data preparation

7月 062015
 

I recently used a JMP add-in I wrote to import my complete set of BodyMedia FIT food log data files, including data from Dec. 21, 2010, through the last day I logged my meals in that software on March 29, 2015. My final data table contained 39,942 rows of food items names. […]

The post Creating a JMP 12 Recode script from a Lookup Table containing original and new values appeared first on JMP Blog.

7月 012015
 

The data you want to import into JMP often requires some manipulation before it’s ready to be analyzed in JMP. Sometimes data is arranged so that a row contains information for multiple observations. To prepare your data for analysis, you must restructure it so that each row of the JMP […]

The post How to stack data for a Oneway analysis appeared first on JMP Blog.

5月 042015
 

In April, the free trial of SAS Data Loader for Hadoop became available globally. Now, you can take a test drive of our new technology designed to increase the speed and ease of managing data within Hadoop. The downloads might take a while (after all, this is big data), but I think you’ll […]

The post Self-service big data preparation in the age of Hadoop appeared first on The Data Roundtable.

4月 152011
 
To rename the variables of a dataset in SAS is a daily routine. SAS or the programmer s would give an arbitrary name for any variable at the initial stage of data integration. Those names have to be modified afterward. Wensui [Ref.1] developed a macro to add prefixes to the variables . Vincent et al. [Ref. 2] extended his idea and added some parameters into the macros. However, giving a name to a variable in SAS has many restrictions regarding the length and the format. For better understanding and recognition, labeling variables instead of renaming them would be useful. In the example below, first comes with an integration of complicated text data. Proc Transpose generates a number of variables with the same prefix.  Then by invoking the label() macro, the dataset would be correctly labeled as desired.

References:
1. Wensui Liu. ‘How to rename many variables in SAS’. http://statcompute.blogspot.com/
2. Vincent Weng. Ying Feng. ‘Renaming in Batches’. SAS Global 2009.

****************(1) MODULE-BUILDING STEP******************;
%macro label(dsin = , dsout = , dslabel = );
   /***********************************************************
   *  MACRO:      label()
   *  GOAL:       use a label dataset to label the variables 
   *              of the target dataset
   *  PARAMETERS: dsin = input dataset
   *              dsout = output dataset
   *              dslabel = label dataset
   *
   ***********************************************************/
      data _tmp;
            set &dslabel ;
            num = _n_;
      run;

      ods listing close;
      ods output variables = _varlist;
      proc contents data = &dsin;
      run;

      proc sql;
            select cats(a.variable, '="', b.labelname, '"') 
                   into: labellist separated by ' '
            from _varlist as a, _tmp as b 
            where a.num = b.num
      ;quit; 

      data &dsout;
            set &dsin;
            label &labellist;
      run;

      proc datasets;
            delete _:;
      quit;
      ods listing;
%mend;

****************(2) TESTING STEP******************;
******(2.1) INTEGRATE COMPLICATED DATA*************;
data have;
    infile datalines dlm = ',';
        retain _row;
        input _tmpvar $ @@ ; 
        if prxmatch("/10\d/", _tmpvar) ne 0 then _row + 1; 
        if missing(_tmpvar) then delete;
    datalines;
    100, Tom, 3,1,5,2,6
    101, Marlene, 1,2,4
    102, Jerry, 9,10,4,
    5, 6                    
    103, Jim,2 ,1, 2, 2,4
;
run;

proc transpose data=have out=want(drop = _:)
    prefix = var;
    by _row;
    var _tmpvar;
run;

******(2.2) INPUT LABELS FOR USE*************;
data label;
      input labelname $30.;
      cards;
      Patient ID
      Patient last name
      The 1st treatment
      The 2nd treatment
      The 3rd treatment
      The 4th treatment
      The 5th treatment
;
run;

******(2.3) INVOKE MACRO TO LABEL*************;
%label(dsin = want, dsout = want_labeled, dslabel = label);

****************END OF ALL CODING***************************************;


2月 252011
 

R surpassed Matlab and SAS on recent Tiobe Programming index for popularity. I agree with this ranking[1], since R could largely replace Matlab and SAS for MB size data, and it is getting hot. The fast growing sectors, quantitative finance, bioinformatics and web analytics, are all embracing R. In those competitive fields, survival demands the skills of a developer instead of a programmer, which suggests something brings more power or higher AUC. The cross-platform, open-sourced, C-based, function-friendly R is right on top of the tide. SAS programmers have to wait SAS Institute to build tools for them, while an R developer could invent his own gloves and axes. SAS’s licensing policy is also problematic: SAS is divided into modules, such as BASE, STAT, IML, GRAPH, ETS, etc; if you want to use just one procedure in a module, you should buy the whole module and renew the license each year. Norman Nie[2] said that 15 core statistical procedures, such as cluster analysis, factor analysis, PCA and Cox regression, would satisfy 85% of the needs. Thus SAS programmers may have to pay the cost of hundreds of procedures to use the 15 procedures. On the contrary, you can specify and download any package for R. Or you construct brand-new functions or packages.

However, with larger data, R or Matlab users have to painfully jump to low-level languages. For convenience of matrix computation, R or Matlab routinely keeps copies of data in memory and does not let garbage collector frenetically dump them. With GB size data, people would soon hear the outcry of memory shortage in their workstation, and ponder to resort to C or C++. That is possibly why Matlab or R talents are so popular in academia but never welcome in job market. Amazingly, SAS is pretty robust to deal with this scale of data. SAS rarely failed me even though I sometimes complain about the speed. Of course, if memory is not a hurdle, R is another good option. Inspired by this idea, Norman Nie‘s startup company Revolution Analytics reinvigorated the community R. They used an XDF data file on hard disk to contain data before into memory stack. Then their unique package RevoScaleR would perform data analysis inside of those files. This strategy is pretty similar to the In-Database technology by SAS’s high-end enterprise products. Besides, XDF would forcibly partition data to allow multiple-core processing. On a common PC with 3G memory, I compared Revolution R and SAS using a 1.1 GB flat file. SAS integrated it in 20 seconds, while Rev R did that in 5 minutes. Afterward SAS realized some data transformation steps and summarization procedures. Unfortunately in those occasions the R core of Rev R usually crashed with the Rev R’s IDE left as a zombie. Presumably the compatibility between the R core and the XDF file system is a concern. In addition, at this stage the RevoScaleR package in Rev R has limited functions or procedures, such as summary statistics, linear regression, logistic regression, cross-tab, which utilized the advantage of the XDF system. Another worry is that the R core contributors, led by Ross Ihaka and Robert Gentleman, may feel reluctant to further assist Norman Nie's effort to develop his profitable product.

For TB size data, R will roll back. Like the physical world with extraordinary speed or space that Newton's theories expire, the complex, high-volume, high-dimensional data would stun any extraction, transformation, loading and analysis operations base on the concept of RDBMS. MySQL, Oracle and SAS all fade away. Distributed system and function languages are the rescue. Open source software, like Python and R, rule in this domain. The data center would aggregate the total memory from thousands of servers. With gigantic memories all together, data analysis by R could be performed by scheduling a Map/Reduce task. However, those well-known IT giants owning those data centers are not likely to buy many licenses for a commercial package. The community R, not Revolution R, will enjoy a significant share from the rapid growth in this area.

Overall SAS is still much better than Revolution R in handling GB size data now. Revolution R may need more time to get matured for production purpose. Revolution R launched a SAS-to-R challenge and deliberately created a function to transform SAS datasets to its XDF format. I like to see newcomers in the arena, and SAS may benefit from this competition.

Reference: 1. 'TIOBE Programming Community Index for February 2011'. http://www.tiobe.com/index.php/content/paperinfo/tpci/index.html
2. 'Another Open Source Swipe at IBM and SAS'. http://blogs.forbes.com/quentinhardy/2011/02/01/another-open-source-swipe-at-ibm-and-sas/


************(1) DOWNLOAD AND UNZIP THE DATA ***********************;
ftp://ftp.cdc.gov/pub/Health_Statistics/NCHS/Datasets/DVS/mortality/mort2006us.zip

***********(2)SAS'S CODE TO INTEGRATE DATA*****************;
data mort06;
 infile 'C:\mort06.dat' 
    lrecl=150 missover ;
 input    @20 Resident_status     $1. 
    @83 Plac_of_death_decedent_status   $1.  
    @85 Day_of_week _of_death   $1. 
    @94 Date_receipt    $8.  
    @102 Data_Year     $4. 
    @65 Month_of_Death     $2.
    @60 Sex      $1.  
    @445 Race      $6.
    @70 Age      $13.  
    @84 Marital      $1.  
    @484 Hispanic_origin     $5.  
    @61 Education             $4.  
    @146 ICD10 $4.  
    @150 code358    $3.   
    @154 code113 $3.     
    @157 code130 $3.    
    @160 cause_recode   $2.   
    @145 Place_injury   $1.   
    @106 Injury_work     $1.  
    @108 Method_disposition    $1. 
    @109 Autopsy      $1. 
    @144 Activity    $1. 
;
run;

proc contents data=mort06;
run;

############(3) REV R 'S CODE TO INTEGRATE DATA#####################
colList <- list(  
    "Resident_status"        =list(type="factor",  start=20, width=1 ),
     "Plac_of_death_decedent_status"    =list(type="factor", start=83, width=1 ),
     "Day_of_week _of_death"      =list(type="factor", start=85, width=1 ),     
     "Date_receipt"         =list(type="factor", start=94, width=1 ),         
     "Data_Year"         =list(type="factor",  start=102, width=4  ),
     "Month_of_Death"        =list(type="factor", start=65, width=2  ),
     "Sex"           =list(type="factor", start=60, width=1  ),
     "Race"           =list(type="factor", start=445, width=6 ),
     "Age"           =list(type="factor", start=70, width=13  ),
     "Marital"          =list(type="factor", start=84, width=1 ),
     "Hispanic_origin"        =list(type="factor", start=484, width=5  ),
     "Education"         =list(type="factor", start=61, width=4 ),
     "ICD10"          =list(type="factor", start=146, width=4  ),     
     "code358"          =list(type="factor", start=150, width=3  ), 
     "code113"          =list(type="factor", start=154, width=3  ),
     "code130"          =list(type="factor", start=157, width=3  ),
     "cause_recode "        =list(type="factor", start=160, width=2  ),
     "Place_injury"         =list(type="factor", start=145, width=1  ),
     "Injury_work "         =list(type="factor", start=106, width=1  ),
     "Method_disposition"       =list(type="factor", start=108, width=1  ),
     "Autopsy"          =list(type="factor", start=109, width=1  ),
     "Activity"          =list(type="factor", start=144, width=1  )  
    )
mortFile <- file.path("C:", "MORT06.DAT")
sourceData <- RxTextData(mortFile, colInfo=colList )                 
outputData <- RxXdfData("MORT")
rxImportToXdf(sourceData, outputData,  overwrite = TRUE) 
rxGetInfoXdf("MORT.xdf", getVarInfo=TRUE) 

#################END OF CODING#######################################;  
12月 132010
 


Three dimensions are usually regarded as the maximum for data presentation. With the opening of ODS from SAS 9.2 and its graph template language, 3D graphing is no longer a perplexing problem for SAS programmers. However, nowadays magnificent amount of data with multi-dimension structure needs more vivid and simpler way to be displayed.

The emerging of Google Motion Chart now provides a sound solution to visualize data in a more than three dimensions scenario. This web-based analytical technology originated from Dr. Hans Rosling’s innovation. Dr. Rosling and his Gapminder foundation invented a technology to demonstrate the relationship among multiple dimensions by animated bubbles. They developed a lot of bubble plots in Gapminder’s website to discover knowledge form a bulk of public information, especially for regional/national comparison. It soon attracted Google’s attention. In 2008 after an agreement between Dr. Rosling and Google’s two founders, Google launched its Motion Chart gadget. People could create motion chart by using Google Docs, an online alternative to Microsoft’s Office.

The combination between SAS and Google Motion Chart shows a handy and cheap way for up-to-five-dimension data visualization. For Motion Chart, it supports five variables all together in a plot. Commonly the data structure requires time(animation), var1(X axis), var2(Y axis), var3(color) and var4(bubble size). The correlation from var1 to var4 is expected: usually the bubbles with changing color and size tend to move along the diagonal line. Overall 5d visualization can be rendered within such a single plot. In this example, a SAS help dataset ‘SASHELP.SHOES’ is used. The data set has several regions to compare each other. Logged return money is Y-axis, while logged sale money is X-axis. A series of virtual time is given to each region, with inventory as bubble size and the store number as color. By SAS, the data structure in Motion Chart can be prepared quickly. Thus, once the CSV file is uploaded to Google Docs, a motion chart is ready to be published in any webpage. OK, it's time to sit and discover some interesting tendency...

Reference:
1.'Show me--New ways of visualising data’. The Economist. Feb 25th 2010.
2.‘Making data dance’. The Economist. Dec 11st 2010.
3. Google Docs online help center. 2010.

*********(1) Extract data from SASHELP.SHOES***********;
proc sql;
create table test as
select region, Sales, Inventory, Returns, Stores
from sashelp.shoes
order by region , sales desc
;quit;
********(2) Create a random variable for time************;
data test1;
do i=1 by 1 until (last.region);
set test;
by region;
time=today()-i+1;
mytime=put(time, mmddyy8.);
drop i;
output;
end;
run;
********(3) Transform some variables with log**********;
proc sql;
create table test2 as
select region, mytime, log(sales) as logsales, log(returns) as logreturn, Stores as storenum, Inventory
from test1
order by region, mytime
;quit;
********(4) Export data as CSV***************;
proc export data=test2 outfile='C:\Users\Yanyi\Desktop\test.csv' replace;
run;
*******(5) Upload CSV to Google Docs************;
******(6)  Create Google Motion Chart manually**********;

**********END*********TEST PASSED 12DEC2010****************************;