More than 3.5 percent of the world’s population is on the move, considered international migrants. That’s more than 250 million people living in a country different than their country of birth or nationality. To put that another way, if all migrants lived in a single country, their population would be [...]
I've said it before: spreadsheets are not databases. However, many of us use spreadsheets as if they were databases, and then we struggle when the spreadsheet layout does not support database-style rigor of predictable rows, columns, and variable types -- the basic elements we need for analytics and reporting. If you're using SAS to read data from Microsoft Excel, what can you do when the data you need doesn't begin at cell A1?
By design, SAS can read data from any range of cells in your spreadsheet. In this article, I'll describe how to use the RANGE statement in PROC IMPORT to get the data you need.
With SAS 9.4 and later, SAS recommends using DBMS=XLSX for the most flexibility. It works on all operating systems without the need for additional components like the PC Files Server. Your Excel file must be in the Excel 2007-or-later format (XLSX). You do need a licence for SAS/ACCESS to PC Files. (Just learning? These DBMS=XLSX techniques also work in SAS University Edition.)
If your Excel data does not begin in cell A1 (the default start point for an import process), then you can add a RANGE= value that includes the specific cells. The easiest method is to use a Named Range in Excel to define the exact boundaries of the data.
How to add a Named Range
To define a named range in Excel, highlight the range of cells to include and simply type the new name of the range in the Name Box:
Then save the Excel file.
Then to import into SAS, specify that range name in the RANGE= option:
proc import datafile="/myprojects/myfile.xlsx" out=mydata replace; range="myspecialrange"; run;
Using Excel notation for a cell range
What if you don't know the range ahead of time? You can use PROC IMPORT to read the entire sheet, but the result will not have the column headers and types you want. Consider a sheet like this:
This code will read it:
proc import datafile="/myprojects/middle.xlsx" out=mid dbms=xlsx replace; run;
But the result will contain many empty cells, and the values will be read as all character types:
With additional coding, you can "fix" this result in another pass using DATA step. Or, if you're willing to add the RANGE option with the Excel notation for the specific cell ranges, you can read it properly in the first pass:
proc import datafile="/myprojects/middle.xlsx" out=mid dbms=xlsx replace; range="Sheet1$E7:K17" ; run;
How to "discover" the structure of your Excel file
You can also use LIBNAME XLSX to read entire sheets from Excel, or simply as a discovery step to see what sheets the Excel file contains before you run PROC IMPORT. However, LIBNAME XLSX does not show the Excel named ranges.
On SAS for Windows systems, you can use LIBNAME EXCEL (32-bit) or LIBNAME PCFILES (64-bit) to reveal a little more information about the Excel file.
libname d pcfiles path="c:\myprojects\middle.xlsx"; proc datasets lib=d; quit; /* always clear the libname, as it locks the file */ libname d clear;
Note that DBMS=XLSX does not support some of the options we see in the legacy DBMS=XLS (which supports only old-format XLS files), such as STARTROW and NAMEROW. DBMS=XLSX does support GETNAMES (treats the first record of the sheet or range as the variable names). See the full reference for Excel file import/export in the SAS documentation.
The post How to use SAS to read a range of cells from Excel appeared first on The SAS Dummy.
Regardless of industry, it has become a frequent occurrence that behind every data-driven marketer is an analytical ninja. Together, they formulate recipes in addressing the customer-centric paradigm that considers the different actions that a brand can take for a specific individual, and decides on the “best” one. The goal of [...]
Our company talks to utilities all over the world about the value of analytics. We like to talk about "the digital utility" and break down analytics use cases across: assets and operations; customers; portfolio; and corporate operations. I plan to highlight a few analytics use cases for utilities across these four areas [...]
Analytics use cases for utilities: Corporate operations was published on SAS Voices by David Pope
A previous article provides an example of using the BOOTSTRAP statement in PROC TTEST to compute bootstrap estimates of statistics in a two-sample t test. The BOOTSTRAP statement is new in SAS/STAT 14.3 (SAS 9.4M5). However, you can perform the same bootstrap analysis in earlier releases of SAS by using procedures in Base SAS and SAS/STAT. This article gives an example of how to bootstrap in SAS.
The main steps of the bootstrap method in SAS
A previous article describes how to construct a bootstrap confidence interval in SAS. The major steps of a bootstrap analysis follow:
- Compute the statistic of interest for the original data
- Resample B times (with replacement) from the data to form B bootstrap samples. The resampling process should respect the structure of the analysis and the null hypothesis. In SAS it is most efficient to use the DATA step or PROC SURVEYSELECT to put all B random bootstrap samples into a single data set.
- Use BY-group processing to compute the statistic of interest on each bootstrap sample. The BY-group approach is much faster than using macro loops. The union of the statistic is the bootstrap distribution, which approximates the sampling distribution of the statistic under the null hypothesis.
- Use the bootstrap distribution to obtain bootstrap estimates of bias, standard errors, and confidence intervals.
Compute the statistic of interest
This article uses the same bootstrap example as the previous article. The following SAS DATA step subsets the Sashelp.Cars data to create a data set that contains two groups: SUV" and "Sedan". There are 60 SUVs and 262 sedans. The statistic of interest is the difference of means between the two groups. A call to PROC TTEST computes the difference between group means for the data:
data Sample; /* create the sample data. The two groups are "SUV" and "Sedan" */ set Sashelp.Cars(keep=Type MPG_City); if Type in ('Sedan' 'SUV'); run; /* 1. Compute statistic (difference of means) for data */ proc ttest data=Sample; class Type; var MPG_City; ods output Statistics=SampleStats; /* save statistic in SAS data set */ run; /* 1b. OPTIONAL: Store sample statistic in a macro variable for later use */ proc sql noprint; select Mean into :Statistic from SampleStats where Method="Satterthwaite"; quit; %put &=Statistic;
The point estimate for the difference of means between groups is -4.98. The TTEST procedure produces a graph (not shown) that indicates that the MPG_City variable is moderately skewed for the "Sedan" group. Therefore you might question the usefulness of the classical parametric estimates for the standard error and confidence interval for the difference of means. The following bootstrap analysis provides a nonparametric estimate about the accuracy of the difference of means.
Resample from the data
For many resampling schemes, PROC SURVEYSELECT is the simplest way to generate bootstrap samples. The documentation for PROC TTEST states, "In a bootstrap for a two-sample design, random draws of size n1 and n2 are taken with replacement from the first and second groups, respectively, and combined to produce a single bootstrap sample." One way to carry out this sampling scheme is to use the STRATA statement in PROC SURVEYSELECT to sample (with replacement) from the "SUV" and "Sedan" groups. To perform stratified sampling, sort the data by the STRATA variable. The following statements sort the data and generate 10,000 bootstrap samples by drawing random samples (with replacement) from each group:
/* 2. Sample with replacement from each stratum. First sort by the STRATA variable. */ proc sort data=Sample; by Type; run; /* Then perform stratified sampling with replacement */ proc surveyselect data=Sample out=BootSamples noprint seed=123 method=urs /* with replacement */ /* OUTHITS */ /* use OUTHITS option when you do not want a frequency variable */ samprate=1 reps=10000; /* 10,000 resamples */ strata Type; /* sample N1 from first group and N2 from second */ run;
The BootSamples data set contains 10,000 random resamples. Each sample contains 60 SUVs and 262 sedans, just like the original data. The BootSamples data contains a variable named NumberHits that contains the frequency with which each original observation appears in the resample. If you prefer to use duplicated observations, specify the OUTHITS option in the PROC SURVEYSELECT statement. The different samples are identified by the values of the Replicate variable.
BY-group analysis of bootstrap samples
Recall that a BY-group analysis is an efficient way to process 10,000 bootstrap samples. Recall also that it is efficient to suppress output when you perform a large BY-group analysis. The following macros encapsulate the commands that suppress ODS objects prior to a simulation or bootstrap analysis and then permit the objects to appear after the analysis is complete:
/* Define useful macros */ %macro ODSOff(); /* Call prior to BY-group processing */ ods graphics off; ods exclude all; ods noresults; %mend; %macro ODSOn(); /* Call after BY-group processing */ ods graphics on; ods exclude none; ods results; %mend;
With these definitions, the following call to PROC TTEST computes the Satterthwaite test statistic for each bootstrap sample. Notice that you need to sort the data by the Replicate variable because the BootSamples data are ordered by the values of the Type variable. Note also that the NumberHits variable is used as a FREQ variable.
/* 3. Compute statistics */ proc sort data = BootSamples; by Replicate Type; run; %ODSOff /* suppress output */ proc ttest data=BootSamples; by Replicate; class Type; var MPG_City; freq NumberHits; /* Use FREQ variable in analysis (or use OUTHITS option) */ ods output ConfLimits=BootDist(where=(method="Satterthwaite") keep=Replicate Variable Class Method Mean rename=(Mean=DiffMeans)); run; %ODSOn /* enable output */
Obtain estimates from the bootstrap distribution
At this point in the bootstrap example, the data set BootDist contains the bootstrap distribution in the variable DiffMeans. You can use this variable to compute various bootstrap statistics. For example, the bootstrap estimate of the standard error is the standard deviation of the DiffMeans variable. The estimate of bias is the difference between the mean of the bootstrap estimates and the original statistic. The percentiles of the DiffMeans variable can be used to construct a confidence interval. (Or you can use a different interval estimate, such as the bias-adjusted and corrected interval.) You might also want to graph the bootstrap distribution. The following statements use PROC UNIVARIATE to compute these estimates:
/* 4. Plot sampling distribution of difference of sample means. Write stats to BootStats data set */ proc univariate data=BootDist; /* use NOPRINT option to suppress output and graphs */ var DiffMeans; histogram DiffMeans; /* OPTIONAL */ output out=BootStats pctlpts =2.5 97.5 pctlname=P025 P975 pctlpre =Mean_ mean=BootMean std=BootStdErr; run; /* use original sample statistic to compute bias */ data BootStats; set BootStats; Bias = BootMean - &Statistic; label Mean_P025="Lower 95% CL" Mean_P975="Upper 95% CL"; run; proc print data=BootStats noobs; var BootMean BootStdErr Bias Mean_P025 Mean_P975; run;
The results are shown. The bootstrap distribution appears to be normally distributed. This indicates that the bootstrap estimates will probably be similar to the classical parametric estimates. For this problem, the classical estimate of the standard error is 0.448 and a 95% confidence interval for the difference of means is [-5.87, -4.10]. In comparison, the bootstrap estimates are 0.444 and [-5.87, -4.13]. In spite of the skewness of the MPG_City variable for the "Sedan" group, the two-sample Satterthwaite t provides similar estimates regarding the accuracy of the point estimate for the difference of means. The bootstrap statistics also are similar to the statistics that you can obtain by using the BOOTSTRAP statement in PROC TTEST in SAS/STAT 14.3.
In summary, you can use Base SAS and SAS/STAT procedures to compute a bootstrap analysis of a two-sample t test. Although the "manual" bootstrap requires more programming effort than using the BOOTSTRAP statement in PROC TTEST, the example in this article generalizes to other statistics for which a built-in bootstrap option is not supported. This article also shows how to use PROC SURVEYSELECT to perform stratified sampling as part of a bootstrap analysis that involves sampling from multiple groups.
Cloud Analytic Services (CAS) is really exciting. It’s open. It’s multi-threaded. It’s distributed. And, best of all for SAS programmers, it’s SAS. It looks like SAS. It feels like SAS. In fact, you can even run DATA Step in CAS. But, how does DATA Step work in a multi-threaded, distributed context? What’s new? What’s different? If I’m a SAS programming wizard, am I automatically a CAS programming wizard?
While there are certain _n_ automatic variable as shown below:
DATA tableWithUniqueID; SET tableWithOutUniqueID; uniqueID = _n_; run;
CAS DATA Step
Creating a unique ID in CAS DATA Step is a bit more complicated. Each thread maintains its own _n_. So, if we just use _n_, we’ll get duplicate IDs. Each thread will produce an uniqueID field value of 1, 2..and so on. …. When the thread output is combined, we’ll have a bunch of records with an uniqueID of 1 and a bunch with an uniqueID of 2…. This is not useful.
To produce a truly unique ID, you need to augment _n_ with something else. _threadID_ automatic variable can help us get our unique ID as shown below:
DATA tableWithUniqueID; SET tableWithOutUniqueID; uniqueID = put(_threadid_,8.) || || '_' || Put(_n_,8.); run;
While there are surely other ways of doing it, concatenating _threadID_ with _n_ ensures uniqueness because the _threadID_ uniquely identifies a single thread and _n_ uniquely identifies a single row output by that thread.
Aggregation with DATA Step
Now, let’s look at “whole table” aggregation (no BY Groups).
SAS DATA Step
Aggregating an entire table in SAS DATA Step usually looks something like below. We create an aggregator field (totSalesAmt) and then add the detail records’ amount field (SaleAmt) to it as we process each record. Finally, when there are no more records (eof), we output the single aggregate row.
DATA aggregatedTable ; SET detailedTable end=eof; retain totSalesAmt 0; totSalesAmt = totSalesAmt + SaleAmt; keep totSalesAmt; if eof then output; run;
CAS DATA Step
While the above code returns one row in single-engine SAS, the same code returns multiple rows in CAS — one per thread. When I ran this code against a table in my environment, I got 28 rows (because CAS used 28 threads in this example).
As with the unique ID logic, producing a total aggregate is just a little more complicated in CAS. To make it work in CAS, we need a post-process step to bring the results together. So, our code would look like this:
DATA aggregatedTable ; SET detailedTable end=eof; retain threadSalesAmt 0; threadSalesAmt = threadSalesAmt + SaleAmt; keep threadSalesAmt; if eof then output; run; DATA aggregatedTable / single=yes; SET aggregatedTable end=eof; retain totSalesAmt 0; totSalesAmt = totSalesAmt + threadSalesAmt; if eof then output; run;
In the first data step in the above example, we ran basically the same code as in the SAS DATA Step example. In that step, we let CAS do its distributed, multi-threaded processing because our table is large. Spreading the work over multiple threads makes the aggregation much quicker. After this, we execute a second DATA Step but here we force CAS to use only one thread with the single=yes option. This ensures we only get one output row because CAS only uses one thread. Using a single thread in this case is optimal because we’ll only have a few input records (one per thread from the previous step).
Individual threads are then assigned to individual BY-Groups. Since each BY-Group is processed by one and only one thread, when we aggregate, we won’t see multiple output rows for a BY-Group. So, there shouldn’t be a need to consolidate the thread results like there was with “whole table” aggregation above.
Consequently, BY-Group aggregation DATA Step code should look exactly the same in CAS and SAS (at least for the basic stuff).
Coding DATA Step in CAS is very similar to coding DATA Step in SAS. If you’re a wizard in one, you’re likely a wizard in the other. The major difference is accounting for CAS’ massively parallel processing capabilities (which manifest as threads). For more insight into data processing with CAS, check out the SAS Global Forum paper.
We have entered the “second machine age.” The first machine age began with the industrial revolution, which was driven primarily by technology innovation. The ability to generate massive amounts of mechanical power made humans more productive. Where the steam engine started the industrial revolution, the second machine age has taken [...]
When making a new piece of code, I like to use the smallest font I can read. This lets me fit more text on the screen at once. When presenting code to others, especially in a classroom setting, I like to make the font large enough to see from the back of the room. Here’s how I change font size in SAS in our three programming interfaces.
Ever since the Moneyball book & movie came out, athletes have been scrambling to use data and analytics to gain a competitive advantage. One of my favorite sports is boat racing - the ones you paddle. Follow along as I lead you through some maps and graphs I created for [...]