1月 102019
 


My New Year's resolution: “Unclutter your life” and I hope this post will help you do the same.

Here I share with you a data preparation approach and SAS coding technique that will significantly simplify, unclutter and streamline your SAS programming life by using data templates.

Dictionary.com defines template as “anything that determines or serves as a pattern; a model.” However, I was flabbergasted when my “prior art research” for the topic of this blog post ended rather abruptly: “No results found for data template.”

What do you mean “no results?!” (Yes, sometimes I talk to the Internet. Do you?) We have templates for everything in the world: MS Word templates, C++ templates, Photoshop templates, website templates, holiday templates, we even have our own PROC TEMPLATE. But no templates for data?

At that point I paused, struggling to accept reality, but then felt compelled to come up with my own definition:

A data template is a well-defined data structure containing a data descriptor but no data.

Therefore, a SAS data template is a SAS dataset (data table) containing the descriptor portion with all necessary attributes defined (variable types, labels, lengths, formats, and informats) and empty (zero observations) data portion.

Less clutter = greater efficiency

When you construct SAS data tables using SAS code or data management tools such as using data design documentation as a feed to code-generating SAS program.

Unfortunately, despite all these benefits the data template concept is not explicitly and consistently employed and is noticeably absent from data development methodologies and practices.

Let’s try to change that!

How to create SAS data templates from scratch

It is very easy to create SAS data template. Here is an example:

 
libname PARMSDL 'c:\projects\datatemplates';
data PARMSDL.MYTEMPLATE;
   label
      newvar1 = 'Label for new variable 1'
      newvar2 = 'Label for new variable 2'
      /* ... */
      newvarN = 'Label for new variable N'
      ;
  length
      newvar1 newvar2 $40
      newvarN 8
      ;
   format newvarN mmddyy10.;
   informat newvarN date9.;
   stop;
run;

First, you need to assign a permanent library (e.g. PARMSDL) where you are going to store your SAS dataset template. I usually do not store data templates in the same library as data. Nor do I store it in the same directory/folder where you store your SAS code. Ordinarily, I store data templates in a so-called parameter data library (that is why I use PARMSDL as a libref), along with other data defining SAS code structure.

In the data step, the very first statement LABEL defines all variables’ labels as well as variable position determined by the order in which they are listed.

Statement LENGTH defines variables’ types (numeric or character) and their length in bytes. Here you may group variables of the same length to shorten your code or define them individually to be more explicit.
Statement FORMAT defines variables’ formats as needed. You don’t have to define formats for all the variables; define them only if necessary.

Statement INFORMAT (also optional) defines informats that come handy if you use this data template for creating SAS datasets by reading external raw files. With informats defined on the data template, you won’t have to specify informats in your INPUT statement while reading external file, as the informats will be inherently associated with the variable names. That is why SAS data sets have informat attribute for its variables in the first place (if you ever wondered why.)

Finally, don’t forget the STOP statement at the end of your data step, just before the RUN statement. Otherwise, instead of zero observations, you will end up with a data table that has a single observation with all missing variable values. Not what we want.

It is worth noting that obs=0 system option will not work instead of the STOP statement as it is applied only to the data being read, but we read no data here. For the same reason, (obs=0) data set option will not work either. Try it, and SAS log will dispel your doubts:

 
data PARMSDL.MYTEMPLATE (obs=0);
                        ---
                        70
WARNING 70-63: The option OBS is not valid in this context.  Option ignored.

How to create SAS data templates by inheritance

If you already have some data table with well-defined variable attributes, you may easily create a data template out of that data table by inheriting its descriptor portion:

 
data PARMSDL.MYTEMPLATE;
   set SASDL.MYDATA (obs=0);
run;

Option (obs=0) does work here as it is applied to the dataset being read, and therefore STOP statement is not necessary.

You can also combine inheritance with defining new variables, as in the following example:

 
data MYTEMPLATE;
   set SASDL.MYDATA (obs=0); *<-- inherited template;
   * variables definition: ;
   label
      newvar1 = &#039;Label for new variable 1&#039;
      newvarN = &#039;Label for new variable N&#039;
      oldvar =  &#039;New Label for OLD variable’ 
      ;
   length
      newvar1 $40
      newvarN 8
      oldvar  $100 /* careful here, see notes below */
      ;
   format newvarN mmddyy10.;
   informat newvarN date9.;
run;

A word of warning

Be careful when your new variable definition type and length contradicts inherited definition.
You can overwrite/re-define inherited variable attributes such as labels, formats and informats with no problem, but you cannot overwrite type and in some cases length. If you do need to have a different variable type for a specific variable name on your data template, you should first drop that variable on the SET statement and then re-define it in the data step.

With the length attribute the picture is a bit different. If you try defining a different length for some variable, SAS will produce the following WARNING in the LOG:

WARNING: Length of character variable  has already been set.
Use the LENGTH statement as the very first statement in the DATA STEP to declare the length of a character variable.

You can either use the advice of the WARNING statement and place the LENGTH statement as the very first statement or at least before the SET statement. In this case, you will find that you can increase the length without a problem, but if you try to reduce the length relative to the one on the parent dataset SAS will produce the following WARNING in the LOG:

WARNING: Multiple lengths were specified for the variable  by input data set(s). This can cause truncation of data.

In this case, a cleaner way will also be to drop that variable on the SET statement and redefine it with the LENGTH statement in the data step.

Keep in mind that when you drop these variables from the parent data set, besides losing their type and length attributes, you will obviously lose the rest of the attributes too. Therefore, you will need to re-define all the attributes (type, length, label, format, and informat) for the variables you drop. At least, this technique will allow you to selectively inherit some variables from a parent data set and explicitly define others.

How to use SAS data templates

One way to apply your data template to a newly created dataset is to: 1) Copy your data template in that new dataset; 2) Append your data table to that new data set. Here is an example:

 
/* copying data template into dataset */
data SASDL.MYNEWDATA
   set PARMSDL.MYTEMPLATE;
run;
 
/* append data to your dataset with descriptor */
proc append base=SASDL.MYNEWDATA data=WORK.MYDATA;
run;

Your variable types and lengths should be the same on the BASE= and DATA= tables; labels, formats and informats will be carried over from the BASE= dataset/template.

It is simple, but could be simplified even more to reduce your code to just a single data step:

 data SASDL.MYNEWDATA;
   if 0 then set PARMSDL.MYTEMPLATE;
   set WORK.MYDATA;
   /* other statements */
run;

Even though set PARMSDL.MYTEMPLATE; statement has never executed because of the explicitly FALSE condition (0 means FASLE) in the IF statement, the resulting dataset SASDL.MYDATA gets all its variable attributes carried over from the PARMSDL.MYTEMPLATE data template during data step compilation.

This same coding technique can be used to implicitly apply data variable attributes from a well-defined data set by inheritance even though that data set is not technically a data template (has more than 0 observations.) Run the following code to make sure MYDATA table has all the variables and attributes of the SASHELP.CARS data table while data values come from the ABC data set:

 
data ABC;
  make='Toyota';
run;
 
data MYDATA;
   if 0 then set SASHELP.CARS;
   set ABC;
run;

Perhaps the benefits of SAS data templates are best demonstrated when you read external data into SAS data table. Here is an example for you to run (of course, in real life MYTEMPLATE should be a permanent data set and instead of datalines it should be an external file):

 
data MYTEMPLATE;
   label
      fdate = 'Flight Date'
      count = 'Flight Count'
      fdesc = 'Flight Description'
      reven = 'Revenue, $';
   length fdate count reven 8 fdesc $22;
   format fdate date9. count comma12. reven dollar12.2;
   informat fdate mmddyy10. count comma8. fdesc $22. reven comma10.;
   stop;
run;
 
data FLIGHTS;
   if 0 then set MYTEMPLATE;
   input fdate count fdesc & reven;
   datalines;
12/05/2018 500   Flight from DCA to BOS  120,034
10/01/2018 1,200 Flight from BOS to DCA  90,534
09/15/2018 2,234 Flight from DCA to MCO  1,350
;

Here is how the output data set looks:

Notice how simple the last data step is. No labels, no lengths, no formats, no informats – no clutter. Yet, the raw data is read in nicely, with proper informats applied, and the resulting data set has all the proper labels and variable formatting. And when you repeat this process for another sample of similar data you can still use the same data template, and your read-in data step stays the same – simple and concise.

Your thoughts

Do you find SAS data templates useful? Do you use them in any shape or form in your SAS data development projects? Please share your thoughts.

Simplify data preparation using SAS data templates was published on SAS Users.

1月 092019
 

Numbers don't lie, but sometimes they don't reveal the full story. Last week I wrote about the most popular articles from The DO Loop in 2018. The popular articles are inevitably about elementary topics in SAS programming or statistics because those topics have broad appeal. However, I also write about advanced topics, which are less popular but fill an important niche in the SAS community. Not everyone needs to know how to fit a Pareto distribution in SAS or how to compute distance-based measures of correlation in SAS. Nevertheless, these topics are interesting to think about.

I believe that learning should not stop when we leave school. If you, too, are a lifelong learner, the following topics deserve a second look. I've included articles from four different categories.

Data Visualization

  • Fringe plot: When fitting a logistic model, you can plot the predicted probabilities versus a continuous covariate or versus the empirical probability. You can use a fringe plot to overlay the data on the plot of predicted probabilities. The SAS developer of PROC LOGISTIC liked this article a lot, so look for fringe plots in a future release of SAS/STAT software!
  • Order variables in a correlation matrix or scatter plot matrix: When displaying a graph that shows many variables (such as a scatter plot matrix), you can make the graph more understandable by ordering the variables so that similar variables are adjacent to each other. The article uses single-link clustering to order the variables, as suggested by Hurley (2004).
  • A stacked band plot, created in SAS by using PROC SGPLOT
  • Stacked band plot: You can use PROC SGPLOT to automatically create a stacked bar plot. However, when the bars represent an ordered categorical variable (such as months or years), you might want to create a stacked band plot instead. This article shows how to create a stacked band plot in SAS.

Statistics and Data Analysis

Random numbers and resampling methods

Process flow diagram shows how to resample data to create a bootstrap distribution.

Optimization

These articles are technical but provide tips and techniques that you might find useful. Choose a few topics that are unfamiliar and teach yourself something new in this New Year!

Do you have a favorite article from 2018 that I did not include on the list? Share it in a comment!

The post 10 posts from 2018 that deserve a second look appeared first on The DO Loop.

1月 082019
 

I love my job, but I am not a morning person so I need a bit of inspiration to get out of bed. I’ve been a marketer for more than 15 years, and that inspiration has never been easier to find than since I joined SAS as an industry marketer [...]

Health care convergence gets me up in the morning. Yes, really. was published on SAS Voices by Cameron McLauchlin

1月 072019
 

In the first of three posts on using automated analysis with SAS Visual Analytics, we explored a typical visualization designed to give telco customer care workers guidance on customers most receptive to upgrade their plans. While the analysis provided some insight, it lacked analytical depth -- and that increases the risk of  wasting time, energy and money on a strategy that may not succeed.

Let’s now look at the same data, but this time deepen the analytical view by putting SAS Visual Analytics' automated analysis into play. We’ll use automated analysis to determine significant variables that impact our key business measure, X-sell and Up-sell Flag.

Less time spent on data discovery, quicker response time

The automated analysis object determines the most important underlying factors for a specific response variable, in our case the X-Sell and Up-Sell flag. After you specify a response variable, most of the remaining data items are added as underlying factors. Variables that are identical to the response variable, variables that have excessive missing values, or variables that have high cardinality are not added as underlying factors. For category responses, you can select the event level (category value) that interests you.

To run automated analysis, I will use the data pane and right click on Xsell and Upsell Flag Category, select Analyze, then Analyze on new page.

 

Here we see the results for Not Yet Upgraded.

Seeing how we really want to understand what made our customers upgrade so we can learn from it, let’s change the results to see upgraded accounts. To do this I will use the drop-down menu to change the category value to Upgraded.

 

Now we see the details for Upgraded. Let’s look at each piece of information within this chart.

 

The top section tells us that the probability of a customer upgraded is 12.13%. It also tells us the other variables in our dataset that influences that probability. The strongest influencers are Total Days over plan, Days Suspended Last 6M (months), Total Times Over Plan and Delinquent Indicator. Remember from our previous analysis, the correlation matrix determined that Total Days over plan, Delinquent Indicator and Days suspended last 6M were correlated with our X-sell and Up-sell Flag. So this part of the analysis is pretty similar. However, the rest of the automated analysis provides so much more information than what we got from our previous analysis and it was produced in under a minute.

The next section gives us a visual on how strong each influencer is on our variable of interest, Xsell and Upsell Flag Category. Total Days Over Plan is the strongest followed by Days Suspended Last 6M followed by Total Times Over Plan…. If we mouse over each of the boxes, we’ll see their relative importance.

After SAS Visual Analytics adds the underlying factors, it creates a relative importance score for each underlying factor. The most important underlying factor is assigned a score of 1, and all other scores are proportional to that value.

If I mouse over Total Days Over Plan I’ll see the relative importance score for that variable.

Here we see that Total Days Over Plan relative importance in influencing a customer to change plans is 1. That means it was the most important factor in predicting our variable of interest, cross-sell and up-sell flag. If I mouse over the Days Suspended Last 6M, I can see that the relative importance for that variable is 0.6664.

The percentages along the left-hand side give us the probability (or chance) of the subgroups of customers likely to upgrade. SAS Visual Analysis shows the top groups and the bottom groups based on probability. The first group of customers are 100% likely to upgrade. These customers have Total Days Over Plan greater than or equal to 33, Days Suspended Last 6M greater than or equal to 6, 6M Avg Minutes on Network Normally Distributed less than -6.9, Delinquent Indicator of 1,2,3 or 4. This means going forward, if we have customers that meet these criteria we should target them for an upgrade because they are 100% likely to upgrade. We can also use the next three customer groups to target as well.

For measure responses, the results display the four groups that result in the greatest values of the response. The results also display the two groups that result in the smallest values of the response. For category responses, the results display the four groups that contain the greatest percentages of the response. The results also display the two groups that contain the least percentages of the response.

The bottom right chart shows how a variable relates to our variable of interest. Below the chart is a description outlining key findings.

An explanatory plot is included for each underlying factor. The contents of this plot depend on the variable type of both the response variable and the underlying factor.

If I click on Days Suspended Last 6M from the colored button bar, the informative text will be highlighted, and the plot chart will be updated to reflect my selection.

But what if you want to see all the variables analyzed and discover what actions were taken on them? If we maximize the automated analysis object we’d see a table at the bottom. This table outlines actions taken on the predictors.

Here we see that Census Area Total Males was rejected because it is too strongly correlated with another measure. This reason would be easy for someone to miss and would affect the results of an analysis or model if that predictor was not removed. Automated analysis really does do the thinking for us and makes models more accurate!

In the second post of this three-part series, we’ll see how we can turn the results from this automated analysis into actionable items.

SAS® Visual Analytics on SAS® Viya® Try it for free!

How SAS Visual Analytics' automated analysis takes customer care to the next level - Part 2 was published on SAS Users.

1月 072019
 

Deming regression (also called errors-in-variables regression) is a total regression method that fits a regression line when the measurements of both the explanatory variable (X) and the response variable (Y) are assumed to be subject to normally distributed errors. Recall that in ordinary least squares regression, the explanatory variable (X) is assumed to be measured without error. Deming regression is explained in a Wikipedia article and in a paper by K. Linnet (1993).

A situation in which both X and Y are measured with errors arises when comparing measurements from different instruments or medical devices. For example, suppose a lab test measures the amount of some substance in a patient's blood. If you want to monitor this substance at regular intervals (for example, hourly), it is expensive, painful, and inconvenient to take the patient's blood multiple times. If someone invents a medical device that goes on the patient's finger and measures the substance indirectly (perhaps by measuring an electrical property such as bioimpedance), then that device would be an improved way to monitor the patient. However, as explained in Deal, Pate, and El Rouby (2009), the FDA would first need to approve the device and determine that it measures the response as accurately as the existing lab test. The FDA encourages the use of Deming regression for method-comparison studies.

Deming regression in SAS

There are several ways to compute a Deming Regression in SAS. The SAS FASTats site suggests maximum likelihood estimation (MLE) by using PROC OPTMODEL, PROC IML, or PROC NLMIXED. However, you can solve the MLE equations explicitly to obtain an explicit formula for the regression estimates. Deal, Pate, and El Rouby (2009) present a rather complicated macro, whereas Njoya and Hemyari (2017) use simple SQL statements. Both authors also provide SAS code for estimating the variance of the Deming regression estimates, either by using the jackknife method or by using the bootstrap. However, the resampling schemes in both papers are inefficient because they use a macro loop to perform the jackknife or bootstrap.

The following SAS DATA Step defines pairs of hypothetical measurements for 65 patients, each of whom received the standard lab test (measured in micrograms per deciliter) and the new noninvasive device (measured in kiloohms):

data BloodTest;
label x="micrograms per deciliter"
      y="kiloOhms";
input x y @@;
datalines;
169.0 45.5 130.8 33.4 109.0 23.8 94.1 19.8 86.3 20.4 78.4 18.7 
 76.1 16.1  72.2 16.7  70.0 11.9 69.8 14.6 69.5 10.6 68.7 12.7 67.3 16.9 
174.7 57.8 137.9 39.0 114.6 30.4 99.8 21.1 90.1 21.7 85.1 25.2
 80.7 20.6 78.1 19.3  77.8 20.9  76.0 18.2 77.8 18.3 74.2 15.7 73.1 13.9 
182.5 55.5 144.0 38.7 123.8 35.1 107.6 30.6 96.9 25.7 92.8 19.2 
 87.2 22.4  86.3 18.4  84.4 20.7  83.7 20.6 83.3 20.0 83.9 18.8 82.7 21.8 
160.8 49.9 122.7 32.2 102.6 19.2 86.6 14.7 76.1 16.6 69.6 18.8 
 66.7  7.4  64.4  8.2  63.0 15.5 61.7 13.7 61.2 9.2 62.4 12.0 58.4 15.2 
171.3 48.7 136.3 36.1 111.9 28.6 96.5 21.8 90.3 25.6 82.9 16.8 
 78.1 14.1  76.5 14.2  73.5 11.9 74.4 17.7 73.9 17.6 71.9 10.2 72.0 15.6 
;
 
title "Deming Regression";
title2 "Gold Standard (X) vs New Method (Y)";
proc sgplot data=BloodTest noautolegend;
   scatter x=x y=y;
   lineparm x=0 y=-10.56415 slope=0.354463 / clip; /* Deming regression estimates */
   xaxis grid label="Lab Test (micrograms per deciliter)";
   yaxis grid label="New Device (kiloohms)";
run;
Deming regression line for  medical devices with different scales

The scatter plot shows the pairs of measurements for each patient. The linear pattern indicates that the new device is well calibrated with the standard lab test over a range of clinical values. The diagonal line represents the Deming regression estimate, which enables you to convert one measurement into another. For example, a lab test that reads 100 micrograms per deciliter is expected to correspond to 25 kiloohms on the new device and vice versa. (If you want to convert the new readings into the old, you can regress X onto Y and plot X on the vertical axis.)

The following SAS/IML function implements the explicit formulas that compute the slope and intercept of the Deming regression line:

/* Deming Regression in SAS */
proc iml;
start Deming(XY, lambda=);
   /* Equations from https://en.wikipedia.org/wiki/Deming_regression */
   m = mean(XY);
   xMean = m[1]; yMean = m[2];
   S = cov(XY);
   Sxx = S[1,1]; Sxy = S[1,2]; Syy = S[2,2];
   /* if lambda is specified (eg, lambda=1), use it. Otherwise, estimate. */
   if IsEmpty(lambda) then
      delta = Sxx / Syy;        /* estimate of ratio of variance */
   else delta = lambda;
   c = Syy - delta*Sxx;
   b1 = (c + sqrt(c**2 + 4*delta*Sxy**2)) / (2*Sxy);
   b0 = yMean - b1*xMean;
   return (b0 || b1);
finish;
 
/* Test the program on the blood test data */
use BloodTest; read all var {x y} into XY; close;
b = Deming(XY);
print b[c={'Intercept' 'Slope'} L="Deming Regression"];
Deming regression estimates in SAS

The SAS/IML function can estimate the ratio of the variances of the X and Y variable. In the SAS macros by Deal, Pate, and El Rouby (2009) and Njoya and Hemyari (2017), the ratio is a parameter that is determined by the user. The examples in both papers use a ratio of 1, which assumes that the devices have an equal accuracy and use the same units of measurement. In the current example, the lab test and the electrical device use different units. The ratio of the variances for these hypothetical devices is about 7.4.

Standard errors of estimates

You might wonder how accurate the parameter estimates are. Linnet (1993) recommends using the jackknife method to answer that question. I have previously explained how to jackknife estimates in SAS/IML, and the following program is copied from that article:

/* Helper modules for jackknife estimates of standard error and CI for parameters: */
/* return the vector {1,2,...,i-1, i+1,...,n}, which excludes the scalar value i */ 
start SeqExclude(n,i);
   if i=1 then return 2:n;
   if i=n then return 1:n-1;
   return (1:i-1) || (i+1:n);
finish;
 
/* return the i_th jackknife sample for (n x p) matrix X */
start JackSamp(X,i);
   return X[ SeqExclude(nrow(X), i), ];  /* return data without i_th row */
finish;
 
/* 1. Compute T = statistic on original data */
T = b;
 
/* 2. Compute statistic on each leave-one-out jackknife sample */
n = nrow(XY);
T_LOO = j(n,2,.);             /* LOO = "Leave One Out" */
do i = 1 to n;
   J = JackSamp(XY,i);
   T_LOO[i,] = Deming(J); 
end;
 
/* 3. compute mean of the LOO statistics */
T_Avg = mean( T_LOO );  
 
/* 4. Compute jackknife estimates of standard error and CI */
stdErrJack = sqrt( (n-1)/n * (T_LOO - T_Avg)[##,] );
alpha = 0.05;
tinv = quantile("T", 1-alpha/2, n-2); /* use df=n-2 b/c both x and y are estimated */
Lower = T - tinv#stdErrJack;
Upper = T + tinv#stdErrJack;
result = T` || T_Avg` || stdErrJack` || Lower` || Upper`;
print result[c={"Estimate" "Mean Jackknife Estimate" "Std Error" 
             "Lower 95% CL" "Upper 95% CL"} r={'Intercept' 'Slope'}];
Jackknife standard errors for Deming regression estimates

The formulas for the jackknife computation differs slightly from the SAS macro by Deal, Pate, and El Rouby (2009). Because both X and Y have errors, the t quantile must be computed by using n–2 degrees of freedom, not n–1.

If X and Y are measured on the same scale, then the methods are well-calibrated when the 95% confidence interval (CI) for the intercept includes 0 and the CI for the intercept includes 1. In this example, the devices use different scales. The Deming regression line enables you to convert from one measurement scale to the other; the small standard errors (narrow CIs) indicate that this conversion is accurate.

In summary, you can use a simple set of formulas to implement Deming regression in SAS. This article uses SAS/IML to implement the regression estimates and the jackknife estimate of the standard errors. You can also use the macros that are mentioned in the section "Deming regression in SAS," but the macros are less efficient, and you need to specify the ratio of the variances of the data vectors.

The post Deming regression for comparing different measurement methods appeared first on The DO Loop.

1月 052019
 

Do you want to start your year feeling great about data and analytics? Then dive into this list and read a few data for good stories that you might have missed in 2018. You'll learn how data scientists are using analytics to save endangered species, prevent suicide, combat cancer - [...]

Top 10 data for good articles of 2018 was published on SAS Voices by Alison Bolen

1月 052019
 

Do you want to start your year feeling great about data and analytics? Then dive into this list and read a few data for good stories that you might have missed in 2018. You'll learn how data scientists are using analytics to save endangered species, prevent suicide, combat cancer - [...]

Top 10 data for good articles of 2018 was published on SAS Voices by Alison Bolen