10月 312018
 

A useful feature in PROC SGPLOT is the ability to easily visualize subgroups of data. Most statements in the SGPLOT procedure support a GROUP= option that enables you to overlay plots of subgroups. When you use the GROUP= option, observations are assigned attributes (colors, line patterns, symbols, ...) that indicate the value of the grouping variable. This article reviews the GROUP= option and shows how to trick PROC SGPLOT into performing a group analysis for statements that do not support the GROUP= option.

Three ways to plot data by groups

It is common to use colors or symbols to indicate which observations belong to each category of a grouping variable. Typical grouping variables include gender (male and female), political affiliation (democrats, republicans, and independents), race, education level, and so forth. When you use the SAS SG procedures to plot subsets of the data, there are three ways to arrange the plots. You can plot each group individually, you can create a panel of graphs, or you can overlay the groups on a single graph:

  • If you use the BY statement in PROC SGPLOT, each subgroup is plotted independently in its own graph. The axes are scaled based only on the data in that subgroup.
  • If you use the PANELBY statement in PROC SGPANEL, each subgroup is plotted in a cell of a lattice in which the axes are scaled to a common range.
  • If you use the GROUP= option, the plots for each subgroup are overlaid in a single graph.

The following SAS statements demonstrate each approach. Only the GROUP= overlay is displayed because that is the topic of this article:

proc sgplot data=Sashelp.Iris;        /* BY-group visualization. Three independent graphs.  */
   by Species;
   histogram SepalLength;
   density SepalLength / type=kernel;
run;
 
proc sgpanel data=Sashelp.Iris;       /* Panel visualization. Shared common axis. */
   panelby Species / columns=1 onepanel;
   histogram SepalLength;
   density SepalLength / type=kernel;
run;
 
proc sgplot data=Sashelp.Iris;       /* Overlay three plots in one graph  */
   histogram SepalLength / GROUP=Species binstart=42 binwidth=3 transparency=0.5;
   density SepalLength / type=kernel GROUP=Species;
run;

How to emulate the GROUP= option

Many SGPLOT statements (such as the SERIES and SCATTER statements) have supported the GROUP= option since the early days of ODS graphics. For other statements, support for the GROUP= option was added more recently. For example, the GROUP= option was added to the HISTOGRAM and DENSITY statements in SAS 9.4M2.

Here is a trick (shown to me by my colleague, Paul) that you can use to emulate the GROUP= option. If a statement in the SGPLOT procedure does not support the GROUP= option, but the statement DOES support the FREQ= option, you can often use the FREQ= option to construct a graph that overlays the subgroups. You need to do two things. First, you need to create binary indicators variables (sometimes called dummy variables) for each level of the categorical variable. You then use multiple statements, each with a different frequency variable, to overlay the subgroups. These two steps are shown by the following DATA step and call to PROC SGPLOT, which uses the FREQ= trick to overlay three histograms:

/* emulate a GROUP= option for SGPLOT statements that do not support GROUP= */
data IrisFreq;
   set sashelp.Iris;
   Freq1 = (Species='Setosa');     /* Binary. Equals 1 if observation is in 'Setosa' group     */
   Freq2 = (Species='Versicolor'); /* Binary. Equals 1 if observation is in 'Versicolor' group */
   Freq3 = (Species='Virginica');  /* Binary. Equals 1 if observation is in 'Verginica' group  */
run;
 
title "Overlay Histograms by Using the FREQ= Option";
%let binOpts = binstart=42 binWidth=3 transparency=0.5; /* ensure common bins */
proc sgplot data=IrisFreq;
   histogram SepalLength / freq=Freq1 &binOpts;    /* only the 'Setosa' group     */
   histogram SepalLength / freq=Freq2 &binOpts;    /* only the 'Versicolor' group */
   histogram SepalLength / freq=Freq3 &binOpts;    /* only the 'Virginica' group  */
run;

The graph overlays three histograms, one for each value of the Species variable. The result is similar to the earlier graph that used the GROUP= option. You can use the same trick on the DENSITY statement, although you will need to manually set the line attributes so that they match the attributes for the corresponding histograms.

You can use this technique in old versions of SAS to emulate the GROUP= option on the HISTOGRAM statement. You can also use it for statements that do not support the GROUP= option.

Although this example uses the DATA step to manually create the dummy variables that are used as frequencies, you can also create the dummy variables automatically by generating the "design matrix" for the Species variable. The GLMMOD procedure is the simplest way to create dummy variables in SAS, but other procedures provide additional features.

Generate prediction ellipses for groups

Several years ago I showed how you can overlay prediction ellipses for each group on a scatter plot. (Note that the ELLIPSE statement does not support a GROUP= option.) The technique requires that you transpose the data from long to wide form by creating new variables, one for each group of the categorical variable. Paul recognized that creating a dummy variable and using the FREQ= option is a simpler way to overlay prediction ellipses on a scatter plot:

title "Prediction Ellipses for Iris Data";
proc sgplot data=IrisFreq;
   scatter x=PetalLength y=PetalWidth / group=Species;
   ellipse x=PetalLength y=PetalWidth / freq=Freq1 legendlabel="Setosa";
   ellipse x=PetalLength y=PetalWidth / freq=Freq2 legendlabel="Versicolor";
   ellipse x=PetalLength y=PetalWidth / freq=Freq3 legendlabel="Virginica";
run;

Advantages and disadvantages of the FREQ= trick

The main advantage of using the FREQ= option for group processing is that it enables you to overlay subgroups even when a statement does not support the GROUP= option. A secondary advantage is that this technique gives you complete control of the attributes of each subgroup. Although you can use the STYLEATTRS statement to control many group attributes, the STYLEATTTRS statement does not enable you to control marker sizes or line widths, to name two examples.

The FREQ= trick does have some disadvantages:

  • You can't use the FREQ= trick for statements that produce graphs of categorical variables. The SGPLOT documentation states, “If your plot is overlaid with other categorization plots, then the first FREQ variable that you specified is used for all of the plots.” [My emphasis.]
  • As mentioned earlier, if you are trying to produce multiple grouped plots, you might need to manually assign attributes to obtain consistency among the levels of the grouping variables. By default, most ODS styles use different attributes for each statement. If you want the attributes for the fourth statement to match the attributes for the first statement, you need to use an option such as LINEATTRS=GraphData1 on the fourth statement.

In conclusion, if a statement supports the GROUP= option, you should probably use that option to overlay plots of the groups. But if a statement does NOT support the GROUP= option (such as the ELLIPSE and HEATMAP statements), you can use the FREQ= trick to emulate the GROUP= behavior.

I thank my colleague, Paul, for showing me the ellipse example. I hope you agree that this trick is a real treat, not just on Halloween, but every day!

The post A trick to plot groups in PROC SGPLOT appeared first on The DO Loop.

10月 302018
 

Note: Today’s utility industry is in upheaval. All of the assumptions the business has run on have been turned on their heads. This post is the second in a three-part series looking at how analytics are helping utilities navigate this challenging landscape and find new opportunities for improvements in operations, [...]

The digital utility era is now: Charging ahead with EV analytics was published on SAS Voices by Mike F. Smith

10月 302018
 

GAHHHHHHHHH! My screams filled the office hallways on a Sunday afternoon. In agony, I hopped to the ice machine to find relief for my crushed toe. Moments before, while hanging my shiny new patent plaque on the wall, it dropped six feet and landed on my big toe joint. Since [...]

Screaming analytics. Let your safety data whisper. was published on SAS Voices by Marcia Walker

10月 302018
 

In an earlier blog post my colleague, Suneel Grover, wrote about how to take advantage of the product recommendations technology implemented in SAS Customer Intelligence 360. He noted that two methods are implemented behind the scenes, a visitor-centric and a product (or item)-centric approach. And most importantly, while some sophisticated [...]

Better product-centric recommendations using SAS Customer Intelligence 360 was published on Customer Intelligence Blog.

10月 292018
 

If you want to bootstrap the parameters in a statistical regression model, you have two primary choices. The first, case resampling, is discussed in a previous article. This article describes the second choice, which is resampling residuals (also called model-based resampling). This article shows how to implement residual resampling in Base SAS and in the SAS/IML matrix language.

Residual resampling assumes that the model is correctly specified and that the error terms in the model are identically distributed and independent. However, the errors do not need to be normally distributed. Before you run a residual-resampling bootstrap, you should use regression diagnostic plots to check whether there is an indication of heteroskedasticity or autocorrelation in the residuals. If so, do not use this bootstrap method.

Although residual resampling is primarily used for designed experiments, this article uses the same data set as in the previous article: the weights (Y) and heights (X) of 19 students. Therefore, you can compare the results of the two bootstrap methods. As in the previous article, this article uses the bootstrap to examine the sampling distribution and variance of the parameter estimates (the regression coefficients).

Bootstrap residuals

The following steps show how to bootstrap residuals in a regression analysis:

  1. Fit a regression model that regresses the original response, Y, onto the explanatory variables, X. Save the predicted values (YPred) and the residual values (R).
  2. A bootstrap sample consists of forming a new response vector as Yi, Boot = Yi, Pred + Rrand, where Yi, Pred is the i_th predicted value and Rrand is chosen randomly (with replacement) from the residuals in Step 1. Create B samples, where B is a large number.
  3. For each bootstrap sample, fit a regression model that regresses YBoot onto X.
  4. The bootstrap distribution is the union of all the statistics that you computed in Step 3. Analyze the bootstrap distribution to estimate standard errors and confidence intervals for the parameters.

Step 1: Fit a model, save predicted and residual values

To demonstrate residual resampling, I will use procedures in Base SAS and SAS/STAT. (A SAS/IML solution is presented at the end of this article.) The following statements create the data and rename the response variable (Weight) and the explanatory variable (X) so that their roles are clear. Step 1 of the residual bootstrap is to fit a model and save the predicted and residual values:

data sample(keep=x y);
   set Sashelp.Class(rename=(Weight=Y Height=X));
run;
 
/* 1. compute value of the statistic on original data */
proc reg data=Sample plots=none;
   model Y = X / CLB covb;
   output out=RegOut predicted=Pred residual=Resid;
run; quit;
 
%let IntEst = -143.02692;  /* set some macro variables for later use */
%let XEst   =    3.89903;
%let nObs   =   19;

Step 2: Form the bootstrap resamples

The second step is to randomly draw residuals and use them to generate new response vectors from the predicted values of the fitted model. There are several ways to do this. If you have SAS 9.4m5 (SAS/STAT 14.3), you can use PROC SURVEYSELECT to select and output the residuals in a random order. An alternative method that uses the SAS DATA step is shown below. The program reads the residuals into an array. For each original observation, it adds a random residual to the predicted response. It does this B times, where B = 5,000. It then sorts the data by the SampleID variable so that the B bootstrap samples are ready to be analyzed.

%let NumSamples = 5000;             /* B = number of bootstrap resamples */
/* SAS macro: chooses random integer in [min, max]. See
   https://blogs.sas.com/content/iml/2015/10/05/random-integers-sas.html */
%macro RandBetween(min, max);
   (&min + floor((1+&max-&min)*rand("uniform")))
%mend;
 
/* 2. For each obs, add random residual to predicted values to create 
      a random response. Do this B times to create all bootstrap samples. */
data BootResiduals;
array _residual[&nObs] _temporary_; /* array to hold residuals */
do i=1 to &nObs;                    /* read data; put rasiduals in array */
   set RegOut point=i;
   _residual[i] = Resid;
end;
 
call streaminit(12345);             /* set random number seed */
set RegOut;                         /* for each observations in data */
ObsID = _N_;                        /* optional: keep track of obs position */
do SampleId = 1 to &NumSamples;     /* for each bootstrap sample */
   k = %RandBetween(1, &nObs);      /* choose a random residual */
   YBoot = Pred + _residual[k];     /* add it to predicted value to create new response */
   output;
end;
drop k;
run;
 
/* prepare for BY group analysis: sort by SampleID */
proc sort data=BootResiduals;
by SampleID ObsID;                  /* sorting by ObsID is optional */
run;

Step 3: Analyze the bootstrap resamples

The previous step created a SAS data set (BootResiduals) that contains B = 5,000 bootstrap samples. The unique values of the SampleID variable indicate which observations belong to which sample. You can use a BY-group analysis to efficiently analyze all samples in a single call to PROC REG, as follows:

/* 3. analyze all bootstrap samples; save param estimates in data set  */
proc reg data=BootResiduals noprint outest=PEBootResid; 
   by SampleID;
   model YBoot = X;
run;quit;
 
/* take a peek at the first few bootstrap estimates */
proc print data=PEBootResid(obs=5);
   var SampleID Intercept X;
run;

The result of the analysis is a data set (PEBootResid) that contains 5,000 rows. The i_th row contains the parameter estimates for the i_th bootstrap sample. Thus, the Intercept and X columns contain the bootstrap distribution of the parameter estimates, which approximates the sampling distribution of those statistics.

Step 4: Analyze the bootstrap distribution

The previous article includes SAS code that estimates the standard errors and covariance of the parameter estimates. It also provides SAS code for a bootstrap confidence interval for the parameters. I will not repeat those computations here. The following call to PROC SGPLOT displays a scatter plot of the bootstrap distribution of the estimates. You can see from the plot that the estimates are negatively correlated:

/* 4. Visualize and analyze the bootstrap distribution */
title "Parameter Estimates for &NumSamples Bootstrap Samples";
title2 "Residual Resampling";
proc sgplot data=PEBootResid;
   label Intercept = "Estimate of Intercept" X = "Estimate of Coefficient of X";
   scatter x=Intercept y=X / markerattrs=(Symbol=CircleFilled) transparency=0.7;
   /* Optional: draw reference lines at estimates for original data */
   refline &IntEst / axis=x lineattrs=(color=blue);
   refline &XEst / axis=y lineattrs=(color=blue);
   xaxis grid; yaxis grid;
run;

Bootstrap residuals in SAS/IML

For standard regression analyses, the previous sections show how to bootstrap residuals in a regression analysis in SAS. If you are doing a nonstandard analysis, however, you might need to perform a bootstrap analysis in the SAS/IML language. I've previously shown how to perform a bootstrap analysis (using case resampling) in SAS/IML. I've also shown how to use the SWEEP operator in SAS/IML to run thousands of regressions in a single statement. You can combine these ideas, along with an easy way to resample data (with replacement) in SAS/IML, to write a short SAS/IML program that resamples residuals, fits the samples, and plots the bootstrap estimates:

proc iml;
use RegOut;  read all var {Pred X Resid};  close; /* read data */
design = j(nrow(X), 1, 1) || X;           /* design matrix for explanatory variables */
 
/* resample residuals with replacement; form bootstrap responses */
call randseed(12345);
R = sample(Resid, &NumSamples//nrow(Resid)); /* NxB matrix; each col is random residual */
YBoot = Pred + R;                         /* 2. NxB matrix; each col is Y_pred + resid */
M = design || YBoot;                      /*    add intercept and X columns */
ncol = ncol(M);
MpM = M`*M;                               /*    (B+2) x (B+2) crossproduct matrix */
free R YBoot M;                           /*    free memory */
 
/* Use SWEEP function to run thousands of regressions in a single call */
S = sweep(MpM, {1 2});                    /* 3. sweep in intercept and X for each Y_i */
ParamEst = S[{1 2}, 3:ncol];              /*    estimates for models Y_i = X */
 
/* 4. perform bootstrap analyses by analyzing ParamEst */
call scatter(ParamEst[1,], ParamEst[2,]) grid={x y} label={"Intercept" "X"}
     other="refline &IntEst /axis=x; refline &XEst /axis=y;";

The graph is similar to the previous scatter plot and is not shown. It is remarkable how concise the SAS/IML language can be, especially compared to the procedure-based approach in Base SAS. On the other hand, the SAS/IML program uses a crossproduct matrix (MpM) that contains approximately B2 elements. If you want to analyze, say, B = 100,000 bootstrap samples, you would need to restructure the program. For example, you could analyze 10,000 samples at a time and accumulate the bootstrap estimates.

Further thoughts and summary

The %BOOT macro also supports resampling residuals in a regression context. The documentation for the %BOOT macro contains an example. No matter how you choose to bootstrap residuals, remember that the process assumes that the model is correct and that the error terms in the model are identically distributed and independent.

Use caution if the response variable has a physical meaning, such as a physical dimension that must be positive (length, weight, ...). When you create a new response as the sum of a predicted value and a random residual, you might obtain an unrealistic result, especially if the data contain an extreme outlier. If your resamples contain a negative length or a child whose height is two meters, you might reconsider whether resampling residuals is appropriate for your data and model.

When it is appropriate, the process of resampling residuals offers a way to use the bootstrap to investigate the variance of many parameters that arise in regression. It is especially useful for data from experiments in which the explanatory variables have values that are fixed by the design.

The post Bootstrap regression estimates: Residual resampling appeared first on The DO Loop.

10月 292018
 

CASL is a language specification that can be used by the SAS client to interact with and provide easy access to Cloud Analytic Services (CAS).  CASL is a statement-based scripting language with many uses and strengths including:

  • Specifying CAS actions to submit requests to the CAS server to perform work and return results.
  • Evaluating and manipulating the results returned by an action.
  • Creating user-defined actions and functions and creating the arguments to an action.
  • Developing analytic pipelines.

CASL uses PROC CAS which enables you to program and execute CAS actions from the SAS client and use the results to prepare the parameters for a subsequent action.  A single PROC CAS statement can contain several CASL programs.  With the CAS procedure you can run any CAS action supported by the server, load new action sets into the server, use multiple sessions to perform asynchronous execution and operate on parameters and results as variables using the function expression parser.

CASL, and the CAS actions, provide the most control, flexibility and options when interacting with CAS.  One can use DATA Step, CAS-enabled PROCS and CASL for optimal flexibility and control.  CASL works well with traditional SAS interfaces and the Base SAS language.

Each CAS action belongs to an action set.  Each action set is further categorized by product (i.e. VA, VS, VDMML, etc.).  In addition to the many CAS actions supplied by SAS, as of SAS® Viya™ 3.4, you can create your own actions using CASL.  Developing and utilizing your own CAS actions allows you to further customize your code and increase your ability to work with CAS in a manner that best suits you and your organization.

About user-defined action sets

Developing a CASL program that is stored on the CAS server for processing is defined as a user-defined action set.  Since the action set is stored on the CAS server, the CASL statements can be written once and executed by many users. This can reduce the need to exchange files between users that store common code.  Note that you cannot add, remove, or modify a single user-defined action. You must redefine the entire action set.

Before creating any user-defined actions, test your routines and functions first to ensure they execute successfully in CAS when submitted from the programming client.  To create user-defined actions, use the defineActionSet action in the builtins action set and add your code.  You also need to modify your code to use CASL functions such as SEND_RESPONSE, so the resulting objects on the server are returned to the client.

Developing new actions by combining SAS-provided CAS actions

One method for creating user-defined CAS actions is to combine one or more SAS provided CAS actions into a user-defined CAS action.  This allows you to execute just one PROC CAS statement and call all user-defined CAS actions.  This is beneficial if you repeatedly run many of the same actions against a CAS table.  An example of this is shown below. If you would like copy of the actual code, feel free to leave a reply below.

In this example, four user-defined CAS actions named listTableInfo, simplefreq, detailfreq, and corr have been created by using the corresponding SAS-provided CAS actions tableInfo, freq, freqTab, and correlation.  These four actions return information about a CAS table, simple frequency information, detailed frequency and tabulate information, and Pearson correlation coefficients respectively.  These four actions are now part of the newly created user-defined action set myActionSet.  When this code is executed, the log will display a note that the new action set has been added.

Once the new action set and actions have been created, you can call all four or any combination of them via a PROC CAS statement.  Specify the user-defined action set, user-defined action(s), and parameters for each.

Developing new actions by writing your own code

Another way to create user-defined CAS actions is to apply user-defined code, functions, and statements instead of SAS-provided CAS actions.

In this example, two user-defined CAS actions have been created, bdayPct and sos.  These actions belong to the new user-defined action set myFunctionSet.

To call one or both actions, specify the user-defined action set, user-defined action(s), and parameters for each.

The results for each action are shown in the log.

Save and load custom actions across CAS sessions

User-defined action sets only exist in the current CAS session.  If the current CAS session is terminated, the program to create the user-defined action set must be executed again unless an in-memory table is created from the action set and the in-memory table is subsequently persisted to a SASHDAT file.  Note: SASHDAT files can only be saved to path-based caslibs such as Path, DNFS, HDFS, etc.  To create an in-memory table and persist it to a SASHDAT file, use the actionSetToTable and save CAS actions.

To use the user-defined action set, it needs to be restored from the saved SASHDAT file.  This is done with the actionSetFromTable action.

More about CASL programming and CAS actions

Check out these resources for further information on programming in the CASL language and running actions with CASL.

How to use CASL to develop and work with user-defined CAS actions was published on SAS Users.

10月 252018
 

The other day I was playing around with the voter registration data for all ~8 million registered voters in North Carolina (yes - this guy knows how to have fun!), and I got to wondering what last names were the most common. I summarized the data by county, and the [...]

The post What is the most common last name in each North Carolina county? appeared first on SAS Learning Post.

10月 252018
 

Recommendation systems work best when you can provide them with as many relevant examples as possible. On the other hand, increasing the number of products or offers leads to recommendation blind spots, especially early on in training the system. This cold-start problem is a challenge for most recommendation systems. SAS [...]

How hybrid recommendations improve SAS Customer Intelligence 360 results was published on Customer Intelligence Blog.