8月 092018
 

Recently I’ve been listening to the BBC Radio Series 50 Things That Made the Modern Economy, which was first broadcast in 2016. One of the episodes considers the impact of a simple box (the shipping container) and concludes its invention was a major contributor to the post-war boom in global trade. It’s worth a listen, if you can.

Notwithstanding the tenuous link, containerization is having perhaps an equally significant impact on Cloud Computing and I want to share a recent experience which highlights the convenience of containers. I’m not aiming to summarize all the multiple SAS initiatives in the Cloud (including SAS Viya and Cloud Foundry) here rather it’s to share a few observations about a specific offering for SAS 9.4.

Recently I attended a demonstration by SAS’ Doug Liming on SAS Analytics for Containers. While this product was launched in 2016, until now I confess I’d not appreciated its simplicity or potential. I’d like to use this blog post to share what I saw & learned because this session served as a bit of an epiphany for me.

As a reminder SAS Analytics for Containers consists of:

    • Foundation SAS (Base, STAT & Graph) ready-packaged to be deployed in a Docker container.
    • SAS Studio.
    • Optional SAS/Access connectors & Accelerators.

In the space of 20 minutes, Doug took us through the The power and potential of simplicity: SAS 9.4 and Containers was published on SAS Users.

8月 092018
 

We’re bringing the concept of #VideoTag to LinkedIn. What's #VideoTag, you ask? It's an online adaptation of the old schoolyard game. In short, you record a video of yourself, upload it to LinkedIn and tag others to respond. It’s a fun, easy way to spur conversation online by showing your [...]

#VideoTag: the online game you’ve never heard of (and will want to join) was published on SAS Voices by Lindsay Beth Gunter

8月 082018
 
Use the GROUP= option in PROC SGPLOT to plot lines for TWO categorical variables

The SGPLOT procedure in SAS makes it easy to create graphs that overlay various groups in the data. Many statements support the GROUP= option, which specifies that the graph should overlay group information. For example, you can create side-by-side bar charts and box plots, and you can overlay multiple scatter plots and series plots in the same graph. However, the GROUP= option takes only a single grouping variable! What can you do if you need to visualize combinations of two (or even three!) categorical variables? This article shows how to construct a new group variable that combines the levels of two or more existing categorical variables.

For concreteness, I will show how to overlay multiple series plots (line plots), as shown to the right. (Click to enlarge.) By using this technique, you can overlay curves that describe combinations of gender and race. Or you can plot response curves for control-vs-experimental groups and the severity of a disease. You can use this technique for other plot types, such as creating box plots that visualize a two-way ANOVA.

What is the problem? Why can't you use the GROUP= option?

To motivate the discussion, let's first see why the GROUP= option in the SERIES statement does not work for overlaying two categorical variables. The following DATA step creates two categorical variables. The G1 variable has the values 1, 2, and 3. The G2 variable has the values 'A' and 'B'. For each of the six combinations of (G1, G2), the DATA step creates a curve (actually a line) of (X, Y) values. Let's see what happens if you try to plot the lines by using a SERIES statement with the GROUP=G1 option:

data TwoGroups;
do G1=1 to 3;
   do G2='A', 'B';
      do X = 1 to 10;
         Y = 10 + G1 + 0.5*(G2='A') + G1*X/20;
         output;
      end;
   end;
end;
run;
 
title "First Attempt: Does Not Work";
proc sgplot data=TwoGroups;
   series x=x y=y / group=G1 curvelabel;
run;

The attempt is a failure. The graph does not show six individual curves. Instead, it shows three curves (the number of categories in the G1 variables) and each "curve" is Z-shaped because the graph traces the curve for G2='A' on the range [1, 10] and then draws the curve for G2='B' without "picking up the pen." This happens because the data are sorted by G1, then by G2, then by X.

There are two ways to handle this situation. One way is to try to convert the data from "long form" to "wide form." For these data, which have identical X values for every curve, you can create two new variables Y_A and Y_B that contain the coordinates of the three curves for G2='A' and G2='B', respectively. You can then use two SERIES statements, each using the GROUP=G1 option. Each statement will draw three curves for a total of six. You can use the NOCYCLEATTRS option to make sure that each statement uses the same line colors and patterns. However, this approach becomes complicated if each curve is evaluated at a different set of X values. In that case, it is better to keep the data in long form.

Forming a new group variable by concatenation

The problem would be solved if there were one categorical variable that had six levels instead of two categorical variables that have six joint levels, so let's write some SAS code to make that happen. First sort the data by the categorical variables and then by the X variable. (For the current example, the data are already sorted correctly.) Then write a DATA step that does either of the following options:

  • Option 1: Use the CATT (or CATX) function to concatenate the values of the existing group variables. The new categorical variable will have values that derived from the original variables.
  • Option 2: Use the FIRST.variable syntax to create a new group variable that has the values 1–6. Then use PROC FORMAT to assign a meaningful value to each level of the new categorical variable.

The first option (the CATT function) is automated and less prone to error, but for the sake of completeness both options are shown below:

/* create a new group variable by concatenating the two existing variables */
proc sort data=TwoGroups;
   by G1 G2 X;
run;
 
data Make2Groups;
set TwoGroups;
by G1 G2;
/* Option 1: automatically create joint levels from original levels */
Label = catt("G1=", G1) || "; " || catt("G2=", G2);
/* Option 2: create new categorical variable and use PROC FORMAT to assign values */
if first.G2 then 
   GroupID + 1;        /* GroupID = 1, 2, 3, ... numGroups */
run;
 
/* For Option 2: use PROC FORMAT to form labels that encode the two groups. See
https://blogs.sas.com/content/iml/2017/04/24/two-way-anova-viz.html
*/
proc format;                 
  value GroupFmt 1 = "G1=1; G2='A'"   2 = "G1=1; G2='B'"   3 = "G1=2; G2='A'"
                 4 = "G1=2; G2='B'"   5 = "G1=3; G2='A'"   6 = "G1=3; G2='B'";
run;

The following call to PROC SGPLOT creates a series plot for the Label variable, which corresponds to the joint levels of the original grouping variables. The GROUPLC= option (supported in SAS 9.4M2) colors the lines according to the value of the G2 variable.

title "Two Groups, One SERIES Statement";
proc sgplot data=Make2Groups;
/* format GroupID GroupFmt.; */            /* for Option 2 */
series x=x y=y / group=Label grouplc=G2    /* or group=GroupID for Option 2 */
                 lineattrs=(pattern=solid) curvelabel;
run;

The graph is shown at the top of this article. The new categorical variable has values that correspond to joint levels of the original two variables. The GROUP= option creates six curves when you specify the Label variable. The GROUPLC= option sets the colors of the lines according to the values of the G2 variable.

This technique generalizes to three categorical variables, but I will leave the details to the reader. You might want to use the GROUPLP= option to set the line patterns according to the value of a third categorical variable. Beyond three variables the display will begin to resemble a spaghetti plot. For many categorical variables, you might want to use panels and BY groups to visualize the curves.

This technique also generalizes to other plot types, such as box plots and scatter plots.

The post Plot curves for levels of two categorical variables in SAS appeared first on The DO Loop.

8月 072018
 

How important is the customer experience in communications today? “Very important” is an understatement. It should be the top area of focus for all communications providers. It’s no secret that when customers are extremely satisfied, they become brand champions for the companies that provide them the products and services they [...]

How are CSPs using real-time analytics to improve customer experience? was published on SAS Voices by Michael Pawlak

8月 062018
 

There are many quotes with words of wisdom to help you live your life. But sometimes one quote seems to contradict another. For example, "Don't sweat the small stuff." ... and "The devil is in the details." When it comes to creating graphs (and perhaps living my life in general), [...]

The post Choose your markers carefully! (for scatter plots, that is) appeared first on SAS Learning Post.

8月 062018
 

Looking for use cases for analytics to derive value at an electrical utility? We have identified over 125 ways you can use analytics to improve the business processes at an electric utility. I recently posted a series of blog posts discussing four different use cases. Now I'd like to share [...]

12 real-world examples of smart grid analytics was published on SAS Voices by David Pope

8月 062018
 
Graph of quantile regression curves in SAS

This article shows how to score (evaluate) a quantile regression model on new data. SAS supports several procedures for quantile regression, including the QUANTREG, QUANTSELECT, and HPQUANTSELECT procedures. The first two procedures do not support any of the modern methods for scoring regression models, so you must use the "missing value trick" to score the model. (HPQUANTSELECT supports the CODE statement for scoring.) You can use this technique to construct a "sliced fit plot" that visualizes the model, as shown to the right. (Click to enlarge.)

The easy way to create a fit plot

The following DATA step creates the example data as a subset of the Sashelp.BWeight data set, which contains information about the weights of live births in the US in 1997 and information about the mother during pregnancy. The following call to PROC QUANTREG models the conditional quantiles of the baby's weight as a function of the mother's weight gain. The weight gain is centered according to the formula MomWtGain = "Actual Gain" – 30 pounds. Because the quantiles might depend nonlinearly on the mother's weight gain, the EFFECT statement generates a spline basis for the independent variable. The resulting model can flexibly fit a wide range of shapes.

Although this article shows how to create a fit plot, you can also get a fit plot directly from PROC QUANTREG. As shown below, the PLOT=FITPLOT option creates a fit plot when the model contains one continuous independent variable.

data Orig;   /* restrict to 5000 births; exclude extreme weight gains */
set Sashelp.BWeight(obs=5000 where=(MomWtGain<=40));
run;
 
proc quantreg data=Orig
              algorithm=IPM            /* use IPM algorithm for splines and binned data */
              ci=none plot(maxpoints=none)=fitplot; /* or fitplot(nodata) */
   effect MomWtSpline = spline( MomWtGain / knotmethod = equal(9) );  /* 9 knots, equally spaced */
   model Weight = MomWtSpline / quantile = 0.1 0.25 0.5 0.75 0.90;
run;
Fit plot of quantile regression curves from PROC QUANTREG in SAS

The graph enables you to visualize curves that predict the 10th, 25th, 50th, 75th, and 90th percentiles of the baby's weight based on the mother's weight gain during pregnancy. Because the data contains 5,000 observations, the fit plot suffers from overplotting and the curves are hard to see. You can use the PLOT=FITPLOT(NODATA) option to exclude the data from the plot, thus showing the quantile curves more clearly.

Score a SAS procedure by using the missing value trick

Although PROC QUANTREG can produce a fit plot when there is one continuous regressor, it does not support the EFFECTPLOT statement so you have to create more complicated graphs manually. To create a graph that shows the predicted values, you need to score the model on a new set of independent values. To use the missing value trick, do the following:

  1. Create a SAS data set (the scoring data) that contains the values of the independent variables at which you want to evaluate the model. Set the response variable to missing for each observation.
  2. Append the scoring data to the original data that are used to fit the model. Include a binary indicator variable that has the value 0 for the original data and the value 1 for the scoring data.
  3. Run the regression procedure on the combined data set. Use the OUTPUT statement to output the predicted values for the scoring data. Of course, you can also output residuals and other observation-wise statistics, if necessary.

This general technique is implemented by using the following SAS statements. The scoring data consists of evenly spaced values of the MomWtGain variable. The binary indicator variable is named ScoreData. The result of these computations is a data set named QRegOut that contains a variable named Pred that contains the predicted values for each observation in the scoring data.

/* 1. Create score data set */
data Score;
/* Optionally define additional covariates here. See the example
   "Create a sliced fit plot manually by using the missing value trick"
   https://blogs.sas.com/content/iml/2017/12/20/create-sliced-fit-plot-sas.html
*/
Weight = .;               /* set response (Y) variable to missing */
do MomWtGain = -30 to 40; /* uniform spacing in the independent (X) variable */
   output;
end;
run;
 
/* 2. Append the score data to the original data. Use a binary variable to 
      indicate which observations are the scoring data */
data Combined;
set Orig                     /* original data */
    Score(in=_ScoreData);    /* scoring data  */
ScoreData = _ScoreData;      /* binary indicator variable. ScoreData=1 for scoring data */
run;
 
/* 3. Run the procedure on the combined (original + scoring) data */
ods select ModelInfo NObs;
proc quantreg data=Combined algorithm=IPM ci=none;
   effect MomWtSpline = spline( MomWtGain / knotmethod = equal(9) );
   model Weight = MomWtSpline / quantile = 0.1 0.25 0.5 0.75 0.90;
   output out=QRegOut(where=(ScoreData=1)) /* output predicted values for the scoring data */
              P=Pred / columnwise;         /* COLUMWISE option supports multiple quantiles */
run;

This technique can be used for any SAS regression procedure. In this case, the COLUMNWISE option specifies that the output data set should be written in "long form": A QUANTILE variable specifies the quantile and the variable PRED contains the predicted values for each quantile. If you omit the COLUMNWISE option, the output data is in "wide form": The predicted values for the five quantiles are contained in the variables Pred1, Pred2, ..., Pred5.

Using predicted values to create a sliced fit plot

You can use the predicted values of the scoring data to construct a fit plot. You merely need to sort the data by any categorical variables and by the X variable (in this case, MomWtGain). You can then plot the predicted curves. If desired, you can also append the original data and the predicted values and create a graph that overlays the data and predicted curves. You can use transparency to address the overplotting issue and also modify other features of the fit plot, such as the title, axes labels, tick positions, and so forth:

/* 4. If you want a fit plot, sort by the independent variable for each curve.
      Put QUANTILE and other covariates first, then the X variable. */
proc sort data=QRegOut out=ScoreData;
   by Quantile MomWtGain;
run;
 
/* 5. (optional) If you want to overlay the plots, it's easiest to define separate variables
      for original data and scoring data */
data All;
set Orig                                      /* original data */
    ScoreData(rename=(MomWtGain=X Pred=Y));   /* scoring data  */
run;
 
title "Quantile Regression Curves";
footnote J=C "Gain is centered: MomWtGain = Actual_Gain - 30";
proc sgplot data=All;
   scatter x=MomWtGain y=Weight / markerattrs=(symbol=CircleFilled) transparency=0.92;
   series x=X y=Y / group=Quantile lineattrs=(thickness=2) nomissinggroup name="p";
   keylegend "p" / position=right sortorder=reverseauto title="Quantile";
   xaxis values=(-20 to 40 by 10) valueshint grid  label="Mother's Relative Weight Gain (lbs)";;
   yaxis values=(1500 to 4500 by 500) valueshint grid label="Predicted Weight of Child (g)";
run;

The fit plot is shown at the top of this article.

In summary, this article shows how to use the missing value trick to evaluate a regression model in SAS. You can use this technique for any regression procedure, although newer procedures often support syntax that makes it easier to score a model.

As shown in this example, if you score the model on an equally spaced set of points for one of the continuous variables in the model, you can create a sliced fit plot. For a more complicated example, see the article "How to create a sliced fit plot in SAS."

The post How to score and graph a quantile regression model in SAS appeared first on The DO Loop.

8月 062018
 
Graph of quantile regression curves in SAS

This article shows how to score (evaluate) a quantile regression model on new data. SAS supports several procedures for quantile regression, including the QUANTREG, QUANTSELECT, and HPQUANTSELECT procedures. The first two procedures do not support any of the modern methods for scoring regression models, so you must use the "missing value trick" to score the model. (HPQUANTSELECT supports the CODE statement for scoring.) You can use this technique to construct a "sliced fit plot" that visualizes the model, as shown to the right. (Click to enlarge.)

The easy way to create a fit plot

The following DATA step creates the example data as a subset of the Sashelp.BWeight data set, which contains information about the weights of live births in the US in 1997 and information about the mother during pregnancy. The following call to PROC QUANTREG models the conditional quantiles of the baby's weight as a function of the mother's weight gain. The weight gain is centered according to the formula MomWtGain = "Actual Gain" – 30 pounds. Because the quantiles might depend nonlinearly on the mother's weight gain, the EFFECT statement generates a spline basis for the independent variable. The resulting model can flexibly fit a wide range of shapes.

Although this article shows how to create a fit plot, you can also get a fit plot directly from PROC QUANTREG. As shown below, the PLOT=FITPLOT option creates a fit plot when the model contains one continuous independent variable.

data Orig;   /* restrict to 5000 births; exclude extreme weight gains */
set Sashelp.BWeight(obs=5000 where=(MomWtGain<=40));
run;
 
proc quantreg data=Orig
              algorithm=IPM            /* use IPM algorithm for splines and binned data */
              ci=none plot(maxpoints=none)=fitplot; /* or fitplot(nodata) */
   effect MomWtSpline = spline( MomWtGain / knotmethod = equal(9) );  /* 9 knots, equally spaced */
   model Weight = MomWtSpline / quantile = 0.1 0.25 0.5 0.75 0.90;
run;
Fit plot of quantile regression curves from PROC QUANTREG in SAS

The graph enables you to visualize curves that predict the 10th, 25th, 50th, 75th, and 90th percentiles of the baby's weight based on the mother's weight gain during pregnancy. Because the data contains 5,000 observations, the fit plot suffers from overplotting and the curves are hard to see. You can use the PLOT=FITPLOT(NODATA) option to exclude the data from the plot, thus showing the quantile curves more clearly.

Score a SAS procedure by using the missing value trick

Although PROC QUANTREG can produce a fit plot when there is one continuous regressor, it does not support the EFFECTPLOT statement so you have to create more complicated graphs manually. To create a graph that shows the predicted values, you need to score the model on a new set of independent values. To use the missing value trick, do the following:

  1. Create a SAS data set (the scoring data) that contains the values of the independent variables at which you want to evaluate the model. Set the response variable to missing for each observation.
  2. Append the scoring data to the original data that are used to fit the model. Include a binary indicator variable that has the value 0 for the original data and the value 1 for the scoring data.
  3. Run the regression procedure on the combined data set. Use the OUTPUT statement to output the predicted values for the scoring data. Of course, you can also output residuals and other observation-wise statistics, if necessary.

This general technique is implemented by using the following SAS statements. The scoring data consists of evenly spaced values of the MomWtGain variable. The binary indicator variable is named ScoreData. The result of these computations is a data set named QRegOut that contains a variable named Pred that contains the predicted values for each observation in the scoring data.

/* 1. Create score data set */
data Score;
/* Optionally define additional covariates here. See the example
   "Create a sliced fit plot manually by using the missing value trick"
   https://blogs.sas.com/content/iml/2017/12/20/create-sliced-fit-plot-sas.html
*/
Weight = .;               /* set response (Y) variable to missing */
do MomWtGain = -30 to 40; /* uniform spacing in the independent (X) variable */
   output;
end;
run;
 
/* 2. Append the score data to the original data. Use a binary variable to 
      indicate which observations are the scoring data */
data Combined;
set Orig                     /* original data */
    Score(in=_ScoreData);    /* scoring data  */
ScoreData = _ScoreData;      /* binary indicator variable. ScoreData=1 for scoring data */
run;
 
/* 3. Run the procedure on the combined (original + scoring) data */
ods select ModelInfo NObs;
proc quantreg data=Combined algorithm=IPM ci=none;
   effect MomWtSpline = spline( MomWtGain / knotmethod = equal(9) );
   model Weight = MomWtSpline / quantile = 0.1 0.25 0.5 0.75 0.90;
   output out=QRegOut(where=(ScoreData=1)) /* output predicted values for the scoring data */
              P=Pred / columnwise;         /* COLUMWISE option supports multiple quantiles */
run;

This technique can be used for any SAS regression procedure. In this case, the COLUMNWISE option specifies that the output data set should be written in "long form": A QUANTILE variable specifies the quantile and the variable PRED contains the predicted values for each quantile. If you omit the COLUMNWISE option, the output data is in "wide form": The predicted values for the five quantiles are contained in the variables Pred1, Pred2, ..., Pred5.

Using predicted values to create a sliced fit plot

You can use the predicted values of the scoring data to construct a fit plot. You merely need to sort the data by any categorical variables and by the X variable (in this case, MomWtGain). You can then plot the predicted curves. If desired, you can also append the original data and the predicted values and create a graph that overlays the data and predicted curves. You can use transparency to address the overplotting issue and also modify other features of the fit plot, such as the title, axes labels, tick positions, and so forth:

/* 4. If you want a fit plot, sort by the independent variable for each curve.
      Put QUANTILE and other covariates first, then the X variable. */
proc sort data=QRegOut out=ScoreData;
   by Quantile MomWtGain;
run;
 
/* 5. (optional) If you want to overlay the plots, it's easiest to define separate variables
      for original data and scoring data */
data All;
set Orig                                      /* original data */
    ScoreData(rename=(MomWtGain=X Pred=Y));   /* scoring data  */
run;
 
title "Quantile Regression Curves";
footnote J=C "Gain is centered: MomWtGain = Actual_Gain - 30";
proc sgplot data=All;
   scatter x=MomWtGain y=Weight / markerattrs=(symbol=CircleFilled) transparency=0.92;
   series x=X y=Y / group=Quantile lineattrs=(thickness=2) nomissinggroup name="p";
   keylegend "p" / position=right sortorder=reverseauto title="Quantile";
   xaxis values=(-20 to 40 by 10) valueshint grid  label="Mother's Relative Weight Gain (lbs)";;
   yaxis values=(1500 to 4500 by 500) valueshint grid label="Predicted Weight of Child (g)";
run;

The fit plot is shown at the top of this article.

In summary, this article shows how to use the missing value trick to evaluate a regression model in SAS. You can use this technique for any regression procedure, although newer procedures often support syntax that makes it easier to score a model.

As shown in this example, if you score the model on an equally spaced set of points for one of the continuous variables in the model, you can create a sliced fit plot. For a more complicated example, see the article "How to create a sliced fit plot in SAS."

The post How to score and graph a quantile regression model in SAS appeared first on The DO Loop.