I peruse many different websites to get my news, and I always keep an eye out for good (or bad) presentations of data. I recently saw a posting on reddit claiming "U.S. GDP is greater than the total of all others combined." This news seemed too good to be true [...]

The post Fact checking a reddit post about GDP appeared first on SAS Learning Post.

The SGPLOT procedure in SAS makes it easy to create graphs that overlay various groups in the data. Many statements support the GROUP= option, which specifies that the graph should overlay group information. For example, you can create side-by-side bar charts and box plots, and you can overlay multiple scatter plots and series plots in the same graph. However, the GROUP= option takes only a single grouping variable! What can you do if you need to visualize combinations of two (or even three!) categorical variables? This article shows how to construct a new group variable that combines the levels of two or more existing categorical variables.

For concreteness, I will show how to overlay multiple series plots (line plots), as shown to the right. (Click to enlarge.) By using this technique, you can overlay curves that describe combinations of gender and race. Or you can plot response curves for control-vs-experimental groups and the severity of a disease. You can use this technique for other plot types, such as creating box plots that visualize a two-way ANOVA.

### What is the problem? Why can't you use the GROUP= option?

To motivate the discussion, let's first see why the GROUP= option in the SERIES statement does not work for overlaying two categorical variables. The following DATA step creates two categorical variables. The G1 variable has the values 1, 2, and 3. The G2 variable has the values 'A' and 'B'. For each of the six combinations of (G1, G2), the DATA step creates a curve (actually a line) of (X, Y) values. Let's see what happens if you try to plot the lines by using a SERIES statement with the GROUP=G1 option:

```data TwoGroups; do G1=1 to 3; do G2='A', 'B'; do X = 1 to 10; Y = 10 + G1 + 0.5*(G2='A') + G1*X/20; output; end; end; end; run;   title "First Attempt: Does Not Work"; proc sgplot data=TwoGroups; series x=x y=y / group=G1 curvelabel; run;```

The attempt is a failure. The graph does not show six individual curves. Instead, it shows three curves (the number of categories in the G1 variables) and each "curve" is Z-shaped because the graph traces the curve for G2='A' on the range [1, 10] and then draws the curve for G2='B' without "picking up the pen." This happens because the data are sorted by G1, then by G2, then by X.

There are two ways to handle this situation. One way is to try to convert the data from "long form" to "wide form." For these data, which have identical X values for every curve, you can create two new variables Y_A and Y_B that contain the coordinates of the three curves for G2='A' and G2='B', respectively. You can then use two SERIES statements, each using the GROUP=G1 option. Each statement will draw three curves for a total of six. You can use the NOCYCLEATTRS option to make sure that each statement uses the same line colors and patterns. However, this approach becomes complicated if each curve is evaluated at a different set of X values. In that case, it is better to keep the data in long form.

### Forming a new group variable by concatenation

The problem would be solved if there were one categorical variable that had six levels instead of two categorical variables that have six joint levels, so let's write some SAS code to make that happen. First sort the data by the categorical variables and then by the X variable. (For the current example, the data are already sorted correctly.) Then write a DATA step that does either of the following options:

• Option 1: Use the CATT (or CATX) function to concatenate the values of the existing group variables. The new categorical variable will have values that derived from the original variables.
• Option 2: Use the FIRST.variable syntax to create a new group variable that has the values 1–6. Then use PROC FORMAT to assign a meaningful value to each level of the new categorical variable.

The first option (the CATT function) is automated and less prone to error, but for the sake of completeness both options are shown below:

```/* create a new group variable by concatenating the two existing variables */ proc sort data=TwoGroups; by G1 G2 X; run;   data Make2Groups; set TwoGroups; by G1 G2; /* Option 1: automatically create joint levels from original levels */ Label = catt("G1=", G1) || "; " || catt("G2=", G2); /* Option 2: create new categorical variable and use PROC FORMAT to assign values */ if first.G2 then GroupID + 1; /* GroupID = 1, 2, 3, ... numGroups */ run;   /* For Option 2: use PROC FORMAT to form labels that encode the two groups. See https://blogs.sas.com/content/iml/2017/04/24/two-way-anova-viz.html */ proc format; value GroupFmt 1 = "G1=1; G2='A'" 2 = "G1=1; G2='B'" 3 = "G1=2; G2='A'" 4 = "G1=2; G2='B'" 5 = "G1=3; G2='A'" 6 = "G1=3; G2='B'"; run;```

The following call to PROC SGPLOT creates a series plot for the Label variable, which corresponds to the joint levels of the original grouping variables. The GROUPLC= option (supported in SAS 9.4M2) colors the lines according to the value of the G2 variable.

```title "Two Groups, One SERIES Statement"; proc sgplot data=Make2Groups; /* format GroupID GroupFmt.; */ /* for Option 2 */ series x=x y=y / group=Label grouplc=G2 /* or group=GroupID for Option 2 */ lineattrs=(pattern=solid) curvelabel; run;```

The graph is shown at the top of this article. The new categorical variable has values that correspond to joint levels of the original two variables. The GROUP= option creates six curves when you specify the Label variable. The GROUPLC= option sets the colors of the lines according to the values of the G2 variable.

This technique generalizes to three categorical variables, but I will leave the details to the reader. You might want to use the GROUPLP= option to set the line patterns according to the value of a third categorical variable. Beyond three variables the display will begin to resemble a spaghetti plot. For many categorical variables, you might want to use panels and BY groups to visualize the curves.

This technique also generalizes to other plot types, such as box plots and scatter plots.

The post Plot curves for levels of two categorical variables in SAS appeared first on The DO Loop.

How important is the customer experience in communications today? “Very important” is an understatement. It should be the top area of focus for all communications providers. It’s no secret that when customers are extremely satisfied, they become brand champions for the companies that provide them the products and services they [...]

There are many quotes with words of wisdom to help you live your life. But sometimes one quote seems to contradict another. For example, "Don't sweat the small stuff." ... and "The devil is in the details." When it comes to creating graphs (and perhaps living my life in general), [...]

The post Choose your markers carefully! (for scatter plots, that is) appeared first on SAS Learning Post.

Looking for use cases for analytics to derive value at an electrical utility? We have identified over 125 ways you can use analytics to improve the business processes at an electric utility. I recently posted a series of blog posts discussing four different use cases. Now I'd like to share [...]

12 real-world examples of smart grid analytics was published on SAS Voices by David Pope

This article shows how to score (evaluate) a quantile regression model on new data. SAS supports several procedures for quantile regression, including the QUANTREG, QUANTSELECT, and HPQUANTSELECT procedures. The first two procedures do not support any of the modern methods for scoring regression models, so you must use the "missing value trick" to score the model. (HPQUANTSELECT supports the CODE statement for scoring.) You can use this technique to construct a "sliced fit plot" that visualizes the model, as shown to the right. (Click to enlarge.)

### The easy way to create a fit plot

The following DATA step creates the example data as a subset of the Sashelp.BWeight data set, which contains information about the weights of live births in the US in 1997 and information about the mother during pregnancy. The following call to PROC QUANTREG models the conditional quantiles of the baby's weight as a function of the mother's weight gain. The weight gain is centered according to the formula MomWtGain = "Actual Gain" – 30 pounds. Because the quantiles might depend nonlinearly on the mother's weight gain, the EFFECT statement generates a spline basis for the independent variable. The resulting model can flexibly fit a wide range of shapes.

Although this article shows how to create a fit plot, you can also get a fit plot directly from PROC QUANTREG. As shown below, the PLOT=FITPLOT option creates a fit plot when the model contains one continuous independent variable.

```data Orig; /* restrict to 5000 births; exclude extreme weight gains */ set Sashelp.BWeight(obs=5000 where=(MomWtGain<=40)); run;   proc quantreg data=Orig algorithm=IPM /* use IPM algorithm for splines and binned data */ ci=none plot(maxpoints=none)=fitplot; /* or fitplot(nodata) */ effect MomWtSpline = spline( MomWtGain / knotmethod = equal(9) ); /* 9 knots, equally spaced */ model Weight = MomWtSpline / quantile = 0.1 0.25 0.5 0.75 0.90; run;```

The graph enables you to visualize curves that predict the 10th, 25th, 50th, 75th, and 90th percentiles of the baby's weight based on the mother's weight gain during pregnancy. Because the data contains 5,000 observations, the fit plot suffers from overplotting and the curves are hard to see. You can use the PLOT=FITPLOT(NODATA) option to exclude the data from the plot, thus showing the quantile curves more clearly.

### Score a SAS procedure by using the missing value trick

Although PROC QUANTREG can produce a fit plot when there is one continuous regressor, it does not support the EFFECTPLOT statement so you have to create more complicated graphs manually. To create a graph that shows the predicted values, you need to score the model on a new set of independent values. To use the missing value trick, do the following:

1. Create a SAS data set (the scoring data) that contains the values of the independent variables at which you want to evaluate the model. Set the response variable to missing for each observation.
2. Append the scoring data to the original data that are used to fit the model. Include a binary indicator variable that has the value 0 for the original data and the value 1 for the scoring data.
3. Run the regression procedure on the combined data set. Use the OUTPUT statement to output the predicted values for the scoring data. Of course, you can also output residuals and other observation-wise statistics, if necessary.

This general technique is implemented by using the following SAS statements. The scoring data consists of evenly spaced values of the MomWtGain variable. The binary indicator variable is named ScoreData. The result of these computations is a data set named QRegOut that contains a variable named Pred that contains the predicted values for each observation in the scoring data.

```/* 1. Create score data set */ data Score; /* Optionally define additional covariates here. See the example "Create a sliced fit plot manually by using the missing value trick" https://blogs.sas.com/content/iml/2017/12/20/create-sliced-fit-plot-sas.html */ Weight = .; /* set response (Y) variable to missing */ do MomWtGain = -30 to 40; /* uniform spacing in the independent (X) variable */ output; end; run;   /* 2. Append the score data to the original data. Use a binary variable to indicate which observations are the scoring data */ data Combined; set Orig /* original data */ Score(in=_ScoreData); /* scoring data */ ScoreData = _ScoreData; /* binary indicator variable. ScoreData=1 for scoring data */ run;   /* 3. Run the procedure on the combined (original + scoring) data */ ods select ModelInfo NObs; proc quantreg data=Combined algorithm=IPM ci=none; effect MomWtSpline = spline( MomWtGain / knotmethod = equal(9) ); model Weight = MomWtSpline / quantile = 0.1 0.25 0.5 0.75 0.90; output out=QRegOut(where=(ScoreData=1)) /* output predicted values for the scoring data */ P=Pred / columnwise; /* COLUMWISE option supports multiple quantiles */ run;```

This technique can be used for any SAS regression procedure. In this case, the COLUMNWISE option specifies that the output data set should be written in "long form": A QUANTILE variable specifies the quantile and the variable PRED contains the predicted values for each quantile. If you omit the COLUMNWISE option, the output data is in "wide form": The predicted values for the five quantiles are contained in the variables Pred1, Pred2, ..., Pred5.

### Using predicted values to create a sliced fit plot

You can use the predicted values of the scoring data to construct a fit plot. You merely need to sort the data by any categorical variables and by the X variable (in this case, MomWtGain). You can then plot the predicted curves. If desired, you can also append the original data and the predicted values and create a graph that overlays the data and predicted curves. You can use transparency to address the overplotting issue and also modify other features of the fit plot, such as the title, axes labels, tick positions, and so forth:

```/* 4. If you want a fit plot, sort by the independent variable for each curve. Put QUANTILE and other covariates first, then the X variable. */ proc sort data=QRegOut out=ScoreData; by Quantile MomWtGain; run;   /* 5. (optional) If you want to overlay the plots, it's easiest to define separate variables for original data and scoring data */ data All; set Orig /* original data */ ScoreData(rename=(MomWtGain=X Pred=Y)); /* scoring data */ run;   title "Quantile Regression Curves"; footnote J=C "Gain is centered: MomWtGain = Actual_Gain - 30"; proc sgplot data=All; scatter x=MomWtGain y=Weight / markerattrs=(symbol=CircleFilled) transparency=0.92; series x=X y=Y / group=Quantile lineattrs=(thickness=2) nomissinggroup name="p"; keylegend "p" / position=right sortorder=reverseauto title="Quantile"; xaxis values=(-20 to 40 by 10) valueshint grid label="Mother's Relative Weight Gain (lbs)";; yaxis values=(1500 to 4500 by 500) valueshint grid label="Predicted Weight of Child (g)"; run;```

In summary, this article shows how to use the missing value trick to evaluate a regression model in SAS. You can use this technique for any regression procedure, although newer procedures often support syntax that makes it easier to score a model.

As shown in this example, if you score the model on an equally spaced set of points for one of the continuous variables in the model, you can create a sliced fit plot. For a more complicated example, see the article "How to create a sliced fit plot in SAS."

The post How to score and graph a quantile regression model in SAS appeared first on The DO Loop.

This article shows how to score (evaluate) a quantile regression model on new data. SAS supports several procedures for quantile regression, including the QUANTREG, QUANTSELECT, and HPQUANTSELECT procedures. The first two procedures do not support any of the modern methods for scoring regression models, so you must use the "missing value trick" to score the model. (HPQUANTSELECT supports the CODE statement for scoring.) You can use this technique to construct a "sliced fit plot" that visualizes the model, as shown to the right. (Click to enlarge.)

### The easy way to create a fit plot

The following DATA step creates the example data as a subset of the Sashelp.BWeight data set, which contains information about the weights of live births in the US in 1997 and information about the mother during pregnancy. The following call to PROC QUANTREG models the conditional quantiles of the baby's weight as a function of the mother's weight gain. The weight gain is centered according to the formula MomWtGain = "Actual Gain" – 30 pounds. Because the quantiles might depend nonlinearly on the mother's weight gain, the EFFECT statement generates a spline basis for the independent variable. The resulting model can flexibly fit a wide range of shapes.

Although this article shows how to create a fit plot, you can also get a fit plot directly from PROC QUANTREG. As shown below, the PLOT=FITPLOT option creates a fit plot when the model contains one continuous independent variable.

```data Orig; /* restrict to 5000 births; exclude extreme weight gains */ set Sashelp.BWeight(obs=5000 where=(MomWtGain<=40)); run;   proc quantreg data=Orig algorithm=IPM /* use IPM algorithm for splines and binned data */ ci=none plot(maxpoints=none)=fitplot; /* or fitplot(nodata) */ effect MomWtSpline = spline( MomWtGain / knotmethod = equal(9) ); /* 9 knots, equally spaced */ model Weight = MomWtSpline / quantile = 0.1 0.25 0.5 0.75 0.90; run;```

The graph enables you to visualize curves that predict the 10th, 25th, 50th, 75th, and 90th percentiles of the baby's weight based on the mother's weight gain during pregnancy. Because the data contains 5,000 observations, the fit plot suffers from overplotting and the curves are hard to see. You can use the PLOT=FITPLOT(NODATA) option to exclude the data from the plot, thus showing the quantile curves more clearly.

### Score a SAS procedure by using the missing value trick

Although PROC QUANTREG can produce a fit plot when there is one continuous regressor, it does not support the EFFECTPLOT statement so you have to create more complicated graphs manually. To create a graph that shows the predicted values, you need to score the model on a new set of independent values. To use the missing value trick, do the following:

1. Create a SAS data set (the scoring data) that contains the values of the independent variables at which you want to evaluate the model. Set the response variable to missing for each observation.
2. Append the scoring data to the original data that are used to fit the model. Include a binary indicator variable that has the value 0 for the original data and the value 1 for the scoring data.
3. Run the regression procedure on the combined data set. Use the OUTPUT statement to output the predicted values for the scoring data. Of course, you can also output residuals and other observation-wise statistics, if necessary.

This general technique is implemented by using the following SAS statements. The scoring data consists of evenly spaced values of the MomWtGain variable. The binary indicator variable is named ScoreData. The result of these computations is a data set named QRegOut that contains a variable named Pred that contains the predicted values for each observation in the scoring data.

```/* 1. Create score data set */ data Score; /* Optionally define additional covariates here. See the example "Create a sliced fit plot manually by using the missing value trick" https://blogs.sas.com/content/iml/2017/12/20/create-sliced-fit-plot-sas.html */ Weight = .; /* set response (Y) variable to missing */ do MomWtGain = -30 to 40; /* uniform spacing in the independent (X) variable */ output; end; run;   /* 2. Append the score data to the original data. Use a binary variable to indicate which observations are the scoring data */ data Combined; set Orig /* original data */ Score(in=_ScoreData); /* scoring data */ ScoreData = _ScoreData; /* binary indicator variable. ScoreData=1 for scoring data */ run;   /* 3. Run the procedure on the combined (original + scoring) data */ ods select ModelInfo NObs; proc quantreg data=Combined algorithm=IPM ci=none; effect MomWtSpline = spline( MomWtGain / knotmethod = equal(9) ); model Weight = MomWtSpline / quantile = 0.1 0.25 0.5 0.75 0.90; output out=QRegOut(where=(ScoreData=1)) /* output predicted values for the scoring data */ P=Pred / columnwise; /* COLUMWISE option supports multiple quantiles */ run;```

This technique can be used for any SAS regression procedure. In this case, the COLUMNWISE option specifies that the output data set should be written in "long form": A QUANTILE variable specifies the quantile and the variable PRED contains the predicted values for each quantile. If you omit the COLUMNWISE option, the output data is in "wide form": The predicted values for the five quantiles are contained in the variables Pred1, Pred2, ..., Pred5.

### Using predicted values to create a sliced fit plot

You can use the predicted values of the scoring data to construct a fit plot. You merely need to sort the data by any categorical variables and by the X variable (in this case, MomWtGain). You can then plot the predicted curves. If desired, you can also append the original data and the predicted values and create a graph that overlays the data and predicted curves. You can use transparency to address the overplotting issue and also modify other features of the fit plot, such as the title, axes labels, tick positions, and so forth:

```/* 4. If you want a fit plot, sort by the independent variable for each curve. Put QUANTILE and other covariates first, then the X variable. */ proc sort data=QRegOut out=ScoreData; by Quantile MomWtGain; run;   /* 5. (optional) If you want to overlay the plots, it's easiest to define separate variables for original data and scoring data */ data All; set Orig /* original data */ ScoreData(rename=(MomWtGain=X Pred=Y)); /* scoring data */ run;   title "Quantile Regression Curves"; footnote J=C "Gain is centered: MomWtGain = Actual_Gain - 30"; proc sgplot data=All; scatter x=MomWtGain y=Weight / markerattrs=(symbol=CircleFilled) transparency=0.92; series x=X y=Y / group=Quantile lineattrs=(thickness=2) nomissinggroup name="p"; keylegend "p" / position=right sortorder=reverseauto title="Quantile"; xaxis values=(-20 to 40 by 10) valueshint grid label="Mother's Relative Weight Gain (lbs)";; yaxis values=(1500 to 4500 by 500) valueshint grid label="Predicted Weight of Child (g)"; run;```

In summary, this article shows how to use the missing value trick to evaluate a regression model in SAS. You can use this technique for any regression procedure, although newer procedures often support syntax that makes it easier to score a model.

As shown in this example, if you score the model on an equally spaced set of points for one of the continuous variables in the model, you can create a sliced fit plot. For a more complicated example, see the article "How to create a sliced fit plot in SAS."

The post How to score and graph a quantile regression model in SAS appeared first on The DO Loop.

In many movies, there is often a scene where the star says "We can do this the easy way, or the hard way" (and the hard way usually involves quite a bit of pain). So it is with interrogations ... and so it is with writing SAS code! Today I'm [...]

The post Cumulative values - "We can do this the easy way, or the hard way..." appeared first on SAS Learning Post.

SAS Viya has opened an entirely new set of capabilities, allowing SAS to analyze on cloud technology in real-time. One of the best new features of SAS Viya is its ability to pair with open source platforms, allowing developers the freedom of language and implementation to integrate with the power of SAS analytics.

At SAS Global Forum 2018, Sean Ankenbruck and Grace Heyne Lybrand from Zencos Consulting led the talk, SAS Viya: The Beauty of REST in Action. While the paper – and this blog post – outlines the use of Python and SAS Viya, note that SAS Viya integrates with R, Java and Lua as well.

Nonetheless, this Python integration example shows how easy it is to integrate SAS Viya and open source technologies. Here is the basic workflow:

1. A developer creates a web application, in a language of their choice
2. A user enters data in the web application
3. The collected data that is passed to Viya via the defined APIs
4. Analysis is performed in Viya using SAS actions
5. Results are passed back to the web application
6. The web application presents the results to the user

SAS’ Cloud Analytic Services (CAS) acts as a server to analyze data, and REST API’s are being used to integrate many programming languages into SAS Viya. REST stands for Representational State Transfer, and is a set of constraints that allows scalability and integration of multiple web-based systems. In layman’s terms, it’s a set of software design patterns that provides handy connector points from one web app to another. The REST API is what developers use to interact with and submit requests through the processing system.

CAS actions are what allow “tasks” to be completed on SAS Viya. These “tasks” are under the categories of Statistics, Analytics, System, and Data Mining and Machine Learning.

### Integration with Python

To access CAS through Python, the SAS Scripting Wrapper for Analytics Transfer (SWAT) package is used, letting Python conventions dictate CAS actions. To create this interface, data must be captured through a web application in a format that Python can transmit to SAS Viya.
In order to connect Python and CAS, the following is necessary:

• Hostname
• CAS Port

### Let’s see it in action

As an example, one project about wine preferences used CAS-collected data through a questionnaire stored in Python’s Pandas library. When the information was gathered, the decision tree was uploaded to SAS Viya. A model was created with common terms reviewers use to describe wines, feeding into a decision tree. The CAS server scored the users’ responses in real-time, and then sent the results back to the user providing them with suggested wines to match their inputs.

Process for model

Code to utilize tree:

```conn.loadactionset("decisionTree") conn.decisionTree.dTreeTrain( casOut = {"name":"tree_model"}, inputs = [{vars}], modelId = "DT_wine_variety", table = {"caslib":"public", "name": "wines_model_data"}, target = "variety")```

"Decision" given to user

### Conclusion

SAS Viya has opened SAS to a plethora of opportunities, allowing many different programming languages to be interpreted and quickly integrated, giving analysts and data scientists more flexibility.