Statistical Graphics

11月 122018
 

The SGPLOT procedure enables you to use the value of a response variable to color markers or areas in a graph. For example, you can use the COLORRESPONSE= option to define a variable whose values will be used to color markers in a scatter plot or cells in a heat map. You can use the COLORMODEL= statement to use a pre-defined color ramp or to define and use a custom color ramp. These options are often used to visualize a "third variable" (Z) in a two-dimensional plot (Y vs X).

However, the SGPLOT procedure is not the only way to create color ramps. Recently, I needed to create a custom color ramp and specify a color for values that were outside of a certain range. I also needed to change the color used for missing values in the response variable. This article shows how to perform these tasks by using the RANGEATTRMAP statement in the SAS Graph Template Language (GTL).

Create a basic color ramp in SAS

SAS provides several ways to visualize the values of a response variable in scatter plots, contour plots, heat maps, and other graphs. An example is shown to the right. In this scatter plot, the markers represent data for patients in a heart study. The two-dimensional coordinates of the markers represent the systolic blood pressure and the MRW ratio (a body-mass index) of 80 patients. The colors indicate measurements of cholesterol. Three observations are highlighted by using arrows. One point has an extreme value of cholesterol (400). Two others have missing values of cholesterol and are shown in a grey color by default.

The code that generates the graph follows. The COLORRESPONSE= option is used to name the response variable. The values of the response variable determine the color for each marker according to the color ramp that is specified on the COLORMODEL= option.

data Have;           /* example data: a subset of the Heart data set */
set Sashelp.Heart(firstobs=5000 rename=(Systolic=X MRW=Y Cholesterol=Z)
                  where=(X < 200 AND Y < 200));
label X="Systolic" Z="Cholesterol";
keep X Y Z;
run;
 
title "Markers Colored by Using the COLORRESPONSE= Option";
proc sgplot data=Have;
   scatter x=X y=Y / colorresponse=Z                 /* specify response variable */
                     colormodel=ThreeAltColorRamp    /* specify color ramp */
                     filledoutlinedmarkers markerattrs=(symbol=CircleFilled size=12);
   xaxis grid; yaxis grid;
run;

Advanced features for custom color ramps

The SGPLOT option provide basic functionality for creating a custom color ramp. For more control, you can use either of the following advanced techniques:

The RANGEATTRMAP statement

As shown in the previous example, using a color ramp requires that you specify two pieces of information: a response variable and a color ramp. In GTL, use the RANGEATTRVAR statement to specify the response variable. Use the RANGEATTRMAP statement to define a custom color ramp.

Although the documentation discusses it, I was initially confused about the difference between color ramps and "alt" color ramps. I erroneously assumed that they are functionally equivalent. After all, when I use the COLORMODEL= option in PROC SGPLOT, I can choose any predefined color ramp. For example, the graph at the start of this article is created by using COLORMODEL=ThreeAltColorRamp, which is a blue-black-red color ramp. However, I could just as easily specify COLORMODEL=ThreeColorRamp to use a blue-white-red color ramp.

However, in the RANGEATTRMAP statement, there is a substantive difference between the RANGECOLOR= option and the RANGEALTCOLOR= option:

  • Use the RANGECOLOR= and RANGECOLORMODEL= options to define a color ramp for "area plots" that use a fill color, such as bar charts, histograms, mosaic plots, heat maps, and contour plots. Sanjay's example of a heat map and my example of a mosaic plot are both "area plots" that use the RANGECOLORMODEL= option.
  • Use the RANGEALTCOLOR= and RANGEALTCOLORMODEL= options to define a color ramp for graphs that use lines and markers, such as scatter plots and series plots. The documentation for the RANGEATTRMAP statement provides an example of a scatter plot whose markers are colored according to a range attribute map.

The following GTL defines a template that creates a scatter plot that is similar to the one at the beginning of this article. Because I want to color markers, I use the RANGEALTCOLOR= and RANGEALTCOLORMODEL= options. The RANGEATTRMAP statement contains three RANGE statements. Each RANGE statement associates a color or colors with ranges of data values:

  • A custom yellow-orange-red color ramp is used for cholesterol values less than 350.
  • A custom color (black) is used for markers whose Z-value is greater than 350.
  • A custom color (lime green) is used for markers whose Z-value is missing.
proc template;
define statgraph scatter3Dcol;
begingraph;
   rangeattrmap name="ResponseRange";
        range min-350 / rangeAltColorModel=(CXFFFFB2 CXFED976 CXFEB24C CXFD8D3C CXFC4E2A CXE31A1C CXB10026);
        range OTHER   / rangeAltColor=Black;   /* or use the OVER or UNDER keyword */
        range MISSING / rangeAltColor=Lime;    /* color for missing values */
   endrangeattrmap;
   rangeattrvar var=Z                          /* specify response variable in data set */
                attrmap="ResponseRange"        /* specify custom color ramp */
                attrvar=RangeVar;              /* alias for this variable/ramp combination */
 
   entrytitle "Markers Colored by Using the RANGEATTRMAP Statement";
   layout overlay;
      scatterplot x=X y=Y / markercolorgradient=RangeVar  /* color by Z and custom color ramp */
                  filledoutlinedmarkers=true
                  markerattrs=(symbol=circlefilled size=12) 
                  name="Scatter";
      continuouslegend "Scatter" / title='Cholesterol'; 
   endlayout;
endgraph;
end;
run;
 
proc sgrender data=Have template=scatter3Dcol;
run;

You can do more with the RANGEATTRMAP statement, but I'll stop here. In summary, you can use the RANGEATTRMAP (and RANGEATTRVAR) statement in the Graph Template Language to define a custom color ramp. The RANGEATTRMAP statement supports features that are not surfaced in PROC SGPLOT, such as enabling you to specify a color for out-of-range values and missing values. If you are going to use the color ramp in an "area plot" such as a heat map, use the RANGECOLOR= and RANGECOLORMODEL= options to define the color ramp. If you are going to use the color ramp to color lines and markers, use the RANGEALTCOLOR= and RANGEALTCOLORMODEL= options.

The post Define custom color ramps by using the RANGEATTRMAP statement in GTL appeared first on The DO Loop.

11月 072018
 

When solving optimization problems, it is harder to specify a constrained optimization than an unconstrained one. A constrained optimization requires that you specify multiple constraints. One little typo or a missing minus sign can result in an infeasible problem or a solution that is unrelated to the true problem. This article shows two ways to check that you've correctly specified the constraints for a two-parameter optimization. The first is a program that translates the linear constraint matrix in SAS/IML software into a system of linear equations in slope-intercept form. The second is a visualization of the constraint region.

For simplicity, this article only discusses two-dimensional constraint regions. For these regions, the SAS/IML constraint matrix is a k x 4 matrix. We also assume that the constraints are linear inequalities that define a 2-D feasible region.

The SAS/IML matrix for boundary and linear constraints

In SAS/IML software you must translate your boundary and linear constraints into a matrix that encodes the constraints. (The OPTMODEL procedure in SAS/OR software uses a more natural syntax to specify constraints.) The first and second rows of the constraint matrix specify the lower and upper bounds, respectively, for the variables. The remaining rows specify the linear constraints. In general, when you have p variables, the first p columns contain a matrix of coefficients; the (p+1)st column encodes whether the constraint is an equality or inequality; the (p+2)th column specifies the right-hand side of the linear constraints. In this article, p = 2.

The following matrix is similar to the one used in a previous article about how to find an initial guess in a feasible region. The matrix encodes three inequality constraints for a two-parameter optimization. The first two rows of the first column specify bounds for the first variable: 0 ≤ x ≤ 10. The first two rows of the second column specify bounds for the second variable: 0 ≤ y ≤ 8. The third through fifth rows specify linear inequality constraints, as indicated in the comments. The third column encodes the direction of the inequality constraints. A value of -1 means "less than"; a value of +1 means "greater than." You can also use the value 0 to indicate an equality constraint.

proc iml;
con = {  0   0   .   .,        /* lower bounds */
        10   8   .   .,        /* upper bounds */
         3  -2  -1  10,        /* 3*x1 + -2*x2 LE 10 */
         5  10  -1  56,        /* 5*x1 + 10*x2 LE 56 */
         4   2   1   7 };      /* 4*x1 +  2*x2 GE  7 */

Verify the constraint matrix

A pitfall of encoding the constraints is that you might wonder whether you made a mistake when you typed the constraint matrix. Also, each linear constraint (row) in the matrix represents a linear equation in "standard form" such as 3*x - 2*y ≤ 10, whereas you might be more comfortable visualizing constraints in the equivalent "slope-intercept form" such as "y ≥ (3/2)x - 5.

It is easy to transform the equations from standard form to slope-intercept form. If c2 > 0, then the standard equation c1*x + c2*y ≥ c0 is transformed to y ≥ (-c1/c2)*x + (c0/c2). If c2 < 0, you need to remember to reverse the sign of the inequality: y ≤ (-c1/c2)*x + (c0/c2). Because the (p+1)th column has values ±1, you can use the SIGN function to perform all transformations without writing a loop. The following function "decodes" the constraint matrix and prints it is "human readable" form:

start BLCToHuman(con);
   nVar = ncol(con) - 2;
   Range = {"x", "y"} + " in [" + char(con[1,1:nVar]`) + "," + char(con[2,1:nVar]`) + "]";
   cIdx = 3:nrow(con);              /* rows for the constraints */
   c1 = -con[cIdx,1] / con[cIdx,2];
   c0 =  con[cIdx,4] / con[cIdx,2];
   sign = sign(con[cIdx,3] # con[cIdx,2]);
   s = {"<=" "=" ">="};
   idx = sign + 2;                 /* convert from {-1 0 1} to {1 2 3} */
   EQN = "y" + s[idx] + char(c1) + "x + " + char(c0);
   print Range, EQN;
finish;
 
run BLCToHuman(con);

Visualize the feasible region

In a previous article, I showed a technique for visualizing a feasible region. The technique is crude but effective. You simply evaluate the constraints at each point of a dense, regular, grid. If a point satisfies the constraints, you plot it in one color; otherwise, you plot it in a different color.

For linear constraints, you can perform this operation by using matrix multiplication. If A is the matrix of linear coefficients and all inequalities are "greater than" constraints, then you simply check that A*X ≥ b, where b is the right-hand side (the (p+2)th column). If any of the inequalities are "less than" constraints, you can multiply that row of the constraint matrix by -1 to convert it into an equivalent "greater than" constraint. This algorithm is implemented in the following SAS/IML function:

start PlotFeasible(con);
   L = con[1, 1:2];                   /* lower bound constraints */
   U = con[2, 1:2];                   /* upper bound constraints */
   cIdx = 3:nrow(con);                /* rows for linear inequality constraints */
   C = con[cIdx,] # con[cIdx,3];      /* convert all inequalities to GT */ 
   M = C[, 1:2];                      /* coefficients for linear constraints */
   RHS = C[, 4];                      /* right hand side */
   x = expandgrid( do(L[1], U[1], (U[1]-L[1])/50),    /* define (x,y) grid from bounds */
                   do(L[2], U[2], (U[2]-L[2])/50) );
   q = (M*x`>= RHS);                  /* 0/1 indicator matrix */
   Feasible = (q[+,] = ncol(cIdx));   /* are all constraints satisfied? */
   call scatter(x[,1], x[,2],) group=Feasible grid={x y} option="markerattrs=(symbol=SquareFilled)";
finish;
 
ods graphics / width=430px height=450px;
title "Feasible Region";
title2 "Formed by Bounds and Linear Constraints";
run PlotFeasible(con);

The graph shows that the feasible region as a red pentagonal region. The left and lower edges are determined by the bounds on the parameters. There are three diagonal lines that are determined by the linear inequality constraints. By inspection, you can see that the point (2, 2) is in the feasible region. Similarly, the point (8,6) is in the blue region and is not feasible.

The technique uses the bounds on the parameters (the first two rows) to specify the range of the axes in the plot. If your problem does not have explicit upper/lower bounds on the parameters, you can "invent" bounds such as [0, 100] or [-10, 10] just for plotting the feasible region.

In summary, this article shows two techniques that can help you verify that a constraint matrix is correctly specified. The first translates that constraint matrix into "human readable" form. The second draws a crude approximation of the feasible region by evaluating the constraints at each location on a dense grid. Both techniques are shown for two-parameter problems. However, with a little effort, you could incorporate the second technique when searching for a good initial guess in higher-dimensional problems.

The post Visualize the feasible region for a constrained optimization appeared first on The DO Loop.

10月 312018
 

A useful feature in PROC SGPLOT is the ability to easily visualize subgroups of data. Most statements in the SGPLOT procedure support a GROUP= option that enables you to overlay plots of subgroups. When you use the GROUP= option, observations are assigned attributes (colors, line patterns, symbols, ...) that indicate the value of the grouping variable. This article reviews the GROUP= option and shows how to trick PROC SGPLOT into performing a group analysis for statements that do not support the GROUP= option.

Three ways to plot data by groups

It is common to use colors or symbols to indicate which observations belong to each category of a grouping variable. Typical grouping variables include gender (male and female), political affiliation (democrats, republicans, and independents), race, education level, and so forth. When you use the SAS SG procedures to plot subsets of the data, there are three ways to arrange the plots. You can plot each group individually, you can create a panel of graphs, or you can overlay the groups on a single graph:

  • If you use the BY statement in PROC SGPLOT, each subgroup is plotted independently in its own graph. The axes are scaled based only on the data in that subgroup.
  • If you use the PANELBY statement in PROC SGPANEL, each subgroup is plotted in a cell of a lattice in which the axes are scaled to a common range.
  • If you use the GROUP= option, the plots for each subgroup are overlaid in a single graph.

The following SAS statements demonstrate each approach. Only the GROUP= overlay is displayed because that is the topic of this article:

proc sgplot data=Sashelp.Iris;        /* BY-group visualization. Three independent graphs.  */
   by Species;
   histogram SepalLength;
   density SepalLength / type=kernel;
run;
 
proc sgpanel data=Sashelp.Iris;       /* Panel visualization. Shared common axis. */
   panelby Species / columns=1 onepanel;
   histogram SepalLength;
   density SepalLength / type=kernel;
run;
 
proc sgplot data=Sashelp.Iris;       /* Overlay three plots in one graph  */
   histogram SepalLength / GROUP=Species binstart=42 binwidth=3 transparency=0.5;
   density SepalLength / type=kernel GROUP=Species;
run;

How to emulate the GROUP= option

Many SGPLOT statements (such as the SERIES and SCATTER statements) have supported the GROUP= option since the early days of ODS graphics. For other statements, support for the GROUP= option was added more recently. For example, the GROUP= option was added to the HISTOGRAM and DENSITY statements in SAS 9.4M2.

Here is a trick (shown to me by my colleague, Paul) that you can use to emulate the GROUP= option. If a statement in the SGPLOT procedure does not support the GROUP= option, but the statement DOES support the FREQ= option, you can often use the FREQ= option to construct a graph that overlays the subgroups. You need to do two things. First, you need to create binary indicators variables (sometimes called dummy variables) for each level of the categorical variable. You then use multiple statements, each with a different frequency variable, to overlay the subgroups. These two steps are shown by the following DATA step and call to PROC SGPLOT, which uses the FREQ= trick to overlay three histograms:

/* emulate a GROUP= option for SGPLOT statements that do not support GROUP= */
data IrisFreq;
   set sashelp.Iris;
   Freq1 = (Species='Setosa');     /* Binary. Equals 1 if observation is in 'Setosa' group     */
   Freq2 = (Species='Versicolor'); /* Binary. Equals 1 if observation is in 'Versicolor' group */
   Freq3 = (Species='Virginica');  /* Binary. Equals 1 if observation is in 'Verginica' group  */
run;
 
title "Overlay Histograms by Using the FREQ= Option";
%let binOpts = binstart=42 binWidth=3 transparency=0.5; /* ensure common bins */
proc sgplot data=IrisFreq;
   histogram SepalLength / freq=Freq1 &binOpts;    /* only the 'Setosa' group     */
   histogram SepalLength / freq=Freq2 &binOpts;    /* only the 'Versicolor' group */
   histogram SepalLength / freq=Freq3 &binOpts;    /* only the 'Virginica' group  */
run;

The graph overlays three histograms, one for each value of the Species variable. The result is similar to the earlier graph that used the GROUP= option. You can use the same trick on the DENSITY statement, although you will need to manually set the line attributes so that they match the attributes for the corresponding histograms.

You can use this technique in old versions of SAS to emulate the GROUP= option on the HISTOGRAM statement. You can also use it for statements that do not support the GROUP= option.

Although this example uses the DATA step to manually create the dummy variables that are used as frequencies, you can also create the dummy variables automatically by generating the "design matrix" for the Species variable. The GLMMOD procedure is the simplest way to create dummy variables in SAS, but other procedures provide additional features.

Generate prediction ellipses for groups

Several years ago I showed how you can overlay prediction ellipses for each group on a scatter plot. (Note that the ELLIPSE statement does not support a GROUP= option.) The technique requires that you transpose the data from long to wide form by creating new variables, one for each group of the categorical variable. Paul recognized that creating a dummy variable and using the FREQ= option is a simpler way to overlay prediction ellipses on a scatter plot:

title "Prediction Ellipses for Iris Data";
proc sgplot data=IrisFreq;
   scatter x=PetalLength y=PetalWidth / group=Species;
   ellipse x=PetalLength y=PetalWidth / freq=Freq1 legendlabel="Setosa";
   ellipse x=PetalLength y=PetalWidth / freq=Freq2 legendlabel="Versicolor";
   ellipse x=PetalLength y=PetalWidth / freq=Freq3 legendlabel="Virginica";
run;

Advantages and disadvantages of the FREQ= trick

The main advantage of using the FREQ= option for group processing is that it enables you to overlay subgroups even when a statement does not support the GROUP= option. A secondary advantage is that this technique gives you complete control of the attributes of each subgroup. Although you can use the STYLEATTRS statement to control many group attributes, the STYLEATTTRS statement does not enable you to control marker sizes or line widths, to name two examples.

The FREQ= trick does have some disadvantages:

  • You can't use the FREQ= trick for statements that produce graphs of categorical variables. The SGPLOT documentation states, “If your plot is overlaid with other categorization plots, then the first FREQ variable that you specified is used for all of the plots.” [My emphasis.]
  • As mentioned earlier, if you are trying to produce multiple grouped plots, you might need to manually assign attributes to obtain consistency among the levels of the grouping variables. By default, most ODS styles use different attributes for each statement. If you want the attributes for the fourth statement to match the attributes for the first statement, you need to use an option such as LINEATTRS=GraphData1 on the fourth statement.

In conclusion, if a statement supports the GROUP= option, you should probably use that option to overlay plots of the groups. But if a statement does NOT support the GROUP= option (such as the ELLIPSE and HEATMAP statements), you can use the FREQ= trick to emulate the GROUP= behavior.

I thank my colleague, Paul, for showing me the ellipse example. I hope you agree that this trick is a real treat, not just on Halloween, but every day!

The post A trick to plot groups in PROC SGPLOT appeared first on The DO Loop.

10月 012018
 

Programmers on a SAS discussion forum recently asked about the chi-square test for proportions as implemented in PROC FREQ in SAS. One person asked the basic question, "how do I test the null hypothesis that the observed proportions are equal to a set of known proportions?" Another person said that the null hypothesis was rejected for his data, and he wanted to know which categories were "responsible for the rejection." This article answers both questions and points out a potential pitfall when you specify the proportions for a chi-square goodness-of-fit test in PROC FREQ.

The basic idea: The proportion of party affiliations for a group of voters

To make these questions concrete, let's look at some example data. According to a 2016 Pew research study, the party affiliation of registered voters in the US in 2016 was as follows: 33% of voters registered as Democrats, 29% registered as Republicans, 34% were Independents, and 4% registered as some other party. If you have a sample of registered voters, you might want to ask whether the observed proportion of affiliations matches the national averages. The following SAS data step defines the observed frequencies for a hypothetical sample of 300 voters:

data Politics;
length Party $5;
input Party $ Count;
datalines;
Dem   125
Repub  79
Indep  86
Other  10
;

You can use the TESTP= option on the TABLES statement in PROC FREQ to compare the observed proportions with the national averages for US voters. You might assume that the following statements perform the test, but there is a potential pitfall. The following statements contain a subtle error:

proc freq data=Politics;
/* National Pct:    D=33%, R=29%, I=34%, Other=04% */
tables Party / TestP=(0.33 0.29 0.34 0.04) nocum; /* WARNING: Contains an error! */
weight Count;
run;

If you look carefully at the OneWayFreqs table that is produced, you will see that the test proportions that appear in the fourth column are not the proportions that we intended to specify! The problem is that order of the categories in the table is alphabetical whereas the proportions in the LISTP= option correspond to the order that the categories appear in the data. In an effort to prevent this mistake, the documentation for the TESTP= option warns you to "order the values to match the order in which the corresponding variable levels appear in the one-way frequency table." The order of categories is important in many SAS procedures, so always think about the order! (The ESTIMATE and CONTRAST statements in linear regression procedures are other statements where order is important.)

Specify the test proportions correctly

To specify the correct order, you have two options: (1) list the proportions for the TESTP= option according to the alphabetical order of the categories, or (2) use the ORDER=DATA option on the PROC FREQ statement to tell the procedure to use the order of the categories as they appear in the data. The following statement uses the ORDER=DATA option to specify the proportions:

proc freq data=Politics ORDER=DATA;   /* list proportions in DATA order */
/*                 D=33%, R=29%, I=34%, Other=04% */
tables Party / TestP=(0.33 0.29 0.34 0.04);  /* cellchi2 not available for one-way tables */
weight Count;
ods output OneWayFreqs=FreqOut;
output out=FreqStats N ChiSq;
run;

The analysis is now correctly specified. The chi-square table indicates that the observed proportions are significantly different from the national averages at the α = 0.05 significance level.

Which categories are "responsible" for rejecting the null hypothesis?

A SAS programmer posted a similar analysis on a discussion and asked whether it was possible to determine which categories were the most different from the specified proportions. The analysis shows that the chi-square test rejects the null hypothesis, but does not indicate whether only one category is different than expected or whether many categories are different.

Interestingly, PROC FREQ supports such an option for two-way tables when the null hypothesis is the independence of the two variables. Recall that the chi-square statistic is a sum of squares, where each cell in the table contributes one squared value to the sum. The CELLCHI2 option on the TABLES statement "displays each table cell’s contribution to the Pearson chi-square statistic.... The cell chi-square is computed as
(frequencyexpected)2 / expected
where frequency is the table cell frequency (count) and expected is the expected cell frequency" under the null hypothesis.

Although the option is not supported for one-way tables, it is straightforward to use the DATA step to compute each cell's contribution. The previous call to PROC FREQ used the ODS OUTPUT statement to write the OneWayFreqs table to a SAS data set. It also wrote a data set that contains the sample size and the chi-square statistic. You can use these statistics as follows:

/* create macro variables for sample size and chi-square statistic */
data _NULL_;
   set FreqStats;
   call symputx("NumObs", N);         
   call symputx("TotalChiSq", _PCHI_);
run;
 
/* compute the proportion of chi-square statistic that is contributed
   by each cell in the one-way table */
data Chi2;
   set FreqOut;
   ExpectedFreq = &NumObs * TestPercent / 100;
   Deviation = Frequency - ExpectedFreq;
   ChiSqContrib = Deviation**2 / ExpectedFreq;  /* (O - E)^2 / E */
   ChiSqPropor = ChiSqContrib / &TotalChiSq;    /* proportion of chi-square contributed by this cell */
   format ChiSqPropor 5.3;
run;
 
proc print data=Chi2; 
   var Party Frequency TestPercent ExpectedFreq Deviation ChiSqContrib ChiSqPropor; 
run;

The table shows the numbers used to compute the chi-square statistic. For each category of the PARTY variable, the table shows the expected frequencies, the deviations from the expected frequencies, and the chi-square term for each category. The last column is the proportion of the total chi-square statistic for each category. You can see that the 'Dem' category contributes the greatest proportion. The interpretation is that the observed count of the 'Dem' group is much greater than expected and this is the primary reason why the null hypothesis is rejected.

You can also create a bar chart that shows the contributions to the chi-square statistic. You can create the "chi-square contribution plot" by using the following statements:

title "Proportion of Chi-Square Statistic for Each Category";
proc sgplot data=Chi2;
   vbar Party / response=ChiSqPropor datalabel=ChiSqPropor;
   xaxis discreteorder=data;
   yaxis label="Proportion of Chi-Square Statistic" grid;
run;
Contribution of each cell to the chi-square statistic

The bar chart makes it clear that the frequency of the 'Dem' group is the primary factor in the size of the chi-square statistic. The "chi-square contribution plot" is a visual companion to the Deviation Plot, which is produced automatically by PROC FREQ when you specify the PLOTS=DEVIATIONPLOT option. The Deviation Plot shows whether the counts for each category are more than expected or less than expected. When you combine the two plots, you can make pronouncements like Goldilocks:

  • The 'Dem' group contributes the most to the chi-square statistic because the observed counts are "too big."
  • The 'Indep' group contributes a moderate amount because the counts are "too small."
  • The remaining groups do not contribute much because their counts are "just right."

Summary

In summary, this article addresses three topics related to testing the proportions of counts in a one-way frequency table. You can use the TESTP= option to specify the proportions for the null hypothesis. Be sure that you specify the proportions in the same order that they appear in the OneWayFreqs table. (The ORDER=DATA option is sometimes useful for this.) If the data proportions do not fit the null hypothesis, you might want to know why. One way to answer this question is to compute the contributions of each category to the total chi-square computation. This article shows how to display that information in a table or in a bar chart.

The post Chi-square tests for proportions in one-way tables appeared first on The DO Loop.

9月 102018
 

Given a rectangular grid with unit spacing, what is the expected distance between two random vertices, where distance is measured in the L1 metric? (Here "random" means "uniformly at random.") I recently needed this answer for some small grids, such as the one to the right, which is a 7 x 6 grid. The graph shows that the L1 distance between the points (2,6) and (5,2) is 7, the length of the shortest path that connects the two points. The L1 metric is sometimes called the "city block" or "taxicab" metric because it measures the distance along the grid instead of "as the bird flies," which is the Euclidean distance.

The answer to the analogous question for the continuous case (a solid rectangle) is difficult to compute. The main result is that the expected distance is less than half of the diameter of the rectangle. In particular, among all rectangles of a given area, a square is the rectangle that minimizes the expected distance between random points.

Although I don't know a formula for the expected distance on a discrete regular grid, the grids in my application were fairly small so this article shows how to compute all pairwise distances and explicitly find the average (expected) value. The DISTANCE function in SAS/IML makes the computation simple because it supports the L1 metric. It is also simple to perform computer experiments to show that among all grids that have N*M vertices, the grid that is closest to being square minimizes the expected L1 distance.

L1 distance on a regular grid

An N x M grid contains NM vertices. Therefore the matrix of pairwise distances is NM x NM. Without loss of generality, assume that the vertices have X coordinates between 1 and N and Y coordinates between 1 and M. Then the following SAS/IML function defines the distance matrix for vertices on an N x M grid. To demonstrate the computation, the distance matrix for a 4 x 4 grid is shown, along with the average distance between the 16 vertices:

proc iml;
start DistMat(rows, cols);
   s = expandgrid(1:rows, 1:cols);   /* create (x, y) ordered pairs */
   return distance(s, "CityBlock");  /* pairwise L1 distance matrix */
finish;
 
/* test function on a 4 x 4 grid */
D = DistMat(4, 4);
AvgDist = mean(colvec(D));
print D[L="L1 Distance Matrix for 4x4 Grid"];
print AvgDist[L="Avg Distance on 4x4 Grid"];
L1 Distance matrix for vertices on a regular 4 x 4 grid Average L1 distance on a 4 x 4 grid

For an N x M grid, the L1 diameter of the grid is the L1 distance between opposite corners. That distance is always (N-1)+(M-1), which equates to 6 units for a 4 x 4 grid. As for the continuous case, the expected L1 distance is less than half the diameter. In this case, E(L1 distance) = 2.5.

A comparison of distances for grids of different aspect ratios

As indicated previously, the expected distance between two random vertices on a grid depends on the aspect ratio of the grid. A grid that is nearly square has a smaller expected distance than a short-and-wide grid that contains the same number of vertices. You can illustrate this fact by computing the distance matrices for grids that each contain 36 vertices. The following computation computes the distances for five grids: a 1 x 36 grid, a 2 x 18 grid, a 3 x 12 grid, a 4 x 9 grid, and a 6 x 6 grid.

/* average L1 distance on 36 x 36 grid in several configurations */
N=36;
rows = {1, 2, 3, 4, 6};
cols = N / rows;
AvgDist = j(nrow(rows), 1, .);
do i = 1 to nrow(rows);
   D = DistMat(rows[i], cols[i]);
   AvgDist[i] = mean(colvec(D));
end;
/* show average distance as a decimal and as a fraction */
numer = AvgDist*108; AvgDistFract = char(numer) + " / 108"; 
print rows cols AvgDist AvgDistFract;

The table shows that short-and-wide tables have an average distance that is much greater than a nearly square grid that contains the same number of vertices. When the points are arranged in a 6 x 6 grid, the distance matrix naturally decomposes into a block matrix of 6 x 6 symmetric blocks, where each block corresponds to a row. When the points are arranged in a 3 x 12 grid, the distance matrix decomposes into a block matrix of 12 x 12 blocks The following heat maps visualize the patterns in the distance matrices:

D = DistMat(6, 6);
call heatmapcont(D) title="L1 Distance Between Vertices in a 6 x 6 Grid";
D = DistMat(3, 12);
call heatmapcont(D) title="L1 Distance Between Vertices in a 3 x 12 Grid";
Visualization of L1 distance matrix for items arranged on a 6 x 6 grid Visualization of L1 distance matrix for items arranged on a 3 x 12 grid

Expected (average) distance versus number of rows

You can demonstrate how the average distance depends on the number of rows by choosing the number of vertices to be a highly composite number such as 5,040. A highly composite number has many factors. The following computation computes the average distance between points on 28 grids that each contain 5,040 vertices. A line chart then displays the average distance as a function of the number of rows in the grid:

N = 5040;               /* 5040 is a highly composite number */
rows = {1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 12, 14, 15, 16, 18, 20, 
       21, 24, 28, 30, 35, 36, 40, 42, 45, 48, 56, 60, 63, 70};
cols = N / rows;
AvgDist = j(nrow(rows), 1, .);
do i = 1 to nrow(rows);
   D = DistMat(rows[i], cols[i]);
   AvgDist[i] = mean(colvec(D));
end;
 
title "Average L1 Distance Between 5040 Objects in a Regular Grid";
call series(rows, AvgDist) grid={x y} other="yaxis offsetmin=0 grid;";
Expected L1 distance between vertices on a regular grid with 5,040 points. The curve demonstraes that nearly square grids have a much lower average distance than short and wide grids.

Each point on the graph represents the average distance for an N x M grid where NM = 5,040. The horizontal axis displays the value of N (rows). The graph shows that nearly square grids (rows greater than 60) have a much lower average distance than very short and wide grids (rows less than 10). The scale of the graph makes it seem like there is very little difference between the average distance in a grid with 40 versus 70 rows, but that is somewhat of an illusion. A 40 x 126 grid (aspect ratio = 3.15) has an average distance of 55.3; a 70 x 72 grid (aspect ratio = 1.03) has an average distance of 47.3.

Applications and conclusions

In summary, you can use the DISTANCE function in SAS/IML to explicitly compute the expected L1 distance (the "city block" distance) between random points on a regular grid. You can minimize the average pairwise distance by making the grid as square as possible.

City planning provides a real-world application of the L1 distance. If you are tasked with designing a city with N buildings along a grid, then the average distance between buildings is smallest when the grid is square. Of course, in practice, some buildings have a higher probability of being visited than others (such as a school or a grocery). You should position those buildings in the center of town to shorten the average distance that people travel.

The post Distances on rectangular grids appeared first on The DO Loop.

8月 082018
 
Use the GROUP= option in PROC SGPLOT to plot lines for TWO categorical variables

The SGPLOT procedure in SAS makes it easy to create graphs that overlay various groups in the data. Many statements support the GROUP= option, which specifies that the graph should overlay group information. For example, you can create side-by-side bar charts and box plots, and you can overlay multiple scatter plots and series plots in the same graph. However, the GROUP= option takes only a single grouping variable! What can you do if you need to visualize combinations of two (or even three!) categorical variables? This article shows how to construct a new group variable that combines the levels of two or more existing categorical variables.

For concreteness, I will show how to overlay multiple series plots (line plots), as shown to the right. (Click to enlarge.) By using this technique, you can overlay curves that describe combinations of gender and race. Or you can plot response curves for control-vs-experimental groups and the severity of a disease. You can use this technique for other plot types, such as creating box plots that visualize a two-way ANOVA.

What is the problem? Why can't you use the GROUP= option?

To motivate the discussion, let's first see why the GROUP= option in the SERIES statement does not work for overlaying two categorical variables. The following DATA step creates two categorical variables. The G1 variable has the values 1, 2, and 3. The G2 variable has the values 'A' and 'B'. For each of the six combinations of (G1, G2), the DATA step creates a curve (actually a line) of (X, Y) values. Let's see what happens if you try to plot the lines by using a SERIES statement with the GROUP=G1 option:

data TwoGroups;
do G1=1 to 3;
   do G2='A', 'B';
      do X = 1 to 10;
         Y = 10 + G1 + 0.5*(G2='A') + G1*X/20;
         output;
      end;
   end;
end;
run;
 
title "First Attempt: Does Not Work";
proc sgplot data=TwoGroups;
   series x=x y=y / group=G1 curvelabel;
run;

The attempt is a failure. The graph does not show six individual curves. Instead, it shows three curves (the number of categories in the G1 variables) and each "curve" is Z-shaped because the graph traces the curve for G2='A' on the range [1, 10] and then draws the curve for G2='B' without "picking up the pen." This happens because the data are sorted by G1, then by G2, then by X.

There are two ways to handle this situation. One way is to try to convert the data from "long form" to "wide form." For these data, which have identical X values for every curve, you can create two new variables Y_A and Y_B that contain the coordinates of the three curves for G2='A' and G2='B', respectively. You can then use two SERIES statements, each using the GROUP=G1 option. Each statement will draw three curves for a total of six. You can use the NOCYCLEATTRS option to make sure that each statement uses the same line colors and patterns. However, this approach becomes complicated if each curve is evaluated at a different set of X values. In that case, it is better to keep the data in long form.

Forming a new group variable by concatenation

The problem would be solved if there were one categorical variable that had six levels instead of two categorical variables that have six joint levels, so let's write some SAS code to make that happen. First sort the data by the categorical variables and then by the X variable. (For the current example, the data are already sorted correctly.) Then write a DATA step that does either of the following options:

  • Option 1: Use the CATT (or CATX) function to concatenate the values of the existing group variables. The new categorical variable will have values that derived from the original variables.
  • Option 2: Use the FIRST.variable syntax to create a new group variable that has the values 1–6. Then use PROC FORMAT to assign a meaningful value to each level of the new categorical variable.

The first option (the CATT function) is automated and less prone to error, but for the sake of completeness both options are shown below:

/* create a new group variable by concatenating the two existing variables */
proc sort data=TwoGroups;
   by G1 G2 X;
run;
 
data Make2Groups;
set TwoGroups;
by G1 G2;
/* Option 1: automatically create joint levels from original levels */
Label = catt("G1=", G1) || "; " || catt("G2=", G2);
/* Option 2: create new categorical variable and use PROC FORMAT to assign values */
if first.G2 then 
   GroupID + 1;        /* GroupID = 1, 2, 3, ... numGroups */
run;
 
/* For Option 2: use PROC FORMAT to form labels that encode the two groups. See
https://blogs.sas.com/content/iml/2017/04/24/two-way-anova-viz.html
*/
proc format;                 
  value GroupFmt 1 = "G1=1; G2='A'"   2 = "G1=1; G2='B'"   3 = "G1=2; G2='A'"
                 4 = "G1=2; G2='B'"   5 = "G1=3; G2='A'"   6 = "G1=3; G2='B'";
run;

The following call to PROC SGPLOT creates a series plot for the Label variable, which corresponds to the joint levels of the original grouping variables. The GROUPLC= option (supported in SAS 9.4M2) colors the lines according to the value of the G2 variable.

title "Two Groups, One SERIES Statement";
proc sgplot data=Make2Groups;
/* format GroupID GroupFmt.; */            /* for Option 2 */
series x=x y=y / group=Label grouplc=G2    /* or group=GroupID for Option 2 */
                 lineattrs=(pattern=solid) curvelabel;
run;

The graph is shown at the top of this article. The new categorical variable has values that correspond to joint levels of the original two variables. The GROUP= option creates six curves when you specify the Label variable. The GROUPLC= option sets the colors of the lines according to the values of the G2 variable.

This technique generalizes to three categorical variables, but I will leave the details to the reader. You might want to use the GROUPLP= option to set the line patterns according to the value of a third categorical variable. Beyond three variables the display will begin to resemble a spaghetti plot. For many categorical variables, you might want to use panels and BY groups to visualize the curves.

This technique also generalizes to other plot types, such as box plots and scatter plots.

The post Plot curves for levels of two categorical variables in SAS appeared first on The DO Loop.

8月 062018
 
Graph of quantile regression curves in SAS

This article shows how to score (evaluate) a quantile regression model on new data. SAS supports several procedures for quantile regression, including the QUANTREG, QUANTSELECT, and HPQUANTSELECT procedures. The first two procedures do not support any of the modern methods for scoring regression models, so you must use the "missing value trick" to score the model. (HPQUANTSELECT supports the CODE statement for scoring.) You can use this technique to construct a "sliced fit plot" that visualizes the model, as shown to the right. (Click to enlarge.)

The easy way to create a fit plot

The following DATA step creates the example data as a subset of the Sashelp.BWeight data set, which contains information about the weights of live births in the US in 1997 and information about the mother during pregnancy. The following call to PROC QUANTREG models the conditional quantiles of the baby's weight as a function of the mother's weight gain. The weight gain is centered according to the formula MomWtGain = "Actual Gain" – 30 pounds. Because the quantiles might depend nonlinearly on the mother's weight gain, the EFFECT statement generates a spline basis for the independent variable. The resulting model can flexibly fit a wide range of shapes.

Although this article shows how to create a fit plot, you can also get a fit plot directly from PROC QUANTREG. As shown below, the PLOT=FITPLOT option creates a fit plot when the model contains one continuous independent variable.

data Orig;   /* restrict to 5000 births; exclude extreme weight gains */
set Sashelp.BWeight(obs=5000 where=(MomWtGain<=40));
run;
 
proc quantreg data=Orig
              algorithm=IPM            /* use IPM algorithm for splines and binned data */
              ci=none plot(maxpoints=none)=fitplot; /* or fitplot(nodata) */
   effect MomWtSpline = spline( MomWtGain / knotmethod = equal(9) );  /* 9 knots, equally spaced */
   model Weight = MomWtSpline / quantile = 0.1 0.25 0.5 0.75 0.90;
run;
Fit plot of quantile regression curves from PROC QUANTREG in SAS

The graph enables you to visualize curves that predict the 10th, 25th, 50th, 75th, and 90th percentiles of the baby's weight based on the mother's weight gain during pregnancy. Because the data contains 5,000 observations, the fit plot suffers from overplotting and the curves are hard to see. You can use the PLOT=FITPLOT(NODATA) option to exclude the data from the plot, thus showing the quantile curves more clearly.

Score a SAS procedure by using the missing value trick

Although PROC QUANTREG can produce a fit plot when there is one continuous regressor, it does not support the EFFECTPLOT statement so you have to create more complicated graphs manually. To create a graph that shows the predicted values, you need to score the model on a new set of independent values. To use the missing value trick, do the following:

  1. Create a SAS data set (the scoring data) that contains the values of the independent variables at which you want to evaluate the model. Set the response variable to missing for each observation.
  2. Append the scoring data to the original data that are used to fit the model. Include a binary indicator variable that has the value 0 for the original data and the value 1 for the scoring data.
  3. Run the regression procedure on the combined data set. Use the OUTPUT statement to output the predicted values for the scoring data. Of course, you can also output residuals and other observation-wise statistics, if necessary.

This general technique is implemented by using the following SAS statements. The scoring data consists of evenly spaced values of the MomWtGain variable. The binary indicator variable is named ScoreData. The result of these computations is a data set named QRegOut that contains a variable named Pred that contains the predicted values for each observation in the scoring data.

/* 1. Create score data set */
data Score;
/* Optionally define additional covariates here. See the example
   "Create a sliced fit plot manually by using the missing value trick"
   https://blogs.sas.com/content/iml/2017/12/20/create-sliced-fit-plot-sas.html
*/
Weight = .;               /* set response (Y) variable to missing */
do MomWtGain = -30 to 40; /* uniform spacing in the independent (X) variable */
   output;
end;
run;
 
/* 2. Append the score data to the original data. Use a binary variable to 
      indicate which observations are the scoring data */
data Combined;
set Orig                     /* original data */
    Score(in=_ScoreData);    /* scoring data  */
ScoreData = _ScoreData;      /* binary indicator variable. ScoreData=1 for scoring data */
run;
 
/* 3. Run the procedure on the combined (original + scoring) data */
ods select ModelInfo NObs;
proc quantreg data=Combined algorithm=IPM ci=none;
   effect MomWtSpline = spline( MomWtGain / knotmethod = equal(9) );
   model Weight = MomWtSpline / quantile = 0.1 0.25 0.5 0.75 0.90;
   output out=QRegOut(where=(ScoreData=1)) /* output predicted values for the scoring data */
              P=Pred / columnwise;         /* COLUMWISE option supports multiple quantiles */
run;

This technique can be used for any SAS regression procedure. In this case, the COLUMNWISE option specifies that the output data set should be written in "long form": A QUANTILE variable specifies the quantile and the variable PRED contains the predicted values for each quantile. If you omit the COLUMNWISE option, the output data is in "wide form": The predicted values for the five quantiles are contained in the variables Pred1, Pred2, ..., Pred5.

Using predicted values to create a sliced fit plot

You can use the predicted values of the scoring data to construct a fit plot. You merely need to sort the data by any categorical variables and by the X variable (in this case, MomWtGain). You can then plot the predicted curves. If desired, you can also append the original data and the predicted values and create a graph that overlays the data and predicted curves. You can use transparency to address the overplotting issue and also modify other features of the fit plot, such as the title, axes labels, tick positions, and so forth:

/* 4. If you want a fit plot, sort by the independent variable for each curve.
      Put QUANTILE and other covariates first, then the X variable. */
proc sort data=QRegOut out=ScoreData;
   by Quantile MomWtGain;
run;
 
/* 5. (optional) If you want to overlay the plots, it's easiest to define separate variables
      for original data and scoring data */
data All;
set Orig                                      /* original data */
    ScoreData(rename=(MomWtGain=X Pred=Y));   /* scoring data  */
run;
 
title "Quantile Regression Curves";
footnote J=C "Gain is centered: MomWtGain = Actual_Gain - 30";
proc sgplot data=All;
   scatter x=MomWtGain y=Weight / markerattrs=(symbol=CircleFilled) transparency=0.92;
   series x=X y=Y / group=Quantile lineattrs=(thickness=2) nomissinggroup name="p";
   keylegend "p" / position=right sortorder=reverseauto title="Quantile";
   xaxis values=(-20 to 40 by 10) valueshint grid  label="Mother's Relative Weight Gain (lbs)";;
   yaxis values=(1500 to 4500 by 500) valueshint grid label="Predicted Weight of Child (g)";
run;

The fit plot is shown at the top of this article.

In summary, this article shows how to use the missing value trick to evaluate a regression model in SAS. You can use this technique for any regression procedure, although newer procedures often support syntax that makes it easier to score a model.

As shown in this example, if you score the model on an equally spaced set of points for one of the continuous variables in the model, you can create a sliced fit plot. For a more complicated example, see the article "How to create a sliced fit plot in SAS."

The post How to score and graph a quantile regression model in SAS appeared first on The DO Loop.

8月 062018
 
Graph of quantile regression curves in SAS

This article shows how to score (evaluate) a quantile regression model on new data. SAS supports several procedures for quantile regression, including the QUANTREG, QUANTSELECT, and HPQUANTSELECT procedures. The first two procedures do not support any of the modern methods for scoring regression models, so you must use the "missing value trick" to score the model. (HPQUANTSELECT supports the CODE statement for scoring.) You can use this technique to construct a "sliced fit plot" that visualizes the model, as shown to the right. (Click to enlarge.)

The easy way to create a fit plot

The following DATA step creates the example data as a subset of the Sashelp.BWeight data set, which contains information about the weights of live births in the US in 1997 and information about the mother during pregnancy. The following call to PROC QUANTREG models the conditional quantiles of the baby's weight as a function of the mother's weight gain. The weight gain is centered according to the formula MomWtGain = "Actual Gain" – 30 pounds. Because the quantiles might depend nonlinearly on the mother's weight gain, the EFFECT statement generates a spline basis for the independent variable. The resulting model can flexibly fit a wide range of shapes.

Although this article shows how to create a fit plot, you can also get a fit plot directly from PROC QUANTREG. As shown below, the PLOT=FITPLOT option creates a fit plot when the model contains one continuous independent variable.

data Orig;   /* restrict to 5000 births; exclude extreme weight gains */
set Sashelp.BWeight(obs=5000 where=(MomWtGain<=40));
run;
 
proc quantreg data=Orig
              algorithm=IPM            /* use IPM algorithm for splines and binned data */
              ci=none plot(maxpoints=none)=fitplot; /* or fitplot(nodata) */
   effect MomWtSpline = spline( MomWtGain / knotmethod = equal(9) );  /* 9 knots, equally spaced */
   model Weight = MomWtSpline / quantile = 0.1 0.25 0.5 0.75 0.90;
run;
Fit plot of quantile regression curves from PROC QUANTREG in SAS

The graph enables you to visualize curves that predict the 10th, 25th, 50th, 75th, and 90th percentiles of the baby's weight based on the mother's weight gain during pregnancy. Because the data contains 5,000 observations, the fit plot suffers from overplotting and the curves are hard to see. You can use the PLOT=FITPLOT(NODATA) option to exclude the data from the plot, thus showing the quantile curves more clearly.

Score a SAS procedure by using the missing value trick

Although PROC QUANTREG can produce a fit plot when there is one continuous regressor, it does not support the EFFECTPLOT statement so you have to create more complicated graphs manually. To create a graph that shows the predicted values, you need to score the model on a new set of independent values. To use the missing value trick, do the following:

  1. Create a SAS data set (the scoring data) that contains the values of the independent variables at which you want to evaluate the model. Set the response variable to missing for each observation.
  2. Append the scoring data to the original data that are used to fit the model. Include a binary indicator variable that has the value 0 for the original data and the value 1 for the scoring data.
  3. Run the regression procedure on the combined data set. Use the OUTPUT statement to output the predicted values for the scoring data. Of course, you can also output residuals and other observation-wise statistics, if necessary.

This general technique is implemented by using the following SAS statements. The scoring data consists of evenly spaced values of the MomWtGain variable. The binary indicator variable is named ScoreData. The result of these computations is a data set named QRegOut that contains a variable named Pred that contains the predicted values for each observation in the scoring data.

/* 1. Create score data set */
data Score;
/* Optionally define additional covariates here. See the example
   "Create a sliced fit plot manually by using the missing value trick"
   https://blogs.sas.com/content/iml/2017/12/20/create-sliced-fit-plot-sas.html
*/
Weight = .;               /* set response (Y) variable to missing */
do MomWtGain = -30 to 40; /* uniform spacing in the independent (X) variable */
   output;
end;
run;
 
/* 2. Append the score data to the original data. Use a binary variable to 
      indicate which observations are the scoring data */
data Combined;
set Orig                     /* original data */
    Score(in=_ScoreData);    /* scoring data  */
ScoreData = _ScoreData;      /* binary indicator variable. ScoreData=1 for scoring data */
run;
 
/* 3. Run the procedure on the combined (original + scoring) data */
ods select ModelInfo NObs;
proc quantreg data=Combined algorithm=IPM ci=none;
   effect MomWtSpline = spline( MomWtGain / knotmethod = equal(9) );
   model Weight = MomWtSpline / quantile = 0.1 0.25 0.5 0.75 0.90;
   output out=QRegOut(where=(ScoreData=1)) /* output predicted values for the scoring data */
              P=Pred / columnwise;         /* COLUMWISE option supports multiple quantiles */
run;

This technique can be used for any SAS regression procedure. In this case, the COLUMNWISE option specifies that the output data set should be written in "long form": A QUANTILE variable specifies the quantile and the variable PRED contains the predicted values for each quantile. If you omit the COLUMNWISE option, the output data is in "wide form": The predicted values for the five quantiles are contained in the variables Pred1, Pred2, ..., Pred5.

Using predicted values to create a sliced fit plot

You can use the predicted values of the scoring data to construct a fit plot. You merely need to sort the data by any categorical variables and by the X variable (in this case, MomWtGain). You can then plot the predicted curves. If desired, you can also append the original data and the predicted values and create a graph that overlays the data and predicted curves. You can use transparency to address the overplotting issue and also modify other features of the fit plot, such as the title, axes labels, tick positions, and so forth:

/* 4. If you want a fit plot, sort by the independent variable for each curve.
      Put QUANTILE and other covariates first, then the X variable. */
proc sort data=QRegOut out=ScoreData;
   by Quantile MomWtGain;
run;
 
/* 5. (optional) If you want to overlay the plots, it's easiest to define separate variables
      for original data and scoring data */
data All;
set Orig                                      /* original data */
    ScoreData(rename=(MomWtGain=X Pred=Y));   /* scoring data  */
run;
 
title "Quantile Regression Curves";
footnote J=C "Gain is centered: MomWtGain = Actual_Gain - 30";
proc sgplot data=All;
   scatter x=MomWtGain y=Weight / markerattrs=(symbol=CircleFilled) transparency=0.92;
   series x=X y=Y / group=Quantile lineattrs=(thickness=2) nomissinggroup name="p";
   keylegend "p" / position=right sortorder=reverseauto title="Quantile";
   xaxis values=(-20 to 40 by 10) valueshint grid  label="Mother's Relative Weight Gain (lbs)";;
   yaxis values=(1500 to 4500 by 500) valueshint grid label="Predicted Weight of Child (g)";
run;

The fit plot is shown at the top of this article.

In summary, this article shows how to use the missing value trick to evaluate a regression model in SAS. You can use this technique for any regression procedure, although newer procedures often support syntax that makes it easier to score a model.

As shown in this example, if you score the model on an equally spaced set of points for one of the continuous variables in the model, you can create a sliced fit plot. For a more complicated example, see the article "How to create a sliced fit plot in SAS."

The post How to score and graph a quantile regression model in SAS appeared first on The DO Loop.

7月 022018
 
Who was the greatest US president? Presidental greatness scores through 2018

Which president of the United States is ranked the greatest by presidential historians? This article visualizes the results of the 2018 Presidential Greatness Survey, which was created and administered by B. Rottinghaus and J. Vaughn. They analyzed 166 responses from experts in political science who ranked the 44 US presidents on a 0–100 "greatness scale" where 0 represents abject failure, 50 is average, and 100 is great. The survey results are presented in their 2018 report.

Table 1 of the report (p. 2-3) provides the average greatness scores of the presidents among all respondents. Just as "beauty is in the eye of the beholder," so too "greatness" is a subjective notion that depends on the values and beliefs of the assessor. Table 2A (p. 6-7) provides the average greatness scores for respondents by the political party and political ideology of the respondents. Of the respondents, 57.2% were Democrats, 30.1% were Independent or Other, and 12.7% were Republicans. Similarly, 58.4% self-identified as liberal or somewhat liberal, 24.1% self-identified as moderate, and 17.5% reported being conservative or somewhat conservative. The relatively large proportion of liberals should be considered when interpreting the survey results.

I imported the survey results into SAS and combined them to produce a visualization of the data. The graph at the right (click to enlarge) shows the presidents ranked by their overall mean greatness score, which is shown by an X. The mean scores for the conservative, moderate, and liberal respondents are shown as filled circles. The party of the president is indicated by a colored bar that shows the range of scores.

A few results are apparent from the graph:

  • Lincoln, Washington, FDR, Teddy Roosevelt, and Jefferson are the five greatest presidents according to these experts.
  • Obama and Reagan are two recent presidents who are ranked highly.
  • Statistically speaking, there are large groups of presidents for which the relative ranking is uncertain. For example, the presidents ranked 12–20 (Madison through Monroe) have mean scores that are similar.
  • The current president, Donald Trump, is ranked last. In fairness, Trump had not yet completed his first year in office at the time of the survey (December 2017). The authors of the report dedicate several pages (p. 10–13) analyzing a survey question about who was the Most Polarizing president, a category that Trump won handily.
  • Other presidents who scored low include James Buchanan (the "Do-Nothing President"), William Henry Harrison (who died 31 days into his term), and Franklin Pierce (a weak leader in a contentious pre-Civil War era).

You can also graph the data by using the political parties of the experts rather than their ideologies. The graph is very similar.

Bias and disagreement in ranking presidents

One of the most interesting aspects of these data is the relative widths of the ranges of most 20th and 21st century presidents. These wide intervals are caused when experts of one political ideology judge a president much differently than experts from another ideology. Typically, the conservative-liberal disparity amounts to 10–15 points on the "greatness" scale, but for several presidents (Coolidge and Obama) the average conservative score is more than 20 points different from the average liberal score.

This disparity is evidence of the partisan state of politics in the US. Even these experts, who presumably use historical facts rather than opinions to guide their ranking, are not immune from seeing current events through the lens of their own personal values and biases.

To better visualize the range of disagreement among the experts, I graphed the difference between the conservative and liberal scores for each president. Again, the color of the line indicates the political party of the president. The 20th century began with McKinley's presidency. For almost every president since McKinley, the gap for Democratic presidents is negative, indicating that the conservative experts judged the presidency less favorably than the liberal experts. For Republican presidents, the direction of the gap is reversed: the conservative experts ranked the presidency more favorably than their liberal counterparts. The exception is Teddy Roosevelt, whose progressive agenda (the "Square Deal" and conservation measures such as national parks and forests) appeals to liberal experts.

Summary

Rottinghaus and Vaughn's 2018 Presidential Greatness Survey analyzes the opinions of 166 experts as to which US presidents are considered great, average, or a failure. The first graph in this blog post shows that although there is general agreement among experts for many 18th and 19th century presidents (Lincoln and Washington were great; Pierce, W. H. Harrison, and Buchanan were not.), there is less agreement for modern presidents. As seen in the second graph, the variation tends to correlate along ideological grounds, with conservatives and liberals having very different views about which presidents were great.

You can download the SAS program that creates the graphs in this article. For a different visualization of these data, see Robert Lawrence's article "All the U.S. Presidents, Ranked by 'Greatness'."

The post Ranking US presidents appeared first on The DO Loop.

6月 132018
 

If you use PROC SGPLOT to create ODS graphics, "ATTRS" are everywhere. ATTRS is an abbreviation of "attributes." Most options that change the attributes of a graphical element end with the ATTRS suffix. For example, the MARKERATTRS option modifies attributes of markers, the LINEATTRS option modifies attributes of lines, and the FILLATTRS option modifies attributes of the filled area of a bar or other region. These options are easy to remember and use.

However, there are three "ATTRS" that you might find more confusing: CYCLEATTRS, ATTRPRIORITY, and STYLEATTRS. These options determine the colors, line patterns, and marker symbols that are used to represent groups in your data. They interact with each other when you use a GROUP= option to specify groups and also use multiple statements to overlay several graph types such as scatter plots and series plots.

This article summarizes the ATTRPRIORITY, CYCLEATTRS, and STYLEATTRS keywords and provides an example that shows how they interact with each other. The end of this article presents a list of references for further reading.

What are ATTRPRIORITY, CYCLEATTRS, and STYLEATTRS?

In PROC SGPLOT, many statements support the GROUP= option. In order to distinguish one group from another, the groups are assigned different attributes such as colors, line patterns, marker symbols, and so on. The colors, patterns, and other attributes are defined by the current ODS style (or by the STYLEATTRS statement or the DATTRMAP= data set), but the ATTRPRIORITY and CYCLEATTRS keywords determine how the colors, patterns, and symbols combine for each group. The following describes the syntax and purpose of each keyword:

  • The ATTRPRIORITY option is specified on the ODS GRAPHICS statement. If you specify ATTRPRIORITY=COLOR, then groups are represented by changing the color of markers and lines, but not line patterns or marker symbols. If you specify ATTRPRIORITY=NONE, groups are represented by changing the color, line patterns, and marker symbols for each group. The ATTRPRIORITY=NONE is essential for "no color" styles such as the Journal style.
  • The CYCLEATTRS (or NOCYCLEATTERS) option affect whether attributes change when you overlay multiple graphs by using two or more statements. Notice that the word "CYCLE" begins with an "S" sound, as does the word "statements." Say the following sentence out loud, emphasizing the common "S" sounds: "CYCLEATTRS affect Statements."
  • The STYLEATTRS statement in PROC SGPLOT overrides the colors, patterns, and symbols in the current style. You can use the STYLEATTRS statement to set the combination of attributes that appear for each group.

Before looking at the interaction between these keywords, recall how the ATTRPRIORITY= option affects a graph that has only one statement. The following calls to PROC SGPLOT are identical. The only difference is the ATTRPRIORITY= option.

title "Iris Data: AttrPriority=Color";
ods graphics / AttrPriority=Color;
proc sgplot data=Sashelp.Iris;
   reg x=PetalWidth y=SepalLength / group=Species markerattrs=(size=10) lineattrs=(thickness=4);
run;
 
title "Iris Data: AttrPriority=None";
ods graphics / AttrPriority=None;
proc sgplot data=Sashelp.Iris;
   reg x=PetalWidth y=SepalLength / group=Species markerattrs=(size=10) lineattrs=(thickness=4);
run;
ATTRPRIORITY=COLOR and ATTRPRIORITY=NONE options in ODS graphics in SAS PROC SGPLOT

The plot on the left shows the result of using ATTRPRIORITY=COLOR with the HtmlBlue style. The three groups are represented by using the first three colors in the style (in this case, blue, red, and green), but the line patterns and marker symbols do not change between groups. In contrast, the plot on the left is the result of using ATTRPRIORITY=NONE. The three groups are represented by using the first three colors in the style and also the first three line patterns (solid, dashed, and dash-dot) and the first three marker symbols (circle, plus sign, and "X"). The results might depend on the current ODS style.

The interaction between ATTRPRIORITY and CYCLEATTRS

The ATTRPRIORITY and CYCLEATTRS options interact when you use two or more statements to overlay graphs. In Warren Kuhfeld's short course on "Advanced ODS Graphics Examples in SAS," he presents a slide that shows how these options interact. The following example creates a panel of graphs that is inspired by Warren's slide. Each graph in the example shows the same data (the closing stock price for two companies in 2001–2003) displayed for each of the four combinations of the ATTRPRIORITY and CYCLEATTRS options. Although the LOESS statement can display the markers, I used a separate SCATTER statement because I want to illustrate how the CYCLEATTRS statement affects the attributes.

data Stocks;
set Sashelp.Stocks;
where '01JAN2001'd <= Date <= '31OCT2003'd and Stock in ("IBM", "Microsoft");
run;
 
/* The code for each call is the same except for the options. Wrap in a macro for brevity. */
%macro GraphIt(priority, cycle);
title "AttrPriority=&priority.; &cycle.";
ods graphics / AttrPriority=&priority.;
proc sgplot data=Stocks &cycle.;
   loess   x=Date y=Close / group=Stock lineattrs=(thickness=4) NoMarkers smooth=0.75;
   scatter x=Date y=Close / group=Stock markerattrs=(size=10);
   yaxis label="Stock Price";
run;
%mend;
 
ods graphics / width=400px height=300px;
ods layout gridded columns=2 advance=table column_gutter=0 row_gutter=0;
%GraphIt(Color, NoCycleAttrs);
%GraphIt(None,  NoCycleAttrs);
%GraphIt(Color, CycleAttrs);
%GraphIt(None,  CycleAttrs);
ods layout end;
Interaction between the ATTRPRIORITY=  and CYCLEATTRS options in ODS graphics in SAS PROC SGPLOT

The panel for the example shows the interaction between the ATTRPRIORITY= and CYCLEATTRS options.

  • The top-left graph shows the graph for ATTRPRIORITY=COLOR and NOCYCLEATTRS. Only colors are used to distinguish the groups (IBM and Microsoft). The second statement (SCATTER) uses the same attributes as the first statement (LOESS).
  • The top-right graph shows the graph for ATTRPRIORITY=NONE and NOCYCLEATTRS. Colors, line patterns, and symbols are used to distinguish the groups. The second statement (SCATTER) uses the same attributes as the first statement (LOESS).
  • The bottom-left graph shows the graph for ATTRPRIORITY=COLOR and CYCLEATTRS. Only colors are used to distinguish the groups (IBM and Microsoft). The first statement (LOESS) uses the first two colors in the style (blue and red) and the second statement (SCATTER) uses the third and fourth colors in the style (green and brown). Because the markers are not filled, it might be difficult to see that the marker colors are green and brown, but you can click on the graph to enlarge it.
  • The bottom-right graph shows the graph for ATTRPRIORITY=NONE and CYCLEATTRS. Colors, line patterns, and symbols are used to distinguish the groups. The first statement (LOESS) uses the first two attributes in the style (blue-solid and red-dashed) whereas the second statement (SCATTER) uses the third and fourth attributes (green-X and brown-triangle).

Depending on your needs, either graph in the first row would be appropriate for color graphs. The lower-right plot is most suitable for ODS styles that create monochrome graphs.

The STYLEATTRS statement

After you understand the interaction between the ATTRPRIORITY= and CYCLEATTRS options, it is straightforward to use the STYLEATTRS statement. The statement merely overrides the group attributes for the current style. If you specify fewer attributes than the number of groups, the attributes are cyclically reused. For example, the following STYLEATTRS statement specifies that all lines should be solid but specifies different colors and symbols for the groups. Because two attributes are specified, the second statement reuses the same colors and symbols even though the NOCYCLEATTRS option is set:

title "AttrPriority=None; NoCycleAttrs; StyleAttrs";
ods graphics / AttrPriority=None;
proc sgplot data=Stocks NoCycleAttrs;
   styleattrs datalinepatterns=(solid)
              datacontrastcolors=(SteelBlue DarkGreen)
              datasymbols=(CircleFilled TriangleFilled);
   loess   x=Date y=Close / group=Stock lineattrs=(thickness=4) NoMarkers smooth=0.75;
   scatter x=Date y=Close / group=Stock markerattrs=(size=10);
   yaxis label="Stock Price";
run;

Reset the default attribute priority

Here's another trick I learned from Warren's course: You can use the RESET= option to reset the default value of the ATTRPRIORITY option for the current style:

ods graphics / reset=attrpriority; /* reset the default value for the current style */

Summary and further reading

This article provides an example and summarizes the behavior of the ATTRPRIORITY= option (on the ODS GRAPHICS statement) and the CYCLEATTRS option (on the PROC SGPLOT statement). These options interact when you have several statements in PROC SGPLOT that each use the GROUP= option. You can use the STYLEATTRS statement to override the default attributes for the current ODS style.

Much more can be said about these options and how they interact. In addition to the documentation links in the article, I recommend the following:

The post Attrs, attrs, everywhere: The interaction between ATTRPRIORITY, CYCLEATTRS, and STYLEATTRS in ODS graphics appeared first on The DO Loop.