6月 062019
 

Want to learn SAS programming but worried about taking the plunge? Over at SAS Press, we are excited about an upcoming publication that introduces newbies to SAS in a peer-review instruction format we have found popular for the classroom. Professors Jim Blum and Jonathan Duggins have written Fundamentals of Programming in SAS using a spiral curriculum that builds upon topics introduced earlier and at its core, uses large-scale projects presented as case studies. To get ready for release, we interviewed our new authors on how their title will enhance our SAS Books collection and which of the existing SAS titles has had an impact on their lives!

What does your book bring to the SAS collection? Why is it needed?

Blum & Duggins: The book is probably unique in the sense that it is designed to serve as a classroom textbook, though it can also be used as a self-study guide. That also points to why we feel it is needed; there is no book designed for what we (and others) do in the classroom. As SAS programming is a broad topic, the goal of this text is to give a complete introduction of effective programming in Base SAS – covering topics such as working with data in various forms and data manipulation, creating a variety of tabular and visual summaries of data, and data validation and good programming practices.

The book pursues these learning objectives using large-scale projects presented as case studies. The intent of coupling the case-study approach with the introductory programming topics is to create a path for a SAS programming neophyte to evolve into an adept programmer by showing them how programmers use SAS, in practice, in a variety of contexts. The reader will gain the ability to author original code, debug pre-existing code, and evaluate the relative efficiencies of various approaches to the same problem using methods and examples supported by pedagogical theory. This makes the text an excellent companion to any SAS programming course.

What is your intended audience for the book?

Blum & Duggins: This text is intended for use in both undergraduate and graduate courses, without the need for previous exposure to SAS. However, we expect the book to be useful for anyone with an aptitude for programming and a desire to work with data, as a self-study guide to work through on their own. This includes individuals looking to learn SAS from scratch or experienced SAS programmers looking to brush up on some of the fundamental concepts. Very minimal statistical knowledge, such as elementary summary statistics (e.g. means and medians), would be needed. Additional knowledge (e.g. hypothesis testing or confidence intervals) could be beneficial but is not expected.

What SAS book(s) changed your life? How? And why?

Blum: I don’t know if this qualifies, but the SAS Programming I and SAS Programming II course notes fit this best. With those, and the courses, I actually became a SAS programmer instead of someone who just dabbled (and dabbled ineffectively). From there, many doors were opened for me professionally and, more importantly, I was able to start passing that knowledge along to students and open some doors for them. That experience also served as the basis for building future knowledge and passing it along, as well.

Duggins: I think the two SAS books that most changed my outlook on programming (which I guess has become most of my life, for better or worse) would either be The Essential PROC SQL Handbook for SAS Users by Katherine Prairie or Jane Eslinger's The SAS Programmer's PROC REPORT Handbook, because I read them at different times in my SAS programming career. Katherine's SQL book changed my outlook on programming because, until then, I had never done enough SQL to consistently consider it as a viable alternative to the DATA step. I had taken a course that taught a fair amount of SQL, but since I had much more experience with the DATA step and since that is what was emphasized in my doctoral program, I didn't use SQL all that often. However, after working through her book, I definitely added SQL to my programming arsenal. I think learning it, and then having to constantly evaluate whether the DATA step or SQL was better suited to my task, made me a better all-around programmer.

As for Jane's book - I read it much later after having used some PROC REPORT during my time as a biostatistician, but I really wasn't aware of how much could be done with it. I've also had the good fortune to meet Jane, and I think her personality comes through clearly - which makes that book even more enjoyable now than it was during my first read!

Read more

We at SAS Press are really excited to add this new release to our collection and will continue releasing teasers until its publication. For almost 30 years SAS Press has published books by SAS users for SAS users. Here is a free excerpt located on our Duggins' author page to whet your appetite. (Please know that this excerpt is an unedited draft and not the final content). Look out for news on this new publication, you will not want to miss it!

Want to find out more about SAS Press? For more about our books, subscribe to our newsletter. You’ll get all the latest news and exclusive newsletter discounts. Also, check out all our new SAS books at our online bookstore.

Interview with new SAS Press authors: Jim Blum and Jonathan Duggins was published on SAS Users.

6月 052019
 

A family of curves is generated by an equation that has one or more parameters. To visualize the family, you might want to display a graph that overlays four of five curves that have different parameter values, as shown to the right. The graph shows members of a family of exponential transformations of the form
f(x; α) = (1 – exp(-α x)) / (1 – exp(-α))
for α > 0 and x ∈ [0, 1]. This graph enables you to see how the parameter affects the shape of the curve. For example, for small values of the parameter, α, the transformation is close to the identity transformation. For larger values of α, the nonlinear transformation stretches intervals near x=0 and compresses intervals near x=1.

Here's a tip for creating a graph like this in SAS. Generate the data in "long format" and use the GROUP= option on the SERIES statement in PROC SGPLOT to plot the curves and control their attributes. The long format and the GROUP= option make it easy to visualize the family of curves.

A family of exponential transformations

I recently read a technical article that used the exponential family given above. The authors introduced the family and stated that they would use α = 8 in their paper. Although I could determine in my head that the function is monotonically increasing on [0, 1] and f(0)=0 and f(1)=1, I had no idea what the transformation looked like for α = 8. However, it is easy to use SAS to generate members of the family for different values of α and overlay the curves:

data ExpTransform;
do alpha = 1 to 7 by 2;                             /* parameters in the outer loop */
   do x = 0 to 1 by 0.01;                           /* domain of function    */
      y = (1-exp(-alpha*x)) / (1 - exp(-alpha));    /* f(x; alpha) on domain */
      output;
   end;
end;
run;
 
/* Use ODS GRAPHICS / ATTRPRIORITY=NONE 
   if you want to force the line attributes to vary in the HTML destination. */
ods graphics / width=400px height=400px;
title "Exponential Family of Transformations";
proc sgplot data=ExpTransform;
   series x=x y=y / group=alpha lineattrs=(thickness=2);
   keylegend / location=inside position=E across=1 opaque sortorder=reverseauto;
   xaxis grid;  yaxis grid;
run;

The graph is shown at the top of this article. The best way to create this graph is to generate the points in the long-data format because:

  • The outer loop controls the values of the parameters and how many curves are drawn. You can use a DO loop to generate evenly spaced parameters or specify an arbitrary sequence of parameters by using the syntax
    DO alpha = 1, 3, 6, 10;
  • The domain of the curve might depend on the parameter value. As shown in the next section, you might want to use a different set of points for each curve.
  • You can use the GROUP= option and the KEYLEGEND statement in PROC SGPLOT to visualize the family of curves.

Visualize a two-parameter family of curves

You can use the same ideas and syntax to plot a two-parameter family of curves. For example, you might want to visualize the density of the Beta distribution for representative values of the shape parameters, a and b. The Wikipedia article about the Beta distribution uses five pairs of (a, b) values; I've used the same values in the following SAS program:

data BetaDist;
array alpha[5] _temporary_ (0.5 5 1 2 2);
array beta [5] _temporary_ (0.5 1 3 2 5);
do i = 1 to dim(alpha);                       /* parameters in the outer loop */
   a = alpha[i]; b = beta[i];
   Params = catt("a=", a, "; b=", b);         /* concatenate parameters */
   do x = 0 to 0.99 by 0.01;
      pdf = pdf("Beta", x, a, b);             /* evaluate the Beta(x; a, b) density */
      if pdf < 2.5 then output;               /* exclude large values */
   end;
end;
run;
 
ods graphics / reset;
title "Probability Density of the Beta(a, b) Distribution";
proc sgplot data=BetaDist;
   label pdf="Density";
   series x=x y=pdf / group=Params lineattrs=(thickness=2);
   keylegend / position=right;
   xaxis grid;  yaxis grid;
run;

The resulting graph gives a good overview of how the parameters in the Beta distribution affect the shape of the probability density function. The program uses a few tricks:

  • The parameters are stored in arrays. The program loops over the number of parameters.
  • A SAS concatenation functions concatenate the parameters into a string that identifies each curve. The CAT, CATS, CATT, and CATX functions are powerful and useful!
  • For this family, several curves are unbounded. The program caps the maximum vertical value of the graph at 2.5.
  • Although it is not obvious, some of the curves are drawn by using 100 points whereas others use fewer points. This is an advantage of using the long format.

In summary, you can use PROC SGPLOT to visualize a family of curves. The task is easiest when you generate the points along each curve in the "long format." The long format is easier to work with than the "wide format" in which each curve is stored in a separate Y variable. When the curve values are in long form, you can use the GROUP= option on the SERIES statement to create an effective visualization by using a small number of statements.

The post Plot a family of curves in SAS appeared first on The DO Loop.

6月 042019
 


Two sayings I’ve heard countless times throughout my life are “Work smarter, not harder,” and “Use the best tool for the job.” If you need to drive a nail, you pick up a hammer, not a wrench or a screwdriver. In the programming world, this could mean using an existing function library instead of writing your own or using an entirely different language because it’s more applicable to your problem. While that sounds good in practice, in the workplace you don’t always have that freedom.

So, what do you do when you’re given a hammer and told to fasten a screw? Or, like in the title of this article’s case, what do you do when you have Python functions you want to use in SAS?

Recently I was tasked with documenting an exciting new feature for SAS — the ability to call Python functions from within SAS. In this article I will highlight everything I’ve learned along the way to bring you up to speed on this powerful new tool.

PROC FCMP Python Objects

Starting with May 2019 release of SAS 9.4M6, the PROC FCMP procedure added support for submitting and executing functions written in Python from within a SAS session using the new Python object. If you’re unfamiliar with PROC FCMP, I’d suggest reading the documentation. In short, FCMP, or the SAS Function Compiler, enables users to write their own functions and subroutines that can then be called from just about anywhere a SAS function can be used in SAS. Users are not restricted to using Python only inside a PROC FCMP statement. You can create an FCMP function that calls Python code, and then call that FCMP function from the DATA step. You can also use one of the products or solutions that support Python objects including SAS High Performance Risk and SAS Model Implementation Platform.

The Why and How

So, what made SAS want to include this feature in our product? The scenario in mind we imagined when creating this feature was a customer who already had resources invested in Python modeling libraries but now wanted to integrate those libraries into their SAS environment. As much fun as it sounds to convert and validate thousands of lines of Python code into SAS code, wouldn’t it be nice if you could simply call Python functions from SAS? Whether you’re in the scenario above with massive amounts of Python code, or you’re simply more comfortable coding in Python, PROC FCMP is here to help you. Your Python code is submitted to a Python interpreter of your choice. Results are packaged into a Python tuple and brought back inside SAS for you to continue programming.

Programming in Two Languages at Once

So how do you program in SAS and Python at the same time? Depending on your installation of SAS, you may be ready to start, or there could be some additional environment setup you need to complete first. In either case, I recommend pulling up the Using PROC FCMP Python Objects documentation before we continue. The documentation outlines the addition of an output string that must be made to your Python code before it can be submitted from SAS. When you call a Python function from SAS, the return value(s) is stored in a SAS dictionary. If you’re unfamiliar with SAS dictionaries, you can read more about them here Dictionaries: Referencing a New PROC FCMP Data Type.

Getting Started

There are multiple methods to load your Python code into the Python object. In the code example below, I’ll use the SUBMIT INTO statement to create an embedded Python block and show you the basic framework needed to execute Python functions in SAS.

/* A basic example of using PROC FCMP to execute a Python function */
proc fcmp;
 
/* Declare Python object */
declare object py(python);
 
/* Create an embedded Python block to write your Python function */
submit into py;
def MyPythonFunction(arg1, arg2):
	"Output: ResultKey"
	Python_Out = arg1 * arg2
	return Python_Out
endsubmit;
 
/* Publish the code to the Python interpreter */
rc = py.publish();
 
/* Call the Python function from SAS */
rc = py.call("MyPythonFunction", 5, 10);
 
/* Store the result in a SAS variable and examine the value */
SAS_Out = py.results["ResultKey"];
put SAS_Out=;
run;

You can gather from this example that there are essentially five parts to using PROC FCMP Python objects in SAS:

  1. Declaring your Python object.
  2. Loading your Python code.
  3. Publishing your Python code to the interpreter.
  4. Executing your Python Code.
  5. Retrieving your results in SAS.

From the SAS side, those are all the pieces you need to get started importing your Python code. Now what about more complicated functions? What if you have working models made using thousands of lines and a variety of Python packages? You still use the same program structure as before. This time I’ll be using the INFILE method to import my Python function library by specifying the file path to the library. You can follow along in by copying my Python code into a .py file. The file, blackscholes.py, contains this code:

def internal_black_scholes_call(stockPrice, strikePrice, timeRemaining, volatility, rate):
    import numpy
    from scipy import stats
    import math
    if ((strikePrice != 0) and (volatility != 0)):
        d1 = (math.log(stockPrice/strikePrice) + (rate + (volatility**2)\
                       /  2) * timeRemaining) / (volatility*math.sqrt(timeRemaining))
        d2 = d1 - (volatility * math.sqrt(timeRemaining))
        callPrice = (stockPrice * stats.norm.cdf(d1)) - \
        (strikePrice * math.exp( (-rate) * timeRemaining) * stats.norm.cdf(d2))
    else:
        callPrice=0
    return (callPrice)
 
def black_scholes_call(stockPrice, strikePrice, timeRemaining, volatility, rate):
    "Output: optprice"
    import numpy
    from scipy import stats
    import math
    optPrice = internal_black_scholes_call(stockPrice, strikePrice,\
                                           timeRemaining, volatility, rate)
    callPrice = float(optPrice)
    return (callPrice,)

My example isn’t quite 1000 lines, but you can see the potential of having complex functions all callable inside SAS. In the next figure, I’ll call these Python functions from SAS.

(/*Using PROC FCMP to execute Python functions from a file */
proc fcmp;
 
/* Declare Python object */
declare object py(python);
 
/* Use the INFILE method to import Python code from a file */
rc = py.infile("C:\Users\PythonFiles\blackscholes.py");
 
/* Publish the code to the Python interpreter */
rc = py.publish();
 
/* Call the Python function from SAS */
rc = py.call("black_scholes_call", 132.58, 137, 0.041095, .2882, .0222);
 
/* Store the result in a SAS variable and examine the value */
SAS_Out = py.results["optprice"];
put SAS_Out=;
run;

Calling Python Functions from the DATA step

You can take this a step further and make it useable in the DATA step-outside of a PROC FCMP statement. We can use our program from the previous example as a starting point. From there, we just need to wrap the inner Python function call in an outer FCMP function. This function within a function design may be giving you flashbacks of Inception, but I promise you this exercise won’t leave you confused and questioning reality. Even if you’ve never used FCMP before, creating the outer function is straightforward.

/* Creating a PROC FCMP function that calls a Python function  */
proc fcmp outlib=work.myfuncs.pyfuncs;
 
/* Create the outer FCMP function */
/* These arguments are passed to the inner Python function */
function FCMP_blackscholescall(stockprice, strikeprice, timeremaining, volatility, rate);
 
/* Create the inner Python function call */
/* Declare Python object */
declare object py(python);
 
/* Use the INFILE method to import Python code from a file */
rc = py.infile("C:\Users\PythonFiles\blackscholes.py");
 
/* Publish the code to the Python interpreter */
rc = py.publish();
 
/* Call the Python function from SAS */
/* Since this the inner function, instead of values in the call           */
/* you will pass the outer FCMP function arguments to the Python function */
rc = py.call("black_scholes_call", stockprice, strikeprice, timeremaining, volatility, rate);
 
/* Store the inner function Python output in a SAS variable                              */
FCMP_out = py.results["optprice"];
 
/* Return the Python output as the output for outer FCMP function                        */
return(FCMP_out);
 
/* End the FCMP function                                                                 */
endsub;
run;
 
/* Specify the function library you want to call from                                    */
options cmplib=work.myfuncs;
 
/*Use the DATA step to call your FCMP function and examine the result                    */
data _null_;
   result = FCMP_blackscholescall(132.58, 137, 0.041095, .2882, .0222);
   put result=;
run;

With your Python function neatly tucked away inside your FCMP function, you can call it from the DATA step. You also effectively reduced the statements needed for future calls to the Python function from five to one by having an FCMP function ready to call.

Looking Forward

So now that you can use Python functions in SAS just like SAS functions, how are you going to explore using these two languages together? The PROC FCMP Python object expands the capabilities of SAS and by result improves you too as a SAS user. Depending on your experience level, completing a task in Python might be easier for you than completing that same task in SAS. Or you could be in the scenario I mentioned before where you have a major investment in Python and converting to SAS is non-trivial. In either case, PROC FCMP now has the capability to help you bridge that gap.

SAS or Python? Why not use both? Using Python functions inside SAS programs was published on SAS Users.

6月 032019
 

Statistical programmers and analysts often use two kinds of rectangular data sets, popularly known as wide data and long data. Some analytical procedures require that the data be in wide form; others require long form. (The "long format" is sometimes called "narrow" or "tall" data.) Fortunately, the statistical graphics procedures in SAS (notably, PROC SGPLOT) can usually accommodate either format. You can use multiple statements to create graphs of wide data. You can use a single statement and the GROUP= option to create graphs of long data.

Example: Overlay line plots for multiple response variables

Suppose you have four variables (with N observations) and you want to overlay line plots of three of the variables graphed against the fourth. There are two natural ways to arrange the 4*N data values. The first (and most natural) is a data set that has N rows and 4 variables (call them X, Y1, Y2, and Y3). This is the "wide form" of the data. The "long form" data set has three variables and 3*N rows, as shown to the right. The first column (VarName) specifies the name of the three response variables. The second column (X) indicates the value of the independent variable and the third column (Y) represents the value of the dependent variable that is specified in the VarName column. Some people will additionally sort the long data by the VarName variable, but that is not usually necessary. In general, if you want to stack k variables, the long form data will contain k*N observations.

PROC SGPLOT enables you to plot either set of data. For the wide data, you can use three SERIES statements to plot X vs Y1, X vs Y2, and X vs Y3, as follows. Notice that you can independently set the attributes of each line, such as color, symbol, line style. In the following program, the line thickness is set to the same value for all lines, but you could make that attribute vary, if you prefer.

data Wide;
input X Y1 Y2 Y3;
datalines;
10 2 3 4
15 0 4 6
20 1 4 5
;
 
title "Wide Form: Use k Statements to Plot k Variables";
proc sgplot data=Wide;
   series x=X y=Y1 / markers lineattrs=(thickness=2);
   series x=X y=Y2 / markers lineattrs=(thickness=2);
   series x=X y=Y3 / markers lineattrs=(thickness=2);
run;

You can use PROC TRANSPOSE or the SAS DATA step to convert the data from wide form to long form. When the data are in the long format, you use a single SERIES statement and the GROUP=VarName option to plot the three groups of lines. In addition, you can set the attributes for all the lines by using a single statement.

/* convert data from Wide to Long form */
data Long;
set Wide;
VarName='Y1'; Value=Y1; output;
VarName='Y2'; Value=Y2; output;
VarName='Y3'; Value=Y3; output;
drop Y1-Y3;
run;
 
title "Long Form: Use GROUP= Option to Plot k Variables";
proc sgplot data=Long;
   series x=X y=Value / group=VarName markers lineattrs=(thickness=2);
run;

Advantages and disadvantages of wide and long formats

The two formats contain the same information, but sometimes one form is more convenient than the other. Here are a few reasons to consider wide-form and long-form data:

Use the wide form when...

  • You want to run a fixed-effect regression analysis. Many SAS procedures require data to be in wide form, including ANOVA, REG, GLM, LOGISTIC, and GENMOD.
  • You want to run a multivariate analysis. Multivariate analyses include principal components (PRINCOMP), clustering (FASTCLUS), discriminant analysis (DISCRIM), and most matrix-based computations (PROC IML).
  • You want to create a plot that overlays graphs of several distinct variables. With wide data, you can easily and independently control the attributes of each overlay.

Use the long form when...

  • You want to run a mixed model regression analysis for repeated measurements. PROC MIXED and GLIMMIX require the long format. In general, the long format is useful for many kinds of longitudinal analysis, where the same subject is measured at multiple time points.
  • The measurements were taken at different values of the X variable. For example, in the previous section, the wide format is applicable because Y1, Y2, and Y3 were all measured at the same three values of X. However, the long form enables Y1 to be measured at different values than Y2. In fact, Y1 could be measured at only three points whereas Y2 could be measured at more points.

The last bullet point is important. The long form is more flexible, so you can use it to plot quantities that are measured at different times or positions. Line plots of this type are sometimes called spaghetti plots. The following DATA step defines long-form data for which the response variables are measured at different values of X:

data Long2;
infile datalines truncover;
length VarName $2;
input VarName X Value @;
do until( Value=. );
   output;   input X Value @;
end;
datalines;
Y1 10 2 15 0 20 1
Y2 10 3 12 4 13 5 16 4 17 3 18 3 20 4
Y3 9 3 11 4 14 6 18 4 19 5
;
 
title "Long Form: Different Number of Measurements per Subject";
proc sgplot data=Long2;
   series x=X y=Value / group=VarName markers lineattrs=(thickness=2);
   xaxis grid; yaxis grid;
run;

In summary, you can use PROC SGPLOT to create graphs regardless of whether the data are in wide form or long form. I've presented a few common situations in which you might want to use each kind of data representation. Can you think of other situations in which the long format is preferable to the wide format? Or vice versa? Let me know by leaving a comment.

The post Graph wide data and long data in SAS appeared first on The DO Loop.

5月 302019
 

Knowing how to visualize a regression model is a valuable skill. A good visualization can help you to interpret a model and understand how its predictions depend on explanatory factors in the model. Visualization is especially important in understanding interactions between factors. Recently I read about work by Jacob A. Long who created a package in R for visualizing interaction effects in regression models. His graphs inspired me to discuss how to visualize interaction effects in regression models in SAS.

There are many ways to explore the interactions in a regression model, but this article describes how to use the EFFECTPLOT statement in SAS. The emphasis is on creating a plot that shows how the response depends on two regressors that might interact. Depending on the type of regressors (continuous or categorical), you can create the following plots:

  • Both regressors are continuous: Use the CONTOUR option to create a contour plot or the SLICEFIT option to display curves that show the predicted response as a function of the first regressor while fixing the second regressor at a sequence of values (often low, medium, and high values).
  • One regressor is categorical and the other is continuous: Use the SLICEFIT option to overlay a curve for the predicted response for each value of the categorical regressor.
  • Both regressors are categorical: Use the INTERACTION option to create a plot that shows the group means for each joint level of the regressors. Alternatively, you can use the BOX option to draw box plots for each pair of levels.

For an introduction to the EFFECTPLOT statement, see my 2016 article "Use the EFFECTPLOT statement to visualize regression models in SAS." The EFFECTPLOT statement and the PLM procedure were both introduced in SAS 9.22 in 2010.

This article uses the Sashelp.Cars data to demonstrate the visualizations. The response variable is MPG_City, which is the average miles per gallon for each vehicle during city driving. The regressors are Weight (mass of the vehicle, in pounds), Horsepower, Origin (place of manufacture: 'Asia', 'Europe', or 'USA'), and Type of vehicle. For simplicity, the example uses only four values of the Type variable: 'SUV', 'Sedan', 'Sports', or 'Wagon'.

Although this article shows only two-regressor models, the EFFECTPLOT statement supports arbitrarily many regressors. By default, the additional continuous explanatory variables are set to their mean values; the additional categorical regressors are set to their reference level. You can change this default behavior by using the AT keyword.

Interaction between two continuous variables

Suppose you want to visualize the interaction between two continuous regressors. The following call to PROC GLM creates a contour plot automatically. It also creates an item store which saves information about the model.

proc glm data=Sashelp.Cars;
   model MPG_City = Horsepower | Weight / solution;
   ods select ParameterEstimates ContourFit;
   store GLMModel;
run;

From the contour plot, you can see that the Horsepower and Weight variables interact. For low values of Weight, the predicted response has a negative slope with respect to Horsepower. In contrast, for high values of Weight, the predicted response has a positive slope with respect to Horsepower.

This fact is easier to see if you "slice" the contour plot at low, medium, and high values of the Weight variable. You can use PROC PLM to create a SLICEFIT plot. By default, the "slicing" variable is fixed at five values: its minimum value, first quartile value, median value, third quartile value, and maximum value.

/* Graph response vs X1. By default, X2 fixed at Min, Q1, Median, Q3, and Max values */
proc plm restore=GLMModel noinfo;
   effectplot slicefit(x=Horsepower sliceby=Weight) / clm;
run;

Because the slopes of the lines depend on the value of Weight, the graph indicates an interaction.

Changing the slicing levels for continuous variables

As shown in the previous section, the SLICEFIT statement uses quantiles to fix the value of the second continuous regressor. When I read Jacob Long's web page, I noticed that his functions slice the second regressor at its mean and one standard deviation away from the mean (in both directions). In SAS, you can specify arbitrary values by using the SLICEBY= option. The following statements use PROC MEANS to compute the mean and standard deviation of the Weight variable, then use those values to specify slicing values for the Weight variable:

proc means data=Sashelp.Cars Mean StdDev ndec=0;
   var Weight;
run;
 
proc plm restore=GLMModel noinfo;
   effectplot slicefit(x=Horsepower sliceby=Weight=2819 3578 4337) / clm;
run;

Again, the slope of the predicted response changes with values of Weight, which indicates an interaction effect. Notice that when Weight = 4337, which is one standard deviation above the mean, the slope of the predicted response is flat.

Of course, you can automate this process if you don't want to compute the slicing values in your head. You can use the DATA step or PROC SQL to compute the slicing values, then create macro variables for the sample mean and standard deviation. You can then use the macro variables to specify the slicing values:

proc sql;
  select mean(Weight) as mean, std(Weight) as std
  into :mean, :std             /* put Mean and StdDev into macro variables */
  from Sashelp.cars;
quit;
 
/* slice the Weight variable at mean - StdDev, mean, and mean + StdDev */
proc plm restore=GLMModel noinfo;
   effectplot slicefit(x=Horsepower 
      sliceby=Weight=%sysevalf(&mean-&std) &mean %sysevalf(&mean+&std)) / clm;
run;

The graph is similar to the previous graph and is not shown.

Interactions between a continuous and a categorical regressor

If one of the regressors is categorical and the other is continuous, it is easy to visualize the interaction because you can plot the predicted response versus the continuous regressor for each level of the categorical regressor. In fact, this plot is created automatically by many SAS procedures, so often you don't need to use the EFFECTPLOT statement. For example, the following call to PROC GLM overlays three regression curves on a scatter plot of the data:

ods graphics on;
proc glm data=Sashelp.Cars;
   class Origin(ref='Europe');
   model mpg_city = Horsepower | Origin / solution; /* one continuous, one categorical */
   store GLMModel2;
run;

The slopes of the lines change with the levels of the Origin variable, so there appears to be an interaction effect between those two regressors.

The GLM procedure has access to the original data, so the lines are overlaid on a scatter plot. If you create the same plot in PROC PLM, you obtain the lines (and, optionally, confidence bands), but the plot does not include a scatter plot because the data are not part of the saved item store. For completeness, the following call to PROC PLM creates a similar visualization of the Horsepower-Origin interaction:

proc plm restore=GLMModel2 noinfo;
   effectplot slicefit(x=Horsepower sliceby=Origin) / clm;
run;

Interactions between two categorical regressors

I have previously written about how to create an "interaction plot" for two categorical predictors. Many SAS procedures produce this kind of plot automatically. You can use the EFFECTPLOT BOX or EFFECTPLOT INTERACTION statement inside many regression procedures. Alternatively, you can call PROC PLM and create an interaction plot from an item store. Again, the main difference is that the regression procedures can overlay observed data values, whereas PROC PLM visualizes only the model, not the data.

The following example creates a model that has two categorical variables. By default, PROC GLM creates an interaction plot and overlays the observed data values:

proc glm data=Sashelp.Cars(where=(Type in ('SUV' 'Sedan' 'Sports' 'Wagon')));
   class Origin(ref='Europe') Type;
   model mpg_city = Origin | Type;
   store GLMModel3;
run;

This model does not exhibit much (if any) interaction between the regressors. For a vehicle of a specific type (such as 'SUV'), Asian-built vehicles tend to have a higher MPG_City than UAS-built vehicles, which tend to have a higher MPG than European-built vehicles. These trend lines have similar slopes as you vary the Type variable. If you check the ParameterEstimates table, you will see that the interaction effects are not statistically significant (α = 0.05).

As mentioned, you can create the same plot (without the data markers) by using PROC PLM. If you request confidence intervals, you get a slightly different graph. You can choose whether or not to connect the means of the response for each level. The following statements create a plot for which the means are not connected:

proc plm restore=GLMModel3 noinfo;
   effectplot interaction(x=Origin sliceby=Type) / clm;
/* effectplot interaction(x=Origin sliceby=Type) / clm connect; */ /* or connect the means */
run;

Lastly, you can use the EFFECTPLOT BOX statement in regression procedures. The information is the same as for the "interaction plot," but box plots are used to show the observed distribution of the response for each level of the first categorical regressor.

proc genmod data=Sashelp.Cars(where=(Type in ('SUV' 'Sedan' 'Sports' 'Wagon')));
   class Origin(ref='Europe') Type;
   model mpg_city = Origin | Type;
   effectplot box(x=Origin sliceby=Type) / nolabeloutlier;
run;

Summary

In summary, you can use the EFFECTPLOT statement to visualize the interactions between regressors in a regression model. In general, when the slopes of the response curves depend on the values of a second regressor, that indicates an interaction effect. For a continuous-continuous interaction, you can choose the values at which you slice the second regressor. By default, the regressor is sliced at quantiles, but you can modify that to, for example, slice the variable at its mean and at (plus or minus) one standard deviation away from the mean. If you use the EFFECTPLOT statement inside a regression procedure, you can overlay the model on the observed responses. In PROC PLM, the EFFECTPLOT statement visualizes only the model.

The post Visualize interaction effects in regression models appeared first on The DO Loop.

5月 302019
 

Human behavior is fascinating. We come in so many shapes, sizes and backgrounds. Doesn’t it make sense that any tests we write also accommodate our wonderful differences?

This picture is of Miko, a northern rescue and a recent addition to my family. He’s learning to live in an urban household and doing great with some training. He’s going through so many new tests as he adapts to life in the city, which is quite different from being free in the northern territories. Watch for a later post on his training successes.

I’m so happy to share how SAS has been helping candidates by offering a variety of certification credentials geared towards testing for differences and preferences in thought. If you are wondering – I’ve been addicted to psychometrics for a while now, anything human behavior-related interests me. I thought I would begin with sharing some different types of testing roles that I have held in the past.

1. Psychometric testing

Before I joined SAS, I worked at CSI. To answer that unspoken thought dear reader, CSI has been providing financial training and accreditation since 1964 – way before CSI the TV show became popular.

My role as Test Manager was super exciting for someone with a curiosity for analytics and helping people succeed. In a team of four we scored over 200 exams to provide credentials. Psychometrics was the most exciting part of my job analyzing the performance of test takers to constantly innovate our tests. Psychometric tests are used to identify a candidate's skills, knowledge and personality.

2. Multiple-choice testing

While setting multiple choice exam questions, I learned that it was ideal for the four answer choices to be similar in length, and complexity (e.g. if candidates typically chose option A for a question whose right response was B, we would dig deeper to compare the lengths of the options, the language of the options, and then change the option if that was what the review committee agreed upon).

3. Adaptive testing

Prior to CSI, I worked at the test center of Devry Institute of technology. In adaptive testing, the test’s difficulty adapts to candidate performance. A correct response leads into a more complex question. On the flip side, an incorrect response leads to an easier next question. So that, eventually, we could help candidates decide which engineering program would be the right skill fit.

This is where I met the student who asked, “can my boyfriend write my exam?”

4. Performance testing

With SAS at the forefront of analytics, it should come as no surprise that certification exams have evolved to the next level. As a certification candidate you can now try out performance-based testing.

A performance test requires a candidate to actually perform a task, rather than simply answering questions. An example is writing SAS code. Instead of answering a knowledge-level multiple choice exam about SAS code, the candidate is asked to actually write code to arrive at answers.

Certification at SAS

SAS Certified Specialist: Base Programming Using SAS 9.4 is great for those who can demonstrate ease in putting into practice the knowledge learned in the Foundation Programming classes 1 and 2. During this performance-based exam, candidates will access a SAS environment. Coding challenges will be presented, and you will need to write and execute SAS code to determine the correct answers to a series of questions.

SAS® Certified Base Programmer for SAS®9 credential remains, but the exam will be retired in June 2019.

While writing this post I came across this on Wikipedia: it shows how the study of adaptive behavior goes back to Darwin’s time. It’s a good read for anyone intrigued by the science and art of testing.

“Charles Darwin was the inspiration behind Sir Francis Galton who led to the creation of psychometrics. In 1859, Darwin published his book The Origin of Species, which pertained to individual differences in animals. This book discussed how individual members in a species differ and how they possess characteristics that are more adaptive and successful or less adaptive and less successful. Those who are adaptive and successful are the ones that survive and give way to the next generation, who would be just as or more adaptive and successful. This idea, studied previously in animals, led to Galton's interest and study of human beings and how they differ one from another, and more importantly, how to measure those differences.”

Are you fascinated by the science and art of human behavior as it relates to testing? Are you as excited as I am about the possibilities of performance-based testing? I would love to hear your comments below.

New at SAS: Psychometric testing was published on SAS Users.

5月 302019
 

Note: This is the second post in a series featuring analytics leaders in the utility industry. St. Louis is home to Anheuser Busch, the legendary Cardinals baseball team, and Ameren, where analytics rocks. More on this later. Ameren is a large investor-owned utility that serves St. Louis and the surrounding [...]

Analytics rocks at Ameren was published on SAS Voices by Mike F. Smith

5月 292019
 

Interestingly enough, paperclips have their own day of honor. On May 29th, we celebrate #NationalPaperclipDay! That well-known piece of curved wire deserves attention for keeping our papers together and helping us stay organized. Do you remember who else deserved the same attention? Clippit – the infamous Microsoft Office assistant, popularly known as ‘Clippy’.

We saw the last of Clippy in 2004 before it was removed completely from Office 2007 after constant negative criticism. By then, most users considered it useless and decided to turn it off completely, despite the fact that it was supposed to help them perform certain tasks faster. Clippy was a conversational agent, like a chatbot, launched a decade before Apple’s Siri. The cartoonish paperclip-with-eyes resting on a yellow loose-leaf paper bed would pop up to offer assistance every time you opened a Microsoft Office program. Today, people seem to love AI. But a decade ago, why did everyone hate Clippy?

Why was Clippy such a failure?

One of the problems with Clippy was the terrible user experience. Clippy stopped to ask users if they needed help but in doing so would suspend their operations altogether. This was great for first timers, getting to know Clippy, goofing around and training themselves to use the agent. However, later on, mandatory conversations became frustrating. Second, Clippy was designed as a male agent. Clippy was born in a meeting room full of male employees working at Microsoft. The facial features resemble those of a male cartoon or at least that is what women in the testing focus group observed. In a farewell note for Clippy, the company addressed Clippy as ‘he’ affirming his gender. Some women even said that Clippy’s gaze created discomfort while they tried to do their tasks.

Rather than using Natural Language Processing (NLP), Clippy’s actions were invoked by Bayesian algorithms that estimated the probability of a user wanting help based on specific actions. Clippy’s answers were rule based and templated from Microsoft’s knowledge base. Clippy seemed like the agent you would use when you were stuck on a problem, but the fact that he kept appearing on the screen in random situations added negativity to users’ actions. In total, Clippy violated design principles, social norms and process workflows.

What lessons did Clippy teach us?

We can learn from Clippy. The renewed interest in conversational agents in the tech industry often overlooks the aspects that affect efficient communication, resulting in failures. It is important to learn lessons from the past and design interfaces and algorithms that tackle the needs of humans rather than hyping the capabilities of artificial intelligence.

At SAS, we are working to deliver a natural language interaction (NLI) service that converts keyed or spoken natural language text into application-specific, executable code; and using apps like Q –genderless AI voice for virtual assistants. We’re developing different ways to incorporate chatbots into business dashboards or analytics platforms. These capabilities have the potential to expand the audience for analytics results and attract new and less technical users.

“Chatbots are a key technology that could allow people to consume analytics without realizing that’s what they’re doing,” says Oliver Schabenberger, SAS Executive Vice President, Chief Operating Officer and Chief Technology Officer. “Chatbots create a humanlike interaction that makes results accessible to all.”

The use of chatbots is exponentially growing and all kinds of organizations are starting to see the exciting possibilities combining chatbots with AI analytics. Clippy was a trailblazer of sorts but has come to represent what to avoid when designing AI. At SAS we are focused on augmenting the human experience and that it is the customer who needs to be at the center, not the technology.

Interested in seeing what SAS is doing with Natural Language Processing? Check out SAS Visual Text Analytics and try it for free. We also have a brand new SAS Book hot off the press focusing on information extraction models for unstructured text / language data: SAS® Text Analytics for Business Applications: Concept Rules for Information Extraction Models. We even have a free chapter for you to take a peek!

Other Resources:

What can we learn from Clippy about AI? was published on SAS Users.

5月 282019
 

Modern statistical software provides many options for computing robust statistics. For example, SAS can compute robust univariate statistics by using PROC UNIVARIATE, robust linear regression by using PROC ROBUSTREG, and robust multivariate statistics such as robust principal component analysis. Much of the research on robust regression was conducted in the 1970s, so I was surprised to learn that a robust version of simple (one variable) linear regression was developed way back in 1950! This early robust regression method uses many of the same techniques that are found in today's "modern" robust regression methods. This article describes and implements a robust estimator for simple linear regression that was developed by Theil (1950) and extended by Sen (1968).

The Theil-Sen robust estimator

I had not heard of the Theil-Sen robust regression method until recently, perhaps because it applies only to one-variable regression. The Wikipedia article about the Theil-Sen estimator states that the method is "the most popular nonparametric technique for estimating a linear trend" in the applied sciences due to its "simplicity in computation, ... robustness to outliers," and "[limited assumptions]regarding measurement errors."

The idea behind the estimator is simple. If the data contains N pairs of (x, y) values, compute all the slopes between pairs of points and choose the median as the estimate of the regression slope. Using that slope, pass a line through each pair of (x,y) values to obtain N intercepts. Choose the median of the intercepts as the estimate of the regression intercept.

That's it. You compute "N choose 2" (which is N*(N-1)/2) slopes and take the median. Then compute N intercepts and take the median. The slope estimate is unbiased and the process is resistant to outliers.

The adjacent scatter plot shows the Theil-Sen regression line for nine data points. The seven data points that appear to fall along the regression line were used by Sen (1968). I added two outliers. The plot shows that the Theil-Sen regression line ignores the outliers and passes close to the other data points. The slope of the Theil-Sen line is slightly less than 4. In contrast, the least squares line through these data has a slope of only 2.4 because of the influence of the two outliers.

Implement the Theil-Sen estimator in SAS

You can easily implement Theil-Sen regression in SAS/IML, as follows:

  1. Use the ALLCOMB function to generate all pairs of the values {1, 2, ..., N}. Or, for large N, use the RANDCOMB function to sample pairs of values.
  2. Use subscript operations to extract the pairs of points. Compute all slopes between the pairs of points.
  3. Use the MEDIAN function to compute the median slope and the median intercept.

The following SAS/IML program implements this algorithm. The program is very compact (six statements) because it is vectorized. There are no explicit loops.

proc iml;
XY = {1  9,  2 15,  3 19, 4 20, 10 45,  12 55, 18 78, /* 7 data points used by Sen (1968) */
     12.5 30,  4.5 50};                               /* 2 outliers (not considered by Sen) */
 
/* Theil uses all "N choose 2" combinations of slopes of segments.
   Assume that the first coordinates (X) are distinct */
c = allcomb(nrow(XY), 2);         /* all "N choose 2" combinations of pairs */
Pt1 = XY[c[,1],];                 /* extract first point of line segments */
Pt2 = XY[c[,2],];                 /* extract second point of line segments */
slope = (Pt1[,2] - Pt2[,2]) / (Pt1[,1] - Pt2[,1]); /* Careful! Assumes x1 ^= x2 */
m = median(slope);
b = median( XY[,2] - m*XY[,1] );  /* median(y-mx) */
print (b||m)[c={'Intercept' 'Slope'} L="Method=Theil Combs=All"];

As stated earlier, the Theil-Sen estimate has a slope of 3.97. That value is the median of the slopes among the 36 line segments that connect pairs of points. The following graphs display the 36 line segments between pairs of points and a histogram of the distribution of the slopes. The histogram shows that the value 3.97 is the median value of the distribution of slopes.


Handling repeated X values: Sen's extension

The observant reader might object that the slopes of the line segments will be undefined if any of the data have identical X values. One way to deal with that situation is to replace the undefined slopes by large positive or negative values, depending on the sign of the difference between the Y values. Since the median is a robust estimator, adding a few high and low values will not affect the computation of the median slope. Alternatively, Sen (1968) proved that you can omit the pairs that have identical X values and still obtain an unbiased estimate. In the following SAS/IML program, I modified the X values of the two outliers so that only seven of the nine X values are unique. The LOC function finds all pairs that have different X values, and only those pairs are used to compute the robust regression estimates.

/* Sen (1968) handles repeated X coords by using only pairs with distinct X */
XY = {1  9,  2 15,  3 19, 4 20, 10 45,  12 55, 18 78,
     12 30,  4 50};  /* last two obs are repeated X values */
c = allcomb(nrow(XY), 2);        /* all "N choose 2" combinations of pairs */
Pt1 = XY[c[,1],];                /* first point of line segments */
Pt2 = XY[c[,2],];                /* second point of line segments */
idx = loc(Pt1[,1]-Pt2[,1]^=0);   /* find pairs with same X value */
Pt1 = Pt1[idx,];                 /* keep only pairs with different X values */
Pt2 = Pt2[idx,];
 
slope = (Pt1[,2] - Pt2[,2]) / (Pt1[,1] - Pt2[,1]);
m = median(slope);
b = median( XY[,2] - m*XY[,1] );  /* median(y-mx) */
print (b||m)[c={'Intercept' 'Slope'} L="Method=Sen Combs=All"];

A function to compute the Theil-Sen estimator

The following function defines a SAS/IML function that implements the Theil-Sen regression estimator. I added two options. You can use the METHOD argument to specify how to handle pairs of points that have the same X values. You can use the NUMPAIRS option to specify whether to use the slopes of all pairs of points or whether to use the slopes of K randomly generated pairs of points.

proc iml;
/* Return (intercept, slope) for Theil-Sen robust estimate of a regression line.
   XY is N x 2 matrix. The other arguments are:
   METHOD: 
      If method="Theil" and a pair of points have the same X coordinate, 
         assign a large positive value instead of +Infinity and a large negative 
         value instead of -Infinity. 
      If method="Sen", omit any pairs of points that have the same first coordinate. 
   NUMPAIRS:
      If numPairs="All", generate all "N choose 2" combinations of the N points.
      If numPairs=K (positive integer), generate K random pairs of points. 
*/
start TheilSenEst(XY, method="SEN", numPairs="ALL");
   Infinity = 1e99;             /* big value for slope instead of +/- infinity */
   if type(numPairs)='N' then
      c = rancomb(nrow(XY), 2, numPairs);  /* random combinations of pairs */
   else if upcase(numPairs)="ALL" then 
      c = allcomb(nrow(XY), 2);            /* all "N choose 2" combinations of pairs */
   else stop "ERROR: The numPairs option must be 'ALL' or a postive integer";
 
   Pt1 = XY[c[,1],];                       /* first points for slopes */
   Pt2 = XY[c[,2],];                       /* second points for slopes */
   dy = Pt1[,2] - Pt2[,2];                 /* change in Y */
   dx = Pt1[,1] - Pt2[,1];                 /* change in X */ 
   idx = loc( dx ^= 0 );  
   if upcase(method) = "SEN" then do;      /* exclude pairs with same X value */
      slope = dy[idx] / dx[idx];           /* slopes of line segments */
   end;
   else do;                        /* assign big slopes for pairs with same X value */
      slope = j(nrow(Pt1), 1, .);  /* if slope calculation is 0/0, assign missing */
      /* Case 1: x1 ^= x2. Do the usual slope computation */
      slope[idx] = dy[idx] / dx[idx];
      /* Case 2: x1 = x2. Assign +Infinity if sign(y1-y2) > 0, else assign -Infinity */
      jdx = loc( dx = 0 & sign(dy)>0 );
      if ncol(jdx)>0 then 
         slope[jdx] = Infinity;
      jdx = loc( dx = 0 & sign(dy)<0 );
      if ncol(jdx)>0 then 
         slope[jdx] = -Infinity;
   end;
   m = median(slope);
   b = median( XY[,2] - m*XY[,1] );  /* median(y-mx) */
   return( b || m );
finish;
 
/* Test all four calls */
XY = {1  9,  2 15,  3 19, 4 20, 10 45,  12 55, 18 78,
     18 30,  4 50};  /* last two obs are outliers not considered by Sen */
 
est = TheilSenEst(XY, "Theil", "All");
print est[c={'Intercept' 'Slope'} L="Method=Theil; Pairs=All"];
 
est = TheilSenEst(XY, "Sen", "All");
print est[c={'Intercept' 'Slope'} L="Method=Sen; Pairs=All"];
 
call randseed(123, 1);
est = TheilSenEst(XY, "Theil", 200);
print est[c={'Intercept' 'Slope'} L="Method=Theil; Pairs=200"];
 
call randseed(123, 1);
est = TheilSenEst(XY, "Sen", 200);
print est[c={'Intercept' 'Slope'} L="Method=Sen; Pairs=200"];
QUIT;

For these data, the estimates are the same whether you exclude pairs of points that have identical X coordinates or whether you replace the undefined slopes with large values. For this small data set, there is no reason to use the randomly chosen pairs of points, but that syntax is shown for completeness. Of course, if you run the analysis with a different random number seed, you will get a different estimate.

Summary

You can download the SAS program that creates the analysis and graphs in this article.

Although Theil published the main ideas for this method in 1950, it contains many of the features of modern robust statistical estimates. Specifically, a theme in modern robust statistics is to exhaustively or randomly choose many small subsets of the data. You compute a (classical) estimate on each subset and then use the many estimates to obtain a robust estimate. I did not realize that Theil had introduced these basic ideas almost seventy years ago!

Theil and Sen also included confidence intervals for the estimates, but I have not included them in this brief article.

References

The post The Theil-Sen robust estimator for simple linear regression appeared first on The DO Loop.