4月 252018
 

How many planets are there in our solar system? The answer hasn't always been 9 ... er, I mean 8 (sorry Pluto!). The count has changed throughout history as we got a better understanding of astronomy, discovered new planets, and redefined what a 'planet' is. Wouldn't it be helpful to [...]

The post How many planets in our solar system? (... tricky question!) appeared first on SAS Learning Post.

4月 252018
 

SAS programmers on SAS discussion forums sometimes ask how to run thousands of regressions of the form Y = B0 + B1*X_i, where i=1,2,.... A similar question asks how to solve thousands of regressions of the form Y_i = B0 + B1*X for thousands of response variables. I have previously written about how to solve the first problem by converting the data from wide form to a long form and using the BY statement in PROC REG to solve the thousands of regression problems in a single call to the procedure. My recent description of the SWEEP operator (which is implemented in SAS/IML software), provides an alternative technique that enables you to analyze both regression problems when the data are in the wide format.

Why would you anyone need to run 1,000 regressions?

Why anyone wants to solve thousands of linear regression problems? I can think of a few reasons:

  1. In the exploratory phase of an analysis, you might want to know which of thousands of variables are linearly related to a response. Running the regressions is a quick way to filter or "screen out" variables that do not have a strong linear association with the response.
  2. In a simulation study, each Y_i might be an independent draw from the same distribution. Running thousands of regressions enables you to approximate the sampling distribution of the parameter estimates.
  3. In a bootstrap analysis of regression estimates, you can perform case resampling or model-based resampling. In case resampling, you generate many X_i by resampling from the original X variable. In model-based resampling, you keep the X fixed and resample thousands of Y_i. In either case, Running thousands of regressions enables you to estimate the precision of the parameter estimates.

Many univariate regressions

In a previous article, I showed how to simulate data that satisfies a regression model. I created a data set that contains explanatory variables X1-X1000 and a single response variable, Y. You can download the SAS program that computes the data and performs the regression.

The following SAS/IML program reads the simulated data into a large matrix, M. For each regression, it forms the three-column matrix A from the intercept column, the k_th explanatory variable, and the variable Y. It then forms the sum of squares and cross products (SSCP) matrix (A`*A) and uses the SWEEP function to solve the least squares regression problem. The two parameter estimates (intercept and slope) for each explanatory variable are saved in a 2 x 1000 array. A few parameter estimates are displayed.

/* program to compute parameter estimates for Y = b0 + b1*X_k, k=1,2,... */
proc iml;
XVarNames = "x1":"x&nCont";      /* names of explanatory variables */
p = ncol(XVarNames );            /* number of X variables, excluding intercept */
varNames = XVarNames || "Y";     /* name of all data variables */
use Wide;  read all var varNames into M;  close; /* read data into M */
 
A = j(nrow(M), 3, 1);            /* columns for intercept, X_k, Y */
A[ ,3] = M[ , ncol(M)];          /* put Y in 3rd col */
ParamEst =  j(2, p, .);          /* allocate 2 x p matrix for estimates */
do k = 1 to ncol(XVarNames);
   A[ ,2] = M[ ,k];              /* X_k in 2nd col */
   S1 = sweep(A`*A, {1 2});      /* sweep in intercept and X_k */
   ParamEst[ ,k] = S1[{1 2}, 3]; /* estimates for k_th model Y = b0 + b1*X_k */
end;
print (paramEst[,1:6])[r={"Intercept" "Slope"} c=XVarNames];
Run thousands of regressions in SAS and obtain parameter estimates

You could perform a similar loop for models that contain multiple variables, such as all two-variable main-effect models of the form Y = b0 + b1*X_k + b2*X_j, where k ≠ j. You can use the ALLCOMB function in SAS/IML to choose the combinations of columns to sweep.

One large multivariate regression

You can also use the SWEEP operator to perform the regression of many responses onto a single explanatory variable. This case is easy in PROC REG because the procedure supports multiple response variables. The analysis is also easy if you use the SWEEP function in SAS/IML. It only requires a single function call!

The following DATA step simulates 1,000 response variables according to the model Y_i = b0 + b1*X + ε, where b0 = 39.07, b1 = 0.902, and ε ~ N(0, 5.71) is a normally distributed random error term. These values are the parameter estimates for the regression of SepalLength onto SepalWidth for the virginica species of iris flowers in the Sashelp.Iris data set. For more about simulating regression models, see Chapter 11 of Wicklin (2013).

/* based on Wicklin (2013, p. 203) */
%let NumSim = 1000;                  /* number of Y variables */
data RegSim(drop=RMSE b0 b1 k Species);
set Sashelp.Iris(where =(Species="Virginica") keep=Species SepalWidth
                 rename=(SepalWidth=X));
RMSE = 5.71; b0 = 39.07; b1 = 0.902; /* param estimates for model SepalLength = SepalWidth */
array Y[&NumSim];
call streaminit(321);
do k = 1 to &NumSim;
   Y[k] = b0 + b1*X + rand("Normal", 0, RMSE); /* simulate responses from model */
end;
output;
run;

After generating the data, you can compute all 1,000 parameter estimates by using a single call to the SWEEP function in SAS/IML. The following program forms a matrix M for which the first column is an intercept term, the second column is the X variable, and the next 1,000 columns are the simulated responses. Sweeping the first two rows of the M`*M matrix computes the 1,000 parameter estimates, which are then graphed in a scatter plot.

/* program to compute parameter estimates for Y_k = b0 + b1*X, k=1,2,... */
proc iml;
YVarNames = "Y1":"Y&numSim";     /* names of explanatory variables */
varNames = "X" || YVarNames;     /* name of all data variables */
use RegSim;  read all var varNames into M;  close;
M = j(nrow(M), 1, 1) || M;       /* add intercept column */
 
S1 = sweep(M`*M, {1 2});         /* sweep in intercept and X */
ParamEst = S1[{1 2}, 3:ncol(M)]; /* estimates for models Y_i = X */
 
title "Parameter Estimates for 1,000 Simulated Responses";
call scatter(ParamEst[1,], ParamEst[2,]) label={"Intercept" "X"}
     other="refline 39.07/axis=x; refline 0.902/axis=y;";
Run thousands of regression in SAS and obtain parameter estimates

The graph shows the parameter estimates for each of the 1,000 linear regressions. The distribution of the estimates provides more information about the sampling distribution than the estimate of standard error that PROC REG produces when you run a regression model. The graph visually demonstrates the "covariance of the estimates," which PROC REG estimates if you use the COVB option on the MODEL statement.

In summary, you can use the SWEEP operator (implemented in the SWEEP function in SAS/IML) to efficiently compute thousands of regression estimates for wide data. This article demonstrates how to compute models that have many explanatory variables (Y = b0 + b1* X_i) and models that have many response variables (Y_i = b0 + b1 * X). Are there drawbacks to this approach? Sure. You don't automatically get the dozens of ancillary regression statistics like R squared, adjusted R squared, p-values, and so forth. Also, PROC REG automatically handles missing values, whereas in SAS/IML you must first extract the complete cases before you try to form the SSCP matrix. Nevertheless, this computation can be useful for simulation studies in which the data are simulated and analyzed within SAS/IML.

Download the SAS programs for this article.

The post An easier way to run thousands of regressions appeared first on The DO Loop.

4月 242018
 

The ODS destination for PowerPoint uses table templates and style templates to display the tables, graphs, and other output produced by SAS procedures. You can customize the look of your presentation in a number of ways, including using custom style templates and images. Here we'll learn about using background images.

The post Background images and the ODS destination for PowerPoint appeared first on SAS Learning Post.

4月 242018
 

A SAS practice exam can help you prepare for SAS certification. Practice exams are similar in difficulty, objectives, length, and design of the actual exam. While there's no guarantee that passing the practice exam will result in passing the actual exam, they can help you determine how prepared you are for an exam.

The post Are SAS practice exams worth the cost? appeared first on SAS Learning Post.

4月 242018
 

SAS variables are variables in the statistics sense, not the computer programming sense. SAS has what many computer languages call “variables,” it just calls them “macro variables.” Knowing the difference between SAS variables and SAS macro variables will help you write more flexible and effective code.

The post When a variable is not a variable appeared first on SAS Learning Post.

4月 232018
 

Plastic pollution in the oceans is becoming a huge problem. And, as with any problem, finding the solution starts with identifying the source of the problem. A recent study estimated that 95% of the plastic pollution in our oceans comes from 10 rivers - let's put some visual analytics to [...]

The post Can you guess which 10 rivers produce 95% of ocean's plastic pollution? appeared first on SAS Learning Post.

4月 232018
 

You've probably heard about the "80-20 Rule," which describes many natural and manmade phenomena. This rule is sometimes called the "Pareto Principle" because it was discovered by Vilfredo Pareto (1848–1923) who used it to describe the unequal distribution of wealth. Specifically, in his study, 80% of the wealth was held by 20% of the population. The unequal distribution of effort, resources, and other quantities can often be described in terms of the Pareto distribution, which has a parameter that controls the inequity. Whereas some data seem to obey an 80-20 rule, other data are better described as following a "70-30 rule" or a "60-40 rule," and so on. (Although the two numbers do not need to add to 100, it makes the statement pleasingly symmetric.)

I thought about the Pareto distribution recently when I was looking at some statistics for the SAS blogs. I found myself wondering whether 80% of the traffic to a blog is generated by 20% of its posts. (Definition: A blog is a collection of articles. Each individual article is a post.) As you can imagine, some posts are more popular than others. Some posts are timeless. They appear in internet searches when someone searches for a particular statistical or programming technique, even though they were published long ago. Other posts (such as Christmas-themed posts) do not get much traffic from internet searches. They generate traffic for a week or two and then fade into oblivion.

To better understand how various blogs at SAS follow the Pareto Principle, I downloaded data for seven blogs during a particular time period. I then kept only the top 100 posts for each blog.

The following line plot shows the pageviews for one blog. The horizontal axis indicates the posts for this blog, ranked by popularity. (The most popular post is 1, the next most popular post is 2, and so forth.) This blog has four very popular posts that each generated more than 7,000 pageviews during the time period. Another group of four posts were slightly less popular (between 4,000 and 5,000 pageviews). After those eight "blockbuster" posts are the rank-and-file posts that were viewed less than 3,000 times during the time period.

Long-tailed distribution of pageviews for a blog

The pageviews for the other blogs look similar. This distribution is also common in book-sales data: most books sell only a few thousand copies whereas the best-sellers (think John Grisham) sell hundreds of millions of copies. Movie revenue is another example that follows this distribution.

The distribution is a long-tailed distribution, a fact that becomes evident if you graph a histogram of the underlying quantity. For the blog, the following histogram shows the distribution of the pageviews for the top 100 posts:

Long-tailed distribution of pageviews for a blog

Notice the "power law" nature of the distribution for the first few bars of the histogram. The height of each bar is about 0.4 of the previous height. About 57% of the blog posts had less than 1,000 pageviews. Another 22% had between 1,000 and 2,000 pageviews. The number of rank-and-file posts in each category decreases like a power law, but then the blockbuster posts start to appear. These popular posts give the distribution a very long tail.

Because some blogs (like the one pictured) attract thousands of readers whereas other blogs have fewer readers, you need to standardize the data if you want to compare the distributions for several blogs. Recall that the Pareto Principle is a statement about cumulative percentages. The following graph shows the cumulative percentages of pageviews for seven blogs at SAS (based on the Top 100 posts):

Cumulative percentages of pageviews for seven blogs

The graph shows a dashed line with slope –1 that is overlaid on the cumulative percentage curves. The places where the dashed line intersects a curve are the "Pareto locations" for which Y% of the pageviews are generated by the first (1-Y)% most popular blog posts. In general, these blogs appear to satisfy a "70-30 rule" because about 70% of the pageviews are generated by the 30 / 100 most popular posts. There is some variation between blogs, with the upper curve following a "72-28 rule" and the lower curve satisfying a "63-37 rule."

All this is approximate and is based on only the top 100 posts for each blog. For a more rigorous analysis, you could use PROC UNIVARIATE or PROC NLMIXED to fit the parameters of the Pareto distribution to the data for each blog. However, I am happy to stop here. In general, blogs at SAS satisfy an approximate "70-30 rule," where 70% of the traffic is generated by the top 30% of the posts.

The post The 80-20 rule for blogs appeared first on The DO Loop.

4月 212018
 

During SAS Global Forum 2018, SAS instructor Charu Shankar sat down with four SAS users to get their take on what makes a SAS user. Read through to find valuable tips they shared and up your SAS game. I’m sure you will come away inspired, as you discover some universal commonalities in being a SAS user.

The post What makes a SAS user? SAS thinks like me: Dede Schreiber-Gregory appeared first on SAS Learning Post.

4月 212018
 

During SAS Global Forum 2018, SAS instructor Charu Shankar sat down with four SAS users to get their take on what makes them a SAS user. Read through to find valuable tips they shared and up your SAS game. I’m sure you will come away inspired, as you discover some universal commonalities in being a SAS user.

The post What makes a SAS user? Introverts find their tribe: Richann Watson appeared first on SAS Learning Post.

4月 212018
 

Have you ever been working in the macro facility and needed a macro function, but you could not locate one that would achieve your task? With the %SYSFUNC macro function, you can access most SAS® functions. In this blog post, I demonstrate how %SYSFUNC can help in your programming needs when a macro function might not exist. I also illustrate the formatting feature that is built in to %SYSFUNC. %SYSFUNC also has a counterpart called %QSYSFUNC that masks the returned value, in case special characters are returned.
%SYSFUNC enables the execution of SAS functions and user-written functions, such as those created with the FCMP procedure. Within the DATA step, arguments to the functions require quotation marks, but because %SYSFUNC is a macro function, you do not enclose the arguments in quotation marks. The examples here demonstrate this.

%SYSFUNC has two possible arguments. The first argument is the SAS function, and the second argument (which is optional) is the format to be applied to the value returned from the function. Suppose you had a report and within the title you wanted to issue today’s date in word format:

   title "Today is %sysfunc(today(),worddate20.)";

The title appears like this:

   "Today is               July 4, 2018"

Because the date is right-justified, there are leading blanks before the date. In this case, you need to introduce another function to remove the blank spaces. Luckily %SYSFUNC enables the nesting of functions, but each function that you use must have its own associated %SYSFUNC. You can rewrite the above example by adding the STRIP function to remove any leading or trailing blanks in the value:

   title "Today is %sysfunc(strip(%sysfunc(today(),worddate20.)))";

The title now appears like this:

    "Today is July 4, 2018"

The important thing to notice is the use of two separate functions. Each function is contained within its own %SYSFUNC.

Suppose you had a macro variable that contained blank spaces and you wanted to remove them. There is no macro COMPRESS function that removes all blanks. However, with %SYSFUNC, you have access to one. Here is an example:

   %let list=a    b    c; 
   %put %sysfunc(compress(&list));

The value that is written to the log is as follows:

   abc

In this last example, I use %SYSFUNC to work with SAS functions where macro functions do not exist.

The example checks to see whether an external file is empty. It uses the following SAS functions: FILEEXIST, FILENAME, FOPEN, FREAD, FGET, and FCLOSE. There are other ways to accomplish this task, but this example illustrates the use of SAS functions within %SYSFUNC.

   %macro test(outf);
   %let filrf=myfile;
 
   /* The FILEEXIST function returns a 1 if the file exists; else, a 0
   is returned. The macro variable &OUTF resolves to the filename
   that is passed into the macro. This function is used to determine
   whether the file exists. In this case you want to find the file
   that is contained within &OUTF. Notice that there are no quotation
   marks around the argument, as you will see in all cases below. If
   the condition is false, the %ELSE portion is executed, and a
   message is written to the log stating that the file does not
   exist.*/
 
   %if %sysfunc(fileexist(&outf)) %then %do;
 
   /* The FILENAME function returns 0 if the operation was successful; 
   else, a nonzero is returned. This function can assign a fileref
   for the external file that is located in the &OUTF macro 
   variable. */
 
   %let rc=%sysfunc(filename(filrf,&outf));
 
   /* The FOPEN function returns 0 if the file could not be opened; 
   else, a nonzero is returned. This function is used to open the
   external file that is associated with the fileref from &FILRF. */
 
   %let fid=%sysfunc(fopen(&filrf));
 
   /* The %IF macro checks to see whether &FID has a value greater
   than zero, which means that the file opened successfully. If the
   condition is true, we begin to read the data in the file. */
 
   %if &fid > 0 %then %do;
 
   /* The FREAD function returns 0 if the read was successful; else, a
   nonzero is returned. This function is used to read a record from
   the file that is contained within &FID. */
 
   %let rc=%sysfunc(fread(&fid));
 
   /* The FGET function returns a 0 if the operation was successful. A
   returned value of -1 is issued if there are no more records
   available. This function is used to copy data from the file data 
   buffer and place it into the macro variable, specified as the
   second argument in the function. In this case, the macro variable
   is MYSTRING. */   
 
   %let rc=%sysfunc(fget(&fid,mystring));
 
   /* If the read was successful, the log will write out the value
   that is contained within &MYSTRING. If nothing is returned, the
   %ELSE portion is executed. */
 
   %if &rc = 0 %then %put &mystring;
   %else %put file is empty;
 
   /* The FCLOSE function returns a 0 if the operation was successful;
   else, a nonzero value is returned. This function is used to close
   the file that was referenced in the FOPEN function. */
 
   %let rc=%sysfunc(fclose(&fid));
   %end;
 
   /* The FILENAME function is used here to deassign the fileref 
   FILRF. */
 
   %let rc=%sysfunc(filename(filrf));
   %end;
   %else %put file does not exist;
   %mend test;
   %test(c:\testfile.txt)

There are times when the value that is returned from the function used with %SYSFUNC contains special characters. Those characters then need to be masked. This can be done easily by using %SYSFUNC’s counterpart, %QSYSFUNC. Suppose we run the following example:

   %macro test(dte);
   %put &dte;
   %mend test;
 
   %test(%sysfunc(today(), worddate20.))

The above code would generate an error in the log, similar to the following:

   1  %macro test(dte);
   2  %put &dte;
   3  %mend test;
   4
   5  %test(%sysfunc(today(), worddate20.))
   MLOGIC(TEST):  Beginning execution.
   MLOGIC(TEST):  Parameter DTE has value July 20
   ERROR: More positional parameters found than defined.
   MLOGIC(TEST):  Ending execution.

The WORDDATE format would return the value like this: July 20, 2017. The comma, to a parameter list, represents a delimiter, so this macro call is pushing two positional parameters. However, the definition contains only one positional parameter. Therefore, an error is generated. To correct this problem, you can rewrite the macro invocation in the following way:

   %test(%qsysfunc(today(), worddate20.))

The %QSYSFUNC macro function masks the comma in the returned value so that it is seen as text rather than as a delimiter.

For a list of the functions that are not available with %SYSFUNC, see the “How to expand the number of available SAS functions within the macro language was published on SAS Users.