9月 302022
 

Recently I was on an email thread where someone asked how to do a swimmer plot in SAS Visual Analytics. People replied with other ways using SAS code. Though there is not a standard swimmer plot in VA, I thought it might be possible to create one with a custom graph. So I decided to give it a try.

I reviewed some related materials about the swimmer plot and discovered a useful blog post by my colleague Sanjay Matange. His post provides SAS code to generate the data used by the swimmer plot as well, making things simple. My next step was to create a swimmer plot template in SAS Graph Builder and draw the plot in SAS Visual Analytics.

Create swimmer plot template

This will be done in SAS Graph Builder, I will create a custom graph template named ‘Swimmer Plot’.

The composition of the swimmer plot template

To make the Swimmer plot template, I use two schedule charts and four scatter plots, as shown below.

The template is made up with following charts:

a) Schedule Chart 1 will draw the High/Low bar representing the duration of each subject. It also needs to indicate the type of disease stage – Stage 1, 2, 3, 4.

b) Schedule Chart 2 will draw the Start/End line representing the duration of each response of each subject. It also needs to indicate the type of response - Complete or Partial.

c) Scatter Plot 1 will be used for the Start event, and Scatter Plot 2 for the End event.

d) Scatter Plot 3 will be used to indicate the Durable responder, and Scatter Plot 4 will show if the response is a Continued response.

Creating the swimmer plot template

In SAS Graph Builder, drag the above plots one by one to the work area. And then perform the settings as listed for each object in the ‘Option’ pane.

Next, we need to define roles for these plots in the ‘Roles’ pane.

1 - In the ‘Shared Roles’ section, click the toolbox icon next to the role ‘Shared Role 1’. Edit the role to update the Role Name to ‘Item’ and click OK button.

2 - In the ‘Schedule Chart 1’ section:

a) Click the toolbox icon next to the role ‘Schedule Chart 1 Start’. Edit the role to update the Role Name to ‘Low’ and click OK button.

b) Click the toolbox icon next to the role ‘Schedule Chart 1 Finish’. Edit the role to update the Role Name to ‘High’ and click OK button.

c) Add role by clicking the ‘+ Add Role’ link, update the Role Name to ‘Stage’, leave the Role Type as ‘Group’, check the ‘Required’ checkbox, and click OK button.

3 - In the ‘Schedule Chart 2’ section:

a) Click the toolbox icon next to the role ‘Schedule Chart 2 Start’. Edit the role to update the Role Name to ‘Start’ and click OK button.

b) Click the toolbox icon next to the role ‘Schedule Chart 2 Finish’. Edit the role to update the Role Name to ‘Endline’ and click OK button.

c) Add role by clicking the ‘+ Add Role’ link, update the Role Name to ‘Status’, leave the Role Type as ‘Group’, check the ‘Required’ checkbox, and click OK button.

4 - In the ‘Scatter Plot 1’ section, click the toolbox icon next to ‘Scatter Plot 1 X’, and select ‘Create Shared Role with Another Role’ > ‘Start’. Update the Role Name to ‘Start’ and click OK button.

5 - In the ‘Scatter Plot 2’ section, click the toolbox icon next to ‘Scatter Plot 2 X’, and select ‘Edit Role’, update the Role Name to ‘End’, and click OK button.

6 - In the ‘Scatter Plot 3’ section, click the toolbox icon next to ‘Scatter Plot 3 X’, and select ‘Edit Role’, update the Role Name to ‘Durable’, and click OK button.

7 - In the ‘Scatter Plot 4’ section, click the toolbox icon next to ‘Scatter Plot 4 X’, and select ‘Edit Role’, update the Role Name to ‘Continued’, and click OK button.

Now I am done with the creating the template. Save it as ‘Swimmer Plot’ in ‘My Folder’.

Prepare the data for the swimmer plot

I generated the data set from the Swimmer plot codes by Sanjay and updated the missing values in the ‘Stage’ column. This will avoid the missing value shown in VA. I put the generated CSV file here. Next, I need to prepare the data so it can be directly used to draw the swimmer plot in VA.

1 - Change the Classification of the ‘item’ column, from Measure to Category as shown below.

2 - Create a custom sort for the ‘item’ column. RMB the ‘item’ column in the ‘Data’ Pane and select ‘Custom sort…’ from the menu. In the ‘Add Custom Sort’ pop-up page, click the ‘Add all’ icon to have all the items sorted as below.

3 - Create a calculated item named ‘Continued’ as shown below, its expression is IF ( 'highcap'n NotMissing ) RETURN ( 'high'n + 0.2 ) ELSE ..

That’s all for the data preparation.

Create the swimmer plot in VA

We will first import the ‘Swimmer Plot’ template. In SAS Visual Analytics, go to the ‘Object’ pane, and click the toolbox icon. Select the ‘Import custom graph…’ from the pop-up menu and choose the ‘Swimmer Plot’ in the open dialog. Click OK button to import the graph template we just created. Now the ‘Swimmer Plot’ will be listed in the ‘Graph’ section in the ‘Object’ pane.

Next, drag the ‘Swimmer Plot’ object to canvas, and assign the corresponding data columns to the roles, SAS Visual Analytics will render the Swimmer Plot. To show more legends for the marks in the plot, I use an Image object. I put the ‘Swimmer Plot’ and the legend image in a Precision container. Now we see the chart as shown below.

Summary

With SAS Graph builder, we create the swimmer plot template using two schedule charts and four scatter plots. After importing the template in SAS Visual Analytics, we can create the swimmer plot easily by assigning corresponding data roles.

How to draw a swimmer plot in SAS Visual Analytics was published on SAS Users.

9月 302022
 

SAS is excited to announce our inaugural Customer Appreciation Awards. We want to give a big “thank you” and a round of applause to all our SAS customers and partners around the globe who help us change the world through analytics. We want to recognize a few of you for [...]

Congrats to our 2022 user recognition award winners was published on SAS Voices by Kevin Scanlon

9月 292022
 

Crises like the COVID-19 pandemic have increased the demand for public health experts who possess advanced analytics skills. After all, data – when properly collected, analyzed and understood – has immense power to inform decision-making. And in areas like public health, informed decision making can save lives. Azhar Nizam has [...]

Statistically savvy graduates sought by leading health organizations was published on SAS Voices by Lori Downen

9月 282022
 

The moments of a continuous probability distribution are often used to describe the shape of the probability density function (PDF). The first four moments (if they exist) are well known because they correspond to familiar descriptive statistics:

  • The first raw moment is the mean of a distribution. For a random variable, this is the expected value. It can be positive, negative, or zero.
  • The second central moment is the variance. The variance is never negative; in practice, the variance is strictly positive.
  • The third standardized moment is the skewness. It can be positive, negative, or zero. A distribution that has zero skewness is symmetric.
  • The fourth standardized moment is the kurtosis. Whereas the raw kurtosis, which is used in probability theory, is always positive, statisticians most often use the excess kurtosis. The excess kurtosis is 3 less than the raw kurtosis. Statisticians subtract 3 because the normal distribution has raw kurtosis of three, and researchers often want to compare whether the tail of a distribution is thicker than the normal distribution's tail (a positive excess kurtosis) or thinner than the normal distribution's tail (a negative excess kurtosis).

The purpose of this article is to point out a subtle point. There are three common definitions of moments: raw moments, central moments, and standardized moments. The adjectives matter. When you read a textbook, article, or software documentation, you need to know which definition the author is using.

A cautionary tale

If you re-read the previous list, you will see that it is traditional to use the RAW moment for the mean, the CENTRAL moment for the variance, the STANDARDIZED moment for the skewness, and the STANDARDIZED moment (sometimes subtracting 3) for the kurtosis. However, researchers can report other quantities. Recently, I read a paper in which the author reported formulas for the third and fourth moments of a distribution. I assumed that the author was referring to the standardized moments, but when I simulated data from the distribution, my Monte Carlo estimates for the skewness and excess kurtosis did not agree with the author's formulas. After many hours of checking and re-checking the formulas and my simulation, I realized that the author's formulas were for the central moments (not standardized). After I converted the formulas to report standardized moments (and the excess kurtosis), I was able to reconcile the published results and my Monte Carlo estimates.

Definitions of raw moments

For a continuous probability distribution for density function f(x), the nth raw moment (also called the moment about zero) is defined as
\(\mu_n^\prime = \int_{-\infty}^{\infty} x^n f(x)\, dx\)
The mean is defined as the first raw moment. Higher-order raw moments are used less often. The superscript on the symbol \(\mu_n^\prime\) is one way to denote the raw moment, but not everyone uses that notation. For the first raw moment, both the superscript and the subscript are often dropped, and we use \(\mu\) to denote the mean of the distribution.

For all moments, you should recognize that the moments might not exist. For example, the Student t distribution with ν degrees of freedom does not have finite moments of order ν or greater. Thus, you should mentally add the phrase "when they exist" to the definitions in this article.

Definitions of central moments

The nth central moment for a continuous probability distribution with density f(x) is defined as
\(\mu_n = \int_{-\infty}^{\infty} (x - \mu)^n f(x)\, dx\)
where \(\mu\) is the mean of the distribution. The most famous central moment is the second central moment, which is the variance. The second central moment, \(\mu_2\) is usually denoted by \(\sigma^2\) to emphasize that the variance is a positive quantity.

Notice that the central moments of even orders (2, 4, 6,...) are always positive since they are defined as the integral of positive quantities. It is easy to show that the first central moment is always 0.

The third and fourth central moments are used as part of the definition of the skewness and kurtosis, respectively. These moments are covered in the next section.

Definitions of standardized moments

The nth standardized moment is defined by dividing the nth central moment by the nth power of the standard deviation:
\({\tilde \mu}_n = \mu_n / \sigma^n\)
For the standardized moments, we have the following results:

  • The first standardized moment is always 0.
  • The second standardized moment is always 1.
  • The third standardized moment is the skewness of the distribution.
  • The fourth standardized moment is the raw kurtosis of the distribution. Because the raw kurtosis of the normal distribution is 3, it is common to define the excess kurtosis as \({\tilde \mu}_n - 3\).

A distribution that has a negative excess kurtosis has thinner tails than the normal distribution. An example is the uniform distribution. A distribution that has a positive excess kurtosis has fatter tails than the normal distribution. An example is the t distribution. For more about kurtosis, see "Does this kurtosis make my tail look fat?"

Moments for discrete distributions

Similar definitions exist for discrete distributions. Technically, the moments are defined by using the notion of the expected value of a random variable. Loosely speaking, you can replace the integrals by summations. For example, if X is a discrete random variable with a countable set of possible values {x1, x2, x3,...} that have probability {p1, p2, p3, ...} of occurring (respectively), then the raw nth moment for X is the sum
\(E[X^n] = \sum_i x_i^n p_i\)
and the nth central moment is
\(E[(X-\mu)^n] = \sum_i (x_i-\mu)^n p_i\)

Summary

I almost titled this article, "Will the real moments please stand up!" The purpose of the article is to remind you that "the moment" of a probability distribution has several possible interpretations. You need to use the adjectives "raw," "central," or "standardized" to ensure that your audience knows which moment you are using. Conversely, when you are reading a paper that discusses moments, you need to determine which definition the author is using.

The issue is complicated because the common descriptive statistics refer to different definitions. The mean is defined as the first raw moment. The variance is the second central moment. The skewness and kurtosis are the third and fourth standardized moments, respectively. When using the kurtosis, be aware that most computer software reports the excess kurtosis, which is 3 less than the raw kurtosis.

The post Definitions of moments in probability and statistics appeared first on The DO Loop.

9月 262022
 

The correlations between p variables are usually displayed by using a symmetric p x p matrix of correlations. However, sometimes you might prefer to see the correlations listed in "long form" as a three-column table, as shown to the right. In this table, each row shows a pair of variables and the correlation between them. I call this the "list format."

This article demonstrates two ways to display correlations in a list format by using Base SAS. First, this article shows how to use the FISHER option in PROC CORR to create a table that contains the correlations in a list. Second, this article shows how to use the DATA step to convert the symmetric matrix into a simple three-column table that displays the pairwise statistics between variables. This is useful when you have a matrix of statistics that was produced by a procedure other than PROC CORR.

The FISHER option to create a table of pairwise correlations

If you want to get the correlations between variables in a three-column table (instead of a matrix), use the FISHER option in PROC CORR. The FISHER option is used to produce confidence intervals and p-values for a hypothesis test, but it also presents the statistics in a column. As usual, you can use the ODS OUTPUT statement to save the table to a data set. The following call to PROC CORR analyzes a subset of the Sashelp.Iris data.

proc corr data=sashelp.iris(where=(Species="Setosa")) FISHER;  /* FISHER ==> list of Pearson correlations */
   var _numeric_;
   ods output FisherPearsonCorr = CorrList;       /* Optional: Put the correlations in a data set */
run;
 
proc print data=CorrList noobs; 
   var Var WithVar Corr;
run;

The output is shown at the top of this article. Use this trick when you want to output Pearson (or Spearman) correlations in list format.

However, this trick does not work for other correlations (such as Kendall's or Hoeffding's), nor on statistics that are produced by a different procedure. For example, the CORRB option on the MODEL statement in PROC REG produces a matrix of correlations between the regression parameters in the model. It does not provide a special option to display the correlations in a list format. The next section shows how to use the DATA step to convert the statistics from wide form (a symmetric matrix) to long form (the list format).

Generate the correlations between all variables

If you have a symmetric matrix stored in a SAS data set, you can use the DATA step to convert the matrix elements from wide form to long form. This section uses a correlation matrix as an example, but the technique also applies to covariance matrices, distance matrices, or other symmetric matrices. If the diagonal elements are constant (they are 1 for a correlation matrix), you need to convert only the upper-triangular elements of the symmetric matrix. If the diagonal elements are not constant, you should include them in the output.

The following statements compute the Pearson correlation for all numeric variables for a subset of the Sashelp.Iris data. The OUTP= option writes the statistics to the CorrSym data set:

proc corr data=sashelp.iris(where=(Species="Setosa")) outp=CorrSym;
   var _numeric_;
run;

The structure of a TYPE=CORR data set is documented, but it is easy enough to figure out the structure by printing the data set, as follows:

proc print data=CorrSym; run;

I have highlighted certain elements in the data set that are used in the next section. The goal is to display a list of the correlations inside the triangle.

Convert from a correlation matrix to a pairwise list

Notice that the correlations are identified by the rows where _TYPE_="CORR". Notice that the only numeric columns are the p variables that contain the correlations. You can therefore use a WHERE clause to subset the rows and use the _NUMERIC_ keyword to read the columns into an array. You can then iterate over the upper-triangular elements of the correlation matrix, as follows:

/* Convert a symmetric matrix of correlations into a three-column table.
   This is a "wide to long" conversion, but we need to display only the upper-triangular elements.
 
   Input: a dense symmetric matrix p x p matrix with elements C[i,j] for 1 <= i,j <= p
   Output: pairwise statistics by using the upper triangular elements.
           The output columns are  Var1 Var2 Correlation
*/
data SymToPairs;
  set CorrSym(where=(_type_='CORR'));  /* subset the rows */
  Var1 = _NAME_;
  length Var2 $32;
  array _cols (*) _NUMERIC_;     /* columns that contain the correlations */
  do _i = _n_ + 1 to dim(_cols); /* iterate over STRICTLY upper-triangular values  */
     Var2 = vname(_cols[_i]);    /* get the name of this column           */
     Correlation = _cols[_i];    /* get the value of this element         */
     output;
  end;
  keep Var1 Var2 Correlation;
run;
 
proc print data=SymToPairs noobs; run;

Success! The correlations which were in the upper-triangular portion of a p x p matrix (p=4 for this example) are now in the p(p-1)/2 rows of a table. Notice that the DATA step uses the VNAME function to obtain the name of a variable in an array. This is a useful trick.

This trick will work for any symmetric matrix in a data set. However, for statistics that are produced by other procedures, you must figure out which rows to extract and which variables to include on the ARRAY statement. If you are converting a matrix that does not have a constant diagonal, you can write the DO loop as follows:

  do _i = _n_ to dim(_cols); /* iterate over diagonal AND upper-triangular values  */
  ...
  end;

If you include the diagonal elements in the output, the list format will contain p(p+1)/2 rows. Try modifying the DATA step for the previous example to include the diagonal elements. You should get a table that has 10 rows. The correlation of each variable with itself is 1.

A rectangular array of correlations

The preceding sections show one way to convert a symmetric matrix of correlations into a list. You can use this technique when you have used the VAR statement in PROC CORR to compute all correlations between variables. Another common computation is to compute the correlations between one set of variables and another set. This is done by using the VAR and WITH statements in PROC CORR to define the two sets of variables. The result is a rectangular matrix of correlations. The matrix is not symmetric and does not contain 1s on the diagonal. Nevertheless, you can slightly modify the previous program to handle this situation:

proc corr data=sashelp.cars noprob outp=CorrRect;
   var EngineSize HorsePower MPG_Highway;
   with Length Wheelbase;
run;
 
/* Convert a rectangular matrix of correlations into a three-column table.   
   Input: an (r x p) dense matrix of statistics
   Output: pairwise statistics in a table: Var1 Var2 Correlation
*/
data RectToPairs;
  set CorrRect(where=(_type_='CORR'));
  Var1 = _NAME_;
  length Var2 $32;
  array _cols (*) _NUMERIC_;
  do _i = 1 to dim(_cols);     /* iterate over ALL values       */
     Var2 = vname(_cols[_i]);  /* get the name of this column   */
     Correlation = _cols[_i];  /* get the value of this element */
     output;
  end;
  keep Var1 Var2 Correlation;
run;
 
proc print data=RectToPairs noobs; run;

In this case, the original matrix is 2 x 3, and the converted list has 6 rows.

Summary

If you use PROC CORR in SAS to compute Pearson or Spearman correlations, you can use the FISHER option to display the correlations in list format. If you have a symmetric matrix of statistics from some other source, you can use a DATA step to display the upper-triangular elements in a list. You can choose to include or exclude the diagonal elements.

If you use the VAR and WITH statements in PROC CORR, you obtain a rectangular matrix of correlations, which is not symmetric. You can use the DATA step to display all elements in list format.

A table of pairwise statistics can be useful for visualizing the correlations in a bar chart or for highlighting correlations that are very large or very small. It is also possible to go the other way: From a list of all pairwise correlations, you can construct the symmetric correlation matrix.

The post Display correlations in a list format appeared first on The DO Loop.

9月 232022
 

You’re down by 10 points in your NFL fantasy football league, and you need to choose a wide receiver from the free agency pool because your starter was injured. How do you decide to get the 11 points required for a win? What methods will you use to lead you [...]

3 advanced metrics to help you sustain a winning fantasy football team was published on SAS Voices by Caslee Sims

9月 222022
 

Even with today's technology, it's hard to know precisely when, where and how weather-related damage will occur. Flooding costs are expected to rise drastically over the next 20 years and climate change is a constant threat. Unfortunately, natural disasters are here to stay, but we can try our best to [...]

Predicting the uncontrollable: Using AI to help prevent flooding damages was published on SAS Voices by Olivia Ojeda

9月 222022
 

Whether working as a business analyst, data scientist or machine learning engineer, one thing remains the same – making an impact with data and AI is what really matters. Pre-processing and exploring data, building and deploying models and turning those scoring values into an actionable insight can be overwhelming. A [...]

3 ways to enhance productivity with AI was published on SAS Voices by Briana Ullman

9月 212022
 

The noncentral t distribution is a probability distribution that is used in power analysis and hypothesis testing. The distribution generalizes the Student t distribution by adding a noncentrality parameter, δ. When δ=0, the noncentral t distribution is the usual (central) t distribution, which is a symmetric distribution. When δ > 0, the noncentral t distribution is positively skewed; for δ < 0, the noncentral t distribution is negatively skewed. Thus, you can think of the noncentral t distribution as a skewed cousin of the t distribution.

SAS software supports the noncentrality parameter in the PDF, CDF, and QUANTILE functions. This article shows how to use these functions for the noncentral t distribution. The RAND function in SAS does not directly support the noncentrality parameter, but you can use the definition of a random noncentral t variable to generate random variates.

Visualize the noncentral t distribution

The classic Student t distribution contains a degree-of-freedom parameter, ν. (That's the Greek letter "nu," which looks like a lowercase "v" in some fonts.) For small values of ν, the t distribution has heavy tails. For larger values of ν, the t distribution resembles the normal distribution. The noncentral t distribution uses the same degree-of-freedom parameter.

The noncentral t distribution also supports a noncentrality parameter, δ. The simplest way to visualize the effect of δ is to look at its probability density function (PDF) for several values of δ. The support of the PDF is all real numbers, but most of the probability is close to x = δ. You can use the PDF function in SAS to compute the PDF for various values of the noncentrality parameter. The fourth parameter for the PDF("t",...) call is the noncentrality value. It is optional and defaults to 0 if not specified.

The following visualization shows the density functions for positive values of δ and positive values of x. In the computer programs, I use DF for the ν parameter and NC for the δ parameter.

/* use the PDF function to visualize the noncentral t distribution */
%let DF = 6;
data ncTPDFSeq;
df = &DF;                        /* degree-of-freedom parameter, nu */
do nc = 4, 6, 8, 12;             /* noncentrality parameter, delta */
   do x = 0 to 20 by 0.1;        /* most of the density is near x=delta */
      PDF = pdf("t", x, df, nc);
      output;
   end;
end;
label PDF="Density";
run;
 
title "PDF of Noncentral t Distributions";
title2 "DF=&DF";
proc sgplot data=ncTPDFSeq;
   series x=x y=PDF / group=nc lineattrs=(thickness=2);
   keylegend / location=inside across=1 title="Noncentral Param" opaque;
   xaxis grid; yaxis grid;
run;

The graph shows the density functions for δ = 4, 6, 8, and 12 for a distribution that has ν=6 degrees of freedom. You can see that the modes of the distributions are close to (but a little less than) δ when δ > 0. For negative values of δ, the functions are reflected across x=0. That is, if f(x; ν, δ) is the pdf of the noncentral t distribution with parameter δ, then f(-x; ν, -δ) = f(x; ν, δ).

The CDF and quantile function of the noncentral t distribution

If you change the PDF call to a CDF call, you obtain a visualization of the cumulative distribution function for various values of the noncentrality parameter, δ.

The quantile function is important in hypothesis testing. The following DATA step finds the quantile that corresponds to an upper-tail probability of 0.05. This would be a critical value in a one-sided hypothesis test where the test statistic is distributed according to a noncentral t distribution.

%let NC = 4;
data CritVal;
do alpha = 0.1, 0.05, 0.01;
   tCritUpper = quantile("T", 1-alpha, &DF, &NC);
   output;
end;
run;
 
proc print data=CritVal nobs; run;

The graph shows the critical value of a noncentral t statistic for a one-sided hypothesis test at the α significance level for α=0.1, 0.05, and 0.01. A test statistic that is larger than the critical value would lead you to reject the null hypothesis at the given significance level.

Random variates from the noncentral t distribution

Although the RAND function in SAS does not support a noncentrality parameter for the t distribution, it is simple to generate random variates. By definition, a noncentral t random variable, Tν δ is the ratio of a standard normal variate with mean δ and a scaled chi-distributed variable. If Z ~ N(δ,1) is a normal random variable and V ~ χ2(ν) is a chi-squared random variable with ν degrees of freedom, then the ratio Tν δ = Z / sqrt(V / ν) is a random variable from a noncentral t distribution.

/* Rand("t",df) does not support a noncentrality parameter. Use the definition instead. */
data ncT;
df = &DF;
nc = &NC;
call streaminit(12345);
do i = 1 to 10000;
   z = rand("Normal", nc);   /* Z ~ N(nc, 1)    */
   v = rand("chisq", df);    /* V ~ ChiSq(df)   */
   t = z / sqrt(v/df);       /* T ~ NCT(df, nc) */
   output;
end;
keep t;
run;
 
title "Random Sample from Noncentral t distribution";
title2 "DF=&DF; nc=&NC";
proc sgplot data=ncT noautolegend;
   histogram t;
   density t / type=kernel;
   xaxis max=20;
run;

The graph shows a histogram for 10,000 random variates overlaid with a kernel density estimate. The density is very similar to the earlier graph that showed the PDF for the noncentral t distribution with ν=6 degrees of freedom and δ=4.

Summary

The noncentral t distribution is a probability distribution that is used in power analysis and hypothesis testing. You can this of the noncentral t distribution as a skewed t distribution. SAS software supports the noncentral t distribution by using an optional argument in the PDF, CDF, and QUANTILE functions. You can generate random variates by using the definition of a random variable, which is a ratio of a normal variate and a scaled chi-distributed variable.

The post The noncentral t distribution in SAS appeared first on The DO Loop.

9月 202022
 

SAS automation using Windows batch scripts

Let’s consider the following ubiquitous scenario. You have created a SAS program that you want to run automatically on schedule on a SAS server under the Microsoft Windows operating system.

If you have SAS® Enterprise BI Server or SAS® BI Server and Platform Suite for SAS you can do the scheduling using the Schedule Manager plug-in of SAS Management Console.

But even if you only have SAS Foundation installed on a Windows server (or any Windows machine), you can automate/schedule your SAS program to run in batch using Windows Task Scheduler. In order to do that you would need a Windows batch file.

What are Windows batch files?

A Windows batch file (or batch script) is a text file consisting of a series of Windows commands executed by the command-line interpreter. It can be created using a plain text editor such as Notepad or Notepad++ and saved as a text file with the .bat file name extension. (Do not use a word processor like Microsoft Word for batch script writing!)

Historically, batch files originated in DOS (Disk Operating System), but modern versions of Windows continually support them. Batch files get executed when you double click on them in Windows File Explorer or type & enter their name in the Command Prompt window.

Even though Windows batch files are not as powerful as PowerShell scripts or Unix/Linux scripts they are still quite versatile and useful for automating processes running SAS in batch mode.

Besides simply running a bunch of OS command sequentially one after another, batch scripts can be more sophisticated. For example, they can use environment variables (which are similar to SAS macro variables). They can use functions and formats. They can capture exit code from one command (or SAS program) and then, depending on its value, conditionally execute a command or a group of commands (which might include another SAS program or another batch script). They can even do looping.

They can submit a SAS program to run by SAS System in batch mode and pass to this program a parameter string.

They can control where SAS program saves its log and generate a dynamic log file name adding a datetime stamp suffix to it.

Windows batch script examples

Let’s explore a couple of examples.

Example 1. Simple script running SAS program in batch mode

set sas=D:\SAS94\sashome\SASFoundation\9.4\sas.exe
set proj=C:\Projects\SAS_to_Excel
set name=mysasprogram
set pgm=%proj%\%name%.sas
set log=%proj%\logs\%name%.log
%sas% -sysin %pgm% -log %log% -nosplash -nologo -icon

Here, we define several environment variables (proj, name, pgm, log) and reference these variables by surrounding their names with %-sign as %variable-name% (analogous to SAS macro variables which are defined by %let mvar-name = mvar-value; and referenced as &mvar-name).

This script will initiate SAS session in batch mode which executes your SAS program mysasprogram.sas and outputs SAS log as mysasprogram.log file.

Example 2. Running SAS program in batch and date/time stamping SAS log

Batch scripts are often used to run a SAS programs repeatedly at different times. In order to preserve SAS log for each run, we can assign unique names for the log files by suffixing their names with a date/time. For example, instead of saving SAS log with the same name mysasprogram.log we can dynamically generate unique names for SAS log, e.g. mysasprogram_YYYYMMDD_HHMMSS, where YYYYMMDD_HHMMSS indicates the date (YYYYMMDD) and time (HHMMSS) of run. This will effectively keep all SAS logs and indicate when (date and time) each log file was created. Here is a Windows batch script that does it:

: generate suffix dt=YYYYMMDD_HHMMSS, pad HH with leading 0
set z=%time: =0%
set dt=%date:~-4%%date:~4,2%%date:~7,2%_%z:~0,2%%z:~3,2%%z:~6,2%

set sas=D:\SAS94\sashome\SASFoundation\9.4\sas.exe
set proj=C:\Projects\SAS_to_Excel
set name=mysasprogram
set pgm=%proj%\%name%.sas
set log=%proj%\logs\%name%_%dt%.log
%sas% -sysin %pgm% -log %log% -nosplash -nologo -icon

Windows batch scripts with conditional execution

Let’s enhance our script by adding the following functionality:

  • Captures exit code from the mysasprogram.sas (exit code 0 mean there are no ERRORs or WARNINGs)
  • If exit code is not equal to 0, conditionally execute another SAS program my_error_email.sas which sends out an email to designated recipients informing them that mysasprogram.sas failed (successful execution email can be sent from mysasprogram.sas itself).

One would expect that it can be achieved by adding the following scripting code to the above example 2:

: capture exit code from sas
set exitcode=%ERRORLEVEL%

: generate email if ERROR and/or WARNING
if not %exitcode% == 0 (
   set ename=my_error_email
   set epgm=%proj%\Programs\%ename%.sas
   set elog=%proj%\SASLogs\%ename%_%dt%.log
   %sas% -sysin %epgm% -log %elog% -nosplash -nologo -icon -sysparm %log%
)

However, you might be in for a big surprise (I was!) when you discover that my_error_email.sas program runs regardless of whether exitcode is equal on not equal to 0.  How is that possible!

It turned out that Windows script environment variable references in form of %variable-name% do not resolve at execution time like SAS macro variable references &mvar-name or Unix/Linux script variable references $variable-name . They resolve during the initial script parsing before the run-time exitcode is evaluated. As a result, all the commands within the parentheses of the IF-command (including SAS session kickoff) are resolved and executed unconditionally.

Initially, DOS (and later Windows) scripts were implemented without conditional execution (IF command) and looping (FOR command) functionality and their environment variable references were resolving during script parsing. Later, when scripting language was brought to a higher standard that did include conditional execution and looping, the developers decided to keep the original  behavior of the %variable-name% references intact, but added a new form of the environment variable references !variable-name! surrounding variable names with exclamation marks. They called it "delayed expansion" and cardinally altered the variable references behavior causing them to resolve (expand) during execution time rather than parse time.

The following scripting command enables delayed expansion:

SetLocal EnableDelayedExpansion

We can place this SetLocal command right before the IF section and replace variable references in it with !variable-name! . Alternatively, for consistency, we can place SetLocal EnableDelayedExpansion at the beginning of the script and replace all environment variable references with !variable-name! . In the latter case, all our variables will be resolved at execution time.  Here is the final script:

SetLocal EnableDelayedExpansion

: generate suffix dt=YYYYMMDD_HHMMSS, pad HH with leading 0
set z=!time: =0!
set dt=!date:~-4!!date:~4,2!!date:~7,2!_!z:~0,2!!z:~3,2!!z:~6,2!

set sas=D:\SAS94\sashome\SASFoundation\9.4\sas.exe
set proj=C:\Projects\SAS_to_Excel
set name=mysasprogram
set pgm=!proj!\!name!.sas
set log=!proj!\logs\!name!_!dt!.log
!sas! -sysin !pgm! -log !log! -nosplash -nologo -icon

: capture exit code from sas
set exitcode=!ERRORLEVEL!

: generate email if ERROR and/or WARNING
if not !exitcode! == 0 (
   set ename=my_error_email
   set epgm=!proj!\Programs\!ename!.sas
   set elog=!proj!\SASLogs\!ename!_!dt!.log
   !sas! -sysin !epgm! -log !elog! -nosplash -nologo -icon -sysparm !log!
)

Notice, how we pass in to the my_error_email.sas program the log name of failed mysasprogram.sas:

-sysparm !log!

This log name can be captured in the my_error_email.sas program by using the SYSPARM automatic macro variable:

%let failed_log = &sysparm;

Then, it can be used either to attach that log file to the automatically generated email or at least provide its path and name in the email body.

Job scheduling

With Windows batch script file in place, you can easily schedule and run your SAS program in batch mode on a SAS machine that have just SAS Foundation installed using Microsoft Windows Task Scheduler. In order to do that you would need to specify your script’s fully qualified name in the Windows Task Scheduler (Create Task →  New Action Program/script field) as shown below:

Creating a new Task in Windows Task Scheduler

Then you would need to specify (add) new Trigger(s) that ultimately define the scheduling rules:

Setting up a new Trigger to define scheduling rules

That’s all. You can now sleep well while your job runs at 3:00 am.

Questions? Thoughts? Comments?

Do you have questions, concerns, comments or use other ways of automating SAS jobs? Please share with us below in the Comments section.

Additional Resources

Automating SAS processes using Windows batch files was published on SAS Users.