Viya

4月 112022
 

Many discussions and articles about SAS Viya emphasize its ability to handle Big Data, perform parallel processing, and integrate open-source languages. These are important issues for some SAS customers. But for customers who program in SAS and do not have Big Data, SAS Viya is attractive because it is the development environment in which SAS R&D adds new features, options, and algorithms. Since 2018, most new development has been in SAS Viya. This article discusses one new feature in the SAS IML product: new state-of-the-art algorithms for solving ordinary differential equations (ODEs) that are formulated as initial value problems (IVPs).

In statistics, differential equations arise in several areas, including compartment models in the field of pharmacokinetics. The example in this article is a compartment model, although it is from epidemiology, not pharmacokinetics.

Solving ODEs in SAS

The SAS/IML product introduced the ODE subroutine way back in SAS 6. The ODE subroutine solves IVPs by using an adaptive, variable-order, variable-step-size integrator. This integrator is very good for a wide assortment of problems. The integrator was state-of-the-art in the 1980s, but researchers have made a lot of progress in solving ODEs since then.

The new ODEQN subroutine was introduced in Release 2021.1.5 of SAS IML software. See the documentation for a full description of the features of the ODEQN subroutine. Briefly, it has a simple calling syntax and supports about 36 different integration methods, including the following:

  • BDF: backward differentiation formulas for stiff problems
  • Adams-Moulton method for non-stiff problems
  • Hybrid (BDF + Adams): Automatic detection of stiffness and switching
  • Explicit Runge-Kutta methods: Popular solvers in the sciences and engineering fields

Each class of solvers supports several "orders," which equate to low- or high-order Taylor series approximations.

The new ODEQN subroutine has several advantages over the previous solver, but the following features are noteworthy:

  • Self-starting: the subroutine automatically chooses an initial step size
  • Variable-order methods use higher-order approximations to take larger steps, when possible
  • Automatically detect stiffness (Hybrid method)
  • Support piecewise-defined vector fields
  • Support backward-in-time integration
  • BLAS-aware linear algebra scales to handle large systems

Example: A basic SIR model in epidemiology

Many introductory courses in differential equations introduce the SIR model in epidemiology. The SIR model is a simple model for the transmission and recovery of a population that is exposed to an infectious pathogen. More sophisticated variations of the basic SIR model have been used by modelers during the COVID-19 pandemic. The SIR model assumes that the population is classified into three populations: Susceptible to the disease, Infected with the disease, and Recovered from the disease.

In the simplest SIR model, there are no births and no deaths. Initially, almost everyone is susceptible except for a small proportion of the population that is infected. The susceptible population can become infected by interacting with a member of the infected population. Everyone in the infected population recovers, at which point the model assumes that they are immune to reinfection. The SIR model has two parameters. One parameter (b) controls the average rate at which a susceptible-infected interaction results in a new infected person. Another parameter (k) represents how quickly an infected person recovers and is no longer infectious.

The following SAS IML module defines the SIR equations, which model the proportion of the population that is susceptible, infected, and recovered. Initially, 99.99% of the population is susceptible, 0.01% is infected, and 0% is recovered. The new ODEQN subroutine in SAS Viya solves the system of equations for 100 units of time. The output of the routines is a 100 x 3 matrix. The i_th row is the proportion of the population in each category after i units of time have passed. Notice that the syntax of the ODEQN subroutine is quite simple: You specify the module that defines the equations, the initial conditions, and the time points at which you want a solution.

/* Program requires SAS IML in SAS Viya 2021.1.5 */
proc iml;
/* SIR model: https://en.wikipedia.org/wiki/Compartmental_models_in_epidemiology */
start SIR(t, y) global(b, k);
   S = y[1]; I = y[2]; R = y[3];
   dSdt = -b * S*I;
   dIdt =  b * S*I - k*I;
   dRdt =            k*I;
   return ( dSdt || dIdt || dRdt );
finish;
 
/* set parameters and initial conditions */
b = 0.5;      /* b param: transition rate from S to I */
k = 1/14;     /* k param: transition rate from I to R */
IC = {0.9999  0.0001   0};   /* S=99.99% I=0.01%  R=0 */
t = T(0:100);  /* compute spread for each day up to 100 days */
 
call odeqn(soln, "SIR", IC, t);  /* in SAS Viya only */

You can graph the solution of this system by plotting the time-evolution of the three populations:

/* write solution to SAS data set */
S = t || soln;
create SIROut from S[c={'Time' 'Susceptible' 'Infected' 'Recovered'}];
append from S;
close;
QUIT;
 
title "Solution to SIR Model";
title2 "b = 1/2; k = 1/14";
proc sgplot data=SIROut;
   series x=Time y=Susceptible;
   series x=Time y=Infected;
   series x=Time y=Recovered;
   xaxis grid; yaxis grid label="Percent of Group";
run;

For the parameters (b, k) = (0.5, 0.07), the infected population rapidly increases until Day 25 when almost 60% of the population is infected. By this time, the population of susceptible people is only about 10% of the population. For these parameters, nearly everyone in the population becomes infected by Day 60. The highly infectious (but not lethal) epidemic is essentially over by Day 80 when almost everyone has been infected, recovered, and is now immune.

Example: An SIR model for a change in behavior

One of the advantages of mathematical models is that you can simulate "what if" scenarios. For example, what if everyone in the population changes their behavior on Day 20 to reduce the transmission rate? This might include the adoption of non-pharmaceutical interventions such as mask-wearing, social distancing, and quarantining the infected. What if these behaviors change the effective rate of transmission from 0.5 (very high) to 0.1 (moderate)? You can modify the module that defines the ODE to reflect that parameter change at t=20. As mentioned earlier, the ODEQN subroutine can integrate piecewise-defined systems if you specify the time at which the system changes its behavior. The following call to the ODEQN subroutine uses the TDISCONT= option to tell the integrator that the system of ODEs changes at t=20.

%let TChange = 20;
start SIR(t, y) global(k);
   if t < &TChange then b = 0.5;   /* more spread */
   else                 b = 0.1;   /* less spread */
 
   S = y[1]; I = y[2]; R = y[3];
   dSdt = -b * S*I;
   dIdt =  b * S*I - k*I;
   dRdt =            k*I;
   return ( dSdt || dIdt || dRdt );
finish;
 
/* set parameters and initial conditions */
k = 1/14;     /* k param: transition rate from I to R */
IC = {0.9999  0.0001   0};   /* S=99.99% I=0.01%  R=0 */
t = T(0:100);  /* compute spread for each day up to 100 days */
call odeqn(soln, "SIR", IC, t) TDISCON=&TChange;

If you plot the new solutions, you obtain the following graph:

In the new scenario, the increase in the infected population is reversed by the intervention on Day 20. From there, the proportion of infected persons decreases over time. Eventually, the disease dies out. By Day 100, 75% of the population was infected and recovered, leaving 25% that are still susceptible to a future outbreak.

Summary

Although moving to SAS Viya offers many advantages for SAS customers who use Big Data and want to leverage parallel processing, Viya also provides new features for traditional SAS programmers. One example is the new ODEQN subroutine in SAS IML, which solves ODEs that are initial value problems by using state-of-the-art numerical algorithms. In addition to computational improvements in accuracy and speed, the ODEQN subroutine has a simple syntax and makes many decisions automatically, which makes the routine easier for non-experts to use.

The post New methods for solving differential equations in SAS appeared first on The DO Loop.

2月 282022
 

An experienced SAS programmer recently switched to SAS Viya and asked how to discover what products are available on his version of Viya. We discussed a few older SAS 9 procedures, and I showed him a new Viya-specific way to get information about his version of SAS and his licensed products.

This article discusses the getLicensedProductInfo action, which is a new way to obtain information about all licensed products in SAS Viya. In addition, it discusses what you will see if you run the following SAS 9 procedures or statements in Viya:

  • The SYSVER and SYSVLONG macro variables. Did you know that there are newer macro variables that provide information about your SAS Viya version? Below, I discuss the SYSVIYAVERSION macro variable.
  • The SETINIT procedure.
  • The PRODUCT_STATUS procedure, which is not available in SAS Viya.

Since I am not an expert in SAS Administration, I invite comments from those who are more knowledgeable than I am.

The getLicensedProductInfo action

SAS Viya is built to work with many programming clients: SAS, Python, R, Lua, and so on. Accordingly, you can get the product information without using SAS-specific calls such as the SETINIT procedure. The getLicensedProductInfo action provides a client-agnostic way to obtain information about licensed products. You can call an action from any client language. From the SAS client, you can call actions by using PROC CAS. For example, if you want to display the products that are available in your Viya session, you can use the following statements:

cas;                 /* connect to CAS session */
 
proc cas;
   getLicensedProductInfo;
quit;

The output will contain every product that is available in your session. There are a total of eight columns in the output table, but only the first four columns are shown.

What version of SAS are you running?

A common inquiry on discussion forums is "What version of SAS are you running?" This is important because the person asking a question might be running an old version of SAS that does not contain newer features. In SAS 9, you can use the SYSVLONG system macro variables to discover the version of SAS 9 that you are running. (The SYSVLONG and SYSVLONG4 variables are similar: SYSVLONG displays a two-digit year such as 21 whereas SYSVLONG4 displays a four-digit year such as 2021.) For example, a SAS 9 programmer might submit the following statement, which displays the version information in the log:

%put In SAS 9: &=sysvlong4;

In SAS 9, the SAS log shows the following information:

In SAS 9: SYSVLONG4=9.04.01M6P11072018

The SYSVLONG4 string tells you that the SAS release is 9.4 ("9.04") and you are running the M6 release, which was built on 07NOV2018. You might wonder what the system macro variable contains if you submit the same statement on the SAS client that comes with SAS Viya:

%put In SAS Viya: &=sysvlong4;

In SAS Viya, the SAS log shows the following information:

In SAS Viya: SYSVLONG4=V.04.00M0P02142022

The output shows that the version is "V.04" and the build date was 14FEB2022. You can interpret "V.04" as "Viya 4," which is a cloud-native version of Viya.

If you know that you are running SAS Viya, there are two additional SAS macros that provide information about your Viya release: SYSVIYARELEASE and SYSVIYAVERSION, as follows:

%put In SAS Viya: &=SYSVIYARELEASE;
%put In SAS Viya: &=SYSVIYAVERSION;

The SAS log shows the following information:

In SAS Viya: SYSVIYARELEASE=20220221.1645427074741
In SAS Viya: SYSVIYAVERSION=Stable 2021.2.4

The value of SYSVIYARELEASE is a date (21FEB2022) followed by additional numbers that are related to the release. The value of SYSVIYAVERSION can be in two forms:

  • Stable YYYY.Major.Minor. For example, "Stable 2021.2.4" tells you that your version of Viya updates monthly (a "Stable" release) and that the cadence is version 6 of the 2nd release that occurred in 2021.
  • LTS YYYY.Release. For example, "LTS 2021.2" tells you that your version of Viya updates every six months (a "long-term stable" or LTS release) and that this is the 2nd release that occurred in 2021.

SAS 9 procedures

If you are a SAS 9 programmer, you might have used two SAS procedures that provide information about your SAS release. The first procedure is PROC SETINIT, which continues to be supported in SAS Viya. The following call to PROC SETINIT shows information that is similar to the output from the getLicensedProductInfo action:

proc setinit; 
run;

The SAS log displays the following information:

---Base SAS Software                             08MAY2022 (CPU A) 
---SAS/STAT                                      08MAY2022 (CPU A) 
---SAS/GRAPH                                     08MAY2022 (CPU A) 
---SAS/ETS                                       08MAY2022 (CPU A) 
---SAS/OR                                        08MAY2022 (CPU A) 
---SAS/IML                                       08MAY2022 (CPU A) 
<--- and many other lines --->

Although PROC SETINIT continues to work, it provides less information than the getLicensedProductInfo action, so I recommend using the action instead.

A second SAS 9 procedure that some customers use is PROC PRODUCT_STATUS. In SAS 9, the output from this procedure is similar to the PROC SETINIT output, but the output also includes version information for each product. For example, the output might tell you that the SAS/STAT version is 15.2 and the SAS/GRAPH version is 9.4_M5. This method of naming versions is not used in SAS Viya, so PROC PRODUCT_STATUS is no longer supported. If you try to run the procedure in SAS Viya, the log will contain the following error message:

ERROR: Procedure PRODUCT_STATUS not found.

Using the SYSVER macro variable in SAS macro programming

One use of the SYSVER macro variable is to check the SAS version to ensure that certain language features (procedures or options) are present. For example, I have seen SAS macros that include the following logic:

/* macro logic to test whether the version of SAS is 9.4 or greater */
%if %sysevalf(&sysver < 9.4) %then %do;
   %put SAS 9.4 or later is required.  Terminating.;
   %goto exit;
%end;

This snippet of code will exit the macro unless the SAS version is at least 9.4 or later. What will happen if you run this macro code in SAS Viya? Thankfully, the code continues to work properly!

To understand why the logic continues to work when SYSVER contains a value such as "V.04," first recall that the %IF condition compares STRINGS, not numbers. If the SYSVER variable is V.04 (or any string that begins with "V"), the logical expression %sysevalf(&sysver < 9.4) is false. This is because the ASCII value for "V" is 86 whereas the ASCII value for "9" is 57. You can confirm this logic yourself by running the following macro statements on any version of SAS 9 or SAS Viya. In SAS 9.4M5 or later, you don't even need to embed the code inside a macro. You can run %IF-%THEN statements in open code:

/* prior to SAS 9.4M5, you need to wrap this code inside a macro call */
%if %sysevalf(V.04 < 9.4) %then %do;
   %put Your version of SAS is less than 9.4;   /* What happens if &SYSVER resolves to V.xx? */
%end;
%else %do;
   %put Your version of SAS is SAS 9.4 or later;
%end;

This code snippet always prints "Your version of SAS is SAS 9.4 or later." Consequently, the SYSVER macro variable in SAS Viya has a value that is greater than 9.4 (or any other set of numbers).

Summary

In SAS Viya, you can use the getLicensedProductInfo action to obtain information about licensed products. Some of the older SAS 9 procedures (such as PROC SETINIT) continue to work in Viya, although others (such as PROC PRODUCT_STATUS) are deprecated. The SYSVER and SYSVLONG macro variables are supported, although in SAS Viya their values begin with "V." To discover your version of SAS Viya, use the SYSVIYAVERSION macro variable.

The post How to find release and licensing information in SAS Viya? appeared first on The DO Loop.

1月 242022
 

How can you estimate percentiles in SAS Viya? This article shows how to call the percentile action from PROC CAS to estimate percentiles of variables in a CAS data table. You can use the same technique to estimate weighted percentiles. Percentiles and quantiles are essentially the same (the pth quantile is the 100*pth percentile for p in [0, 1]), so this article also shows how to estimate quantiles in SAS Viya.

I previously mentioned that some SAS procedures are CAS-enabled, which means that they will call a CAS action that runs on the CAS server if you specify a CAS table on the DATA= option. However, I also mentioned that some Base SAS procedures are "hybrid," which means that they might run on the CAS server or on the SAS Compute Server. But be careful: some Base SAS procedures (including the DATA step) look at the options that you specify to decide whether they can process the data on the CAS server. If not, they will pull the data down from the server to the SAS client and compute the result there. This is probably not what you want. It is inefficient to copy data from the CAS server if you can perform the computation on the server where the data are.

PROC MEANS is a hybrid procedure

A good example of a hybrid procedure is PROC MEANS. If the data are in a CAS table, PROC MEANS will look at the options that you specify. If you request basic descriptive statistics (N, MIN, MAX, MEAN, STD, etc), the procedure will call the aggregate action and compute the statistics in CAS. However, for some procedure options (and, I confess, I don't know which ones) the procedure decides that the requested statistics should be computed on the SAS client and uses the CAS LIBNAME engine to access the data. This results in reading the data from the CAS table and performing the computation on the SAS client.

Let's construct an example for which PROC MEANS computes the statistics in CAS. The following program loads the data in the Sashelp.Cars data set into a CAS table. It then calls PROC MEANS to compute descriptive statistics:

cas;                /* connect to CAS server */
libname mylib cas;  /* active caslib, whatever it is */
 
proc casutil;
   load data=Sashelp.Cars casout='Cars' replace; /* load data into the active caslib */
quit;
 
proc means data=mylib.Cars nolabels min mean max;   /* runs action on CAS server */
   class Origin;
   vars Cylinders MPG_City;
run;

The procedure outputs a familiar table that displays the mean, min, and max statistics for two variables, each grouped according to levels of the classification variable. In addition, the procedure displays the following note in the log:

NOTE: The CAS aggregation.aggregate action will be used to perform the initial summarization.

This note informs you that the computation took place on the CAS server. It also tells you that PROC MEANS used the aggregate action to perform the computation.

The behavior of PROC MEANS changes if you request percentiles. For some reason (unknown to me), the percentiles are not computed on the CAS server. Instead, the data are read from the CAS table by using the CAS libname engine, and the computation takes place on the SAS client:

proc means data=mylib.Cars nolabels P25 P50 P75;  /* run entirely on the SAS compute server */
   class Origin;
   vars Cylinders MPG_City;
run;

In addition to the output, the procedure writes the following note to the log:

NOTE: There were 428 observations read from the data set MYLIB.CARS.

This note implicitly tells you that the computation did not occur on the CAS server.

Computing percentiles on the CAS server

Although PROC MEANS does not automatically compute percentiles on the CAS server, you can use a CAS action to estimate the percentiles. By definition, an action always computes on the CAS server. I often look at the "Action Sets by Name" documentation when I am trying to find an action that will perform an analysis. In this case, you can search that page for the word "percentile" and find the documentation for the syntax of the percentile action. There are separate tabs for calling the action from SAS (the "CASL syntax"), from Lua, from Python, and from R. This article uses the CASL syntax, which tells you how to call the action from PROC CAS in SAS.

Let's run a few examples. You can use the percentile action (in the percentile action set) to compute percentiles of variables in CAS tables. Recall that you can call an action by specify its full two-level name (ActionSet.ActionName) each time, or by using the LOADACTIONSET statement to load the actions into your CAS session. After loading the action, you can refer to the action by using only its name.

The following call to PROC CAS loads the percentile action set, then calls the percentile action. The table= parameter is required. It is used to specify the CAS table that contains the data. Optionally, you can use additional parameters to specify the variables in the analysis, the percentiles to estimate, and more. The following call is similar to the previous PROC MEANS call. It estimates the same percentiles for the same variables.

/* load the percentile action set and make a basic call */
proc cas;
   loadactionset 'percentile';           /* load the action set */
   percentile /                          /* call the percentile action */
      table={name="Cars",                    /* name of table (in active caslib) */
             vars={"Cylinders" "MPG_City"},  /* name of analysis variables */
             groupby={"Origin"}              /* (optional) name of classification variables */
            }
      values={25 50 75}                  /* specify the percentiles */
      ;                                  /* end of syntax for the action */
run;

The output from the percentile action is in "long form" rather than "wide form," but the estimates are the same.

You might notice that the output contains a column labeled "Converged," and that the rows display "Yes." By default, the percentile uses an iterative process (method="ITERATIVE") to estimate the percentiles. The documentation states that the iterative process "is very memory efficient to compute percentiles." If you want to run a different estimation method, you can change some parameters. Most importantly, you can use the values= parameter to specify any percentiles values in [0, 100]. (By convention, the 0th percentile is the sample minimum and the 100th percentile is the sample maximum.) For example, the following statements use the default estimation method that PROC MEANS uses and add additional values to the list of percentiles.

/* You can specify other options such as percentiles and method */
proc cas;
   percentile /                          /* call the percentile action */
      table={name="Cars",                    /* name of table (in active caslib) */
             vars={"Cylinders" "MPG_City"}   /* name of analysis variables */
            }
      values={10  17.5  50  82.5  90}    /* specify the percentiles */
      pctldef = 5                        /* choice of estimand */
      method = "Exact"                   /* method for estimation */
      ;                                  /* end of syntax for the action */
run;

The percentile action also supports a weight= parameter, which you can use to specify a weight variable to compute weighted percentiles.

Summary

This article shows how to call the percentile action from PROC CAS to compute percentiles of variables in a CAS data table. The percentile action can analyze multiple variables and can estimate any percentiles that you specify. You can use the groupby= parameter inside the table= specification to estimate the percentiles for joint levels of categorical variables, which is similar to using the CLASS statement in PROC MEANS. The action also supports a weight= parameter for specifying a weighted variable.

An action will always perform its computations on the CAS server (where the data are). In contrast, some Base SAS procedures are "hybrid" procedures that may or may not compute on the CAS server. Consequently, I prefer to call actions when I need to ensure that the computations will be performed in CAS.

The post Estimate percentiles in SAS Viya appeared first on The DO Loop.

1月 102022
 

On this blog, I write about a diverse set of topics that are relevant to statistical programming and data visualization. In a previous article, I presented some of the most popular blog posts from 2021. The most popular articles often deal with elementary or familiar topics that are useful to almost every data analyst. However, I also write articles that discuss more advanced topics. Did you make a New Year's resolution to learn something new this year? Here is your chance! The following articles deserve a second look. I have grouped them into four categories: SAS programming, statistics, optimization, and data simulation.

SAS programming

For years, I've been writing articles about how to accomplish tasks in Base SAS. In addition, I now write about how to program basic tasks in SAS Viya.

Use the DOLIST Syntax to Specify Tick Marks on a Graph

Probability and statistics

A Family of Density Curves for the Inverse Gamma Distribution

Probability and statistics provide the theoretical basis for modern data analysis. You cannot understand data science, machine learning, or artificial intelligence without understanding the basic principles of statistics. Most readers are familiar with common probability distributions, Pearson correlation, and least-squares regression. The following articles discuss some of the lesser-known statistical cousins of these topics:

  • The inverse gamma distribution: To use any probability distribution, you need to know four essential functions: the density, the cumulative probability, the quantiles, and how to generate random variates. You might be familiar with the gamma distribution, but what is the inverse gamma distribution and how do you define the four essential functions in SAS? Similarly, what is the generalized gamma distribution?
  • The Hoeffding D statistic: The Hoeffding D statistic measures the association between two variables. How do you compute the Hoeffding D statistic in SAS, and how do you interpret the results?
  • Weibull regression: In ordinary least squares regression, the response variable is assumed to be modeled by a set of explanatory variables and normally distributed errors. If the error terms are Weibull distributed, you can estimate parameters for the Weibull distribution as part of a regression analysis, but you need to transform the regression estimates to obtain the usual Weibull parameters.

Optimization

Optimization is at the heart of many statistical techniques, such as maximum likelihood estimation. But sometimes you need to solve a "pure" optimization problem. SAS supports many methods for solving optimization problems:

Multivariate simulation and bootstrapping

It is straightforward to simulate univariate data. It is harder to simulate multivariate data while preserving important relations between the variables, such as correlation. Similarly, it can be challenging to analyze the bootstrap distribution of a multivariate statistic, such as a correlation matrix:

The Geometry of the Iman-Conover Transformation

Your turn

Did I omit one of your favorite blog posts from The DO Loop in 2021? If so, leave a comment and tell me what topic you found interesting or useful.

The post 12 blog posts from 2021 that deserve a second look appeared first on The DO Loop.

11月 292021
 

When SAS 9 programmers transition to SAS Viya, there are inevitably questions about how new concepts in Cloud Analytic Services (CAS) relate to similar concepts in SAS. This article discusses the question, "What is the difference between a libref and a caslib?" Both are used to access data, but they are used in different situations. Briefly, a caslib can be used by any language that calls an action (SAS, Python, Lua, R, and so forth), whereas librefs are used only by procedures in the SAS language.

This article also discusses how to determine the active caslib, how to load data into a caslib, and how to read data from a caslib by using a CAS-enabled procedure in SAS.

An overview of the CAS server and client languages

I use the term "SAS 9" to refer to the traditional SAS workspace server that runs procedures in products such as Base SAS, SAS/STAT, and SAS/ETS. So "SAS 9" refers to the "classic" SAS programming environment that existed before SAS Viya.

For an overview of the CAS server and client languages, see my article about CAS-enabled procedures. I use the SAS client (called the SAS compute server) to connect to the CAS server. By using a SAS client to communicate with CAS, I can leverage my 25 years of SAS programming knowledge and skills.

What is a libref?

SAS 9 programmers are used to working with librefs, which are defined by using the LIBNAME statement in SAS. In SAS 9, you use a libref to point to a data source, which can be a file directory (such as C:/MyData) or a database such as DB2, SQL Server, Oracle, and so forth. A libref enables you to read or write data by using a SAS procedure.

Some librefs are automatically available, such as WORK or SASUSER. Others might be pre-defined by your SAS administrator so that everyone on a team can access the same data. A SAS programmer can use the LIBNAME statement to create a personal librefs, which is a convenient way to organize data that belong to different projects.

The LIBNAME statement enables you to use a SAS "engine" to access data from a wide variety of data sources. An engine enables SAS procedures to process data from a non-SAS data format such as XML, JSON, Excel, or a database. This is one of the reasons companies invest in SAS: it enables programmers to concentrate on analyzing the data rather than worrying about how to read it.

Many SAS programmers are familiar with the idea of associating a libref with a database or schema. For example, to connect to an ORACLE database, you might run the following LIBNAME statement:

libname mydblib ORACLE user=wicklin;  /* librefs can connect to various data sources */

I mention this because there is a CAS LIBNAME engine that you can use to access CAS tables from SAS procedures in Viya. That means you can access CAS tables from any SAS procedure, just by assigning a libref. But what does the libref refer to? It refers to a caslib in a CAS session. Therefore, it is important to understand how to define and use a caslib, which you can think of as being analogous to a libref, except that a caslib exists on the CAS server, not on the SAS client. As you read through the next sections, you will see many similarities between caslibs and librefs.

What is a caslib?

Caslibs are discussed in the documentation SAS Cloud Analytic Services: Fundamentals. A caslib is associated with a data source and includes the connection information for the data source. It enables you to read and write in-memory tables from CAS. I like to think of it as a "server side" libref, except that you can use a caslib from any client language, not just from SAS.

What is the active caslib?

When you use the CAS statement to start a CAS session, CAS creates a personal caslib for you. The name of your personal caslib is usually CASUSER(userID), although you can usually abbreviate this as CASUSER. By default, this is the active caslib for the session. Here is a partial output of the log when I use the CAS statement to connect to a CAS server:

cas;                 /* connect to CAS session */
   NOTE: The session CASAUTO connected successfully to Cloud Analytic Services <name> using port 5570. <snip>
       The user is frwick and the active caslib is CASUSER(frwick).
Notice that when you connect to a CAS session, the log tells you the name of the active caslib. For me, the log states "the active caslib is CASUSER(frwick)." The log also tells me the session name, which is CASAUTO in this example.

The active caslib is the default location to read and write an in-memory data set. If you store data in the active caslib, you don't need to specify the name of the caslib when you call most actions. By default, the actions look in the active caslib to find the data.

What other caslibs are defined?

Your administrator might have given you access to other predefined caslibs. You can use the CASLIB statement to display all available caslibs in the log, as follows:

caslib _all_ list;   /* display all caslibs to log */
   NOTE: Session = CASAUTO Name = CASUSER(frwick)
         Type = PATH
         Description = Personal File System Caslib
         Path = /cas/data/caslibs/casuserlibraries/frwick/
         Definition = 
         Subdirs = Yes
         Local = No
         Active = Yes
         Personal = Yes
   NOTE: Session = CASAUTO Name = Public
         Type = PATH
         Description = Shared and writeable caslib, accessible to all users.
         Path = /cas/data/caslibs/public/
         Definition = 
         Subdirs = No
         Local = No
         Active = No
         Personal = No

The log contains a list of all caslibs in the current session. One of the caslibs has the text "Active = Yes" as part of its description. That caslib is the active caslib. In this example, the active caslib is CASUSER.

There are different types of caslibs, which provide access to various data sources, such as DB2, Hadoop, Oracle, and Teradata, just to name a few. A complete list of data sources is provided in the documentation. I don't have experience with most of these data sources. I primarily use "Type = PATH" caslibs, which can store audio files, video files, SAS data sets, CSV files, and more. I also primarily use in-memory tables, but for a PATH caslib the "Path" field tells you the location (directory) in which CAS will save the data upon request.

In addition to my personal caslib, I also have access to a global caslib named 'Public'. You can change the active caslib by using the CASLIB= suboption on the CASSESSOPTS= option on the CAS statement. For example, the following statements change the active caslib in the CASAUTO session to 'Public', then immediately changes it back to my personal caslib.

cas CASAUTO sessopts=(caslib='public');    /* make the 'public' caslib active */
cas CASAUTO sessopts=(caslib='casuser');   /* make my personal caslib active */
   NOTE: 'Public' is now the active caslib.
   NOTE: 'CASUSER(frwick)' is now the active caslib.

How does a libref differ from a caslib?

  • A libref and a caslib both point to a data source.
  • The default libref is typically WORK or SASUSER. The analogous concept is the "active" caslib. By default, the active caslib is CASUSER, which is your personal caslib.
  • A libref is defined in that SAS language and is used only by SAS procedures. A caslib exists on the CAS server and can be used by any client language that can access CAS.
  • SAS 9 programmers often use the LIBNAME statements to define new librefs. In CAS, only authorized users can define a caslib (by using the CASLIB statement).
  • If you want to read a CAS table from a SAS procedure, you need to define a libref that points to a caslib. This is shown later in this article.

Load data to a caslib

You can upload data from the SAS client into a caslib by using the CASUTIL procedure. (You can also use the table.loadTable action.) The following statements upload the data in Sashelp.Cars (which is a SAS data set) to the active caslib in the current CAS session. Optionally, you can use the LIST statement to list the tables in that caslib:

proc casutil;
   load data=Sashelp.Cars casout='Cars' replace; /* use active caslib */
   list tables;                                  /* show tables in active caslib */
quit;

As mentioned earlier, when you interact with the active caslib, you don't need to explicitly specify it in most actions and procedures. If you want to upload the data to a different caslib, you can use the OUTCASLIB= option on the LOAD statement.

Read a CAS table by using a libref

If you want to use a SAS procedure to read data from a CAS in-memory table, then you need to define a libref by using the CAS LIBNAME engine. For example, the following statement defines a libref that refers to the active caslib:

libname myLib cas;     /* refers to active caslib, whatever it is */

Alternatively, you can use the CASLIB= option to specify a caslib:

libname mycasu cas caslib=casuser; /* refers to CASUSER, even if not active */

The mycasu libref refers to the CASUSER personal caslib, regardless of whether it is the active caslib.

You can use a libref to download or upload a CAS table by using the SAS DATA step. For example, the following DATA step is an alternative way to upload data from Sashelp.Cars into a table in the active caslib:

data myLib.Cars;       /* upload to active caslib */
   set Sashelp.Cars;   /* these data are on the SAS client */
run;

You can use the libref to read data into any SAS procedure. For example, the following statement computes basic statistics for some of the variables in the CAS in-memory data:

proc means data=myLib.Cars;       /* use libref to read CAS table */
   var EngineSize MPG_City Weight;
run;

PROC MEANS is a CAS-enabled procedure, so it calls an action to perform the computations for data that are in a CAS table. The log tells you that it calls the aggregate action:

NOTE: The CAS aggregation.aggregate action will be used to perform the initial summarization.

Not every CAS-enabled procedure tells you the name of the action that was called. Sometimes the log will only notify you that CAS was involved by displaying a note such as the following:

NOTE: The Cloud Analytic Services server processed the request in 0.0097 seconds.

Summary

In summary, this article discusses the concept of a caslib, including how to learn the names of the defined caslibs and the active caslib. Caslibs are used by all actions and procedures that interact with data table on CAS server. In contrast, a libref is the familiar SAS-only reference, which you can use in SAS procedures to read data from a variety of data sources, including a CAS table.

References

The post Caslibs and librefs in SAS Viya appeared first on The DO Loop.

11月 222021
 

I attended a seminar last week whose purpose was to inform SAS 9 programmers about SAS Viya. I could tell from the programmer's questions that some programmers were confused about three basic topics:

  • What are the computing environments in Viya, and how should a programmer think about them?
  • What procedures do you get when you order a programming-oriented Viya product such as SAS Visual Statistics or SAS Econometrics? Of these procedures, which are CAS-enabled?
  • If you have legacy SAS programs, can you still run them if your company migrates from SAS 9 to SAS Viya?

I am a programmer, so I thought it might be helpful for me to discuss these topics programmer-to-programmer. In a series of articles, I am going to discuss issues that a SAS statistical programmer might face when migrating to Viya from SAS 9. I use the term "SAS 9" to refer to the SAS Workspace Server that runs procedures in traditional products such as Base SAS, SAS/STAT, and SAS/ETS. So "SAS 9" refers to the most recent version of the "classic" SAS programming environment. It is the version of SAS that existed before SAS Viya was created.

Clients and servers: Where do computations occur in SAS Viya?

In SAS 9, a procedure runs on the SAS Workspace Server. In SAS 9, the word "client," refers to a program such as Enterprise Guide (EG) or SAS Studio, which runs on a PC and submits code to the SAS Workspace Server. The server computes the results and sends tables and graphs back to the client, which displays them. Typically, the input and output data sets remain on the server.

You can think of SAS Viya as having two main components: the CAS server where the data are stored and the computations are performed, and support for several client languages. A client language enable you to connect to the CAS server and tell it what analyses you want to perform. So, in the world of Viya, "client" no longer refers to a GUI like EG, but to an entire programming environment such as SAS, Python, or R. The purpose of the client software is to connect to CAS, submit actions, and get back results. You then use the capabilities of the client language to display the results as a table or graph. For example, the SAS client uses ODS to display tables and graphs. In Python, you might use matplotlib to graph the results. In R, you might use ggplot. In all cases, you can also use the native capabilities of the client language (DATA step, Pandas, the tidyverse,....) to modify, aggregate, or enhance the output.

I use the SAS client to connect to and communicate with the CAS server. By using a SAS client to communicate with CAS, I can leverage my 25 years of SAS programming knowledge and skills. Others have written about how to use other clients (such as a Python client) to connect to CAS and to call CAS actions.

What procedures do you get when you order a programming-oriented Viya product?

When you purchase a product in SAS Viya, you get three kinds of computational capabilities:

  • Actions, which run on the CAS server. You can call an action from any client language.
  • CAS-enabled procedures, which are parsed on the SAS client but call CAS actions "under the covers."
  • Legacy SAS procedures that run on the SAS client, just as they do in SAS 9.

Obviously, the CAS-enabled and legacy procedures are only available on the SAS client.

To give an example, SAS Visual Statistic contains action sets (which contain actions), CAS-enabled procedures, and all the procedures in SAS/STAT. All procedures run on the SAS compute server, which is also called the SAS client. However, the CAS-enabled procedures call one or more actions that run on the CAS server, then display the results as ODS tables and graphics.

A CAS-enabled procedure performs very few computations on the client. In contrast, a legacy procedure that is not CAS-enabled performs all of its computations on the SAS client. It does not call any CAS actions. An example of a CAS-enabled procedure is the REGSELECT procedure, which performs linear regression with feature selection. It contains many of the features of the GLMSELECT procedure, which is a traditional regression procedure in SAS/STAT.

Can you run legacy programs in SAS Viya?

Naturally, SAS 9 statistical programmers want to make sure that their existing programs will run in Viya. That is why SAS Visual Statistics comes with the legacy SAS/STAT procedures. The same is true for SAS/ETS proceduires, which are shipped as part of SAS Econometrics. And the SAS IML product in Viya contains PROC IML, which runs on the SAS client, as well as the newer iml action, which runs in CAS.

So what happens if, for example, you call PROC REG in SAS and ask it to perform a regression on a SAS data set in the WORK libref? PROC REG will do what it has always done. It will run in the SAS environment. It will not run on the CAS server. It will not magically run faster than it used to in SAS 9. The performance of most legacy programs should be comparable to their performance in SAS 9.

There are some exceptions to that rule. Some SAS procedures have been enhanced and now perform better than their SAS 9 counterparts. For example, the SAS IML team has enhanced certain functions in PROC IML in SAS Viya so that they have better performance in SAS Viya than the SAS 9 version of the procedure. The SAS IML development team is focused exclusively on improving performance and adding features in SAS Viya, both PROC IML and the iml action.

Another exception is that some Base SAS procedures were enhanced so that they behave differently depending on the location of the data. Many Base SAS procedures are now hybrid procedures. If you tell them to analyze a CAS table, they will call an action, which runs in CAS, and retrieve the results. If you tell them to analyze a SAS data set, they will run on the SAS client, just like they do in SAS 9.

To make the situation more complicated, some of the legacy Base SAS procedures support features that are not supported in CAS. When you request an option that is not supported in CAS, the procedure will download the data from CAS into SAS and perform the computation on the client. This can be inefficient, so check the documentation before you start using legacy procedures to analyze CAS tables. As a rule, I prefer to use legacy procedures to analyze SAS data sets on the client; I use newer CAS-enabled procedures for analyzing CAS tables.

Summary

At a recent seminar for SAS 9 programmers, there were lots of questions about SAS Viya and what it means to for a SAS programmer to start programming in Viya. This article is the first of several articles that I intend to write for SAS programmers. I don't know everything, but I hope that other SAS programmers will join me in sharing what they have learned about the process of migrating from SAS 9 to SAS Viya.

If you are a SAS programmer who has general questions about SAS Viya, let me know what you are thinking by leaving a comment. I might not know the answer, but I'll try to find someone who does. For programming questions ("How do I do XYZ in Viya?"), post your question and sample data to the SAS Support Communities. There is a dedicated community for SAS Viya questions, so take advantage of the experts there.

The post What is a CAS-enabled procedure? appeared first on The DO Loop.

11月 112020
 

Data visualization has never been more widespread and consumed by a global audience as it has been this year with the Coronavirus pandemic. If you are interested in the statistics behind many of the numbers you see displayed in data visualizations then please reference my colleague’s blog series:

One visualization that is commonly used to display metrics of Coronavirus is a bar line chart where the bars display the actual values and the line is a moving average metric.

The screenshot below shows the number of web visits per week and a 5 week moving average value. I’ve named it Visits (5 Week Moving Avg Bracket) since I am including both past and present data points into this calculation.

I will demonstrate how to easily generate this moving average metric to use in your SAS Visual Analytics reports.

Bar Line With Moving Average

Moving Average

Moving average is essentially a block of data points averaged together, creating a series of averages for the data. This is commonly used with time series data to smooth out short-term fluctuations and highlight longer-term trends or cycles.

Traditionally, the moving average block takes the current data point and then moves forward in the series, which is often the case in financial applications. However, it is more common in science and engineering to take equal data points before and after the current position, creating what I am calling a moving average bracket.

SAS’ Visual Analytics calculation gives you the flexibility to define how to average the data points for your moving average: how many positions prior and/or after the current data point.

Steps to generate the Moving Average

From the Data pane, right-click on the measure you want to generate the moving average and select New calculation.

New Calculation

Next, in the Create Calculation window:

  • Name: Visits (5 Week Moving Avg Bracket) (a meaningful name of your choice)
  • Type: Moving average
  • Number of cells to average: 5 (or the number of your choice)

Click OK.

New Moving Average

Notice that you did not have the option to specify how many data points to include before and after the current position. To do that, we will right-click on the generated calculation and select Edit.

Edit Moving Average Scope

From the Edit Calculated Item window, make your starting and ending point position selections. The current position is denoted as zero. In my example, I want a 5 point moving average bracket, therefore I want to average the two positions before current, (-2), current, and then the two positions after current (2).

Aggregate Cells Average

Success. We now have generated our 5 point moving average bracket. You may be wondering, where do you specify the time frame? Granted I named this metric Visits (5 Week Moving Average Bracket) but I did not specify the week time metric anywhere. This is because, for this aggregated measure, the time duration is directly dependent on the data items you have assigned to the object.

In my example, for my web visit data, I want to look at the aggregation at the week level, therefore this Aggregate Cell measure will be grouped by week. I wanted to name my measure appropriately so that any user reading the visual knows the time frame that is being averaged.

If you are unsure of how you want to aggregate your time series, or you are allowing your report viewers the ability to change the role assignments through self-service reporting (see blog or YouTube for more information) then it would be best if you name your measure as 5 Point Moving Average Bracket and leave off the aggregation of the time series.

Bar Line Role Assignments

Let’s take a look at what the numbers are behind the scene. I’ve expanded the object to its Maximize mode so that I can see the summary table of the metrics that make up this object.

The data points will average until the full bracket size is met, then it will slide that bracket down the time series. Keep in mind that our bracket size is five points, two previous data points (-2), current data point (0), and two future data points (2).

Let’s look at week 01. Week 01 is our current, or zero, position. We do not have any prior data points so the only points that can be used from the bracket are positions 1 and 2. See the first highlighted yellow Visits (5 Week Moving Avg Bracket) and how it corresponds to the Visits.

Next, week 02 is able to include a -1 position into the bracket. The full moving average bracket isn’t met till week 03 where all of the data points: -2 through 2 are available. Then the moving average bracket slides along the rest of the time series.

Moving Average Bracket Details

Seeing the object’s summary data and the breakdown of how the moving average bracket is calculated for the data should drive home the fact that the time period aggregation is solely driven by the object’s role assignments.

In my second example, I am looking at retail data and this is aggregated to the day level.

By Day Example

I chose to demonstrate a 14 day moving average bracket.

14 Day Moving Average

Bar Line Chart

In both of my examples, I used the Dual axis bar-line chart object. Default Y axis behavior for bar charts is to start at zero, but line charts since they are intended to show trend can start at a value other than zero. Use the Options pane to select your desired line chart baseline. I choose to have both my bar and line charts fixed at zero. Explore the other Options to further customize your visual.

Bar Line Chart

Summary

Calculating the moving average is offered as a SAS Visual Analytics out-of-the-box Derived Item. See the SAS Documentation for more information about Derived Items.

Once you’ve specified the window of data points to average for the moving average you can go back and edit the starting and ending points around the current position. Remember that you will not be specifying a time frame for the aggregation but it will be dynamically determined based on the role assignments to the visual.

Learn more

If you are interested in other uses for the moving average operator Aggregate Cells take a look at this article: VA Report Example: Moving 30 Day Rolling Sum.

Also, check out these resources:

SAS Visual Analytics example: moving average was published on SAS Users.

9月 142020
 

My 2020 SAS Global Forum paper was about how to write custom parallel programs by using the iml action in SAS Viya 3.5. My conference presentation was canceled because of the coronavirus pandemic, but I recently recorded a 15-minute video that summarizes the main ideas in the paper.

One of the reasons I enjoy attending conferences is to obtain a high-level "mental roadmap" of a topic that I can leverage if I decide to study the details. Hopefully, this video will provide you with a roadmap of the iml action and its capabilities.


If your browser does not support embedded video, you can go directly to the video on YouTube.

The video introduces a few topics that I have written about in more detail:

For more details you can read the paper: Wicklin and Banadaki, 2020, "Write Custom Parallel Programs by Using the iml Action."

The post Video: How to Write a Custom Parallel Program in SAS Viya appeared first on The DO Loop.

8月 172020
 

I recently showed how to use simulation to estimate the power of a statistical hypothesis test. The example (a two-sample t test for the difference of means) is a simple SAS/IML module that is very fast. Fast is good because often you want to perform a sequence of simulations over a range of parameter values to construct a power curve. If you need to examine 100 sets of parameter values, you would expect the full computation to take about 100 times longer than one simulation.

But that calculation applies only when you run the simulations sequentially. If you run the simulations in parallel, you can complete the simulation study much faster. For example, if you have access to a machine or cluster of machines that can run 32 threads, then each thread needs to run only a few simulations. You can perform custom parallel computations by using the iml action in SAS Viya. The iml action is supported in SAS Viya 3.5.

This article shows how to distribute computations by using the PARTASKS function in the iml action. (PARTASKS stands for PARallel TASKS.) If you are not familiar with the iml action or with parallel processing, see my previous example that uses the PARTASKS function.

Use simulation to estimate a power curve

I previously showed how to use simulation to estimate the power of a statistical test. In that article, I simulated data from N(15,5) and N(16,5) distributions. Because the t test is invariant under centering and scaling operations, this study is mathematically equivalent to simulating from the N(0,1) and N(0.2,1) distributions. (Subtract 15 and divide by 5.)

Recall that the power of a statistical hypothesis test is the probability of making a Type 2 error. A Type 2 error is rejecting the null hypothesis when the alternative hypothesis is true. The power depends on the sample sizes and the magnitude of the effect you are trying to detect. In this case, the effect is the difference between two means. You can study how the power depends on the mean difference by estimating the power for a t test between data from an N(0,1) distribution and an N(δ, 1) distribution, where δ is a parameter in the interval [0, 2]. All hypothesis tests in this article use the α=0.05 significance level.

The power curve we want to estimate is shown to the right. The horizontal axis is the δ parameter, which represents the true difference between the population means. Each point on the curve is an estimate of the power of a two-sample t test for random samples of size n1 = n2 = 10. The curve indicates that when there is no difference between the means, you will conclude otherwise in 5% of random samples (because α=0.05). For larger differences, the probability of detecting the difference increases. If the difference is very large (more than 1.5 standard deviations apart), more than 90% of random samples will correctly reject the null hypothesis in the t test.

The curve is calculated at 81 values of δ, so this curve is the result of running 81 independent simulations. Each simulation uses 100,000 random samples and carries out 100,000 t tests. Although it is hard to see, the graph also shows 95% confidence intervals for power. The confidence intervals are very small because so many simulations are run.

So how long did it take to compute this power curve, which is the result of 8.1 million t tests? About 0.4 seconds by using parallel computations in the iml action.

Using the PARTASKS function to distribute tasks

Here's how you can write a program in the iml action to run 81 simulations across 32 threads. The following program uses four nodes, each with 8 threads, but you can achieve similar results by using other configurations. The main parts of the program are as follows:

  • The program assumes that you have used the CAS statement in SAS to establish a CAS session that uses four worker nodes, each with at least eight CPUs. For example, the CAS statement might look something like this:
        cas sess4 host='your_host_name' casuser='your_username' port=&MPPPort nworkers=4;
    The IML program is contained between the SOURCE/ENDSOURCE statements in PROC CAS.
  • The TTestH0 function implements the t test. It runs a t test on the columns of the input matrices and returns the proportion of samples that reject the null hypothesis. The TTestH0 function is explained in a previous article.
  • The "task" we will distribute is the SimTTest function. (This encapsulates the "main program" in my previous article.) The SimTTest function simulates the samples, calls the TTestH0 function, and returns the number of t tests that reject the null hypothesis. The first sample is always from the N(0, 1) distribution and the second sample is from the N(&delta, 1) distribution. Each call creates B samples. The input to the SimTTest function is a list that contains all parameters: a random-number seed, the sample sizes (n1 and n2), the number of samples (B), and the value of the parameter (delta).
  • The main program first sets up a list of arguments for each task. The i_th item in the list is the argument to the i_th task. For this computation, all arguments are the same (a list of parameters) except that the i_th argument includes the parameter value delta[i], where delta is a vector of parameter values in the interval [0, 2].
  • The call to the PARTASKS function is
         RL = ParTasks('SimTTest', ArgList, {2, 0});
    where the first parameter is the task (or list of tasks), the second parameter is the list of arguments to the tasks, and the third parameter specifies how to distribute the computations. The documentation of the PARTASKS function provides the full details.
  • After the PARTASKS function returns, the program processes the results. For each value of the parameter, δ, the main statistic is the proportion of t tests (p) that reject the null hypothesis. The program also computes a 95% (Wald) confidence interval for the binomial proportion. These statistics are written to a CAS table called 'PowerCurve' by using the MatrixWriteToCAS subroutine.
/* Power curve computation: delta = 0 to 2 by 0.025 */
proc cas;
session sess4;                         /* use session with four workers     */
loadactionset 'iml';                   /* load the iml action set           */
source TTestPower;
 
/* Helper function: Compute t test for each column of X and Y.
   X is (n1 x m) matrix and Y is (n2 x m) matrix.
   Return the number of columns for which t test rejects H0 */
start TTestH0(x, y);
   n1 = nrow(X);     n2 = nrow(Y);     /* sample sizes                      */
   meanX = mean(x);  varX = var(x);    /* mean & var of each sample         */
   meanY = mean(y);  varY = var(y);
   poolStd = sqrt( ((n1-1)*varX + (n2-1)*varY) / (n1+n2-2) );
 
   /* Compute t statistic and indicator var for tests that reject H0 */
   t = (meanX - meanY) / (poolStd*sqrt(1/n1 + 1/n2));
   t_crit =  quantile('t', 1-0.05/2, n1+n2-2);       /* alpha = 0.05        */
   RejectH0 = (abs(t) > t_crit);                     /* 0 or 1              */
   return  RejectH0;                                 /* binary vector       */
finish;
 
/* Simulate two groups; Count how many reject H0: delta=0 */
start SimTTest(L);                     /* define the mapper                 */
   call randseed(L$'seed');            /* each thread uses different stream */
   x = j(L$'n1', L$'B');               /* allocate space for Group 1        */
   y = j(L$'n2', L$'B');               /* allocate space for Group 2        */
   call randgen(x, 'Normal', 0,         1);   /* X ~ N(0,1)                 */
   call randgen(y, 'Normal', L$'delta', 1);   /* Y ~ N(delta,1)             */
   return sum(TTestH0(x, y));          /* count how many reject H0          */
finish;
 
/* ----- Main Program ----- */
numSamples = 1e5;
L = [#'delta' = .,   #'n1' = 10,  #'n2' = 10,
     #'seed'  = 321, #'B'  = numSamples];
 
/* Create list of arguments. Each arg gets different value of delta */
delta = T( do(0, 2, 0.025) );
ArgList = ListCreate(nrow(delta));
do i = 1 to nrow(delta);
   L$'delta' = delta[i];
   call ListSetItem(ArgList, i, L);
end;
 
RL = ParTasks('SimTTest', ArgList, {2, 0});  /* assign nodes before threads */
 
/* Summarize results and write to CAS table for graphing */
varNames = {'Delta' 'ProbEst' 'LCL' 'UCL'};  /* names of result vars        */
Result = j(nrow(delta), 4, .);
zCrit = quantile('Normal', 1-0.05/2);  /* zCrit = 1.96                      */
do i = 1 to nrow(delta);               /* for each task                     */
   p = RL$i / numSamples;              /* get proportion that reject H0     */
   SE = sqrt(p*(1-p) / L$'B');         /* std err for binomial proportion   */
   LCL = p - zCrit*SE;                 /* 95% CI                            */
   UCL = p + zCrit*SE;
   Result[i,] = delta[i] || p || LCL || UCL;
end;
 
call MatrixWriteToCas(Result, '', 'PowerCurve', varNames);
endsource;
iml / code=TTestPower nthreads=8;
quit;

You can pull the 81 statistics back to a SAS data set and use PROC SGPLOT to plot the results. You can download the complete program that generates the statistics and graphs the power curve.

Notice that there are 81 tasks but only 32 threads. That is not a problem. The PARTASKS tries to distribute the workload as evenly as possible. In this example, 17 threads are assigned three tasks and 15 threads are assigned two tasks. If T is the time required to perform one power estimate, the total time to compute the power curve will be approximately 3T, plus "overhead costs" such as the time required to set up the problem, distribute the parameters to each task, and aggregate the results. You can minimize the overhead costs by passing only small amounts of data across nodes.

Summary

In summary, if you have a series of independent tasks in IML, you can use the PARTASKS function to distribute tasks to available threads. The speedup can be dramatic; it depends on the time required for the thread that performs the most work.

This article shows an example of using the PARTASKS function in the iml action, which is available in SAS Viya 3.5. The example shows how to distribute a sequence of computations among k threads that run concurrently and independently. In this example, k=32. Because the tasks are essentially identical, each thread computes 2/32 or 3/32 of the total work. The results from each task are returned in a list, where they can be further processed by the main program.

For more information and examples, see Wicklin and Banadaki (2020), "Write Custom Parallel Programs by Using the iml Action," which is the basis for these blog posts. Another source is the SAS IML Programming Guide, which includes documentation and examples for the iml action.

The post Estimate a power curve in parallel in SAS Viya appeared first on The DO Loop.

8月 032020
 

The iml action was introduced in Viya 3.5. As shown in a previous article, the iml action supports ways to implement the map-reduce paradigm, which is a way to distribute a computation by using multiple threads. The map-reduce paradigm is ideal for “embarrassingly parallel” computations, which are composed of many independent and essentially identical computations. I wrote an example program that shows how to run a Monte Carlo simulation in parallel by using the iml action in SAS Viya.

The map-reduce paradigm is often used when there is ONE task to perform and you want to assign a portion of the work to each thread. A different scenario is when you have SEVERAL independent tasks to perform. You can assign each task to a separate thread and ask each thread to compute an entire task. The iml action supports the PARTASKS function (short for "parallel tasks") for running tasks in separate threads. This article shows a simple example.

Three tasks run sequentially

Suppose you have three tasks. The first takes 2 seconds to run, the next takes 0.5 seconds, and the third takes 1 second. If you run a set of tasks sequentially, the total time is the sum of the times for the individual tasks: 2 + 0.5 + 1 = 3.5 seconds. In contrast, if you run the tasks in parallel, the total time is the maximum of the individual times: 2 seconds. Running the tasks in parallel saves time, and the speed-up depends on the ratio of the largest time to the total time, which is 2/3.5 in this example. Thus, the time for the parallel computation is 57% of the time for the sequential computation.

In general, the largest speedup occurs when the tasks each run in about the same time. In that situation (ignoring overhead costs), the time reduction can be as much 1/k when you run k tasks in k threads.

Let's run three tasks sequentially, then run the same tasks in parallel. Suppose you have a large symmetric N x N matrix and you need to compute three things: the determinant, the matrix inverse, and the eigenvalues. The following call to PROC IML runs the three tasks sequentially. The TOEPLITZ function is used to create a large symmetric matrix.

proc iml;
N = 2000;
A = toeplitz( (N:1)/100 );     /* N x N symmetric matrix */
 
det = det(A);                  /* get det(A) */
inv = inv(A);                  /* get inv(A) */
eigval = eigval(A);            /* get eigval(A) */

The previous computations are performed by directly calling three built-in SAS/IML functions. Of course, in practice, the tasks will be more complex than these. Almost surely, each task will be encapsulated into a SAS/IML module. To emulate that process, I am going to define three trivial "task modules." The following statements perform The same computations, but now each "task" is performed by calling a user-written module:

start detTask(A);
   return det(A);
finish;
start invTask(A);
   return inv(A);
finish;
start eigvalTask(A);
   return eigval(A);
finish;
 
det = detTask(A);              /* get det(A) */
inv = invTask(A);              /* get inv(A) */
eigval = eigvalTask(A);        /* get eigval(A) */

The syntax of the PARTASKS function

You can use the PARTASKS function to run the three tasks concurrently. To use the PARTASKS function, each task must be a user-defined SAS/IML function that has exactly one input argument and returns one output argument. For our example, the detTask, invTask, and eigvalTask modules satisfy these requirements. (If a task requires more than one input argument, you can pack the arguments into a list and pass the list as a single argument.)

The syntax of the PARTASKS function is
result = PARTASKS( Tasks, TaskArgs, Options );
where

  • Tasks is a character vector that names the SAS/IML functions. For our example, Tasks = {'detTask' 'invTask' 'eigvalTask'};
  • TaskArgs is a list of arguments. The i_th argument will be sent to the i_th module. In this example, each module takes the same matrix, so you can create a list that has three items, each being the same matrix: TaskArgs = [A, A, A];
  • Options is an (optional) vector that specifies how the tasks are distributed. The documentation of the PARTASKS function gives the full details, but the first element of the vector specifies how to distribute the computation to nodes and threads, and the second element specifies whether to print information about the performance of the tasks. For this example, you can choose options = {1, 1};, which will distribute tasks to threads on the controller node and print information about the tasks.

Parallel tasks

Suppose you are running SAS Viya and you use the have access to a machine that has at least four threads. You can use the CAS statement to define a session that will use only one machine (no worker nodes). The CAS statement might look something like this:
cas sess0 host='your_host_name' casuser='your_username' port=&YourPort; /* SMP, 0 workers */
Then you can run the PARTASKS function by making the following call to the iml action. If you are not familiar with the iml action, see the getting started example for the iml action.

proc cas;
session sess0;                         /* SMP session: controller node only    */
loadactionset 'iml';                   /* load the action set               */
source partasks;                       /* put program in SOURCE/ENDSOURCE block */
   start detTask(A);                   /* 1st task */
      return det(A);
   finish;
   start invTask(A);                   /* 2nd task */
      return inv(A);
   finish;
   start eigvalTask(A);                /* 3rd task */
      return eigval(A);
   finish;
 
   /* ----- Main Program ----- */
   N = 2000;
   A = toeplitz( (N:1)/100 );          /* N x N symmetric matrix */
   Tasks = {'detTask' 'invTask' 'eigvalTask'};
   Args = [A, A, A];                   /* each task gets same arg in this case */
   opt = {1,                           /* distribute to threads on controller  */
          1};                          /* display information about the tasks  */
   Results = ParTasks(Tasks, Args, opt); /* results are returned in a list     */
 
   /* the i_th list item is the result of the i_th task */
   det    = Results$1;                 /* get det(A)                           */
   inv    = Results$2;                 /* get inv(A)                           */
   eigval = Results$3;                 /* get eigval(A)                        */
endsource;
iml / code=partasks nthreads=4;
run;

The results are identical to the sequential computations in PROC IML. However, this program executes the three tasks in parallel, so the total time for the computations is the time required by the longest task.

Visualize the parallel tasks computations

You can use the following diagram to help visualize the program. (Click to enlarge.)

The diagram shows the following:

  • The CAS session does not include any worker nodes. Therefore, the iml action runs entirely on the controller, although it can still use multiple threads. This mode is known as symmetric multiprocessing (SMP) or single-machine mode. Notice that the call to the iml action (the statement just before the RUN statement) specifies the NTHREADS=4 parameter, which causes the action to use four threads. Because there are only three tasks, one thread remains idle.
  • The program defines the matrix, A, the vector of module names, and the list of arguments. When you call the ParTasks function, the i_th argument is sent to the i_th module on the i_th thread.
  • Each thread runs a different function module. Each function returns its result. The results are returned in a list. The i_th item in the list is the result of the call to the i_th function module.

Summary

This article demonstrates a simple call to the PARTASKS function in the iml action, which is available in SAS Viya 3.5. The example shows how to distribute tasks among k threads. Each thread runs concurrently and independently. The total time for the computation depends on the longest time that any one thread spends on its computation.

This example is simple but demonstrates the main ideas of distributing tasks in parallel. In a future article, I will present a more compelling example.

For more information and examples, see Wicklin and Banadaki (2020), "Write Custom Parallel Programs by Using the iml Action," which is the basis for these blog posts. Another source is the SAS IML Programming Guide, which includes documentation and examples for the iml action.

The post Run tasks in parallel in SAS Viya appeared first on The DO Loop.