Viya

7月 132020
 

A previous article introduces the MAPREDUCE function in the iml action. (The iml action was introduced in Viya 3.5.) The MAPREDUCE function implements the map-reduce paradigm, which is a two-step process for distributing a computation to multiple threads. The example in the previous article adds a set of numbers by distributing the computation among four threads. During the "map" step, each thread computes the partial sum of one-fourth of the numbers. During the "reduce" step, the four partial sums are added to obtain the total sum. The same map-reduce framework can be used to implement a wide range of parallel computations. This article uses the MAPREDUCE function in the iml action to implement a parallel version of a Monte Carlo simulation.

Serial Monte Carlo simulation

Before running a simulation study in parallel, let's present a serial implementation in the SAS/IML language. In a Monte Carlo simulation, you simulate B random samples of size N from a probability distribution. For each sample, you compute a statistic. When B is large, the distribution of the statistics is a good approximation to the true sampling distribution for the statistic.

The example in this article uses simulation to approximate the sampling distribution of the sample mean. You can use the sampling distribution to estimate the standard error and to estimate a confidence interval for the mean. This example appears in the book Simulating Data with SAS(Wicklin 2013, p. 55–57) and in the paper "Ten Tips for Simulating Data with SAS" (Wicklin 2015, p. 6-9). The SAS/IML program runs a simulation to approximate the sampling distribution of the sample mean for a random sample of size N=36 that is drawn from the uniform U(0, 1) distribution. It simulates B=1E6 (one million) samples. The main function is the SimMeanUniform function, which does the following:

  1. Uses the RANDSEED subroutine to set a random number seed.
  2. Uses the RANDGEN subroutine to generate B samples of size N from the U(0, 1) distribution. In the N x B matrix, X, each column is a sample and there are B samples.
  3. Uses the MEAN function to compute the mean of each sample (column) and returns the row vector of the sample means.

The remainder of the program estimates the population mean, the standard error of the sample mean, and a 95% confidence interval for the population mean.

proc iml;
start SimMeanUniform(N, seed, B); 
   call randseed(seed);            /* set random number stream */
   x = j(N, B);                    /* allocate NxB matrix for samples */
   call randgen(x, 'Uniform');     /* simulate from U(0,1) */
   return mean(x);                 /* return row vector = mean of each column */
finish;
 
/* Simulate and return Monte Carlo distribution of mean */
stat = SimMeanUniform(36,          /* N = sample size */
                      123,         /* seed = random number seed */
                      1E6);        /* B = total number of samples */
 
/* Monte Carlo estimates: mean, std err, and 95% CI of mean */
alpha = 0.05;
stat = colvec(stat);
numSamples = nrow(stat);
MCEst = mean(stat);                /* estimate of mean, */
SE = std(stat);                    /* standard deviation, */
call qntl(CI, stat, alpha/2 || 1-alpha/2); /* and 95% CI */
R = numSamples || MCEst || SE || CI`;      /* combine for printing */
print R[format=8.4 L='95% Monte Carlo CI (Serial)'
        c={'NumSamples' 'MCEst' 'StdErr' 'LowerCL' 'UpperCL'}];
Monte Carlo simulation results from a serial PROC IML program in SAS

This program takes less than 0.5 seconds to run. It simulates one million samples and computes one million statistics (sample means). More complex simulations take longer and can benefit by distributing the computation to many threads, as shown in the next section.

A distributed Monte Carlo simulation

Suppose that you have access to a cluster of four worker nodes, each of which runs eight threads. You can distribute the simulation across the 32 threads and ask each thread to perform 1/32 of the simulation. Specifically, each thread can simulate 31,250 random samples from U(0,1) and return the sample means. The sample means can then be concatenated into a long vector and returned. The rest of the program (the Monte Carlo estimates) does not need to change.

The goal is to get the SimMeanUniform function to run on all 32 available threads. One way to do that is to use the SimMeanUniform function as the mapping function that is passed to the MAPREDUCE function. To use SimMeanUniform as a mapping function, you need to make two small modifications:

  • The function currently takes three arguments: N, seed, and B. But a mapping function that is called by the MAPREDUCE function is limited to a single argument. The solution is to pack the arguments into a list. For example, in the main program define L = [#'N'=36, #'seed' = 123, #'B' = 1E6] and pass that list to the MAPREDUCE function. In the definition of the SimMeanUniform module, define the signature as SimMeanUniform(L) and access the parameters as L$'N', L$'seed', and L$'B'.
  • The SimMeanUniform function currently simulates B random samples. But if this function runs on 32 threads, we want each thread to generate B/32 random samples. One solution would be to explicitly pass in B=1E6/32, but a better solution is to use the NPERTHREAD function in the iml action. The NPERTHREAD function uses the FLOOR-MOD trick to determine how many samples should be simulated by each thread. You specify the total number of simulations, and the NPERTHREAD function uses the thread ID to determine the number of simulations for each thread.

With these two modifications, you can implement the Monte Carlo simulation in parallel in the iml action. Because the mapping function returns a row vector, the reducer will be horizontal concatenation, which is available as the built-in "_HCONCAT" reducer. Thus, although each thread returns a row vector that has 31,250 elements, the MAPREDUCE function will return a row vector that has one million elements. The following program implements the parallel simulation in the iml action. If you are not familiar with using PROC CAS to call the iml action, see the getting started example for the iml action.

/* Simulate B independent samples from a uniform(0,1) distribution.
   Mapper: Generate M samples, where M ~ B/numThreads. Return M statistics.
   Reducer: Concatenate the statistics.
   Main Program: Estimate standard error and CI for mean.
 */
proc cas;
session sess4;                         /* use session with four workers     */
loadactionset 'iml';                   /* load the action set               */
source SimMean;                        /* put program in SOURCE/ENDSOURCE block */
 
start SimMeanUniform(L);               /* define the mapper                 */
   call randseed(L$'seed');            /* each thread uses different stream */
   M = nPerThread(L$'B');              /* number of samples for thread      */
   x = j(L$'N', M);                    /* allocate NxM matrix for samples   */
   call randgen(x, 'Uniform');         /* simulate from U(0,1)              */
   return mean(x);                     /* row vector = mean of each column  */
finish;
 
/* ----- Main Program ----- */
/* Put the arguments for the mapper into a list */
L = [#'N'    = 36,                     /* sample size                       */
     #'seed' = 123,                    /* random number seed                */
     #'B'    = 1E6 ];                  /* total number of samples           */
/* Simulate on all threads; return Monte Carlo distribution */
stat = MapReduce(L, 'SimMeanUniform', '_HCONCAT');   /* reducer is "horiz concatenate" */
 
/* Monte Carlo estimates: mean, std err, and 95% CI of mean */
alpha = 0.05;
stat = colvec(stat);
numSamples = nrow(stat);
MCEst = mean(stat);                    /* estimate of mean,                 */
SE = std(stat);                        /* standard deviation,               */
call qntl(CI, stat, alpha/2 || 1-alpha/2);     /* and 95% CI                */
R = numSamples || MCEst || SE || CI`;          /* combine for printing      */
print R[format=8.4 L='95% Monte Carlo CI (Parallel)'
        c={'NumSamples' 'MCEst' 'StdErr' 'LowerCL' 'UpperCL'}];
 
endsource;                             /* END of the SOURCE block */
iml / code=SimMean nthreads=8;         /* run the iml action in 8 threads per node */
run;
Monte Carlo simulation results from a parallel program in the iml action in SAS Viya

The output is very similar to the output from PROC IML. There are small differences because this program involves random numbers. The random numbers used by the serial program are different from the numbers used by the parallel program. By the way, each thread in the parallel program uses an independent set of random numbers. In a future article. I will discuss generating random numbers in parallel computations.

Notice that the structure of the parallel program is very similar to the serial program. The changes needed to "parallelize" the serial program are minor for this example. The benefit, however, is huge: The parallel program runs about 30 times faster than the serial program.

Summary

This article shows an example of using the MAPREDUCE function in the iml action, which is available in SAS Viya 3.5. The example shows how to divide a simulation study among k threads that run concurrently and independently. Each thread runs a mapping function, which simulates and analyzes one-kth of the total work. (In this article, k=32.) The results are then concatenated to form the final answer, which is a sampling distribution.

For comparison, the Monte Carlo simulation is run twice: first as a serial program in PROC IML and again as a parallel program in the iml action. Converting the program to run in parallel requires some minor modifications but results in a major improvement in overall run time.

For more information and examples, see Wicklin and Banadaki (2020), "Write Custom Parallel Programs by Using the iml Action," which is the basis for these blog posts. Another source is the SAS IML Programming Guide, which includes documentation and examples for the iml action.

The post A parallel implementation of Monte Carlo simulation in SAS Viya appeared first on The DO Loop.

7月 082020
 

The iml action in SAS Viya (introduced in Viya 3.5) provides a set of general programming tools that you can use to implement a custom parallel algorithm. This makes the iml action different than other Viya actions, which use distributed computations to solve specific problems in statistics, machine learning, and artificial intelligence. By using the iml action, you can use programming statements to define the problem you want to solve. One of the simplest ways to run a parallel program is to use the MAPREDUCE function in the iml action, which enables you to distribute a computation across threads and nodes. This article describes the MAPREDUCE function and gives an example.

What is the map-reduce paradigm?

The MAPREDUCE function implements the map-reduce paradigm, which is a two-step process for distributing a computation. The MAPREDUCE function runs a SAS/IML module (called the mapping function, or the mapper) on every available node and thread in your CAS session. Each mapping function returns a result. The results are aggregated by using a reducing function (the reducer). The final aggregated result is returned by the MAPREDUCE function. The MAPREDUCE function is ideal for “embarrassingly parallel” computations, which are composed of many independent and essentially identical computations. Examples in statistics include Monte Carlo simulation and resampling methods such as the bootstrap. A Wikipedia article about the map-reduce framework includes other examples and more details.

A simple map-reduce example: Adding numbers

Perhaps the simplest map-reduce computation is to add a large set of numbers in a distributed manner. Suppose you have N numbers to add, where N is large. If you have access to k threads, you can ask each thread to add approximately N/k numbers and return the sum. The mapper function on each thread computes a partial sum. The next step is the reducing step. The k partial sums are passed to the reducer, which adds them and returns the total sum. In this way, the map-reduce paradigm computes the sum of the N numbers in parallel. For embarrassingly parallel problems, the map-reduce operation can reduce the computational time by up to a factor of k, if you do not pass huge quantities of data to the mappers and reducers.

You can use the MAPREDUCE function in the iml action to implement the map-reduce paradigm. The syntax of the MAPREDUCE function is
result = MAPREDUCE( mapArg, 'MapFunc', 'RedFunc' );
In this syntax, 'MapFunc' is the name of the mapping function and mapArg is a parameter that is passed to the mapper function in every thread. The 'RedFunc' argument is the name of the reducing function. You can use a predefined (built-in) reducers, or you can define your own reducer. This article uses only predefined reducers.

Let's implement the sum-of-N-numbers algorithm in the iml action by using four threads to sum the numbers 1, 2, ..., 1000. (For simplicity, I chose N divisible by 4, so each thread sums N/4 = 250 numbers.) There are many ways to send data to the mapper. This program packs the data into a matrix that has four rows and tells each thread to analyze one row. The program defines a helper function (getThreadID) and a mapper function (AddRow), which will run on each thread. The AddRow function does the following:

  1. Call the getThreadID function to find the thread in which the AddRow function is running. The getThreadID function is a thin wrapper around the NODEINFO function, which is a built-in function in the iml action. The thread ID is stored in the variable j.
  2. Extract the j th row of numbers. These are the numbers that the thread will sum.
  3. Call the SUM function to compute the partial sum. Return that value.

In the following program, the built-in '_SUM' reducer adds the partial sums and returns the total sum. The program prints the result from each thread. Because the computation is performed in parallel, the order of the output is arbitrary and can vary from run to run. If you are not familiar with using PROC CAS to call the iml action, see the getting started example for the iml action.

/* assume SESS0 is a CAS session that has 0 workers (controller only) and at least 4 threads */
proc cas;
session sess0;                         /* SMP session: controller node only        */
loadactionset 'iml';                   /* load the action set                      */
source MapReduceAdd;
   start getThreadID(j);               /* this function runs on the j_th thread    */
      j = nodeInfo()$'threadId';       /* put thread ID into the variable j        */
   finish;
   start AddRow(X);
      call getThreadId(j);             /* get thread ID                            */
      sum = sum(X[j, ]);               /* compute the partial sum for the j_th row */
      print sum;                       /* print partial sum for this thread        */
      return sum;                      /* return the partial sum                   */
   finish;
 
   /* ----- Main Program ----- */
   x = shape(1:1000, 4);                      /* create a 4 x 250 matrix           */
   Total = MapReduce(x, 'AddRow', '_SUM');    /* use built-in _SUM reducer         */
   print Total;
endsource;
iml / code=MapReduceAdd nthreads=4;
run;

The output shows the results of the PRINT statement in each thread. The output can appear in any order. For this run, the first output is from the second thread, which computes the sum of the numbers 251, 252, . . . , 500 and returns the value 93,875. The next output is from the fourth thread, which computes the sum of the numbers 751, 752, . . . , 1000. The other threads perform similar computations. The partial sums are sent to the built-in '_SUM' reducer, which adds them together and returns the total sum to the main program. The total sum is 500,500 and appears at the end of the output.

Visualize the map-reduce computations

You can use the following diagram to help visualize the program. (Click to enlarge.)

The following list explains parts of the diagram:

  • My CAS session did not include any worker nodes. Therefore, the iml action runs entirely on the controller, although it can still use multiple threads. This mode is known as symmetric multiprocessing (SMP) or single-machine mode. Notice that the call to the iml action (the statement just before the RUN statement) specifies the NTHREADS=4 parameter, which causes the action to use four threads.
  • The program defines the matrix X, which has four rows. This matrix is sent to each thread.
  • Each thread runs the AddRow function (the mapper). The function uses the NODEINFO function to determine the thread it is running in. It then sums the corresponding row of the X matrix and returns that partial sum.
  • The reducer combines the results of the four mappers. In this case, the reducer is '_SUM', so the reducer adds the four partial sums. This sum is the value that is returned by the MAPREDUCE function.

Summary

This article demonstrates a simple call to the MAPREDUCE function in the iml action, which is available in SAS Viya 3.5. The example shows how to divide a task among k threads. Each thread runs concurrently and independently. Each thread runs a mapping function, which computes a portion of the task. The partial results are then sent to a reducing function, which assembles the partial results into a final answer. In this example, the task is to compute a sum. The numbers are divided among four threads, and each thread computes part of the sum. The partial sums are then sent to a reducer to obtain the overall sum.

This example is one of the simplest map-reduce examples. Obviously, you do not need parallel computations to add a set of numbers, but I hope the simple example enables you to focus on map-reduce architecture. In a future article, I will present a more compelling example.

For more information and examples, see Wicklin and Banadaki (2020), "Write Custom Parallel Programs by Using the iml Action," which is the basis for these blog posts. Another source is the SAS IML Programming Guide, which includes documentation and examples for the iml action.

The post A general method for parallel computation in SAS Viya appeared first on The DO Loop.

6月 172020
 

A previous article shows how to use the iml action to read a CAS data table into an IML matrix. This article shows how to write a CAS table from data in an IML matrix. You can read an overview of the iml action, which was introduced in SAS Viya 3.5.

Write a CAS data table from the iml action

Suppose you are running a program in the iml action and you want to write the result of a computation to a CAS data table. The simplest way is to use the the MatrixWriteToCAS subroutine. The syntax of the function is
call MatrixWriteToCAS(matrix, caslib, TableName, colnames);
where

  • matrix is the matrix that you want to save.
  • caslib is a caslib that specifies the location of the table. A blank string means "use the default caslib." Otherwise, specify the name of a caslib. For example, if the CAS table is in your personal caslib, specify 'CASUSER(userName)', where userName is your login name. For more information, see the CAS documentation about caslibs.
  • TableName is a string that specifies the name of the CAS table.
  • colnames is a character vector, whose elements specify the columns in the CAS table.

For simplicity, the following call to the iml action defines a matrix and then writes it to a table. In practice, the matrix would be the result of some computation:

proc cas;
loadactionset 'iml';   /* load action set (once per session) */
source WriteMat;
   corr = {1.0000  0.7874 -0.7095 -0.7173  0.8079,
           0.7874  1.0000 -0.6767 -0.6472  0.6308,
          -0.7095 -0.6767  1.0000  0.9410 -0.7380,
          -0.7173 -0.6472  0.9410  1.0000 -0.7910,
           0.8079  0.6308 -0.7380 -0.7910 1.0000 };
   varNames = {'EngineSize' 'Horsepower' 'MPG_City' 'MPG_Highway' 'Weight'};
   call MatrixWriteToCas(corr, ' ', 'MyCorr', varNames);  /* Write to CAS data table in default caslib */ 
endsource;
iml / code=WriteMat;      /* call the iml action to run the program */
run;

The program writes the matrix to a CAS table named MYCORR. You can call the columnInfo action (which is similar to PROC CONTENTS) to verify that the CAS table exists:

proc cas;
   columnInfo / table='MyCorr';
run;

The output of the columnInfo action shows that the MYCORR data table was created. Notice that in addition to the five specified variables, the CAS table contains a sixth variable, called _ROWID_. This variable is automatically added by the MatrixWriteToCAS subroutine.

How did that variable get there? As explained in the previous article, CAS data tables do not have an inherent row order, but matrices do. The iml action includes the _ROWID_ variable in the output data table in case you need to read the data back into an IML matrix at a later time. If you do, the rows of the new matrix will be in the same order as the rows of the CORR matrix that created the data table. For a correlation matrix, this is important because (by convention) the order of the rows corresponds to the order of the columns. Preserving the row order ensures that the matrix has 1s along the main diagonal.

In summary, you can use the MatrixWriteToCAS subroutine to write a matrix to a CAS table. Not only does the call create a CAS table, but it augments the table with information that can recreate the matrix in the same row order. By the way, if you want to save mixed-type data, you can use the TableWriteToCAS subroutine to write a SAS/IML table to a CAS table.

Further reading

The post Write a CAS data table by using the iml action appeared first on The DO Loop.

6月 152020
 

A previous article compares a SAS/IML program that runs in PROC IML to the same program that runs in the iml action. (You can read an overview of the iml action.) The example in the previous article was very simple and did not read or write data. This article compares a PROC IML program that reads a SAS data set to a similar program that runs in the iml action and reads CAS tables. The iml action was introduced in SAS Viya 3.5.

PROC IML and SAS data sets

An important task for statistical programmers is reading data from a SAS data set into a SAS/IML matrix. SAS/IML programs are often used to pre-process or post-process data that are created by using other SAS procedures, and SAS data sets enable the output from one procedure to become the input to another procedure.

To simplify matters, let's focus on reading numerical data into a SAS/IML matrix. In PROC IML, the USE and READ statements are used to read data into a matrix. The following program is typical. Several numerical variables from the SasHelp.Cars data are read into the matrix X. The call to the CORR function then computes the correlation matrix for those variables:

proc iml;
varNames = {'EngineSize' 'Horsepower' 'MPG_City' 'MPG_Highway' 'Weight'};
use Sashelp.Cars;              /* specify the libref and data set name */
read all var varNames into X;  /* read all specified variables into columns of X */
close;
 
corr = corr(X);
print corr[r=varNames c=varNames format=7.4];
quit;

The USE and READ statements read SAS data sets. However, there are no SAS data sets in CAS, so these statements are not supported in the iml action. Instead, the iml action supports reading a CAS table into a matrix. Whereas the READ statement in PROC IML reads data sequentially, the iml action reads data from a CAS table in parallel by using multiple threads.

The rows in a CAS table do not have a defined order

Before showing how to read a CAS table into a matrix, let's discuss a characteristic of CAS tables that might be unfamiliar to you. Namely, the order of observations in a CAS table is undefined.

Recall that one purpose of CAS is to process massive amounts of data. Large data tables might be stored in a distributed fashion across multiple nodes in a cluster. When rows of data are read, they are read by using multiple threads that execute in parallel. A consequence of this fact is that CAS data tables do not have an inherent order. If this sounds shocking, realize that the order of the observations does not matter for most data analysis. For example, order does not affect descriptive statistics such as the mean or percentiles. Furthermore, any analysis that uses the sum-of-squares crossproduct matrix (X`*X) is unaffected by reordering the observations. This includes correlation, regression, and a lot of multivariate statistics (for example, principal component analysis).

Of course, for some analyses (such as time series) and for some matrix computations, the order of rows is important. Therefore, the iml action has a way to ensure that the rows of a matrix are in a specific order. If the CAS table has a variable with the name _ROWID_ (or _ROWORDER_), then the data rows are sorted by that variable before the IML matrix is created. This happens automatically when you use the MatrixCreateFromCAS function, which is discussed in a subsequent section. (The _ROWID_ is used only to sort the rows; it does not become a column in the matrix.)

You can use the iml action to read any CAS table. If you read a table that does not contain a _ROWID_ variable, the order of rows in the data matrix might change from run to run.

Upload a data set from SAS to CAS

There are several ways to upload a SAS data set into a CAS table. This example uses the DATA step. By using the DATA step, you can also add a _ROWID_ variable to the data, in case you want the rows of the IML matrix to be in the same order as the data set.

I assume you have already established a CAS session. The following statements use the LIBNAME statement in SAS to create a libref to the active caslib in the current CAS session. You can use this libref to upload a data set into a CAS table:

libname mycaslib cas;    /* libref to the active caslib in the current CAS session */
data mycaslib.cars;      /* create CAS table named CARS in the active caslib */
   set Sashelp.Cars;
   _ROWID_ + 1;          /* add a sort variable */
run;
 
/* Optional: verify that the CAS table includes the _ROWID_ variable */
proc cas;
   columnInfo / table='cars';
run;

The output of the columnInfo action shows that the cars data table was created and includes the _ROWID_ variable.

How to read a CAS data table in the iml action

To read variables from a CAS data table into a matrix, you can use the CreateMatrixFromCAS function. The syntax of the function is
X = MatrixCreateFromCAS(caslib, TableName, options);
where

  • caslib is a caslib that specifies the location of the table. You can specify a blank string if you want to use the default caslib. Or you can specify a caslib as a string. For example, if the CAS table is in your personal caslib, specify 'CASUSER(userName)', where userName is your login name. For more information, see the CAS documentation about caslibs.
  • TableName is a string that specifies the name of the CAS table.
  • options is string that enables you to read only part of the data. The most common option is the KEEP statement, which you can use to specify all numeric values ('KEEP=_NUMERIC_', which is the default), all character variables ('KEEP=_CHARACTER_'), or to specify certain variables ('KEEP=X1 X1 Y Z').

Recall that matrices in the SAS/IML language are either numeric or character. If you want to read mixed-type data, you can use the TableCreateFromCAS function.

Read a CAS table in the iml action

In the previous sections, we created a CAS table ('cars') that contains the Sashelp.Cars data. You can use iml action and the MatrixCreateFromCAS function to read that data table into a matrix, as follows:

proc cas;
   loadactionset 'iml';    /* load the iml action set (only once per session) */
run;
 
proc cas;
source ReadData;
   KeepStmt = 'KEEP=EngineSize Horsepower MPG_City MPG_Highway Weight';
   X = MatrixCreateFromCas(' ', 'cars', KeepStmt);
 
   corr = corr(X);
   varNames = {'EngineSize' 'Horsepower' 'MPG_City' 'MPG_Highway' 'Weight'};
   print corr[r=varNames c=varNames format=7.4];
endsource;
iml / code=ReadData;
run;

The output is identical to the output from PROC IML and is not shown.

The KEEP= option specifies five numeric variables to read. The first argument to the MatrixCreateFromCAS function is a blank string, which means "use the default caslib." Alternatively, you could specify a caslib such as 'CASUSER(userName)'. The matrix X is the data matrix. Because the CARS table contains a _ROWID_ variable, the rows of X are in the same order as the rows of the Sashelp.Cars data set.

After you read the data, you can write standard SAS/IML programs to manipulate or analyze X. For example, I used the CORR function and the PRINT statement to reproduce the results of the previous PROC IML program.

In the next article, I will show how to use the iml action to write a CAS table that contains data in an IML matrix.

Further reading

The post Read a CAS data table by using the iml action appeared first on The DO Loop.

6月 122020
 

A previous article provides an introduction and overview of the iml action, which is available in SAS Viya 3.5. The article compares the iml action to PROC IML and states that most PROC IML programs can be modified to run in iml action. This article takes a closer look at what a SAS/IML program looks like in the iml action. If you have an existing PROC IML program, how can you modify it to run in the iml action?

PROC CAS and the SOURCE/ENDSOURCE block

A feature of SAS Viya is that the actions can be called from multiple languages: SAS, Python, R, Lua, etc. I am a SAS programmer, so in my blog, I will use SAS to call actions.

An action (sometimes called a "CAS action") runs by using the SAS Cloud Analytic Services (CAS). The CAS procedure provides a language (called "CASL," for "CAS Language") that enables you to call CAS actions. The documentation for CASL states that "CASL is a scripting language that you use to prepare arguments for execution of CAS actions, submit actions to the CAS server, and then process action results." In short, PROC CAS enables you to "stitch together" a sequence of action calls into a workflow.

PROC CAS supports a SOURCE/ENDSOURCE block, which defines a program that you can submit to an action that supports programming statements. I will use the SOURCE/ENDSOURCE block to define programs for the iml action. To call the iml action by using PROC CAS, you need to do three things:

  1. Load the iml action set. Any CAS action set that is not auto-loaded must be loaded before you can use it. It only needs to be loaded once per session.
  2. Use the SOURCE/ENDSOURCE block to define the IML program.
  3. Call the iml action to run the program. The syntax to call the action is iml / code=ProgramName

In terms of syntax, a typical program in the iml action has the following structure:

PROC CAS;
loadactionset 'iml';        /* load the iml action set */
source ProgramName;
    < put the IML program here >
endsource;
iml / code=ProgramName;     /* run the program in the iml action */
RUN;

Convert a program from PROC IML to the iml action

The following SAS/IML program defines two vectors and computes their inner product. It then computes the variance of the x data vector. Lastly, it centers and scales the data and computes the variance of the standardized data. This program runs in PROC IML:

PROC IML;
   c = {1, 2, 1, 3, 2, 0, 1};   /* weights */
   x = {0, 2, 3, 1, 0, 2, 2};   /* data */
   wtSum = c` * x;              /* inner product (weighted sum) */
   var1 = var(x);               /* variance of original data */
   stdX = (x-mean(x)) / std(x); /* standardize data */
   var2 = var(stdX);            /* variance of standardized data */
   print wtSum var1 var2;
QUIT;

This program does not read or write any data sets, which makes it easy to convert to the iml action. The following statements assume that you have a license for the SAS IML product in Viya and that you have already connected to a CAS server.

/* Example of using PROC CAS in SAS to call the iml action */
PROC CAS;
loadactionset 'iml';            /* load the action set (once) */
source pgm;
   c = {1, 2, 1, 3, 2, 0, 1};   /* weights */
   x = {0, 2, 3, 1, 0, 2, 2};   /* data */
   wtSum = c` * x;              /* inner product (weighted sum) */
   var1 = var(x);               /* variance of original data */
   stdX = (x-mean(x)) / std(x); /* standardize data */
   var2 = var(stdX);            /* variance of standardized data */
   print wtSum var1 var2;
endsource;
iml / code=pgm;                 /* call iml action to run the 'pgm' program */
RUN;

For this example, the SAS/IML statements are exactly the same in both programs. The difference is the way that the program is submitted for execution. For this example, the program is named 'pgm', but you could also have named the program 'MyProgram' or 'StdAnalysis' or 'Robert' or 'Jane'. Whatever identifier you use on the SOURCE statement, use that same identifier for the CODE= parameter when you call in the action.

Calling the iml action from Python

I use PROC CAS in SAS to call CAS actions. However, I know that Python is a popular programming language for some data scientists, so here is how to call the iml action from Python. You first need to install the SAS Scripting Wrapper for Analytics Transfer (SWAT) package. You can then use the following statements to connect to a CAS server and load the action:

# Example of using Python to call the iml action
import swat # load the swat package
s = swat.CAS('myhost', 12345)    # use server='myhost'; port=12345
s.loadactionset('iml')           # load the action set (once)

As mentioned earlier, you only need to load the action set one time per session. After the action set is loaded, you can call the iml action by using the following. In Python, triple quotes enable you to preserve the indention and comments in the IML program.

m = s.iml(code=
"""
   c = {1, 2, 1, 3, 2, 0, 1};   /* weights */
   x = {0, 2, 3, 1, 0, 2, 2};   /* data */
   wtSum = c` * x;              /* inner product (weighted sum) */
   var1 = var(x);               /* variance of original data */
   stdX = (x-mean(x)) / std(x); /* standardize data */
   var2 = var(stdX);            /* variance of standardized data */
   print wtSum var1 var2;
""")

More realistic examples

In this case, the program did not read or write data. However, a typical IML program reads in data, analyzes it, and optionally writes out the results. This is the first (and most common) difference between a program that runs in PROC IML and a similar program that runs in the iml action. A PROC IML program reads and writes SAS data sets, whereas actions read and write CAS data tables.

The next article shows an example of a PROC IML program that reads and writes SAS data sets. It compares the PROC IML program to an analogous program that runs in the iml action and reads and writes CAS tables.

Further reading

The post Getting started with the iml action in SAS Viya appeared first on The DO Loop.

6月 102020
 

This article introduces the iml action, which is available in SAS Viya 3.5. The iml action supports most of the same syntax and functionality as the SAS/IML matrix language, which is implemented in PROC IML. With minimal changes, most programs that run in PROC IML also run in the iml action. In addition, the iml action supports new programming features for parallel programming.

Most actions in SAS Viya perform a specific task, but the iml action is different. The iml action provides a set of general programming tools that you can use to implement a custom parallel algorithm. The programmer can control many aspects of the computation, including how the computation is distributed among nodes and threads on a cluster of machines (or threads on a single machine).

Future articles will address the parallel programming capabilities of the iml action. This article provides an overview of the iml action. What is it? How do you get access to it? How is it similar to and different from PROC IML?

What is the iml action?

Recall that the SAS/IML language is a matrix-vector programming language that supports a rich library of functions in statistics, data analysis, matrix computations, numerical analysis, simulation, and optimization. In SAS 9 ("traditional SAS"), you can access the SAS/IML language by licensing the SAS/IML product and calling the IML procedure. PROC IML is also available in the SAS University Edition.

In SAS Viya 3.5. you get access to the SAS/IML language by licensing the SAS IML product. (Notice that there is no “slash” in the product name.) The SAS IML product gives you access to the iml action and to the IML procedure. Thus, in Viya, you can run all existing PROC IML programs, and you can also write new programs that run in the iml action and use SAS Cloud Analytic Services (CAS).

The iml action belongs to the iml action set. In addition to supporting most of the statements and functions in the SAS/IML language, the iml action supports new functionality that enables you to take advantage of the distributed computational resources in SAS Viya. In particular, you can use the iml action to implement custom parallel algorithms that use multiple nodes and threads on a cluster of machines. Even on one machine, you can run custom parallel programs on a multicore processor.

How is the iml action similar to PROC IML?

The iml action and the IML procedure share a common syntax. The mathematical and statistical function library is essentially the same in the action and in the procedure. Both environments support arithmetic and linear algebraic operations on matrices, operations to subset and query matrices, and programming features such as writing loops and using IF-THEN/ELSE logic.

Of the 300 functions and statements in the SAS/IML run-time library, only a handful of statements are not supported in the iml action. Most differences are related to the difference between the SAS 9 and SAS Viya environments. PROC IML interacts with traditional SAS constructs (such as data sets and catalogs) and supports calling SAS procedures and interacting with files on your local computer. The iml action interacts with analogous constructs in the Viya environment. It can read and write CAS tables, write analytic stores (astores), and can call other Viya actions.

Why use the iml action?

The iml action runs on a CAS server. Why might you choose to use the iml action instead of PROC IML? Or convert an existing PROC IML program into the iml action? There are two main reasons:

  • You want to use the SAS/IML language as part of a sequence of actions that analyze data that are in CAS tables. By using the iml action, you can read and write CAS tables directly. If you use PROC IML, you need to pull the data from CAS into a SAS data set, run the analysis in PROC IML, and then push the results to a CAS table.
  • You want to take advantage of the capabilities of the CAS server to perform parallel processing. You can use the iml action to create custom parallel computations.

How is the iml action different from PROC IML?

As mentioned previously, the iml action does not support every function and statement that PROC IML supports. The unsupported functions and statements are primarily in four areas:

  • Base SAS functions that are not supported in the CAS DATA step. For example, the old random number generator functions RANUNI and RANNOR are not supported in CAS because they cannot generate independent streams of random number in parallel.
  • Statements that read or write SAS data sets or text files.
  • Functions that create graphics. CAS is for computations. You can download the results of the computation to whatever language you are using to call the CAS actions. For example, I download the results to SAS and create SAS graphs, but you could also use Python or R.
  • SAS/IML functions that were deprecated in earlier releases of SAS.

See the documentation for the iml action for a complete list of the PROC IML functions and statements that are not supported in the iml action.

Will my program run faster in the iml action than PROC IML?

If you take an existing PROC IML program and run it in the iml action, it will take about the same amount of time to run. Sure, it might run a little faster if the machines in your CAS cluster are newer and more powerful than the SAS server or your PC, but that speedup is due to hardware. An existing program does not automatically run faster in the iml action because it runs serially until it encounters a programming statement that can be executed in parallel. There are two main sets of statements that run in parallel:

  • Reading and writing data from CAS tables.
  • Functions that distribute computations. You, the programmer, need to call these functions to write parallel programs.

So, yes, you can get certain programs to run faster in the iml action, but it doesn't happen automatically. You have to add new input/output statements or call functions that execute tasks in parallel.

Should you convert from PROC IML to the iml action?

What does this mean for the SAS/IML programmer whose company is changing from SAS 9 to Viya? Do you need to convert hundreds of existing PROC IML programs to run in the iml action? No, absolutely not. As mentioned previously, when you license SAS IML on Viya, you get both PROC IML and the iml action. The existing programs that you wrote in SAS 9 will continue to run in PROC IML in Viya.

The Viya platform provides an opportunity to use the iml action but does not require it. Under what circumstances might you want to convert a program from PROC IML to the iml action? Or write a new program in the iml action instead of using PROC IML? In my opinion, it comes down to two issues: workflow and performance.

  • Workflow: If your company is using CAS actions and CAS-enabled procedures, and if your data are stored in CAS tables, then it makes sense to use the iml action instead of the IML procedure. The SAS/IML language is often used to pre- or post-process data for other procedures or actions. The iml action can read and write CAS tables that are created by other CAS actions or that will be consumed by other CAS actions.
  • Performance: Suppose that you have a computation that takes a long time to process in PROC IML, but the computation is "embarrassingly parallel." An embarrassingly parallel problem, is one that consists of many identical independent subtasks. The iml action supports several functions for distributing a computation to multiple threads. Examples of embarrassingly parallel computations in statistics and machine learning include Monte Carlo simulation, resampling methods such as the bootstrap, ensemble models, and many "brute force" computations.

Further reading

I continue this exploration of the iml action in subsequent articles. Related articles include:

  • A Getting Started example that shows how to call the iml action and discusses how the action is similar to and different from PROC IML.
  • An example that shows how to read and write CAS tables from the iml action.
  • The MAPREDUCE function, which enables you to distribute a computation across threads and nodes.
  • The PARTASKS function, which enables you to distribute multiple independent computations across threads and nodes.
  • The SCORE function, which enables you to evaluate a function in parallel on every row of a CAS table.

For more information and examples, see Wicklin and Banadaki (2020), "Write Custom Parallel Programs by Using the iml Action," which is the basis for these blog posts. Another source is the SAS IML Programming Guide, which includes documentation and examples for the iml action.

The post An introduction to the iml action in SAS Viya appeared first on The DO Loop.

4月 282020
 

With increasing interest in Continuous Integration/Continuous Delivery (CI/CD), many SAS Users want to know what can be done for Visual Analytics reports. In this article, I will explain how to use Python and SAS Viya REST APIs to extract a report from a SAS Viya environment and import it into another environment. For those trying to understand the secret behind CI/CD and DevOps, here it is:

What you do tomorrow will be better than what you did yesterday because you gained more experience today!

About Continuous Integration/Continuous Delivery

If you apply this principle to code development, it means that you may update your code every day, test it, deploy it, and start the process all over again the following day. You might feel like Sisyphus rolling his boulder around for eternity. This is where CI/CD can help. In the code deployment process, you have many recurrent tasks that can be automated to reduce repetitiveness and boredom. CI/CD is a paradigm where improvements to code are pushed, tested, validated, and deployed to production in a continuous automated manner.

About ModelOps and AnalyticOps

I hope you now have a better understanding of what CI/CD is. You might now wonder how CI/CD relates to Visual Analytics reports, models, etc. With the success of DevOps which describes the Development Operations for software development, companies have moved to the CI/CD paradigms for operations not related to software development. This is why you hear about ModelOps, AnalyticOps... Wait a second, what is the difference between writing or generating code for a model or report versus writing code for software? You create a model, you test it, you validate it, and finally deploy it. You create a report, you test it, you validate it, and then you deploy it. Essentially, the processes are the same. This is why we apply CI/CD techniques to models, reports, and many other business-related tasks.

About tools

As with many methodologies like CI/CD, tools are developed to help users through the process. There are many tools available and some of them are more widely used. SAS offers SAS Workflow Manager for building workflows to help with ModelOps. Additionally, you have surely heard about Git and maybe even Jenkins.

  • Git is a version control system that is used by many companies with some popular implementations: GitHub, GitLab, BitBucket.
  • Jenkins is an automation program that is designed to build action flows to ease the CI/CD process.

With these tools, you have the needed architecture to start your CI/CD journey.

The steps

With a basic understanding of the CI/CD world; you might ask yourself: How does this apply to reports?

When designing a report in an environment managed by DevOps principles, here are the steps to deploy the report from a development environment to production:

  1. Design the report in development environment.
  2. Validate the report with Business stakeholders.
  3. Export the report from the development environment.
  4. Save a version of the report.
  5. Import the report into the test environment.
  6. Test the report into the test environment.
  7. Import the report into the production environment.
  8. Monitor the report usage and performance.

Note: In some companies, the development and test environments are the same. In this case, steps 4 to 6 are not required.

Walking through the steps, we identify steps 1 and 2 are manual. The other steps can be automated as part of the CI/CD process. I will not explain how to build a pipeline in Jenkins or other tools in this post. I will nevertheless provide you with the Python code to extract, import, and test a report.

The code

To perform the steps described in the previous section, you can use different techniques and programming languages. I’ve chosen to use Python and REST APIs. You might wonder why I've chosen these and not the sas-admin CLI or another language. The reason is quite simple: my Chinese zodiac sign is a snake!

Jokes aside, I opted for Python because:

  • It is easy to read and understand.
  • There is no need to compile.
  • It can run on different operating systems without adaptation.
  • The developer.sas.com site provides examples.

I’ve created four Python files:

  1. getReport.py: to extract the report from an environment.
  2. postReport.py: to create the report in an environment.
  3. testReport.py: to test the report in an environment.
  4. functions.py: contains the functions that are used in the other Python files.

All the files are available on GitHub.

The usage

Before you use the above code, you should configure your SAS Viya environment to enable access to REST APIs. Please follow the first step from this article to register your client.

You should also make sure the environment executing the Python code has Python 3 with the requests package installed. If you're missing the requests package, you will get an error message when executing the Python files.

Get the report

Now that your environment is set up, you can execute the getReport code to extract the report content from your source environment. Below is the command line arguments to pass to execute the code:

python3 getReport.py -a myAdmin -p myAdminPW -sn http://va85.gel.sas.com -an app -as appsecret -rl "/Users/sbxxab/My Folder" -rn CarsReport -o /tmp/CICD/

The parameters are:

  • a - the user that is used to connect to the SAS Viya environment.
  • p - the password of the user.
  • sn - the URL of the SAS Viya environment.
  • an - the name of the application that was defined to enable REST APIs.
  • as - the secret to access the REST APIs.
  • rl - the report location should be between quotes if it contains white spaces.
  • rn - the report name should be between quotes if it contains white spaces.
  • o - the output location that will be used.

The output location should ideally be a Git repository. This allows a commit as the next step in the CI/CD process to keep a history log of any report changes.

The output generated by getReport is a JSON file which has the following structure:

{
"name": "CarsReport",
"location": "/Users/sbxxab/My Folder",
"content": {
"@element": "SASReport",
"xmlns": "http://www.sas.com/sasreportmodel/bird-4.2.4",
"label": "CarsReport",
"dateCreated": "2020-01-14T08:10:31Z",
"createdApplicationName": "SAS Visual Analytics 8.5",
"dateModified": "2020-02-17T15:33:13Z",
"lastModifiedApplicationName": "SAS Visual Analytics 8.5",
"createdVersion": "4.2.4",
"createdLocale": "en",
"nextUniqueNameIndex": 94,
 
...
 
}

From the response:

  • The name value is the report name.
  • The location value is the folder location in the SAS Content.
  • The content value is the BIRD representation of the report.

The generated file is not a result of the direct extraction of the report in the SAS environment. It is a combination of multiple elements to build a file containing the required information to import the report in the target environment.

Version the report

The next step in the process is to commit the file within the Git repository. You can use the git commit command followed by a git push to upload the content to a remote repository. Here are some examples:

# Saving the report in the local repository
git commit -m "Save CarsReport"
 
# Pushing the report to a remote repository after a local commit
git push https://gitlab.sas.com/myRepository.git

Promote the report

As soon as you have saved a version of the report, you can import the report in the target environment (production). This is where the postReport comes into play. Here is a sample command line:

python3 postReport.py -a myAdmin -p myAdminPW -sn http://va85.gel.sas.com -an app -as appsecret -i /tmp/CICD/CarsReport.json

The parameters are:

  • a - the user that is used to connect to the SAS Viya environment.
  • p - the password of the user.
  • sn - the URL of the SAS Viya environment.
  • an - the name of the application that was defined to enable REST APIs.
  • as - the secret to access the REST APIs.
  • i - the input JSON file which contains the output of the getReport.py.

The execution of the code returns nothing except in the case of an error.

Testing the report

You now have access to the report in the production environment. A good practice is to test/validate access to the report. While testing manually in an interface is possible, it's best to automate. Sure, you could validate one report, but what if you had twenty? Use the testReport script to verify the report. Below are the command line arguments to execute the code:

python3 testReport.py -a myAdmin -p myAdminPW -sn http://va85.gel.sas.com -an app -as appsecret -i /tmp/CICD/CarsReport.json

The parameters are:

  • a - the user that is used to connect to the SAS Viya environment.
  • p - the password of the user.
  • sn - the URL of the SAS Viya environment.
  • an - the name of the application that was defined to enable REST APIs.
  • as - the secret to access the REST APIs.
  • i - the input JSON file which contains the output of the getReport.py

The testReport connects to SAS Visual Analytics using REST APIs and generates an image of the first section of the report. The image generation process produces an SVG image of little interest. The most compelling part of the test is the duration of the execution. This gives us an indication of the report's performance.

Throughout the CI/CD process, validation is important and it is also interesting to get a benchmark for the report. This is why the testReport populates a .perf file for the report. The response file has the following structure:

{
"name": "CarsReport",
"location": "/Users/sbxxab/My Folder",
"performance": [
{
"testDate": "2020-03-26T08:55:37.296Z",
"duration": 7.571
},
{
"testDate": "2020-03-26T08:55:56.449Z",
"duration": 8.288
}
]
}

From the response:

  • The name value is the report name.
  • The location value is the folder location in the SAS Content.
  • The performance array contains the date time stamp of the test and the time needed to generate the image.

The file updates for each execution of the testReport code.

Conclusion

CI/CD is important to SAS. Many SAS users need solutions to automate their deployment processes for code, reports, models, etc. This is not something to fear because SAS Viya has the tools required to integrate into CI/CD pipelines. As you have seen in this post, we write code to ease the integration. Even if you don’t have CI/CD tools like Jenkins to orchestrate the deployment, you can execute the different Python files to promote content from one environment to another and test the deployed content.

If you want to get more information about ModelOps, I recommend to have a look at this series.

Continuous Integration/Continuous Delivery – Using Python and REST APIs for SAS Visual Analytics reports was published on SAS Users.

4月 102020
 

First introduced in SAS Visual Analytics 8.3, common filters are filters that can be shared between objects in your reports.

Common filter benefits include:

  • Easy to assign the same filter conditions to other report objects.
  • When you edit a common filter, it is updated everywhere that the common filter is used.
  • A common filter is available for the entire report, across pages.

Common filter limitations include:

  • Objects must share the same data source as the common filter definition.
  • Common filters are not available across multiple reports.

While the benefits speak for themselves, common filters also expedite designing reports and exploring data by quickly reusing the same filter conditions with only a few mouse clicks.

Let’s look at some examples of using common filters. In the screenshot below, I defined a filter to return data for the last 30 days. I converted that filter to a common filter and now the Last 30 Days common filter can be applied to any object in this report that uses the same data source. Then I applied it to the bottom treemap object.



In my second example, I parameterized a text input control to capture a user-defined value. Then I defined a parameter-driven filter for the bar chart object. Next, I converted that filter to a common filter named Text Input Contains Filter and applied it to the list table object below.



In my third example, I parameterized two drop-down list controls to capture date boundaries. Then I defined a parameter-driven filter for the waterfall chart object. Next, I converted that filter to a common filter named FromToWeekFilter and applied it to the key value object.



Pro tip: Embrace meaningful filter names. Recall that these common filters are associated with the data source for which they are defined. Being able to quickly identify a common filter definition by its name will save you additional mouse clicks in the long run.

Example 1: Last 30 Days Filter

Let’s pick up where we need to define the filters. Therefore, both objects have been added to the report page with roles assigned.

Next, select the line chart object and open the Filters pane. Click + New filter then select Advanced filter.

Now we will use the built-in filter conditions offered in SAS Visual Analytics. First, select the date/datetime data item, Day. Second, scroll down in the available conditions until you see Last 30 days, then double click on the condition to add it to the expression editor. Third, give the filter condition a meaningful name and as a final step, be sure the number of returned observations is expected. Click OK.

Now we need to convert this filter to a common filter. With the line chart object still selected, on the Filters pane, use the Last 30 Day overflow menu and select Change to common filter.



You should now see the Last 30 Days listed as a Common Filter in the Data pane.

Lastly, we can apply this common filter to the treemap object. Select the treemap object, then open the Filters pane. Next click + New Filter and select the common filter Last 30 Days.



It’s good practice to title your objects to reflect any filters that may be defined for them. Especially if there are no prompts, i.e. control objects, driving the subset of data. This is so your report consumers quickly understand that they are only seeing, in this case, the last 30 days of data.

Example 2: Text Input Contains Filter

This next example will define a parameter-driven common filter. If you are not familiar with using parameters in SAS Visual Analytics, start with this article and refer to the additional materials at the end.

See the screenshot below for more information on the data item role assignments for each object. The most important role that will drive the parameter-driven common filter is the parameter. The parameter, TextInputParameter, is a character parameter assigned to the text input control object and it is the only role assignment. If there is nothing entered in this prompt, i.e. control object, then the filter will return all of the data. We will define a contains expression to only return data rows that contain the entered text.

With the bar chart object selected, open the Filters pane and click + New filter and select Advanced filter.



Now we need to build our parameter-driven filter expression. In this filter we will be checking if the Product Description data item Contains the entered text stored in the TextInputParameter. I wrapped each string expression in an UpCase function so that mixed case is ignored. I could have just as easily used the LowerCase function to get the same result.

And finally, remember to give your filter a meaningful name. The returned observations number is not reflective of an applied filter since I do not have any text entered in the control object. If you had text entered there, then your returned observation number may show a subset of matched rows.



Now we need to convert this object-level filter to a common filter. With the bar chart still the active object, open the Filters pane and use the Text Input Contains Filter overflow menu and select Change to common filter.

You should now see the Text Input Contains Filter in the Data pane.



To add this common filter to the lower list table object, select the list table object and open the Filters pane. Click on the + New filter and select the Text Input Contains Filter.



Now let’s test the filter. In the screenshot below, I’ve typed the word red. Recall that our filter expression is to return rows where Product Description contains the entered text. Notice that even though I do not have the data item Product Description assigned as a data role in the bar chart object, SAS Visual Analytics is still able to apply the filter appropriately. The list table object does have a role assigned for Product Description so that filter application is easier to identify.



I used the contains operator instead of the equals to return a partial match to the Product Description to help identify trends across multiple Products. In these next examples, I entered the text (F) and (M) to return the rows where the Product Description indicates gender-specific products.

Example 3: From – To Week Filter

This last example will also define a parameter-driven common filter. If you need a more step-by-step guide to similar examples, refer to this YouTube video tutorial:

See the screenshot below for more information on the data item role assignments for each object. This example differs from the last in that these control objects, the drop-down lists, use both Category and Parameter Role assignments. The Year-Week data item will provide the available values and the selections will be stored in the parameters FromWeekParameter and ToWeekParameter. I will then create a common filter for an inclusive between of the selected year week values to apply to both the key value object and the waterfall chart.



With the waterfall chart object selected, open the Filters pane and click + New filter and select Advanced filter.



Now we need to build our parameter-driven filter expression. In this filter, I will subset the Year-Week date values which are inclusively between the FromWeekParameter and ToWeekParameter boundaries. Follow these steps:

  1. Select the date/datetime data item, Year-Week.
  2. Scroll down in the available conditions till you see Year-Week BetweenInclusive(‘x’,’y’), then double click on the condition to add it to the expression editor.
  3. Drag the FromWeekParameter and ToWeekParameter from the parameter data items list to the expression.
  4. Give the filter condition a meaningful name. Click OK.



Test the filter by adjusting the values in the drop-down list controls.



Now we need to convert this object-level filter to a common filter. With the waterfall chart still the active object, open the Filters pane and use the FromToWeekFilter overflow menu and select Change to common filter.

You should now see the FromToWeekFilter in the Data pane.



To add this common filter to the key value object; select the key value object and open the Filters pane. Click on the + New filter and select the FromToWeekFilter.

Conclusion

Remember that once a common filter is defined, it can be used for any object in the report that uses the same data source. In these examples, I applied the common filter on the same report page but you can use a common filter on any page within the report. Recall that all of the common filters are listed and available when I click to apply a new filter to an object.

Hopefully these examples have shown you new ways to explore data faster!

Additional materials for using parameters in SAS Visual Analytics:

Using common filters in SAS Visual Analytics was published on SAS Users.

1月 132020
 

Are you ready to get a jump start on the new year? If you’ve been wanting to brush up your SAS skills or learn something new, there’s no time like a new decade to start! SAS Press is releasing several new books in the upcoming months to help you stay on top of the latest trends and updates. Whether you are a beginner who is just starting to learn SAS or a seasoned professional, we have plenty of content to keep you at the top of your game.

Here is a sneak peek at what’s coming next from SAS Press.
 
 

For students and beginners

For beginners, we have Exercises and Projects for The Little SAS® Book: A Primer, Sixth Edition, the best-selling workbook companion to The Little SAS Book by Rebecca Ottesen, Lora Delwiche, and Susan Slaughter. Exercises and Projects for The Little SAS® Book, Sixth Edition will be updated to match the updates to the new The Little SAS® Book: A Primer, Sixth Edition. This hands-on workbook is designed to hone your SAS skills whether you are a student or a professional.

 

 

For data explorers of all levels

This free e-book explores the features of SAS® Visual Data Mining and Machine Learning, powered by SAS® Viya®. Users of all skill levels can visually explore data on their own while drawing on powerful in-memory technologies for faster analytic computations and discoveries. You can manually program with custom code or use the features in SAS® Studio, Model Studio, and SAS® Visual Analytics to automate your data manipulation and modeling. These programs offer a flexible, easy-to-use, self-service environment that can scale on an enterprise-wide level. This book introduces some of the many features of SAS Visual Data Mining and Machine Learning including: programming in the Python interface; new, advanced data mining and machine learning procedures; pipeline building in Model Studio, and model building and comparison in SAS® Visual Analytics

 

 

For health care data analytics professionals

If you work with real world health care data, you know that it is common and growing in use from sources like observational studies, pragmatic trials, patient registries, and databases. Real World Health Care Data Analysis: Causal Methods and Implementation in SAS® by Doug Faries et al. brings together best practices for causal-based comparative effectiveness analyses based on real world data in a single location. Example SAS code is provided to make the analyses relatively easy and efficient. The book also presents several emerging topics of interest, including algorithms for personalized medicine, methods that address the complexities of time varying confounding, extensions of propensity scoring to comparisons between more than two interventions, sensitivity analyses for unmeasured confounding, and implementation of model averaging.

 

For those at the cutting edge

Are you ready to take your understanding of IoT to the next level? Intelligence at the Edge: Using SAS® with the Internet of Things edited by Michael Harvey begins with a brief description of the Internet of Things, how it has evolved over time, and the importance of SAS’s role in the IoT space. The book will continue with a collection of chapters showcasing SAS’s expertise in IoT analytics. Topics include Using SAS Event Stream Processing to process real world events, connectivity, using the ESP Geofence window, applying analytics to streaming data, using SAS Event Stream Processing in a typical IoT reference architecture, the role of SAS Event Stream Manager in managing ESP deployments in an IoT ecosystem, how to use deep learning with Your IoT Digital, accounting for data quality variability in streaming GPS data for location-based analytics, and more!

 

 

 

Keep an eye out for these titles releasing in the next two months! We hope this list will help in your search for a SAS book that will get you to the next step in updating your SAS skills. To learn more about SAS Press, check out our up-and-coming titles, and to receive exclusive discounts make sure to subscribe to our newsletter.

Foresight is 2020! New books to take your skills to the next level was published on SAS Users.

5月 222019
 

This blog shows how the automatically generated concepts and categories in Visual Text Analytics (VTA) can be refined using LITI and Boolean rules. Because of these capabilities highly customized models can be developed in VTA. The rules used in this blog are basic. Developing linguistic rules and accurately categorizing documents requires subject matter expertise and understanding the grammatical structure of the language(s) used.

I will use a data set that contains information on 1527 randomly selected movies: their titles, reviews, MPAA Ratings, Main Genre classifications and Viewer Ratings. Two customized categories will be developed one for Children and the other for Sport movies. Because we are familiar with movies classification and MPAA ratings, it will be relatively easy to understand the rules used in this blog. The overall blog’s objective is to show how to formulate basic rules, thus their use can be extended to other fields.

SAS Visual Text Analytics (VTA) is the SAS offering designed to effectively extract insights from unstructured data in large scale. Offered on the SAS Viya architecture, VTA combines the power of Natural Language Processing (NLP), Machine Learning (ML) and Linguistic Rules. Currently, VTA supports 30 languages and it has an open architecture supporting 3rd-party programming interfaces.

As in all analytical projects, the discovery process in Text Analytics projects requires several iterations where the insights found in one iteration are used in the next iterations. In relationship to the linguistic rules, one must determine if the new rules are an improvement over the ones used in previous iterations, and find how many true positives and false positives are matched by the new rules. This process should be repeated until one obtains the precision required.

Initial Text Analysis Using Visual Analytics

Because Visual Analytics (VA) and VTA are highly integrated, the initial Text Analysis can be done in VA.

Every Text Analytics dataset must have a unique identifier associated to each document. In my blog, Discover Main Topics on #MLKDayofService Tweets Using SAS Visual Text Analytics, I showed how to set a “Unique Row Identifier”, and how to work with the nodes in the Pipeline.

In Visual Analytics (VA) one can do the initial analysis of text data, see the Word Cloud, and a list of topics. In the Options menu, I indicated I wanted a Maximum of Topics to Generate=7.

The photo above shows that there are 364 movies with the term “kid” in the Topic “+show,+kid, +rate,+movie”. We could build a category that groups appropriate movies for kids.

There are 203 documents with the Topic related to science fiction. Therefore, if I wanted to have a category for Sport movies, I would have to build it myself because sports terms appear in fewer documents.

Create a Visual Text Analytics Project

In VTA, a pipeline is a process flow diagram whose nodes represent tasks in the Text Analysis Process, I described in detail how to work with the nodes in my MLKDayofService blog mentioned above. Briefly, from the SAS Home menu select the action Build Models that will take you to SAS Model Studio, where you select and create a New Project.

The photo above shows the data role assignments done in the Data tab.

Notice that there is a Unique Row Identifier for each document, the Text Variable to analyze is Review, and two variables are used as Category: MPAARating and mainGenre. Later on, VTA will use these two variables to automatically create categories and their Boolean rules. Title doesn’t have a role but I want it to be displayed to facilitate the analysis.

Movies are already classified according to their main category (mainGenre), I want to see the Boolean rules that VTA automatically generates for each category, and if I can create new concepts and categories that improve on the initial categorization. For example, I would like:

  1. to find children movies that don’t contain violence,
  2. to find movies that are related to Sports,
  3. to read the reviews of my favorite old movies, and
  4. find movies whose reviews mention some of my favorite movie directors.

Method

I ran two pipelines. The first one had the default pipeline settings and also the option Include predefined concepts enabled for the Concepts node. The objective was to see the rules associated with the genres “Sports”, “Animated” and “Family”, the movies matched by these rules, as well as, the ones that shouldn’t have been matched. In the second pipeline, I developed LITI and Boolean rules with the objective of improving the default categorizations automatically produced in the first pipeline.

In the next sections I will describe how the new categories were built. In real business situations, sometimes we will have pre-defined categories available to us, and other times we will come up with categories that satisfy the business objectives after analyzing many documents.

Customized Concepts built in the Concept Node

In the second pipeline, I developed three customized concepts, I will use one of them “MySports” to build a new category later on.

Basic Boolean operators are used to define new concepts and categories. AND/NOT operators are applied to the whole document. There other operators that search within the same sentence (SENT), the same paragraph (PARA) or a number of terms (DIST).

# Any line that starts with “#” is a comment
# Use CLASSIFIER to match a literal sequence
# Use CONCEPT_RULE to use Boolean and proximity operators. The term extracted should use _c{ }

MySports Concept

I wrote this rule which matches 98 documents, most of them related to Sport movies and with few false positives. This rule will match a document if any of the terms sport, baseball, tennis, football, basketball, racetrack appear anywhere in the document (movie review) but the terms gambling, buddy or sporting do not appear anywhere in the document.

I will use this MySports CONCEPT_RULE to build the new Sports category:

CONCEPT_RULE: (AND, (OR, “_c{sport@}”,”_c{baseball}”,”_c{tennis}”,”_c{football}”,”_c{basketball}”, “_c{racetrack}”),(NOT,”sporting”),(NOT, “gambling”),(NOT,”buddy”))

filmmakersInReview Concept

I built this concept just to illustrate how to use a pre-defined concept, in this case nlpPerson:

CONCEPT_RULE: (DIST_10,(OR,”filmmaker”,”director”,”film producer”,”producer”,”movie maker”),”_c{nlpPerson}”)

favoriteMovies Concept

I built this concept to match my favorite old movies and one of my favorite directors. The first CONCEPT_RULE will match documents that contain in the same sentence the terms Stanley Kubrick and 2001. The second CONCEPT_RULE will match documents that contain the two terms anywhere in the document. Both CONCEPT_RULEs will only extract the first term "Stanley Kubrick":

CLASSIFIER:A Space Odyssey
CLASSIFIER:The Sound of Music
CLASSIFIER:Il Postino
CONCEPT_RULE:(SENT,”_c{Stanley Kubrick}”,”2001″)
CONCEPT_RULE:(AND,”_c{Stanley Kubrick}”,”A Space Odyssey”)

New Concepts in Parsing Text Node

The customized concepts developed in the Concepts node are passed to the Text Parsing Node. Notice the Terms football, sport, sports, baseball and the new Role MySports in the Kept Terms window, as well as the matched documents to the term "football":

Customized Categories in the Category Mode

In the second pipeline I developed new categories using as starting point the rules associated with the genres “Sports”, “Animated” and “Family”.

Sports Category

The input data has only 3 movies in the Sports category, it is difficult to generate a meaningful rule with such a small dataset. Once the first pipeline is ran, there are a total of 8 movies which include the 3 original ones, and 5 that are not related to sports. The automatically generated rule is:

(OR,(AND,”crowd-pleaser”),(AND,”conor”),(AND,”_x000d_stupid”))

For the second pipeline, I developed the MySports rule in the Concepts node as mentioned above, and write this Boolean rule in the Categories node:

(OR,(AND,”crowd-pleaser”),(AND,”conor”),(AND,”_x000d_stupid”),”[MySports]”)

The new rule matches 90 movies, most of them related to Sports. For the next iteration, one would need to look at the movies that don’t relate to Sports, the ones that relate to Sports and were not matched, and improve in the rule above.

ChildrenMovies Category

In the second pipeline, I combined the rules for the Family and Animation categories which were automatically produced in the first pipeline.

For the Family category, there were 6 movies matched by this rule

(OR,(AND,(OR,”adults”,”adult”),”oz”))

It matched “People vs Larry Flynt” which prompted me to use the terms “murder” and “obscenity” in the Concept rule.
The Animation category had 66 matched movies and the automatically generated rule was:

(OR,(AND,”pixar”),(AND,(OR,”animator”,”animators”)),(AND,(OR,”voiced”,”voices”,”voicing”,”voice”),(OR,”cartoon”,”cartoons”)),(AND,(OR,”cartoon characters”,”cartoon character”)),(AND,(OR,”lesson”,”lessons”),”animated”),(AND,”live action”),(AND,”jeffrey”,(OR,”features”,”feature”)),(AND,”3-d”))
I decided to modify these two rules. In the second pipeline, I used this new rule
(OR,(AND,(NOT,(OR,”adults”,”adult”,”suitable for children”,”rated R”,”strip@”,”suck@”,”crude humor”,”gore”,”horror”,”murder”,”obscenity”,”drug use@”)),”Wizard of Oz”),(AND,”pixar”),(AND,(OR,”animator”,”animators”)),(AND,(OR,”voiced”,”voices”,”voicing”,”voice”),(OR,”cartoon”,”cartoons”)),(AND,(OR,”cartoon characters”,”cartoon character”)),(AND,(OR,”lesson”,”lessons”),”animated”),(AND,”live action”),(AND,”jeffrey”,(OR,”features”,”feature”)),(AND,”3-d”))

This produced 73 movies and only two rated “R”. Therefore, I removed both the Animation and the Family categories and created the new category childrenMovies.

Again, to determine if the new rules are an improvement over the previous ones, we must find out how many true positives and false positives are matched by the new rules, and repeat the process until we obtain the precision required.

Conclusion

Because the automatically generated concepts and categories in Visual Text Analytics (VTA) can be refined using LITI and Boolean rules, highly customized models can be developed in VTA.

As in all analytical projects, the discovery process in Text Analytics projects requires several iterations where the insights found in one iteration are used in the next iterations. In relationship to the linguistic rules, one must determine if the new rules are an improvement over the ones used in previous iterations, and find how many true positives and false positives are matched by the new rules. This process should be repeated until one obtains the precision required.

Many thanks to Teresa Jade and Biljana Belamaric Wilsey for reviewing the linguistic rules used in this blog. For more information about Visual Text Analytics, please check out:

Analysis of Movie Reviews using Visual Text Analytics was published on SAS Users.