Parallel Computations

9月 142020
 

My 2020 SAS Global Forum paper was about how to write custom parallel programs by using the iml action in SAS Viya 3.5. My conference presentation was canceled because of the coronavirus pandemic, but I recently recorded a 15-minute video that summarizes the main ideas in the paper.

One of the reasons I enjoy attending conferences is to obtain a high-level "mental roadmap" of a topic that I can leverage if I decide to study the details. Hopefully, this video will provide you with a roadmap of the iml action and its capabilities.


If your browser does not support embedded video, you can go directly to the video on YouTube.

The video introduces a few topics that I have written about in more detail:

For more details you can read the paper: Wicklin and Banadaki, 2020, "Write Custom Parallel Programs by Using the iml Action."

The post Video: How to Write a Custom Parallel Program in SAS Viya appeared first on The DO Loop.

8月 172020
 

I recently showed how to use simulation to estimate the power of a statistical hypothesis test. The example (a two-sample t test for the difference of means) is a simple SAS/IML module that is very fast. Fast is good because often you want to perform a sequence of simulations over a range of parameter values to construct a power curve. If you need to examine 100 sets of parameter values, you would expect the full computation to take about 100 times longer than one simulation.

But that calculation applies only when you run the simulations sequentially. If you run the simulations in parallel, you can complete the simulation study much faster. For example, if you have access to a machine or cluster of machines that can run 32 threads, then each thread needs to run only a few simulations. You can perform custom parallel computations by using the iml action in SAS Viya. The iml action is supported in SAS Viya 3.5.

This article shows how to distribute computations by using the PARTASKS function in the iml action. (PARTASKS stands for PARallel TASKS.) If you are not familiar with the iml action or with parallel processing, see my previous example that uses the PARTASKS function.

Use simulation to estimate a power curve

I previously showed how to use simulation to estimate the power of a statistical test. In that article, I simulated data from N(15,5) and N(16,5) distributions. Because the t test is invariant under centering and scaling operations, this study is mathematically equivalent to simulating from the N(0,1) and N(0.2,1) distributions. (Subtract 15 and divide by 5.)

Recall that the power of a statistical hypothesis test is the probability of making a Type 2 error. A Type 2 error is rejecting the null hypothesis when the alternative hypothesis is true. The power depends on the sample sizes and the magnitude of the effect you are trying to detect. In this case, the effect is the difference between two means. You can study how the power depends on the mean difference by estimating the power for a t test between data from an N(0,1) distribution and an N(δ, 1) distribution, where δ is a parameter in the interval [0, 2]. All hypothesis tests in this article use the α=0.05 significance level.

The power curve we want to estimate is shown to the right. The horizontal axis is the δ parameter, which represents the true difference between the population means. Each point on the curve is an estimate of the power of a two-sample t test for random samples of size n1 = n2 = 10. The curve indicates that when there is no difference between the means, you will conclude otherwise in 5% of random samples (because α=0.05). For larger differences, the probability of detecting the difference increases. If the difference is very large (more than 1.5 standard deviations apart), more than 90% of random samples will correctly reject the null hypothesis in the t test.

The curve is calculated at 81 values of δ, so this curve is the result of running 81 independent simulations. Each simulation uses 100,000 random samples and carries out 100,000 t tests. Although it is hard to see, the graph also shows 95% confidence intervals for power. The confidence intervals are very small because so many simulations are run.

So how long did it take to compute this power curve, which is the result of 8.1 million t tests? About 0.4 seconds by using parallel computations in the iml action.

Using the PARTASKS function to distribute tasks

Here's how you can write a program in the iml action to run 81 simulations across 32 threads. The following program uses four nodes, each with 8 threads, but you can achieve similar results by using other configurations. The main parts of the program are as follows:

  • The program assumes that you have used the CAS statement in SAS to establish a CAS session that uses four worker nodes, each with at least eight CPUs. For example, the CAS statement might look something like this:
        cas sess4 host='your_host_name' casuser='your_username' port=&MPPPort nworkers=4;
    The IML program is contained between the SOURCE/ENDSOURCE statements in PROC CAS.
  • The TTestH0 function implements the t test. It runs a t test on the columns of the input matrices and returns the proportion of samples that reject the null hypothesis. The TTestH0 function is explained in a previous article.
  • The "task" we will distribute is the SimTTest function. (This encapsulates the "main program" in my previous article.) The SimTTest function simulates the samples, calls the TTestH0 function, and returns the number of t tests that reject the null hypothesis. The first sample is always from the N(0, 1) distribution and the second sample is from the N(&delta, 1) distribution. Each call creates B samples. The input to the SimTTest function is a list that contains all parameters: a random-number seed, the sample sizes (n1 and n2), the number of samples (B), and the value of the parameter (delta).
  • The main program first sets up a list of arguments for each task. The i_th item in the list is the argument to the i_th task. For this computation, all arguments are the same (a list of parameters) except that the i_th argument includes the parameter value delta[i], where delta is a vector of parameter values in the interval [0, 2].
  • The call to the PARTASKS function is
         RL = ParTasks('SimTTest', ArgList, {2, 0});
    where the first parameter is the task (or list of tasks), the second parameter is the list of arguments to the tasks, and the third parameter specifies how to distribute the computations. The documentation of the PARTASKS function provides the full details.
  • After the PARTASKS function returns, the program processes the results. For each value of the parameter, δ, the main statistic is the proportion of t tests (p) that reject the null hypothesis. The program also computes a 95% (Wald) confidence interval for the binomial proportion. These statistics are written to a CAS table called 'PowerCurve' by using the MatrixWriteToCAS subroutine.
/* Power curve computation: delta = 0 to 2 by 0.025 */
proc cas;
session sess4;                         /* use session with four workers     */
loadactionset 'iml';                   /* load the iml action set           */
source TTestPower;
 
/* Helper function: Compute t test for each column of X and Y.
   X is (n1 x m) matrix and Y is (n2 x m) matrix.
   Return the number of columns for which t test rejects H0 */
start TTestH0(x, y);
   n1 = nrow(X);     n2 = nrow(Y);     /* sample sizes                      */
   meanX = mean(x);  varX = var(x);    /* mean & var of each sample         */
   meanY = mean(y);  varY = var(y);
   poolStd = sqrt( ((n1-1)*varX + (n2-1)*varY) / (n1+n2-2) );
 
   /* Compute t statistic and indicator var for tests that reject H0 */
   t = (meanX - meanY) / (poolStd*sqrt(1/n1 + 1/n2));
   t_crit =  quantile('t', 1-0.05/2, n1+n2-2);       /* alpha = 0.05        */
   RejectH0 = (abs(t) > t_crit);                     /* 0 or 1              */
   return  RejectH0;                                 /* binary vector       */
finish;
 
/* Simulate two groups; Count how many reject H0: delta=0 */
start SimTTest(L);                     /* define the mapper                 */
   call randseed(L$'seed');            /* each thread uses different stream */
   x = j(L$'n1', L$'B');               /* allocate space for Group 1        */
   y = j(L$'n2', L$'B');               /* allocate space for Group 2        */
   call randgen(x, 'Normal', 0,         1);   /* X ~ N(0,1)                 */
   call randgen(y, 'Normal', L$'delta', 1);   /* Y ~ N(delta,1)             */
   return sum(TTestH0(x, y));          /* count how many reject H0          */
finish;
 
/* ----- Main Program ----- */
numSamples = 1e5;
L = [#'delta' = .,   #'n1' = 10,  #'n2' = 10,
     #'seed'  = 321, #'B'  = numSamples];
 
/* Create list of arguments. Each arg gets different value of delta */
delta = T( do(0, 2, 0.025) );
ArgList = ListCreate(nrow(delta));
do i = 1 to nrow(delta);
   L$'delta' = delta[i];
   call ListSetItem(ArgList, i, L);
end;
 
RL = ParTasks('SimTTest', ArgList, {2, 0});  /* assign nodes before threads */
 
/* Summarize results and write to CAS table for graphing */
varNames = {'Delta' 'ProbEst' 'LCL' 'UCL'};  /* names of result vars        */
Result = j(nrow(delta), 4, .);
zCrit = quantile('Normal', 1-0.05/2);  /* zCrit = 1.96                      */
do i = 1 to nrow(delta);               /* for each task                     */
   p = RL$i / numSamples;              /* get proportion that reject H0     */
   SE = sqrt(p*(1-p) / L$'B');         /* std err for binomial proportion   */
   LCL = p - zCrit*SE;                 /* 95% CI                            */
   UCL = p + zCrit*SE;
   Result[i,] = delta[i] || p || LCL || UCL;
end;
 
call MatrixWriteToCas(Result, '', 'PowerCurve', varNames);
endsource;
iml / code=TTestPower nthreads=8;
quit;

You can pull the 81 statistics back to a SAS data set and use PROC SGPLOT to plot the results. You can download the complete program that generates the statistics and graphs the power curve.

Notice that there are 81 tasks but only 32 threads. That is not a problem. The PARTASKS tries to distribute the workload as evenly as possible. In this example, 17 threads are assigned three tasks and 15 threads are assigned two tasks. If T is the time required to perform one power estimate, the total time to compute the power curve will be approximately 3T, plus "overhead costs" such as the time required to set up the problem, distribute the parameters to each task, and aggregate the results. You can minimize the overhead costs by passing only small amounts of data across nodes.

Summary

In summary, if you have a series of independent tasks in IML, you can use the PARTASKS function to distribute tasks to available threads. The speedup can be dramatic; it depends on the time required for the thread that performs the most work.

This article shows an example of using the PARTASKS function in the iml action, which is available in SAS Viya 3.5. The example shows how to distribute a sequence of computations among k threads that run concurrently and independently. In this example, k=32. Because the tasks are essentially identical, each thread computes 2/32 or 3/32 of the total work. The results from each task are returned in a list, where they can be further processed by the main program.

For more information and examples, see Wicklin and Banadaki (2020), "Write Custom Parallel Programs by Using the iml Action," which is the basis for these blog posts. Another source is the SAS IML Programming Guide, which includes documentation and examples for the iml action.

The post Estimate a power curve in parallel in SAS Viya appeared first on The DO Loop.

8月 032020
 

The iml action was introduced in Viya 3.5. As shown in a previous article, the iml action supports ways to implement the map-reduce paradigm, which is a way to distribute a computation by using multiple threads. The map-reduce paradigm is ideal for “embarrassingly parallel” computations, which are composed of many independent and essentially identical computations. I wrote an example program that shows how to run a Monte Carlo simulation in parallel by using the iml action in SAS Viya.

The map-reduce paradigm is often used when there is ONE task to perform and you want to assign a portion of the work to each thread. A different scenario is when you have SEVERAL independent tasks to perform. You can assign each task to a separate thread and ask each thread to compute an entire task. The iml action supports the PARTASKS function (short for "parallel tasks") for running tasks in separate threads. This article shows a simple example.

Three tasks run sequentially

Suppose you have three tasks. The first takes 2 seconds to run, the next takes 0.5 seconds, and the third takes 1 second. If you run a set of tasks sequentially, the total time is the sum of the times for the individual tasks: 2 + 0.5 + 1 = 3.5 seconds. In contrast, if you run the tasks in parallel, the total time is the maximum of the individual times: 2 seconds. Running the tasks in parallel saves time, and the speed-up depends on the ratio of the largest time to the total time, which is 2/3.5 in this example. Thus, the time for the parallel computation is 57% of the time for the sequential computation.

In general, the largest speedup occurs when the tasks each run in about the same time. In that situation (ignoring overhead costs), the time reduction can be as much 1/k when you run k tasks in k threads.

Let's run three tasks sequentially, then run the same tasks in parallel. Suppose you have a large symmetric N x N matrix and you need to compute three things: the determinant, the matrix inverse, and the eigenvalues. The following call to PROC IML runs the three tasks sequentially. The TOEPLITZ function is used to create a large symmetric matrix.

proc iml;
N = 2000;
A = toeplitz( (N:1)/100 );     /* N x N symmetric matrix */
 
det = det(A);                  /* get det(A) */
inv = inv(A);                  /* get inv(A) */
eigval = eigval(A);            /* get eigval(A) */

The previous computations are performed by directly calling three built-in SAS/IML functions. Of course, in practice, the tasks will be more complex than these. Almost surely, each task will be encapsulated into a SAS/IML module. To emulate that process, I am going to define three trivial "task modules." The following statements perform The same computations, but now each "task" is performed by calling a user-written module:

start detTask(A);
   return det(A);
finish;
start invTask(A);
   return inv(A);
finish;
start eigvalTask(A);
   return eigval(A);
finish;
 
det = detTask(A);              /* get det(A) */
inv = invTask(A);              /* get inv(A) */
eigval = eigvalTask(A);        /* get eigval(A) */

The syntax of the PARTASKS function

You can use the PARTASKS function to run the three tasks concurrently. To use the PARTASKS function, each task must be a user-defined SAS/IML function that has exactly one input argument and returns one output argument. For our example, the detTask, invTask, and eigvalTask modules satisfy these requirements. (If a task requires more than one input argument, you can pack the arguments into a list and pass the list as a single argument.)

The syntax of the PARTASKS function is
result = PARTASKS( Tasks, TaskArgs, Options );
where

  • Tasks is a character vector that names the SAS/IML functions. For our example, Tasks = {'detTask' 'invTask' 'eigvalTask'};
  • TaskArgs is a list of arguments. The i_th argument will be sent to the i_th module. In this example, each module takes the same matrix, so you can create a list that has three items, each being the same matrix: TaskArgs = [A, A, A];
  • Options is an (optional) vector that specifies how the tasks are distributed. The documentation of the PARTASKS function gives the full details, but the first element of the vector specifies how to distribute the computation to nodes and threads, and the second element specifies whether to print information about the performance of the tasks. For this example, you can choose options = {1, 1};, which will distribute tasks to threads on the controller node and print information about the tasks.

Parallel tasks

Suppose you are running SAS Viya and you use the have access to a machine that has at least four threads. You can use the CAS statement to define a session that will use only one machine (no worker nodes). The CAS statement might look something like this:
cas sess0 host='your_host_name' casuser='your_username' port=&YourPort; /* SMP, 0 workers */
Then you can run the PARTASKS function by making the following call to the iml action. If you are not familiar with the iml action, see the getting started example for the iml action.

proc cas;
session sess0;                         /* SMP session: controller node only    */
loadactionset 'iml';                   /* load the action set               */
source partasks;                       /* put program in SOURCE/ENDSOURCE block */
   start detTask(A);                   /* 1st task */
      return det(A);
   finish;
   start invTask(A);                   /* 2nd task */
      return inv(A);
   finish;
   start eigvalTask(A);                /* 3rd task */
      return eigval(A);
   finish;
 
   /* ----- Main Program ----- */
   N = 2000;
   A = toeplitz( (N:1)/100 );          /* N x N symmetric matrix */
   Tasks = {'detTask' 'invTask' 'eigvalTask'};
   Args = [A, A, A];                   /* each task gets same arg in this case */
   opt = {1,                           /* distribute to threads on controller  */
          1};                          /* display information about the tasks  */
   Results = ParTasks(Tasks, Args, opt); /* results are returned in a list     */
 
   /* the i_th list item is the result of the i_th task */
   det    = Results$1;                 /* get det(A)                           */
   inv    = Results$2;                 /* get inv(A)                           */
   eigval = Results$3;                 /* get eigval(A)                        */
endsource;
iml / code=partasks nthreads=4;
run;

The results are identical to the sequential computations in PROC IML. However, this program executes the three tasks in parallel, so the total time for the computations is the time required by the longest task.

Visualize the parallel tasks computations

You can use the following diagram to help visualize the program. (Click to enlarge.)

The diagram shows the following:

  • The CAS session does not include any worker nodes. Therefore, the iml action runs entirely on the controller, although it can still use multiple threads. This mode is known as symmetric multiprocessing (SMP) or single-machine mode. Notice that the call to the iml action (the statement just before the RUN statement) specifies the NTHREADS=4 parameter, which causes the action to use four threads. Because there are only three tasks, one thread remains idle.
  • The program defines the matrix, A, the vector of module names, and the list of arguments. When you call the ParTasks function, the i_th argument is sent to the i_th module on the i_th thread.
  • Each thread runs a different function module. Each function returns its result. The results are returned in a list. The i_th item in the list is the result of the call to the i_th function module.

Summary

This article demonstrates a simple call to the PARTASKS function in the iml action, which is available in SAS Viya 3.5. The example shows how to distribute tasks among k threads. Each thread runs concurrently and independently. The total time for the computation depends on the longest time that any one thread spends on its computation.

This example is simple but demonstrates the main ideas of distributing tasks in parallel. In a future article, I will present a more compelling example.

For more information and examples, see Wicklin and Banadaki (2020), "Write Custom Parallel Programs by Using the iml Action," which is the basis for these blog posts. Another source is the SAS IML Programming Guide, which includes documentation and examples for the iml action.

The post Run tasks in parallel in SAS Viya appeared first on The DO Loop.

7月 132020
 

A previous article introduces the MAPREDUCE function in the iml action. (The iml action was introduced in Viya 3.5.) The MAPREDUCE function implements the map-reduce paradigm, which is a two-step process for distributing a computation to multiple threads. The example in the previous article adds a set of numbers by distributing the computation among four threads. During the "map" step, each thread computes the partial sum of one-fourth of the numbers. During the "reduce" step, the four partial sums are added to obtain the total sum. The same map-reduce framework can be used to implement a wide range of parallel computations. This article uses the MAPREDUCE function in the iml action to implement a parallel version of a Monte Carlo simulation.

Serial Monte Carlo simulation

Before running a simulation study in parallel, let's present a serial implementation in the SAS/IML language. In a Monte Carlo simulation, you simulate B random samples of size N from a probability distribution. For each sample, you compute a statistic. When B is large, the distribution of the statistics is a good approximation to the true sampling distribution for the statistic.

The example in this article uses simulation to approximate the sampling distribution of the sample mean. You can use the sampling distribution to estimate the standard error and to estimate a confidence interval for the mean. This example appears in the book Simulating Data with SAS(Wicklin 2013, p. 55–57) and in the paper "Ten Tips for Simulating Data with SAS" (Wicklin 2015, p. 6-9). The SAS/IML program runs a simulation to approximate the sampling distribution of the sample mean for a random sample of size N=36 that is drawn from the uniform U(0, 1) distribution. It simulates B=1E6 (one million) samples. The main function is the SimMeanUniform function, which does the following:

  1. Uses the RANDSEED subroutine to set a random number seed.
  2. Uses the RANDGEN subroutine to generate B samples of size N from the U(0, 1) distribution. In the N x B matrix, X, each column is a sample and there are B samples.
  3. Uses the MEAN function to compute the mean of each sample (column) and returns the row vector of the sample means.

The remainder of the program estimates the population mean, the standard error of the sample mean, and a 95% confidence interval for the population mean.

proc iml;
start SimMeanUniform(N, seed, B); 
   call randseed(seed);            /* set random number stream */
   x = j(N, B);                    /* allocate NxB matrix for samples */
   call randgen(x, 'Uniform');     /* simulate from U(0,1) */
   return mean(x);                 /* return row vector = mean of each column */
finish;
 
/* Simulate and return Monte Carlo distribution of mean */
stat = SimMeanUniform(36,          /* N = sample size */
                      123,         /* seed = random number seed */
                      1E6);        /* B = total number of samples */
 
/* Monte Carlo estimates: mean, std err, and 95% CI of mean */
alpha = 0.05;
stat = colvec(stat);
numSamples = nrow(stat);
MCEst = mean(stat);                /* estimate of mean, */
SE = std(stat);                    /* standard deviation, */
call qntl(CI, stat, alpha/2 || 1-alpha/2); /* and 95% CI */
R = numSamples || MCEst || SE || CI`;      /* combine for printing */
print R[format=8.4 L='95% Monte Carlo CI (Serial)'
        c={'NumSamples' 'MCEst' 'StdErr' 'LowerCL' 'UpperCL'}];
Monte Carlo simulation results from a serial PROC IML program in SAS

This program takes less than 0.5 seconds to run. It simulates one million samples and computes one million statistics (sample means). More complex simulations take longer and can benefit by distributing the computation to many threads, as shown in the next section.

A distributed Monte Carlo simulation

Suppose that you have access to a cluster of four worker nodes, each of which runs eight threads. You can distribute the simulation across the 32 threads and ask each thread to perform 1/32 of the simulation. Specifically, each thread can simulate 31,250 random samples from U(0,1) and return the sample means. The sample means can then be concatenated into a long vector and returned. The rest of the program (the Monte Carlo estimates) does not need to change.

The goal is to get the SimMeanUniform function to run on all 32 available threads. One way to do that is to use the SimMeanUniform function as the mapping function that is passed to the MAPREDUCE function. To use SimMeanUniform as a mapping function, you need to make two small modifications:

  • The function currently takes three arguments: N, seed, and B. But a mapping function that is called by the MAPREDUCE function is limited to a single argument. The solution is to pack the arguments into a list. For example, in the main program define L = [#'N'=36, #'seed' = 123, #'B' = 1E6] and pass that list to the MAPREDUCE function. In the definition of the SimMeanUniform module, define the signature as SimMeanUniform(L) and access the parameters as L$'N', L$'seed', and L$'B'.
  • The SimMeanUniform function currently simulates B random samples. But if this function runs on 32 threads, we want each thread to generate B/32 random samples. One solution would be to explicitly pass in B=1E6/32, but a better solution is to use the NPERTHREAD function in the iml action. The NPERTHREAD function uses the FLOOR-MOD trick to determine how many samples should be simulated by each thread. You specify the total number of simulations, and the NPERTHREAD function uses the thread ID to determine the number of simulations for each thread.

With these two modifications, you can implement the Monte Carlo simulation in parallel in the iml action. Because the mapping function returns a row vector, the reducer will be horizontal concatenation, which is available as the built-in "_HCONCAT" reducer. Thus, although each thread returns a row vector that has 31,250 elements, the MAPREDUCE function will return a row vector that has one million elements. The following program implements the parallel simulation in the iml action. If you are not familiar with using PROC CAS to call the iml action, see the getting started example for the iml action.

/* Simulate B independent samples from a uniform(0,1) distribution.
   Mapper: Generate M samples, where M ~ B/numThreads. Return M statistics.
   Reducer: Concatenate the statistics.
   Main Program: Estimate standard error and CI for mean.
 */
proc cas;
session sess4;                         /* use session with four workers     */
loadactionset 'iml';                   /* load the action set               */
source SimMean;                        /* put program in SOURCE/ENDSOURCE block */
 
start SimMeanUniform(L);               /* define the mapper                 */
   call randseed(L$'seed');            /* each thread uses different stream */
   M = nPerThread(L$'B');              /* number of samples for thread      */
   x = j(L$'N', M);                    /* allocate NxM matrix for samples   */
   call randgen(x, 'Uniform');         /* simulate from U(0,1)              */
   return mean(x);                     /* row vector = mean of each column  */
finish;
 
/* ----- Main Program ----- */
/* Put the arguments for the mapper into a list */
L = [#'N'    = 36,                     /* sample size                       */
     #'seed' = 123,                    /* random number seed                */
     #'B'    = 1E6 ];                  /* total number of samples           */
/* Simulate on all threads; return Monte Carlo distribution */
stat = MapReduce(L, 'SimMeanUniform', '_HCONCAT');   /* reducer is "horiz concatenate" */
 
/* Monte Carlo estimates: mean, std err, and 95% CI of mean */
alpha = 0.05;
stat = colvec(stat);
numSamples = nrow(stat);
MCEst = mean(stat);                    /* estimate of mean,                 */
SE = std(stat);                        /* standard deviation,               */
call qntl(CI, stat, alpha/2 || 1-alpha/2);     /* and 95% CI                */
R = numSamples || MCEst || SE || CI`;          /* combine for printing      */
print R[format=8.4 L='95% Monte Carlo CI (Parallel)'
        c={'NumSamples' 'MCEst' 'StdErr' 'LowerCL' 'UpperCL'}];
 
endsource;                             /* END of the SOURCE block */
iml / code=SimMean nthreads=8;         /* run the iml action in 8 threads per node */
run;
Monte Carlo simulation results from a parallel program in the iml action in SAS Viya

The output is very similar to the output from PROC IML. There are small differences because this program involves random numbers. The random numbers used by the serial program are different from the numbers used by the parallel program. By the way, each thread in the parallel program uses an independent set of random numbers. In a future article. I will discuss generating random numbers in parallel computations.

Notice that the structure of the parallel program is very similar to the serial program. The changes needed to "parallelize" the serial program are minor for this example. The benefit, however, is huge: The parallel program runs about 30 times faster than the serial program.

Summary

This article shows an example of using the MAPREDUCE function in the iml action, which is available in SAS Viya 3.5. The example shows how to divide a simulation study among k threads that run concurrently and independently. Each thread runs a mapping function, which simulates and analyzes one-kth of the total work. (In this article, k=32.) The results are then concatenated to form the final answer, which is a sampling distribution.

For comparison, the Monte Carlo simulation is run twice: first as a serial program in PROC IML and again as a parallel program in the iml action. Converting the program to run in parallel requires some minor modifications but results in a major improvement in overall run time.

For more information and examples, see Wicklin and Banadaki (2020), "Write Custom Parallel Programs by Using the iml Action," which is the basis for these blog posts. Another source is the SAS IML Programming Guide, which includes documentation and examples for the iml action.

The post A parallel implementation of Monte Carlo simulation in SAS Viya appeared first on The DO Loop.

7月 082020
 

The iml action in SAS Viya (introduced in Viya 3.5) provides a set of general programming tools that you can use to implement a custom parallel algorithm. This makes the iml action different than other Viya actions, which use distributed computations to solve specific problems in statistics, machine learning, and artificial intelligence. By using the iml action, you can use programming statements to define the problem you want to solve. One of the simplest ways to run a parallel program is to use the MAPREDUCE function in the iml action, which enables you to distribute a computation across threads and nodes. This article describes the MAPREDUCE function and gives an example.

What is the map-reduce paradigm?

The MAPREDUCE function implements the map-reduce paradigm, which is a two-step process for distributing a computation. The MAPREDUCE function runs a SAS/IML module (called the mapping function, or the mapper) on every available node and thread in your CAS session. Each mapping function returns a result. The results are aggregated by using a reducing function (the reducer). The final aggregated result is returned by the MAPREDUCE function. The MAPREDUCE function is ideal for “embarrassingly parallel” computations, which are composed of many independent and essentially identical computations. Examples in statistics include Monte Carlo simulation and resampling methods such as the bootstrap. A Wikipedia article about the map-reduce framework includes other examples and more details.

A simple map-reduce example: Adding numbers

Perhaps the simplest map-reduce computation is to add a large set of numbers in a distributed manner. Suppose you have N numbers to add, where N is large. If you have access to k threads, you can ask each thread to add approximately N/k numbers and return the sum. The mapper function on each thread computes a partial sum. The next step is the reducing step. The k partial sums are passed to the reducer, which adds them and returns the total sum. In this way, the map-reduce paradigm computes the sum of the N numbers in parallel. For embarrassingly parallel problems, the map-reduce operation can reduce the computational time by up to a factor of k, if you do not pass huge quantities of data to the mappers and reducers.

You can use the MAPREDUCE function in the iml action to implement the map-reduce paradigm. The syntax of the MAPREDUCE function is
result = MAPREDUCE( mapArg, 'MapFunc', 'RedFunc' );
In this syntax, 'MapFunc' is the name of the mapping function and mapArg is a parameter that is passed to the mapper function in every thread. The 'RedFunc' argument is the name of the reducing function. You can use a predefined (built-in) reducers, or you can define your own reducer. This article uses only predefined reducers.

Let's implement the sum-of-N-numbers algorithm in the iml action by using four threads to sum the numbers 1, 2, ..., 1000. (For simplicity, I chose N divisible by 4, so each thread sums N/4 = 250 numbers.) There are many ways to send data to the mapper. This program packs the data into a matrix that has four rows and tells each thread to analyze one row. The program defines a helper function (getThreadID) and a mapper function (AddRow), which will run on each thread. The AddRow function does the following:

  1. Call the getThreadID function to find the thread in which the AddRow function is running. The getThreadID function is a thin wrapper around the NODEINFO function, which is a built-in function in the iml action. The thread ID is stored in the variable j.
  2. Extract the j th row of numbers. These are the numbers that the thread will sum.
  3. Call the SUM function to compute the partial sum. Return that value.

In the following program, the built-in '_SUM' reducer adds the partial sums and returns the total sum. The program prints the result from each thread. Because the computation is performed in parallel, the order of the output is arbitrary and can vary from run to run. If you are not familiar with using PROC CAS to call the iml action, see the getting started example for the iml action.

/* assume SESS0 is a CAS session that has 0 workers (controller only) and at least 4 threads */
proc cas;
session sess0;                         /* SMP session: controller node only        */
loadactionset 'iml';                   /* load the action set                      */
source MapReduceAdd;
   start getThreadID(j);               /* this function runs on the j_th thread    */
      j = nodeInfo()$'threadId';       /* put thread ID into the variable j        */
   finish;
   start AddRow(X);
      call getThreadId(j);             /* get thread ID                            */
      sum = sum(X[j, ]);               /* compute the partial sum for the j_th row */
      print sum;                       /* print partial sum for this thread        */
      return sum;                      /* return the partial sum                   */
   finish;
 
   /* ----- Main Program ----- */
   x = shape(1:1000, 4);                      /* create a 4 x 250 matrix           */
   Total = MapReduce(x, 'AddRow', '_SUM');    /* use built-in _SUM reducer         */
   print Total;
endsource;
iml / code=MapReduceAdd nthreads=4;
run;

The output shows the results of the PRINT statement in each thread. The output can appear in any order. For this run, the first output is from the second thread, which computes the sum of the numbers 251, 252, . . . , 500 and returns the value 93,875. The next output is from the fourth thread, which computes the sum of the numbers 751, 752, . . . , 1000. The other threads perform similar computations. The partial sums are sent to the built-in '_SUM' reducer, which adds them together and returns the total sum to the main program. The total sum is 500,500 and appears at the end of the output.

Visualize the map-reduce computations

You can use the following diagram to help visualize the program. (Click to enlarge.)

The following list explains parts of the diagram:

  • My CAS session did not include any worker nodes. Therefore, the iml action runs entirely on the controller, although it can still use multiple threads. This mode is known as symmetric multiprocessing (SMP) or single-machine mode. Notice that the call to the iml action (the statement just before the RUN statement) specifies the NTHREADS=4 parameter, which causes the action to use four threads.
  • The program defines the matrix X, which has four rows. This matrix is sent to each thread.
  • Each thread runs the AddRow function (the mapper). The function uses the NODEINFO function to determine the thread it is running in. It then sums the corresponding row of the X matrix and returns that partial sum.
  • The reducer combines the results of the four mappers. In this case, the reducer is '_SUM', so the reducer adds the four partial sums. This sum is the value that is returned by the MAPREDUCE function.

Summary

This article demonstrates a simple call to the MAPREDUCE function in the iml action, which is available in SAS Viya 3.5. The example shows how to divide a task among k threads. Each thread runs concurrently and independently. Each thread runs a mapping function, which computes a portion of the task. The partial results are then sent to a reducing function, which assembles the partial results into a final answer. In this example, the task is to compute a sum. The numbers are divided among four threads, and each thread computes part of the sum. The partial sums are then sent to a reducer to obtain the overall sum.

This example is one of the simplest map-reduce examples. Obviously, you do not need parallel computations to add a set of numbers, but I hope the simple example enables you to focus on map-reduce architecture. In a future article, I will present a more compelling example.

For more information and examples, see Wicklin and Banadaki (2020), "Write Custom Parallel Programs by Using the iml Action," which is the basis for these blog posts. Another source is the SAS IML Programming Guide, which includes documentation and examples for the iml action.

The post A general method for parallel computation in SAS Viya appeared first on The DO Loop.