4月 212021
 

Ranking is a fundamental concept in statistics. Ranks of univariate data are used by statisticians to estimate statistics such as percentiles (quantiles) and empirical distributions. A more advanced use is to compute various rank-based measures of correlation or association between pairs of variables. For example, ranks are used to compute the Spearman rank correlation.

The Spearman correlation uses univariate ranks. That is, the Spearman correlation between variables X and Y is determined by computing the tied ranks for X and Y separately. Other bivariate measures of association use a different type of ranking, which is known as the bivariate rank. In a bivariate ranking, the pairs of (X,Y) values are ranked. A bivariate ranking assigns a rank to the pairs by using the X values, the Y values, and the joint values. This article provides an example of a bivariate ranking scheme, which is used in computing a statistic known as Hoeffding's dependence coefficient.

A formula for a bivariate rank

The SAS/IML language supports the BRANKS function, which computes bivariate (tied) ranks according to the following formula. If the data are pairs of values {(Xi, Yi) | i=1,2,...,n}, then the bivariate rank of the i_th point is
\(Q_i = 3/4 + \sum\nolimits_j u(X_i - X_j) u(Y_i - Y_j)\)
where u is a function that counts how many values are less than or equal to a given value. Tied values are counted as 0.5. Specifically, u(t)=1 if t>0, u(t)=1/2 if t=0, and u(t)=0 otherwise.

You can think of the formula as a (scaled) estimate of the bivariate cumulative distribution function (CDF). If you assume that all data values are distinct (no tied values), then the argument to the u function is never 0, and the formula counts how many data points have an X coordinate less than Xi and (simultaneously) a Y coordinate less than Yi. If the data have tied values in one or both coordinates, then the formula for the rank of P = (Xi, Yi) says:

  • Add 1 for every data point that is less than P in both coordinates.
  • Add 1/2 for every data point that has one coordinate less than P and the other coordinate equal to the corresponding coordinate of P.
  • Add 1/4 for every data point that is equal to P.

Compute bivariate ranks in SAS

The SAS/IML language supports the BRANKS function, which computes bivariate (tied) ranks according to the formula in the previous section. Let's start with a sample that has nine observations. The ninth observation is a repeat of the eighth observation. The BRANKS function returns a matrix that has three columns:

  • The first column of the output is the tied ranks (using the MEAN method) of the X coordinate of the data. (You can read a previous article to learn more about univariate tied ranks.)
  • The second column is the tied ranks of the Y coordinate of the data.
  • The third column is the bivariate ranks of the (X,Y) values, which is computed by using the formula in the previous section.
/* BIVARIATE RANKS */
proc iml;
w = {10 20, 
     10 21, 
     10 22, 
     11 20, 
     11 21, 
     11 22, 
     12 20, 
     12 22, 
     12 22 };  /* last row is a repeat */
Ranks = branks(w);
print w[c={x y}], Ranks[c={'RankX' 'RankY' 'BRank'}];

The first two columns are univariate tied ranks of the X and Y coordinates. This third column is the bivariate ranking of the pairs of points. If you think about plotting the points on a scatter plot, points that are in the lower-left corner of the plot have the lowest ranks, and points in the upper-right corner have the higher ranks.

Checking the BRANKS output

To ensure that the BRANKS function does, in fact, use the formula in the documentation, I wrote the following function, which evaluates the formula "manually." The following statements verify that the formula gives the same values as the third column of the BRANKS function:

/* compute bivariate ranks manually */
start BivarRank(xy);
   x = xy[,1];
   y = xy[,2];
   n = nrow(x);
   Q = j(n, 1, .);
   do i = 1 to n;
      ux = (x[i] > x) + 0.5*(x[i] = x); /* count X values >= x[i] */
      uy = (y[i] > y) + 0.5*(y[i] = y); /* count Y values >= y[i] */
      Q[i] = 0.75 + sum( ux#uy );       /* bivariate rank of (x[i], y[i]) */
   end;
   return Q;
finish;
 
Q = BivarRank(w);
bivarRank = Ranks[,3];
print Q bivarRank (Q-bivarRank)[L="Diff"];

A visualization of bivariate ranks

It's not clear (to me) how this formula assigns ranks to a cloud of points in a scatter plot. We know that the points in the lower-left corner of the graph have low bivariate ranks and that points in the upper-right corner have high bivariate ranks. However, it is not clear what happens in the middle.

Let's generate some data and find out! The following program statements generate 1000 random uniform points in the unit square. The points and the bivariate ranks are written to a SAS data set. PROC SGPLOT displays the points and colors the markers according to the bivariate rank, as follows:

/* compute bivariate ranks for random data */
call randseed(1234);
xy = randfun({1000 2}, "Uniform" );  /* 1000 random uniform points in [0,1]x[0,1] */
bivarRank = branks(xy);              /* third column contains bivariate ranks */
m = bivarRank || xy;
create BivarRanks from m[c={'rx' 'ry' 'brank' 'x' 'y'}];
append from m;
close;
QUIT;
 
/* palette("spectral",9) */
%let colorRamp = CXD53E4F CXF46D43 CXFDAE61 CXFEE08B CXFFFFBF CXE6F598 CXABDDA4 CX66C2A5 CX3288BD;
title "Bivariate Ranks of 1000 Points";
proc sgplot data=BivarRanks aspect=1;
   scatter x=x y=y / colorresponse=brank markerattrs=(symbol=CircleFilled)
   colormodel=(&colorRamp);
run;

The colors of the markers indicate the bivariate ranks of the observations. The graph indicates that low ranks are assigned to points whose X or Y coordinates are small. High ranks are assigned only when both coordinates are large.

Notice that the ranks are not uniformly distributed among the 1000 points. That is because there are many tied ranks for the lower ranks. For example, about 50% of the points have ranks less than 200. Only 10% have ranks greater than 600.

A connection with bivariate CDF

As I indicated earlier, the formula for the bivariate ranks is reminiscent of the definition of a bivariate CDF. In retrospect, I should not have been surprised to see that coloring the observations by their bivariate rank looks a lot like a two-dimensional CDF. For example, the following graph shows the CDF for the bivariate normal distribution:

Summary

This article discusses the concept of a bivariate rank for ordered pairs. In a bivariate rank, both the X and Y coordinate are used to assign a rank. The formula that computes the bivariate rank is not complicated, but I did not initially understand how it assigns ranks to points in a scatter plot. As usual, a visualization helps. The visualization shows that bivariate ranks are conceptually similar to the computation of a two-dimensional CDF.

The post Compute bivariate ranks appeared first on The DO Loop.

4月 212021
 
A customer recently contacted SAS Technical Support and wanted to know how he could generate a report that displays just the column headings for a data set (or table) that does not contain any records. Rather than just omitting the missing data, he wanted to provide his customers with a visual way to see where data was missing.

This blog demonstrates how to create a report that provides only the column headings for data that is missing. The blog also explains how to create, select, and exclude output objects as well as how to generate reports with the SAS® Output Delivery System (ODS). These concepts are relevant to the task of generating a report with the column headings for a data set that contains no (0) observations.

The first section below provides some basic information that you need to understand about ODS. Specifically, it discusses ODS destinations along with the concept of output objects and how they work in ODS.

The second section explains tools that enable you to achieve the desired output for the report:

  • how to use the ODS SELECT statement to specify certain objects that you want to send to the destination for the report.
  • how to use a SAS® macro and dictionary tables to display column headings when no output object is generated

The last section provides a code example that unites all of these concepts. The end result is a report that contains only column headings from the WORK.CLASS data set and information from the Moments output object in the SASHELP.CLASS data set.

Understanding the SAS® Output Delivery System

The SAS Output Delivery System has many destinations that you can use to generate files in various formats. Some of these destinations generate files in third-party formats such as XLSX (Excel), DOCX (Word), PPTX (PowerPoint), and HTML (HTML, HTML5). Other types of destinations are available, too. Examples include ODS Package, which generates Archive (or ZIP) files, and ODS Document, which generates binary objects from SAS DATA steps or procedures.

The foundation for the output delivery system is an output object, which is generated when a SAS® procedure or DATA step is executed. The output object is generated when you combine text and numbers with a template definition.

SAS® Output Delivery System

DATA steps generate only one output object, whereas procedures can generate one or more output objects.

Using ODS statements to select or exclude output objects

To see the contents of an output object, you can use the ODS TRACE statement to generate trace records. A trace record displays the object name, the template location, and the label.

The following example generates trace records for the SASHELP.CLASS data set:

ods trace on;
univariate data=sashelp.class;
run;

The output from this code is shown below:

ODS TRACE output

Once you discover an object's name, you can choose to select or exclude it from the output by using either the ODS SELECT statement or the ODS EXCLUDE statement.

For example, using the previous code example, you can use the ODS SELECT statement to choose a specific output object.

ods trace on;
ods select moments;
proc univariate data=sashelp.class;
run;

In this example, The ODS SELECT statement selects just the Moments object and sends it to any open ODS destinations so that the object's data can be printed in a report. No other objects are sent to the destination.

The ODS TRACE statement generates the following output in the trace log since only the Moments object is specified.

Moments object

Generating a report with column headings for a data set with 0 records and with information from a selected output object

The information in the previous section is helpful for data sets that contain records. But it is not helpful if you want to generate a report that shows column headings from a data set that does not have any records.

When a table has no records, it does not generate an output object. Because no object is generated, ODS cannot display headings.

However, you can use another strategy with ODS to display headings from a data set with no records. You can use dictionary tables to obtain the name of the column headings for a table that has no records. Dictionary tables are created automatically by SAS® to store information related to SAS libraries, SAS system options, SAS catalog, and so on. These tables enable you to query information about a data set (column names, titles, and so on).

To accomplish the task at hand, you can use a dictionary table and a SAS macro with ODS. The macro is used to verify whether you are processing a zero-observation data set:

  1. If the data set that is passed is not a zero-observation data set, the macro passes and executes the procedure.
  2. If the data set that is passed is a zero-observation table, the SQL procedure is used with the DICTIONARY.COLUMNS table to query only the column names in the table. Then, the column names are transposed with the TRANSPOSE procedure, and they are displayed with the REPORT procedure.

Note: The column headings are arranged vertically. You need to transpose them so that they are displayed horizontally in the report that is generated by the ODS destination.

Creating the report with SAS® Output Delivery System, a SAS® macro, and dictionary tables

The following example illustrates the strategy that is described in the last section:

/* Sample table with 0 observations */
data work.class;
set sashelp.class;
stop;
run;
ods excel file="sample.xlsx" options(embedded_titles="yes");
 
%macro test(libref=,dsn=);
%let rc=%sysfunc(open(&libref..&dsn,i));
%let nobs=%sysfunc(attrn(&rc,NOBS));
%let close=%sysfunc(CLOSE(&rc));
%if &nobs ne 0 and %sysfunc(Exist(&libref..&dsn)) %then %do;
 
title "Report for Company XYZ";
ods select moments;
proc univariate data=&libref..&dsn;
run;
 
%end;
%else %do;
 
proc sql noprint;
create table temp as
select name from dictionary.columns
where libname=%upcase("&libref") and memname=%upcase("&dsn");
run;
quit;
proc transpose data=temp out=temp1(drop=_label_ _name_);
id name;
var name;
run;
proc report noheader style(column)=header[just=center] nowd;
title "No data for data set &libref..&dsn";
run;
%end;
%mend;
%test(libref=work,dsn=class)
%test(libref=sashelp,dsn=class)
 
ods excel close;

As you can see below, the report that is generated shows the column headings for the empty WORK.CLASS data set as well as the data from the Moments object from the SASHELP.CLASS data set.

No data for data set work.class

Report for Company XYZ

Learn More

Creating a report that displays only the column headings for a data set containing 0 records was published on SAS Users.

4月 202021
 

I can’t believe it’s true, but SAS Global Forum is just over a month away. I have some exciting news to share with you, so let’s start with the theme for this year:

New Day. New Answers. Inspired by Curiosity.

What a fitting theme for this year! Technology continues to evolve, so each new day is a chance to seek new answers to what can sometimes feel like impossible challenges. Our curiosity as humans drives us to seek out better ways to do things. And I hope your curiosity will drive you to register for this year’s SAS Global Forum.

We are excited to offer a global event across three regions. If you’re in the Americas, the conference is May 18-20. In Asia Pacific? Then we’ll see you May 19-20. And we didn’t forget about Europe. Your dates are May 25-26. We hope these region-specific dates and the virtual nature of the conference means more SAS users than ever will join us for an inspiring event. Curious about the exciting agenda? It’s all on the website, so check it out.

Keynotes speakers that you’ll talk about for months to come

Want to be inspired to chase your “impossible” dreams? Or hear more about the future of AI? How about learning about work-life balance and your mental health? We have you covered. SAS executives are gearing up to host an exciting lineup of extremely smart, engaging and thought-provoking keynote speakers like Adam Grant, Ayesha Khanna and Hakeem Oluseyi.

And who knows, we might have a few more surprises up our sleeve. You’ll just have to register and attend to find out.

Papers and proceedings: simplified and easy to find

Have you joined the SAS Global Forum online community? You should, because that’s where you’ll find all the discussion around the conference…before, during and after. It’s also where you’ll find a link to the 2021 proceedings, when they become available. Authors are busy preparing their presentations now and they are hard at work staging their proceedings in the community. Join the community so you can connect with other attendees and know when the proceedings become available.

Stay tuned for even more details

SAS Global Forum is the place where creativity meets curiosity, and amazing analytics happens! I encourage you to regularly check the conference website, as we’re continually adding new sessions and events. You don’t want to miss this year’s conference, so don’t forget to register for SAS Global Forum. See you soon!

Registration is open for a truly inspiring SAS Global Forum 2021 was published on SAS Users.

4月 192021
 

The ranks of a set of data values are used in many nonparametric statistics and statistical tests. When you request a statistic or nonparametric test in SAS, the procedure will automatically compute the ranks that are needed. However, sometimes it is useful to know how to compute the ranks yourself. This article shows how to compute ranks in SAS when the data contains repeated values, which result in tied ranks. The article shows how to use PROC RANK as well as the RANKTIE function in SAS/IML software.

What are tied ranks?

Ranks are easy to compute if there are no tied values in the data. You simply sort the data values and assign the ordinal position of the sorted data as the rank. For example, in the data {18, 13, 19, 16}, the corresponding ranks are {3, 1, 4, 2} because 13 is the first (sorted) value, 16 is the second (sorted) value, and so forth. For a sample of size n, the ranks are integers in the range [1,n].

When the data contains duplicate values, you must decide how to assign ranks to the tied values. All tied values should have the same rank, but what rank should you assign? There are several ways to handle ties, but the most common way is to assign the average rank of the tied values. This is advantageous in statistical tests because it preserves the sum of the ranks. For example, in the data {18, 13, 18, 16}, the "average rank" method would assign the ranks {3.5, 1, 3.5, 2} because the third and fourth sorted values are the same. Therefore, the average rank is (3 + 4)/2 = 3.5. Notice that the sum of the ranks is 10, which equals the sum of the integers 1:n, where n=4.

There are four common methods for handling ties. Suppose there are k tied values. If you sort the data, the tied values will appear in the ordinal positions R, R+1, ..., R+k-1.

  • MEAN method: As discussed above, you can assign the rank of the tied values to be the mean position, which is (R + R+k-1)/2 = R + (k-1)/2.
  • LOW or HIGH method: For the low method, the rank of the tied values is assigned to be R. For the high method, the rank of the tied values is assigned to be R+k-1.
  • DENSE method: In the dense method, you first write down the set of unique data values. You then rank those unique values. For the original data, you assign each datum a rank that equals the rank of its unique value. For example, for the data {18, 13, 18, 16, 18, 16}, the unique values are {13, 16, 18}, which are assigned ranks {1, 2, 3}. Therefore, the ranks for the original data are {3, 1, 3, 2, 3, 2}.

In SAS, you can compute ranks by using the RANKTIE function in SAS/IML, or you can use PROC RANK in Base SAS. Both methods support the four methods for handling ties. The default method is the mean method.

Compute tied ranks in SAS/IML

You can use the RANKTIE function in PROC IML to compute tied ranks, as follows:

proc iml;
x = {10, 10, 10, 11, 11, 12, 12, 12, 13};
rankMean  = ranktie(x);          /* default="MEAN" */
rankLow   = ranktie(x, "Low");  
rankDense = ranktie(x, "Dense");
print x rankMean rankLow rankDense;

The result shows three of the four methods:

  • MEAN method: The value 10 is the smallest value. In a sorted set, the value has the ordinal positions {1, 2, 3}. The average rank is therefore 2. Similarly, the (untied) ranks for the value 11 are {4, 5}, so the mean rank is 4.5.
  • LOW method: This is the method used in sports. If three athletes all have the best score, we say they are tied for first. The next two athletes are tied for fourth place, and so on. The HIGH method is similar but isn't used as often in sports. For the HIGH method, we would say the three athletes that have the best scores are tied for third place. Both of these methods can be used to compute an empirical distribution function.
  • DENSE method: There are four unique values, so the ranks are values 1, 2, 3, or 4.

For simplicity, the previous example lists the data in sorted order. However, the tied ranks will be the same regardless of the order of the values. For example, the following statements use the same values but change the order of the observations. The ranks are the same:

/* Change the order of the data. The result is similar except for the ordering. */
x = {12, 10, 13, 11, 10, 10, 12, 11, 12};
rankMean  = ranktie(x);          /* default="MEAN" */
rankLow   = ranktie(x, "Low");  
rankDense = ranktie(x, "Dense");
print x rankMean rankLow rankDense;

Compute tied ranks by using PROC RANK

Although PROC RANK supports the TIES= option to specify the MEAN, LOW, HIGH, or DENSE methods, you can only use one method at a time. Therefore, the following program calls PROC RANK three times and concatenates the outputs into a single data set:

data X;
input x @@;
datalines;
10 10 10 11 11 12 12 12 13
;
 
proc rank data=X out=rankMean ties=MEAN;
   var X;   ranks rankMean;
run;
proc rank data=X out=rankLow ties=LOW;
   var X;   ranks rankLow;
run;
proc rank data=X out=rankDense ties=DENSE;
   var X;   ranks rankDense;
run;
data Ranks;
   merge rankMean rankLow rankDense;
run;
proc print data=Ranks; run;

The result is the same as for the RANKTIE function in SAS/IML. In practice, you typically apply only one method for handling ties, so you only need to call PROC RANK once.

Summary

Many statistics and statistical tests use ranks. When the data contain duplicate values, the ranks are not unique. You must choose a way to assign ranks to the tied values. This article discusses four ways to compute tied ranks in SAS. Computations are shown by using PROC IML and PROC RANK.

The post Compute tied ranks appeared first on The DO Loop.

4月 162021
 

Predictive models and medical image analysis have the potential to transform health care delivery. To accelerate innovative approaches in health analytics, teams around the world are participating in a global hackathon. In scientific communities, it's well-known that the fastest way to drive progress and discover new ideas is collaboration with [...]

Analytics hackathon sparks innovation for health care was published on SAS Voices by Alyssa Farrell

4月 142021
 

Improving programming jobs performance with massively parallel processingUntil recently, I used UNIX/Linux shell scripts in a very limited capacity, mostly as vehicle of submitting SAS batch jobs. All heavy lifting (conditional processing logic, looping, macro processing, etc.) was done in SAS and by SAS.  If there was a need for parallel processing and synchronization, it was also implemented in SAS. I even wrote a blog post  Running SAS programs in parallel using SAS/CONNECT®, which I proudly shared with my customers.

The post caught their attention and I was asked if I could implement the same approach to speed up processes that were taking too long to run.

However, it turned out that SAS/CONNECT was not licensed at their site and procuring the license wasn’t going to happen any time soon. Bummer!

Or boon? You should never be discouraged by obstacles. In fact, encountering an obstacle might be a stroke of luck. Just add a mixture of curiosity, creativity, and tenacity – and you get a recipe for new opportunity and success. That’s exactly what happened when I turned to exploring shell scripting as an alternative way of implementing parallel processing.

Running several batch jobs in parallel

UNIX/Linux OS allows running several scripts in parallel. Let’s say we have three SAS batch jobs controlled by their own scripts script1.sh, script2.sh, and script3.sh. We can run them concurrently (in parallel) by submitting these shell scripts one after another in background mode using & at the end. Just put them in a wrapper “parent” script allthree.sh and run it in background mode as:

$ nohup allthree.sh &

Here what is inside the allthree.sh: 

#!/bin/sh
script1.sh &
script2.sh &
script3.sh &
wait

With such an arrangement, allthree.sh “parent” script starts all three background tasks (and corresponding SAS programs) that will run by the server concurrently (as far as resources would allow.) Depending on the server capacity (mainly, the number of CPU’s) these jobs will run in parallel, or quasi parallel competing for the server shared resources with the Operating System taking charge for orchestrating their co-existence and load balancing.

The wait command at the end is responsible for the “parent” script’s synchronization. Since no process id or job id is specified with wait command, it will wait for all current “child” processes to complete. Once all three tasks completed, the parent script allthree.sh will continue past the wait command.

Get the UNIX/Linux server information

To evaluate server capabilities as it relates to the parallel processing, we would like to know the number of CPU’s.

To get this information we can ran the the lscpu command as it provides an overview of the CPU architectural characteristics such as number of CPU’s, number of CPU cores, vendor ID, model, model name, speed of each core, and lots more. Here is what I got:

Ha! 56 CPUs! This is not bad, not bad at all! I don’t even have to usurp the whole server after all. I can just grab about 50% of its capacity and be a nice guy leaving another 50% to all other users.

Problem: monthly data ingestion use case

Here is a simplified description of the problem I was facing.

Each month, shortly after the end of the previous month we needed to ingest a number of CSV files pertinent to transactions during the previous month and produce daily SAS data tables for each day of the previous month.  The existing process sequentially looped through all the CSV files, which (given the data volume) took about an hour to run.

This task was a perfect candidate for parallel processing since data ingestions of individual days were fully independent of each other.

Solution: massively parallel process

The solution is comprised of the two parts:

  • Single thread SAS program responsible for a single day data ingestion.
  • Shell script running multiple instances of this SAS program concurrently.

Single thread SAS process

The first thing I did was re-writing the SAS program from looping through all of the days to ingesting just a single day of a month-year. Here is a bare-bones version of the SAS program:

/* capture parameter &sysparm passed from OS command */ 
%let YYYYMMDD = &sysparm;
 
/* create varlist macro variable to list all input variable names */
proc sql noprint;
   select name into :varlist separated by ' ' from SASHELP.VCOLUMN
   where libname='PARMSDL' and memname='DATA_TEMPLATE';
quit;
 
/* create fileref inf for the source file */
filename inf "/cvspath/rawdata&YYYYMMDD..cvs";
 
/* create daily output data set */
data SASDL.DATA&YYYYMMDD; 
   if 0 then set PARMSDL.DATA_TEMPLATE;
   infile inf missover dsd encoding='UTF-8' firstobs=2 obs=max;
   input &varlist;
run;

This SAS program (let’s call it oneday.sas) can be run in batch using the following OS command:

sas oneday.sas -log oneday.log -sysparm 202103

Note, that we pass a parameter (e.g. 202103 means year 2021, month 03) defining the requested year and month YYYYMM as -sysparm value.

That value becomes available in the SAS program as a macro variable reference &sysparm.

We also use a pre-created data template PARMSDL.DATA_TEMPLATE - a zero-observations data set that contains descriptions of all the variables and their attributes (see Simplify data preparation using SAS data templates).

Shell script running the whole process in parallel

Below shell script month_parallel_driver.sh puts everything together. It spawns and runs concurrently as many daily processes as there are days in a specified month-of-year and synchronizes all single day processes (threads) at the end by waiting them all to complete. It logs all its treads and calculates (and prints) the total processing duration. As you can see, shell script as a programming language is a quite versatile and powerful. Here it is:

#!/bin/sh
 
# HOW TO RUN:
# cd /projpath/scripts
# nohup sh month_parallel_driver.sh &
 
# Project path
proj=/projpath
 
# Program file name
prgm=oneday
pgmname=$proj/programs/$prgm.sas
 
# Current date/time stamp
now=$(date +%Y.%m.%d_%H.%M.%S)
echo 'Start time:'$now
 
# Reset timer
SECONDS=0
 
# Get YYYYMM as the script parameter
par=$1
 
# Extract year and month from $par
y=${par:0:4}
m=${par:4:2}
 
# Get number of days in month $m of year $y
days=$(cal $m $y | awk 'NF {DAYS = $NF}; END {print DAYS}')
 
# Create log directory
logdir=$proj/saslogs/${prgm}_${y}${m}_${now}_logs
mkdir $logdir
 
# Loop through all days of month $m of year $y
for i in $(seq -f "%02g" 1 $days)
do
   # Assign log name for a single day thread
   logname=$logdir/${prgm}_${y}${m}_thread${i}_$now.log
 
   # Run single day thread
   /SASHome/SASFoundation/9.4/sas $pgmname -log $logname -sysparm $par$i &
done
 
# Wait until all threads are finished
wait
 
# Calculate and print duration
end=$(date +%Y.%m.%d_%H.%M.%S)
echo 'End time:'$end
hh=$(($SECONDS/3600))
mm=$(( $(($SECONDS - $hh * 3600)) / 60 ))
ss=$(($SECONDS - $hh * 3600 - $mm * 60))
printf " Total Duration: %02d:%02d:%02d\n" $hh $mm $ss
echo '------- End of job -------'

This script is self-described by detail comments and can be run as:

cd /projpath/scripts
nohup sh month_parallel_driver.sh &

Results

The results were as expected as they were stunning. The overall duration was cut roughly by a factor of 25, so now this whole task completes in about two minutes vs. one hour before. Actually, now it is even fun to watch how SAS logs and output data sets are being updated in real time.

What is more, this script-centric approach can be used for running not just SAS processes, but non-SAS, open source and/or hybrid processes as well. This makes it a powerful amplifier and integrator for heterogeneous software applications development.

SAS Consulting Services

The solution presented in this post is a stripped-down version of the original production quality solution. This better serves our educational objective of communicating the key concepts and coding techniques. If you believe your organization’s computational powers are underutilized and may benefit from a SAS Consulting Services engagement, please reach out to us through your SAS representative, and we will be happy to help.

Additional resources

Thoughts? Comments?

Do you find this post useful? Do you have processes that may benefit from parallelization? Please share with us below.

Using shell scripts for massively parallel processing was published on SAS Users.

4月 142021
 

It can be frustrating to receive an error message from statistical software. In the early days of the SAS statistical graphics (SG) procedures, an error message that I dreaded was
ERROR: Attempting to overlay incompatible plot or chart types.
This error message appears when you attempt to use PROC SGPLOT to overlay two plots that have different properties. For example, you might be trying to overlay a bar chart, which requires a categorical variable, with a scatter plot or series plot, which often displays values of a continuous variable.

In SAS 9.4M3 and later, there is a simple way to avoid this error message. You can combine a bar chart with other plot types by using the VBARBASIC or HBARBASIC statements, which create a bar chart that is compatible with other "basic" plots.

Compatibility of plot types

The SAS documentation includes an explanation of chart types, and a table that shows which plots you can overlay when you use PROC SGPLOT or PROC SGPANEL. (The doc erroneously puts HBARBASIC and VBARBASIC in the "categorization" group, but they should be in the "basic group." I have contacted the doc writer to correct the mistake.) If you try to overlay plots from different chart types, you will get the dreaded ERROR: Attempting to overlay incompatible plot or chart types.

First, let me emphasize that this error message only appears when you use PROC SGPLOT or SGPANEL to overlay the plots. If you use the Graph Template Language (GTL) and PROC SGRENDER, you do not have this restriction.

I have previously written about two important cases for which it is necessary to overlay an empirical distribution (bar chart or histogram) and a theoretical distribution, which is visualized by using a scatter plot or series plot:

My previous article shows how to overlay a bar chart and a series plot, but the example is a little complicated. The examples in the next sections are much simpler. There are two ways to combine a bar chart and a line plot: You can use the HBAR and HLINE statements, or you can use the HBARBASIC and SERIES statements. By using HBARBASIC, you can overlay a bar chart with many other plots.

Overlay a bar chart and a line plot

Suppose you want to use a bar chart to display the average height (by age) of a sample of school children. You also want to add a line that shows the average heights in the population. Because you must use "compatible" plot types, the traditional approach is to combine the HBAR and HLINE statements, as follows. (For simplicity, I omit the legend on this graph.) The DATA step creates fake data, which are supposed to represent the national average and a range of values for the average heights.

data NationalAverage;         /* fake data for demonstration purposes */
label Average = "National Average";
input Age Average Low High;
datalines;
11 60 53 63
12 62 54 65
13 65 55 67
14 66 56 68
15 67 57 70
16 68 58 72
;
 
data All;
set Sashelp.Class NationalAverage;
run;
 
title "Average Heights of Students in Class, by Age";
proc sgplot data=All noautolegend;
   hbar Age / response=height stat=mean;
   hline Age / response=Average markers datalabel;
run;

This is the "classic" bar chart and line plot. This syntax has been available in SAS since at least SAS 9.2. It enables you to combine multiple statements for discrete variables, such as HBAR/VBAR, HLINE/VLINE, and DOT. However, in some situations, you might need to overlay a bar chart and more complicated plots. In those situations, use the HBARBASIC or VBARBASIC graphs, as shown in the next section.

Overlay a bar chart and plots of continuous data

The VBARBASIC and HBARBASIC statements (introduced in SAS 9.4M3) enable you to combine bar charts with one or more other "basic" plots such as scatter plots, series plots, and box plots. Like the VBAR and HBAR statements, these statements can summarize raw data. They have almost the same syntax as the VBAR and HBAR statements.

Suppose you want to combine a bar chart, a series plot, and a high-low plot. You can't use the VBAR or HBAR statements because that leads to "incompatible plot or chart types." However, you can use the VBARBASIC and HBARBASIC statements, as follows:

proc sgplot data=All noautolegend;
   hbarbasic Age / response=height stat=mean name="S" legendlabel="Class";
   series y=Age x=Average / markers datalabel=Average name="Avg" legendlabel="National Average";
   highlow y=Age low=Low high=High;
   keylegend "S" "Avg";
run;

Notice that the SERIES and HIGHLOW statements create "basic" graphs. To overlay these on a bar chart, use the HBARBASIC statement. In a similar way, you can overlay many other graph types on a bar chart.

Summary

Sometimes you need to overlay a bar chart and another type of graph. If you aren't careful, you might get the error message: ERROR: Attempting to overlay incompatible plot or chart types. In SAS 9.4M3 and later, there is a simple way to avoid this error message. You can use the VBARBASIC or HBARBASIC statements to create a "basic" bar chart that is compatible with other "basic" plots.

The post Overlay other graphs on a bar chart with PROC SGPLOT appeared first on The DO Loop.