7月 012019
 

Math and statistics are everywhere, and I always rejoice when I spot a rather sophisticated statistical idea "in the wild." For example, I am always pleased when I see a graph that shows the distribution of race times in a typical race (such as a 5K), as shown to the right. The finishing times are plotted against the order in which runners crossed the finish line. This is a great visualization because you can see the times for each participant, the range of times between the top finishers and the laggards, and how close each runner's time was to the time of the person who placed ahead of her.

What amazes me is that this graph essentially shows the cumulative distribution of race times. The cumulative distribution is not generally used outside of scientific publications. Yet here it is, easily understood and with no accompanying explanation! Graphs like this are often used to visualize race times for triathlons, marathons, 5Ks, and more.

The distribution of race times

The graph is noteworthy for what it is and also for what it isn't. It isn't a histogram. If you give a statistician a set of measurements and ask for the distribution, you are likely to get a histogram. For small data sets, you might also get a fringe plot below the histogram, as shown below:

The graph shows the distribution of race time for the same race. The histogram bins the times into one-minute intervals. You can easily see that a small percentage of runners (about 2%) finished the race under 19 minutes and that about 40% of the runners finished between 20 and 22 minutes. You can see that only a few runners exceeded 26 minutes.

Since the histogram is such a standard plot, you would think it would be the "best" graph to use, but the time-versus-rank graph has several advantages:

  • The time-versus-rank graph connects the times to the place. What was the time for the 20th runner? It's easy to determine. How many runners finished under 22 minutes? Also easy to find.
  • You can see every runner's time. In a race, there might be a tenth of a second between times. The fringe plot suffers from overplotting. The histogram bins many times into a single bar. The time-versus-rank graph displays one marker per runner and the markers do not overlap unless there are hundreds of runners.
  • The time-versus-rank graph shows packs of runners. In long-distance races, there is often a "lead pack," a "trailing pack," and other clumps of runners of equal abilities. On the time-versus-rank graph, these packs show ups as groups of nearly horizontal markers. In the fringe plot, the vertical lines overlap and are harder to see.
  • Leaders and laggards stand out in the time-versus-rank graph because the markers are isolated. If someone wins a race by 20 seconds (a huge lead!), you can see that clearly in the first graph. In contrast, the histogram lumps together all the leaders into one bar.

The cumulative distribution of race times

The time-versus-rank graph is not exactly equal to the standard graph of the empirical cumulative distribution, but it's close. You can use PROC UNIVARIATE to create the following graph of the cumulative distribution of race times. The graph is known as a CDF plot.

The CDF plot has the same shape as the time-versus-rank graph, but you need to flip the axes. (Geometrically, flip the CDF plot across its diagonal.) The CDF plot differs in three minor ways:

  • The ECDF is plotted as a step function. The time-versus-rank graph is plotted as a scatter plot.
  • The ECDF has its axes reversed: the times are on the horizontal axis and the order is plotted vertically.
  • The ECDF standardizes the "order statistic" into a percentage, rather than using a rank. The runner who finishes 20th out of 66 runners is plotted at the 30.3 percentage point.

So, yes, there are minor differences. But I still smile whenever I see the time-versus-rank graph. Although the racers might not know or care, the plot contains the same information as a plot of the cumulative distribution of the race times.

You can download the SAS program that contains the data and creates all the graphs in this article.

The post Visualize race times in SAS appeared first on The DO Loop.

6月 282019
 

Congratulations on being chosen to speak at an event! Let the anxiety preparation begin. But wait.

Did you know that social media can help you out? Yes, even now, while you plan. What's more, it can be instrumental in maximizing your entire presenter experience before, during and after your presentation. Here are some ideas to get you thinking.

Before

1. Solicit ideas online.

Most of your connections won't attend your event, but many are probably interested in your topic. Don't hesitate to get help from your network while you work on your paper or presentation. Ask them questions. Get their feedback. (And use the event hashtag -- say, #sasgf or #sasusers for example -- when you do it.)

2. Use social media for research.

Online properties like Quora, SAS Communities, Medium, SlideShare and even LinkedIn can lead to statistics, influencers or research you never knew existed. Type keywords or phrases in the basic search field on any of these websites. You never know what (or who) you might find.

3. Polish your LinkedIn (and/or Twitter) profile. (People will be looking.)

Need a checklist? Start with the Example SAS User LinkedIn Profile on communities.sas.com or Buffer's 7 Key Ingredients of a Great Twitter Bio.

4. Schedule a handful of posts.

One week before the conference or perhaps while you're en route, schedule a few posts to your social media accounts. You'll be too busy at the conference to do this. Free tools like Buffer or Hootsuite allow you to schedule posts throughout the week.

During

5. Skim activity around the event hashtag feed to like, reply, share or comment.

Don't know how? Enter the event hashtag, for instance "#sasgf" (no quotation marks), in the main search fields on Twitter and LinkedIn. Doing this is good for a few reasons:

  • It's easy. Especially since you'll be so busy during the event.
  • People (even strangers) appreciate when you interact with their event posts.
  • Social activity during an event is a sure-fire way to gain followers.

6. Post the occasional photo or a useful tip from a particularly inspirational session.

You'll be so busy during the event, it will be hard to find time to post. If you can, do it in small pieces. A favorite stat. A meaningful quote. A beautiful view of the venue. (Remember, use the event hashtag or other topic-specific hashtags when you do.)

After

7. Connect on LinkedIn or SAS Communities.

Immediately after the event (ideally, in less than 24 hours), connect with fellow conference goers on LinkedIn. Be sure to personalize your invitation with a brief note in case they forgot your name. Don't want to wait? Connect with them in person using the LinkedIn QR code trick.

Is your new friend fairly technical? If so, find and follow his or her activity on communities.sas.com (See subhead "How do I search for people?").

8. Add your paper or presentation to your LinkedIn profile (and direct people to it).

There are three sections of your profile where you can add media (in the form of hyperlinks, documents, PowerPoint slides, etc.): your Summary, Experience and Education sections. Professionals: Add your paper or presentation slides to your Summary or Experience sections; students: consider your Education section.

Pro tip: For additional profile views, create a post to point connections to it on your profile or mention it during your presentation.

 

 

 

 

 

9. Write a useful blog post.

Alison Bolen wrote about this in 2012, yet her message remains perfectly relevant: How to transform your live event blogging into evergreen content. The bottom line? Readers care about the content, not the conference.

Nine #SocialMedia Speaker Tips to Use Before, During and After Events was published on SAS Users.

6月 282019
 

Congratulations on being chosen to speak at an event! Let the anxiety preparation begin. But wait.

Did you know that social media can help you out? Yes, even now, while you plan. What's more, it can be instrumental in maximizing your entire presenter experience before, during and after your presentation. Here are some ideas to get you thinking.

Before

1. Solicit ideas online.

Most of your connections won't attend your event, but many are probably interested in your topic. Don't hesitate to get help from your network while you work on your paper or presentation. Ask them questions. Get their feedback. (And use the event hashtag -- say, #sasgf or #sasusers for example -- when you do it.)

2. Use social media for research.

Online properties like Quora, SAS Communities, Medium, SlideShare and even LinkedIn can lead to statistics, influencers or research you never knew existed. Type keywords or phrases in the basic search field on any of these websites. You never know what (or who) you might find.

3. Polish your LinkedIn (and/or Twitter) profile. (People will be looking.)

Need a checklist? Start with the Example SAS User LinkedIn Profile on communities.sas.com or Buffer's 7 Key Ingredients of a Great Twitter Bio.

4. Schedule a handful of posts.

One week before the conference or perhaps while you're en route, schedule a few posts to your social media accounts. You'll be too busy at the conference to do this. Free tools like Buffer or Hootsuite allow you to schedule posts throughout the week.

During

5. Skim activity around the event hashtag feed to like, reply, share or comment.

Don't know how? Enter the event hashtag, for instance "#sasgf" (no quotation marks), in the main search fields on Twitter and LinkedIn. Doing this is good for a few reasons:

  • It's easy. Especially since you'll be so busy during the event.
  • People (even strangers) appreciate when you interact with their event posts.
  • Social activity during an event is a sure-fire way to gain followers.

6. Post the occasional photo or a useful tip from a particularly inspirational session.

You'll be so busy during the event, it will be hard to find time to post. If you can, do it in small pieces. A favorite stat. A meaningful quote. A beautiful view of the venue. (Remember, use the event hashtag or other topic-specific hashtags when you do.)

After

7. Connect on LinkedIn or SAS Communities.

Immediately after the event (ideally, in less than 24 hours), connect with fellow conference goers on LinkedIn. Be sure to personalize your invitation with a brief note in case they forgot your name. Don't want to wait? Connect with them in person using the LinkedIn QR code trick.

Is your new friend fairly technical? If so, find and follow his or her activity on communities.sas.com (See subhead "How do I search for people?").

8. Add your paper or presentation to your LinkedIn profile (and direct people to it).

There are three sections of your profile where you can add media (in the form of hyperlinks, documents, PowerPoint slides, etc.): your Summary, Experience and Education sections. Professionals: Add your paper or presentation slides to your Summary or Experience sections; students: consider your Education section.

Pro tip: For additional profile views, create a post to point connections to it on your profile or mention it during your presentation.

 

 

 

 

 

9. Write a useful blog post.

Alison Bolen wrote about this in 2012, yet her message remains perfectly relevant: How to transform your live event blogging into evergreen content. The bottom line? Readers care about the content, not the conference.

Nine #SocialMedia Speaker Tips to Use Before, During and After Events was published on SAS Users.

6月 272019
 

When Jack Shostak and I first started thinking about writing a SAS book on implementing CDISC (Clinical Data Interchange Standards Consortium) standards, we held one truth to be self-evident: that at least some parts of the book would be outdated before it was even published. Thanks to some lucky timing and the ability to make some minor tweaks just prior to publishing, the extent of the “outdated-ness” with our first edition of Implementing CDISC Using SAS: An End-to-End Guide, was fairly limited…for a few weeks at least.

Shortly after publishing, the final version 2.0 of the Define-XML specification came out and we knew there was some work to do in the future. So, after a bit of a writing break, we rolled up our sleeves again and began updating our %make_define macro and the associated metadata spreadsheets for the second edition of our book. Quite a few other changes were also in the works!

That edition came out in November of 2016. However, CDISC standards didn’t stop for us. True to form, even before publishing, we realized that we weren’t implementing NCI codes, aka “C-codes”, in our metadata-controlled terminology.

This was painfully obvious thanks to a check that started coming up in the Pinnacle 21 reports: “Missing NCI Code for Term in Codelist”. Some users shared this feedback with us, and we took action (thank you, users!).

So with some motivation from Jack, I started working on implementing C-codes. But I wanted it to be slick. The codes are all on the NCI website spreadsheets, so why should we expect users to enter them all into their study-specific metadata spreadsheets, right? Why not just read those spreadsheets, also available in XML format, and automatically merge the C-codes into the study-specific data? Well, I can tell you why it wasn’t that easy.

Jack, meanwhile, was a little more motivated than I to get a solution in place. So, as he frequently did back when we worked together at PRA Health Sciences in the early ’90s, he showed me how it was done. Thanks to him, in April of this year, Implementing CDISC Using SAS: An End-to-End Guide, Revised Second Edition hit the virtual bookstore shelves. The updated macros and meta-data spreadsheets can be downloaded on the SAS website. See the C-codes for yourself and let us know what you think!

Want to get a sneak peek? Also check our book excerpt of Chapter 1: Implementation Strategies! Hoping to learn more about SAS Books? Subscribe to our newsletter for the latest discounts and news.

Other Resources:

SAS and CDISC

Move beyond the ‘whys’ of CDISC and bridge the gap between theory and practice by Jack Shostak

Working with ever-changing CDISC standards was published on SAS Users.

6月 262019
 

"There's a way to do it better - find it." - Thomas A. Edison

Finding a better SAS code

When it comes to SAS coding, this quote by Thomas A. Edison is my best advisor. Time permitting, I love finding better ways of implementing SAS code.

But what code feature means “better” – brevity, clarity or efficiency? It all depends on the purpose of your code. When code is to illustrate a coding concept or technique, clarity is a paramount. However, when processing large data volumes in near real-time, code efficiency becomes critical, not just a luxury or convenience. And brevity won’t hurt in either case. Ideally, your code should be a combination of all three features - brevity, clarity and efficiency.

Parsing a character string

In this blog post we will solve a problem of parsing a character string to find a position of n-th occurrence of a group of characters (substring) in that string.

The closest out-of-box solution to this problem is SAS’ FIND() function. Except this function searches only for a single/first instance of specified substring of characters within a character string. Close enough, and with some do-looping we can easily construct what we want.

After some internet and soul searching to find the Nth occurrence of a substring within a string, I came up with the following DATA STEP code snippet:

   p = 0;
   do i=1 to n until(p=0); 
      p = find(s, x, p+1);
   end;

Here, s is a text string (character variable) to be parsed; x is a character variable holding a group of characters that we are searching for within s; p is a position of x value found within s; n is an instance number.

If there is no n-th instance of x within s found, then the code returns p=0.

In this code, each do-loop iteration searches for x within s starting from position p+1 where p is position found in prior iteration: p = find(s,x,p+1);.

Notice, if there is no prior-to n instance of x within s, the do-loop ends prematurely, based on until(p=0) condition, thus cutting the number of loops to the minimal necessary.

Reverse string search

Since find() function allows for a string search in a reverse direction (from right to left) by making the third augment negative, the above code snippet can be easily modified to do just that: find Nth instance (from right to left) of a group of characters within a string. Here is how you can do that:

   p = length(s) + 1;
   do i=1 to n until(p=0); 
      p = find(s, x, -p+1);
   end;

The difference here is that we start from position length(s)+1 instead of 0, and each iteration searches substring x within string s starting from position –(p-1)=-p+1 from right to left.

Testing SAS code

You can run the following SAS code to test and see how these searches work:

data a;
   s='AB bhdf +BA s Ab fs ABC Nfm AB ';
   x='AB';
   n=3;
 
   /* from left to right */
   p = 0;
   do i=1 to n until(p=0); 
      p = find(s, x, p+1);
   end;
   put p=;
 
   /* from right to left */
   p = length(s) + 1;
   do i=1 to n until(p=0); 
      p = find(s, x, -p+1);
   end;
   put p=;
run;

FINDNTH() function

We can also combine the above left-to-right and right-to-left searches into a single user-defined SAS function by means of SAS Function Compiler (PROC FCMP) procedure:

proc fcmp outlib=sasuser.functions.findnth;
   function findnth(str $, sub $, n);
      p = ifn(n>=0,0,length(str)+1);
      do i=1 to abs(n) until(p=0);
         p = find(str,sub,sign(n)*p+1);
      end;
      return (p);
   endsub;
run;

We conveniently named it findnth() to match the Tableau FINDNTH(string, substring, occurrence) function that returns the position of the nth occurrence of substring within the specified string, where the occurrence argument defines n.

Except our findnth() function allows for both, positive (for left-to-right searches) as well as negative (for right-to-left searches) third argument while Tableau’s function only allows for left-to-right searches.

Here is an example of the findnth() function usage:

options cmplib=sasuser.functions;
data a;
   s='AB bhdf +BA s Ab fs ABC Nfm AB ';
   x='AB';
   n=3;
 
   /* from left to right */
   p=findnth(s,x,n);
   put p=;
 
   /* from right to left */
   p=findnth(s,x,-n);
   put p=;
run;

Using Perl regular expression

As an alternative solution I also implemented SAS code for finding n-th occurrence of a substring within a string using Perl regular expression (regex or prx):

data a;
   s='AB bhdf +BA s Ab fs ABC Nfm AB ';
   x='AB';
   n=3;
 
   /* using regex */
   xid = prxparse('/'||x||'/o');
   p = 0;
   do i=1 to n until(p=0);
      from = p + 1;
      call prxnext(xid, p + 1, length(s), s, p, len);
   end;
   put p=;
run;

However, efficiency benchmarking tests demonstrated that the above solutions using FIND() function or FINDNTH() SAS user-written function run roughly twice faster than this regex solution.

Challenge

Can you come up with an even better solution to the problem of finding Nth instance of a sub-string within a string? Please share your thoughts and solutions with us. Thomas A. Edison would have been proud of you!

Finding n-th instance of a substring within a string was published on SAS Users.

6月 262019
 

SAS/STAT software contains a number of so-called HP procedures for training and evaluating predictive models. ("HP" stands for "high performance.") A popular HP procedure is HPLOGISTIC, which enables you to fit logistic models on Big Data. A goal of the HP procedures is to fit models quickly. Inferential statistics such as standard errors, hypothesis tests, and p-values are less important for Big Data because, if the sample is big enough, all effects are significant and all hypothesis tests are rejected! Accordingly, many of the HP procedures do not support the same statistical tests as their non-HP cousins (for example, PROC LOGISTIC).

A SAS programmer recently posted an interesting question on the SAS Support Community. She is using PROC HPLOGISTIC for variable selection on a large data set. She obtained a final model, but she also wanted to estimate the covariance matrix for the parameters. PROC HPLOGISTIC does not support that statistic but PROC LOGISTIC does (the COVB option on the MODEL statement). She asks whether it is possible to get the covariance matrix for the final model from PROC LOGISTIC, seeing as PROC HPLOGISTIC has already fit the model? Furthermore, PROC LOGISTIC supports computing and graphing odds ratios, so is it possible to get those statistics, too?

It is an intriguing question. The answer is "yes," although PROC LOGISTIC still has to perform some work. The main idea is that you can tell PROC LOGISTIC to use the parameter estimates found by PROC HPLOGISTIC.

What portion of a logistic regression takes the most time?

The main computational burden in logistic regression is threefold:

  • Reading the data and levelizing the CLASS variables in order to form a design matrix. This is done once.
  • Forming the sum of squares and crossproducts matrix (SSCP, also called the X`X matrix). This is done once.
  • Maximizing the likelihood function. The parameter estimates require running a nonlinear optimization. During the optimization, you need to evaluate the likelihood function, its gradient, and its Hessian many times. This is an expensive operation, and it must be done for each iteration of the optimization method.

The amount of time spent required by each step depends on the number of observations, the number of effects in the model, the number of levels in the classification variables, and the number of computational threads available on your system. You can use the Timing table in the PROC HPLOGISTIC output to determine how much time is spent during each step for your data/model combination. For example, the following results are for a simulated data set that has 500,000 observations, 30 continuous variables, four CLASS variables (each with 3 levels), and a main-effects model. It is running on a desktop PC with four cores. The iteration history is not shown, but this model converged in six iterations, which is relatively fast.

ods select PerformanceInfo IterHistory Timing;
proc hplogistic data=simLogi OUTEST;
   class c1-c&numClass;
   model y(event='1') = x1-x&numCont c1-c&numClass;
   ods output ParameterEstimates=PE;
   performance details;
run;

For these data and model, the Timing table tells you that the HPLOGISITC procedure fit the model in 3.68 seconds. Of that time, about 17% was spent reading the data, levelizing the CLASS variables, and accumulating the SSCP matrix. The bulk of the time was spent evaluating the loglikelihood function and its derivatives during the six iterations of the optimization process.

Reduce time in an optimization by improving the initial guess

If you want to improve the performance of the model fitting, about the only thing you can control is the number of iterations required to fit the model. Both HPLOGISTIC and PROC LOGISTIC support an INEST= option that enables you to provide an initial guess for the parameter estimates. If the guess is close to the final estimates, the optimization method will require fewer iterations to converge.

Here's the key idea: If you provide the parameter estimates from a previous run, then you don't need to run the optimization algorithm at all! You still need to read the data, form the SSCP matrix, and evaluate the loglikelihood function at the (final) parameter estimates. But you don't need any iterations of the optimization method because you know that the estimates are optimal.

How much time can you save by using this trick? Let's find out. Notice that the previous call used the ODS OUTPUT statement to output the final parameter estimates. (It also used the OUTEST option to add the ParmName variable to the output.) You can run PROC TRANSPOSE to convert the ParameterEstimates table into a data set that can be read by the INEST= option on PROC HPLOGISTIC or PROC LOGISTIC. For either procedure, set the option MAXITER=0 to bypass the optimization process, as follows:

/* create INEST= data set from ParameterEstimates */
proc transpose data=PE out=inest(type=EST) label=_TYPE_;
   label Estimate=PARMS;
   var Estimate;
   id ParmName;
run;
 
ods select PerformanceInfo IterHistory Timing;
proc hplogistic data=simLogi INEST=inest MAXITER=0;
   class c1-c&numClass;
   model y(event='1') = x1-x&numCont c1-c&numClass;
   performance details;
run;
The time required to read the data, levelize, and form the SSCP is unchanged. However, the time spent evaluating the loglikelihood and its derivatives is about 1/(1+NumIters) of the time from the first run, where NumIters is the number of optimization iterations (6, for these data). The second call to the HPLOGISTIC procedure still has to fit the loglikelihood function to form statistics like the AIC and BIC. It needs to evaluate the Hessian to compute standard errors of the estimates. But the total time is greatly decreased by skipping the optimization step.

Use PROC HPLOGISTIC estimates to jump-start PROC LOGISTIC

You can use the same trick with PROC LOGISTIC. You can use the INEST= and MAXITER=0 options to greatly reduce the total time for the procedure. At the same time, you can request statistics such as COVB and ODDSRATIO that are not available in PROC HPLOGISTIC, as shown in the following example:

proc logistic data=simLogi INEST=inest 
   plots(only MAXPOINTS=NONE)=oddsratio(range=clip);
   class c1-c&numClass;
   model y(event='1') = x1-x&numCont c1-c&numClass / MAXITER=0 COVB;
   oddsratio c1;
run;

The PROC LOGISTIC step takes about 4.5 seconds. It produces odds ratios and plots for the model effects and displays the covariance matrix of the betas (COVB). By using the parameter estimates that were obtained by PROC HPLOGISTIC, it was able to avoid the expensive optimization iterations.

You can also use the STORE statement in PROC LOGISTIC to save the model to an item store. You can then use PROC PLM to create additional graphs and to run additional post-fit analyses.

In summary, PROC LOGISTIC can compute statistics and hypothesis tests that are not available in PROC HPLOGISTIC. it is possible to fit a model by using PROC HPLOGISTIC and then use the INEST= and MAXITER=0 options to pass the parameter estimates to PROC LOGISTIC. This enables PROC LOGISTIC to skip the optimization iterations, which saves substantial computational time.

You can download the SAS program that creates the tables and graphs in this article. The program contains a DATA step that simulates logistic data of any size.

The post Jump-start PROC LOGISTIC by using parameter estimates from PROC HPLOGISTIC appeared first on The DO Loop.

6月 262019
 

In his article How to use CASL to develop and work with user-defined CAS actions, Brian Kinnebrew defines CASL as "a language specification used by the SAS client to interact with and provide easy access to Cloud Analytic Services (CAS). CASL is a statement-based scripting language with many uses and strengths." I can't come up with a better definition, so if you'd like to learn more about the basics of CASL, I encourage you to read Brian's post.

In a SAS Stored Process (or any traditional SAS program) the code has multiple DATA steps and procedures with a good dose of macros.

Over the last couple of years, the focus of my projects is the use of CAS actions with no traditional procedures in the mix. Most of these involved web applications calling multiple CAS actions. My initial approach was to make multiple http calls - one per action. This could get tedious.

Then I met my action hero: sccasl.runCasl.

My favorite SAS CASL action

This action executes a CASL "script" on the CAS server analogous to executing a SAS Stored Process. Running a CASL program with a mix of CAS actions and CASL statements on the CAS server has these benefits:

  1. Reduced the number of http calls to the server
  2. The client-side code is much easier to reason
  3. The returned values can be a dictionary that is suited for further consumption by the client, simplifying the client code

My personal name for these scripts is "CAS stored process".

Where art thou Macro?

In many applications, user input is passed to the code running on the server. In a SAS Stored Process, macros pass the parameters. CASL has no macros. My initial approaches to passing parameters to CASL programs were:

  1. Generate the final CASL program in JavaScript with the user input values inserted into the code. Sample code is available here.
    • Drawback: Debugging the code in SAS Studio requires a cut-and-paste of the generated code into SAS Studio.
  2. Load the data into a CAS table using table.upload action for programs needing an input table. Sample code is available here.
    • Drawback: This requires an additional http call to the server.

In developing the GraphQL approach to writing applications - GraphQL and SAS Viya applications - a good match - I addressed the two drawbacks listed above by creating two functions and put them in my utility belt.

    Superfriend functions

  • jsonToDict.js - generate a string having the CASL dictionary version of a JavaScript object
  • argsToTable - create a CAS table from a dictionary

The remainder of this article discusses these two functions. To demonstrate the functions' usage, I use code listings from the scoring example, covered in the GraphQL example.

The jsonToDict function

The result of this function produces a string with CASL dictionary suitable for inclusion in CASL code.

Function definition

jsonToDict ⇒ string

Returnsstring - returns the string containing the CASL dictionary

Param Type Description
obj object the JavScript object of interest
name string the name to assign to the dictionary

 

Example function code

The code below outlines the jsonToDict function usage.

obj = {x:1, b:2, c:['a', 'b']};
let r = jsonToDict(obj, '_appEnv_');
The result is:
r = `_appEnv_ = {x=1, b=2, c={"a", "b"}}'`;

The following lists the input parameters passed to the CASL code for scoring.

let input = {
        JOB    : 'J1',
        CLAGE  : 100, 
        CLNO   : 20, 
        DEBTINC: 20, 
        DELINQ : 2, 
        DEROG  : 0, 
        MORTDUE: 4000, 
        NINQ   : 1,
        YOJ    : 10,
        LOAN: 1000,
        ASSET: 100000
    };

Below is the Javascript code and the result.

let _args_ = jsonToDict(input, '_args_');
results in
 
args_ = `_args_ = {  JOB= "J1" ,CLAGE=100  ,CLNO=20  ,DEBTINC=20  ,DELINQ=2  ,DEROG=0  ,
MORTDUE=4000  ,NINQ=1  ,YOJ=10  ,LOAN=1000  ,ASSET=100000  };`;

To allow the use of different versions of the model, the name of the scoring model is passed in as a parameter. The Javascript for the model follows.

let env = {
     astore: {
            caslib: 'Public',
            name  : 'GRADIENT_BOOSTING___BAD_2'
        }
};
let _appEnv_= jsonToDict(env, '_appEnv_');
 
resulting in:
let _appEnv_ = `_appEnv_ = { astore = { caslib="Public", name="GRADIENT_BOOSTING___BAD_2"}};`;

Next, I prepend the strings to the CASL program in the client code with the following code snips.

let code = _args_ + _appEnv_ + `the CASL code shown below';
loadactionset "astore";
 
/* convert arguments to a cas table */
argsToTable(_args_, 'casuser', 'INPUTDATA' );
 
/* score */
action astore.score r=rc/
    table  = { caslib= 'casuser', name = 'INPUTDATA' } 
    rstore = { caslib= _appEnv_.astore.caslib,  name=_appEnv_.astore.name }
    casout  = { caslib = 'casuser', name = 'OUTPUTDATA' replace= TRUE};
 
/* fetch results */
action table.fetch r = result /
    table = {  caslib = 'casuser' name = 'OUTPUTDATA' } ;
 
/* extract the score and send it as a dictionary */
score = result.Fetch[1].P_BAD;
send_response({score = score});

Now the CASL program can access the incoming information using the two dictionaries _args_ and _appEnv_. Note: As a personal choice, I use a convention of _args_ for user input and _appEnv_ for application specific information. I use the restaf application framework to make the http calls as shown below.

let payload = {
        action: 'sccasl.runCasl',
        data  : { code: code}
    }
 let result = await store.runAction(session, payload); 
let score = result.items('results', 'score');

CASL coding - easier than saying Kilp-ill-skim

Notice the absence of string substitutions that look strange. Just simple, straight forward coding. Easier than saying your name backwards, forcing you back to the fifth dimension.

The argsToTable.casl function

The sample code above used the function argsToTable. As you may guess it converts the _args_ dictionary into a CAS table used in the scoring action. The argsToTable is the function in CASL handling this task.

Function definition

argsToTable ⇒ Load dictionary into a CAS Table

Param Type Description
input dictionary the data to load
caslib string caslib of output table
name string name of output table

 

The relevant CASL code from Step 1 is reproduced here:

argsToTable(_args_, 'casuser', 'INPUTDATA' );

The argsToTable function is either stored on a server or prepended to the CASL code sent to the runCasl action. This function removes the need to run a http call to load the data in the CAS table.

Returning data from CAS - the send_response function

The ultimate sidekick function

Any good super hero has a sidekick. The function send_response in CASL is very versatile - it allows one to return data in the form the application needs and allows more than one result. In many programs I return data in a form easily consumed by the client code.

For example, if you wanted to return just the rows of table you can do the following:

function resultsToDict(r);
    casResults = {};
    i = 1;
  do row over r;
     casResults[i] = row;
     i = i + 1;
   end;
  return casResults;
end;
 
/* and use it as follows: */
 
action table.fetch r = result /
    table = {  caslib = 'casuser',  name = 'mydata' };
/* extract the data and return it as a dictionary */
casResults = resultsToDict(result.Fetch);
send_response({casResults: casResults});

Finally

Using a combination of CASL, a couple of utility functions and the runCasl action you can develop some very efficient programs with minimal traffic between your client and the server. If you run multiple actions in sequence you should consider grouping them into a CASL program and executing them on the CAS server using the runCasl action.

Next

In my next article, which I hope to finish in a flash, I will discuss using the runCasl action to create a browser for CAS tables with support for pagination.

All comments are welcome. Please feel free to clone the code, make it better.

Cheers...
Deva

Let runCasl be your BFF and favorite action hero

"CAS Stored Process" with my Favorite Action Hero runCasl was published on SAS Users.

6月 252019
 

Everyone is talking about artificial intelligence. Unfortunately, a lot of what you hear about AI in the movies and on the TV is sensationalized for entertainment. Indeed, AI is overhyped. But AI is also real and powerful. Consider this: engineers worked for years on hand-crafted models for object detection, facial [...]

Meet 5 data pioneers developing AI solutions for the real world was published on SAS Voices by Oliver Schabenberger

6月 242019
 

When fitting a least squares regression model to data, it is often useful to create diagnostic plots of the residuals versus the explanatory variables. If the model fits the data well, the plots of the residuals should not display any patterns. Systematic patterns can indicate that you need to include additional explanatory effects to model the data. Sometimes it is difficult to spot patterns in a seemingly random cloud of points, so some analysts like to add a scatter plot smoother to the residual plots. You can use the SMOOTH suboption to the PLOTS=RESIDUALS option in many SAS regression procedures to generate a panel of residual plots that contain loess smoothers. For SAS procedures that do not support the PLOTS=RESIDUALS option, you can use PROC SGPLOT to manually create a residual plot with a smoother.

Residual plots with loess smoothers

Many SAS linear regression procedures such as PROC REG and PROC GLM support the PLOTS=RESIDUAL(SMOOTH) option on the PROC statement. For example, the following call to PROC GLM automatically creates a panel of scatter plots where the residuals are plotted against each regressor. The model is a two-variable regression of the MPG_City variable in the Sashelp.Cars data.

/* residual plots with loess smoother */
ods graphics on;
proc glm data=Sashelp.Cars plots(only) = Residuals(smooth);
   where Type in ('SUV', 'Truck');
   model MPG_City = EngineSize Weight;
run; quit;

The loess smoothers can sometimes reveal patterns in the residuals that would not otherwise be perceived. In this case, it looks like there is a quadratic pattern to the residuals-versus-EngineSize graph (and perhaps for the Weight variable as well). This indicates that you might need to include a quadratic effect in the model. Because the EngineSize and Weight variables are highly correlated (ρ = 0.81), the following statements add only a quadratic effect for EngineSize:

proc glm data=Sashelp.Cars plots(only) = Residuals(smooth);
   where Type in ('SUV', 'Truck');
   model MPG_City = EngineSize Weight
                    EngineSize*EngineSize ;
quit;

After adding the quadratic effect, the residual plots do not reveal any obvious systematic trends. Also, the residual plot for Weight no longer shows any quadratic pattern.

How to use PROC SGPLOT to create a residual plot with a smoother

If you use a SAS procedure that does not support the PLOTS=RESIDUALS(SMOOTH) option, you can output the residual values to a SAS data set and use PROC SGPLOT to create the residual plots. Even when a procedure DOES support the PLOTS=RESIDUALS(SMOOTH) option, you might want to customize the plot by adding legends, by changing attributes of the markers or curve, or by specifying a value for the smoothing parameter.

An example is shown below. If you use the same model for MPG_City, but use all observations in the data set, the residual plot for EngineSize looks very strange. For these data, the smoothing parameter for the loess curve is very small and therefore the loess curve overfits the residuals:

proc glm data=Sashelp.Cars plots(only) = Residuals(smooth);
   model MPG_City = EngineSize Weight;
   output out=RegOut predicted=Pred residual=Residual;
run; quit;

Yuck! The loess curve for the plot on the left clearly overfits the residuals-versus-EngineSize data! Unfortunately, you cannot change the smoothing parameter from the PROC GLM syntax. However, you can change the default smoothing parameter in PROC SGPLOT and you can make other modifications to the plot as well. Notice in the previous call to PROC GLM that the OUTPUT statement creates a data set named RegOut that contains the residual values and the original variables. Therefore, you can create a residual plot and add a loess smoother by using PROC SGPLOT, as follows:

ods graphics / attrpriority=NONE;
title "Residuals for Model";
proc sgplot data=RegOut ;
   scatter x=EngineSize y=Residual / group=Origin;
   loess x=EngineSize y=Residual / nomarkers smooth=0.5;
   refline 0 / axis=y;
   xaxis grid; yaxis grid;
run;

The smoothing parameter was manually set to 0.5, but you can use PROC LOESS if you want to choose a smoothing parameter that optimizes some information criterion such as the AICC statistic. Notice that you can use additional SGPLOT statements to add a reference grid and to change marker attributes. If you prefer, you could add a different kind of smoother such as a penalized B-spline by using the PBSPLINE statement.

You might wonder why the smoother in the residual plot for EngineSize is so small. The parameter is chosen to optimize a criterion such as the AICC statistic, so why does it overfit the data? An example in the PROC LOESS documentation provides an explanation. The chosen value for the smoothing parameter is one that corresponds to a local minimum of an objective function that involves the AICC statistic. Unfortunately, a set of data can have multiple local minima, and this is the case for the residuals of the EngineSize variable. When the smoothing parameter is 0.534, the AICC criterion reaches a local minimum. However, there are smaller values of the smoothing parameter for which the AICC criterion is even smaller. The minimum value of the AICC occurs when the smoothing parameter is 0.015, which leads to the "jagged" loess curve that is seen in the panel of residual plots shown earlier in this section. If you want to see this phenomenon yourself, run the following PROC LOESS code and look at the criterion plot.

ods select CriterionPlot SmoothingCriterion FitPlot;
proc loess data=RegOut;
   model Residual = EngineSize / select=AICC(global) ;
run;

Because a data set can be smoothed at multiple scales, the "optimal" smoothing parameter that is chosen automatically by the PLOTS=RESIDUALS(SMOOTH) option might not enable you to see the general trend of the residuals. If you experience this phenomenon, output the residuals and use PROC SGPLOT or PROC LOESS to compute a more useful smoother.

In summary, SAS provides the PLOTS=RESIDUALS(SMOOTH) option to automatically create residual-versus-regressor plots. Although this panel usually provides a useful indication of patterns in the residuals, you can also output the residuals to a data set and use PROC SGPLOT or PROC LOESS to create a customized residual plot.

The post Add loess smoothers to residual plots appeared first on The DO Loop.

6月 212019
 

For every project in SAS®, the first step is almost always making your data available. This blog shows you how to load three of the most common input data types—a data set, a text file, and a Microsoft Excel file—into SAS® Cloud Analytic Services (CAS) tables.

The three methods that I show here are the three easiest ways to load each data type into CAS. Multiple tools can load data into CAS, but I am showing the tools that I consider the easiest to use and that are probably the most familiar to SAS programmers.

You need to place your data in a location that can be accessed by the programming environment that is used to access CAS. The most common programming environment that accesses CAS is SAS® Studio. Input data files that are used in CAS are going to be very large. You will need to use an SFTP tool to move your data from your PC to a directory that can be accessed by your SAS Studio session. (Check with your system administrator to see what the preferred tool is at your site.)

After your data is in a location that can be accessed by the programming environment, you need to start a CAS session. This is done with the CAS statement; here is the syntax:

cas session-name <option(s)>;

The options that you specify depend on how your system administrator configured your environment. For example, I asked my system administrator to set it up so that the only thing I need to do is issue the following statement:

cas;

That statement then creates a CAS session with the default name of CASAUTO, with an active caslib of CASUSER.

After you establish your CAS session, you can start loading data.

Load a SAS data set

The easiest way to load SAS data into CAS is to use a DATA step. The basic syntax is the same as it is when you are creating a SAS data set. The key difference is that the libref that is listed in the DATA step must point to a caslib.

The following example accesses SASHELP.CARS and then creates the table CARS in the CASUSER caslib.

cas;                   /* log on to the CAS session */
   caslib _all_ assign;   /* create a libref for each active 
                             caslib */
 
   data casuser.cars;     /* create the table in the caslib */
      set sashelp.cars;
   run;

The one thing to note about this code is that this DATA step is running in SAS and not CAS. A DATA step runs in CAS only if the input and output librefs are both using the CAS engine and only if it uses language elements that CAS supports. In this example, the SASHELP libref was not created with the CAS engine, so the step must run in SAS.

There are no calculations in this step, so there is no effect on performance.

When you load a data set into a CAS table, one task that you might want to perform is to promote the table. The following DATA step shows how to promote a table as you create it:

data casuser.cars(promote=yes);
      set sashelp.cars;
   run;

Promoting a table gives it a global scope. You can then access the table in multiple sessions, and the table is also available in SAS® Visual Analytics. The PROMOTE= data set option can be used with all three of the examples in this blog.

Load a delimited text file

The easiest way to load a delimited text file and store it as a CAS table is to use another familiar SAS step, the IMPORT procedure. The syntax is going to be basically the same as it is in SAS. Again, the key difference is that you need to point to a caslib in the OUT= option of the PROC IMPORT statement.

If you are running SAS® Studio 5.1, a common location for a text file that needs to be loaded into CAS is the SAS Content folder. This folder is a predefined repository for text files that you need to access from SAS Studio.

In order to access the files from SAS Content, you need to use the FILESRVC access method with the FILENAME statement. This method enables you to store and retrieve content using the SAS® Viya® Files service. Here is the basic syntax:

filename fileref filesrvc folderpath='path'
            filename='name';

For more information about this access method, see SAS 9.4 Global Statements Reference.

In the following example, PROC IMPORT and the FILESRVC access method are used to load the class.csv file from the SAS Content folder. The resulting ALLCLASS table is written to the CASUSER caslib.

filename myfile filesrvc folderpath='/Users/saskir'
            filename='class.csv';
   proc import datafile=myfile out=casuser.allclass(promote=yes)
        dbms=csv;
   run;

Load an Excel file

This section shows you how to load an Excel file into a CAS table by using PROC IMPORT. Loading an Excel file using PROC IMPORT requires that you have SAS/ACCESS® Interface to PC Files. If you are unsure whether you have this product, you can write all of your licensed products to the log by using the SETINIT procedure:

proc setinit;
   run;

You should see the following in the log when the product is licensed:

---SAS/ACCESS Interface to PC Files

After you confirm that you have this product, you can adapt the following PROC IMPORT code. This example loads the ReportTest.xlsx file and stores it in a table named ReportTest in the CASUSER caslib.

cas;
   caslib _all_ assign;
 
   proc import datafile='/viyashare/ReportTest.xlsx'
        out=casuser.ReportTest dbms=xlsx;
   run;

There are other methods

The purpose of this blog was to show you the easiest ways to load a SAS data set, text file, and Excel file into CAS. Although there are multiple ways to accomplish these tasks, both programmatically and interactively, these methods are the easiest and most straightforward ways to get data into CAS.

Learn the three easiest ways to load data into CAS tables was published on SAS Users.