11月 152018
 

When describing their business model, our customer, Epipoli (one of the leading gift card companies in Europe), tells the story of the ancient Limoncello makers of Italy. Limoncello is an intensely lemon-flavored liqueur famously produced in Sorrento, the Amalfi coast, and the island of Capri. The drink started as a local tradition, but quickly [...]

Lessons learned from a real-world AI implementation was published on SAS Voices by David Tareen

11月 152018
 

The US 'midterm' elections have finally started to wind down, and we finally have some (mostly) finalized results to study. But what's the best way to visualize who won the US congressional seats in each of the 435 districts? Let's dive into this topic!... Preparation For starters, I couldn't find [...]

The post Building a better election map appeared first on SAS Learning Post.

11月 152018
 

JSON is a popular format for data exchange between APIs and some modern databases. It's also used as a way to archive audit logs in many systems. Because JSON can represent hierarchical relationships of data fields, many people consider it to be superior to the CSV format -- although it's certainly not yet universal.

I learned recently that newline-delimited JSON, also called JSONL or JSON Lines, is growing in popularity. In a JSONL file, each line of text represents a valid JSON object -- building up to a series of records. But there is no hierarchical relationship among these lines, so when taken as a whole the JSONL file is not valid JSON. That is, a JSON parser can process each line individually, but it cannot process the file all at once.

In SAS, use the JSON libname engine to parse valid JSON files. But neither of these can create or parse JSONL files directly. Here's a simple example of a JSONL file. Each line, enclosed in braces, represents valid JSON. But if you paste the entire body into a validation tool like JSONLint, the parsing fails.

JSONL sample, not valid in JSON parsers

If we needed these records to be true JSON, we need a hierarchy. This requires us to set off the rows with more braces and brackets and separate them with commas, like this:

Valid JSON with hierarchies

Creating JSONL with PROC JSON and DATA step

In a recent SAS Support Communities thread, a SAS user was struggling to use PROC JSON and a SAS data set to create a JSONL file for use with the Amazon Redshift database. PROC JSON can't create the finished file directly, but we can use PROC JSON to create the individual JSON object records. Our solution looks like this:

  • Use PROC JSON to read each record of the source data set and create a new JSON file. DATA step and CALL EXECUTE can generate these steps for us.
  • Using DATA step, post-process the collection of JSON files and append these into a final JSONL file.

Here's the code we used. You need to change only the output file name and the source SAS data set.

/* Build a JSONL (newline-delimited JSON) file */
/* from the records in a SAS data set          */
filename final "c:\temp\final.jsonl" ;
%let datasource = sashelp.class;
 
/* Create a new subfolder in WORK to hold */
/* temp JSON files, avoiding conflicts    */
options dlcreatedir;
%let workpath = %sysfunc(getoption(WORK))/json;
libname json "&workpath.";
libname json clear;
 
/* Will create a run a separate PROC JSON step */
/* for each record.  This might take a while   */
/* for very large data.                        */
/* Each iteration will create a new JSON file  */
data _null_;
 set &datasource.;
 call execute(catt('filename out "',"&workpath./out",_n_,'.json";'));
 call execute('proc json out=out nosastags ;');
 call execute("export &datasource.(obs="||_n_||" firstobs="||_n_||");");
 call execute('run;');
run;
 
/* This will concatenate the collection of JSON files */
/* into a single JSONL file                           */
data _null_;
 file final encoding='utf-8' termstr=cr;
 infile "&workpath./out*.json";
 input;
 /* trim the start and end [ ] characters */
 final = substr(_infile_,2,length(_infile_)-2);
 put final;
run;

From what I've read, it's a common practice to compress JSONL files with gzip for storage or faster transfers. That's a simple step to apply in our example, because SAS supports a GZIP method in SAS 9.4 Maintenance 5. To create a gzipped final result, change the first FILENAME statement to something like:

filename final ZIP "c:\temp\final.jsonl.gz" GZIP;

The JSONL format is new to me and I haven't needed to use it in any of my applications. If you use JSONL in your work, I'd love to hear your feedback about whether this approach would create the types of files you need.

The post Create newline-delimited JSON (or JSONL) with SAS appeared first on The SAS Dummy.

11月 142018
 

Prior to SAS Viya

With the creation of SAS Viya, the ability to run DATA Step code in a distributed manner became a reality. Prior to distributed DATA Step, DATA Step programmers never had to think about achieving repeatable results when SAS7BDAT datasets were the sources to their DATA Step code that contains a BY statement. This is because prior to SAS Cloud Analytics Services (CAS), DATA Step ran single-threaded and the source SAS7BDAT dataset was stored on disk. Every time one would run the code we obtained repeatable results because the sequence of rows within the BY group were preserved between runs. To illustrate this, review figures 1, 2, and 3.

Figure 1 is the source SAS7BDAT dataset WORK.TEST1. Notice the sequence of VAR2, especially on row 1 and 4 (i.e., _N_ =1 and 4).

_n_ VAR1 VAR2
1 1 N
2 1 Y
3 1 Y
4 2 Y
5 2 Y
6 2 N


Figure 1. WORK.TEST1 the original SAS7BDAT dataset

In figure 2, we see a BY statement with variable VAR1. This will ensure VAR1 is in ascending order. We are also using FIRST. processing to identify the first occurrence of the BY group. Because this data is stored on disk and because the DATA Step is executed using a single thread, the result table will be repeatable no matter how many times we run the DATA Step code.

Figure 2. Focus on the IF statement, especially VAR2

In figure 3, we see the output SAS7BDAT dataset WORK.TEST2.

_n_ VAR1 VAR2
1 1 N

Figure 3. WORK.TEST2 result dataset from running the code in Figure 2

In figure 4, we are running the same DATA Step but this time our source and target tables are CAS tables. The source table CASLIB.TEST1 was created by lifting the original SAS7BDAT dataset WORK.TEST1 (review figure 1) into CAS.

Figure 4. DATA Step executing in CAS

In figure 5, we see that the DATA Step logic is being respected in runs 1, 2 and 3; but we are not achieving repeatable results. This is due to CAS running on multiple threads. Note that the BY statement – which will group the data correctly for each BY group – is done on the fly. Also, the BY statement will not preserve the sequence of rows within the BY group between runs.

For some processes, this is not a concern but for others it could be. If you need to obtain repeatable results in DATA Step code that runs distributed in CAS as well as match your SAS 9 single-threaded DATA Step results, I suggest the following workaround be used.

Figure 5. DATA Step logic is respected but yields different results with each run

With SAS Viya

The workaround is very simplistic to understand and implement. For each SAS7BDAT dataset being lifted into a CAS table, see figure 6, we need to add a new variable ROW_ID.

_n_ VAR1 VAR2
1 1 N
2 1 Y
3 1 Y
4 2 Y
5 2 Y
6 2 N

Figure 6. Original SAS7BDAT dataset source WORK.TEST1

To accomplish this, we will leverage the automatic variable _N_ that is available to all DATA Step programmers. _N_ is initially set to 1. Each time the DATA step loops past the DATA statement, the variable _N_ increments by 1. The value of _N_ represents the number of times the DATA step has iterated. In our case, the value for each row is the row sequence in the original SAS7BDAT dataset. Figure 7 contains the SAS code we ran on the SAS 9.4M5 workspace server or the SAS Viya compute server to add the new variable ROW_ID.

 

Figure 7. Creating the new variable ROW_ID

By reviewing figure 8 we can see the new variable ROW_ID in the SAS7BDAT dataset WORK.TEST1. Now that we have the new variable, we are ready to lift this dataset into CAS.

_N_ VAR1 VAR2 ROW_ID
1 1 N 1
2 1 Y 2
3 1 Y 3
4 2 Y 4
5 2 Y 5
6 2 N 6

Figure 8. WORK.TEST1 with the new variable ROW_ID

There are many ways to lift a SAS7BDAT dataset into CAS. One way is to use a DATA Step like we did in figure 9.

Figure 9. DATA Step code to create distributed CAS table CASLIB.TEST1 

To obtain the repeatable results, we need to control the sequence of rows within each BY group. We accomplish this by adding the new variable ROW_ID as the last variable to the BY statement in our DATA Step code, see figure 10.

Figure 10. Add ROW_ID as last variable of the BY group

Figure 11 shows us the output CAS table created by the code in figure 10. By adding the new variable ROW_ID and using that variable as the last variable of the BY statement, we are controlling the sequencing of rows within the BY groups for all 3 runs.

VAR1 VAR2 ROW_ID
1 N 1

Figure 11. Distrusted CAS table CASLIB.TEST2

Conclusion

With distributed DATA Step comes great opportunities to improve runtimes. It also means we need to understand differences between single-threaded processing of SAS7BDAT datasets that are stored on disk and distributed processing of CAS tables store in-memory. To help you with that journey I suggest you read the SAS Global Forum paper, Parallel Programming with the DATA Step: Next Steps.

How to achieve repeatable results with distributed DATA Step BY Groups was published on SAS Users.

11月 142018
 

Teenagers. I have three boys, and two of them are teenagers. 15 year-old Kav is my middle son, and he gave me permission to share his story here. When he was five, Kav was diagnosed with Type I Diabetes, and I'm in awe of how he's accepted this challenging, lifelong [...]

How next-best-action could transform health care was published on SAS Voices by Heather Hallett

11月 142018
 

An ROC curve graphically summarizes the tradeoff between true positives and true negatives for a rule or model that predicts a binary response variable. An ROC curve is a parametric curve that is constructed by varying the cutpoint value at which estimated probabilities are considered to predict the binary event. Most SAS data analysts know that you can fit a logistic model in PROC LOGISTIC and create an ROC curve for that model, but did you know that PROC LOGISTIC enables you to create and compare ROC curves for ANY vector of predicted probabilities regardless of where the predictions came from? This article shows how!

If you want to review the basic constructions of an ROC curve, you can see a previous article that constructs an empirical ROC curve from first principles. The PROC LOGISTIC documentation provides formulas used for constructing an ROC curve.

Produce an ROC plot by using PROC LOGISTIC

Before discussing how to create an ROC plot from an arbitrary vector of predicted probabilities, let's review how to create an ROC curve from a model that is fit by using PROC LOGISTIC. The following data and model are taken from the the PROC LOGISTIC documentation. The data are for 43 cancer patients who also had an intestinal obstruction. The response variable popInd is a postoperative indicator variable: popInd = 1 for patients who died within two months after surgery. The explanatory variables are three pre-operative screening tests. The goal of the study is to determine patients who might benefit from surgery, where "benefit" is measured by postoperative survival of at least two months.

data roc;
   input alb tp totscore popind @@;
   totscore = 10 - totscore;
   datalines;
3.0 5.8 10 0   3.2 6.3  5 1   3.9 6.8  3 1   2.8 4.8  6 0
3.2 5.8  3 1   0.9 4.0  5 0   2.5 5.7  8 0   1.6 5.6  5 1
3.8 5.7  5 1   3.7 6.7  6 1   3.2 5.4  4 1   3.8 6.6  6 1
4.1 6.6  5 1   3.6 5.7  5 1   4.3 7.0  4 1   3.6 6.7  4 0
2.3 4.4  6 1   4.2 7.6  4 0   4.0 6.6  6 0   3.5 5.8  6 1
3.8 6.8  7 1   3.0 4.7  8 0   4.5 7.4  5 1   3.7 7.4  5 1
3.1 6.6  6 1   4.1 8.2  6 1   4.3 7.0  5 1   4.3 6.5  4 1
3.2 5.1  5 1   2.6 4.7  6 1   3.3 6.8  6 0   1.7 4.0  7 0
3.7 6.1  5 1   3.3 6.3  7 1   4.2 7.7  6 1   3.5 6.2  5 1
2.9 5.7  9 0   2.1 4.8  7 1   2.8 6.2  8 0   4.0 7.0  7 1
3.3 5.7  6 1   3.7 6.9  5 1   3.6 6.6  5 1
;
 
ods graphics on;
proc logistic data=roc plots(only)=roc;
   LogisticModel: model popind(event='0') = alb tp totscore;
   output out=LogiOut predicted=LogiPred;       /* output predicted value, to be used later */
run;
ROC curve for linear logistic model fitted in PROC LOGISTIC in SAS

You can see the documentation for details about how to interpret the output from PROC LOGISTIC, but the example shows that you can use the PLOTS=ROC option (or the ROC statement) to create an ROC curve for a model that is fit by PROC LOGISTIC. For this model, the area under the ROC curve is 0.77. Because a random "coin flip" prediction has an expected area of 0.5, this model predicts the survival of surgery patients better than random chance.

Create an ROC curve for any prediction rule

A logistic model is not the only way to predict a binary response. You could also use a decision tree, a generalized mixed model, a nonparametric regression model, or even ask a human expert for her opinion. An ROC curve only requires two quantities: for each observation, you need the observed binary response and a predicted probability. In fact, if you carefully read the PROC LOGISTIC documentation, you will find these sentences:

  • In the "Details" section: "ROC curves can be created ... from the specified model in the MODEL statement, from specified models in ROC statements, or from input variables which act as [predicted probabilities]." (Emphasis added.)
  • In the documentation of the ROC statement: "The PRED= option enables you to input a criterion produced outside PROC LOGISTIC; for example, you can fit a random-intercept model by using PROC GLIMMIX or use survey weights in PROC SURVEYLOGISTIC, then use the predicted values from those models to produce an ROC curve for the comparisons."

In other words, you can use PROC LOGISTIC to create an ROC curve regardless of how the predicted probabilities are obtained! For argument's sake, let's suppose that you ask a human expert to predict the probability of each patient surviving for at least two months after surgery. (Notice that there is no statistical model here, only a probability for each patient.) The following SAS DATA step defines the predicted probabilities, which are then merged with the output from the earlier PROC LOGISTIC call:

data ExpertPred;
   input ExpertPred @@;
   datalines;
0.95 0.2  0.05 0.3  0.1  0.6  0.8  0.5 
0.1  0.25 0.1  0.2  0.05 0.1  0.05 0.1 
0.4  0.1  0.2  0.25 0.4  0.7  0.1  0.1 
0.3  0.2  0.1  0.05 0.1  0.4  0.4  0.7
0.2  0.4  0.1  0.1  0.9  0.7  0.8  0.25
0.3  0.1  0.1 
;
data Survival;
   merge LogiOut ExpertPred;
run;
 
/* create ROC curve from a variable that contains predicted values */
proc logistic data=Survival;
   model popind(event='0') = ExpertPred / nofit;
   roc 'Expert Predictions' pred=ExpertPred;
   ods select ROCcurve;
run;
ROC curve from external predictions, create with PROC LOGISTIC in SAS

Notice that you only need to supply two variables on the MODEL statements: the observed responses and the variable that contains the predicted values. On the ROC statement, I've used the PRED= option to indicate that the ExpertPred variable is not being fitted by the procedure. Although PROC LOGISTIC creates many tables, I've used the ODS SELECT statement to suppress all output except for the ROC curve.

Overlay and compare ROC curves from different models or rules

You might want to overlay and compare ROC curves from multiple predictive models (either from PROC LOGISTIC or from other sources). PROC LOGISTIC can do that as well. You just need to merge the various predicted probabilities into a single SAS data set and then specify multiple ROC statements, as follows:

/* overlay two or more ROC curves by using variables of predicted values */
proc logistic data=Survival;
   model popind(event='0') = LogiPred ExpertPred / nofit;
   roc 'Logistic' pred=LogiPred;
   roc 'Expert'   pred=ExpertPred;
   ods select ROCOverlay;
   /* optional: for a statistical comparison, use ROCCONTRAST stmt and remove the ODS SELECT stmt */
   *roccontrast reference('Expert Model') / estimate e;
run;
Compare ROC curves by using PROC LOGISTIC in SAS to overlay the ROC curves

This ROC overlay shows that the "expert" prediction is almost always superior or equivalent to the logistic model in terms of true and false classification rates. As noted in the comments of the previous call to PROC LOGISTIC, you can use the ROCCONTRAST statement to obtain a statistical analysis of the difference between the areas under the curves (AUC).

In summary, you can use the ROC statement in PROC LOGISTIC to generate ROC curves for models that were computed outside of PROC LOGISTIC. All you need are the predicted probabilities and observed response for each observation. You can also overlay and compare two or more ROC curves and use the ROCCONTRAST statement to analyze the difference between areas under the curves.

The post Create and compare ROC curves for any predictive model appeared first on The DO Loop.

11月 132018
 

Note: Today’s utility industry is in upheaval. All of the assumptions the business has run on have been turned on their heads. This post is the third in a three-part series looking at how analytics are helping utilities navigate this challenging landscape and find new opportunities for improvements in operations, [...]

The digital utility era is now: The customer/grid nexus was published on SAS Voices by Mike F. Smith

11月 132018
 

In my previous blog post I demonstrated how to create your own CAS actions and action sets.  In this post, we will explore how to create your own CAS functions using the CAS Language (CASL).  A function is a component of the CASL programming language that can accept arguments, perform a computation or other operation, and return a value.  The value that is returned can be used in an assignment statement or elsewhere in expressions.

About SAS functions

SAS provides two types of supplied functions: built-in functions and common functions.  Built-in functions contain functionality that is unique to CASL.  These allow you to perform operations on your result tables, arrays, and dictionaries, and provide run-time support for your CASL programs.  Built-in functions cannot be replaced with user-defined functions.

Conversely, common functions provide functionality that is common to other SAS functions.  When used in a CASL program, SAS functions take a CASL value and a CASL value is returned.  Unlike built-in functions, you can replace these functions with user-defined functions.

Since the capabilities of built-in functions are unique to CASL, let’s look at these in-depth and demonstrate with an example.  Save the following FedSQL code in an external file called hmeqsql.sas.  This code will be read into CAS and stored as a variable.

The execDirect action executes FedSQL code in CAS.  The READPATH built-in function reads the FedSQL code saved in hmeqsql.sas and stores it in the CASL variable sqlcode which is used as input to the query parameter.

The fetch action displays the first 20 rows from the output table hmeq.out.

If you don’t feel like looking through the documentation for a built-in or common function, a list of each can be generated programmatically.  Run the following code to see a list of built-in functions.

Partial list of CASL built-in functions

Run the following code to see a list of common functions.

Partial list of common functions

User-defined CASL functions

In addition to the customizable capabilities of built-in functions supplied by SAS, you can also create your own functions using the FUNCTION statement.  User-defined functions can be called in expressions using CASL and they provide a large amount of flexibility.  The following example creates four different functions for temperature conversion.

After creating these functions, they can be called immediately, or you can store them in an external file and call them via a %include statement.  In this example, the user-defined functions have been stored in an external file called FunctionStore.sas.  You can call one, all, or any number of your user-defined functions.

The output from each function call is displayed in the log.

Lastly, if you want to see all user-defined functions, run the FUNCTIONLIST statement.  A list will be printed to the log.

More about CASL programming and using functions in CASL

Check out these resources for further information on programming in the CASL language and using functions in CASL.

Customize your CASL code with built-in and user-defined functions was published on SAS Users.