5月 172019
 

In the article Serverless functions and SAS Viya - a good match I discussed using serverless functions to deliver SAS Viya applications. Ignoring all the buzz words, a serverless function boils down to a set of REST APIs. So, if you tried the example you are now a REST API developer 🙂 .

The serverless function allowed the application developer to do the following:

  1. Define what the end user must supply to the function. A good application developer will try to make the request simple and easy to understand.
  2. Return to the end user a response easily consumed by the client's program. Again, a good application developer would make sure the response satisfies most common usage scenarios.
  3. Hide all the details of what it took to satisfy the users request.

This blog discusses using GraphQL to achieve the same goals. First, I will briefly discuss GraphQL, where it fits in with SAS Viya application integration, and how to create GraphQL-based applications. I also provide a series of examples based on real-world scenarios.

The images below display a high level comparison of the approaches between serverless and GraphQL.

serverless and GraphQL process flow

serverless and GraphQL process flow

Steps in the GraphQL flow

  1. A GraphQL server replaces the AWS API Gateway.
  2. The code that runs in the GraphQL server is referred to as "resolvers" - as the name implies, resolvers are used by the GraphQL server to execute user requests.
  3. The resolvers make the necessary REST API calls to the SAS Viya Server.

All of the code in this article resides in the restaf-graphql-demo GitHub repository. If you are not familiar with GraphQL please review the links at the end of this article before proceeding.

Why GraphQL?

Some smart folks at Facebook created GraphQL to solve problems they encountered using standard REST APIs. Companies like Github, Netflix, PayPal, The New York Times and many others are adopting GraphQL.

Some of the key motivators are:

  1. Users define and request what they need, following exact specifications
  2. A convenient way to front existing systems (REST-based or not) and databases with a Developer Experience friendly API
  3. Returning only the requested information reduces the data transferred - important for reducing network traffic
  4. GraphQL is less "chatty" - where REST API will requires multiple trips to the server, GraphQL can accomplish the same task in one round trip

Why GraphQL for SAS Viya application developers?

While the general GraphQL characteristics listed above are important, GraphQL is also a useful technology for developers creating applications integrated with SAS Viya.

  1. GraphQL is a ready-made vehicle for SAS users to deliver their applications as the next generation "stored process" developed with the data step+procedures, CAS Language (CASL) statements, custom CASL actions and SAS REST APIs.
  2. GraphQL is a great way for front-end and back-end developers to communicate.
  3. Developers can code to an agreed contract as specified by the GraphQL schema.
  4. Front-end developers can be confident what they get is exactly what they asked for.

Writing the GraphQL-based applications

The GraphQL queries used in this article are examples for demonstration purposes only and not "standards or strict guidelines" to follow. The code in the GitHub repository and the examples outlined below will help you jump-start your excellent adventure in GraphQL and SAS Viya applications.

The high-level steps for writing an application using GraphQL query are:

SAS Viya Side

  1. SAS programmers, data analysts and data scientists develop their intellectual protocol with SAS programs written with SAS procedures, CAS Actions, data step and CASL language.

Server Side

  1. Build the GraphQL schema and define the queries (see this for examples). In relation to SAS Viya, the schema describes the input and output of the SAS programs.
    • Make sure you have discussed this with the UI developers and the SAS programmers
  2. Write the resolvers - GraphQL server will call this code to resolve the requests by the user (see this for examples).
  3. Register both of these with the GraphQL server.

Client Side

  1. You can build the web apps in the normal way with these characteristics:
    • These apps will call a single end point (/graphql) with a POST method.
    • The payload is the GraphQL query
    • The response will match the query and are easily accessible

The image below shows the flow of a GraphQL-based application. User's queries are sent to the GraphQL server. The server parses the queries and calls the appropriate resolver (your code) to obtain the values for the requested fields. In this project the resolvers use restaf to make REST API calls to SAS Viya.

GraphQL-based application process flow

GraphQL-based application process flow

The rest of the blog discusses a few examples. All these examples are available in the repository. I chose to write the examples using JavaScript since it is one of the languages I am familiar with and can write reasonably decent code in. You can develop GraphQL-based SAS Viya applications in all the popular languages of today.

Example 1: Scoring a loan from client app
In this example, a data scientist working for a bank, has created a model to score a loan applicant's eligibility. The scientist outlines the following requirements:

  1. The user can only enter the desired loan amount and their current assets. All the other parameters needed for scoring have set values. All the values must be passed to the SAS code as a dictionary named _args_.
  2. Since the scientist wants to run A/B experiments the location and name of the scoring model's astore must be passed in as dictionary named _appEnv_.
  3. The code developed by the data scientist is below. The score returns as a dictionary.

    {score= <value>}

SAS Code

I wrote the SAS program in this example in CASL.

loadactionset "astore";

  /* convert arguments to a cas table */
/* _args_  and _appEnv_ are  generated by caslBase - see caslBase for details */

/* CASL function to convert a dictionary to a cas table  see lib/argsToTable.js for details*/
argsToTable(_args_, 'casuser', 'INPUTDATA' );

/* score */
action astore.score /
    table  = { caslib= 'casuser', name = 'INPUTDATA' } 
    rstore = { caslib= _appEnv_.astore.caslib,  name=_appEnv_.astore.name }
    casout  = { caslib = 'casuser', name = 'OUTPUTDATA' replace= TRUE};

/* fetch results */
action table.fetch r = result /
    table = {  caslib = 'casuser' name = 'INPUTDATA' } ;

/* extract the score and send it as a dictionary */
score = result.Fetch[1].P_BAD;
scoreo= {score= score};
send_response(scoreo);

Key points to note:

  1. The resolver creates and prepends two CASL dictionaries _args_ and _appEnv_.
  2. The CASL program returns the result using the send_response function.
    • One of the cool things is that CASL allows the programmer to customize the returned value. In this example the score extracts into a dictionary.

Schema

Based on the requirement the schema is as shown below:

type Query {
   scoreLoan(amount: Int assets: Int) : Float

Key Point:

  1. The two values the user specifies are defined as the filter parameters to the query.

Application

scoreLoan

Key point:

  1. The user enters the two values the data scientist requires.

Client code

async function runScore(amount, assets){
    let payload = {
        query: `query {
            scoreLoan(amount: ${amount} assets: ${assets} )
        }`
    }

    let config = {
        url            : host + '/graphql',
        withCredentials: true,
        method         : 'POST',
        data           : payload
    }

    let r = await axios(config);
    return r.data.data.scoreLoan;
}

Key points:

  1. The payload is the GraphQL query.
  2. I use the POST method.
  3. The end point is /graphql - this is the only endpoint the application will use.
  4. The response is available as r.data.data.scoreLoan
  5. Note the simplicity of the client code to access the GraphQL server and obtain the results.

Resolver

let caslBase = require('../lib/caslBase');

module.exports = async function scoreLoan (_, args, context) {
    let { store } = context;
    let input = {
        JOB    : 'J1',
        CLAGE  : 100, 
        CLNO   : 20, 
        DEBTINC: 20, 
        DELINQ : 2, 
        DEROG  : 0, 
        MORTDUE: 4000, 
        NINQ   : 1,
        YOJ    : 10
    };

    input.LOAN  = args.amount;
    input.VALUE = args.assets;

    let env = {
        astore: {
            caslib: 'Public',
            name  : 'GRADIENT_BOOSTING___BAD_2'
        }
    }
    let result = await caslBase(store,['argsToTable.casl', 'score.casl'], input, env);
    let score = result.items('results', 'score');
    
    return score;

}

Key points:

  1. As required, the default values for the other parameters are added to the user input.
  2. The resolver contains the location and name of the model.
  3. The names of the SAS code are passed to caslBase - this allows the code to read the SAS code from a repository.
  4. The caslBase function calls the jsonToDict to convert the json parameters to CASL dictionary and passes it on to CAS along with the code.
  5. The user receives the resulting score.
Example 2: Reporting wine production to management
The TwoBit winery management wants a simple report to view the production of different wines by year. They want to be able to pick the year range and the wines in which they are interested. The data shown below is for the TwoBit Winery. The goal is to query for selected wines and filter on years.

The data for the winery is listed below.

 
Obs year cabernet merlot pinot chardonnay twobit
1 2000 10 20 30 40 50
2 2001 5 10 15 5 0
3 2002 6 7 11 12 13
4 2003 5 8 0 0 50
5 2004 11 5 7 8 100
6 2005 1 1 0 0 1000
7 2006 0 0 0 0 3000

 

SAS Code

The SAS experts at the company created the following SAS code to meet management's request. Note that for demo purposes the wine data is created inline.

data wineList;  
 input year cabernet merlot pinot chardonnay twobit ;  
 cards;  
 2000 10 20 30 40 50   
 2001 5 10 15 5 0  
 2002 6 7 11 12 13  
 2003 5 8 0 0 50 
 2004 11 5 7 8 100  
 2005 1  1 0 0 1000  
 2006 0 0 0 0 3000  
;;;; 
run;  
/* _selections_ macro was generated in src/lib/getSelections function.
data wine ;  
    set winelist( where= (year GE &amp;from &amp;&amp; year LE &amp;to)); 
    keep &amp;_selections_; 
    run;  
ods html style=barrettsblue;  
    proc print data=wine;run;  
ods html close;run ;

Key points to note:

  1. The code requires macro variables &from, &to and &_selections_ be set before this code executes.
  2. The name of the returned table is wine.

Schema

type Query{
wineProduction(from: Int, to: Int): WineProduction
}

type WineProduction {
"""
An array containing wine production
"""
wines : [WineList]

"""
ODS output and Log output
"""
report: SASResults
}

type WineList {
year : Int
cabernet : Int
merlot : Int
pinot : Int
chardonnay: Int
twobit : Int
}

type WineProductionCas {
wines : [WineList]
}

type SASResults {
        """
        ODS output from the server
        """
        ods: String
        """
        Log output from the server
        """
        log: String
    
    }

Key points:

  1. As required, the year range is specified as filters for the query.
  2. As required, the user can pick the wines in which they are interested.

Application

The application is shown below.

Client code

The relevant client code is shown below (see this in the repository for the full program).

 let gqString = `query userQuery($from: Int, $to: Int) {
                           results: wineProduction(from: $from to: $to) {
                              wines { 
                                  ${wineList} 
                                } 
                                ${reportList}
                             } 
                            }`;
        let payload = {
            url   : host + '/graphql',
            method: 'POST',
            data: { 
                query: gqString,
                variables: {
                    from: fromYear.value,
                    to  : toYear.value
                }
            }
        }
        setReportValues(null);
        setResultValues(null);
        axios(payload)
         .then ( r => {
            let res = r.data.data.results;
           // Simple to extract the results
            setResultValues(res.wines);
            if (res.report != null ) {
                setReportValues(res.report);
            }
        
         })
         .catch( e => alert(e))
    }
})

Key points:

  1. The GraphQL query string is sent as the payload (wineList and reportList are strings computed earlier in the program based on user selection).
  2. The endpoint is again /graphql with a POST method.
  3. This snippet also shows the preferred way to send the filter values.

Resolver

The root resolver is shown below.

let getProgram    = require('../lib/getProgram');
let getSelections = require('../lib/getSelections');
let spBase        = require('../lib/spBase');

module.exports = async function wineProduction (_, args, context, info){
    let {store} = context;

<span style="font-size: 14px;">   // read source - reads in the sas program</span>
    let src = await getProgram(store, ['wines.sas']); 

    // update args with the wine list specified by the user
    let selections = getSelections(info, 'wines', args);

   // execute the sas code with compute server and get results
    let resultSummary = await spBase(store, selections.args, src);
    
    // resultSummary is now passed to the resolvers for wines and results fields.
    return resultSummary;
}

Key points:

  1. Code from the GitHub repo uses winelist.js to resolve the list of wines.
  2. Code from sasresults.js, sasOds.js and sasLog.js returns ODS output and the SAS log.
  3. The SAS code reads in from a repository using the getPrograms function.
Example 3: List SAS Visual Analytics reports
Another common use case is retrieving information about reports developed with SAS Visual Analytics. The GraphQL query to get the list of reports, who edited it last and when is shown below. This example uses the reports REST API.

Schema

{
    reports {
        name
        modifiedBy
        modifiedOn
   }
}

Creating a UI for this is a challenge exercise for the reader (meaning I did not get around to writing it 🙂 ). The returned results look something like this:

{
    "data": {
    "reports": [
        {
            "name": "Application Activity",
            "modifiedBy": "SAS Supplied",
            "modifiedOn": "2018-04-20T14:24:05.258Z"
       },
      {
           "name": "CAS Activity",
           "modifiedBy": "SAS Supplied",
          "modifiedOn": "2018-06-08T20:21:14.727Z"
        }
...

Resolver

module.exports = async function reports (_, args, context) {
    let {store} = context;
    let reports = store.getService ('reports');
    let list =await getList(store, reports);
    return list;
}

async function getList(store, reports) {
    let reportsList =await store.apiCall (reports.links ('reports'));
    if (reportsList.itemsList().size ===0) {
       return [];
     }
    let r = reportsList.itemsList().map (name => {
         let t = {
             name : name,
             modifiedBy: reportsList.items(name, 'data', 'modifiedBy'),
             modifiedOn: reportsList.items(name, 'data', 'modifiedTimeStamp')
         };
        return t;
     });
   return r;
}

Example 4: Getting the URL and image of a specific report
The query below can be used to obtain the URL to display the interactive report and svg image of a specific report.

Schema

{
      report(name:"Application Activity"){
           url
          image
      }
}

The returned value will be along these lines:

{
  "data": {
    "report": {
      "url": "http://superuser.com/?reportUri=/reports/reports/ecec39ad-994f-4055-8e40-4360f410bc6e...",
      "image: {the svg of the image}
    }
}

Resolver

There are 3 resolvers associated with this query, the root resolver and resolvers for image and url. For the sake of brevity, I will not review those here. please visit the code in the repository.

In conclusion

The examples above cover some basic scenarios for SAS Viya applications.

  1. Using CAS actions
  2. Using traditional data step and procs
  3. Obtaining ODS output
  4. Working with SAS Visual Analytics

The simplicity of the client code and the resolvers are what makes GraphQL attractive for writing SAS Viya applications. You can also exploit other features in SAS Viya using the same pattern. Further, you can use the examples in this repository to easily customize your own use cases. The resolvers and helper functions are written to be reusable with minimal effort. The instructions are in the README file in the repository. If you create interesting schema and resolvers for SAS Viya, please share them with the SAS user community.

Opinion

Like all new technologies GraphQL has its proponents and detractors. Also, many people get caught in the low-value arguments about GraphQL being better or worse than REST. I personally do not follow these discussions since you should use the best tool for the job.

I find GraphQL most attractive when developing a back-end for SAS Viya applications. Both front and back-end developers will benefit from the clear definition of the schema. Having well supported GraphQL servers by Apollo and Facebook makes it easier to adopt GraphQL.

Useful links

There are a growing number of resources from which to learn and model. Below is small starter list.

  1. graphql.org
  2. Apollo
  3. Relay
  4. GraphQL Concepts Visualized by Dhaivat Pandya
  5. GraphQL tutorial from TutorialsPoint
  6. How to GraphQL

GraphQL and SAS Viya applications - a good match was published on SAS Users.

5月 172019
 

Competition in customer experience management has never been as challenging as it is now. Customers spend more money in aggregate, but less per brand. The average size of a single purchase has decreased, partly because competitive offers are just one click away. Predicting offer relevance to potential (and existing) customers plays a [...]

SAS Customer Intelligence 360: Factorization machines, visual analytics, and personalized marketing [Part 2] was published on Customer Intelligence Blog.

5月 152019
 

Have you ever run a statistical test to determine whether data are normally distributed? If so, you have probably used Kolmogorov's D statistic. Kolmogorov's D statistic (also called the Kolmogorov-Smirnov statistic) enables you to test whether the empirical distribution of data is different than a reference distribution. The reference distribution can be a probability distribution or the empirical distribution of a second sample. Most statistical software packages provide built-in methods for computing the D statistic and using it in hypothesis tests. This article discusses the geometric meaning of the Kolmogorov D statistic, how you can compute it in SAS, and how you can compute it "by hand."

This article focuses on the case where the reference distribution is a continuous probability distribution, such as a normal, lognormal, or gamma distribution. The D statistic can help you determine whether a sample of data appears to be from the reference distribution. Throughout this article, the word "distribution" refers to the cumulative distribution.

What is the Kolmogorov D statistic?

The geometry of Kolmogorov's D statistic, which measures the maximum vertical distance between an emirical distribution and a reference distribution.

The letter "D" stands for "distance." Geometrically, D measures the maximum vertical distance between the empirical cumulative distribution function (ECDF) of the sample and the cumulative distribution function (CDF) of the reference distribution. As shown in the adjacent image, you can split the computation of D into two parts:

  • When the reference distribution is above the ECDF, compute the statistic D as the maximal difference between the reference distribution and the ECDF. Because the reference distribution is monotonically increasing and because the empirical distribution is a piecewise-constant step function that changes at the data values, the maximal value of D will always occur at a right-hand endpoint of the ECDF.
  • When the reference distribution is below the ECDF, compute the statistic D+ as the maximal difference between the ECDF and the reference distribution. The maximal value of D+ will always occur at a left-hand endpoint of the ECDF.

The D statistic is simply the maximum of D and D+.

You can compare the statistic D to critical values of the D distribution, which appear in tables. If the statistic is greater than the critical value, you reject the null hypothesis that the sample came from the reference distribution.

How to compute Kolmogorov's D statistic in SAS?

In SAS, you can use the HISTOGRAM statement in PROC UNIVARIATE to compute Kolmogorov's D statistic for many continuous reference distributions, such as the normal, beta, or gamma distributions. For example, the following SAS statements simulate 30 observations from a N(59, 5) distribution and compute the D statistic as the maximal distance between the ECDF of the data and the CDF of the N(59, 5) distribution:

/* parameters of reference distribution: F = cdf("Normal", x, &mu, &sigma) */
%let mu    = 59;
%let sigma =  5;
%let N     = 30;
 
/* simulate a random sample of size N from the reference distribution */
data Test;
call streaminit(73);
do i = 1 to &N;
   x = rand("Normal", &mu, &sigma);
   output;
end;
drop i;
run;
 
proc univariate data=Test;
   ods select Moments CDFPlot GoodnessOfFit;
   histogram x / normal(mu=&mu sigma=&sigma) vscale=proportion;  /* compute Kolmogov D statistic (and others) */
   cdfplot x / normal(mu=&mu sigma=&sigma) vscale=proportion;    /* plot ECDF and reference distribution */
   ods output CDFPlot=ECDF(drop=VarName CDFPlot);                /* for later use, save values of ECDF */
run;

The computation shows that D = 0.131 for this simulated data. The plot at the top of this article shows that the maximum occurs for a datum that has the value 54.75. For that observation, the ECDF is farther away from the normal CDF than at any other location.

The critical value of the D distribution when N=30 and α=0.05 is 0.2417. Since D < 0.2417, you should not reject the null hypothesis. It is reasonable to conclude that the sample comes from the N(59, 5) distribution.

Although the computation is not discussed in this article, you can use PROC NPAR1WAY to compute the statistic when you have two samples and want to determine if they are from the same distribution. In this two-sample test, the geometry is the same, but the computation is slightly different because the reference distribution is itself an ECDF, which is a step function.

Compute Kolmogorov's D statistic manually

Although PROC UNIVARIATE in SAS computes the D statistic (and other goodness-of-fit statistics) automatically, it is not difficult to compute the statistic from first principles in a vector language such as SAS/IML.

The key is to recall that the ECDF is a piecewise constant function that changes heights at the value of the observations. If you sort the data, the height at the i_th sorted observation is i / N, where N is the sample size. The height of the ECDF an infinitesimal distance before the i_th sorted observation is (i – 1)/ N. These facts enable you to compute D and D+ efficiently.

The following algorithm computes the D statistic:

  1. Sort the data in increasing order.
  2. Evaluate the reference distribution at the data values.
  3. Compute the pairwise differences between the reference distribution and the ECDF.
  4. Compute D as the maximum value of the pairwise differences.

The following SAS/IML program implements the algorithm:

/* Compute the Kolmogorov D statistic manually */
proc iml;
use Test;  read all var "x";  close;
N = nrow(x);                         /* sample size */
call sort(x);                        /* sort data */
F = cdf("Normal", x, &mu, &sigma);   /* CDF of reference distribution */
i = T( 1:N );                        /* ranks */
Dminus = F - (i-1)/N;
Dplus = i/N - F;
D = max(Dminus, Dplus);              /* Kolmogorov's D statistic */
print D;

The D statistic matches the computation from PROC UNIVARIATE.

The SAS/IML implementation is very compact because it is vectorized. By computing the statistic "by hand," you can perform additional computations. For example, you can find the observation for which the ECDF and the reference distribution are farthest away. The following statements find the index of the maximum for the Dminus and Dplus vectors. You can use that information to find the value of the observation at which the maximum occurs, as well as the heights of the ECDF and reference distribution. You can use these values to create the plot at the top of this article, which shows the geometry of the Kolmogorov D statistic:

/* compute locations of the maximum D+ and D-, then plot with a highlow plot */
i1 = Dminus[<:>];  i2 = Dplus [<:>];              /* indices of maxima */
/* Compute location of max, value of max, upper and lower curve values */
x1=x[i1];  v1=DMinus[i1];  Low1=(i[i1]-1)/N;  High1=F[i1]; 
x2=x[i2];  v2=Dplus [i2];  Low2=F[i2];        High2=i[i2]/N; 
Result = (x1 || v1 || Low1 || High1) // (x2 || v2 || Low2 || High2);
print Result[c={'x' 'Value' 'Low' 'High'} r={'D-' 'D+'} L='Kolmogorov D'];

The observations that maximize the D and D+ statistics are x=54.75 and x=61.86, respectively. The value of D is the larger value, so that is the value of Kolmogorov's D.

For completeness, the following statement show how to create the graph at the top of this article. The HIGHLOW statement in PROC SGPLOT is used to plot the vertical line segments that represent the D and D+ statistics.

create KolmogorovD var {x x1 Low1 High1 x2 Low2 High2}; append; close;
quit;
 
data ALL;
   set KolmogorovD ECDF;   /* combine the ECDF, reference curve, and the D- and D+ line segments */
run;
 
title "Kolmogorov's D Statistic";
proc sgplot data=All;
   label CDFy = "Reference CDF" ECDFy="ECDF";
   xaxis grid label="x";
   yaxis grid label="Cumulative Probability" offsetmin=0.08;
   fringe x;
   series x=CDFx Y=CDFy / lineattrs=GraphReference(thickness=2) name="F";
   step x=ECDFx Y=ECDFy / lineattrs=GraphData1 name="ECDF";
   highlow x=x1 low=Low1 high=High1 / 
           lineattrs=GraphData2(thickness=4) name="Dm" legendlabel="D-";
   highlow x=x2 low=Low2 high=High2 / 
           lineattrs=GraphData3(thickness=4) name="Dp" legendlabel="D+";
   keylegend "F" "ECDF" "Dm" "Dp";
run;

Concluding thoughts

Why would anyone want to compute the D statistic by hand if PROC UNIVARIATE can compute it automatically? There are several reasons:

  • Understanding: When you compute a statistic manually, you are forced to understand what the statistic means and how it works.
  • Efficiency: When you run PROC UNIVARIATE, it computes many statistics (mean, standard deviation, skewness, kurtosis,...), many goodness-of-fit tests (Kolmogorov-Smirnov, Anderson-Darling, and Cramér–von Mises), a histogram, and more. That's a lot of work! If you are interested only in the D statistic, why needlessly compute the others?
  • Generalizability: You can compute quantities such as the location of the maximum value of D and D+. You can use the knowledge and experience you've gained to compute related statistics.

The post What is Kolmogorov's D statistic? appeared first on The DO Loop.

5月 142019
 

Interested in making business decisions with big data analytics? Our Wiley SAS Business Series book Profit Driven Business Analytics: A Practitioner’s Guide to Transforming Big Data into Added Value by Bart Baesens, Wouter Verbeke, and Cristian Danilo Bravo Roman has just the information you need to learn how to use SAS to make data and analytics decision-making a part of your core business model!

This book combines the authorial team’s worldwide consulting experience and high-quality research to open up a road map to handling data, optimizing data analytics for specific companies, and continuously evaluating and improving the entire process.

In the following excerpt from their book, the authors describe a value-centric strategy for using analytics to heighten the accuracy of your enterprise decisions:

“'Data is the new oil' is a popular quote pinpointing the increasing value of data and — to our liking — accurately characterizes data as raw material. Data are to be seen as an input or basic resource needing further processing before actually being of use.”

Analytics process model

In our book, we introduce the analytics process model that describes the iterative chain of processing steps involved in turning data into information or decisions, which is quite similar actually to an oil refinery process. Note the subtle but significant difference between the words data and information in the sentence above. Whereas data fundamentally can be defined to be a sequence of zeroes and ones, information essentially is the same but implies in addition a certain utility or value to the end user or recipient.

So, whether data are information depends on whether the data have utility to the recipient. Typically, for raw data to be information, the data first need to be processed, aggregated, summarized, and compared. In summary, data typically need to be analyzed, and insight, understanding, or knowledge should be added for data to become useful.

Applying basic operations on a dataset may already provide useful insight and support the end user or recipient in decision making. These basic operations mainly involve selection and aggregation. Both selection and aggregation may be performed in many ways, leading to a plentitude of indicators or statistics that can be distilled from raw data. Providing insight by customized reporting is exactly what the field of business intelligence (BI) is about.

Business intelligence is an umbrella term that includes the applications, infrastructure and tools, and best practices that enable access to — and analysis of — information to improve and optimize decisions and performance.

This model defines the subsequent steps in the development, implementation, and operation of analytics within an organization.

    Step 1
    As a first step, a thorough definition of the business problem to be addressed is needed. The objective of applying analytics needs to be unambiguously defined. Some examples are: customer segmentation of a mortgage portfolio, retention modeling for a postpaid Telco subscription, or fraud detection for credit cards. Defining the perimeter of the analytical modeling exercise requires a close collaboration between the data scientists and business experts. Both parties need to agree on a set of key concepts; these may include how we define a customer, transaction, churn, or fraud. Whereas this may seem self-evident, it appears to be a crucial success factor to make sure a common understanding of the goal and some key concepts is agreed on by all involved stakeholders.

    Step 2
    Next, all source data that could be of potential interest need to be identified. The golden rule here is: the more data, the better! The analytical model itself will later decide which data are relevant and which are not for the task at hand. All data will then be gathered and consolidated in a staging area which could be, for example, a data warehouse, data mart, or even a simple spreadsheet file. Some basic exploratory data analysis can then be considered using, for instance, OLAP facilities for multidimensional analysis (e.g., roll-up, drill down, slicing and dicing).

    Step 3
    After we move to the analytics step, an analytical model will be estimated on the preprocessed and transformed data. Depending on the business objective and the exact task at hand, a particular analytical technique will be selected and implemented by the data scientist.

    Step 4
    Finally, once the results are obtained, they will be interpreted and evaluated by the business experts. Results may be clusters, rules, patterns, or relations, among others, all of which will be called analytical models resulting from applying analytics. Trivial patterns (e.g., an association rule is found stating that spaghetti and spaghetti sauce are often purchased together) that may be detected by the analytical model is interesting as they help to validate the model. But of course, the key issue is to find the unknown yet interesting and actionable patterns (sometimes also referred to as knowledge diamonds) that can provide new insights into your data that can then be translated into new profit opportunities!

    Step 5
    Once the analytical model has been appropriately validated and approved, it can be put into production as an analytics application (e.g., decision support system, scoring engine). Important considerations here are how to represent the model output in a user-friendly way, how to integrate it with other applications (e.g., marketing campaign management tools, risk engines), and how to make sure the analytical model can be appropriately monitored and back-tested on an ongoing basis.

Book giveaway!

If you are as excited about business analytics as we are and want a copy of Bart Baesens’ book Profit Driven Business Analytics: A Practitioner’s Guide to Transforming Big Data into Added Value, enter to win a free copy in our book giveaway today! The first 5 commenters to correctly answer the question below get a free copy of Baesens book! Winners will be contacted via email.

Here's the question:
What Free SAS Press e-book did Bart Baesens write the foreword too?

We look forward to your answers!

Further resources

Want to prove your business analytics skills to the world? Check out our Statistical Business Analyst Using SAS 9 certification guide by Joni Shreve and Donna Dea Holland! This certification is designed for SAS professionals who use SAS/STAT software to conduct and interpret complex statistical data analysis.

For more information about the certification and certification prep guide, watch this video from co-author Joni Shreve on their SAS Certification Prep Guide: Statistical Business Analysis Using SAS 9.

Big data in business analytics: Talking about the analytics process model was published on SAS Users.

5月 142019
 

Interested in making business decisions with big data analytics? Our Wiley SAS Business Series book Profit Driven Business Analytics: A Practitioner’s Guide to Transforming Big Data into Added Value by Bart Baesens, Wouter Verbeke, and Cristian Danilo Bravo Roman has just the information you need to learn how to use SAS to make data and analytics decision-making a part of your core business model!

This book combines the authorial team’s worldwide consulting experience and high-quality research to open up a road map to handling data, optimizing data analytics for specific companies, and continuously evaluating and improving the entire process.

In the following excerpt from their book, the authors describe a value-centric strategy for using analytics to heighten the accuracy of your enterprise decisions:

“'Data is the new oil' is a popular quote pinpointing the increasing value of data and — to our liking — accurately characterizes data as raw material. Data are to be seen as an input or basic resource needing further processing before actually being of use.”

Analytics process model

In our book, we introduce the analytics process model that describes the iterative chain of processing steps involved in turning data into information or decisions, which is quite similar actually to an oil refinery process. Note the subtle but significant difference between the words data and information in the sentence above. Whereas data fundamentally can be defined to be a sequence of zeroes and ones, information essentially is the same but implies in addition a certain utility or value to the end user or recipient.

So, whether data are information depends on whether the data have utility to the recipient. Typically, for raw data to be information, the data first need to be processed, aggregated, summarized, and compared. In summary, data typically need to be analyzed, and insight, understanding, or knowledge should be added for data to become useful.

Applying basic operations on a dataset may already provide useful insight and support the end user or recipient in decision making. These basic operations mainly involve selection and aggregation. Both selection and aggregation may be performed in many ways, leading to a plentitude of indicators or statistics that can be distilled from raw data. Providing insight by customized reporting is exactly what the field of business intelligence (BI) is about.

Business intelligence is an umbrella term that includes the applications, infrastructure and tools, and best practices that enable access to — and analysis of — information to improve and optimize decisions and performance.

This model defines the subsequent steps in the development, implementation, and operation of analytics within an organization.

    Step 1
    As a first step, a thorough definition of the business problem to be addressed is needed. The objective of applying analytics needs to be unambiguously defined. Some examples are: customer segmentation of a mortgage portfolio, retention modeling for a postpaid Telco subscription, or fraud detection for credit cards. Defining the perimeter of the analytical modeling exercise requires a close collaboration between the data scientists and business experts. Both parties need to agree on a set of key concepts; these may include how we define a customer, transaction, churn, or fraud. Whereas this may seem self-evident, it appears to be a crucial success factor to make sure a common understanding of the goal and some key concepts is agreed on by all involved stakeholders.

    Step 2
    Next, all source data that could be of potential interest need to be identified. The golden rule here is: the more data, the better! The analytical model itself will later decide which data are relevant and which are not for the task at hand. All data will then be gathered and consolidated in a staging area which could be, for example, a data warehouse, data mart, or even a simple spreadsheet file. Some basic exploratory data analysis can then be considered using, for instance, OLAP facilities for multidimensional analysis (e.g., roll-up, drill down, slicing and dicing).

    Step 3
    After we move to the analytics step, an analytical model will be estimated on the preprocessed and transformed data. Depending on the business objective and the exact task at hand, a particular analytical technique will be selected and implemented by the data scientist.

    Step 4
    Finally, once the results are obtained, they will be interpreted and evaluated by the business experts. Results may be clusters, rules, patterns, or relations, among others, all of which will be called analytical models resulting from applying analytics. Trivial patterns (e.g., an association rule is found stating that spaghetti and spaghetti sauce are often purchased together) that may be detected by the analytical model is interesting as they help to validate the model. But of course, the key issue is to find the unknown yet interesting and actionable patterns (sometimes also referred to as knowledge diamonds) that can provide new insights into your data that can then be translated into new profit opportunities!

    Step 5
    Once the analytical model has been appropriately validated and approved, it can be put into production as an analytics application (e.g., decision support system, scoring engine). Important considerations here are how to represent the model output in a user-friendly way, how to integrate it with other applications (e.g., marketing campaign management tools, risk engines), and how to make sure the analytical model can be appropriately monitored and back-tested on an ongoing basis.

Book giveaway!

If you are as excited about business analytics as we are and want a copy of Bart Baesens’ book Profit Driven Business Analytics: A Practitioner’s Guide to Transforming Big Data into Added Value, enter to win a free copy in our book giveaway today! The first 5 commenters to correctly answer the question below get a free copy of Baesens book! Winners will be contacted via email.

Here's the question:
What Free SAS Press e-book did Bart Baesens write the foreword too?

We look forward to your answers!

Further resources

Want to prove your business analytics skills to the world? Check out our Statistical Business Analyst Using SAS 9 certification guide by Joni Shreve and Donna Dea Holland! This certification is designed for SAS professionals who use SAS/STAT software to conduct and interpret complex statistical data analysis.

For more information about the certification and certification prep guide, watch this video from co-author Joni Shreve on their SAS Certification Prep Guide: Statistical Business Analysis Using SAS 9.

Big data in business analytics: Talking about the analytics process model was published on SAS Users.

5月 132019
 

In SAS/IML programs, a common task is to write values in a matrix to a SAS data set. For some programs, the values you want to write are in a matrix and you use the CREATE FROM/APPEND FROM syntax to create the data set, as follows:

proc iml;
X = {1  2  3, 
     4  5  6, 
     7  8  9, 
    10 11 12};
create MyData from X[colname={'A' 'B' 'C'}];  /* create data set and variables */
append from X;                                /* write all rows of X */
close;                                        /* close the data set */

In other programs, the results are computed inside an iterative DO loop. If you can figure out how many observations are generated inside the loop, it is smart to allocate room for the results prior to the loop, assign the rows inside the loop, and then write to a data set after the loop.

However, sometimes you do not know in advance how many results will be generated inside a loop. Examples include certain kinds of simulations and algorithms that iterate until convergence. An example is shown in the following program. Each iteration of the loop generates a different number of rows, which are appended to the Z matrix. If you do not know in advance how many rows Z will eventually contain, you cannot allocate the Z matrix prior to the loop. Instead, a common technique is to use vertical concatenation to append each new result to the previous results, as follows:

/* sometimes it is hard to determine in advance how many rows are in the final result */
free Z;
do n = 1 to 4;
   k = n + floor(n/2);      /* number of rows */
   Y = j(k , 3, n);         /* k x 3 matrix */
   Z = Z // Y;              /* vertical concatenation of results */
end;
create MyData2 from Z[colname={'A' 'B' 'C'}];  /* create data set and variables */
append from Z;                                 /* write all rows */
close;                                         /* close the data set */

Concatenation within a loop tends to be inefficient. As I like to say, "friends don't let friends concatenate results inside a loop!"

If your ultimate goal is to write the observations to a data set, you can write each sub-result to the data set from inside the DO loop! The APPEND FROM statement writes whatever data are in the specified matrix, and you can call the APPEND FROM statement multiple times. Each call will write the contents of the matrix to the open data set. You can update the matrix or even change the number of rows in the matrix. For example, the following program opens the data set prior to the DO loop, appends to the data set multiple times (each time with a different number of rows), and then closes the data set after the loop ends.

/* alternative: create data set, write to it during the loop, then close it */
Z = {. . .};                /* tell CREATE stmt that data will contain three numerical variables */
create MyData3 from Z[colname={'A' 'B' 'C'}];   /* open before the loop. The TYPE of the variables are known. */
do n = 1 to 4;
   k = n + floor(n/2);      /* number of rows */
   Z = j(k , 3, n);         /* k x 3 matrix */
   append from Z;           /* write each block of data */
end;
close;                      /* close the data set */

The following output shows the contents of the MyData3 data set, which is identical to the MyData2 data set:

Notice that the CREATE statement must know the number and type (numerical or character) of the data set variables so that it can set up the data set for writing. If you are writing character variables, you also need to specify the length of the variables. I typically use missing values to tell the CREATE statement the number and type of the variables. These values are not written to the data set. It is the APPEND statement that writes data.

I previously wrote about this technique in the article "Writing data in chunks," which was focused on writing large data set that might not fit into RAM. However, the same technique is useful for writing data when the total number of rows is not known until run time. I also use it when running simulations that generate multivariate data. This technique provides a way to write data from inside a DO loop and to avoid concatenating matrices within the loop.

The post Write to a SAS data set from inside a SAS/IML loop appeared first on The DO Loop.

5月 102019
 

May 12th is #NationalLimerickDay! If you saw our Valentine’s Day poem, you know we at SAS Press love creating poems and fun rhymes, so check out our limericks below!

So, what’s a limerick?

National Limerick Day is observed each year on May 12th and honors the birthday of the famed English artist, illustrator, author and poet Edward Lear (May 12, 1812 – Jan. 29, 1888). Lear’s poetry is most famous for its nonsense or absurdity, and mostly consists of prose and limericks.

His book, “Book of Nonsense,” published in 1846 popularized the limerick poem.

A limerick poem has five lines and is often very short, humorous, and full of nonsense. To create a limerick the first two lines must rhyme with the fifth line, and the third and fourth lines rhyme together. The limerick’s rhythm is officially described as anapestic meter.

To celebrate, we want to ask all lovers of SAS books to enjoy the limericks written by us and to see if you can create your own! Can you top our limericks on our love for SAS Books? Check out our handy how-to limerick links below.

Our limericks

There once was a software named SAS
helping tons of analysts complete tasks.
a Text Analytics book to extract meaning as data flies by
and a Portfolio and Investment Analysis book so you’ll never go awry.
You know our SAS books are first-class!

We enjoyed meeting our awesome users at SAS Global Forum
who enjoy our books with true decorum.
a SAS Administration book on building from the ground up
and a new book about PROC SQL you need to pick-up.
Checkout our SAS books today, you’ll adore ‘em!

For more about SAS Books and some more of our SAS Press fun, subscribe to our newsletter. You’ll get all the latest news and exclusive newsletter discounts. Also check out all our new SAS books at our online bookstore.

Resources:
Wiki-How: How to Write A Limerick
Limerick Generator: Create a Limerick in Seconds

Happy National Limerick Day from SAS Press! was published on SAS Users.

5月 102019
 

May 12th is #NationalLimerickDay! If you saw our Valentine’s Day poem, you know we at SAS Press love creating poems and fun rhymes, so check out our limericks below!

So, what’s a limerick?

National Limerick Day is observed each year on May 12th and honors the birthday of the famed English artist, illustrator, author and poet Edward Lear (May 12, 1812 – Jan. 29, 1888). Lear’s poetry is most famous for its nonsense or absurdity, and mostly consists of prose and limericks.

His book, “Book of Nonsense,” published in 1846 popularized the limerick poem.

A limerick poem has five lines and is often very short, humorous, and full of nonsense. To create a limerick the first two lines must rhyme with the fifth line, and the third and fourth lines rhyme together. The limerick’s rhythm is officially described as anapestic meter.

To celebrate, we want to ask all lovers of SAS books to enjoy the limericks written by us and to see if you can create your own! Can you top our limericks on our love for SAS Books? Check out our handy how-to limerick links below.

Our limericks

There once was a software named SAS
helping tons of analysts complete tasks.
a Text Analytics book to extract meaning as data flies by
and a Portfolio and Investment Analysis book so you’ll never go awry.
You know our SAS books are first-class!

We enjoyed meeting our awesome users at SAS Global Forum
who enjoy our books with true decorum.
a SAS Administration book on building from the ground up
and a new book about PROC SQL you need to pick-up.
Checkout our SAS books today, you’ll adore ‘em!

For more about SAS Books and some more of our SAS Press fun, subscribe to our newsletter. You’ll get all the latest news and exclusive newsletter discounts. Also check out all our new SAS books at our online bookstore.

Resources:
Wiki-How: How to Write A Limerick
Limerick Generator: Create a Limerick in Seconds

Happy National Limerick Day from SAS Press! was published on SAS Users.

5月 082019
 

Legendary unicorn surrounded by bubbles representing zeros

To those of you who have not read my previous post, Dividing by zero with SAS, it's not too late to go back and make it up. You missed a lot of fun, deep thought and opportunity to solve an unusual SAS coding challenge.

For those who have already read it, let’s get serious for a second.

I have found a lot of misconceptions and misunderstandings percolated among SAS online communities and discussion groups. The goal of this post is to dispel all those fallacies and delusions, also known in civilized societies as myths.

Myth

When SAS encounters division by zero during program execution it generates an ERROR message in the SAS log and stops. (A more histrionic version of this myth is, “Your computer disconnects from the Internet and burns down.”)

Reality

SAS does not generate an ERROR message in the SAS log and does not stop executing the current step and subsequent steps in the event of division by zero.

Myth

When SAS encounters division by zero during program execution it generates a WARNING message in the SAS log and continues execution.

Reality

Consider yourself officially warned that SAS does not generate even a WARNING message in the SAS log in the event of division by zero.

When SAS encounters division by zero during SAS data step execution it does the following:

1. For each occurrence, it generates a NOTE in the SAS log:

NOTE: Division by zero detected at line XXX column XX.

2. For each occurrence, it sets the result of division by zero to a missing value and dumps the Program Data Vector (PDV) including data step variables and automatic variables (_N_ and _ERROR_) to the SAS log, e.g.

a=0 b=0 c=. _ERROR_=1 _N_=1

3. Finally, it generates a summary NOTE to the SAS log, e.g.

NOTE: Mathematical operations could not be performed at the following places. The results of the operations have been set to missing values.
Each place is given by: (Number of times) at (Line):(Column).
3 at 244:8

You can run the following SAS code snippet to confirm reality over myth for yourself:

data a;
   input n d;
   datalines;
2  0
-2 0
0  0
;
 
data b;
   set a;
   c = n/d;
run;

Myth

SAS programmers don’t care what messages SAS generates in the SAS log about division by zero and just ignore them.

Reality

Good programming practices and good SAS programmers do care about the possibility of dividing by zero and try to eliminate those messages by taking control over the situation. For example, instead of blindly dividing by a variable that can potentially have zero values, you can write the following “clean” code:

 
if d ne 0
   then r = n/d;
   else r = .;

In this case the division by zero event will never happen during program execution, and you will not receive any nastygrams in the SAS log, but still the result will be the same – a missing value.

Myth

To avoid those “Division by zero detected” NOTEs, one can use ifn() function as in the following example:

r = ifn(d=0, ., n/d);

The assumption is that SAS will first evaluate the first argument (logical expression d=0); if it is true, then return the second argument (missing value); if false, then evaluate and return the third argument (n/d).

Reality

Despite the ifn() function being one of my favorites, the reality is that it works somewhat differently than the above assumption – it evaluates all its arguments before deciding which argument to return. If d does in fact equal 0, evaluating the third argument, n/d, will trigger an attempt to divide by 0, resulting in the “Division by zero detected” NOTE and the PDV dump in the SAS log; that disqualifies this function from being a graceful handler of division by zero events.

Myth

SAS does not have an effective solution for graceful handling of division by zero events; therefore, SAS programmers are compelled to write additional special programming logic.

Reality

SAS doesn’t just have an effective solution to gracefully handle division by zero events. It has a perfect solution! That perfect solution is even conveniently named the divide() function.

The divide(n,d) function has two arguments, each is either a numeric constant, variable, or expression. It divides the first argument by the second argument and returns a non-missing quotient in cases when none of the arguments is missing and the second argument is not zero. Otherwise, it returns missing value. No ugly “Division by zero detected” NOTES in the SAS log, just what we want.

So, instead of using:

r = n/d;

you would just use

r = divide(n,d);

Moreover, it gives you extended functionality by providing additional information about its argument’s composition. In the case of a zero divisor, it returns three different types of missing values. If the dividend (the first argument) is positive, it returns a special missing value of .I (I for infinity); if the dividend is negative, it returns special missing value of .M (M for minus infinity); if dividend is zero, it returns . as an ordinary missing value.

The icing on the cake

SAS’ missing() function equates all those ordinary and special missing values:

missing(.) = missing(.I) = missing(.M) = 1.

If for some reason you don’t like having different missing values X in your results after using the X=divide(n,d) function, you can easily recode them all to the generic missing value:

if missing(X) then X=.;

Your Comments

How do you prefer handling division by zero? Please share with us.

Dividing by zero with SAS - myths and realities was published on SAS Users.