3月 282019
 

Over the past couple of years, I've written about a variety of use cases and value props regarding SAS® Customer Intelligence 360. As powerful as words and images can be, let's transition to the ultimate show – the demo. In the forty minute video below, observe how SAS Customer Intelligence [...]

SAS Customer Intelligence 360 Meets SAS Viya: Show me the demo was published on Customer Intelligence Blog.

3月 272019
 

SAS Visual Analytics supports region maps for Country, US states, and provinces out-of-the-box.  These work well for small scale maps covering the world, a continent, or a single country.  However, other regions are often needed.  Beginning in version 8.3, VA supports custom polygons to display regions such as sales territories, counties, or zip codes.

Region (choropleth) maps use a fill color to show relationships between the regions based upon a response value from your data.  Using custom polygons in VA follows the same steps outlined in previous posts for predefined or custom coordinate geography items, with just a few additional steps.  Here’s the basic flow:

  • Identify your data
  • Import polygon shapefile into SAS dataset
  • Import the shape dataset into VA
  • Create a Custom polygon provider
  • Create the geography item
  • Create and customize the map

Before we begin

VA supports two sources for creating custom polygons: Esri shapefiles and Esri Feature Services.  The goal for this post is to show how to create custom polygons using an Esri shapefile.

Typically, when working with custom polygons, you will have two datasets: the first defines the custom regions (shape data) and the second contains the data you wish to map (business data).  The shape data is derived from an Esri shapefile or feature service.  The business data can be in a shapefile or any format supported by VA (.sas7bdat, .csv, .xls, etc). It contains the information you want to analyze distributed across the regions defined by the shape data.

It is recommended that you verify the imported shape data before using it in your final map.  This will confirm the data is valid and make debugging an issue easier should you encounter any errors.  To verify, use the same dataset for both the shape and business data.  The example below will use this approach.

Access to a GIS application such as Esri’s ArcGIS or QGIS is recommended.  There are two areas where they can help you prepare to use custom polygons in your VA map:

  • Creating a shapefile to define polygons specific to your business need or application
  • Viewing the attribute table of existing shapefiles to determine its unique identifier column

For this example, we will be creating a map of registered Neighborhood Associations in Boise, Idaho. To follow along, download the data from the City of Boise open data site: Boise Neighborhood Associations

1. Identify your data

Shape data

The shape data defining the custom regions needs to be in an Esri shapefile format. These files can be created in a GIS application or obtained from a wide variety of online sources such as: the US Census Bureau (http://www.census.gov); local and state municipalities; state agencies such as the Department of Transportation; and university GIS departments.  Most municipalities now have Open Data portals that provide a wealth of reliable data for public use.  These sources are maintained by dedicated staff and are updated regularly.

Business data

The business data can be specific to your company’s operation or customer base.  Or it can be broad and general using census or demographic information.  It answers the question of What you want to analyze on the map.  The business data must contain a column that aligns with your shape data.  For example: If you want to map the age distribution and spending habits of your target customers across zip codes, then your business data must have a column for zip codes that allows it to be joined to a zip code region in the shape data.

2. Import polygon data into a SAS dataset

VA 8.3 does not support the native shapefile format. To use a shapefile in VA, you must first import it into SAS.  Included with Viya3.4, the %shpimprt macro will convert a shapefile into a SAS dataset and load it into CAS.  You can find the documentation for it here: %shpimprt documentation.

Alternatively, the shapefile can be manually imported with these basic steps:

  • Import the shapefile into SAS
  • Add a sequence column to the dataset
  • Reduce the density of the dataset
  • Limit the dataset based on the density value

Additional details and sample code for each of these steps can be found in the text file linked here: Manual shapefile import steps.

3. Import the shape dataset into VA

Next, we must import the dataset into VA, if using the manual shapefile import process.  To do this, locate the data pane on the left of VA.  From the ‘Open Data Source’ window, select Import > Local File.  Navigate to the location of the SAS dataset created from Step 2 and click the Open button.

Adjust the target location as needed, based on your VA installation, and make note of the location selected.  This path will be required to configure the custom polygon provider. Review and adjust the other options as needed.  Click the blue ‘Import Item’ button at the top of the window to start the import process.  A message will appear indicating the import status. Upon successful import, click the 'OK' button to open the dataset.

Since we are using the same dataset for the shape and business data, we need to make a copy of the category variable that will be used for our map. Right click on ‘ASSOCIATIO’ and select ‘Duplicate’.  Next, let’s change the names of both variables to better distinguish them from one another:

  • Change ‘ASSOCIATIO’ to ‘Business data’
  • Change ‘ASSOCIATIO (1)’ to ‘Shape data’

4. Create the geography item

We are now ready to start creating the geography item.  With Custom polygons, an additional step is required beyond what was described in previous posts with predefined and custom coordinates geography items.  We must define a Custom Polygon provider so VA knows how to locate and display the Boise Neighborhood Associations.  This is needed only once and is part of the geography item setup you are familiar with.

Our goal is to map the regions of the Boise Neighborhood Associations, so we will use ‘Shape data’ for our geography item.  Locate it in the VA data panel and change its Classification type to ‘Geography’.  From the ‘Geography data type’ dropdown, select ‘Custom polygonal shapes’. Several new fields will be displayed.  In the ‘Custom polygon provider’ dropdown, click the ‘Define new polygon provider’ button.

A ‘New Polygon Provider’ window will appear.  All fields shown are required.  The Advanced section has additional options, but they are not needed for this example.

Configure the fields based on the following:

  • Name / Label – Enter ‘Boise Neighborhoods’ for both (these values do not have to be the same)
  • Type – The default CAS Table is the correct option for this example.
  • Server / Library – These values must match those used for the data upload in Step 3.
  • Table – Select the name of the table uploaded in Step 3 (Boise_Neighborhoods)
  • ID Column – The unique identifier column of the dataset. Used to join the shape and business data together. (Select OBJECTID)
  • Sequence Column – This column is created during the import process from Step 2. Needed by VA to display the custom regions. (Select SEQUENCE)

The custom polygon provider is now configured.  All that is needed to finish the geography item setup, is to identify the Region ID.  This is the crucial step that will join the shape data to the business data.  The Region ID column must match the ID Column chosen when the custom polygon provider was setup.  Since we are using the same dataset in this example, that value is the same: OBJECTID.

In cases where different datasets are used for the shape and business data, the name of Region ID and ID Column may be different.  The column labels are not important, but their content must match for the join to occur.

Notice that once you select the correct RegionID value, the preview window will display the custom regions from the imported shape data.  The Latitude and Longitude columns are not required in this example.  Click the ‘OK’ button, to finish the setup.

5. Create and customize the map

You are now ready to create your map.  Drag the Boise Neighborhoods geography item to the report canvas.  Let’s enhance the appearance of our map by making a few style changes:

  • Set a Color role to shade the Neighborhood Association regions (Roles > Color > Business data)
  • Position the legend on the left of the map (Options > Legend)
  • Adjust the transparency of the fill color to 45% (Options > Map Transparency)
  • Change the map service to Esri World Street Map (Options > Map service)

Final map with custom polygons.

Congratulations!  You have just created your first custom region map.  In this post we discussed how to use the Custom Polygon provider to define your own regions using an Esri shapefile.  Compared to the Predefined and Custom Coordinate options, custom polygons give you additional flexibility and control over how your spatial data is analyzed.

Creating custom region maps with SAS Visual Analytics was published on SAS Users.

3月 272019
 

Note: This is the first in a series of profiles of leaders in today’s utility analytics marketplace. “My job is to make people uncomfortable with the way they do things today. Transformation implies a revolution, not small incremental changes, and today the big revolution in any industry is in data," [...]

Utility analytics leaders at the edge: Meet an industry unicorn was published on SAS Voices by Mike F. Smith

3月 272019
 

In simulation studies, sometimes you need to simulate outliers. For example, in a simulation study of regression techniques, you might want to generate outliers in the explanatory variables to see how the technique handles high-leverage points. This article shows how to generate outliers in multivariate normal data that are a specified distance from the center of the distribution. In particular, you can use this technique to generate "regular outliers" or "extreme outliers."

When you simulate data, you know the data-generating distribution. In general, an outlier is an observation that has a small probability of being randomly generated. Consequently, outliers are usually generated by using a different distribution (called the contaminating distribution) or by using knowledge of the data-generating distribution to construct improbable data values.

The geometry of multivariate normal data

A canonical example of adding an improbable value is to add an outlier to normal data by creating a data point that is k standard deviations from the mean. (For example, k = 5, 6, or 10.) For multivaraite data, an outlier does not always have extreme values in all coordinates. However, the idea of an outliers as "far from the center" does generalize to higher dimensions. The following technique generates outliers for multivariate normal data:

  1. Generate uncorrelated, standardized, normal data.
  2. Generate outliers by adding points whose distance to the origin is k units. Because you are using standardized coordinates, one unit equals one standard deviation.
  3. Use a linear transformation (called the Cholesky transformation) to produce multivariate normal data with a desired correlation structure.

In short, you use the Cholesky transformation to transform standardized uncorrelated data into scaled correlated data. The remarkable thing is that you can specify the covariance and then directly compute the Cholesky transformation that will result in that covariance. This is a special property of the multivariate normal distribution. For a discussion about how to perform similar computations for other multivariate distributions, see Chapter 9 in Simulating Data with SAS (Wicklin 2013).

Let's illustrate this method by using a two-dimensional example. The following SAS/IML program generates 200 standardized uncorrelated data points. In the standardized coordinate system, the Euclidean distance and the Mahalanobis distance are the same. It is therefore easy to generate an outlier: you can generate any point whose distance to the origin is larger than some cutoff value. The following program generates outliers that are 5 units from the origin:

ods graphics / width=400px height=400px attrpriority=none;
proc iml;
/* generate standardized uncorrelated data */
call randseed(123);
N = 200;
x = randfun(N//2, "Normal"); /* 2-dimensional data. X ~ MVN(0, I(2)) */
 
/* covariance matrix is product of diagonal (scaling) and correlation matrices.
   See https://blogs.sas.com/content/iml/2010/12/10/converting-between-correlation-and-covariance-matrices.html
 */
R = {1   0.9,             /* correlation matrix */
     0.9 1  };
D = {4 9};                /* standard deviations */
Sigma = corr2cov(R, D);   /* Covariance matrix Sigma = D*R*D */
print Sigma;
 
/* U is the Cholesky transformation that scales and correlates the data */
U = root(Sigma);
 
/* add a few unusual points (outliers) */
pi = constant('pi');
t = T(0:11) * pi / 6;            /* evenly spaced points in [0, 2p) */
outliers = 5#cos(t) || 5#sin(t); /* evenly spaced on circle r=5 */
v = x // outliers;               /* concatenate MVN data and outliers */
w = v*U;                         /* transform from stdized to data coords */
 
labl = j(N,1," ") // T('0':'11');
Type = j(N,1,"Normal") // j(12,1,"Outliers");
title "Markers on Circle r=5";
title2 "Evenly spaced angles t[i] = i*pi/6";
call scatter(w[,1], w[,2]) grid={x y} datalabel=labl group=Type;

The program simulates standardized uncorrelated data (X ~ MVN(0, I(2))) and then appends 12 points on the circle of radius 5. The Cholesky transformation correlates the data. The correlated data are shown in the scatter plot. The circle of outliers has been transformed into an ellipse of outliers. The original outliers are 5 Euclidean units from the origin; the transformed outliers are 5 units of Mahalanobis distance from the origin, as shown by the following computation:

center = {0 0};
MD = mahalanobis(w, center, Sigma);     /* compute MD based on MVN(0, Sigma) */
obsNum = 1:nrow(w);
title "Mahalanobis Distance of MVN Data and Outliers";
call scatter(obsNum, MD) grid={x y} group=Type;

Adding outliers at different Mahalanobis distances

The same technique enables you to add outliers at any distance to the origin. For example, the following modification of the program adds outliers that are 5, 6, and 10 units from the origin. The t parameter determines the angular position of the outlier on the circle:

MD = {5, 6, 10};               /* MD for outlier */
t = {0, 3, 10} * pi / 6;       /* angular positions */
outliers = MD#cos(t) ||  MD#sin(t);
v = x // outliers;             /* concatenate MVN data and outliers */
w = v*U;                       /* transform from stdized to data coords */
 
labl = j(N,1," ") // {'p1','p2','p3'};
Type = j(N,1,"Normal") // j(nrow(t),1,"Outlier");
call scatter(w[,1], w[,2]) grid={x y} datalabel=labl group=Type;

If you use the Euclidean distance, the second and third outliers (p2 and p3) are closer to the center of the data than p1. However, if you draw the probability ellipses for these data, you will see that p1 is more probable than p2 and p3. This is why the Mahalanobis distance is used for measuring how extreme an outlier is. The Mahalanobis distances of p1, p2, and p3 are 5, 6, and 10, respectively.

Adding outliers in higher dimensions

In two dimensions, you can use the formula (x(t), y(t)) = r*(cos(t), sin(t)) to specify an observation that is r units from the origin. If t is chosen randomly, you can obtain a random point on the circle of radius r. In higher dimensions, it is not easy to parameterize a sphere, which is the surface of an d-dimensional ball. Nevertheless, you can still generate random points on a d-dimensional sphere of radius r by doing the following:

  1. Generate a point from the d-dimensional multivariate normal distribution: Y = MVN(0, I(d)).
  2. Project the point onto the surface of the d-dimensional sphere of radius r: Z = r*Y / ||Y||.
  3. Use the Cholesky transformation to correlate the variables.

In the SAS/IML language, it would look like this:

/* General technique to generate random outliers at distance r */
d = 2;                              /* use any value of d */
MD = {5, 6, 10};                    /* MD for outlier */
Y = randfun(nrow(MD)//d, "Normal"); /* d-dimensional Y ~ MVN(0, I(d)) */
outliers = MD # Y / sqrt(Y[,##]);   /* surface of spheres of radii MD */
/* then append to data, use Cholesky transform, etc. */

For these particular random outliers, the largest outliers (MD=10) is in the upper right corner. The others are in the lower left corner. If you change the random number seed in the program, the outliers will appear in different locations.

In summary, by understanding the geometry of multivariate normal data and the Cholesky transformation, you can manufacture outliers in specific locations or in random locations. In either case, you can control whether you are adding a "near outlier" or an "extreme outlier" by specifying an arbitrary Mahalanobis distance.

You can download the SAS program that generates the graphs in this article.

The post How to simulate multivariate outliers appeared first on The DO Loop.

3月 272019
 

PAYG financial services: coming to a bank near you

You walk into your neighborhood bank to see about a mortgage. You and your spouse have your eye on the perfect 3BR, 2BA brick ranch near your child's school, and it won't be on the market long. An hour later, you burst through the front door with a bottle of champagne: "We're qualified!"

Also celebrating is your bank's branch manager. She was skeptical when headquarters analysts equipped branches for "Cloud-based application using SAS" , saying it would speed up loan applications. But your quick, frictionless transaction proved them right.The bank's accountants are happy too. The new pay-as-you-go mode of using SAS software in the cloud means big savings.

The above scenario is possible now through serverless functions, which enable your SAS Viya applications to take input from end users, score the loan application, and return results.

The rest of this post gets into the nitty gritty of serverless functions and SAS Viya, detailing what happens in a bank's computers after a customer applies for a loan. The qualification process starts by running a previously built scoring model to generate a score. You will see how the combination of REST APIs in SAS Viya, analytic models and the restaf library make the task of building the serverless function relatively simple.

The blog titled "SAS REST APIs: a sample application" demonstrated building a SAS Viya application using REST APIs, SAS Visual Analytics and SAS Operational Research. This is typical web applications with application server and SAS Viya running on premise.

If you are one of many users using(or considering) a cloud provider, serverless functions is an useful alternate way to deliver your applications to your users. This eliminates the need to manage the application server associated with your application. Additionally you get zero administration and auto-scaling among other benefits. Many SAS applications that respond quickly to user requests are ideal candidates to be deployed as serverless functions.

The example in this article is available on SAS software’s GitHub site in the viya-apps-serverless-score repository.  If you want to see the end application for frame of reference, see the Using the serverless functions section at the bottom of this article.

Let’s begin with a bit of background on serverless computing and then dig into the details of the application and functions.

Serverless computing explained

The benefits of serverless functions as touted by AWS serverless, Azure and serverless.com:

AWS Lambda

AWS Lambda lets you run code without provisioning or managing servers. You pay only for the compute time you consume– there is no charge when your code is not running. With Lambda, you can run code for virtually any type of application or backend service – all with zero administration. Just upload your code and Lambda takes care of everything required to run and scale your code with high availability. You can set up your code to automatically trigger from other AWS services or call it directly from any web or mobile app.

What is serverless computing?

According to Azure serverless computing is the abstraction of servers, infrastructure, and operating systems. When you build serverless apps you don’t need to provision and manage any servers, so you can take your mind off infrastructure concerns. Serverless computing is driven by the reaction to events and triggers happening in near-real-time—in the cloud. As a fully managed service, server management and capacity planning are invisible to the developer and billing is based just on resources consumed or the actual time your code is running.

Four core benefits of serverless computing from serverless.com:

  1. Zero administration – Deploy code without provisioning anything beforehand or managing anything afterward. There is no concept of a fleet, an instance, or even an operating system. No more bothering the Ops department.
  2. Auto-scaling – Let your service providers manage the scaling challenges. No need to fire alerts or write scripts to scale up and down. Handle quick bursts of traffic and weekend lulls the same way — with peace of mind.
  3. Pay-per-use – Function-as-a-service compute and managed services charged based on usage rather than pre-provisioned capacity. You can have complete resource utilization without paying a cent for idle time. The results? 90% cost-savings over a cloud VM, and the satisfaction of knowing that you never pay for resources you don’t use.
  4. Increased velocity – Shorten the loop between having an idea and deploying to production. Because there’s less to provision up front and less to manage after deployment, smaller teams can ship more features. It’s easier than ever to make your idea live.

OK, so there is a server involved in serverless computing. The beauty in this technology is that once you deploy your code, you don't have to worry about the underlying infrastructure. You just know that the app should work and you only incur costs when the app is running.

Basic flow

Serverless functions are loaded and executed based on the occurrence of one of the triggers/events supported by the cloud vendor. In this example the API Gateway triggers the serverless functions when an http call invokes the function. The API Gateway calls the handler for the function and passes in the user data. On return from the handler the response is sent to the client. This article focuses on the code inside the Serverless Function box in the picture below.

Figure 1: Request Workflow

This example utilizes two key functions:

  1. app – This function serves up an html application for user to enter the data. This is an example of a web application as a serverless function.
  2. score – This function takes user input from the web app, executes scoring on a Viya Server and returns the results.

Serverless.yml

The serverless.yml defines the serverless functions and the handlers, used to execute and other system related information. We will focus only on the application specific information.

The code snippet below shows the definition of the path and handler for the two functions in the serverles.yml file.

functions:
  app: 
    handler: src/app.app
    events:
      - http:
          path: app
          method: get
          cors: 
            origin: '*'
          request:
            parameters:
              paths:
                id: true  
 
  score:
    handler: src/score.score
    events:
      - http:
          path: score
          method: post
          cors: 
            origin: '*'

The functions(app & score) in the yaml define:

  1. event - http event will trigger this function
  2. path - this is path to the function - similar to what you define in Express or hapijs
  3. method - http standard GET, PUT etc...
  4. others - refer to the cloud vendor's documentation for other available options.

The serverless.yml file also sets application related information using environment variables. In this particular use case we define how to access SAS Viya and which scoring model to use.

environment:
#
# Information for logging into SAS Viya
#
  VIYA_SERVER: http://example.viya.server.com
  CLIENTID: raf
  CLIENTSECRET: raf
  USER:rafuser
  PASSWORD: rafpass
 
#
# astore to be used for scoring
#
  caslib: casuser
  name: GRADIENT_BOOSTING___BAD_2

A note on securing your password

In this example we store the userid and password in environment variables. This is to the keep the focus on the internals of serverless functions for SAS Viya. Locally you can use "serverless variables" to secure the information during development. However, for production deployment, refer to your provider's recommendations and the user community for best practices.

Sounds like a followup blog in the future 🙂

Anatomy of the serverless function

Figure 2 shows the flow inside the serverless function for this example. This pattern will repeat itself in your serverless functions.

Figure 2: Serverless Function Flow

Serverless function score

The code below is the handler for the score function. The rest of this section will discuss each of the key features of the handler.

//
// See src/score.js for the full code
//
module.exports.score = async function (event, context ) {
 
   let store      =  restaf.initStore(); /* initialize restaf     */
   let inParms = parseEvent(event);  /* get user input        */
   let payload = getLogonPayload(); /* get logon information */
 
   return store.logon(payload)               /* logon to SAS Viya */
        .then (()    > scoreMain( store, inParms )) /* score     */
        .then(result > setPayload(result)) /* return results     */
        .catch(err   > setError(err))	      /* else return errors */
}

Step 1: Parse the input

The event parameter contains the input from the caller (web application, another serverless function, etc).
The content of the event parameter is whatever the designer of the serverless function desires. In this particular case, a sample event data is shown below.

{
    "input": {
        "JOB"    : "J1",
        "CLAGE"  : 100,
        "CLNO"   : 20,
        "DEBTINC": 20,
        "DELINQ" : 2,
        "DEROG"  : 0,
        "MORTDUE": 4000,
        "NINQ"   : 1,
        "YOJ"    : 10,
        "LOAN"   : 10000,
        "VALUE"  : 1000000
    }
}

The parseEvent function validates the incoming information.

module.exports = function parseEvent(event)
    let input = null;
    let body = {};
    let rstore = {
        caslib:  process.env.ASTORE_CASLIB,
        name  : process.env.ASTORE_NAME
    }
    if ( event.body !=  null ) {
        body = ( typeof event.body === 'string') ? JSON.parse(event.body) : Object.assign({}, event.body);
       if ( body.hasOwnProperty('input') === true ) {
          input = body.input;
    }
    return { rstore: rstore, input: input }
}

Step 2: Logon to SAS Viya

The server.yml defines the SAS Viya logon information. Note there are other secure ways to manage sensitive information like passwords. You should refer to your provider’s documentation.

module.exports = function getLogonPayload() {
    let p = {
        authType    : 'password',
        host        : `${process.env.VIYA_SERVER}`,
        user        : process.env['USER'],
        password    : process.env['PASSWORD'],
        clientID    : process.env['CLIENTID'],
        clientSecret: (process.env.hasOwnProperty('CLIENTSECRET')) ? process.env[ 'CLIENTSECRET' ] : ''
        };
    return p;
 }

The line restaf.logon(payload) in function in the handler code logs on to the SAS Viya Server using this information.

Step 3 and Step 4: Create Payload and make REST API calls

On successful logon the server is called to do the scoring. This particular example uses the sccasl.runcasl method to run CAS Language (CASL) statements and return the scores. Creating the score has two steps:

  1. upload user input: The user input is converted to a csv and uploaded to a CAS table
  2. Submit CASL statements to SAS Viya (CAS) to do the scoring

The code in src/scoreMain in the repository accomplishes both these steps.

Each of these steps use a CAS action:

    • table.upload – to upload the user data into a CAS Table. The input data is converted into a comma-delimited file(csv) and then uploaded. The REST call using restaf looks like this:
    let csv = makecsv(input); /* create a csv */
    let JSON_Parameters = {
        casout: {
            caslib : 'casuser', /* a valid caslib */
            name   : 'INPUTDATA', /* name of output file on cas server */
            replace: true
        },
 
        importOptions: {
            fileType: 'csv' /* type of the file being uploaded */
        }
    };
 
    let payload = {
        headers: { 'JSON-Parameters': JSON_Parameters },
        data   : csv,
        action : 'table.upload'
    };
 
    let result = await store.runAction(session, payload);
    • sccasl.runcasl – execute CASL statements to do the scoring
 // Setup casl statements 	 	 
 let caslStatements = `	 	 
 loadactionset "astore";	 	 
 action table.loadTable /	 	 
 caslib = "${rstore.caslib}" 	 	 
 path = "${rstore.name}.sashdat"	 	 
 casout = { caslib = "${rstore.caslib}" name = "${rstore.name}" replace=TRUE};	 	 
 
 action astore.score /	 	 
 table = { caslib= 'casuser' name = 'INPUTDATA' } 	 	 
 rstore = { caslib= "${rstore.caslib}" name = '${rstore.name}' }	 	 
 out = { caslib = 'casuser' name = 'OUTPUTDATA' replace= TRUE};	 	 
 action table.fetch r = result/	 	 
 format = TRUE	 	 
 table = { caslib = 'casuser' name = 'OUTPUTDATA' } ;	 	 
 send_response(result);	 	 
 `;	 	 
 // execute cas actions	 	 
 payload = {	 	 
 action: 'sccasl.runcasl',	 	 
 data : { code: caslStatements}	 	 
 }	 	 
 result = await store.runAction(session, payload);

Step 5: Create response

AWS serverless functions must return data and error(s) in a certain form. The two functions setPayload.js and setError.js accomplish this.

module.exports = function setPayload (body) {
    return {
        "statusCode": 200,
        "headers"   : {
            'Access-Control-Allow-Origin'     : '*',
            'Access-Control-Allow-Credentials': true
          },
        "isBase64Encoded": false,
        "body"           : JSON.stringify(body)
    }
  }

Using the serverless functions

When the serverless function is deployed you will get a link for each of the functions. In our case we receive the request shown below (with xxxx replaced with appropriate information).

GET - https://xxxx.amazonaws.com/demo/app

The first link serves up the web application. The user enters some values and the app calls the score serverless function to get the results.
Alternatively, you can write your own application and make an http POST call to the score function using a link such as:

POST - https://xxxx.amazonaws.com/demo/score

To invoke the web application, you will visit the link

https://xxxx.amazonaws.com/demo/app

with your browser. You should see a display shown in Figure 3:

Figure 3: Application Input Screen

Entering values into the two fields and pressing Submit calls the second serverless function, score, and results in a pie chart as seen in Figure 4:

Figure 4: Score Report Screen

Please see the loan.html file in the GitHub repository for details on the application. Displayed below is the relevant part of the Javascript in the loan.html. The score-function-url is the url for the score function. The payload was described earlier in this article. The http call is made using axios.

async function runScore(inputValues ){
 
    let payload = {
        astore: {
            caslib: 'Public',
            name: 'GRADIENT_BOOSTING___BAD_2'
        },
        input: inputValues
    }
    let config = {
        url: {score-function-url}
        method: 'POST',
        data: payload
    }
    let r = await axios(config);
    return r.data.score;
 
}

Porting to other cloud providers

The cloud provider dependent information is handled in the following functions: score.js, parseEvent.js, setPayload.js and setError.js. The rest of the code is host agnostic. In writing your own functions the recommendation is to follow the same pattern as much as possible. The generic code is then available in its own repository for reuse with other providers and applications.

Go try it yourself

I have shown you how to deliver your SAS Viya applications as serverless functions. To access more examples please see the GitHub restaf-demos repository.

Supporting Resources

Serverless functions and SAS Viya - a good match was published on SAS Users.

3月 252019
 

An important concept in multivariate statistical analysis is the Mahalanobis distance. The Mahalanobis distance provides a way to measure how far away an observation is from the center of a sample while accounting for correlations in the data. The Mahalanobis distance is a good way to detect outliers in multivariate normal data. It is better than looking at the univariate z-scores of each coordinate because a multivariate outlier does not necessarily have extreme coordinate values.

The geometry of multivariate outliers

In classical statistics, a univariate outlier is an observation that is far from the sample mean. (Modern statistics use robust statistics to determine outliers; the mean is not a robust statistic.) You might assume that an observation that is extreme in every coordinate is also a multivariate outlier, and that is often true. However, the converse is not true: when variables are correlated, you can have a multivariate outlier that is not extreme in any coordinate!

The following schematic diagram gives the geometry of multivariate normal data. The middle of the diagram represents the center of a bivariate sample.

  • The orange elliptical region indicates a region that contains most of the observations. Because the variables are correlated, the ellipse is tilted relative to the coordinate axes.
  • For observations inside the ellipse, their Mahalanobis distance to the sample mean is smaller than some cutoff value. For observations outside the ellipse, their Mahalanobis distance to the sample mean is larger than the cutoff.
  • The green rectangle at the left and right indicate regions where the X1 coordinate is far from the X1 mean.
  • The blue rectangle at the top and bottom indicate regions where the X2 coordinate is far from the X2 mean.
Geometry of multivariate outliers showing the relationship between Mahalanobis distance and univariate outliers. The point 'A' has large univariate z scores but a small Mahalanobis distance. The point 'B' has a large Mahalanobis distance. Only 'B' is a multivariate outlier.

The diagram displays two observations, labeled "A" and "B":

  • The observation "A" is inside the ellipse. Therefore, the Mahalanobis distance from "A" to the center is "small." Accordingly, "A" is not identified as a multivariate outlier. However, notice that "A" is a univariate outlier for both the X1 and X2 coordinates!
  • The observation "B" is outside the ellipse. Therefore, the Mahalanobis distance from "B" to the center is relatively large. The observation is classified as a multivariate outlier. However, notice that "B" is not a univariate outlier for either the X1 or X2 coordinates; neither coordinate is far from its univariate mean.

The main point is this: An observation can be a multivariate outlier even though none of its coordinate values are extreme. It is the combination of values which makes an outlier unusual. In terms of Mahalanobis distance, the diagram illustrates that an observation "A" can have high univariate z scores but not have an extremely high Mahalanobis distance. Similarly, an observation "B" can have a higher Mahalanobis distance than "A" even though its z scores are relatively small.

Applications to real data

This article was motivated by a question from a SAS customer. In his data, one observation had a large Mahalanobis distance score but relatively small univariate z scores. Another observation had large z scores but a smaller Mahalanobis distance. He wondered how that was possible. His data contained four variables, but the following two-variable example illustrates his situation:

Geometry of multivariate outliers. The point 'A' has large univariate z scores but the Mahalanobis distance is only about 2.5. The point 'B' has a Mahalanobis distance of 5 and is a multivariate outlier.

The blue markers were simulated from a bivariate normal distribution with μ = (0, 0) and covariance matrix Σ = {16 32.4, 32.4 81}. The red markers were added manually. The observation marked 'B' is a multivariate outlier. The Mahalanobis distance (MD) from 'B' to the center of the sample is about 5 units. (The center is approximately at (0,0).) In contrast, the observation marked 'A' is not a multivariate outlier even though it has extreme values for both coordinates. In fact, the MD from 'A' to the center of the sample is about 2.5, or approximately half the MD of 'B'. The coordinates (x1, x2), standardized coordinates (z1, z2), and MD for both points are shown below:

You can connect the Mahalanobis distance to the probability of a multivariate random normal variable. The squared MD for multivariate normal data is distributed according to a chi-square distribution. For bivariate normal data, the probability that an observation is within t MD units of the center is 1 - exp(-t2/2). Observations like 'A' are not highly unusual. Observations that have MD ≥ 2.5 occur in exp(-2) = 4.4% of random variates from the bivariate normal distribution. In contrast, observations like 'B' are extremely rare. Observations that have MD ≥ 5 occur with probability exp(-25/2) = 3.73E-6. Yes, if you measure in Euclidean distance, 'A' is farther from the center than 'B' is, but the correlation between the variables makes 'A' much more probable. The Mahalanobis distance incorporates the correlation into the calculation of "distance."

Summary and further reading

In summary, things are not always as they appear. For univariate data, an outlier is an extreme observation. It is far from the center of the data. In higher dimensions, we need to account for correlations among variables when we measure distance. The Mahalanobis distance does that, and the examples in this post show that an observation can be "far from the center" (as measured by the Mahalanobis distance) even if none of its individual coordinates are extreme.

The following articles provide more information about Mahalanobis distance and multivariate outliers:

You can download the SAS program that generates the examples and images in this article.

The post The geometry of multivariate versus univariate outliers appeared first on The DO Loop.

3月 252019
 

These days, an ever-increasing number of customer interactions are taking place over digital channels and every single digital interaction offers an incredible source of customer intelligence for organisations to tap into. With every visit, customers leave a valuable trail of digital breadcrumbs. These breadcrumbs give organisations the ability to follow [...]

5 ways your digital analytics strategy is hindering your customer experience was published on Customer Intelligence Blog.

3月 222019
 

As data volumes continue to surge (with no signs of slowing down), the cost to handle all that data can have a very real impact on IT budgets.

But for all the talk of the growing value of data, the cost side of the equation is often overlooked. After all, it not only takes a lot of drives to hold all that data, it requires significant computing resources to handle it. The costs quickly add up, forcing IT leaders to make tough choices about where to make their IT investments.

This is as true for SAS users as it is for any other IT organization. We work closely with our customers and know that the cost of our analytics software is only part of the budget puzzle – total cost of ownership (TCO) casts a long shadow over their budget decisions. So we were as curious as they were when Intel released its Intel® OptaneTM DC Persistent Memory.

What is persistent memory?

Data written to persistent memory remains accessible using memory instructions even after the process that created or last modified them is gone. In this aspect, persistent memory is like a hard disk drive (HDD). But the speed at which the data can be accessed from persistent memory is considerably faster than the speed delivered by an HDD.

Of course, as a longtime Intel collaborator, we worked closely together to optimize SAS software for this innovation. But how would it work in real-life applications outside of the labs? Would the Intel Optane DC Persistent Memory - SAS® Viya® combination have a real impact, or would it just represent yet another incremental improvement – nice, but not enough to change the IT investment-performance equation?

Testing confirms performance

There was only one way to find out, of course: Test it.

So that’s what we did. In collaboration with Intel, we ran SAS Viya 3.4 on Intel Optane DC Persistent Memory. More specifically, the testing compared performance of a two-socket second-generation Intel® Xeon® Platinum 8280 processor with DRAM to a two-socket second-generation Intel Xeon Platinum 8280 processor with Intel Optane DC persistent memory. (See editor's note below for details.)

We didn’t hold back, either – we ran multiple instances of a large model containing 400 gigabytes of data, concurrently. All these jobs finished in a matter of minutes. Just as important, the costs of achieving performance in a persistent memory environment with SAS Viya, when compared to DRAM and SAS Viya, make their own compelling case.

Intel Optane persistent memory achieves more than 95 percent on-par performance with DRAM, at a lower cost per performance rate of approximately 18 percent. With second-generation Intel Xeon Scalable Processors and Intel Optane DC persistent memory in Memory Mode, it’s clear that users can take advantage of the benefits of systems with larger capacity memory, with little or no performance degradation.

Greater memory capacity = substantial cost savings

But performance may not even be the most interesting part of this story – not when you compare it to the cost benefits of running in this environment. In short, compared with DRAM – equal capacity memory, equal CPU, and so on – using Intel Optane DC Persistent Memory leads to a reduction in costs of 20 percent or more. That’s a tangible, meaningful impact for any IT budget, allowing CIOs to invest more heavily to address other pressing needs. By contrast, system costs increase dramatically with higher DRAM capacity.

Plus, it may be possible in the future to use the memory module in different modes – an opportunity we’re currently testing in non-volatile random-access memory (NVRAM) mode with the goal of allowing applications to take advantage of this benefit.

How does this combination of persistent memory tools and SAS Viya contribute to faster speeds and lower costs? The massive scale of data typically handled by SAS customers can be handled by fewer servers, because each can now have a much greater memory capacity. This equation can translate into a significant reduction in the total cost of ownership.

There are a lot of factors, many of which are unique to your organization, that go into reducing the total cost of ownership – so it goes without saying that your mileage may vary. Regardless, this is a conversation your organization needs to be having right now, if it hasn’t already. We’re happy to answer your questions about running SAS Viya together with Intel Optane DC Persistent Memory.

Want a deeper dive? | See the video on Intel Optane DC Persistent Memory

Editor's Note: Specs for this test included:

  • SAS® Viya® In-memory Analytics: SAS® Viya 3.4 VDMML application.
  • Workload: 3 concurrent logistic regression tasks each running on 400GB datasets.
  • Testing by: Intel and SAS completed on February 15, 2019.
  • Baseline hardware for comparison: 2S Intel® Xeon® Platinum 8280 processor, 2.7GHz, 28 cores, turbo and HT on, BIOS SE5C620.86B.0D.01.0286.011120190816, 1536GB total memory, 24 slots / 64GB / 2666 MT/s / DDR4 LRDIMM, 1x 800GB, Intel SSD DC S3710 OS Drive + 1x 1.5TB Intel Optane SSD DC P4800X NVMe Drive for CAS_DISK_CACHE + 1x 1.5TB Intel SSD DC P4610 NVMe Drive for application data, CentOS Linux* 7.6 kernel 4.19.8.
  • New hardware tested: 2S Intel® Xeon® Platinum 8280 processor, 2.7GHz, 28 cores, turbo and HT on, BIOS SE5C620.86B.0D.01.0286.011120190816, 1536GB Intel Optane DCPMM configured in Memory Mode(8:1), 12 slots / 128GB / 2666 MT/s, 192GB DRAM, 12 slots / 16GB / 2666 MT/s DDR4 LRDIMM, 1x 800GB, Intel SSD DC S3710 OS Drive + 1x 1.5TB Intel Optane SSD DC P4800X NVMe Drive for CAS_DISK_CACHE + 1x 1.5TB Intel SSD DC P4610 NVMe Drive for application data, CentOS Linux* 7.6 kernel 4.19.8.)

Persistent memory: one way to rein in IT costs was published on SAS Users.

3月 212019
 

My favorite tech influencers teach, ask questions and share their knowledge freely. They value quality over quantity, and they engage with each other authentically. My goals on social media are to learn, engage with the world socially, and to share what I think is cool – sometimes this is about [...]

6 tech influencers to follow on Twitter was published on SAS Voices by John Maynard