Tech

10月 042018
 

You can now conduct a live demo of SAS Visual Analytics on your mobile device to participants who are geographically dispersed by using the Present Screen feature, a tucked-away option in the SAS Visual Analytics app for iPad, iPhone, and Android devices. Let’s say I am looking at a report on my mobile device, and I have questions about a couple of items for my colleagues Joe and Anita, both of whom are located in two different cities. The three of us are able to see the report while I demo it, drawing their attention to specific areas of interest in the report.

Sitting in my office, or from any location where I have a Wi-Fi or cellular connection, I can use the Present Screen feature to do a live shared presentation with Joe and Anita. And I don't have to present just one report. During the live presentation, I can close a report, open a different one, perform interactions, and move around in the app between different reports.

And here’s the real beauty of this feature. Neither Joe nor Anita need to have the SAS Visual Analytics app on a mobile device, or SAS Visual Analytics running on a desktop. The only requirement for participants is that they have a mobile device (could be an iOS, Android, or Windows device), a laptop, or a desktop system.  Internet access via Wi-Fi (or a cellular connection for mobile devices), an email client for receiving an email notification, and a Web browser are necessary. VPN connectivity might be required if the participants' organization requires VPN.

Plus, you can conduct your live presentation for up to 10 people! Before we move on, note that you can also use the iOS feature, AirDrop, on your iOS device to engage participants with the Present Screen feature. This is useful if you’re in a room with a bunch of folks who have iOS devices, and you want to do live sharing.

Ready to try it? Here’s a short checklist of what you need to do a live presentation of SAS Visual Analytics reports.

Requirements for the presenter

  • Use any one of these devices to present your screen for a shared presentation to your participants: iPad, iPhone, Android tablet or smartphone
  • SAS Visual Analytics app installed on your device
  • Connection via Wi-Fi or cellular to the server where the report(s) reside
  • Subscription to the reports that you want to present and share
  • Email client on the iPad or phone
  • VPN connectivity if your organization requires it

Couple of things to note

Say you're presenting to 10 participants via email or AirDrop.  Note that as the presenter, you must have the SAS Visual Analytics app open in your device with the Present Screen feature selected and active for your guests to be able to see your reports. Once you exit the app, the screen presentation session ends and the email or AirDrop invitations are no longer valid.

And a note on participant information. After a participant forwards an email invitation to view the presentation to a colleague, the recipient is required to enter a name and email address before joining your live screen presentation. You, as the presenter, get to see the names of the folks that have joined your screen presentation. Participant names and email addresses simply let the presenter know who all have joined to presentation. Neither the participants' names nor their email addresses are validated by the SAS Visual Analytics server.

Supported versions for SAS Visual Analytics reports

You can present and share reports with the Present Screen feature for these versions of SAS Visual Analytics:

7.3, 7.4, 8.1, 8.2, 8.3

How to start the screen presentation

In the Analytics report I'm subscribed to from the SAS Visual Analytics app on my iPad, I choose Present Screen.

Choosing Present Screen in the subscribed report

I am reminded that I can present my screen to a maximum of 10 participants. I click OK.

The app prompts me to send email or choose AirDrop to present my screen to participants - I choose email.

My email client opens, and the app includes instructions along with the link that takes them to my live screen presentation. I enter the email addresses for my participants and send the email.

In the email that Anita receives on her desktop PC, she taps on the link which takes her to her Web browser where my screen presentation is set to start soon.

Anita enters her name and email address, and notes that the presentation has not yet started.

Joe, who has logged on to the presentation from his Android phone, is also presented with the same message on his smartphone.

To begin the screen presentation, I tap on the blinking cursor in the app.

The app reminds me that now my participants can see everything on my iPad screen.

Next, a blue bar at the top of the report indicates that my screen presentation is live and can be seen by Anita and Joe. Now I can begin my report presentation, or exit this report and open a different report to share.

Here is my screen on the iPad Mini where I started my screen presentation:

Anita's screen on her Windows desktop monitor:

Joe's screen on the Android smartphone:

When I am finished with the presentation, I tap on the 'stop' button to end it.

A message displays to indicate that the presentation has ended. Here's an example of that message from Joe's Android smartphone.

Do it live! How to present your screen from the SAS Visual Analytics app was published on SAS Users.

9月 262018
 

Here's a challenge.  You're a passenger in an automobile, and you've been asked to evaluate whether the driver's habits behind the wheel are "safe" or "risky."  But there's a catch: you have to collect all of your information with your eyes closed.

Think about it -- with your eyes shut, you're denied important information such as your location, traffic conditions, speed limits and traffic signals, and weather conditions.  Sightless, your only source of data comes from your sense of motion as the vehicle accelerates, slows down, and turns.

Sunish Menon, a PhD. researcher at State Farm Insurance, faced this challenge with his team as they designed the data collection scheme for State Farm's Drive Safe and Save program.  Sunish shared his experience and ideas with attendees at the Analytics Experience 2018 conference in San Diego.

Accelerometer: simple measurements with rich results

Sunish's team knew that they were going to build a smartphone app to support the Drive Safe and Save program.  After all, a smartphone can collect a ton of information: location with GPS, phone use during a trip, traffic conditions, trip duration and speed, and more.  But accessing these details has a cost.  Every sensor on a phone consumes precious battery life, and potential users might not be comfortable sharing their location constantly with an insurance company -- even if there is a premium discount at stake.  So what's the minimum amount of information you can collect and still assemble a meaningful profile?  Maybe capturing the changes in speed and direction is enough.

Like people, your smartphone also has a "sense of motion" -- it's called an accelerometer.  As you might guess from its name, an accelerometer is a small electrical sensor that measures acceleration.  For a quick physics refresher, let's review the difference between speed, velocity, and acceleration:

Speed How fast an object is moving, usually expressed as distance over time (example: 10 meters per second)
Velocity How fast an object is moving and in which direction (example: 10 meters per second, to the east)
Acceleration The rate of change in the velocity of an object. Since it’s a rate of change, it’s expressed as distance over time (speed), per unit of time.  For example, to change speed from 0 to 60 miles-per-hour in 10 seconds, an object must accelerate at 2.682 meters per second per second, or 2.682 m/s2.

 

Your smartphone measures acceleration across three axes, traditionally labeled as x, y, and z.  The measurements are sampled multiple times per second.  Each measurement reflects acceleration across one of the axes.  Taken together, you can get a sense for the phone's overall direction.

I've included Sunish's diagram of how these axes are oriented on a smartphone.  The x axis is horizontal along the face of the phone, and the y axis is vertical along the face.  The z axis is along the perpendicular plane passing through the center of the phone.  Depending on the direction of the phone's movement, acceleration values might have a positive or negative value.

Capturing my commute data

Inspired by Sunish's presentation, I decided to get a bit of hands-on practice with accelerometer data. I installed a free app on my phone to capture the raw data from the accelerometer,  Here's what the data values look like, as measured from the start of my driving commute from work to home.

In these data, the first value is a record counter.  The second "big number" value is a timestamp value in Unix epoch format.  That's the number of milliseconds since midnight on January 1, 1970.  And the next three values are the acceleration measurements for the x, y, and z axis respectively.  Acceleration is measured in meters-per-second squared, or m/s2.  For reference, keep in mind that Earth's gravity -- the force that keeps us grounded (literally) -- is about 9.8 m/s(1g).

The data from my commute contains over 85,000 measurements, captured over about 30 minutes (it was a busy Friday afternoon).  I used the SERIES plot in PROC SGPLOT to create a simple visualization.  Can you tell where the longest stoplight occurs?  (It's right near the shopping mall -- I really don't like that intersection.)

commute accelerometer

Teasing out "events" from the data

In my commute as represented in the above chart, it seems simple enough to locate the mundane events of accelerating, braking, and waiting in traffic.  There are a few spikes and dips that might represent more dramatic braking events, or perhaps a fast start from a traffic light (my car has some pep!).  Let's use some histograms to look at these measurements another way.

histograms of x y z

Most of my commute is uninteresting, as I'm driving at a steady speed or waiting in traffic.  The histogram shows the x axis measurements are centered around 0.  But why don't the y and z axes behave the same?  During my drive,  my phone is positioned nearly vertical in a dashboard holder, with perhaps a 30-degree forward tilt.  Gravity works on all of us at about 9.8m/s2.  With my phone at the vertical-ish tilt, you can see most of that force applied to the y axis, with some shared with the z axis.

Since the data collected represents a time series, it makes sense to apply a time series analysis to see if we can decompose its components and make the interesting events more obvious.  In Sunish's case, his team used PROC TIMESERIES also offers a SPECTRA statement for spectrum analysis for similar options.

Here's a tip: if you are trying this on your own and you get stuck, post a question to the SAS forecasting/time series community.  Experts are eager to answer!

Confounding factors when analyzing a drive

During a drive, the measurements from the accelerometer "start at zero" (or their natural baseline) only when the phone is lying flat, with the top of the phone pointed toward the front of the car.  But who keeps their phone stationed like that?  When I'm driving alone in my car, my phone is usually in a holder mounted on the dash, positioned nearly vertical, tilted slightly.  Or it's in my pants pocket.

Sunish presented a series of techniques to help control for this -- all of them applying more math than I am qualified to describe.  The smartphone also has a gyroscope sensor, which can measure the phone's "tilt" along any of its axes (labeled as pitch, roll, and yaw).  Combining these measurements with the acceleration readings, as well as controlling for the force of gravity, can help create a more accurate picture of your driving experience.

When I'm not alone in the car, the phone might not stay in one place.  A passenger might pick it up to find directions, or to reference IMDB to settle a bet.  All of those movements will also register on the accelerometer, and how will a "safe driving app" judge these actions?  That's a challenge for analytics.

Safe driver versus risky driver: more than just measurement

Please do not rush to judgement about my driving behavior from this one sample.  In fact, even if you had hundreds of samples of my driving, it would probably be difficult to fairly judge whether I am a high-risk driver.

For insurance companies, assessment of risk is influenced more by how similar you are to known risky populations.  That's why young drivers tend to command higher premiums.  It's not just because they are young, exactly, but it's because insurance companies have to pay out more claims due to accidents caused by young drivers.  The cause might be due to their inexperience and immaturity, but that's almost beside the point.  It's a numbers game.

By collecting data from millions of car trips across a wide range of customers, an insurance company can apply machine learning to discern the patterns of drivers who make claims versus those who don't.  If your driving patterns are scored as too similar to those of other drivers who cause accidents...well, don't expect to receive a discount when you share your driving data.

Programs like State Farm's Drive Safe and Save accomplish more than just "proving" that you're a good driver.  The program incents you to be more conscious of your driving behavior, especially while you have that app running and collecting data.  State Farm provides periodic reports to program subscribers that show how your driving behavior compares (favorably or not) to other drivers in the pool.  The gamification and feedback aspect of the program might do just as much to improve driving as the promise of a discount.

 

Using your smartphone accelerometer to build a safe driving profile was published on SAS Users.

9月 252018
 

If you use SAS Visual Analytics and don’t have the SAS Visual Analytics app, you're missing out on a ton of convenience and interaction you could be having while on-the-go. And even if you don’t have access to SAS Visual Analytics today, you can still download and try the mobile app with some cool sample reports.

Ready to take a quick dive and look at the app?

How to get the app

Download and install the free app to your Apple, Android or Windows device from the app store:

Apple iTunes Store
Google Play
Microsoft Store

When you open the app, you are greeted with an introductory launch screen:

In the introductory launch screen that displays when you first open the iOS or Android app, go to the third screen and tap on Learn how to use the Tray.

You are taken to the SAS Help Center. Watch the short slide show at the help center to understand the special Tray feature in the iOS or Android app or to find out what’s new in the app.

Using the Windows-based app? Here’s what you see:

Sample reports on the SAS Demo Server

In the app, sample charts and reports are instantly made available to you in the Subscriptions view via a connection to the SAS Demo Server. This server hosts a nice variety of reports that you can view on your phone or tablet. Interact with a wide spectrum of sample SAS Visual Analytics reports for different industries.

Subscriptions View in the App With Sample Reports

Tap on Add to view the different folders that contain additional sample reports for you to browse, subscribe, and view.

Additional Sample Reports on the SAS Demo Server

When you select and subscribe to the additional reports that are available on the SAS Demo Server, these reports are downloaded to the Subscriptions view in the mobile app. Just tap on the tile for any report in the Subscriptions view to open it and view the charts, graphs, and their associated data.

Here are a couple of reports as viewed in the Windows 10 app:

Already have SAS Visual Analytics in your organization?

If you view SAS Visual Analytics reports on your laptop or a desktop computer, this app extends your ability to view those same reports on your phone or tablet. If your organization has deployed SAS Visual Analytics, but is not taking advantage of extending report viewing ability to mobile devices, I urge you to consider it.

The app supports SAS Visual Analytics 8.3, 8.2, 7.4, and 7.3. Almost every type of interaction that you have with a SAS Visual Analytics report on your desktop can be done with reports viewed in the app on your phone or tablet!

If you have SAS Visual Analytics deployed in your organization, reach out to the SAS Visual Analytics administrator in your organization and ask them to enable support for mobile devices so that you can start viewing your reports in the app.

To give you a little more guidance, here are some FAQs about the app.

If we have reports in our organization that were created with SAS Visual Analytics, can we view those reports in this app?

Yes. The same reports that you view in your web browser on a desktop can be viewed in the mobile app.

How do I view our organization’s reports in the app?

Access from your mobile device to SAS Visual Analytics reports on your company’s server is granted by your SAS Administrator. Live data access requires either a Wi-Fi or cellular connection, and your company may require VPN access or other company-specific security measures.

Contact your SAS Visual Analytics Administrator to request access from your mobile device to the server hosting your SAS Visual Analytics reports. Your administrator ensures that your mobile device is registered as a valid device in the SAS Environment Manager where mobile device access to your organization’s server is managed.

How do I add a server connection?

When your mobile device is registered for access to the SAS Visual Analytics server, simply create a server connection within the app to your company server and browse for reports.

Here’s a nice slide show with the steps you follow to create a server connection to the SAS Visual Analytics server by entering the complete server name, port number, your username, and password:

Quick primer on the SAS Visual Analytics app was published on SAS Users.

9月 212018
 

SAS Technical Support occasionally receives requests from users who want to insert blank rows into their TABULATE procedure tables. PROC TABULATE syntax does not have any specific options that insert blank rows into the table results.

One way to accomplish this task is explained in SAS Sample 45972, "Add a blank row in PROC TABULATE output in ODS destinations." This sample shows you how to add a blank row between class variables in a PROC TABULATE table.  The sample code creates a new data set variable that you can use in a CLASS statement in the PROC TABULATE step.

In addition to the method that is shown this sample, there are two other methods that you can use to insert a blank row in PROC TABULATE table results. The following sections explain those methods:

Method 1: Adding a blank row between row dimension variables

This section demonstrates how to add a blank row between row dimension variables.  This method expands on the approach that is used in Sample 45972.

In the following example, additional data-set variables are added to the data set in a DATA step. The BLANKROW variable is used as a class variable, and the variables DUMMY1 and DUMMY2 are used as analysis variables in the PROC TABULATE step. This sample code also uses the AGE and SEX variables from the SASHELP.CLASS data set as class variables in PROC TABULATE.

/***  Method 1  ***/
data class;
set sashelp.class;
blankrow='  ';
 
dummy1=1;
dummy2=.;
run;
 
proc tabulate data=class missing;
class age sex blankrow;
var dummy1 dummy2;
table age*dummy1=' '
blankrow=' '*dummy2=' '
 
sex*dummy1=' '
blankrow=' '*dummy2=' '
 
all*dummy1=' ',
 
sum*F=8. pctsum / misstext=' ' row=float;
 
keylabel sum='# of Students' 
pctsum='% of Total'
all='Total';
 
title 'Add a Blank Row by Using New Data Set Variables';
run;

This method produces the result that is shown below.  You can use ODS HTML, ODS PDF, ODS RTF, or ODS EXCEL statements to display the table.

 

 

 

 

 

 

 

 

 

 

Method 2: Adding blank rows with user-defined formats and the PRELOADFMT option in the CLASS statement

The second method creates user-defined formats and uses the PRELOADFMT option in a CLASS statement in PROC TABULATE.  The VALUE statements that use the NOTSORTED option in the PROC FORMAT step establish the desired order of the results in the TABULATE results. Using the formats and specifying the ORDER=DATA option in the CLASS statement and the PRINTMISS option in the TABLE statement keeps the order requested in the PROC FORMAT VALUE statements and display the blank rows.

/***  Method 2 ***/
proc format;
value $sexf (notsorted)
'F'='F'
'M'='M'
' '=' ';
 
value agef (notsorted)
11='11'
12='12'
13='13'
14='14'
15='15'
16='16'
.='  ';
 
value $sex2f (notsorted default=8)
'F'='F'
'M'='M'
' '='Missing'
'_'=' ';
 
value age2f (notsorted)
11='11'
12='12'
13='13'
14='14'
15='15'
16='16'
.=' .'
99=' ';
 
value mymiss
0=' '
other=[8.2];
 
run;
 
proc tabulate data=sashelp.class missing;
class sex age / preloadfmt order=data;
 
table sex age all, N pctn*F=mymiss.
/ printmiss misstext=' ' style=[cellwidth=2in];
 
format sex $sexf. age agef.;
/* If there are no missing values for the class  */
/* variables, use the formats $SEXF and AGEF.*/
/* With missing values for the class variables,  */ 
/* use the formats $SEX2F and AGE2F.             */
 
keylabel N='# of Students'
PctN='% of Total'
all='Total';
 
title 'Add a Blank Row by Using the PRELOADFMT Option';
run;

This method produces the result that is shown below. You can use ODS HTML, ODS PDF, ODS RTF, or ODS EXCEL statements to display the table.

 

 

 

 

 

 

 

 

 

 

 

If you have any questions about these methods, contact SAS Technical Support. From this link, you can search the Technical Support Knowledge Base, visit SAS Communities to ask for assistance, and contact SAS Technical Support directly.

Adding blank rows in TABULATE procedure results was published on SAS Users.

9月 192018
 

Like most people, I believed that process of diagnosing and treating cancer begins with a biopsy.  If cancer is suspected, a doctor will extract a small tissue sample -- usually a tiny cylindrical "core sample" -- and examine it for cancer cells.  No cancer cells found -- that's good news!  But if cancer cells are present, then you have decisions to make about treatment.

A young woman named Richa Sehgal taught me that it's not so simple.  There aren't just two types of cells (cancerous and non-cancerous).  There are actually several types of cancer cells, and these do not all have the same importance when it comes to effective cancer treatment.  I learned this from Richa during her presentation at Analytics Experience 2018 -- a remarkable talk for several reasons, not the least of which is this: Richa Sehgal is a high school student, just 18 years old.  I'll have to check the record books, but this might make her the youngest-ever presenter at this premier analytics event.

Last year, Richa served as a student intern at the Canary Center at Stanford for Cancer Early Detection.  That's where she learned about the biology of cancer. She was allowed (encouraged!) to attend all lab meetings – and the experience opened her eyes to the challenges of cancer detection.

The importance of cell types and how cancer works

Unlike many technical conference talks that I've attended, Richa did not dive directly into the math or the code that support the techniques she was presenting.  Instead, Richa dedicated the first 25 minutes of her talk to teach the audience how cancer works.  And that primer was essential to help the (standing-room only!) audience to understand the relevance and value of her analytical solution.

What we call "cancer" is actually a collection of different types of cells.  Richa focused on three types: cancer stem cells (CSCs), transient amplifying cells (TACs), and terminally differentiated cells (TDCs).  CSCs are the most rare type within a tumor, making up just a few percent of the total mix of cells.  But because of their self-renewing qualities and their ability to grow all other types of cancer cells, these are very important to treat.  CSCs require targeted therapy -- that is, you can't use the same type of treatment for all cell types.  TACs usually require their own treatment, depending on the stage of the disease and the ability of a patient to tolerate the therapy.  The presence of TACs can activate CSCs to grow more cancer cells, so if you can't eradicate the CSCs (and that's difficult to manage with 100% certainty, as we'll see) then it is important to treat the TACs.  TDCs represent cancer cells that are no longer capable of dividing, and so generally don't require a treatment -- they will die off on their own.

(I know that my explanation here represents a simplistic view of cancer -- but it was enough of a framework to help me to understand the rest of Richa's talk.)

Richa Sehgal presents to a standing-room-only crowd at #AnalyticsX

The inexact science of biopsies

Now that we understand that cancer is made up of a variety cell types, it makes sense to hope that when we extract a biopsy, that we get a sample that represents this cell type variability.  Richa used an example of sampling a chocolate chip cookie.  If you were to use a needle to extract a core sample from a chocolate chip cookie...but didn't manage to extract any portions of the (disappointingly rare) chocolate chips, you might conclude that the cookie was a simple sugar cookie.  And as a result, you might treat that cookie differently.  (If you encountered a raisin instead...well..that might require a different treatment altogether.  Blech.)

But, as Richa told us, we don't yet know enough about the distribution and proximity of the different cell types for different types of cancers.  This makes it difficult to design better biopsies.  Richa is optimistic that it's just a matter of time -- medical science will crack this and we'll one day have good models of cancer makeup.  And when that day comes, Richa has a statistical method to make biopsies better.

Using SAS and Python to model cancer cell clusters

Most high school students wouldn't think to pick up SAS for use in their science fair projects, but Richa has an edge: her uncle works for SAS as a research statistician.  However, you don't need an inside connection to get access to SAS for learning.  In Richa's case, she used SAS University Edition hosted on AWS -- nothing to install, easy to access, and free to use for any learner.

Since she didn't have real data that represent the makeup of a tumor, Richa created simulations of the cancer cells, their different types and proximity to each other in a 3D model.  With this data in hand, she could use cluster analysis (PROC CLUSTER with Ward's method and then PROC TREE) to analyze a distant matrix that she computed.  The result shows how close cancer cells of the same type are positioned in proximity.  With that information, it's possible to design a biopsy that captures a highly variable collection of cells.

Richa then used the Python package plotly to visualize the 3D model of her cell map.  (I didn't have the heart to tell her that she could accomplish this in SAS with PROC SGPLOT -- some things you just have to learn for yourself.)

A bright future -- for all of us

Clearly, Richa is an extremely accomplished young woman.  When I asked about her college plans for next year, she told me that she has a long list of "stretch schools" that she's looking at.  I'm having a difficult time understanding what constitutes a "stretch" for Richa -- I'm certain that any institution would love to have her.

Richa's accomplishments make me feel optimism for her, but also for the rest of us.  As a father of three daughters, I'm encouraged to see young women enter technical fields and be successful.  SAS is among the elite technology companies that work to close the analytics skills gap by providing free software, education, and mentoring.  Throughout the Analytics Experience 2018 conference, I've heard from many attendees who also saw Richa's talk -- they were similarly impressed and inspired.  Presentations like Richa's deliver on the conference tagline: "Analytics redefines innovation. You redefine the future."

Using machine learning to improve tumor biopsies was published on SAS Users.

9月 132018
 

With the release of SAS Viya 3.4, you can easily build large-scale machine learning models and seamlessly publish and run models to Hadoop, or other external databases such as Teradata, without the data ever leaving the Hadoop environment. In this process, SAS Viya:

1) Converts the model into MapReduce Code.

2) Executes the MapReduce code.

3) Returns a new, scored dataset in Hadoop.

SAS Viya is a new, distributed in-memory product that allows users to easily build predictive models at scale. Using the SAS Model Studio interface, I can build complex models without the need to write large amounts of underlying code.

For this blog post, I'll go through the steps to build my model using a telecommunications dataset to predict customer churn. Under the “Data” tab, I can see all of my variables, assign the proper roles, and view the dataset.

 

 

 

 

 

 

 

 

 

 

 

 

 

With the data prepared, I build a pipeline to perform data preprocessing steps such as imputation and binning and build several predictive models, including Regression, Neural Networks and Gradient Boosting. Pipelines are powerful because they automate the heavy lifting of the model building process, allowing you to solve problems faster. In addition, pipelines are re-usable across different users and datasets, allowing the adoption of best practices across an organization.

 

After building the models, I combine the models into one ensemble model with ease, and compare their performance on the validation sample. I determine that the gradient boosting model is the most accurate based on the misclassification rate. You can pick from a large number of accuracy criteria, including KS Statistic, AUC, MCR or F1.

 

After having identified the best model, I  publish the model to Hadoop. This allows me to perform future scoring at the data source, meaning data does not have to leave Hadoop. I could have configured the system to publish the model directly from SAS Model Studio; however, I publish and score the model via SAS code for maximum flexibility. With SAS Studio, I can easily control, and change, where I write my resulting models and datasets in Hadoop.

In the “Compare Models” tab, I then download the score code, which provides me with the following:

  1. sas file containing DS2 code that performs all the data preprocessing steps, such as binning and imputation, in the pipeline above. I load this .sas file into a location that can be viewed in SAS Studio.
  2. A .sashdat file in the “Models” Caslib, that is a binary representation of our model called ASTORE, used to score our model in Hadoop.

 

Opening up the dm_epscore.sas file in SAS Studio, the comments in the top tell me the ASTORE file needed to publish the model.

 

 

 

This scoring file allows the data preparation within the pipeline above to be published to Hadoop as well. In this case, the file is binning the variables before building the Gradient Boosting.

 

 

 

 

The scoring file then invokes the ASTORE file needed to score the model in Hadoop.

 

 

 

Now, I switch to SAS Studio to publish and score my model in Hadoop.  The full code can be found here.

Below is the syntax to publish the model.  I'm sure to set the classpath variable to the appropriate jar and config files for my Hadoop cluster. Note that you will need permission to read and write from the modeldir directory. Publishing the model converts the .sas scoring code and the ATORE file into MapReduce code for execution in the cluster.

 

 

 

 

 

 

 

 

 

 

This publishing code will create a directory called “telco_churn” in my home directory in HDFS, /user/ankram. In a SAS Viya environment co-located with Hadoop, the “CASUSERHDFS” Caslib is by default pointed to this location, allowing me to ensure the “telco_churn” file was successfully published.

 

 

 

 

The next step is to score the model in Hadoop. The code below scores the “looking_glass_v4” table in Hive and create a new table called “looking_glass_v4_scored”, without the data ever leaving Hadoop.

 

 

 

 

 

 

 

 

If everything is configured properly, the log should show that the SAS Embedded Process executed correctly.

 

Using a previously setup Caslib called “Hivelib” that points to the default schema in the Hive Server, I can now load the “Looking_Glass_v4_scored” dataset into CAS to view the table.

 

 

 

 

Using the Table Viewer, I can then see the predicted probabilities of churn for each individual.

 

To conclude, many organizations have very large datasets, often times terabytes or larger, and often find that minimizing data movement is critical to successfully putting models into production. The in-database technologies for Hadoop on SAS Viya allow you and your fellow data scientists to easily prepare data and score large-scale models entirely in Hadoop, with the data never leaving the environment. You can now focus on solving more problems and are no longer at the mercy of large datasets and network latency.

 

 

 

 

Publishing and running models to Hadoop in SAS Viya was published on SAS Users.

9月 052018
 

Typically, when filters are applied in SAS Visual Analytics it affects all the records and aggregations in linked objects. For example, in a typical sales report below, when filters are applied, it changes all the measures of linked objects.

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

With this kind of filtering, it becomes difficult to calculate measures which requires a different level of aggregation. In above image the expectation is that the ‘Total Customers’ should not be changing irrespective of ‘Region’, ‘State’, ‘Category’ and ‘Subcategory’ control selections. ‘Total Customers (Geo)’ should be changing only based on ‘Region’ and ‘State’ control selections. ‘Total Customers (Geo and Prod)’ should be changing based on all the controls mentioned above. In the above example only, a ‘Total Customers (Geo and Prod)’ calculation is correct.

We will learn to create measures with different levels of aggregation by using ‘Customer Penetration’ measure as an example.

          Customer Penetration = Distinct customers at selected geography and product level/ Distinct customers at selected geography level

Selective filtering may be used for creating similar reports like: Dealer Participation, Sales Contribution, etc. The below section exemplifies the creation of a customer penetration report with selective filtering.

Customer penetration using SAS Visual Analytics 8.2 (selective filtering)

Customer penetration is used to analyze whether marketing and sales strategies are working or not. Managers often uses customer penetration or dealer participation measures along with other measures to measure the popularity of a product, category or brand.

This report requirement is such that the numerator in the ‘Customer Penetration’ formula should be filtered based on region and state list control selections, while the denominator should be filtered based on region, state, category and subcategory list control selections. This is not the same requirement as filtering the whole table through common list controls. In general, if you link a table with any control, all the measures in that table will be filtered as per selected value(s) in controls. However, our requirement is not like that. Instead of linking control and tables we will use control parameters to achieve our objective.

Assume we have a customer transaction table with following variables:

 

 

 

 

 

 

Before we move, be ready with the basic report as per below image:

 

Once you are ready with the report as per the above image, create parameters for ‘Region’, ‘State’, ‘Category’, ‘SubCategory’:

Region Parameter


 

State Parameter

 


Category Parameter

 


SubCategory Parameter

 

Now create the following two calculated items derived from ‘Customer_ID’:

Geo_Customer_ID
Equivalent to ‘Customer_ID’. However, populated only for selected geography levels and rest would be filled with missing.

 

 

 

 

 

 

Geo_and_Prod_Customer_ID
Equivalent to ‘Customer_ID’. However, populated only for selected geography and product levels and rest would be filled with missing.

 

Create the following two aggregated measures:

Total Customers (Geo)
You need to subtract the distinct count related to missing ‘Geo_Customer_ID’, which is 1.

 

Total Customers (Geo and Prod)
You need to subtract the distinct count related to missing ‘Geo_and_Prod_Customer_ID’, which is 1.

 

Now you can create an aggregated measure ‘Customer Penetration’.

Customer Penetration = Total Customers (Geo and Prod) / Total Customers (Geo)

 

Final report will look like this:

 

 

 

 

 

 

 

 

 

 

 

 

 

Comparative images with default and selective filtering implementation:

 

If you compare the above images, you will find the difference in highlighted measures where the first image aggregation level is based on selective filtering, while in second image aggregation level is uniform.

Note – ‘Total Customers’ is count of distinct ‘Customer_ID’ i.e., total customers count is independent from geography and product hierarchy selection.

Conclusion

This process allows you to use control parameters in ‘If Then Else…’ statements to create a variable (calculated item) having character values. You can utilize this feature in several other applications – this is just one way you can use parameters to fulfil a business requirement.

Selective filtering in SAS Visual Analytics 8.2 was published on SAS Users.

8月 302018
 

A combination of SAS Grid Manager and SAS Viya can change the game for IT leaders looking to take on peak computing demands without sacrificing reliability or driving higher costs.  Maybe that’s why we fielded so many questions about SAS Grid Manager and SAS Viya in our recent webinar  about how the two can work together to process massive volumes of data – fast.

Participants asked us so many great questions that we wanted to share the answers here, assuming that you may have the same questions.  This is the first of two blog posts focusing on some of the very best questions we received.  Stay tuned for more soon – and if you don’t see your own burning questions posed here, just post your question in the comments and we’ll respond.

1. Do SAS Grid Manager and SAS Viya need to be collocated in the same data center?

And if they’re in different data centers, can SAS Grid Manager and SAS Viya communicate with one another? As to the question of how well SAS Grid Manager and SAS Viya can communicate if they’re in different data centers, it shouldn’t be a functional concern. If there’s network connectivity between the two data centers, they can communicate.  Just be mindful of a few things:

  • The size of data being processed in each environment.
  • How much data is going back and forth between the two.
  • The impact all that data movement can have on response times and overall performance.

In addition to performance, there's a greater sensitivity around data handling in physically separated deployments. The emergence of stricter data protection regulations increases the complexity of compliance when moving data between locations with different legal jurisdictions. It will be important to consider the additional performance implications of encryption of the transferred data as well. Ultimately, having compute as close as possible to the data it needs results in less complexity and better performance.

In this case it is important to remember the old saying “just because you can doesn’t mean that you should.”  When SAS Grid Manager and SAS Viya are collocated, they can share the same data. I need to be clear that there are implications of sharing the same data for example data sets. For example, the data cannot be open by both a process running on the grid and by the SAS Viya analytics server at the same time. If your business processes can accommodate this requirement then sharing the same physical copies of data in storage may save your organization money as well as ease compliance efforts.  I also cannot sufficiently stress the need to complete a proof of concept with production data volumes and job complexity to compare the performance of hosting SAS Grid Manager and SAS Viya in the same data center, compared to having them in geographically separated data centers.

2. Should SAS Viya and SAS Grid Manager 9.4M5 run on the same OS for integration purposes?

From an integration perspective, they can definitely run on different operating systems. In fact, as of version SAS 9.4m5, you can access a CAS (Cloud Analytics Services) server from Solaris, AIX, or 64-bit Windows. The CAS server itself will run on Linux – and soon we’ll roll out SAS Viya for Windows server.

Short version: Crossing operating systems does not present a functional problem.  Those of you who have used SAS in heterogeneous environments know that there are performance implications when processing data that is not native to the running session.  You should carefully consider the performance implications of deploying in a heterogeneous topology before committing to a mixed environment.

3. How do SAS Viya and SAS Grid Manager compare in terms of complexity – particularly in the context of platform administration?

They’re actually very comparable.  Some level of detail varies but in many cases the underlying concepts and effort required are similar – especially when you reach the level of multi-node administration, keeping multiple hosts patched, and those sorts of issues.

4. SAS Grid Manager requires high-performance storage. Do we need to have that same level of storage (such as IBM General Parallel File System) for SAS Viya? 

No – SAS Viya relies on Cloud Analytic Services (CAS), so it doesn’t have the same storage requirements as SAS Grid Manager.  It’s more like what you’d find in a Hadoop environment – the CAS reference architecture is mainly a collection of nodes with local storage that allows SAS Viya to perform memory-mapping to disk run jobs that need resources larger than the total RAM and can use disk cache to continue running. SAS Viya can ingest data serially or in parallel, so for customers that have the ability to use cost-efficient distributed file systems, movement of data into CAS can be done in parallel.

Note, the shared file system that is part of an existing SAS Grid Manager environment could be further leveraged as a means to share data between the SAS Grid and SAS Viya environments.

SAS’ recommended IO throughput for SAS Grid Manager deployments are based upon years of experience with customers who have been unsatisfied with the performance of their chosen storage.  The resulting best practice is one that minimizes performance complaints and allows customers to process very large data in the timeliest manner. If SAS Viya is deployed with a multi-node analytics server (MPP mode) then a shared file system is required. The SAS Viya Cloud Analytic Server (CAS) has been designed with Network File System (NFS) in mind.

Our customers get the best value from environments built with a blend of storage solutions, including both shared files systems for job/user/application concurrency as well as less expensive distributed storage for workloads that may not require concurrency like large machine learning and AI training problems. The latter is where SAS Viya shines.

These were all great questions that we thought deserved more detail than we could offer in a webinar – and there are more!  Soon we’ll post a second set of questions that you can use to inform your work with SAS Grid Manager and SAS Viya.  In the meantime, feel free to post any further questions in the comment section of this post.  We’ll answer them quickly.

4 FAQs about SAS Grid Manager and SAS Viya was published on SAS Users.

8月 242018
 

I know a lot of you have been programming in SAS for a long time, which is awesome! However, when you do something for a long time, sometimes you get set in your ways and you miss out on new ways of doing things.

Although the COUNT and CAT functions have been around for a while now, I see a lot of customer code that is counting and concatenating text strings the "old-fashioned" way. In this article, I would like to introduce you to the COUNT, COUNTW, CATS and CATX functions. These functions make certain tasks much simpler, like counting words in a string and concatenating text together.

Counting words or text occurrences

First let's take a look at the COUNT and COUNTW functions.
The

Data a;
  Contributors='The Big Company INC, The Little Company, ACME Incorporated,    Big Data Co, Donut Inc.';
  Num=count(contributors,'inc','i');  /* the 'i' modifier means to ignore case*/
  Put num=;
Run;

When we examine the SAS log, we can see that NUM has a value of 3.

Num=3
NOTE: The data set WORK.A has 1 observations and 2 variables.
NOTE: DATA statement used (Total process time):
      real time           0.02 seconds
      cpu time            0.01 seconds

The

/* DON'T USE - use COUNTW instead */
data a(drop=done i);
  x='a#b#c#d#e';
  do until(done);
    i+1;
    y=scan(x,i,'#');
    if y='' then done=1;
    else output;
  end;
  run;

I realize this code isn't terrible, but I try to avoid DO UNTIL/WHILE loops if I can. There is always the possibility of going into an infinite loop.
The COUNTW function eliminates the need for a DO UNTIL/WHILE loop.

Here is an example of logic that I use all the time. In this example, I have a macro variable that contains a list of values that I want to loop through. I can use the COUNTW function to easily loop through each file listed in the resolved value of &FILE_NAMES. The code then uses the file name on the DATA statement and the INFILE statement.

%let file_names=01JAN2018.csv 01FEB2018.csv 01MAR2018.csv 01APR2018.csv;
%macro test(files);
 
%do i=1 %to %sysfunc(countw(&file_names,%str( )));
  %let file=%scan(&file_names,&i,%str( ));
  data _%scan(&file,1,.);
    infile "c:\my files\&file";
    input region $ manager $ sales;
  run;
%end;
%mend;
%test(&file_names)

The log is too large to list here, but you can see one of the generated DATA steps in the MPRINT output of this snapshot of the log.

MPRINT(TEST):   data _01JAN2018;
MPRINT(TEST):   infile "c:\my files\01JAN2018.csv";
MPRINT(TEST):   input region $ manager $ sales;
MPRINT(TEST):   run;

This data step will be generated for each file listed.

Counting strings within another text string should be easy to do. The COUNT functions definitely make this a reality!

Concatenating strings in SAS

Now that we know how to COUNT text in SAS, let me show you how to CAT in SAS with the CATS and CATX functions.

Back in the old days, I had hair(!) and we concatenated text strings using double pipes syntax.

  X=var1||var2||var3;

This syntax is not too bad, but what if VARn has trailing blanks? Prior to SAS Version 9 you had to remove the trailing blanks from each value. Also, if the text was right justified, you had to left justify the text. This complicates the syntax:

X=trim(left(var1))||trim(left(var2))||trim(left(var3));

You can now accomplish the same thing using CATS. The
data a;
  length var1 var2 var3 $12;
  var1='abc';
  var2='123';
  var3='xyz';
  x=cats(var1,var2,var3);
  put x=;
run;

VAR1-VAR3 have a length of 12, which means each value contains trailing blanks. By using the CATS function, all trailing blanks are removed and the text is concatenated without any spaces between the text. Here is the result of the above PUT statement.

x=abc123xyz

Another common need when concatenating text together is to create a delimited string. This can now be done using the CATX function. The

data a;
  length var1 var2 var3 $12;
  var1='abc';
  var2='123';
  var3='xyz';
  x=catx(',',var1,var2,var3);
  put x=;
run;

This syntax creates a comma separated list with all leading and trailing blanks removed. Here is the result of the PUT statement.

x=abc,123,xyz

Just how the COUNT functions making counting text in SAS easier, the CAT functions make concatenating strings so much easier. For more explanation and examples of these CAT* functions, see this paper by Louise Hadden, Purrfectly Fabulous Feline Functions (because they are CAT functions, get it?).

Before I let you go, let me point out that in addition to the COUNT, COUNTW, CATS and CATX functions, there are also the COUNTC, CAT, CATQ and CATT functions that provide even more functionality. These functions are not used as often, so I haven't discussed them here. Please How to COUNT CATs in SAS was published on SAS Users.

8月 242018
 

Dynamic programming is a powerful technique to implement algorithms, and is often used to solve complex computational problems. Some are applications are world-changing, such as aligning DNA sequences; others are more "everyday," such as spelling correction. If you search for "dynamic programming," you will find lots of materials including sample programs written in other programming languages, such as Java, C, and Python etc., but there isn't any SAS sample program. SAS is a powerful language, and of course SAS can do it! This article will show you how to write your dynamic programming function including the SPEDIS function. The purpose of this article is to demonstrate how you can implement such a function in a SAS dynamic programming method.

What is edit distance?

Edit distance is a metric used to measure dissimilarity between two strings, by matching one string to the other through insertions, deletions, or substitutions.

What is minimum edit distance?

Minimum edit distance measures the dissimilarity between two strings through the least number of edit operations.
Taking two strings "BAD" and "BED" as an example. These two words have multiple match possibilities. One solution is an edit distance of 1, resulting from one substitution from letter "A" to "E".

Another solution might be deleting letter "A" from "BAD" then inserting letter "E" between "B" and "D", which results the edit distance of 2

There are several variants of edit distance, depending on the cost of edit operation. For example, given a string pair "BAD" and "BED", if each operation has cost of 1, then its edit distance is 1; if we set substitution cost as 2 (Levenshtein edit distance), then its edit distance is 2.

What is dynamic programming?

The standard method used to solve minimum edit distance uses a dynamic programming algorithm.

Dynamic programming is a method used to resolve complex problems by breaking it into simpler sub-problems and solving these recursively. Partial solutions are saved in a big table, so it can be quickly accessed for successive calculations while avoiding repetitive work. Through this process of building on each preceding result, we eventually solve the original, challenging problem efficiently. Many difficult issues can be resolved using this method.

Here's the algorithm that solves Levenshtein edit distance through dynamic programming:

The following image shows the annotated SAS program that implements the algorithm. The complete code (which you can copy and test for yourself) is at the end of this article.
SPEDIS algorithm

To demonstrate the edit distance function usage and validate the intermediate edit distance matrix table, I used a string pair "INTENTION" and "EXECUTION" that I copied from Stanford's class material as example. (This same resource also shows how the technique applies to DNA sequence alignment.)

options cmplib=work.funcs; 
data test;  
   infile cards missover;
   input word1 : $20. word2 : $20.;
   d=editDistance(word1,word2); 
   put d=;
cards;
INTENTION EXECUTION
;
run;

The Levenshtein edit distance between "INTENTION" and "EXECUTION" is 8.

The edit distance table as follows.

N 8 9 10 11 12 11 10 9 8
O 7 8 9 10 11 10 9 8 9
I 6 7 8 9 10 9 8 9 10
T 5 6 7 8 9 8 9 10 11
N 4 5 6 7 8 9 10 11 10
E 3 4 5 6 7 8 9 10 9
T 4 5 6 7 8 7 8 9 8
N 3 4 5 6 7 8 7 8 7
I 2 3 4 5 6 7 6 7 8
  E X E C U T I O N

 

More dynamic programming applications

Dynamic programming is a powerful technique, and it can be used to solve many complex computation problems. Anna Di and I are presenting a paper to the PharmaSUG 2018 China conference to demonstrate how to align DNA sequences with SAS FCMP and SAS Viya. If you are interested in this topic, please look for our paper after the conference proceedings are published.

Appendix: Complete SAS program for editDistance function

proc fcmp outlib=work.funcs.spedis;
function editDistance(query $, keyword $);
   array distance[1,1]/nosymbols;    
   m = length(query);
   n = length(keyword);   
   call dynamic_array(distance, m+1, n+1); 
   do i=1 to m+1;
      do j=1 to n+1;
         distance[i, j]=-1;  
      end;
   end;
 
   i_max=m;
   j_max=n;
 
   dist = edDistRecursive(query, keyword, distance, m, n);
 
   do i=i_max to 1 by -1;
      do j=1 to j_max;
         put distance[i, j] best3. @;  
      end;
      put;
   end;
 
   return (dist);
endsub; 
 
function edDistRecursive(query $, keyword $, distance[*,*], m, n);
   outargs distance;
 
   if m = 0 then
      return (n);
   if n = 0 then
      return (m);
 
   if distance[m,n] >= 0 then
      return (distance[m,n]);
   if (substr(query,m,1) = substr(keyword,n,1)) then
      delta = 0;
   else
      delta = 2;
 
   ans = min(edDistRecursive(query, keyword, distance, m - 1, n - 1) + delta, 
             edDistRecursive(query, keyword, distance, m - 1, n) + 1, 
             edDistRecursive(query, keyword, distance, m, n - 1) + 1);
   distance[m,n] = ans;
   return (distance[m,n]);   
endsub;
run;
quit; 
 
options cmplib=work.funcs; 
data test;  
   infile cards missover;
   input word1 : $20. word2 : $20.;
   d=editDistance(word1,word2); 
   put d=;
cards;
INTENTION EXECUTION
;
run;

Dynamic programming with SAS FCMP was published on SAS Users.