Tech

10月 102018
 

Deep learning (DL) is a subset of neural networks, which have been around since the 1960’s. Computing resources and the need for a lot of data during training were the crippling factor for neural networks. But with the growing availability of computing resources such as multi-core machines, graphics processing units (GPUs) accelerators and hardware specialized, DL is becoming much more practical for business problems.

Financial institutions use a large number of computations to evaluate portfolios, price securities, and financial derivatives. For example, every cell in a spreadsheet potentially implements a different formula. Time is also usually of the essence so having the fastest possible technology to perform financial calculations with acceptable accuracy is paramount.

In this blog, we talk to Henry Bequet, Director of High-Performance Computing and Machine Learning in the Finance Risk division of SAS, about how he uses DL as a technology to maximize performance.

Henry discusses how the performance of numerical applications can be greatly improved by using DL. Once a DL network is trained to compute analytics, using that DL network becomes drastically faster than more classic methodologies like Monte Carlo simulations.

We asked him to explain deep learning for numerical analysis (DL4NA) and the most common questions he gets asked.

Can you describe the deep learning methodology proposed in DL4NA?

Yes, it starts with writing your analytics in a transparent and scalable way. All content that is released as a solution by the SAS financial risk division uses the "many task computing" (MTC) paradigm. Simply put, when writing your analytics using the many task computing paradigm, you organize code in SAS programs that define task inputs and outputs. A job flow is a set of tasks that will run in parallel, and the job flow will also handle synchronization.

Fig 1.1 A Sequential Job Flow

The job flow in Figure 1.1 visually gives you a hint that the two tasks can be executed in parallel. The addition of the task into the job flow is what defines the potential parallelism, not the task itself. The task designer or implementer doesn’t need to know that the task is being executed at the same time as other tasks. It is not uncommon to have hundreds of tasks in a job flow.

Fig 1.2 A Complex Job Flow

Using that information, the SAS platform, and the Infrastructure for Risk Management (IRM) is able to automatically infer the parallelization in your analytics. This allows your analytics to run on tens or hundreds of cores. (Most SAS customers run out of cores before they run out of tasks to run in parallel.) By running SAS code in parallel, on a single machine or on a grid, you gain orders of magnitude of performance improvements.

This methodology also has the benefit of expressing your analytics in the form of Y= f(x), which is precisely what you feed a deep neural network (DNN) to learn. That organization of your analytics allows you to train a DNN to reproduce the results of your analytics originally written in SAS. Once you have the trained DNN, you can use it to score tremendously faster than the original SAS code. You can also use your DNN to push your analytics to the edge. I believe that this is a powerful methodology that offers a wide spectrum of applicability. It is also a good example of deep learning helping data scientists build better and faster models.

Fig 1.3 Example of a DNN with four layers: two visible layers and two hidden layers.

The number of neurons of the input layer is driven by the number of features. The number of neurons of the output layer is driven by the number of classes that we want to recognize, in this case, three. The number of neurons in the hidden layers as well as the number of hidden layers is up to us: those two parameters are model hyper-parameters.

How do I run my SAS program faster using deep learning?

In the financial risk division, I work with banks and insurance companies all over the world that are faced with increasing regulatory requirements like CCAR and IFRS17. Those problems are particularly challenging because they involve big data and big compute.

The good news is that new hardware architectures are emerging with the rise of hybrid computing. Computers are increasing built as a combination of traditional CPUs and innovative devices like GPUs, TPUs, FPGAs, ASICs. Those hybrid machines can run significantly faster than legacy computers.

The bad news is that hybrid computers are hard to program and each of them is specific: you write code for GPU, it won’t run on an FPGA, it won’t even run on different generations of the same device. Consequently, software developers and software vendors are reluctant to jump into the fray and data scientist and statisticians are left out of the performance gains. So there is a gap, a big gap in fact.

To fill that gap is the raison d’être of my new book, Deep Learning for Numerical Applications with SAS. Check it out and visit the SAS Risk Management Community to share your thoughts and concerns on this cross-industry topic.

Deep learning for numerical analysis explained was published on SAS Users.

10月 082018
 

When was the last time you or your colleagues wanted access to data and tools to produce reports and dashboards for a business need? Probably within the last hour. Self-service BI applications – gaining popularity as we speak – make gaining insights and decision making faster. But they've also generated a greater need for governance.

Part of governance is understanding the data lifecycle or data lineage. For example, a co-worker performed some modifications to a dataset and used it to produce a report that you would like to use to help solve a business need.  How can you be sure that the information in this report is accurate? How did the producer of the report calculate certain measures? From what original data set was the report based?

SAS provides many tools to help govern platforms and solutions.  Let’s look at one of those tools to understand the data lifecycle: SAS Lineage Viewer.

Here we have a report created to explore and visualize telecommunications data using SAS Visual Analytics.  The report shows our variable of interest, cross-sell and up-sell flag, and its relationship to other variables. This report will be used to target customers for cross-sell or up-sell.

This report is based on an Analytical Base Table (ABT) that was created by joining two data sets:

  1. Usage information from a subset of customers who have contacted customer care centers.
  2. Cleansed demographics data.

The name of the joined dataset the report is based on is LG_FINAL_ABT.

To make sure we understand the data behind this report we’ll explore it using a lineage viewer (you will need to be assigned to the "Data Management Business User” or “Data Management: Lineage” group, which an administrator can help you with).  From the applications menu, select Explore Lineage.

 

We’ll click on Search for Subjects and search for the report we were just reviewing: Telecommunications.

I’ll enter Telecommunications in the search field then select the Telecommunications report.

The first thing I see is the LG_Final_ABT CAS table my report is dependent on.

If I click on the + sign on the top right corner of the data, LG_Final_ABT, I can see all the other relationships to that CAS table.  There is a Model Studio project, two Visual Analytics Reports (including the report we looked at), and a data view that are all dependent on the LG_FINAL_ABT CAS Table.  This diagram also shows us that the LG_FINAL_ABT CAS table is dependent on the Public CAS library.  We also see that the LG_FINAL_ABT CAS table was loaded into CAS from the LG_FINAL_ABT.sashdat file.

Let’s explore the LG_FINAL_ABT.sashdat file to see its lineage. Clicking on the + expands the view. In the following diagram, I expanded all the remaining items to see the full data lifecycle.

This image shows us the whole data life cycle.  From LG_FINAL_ABT.sashadat we see that it is dependent on the Create Final LG ABT data preparation plan.  That plan is dependent on two CAS tables; LG_CUSTOMER and LG_ORIG_ABT.  The data lineage viewer shows us that the LG_CUSTOMER table was loaded in from a csv file (lg_customer.csv) and the LG_ORIG_ABT CAS file was loaded in from a sas data set (lg_orig_abt.sas7dbat).

To dive deeper into the mashups and data manipulations that took place to produce LG_FINAL_ABT.sashdat, we can open the data preparation plan.  To do this I’ll right click on Create Final LG ABT and select Actions then Prepare Data.

Here is the data preparation plan.  At the top you can see that the creator of this data set performed five steps – Gender Analysis, Standardize, Remove, Rename and Join.

To get details into each of these steps, click on the titles at the top.  Clicking on Gender Analysis, I see that a gender analysis was performed based on the customer_name field and the results were added to the data set in a variable named customer_name_GND.

Clicking on the Standardize title, I see that there were two standardization tasks performed on the original data set. One for customer state and the other for customer phone number.  I can also see that the results were placed in new fields (customer_state_STND and customer_primary_phone_STND).

Clicking on the Remove title, I see that three variables were dropped from the final dataset.  These variables were the original ones that the user had “fixed” in the previous steps: customer_gender, customer_state, and customer_primary_phone.

Clicking on the Rename title, I see that the new variables have been renamed.

The last step in the process is a join. Clicking on the Join title I see that LG_CUSTOMER was joined with LG_ORIG_ABT based on an inner join on Customer_ID.

We have just walked through the data lineage or data lifecycle for the dataset LG_FINAL_ABT, using SAS tools. I now understand how the data in the report we were looking at was generated. I am confident that the information that I gain from the report will be accurate.

Since sharing information and data among co-workers has become so common, it's now more crucial than ever to think about the data lifecycle. When you gain access to a report that you did not create it is always a good idea to check the underlying data to ensure that you understand the data and that any business insights gained are accurate. Also, if you are sharing data with others and you want to make modifications to it, you should always check the lineage to ensure that you won’t be undermining someone else’s work with changes you make. Thanks to SAS Visual Analytics, all the necessary tools one needs to review data lineage are all available within one interface.

Keep track of where data originated with data lineage in SAS was published on SAS Users.

10月 052018
 

In my earlier blog, I described how to create maps in SAS Visual Analytics 8.2 if you have an ESRI shapefile with  granular geographies, such as counties, that you wish to combine into regions. Since posting this blog in January 2018, I received a lot of questions from users on a range of mapping topics, so I thought a more general post on using – and troubleshooting - custom polygons in SAS Visual Analytics on Viya was in order. Since version 8.3 is now generally available, this post is tailored to the 8.3 version of SAS Visual Analytics, but the custom polygon functionality hasn’t really changed between the 8.2 and 8.3 releases.

What are custom polygons?

Custom polygons are geographic boundaries that enable you to visualize data as shaded areas on the map. They are also sometimes referred to as a choropleth maps. For example, you work for a non-profit organization which is trying to decide where to put a new senior center. So you create a map that shows the population of people over 65 years of age by US census tract. The darker polygons suggest a larger number of seniors, and thus a potentially better location to build a senior center:

SAS Visual Analytics 8.3 includes a few predefined polygonal shapes, including countries and states/provinces. But if you need something more granular, you can upload your own polygonal shapes.

How do I create my own polygonal shapes?

To create a polygonal map, you need two components:

  1. A dataset with a measure variable and a region ID variable. For example, you may have population as a measure, and census tract ID as a region ID. A simple frequency can be used as a measure, too.
  2. A “polygon provider” dataset, which contains the same region ID as above, plus geographic coordinates of each vertex in each polygon, a segment ID and a sequence number.

So where do I get this mysterious polygon provider? Typically, you will need to search for a shapefile that contains the polygons you need, and do a little bit of data preparation. Shapefile is a geographic data format supported by ESRI. When you download a shapefile and look at it on the file system, you will see that it contains several files. For example, my 2010 Census Tract shapefile includes all these components:

Sometimes you may see other components present as well. Make sure to keep all components together.

To prepare this data for SAS Visual Analytics, you have two options.

Preparing shapefile for SAS Visual Analytics: The long way

One method to prepare the polygon provider is to run PROC MAPIMPORT to convert the shapefile into a SAS dataset, add a sequence ID field and then load into the Cloud Analytic Services (CAS) server in SAS Viya. The sequence ID is mandatory, as it helps SAS Visual Analytics to draw the lines connecting vertices in the correct order.

A colleague recently reached out for help with a map of Census block groups for Chatham County in North Carolina. Let’s look at his example:

The shapefile was downloaded from here. We then ran the following code on my desktop:

libname geo 'C:\...\Data;
 
proc mapimport datafile="C:\...\Data\Chatham_County__2010_Census_Block_Groups.shp"
out=work.chatham_cbg;
run;
 
data geo.chatham_cbg;
set  chatham_cbg;
seqno=_n_;
run;

We then manually loaded the geo.chatham_cbg dataset in CAS using self-service import in SAS Visual Analytics. If you are not sure how to upload a dataset to CAS, please check the %SHIMPR. The macro will automatically run PROC MAPIMPORT, create a sequence ID variable and load the table into CAS. Here’s an example:

%shpimprt(shapefilepath=/path/Chatham_County__2010_Census_Block_Groups.shp, id=GEOID, outtable=Chatham_CBG, cashost=my_viya_host.com,   casport=5570, caslib='Public');

For this macro to work, the shapefile must be copied to a location that your SAS Viya server can access, and the code needs to be executed in an environment that has SAS Viya installed. So, it wouldn’t work if I tried to run it on my desktop, which only has SAS 9.4 installed. But it works beautifully if I run it in SAS Studio on my SAS Viya machine.

Configuring the polygon provider

The next step is to configure the polygon provider inside your report. I provided a detailed description of this in my earlier blog, so here I’ll just summarize the steps:

  • Add your data to the SAS Visual Analytics report, locate the region ID variable, right-click and select New Geography
  • Give it a name and select Custom Polygonal Shapes as geography type
  • Click on the Custom Polygon Provider box and select Define New Polygon Provider
  • Configure your polygon provider by selecting the library, table and ID column. The values in your ID column must match the values of the region ID variable in the dataset you are visualizing. The ID column, however, does not need to have the same name as in the visualization dataset.
  • If necessary, configure advanced options of the polygon provider (more on that in the troubleshooting section of this blog).

If all goes well, you should see a preview of your polygons and a percentage of regions mapped. Click OK to save your geographic item, and feel free to use it in the Geo Map object.

I followed your instructions, but the map is not working. What am I missing?

I observed a few common troubleshooting issues with custom maps, and all are fairly easy to fix. The table below summarizes symptoms and solutions.
 

Symptom Solution
In the Geographic Item preview, 0% of the regions are mapped. For example:
Check that the values in the region ID variable match between the main dataset and the polygon provider dataset.
I successfully created the map, but the colors of the polygons all look the same. I know I have a range of values, but the map doesn’t convey the differences. In your main dataset, you probably have missing region ID values or region IDs that don’t exist in the polygon provider dataset. Add a filter to your Geo Map object to exclude region IDs that can’t be mapped.

Only a subset of regions is rendered. You may have too many points (vertices) in your polygon provider dataset. SAS Visual Analytics can render up to 250,000 points. If you have a large number of polygons represented in a detailed way, you can easily exceed this limit. You have two options, which you can mix and match:

(1)    Filter the map to show fewer polygons

(2)    Reduce the level of detail in the polygon provider dataset using PROC GREDUCE. See example here. Also, if you imported data using the %shpimprt macro, it has an option to reduce the dataset. Here’s a handy link to In the Geographic Item preview, the note shows that 100% of the regions are mapped, but the regions don’t render, or the regions are rendered in the wrong location (e.g., in the middle of the ocean) and/or at an incorrect scale.

This is probably the trickiest issue, and the most likely culprit is an incorrectly specified coordinate space code (EPSG code). The EPSG code corresponds to the type of projection applied to the latitude and longitude in the polygon provider dataset (and the originating shapefile). Projection is a method of displaying points from a sphere (the Earth) on a two-dimensional plane (flat surface). See this tutorial if you want to know more about projections.

There are several projection types and numerous flavors of each type. The default EPSG code used in SAS Visual Analytics is EPSG:4326, which corresponds to the unprojected coordinate system.  If you open advanced properties of your polygon provider, you can see the current EPSG code:

Finding the correct EPSG code can be tricky, as not all shapefiles have consistent and reliable metadata built in. Here are a couple of things you can try:

(1)    Open your shapefile as a layer in a mapping application such as ArcMap (licensed by ESRI) or QGIS (open source) and view the properties of the layer. In many cases the EPSG code will appear in the properties.

(2)    Go to the location of your shapefile and open the .prj file in Notepad. It will show the projection information for your shapefile, although it may look a bit cryptic. Take note of the unit of measure (e.g., feet), datum (e.g., NAD 83) and projection type (e.g., Lambert Conformal Conic). Then, go to https://epsg.io/ and search for your geography.  Going back to the example for Chatham county, I searched for North Carolina. If more than one code is listed, select a few codes that seem to match your .prj information the best, then go back to SAS Visual Analytics and change the polygon provider Coordinate Space property. You may have to try a few codes before you find the one that works best.

I ruled out a projection issue, the note in Geographic Item preview shows that 100% of the regions are mapped, but the regions still don’t render. Take a look at your polygon provider preparation code and double-check that the order of observations didn’t accidentally get changed. The order of records may change, for example, if you use a PROC SQL join when you prepare the dataset. If you accidentally changed the order of the records prior to assigning the sequence ID, it can result in an illogical order of points which SAS Visual Analytics will have trouble rendering. Remember, sequence ID is needed so that SAS Visual Analytics can render the outlines of each polygon correctly.

You can validate the order of records by mapping the polygon provider using PROC GMAP, for example:

proc gmap map=geo.chatham_cbg data=geo.chatham_cbg;
   id geoid;
   choro geoid / nolegend levels=1;
run;

For example, in image #1 below, the records are ordered correctly. In image #2, the order or records is clearly wrong, hence the lines going crisscross.

 

As you can see, custom regional maps in SAS Visual Analytics 8.3 are pretty straightforward to implement. The few "gotchas" I described will help you troubleshoot some of the common issues you may encounter.

P.S. I would like to thank Falko Schulz for his help in reviewing this blog.

Troubleshooting custom polygon maps in SAS Visual Analytics 8.3 was published on SAS Users.

10月 042018
 

You can now conduct a live demo of SAS Visual Analytics on your mobile device to participants who are geographically dispersed by using the Present Screen feature, a tucked-away option in the SAS Visual Analytics app for iPad, iPhone, and Android devices. Let’s say I am looking at a report on my mobile device, and I have questions about a couple of items for my colleagues Joe and Anita, both of whom are located in two different cities. The three of us are able to see the report while I demo it, drawing their attention to specific areas of interest in the report.

Sitting in my office, or from any location where I have a Wi-Fi or cellular connection, I can use the Present Screen feature to do a live shared presentation with Joe and Anita. And I don't have to present just one report. During the live presentation, I can close a report, open a different one, perform interactions, and move around in the app between different reports.

And here’s the real beauty of this feature. Neither Joe nor Anita need to have the SAS Visual Analytics app on a mobile device, or SAS Visual Analytics running on a desktop. The only requirement for participants is that they have a mobile device (could be an iOS, Android, or Windows device), a laptop, or a desktop system.  Internet access via Wi-Fi (or a cellular connection for mobile devices), an email client for receiving an email notification, and a Web browser are necessary. VPN connectivity might be required if the participants' organization requires VPN.

Plus, you can conduct your live presentation for up to 10 people! Before we move on, note that you can also use the iOS feature, AirDrop, on your iOS device to engage participants with the Present Screen feature. This is useful if you’re in a room with a bunch of folks who have iOS devices, and you want to do live sharing.

Ready to try it? Here’s a short checklist of what you need to do a live presentation of SAS Visual Analytics reports.

Requirements for the presenter

  • Use any one of these devices to present your screen for a shared presentation to your participants: iPad, iPhone, Android tablet or smartphone
  • SAS Visual Analytics app installed on your device
  • Connection via Wi-Fi or cellular to the server where the report(s) reside
  • Subscription to the reports that you want to present and share
  • Email client on the iPad or phone
  • VPN connectivity if your organization requires it

Couple of things to note

Say you're presenting to 10 participants via email or AirDrop.  Note that as the presenter, you must have the SAS Visual Analytics app open in your device with the Present Screen feature selected and active for your guests to be able to see your reports. Once you exit the app, the screen presentation session ends and the email or AirDrop invitations are no longer valid.

And a note on participant information. After a participant forwards an email invitation to view the presentation to a colleague, the recipient is required to enter a name and email address before joining your live screen presentation. You, as the presenter, get to see the names of the folks that have joined your screen presentation. Participant names and email addresses simply let the presenter know who all have joined to presentation. Neither the participants' names nor their email addresses are validated by the SAS Visual Analytics server.

Supported versions for SAS Visual Analytics reports

You can present and share reports with the Present Screen feature for these versions of SAS Visual Analytics:

7.3, 7.4, 8.1, 8.2, 8.3

How to start the screen presentation

In the Analytics report I'm subscribed to from the SAS Visual Analytics app on my iPad, I choose Present Screen.

Choosing Present Screen in the subscribed report

I am reminded that I can present my screen to a maximum of 10 participants. I click OK.

The app prompts me to send email or choose AirDrop to present my screen to participants - I choose email.

My email client opens, and the app includes instructions along with the link that takes them to my live screen presentation. I enter the email addresses for my participants and send the email.

In the email that Anita receives on her desktop PC, she taps on the link which takes her to her Web browser where my screen presentation is set to start soon.

Anita enters her name and email address, and notes that the presentation has not yet started.

Joe, who has logged on to the presentation from his Android phone, is also presented with the same message on his smartphone.

To begin the screen presentation, I tap on the blinking cursor in the app.

The app reminds me that now my participants can see everything on my iPad screen.

Next, a blue bar at the top of the report indicates that my screen presentation is live and can be seen by Anita and Joe. Now I can begin my report presentation, or exit this report and open a different report to share.

Here is my screen on the iPad Mini where I started my screen presentation:

Anita's screen on her Windows desktop monitor:

Joe's screen on the Android smartphone:

When I am finished with the presentation, I tap on the 'stop' button to end it.

A message displays to indicate that the presentation has ended. Here's an example of that message from Joe's Android smartphone.

Do it live! How to present your screen from the SAS Visual Analytics app was published on SAS Users.

9月 262018
 

Here's a challenge.  You're a passenger in an automobile, and you've been asked to evaluate whether the driver's habits behind the wheel are "safe" or "risky."  But there's a catch: you have to collect all of your information with your eyes closed.

Think about it -- with your eyes shut, you're denied important information such as your location, traffic conditions, speed limits and traffic signals, and weather conditions.  Sightless, your only source of data comes from your sense of motion as the vehicle accelerates, slows down, and turns.

Sunish Menon, a PhD. researcher at State Farm Insurance, faced this challenge with his team as they designed the data collection scheme for State Farm's Drive Safe and Save program.  Sunish shared his experience and ideas with attendees at the Analytics Experience 2018 conference in San Diego.

Accelerometer: simple measurements with rich results

Sunish's team knew that they were going to build a smartphone app to support the Drive Safe and Save program.  After all, a smartphone can collect a ton of information: location with GPS, phone use during a trip, traffic conditions, trip duration and speed, and more.  But accessing these details has a cost.  Every sensor on a phone consumes precious battery life, and potential users might not be comfortable sharing their location constantly with an insurance company -- even if there is a premium discount at stake.  So what's the minimum amount of information you can collect and still assemble a meaningful profile?  Maybe capturing the changes in speed and direction is enough.

Like people, your smartphone also has a "sense of motion" -- it's called an accelerometer.  As you might guess from its name, an accelerometer is a small electrical sensor that measures acceleration.  For a quick physics refresher, let's review the difference between speed, velocity, and acceleration:

Speed How fast an object is moving, usually expressed as distance over time (example: 10 meters per second)
Velocity How fast an object is moving and in which direction (example: 10 meters per second, to the east)
Acceleration The rate of change in the velocity of an object. Since it’s a rate of change, it’s expressed as distance over time (speed), per unit of time.  For example, to change speed from 0 to 60 miles-per-hour in 10 seconds, an object must accelerate at 2.682 meters per second per second, or 2.682 m/s2.

 

Your smartphone measures acceleration across three axes, traditionally labeled as x, y, and z.  The measurements are sampled multiple times per second.  Each measurement reflects acceleration across one of the axes.  Taken together, you can get a sense for the phone's overall direction.

I've included Sunish's diagram of how these axes are oriented on a smartphone.  The x axis is horizontal along the face of the phone, and the y axis is vertical along the face.  The z axis is along the perpendicular plane passing through the center of the phone.  Depending on the direction of the phone's movement, acceleration values might have a positive or negative value.

Capturing my commute data

Inspired by Sunish's presentation, I decided to get a bit of hands-on practice with accelerometer data. I installed a free app on my phone to capture the raw data from the accelerometer,  Here's what the data values look like, as measured from the start of my driving commute from work to home.

In these data, the first value is a record counter.  The second "big number" value is a timestamp value in Unix epoch format.  That's the number of milliseconds since midnight on January 1, 1970.  And the next three values are the acceleration measurements for the x, y, and z axis respectively.  Acceleration is measured in meters-per-second squared, or m/s2.  For reference, keep in mind that Earth's gravity -- the force that keeps us grounded (literally) -- is about 9.8 m/s(1g).

The data from my commute contains over 85,000 measurements, captured over about 30 minutes (it was a busy Friday afternoon).  I used the SERIES plot in PROC SGPLOT to create a simple visualization.  Can you tell where the longest stoplight occurs?  (It's right near the shopping mall -- I really don't like that intersection.)

commute accelerometer

Teasing out "events" from the data

In my commute as represented in the above chart, it seems simple enough to locate the mundane events of accelerating, braking, and waiting in traffic.  There are a few spikes and dips that might represent more dramatic braking events, or perhaps a fast start from a traffic light (my car has some pep!).  Let's use some histograms to look at these measurements another way.

histograms of x y z

Most of my commute is uninteresting, as I'm driving at a steady speed or waiting in traffic.  The histogram shows the x axis measurements are centered around 0.  But why don't the y and z axes behave the same?  During my drive,  my phone is positioned nearly vertical in a dashboard holder, with perhaps a 30-degree forward tilt.  Gravity works on all of us at about 9.8m/s2.  With my phone at the vertical-ish tilt, you can see most of that force applied to the y axis, with some shared with the z axis.

Since the data collected represents a time series, it makes sense to apply a time series analysis to see if we can decompose its components and make the interesting events more obvious.  In Sunish's case, his team used PROC TIMESERIES also offers a SPECTRA statement for spectrum analysis for similar options.

Here's a tip: if you are trying this on your own and you get stuck, post a question to the SAS forecasting/time series community.  Experts are eager to answer!

Confounding factors when analyzing a drive

During a drive, the measurements from the accelerometer "start at zero" (or their natural baseline) only when the phone is lying flat, with the top of the phone pointed toward the front of the car.  But who keeps their phone stationed like that?  When I'm driving alone in my car, my phone is usually in a holder mounted on the dash, positioned nearly vertical, tilted slightly.  Or it's in my pants pocket.

Sunish presented a series of techniques to help control for this -- all of them applying more math than I am qualified to describe.  The smartphone also has a gyroscope sensor, which can measure the phone's "tilt" along any of its axes (labeled as pitch, roll, and yaw).  Combining these measurements with the acceleration readings, as well as controlling for the force of gravity, can help create a more accurate picture of your driving experience.

When I'm not alone in the car, the phone might not stay in one place.  A passenger might pick it up to find directions, or to reference IMDB to settle a bet.  All of those movements will also register on the accelerometer, and how will a "safe driving app" judge these actions?  That's a challenge for analytics.

Safe driver versus risky driver: more than just measurement

Please do not rush to judgement about my driving behavior from this one sample.  In fact, even if you had hundreds of samples of my driving, it would probably be difficult to fairly judge whether I am a high-risk driver.

For insurance companies, assessment of risk is influenced more by how similar you are to known risky populations.  That's why young drivers tend to command higher premiums.  It's not just because they are young, exactly, but it's because insurance companies have to pay out more claims due to accidents caused by young drivers.  The cause might be due to their inexperience and immaturity, but that's almost beside the point.  It's a numbers game.

By collecting data from millions of car trips across a wide range of customers, an insurance company can apply machine learning to discern the patterns of drivers who make claims versus those who don't.  If your driving patterns are scored as too similar to those of other drivers who cause accidents...well, don't expect to receive a discount when you share your driving data.

Programs like State Farm's Drive Safe and Save accomplish more than just "proving" that you're a good driver.  The program incents you to be more conscious of your driving behavior, especially while you have that app running and collecting data.  State Farm provides periodic reports to program subscribers that show how your driving behavior compares (favorably or not) to other drivers in the pool.  The gamification and feedback aspect of the program might do just as much to improve driving as the promise of a discount.

 

Using your smartphone accelerometer to build a safe driving profile was published on SAS Users.

9月 252018
 

If you use SAS Visual Analytics and don’t have the SAS Visual Analytics app, you're missing out on a ton of convenience and interaction you could be having while on-the-go. And even if you don’t have access to SAS Visual Analytics today, you can still download and try the mobile app with some cool sample reports.

Ready to take a quick dive and look at the app?

How to get the app

Download and install the free app to your Apple, Android or Windows device from the app store:

Apple iTunes Store
Google Play
Microsoft Store

When you open the app, you are greeted with an introductory launch screen:

In the introductory launch screen that displays when you first open the iOS or Android app, go to the third screen and tap on Learn how to use the Tray.

You are taken to the SAS Help Center. Watch the short slide show at the help center to understand the special Tray feature in the iOS or Android app or to find out what’s new in the app.

Using the Windows-based app? Here’s what you see:

Sample reports on the SAS Demo Server

In the app, sample charts and reports are instantly made available to you in the Subscriptions view via a connection to the SAS Demo Server. This server hosts a nice variety of reports that you can view on your phone or tablet. Interact with a wide spectrum of sample SAS Visual Analytics reports for different industries.

Subscriptions View in the App With Sample Reports

Tap on Add to view the different folders that contain additional sample reports for you to browse, subscribe, and view.

Additional Sample Reports on the SAS Demo Server

When you select and subscribe to the additional reports that are available on the SAS Demo Server, these reports are downloaded to the Subscriptions view in the mobile app. Just tap on the tile for any report in the Subscriptions view to open it and view the charts, graphs, and their associated data.

Here are a couple of reports as viewed in the Windows 10 app:

Already have SAS Visual Analytics in your organization?

If you view SAS Visual Analytics reports on your laptop or a desktop computer, this app extends your ability to view those same reports on your phone or tablet. If your organization has deployed SAS Visual Analytics, but is not taking advantage of extending report viewing ability to mobile devices, I urge you to consider it.

The app supports SAS Visual Analytics 8.3, 8.2, 7.4, and 7.3. Almost every type of interaction that you have with a SAS Visual Analytics report on your desktop can be done with reports viewed in the app on your phone or tablet!

If you have SAS Visual Analytics deployed in your organization, reach out to the SAS Visual Analytics administrator in your organization and ask them to enable support for mobile devices so that you can start viewing your reports in the app.

To give you a little more guidance, here are some FAQs about the app.

If we have reports in our organization that were created with SAS Visual Analytics, can we view those reports in this app?

Yes. The same reports that you view in your web browser on a desktop can be viewed in the mobile app.

How do I view our organization’s reports in the app?

Access from your mobile device to SAS Visual Analytics reports on your company’s server is granted by your SAS Administrator. Live data access requires either a Wi-Fi or cellular connection, and your company may require VPN access or other company-specific security measures.

Contact your SAS Visual Analytics Administrator to request access from your mobile device to the server hosting your SAS Visual Analytics reports. Your administrator ensures that your mobile device is registered as a valid device in the SAS Environment Manager where mobile device access to your organization’s server is managed.

How do I add a server connection?

When your mobile device is registered for access to the SAS Visual Analytics server, simply create a server connection within the app to your company server and browse for reports.

Here’s a nice slide show with the steps you follow to create a server connection to the SAS Visual Analytics server by entering the complete server name, port number, your username, and password:

Quick primer on the SAS Visual Analytics app was published on SAS Users.

9月 212018
 

SAS Technical Support occasionally receives requests from users who want to insert blank rows into their TABULATE procedure tables. PROC TABULATE syntax does not have any specific options that insert blank rows into the table results.

One way to accomplish this task is explained in SAS Sample 45972, "Add a blank row in PROC TABULATE output in ODS destinations." This sample shows you how to add a blank row between class variables in a PROC TABULATE table.  The sample code creates a new data set variable that you can use in a CLASS statement in the PROC TABULATE step.

In addition to the method that is shown this sample, there are two other methods that you can use to insert a blank row in PROC TABULATE table results. The following sections explain those methods:

Method 1: Adding a blank row between row dimension variables

This section demonstrates how to add a blank row between row dimension variables.  This method expands on the approach that is used in Sample 45972.

In the following example, additional data-set variables are added to the data set in a DATA step. The BLANKROW variable is used as a class variable, and the variables DUMMY1 and DUMMY2 are used as analysis variables in the PROC TABULATE step. This sample code also uses the AGE and SEX variables from the SASHELP.CLASS data set as class variables in PROC TABULATE.

/***  Method 1  ***/
data class;
set sashelp.class;
blankrow='  ';
 
dummy1=1;
dummy2=.;
run;
 
proc tabulate data=class missing;
class age sex blankrow;
var dummy1 dummy2;
table age*dummy1=' '
blankrow=' '*dummy2=' '
 
sex*dummy1=' '
blankrow=' '*dummy2=' '
 
all*dummy1=' ',
 
sum*F=8. pctsum / misstext=' ' row=float;
 
keylabel sum='# of Students' 
pctsum='% of Total'
all='Total';
 
title 'Add a Blank Row by Using New Data Set Variables';
run;

This method produces the result that is shown below.  You can use ODS HTML, ODS PDF, ODS RTF, or ODS EXCEL statements to display the table.

 

 

 

 

 

 

 

 

 

 

Method 2: Adding blank rows with user-defined formats and the PRELOADFMT option in the CLASS statement

The second method creates user-defined formats and uses the PRELOADFMT option in a CLASS statement in PROC TABULATE.  The VALUE statements that use the NOTSORTED option in the PROC FORMAT step establish the desired order of the results in the TABULATE results. Using the formats and specifying the ORDER=DATA option in the CLASS statement and the PRINTMISS option in the TABLE statement keeps the order requested in the PROC FORMAT VALUE statements and display the blank rows.

/***  Method 2 ***/
proc format;
value $sexf (notsorted)
'F'='F'
'M'='M'
' '=' ';
 
value agef (notsorted)
11='11'
12='12'
13='13'
14='14'
15='15'
16='16'
.='  ';
 
value $sex2f (notsorted default=8)
'F'='F'
'M'='M'
' '='Missing'
'_'=' ';
 
value age2f (notsorted)
11='11'
12='12'
13='13'
14='14'
15='15'
16='16'
.=' .'
99=' ';
 
value mymiss
0=' '
other=[8.2];
 
run;
 
proc tabulate data=sashelp.class missing;
class sex age / preloadfmt order=data;
 
table sex age all, N pctn*F=mymiss.
/ printmiss misstext=' ' style=[cellwidth=2in];
 
format sex $sexf. age agef.;
/* If there are no missing values for the class  */
/* variables, use the formats $SEXF and AGEF.*/
/* With missing values for the class variables,  */ 
/* use the formats $SEX2F and AGE2F.             */
 
keylabel N='# of Students'
PctN='% of Total'
all='Total';
 
title 'Add a Blank Row by Using the PRELOADFMT Option';
run;

This method produces the result that is shown below. You can use ODS HTML, ODS PDF, ODS RTF, or ODS EXCEL statements to display the table.

 

 

 

 

 

 

 

 

 

 

 

If you have any questions about these methods, contact SAS Technical Support. From this link, you can search the Technical Support Knowledge Base, visit SAS Communities to ask for assistance, and contact SAS Technical Support directly.

Adding blank rows in TABULATE procedure results was published on SAS Users.

9月 192018
 

Like most people, I believed that process of diagnosing and treating cancer begins with a biopsy.  If cancer is suspected, a doctor will extract a small tissue sample -- usually a tiny cylindrical "core sample" -- and examine it for cancer cells.  No cancer cells found -- that's good news!  But if cancer cells are present, then you have decisions to make about treatment.

A young woman named Richa Sehgal taught me that it's not so simple.  There aren't just two types of cells (cancerous and non-cancerous).  There are actually several types of cancer cells, and these do not all have the same importance when it comes to effective cancer treatment.  I learned this from Richa during her presentation at Analytics Experience 2018 -- a remarkable talk for several reasons, not the least of which is this: Richa Sehgal is a high school student, just 18 years old.  I'll have to check the record books, but this might make her the youngest-ever presenter at this premier analytics event.

Last year, Richa served as a student intern at the Canary Center at Stanford for Cancer Early Detection.  That's where she learned about the biology of cancer. She was allowed (encouraged!) to attend all lab meetings – and the experience opened her eyes to the challenges of cancer detection.

The importance of cell types and how cancer works

Unlike many technical conference talks that I've attended, Richa did not dive directly into the math or the code that support the techniques she was presenting.  Instead, Richa dedicated the first 25 minutes of her talk to teach the audience how cancer works.  And that primer was essential to help the (standing-room only!) audience to understand the relevance and value of her analytical solution.

What we call "cancer" is actually a collection of different types of cells.  Richa focused on three types: cancer stem cells (CSCs), transient amplifying cells (TACs), and terminally differentiated cells (TDCs).  CSCs are the most rare type within a tumor, making up just a few percent of the total mix of cells.  But because of their self-renewing qualities and their ability to grow all other types of cancer cells, these are very important to treat.  CSCs require targeted therapy -- that is, you can't use the same type of treatment for all cell types.  TACs usually require their own treatment, depending on the stage of the disease and the ability of a patient to tolerate the therapy.  The presence of TACs can activate CSCs to grow more cancer cells, so if you can't eradicate the CSCs (and that's difficult to manage with 100% certainty, as we'll see) then it is important to treat the TACs.  TDCs represent cancer cells that are no longer capable of dividing, and so generally don't require a treatment -- they will die off on their own.

(I know that my explanation here represents a simplistic view of cancer -- but it was enough of a framework to help me to understand the rest of Richa's talk.)

Richa Sehgal presents to a standing-room-only crowd at #AnalyticsX

The inexact science of biopsies

Now that we understand that cancer is made up of a variety cell types, it makes sense to hope that when we extract a biopsy, that we get a sample that represents this cell type variability.  Richa used an example of sampling a chocolate chip cookie.  If you were to use a needle to extract a core sample from a chocolate chip cookie...but didn't manage to extract any portions of the (disappointingly rare) chocolate chips, you might conclude that the cookie was a simple sugar cookie.  And as a result, you might treat that cookie differently.  (If you encountered a raisin instead...well..that might require a different treatment altogether.  Blech.)

But, as Richa told us, we don't yet know enough about the distribution and proximity of the different cell types for different types of cancers.  This makes it difficult to design better biopsies.  Richa is optimistic that it's just a matter of time -- medical science will crack this and we'll one day have good models of cancer makeup.  And when that day comes, Richa has a statistical method to make biopsies better.

Using SAS and Python to model cancer cell clusters

Most high school students wouldn't think to pick up SAS for use in their science fair projects, but Richa has an edge: her uncle works for SAS as a research statistician.  However, you don't need an inside connection to get access to SAS for learning.  In Richa's case, she used SAS University Edition hosted on AWS -- nothing to install, easy to access, and free to use for any learner.

Since she didn't have real data that represent the makeup of a tumor, Richa created simulations of the cancer cells, their different types and proximity to each other in a 3D model.  With this data in hand, she could use cluster analysis (PROC CLUSTER with Ward's method and then PROC TREE) to analyze a distant matrix that she computed.  The result shows how close cancer cells of the same type are positioned in proximity.  With that information, it's possible to design a biopsy that captures a highly variable collection of cells.

Richa then used the Python package plotly to visualize the 3D model of her cell map.  (I didn't have the heart to tell her that she could accomplish this in SAS with PROC SGPLOT -- some things you just have to learn for yourself.)

A bright future -- for all of us

Clearly, Richa is an extremely accomplished young woman.  When I asked about her college plans for next year, she told me that she has a long list of "stretch schools" that she's looking at.  I'm having a difficult time understanding what constitutes a "stretch" for Richa -- I'm certain that any institution would love to have her.

Richa's accomplishments make me feel optimism for her, but also for the rest of us.  As a father of three daughters, I'm encouraged to see young women enter technical fields and be successful.  SAS is among the elite technology companies that work to close the analytics skills gap by providing free software, education, and mentoring.  Throughout the Analytics Experience 2018 conference, I've heard from many attendees who also saw Richa's talk -- they were similarly impressed and inspired.  Presentations like Richa's deliver on the conference tagline: "Analytics redefines innovation. You redefine the future."

Using machine learning to improve tumor biopsies was published on SAS Users.

9月 132018
 

With the release of SAS Viya 3.4, you can easily build large-scale machine learning models and seamlessly publish and run models to Hadoop, or other external databases such as Teradata, without the data ever leaving the Hadoop environment. In this process, SAS Viya:

1) Converts the model into MapReduce Code.

2) Executes the MapReduce code.

3) Returns a new, scored dataset in Hadoop.

SAS Viya is a new, distributed in-memory product that allows users to easily build predictive models at scale. Using the SAS Model Studio interface, I can build complex models without the need to write large amounts of underlying code.

For this blog post, I'll go through the steps to build my model using a telecommunications dataset to predict customer churn. Under the “Data” tab, I can see all of my variables, assign the proper roles, and view the dataset.

 

 

 

 

 

 

 

 

 

 

 

 

 

With the data prepared, I build a pipeline to perform data preprocessing steps such as imputation and binning and build several predictive models, including Regression, Neural Networks and Gradient Boosting. Pipelines are powerful because they automate the heavy lifting of the model building process, allowing you to solve problems faster. In addition, pipelines are re-usable across different users and datasets, allowing the adoption of best practices across an organization.

 

After building the models, I combine the models into one ensemble model with ease, and compare their performance on the validation sample. I determine that the gradient boosting model is the most accurate based on the misclassification rate. You can pick from a large number of accuracy criteria, including KS Statistic, AUC, MCR or F1.

 

After having identified the best model, I  publish the model to Hadoop. This allows me to perform future scoring at the data source, meaning data does not have to leave Hadoop. I could have configured the system to publish the model directly from SAS Model Studio; however, I publish and score the model via SAS code for maximum flexibility. With SAS Studio, I can easily control, and change, where I write my resulting models and datasets in Hadoop.

In the “Compare Models” tab, I then download the score code, which provides me with the following:

  1. sas file containing DS2 code that performs all the data preprocessing steps, such as binning and imputation, in the pipeline above. I load this .sas file into a location that can be viewed in SAS Studio.
  2. A .sashdat file in the “Models” Caslib, that is a binary representation of our model called ASTORE, used to score our model in Hadoop.

 

Opening up the dm_epscore.sas file in SAS Studio, the comments in the top tell me the ASTORE file needed to publish the model.

 

 

 

This scoring file allows the data preparation within the pipeline above to be published to Hadoop as well. In this case, the file is binning the variables before building the Gradient Boosting.

 

 

 

 

The scoring file then invokes the ASTORE file needed to score the model in Hadoop.

 

 

 

Now, I switch to SAS Studio to publish and score my model in Hadoop.  The full code can be found here.

Below is the syntax to publish the model.  I'm sure to set the classpath variable to the appropriate jar and config files for my Hadoop cluster. Note that you will need permission to read and write from the modeldir directory. Publishing the model converts the .sas scoring code and the ATORE file into MapReduce code for execution in the cluster.

 

 

 

 

 

 

 

 

 

 

This publishing code will create a directory called “telco_churn” in my home directory in HDFS, /user/ankram. In a SAS Viya environment co-located with Hadoop, the “CASUSERHDFS” Caslib is by default pointed to this location, allowing me to ensure the “telco_churn” file was successfully published.

 

 

 

 

The next step is to score the model in Hadoop. The code below scores the “looking_glass_v4” table in Hive and create a new table called “looking_glass_v4_scored”, without the data ever leaving Hadoop.

 

 

 

 

 

 

 

 

If everything is configured properly, the log should show that the SAS Embedded Process executed correctly.

 

Using a previously setup Caslib called “Hivelib” that points to the default schema in the Hive Server, I can now load the “Looking_Glass_v4_scored” dataset into CAS to view the table.

 

 

 

 

Using the Table Viewer, I can then see the predicted probabilities of churn for each individual.

 

To conclude, many organizations have very large datasets, often times terabytes or larger, and often find that minimizing data movement is critical to successfully putting models into production. The in-database technologies for Hadoop on SAS Viya allow you and your fellow data scientists to easily prepare data and score large-scale models entirely in Hadoop, with the data never leaving the environment. You can now focus on solving more problems and are no longer at the mercy of large datasets and network latency.

 

 

 

 

Publishing and running models to Hadoop in SAS Viya was published on SAS Users.

9月 052018
 

Typically, when filters are applied in SAS Visual Analytics it affects all the records and aggregations in linked objects. For example, in a typical sales report below, when filters are applied, it changes all the measures of linked objects.

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

With this kind of filtering, it becomes difficult to calculate measures which requires a different level of aggregation. In above image the expectation is that the ‘Total Customers’ should not be changing irrespective of ‘Region’, ‘State’, ‘Category’ and ‘Subcategory’ control selections. ‘Total Customers (Geo)’ should be changing only based on ‘Region’ and ‘State’ control selections. ‘Total Customers (Geo and Prod)’ should be changing based on all the controls mentioned above. In the above example only, a ‘Total Customers (Geo and Prod)’ calculation is correct.

We will learn to create measures with different levels of aggregation by using ‘Customer Penetration’ measure as an example.

          Customer Penetration = Distinct customers at selected geography and product level/ Distinct customers at selected geography level

Selective filtering may be used for creating similar reports like: Dealer Participation, Sales Contribution, etc. The below section exemplifies the creation of a customer penetration report with selective filtering.

Customer penetration using SAS Visual Analytics 8.2 (selective filtering)

Customer penetration is used to analyze whether marketing and sales strategies are working or not. Managers often uses customer penetration or dealer participation measures along with other measures to measure the popularity of a product, category or brand.

This report requirement is such that the numerator in the ‘Customer Penetration’ formula should be filtered based on region and state list control selections, while the denominator should be filtered based on region, state, category and subcategory list control selections. This is not the same requirement as filtering the whole table through common list controls. In general, if you link a table with any control, all the measures in that table will be filtered as per selected value(s) in controls. However, our requirement is not like that. Instead of linking control and tables we will use control parameters to achieve our objective.

Assume we have a customer transaction table with following variables:

 

 

 

 

 

 

Before we move, be ready with the basic report as per below image:

 

Once you are ready with the report as per the above image, create parameters for ‘Region’, ‘State’, ‘Category’, ‘SubCategory’:

Region Parameter


 

State Parameter

 


Category Parameter

 


SubCategory Parameter

 

Now create the following two calculated items derived from ‘Customer_ID’:

Geo_Customer_ID
Equivalent to ‘Customer_ID’. However, populated only for selected geography levels and rest would be filled with missing.

 

 

 

 

 

 

Geo_and_Prod_Customer_ID
Equivalent to ‘Customer_ID’. However, populated only for selected geography and product levels and rest would be filled with missing.

 

Create the following two aggregated measures:

Total Customers (Geo)
You need to subtract the distinct count related to missing ‘Geo_Customer_ID’, which is 1.

 

Total Customers (Geo and Prod)
You need to subtract the distinct count related to missing ‘Geo_and_Prod_Customer_ID’, which is 1.

 

Now you can create an aggregated measure ‘Customer Penetration’.

Customer Penetration = Total Customers (Geo and Prod) / Total Customers (Geo)

 

Final report will look like this:

 

 

 

 

 

 

 

 

 

 

 

 

 

Comparative images with default and selective filtering implementation:

 

If you compare the above images, you will find the difference in highlighted measures where the first image aggregation level is based on selective filtering, while in second image aggregation level is uniform.

Note – ‘Total Customers’ is count of distinct ‘Customer_ID’ i.e., total customers count is independent from geography and product hierarchy selection.

Conclusion

This process allows you to use control parameters in ‘If Then Else…’ statements to create a variable (calculated item) having character values. You can utilize this feature in several other applications – this is just one way you can use parameters to fulfil a business requirement.

Selective filtering in SAS Visual Analytics 8.2 was published on SAS Users.