10月 022018
 

I often get asked for programming tips. Here, I share three of my favorite tips for beginners. Tip #1: COUNTC and CATS Functions Together The CATS function concatenates all of its arguments after it strips leading and trailing blanks. The COUNTC function counts characters. Together, they can let you operate [...]

The post Three of My Favorite Programming Tips appeared first on SAS Learning Post.

10月 012018
 

The solar farm at SAS world headquarters is a treasure trove of data. Jessica Peter, Senior User Experience Designer at SAS, had an idea about using that treasure in an art installation to show how data can tell a story. Her idea became a reality when she and others at SAS [...]

What can you learn from an AI art installation? was published on SAS Voices by Lane Whatley

10月 012018
 

Programmers on a SAS discussion forum recently asked about the chi-square test for proportions as implemented in PROC FREQ in SAS. One person asked the basic question, "how do I test the null hypothesis that the observed proportions are equal to a set of known proportions?" Another person said that the null hypothesis was rejected for his data, and he wanted to know which categories were "responsible for the rejection." This article answers both questions and points out a potential pitfall when you specify the proportions for a chi-square goodness-of-fit test in PROC FREQ.

The basic idea: The proportion of party affiliations for a group of voters

To make these questions concrete, let's look at some example data. According to a 2016 Pew research study, the party affiliation of registered voters in the US in 2016 was as follows: 33% of voters registered as Democrats, 29% registered as Republicans, 34% were Independents, and 4% registered as some other party. If you have a sample of registered voters, you might want to ask whether the observed proportion of affiliations matches the national averages. The following SAS data step defines the observed frequencies for a hypothetical sample of 300 voters:

data Politics;
length Party $5;
input Party $ Count;
datalines;
Dem   125
Repub  79
Indep  86
Other  10
;

You can use the TESTP= option on the TABLES statement in PROC FREQ to compare the observed proportions with the national averages for US voters. You might assume that the following statements perform the test, but there is a potential pitfall. The following statements contain a subtle error:

proc freq data=Politics;
/* National Pct:    D=33%, R=29%, I=34%, Other=04% */
tables Party / TestP=(0.33 0.29 0.34 0.04) nocum; /* WARNING: Contains an error! */
weight Count;
run;

If you look carefully at the OneWayFreqs table that is produced, you will see that the test proportions that appear in the fourth column are not the proportions that we intended to specify! The problem is that order of the categories in the table is alphabetical whereas the proportions in the LISTP= option correspond to the order that the categories appear in the data. In an effort to prevent this mistake, the documentation for the TESTP= option warns you to "order the values to match the order in which the corresponding variable levels appear in the one-way frequency table." The order of categories is important in many SAS procedures, so always think about the order! (The ESTIMATE and CONTRAST statements in linear regression procedures are other statements where order is important.)

Specify the test proportions correctly

To specify the correct order, you have two options: (1) list the proportions for the TESTP= option according to the alphabetical order of the categories, or (2) use the ORDER=DATA option on the PROC FREQ statement to tell the procedure to use the order of the categories as they appear in the data. The following statement uses the ORDER=DATA option to specify the proportions:

proc freq data=Politics ORDER=DATA;   /* list proportions in DATA order */
/*                 D=33%, R=29%, I=34%, Other=04% */
tables Party / TestP=(0.33 0.29 0.34 0.04);  /* cellchi2 not available for one-way tables */
weight Count;
ods output OneWayFreqs=FreqOut;
output out=FreqStats N ChiSq;
run;

The analysis is now correctly specified. The chi-square table indicates that the observed proportions are significantly different from the national averages at the α = 0.05 significance level.

Which categories are "responsible" for rejecting the null hypothesis?

A SAS programmer posted a similar analysis on a discussion and asked whether it was possible to determine which categories were the most different from the specified proportions. The analysis shows that the chi-square test rejects the null hypothesis, but does not indicate whether only one category is different than expected or whether many categories are different.

Interestingly, PROC FREQ supports such an option for two-way tables when the null hypothesis is the independence of the two variables. Recall that the chi-square statistic is a sum of squares, where each cell in the table contributes one squared value to the sum. The CELLCHI2 option on the TABLES statement "displays each table cell’s contribution to the Pearson chi-square statistic.... The cell chi-square is computed as
(frequencyexpected)2 / expected
where frequency is the table cell frequency (count) and expected is the expected cell frequency" under the null hypothesis.

Although the option is not supported for one-way tables, it is straightforward to use the DATA step to compute each cell's contribution. The previous call to PROC FREQ used the ODS OUTPUT statement to write the OneWayFreqs table to a SAS data set. It also wrote a data set that contains the sample size and the chi-square statistic. You can use these statistics as follows:

/* create macro variables for sample size and chi-square statistic */
data _NULL_;
   set FreqStats;
   call symputx("NumObs", N);         
   call symputx("TotalChiSq", _PCHI_);
run;
 
/* compute the proportion of chi-square statistic that is contributed
   by each cell in the one-way table */
data Chi2;
   set FreqOut;
   ExpectedFreq = &NumObs * TestPercent / 100;
   Deviation = Frequency - ExpectedFreq;
   ChiSqContrib = Deviation**2 / ExpectedFreq;  /* (O - E)^2 / E */
   ChiSqPropor = ChiSqContrib / &TotalChiSq;    /* proportion of chi-square contributed by this cell */
   format ChiSqPropor 5.3;
run;
 
proc print data=Chi2; 
   var Party Frequency TestPercent ExpectedFreq Deviation ChiSqContrib ChiSqPropor; 
run;

The table shows the numbers used to compute the chi-square statistic. For each category of the PARTY variable, the table shows the expected frequencies, the deviations from the expected frequencies, and the chi-square term for each category. The last column is the proportion of the total chi-square statistic for each category. You can see that the 'Dem' category contributes the greatest proportion. The interpretation is that the observed count of the 'Dem' group is much greater than expected and this is the primary reason why the null hypothesis is rejected.

You can also create a bar chart that shows the contributions to the chi-square statistic. You can create the "chi-square contribution plot" by using the following statements:

title "Proportion of Chi-Square Statistic for Each Category";
proc sgplot data=Chi2;
   vbar Party / response=ChiSqPropor datalabel=ChiSqPropor;
   xaxis discreteorder=data;
   yaxis label="Proportion of Chi-Square Statistic" grid;
run;
Contribution of each cell to the chi-square statistic

The bar chart makes it clear that the frequency of the 'Dem' group is the primary factor in the size of the chi-square statistic. The "chi-square contribution plot" is a visual companion to the Deviation Plot, which is produced automatically by PROC FREQ when you specify the PLOTS=DEVIATIONPLOT option. The Deviation Plot shows whether the counts for each category are more than expected or less than expected. When you combine the two plots, you can make pronouncements like Goldilocks:

  • The 'Dem' group contributes the most to the chi-square statistic because the observed counts are "too big."
  • The 'Indep' group contributes a moderate amount because the counts are "too small."
  • The remaining groups do not contribute much because their counts are "just right."

Summary

In summary, this article addresses three topics related to testing the proportions of counts in a one-way frequency table. You can use the TESTP= option to specify the proportions for the null hypothesis. Be sure that you specify the proportions in the same order that they appear in the OneWayFreqs table. (The ORDER=DATA option is sometimes useful for this.) If the data proportions do not fit the null hypothesis, you might want to know why. One way to answer this question is to compute the contributions of each category to the total chi-square computation. This article shows how to display that information in a table or in a bar chart.

The post Chi-square tests for proportions in one-way tables appeared first on The DO Loop.

9月 292018
 

The Grand Buddha at Ling Shan, located on the northern bank of Taihu Lake near Wuxi, China is a fitting metaphor for smart city initiatives in China, specifically Wuxi in the Jiangsu Province in Eastern China.  One of the largest Buddha statues in the world, the bronze monument reaches 88 [...]

What Wuxi can teach us about being a smart city was published on SAS Voices by Lee Ann Dietz

9月 262018
 

Marketing has changed and continues to change. Some of this is because of technology, but mostly it is the result of evolving customer expectations. Isolated customer interactions, however good, do not create a great customer experience. As isolated events, in fact, they may have a low level of influence on [...]

Why customer journey design matters (and having analytics at its heart matters, too) was published on Customer Intelligence Blog.

9月 262018
 

Here's a challenge.  You're a passenger in an automobile, and you've been asked to evaluate whether the driver's habits behind the wheel are "safe" or "risky."  But there's a catch: you have to collect all of your information with your eyes closed.

Think about it -- with your eyes shut, you're denied important information such as your location, traffic conditions, speed limits and traffic signals, and weather conditions.  Sightless, your only source of data comes from your sense of motion as the vehicle accelerates, slows down, and turns.

Sunish Menon, a PhD. researcher at State Farm Insurance, faced this challenge with his team as they designed the data collection scheme for State Farm's Drive Safe and Save program.  Sunish shared his experience and ideas with attendees at the Analytics Experience 2018 conference in San Diego.

Accelerometer: simple measurements with rich results

Sunish's team knew that they were going to build a smartphone app to support the Drive Safe and Save program.  After all, a smartphone can collect a ton of information: location with GPS, phone use during a trip, traffic conditions, trip duration and speed, and more.  But accessing these details has a cost.  Every sensor on a phone consumes precious battery life, and potential users might not be comfortable sharing their location constantly with an insurance company -- even if there is a premium discount at stake.  So what's the minimum amount of information you can collect and still assemble a meaningful profile?  Maybe capturing the changes in speed and direction is enough.

Like people, your smartphone also has a "sense of motion" -- it's called an accelerometer.  As you might guess from its name, an accelerometer is a small electrical sensor that measures acceleration.  For a quick physics refresher, let's review the difference between speed, velocity, and acceleration:

Speed How fast an object is moving, usually expressed as distance over time (example: 10 meters per second)
Velocity How fast an object is moving and in which direction (example: 10 meters per second, to the east)
Acceleration The rate of change in the velocity of an object. Since it’s a rate of change, it’s expressed as distance over time (speed), per unit of time.  For example, to change speed from 0 to 60 miles-per-hour in 10 seconds, an object must accelerate at 2.682 meters per second per second, or 2.682 m/s2.

 

Your smartphone measures acceleration across three axes, traditionally labeled as x, y, and z.  The measurements are sampled multiple times per second.  Each measurement reflects acceleration across one of the axes.  Taken together, you can get a sense for the phone's overall direction.

I've included Sunish's diagram of how these axes are oriented on a smartphone.  The x axis is horizontal along the face of the phone, and the y axis is vertical along the face.  The z axis is along the perpendicular plane passing through the center of the phone.  Depending on the direction of the phone's movement, acceleration values might have a positive or negative value.

Capturing my commute data

Inspired by Sunish's presentation, I decided to get a bit of hands-on practice with accelerometer data. I installed a free app on my phone to capture the raw data from the accelerometer,  Here's what the data values look like, as measured from the start of my driving commute from work to home.

In these data, the first value is a record counter.  The second "big number" value is a timestamp value in Unix epoch format.  That's the number of milliseconds since midnight on January 1, 1970.  And the next three values are the acceleration measurements for the x, y, and z axis respectively.  Acceleration is measured in meters-per-second squared, or m/s2.  For reference, keep in mind that Earth's gravity -- the force that keeps us grounded (literally) -- is about 9.8 m/s(1g).

The data from my commute contains over 85,000 measurements, captured over about 30 minutes (it was a busy Friday afternoon).  I used the SERIES plot in PROC SGPLOT to create a simple visualization.  Can you tell where the longest stoplight occurs?  (It's right near the shopping mall -- I really don't like that intersection.)

commute accelerometer

Teasing out "events" from the data

In my commute as represented in the above chart, it seems simple enough to locate the mundane events of accelerating, braking, and waiting in traffic.  There are a few spikes and dips that might represent more dramatic braking events, or perhaps a fast start from a traffic light (my car has some pep!).  Let's use some histograms to look at these measurements another way.

histograms of x y z

Most of my commute is uninteresting, as I'm driving at a steady speed or waiting in traffic.  The histogram shows the x axis measurements are centered around 0.  But why don't the y and z axes behave the same?  During my drive,  my phone is positioned nearly vertical in a dashboard holder, with perhaps a 30-degree forward tilt.  Gravity works on all of us at about 9.8m/s2.  With my phone at the vertical-ish tilt, you can see most of that force applied to the y axis, with some shared with the z axis.

Since the data collected represents a time series, it makes sense to apply a time series analysis to see if we can decompose its components and make the interesting events more obvious.  In Sunish's case, his team used PROC TIMESERIES also offers a SPECTRA statement for spectrum analysis for similar options.

Here's a tip: if you are trying this on your own and you get stuck, post a question to the SAS forecasting/time series community.  Experts are eager to answer!

Confounding factors when analyzing a drive

During a drive, the measurements from the accelerometer "start at zero" (or their natural baseline) only when the phone is lying flat, with the top of the phone pointed toward the front of the car.  But who keeps their phone stationed like that?  When I'm driving alone in my car, my phone is usually in a holder mounted on the dash, positioned nearly vertical, tilted slightly.  Or it's in my pants pocket.

Sunish presented a series of techniques to help control for this -- all of them applying more math than I am qualified to describe.  The smartphone also has a gyroscope sensor, which can measure the phone's "tilt" along any of its axes (labeled as pitch, roll, and yaw).  Combining these measurements with the acceleration readings, as well as controlling for the force of gravity, can help create a more accurate picture of your driving experience.

When I'm not alone in the car, the phone might not stay in one place.  A passenger might pick it up to find directions, or to reference IMDB to settle a bet.  All of those movements will also register on the accelerometer, and how will a "safe driving app" judge these actions?  That's a challenge for analytics.

Safe driver versus risky driver: more than just measurement

Please do not rush to judgement about my driving behavior from this one sample.  In fact, even if you had hundreds of samples of my driving, it would probably be difficult to fairly judge whether I am a high-risk driver.

For insurance companies, assessment of risk is influenced more by how similar you are to known risky populations.  That's why young drivers tend to command higher premiums.  It's not just because they are young, exactly, but it's because insurance companies have to pay out more claims due to accidents caused by young drivers.  The cause might be due to their inexperience and immaturity, but that's almost beside the point.  It's a numbers game.

By collecting data from millions of car trips across a wide range of customers, an insurance company can apply machine learning to discern the patterns of drivers who make claims versus those who don't.  If your driving patterns are scored as too similar to those of other drivers who cause accidents...well, don't expect to receive a discount when you share your driving data.

Programs like State Farm's Drive Safe and Save accomplish more than just "proving" that you're a good driver.  The program incents you to be more conscious of your driving behavior, especially while you have that app running and collecting data.  State Farm provides periodic reports to program subscribers that show how your driving behavior compares (favorably or not) to other drivers in the pool.  The gamification and feedback aspect of the program might do just as much to improve driving as the promise of a discount.

 

Using your smartphone accelerometer to build a safe driving profile was published on SAS Users.

9月 262018
 

A radial basis function is a scalar function that depends on the distance to some point, called the center point, c. One popular radial basis function is the Gaussian kernel φ(x; c) = exp(-||xc||2 / (2 σ2)), which uses the squared distance from a vector x to the center c to assign a weight. The weighted sum of Gaussian kernels, Σ wi φ(x; c) arises in many applications in statistics, including kernel density estimation, kernel smoothing, and machine learning algorithms such as support vector machines. It is therefore important to be able to efficiently evaluate a radial basis function and compute a weighted sum of several such kernel functions.

One of the many useful features of the SAS/IML language is its ability to compactly represent matrix and vector expressions. The expression ||xc|| looks like the distance between two vectors, but in the SAS/IML language the DISTANCE function can handle multiple sets of vectors:

  • The DISTANCE function can compute the distance between two vectors of arbitrary dimensions. Thus when x and c are both d-dimensional row vectors, you can compute the distance by using r = DISTANCE(x, c). The result is a scalar distance.
  • The DISTANCE function can compute the distance between multiple points and a center. Thus when x is an m x d matrix that contains m points, you can compute the m distances between the points and c by using r = DISTANCE(x, c). Again, the syntax is the same, but now r is an m x 1 vector of distances.
  • The DISTANCE function in SAS/IML 14.3 can compute the distance between multiple points and multiple centers. Thus when x is an m x d matrix that contains m points and c is a p x d matrix that contains p centers, you can compute the m*p distances between the points and c by using r = DISTANCE(x, c). The syntax is the same, but now r is an m x p matrix of distances.

A SAS/IML function that evaluates a Gaussian kernel function

The following SAS/IML statements define a Gaussian kernel function. Notice that the function is very compact! To test the function, define one center at C = (2.3, 3.2). Because SAS/IML is a matrix language, you can evaluate the Gaussian kernel on a grid of integer coordinates (x,y) where x is an integer in the range [1,5] and y is in the range [1,8]. Let Z be the matrix of the 40 ordered pairs. The following call evaluates the Gaussian kernel at the grid of points:

proc iml;
/* Radial basis function (Gaussian kernel). If z is m x d and c is n x d, this function
   returns the mxn matrix of values exp( -||z[i,] - c[j,]||**2 / (2*sigma**2) ) */
start GaussKernel(z, c, sigma=1);
   return exp( -distance(z,c)##2 / (2*sigma**2) );
finish;
 
/* test on small data: Z is an 5 x 8 grid and C = {2.3 3.2} */
xPts = 1:5; yPts = 1:8;
Z = expandgrid(xPts, yPts);              /* expand into (8*5) x 2 matrix */
C = {2.3 3.2};                           /* the center */ 
phi = GaussKernel(Z, C);                 /* phi is 40 x 1 vector */
 
print Z phi;                             /* print in expanded form */
phi_Grid = shapecol(phi, ncol(yPts));    /* reshape into grid (optional) */
print phi_Grid[c=(char(xPts)) r=(char(yPts)) F=4.2];

The table shows the Gaussian kernel evaluated at the grid points. The columns represent the values at the X locations and the rows indicate the Y locations. The function is largest at the value (x,y)=(2,3) because (2,3) is the grid point closest to the center (2.3, 3.2). The largest value 0.94. Notice that the function is essentially zero at points that are more than 3 units from the center, which you would expect from a Gaussian distribution with σ = 1.

You can use the HEATMAPCONT subroutine to make a heat map of the function values. However, notice that in the matrix the rows increase in the downward direction, whereas in the usual Cartesian coordinate system the Y direction increases upward. Consequently, you need to reverse the rows and the Y-axis labels when you create a heat map:

start PlotValues( v, xPts, yPts );
   G = shapecol(v, ncol(yPts));      /* reshape vector into grid */
   M =  G[nrow(G):1, ];              /* flip Y axis (rows) */
   yRev = yPts[, ncol(yPts):1];      /* reverse the Y-axis labels */
   call heatmapcont(M) xvalues=xPts yValues=yRev;
finish;
run PlotValues(phi, xPts, yPts);
Evaluate Gaussian kernel on grid of points. Visualize with a heat map.

Sums of radial basis functions

Locations of 86 US cities with population greater than 200,000

Often the "centers" are the locations of some resource such as a warehouse, a hospital, or an ATM. Let's use the locations of 86 large US cities, which I used in a previous article about spatial data analysis. A graph of the locations of the cities is shown to the right. (Click to enlarge.) The locations are in a standardized coordinate system, so they do not directly correspond to longitudes and latitudes.

If there are multiple centers, the GaussKernel function returns a column for every center. Many applications require a weighted sum of the columns. You can achieve a weighted sum by using a matrix-vector product A*w, where w is a column vector of weights. If you want an unweighted sum, you can use the SAS/IML subscript reduction operator to sum across the columns: A[,+].

For example, the following statements evaluate the Gaussian kernel function at each value in a grid (the Z matrix) and for each of 86 cities (the C matrix). The result is a 3726 x 86 matrix of values. You can use the subscript reduction operator to sum the kernel evaluations over the cities, as shown:

use BigCities;
   read all var {x y} into C;     /* C = (x,y) locations of centers */
   read all var "City";
close;
 
/* Z = a regular grid in (x,y) coordinates that contains the data */
XGridPts = round( do(-0.4, 0.4, 0.01), 0.001);
YGridPts = round( do(-0.2, 0.25, 0.01), 0.001);
Z = expandgrid( XGridPts, YGridPts );  /* 3,726 points on a 81x46 grid */
 
phi = GaussKernel(X, C, 0.025);   /* use smaller bandwidth */
sumPhi = phi[,+];                 /* for each grid point, add sum of kernel evaluations */
Sum of 86 Gaussian kernels evaluated on a regular grid

The resulting heat map shows blobs centered at each large city in the data. Locations near isolated cities (such as Oklahoma City) are lighter in color than locations near multiple nearby cities (such as southern California and the New York area) because the image shows the superposition of the kernel functions. At points that are far from any large city, the sum of the Gaussian kernel functions is essentially zero.

In summary, if you work with algorithms that use radial basis functions such as Gaussian kernels, you can use the SAS/IML language to evaluate these functions. By using the matrix features of the language and the fact that the DISTANCE function supports matrices as arguments, you can quickly and efficiently evaluate weighted sums of these kernel functions.

The post Radial basis functions and Gaussian kernels in SAS appeared first on The DO Loop.

9月 252018
 

If you use SAS Visual Analytics and don’t have the SAS Visual Analytics app, you're missing out on a ton of convenience and interaction you could be having while on-the-go. And even if you don’t have access to SAS Visual Analytics today, you can still download and try the mobile app with some cool sample reports.

Ready to take a quick dive and look at the app?

How to get the app

Download and install the free app to your Apple, Android or Windows device from the app store:

Apple iTunes Store
Google Play
Microsoft Store

When you open the app, you are greeted with an introductory launch screen:

In the introductory launch screen that displays when you first open the iOS or Android app, go to the third screen and tap on Learn how to use the Tray.

You are taken to the SAS Help Center. Watch the short slide show at the help center to understand the special Tray feature in the iOS or Android app or to find out what’s new in the app.

Using the Windows-based app? Here’s what you see:

Sample reports on the SAS Demo Server

In the app, sample charts and reports are instantly made available to you in the Subscriptions view via a connection to the SAS Demo Server. This server hosts a nice variety of reports that you can view on your phone or tablet. Interact with a wide spectrum of sample SAS Visual Analytics reports for different industries.

Subscriptions View in the App With Sample Reports

Tap on Add to view the different folders that contain additional sample reports for you to browse, subscribe, and view.

Additional Sample Reports on the SAS Demo Server

When you select and subscribe to the additional reports that are available on the SAS Demo Server, these reports are downloaded to the Subscriptions view in the mobile app. Just tap on the tile for any report in the Subscriptions view to open it and view the charts, graphs, and their associated data.

Here are a couple of reports as viewed in the Windows 10 app:

Already have SAS Visual Analytics in your organization?

If you view SAS Visual Analytics reports on your laptop or a desktop computer, this app extends your ability to view those same reports on your phone or tablet. If your organization has deployed SAS Visual Analytics, but is not taking advantage of extending report viewing ability to mobile devices, I urge you to consider it.

The app supports SAS Visual Analytics 8.3, 8.2, 7.4, and 7.3. Almost every type of interaction that you have with a SAS Visual Analytics report on your desktop can be done with reports viewed in the app on your phone or tablet!

If you have SAS Visual Analytics deployed in your organization, reach out to the SAS Visual Analytics administrator in your organization and ask them to enable support for mobile devices so that you can start viewing your reports in the app.

To give you a little more guidance, here are some FAQs about the app.

If we have reports in our organization that were created with SAS Visual Analytics, can we view those reports in this app?

Yes. The same reports that you view in your web browser on a desktop can be viewed in the mobile app.

How do I view our organization’s reports in the app?

Access from your mobile device to SAS Visual Analytics reports on your company’s server is granted by your SAS Administrator. Live data access requires either a Wi-Fi or cellular connection, and your company may require VPN access or other company-specific security measures.

Contact your SAS Visual Analytics Administrator to request access from your mobile device to the server hosting your SAS Visual Analytics reports. Your administrator ensures that your mobile device is registered as a valid device in the SAS Environment Manager where mobile device access to your organization’s server is managed.

How do I add a server connection?

When your mobile device is registered for access to the SAS Visual Analytics server, simply create a server connection within the app to your company server and browse for reports.

Here’s a nice slide show with the steps you follow to create a server connection to the SAS Visual Analytics server by entering the complete server name, port number, your username, and password:

Quick primer on the SAS Visual Analytics app was published on SAS Users.