11月 232017
 

If you're preparing a big Thanksgiving dinner, then you don't want to leave out the most popular side dish, do you?!? But what is the most popular side dish? ... If you don't already know, then perhaps some data & analytics can help! But before we get started, here's a [...]

The post What's the must-have Thanksgiving side dish? appeared first on SAS Learning Post.

11月 222017
 

A statistical programmer read my article about the beta-binomial distribution and wanted to know how to compute the cumulative distribution (CDF) and the quantile function for this distribution. In general, if you know the PDF for a discrete distribution, you can also compute the CDF and quantile functions. This article shows how.

Continuous versus discrete distributions

Recall the four essential functions for any probability distribution: the probability density function (PDF), the cumulative distribution function (CDF), the quantile function, and the random-variate generator. SAS software provides the PDF, CDF, QUANTILE, and RAND function, which support for about 25 common families of distributions. However, there are infinitely many probability distributions. What can you do if you need a distribution that is not natively supported in SAS? The answer is that you can use tools in SAS (especially in SAS/IML software) to implement these functions yourself.

I have previously shown how to use numerical integration and root-finding algorithms to compute the CDF and quantile function for continuous distributions such as the folded normal distribution and the generalized Gaussian distribution. For discrete distributions, you can use a summation to obtain the CDF from the PDF. The CDF at X=x is the sum of the PDF values for all values of X that are less than or equal to x. The discrete CDF is a step function, so it does not have an inverse function. Given a probability, p, the quantile for p is defined as the smallest value of x for which CDF(x) ≥ p.

To simplify the discussion, assume that the random variable X takes values on the integers 0, 1, 2, .... Many discrete probability distributions in statistics satisfy this condition, including the binomial, Poisson, geometric, beta-binomial, and more.

Compute a CDF from a PDF

I will use SAS/IML software to implement the CDF and quantile functions. The following SAS/IML modules define the PDF function and the CDF function. If you pass in a scalar value for x, the CDF function computes the PDF from X=0 to X=x and sums up the individual probabilities. When x is a vector, the implementation is slightly more complicated, but the idea is the same.

proc iml;
/* PMF function for the beta-binomial distribution. See
   https://blogs.sas.com/content/iml/2017/11/20/simulate-beta-binomial-sas.html */
start PDFBetaBinom(x, nTrials, a, b);
   logPMF = lcomb(nTrials, x) + logbeta(x + a, nTrials - x + b) - logbeta(a, b);
   return ( exp(logPMF) );
finish;
 
/* CDF function for the beta-binomial distribution */
start CDFBetaBinom(x, nTrials, a, b);
   y = 0:max(x);                               /* all values of X less than max(x) */
   PMF = PDFBetaBinom(y, NTrials, a, b);       /* compute PMF(t) for t=0..max(x) */
   cumPDF = cusum(PMF);                        /* cdf(T) = sum( PMF(t), t=0..T ) */
   /* The input x can be a vector of values in any order.
      Use the LOC function to assign CDF(x) for each value of x */
   CDF = j(nrow(x), ncol(x), .);               /* allocate result (same size as x) */
   u = unique(x);                              
   do i = 1 to ncol(u);                        /* for each unique value of x... */
      idx = loc(x=u[i]);                       /*    find the indices */
      CDF[idx] = cumPDF[1+u[i]];               /*    assign the CDF   */
   end;
   return CDF;
finish;
 
/* test the CDF function */
a = 6; b = 4; nTrials = 10;
c = CDFBetaBinom(t, nTrials, a, b);

Notice the computational trick that is used to compute the CDF for a vector of values. If you want to compute the CDF at two or more values (say X=5 and X=7), you can compute the PDF for X=0, 1, ... M where M is the maximum value that you need (such as 7). Along the way, you end up computing the CDF for all smaller values.

You can write the CDF values to a SAS data set and graph them to visualize the CDF of the beta-binomial distribution, as shown below:

Beta-binomial cumulative distribution

Compute quantiles from the CDF

Given a probability value p, the quantile for p is smallest value of x for which CDF(x) ≥ p. Sometimes you can find a formula that enables you to find the quantile directly, but, in general, the definition of a quantile requires that you compute the CDF for X=0, 1, 2,... and so forth until the CDF equals or exceeds p.

The support of the beta-binomial distribution is the set {0, 1, ..., nTrials}. The following function first computes the CDF function for all values. It then looks at each value of p and returns the corresponding value of X that satisfies the quantile definition.

start QuantileBetaBinom(_p, nTrials, a, b);
   p = colvec(_p);                         /* ensure input is column vector */
   y = 0:nTrials;                          /* all possible values of X   */
   CDF = CDFBetaBinom(y, NTrials, a, b);   /* all possible values of CDF */
   q = j(nrow(p), 1, .);                   /* allocate result vector */
   do i = 1 to nrow(p);
      idx = loc( p[i] <= CDF );            /* find all x such that p <= CDF(x) */ 
      q[i] = idx[1] - 1;                   /* smallest index. Subtract 1 from index to get x */
   end;
   return ( q );
finish;
 
p = {0.2, 0.5, 0.8};
q = QuantileBetaBinom(p, nTrials, a, b);
print p q;

The output shows the quantile values for p = {0.2, 0.5, 0.8}. You can find the quantiles visually by using the graph of the CDF function, shown in the previous section. For each value of p on the vertical axis, draw a horizontal line until the line hits a bar. The x value of the bar is the quantile for p.

Notice that the QuantileBetaBinom function uses the fact that the beta-binomial function has finite support. The function computes the CDF at the values 0, 1, ..., nTrials. From those values you can "look up" the quantile for any p. If you have distribution with infinite support (such as the geometric distribution) or very large finite support (such as nTrials=1E9), it is more efficient to apply an iterative process such as bisection to find the quantiles.

Summary

In summary, you can compute the CDF and quantile functions for a discrete distribution directly from the PDF. The process was illustrated by using the beta-binomial distribution. The CDF at X=x is the sum of the PDF evaluated for all values less than x. The quantile for p is the smallest value of x for which CDF(x) ≥ p.

The post Compute the CDF and quantiles of discrete distributions appeared first on The DO Loop.

11月 222017
 

Social media has brought anniversary dates to the forefront. Every day, my view of Google Photos or Facebook shows me a collection of photos from exactly some number of years ago to remind me of how good things were back then. These apps are performing the simplest of date-based math to trick me. They think, "We have a collection of photos from this month/day from some previous year -- let's collect those together and then prey on Chris' nostalgia instincts."

Sometimes it works and I share them with friends. Weren't my kids so cute back then? It's true, they were -- but I don't pine for those days. I live in the present and I look towards the future -- as all citizens of the world should.

Looking back at dates in SAS

But you know who likes to look at the past? Managers (and probably a few of my colleagues...). Many of my SAS jobs are in support of date-based reports. They answer questions such as, "what happened in the past 30 days?" or "how much activity in the past 6 months?" When I have SAS date values, going back 30 days is simple. The SAS date for "30 days ago" is simply today()-30. (Because, remember, a SAS date is simply an integer representing the count of days since Jan 1, 1960.)

"6 months ago" is a little more nuanced. I could fudge it by subtracting 183 or 184 from the value of today(), but that's not precise enough for the analytical wonks that I work with. Fortunately, SAS provides a function that can compute any date -- given a starting date -- using just about any criteria you can imagine. The function is called INTNX (

six_mo_ago = intnx(
 'month',  /* unit of time interval */
  today(), /* beginning date */
  -6,      /* number of intervals, negative goes to the past */
  'same'   /* alignment of interval date. "Same" is for same day of month */
 );

If I wanted to go 6 years into the past, I would simply change that first argument to 'year'. 6 days? Change it to 'day'. in %SYSFUNC -- the macro function that breaks out of macro processing to invoke built-in SAS functions. I also need to remove the quotes around the interval and alignment values -- the SAS macro processor will treat these as string literals and would not approve of the quotes.

%let six_mo_ago = 
 %sysfunc(
  intnx(
   month,             /* unit of time interval */
   %sysfunc(today()), /* function to get current date */
   -6,                /* number of intervals, negative goes to the past */
   same               /* alignment of interval date. "Same" is for same day of month */
   )
  );
 %put &=six_mo_ago;

Output (from today's run):

45          %put &=six_mo_ago;
SIX_MO_AGO=20960

Notice the result is a number -- which is exactly what I need for date value comparisons. But what if I wanted a formatted version of the date? The

 %let six_mo_ago_fmt = 
  %sysfunc(
   intnx(
   month,             /* unit of time interval */
   %sysfunc(today()), /* function to get current date */
   -6,                /* number of intervals, negative goes to the past */
   same               /* alignment of interval date. "Same" is for same day of month */
   ), date9.  /* Tell %SYSFUNC how to format the result */
  );
 %put &=six_mo_ago_fmt;

Output (from today's run):

56          %put &=six_mo_ago_fmt;
SIX_MO_AGO_FMT=21MAY2017

Adapting INTNX for SAS datetime values

This computed date works perfectly when my data sets contain SAS date values that I want to filter. For example, I can limit the records to those from the past 6 months with code similar to this:

proc freq data=comm.visits (where=(date > &six_mo_ago.)) 
 noprint 
 ;
tables geo_country_code / out=country_visits_6mo;
run;

But I realized that some of my data sets use datetime values, not date values. A SAS datetime value is simply the number of seconds since midnight on Jan 1, 1960. I've seen programs that adapt the code above by converting the computed cutoff date from a date value to a datetime value -- it's simple to do with math:

proc freq data=comm.visits (where=(event_time > %sysevalf( &six_mo_ago. * 60 * 60 * 24 ))) 
 noprint 
 ;
tables geo_country_code / out=country_visits_6mo;
run;

The

%let six_mo_ago_dt = 
 %sysfunc(
  intnx(
   dtmonth,              /* unit of time interval */
   %sysfunc(datetime()), /* function to get current datetime */
   -6,                   /* number of intervals, negative goes to the past */
   same                  /* alignment of interval date. "Same" is for same day of month/same time */
   )
  );
 %put &=six_mo_ago_dt;

Output (at this moment):

34          %put &=six_mo_ago_dt;
SIX_MO_AGO_DT=1810991917.93526

I can then use the much more readable version of my comparison code:

proc freq data=comm.visits (where=(event_time > &six_mo_ago_dt)) 
 noprint 
 ;
tables geo_country_code / out=country_visits_6mo;
run;

Learning more about date and datetime intervals

First, an acknowledgment. I recently had the privilege of presenting at the Quebec user groups with SAS-world superstar Marje Fecht. Marje is an excellent instructor, and she delivered a popular talk about working with SAS date values -- a topic that trips up many new SAS users. It's useful for me -- someone who has used SAS for a long time -- to see a basic topic presented to me as if I were brand new. Her talk reminded me of some good practices that I was able to bring home and apply in my own production SAS jobs. I know that Marje was invited to present this talk at SAS Global Forum 2018, and I hope that I did not steal too much of her thunder. She has a good origin story about the 01JAN1960 date decision. You should attend the conference and see her talk in person.

The INTNX (and its sister function for computing date differences, INTCK) are powerful tools for manipulating date and datetime values. Other programming languages offer complex code libraries to accomplish what these two functions can do as part of Base SAS. They are tricky to learn at first, but once you get the hang of them they can really simplify your SAS programs that deal with time-based data.

Rick Wicklin presented a useful introduction to both functions in INTCK and INTNX: Two essential functions for computing intervals between dates in SAS.

While these functions are available in Base SAS, they are maintained by the developers who look after SAS/ETS (econometrics and time series). You'll find see this thread on the SAS Support Communities.

The post Computing a date from the past (or future) with SAS appeared first on The SAS Dummy.

11月 212017
 

Fifth annual STEM Showcase draws 350 students Temple Grandin, a prominent author and speaker on autism and animal behavior, and professor of Animal Science at Colorado State University, was shocked this past August to see how many students on her college campus seemed oblivious to the solar eclipse – despite [...]

The post Temple Grandin: 'We need different kinds of minds working together' appeared first on SAS Analytics U Blog.

11月 212017
 

Here in the US, we're preparing to celebrate the Thanksgiving holiday. Therefore  this Thursday most families in the US will be having a big turkey dinner. Although I'm a bachelor guy and eat out all the time, I'm actually a pretty good cook - and I'd like to share with [...]

The post How a data guy cooks a Thanksgiving dinner! appeared first on SAS Learning Post.

11月 212017
 

Here in the US, we're preparing to celebrate the Thanksgiving holiday. Therefore  this Thursday most families in the US will be having a big turkey dinner. Although I'm a bachelor guy and eat out all the time, I'm actually a pretty good cook - and I'd like to share with [...]

The post How a data guy cooks a Thanksgiving dinner! appeared first on SAS Learning Post.

11月 212017
 

I recently spent two days with an innovative communications customer explaining exactly what SAS analytics can do to help them take their advertising platform to a whole new level. Media meets data resulting in addressable advertising. SAS would essentially be the brain behind all their advertising decisions, helping them ingest [...]

Analytics = brilliance was published on SAS Voices by Suzanne Clayton

11月 212017
 

Motivating students is one of the major challenges teachers and student advocates face on a daily basis and encouraging students to be interested in analytics is a whole other mountain to climb, or is it? What motivates students? Often we make assumptions that students are not motivated or are not [...]

The post Using SAS to Help Girl Scouts Grow Cookie Sales! appeared first on SAS Learning Post.

11月 212017
 

SAS Viya provides a robust, scalable, cloud-ready, distributed runtime engine. This engine is driven by CAS (Cloud Analytic Services), providing fast processing for many data management techniques that run distributive, i.e. using all threads on all defined compute nodes.

Why

PROC APPEND is a common technique used in SAS processes. This technique will concatenate two data sets together. However, PROC APPEND will produce an ERROR if the target CAS table exists prior to the PROC APPEND.

Simulating PROC APPEND

Figure 1. SAS Log with the PROC APPEND ERROR message

Now what?

How

To explain how to simulate PROC APPEND we first need to create two CAS tables. The first CAS table is named CASUSER.APPEND_TARGET. Notice the variables table, row and variable in figure 2.

Figure 2. Creating the CAS table we need to append rows to

The second CAS table is called CASUSER.TABLE_TWO and in figure 3 we can review the variables table, row, and variable.

Figure 3. Creating the table with the rows that we need to append to the existing CAS table

To simulate PROC APPEND we will use a DATA Step. Notice on line 77 in figure 4 we will overwrite an existing CAS table, i.e. CASUSER.APEND_TARGET. On Line 78, we see the first table on the SET statement is CASUSER.APPEND_TARGET, followed by CASUSER.TABLE_TWO. When this DATA Step runs, all of the rows in CASUSER.APPEND_TARGET will be processed first, followed by the rows in CASUSER.TABLE_TWO. Also, note we are not limited to two tables on the SET statement with DATA Step; we can use as many as we need to solve our business problems.

Figure 4. SAS log validating the DATA Step ran in CAS i.e. distributed

The result table created by the DATA Step is shown in figure 5.

Figure 5. Result table from simulating PROC APPEND using a DATA Step

Conclusion

SAS Viya’s CAS processing allows us to stage data for downstream consumption by leveraging robust SAS programming techniques that run distributed, i.e. fast. PROC APPEND is a common procedure used in SAS processes. To simulate PROC APPEND when using CAS tables as source and target tables to the procedure use DATA Step.

How to simulate PROC APPEND in CAS was published on SAS Users.

11月 212017
 

The outbound marketing landscape, especially in the retail industry, is a challenging one. Customers are receiving multiple communications from multiple businesses, and for marketers it’s sometimes hard to understand which activity is driving results. Many retailers have a complex multichannel strategy of internet, email, direct mail, telemarketing, and catalogues. These [...]

Using analytics to determine best paths in customer journeys was published on Customer Intelligence Blog.