1月 272021
 

SASensei logoSAS offers myriad ways to level up your SAS skills (scroll to the bottom to see a list of SAS-provided learning paths and channels). In this post, I introduce you to SASensei, an independent, third-party online SAS learning resource that I enjoy a lot.

Learning: dull or fun?

Learning is not always associated with fun. Sometimes it feels difficult and exhausting. New concepts often contradict our prior knowledge and experience, compelling us to rethink, adjust, change and adapt to new paradigms.

Learning new ideas, skills and technologies can be intimidating, challenging and demanding. While learning, you are stretching out of your comfort zone. But that feeling is only transient. As a matter of attitude, learning is not about pushing yourself out of your comfort zone, it’s about expanding your comfort zone. And that is long lasting. The more you learn, the more comfortable and self-confident you become.

Learning does not have to be tedious. Look at pre-school kids. They learn basic life skills like walking (rolling, crawling), talking (in one or more languages), asking questions (a lot) – all without taking classes, just through their natural curiosity and ... playing games.

What is SASensei? Gamified SAS learning

When I first discovered the SASensei online SAS learning game/application I was pleasantly surprised by its non-traditional approach to learning such a serious and well-established platform as SAS.

As stated on its website, “Sasensei is a question based learning system. You must demonstrate your command of SAS® to earn Tokens - which should be wisely invested, to enable you to unlock new levels within the game...”

The following screenshot shows the main page of the SASensei website that displays a dashboard of the top players (they call it leaderboard). You can filter it geographically - by Country, Continent, or World, as well as by the timeline – by Past Month, Past Year, or All Time.

SASensei leaderboard

Privacy or prominence

Users have full control of their privacy or prominence. As you can see in the screenshot above, registered users are displayed by their screen names. This allows the users to either remain anonymous by selecting some fictitious obscure screen name or use their real name. Users can change their screen name at any time.

Rules of the game

In this blog post I provide just an overview of the main functionality and features of the SASensei learning platform. For detailed rules of the game, see SASensei Documentation.

Play and learn

Users are offered a variety of learning activities:

  • Viewing, reviewing and submitting SAS-learning flashcards;
  • Playing, reviewing and submitting questions by different SAS-related topics;
  • Taking and creating public, private, multi-player and custom quizzes;
  • Providing feedback on questions and flashcards by voting and commenting.

Users can challenge themselves by delving into different topics. Your successes and failures will provide you an honest and objective estimation of your SAS strengths as well as weaknesses. A healthy competition with other users encourages you to learn more and hone your SAS skills. When you fail a question, you can review the explanation of the correct answer and thus learn why you failed and acquire new knowledge, tips and tricks quickly and efficiently.

Invest, score, win and build a reputation

To play you will need to earn and spend tokens which are essentially the game’s currency. To motivate you further, you also earn reputation points, which is your ultimate score, a level of achievement in demonstrating SAS skills. Your reputation score is prominently displayed in your public profile. As you progress in the game and your reputation grows, additional functionality unlocks and becomes available to you. Your reputation score determines your SASensei standing level which is derived from those used in martial arts:

Sasensei title Sasamurai title

  • White Belt (new players)
  • Yellow Belt (reputation ≥ 50)
  • Green Belt (reputation ≥ 100)
  • Black Belt (reputation ≥ 200)
  • Sasamurai (reputation ≥ 500)
  • Assassin (reputation ≥ 1000)
  • Sasensei (reputation ≥ 5000)

Sample SASensei question

When you play a question, you select a topic, and then you are presented with a randomly selected multiple-choice question of a specified time limit (30, 60, 90 or 120 seconds). Here is a sample of such question:

Question:

What is wrong with the following LIBNAME statement?
libname fruits (apples oranges tomatoes);

Answers:

  • Incorrect syntax
  • You cannot mix apples and oranges in LIBNAME statement
  • Nothing is wrong, valid LIBNAME statement
  • Tomatoes are not fruits, therefore the statement is not correct

Correct answer:

Nothing is wrong, valid LIBNAME statement

Explanation:

According to Combine and conquer with SAS for examples of usage.

Question
Try tackling a question on your own in the SASensei environment to get real life experience: Sample Question.

Take a SASensei sample quiz

There are various quizzes available at SASensei: public quizzes, multiplayer quiz games, private quizzes (tests) for students.

A public quiz contains 12 questions with a total time cap of 12 minutes, and costs eight tokens to play. You can choose a single topic (sas statements, sas macro, procedures, options, etc.), and if you pass (75% or more) you get 12 tokens back, plus 20 point to your reputation. If you get 100%, you get 30 reputation points plus Top Student badge. A count of passed sessions (by topic) is displayed on your public profile.

QuizAlthough public quizzes are unlocked at the SASamurai level, for the readers of this blog, I have created a special custom quiz sample so you can experience it firsthand right here, right now. Just click on this link, Sample Quiz, register, and enjoy your ride.

See you at the top of the SASensei dashboard!

Credit

Big THANKS to Allan Bowe (United Kingdom) – SAS innovator and entrepreneur who created and founded SASensei learning platform.

Other SAS learning resources

Game on! SASensei: a fun way to learn SAS was published on SAS Users.

1月 272021
 

The inverse gamma distribution is a continuous probability distribution that is used in Bayesian analysis and in some statistical models. The inverse gamma distribution is closely related to the gamma distribution. For any probability distribution, it is essential to know how to compute four functions: the PDF function, which returns the probability density at a given point; the CDF function, which returns the probability that an observation is less than or equal to a particular value; the QUANTILE function, which is the inverse CDF function; and the RAND function, which generates a random variate.

For the gamma distribution, the four essential functions are supported directly in Base SAS. The functions for the inverse gamma distribution are not supported in the same way. However, you can exploit the relationship between the gamma and inverse gamma distributions to implement these four functions for the inverse gamma distribution. This article shows how to implement the PDF, CDF, QUANTILE, and RAND functions for the inverse gamma distribution in SAS.

What is the inverse gamma distribution?

If X is gamma distributed, then X > 0. As its name suggests, the inverse gamma distribution is the distribution of 1/X when X is gamma distributed. More precisely, let Gamma(α, β) be the gamma distribution with shape parameter α and scale parameter β. (This article always uses scale parameters, never a rate parameter.) Let IGamma(α, β) be the inverse gamma distribution with shape and scale parameters. Then if X ~ Gamma(α, β), the random variable 1/X ~ IGamma(α, 1/β). The documentation for PROC MCMC provides an additional discussion of the relationship between the gamma and inverse gamma distributions. There is also a Wikipedia page about the inverse gamma distribution.

A feature of the inverse gamma distribution is that it has a long tail for small values of the α shape parameter. In fact, that the expected value (mean) is undefined when α < 1 and the variance is undefined when α < 2.

Generate random variates from the inverse gamma distribution

It is easy to generate random values from the inverse gamma distribution because that is how the distribution is defined. To obtain a random variate from IGamma(α, β), simply take the reciprocal of a random variate from the Gamma(α, 1/β) distribution. It might be helpful to define a little macro function that makes it easier to generate a random sample. The following SAS program generates a large random sample and displays a histogram:

/* Random values: If X ~ Gamma(alpha, 1/beta), then 1/X ~ IGamma(alpha, beta) */
%macro RandIGamma(alpha, beta);
   (1 / rand('Gamma', &alpha, 1/(&beta)))
%mend;
data IGammaRand;
   call streaminit(12345);
   alpha = 3;
   beta = 0.5;
   do j = 1 to 10000;
      x = %RandIGamma(alpha, beta);
      output;
   end;
run;
 
title "Random Sample from the Inverse Gamma Distribution";
title2 "alpha=3, beta=0.5";
proc sgplot data=IGammaRand noautolegend;
   histogram x / binwidth=0.05 binstart=0.025; 
   xaxis values=(0 to 3 by 0.25);
run;

For these values of the parameters, the histogram shows that the distribution has a peak at x=1/8 and a very long tail. The tail is truncated at x=3, but you can use PROC MEANS to see that the maximum value in the sample is 11.69.

The PDF of the inverse gamma distribution

The Wikipedia article gives a formula for the PDF of the inverse gamma article. For x > 0, the PDF formula is
\(\mbox{IGamma PDF}(x; \alpha, \beta) = \frac{\beta^{\alpha}}{\Gamma(\alpha)} (1/x)^{\alpha+1} \exp(-\beta/x)\)

If you compare this to the PDF for the gamma distribution, you will notice that they are very similar. If you evaluate the gamma PDF with scale 1/β at the point 1/x, you almost get the inverse gamma PDF. All you need to do is divide by x2, as follows:
IGamma_PDF(x; α, β) = Gamma_PDF(1/x; α, 1/β) / x2

You can use this relationship to evaluate the inverse gamma PDF in SAS. The following SAS DATA step evaluates the inverse gamma PDF for four choices of (α, β) parameters. These are the same choices that are used in the Wikipedia article for the inverse gamma distribution.

/* PDF : parameterize IGamma by using the scale parameter
   in terms of the Gamma PDF with shape parameter. */
%macro PDFIGamma(x, alpha, beta);
   (pdf("Gamma", 1/(&x), &alpha, 1/(&beta)) / ((&x)**2))
%mend;
 
data IGammaPDF;
array a[4] _temporary_ (1 2 3 3);
array b[4] _temporary_ (1 1 1 0.5);
do i = 1 to dim(a);
   alpha = a[i];
   beta  = b[i];
   group = cat('a=',alpha,', b=',beta);
   do x = 0.01 to 3 by 0.01;
      pdf = %PDFIGamma(x, alpha, beta);
      /* equivalent to 
      pdf = beta**alpha / Gamma(alpha) * x**(-alpha-1) * exp(-beta/x); */
      output;
   end;
end;
run;
 
title "PDF of the Inverse Gamma Distribution";
proc sgplot data=IGammaPDF;
   series x=x y=PDF / group=group lineattrs=(thickness=2);
   xaxis grid label="Density"; yaxis grid;
run;

You can see that the brown curve (the PDF for (α, β) = (3, 0.5)) has the same shape as the histogram of the random sample in the previous section.

The CDF and quantiles of the inverse gamma distribution

In a previous article, I show that there is a relationship between the CDF of the gamma distribution and a special function called the regularized upper incomplete gamma function. I also show how to compute the incomplete gamma function in SAS. In Wikipedia, the CDF of the inverse gamma distribution is given in terms of the incomplete gamma function. Consequently, we can compute the CDF in SAS without difficulty.

The quantile function is more difficult. The quantile function is the inverse CDF. There are a LOT of reciprocals to keep track of during the derivation! Nevertheless, you can evaluate the quantile function of the inverse gamma distribution by using an expression that includes the quantile function of the gamma distribution. The derivation is left as an exercise.

The following DATA step evaluates the CDF of the inverse gamma distribution for four sets of parameters. The quantile function is evaluated at the same time to ensure that Quantile(CDF(x)) = x. The CDF function is very flat for very small values of x, so the quantile function might fail to converge for very small values of the probability.

/* The CDF of the IGAMMA function is SDF('Gamma',1/x,alpha,1/beta) */
%macro CDFIGamma(x, alpha, beta);
   (SDF('Gamma', (&beta)/(&x), &alpha))
%mend;
%macro QuantileIGamma(p, alpha, beta);
   (1 / quantile('Gamma', 1-(&p), &alpha, 1/(&beta)))
%mend;
 
data IGammaCDF;
array a[4] _temporary_ (1 2 3 3);
array b[4] _temporary_ (1 1 1 0.5);
do i = 1 to dim(a);
   alpha = a[i];
   beta  = b[i];
   group = cat('a=',alpha,', b=',beta);
   do x = 0.05 to 3 by 0.01;
      CDF = %CDFIGamma(x, alpha, beta); 
      Quantile = %QuantileIGamma(CDF, alpha, beta); /* x - Quantile is approx 0 */
      output;
   end;
end;
run;
 
title "CDF of the Inverse Gamma Distribution";
proc sgplot data=IGammaCDF;
   series x=x y=CDF / group=group lineattrs=(thickness=2);
   xaxis grid; yaxis grid;
run;
 
title "Quantiles of the Inverse Gamma Distribution";
proc sgplot data=IGammaCDF;
   series x=CDF y=Quantile / group=group lineattrs=(thickness=2);
   xaxis grid label="Probability"; yaxis grid;
run;

Summary

This article shows how to compute the four essentials functions of the inverse gamma distribution: random variates, PDF, CDF, and quantiles. In every case, you can exploit the relationship between the gamma and inverse gamma distributions. The result is that you can use the RAND, PDF, CDF, and QUANTILE functions for the gamma distribution to evaluate similar quantities for the inverse gamma distribution. A contribution of this article is to write down the four formulas in one place and show how to compute them in SAS.

The post The inverse gamma distribution in SAS appeared first on The DO Loop.

1月 252021
 

Safety, efficacy, speed and costs must all be prioritized and balanced in the delivery of life-changing therapies to patients. A drug that's quickly and cost-efficiently delivered to market, but isn’t effective and safe is unacceptable. An effective, safe drug that doesn’t get to patients in time to save lives has [...]

The evolving role of AI in drug safety was published on SAS Voices by Cameron McLauchlin

1月 252021
 

Years ago, I wrote about how to compute the incomplete beta function in SAS. Recently, a SAS programmer asked about a similar function, called the incomplete gamma function. The incomplete gamma function is a "special function" that arises in applied math, physics, and statistics. You should not confuse the gamma function with the gamma distribution in probability, although they are related, as we will soon see.

There are two kinds of incomplete gamma functions: the "upper" and the "lower." This article shows how to compute the upper and lower incomplete gamma functions in SAS. A graph of the upper incomplete gamma function is shown to the right for several values of the parameter, α. A graph of the lower incomplete gamma function is shown later in this article.

The gamma function

Before discussing the incomplete gamma function, let's review the "complete" gamma function, which is usually called THE gamma function. The gamma function is defined by the following integral:
\(\Gamma(\alpha) = \int_0^{\infty} t^{\alpha-1} e^{-t} \, dt\)

Notice that the parameter to the function is α. The integration variable, t, is integrated out and does not appear in the answer. The integral is "complete" because the bounds of integration is the complete positive real line, (0, ∞).

The gamma function generalizes the factorial operation to include positive real numbers. In particular, Γ(1) = 1 and Γ(x+1) = x Γ(x) for any positive real number, x. If n is a positive integer, then Γ(n) = (n-1)!.

In SAS, you can use the GAMMA function to evaluate the (complete) gamma function at any positive real number:

/* The (complete) gamma function */
data Gamma;
do x = 0.5 to 3 by 0.5;
   gamma = gamma(x);
   output;
end;
run;
 
proc print; run;

The upper and lower incomplete gamma functions

An incomplete gamma function replaces the upper or lower limit of integration in the integral that defines the complete gamma function. If you replace the upper limit of integration (∞) by x, you get the lower incomplete gamma function:
\(p(\alpha, x) = \int_0^{x} t^{\alpha-1} e^{-t} \, dt\)
Here "lower" refers to integrating only the "left side", which means values of t < x.

If you replace the lower limit of integration (0) by x, you get the upper incomplete gamma function:
\(q(\alpha, x) = \int_x^{\infty} t^{\alpha-1} e^{-t} \, dt\)
Here "upper" refers to integrating only the "right side", which means values of t > x.

Notice that p(α, x) + q(α, x) = Γ(α) for all values of x. Furthermore, p(α, 0) = 0 and q(α, 0) = Γ(α).

Compute the incomplete gamma function in SAS

The lower incomplete gamma function looks a LOT like the definition of the cumulative distribution function (CDF) of the gamma distribution. The only difference is that the maximum value of a CDF is unity whereas the maximum value of the lower incomplete gamma function is Γ(α). Thus, Γ(α) is a constant scale factor that relates the lower incomplete gamma function and the CDF of the gamma distribution.

Specifically, the CDF function for the gamma distribution with shape parameter α and scale parameter 1 is
\(\mbox{CDF}('\mbox{Gamma}', x, \alpha) = \frac{1}{\Gamma(\alpha)} \int_0^x t^{\alpha-1} e^{-t} \, dt\)
Equivalently,
\(p(\alpha, x) = \Gamma(\alpha) * \mbox{CDF}('\mbox{Gamma}', x, \alpha)\)

In the same way, recall that the complementary CDF (sometimes called the survival function or the reliability function) is the quantity SDF(x) = 1 - CDF(x). The SDF of the gamma distribution is a scaled version of the upper incomplete gamma function, as follows:
\(q(\alpha, x) = \Gamma(\alpha) * \mbox{SDF}('\mbox{Gamma}', x, \alpha)\)

Consequently, you can compute the lower or upper incomplete gamma function in SAS by using the following macros. The SAS DATA step shows how to evaluate the functions on the domain (0, 4] for three values of the parameter, α = 1, 2, and 3. The graphs of the upper incomplete gamma function are shown:

%macro LowerIncGamma(x, alpha); /* lower incomplete gamma function */
   (Gamma(&alpha) * CDF('Gamma', &x, &alpha))
%mend;
%macro UpperIncGamma(x, alpha); /* upper incomplete gamma function */
   (Gamma(&alpha) * SDF('Gamma', &x, &alpha))
%mend;
data IncGamma;
do alpha = 1 to 3;
   do x = 0.01 to 4 by 0.01;
      LowerIncGamma = %LowerIncGamma(x, alpha);
      UpperIncGamma = %UpperIncGamma(x, alpha);
      output;
   end;
end;
run;
 
title "Lower Incomplete Gamma Function";
proc sgplot data=IncGamma;
   series x=x y=LowerIncGamma / group=alpha;
   xaxis grid; yaxis grid;
run;
title "Upper Incomplete Gamma Function";
proc sgplot data=IncGamma;
   series x=x y=UpperIncGamma / group=alpha;
   xaxis grid; yaxis grid;
run;

The graph of the upper incomplete gamma function is shown at the top of this article. The following graph shows the lower incomplete gamma function for several values of the α parameter.

By the way, some fields of study will use the term "regularized" incomplete gamma function. The regularized lower incomplete gamma function is simply the CDF of the gamma distribution. The regularized upper incomplete gamma function is the SDF of the gamma distribution.

Summary

As is sometimes the case, different areas of application use different names for essentially the same quantity. Although the SAS language does not provide a built-in function that evaluates the incomplete gamma function, it is easy to evaluate by using the GAMMA and CDF functions, as follows:

  • The complete gamma function, Γ(α), is computed by using the GAMMA function.
  • The lower/upper incomplete gamma function is a scaled version of the CDF and SDF (respectively) of the gamma distribution:
    • The lower incomplete gamma function is p(alpha,x) = GAMMA(alpha)*CDF('Gamma',x,alpha);
    • The upper incomplete gamma function is q(alpha,x) = GAMMA(alpha)*SDF('Gamma',x,alpha);
  • The lower/upper regularized incomplete gamma function is another named for the CDF and SDF, respectively.
    • The regularized lower incomplete gamma function is computed as CDF('Gamma',x,alpha);
    • The regularized upper incomplete gamma function is computed as SDF('Gamma',x,alpha);

The post How to compute the incomplete gamma function in SAS appeared first on The DO Loop.

1月 232021
 

It’s been a full eight years since HBR declared that data scientists hold the sexiest job of the 21st century. Yet, here we are in 2021, and the work of data scientists is still found outside the spotlight in many settings. Younger data scientists, in particular, can find it hard [...]

5 data scientists offer advice on improving your visibility was published on SAS Voices by Alison Bolen

1月 202021
 

Editor's note: This blog post is the first in a series of posts, originally published here by our partner News Literacy Project, exploring the role of data in understanding our world. Like infographics, social media and other forms of user-generated content pose unique challenges regarding data. Many news outlets and journalists have checks and balances [...]

Beware of data shared via social media - get the facts was published on SAS Voices by Jen Sabourin

1月 202021
 

This is the third and last introductory article about how to bootstrap time series in SAS. In the first article, I presented the simple block bootstrap and discussed why bootstrapping a time series is more complicated than for regression models that assume independent errors. Briefly, when you perform residual resampling on a time series, you need to resample in blocks to simulate the autocorrelation in the original series. In the second article, I introduced the moving-block bootstrap. For both methods, all blocks are the same size, and the block size must evenly divide the length of the series (n).

In contrast, the stationary block bootstrap uses blocks of random lengths. This article describes the stationary block bootstrap and shows how to implement it in SAS.

The "stationary" bootstrap gets its name from Politis and Romano (1992), who developed it. When applied to stationary data, the resampled pseudo-time series is also stationary.

Choosing the length of a block

In the moving-block bootstrap, the starting location for a block is chosen randomly, but all blocks have the same length. For the stationary block bootstrap, both the starting location and the length of the block are chosen at random.

The distribution for the block length is typically chosen from a geometric distribution with parameter p. If you want the average block length to be L, define p = 1/L because the expected value of a geometric distribution with parameter p is 1/p.

Why do we choose lengths from the geometric distribution? Well, because a length from the geometric distribution is easy to interpret in terms of a discrete process. Imagine performing the following process:

  • Choose an initial location, t, at random.
  • With probability p, choose a different location (at random) for the next residual and start a new block.
  • With probability (1-p), choose the location t+1 for the next residual and add that residual to the current block.
By definition, this process generates blocks whose lengths are geometrically distributed: L ~ Geom(p).

Generate a random block

There are several ways to implement the stationary block bootstrap in SAS. A straightforward method is to generate a starting integer, t, uniformly at random in the range [1, n], where n is the length of the series. Then, choose a length, L ~ Geom(p). If t + L exceeds n, truncate the block. (Some practitioners "wrap" the block to include residuals at the beginning of the series. This alternative method is called the circular block bootstrap.) Continue this process until the total length of the blocks is n. You will usually need to truncate the length of the last block so that the sum of the lengths is exactly n. This process is implemented in the following SAS/IML user-defined functions:

/* STATIONARY BLOCK BOOTSTRAP */
/* Use a random starting point, t, and a random block length, L ~ Geom(p), 
   - probability p of starting a new block at a new position
   - probability (1-p) of using the next residual from the current block
*/
%let p = (1/12);                /* expected block length is 1/p */ 
proc iml;
/* generate indices for a block of random length L~Geom(p)
   chosen from the sequence 1:n */
start RandBlockIdx(n, p);
   beginIdx = randfun(1, "Integer", n); /* random start: 1:n */
   L = randfun(1, "Geom", p);           /* random length */
   endIdx = min(beginIdx+L-1, n);       /* truncate into 1:n */
   return( beginIdx:endIdx );           /* return sequence of indices in i:n */
finish;
 
/* generate indices for one stationary bootstrap resample */
start StationaryIdx(n, p);
   bootIdx = j(n,1,.);
   s = 1;                       /* starting index for first block */
   do while( s <= n );
      idx = RandBlockIdx(n, p); /* get block of random length */
      if ncol(idx) > n-s then   /* truncate if block too long */
         idx = idx[, 1:n-s+1];
      L = ncol(idx);            /* length of this block */
      jdx = s:s+L-1;            /* all indices for this block */
      bootIdx[jdx] = idx;       /* assign this block */
      s = s + L;                /* starting index for next block */
   end;
   return bootIdx;
finish;

Let's run these functions on some sample data. The first article in this series shows how to fit a model to the Sashelp.Air data and write the predicted and residual values. The following statements use the stationary block bootstrap to generate one bootstrap sample of length n from the residuals. When you add the bootstrapped residuals to the predicted values, you get a "new" time series, which is graphed:

use OutReg;                     /* data to bootstrap: Y = Pred + Resid */
   read all var {'Time' 'Pred' 'Resid'};
close;
 
call randseed(1234);
n = nrow(Pred);                 /* length of series */
prob = &p;                      /* probability of jumping to a new block */
 
/*---- Visualization Example: generate one bootstrap resample ----*/
idx = StationaryIdx(n, prob);
YBoot = Pred + Resid[idx];
 
/* visualize the places where a sequence is not ascending:
https://blogs.sas.com/content/iml/2013/08/28/finite-diff-estimate-maxi.html */
blocksIdx = loc(dif(idx) <= 0); /* where is the sequence not ascending? */
b = Time[blocksIdx];
 
title "One Bootstrap Resample";
title2 "Stationary Block Bootstrap";
refs = "refline " + char(b) + " / axis=x;";
call series(Time, YBoot) other=refs;
/*---- End of visualization example ----*/

The graph shows a single bootstrap sample. I have added vertical reference lines to show the blocks of residuals. For this random sample, there are nine blocks whose lengths vary between 2 (the last block) and 28. If you choose another random sample, you will get a different set of blocks that have different lengths.

The stationary block bootstrap

In practice, the bootstrap method requires many bootstrap resamples. You can call the StationaryIdx function repeatedly in a loop to generate B=1000 bootstrap samples, as follows:

/* A complete stationary block bootstrap generates B resamples
   and usually writes the resamples to a SAS data set. */
B = 1000;
SampleID = j(n,1,.);
create BootOut var {'SampleID' 'Time' 'YBoot'};  /* create outside of loop */
do i = 1 to B;
   SampleId[,] = i;   /* fill array. See https://blogs.sas.com/content/iml/2013/02/18/empty-subscript.html */
   idx = StationaryIdx(n, prob);
   YBoot = Pred + Resid[idx];
   append;
end;
close BootOut;

The BootOut data set contains all the bootstrap samples. You can use a BY-group analysis to obtain the bootstrap estimates for almost any statistic. For example, the following scatter plot shows the distribution of 1000 bootstrap estimates of the slope and intercept parameters for an autoregressive model of the series. Reference lines indicate the estimates for the original data.

You can download the complete SAS program that performs all computations and creates all graphs in this article.

Summary

This article is the third of a series. It shows how to perform a stationary block bootstrap by randomly sampling blocks of residuals. Whereas the simple-block and the moving-block bootstrap methods use blocks that have a constant length, the stationary block bootstrap chooses blocks of random lengths.

In researching these blog posts, I read several articles. I recommend Bates (2019) for an introduction and Politis and Romano (1994) for details.

The post The stationary block bootstrap in SAS appeared first on The DO Loop.

1月 182021
 

Recommended soundtrack for this blog post: Netflix Trip by AJR.

This week's news confirms what I already knew: The Office was the most-streamed television show of 2020. According to reports that I've seen, the show was streamed for 57 billion minutes during this extraordinary year. I'm guessing that's in part because we've all been shut in and working from home; we crave our missing office interactions. We lived vicariously (and perhaps dysfunctionally) through watching Dunder Mifflin staff. But another major factor was the looming deadline of the departure of The Office from Netflix as of January 1, 2021. It was a well-publicized event, so Netflix viewers had to get their binge on while they could.

People in my house are fans of the show, and they account for nearly 6,000 of those 57 billion streaming minutes. I can be this precise (nerd alert!) because I'm in the habit of analyzing our Netflix activity by using SAS. In fact, I can tell you that since late 2017, we've streamed 576 episodes of The Office. We streamed 297 episodes in 2020. (Since the show has only 201 episodes we clearly we have a few repeats in there.)

I built a heatmap that shows the frequency and intensity of our streaming of this popular show. In this graph each row is a month, each square is a day. White squares are Office-free. A square with any red indicates at least one virtual visit with the Scranton crew; the darker the shade, the more episodes streamed during that day. You can see that Sept 15, 2020 was a particular big binge with 17 episodes. (Each episode is about 20-21 minutes, so it's definitely achievable.)

netflix trip through The Office

Heatmap of our household streaming of The Office

How to build the heatmap

To build this heatmap, I started with my Netflix viewing history (downloaded from my Netflix account as CSV files). I filtered to just "The Office (U.S.)" titles, and then merged with a complete "calendar" of dates between late 2017 and the start of 2021. Summarized and merged, the data looks something like this:


With all of the data summarized in this way such that there is only one observation per X and Y value, I can use the HEATMAPPARM statement in PROC SGPLOT to visualize it. (If I needed the procedure to summarize/bin the data for me, I would use the HEATMAP statement. Thanks to Rick Wicklin for this tip!)

proc sgplot data=ofc_viewing;
 title height=2.5 "The Office - a Netflix Journey";
 title2 height=2 "&episodes. episodes streamed on &days. days, over 3 years";
 label Episodes="Episodes per day";
 format monyear monyy7.;
 heatmapparm x=day y=monyear 
   colorresponse=episodes / x2axis
    outline
   colormodel=(white  CXfcae91 CXfb6a4a CXde2d26 CXa50f15) ;
 yaxis  minor reverse display=(nolabel) 
  values=(&amp;allmon.)
  ;
 x2axis values=(1 to 31 by 1) 
   display=(nolabel)  ;
run;

You can see the full code -- with all of the data prep -- on my GitHub repository here. You may even run the code in your own SAS environment -- it will fetch my Netflix viewing data from another GitHub location where I've stashed it.

Distribution of Seasons (not "seasonal distribution")

If you examine the heatmap I produced, you can almost see our Office enthusiasm in three different bursts. These relate directly to our 3 children and the moments they discovered the show. First was early 2018 (middle child), then late 2019 (youngest child), then late 2020 (oldest child, now 22 years old, striving to catch up).

The Office ran for 9 seasons, and our kids have their favorite seasons and episodes -- hence the repeated viewings. I used PROC FREQ to show the distribution of episode views across the seasons:


Season 1 is remarkably low for two reasons. First and most importantly, it contains the fewest episodes. Second, many viewers agree that Season 1 is the "cringiest" content, and can be uncomfortable to watch. (This Reddit user leaned into the cringe with his data visualization of "that's what she said" jokes.)

From the data (and from listening to my kids), I know that Season 2 is a favorite. Of the 60 episodes we streamed at least 4 times, 19 of them were in Season 2.

More than streaming, it's an Office lifestyle

Office fandom goes beyond just watching the show. Our kids continue to embrace The Office in other mediums as well. We have t-shirts depicting the memes for "FALSE." and "Schrute Farms." We listen to The Office Ladies podcast, hosted by two stars of the show. In 2018 our daughter's Odyssey of the Mind team created a parody skit based on The Office (a weather-based office named Thunder Mifflin) -- and advanced to world finals.

Rarely does a day go by without some reference to an iconic phrase or life lesson that we gleaned from The Office. We're grateful for the shared experience, and we'll miss our friends from the Dunder Mifflin Paper Company.

The post Visualizing our Netflix Trip through <em>The Office</em> appeared first on The SAS Dummy.

1月 182021
 

What if you had a technology solution that creates a real-time link between the customer demand signal and what's happening on the ground? What if plans that are being steered centrally could  finally be connected to every shipping lane, while simultaneously, creating cost saving carrier adjustments? The first-of-its kind integration [...]

SAS and C.H. Robinson are rewriting the rules of transportation planning and management was published on SAS Voices by Charlie Chase

1月 182021
 

Have you ever heard of the DOLIST syntax? You might know the syntax even if you are not familiar with the name. The DOLIST syntax is a way to specify a list of numerical values to an option in a SAS procedure. Applications include:

  • Specify the end points for bins of a histogram
  • Specify percentiles to be output to a data set
  • Specify tick marks for a custom axis on a graph
  • Specify the location of reference lines on a graph
  • Specify a list of parameters for an algorithm. Examples include smoothing parameters (the SMOOTH= option in PROC LOESS), sample sizes (the NTOTAL= option in PROC POWER), and initial guess for parameters in an optimization (the PARMS statement in PROC NLMIXED and PROC NLIN)

This article demonstrates how to use the DOLIST syntax to specify a list of values in SAS procedures. It shows how to use a single statement to specify individual values and also a sequence of values.

The DOLIST syntax enables you to write a single statement that specifies individual values and one or more sequences of values. The DOLIST syntax should be in the toolbox of every intermediate-level SAS programmer!

The DOLIST syntax in the SAS DATA step

According to the documentation of PROC POWER, the syntax described in this article is sometimes called the DOLIST syntax because it is based on the syntax for the iterative DO loop in the DATA step.

The most common syntax for a DO loop is DO x = start TO stop BY increment. For example, DO x = 10 TO 90 BY 20; iterates over the sequence of values 10, 30, 50, 70, and 90. If the increment is 1, you can omit the BY increment portion of the statement. However, you can also specify values as a common-separated list, such as DO x = 10, 30, 50, 70, 90;, which generates the same values. What you might not know is that you can combine these two methods. For example, in the following DATA step, the values are specified by using two comma-separated lists and three sequences. For clarity, I have placed each list on a separate line, but that is not necessary:

/* the DOLIST syntax for a DO loop in the DATA step */
data A;
do pctl = 5,                  /* individual value(s)  */
          10 to 50 by 20,     /* a sequence of values */
          54.3, 69.1,         /* individual value(s)  */
          80 to 90 by 5,      /* another sequence     */
          60 to 40 by -20;    /* yet another sequence */
   output;
end;
run;
 
proc print; run;

The output (not shown) is a list of values: 5, 10, 30, 50, 54.3, 69.1, 80, 85, 90, 60, 40. Notice that the values do not need to be in sorted order, although they often are.

The expressions to the right of the equal sign are what I mean by the "DOLIST syntax." You can use the same syntax to specify a list of options in many SAS procedures. When the SAS documentation says that an option takes a "list of values," you can often use a comma-separated list, a space-separated list, and the syntax start TO stop BY increment. (Or a combination of these expressions!) The following sections provide a few examples, but there are literally hundreds of options in SAS that support the DOLIST syntax!

Some procedures (for example, PROC SGPLOT) require the DOLIST values to be in parentheses. Consequently, I have adopted the convention of always using parentheses around DOLIST values, even if the parentheses are not strictly required. As far as I know, it is never wrong to put the DOLIST inside parentheses, and it keeps me from having to remember whether parentheses are required. The examples in this article all use parentheses to enclose DOLIST values.

Histogram bins and percentiles

You can use the DOLIST syntax to specify the endpoints of bins in a histogram. For example, in PROC UNIVARIATE, the ENDPOINTS= option in the HISTOGRAM statement supports a DOLIST. Because histograms use evenly spaced bins, usually you will specify only one sequence, as follows:

proc univariate data=sashelp.cars;
   var weight;
   histogram weight / endpoints=(1800 to 7200 by 600);   /* DOLIST sequence expression */
run;

You can also use the DOLIST syntax to specify percentiles. For example, the PCTLPTS= option on the OUTPUT statement enables you to specify which percentiles of the data should be written to a data set:

proc univariate data=sashelp.cars;
   var MPG_City;
   output out=UniOut pctlpre=P_  pctlpts=(50 75, 95 to 100 by 2.5);  /* DOLIST */
run;

Notice that this example specifies both individual percentiles (50 and 75) and a sequence of percentiles (95, 97.5, 100).

Tick marks and reference lines

The SGPLOT procedure enables you to specify the locations of tick marks on the axis of a graph. Most of the time you will specify an evenly spaced set of values, but (just for fun) the following example shows how you can use the DOLIST syntax to combine evenly spaced values and a few custom values:

title "Specify Ticks on the Y Axis";
proc sgplot data=sashelp.cars;
   scatter x=Weight y=Mpg_City;
   yaxis grid values=(10 to 40 by 5, 50 60); /* DOLIST; commas optional */
run;

As shown in the previous example, the GRID option on the XAXIS and YAXIS statements enables you to display reference lines at each tick location. However, sometimes you want to display reference lines independently from the tick marks. In that case, you can use the REFLINE statement, as follows:

title "Many Reference Lines";
proc sgplot data=sashelp.cars;
   scatter x=Weight y=MPG_City;
   refline (1800 to 6000 by 600, 7000) / axis=x;  /* many reference lines */
run;

Statistical procedures

Many statistical procedures have options that support lists. In most cases, you can use the DOLIST syntax to provide values for the list.

I have already written about how to use the DOLIST syntax to specify initial guesses for the PARM statement in PROC NLMIXED and PROC NLIN. The documentation for the POWER procedure discusses how to specify lists of values and uses the term "DOLIST" in its discussion.

Some statistical procedures enable you to specify multiple parameter values, and the analysis is repeated for each parameter in the list. One example is the SMOOTH= option in the MODEL statement of the LOESS procedure. The SMOOTH= option specifies values of the loess smoothing parameter. The following call to PROC LOESS fits four loess smoothers to the data. The call to PROC SGPLOT overlays the smoothers on a scatter plot of the data:

proc loess data=sashelp.cars plots=none;
   model MPG_City = Weight / smooth=(0.1 to 0.5 by 0.2, 0.75); /* value-list */
   output out=LoessOut P=Pred;
run;
 
proc sort data=LoessOut; by SmoothingParameter Weight; run;
proc sgplot data=LoessOut;
   scatter x=Weight y=MPG_City / transparency=0.9;
   series x=Weight y=Pred / group=SmoothingParameter 
          curvelabel curvelabelpos=min;
run;

Summary

In summary, this article describes the DOLIST syntax in SAS, which enables you to simultaneously specify individual values and evenly spaced sequences of values. A sequence is specified by using the start TO step BY increment syntax. The DOLIST syntax is valid in many SAS procedures and in the DATA step. In some procedures (such as PROC SGPLOT), the syntax needs to be inside parentheses. For readability, you can use commas to separate individual values and sequences.

Many SAS procedures accept the special syntax even if it is not explicitly mentioned in the documentation. In the documentation for an option, look for terms such as value-list or numlist or value-1 <...value-n>, which indicate that the option supports the DOLIST syntax.

The post The DOLIST syntax: Specify a list of numerical values in SAS appeared first on The DO Loop.