4月 252019
 

I’m excited because in a couple days I will fly to Dallas for SAS Global Forum 2019, the biggest SAS conference of the year, attended by thousands.

If you are coming, I hope you will say hello to me.  If you can’t make it to Dallas, you’ll be glad to know that many presentations will be livecast. Here is the schedule

A few highlights:

Sunday, April 28, 7:00-8:30 pm CT–Opening Session

Monday, April 29, 8:30-10:00 am CT–General Session: Technology Connection

Tuesday, April 30, 3:00-4:00 pm CT–Career Advice We’d Give to Our Kids: A Panel Discussion

Wednesday, May 1, 10:30-11:30 am CT–The Good, the Bad, and the Creepy: Why Data Scientists Need to Understand Ethics

These presentations may not be available after the conference so check the schedule and make sure to tune in at the right time.

 

 

 

 

4月 242019
 

The CUSUM test has many incarnations. Different areas of statistics use different assumption and test for different hypotheses. This article presents a brief overview of CUSUM tests and gives an example of using the CUSUM test in PROC AUTOREG for autoregressive models in SAS.

A CUSUM test uses the cumulative sum of some quantity to investigate whether a sequence of values can be modeled as random. Here are some examples:

  • A sequence of binary values (call them +1 and -1) might appear to be random, like a coin flip, or nonrandom. A random sequence has a cumulative sum that does not deviate far from zero, as I've discussed in a previous about the CUSUM test for randomness of a binary sequence.
  • In quality control, the CUSUM chart and test is used to monitor whether a process is drifting away from its mean. The CUSUM chart is centered around the mean value of the process. The process is said to be "out of control" if the cumulative sums of the standardized deviations exceed a specified range. The documentation for the CUSUM procedure in SAS/QC software includes an example and a page of formulas that describe the statistics behind the CUSUM chart.
  • In time series analysis, the CUSUM statistics use the sequence of residual deviations from a model to indicate whether the autoregressive model is misspecified. The CUSUM statistics are produced by PROC AUTOREG in SAS/ETS software.

Whereas the CUSUM test for a binary sequence uses cumulative sums for a discrete (+1, -1} sequence, the other tests assume that the sequence is a random sequence of normally distributed values. The main idea behind the tests are the same: The test statistic measures how far the sequence has drifted away from an expected value. If the sequence drifts too far too fast, the sequence is unlikely to be random.

CUSUM test for time series

Let's see how the CUSUM test in PROC AUTOREG can help to identify a misspecified model. For simplicity, consider two response variables, one that is linear in time (with uncorrelated errors) and the other that is quadratic in time. If you fit a linear model to both variables, the CUSUM test can help you to see that the model does not fit the quadratic data.

In a previous article, I discussed Anscombe's quartet and created two series that have the same linear fit and correlation coefficient. These series are ideal to use for the CUSUM test because the first series is linear whereas the second is quadratic. The following calls to PROC AUTOREG fit a linear model to each variable.

ods graphics on;
/* PROC AUTOREG models a time series with autocorrelation */
proc autoreg data=Anscombe2;
  Linear: model y1 = x;        /* Y1 is linear. Model is oorrectly specified. */
  output out=CusumLinear cusum=cusum cusumub=upper cusumlb=lower recres=RecursiveResid;
run;
 
proc autoreg data=Anscombe2;
  Quadratic: model y2 = x;     /* Y2 is quadratic. Model is misspecified. */
  output out=CusumQuad cusum=cusum cusumub=upper cusumlb=lower recres=RecursiveResid;
run;

The AUTOREG procedure creates a panel of standard residual diagnostic plots. The panel includes a plot of the residuals and a fit plot that shows the fitted model and the observed values. For the linear data, the residual plots seem to indicate that the model fits the data well:

In contrast, the same residual panel for the quadratic data indicates a systematic pattern in the residuals:

If this were a least squares model, which assumes independence of the residuals, those residual plots would indicate that this data-model combination does not satisfy the assumptions of the least squares regression model. For an autoregressive model, however, raw residuals can be correlated and exhibit a pattern. To determine whether the model is misspecified, PROC AUTOREG supports a special kind of residual analysis that uses recursive residuals.

The recursive residual for the k_th point is formed by fitting a line to the first k-1 points and then forming a standardized residual for the k_th point. The complete formulas are in the AUTOREG documentation. Galpin and Hawkins (1984) suggest plotting the cumulative sums of the recursive residuals as a diagnostic plot. Galpin and Hawkins credit Brown, Durbin, and Evans (1975) with proposing the CUSUM plot of the recursive residuals. The statistics output from the AUTOREG procedure are different than those in Galpin and Hawkin, but the idea and purpose behind the CUSUM charts are the same.

Galpin and Hawkin show a panel of nine plots that display different patterns that you might see in the CUSUM plots. I have reproduced two of the plots from the paper. (Remember, these graphs were produced in 1984!) The graph on the left shows what you should see for a correctly specified model. The cumulative sums stay within a region near the expected value of zero. In contrast, the graph on the right is one example of a CUSUM plot for a misspecified model.

The previous calls to PROC AUTOREG wrote the cumulative sums and the upper and lower boundaries of the confidence region to a data set. You can use PROC SGPLOT to create the CUSUM plot. The BAND statement is used to draw the confidence band:

ods layout gridded columns=2 advance=table;
 proc sgplot data=CusumLinear noautolegend;
    band x=x lower=lower upper=upper;
    series x=x y=cusum / break markers;
    refline 0  /axis=y noclip;
    xaxis grid; yaxis grid;
 run;
 proc sgplot data=CusumQuad noautolegend;
    band x=x lower=lower upper=upper;
    series x=x y=cusum / break markers;
    refline 0  /axis=y noclip;
    xaxis grid; yaxis grid;
 run;
ods layout end;
CUSUM graphs of cumulative  sums of recursive residuals

The graph on the left looks like a random walk on independent normal data. The cumulative sums stay within the colored confidence region. The model seems to fit the data. In contrast, the graph on the right quickly leaves the shaded region, which indicates that the model is misspecified.

In summary, there are many statistical tests that use a CUSUM statistic to determine whether deviations are random. These tests appear in many areas of statistics, including random walks, quality control, and time series analysis. For quality control, SAS supports the CUSUM procedure in SAS/QC software. For time series analysis, the AUTOREG procedure in SAS supports CUSUM charts of recursive residuals, which enable you to diagnose misspecified models.

You can download the SAS program that generates the graphs in this article.

The post A CUSUM test for autregressive models appeared first on The DO Loop.

4月 242019
 
The ODS Excel destination, which became a production feature in SAS 9.4M3 (TS1M3), generates Microsoft Excel workbooks in native XLSX format. This destination generates multiple worksheets per workbook with each output object (e.g., a table or graphic) the destination encounters by default. The ODS Excel destination is also flexible, enabling you to vertically control the worksheet and place output objects wherever you want. This blog demonstrates the destination’s flexibility and how you can modify its default behavior by using the ODS EXCEL statement's SHEET_INTERVAL= option.

Adding tables and graphics on the same Microsoft Excel worksheet

By default, the ODS Excel destination adds a new worksheet for each table and graphic. However, at times, you might not want to use this default behavior. If you want more control over this, the SHEET_INTERVAL= ODS Excel option determines when a new worksheet is created. Valid values for the SHEET_INTERVAL= option include:

  • TABLE (the default value) - new sheet for each table in output
  • NONE - keep the output that follows on the current sheet
  • PAGE - new sheet for each page of SAS output
  • PROC - new sheet beginning at the PROC step boundary
  • BYGROUP - new sheet for each BY group of output
  • NOW - begin a new sheet immediately

The value NOW, new for SAS 9.4M5 (TS1M5), triggers the creation of a new worksheet after the destination encounters the next output object.
As an example of opting not to use the default behavior, consider a case where you have a CONTENTS procedure without any options. This procedure generates three separate worksheets with the data-set attributes, the engine host information, and the variable list.

The following table shows the default output that you receive with three individual worksheets:

However, if you want to place all three objects on a single worksheet, you can do that by setting the option SHEET_INTERVAL="NONE". The option setting SHEET_INTERVAL="PROC" could also be used in this example which would create a new worksheet only when a new procedure is encountered.
The following example illustrates how to use this option to include all your output on the same worksheet:

ods excel file="c:\test.xlsx" options sheet_interval="none");
proc contents data=sashelp.class;
run; 
ods excel close;

Output

Adding text and tables to a new worksheet

Two of the most popular ways to add text on worksheets are to use either the ODS TEXT= statement or the ODSTEXT procedure with the Excel destination. The following example adds text to a worksheet by using the ODS TEXT= statement. You include this statement before each PRINT procedure in this example:

ods excel file="c:\temp\test.xlsx";
ods excel options(sheet_name="Females");
 
ods text="Data for Female Patients";
proc print data=sashelp.class(where=(sex="F"));
run;
 
ods excel options(sheet_name="Males");
ods text="Data for Male Patients";
 
proc print data=sashelp.class(where=(sex="M"));
run;
 
ods excel close;

Output

Notice that the first text string appears at the top of the first worksheet as expected. However, the text from the second ODS TEXT= statement appears at the bottom of this same page rather than at the top of the next worksheet containing the related data table. This behavior illustrates that the ODS TEXT= option is not very flexible. There is no good way to solve this issue.

However, you can use the SAS 9.4 ODSTEXT procedure in combination with the SHEET_INTERVAL= option to move the text string to the appropriate worksheet.
The following example uses PROC ODSTEXT and the SHEET_INTERVAL= option to move the text string "Statistics for Male Patients" to the top of the second worksheet:

ods excel file="c:\temp\test.xlsx";
ods excel options(sheet_name="Females");
 
ods text="Data for Female Patients";
proc print data=sashelp.class(where=(sex="F"));
run;
 
ods excel options(sheet_name="Males" sheet_interval="now");
ods excel options(sheet_name="Males" sheet_interval="none");
 
proc odstext;
   p "Data for Male Patients";
run;
 
proc print data=sashelp.class(where=(sex="M"));
run;
 
ods excel close;

Output

Adding multiple tables or graphs on the same worksheet

This final example demonstrates how you can use the SHEET_INTERVAL= option to add multiple tables and graphics to the same Excel worksheet. First, we use the SHEET_INTERVAL="NONE" option in the first ODS EXCEL statement to place the first table and graph on the same worksheet. Then, the SHEET_INTERVAL="NOW" option is included in the second ODS EXCEL statement option to create a second worksheet and write the second table and graph to that worksheet:

ods graphics / height=2.5in width=3.5in;
ods excel file="c:\scratch\test.xlsx" options(sheet_interval="none");
 
proc print data=sashelp.class(where=(sex="F"));
run;
 
proc sgplot data=sashelp.class(where=(sex="F"));
scatter x=age y=height;
run;
 
ods excel options(sheet_interval="now");
 
 
proc print data=sashelp.class(where=(sex="M"));
run;
 
proc sgplot data=sashelp.class(where=(sex="M"));
scatter x=age y=height;
run;
 
ods excel close;

Output

Learn more

Control the location of tables, graphs, and text with ODS Excel was published on SAS Users.

4月 242019
 
The ODS Excel destination, which became a production feature in SAS 9.4M3 (TS1M3), generates Microsoft Excel workbooks in native XLSX format. This destination generates multiple worksheets per workbook with each output object (e.g., a table or graphic) the destination encounters by default. The ODS Excel destination is also flexible, enabling you to vertically control the worksheet and place output objects wherever you want. This blog demonstrates the destination’s flexibility and how you can modify its default behavior by using the ODS EXCEL statement's SHEET_INTERVAL= option.

Adding tables and graphics on the same Microsoft Excel worksheet

By default, the ODS Excel destination adds a new worksheet for each table and graphic. However, at times, you might not want to use this default behavior. If you want more control over this, the SHEET_INTERVAL= ODS Excel option determines when a new worksheet is created. Valid values for the SHEET_INTERVAL= option include:

  • TABLE (the default value) - new sheet for each table in output
  • NONE - keep the output that follows on the current sheet
  • PAGE - new sheet for each page of SAS output
  • PROC - new sheet beginning at the PROC step boundary
  • BYGROUP - new sheet for each BY group of output
  • NOW - begin a new sheet immediately

The value NOW, new for SAS 9.4M5 (TS1M5), triggers the creation of a new worksheet after the destination encounters the next output object.
As an example of opting not to use the default behavior, consider a case where you have a CONTENTS procedure without any options. This procedure generates three separate worksheets with the data-set attributes, the engine host information, and the variable list.

The following table shows the default output that you receive with three individual worksheets:

However, if you want to place all three objects on a single worksheet, you can do that by setting the option SHEET_INTERVAL="NONE". The option setting SHEET_INTERVAL="PROC" could also be used in this example which would create a new worksheet only when a new procedure is encountered.
The following example illustrates how to use this option to include all your output on the same worksheet:

ods excel file="c:\test.xlsx" options sheet_interval="none");
proc contents data=sashelp.class;
run; 
ods excel close;

Output

Adding text and tables to a new worksheet

Two of the most popular ways to add text on worksheets are to use either the ODS TEXT= statement or the ODSTEXT procedure with the Excel destination. The following example adds text to a worksheet by using the ODS TEXT= statement. You include this statement before each PRINT procedure in this example:

ods excel file="c:\temp\test.xlsx";
ods excel options(sheet_name="Females");
 
ods text="Data for Female Patients";
proc print data=sashelp.class(where=(sex="F"));
run;
 
ods excel options(sheet_name="Males");
ods text="Data for Male Patients";
 
proc print data=sashelp.class(where=(sex="M"));
run;
 
ods excel close;

Output

Notice that the first text string appears at the top of the first worksheet as expected. However, the text from the second ODS TEXT= statement appears at the bottom of this same page rather than at the top of the next worksheet containing the related data table. This behavior illustrates that the ODS TEXT= option is not very flexible. There is no good way to solve this issue.

However, you can use the SAS 9.4 ODSTEXT procedure in combination with the SHEET_INTERVAL= option to move the text string to the appropriate worksheet.
The following example uses PROC ODSTEXT and the SHEET_INTERVAL= option to move the text string "Statistics for Male Patients" to the top of the second worksheet:

ods excel file="c:\temp\test.xlsx";
ods excel options(sheet_name="Females");
 
ods text="Data for Female Patients";
proc print data=sashelp.class(where=(sex="F"));
run;
 
ods excel options(sheet_name="Males" sheet_interval="now");
ods excel options(sheet_name="Males" sheet_interval="none");
 
proc odstext;
   p "Data for Male Patients";
run;
 
proc print data=sashelp.class(where=(sex="M"));
run;
 
ods excel close;

Output

Adding multiple tables or graphs on the same worksheet

This final example demonstrates how you can use the SHEET_INTERVAL= option to add multiple tables and graphics to the same Excel worksheet. First, we use the SHEET_INTERVAL="NONE" option in the first ODS EXCEL statement to place the first table and graph on the same worksheet. Then, the SHEET_INTERVAL="NOW" option is included in the second ODS EXCEL statement option to create a second worksheet and write the second table and graph to that worksheet:

ods graphics / height=2.5in width=3.5in;
ods excel file="c:\scratch\test.xlsx" options(sheet_interval="none");
 
proc print data=sashelp.class(where=(sex="F"));
run;
 
proc sgplot data=sashelp.class(where=(sex="F"));
scatter x=age y=height;
run;
 
ods excel options(sheet_interval="now");
 
 
proc print data=sashelp.class(where=(sex="M"));
run;
 
proc sgplot data=sashelp.class(where=(sex="M"));
scatter x=age y=height;
run;
 
ods excel close;

Output

Learn more

Control the location of tables, graphs, and text with ODS Excel was published on SAS Users.

4月 232019
 

From saving cheetahs to preserving fresh water systems, analytics plays a role in many inspiring conservation projects around the world. Read on to learn about a few of our favorites. 1. Protecting animals from extinction  Learn how analyzing footprints could help save cheetahs in Namibia and other endangered animals around [...]

5 inspiring reads for Earth Day was published on SAS Voices by Alison Bolen

4月 222019
 

Imagine a world where satisfying human-computer dialogues exist. With the resurgence of interest in natural language processing (NLP) and understanding (NLU) – that day may not be far off.

In order to provide more satisfying interactions with machines, researchers are designing smart systems that use artificial intelligence (AI) to develop better understanding of human requests and intent.

Last year, OpenAI used a machine learning technique called reinforcement learning to teach agents to design their own language. The AI agents were given a simple set of words and the ability to communicate with each other. They were then given a set of goals that were best achieved by cooperating (communicating) with other agents. The agents independently developed a simple ‘grounded’ language.

Grounded vs. inferred language


Human language is said to be grounded in experience. People grasp the meaning of many basic words by interaction – not by learning dictionary definitions by rote. They develop understanding in terms of sensory experience -- for example, words like red, heavy, above.

Abstract word meanings are built in relation to more concretely grounded terms. Grounding allows humans to acquire and understand words and sentences in context.

The opposite of a grounded language is an inferred language. Inferred languages derive meaning from the words themselves and not what they represent. In AI trained only on textual data, but not real-world representations, these methods lack true understanding of what the words mean.

What if the AI agent develops its own language we can’t understand?

It happens. Even if the researcher gives the agents simple English words the agent inevitably diverges to its own, unintelligible language. Recently researchers at Facebook, Google and OpenAI all experienced this phenomenon!

Agents are reward driven. If there is no reward for using English (or human language) then the agents will develop a more efficient shorthand for themselves.

That’s cool – why is that a problem?

When researchers at the Facebook Artificial Intelligence Research lab designed chatbots to negotiate with one another using machine learning, they had to tweak one of their models because otherwise the bot-to-bot conversation “led to divergence from human language as the agents developed their own language for negotiating.” They had to use what’s called a fixed supervised model instead.

The problem, there, is transparency. Machine learning techniques such as deep learning are black box technologies. A lot of data is fed into the AI, in this case a neural network, to train on and develop its own rules. The model is then fed new data which is used to spit out answers or information. The black box analogy is used because it is very hard, if not impossible in complex models, to know exactly how the AI derives the output (answers). If AI develops its own languages when talking to other AI, the transparency problem compounds. How can we fully trust an AI when we can’t follow how it is making its decisions and what it is telling other AI?

But it does demonstrate how machines are redefining people’s understanding of so many realms once believed to be exclusively human—like language. The Facebook researchers concluded that it offered a fascinating insight to human and machine language. The bots also proved to be very good negotiators, developing intelligent negotiating strategies.

These new insights, in turn, lead to smarter chatbots that have a greater understanding of the real world and the context of human dialog.

At SAS, we’re developing different ways to incorporate chatbots into business dashboards or analytics platforms. These capabilities have the potential to expand the audience for analytics results and attract new and less technical users.

“Chatbots are a key technology that could allow people to consume analytics without realizing that’s what they’re doing,” says Oliver Schabenberger, SAS Executive Vice President, Chief Operating Officer and Chief Technology Officer in a recent SAS Insights article. “Chatbots create a humanlike interaction that makes results accessible to all.” The evolution of NLP toward NLU has a lot of important implications for businesses and consumers alike.

Satisfying human-computer dialogues will soon exist, and will have applications in medicine, law, and the classroom-to name but a few. As the volume of unstructured information continues to grow exponentially, we will benefit from AI’s tireless ability to help us make sense of it all.

Further Resources:
Natural Language Processing: What it is and why it matters
White paper: Text Analytics for Executives: What Can Text Analytics Do for Your Organization?
SAS® Text Analytics for Business Applications: Concept Rules for Information Extraction Models, by Teresa Jade, Biljana Belamaric Wilsey, and Michael Wallis
Unstructured Data Analysis: Entity Resolution and Regular Expressions in SAS®, by Matthew Windham
SAS: What are chatbots?
Blog: Let’s chat about chatbots, by Wayne Thompson

Moving from natural language processing to natural language understanding was published on SAS Users.

4月 222019
 

Many statistical tests use a CUSUM statistic as part of the test. It can be confusing when a researcher refers to "the CUSUM test" without providing details about exactly which CUSUM test is being used. This article describes a CUSUM test for the randomness of a binary sequence. You start with a long sequence of binary values such as heads and tails from a coin toss. The test tries to determine whether the sample comes from a Bernoulli distribution with probability p=0.5. In short, is the binary sequence random?

The CUSUM test for randomness of a binary sequence is one of the NIST tests for verifying that a random or pseudorandom generator is generating bits that are indistinguishable from truly random bits (Rukin, et al., 2000, revised 2010, pp 2-31 through 2-33). The test is straightforward to implement. You first translate the data to {-1, +1} values. You then compute the cumulative sums of the sequence. If the sequence is random, the cumulative sum is equivalent to the position of a random walker who takes unit steps. If the sequence is random, the sums will not move away from 0 (the expected sum) too quickly. I've previously visualized the random walk with unit steps (sometimes called a "Drunkard's walk").

Before proceeding to the CUSUM test, I should mention that this test is often used in conjunction with other tests, such as the "runs test" for randomness. That is because a perfectly alternating sequence such as 0101010101... will pass the CUSUM test even though the sequence is clearly not randomly generated. In fact, any sequence that repeatedly has k zeros followed by k ones also passes the test, provided that k is small enough.

The CUSUM test for randomness of a binary sequence in SAS

The NIST report contains an example of calling the CUSUM test with a sequence of length N=100. The following SAS/IML statements define a sequence of {0, 1} values, convert those values to {-1, +1}, and plot the cumulative sums:

proc iml;
eps = {1 1 0 0 1 0 0 1 0 0 0 0 1 1 1 1 1 1 0 1
       1 0 1 0 1 0 1 0 0 0 1 0 0 0 1 0 0 0 0 1 
       0 1 1 0 1 0 0 0 1 1 0 0 0 0 1 0 0 0 1 1 
       0 1 0 0 1 1 0 0 0 1 0 0 1 1 0 0 0 1 1 0 
       0 1 1 0 0 0 1 0 1 0 0 0 1 0 1 1 1 0 0 0 }; 
x = 2*eps - 1;            /* convert to {-1, +1} sequence */
S = cusum(x);
 
title "Cumulative Sums of Sequence of {-1, +1} Values";
call series(1:ncol(S), S) option="markers" other="refline 0 / axis=y" label="Observation Number";
Plot of the cumulative sums of a binary sequence of values in {-1, +1}.

The sequence contains 58 values of one category and 42 values of the other. For a binomial distribution with p=0.5, the probability of a sample that has proportions at least this extreme is about 13%, as shown by the computation 2*cdf("Binomial", 42, 0.5, 100);. Consequently, the proportions are not unduly alarming. However, to test whether the sequence is random, you need to consider not only the proportion of values, but also the sequence. The graph of the cumulative sums of the {-1, +1} sequence shows a drift away from the line S=0, but it is not clear from the graph whether the deviation is more extreme than would be expected for a random sequence of this length.

The CUSUM test gives you a way to quantify whether the sequence is likely to have occurred as a random draw from a Bernoulli(p=0.5) distribution. The test statistic is the maximum deviation from 0. As you can see from the graph, the test statistic for this sequence is 16. The NIST paper provides a formula for the probability that a statistic at least this extreme occurs in a random sequence of length N=100. I implemented a (vectorized) version of the formula in SAS/IML.

/* NIST CUSUM test for randomness in a binary {-1, +1} sequence.
   https://nvlpubs.nist.gov/nistpubs/legacy/sp/nistspecialpublication800-22r1a.pdf
   Section 2.13.  Pages 2-31 through 2-33.
   INPUT: x is sequence of {-1, +1} values.      */
start BinaryCUSUMTest(x, PrintTable=1, alpha=0.01);
   S = colvec( cusum(x) );     /* cumulative sums */
   n = nrow(S);
   z = max(abs(S));            /* test statistic = maximum deviation */
 
   /* compute probability of this test statistic for a sequence of this length */
   zn = z/sqrt(n);
   kStart = int( (-n/z +1)/4 );
   kEnd   = int( ( n/z -1)/4 );
   k = kStart:kEnd;
   sum1 = sum( cdf("Normal", (4*k+1)*zn) - cdf("Normal", (4*k-1)*zn) );
 
   kStart = int( (-n/z -3)/4 );
   k = kStart:kEnd;
   sum2 = sum( cdf("Normal", (4*k+3)*zn) - cdf("Normal", (4*k+1)*zn) );
   pValue = 1 - sum1 + sum2;
 
   /* optional: print the test results in a nice human-readable format */
   cusumTest = z || pValue;
   if PrintTable then do;
      print cusumTest[L="Result of CUSUM Test" c={"Test Statistic" "p Value"}];
      labl= "H0: Sequence is a random binary sequence";
      if pValue <= alpha then 
         msg = "Reject H0 at alpha = " + char(alpha); /* sequence seems random */
      else
         msg = "Do not reject H0 at alpha = " + char(alpha); /* sequence does not seem random */
      print msg[L=labl];
   end;
   return ( cusumTest );
finish;
 
/* call the function for the {-1, +1} sequence */
cusumTest = BinaryCUSUMTest(x);

According to the CUSUM test, there is not sufficient evidence to doubt that the sequence was generated from a random Bernoulli process.

A few comments about the program:

  • if you vectorize the computations, the CUSUM test requires only a few SAS/IML statements. Half of the function is dedicated to printing the results in a friendly format.
  • The computation of the p-value took me a while to puzzle over. The formulas in the NIST article did not write the INT() function for the limits of the summation. But the summation only makes sense when the index of summation (k) is an integer.
  • The significance value for the CUSUM test is usually chosen to be alpha = 0.01.
  • The CUSUM test depends only on the maximum deviation of the cumulative sums (the test statistic) and on the length of the sequence. For a sequence of length 100, the test statistic can be as large as 28 without rejecting the null hypothesis. If the statistic is 29 or larger, then the null hypothesis is rejected and we conclude that the sequence is not generated by a random process.

A neat thing about the CUSUM test is that you can compute the maximum test statistic based only on the sequence length. Thus if you plan to toss a coin 100 times to determine if it is fair, you can stop tossing (with 99% confidence) if the number of heads ever exceeds the number of tails by 29. Similarly, you can stop tossing if you know that the number of excess heads cannot possibly be 29 or greater. (For example, you've tossed 80 times and the current cumulative sum is 5.) You can apply the same argument to excess tails.

In summary, this article shows how to implement the CUSUM test for randomness of a binary sequence in SAS. Only a few lines of SAS/IML are required, and you can implement the test without using any loops. Be aware that the CUSUM test is not very powerful because regular sequences can pass the test. For example, the sequence 000111000111000111... has a maximum deviation of 3.

The post The CUSUM test for randomness of a binary sequence appeared first on The DO Loop.