4月 242019
 
The ODS Excel destination, which became a production feature in SAS 9.4M3 (TS1M3), generates Microsoft Excel workbooks in native XLSX format. This destination generates multiple worksheets per workbook with each output object (e.g., a table or graphic) the destination encounters by default. The ODS Excel destination is also flexible, enabling you to vertically control the worksheet and place output objects wherever you want. This blog demonstrates the destination’s flexibility and how you can modify its default behavior by using the ODS EXCEL statement's SHEET_INTERVAL= option.

Adding tables and graphics on the same Microsoft Excel worksheet

By default, the ODS Excel destination adds a new worksheet for each table and graphic. However, at times, you might not want to use this default behavior. If you want more control over this, the SHEET_INTERVAL= ODS Excel option determines when a new worksheet is created. Valid values for the SHEET_INTERVAL= option include:

  • TABLE (the default value) - new sheet for each table in output
  • NONE - keep the output that follows on the current sheet
  • PAGE - new sheet for each page of SAS output
  • PROC - new sheet beginning at the PROC step boundary
  • BYGROUP - new sheet for each BY group of output
  • NOW - begin a new sheet immediately

The value NOW, new for SAS 9.4M5 (TS1M5), triggers the creation of a new worksheet after the destination encounters the next output object.
As an example of opting not to use the default behavior, consider a case where you have a CONTENTS procedure without any options. This procedure generates three separate worksheets with the data-set attributes, the engine host information, and the variable list.

The following table shows the default output that you receive with three individual worksheets:

However, if you want to place all three objects on a single worksheet, you can do that by setting the option SHEET_INTERVAL="NONE". The option setting SHEET_INTERVAL="PROC" could also be used in this example which would create a new worksheet only when a new procedure is encountered.
The following example illustrates how to use this option to include all your output on the same worksheet:

ods excel file="c:\test.xlsx" options sheet_interval="none");
proc contents data=sashelp.class;
run; 
ods excel close;

Output

Adding text and tables to a new worksheet

Two of the most popular ways to add text on worksheets are to use either the ODS TEXT= statement or the ODSTEXT procedure with the Excel destination. The following example adds text to a worksheet by using the ODS TEXT= statement. You include this statement before each PRINT procedure in this example:

ods excel file="c:\temp\test.xlsx";
ods excel options(sheet_name="Females");
 
ods text="Data for Female Patients";
proc print data=sashelp.class(where=(sex="F"));
run;
 
ods excel options(sheet_name="Males");
ods text="Data for Male Patients";
 
proc print data=sashelp.class(where=(sex="M"));
run;
 
ods excel close;

Output

Notice that the first text string appears at the top of the first worksheet as expected. However, the text from the second ODS TEXT= statement appears at the bottom of this same page rather than at the top of the next worksheet containing the related data table. This behavior illustrates that the ODS TEXT= option is not very flexible. There is no good way to solve this issue.

However, you can use the SAS 9.4 ODSTEXT procedure in combination with the SHEET_INTERVAL= option to move the text string to the appropriate worksheet.
The following example uses PROC ODSTEXT and the SHEET_INTERVAL= option to move the text string "Statistics for Male Patients" to the top of the second worksheet:

ods excel file="c:\temp\test.xlsx";
ods excel options(sheet_name="Females");
 
ods text="Data for Female Patients";
proc print data=sashelp.class(where=(sex="F"));
run;
 
ods excel options(sheet_name="Males" sheet_interval="now");
ods excel options(sheet_name="Males" sheet_interval="none");
 
proc odstext;
   p "Data for Male Patients";
run;
 
proc print data=sashelp.class(where=(sex="M"));
run;
 
ods excel close;

Output

Adding multiple tables or graphs on the same worksheet

This final example demonstrates how you can use the SHEET_INTERVAL= option to add multiple tables and graphics to the same Excel worksheet. First, we use the SHEET_INTERVAL="NONE" option in the first ODS EXCEL statement to place the first table and graph on the same worksheet. Then, the SHEET_INTERVAL="NOW" option is included in the second ODS EXCEL statement option to create a second worksheet and write the second table and graph to that worksheet:

ods graphics / height=2.5in width=3.5in;
ods excel file="c:\scratch\test.xlsx" options(sheet_interval="none");
 
proc print data=sashelp.class(where=(sex="F"));
run;
 
proc sgplot data=sashelp.class(where=(sex="F"));
scatter x=age y=height;
run;
 
ods excel options(sheet_interval="now");
 
 
proc print data=sashelp.class(where=(sex="M"));
run;
 
proc sgplot data=sashelp.class(where=(sex="M"));
scatter x=age y=height;
run;
 
ods excel close;

Output

Learn more

Control the location of tables, graphs, and text with ODS Excel was published on SAS Users.

4月 242019
 
The ODS Excel destination, which became a production feature in SAS 9.4M3 (TS1M3), generates Microsoft Excel workbooks in native XLSX format. This destination generates multiple worksheets per workbook with each output object (e.g., a table or graphic) the destination encounters by default. The ODS Excel destination is also flexible, enabling you to vertically control the worksheet and place output objects wherever you want. This blog demonstrates the destination’s flexibility and how you can modify its default behavior by using the ODS EXCEL statement's SHEET_INTERVAL= option.

Adding tables and graphics on the same Microsoft Excel worksheet

By default, the ODS Excel destination adds a new worksheet for each table and graphic. However, at times, you might not want to use this default behavior. If you want more control over this, the SHEET_INTERVAL= ODS Excel option determines when a new worksheet is created. Valid values for the SHEET_INTERVAL= option include:

  • TABLE (the default value) - new sheet for each table in output
  • NONE - keep the output that follows on the current sheet
  • PAGE - new sheet for each page of SAS output
  • PROC - new sheet beginning at the PROC step boundary
  • BYGROUP - new sheet for each BY group of output
  • NOW - begin a new sheet immediately

The value NOW, new for SAS 9.4M5 (TS1M5), triggers the creation of a new worksheet after the destination encounters the next output object.
As an example of opting not to use the default behavior, consider a case where you have a CONTENTS procedure without any options. This procedure generates three separate worksheets with the data-set attributes, the engine host information, and the variable list.

The following table shows the default output that you receive with three individual worksheets:

However, if you want to place all three objects on a single worksheet, you can do that by setting the option SHEET_INTERVAL="NONE". The option setting SHEET_INTERVAL="PROC" could also be used in this example which would create a new worksheet only when a new procedure is encountered.
The following example illustrates how to use this option to include all your output on the same worksheet:

ods excel file="c:\test.xlsx" options sheet_interval="none");
proc contents data=sashelp.class;
run; 
ods excel close;

Output

Adding text and tables to a new worksheet

Two of the most popular ways to add text on worksheets are to use either the ODS TEXT= statement or the ODSTEXT procedure with the Excel destination. The following example adds text to a worksheet by using the ODS TEXT= statement. You include this statement before each PRINT procedure in this example:

ods excel file="c:\temp\test.xlsx";
ods excel options(sheet_name="Females");
 
ods text="Data for Female Patients";
proc print data=sashelp.class(where=(sex="F"));
run;
 
ods excel options(sheet_name="Males");
ods text="Data for Male Patients";
 
proc print data=sashelp.class(where=(sex="M"));
run;
 
ods excel close;

Output

Notice that the first text string appears at the top of the first worksheet as expected. However, the text from the second ODS TEXT= statement appears at the bottom of this same page rather than at the top of the next worksheet containing the related data table. This behavior illustrates that the ODS TEXT= option is not very flexible. There is no good way to solve this issue.

However, you can use the SAS 9.4 ODSTEXT procedure in combination with the SHEET_INTERVAL= option to move the text string to the appropriate worksheet.
The following example uses PROC ODSTEXT and the SHEET_INTERVAL= option to move the text string "Statistics for Male Patients" to the top of the second worksheet:

ods excel file="c:\temp\test.xlsx";
ods excel options(sheet_name="Females");
 
ods text="Data for Female Patients";
proc print data=sashelp.class(where=(sex="F"));
run;
 
ods excel options(sheet_name="Males" sheet_interval="now");
ods excel options(sheet_name="Males" sheet_interval="none");
 
proc odstext;
   p "Data for Male Patients";
run;
 
proc print data=sashelp.class(where=(sex="M"));
run;
 
ods excel close;

Output

Adding multiple tables or graphs on the same worksheet

This final example demonstrates how you can use the SHEET_INTERVAL= option to add multiple tables and graphics to the same Excel worksheet. First, we use the SHEET_INTERVAL="NONE" option in the first ODS EXCEL statement to place the first table and graph on the same worksheet. Then, the SHEET_INTERVAL="NOW" option is included in the second ODS EXCEL statement option to create a second worksheet and write the second table and graph to that worksheet:

ods graphics / height=2.5in width=3.5in;
ods excel file="c:\scratch\test.xlsx" options(sheet_interval="none");
 
proc print data=sashelp.class(where=(sex="F"));
run;
 
proc sgplot data=sashelp.class(where=(sex="F"));
scatter x=age y=height;
run;
 
ods excel options(sheet_interval="now");
 
 
proc print data=sashelp.class(where=(sex="M"));
run;
 
proc sgplot data=sashelp.class(where=(sex="M"));
scatter x=age y=height;
run;
 
ods excel close;

Output

Learn more

Control the location of tables, graphs, and text with ODS Excel was published on SAS Users.

4月 232019
 

From saving cheetahs to preserving fresh water systems, analytics plays a role in many inspiring conservation projects around the world. Read on to learn about a few of our favorites. 1. Protecting animals from extinction  Learn how analyzing footprints could help save cheetahs in Namibia and other endangered animals around [...]

5 inspiring reads for Earth Day was published on SAS Voices by Alison Bolen

4月 222019
 

Imagine a world where satisfying human-computer dialogues exist. With the resurgence of interest in natural language processing (NLP) and understanding (NLU) – that day may not be far off.

In order to provide more satisfying interactions with machines, researchers are designing smart systems that use artificial intelligence (AI) to develop better understanding of human requests and intent.

Last year, OpenAI used a machine learning technique called reinforcement learning to teach agents to design their own language. The AI agents were given a simple set of words and the ability to communicate with each other. They were then given a set of goals that were best achieved by cooperating (communicating) with other agents. The agents independently developed a simple ‘grounded’ language.

Grounded vs. inferred language


Human language is said to be grounded in experience. People grasp the meaning of many basic words by interaction – not by learning dictionary definitions by rote. They develop understanding in terms of sensory experience -- for example, words like red, heavy, above.

Abstract word meanings are built in relation to more concretely grounded terms. Grounding allows humans to acquire and understand words and sentences in context.

The opposite of a grounded language is an inferred language. Inferred languages derive meaning from the words themselves and not what they represent. In AI trained only on textual data, but not real-world representations, these methods lack true understanding of what the words mean.

What if the AI agent develops its own language we can’t understand?

It happens. Even if the researcher gives the agents simple English words the agent inevitably diverges to its own, unintelligible language. Recently researchers at Facebook, Google and OpenAI all experienced this phenomenon!

Agents are reward driven. If there is no reward for using English (or human language) then the agents will develop a more efficient shorthand for themselves.

That’s cool – why is that a problem?

When researchers at the Facebook Artificial Intelligence Research lab designed chatbots to negotiate with one another using machine learning, they had to tweak one of their models because otherwise the bot-to-bot conversation “led to divergence from human language as the agents developed their own language for negotiating.” They had to use what’s called a fixed supervised model instead.

The problem, there, is transparency. Machine learning techniques such as deep learning are black box technologies. A lot of data is fed into the AI, in this case a neural network, to train on and develop its own rules. The model is then fed new data which is used to spit out answers or information. The black box analogy is used because it is very hard, if not impossible in complex models, to know exactly how the AI derives the output (answers). If AI develops its own languages when talking to other AI, the transparency problem compounds. How can we fully trust an AI when we can’t follow how it is making its decisions and what it is telling other AI?

But it does demonstrate how machines are redefining people’s understanding of so many realms once believed to be exclusively human—like language. The Facebook researchers concluded that it offered a fascinating insight to human and machine language. The bots also proved to be very good negotiators, developing intelligent negotiating strategies.

These new insights, in turn, lead to smarter chatbots that have a greater understanding of the real world and the context of human dialog.

At SAS, we’re developing different ways to incorporate chatbots into business dashboards or analytics platforms. These capabilities have the potential to expand the audience for analytics results and attract new and less technical users.

“Chatbots are a key technology that could allow people to consume analytics without realizing that’s what they’re doing,” says Oliver Schabenberger, SAS Executive Vice President, Chief Operating Officer and Chief Technology Officer in a recent SAS Insights article. “Chatbots create a humanlike interaction that makes results accessible to all.” The evolution of NLP toward NLU has a lot of important implications for businesses and consumers alike.

Satisfying human-computer dialogues will soon exist, and will have applications in medicine, law, and the classroom-to name but a few. As the volume of unstructured information continues to grow exponentially, we will benefit from AI’s tireless ability to help us make sense of it all.

Further Resources:
Natural Language Processing: What it is and why it matters
White paper: Text Analytics for Executives: What Can Text Analytics Do for Your Organization?
SAS® Text Analytics for Business Applications: Concept Rules for Information Extraction Models, by Teresa Jade, Biljana Belamaric Wilsey, and Michael Wallis
Unstructured Data Analysis: Entity Resolution and Regular Expressions in SAS®, by Matthew Windham
SAS: What are chatbots?
Blog: Let’s chat about chatbots, by Wayne Thompson

Moving from natural language processing to natural language understanding was published on SAS Users.

4月 222019
 

Many statistical tests use a CUSUM statistic as part of the test. It can be confusing when a researcher refers to "the CUSUM test" without providing details about exactly which CUSUM test is being used. This article describes a CUSUM test for the randomness of a binary sequence. You start with a long sequence of binary values such as heads and tails from a coin toss. The test tries to determine whether the sample comes from a Bernoulli distribution with probability p=0.5. In short, is the binary sequence random?

The CUSUM test for randomness of a binary sequence is one of the NIST tests for verifying that a random or pseudorandom generator is generating bits that are indistinguishable from truly random bits (Rukin, et al., 2000, revised 2010, pp 2-31 through 2-33). The test is straightforward to implement. You first translate the data to {-1, +1} values. You then compute the cumulative sums of the sequence. If the sequence is random, the cumulative sum is equivalent to the position of a random walker who takes unit steps. If the sequence is random, the sums will not move away from 0 (the expected sum) too quickly. I've previously visualized the random walk with unit steps (sometimes called a "Drunkard's walk").

Before proceeding to the CUSUM test, I should mention that this test is often used in conjunction with other tests, such as the "runs test" for randomness. That is because a perfectly alternating sequence such as 0101010101... will pass the CUSUM test even though the sequence is clearly not randomly generated. In fact, any sequence that repeatedly has k zeros followed by k ones also passes the test, provided that k is small enough.

The CUSUM test for randomness of a binary sequence in SAS

The NIST report contains an example of calling the CUSUM test with a sequence of length N=100. The following SAS/IML statements define a sequence of {0, 1} values, convert those values to {-1, +1}, and plot the cumulative sums:

proc iml;
eps = {1 1 0 0 1 0 0 1 0 0 0 0 1 1 1 1 1 1 0 1
       1 0 1 0 1 0 1 0 0 0 1 0 0 0 1 0 0 0 0 1 
       0 1 1 0 1 0 0 0 1 1 0 0 0 0 1 0 0 0 1 1 
       0 1 0 0 1 1 0 0 0 1 0 0 1 1 0 0 0 1 1 0 
       0 1 1 0 0 0 1 0 1 0 0 0 1 0 1 1 1 0 0 0 }; 
x = 2*eps - 1;            /* convert to {-1, +1} sequence */
S = cusum(x);
 
title "Cumulative Sums of Sequence of {-1, +1} Values";
call series(1:ncol(S), S) option="markers" other="refline 0 / axis=y" label="Observation Number";
Plot of the cumulative sums of a binary sequence of values in {-1, +1}.

The sequence contains 58 values of one category and 42 values of the other. For a binomial distribution with p=0.5, the probability of a sample that has proportions at least this extreme is about 13%, as shown by the computation 2*cdf("Binomial", 42, 0.5, 100);. Consequently, the proportions are not unduly alarming. However, to test whether the sequence is random, you need to consider not only the proportion of values, but also the sequence. The graph of the cumulative sums of the {-1, +1} sequence shows a drift away from the line S=0, but it is not clear from the graph whether the deviation is more extreme than would be expected for a random sequence of this length.

The CUSUM test gives you a way to quantify whether the sequence is likely to have occurred as a random draw from a Bernoulli(p=0.5) distribution. The test statistic is the maximum deviation from 0. As you can see from the graph, the test statistic for this sequence is 16. The NIST paper provides a formula for the probability that a statistic at least this extreme occurs in a random sequence of length N=100. I implemented a (vectorized) version of the formula in SAS/IML.

/* NIST CUSUM test for randomness in a binary {-1, +1} sequence.
   https://nvlpubs.nist.gov/nistpubs/legacy/sp/nistspecialpublication800-22r1a.pdf
   Section 2.13.  Pages 2-31 through 2-33.
   INPUT: x is sequence of {-1, +1} values.      */
start BinaryCUSUMTest(x, PrintTable=1, alpha=0.01);
   S = colvec( cusum(x) );     /* cumulative sums */
   n = nrow(S);
   z = max(abs(S));            /* test statistic = maximum deviation */
 
   /* compute probability of this test statistic for a sequence of this length */
   zn = z/sqrt(n);
   kStart = int( (-n/z +1)/4 );
   kEnd   = int( ( n/z -1)/4 );
   k = kStart:kEnd;
   sum1 = sum( cdf("Normal", (4*k+1)*zn) - cdf("Normal", (4*k-1)*zn) );
 
   kStart = int( (-n/z -3)/4 );
   k = kStart:kEnd;
   sum2 = sum( cdf("Normal", (4*k+3)*zn) - cdf("Normal", (4*k+1)*zn) );
   pValue = 1 - sum1 + sum2;
 
   /* optional: print the test results in a nice human-readable format */
   cusumTest = z || pValue;
   if PrintTable then do;
      print cusumTest[L="Result of CUSUM Test" c={"Test Statistic" "p Value"}];
      labl= "H0: Sequence is a random binary sequence";
      if pValue <= alpha then 
         msg = "Reject H0 at alpha = " + char(alpha); /* sequence seems random */
      else
         msg = "Do not reject H0 at alpha = " + char(alpha); /* sequence does not seem random */
      print msg[L=labl];
   end;
   return ( cusumTest );
finish;
 
/* call the function for the {-1, +1} sequence */
cusumTest = BinaryCUSUMTest(x);

According to the CUSUM test, there is not sufficient evidence to doubt that the sequence was generated from a random Bernoulli process.

A few comments about the program:

  • if you vectorize the computations, the CUSUM test requires only a few SAS/IML statements. Half of the function is dedicated to printing the results in a friendly format.
  • The computation of the p-value took me a while to puzzle over. The formulas in the NIST article did not write the INT() function for the limits of the summation. But the summation only makes sense when the index of summation (k) is an integer.
  • The significance value for the CUSUM test is usually chosen to be alpha = 0.01.
  • The CUSUM test depends only on the maximum deviation of the cumulative sums (the test statistic) and on the length of the sequence. For a sequence of length 100, the test statistic can be as large as 28 without rejecting the null hypothesis. If the statistic is 29 or larger, then the null hypothesis is rejected and we conclude that the sequence is not generated by a random process.

A neat thing about the CUSUM test is that you can compute the maximum test statistic based only on the sequence length. Thus if you plan to toss a coin 100 times to determine if it is fair, you can stop tossing (with 99% confidence) if the number of heads ever exceeds the number of tails by 29. Similarly, you can stop tossing if you know that the number of excess heads cannot possibly be 29 or greater. (For example, you've tossed 80 times and the current cumulative sum is 5.) You can apply the same argument to excess tails.

In summary, this article shows how to implement the CUSUM test for randomness of a binary sequence in SAS. Only a few lines of SAS/IML are required, and you can implement the test without using any loops. Be aware that the CUSUM test is not very powerful because regular sequences can pass the test. For example, the sequence 000111000111000111... has a maximum deviation of 3.

The post The CUSUM test for randomness of a binary sequence appeared first on The DO Loop.

4月 212019
 

 

This year I’ve had the honor of helping to recruit speakers for the Career Development area at SAS Global Forum. We have some fantastic presentations that everyone can benefit from whether you are a student, a new graduate, or a mid-career professional.

I particularly recommend the panel discussion (Career Advice We’d Give to Our Kids) Tuesday April 30, 3:00-4:00 in Level 2, Ballroom C4. The panelists (Shelley Blozis, AnnMaria De Mars, Paul LaBrec) are all great so this should be both informative and entertaining.

The following presentations are listed in order by day and time. As you scroll through this list, you may notice that most (but not all!) of these presentations are in Level 1 Room D168.

Poster (available every day)
Tips to Ensure Success in Your New SAS Project
Flora Fang Liu

Tuesday, April 30, 2019

10:00-11:00 Level 1, D168
Don’t Just Survive, Thrive! A Learning- Based Strategy for a Modern Organization Centered Around SAS
Jovan Marjanovic

11:00-12:00 Level 1, D168
The Power of Know-How: Pump Up Your Professional Value by Refining Your SAS Skills
Gina Huff

1:00-1:15 Level 2, Exhibit Hall D, Super Demo 12
SAS Programming Exam Moves to Performance-Based Format
Mark Stevens

1:30-2:00 Level 1, D168
The Why and How of Teaching SAS to High School Students
Jennifer Richards

2:00-2:30 Level 1, D168
Puzzle Me, Puzzle You: How a Thought Experiment Became a Rubik’s Cube Among a Set of Fun Puzzles
Amit Patel, Lewis Mitchell

2:30-3:00 Level 1, D168
How to Land Work as a SAS Professional
Charu Shankar

3:00-3:15 Level 2, Exhibit Hall D, Super Demo 12
Take SAS Certification Exams from Home Online Proctored
Terry Barham

3:00-4:00 Level 2, Ballroom C4
Panel Discussion: Career Advice We’d Give to Our Kids
Shelley Blozis, AnnMaria De Mars, Paul LaBrec

3:00-4:00 Level 1, D168
How To Be an Effective Statistician
Alexander Schacht

4:00-5:00 Level 1, D168
Stories from the Trenches: Tips and Techniques for Career Advancement from a SAS Industry Recruiter
Molly Hall

5:00-5:30 Level 1, D168
How to HOW: Hands-on- Workshops Made Easy
Chuck Kincaid

Wednesday, May 1, 2019

10:00-11:00 Level 2, Ballroom C3
Tell Me a Data Story
Kat Greenbrook

10:00-11:00 Level 2 Ballroom C4
The Good, The Bad, and The Creepy: Why Data Scientists Need to Understand Ethics
11:00 Jennifer Priestley HOW POI

11:30-12:00 Level 1, D168
New to SAS? Helpful Hints for Developing Your Own Professional Development Plan
Kelly Smith

4月 202019
 

Do you have a favorite television show? Or a favorite movie franchise that you follow? If you call yourself a "fan," just how much of a fan are you? Are you merely a spectator, or do you take your fanaticism to the next level by creating something new?

When it comes to fandom for franchises like Game of Thrones, the Marvel movies, or Stranger Things, there's a new kind of nerd in town. And this nerd brings data science skills. You've heard of the "second screen" experience for watching television, right? That's where fans watch a show (or sporting event or awards ceremony), but also keep up with Twitter or Facebook so they can commune with other fans of the show on social media. These fan-data-scientists bring a third screen: their favorite data workbench IDE.

I was recently lured into into a rabbit hole of Game of Thrones data by a tweet. The Twitter user was reacting to a data visualization of character screen time during the show. The visualization was built in a different tool, but the person was wondering whether it could be done in SAS. I knew the answer was Yes...as long as we could get the data. That turned out to be the easiest part.

WARNING: While this blog post does not reveal any plot points from the show, the data does contain spoilers! No spoilers in what I'm showing here, but if you run my code examples there might be data points that you cannot "unsee." I was personally conflicted about this, since I'm a fan of the show but I'm not yet keeping up with the latest episodes. I had to avert my eyes for the most recent data.

Data is Coming

A GitHub user named Jeffrey Lancaster has shared a repository for all aspects of data around Game of Thrones. He also has similar repos for Stranger Things and Marvel universe. Inside that repo there's a JSON file with episode-level data for all episodes and seasons of the show. With a few lines of code, I was able to read the data directly from the repo into SAS:

filename eps temp;
 
/* Big thanks to this GoT data nerd for assembling this data */
proc http
 url="https://raw.githubusercontent.com/jeffreylancaster/game-of-thrones/master/data/episodes.json"
 out=eps
 method="GET";
run;
 
/* slurp this in with the JSON engine */
libname episode JSON fileref=eps;

Note that I've shared all of my code for my steps in my own GitHub repo (just trying to pay it forward). Everything should work in Base SAS, including in SAS University Edition.

The JSON library reads the data into a series of related tables that show all of the important things that can happen to characters within a scene. Game of Thrones fans know that death, sex, and marriage (in that order) make up the inflection points in the show.

Building the character-scene data

With a little bit of data prep using SQL, I was able to show the details of the on-screen time per character, per scene. These are the basis of the visualization I was trying to create.

/* Build details of scenes and characters who appear in them */
PROC SQL;
   CREATE TABLE WORK.character_scenes AS 
   SELECT t1.seasonNum, 
          t1.episodeNum,
          t2.ordinal_scenes as scene_id, 
          input(t2.sceneStart,time.) as time_start format=time., 
          input(t2.sceneEnd,time.) as time_end format=time., 
          (calculated time_end) - (calculated time_start) as duration format=time.,
          t3.name
      FROM EPISODE.EPISODES t1, 
           EPISODE.EPISODES_SCENES t2, 
           EPISODE.SCENES_CHARACTERS t3
      WHERE (t1.ordinal_episodes = t2.ordinal_episodes AND 
             t2.ordinal_scenes = t3.ordinal_scenes);
QUIT;

With a few more data prep steps (see my code on GitHub), I was able to summarize the screen time for scene locations:

You can see that The Crownlands dominate as a location. In the show that's a big region and a sort of headquarters for The Seven Kingdoms, and the show data actually includes "sub-locations" that can help us to break that down. Here's the makeup of that 18+ hours of time in The Crownlands:

Screen time for characters

My goal is to show how much screen time each of the major characters receives, and how that changes over time. I began by creating a series of charts using PROC SGPLOT. These were created using a single SGPLOT step using a BY group, segmented by show episode. They appear in a grid because I used ODS LAYOUT GRIDDED to arrange them.

Here's the code segment that creates these dozens of charts. Again, see my GitHub for the intermediate data prep work.

/* Create a gridded presentation of Episode graphs CUMULATIVE timings */
ods graphics / width=500 height=300 imagefmt=svg noborder;
ods layout gridded columns=3 advance=bygroup;
proc sgplot data=all_times noautolegend ;
  hbar name / response=cumulative 
    categoryorder=respdesc  
    colorresponse=total_screen_time dataskin=crisp
    datalabel=name datalabelpos=right datalabelattrs=(size=10pt)
    seglabel seglabelattrs=(weight=bold size=10pt color=white) ;
   ;
  by epLabel notsorted;
  format cumulative time.;
  label epLabel="Ep";
  where rank<=10;
  xaxis display=(nolabel)  grid ;
  yaxis display=none grid ;
run;
ods layout end;
ods html5 close;

Creating an animated timeline

The example shared on Twitter showed an animation of screen time, per character, over the complete series of episodes. So instead of a huge grid with many plots, need to produce a single file with layers for each episode. In SAS we can produce an animated GIF or animated SVG (scalable vector graphics) file. The SVG is a much smaller file format, but you need a browser or a special viewer to "play" it. Still, that's the path I followed:

/* Create a single animated SVG file for all episodes */
options printerpath=svg animate=start animduration=1 
  svgfadein=.25 svgfadeout=.25 svgfademode=overlap
  nodate nonumber; 
 
/* change this file path to something that works for you */
ODS PRINTER file="c:\temp\got_cumulative.svg" style=daisy;
 
/* For SAS University Edition
ODS PRINTER file="/folders/myfolders/got_cumulative.svg" style=daisy;
*/
 
proc sgplot data=all_times noautolegend ;
  hbar name / response=cumulative 
    categoryorder=respdesc 
    colorresponse=total_screen_time dataskin=crisp
    datalabel=name datalabelpos=right datalabelattrs=(size=10pt)
    seglabel seglabelattrs=(weight=bold size=10pt color=white) ;
   ;
  by epLabel notsorted;
  format cumulative time.;
  label epLabel="Ep";
  where rank<=10;
  xaxis label="Cumulative screen time (HH:MM:SS)" grid ;
  yaxis display=none grid ;
run;
options animation=stop;
ods printer close;

Here's the result (hosted on my GitHub repo -- but as a GIF for compatibility.)

I code and I know things

Like the Game of Thrones characters, my visualization is imperfect in many ways. As I was just reviewing it I discovered a few data prep missteps that I should correct. I used some features of PROC SGPLOT that I've learned only a little about, and so others might suggest improvements. And my next mission should be to bring this data in SAS Visual Analytics, where the real "data viz maesters" who work with me can work their magic. I'm just hoping that I can stay ahead of the spoilers.

The post Deeper enjoyment of your favorite shows -- through data appeared first on The SAS Dummy.

4月 172019
 

I think every course in exploratory data analysis should begin by studying Anscombe's quartet. Anscombe's quartet is a set of four data sets (N=11) that have nearly identical descriptive statistics but graphical properties. They are a great reminder of why you should graph your data. You can read about Anscombe's quartet on Wikipedia, or check out a quick visual summary by my colleague Robert Allison. Anscombe's first two examples are shown below:

The Wikipedia article states, "It is not known how Anscombe created his data sets." Really? Creating different data sets that have the same statistics sounds like a fun challenge! As a tribute to Anscombe, I decided to generate my own versions of the two data sets shown in the previous scatter plots. The first data set is linear with normal errors. The second is quadratic (without errors) and has the exact same linear fit and correlation coefficient as the first data.

Generating your own version of Anscombe's data

The Wikipedia article notes that there are "several methods to generate similar data sets with identical statistics and dissimilar graphics," but I did not look at the modern papers. I wanted to try it on my own. If you want to solve the problem on your own, stop reading now!

I used the following approach to construct the first two data sets:

  1. Use a simulation to create linear data with random normal errors: Y1 = 3 + 0.5 X + ε, where ε ~ N(0,1).
  2. Compute the regression estimates (b0 and b1) and the sample correlation for the linear data. These are the target statistics. They define three equations that the second data set must match.
  3. The response variable for the second data set is of the form Y2 = β0 + β1 X + β2 X2. There are three equations and three unknowns parameters, so we can solve a system of nonlinear equations to find β.

From geometric reasoning, there are three different solution for the β parameter: One with β2 > 0 (a parabola that opens up), one with β2 = 0 (a straight line), and one with β2 < 0 (a parabola that opens down). Since Anscombe used a downward-pointing parabola, I will make the same choice.

Construct the first data set

You can construct the first data set in a number of ways, but I choose to construct it randomly. The following SAS/IML statements construct the data, defines a helper function (LinearReg). The program computes the target values, which are the parameter estimates for a linear regression and the sample correlation for the data:

proc iml;
call randseed(12345);
x = T( do(4, 14, 0.2) );                              /* evenly spaced X */
eps = round( randfun(nrow(x), "Normal", 0, 1), 0.01); /* normal error */
y = 3 + 0.5*x + eps;                                  /* linear Y + error */
 
/* Helper function. Return paremater estimates for linear regression. Args are col vectors */
start LinearReg(Y, tX);
   X = j(nrow(tX), 1, 1) || tX;
   b = solve(X`*X, X`*Y);       /* solve normal equation */
   return b;
finish;
 
targetB = LinearReg(y, x);          /* compute regression estimates */
targetCorr = corr(y||x)[2];         /* compute sample correlation */
print (targetB`||targetCorr)[c={'b0' 'b1' 'corr'} F=5.3 L="Target"];

You can use these values as the target values. The next step is to find a parameter vector β such that Y2 = β0 + β1 X + β2 X2 has the same regression line and corr(Y2, X) has the same sample correlation. For uniqueness, set β2 < 0.

Construct the second data set

You can formulate the problem as a system of equations and use the NLPHQN subroutine in SAS/IML to solve it. (SAS supports multiple ways to solve a system of equations.) The following SAS/IML statements define two functions. Given any value for the β parameter, the first function returns the regression estimates and sample correlation between Y2 and X. The second function is the objective function for an optimization. It subtracts the target values from the estimates. The NLPHQN subroutine implements a hybrid quasi-Newton optimization routine that uses least squares techniques to find the β parameter that generates quadratic data that tries to match the target statistics.

/* Define system of simultaneous equations:
   https://blogs.sas.com/content/iml/2018/02/28/solve-system-nonlinear-equations-sas.html */
/* This function returns linear regression estimates (b0, b1) and correlation for a choice of beta */
start LinearFitCorr(beta) global(x);
   y2 = beta[1] + beta[2]*x + beta[3]*x##2;    /* construct quadratic Y */
   b = LinearReg(y2, x);      /* linear fit */
   corr = corr(y2||x)[2];     /* sample corr */
   return ( b` || corr);      /* return row vector */
finish;
 
/* This function returns the vector quantity (beta - target). 
   Find value that minimizes Sum | F_i(beta)-Target+i |^2 */
start Func(beta) global(targetB, targetCorr);
   target = rowvec(targetB) || targetCorr;
   G = LinearFitCorr(beta) - target;
   return( G );              /* return row vector */
finish;
 
/* now let's solve for quadratic parameters so that same 
   linear fit and correlation occur */
beta0 = {-5 1 -0.1};         /* initial guess */
con = {.  .  .,              /* constraint matrix */
       0  .  0};             /* quadratic term is negative */
optn = ncol(beta0) || 0;     /* LS with 3 components || amount of printing */
/* minimize sum( beta[i] - target[i])**2 */
call nlphqn(rc, beta, "Func", beta0, optn) blc=con;  /* LS solution */
print beta[L="Optimal beta"];
 
/* How nearly does the solution solve the problem? Did we match the target values? */
Y2Stats = LinearFitCorr(beta);
print Y2Stats[c={'b0' 'b1' 'corr'} F=5.3];

The first output shows that the linear fit and correlation statistics for the linear and quadratic data are identical (to 3 decimal places). Anscombe would be proud! The second output shows the parameters for the quadratic response: Y2 = 4.955 + 2.566*X - 0.118*X2. The following statements create scatter plots of the new Anscombe-like data:

y2 = beta[1] + beta[2]*x + beta[3]*x##2;
create Anscombe2 var {x y y2};  append;  close;
QUIT;
 
ods layout gridded columns=2 advance=table;
proc sgplot data=Anscombe2 noautolegend;
   scatter x=X y=y;    lineparm x=0 y=3.561 slope=0.447 / clip;
run;
proc sgplot data=Anscombe2 noautolegend;
   scatter x=x y=y2;   lineparm x=0 y=3.561 slope=0.447 / clip;
run;
ods layout end;

Notice that the construction of the second data set depends only on the statistics for the first data. If you modify the first data set, and the second will automatically adapt. For example, you could choose the errors manually instead of randomly, and the statistics for the second data set should still match.

What about the other data sets?

You can create the other data sets similarly. For example:

  • For the data set that consist of points along a line except for one outlier, there are three free parameters. Most of the points fall along the line Y3 = a + b*X and one point is at the height Y3 = c. Therefore, you can run an optimization to solve for the values of (a, b, c) that match the target statistics.
  • I'm not sure how to formulate the requirements for the fourth data set. It looks like all but one point have coordinates (X, Y4), where X is a fixed value and the vertical mean of the cluster is c. The outlier has coordinate (a, b). I'm not sure whether the variance of the cluster is important.

Summary

In summary, you can solve a system of equations to construct data similar to Anscombe's quartet. By using this technique, you can create your own data sets that share descriptive statistics but look very different graphically.

To be fair, the technique I've presented does not enable you to reproduce Anscombe's quartet in its entirety. My data share a linear fit and sample correlation, whereas Anscombe's data share seven statistics!

Anscombe was a pioneer (along with Tukey) in using computation to assist in statistical computations. He was also enamored with the APL language. He introduced computers into the Yale statistics department in the mid-1960s. Since he published his quartet in 1973, it is possible that he used computers to assist his creation of the Anscombe quartet. However he did it, Anscombe's quartet is truly remarkable!

You can download the SAS program that creates the results in this article.

The post Create your own version of Anscombe's quartet: Dissimilar data that have similar statistics appeared first on The DO Loop.