5月 102018
 

For those of you who weren't able to attend SAS Global Forum 2018, you can still learn a lot from the content shared there. Gain knowledge from your SAS family. SAS Global Forum 2018 papers and videos now available.

The post Knowledge from the SAS family: SAS Global Forum 2018 papers and videos now available appeared first on SAS Learning Post.

5月 092018
 

In a previous blog post, I discussed ways to produce statistically independent samples from a random number generator (RNG). The best way is to generate all samples from one stream. However, if your program uses two or more SAS DATA steps to simulate the data, you cannot use the same seed value multiple times because that could introduce biases into your study. An easy way to deal with this situation is to use the CALL STREAM subroutine to set different key values for each DATA step.

An example of using two DATA steps to generate random numbers

Suppose that you have a simulation that needs to call the DATA step twice, and each call generates random numbers. If you want to encapsulate that functionality into a macro call, you might want to pass in the seed for the random number streams. An INCORRECT approach is to do the following:

%macro DoItWRONG(seed=);
data A;
   call streaminit('PCG', &seed);
   /* ... do something with random numbers ... */
run;
data B;
   call streaminit('PCG', &seed);       /* <== uh-oh! Problem here! */
   /* ... do something else with random numbers ... */
run;
%mend;

This code is unacceptable because the same seed value and RNG are used for both DATA steps. That means that the stream of random numbers is the same for the second DATA step as for the first, which can lead to correlations that can bias your study.

Prior to SAS 9.4M5, the only way to resolve this issue was to use different seed values for each DATA step. For example, you could use &seed + 1 in the second DATA step. Because the Mersenne twister RNG (the only RNG in SAS prior to 9.4M5) has an astronomically large period, streams that use different seed values have an extremely small probability of overlapping.

Use the CALL STREAM subroutine

SAS 9.4M5 introduced new random number generators. The Mersenne twister algorithm supports one parameter (the seed) which initializes each stream. The Threefry and PCG algorithms support two initialization parameters. To prevent confusion, the first parameter is called the seed value and the second is called the key value. By default, the key value is zero, but you can use the CALL STREAM subroutine to set a positive key value.

You should use the key value (not the seed value) to generate independent streams in two DATA steps. The following statements are true:

  • The PCG generator is not guaranteed to generate independent streams for different seed values. However, you can use the key value to ensure independent streams. Two streams with the same seed and different key values never overlap.
  • The Threefry generators (TF2 and TF4) generate independent streams for different seed values. However, the Threefry generators also accept key values, and two streams with different key values never overlap.
  • The Mersenne twister algorithm does not inherently support a key value. However, the STREAM routine creates a new seed from the current seed and the specified key value. Because of the long-period property of the MT algorithm, two streams that use the same seed and different key values are "independent with high probability."

The following example modifies the previous macro to use the CALL stream routine. For the PCG RNG, two streams that have the same seed and different key values do not overlap:

%macro DoIt(seed=);
data A;
   call streaminit('PCG', &seed);
   call stream(1); 
   /* ... do something with random numbers ... */
run;
data B;
   call streaminit('PCG', &seed);
   call stream(2);                  /* <== different stream, guaranteed */
   /* ... do something else with random numbers ... */
run;
%mend;

Of course, this situation generalizes. If the macro uses three (or 10 or 20) DATA steps, each of which generates random numbers, you can use a different key value in each DATA step. If you initialize each DATA step with the same seed and a different key, the DATA steps generate independent streams (for the Threefry and PCG RNGs) or generate independent streams with high probability (for the Mersenne twister RNG).

A real example of using different streams

This section provides a real-life example of using the CALL STREAM subroutine. Suppose a researcher wants to create a macro that simulates many samples for a regression model. The macro first creates a data set that contains p independent continuous variables. The macro then simulates B response variables for a regression model that uses the independent variables from the first data set.

To prevent the second DATA step from using the same random number stream that was used in the first DATA step, the second DATA step calls the STREAM subroutine to set a key value:

%macro Simulate(seed=, N=, NCont=,  /* params for indep vars */
                key=, betas=,       /* params for dep vars */
                NumSamples=);       /* params for MC simul */
/* Step 1: Generate nCont explanatory variables */
data MakeData;
   call streaminit('PCG', &seed);
   /* if you do not call STREAM, you get the stream for key=0 */
   array x[&NCont];         /* explanatory vars are named x1-x&nCont  */
 
   do ObsNum = 1 to &N;     /* for each observation in the sample  */
      do j = 1 to &NCont;
         x[j] = rand("Normal"); /* simulate NCont explanatory variables   */
      end;
      output;
    end;
    keep ObsNum x:;
run;
 
/* Step 2: Generate NumSamples responses, each from the same regression model 
   Y = X*beta + epsilon, where epsilon ~ N(0, 0.1)  */
data SimResponse;
   call streaminit('PCG', &seed);
   call stream(&key); 
   keep SampleId ObsNum x: Y;
   /* define regression coefficients beta[0], beta[1], beta[2], ... */
   array beta[0:&NCont] _temporary_ (&betas);
 
   /* construct linear model eta = beta[0] + beta[1]*x1 + betat[2]*x2 + ... */
   set MakeData;                    /* for each obs in data ...*/
   array x[&nCont] x1-x&NCont;      /* put the explanatory variables into an array */
   eta = beta[0];
   do i = 1 to &NCont;
      eta = eta + beta[i] * x[i];
   end;
 
   /* add random noise for each obs in each sample */
   do SampleID = 1 to &NumSamples;
      Y = eta + rand("Normal", 0, 0.1); 
      output;
   end;
run;
 
/* sort to prepare data for BY-group analysis */
proc sort data=SimResponse;
   by SampleId ObsNum;
run;
%mend;

For a discussion of how the simulation works, see the article "Simulate many samples from a linear regression model." In this example, the macro uses two DATA steps to simulate the regression data. The second DATA step uses the STREAM subroutine to ensure that the random numbers it generates are different from those used in the first DATA step.

For completeness, here's how you can call the macro to generate the data and then a call PROC REG with a BY statement to analyze the 1,000 regression models.

%Simulate(seed=123, N=50, NCont=3,         /* params for indep vars */
          key=1, Betas = 1 -0.5 0.5 -0.25, /* params for dep vars */
          NumSamples=1000)                 /* params for MC sim */
 
proc reg data=SimResponse noprint outest=PE;
   by SampleID;
   model Y = x1-x3;
run;
 
title "Parameter Estimates for key=1";
proc print data=PE(obs=5) noobs;
   var SampleID Intercept x1-x3;
run;

You can verify that the macro generates different simulated data if you use the same seed value but change the key value in the macro call.

In summary, this article shows that you can use the STREAM call in the SAS DATA step to generate different random number streams from the same seed value. For many simulation studies, the STREAM subroutine is not necessary. However, it useful when the simulation is run in stages or is implemented in modules that need to run independently.

The post Independent streams of random numbers in SAS appeared first on The DO Loop.

5月 082018
 

I have good news to share about the future. Despite what you may have heard elsewhere, the future of work in a world with artificial intelligence (AI) is not all doom and gloom. And thanks to a research-backed book from Malcolm Frank, What to Do When Machines Do Everything, we [...]

Optimistic about AI and the future of work was published on SAS Voices by Randy Guard

5月 082018
 

WARNING: This blog post references Avengers: Infinity War and contains story spoilers. But it also contains useful information about random number generators (RNGs) -- tempting! If you haven't yet seen the movie, you should make peace with this inner conflict before reading on.

Throughout the movie, Thanos makes it clear that his goal is to eliminate half of the population of every civilization in the universe. With the power of all six infinity stones imbued into his gauntlet, he'll be able to accomplish this with a "snap of his fingers." By the end of the film, Thanos has all of the stones, and then he literally snaps his fingers. (Really? I kept thinking that this was just a figure of speech he used to indicate how simple this will be -- but I guess it works more like the ruby slippers in The Wizard of Oz. Some clicking was required.)

So, Thanos snaps his huge fingers and -- POOF -- there goes half of us. Apparently the universe already had some sort of population-reduction subroutine just waiting for a hacker like Thanos to access it. Who put that there? Not a good plan, universe designer. (Check here to see if you survived the snap.)

But how did Thanos (or the universe) determine which of us was wiped from existence and which of us was spared? I have to assume that it was a seriously high-performing, massively parallel random number generator. And if Thanos had access to 9.4 Maintenance 5 or later (part of the Power [to Know] stone?), then he would have his choice of algorithms.

(Tony Stark has been to SAS headquarters, but we haven't seen Thanos around here. Still, he's welcome to download SAS University Edition.)

Your own RNG gauntlet, built into SAS

I know a little bit about this topic because I talked with Rick Wicklin about RNGs. As Rick discusses in his blog post, a recent release of SAS added support for several new/updated RNG algorithms, including Mersenne twister, PCG, Threefry, and one that introduces hardware-based entropy for "extra randomness." If you want to save yourself some reading, watch our 10-minute discussion here.

Implementing my own random Avengers terminator

I was going to write a SAS program to simulate Thanos' "snap," but I don't have a list of every single person in the universe (thanks GDPR!). However, courtesy of IMDB.com, I do have a list of the approximately 100 credited characters in the Infinity War movie. I wrote a DATA step to pull each name into a data set and "randomly decide" each fate by using the new PCG algorithm and the RAND function with a Bernoulli (binomial) distribution. I learned that trick from Rick's post about simulating coin flips. (I hope I did this correctly; Rick will tell me if I didn't.)

%let algorithm = PCG;
data characters;
  call streaminit(2018,"&algorithm.");
  infile datalines dsd;
  retain x 0 y 1;
  length Name $ 60 spared 8 x 8 y 8;
  input Name;
  Spared = rand("Bernoulli", 0.5);
  x+1;
  if x > 10 then
    do; y+1; x = 1;end;
datalines;
Tony Stark / Iron Man
Thor
Bruce Banner / Hulk
Steve Rogers / Captain America
/* about 96 more */
;
run;

After all of the outcomes were generated, I used PROC FREQ to check the distribution. In this run, only 48% were spared, not an even 50%. Well, that's randomness for you. On the universal scale, I'm not sure that anyone is keeping track.

How many spared

Using a trick I learned from Sample 54315: Customize your symbols with the SYMBOLCHAR statement in PROC SGPLOT, I created a scatter plot of the outcomes. I included special Unicode characters to signify the result in terms that even Hulk can understand. Hearts represent survivors; frowny faces represent the vanished heroes. Here's the code:

data thanosmap;
  input id $ value $ markercolor $ markersymbol $;
  datalines;
status 0 black frowny
status 1 red heart
;
run;
 
title;
ods graphics / height=400 width=400 imagemap=on;
proc sgplot data=Characters noautolegend dattrmap=thanosmap;
  styleattrs wallcolor=white;
  scatter x=x y=y / markerattrs=(size=40) 
    group=spared tip=(Name Spared) attrid=status;
  symbolchar name=heart char='2665'x;
  symbolchar name=frowny char='2639'x;
  xaxis integer display=(novalues) label="Did Thanos Kill You? Red=Dead" 
    labelattrs=(family="Comic Sans MS" size=14pt);
    /* Comic Sans -- get it ???? */
  yaxis integer display=none;
run;

Scatter plot of spared

For those of you who can read, you might appreciate a table with the rundown. For this one, I used a trick that I saw on SAS Support Communities to add strike-through text to a report. It's a simple COMPUTE column with a style directive, in a PROC REPORT step.

proc report data=Characters nowd;
  column Name spared;
  define spared / 'Spared' display;
  compute Spared;
    if spared=1 then
      call define(_row_,"style",
        "style={color=green}");
    if spared=0 then
      call define(_row_,"style",
        "style={color=red textdecoration=line_through}");
  endcomp;
run;

Table of results

Remember, my results here were generated with SAS and don't match the results from the film. (I feel like I need to say that to preempt a few comments.) The complete code for this blog post is available on my public Gist.

Learn more about RNGs

Just as the end of Avengers: Infinity War has sent throngs of viewers to the Internet to find out What's Next, I expect that readers of this blog are eager to learn more about these modern random number generators. Here are the go-to articles from Rick that are worth your review:

Unanswered questions

Before Thanos completed his gauntlet, his main hobby was traveling around the cosmos reducing the population of each civilization "the hard way." With the gauntlet in hand when he snapped his fingers, did he eliminate one-half of the remaining population? Or did the universe's algorithm spare those civilizations that had already been culled? Was this a random sample with replacement or not? In the film, Thanos did not express concern about these details (typical upper management attitude), but the grunt-workers of the universe need to know the parameters for this project. Coders need exact specifications, or else you can expect less-than-heroic results from your infinity gauntlet. I'm pretty sure it says so in the owner's manual.

The post Which random number generator did Thanos use? appeared first on The SAS Dummy.

5月 082018
 

SAS reporting tools for GDPR and other privacy protection lawThe European Union’s General Data Protection Regulation (GDPR) taking effect on 25 May 2018 pertains not only to organizations located within the EU; it applies to all companies processing and holding the personal data of data subjects residing in the European Union, regardless of the company’s location.

If the GDPR acronym does not mean much to you, think of the one that does – HIPAA, FERPA, COPPA, CIPSEA, or any other that is relevant to your jurisdiction – this blog post is equally applicable to all of them.

The GDPR prohibits personal data processing revealing such individual characteristics as race or ethnic origin, political opinions, religious or philosophical beliefs, trade union membership, as well as the processing of genetic data, biometric data for the purpose of uniquely identifying a natural person, data concerning health, and data concerning a natural person’s sex life or sexual orientation. It also has special rules for data relating to criminal convictions or offenses and the processing of children’s personal data.

Whenever SAS users produce reports on demographic data, there is always a risk of inadvertently revealing personal data protected by law, especially when reports are generated automatically or interactively via dynamic data queries. Even for aggregate reports there is a high potential for such exposure.

Suppose you produce an aggregate cross-tabulation report on a small demographic group, representing a count distribution by students’ grade and race. It is highly probable that you can get the count of 1 for some cells in the report, which will unequivocally identify persons and thus disclose their education record (grade) by race. Even if the count is not equal to 1, but is equal to some other small number, there is still a risk of possible deducing or disaggregating of Personally Identifiable Information (PII) from surrounding data (other cells, row and column totals) or related reports on that small demographic group.

The following are the four selected SAS tools that allow you to take care of protecting personal data in SAS reports by suppressing counts in small demographic group reports.

1. Automatic data suppression in SAS reports

This blog post explains the fundamental concepts of data suppression algorithms. It takes you behind the scenes of the iterative process of complementary data suppression and walks you through SAS code implementing a primary and secondary complementary suppression algorithm. The suppression code uses BASE SAS – DATA STEPs, SAS macros, PROC FORMAT, PROC MEANS, and PROC REPORT.

2. Implementing Privacy Protection-Compliant SAS® Aggregate Reports

This SAS Global Forum 2018 paper solidifies and expands on the above blog post. It walks you through the intricate logic of an enhanced complementary suppression process, and demonstrates SAS coding techniques to implement and automatically generate aggregate tabular reports compliant with privacy protection law. The result is a set of SAS macros ready for use in any reporting organization responsible for compliance with privacy protection.

3. In SAS Visual Analytics you can create derived data items that are aggregated measures.  SAS Visual Analytics 8.2 on SAS Viya introduces a new Type for the aggregated measures derived data items called Data Suppression. Here is an excerpt from the documentation on the Data Suppression type:

“Obscures aggregated data if individual values could easily be inferred. Data suppression replaces all values for the measure on which it is based with asterisk characters (*) unless a value represents the aggregation of a specified minimum number of values. You specify the minimum in the Suppress data if count less than parameter. The values are hidden from view, but they are still present in the data query. The calculation of totals and subtotals is not affected.

Some additional values might be suppressed when a single value would be suppressed from a subgroup. In this case, an additional value is suppressed so that the suppressed value cannot be inferred from totals or subtotals.

A common use of suppressed data is to protect the identity of individuals in aggregated data when some crossings are sparse. For example, if your data contains testing scores for a school district by demographics, but one of the demographic categories is represented only by a single student, then data suppression hides the test score for that demographic category.

When you use suppressed data, be sure to follow these best practices:

  • Never use the unsuppressed version of the data item in your report, even in filters and ranks. Consider hiding the unsuppressed version in the Data pane.
  • Avoid using suppressed data in any object that is the source or target of a filter action. Filter actions can sometimes make it possible to infer the values of suppressed data.
  • Avoid assigning hierarchies to objects that contain suppressed data. Expanding or drilling down on a hierarchy can make it possible to infer the values of suppressed data.”

This Data Suppression type functionality is significant as it represents the first such functionality embedded directly into a SAS product.

4. Is it sensitive? Mask it with data suppression

This blog post provides an example of using the above Data Suppression type aggregated measures derived data items in SAS Visual Analytics.

We need your feedback!

We want to hear from you.  Is this blog post useful? How do you comply with GDPR (or other Privacy Law of your jurisdiction) in your organization? What SAS privacy protection features would you like to see in future SAS releases?

SAS tools for GDPR privacy compliant reporting was published on SAS Users.

5月 082018
 

SAS reporting tools for GDPR and other privacy protection lawThe European Union’s General Data Protection Regulation (GDPR) taking effect on 25 May 2018 pertains not only to organizations located within the EU; it applies to all companies processing and holding the personal data of data subjects residing in the European Union, regardless of the company’s location.

If the GDPR acronym does not mean much to you, think of the one that does – HIPAA, FERPA, COPPA, CIPSEA, or any other that is relevant to your jurisdiction – this blog post is equally applicable to all of them.

The GDPR prohibits personal data processing revealing such individual characteristics as race or ethnic origin, political opinions, religious or philosophical beliefs, trade union membership, as well as the processing of genetic data, biometric data for the purpose of uniquely identifying a natural person, data concerning health, and data concerning a natural person’s sex life or sexual orientation. It also has special rules for data relating to criminal convictions or offenses and the processing of children’s personal data.

Whenever SAS users produce reports on demographic data, there is always a risk of inadvertently revealing personal data protected by law, especially when reports are generated automatically or interactively via dynamic data queries. Even for aggregate reports there is a high potential for such exposure.

Suppose you produce an aggregate cross-tabulation report on a small demographic group, representing a count distribution by students’ grade and race. It is highly probable that you can get the count of 1 for some cells in the report, which will unequivocally identify persons and thus disclose their education record (grade) by race. Even if the count is not equal to 1, but is equal to some other small number, there is still a risk of possible deducing or disaggregating of Personally Identifiable Information (PII) from surrounding data (other cells, row and column totals) or related reports on that small demographic group.

The following are the four selected SAS tools that allow you to take care of protecting personal data in SAS reports by suppressing counts in small demographic group reports.

1. Automatic data suppression in SAS reports

This blog post explains the fundamental concepts of data suppression algorithms. It takes you behind the scenes of the iterative process of complementary data suppression and walks you through SAS code implementing a primary and secondary complementary suppression algorithm. The suppression code uses BASE SAS – DATA STEPs, SAS macros, PROC FORMAT, PROC MEANS, and PROC REPORT.

2. Implementing Privacy Protection-Compliant SAS® Aggregate Reports

This SAS Global Forum 2018 paper solidifies and expands on the above blog post. It walks you through the intricate logic of an enhanced complementary suppression process, and demonstrates SAS coding techniques to implement and automatically generate aggregate tabular reports compliant with privacy protection law. The result is a set of SAS macros ready for use in any reporting organization responsible for compliance with privacy protection.

3. In SAS Visual Analytics you can create derived data items that are aggregated measures.  SAS Visual Analytics 8.2 on SAS Viya introduces a new Type for the aggregated measures derived data items called Data Suppression. Here is an excerpt from the documentation on the Data Suppression type:

“Obscures aggregated data if individual values could easily be inferred. Data suppression replaces all values for the measure on which it is based with asterisk characters (*) unless a value represents the aggregation of a specified minimum number of values. You specify the minimum in the Suppress data if count less than parameter. The values are hidden from view, but they are still present in the data query. The calculation of totals and subtotals is not affected.

Some additional values might be suppressed when a single value would be suppressed from a subgroup. In this case, an additional value is suppressed so that the suppressed value cannot be inferred from totals or subtotals.

A common use of suppressed data is to protect the identity of individuals in aggregated data when some crossings are sparse. For example, if your data contains testing scores for a school district by demographics, but one of the demographic categories is represented only by a single student, then data suppression hides the test score for that demographic category.

When you use suppressed data, be sure to follow these best practices:

  • Never use the unsuppressed version of the data item in your report, even in filters and ranks. Consider hiding the unsuppressed version in the Data pane.
  • Avoid using suppressed data in any object that is the source or target of a filter action. Filter actions can sometimes make it possible to infer the values of suppressed data.
  • Avoid assigning hierarchies to objects that contain suppressed data. Expanding or drilling down on a hierarchy can make it possible to infer the values of suppressed data.”

This Data Suppression type functionality is significant as it represents the first such functionality embedded directly into a SAS product.

4. Is it sensitive? Mask it with data suppression

This blog post provides an example of using the above Data Suppression type aggregated measures derived data items in SAS Visual Analytics.

We need your feedback!

We want to hear from you.  Is this blog post useful? How do you comply with GDPR (or other Privacy Law of your jurisdiction) in your organization? What SAS privacy protection features would you like to see in future SAS releases?

SAS tools for GDPR privacy compliant reporting was published on SAS Users.

5月 072018
 

Datasets can present themselves in different ways. Identical data can bet arranged differently, often as wide or tall datasets. Generally, the tall dataset is better. Learn how to convert wide data into tall data with PROC TRANSPOSE.

The post "Wide" versus "Tall" data: PROC TRANSPOSE v. the DATA step appeared first on SAS Learning Post.

5月 072018
 

Do you periodically delete unneeded global macro variables? You should! Deleting macro variables releases memory and keeps your symbol table clean. Learn how the macro language statement that deletes global macro variables and about the %DELETEALL statement that can be a life saver for macro programmers.

The post Deleting global macro variables appeared first on SAS Learning Post.

5月 072018
 

Remember the military computer Joshua from the 1983 Matthew Broderick movie WarGames? Joshua learned how to “play a game” by competing against other computers, got confused about reality, and nearly started WWIII. As depicted in that movie, Joshua isn’t all that different from Google’s DeepMind, which became a superhuman chess [...]

Magic vs monetization: AI tips for manufacturing executives was published on SAS Voices by Roger Thomas