5月 082018
 

I have good news to share about the future. Despite what you may have heard elsewhere, the future of work in a world with artificial intelligence (AI) is not all doom and gloom. And thanks to a research-backed book from Malcolm Frank, What to Do When Machines Do Everything, we [...]

Optimistic about AI and the future of work was published on SAS Voices by Randy Guard

5月 082018
 

WARNING: This blog post references Avengers: Infinity War and contains story spoilers. But it also contains useful information about random number generators (RNGs) -- tempting! If you haven't yet seen the movie, you should make peace with this inner conflict before reading on.

Throughout the movie, Thanos makes it clear that his goal is to eliminate half of the population of every civilization in the universe. With the power of all six infinity stones imbued into his gauntlet, he'll be able to accomplish this with a "snap of his fingers." By the end of the film, Thanos has all of the stones, and then he literally snaps his fingers. (Really? I kept thinking that this was just a figure of speech he used to indicate how simple this will be -- but I guess it works more like the ruby slippers in The Wizard of Oz. Some clicking was required.)

So, Thanos snaps his huge fingers and -- POOF -- there goes half of us. Apparently the universe already had some sort of population-reduction subroutine just waiting for a hacker like Thanos to access it. Who put that there? Not a good plan, universe designer. (Check here to see if you survived the snap.)

But how did Thanos (or the universe) determine which of us was wiped from existence and which of us was spared? I have to assume that it was a seriously high-performing, massively parallel random number generator. And if Thanos had access to 9.4 Maintenance 5 or later (part of the Power [to Know] stone?), then he would have his choice of algorithms.

(Tony Stark has been to SAS headquarters, but we haven't seen Thanos around here. Still, he's welcome to download SAS University Edition.)

Your own RNG gauntlet, built into SAS

I know a little bit about this topic because I talked with Rick Wicklin about RNGs. As Rick discusses in his blog post, a recent release of SAS added support for several new/updated RNG algorithms, including Mersenne twister, PCG, Threefry, and one that introduces hardware-based entropy for "extra randomness." If you want to save yourself some reading, watch our 10-minute discussion here.

Implementing my own random Avengers terminator

I was going to write a SAS program to simulate Thanos' "snap," but I don't have a list of every single person in the universe (thanks GDPR!). However, courtesy of IMDB.com, I do have a list of the approximately 100 credited characters in the Infinity War movie. I wrote a DATA step to pull each name into a data set and "randomly decide" each fate by using the new PCG algorithm and the RAND function with a Bernoulli (binomial) distribution. I learned that trick from Rick's post about simulating coin flips. (I hope I did this correctly; Rick will tell me if I didn't.)

%let algorithm = PCG;
data characters;
  call streaminit(2018,"&algorithm.");
  infile datalines dsd;
  retain x 0 y 1;
  length Name $ 60 spared 8 x 8 y 8;
  input Name;
  Spared = rand("Bernoulli", 0.5);
  x+1;
  if x > 10 then
    do; y+1; x = 1;end;
datalines;
Tony Stark / Iron Man
Thor
Bruce Banner / Hulk
Steve Rogers / Captain America
/* about 96 more */
;
run;

After all of the outcomes were generated, I used PROC FREQ to check the distribution. In this run, only 48% were spared, not an even 50%. Well, that's randomness for you. On the universal scale, I'm not sure that anyone is keeping track.

How many spared

Using a trick I learned from Sample 54315: Customize your symbols with the SYMBOLCHAR statement in PROC SGPLOT, I created a scatter plot of the outcomes. I included special Unicode characters to signify the result in terms that even Hulk can understand. Hearts represent survivors; frowny faces represent the vanished heroes. Here's the code:

data thanosmap;
  input id $ value $ markercolor $ markersymbol $;
  datalines;
status 0 black frowny
status 1 red heart
;
run;
 
title;
ods graphics / height=400 width=400 imagemap=on;
proc sgplot data=Characters noautolegend dattrmap=thanosmap;
  styleattrs wallcolor=white;
  scatter x=x y=y / markerattrs=(size=40) 
    group=spared tip=(Name Spared) attrid=status;
  symbolchar name=heart char='2665'x;
  symbolchar name=frowny char='2639'x;
  xaxis integer display=(novalues) label="Did Thanos Kill You? Red=Dead" 
    labelattrs=(family="Comic Sans MS" size=14pt);
    /* Comic Sans -- get it ???? */
  yaxis integer display=none;
run;

Scatter plot of spared

For those of you who can read, you might appreciate a table with the rundown. For this one, I used a trick that I saw on SAS Support Communities to add strike-through text to a report. It's a simple COMPUTE column with a style directive, in a PROC REPORT step.

proc report data=Characters nowd;
  column Name spared;
  define spared / 'Spared' display;
  compute Spared;
    if spared=1 then
      call define(_row_,"style",
        "style={color=green}");
    if spared=0 then
      call define(_row_,"style",
        "style={color=red textdecoration=line_through}");
  endcomp;
run;

Table of results

Remember, my results here were generated with SAS and don't match the results from the film. (I feel like I need to say that to preempt a few comments.) The complete code for this blog post is available on my public Gist.

Learn more about RNGs

Just as the end of Avengers: Infinity War has sent throngs of viewers to the Internet to find out What's Next, I expect that readers of this blog are eager to learn more about these modern random number generators. Here are the go-to articles from Rick that are worth your review:

Unanswered questions

Before Thanos completed his gauntlet, his main hobby was traveling around the cosmos reducing the population of each civilization "the hard way." With the gauntlet in hand when he snapped his fingers, did he eliminate one-half of the remaining population? Or did the universe's algorithm spare those civilizations that had already been culled? Was this a random sample with replacement or not? In the film, Thanos did not express concern about these details (typical upper management attitude), but the grunt-workers of the universe need to know the parameters for this project. Coders need exact specifications, or else you can expect less-than-heroic results from your infinity gauntlet. I'm pretty sure it says so in the owner's manual.

The post Which random number generator did Thanos use? appeared first on The SAS Dummy.

5月 082018
 

SAS reporting tools for GDPR and other privacy protection lawThe European Union’s General Data Protection Regulation (GDPR) taking effect on 25 May 2018 pertains not only to organizations located within the EU; it applies to all companies processing and holding the personal data of data subjects residing in the European Union, regardless of the company’s location.

If the GDPR acronym does not mean much to you, think of the one that does – HIPAA, FERPA, COPPA, CIPSEA, or any other that is relevant to your jurisdiction – this blog post is equally applicable to all of them.

The GDPR prohibits personal data processing revealing such individual characteristics as race or ethnic origin, political opinions, religious or philosophical beliefs, trade union membership, as well as the processing of genetic data, biometric data for the purpose of uniquely identifying a natural person, data concerning health, and data concerning a natural person’s sex life or sexual orientation. It also has special rules for data relating to criminal convictions or offenses and the processing of children’s personal data.

Whenever SAS users produce reports on demographic data, there is always a risk of inadvertently revealing personal data protected by law, especially when reports are generated automatically or interactively via dynamic data queries. Even for aggregate reports there is a high potential for such exposure.

Suppose you produce an aggregate cross-tabulation report on a small demographic group, representing a count distribution by students’ grade and race. It is highly probable that you can get the count of 1 for some cells in the report, which will unequivocally identify persons and thus disclose their education record (grade) by race. Even if the count is not equal to 1, but is equal to some other small number, there is still a risk of possible deducing or disaggregating of Personally Identifiable Information (PII) from surrounding data (other cells, row and column totals) or related reports on that small demographic group.

The following are the four selected SAS tools that allow you to take care of protecting personal data in SAS reports by suppressing counts in small demographic group reports.

1. Automatic data suppression in SAS reports

This blog post explains the fundamental concepts of data suppression algorithms. It takes you behind the scenes of the iterative process of complementary data suppression and walks you through SAS code implementing a primary and secondary complementary suppression algorithm. The suppression code uses BASE SAS – DATA STEPs, SAS macros, PROC FORMAT, PROC MEANS, and PROC REPORT.

2. Implementing Privacy Protection-Compliant SAS® Aggregate Reports

This SAS Global Forum 2018 paper solidifies and expands on the above blog post. It walks you through the intricate logic of an enhanced complementary suppression process, and demonstrates SAS coding techniques to implement and automatically generate aggregate tabular reports compliant with privacy protection law. The result is a set of SAS macros ready for use in any reporting organization responsible for compliance with privacy protection.

3. In SAS Visual Analytics you can create derived data items that are aggregated measures.  SAS Visual Analytics 8.2 on SAS Viya introduces a new Type for the aggregated measures derived data items called Data Suppression. Here is an excerpt from the documentation on the Data Suppression type:

“Obscures aggregated data if individual values could easily be inferred. Data suppression replaces all values for the measure on which it is based with asterisk characters (*) unless a value represents the aggregation of a specified minimum number of values. You specify the minimum in the Suppress data if count less than parameter. The values are hidden from view, but they are still present in the data query. The calculation of totals and subtotals is not affected.

Some additional values might be suppressed when a single value would be suppressed from a subgroup. In this case, an additional value is suppressed so that the suppressed value cannot be inferred from totals or subtotals.

A common use of suppressed data is to protect the identity of individuals in aggregated data when some crossings are sparse. For example, if your data contains testing scores for a school district by demographics, but one of the demographic categories is represented only by a single student, then data suppression hides the test score for that demographic category.

When you use suppressed data, be sure to follow these best practices:

  • Never use the unsuppressed version of the data item in your report, even in filters and ranks. Consider hiding the unsuppressed version in the Data pane.
  • Avoid using suppressed data in any object that is the source or target of a filter action. Filter actions can sometimes make it possible to infer the values of suppressed data.
  • Avoid assigning hierarchies to objects that contain suppressed data. Expanding or drilling down on a hierarchy can make it possible to infer the values of suppressed data.”

This Data Suppression type functionality is significant as it represents the first such functionality embedded directly into a SAS product.

4. Is it sensitive? Mask it with data suppression

This blog post provides an example of using the above Data Suppression type aggregated measures derived data items in SAS Visual Analytics.

We need your feedback!

We want to hear from you.  Is this blog post useful? How do you comply with GDPR (or other Privacy Law of your jurisdiction) in your organization? What SAS privacy protection features would you like to see in future SAS releases?

SAS tools for GDPR privacy compliant reporting was published on SAS Users.

5月 082018
 

SAS reporting tools for GDPR and other privacy protection lawThe European Union’s General Data Protection Regulation (GDPR) taking effect on 25 May 2018 pertains not only to organizations located within the EU; it applies to all companies processing and holding the personal data of data subjects residing in the European Union, regardless of the company’s location.

If the GDPR acronym does not mean much to you, think of the one that does – HIPAA, FERPA, COPPA, CIPSEA, or any other that is relevant to your jurisdiction – this blog post is equally applicable to all of them.

The GDPR prohibits personal data processing revealing such individual characteristics as race or ethnic origin, political opinions, religious or philosophical beliefs, trade union membership, as well as the processing of genetic data, biometric data for the purpose of uniquely identifying a natural person, data concerning health, and data concerning a natural person’s sex life or sexual orientation. It also has special rules for data relating to criminal convictions or offenses and the processing of children’s personal data.

Whenever SAS users produce reports on demographic data, there is always a risk of inadvertently revealing personal data protected by law, especially when reports are generated automatically or interactively via dynamic data queries. Even for aggregate reports there is a high potential for such exposure.

Suppose you produce an aggregate cross-tabulation report on a small demographic group, representing a count distribution by students’ grade and race. It is highly probable that you can get the count of 1 for some cells in the report, which will unequivocally identify persons and thus disclose their education record (grade) by race. Even if the count is not equal to 1, but is equal to some other small number, there is still a risk of possible deducing or disaggregating of Personally Identifiable Information (PII) from surrounding data (other cells, row and column totals) or related reports on that small demographic group.

The following are the four selected SAS tools that allow you to take care of protecting personal data in SAS reports by suppressing counts in small demographic group reports.

1. Automatic data suppression in SAS reports

This blog post explains the fundamental concepts of data suppression algorithms. It takes you behind the scenes of the iterative process of complementary data suppression and walks you through SAS code implementing a primary and secondary complementary suppression algorithm. The suppression code uses BASE SAS – DATA STEPs, SAS macros, PROC FORMAT, PROC MEANS, and PROC REPORT.

2. Implementing Privacy Protection-Compliant SAS® Aggregate Reports

This SAS Global Forum 2018 paper solidifies and expands on the above blog post. It walks you through the intricate logic of an enhanced complementary suppression process, and demonstrates SAS coding techniques to implement and automatically generate aggregate tabular reports compliant with privacy protection law. The result is a set of SAS macros ready for use in any reporting organization responsible for compliance with privacy protection.

3. In SAS Visual Analytics you can create derived data items that are aggregated measures.  SAS Visual Analytics 8.2 on SAS Viya introduces a new Type for the aggregated measures derived data items called Data Suppression. Here is an excerpt from the documentation on the Data Suppression type:

“Obscures aggregated data if individual values could easily be inferred. Data suppression replaces all values for the measure on which it is based with asterisk characters (*) unless a value represents the aggregation of a specified minimum number of values. You specify the minimum in the Suppress data if count less than parameter. The values are hidden from view, but they are still present in the data query. The calculation of totals and subtotals is not affected.

Some additional values might be suppressed when a single value would be suppressed from a subgroup. In this case, an additional value is suppressed so that the suppressed value cannot be inferred from totals or subtotals.

A common use of suppressed data is to protect the identity of individuals in aggregated data when some crossings are sparse. For example, if your data contains testing scores for a school district by demographics, but one of the demographic categories is represented only by a single student, then data suppression hides the test score for that demographic category.

When you use suppressed data, be sure to follow these best practices:

  • Never use the unsuppressed version of the data item in your report, even in filters and ranks. Consider hiding the unsuppressed version in the Data pane.
  • Avoid using suppressed data in any object that is the source or target of a filter action. Filter actions can sometimes make it possible to infer the values of suppressed data.
  • Avoid assigning hierarchies to objects that contain suppressed data. Expanding or drilling down on a hierarchy can make it possible to infer the values of suppressed data.”

This Data Suppression type functionality is significant as it represents the first such functionality embedded directly into a SAS product.

4. Is it sensitive? Mask it with data suppression

This blog post provides an example of using the above Data Suppression type aggregated measures derived data items in SAS Visual Analytics.

We need your feedback!

We want to hear from you.  Is this blog post useful? How do you comply with GDPR (or other Privacy Law of your jurisdiction) in your organization? What SAS privacy protection features would you like to see in future SAS releases?

SAS tools for GDPR privacy compliant reporting was published on SAS Users.

5月 072018
 

Datasets can present themselves in different ways. Identical data can bet arranged differently, often as wide or tall datasets. Generally, the tall dataset is better. Learn how to convert wide data into tall data with PROC TRANSPOSE.

The post "Wide" versus "Tall" data: PROC TRANSPOSE v. the DATA step appeared first on SAS Learning Post.

5月 072018
 

Do you periodically delete unneeded global macro variables? You should! Deleting macro variables releases memory and keeps your symbol table clean. Learn how the macro language statement that deletes global macro variables and about the %DELETEALL statement that can be a life saver for macro programmers.

The post Deleting global macro variables appeared first on SAS Learning Post.

5月 072018
 

Remember the military computer Joshua from the 1983 Matthew Broderick movie WarGames? Joshua learned how to “play a game” by competing against other computers, got confused about reality, and nearly started WWIII. As depicted in that movie, Joshua isn’t all that different from Google’s DeepMind, which became a superhuman chess [...]

Magic vs monetization: AI tips for manufacturing executives was published on SAS Voices by Roger Thomas

5月 072018
 

Simulation studies require both randomness and reproducibility, two qualities that are sometimes at odds with each other. A Monte Carlo simulation might need to generate millions of random samples, where each sample contains dozens of continuous variables and many thousands of observations. In simulation studies, the researcher wants each sample to be statistically independent. For example, if you are drawing samples of size N=100 from a continuous distribution, it is not acceptable for the millionth sample to equal the first sample. Such a situation would bias the study. Similarly, it is not acceptable for half (or a fourth, or an eighth,...) of the numbers in the millionth sample to be the same as the numbers in the first sample.

The easiest way to ensure that each sample is statistically independent is to generate all samples from one high-quality random number stream. However, if the structure of your simulation requires that you use two or more streams, then SAS 9.4M5 provides several ways to produce random samples that are statistically independent. One way is to use one of the new Threefry random number generators. The other way is to use the new CALL STREAM subroutine.

Independence and overlap in random number streams

I've previously written about the many random number generators (RNGs) in SAS. Recall that all RNGs produce uniform variates. When you call a RNG repeatedly, you are generating a stream of random numbers. A stream is the sequence of pseudorandom uniform values that is generated from a given seed value. (If you request a random value from a nonuniform distribution, such as normal or exponential, the software generates one or more uniform variates and then applies a mathematical transformation to produce a nonuniform variate.)

In SAS, the modern RNGs are high quality and have a very long period. For all practical purposes, the samples that are generated from a single stream are independent. However, for the Mersenne twister and PCG RNGs, samples from different streams might not be independent. It might happen that the first sample for one stream is equal to (or has a large intersection with) the millionth sample from another seed, as shown in the image to the right. This phenomenon is known as overlap of streams.

The possibility that streams might overlap is important because some programmers use a macro loop to generate thousands of samples, where each sample is from a different stream. Not only is this technique inefficient, but the birthday problem in probability warns us that the probability that two streams overlap is larger than our intuition suggests.

The probability of overlapping streams is usually small, even for very large simulation studies. The "number of birthdays" equals the period of the RNG, which for Mersenne twister RNG is astronomically huge (≈ 106000). However, for the PCG RNG (which has a smaller period of "only" 264 ≈ 1019), the probability of an overlap could be nonnegligible.

Vigna (2009) estimates the probability of overlap when you use the first L variates in each of n streams from a RNG that has period P. Assuming that nL/P ≪ 1, the probability is approximately n2L/P. When P = 1019 and you generate n = 10,000 different streams, each containing L = 1,000,000 variates, the probability of an overlap is approximately 10-5.

Fortunately, SAS makes it it is easy to ensure the independence of random samples:

  • As shown in Simulating Data with SAS and in many blog posts, you can almost always use a single stream to generate all the random samples.
  • You can use one of the Threefry generators, TF2 or TF4. For a Threefry generator, two streams with different seed values never overlap! Similarly, streams for the hardware-based RDRAND generator never overlap.
  • You can use the CALL STREAM subroutine to set a key value for the PCG algorithm. For the PCG RNG, streams that have the same seed value but different key values never overlap.

In my next blog post, I will demonstrate how to use the STREAM subroutine in the SAS DATA step to generate independent streams. The technique can be useful when you are running a simulation that requires several DATA steps, each of which generates random numbers.

The post Independence and overlap in streams of random numbers appeared first on The DO Loop.