Enochlophobia. It means “fear of crowds,” and I have it. I wouldn’t be caught dead anywhere near Times Square on New Years’ Eve, and how did I ever endure the stadium-sized rock concerts of my youth? I’m thinking all this as I navigate the sea of people at the National [...]
What is a random number generator? What are the random-number generators in SAS, and how can you use them to generate random numbers from probability distributions? In SAS 9.4M5, you can use the STREAMINIT function to select from eight random-number generators (RNGs), including five new RNGs. After choosing an RNG, you can use the RAND function to obtain random values from probability distributions.
What is a random-number generator?
A random-number generator usually refers to a deterministic algorithm that generates uniformly distributed random numbers. The algorithm (sometimes called a pseudorandom-number generator, or PRNG) is initialized by an integer, called the seed, which sets the initial state of the PRNG. Each iteration of the algorithm produces a pseudorandom number and advances the internal state. The seed completely determines the sequence of random numbers. Although it is deterministic, a modern PRNG generates extremely long sequences of numbers whose statistical properties are virtually indistinguishable from a truly random uniform process.
An RNG can also refer to a hardware-based device (sometimes called a "truly random RNG") that samples from an entropy source, which is often thermal noise within the silicon of a computer chip. In the Intel implementation, the random entropy source is used to initialize a PRNG in the chip, which generates a short stream of pseudorandom values before resampling from the entropy source and repeating the process. A hardware-based RNG is not deterministic or reproducible.
An RNG generates uniformly distributed numbers. If you request random numbers from a nonuniform probability distribution (such as the normal or exponential distributions), SAS will automatically apply mathematical transformations that convert the uniform numbers into nonuniform random variates.
Using random numbers in the real world
In the real world, random numbers are used as the basis for statistical simulations. Financial analysts use simulations of the economy to evaluate the risk of a portfolio. Insurance companies use simulations of natural disasters to assess the impact on their customers. Government workers use simulations of population growth to predict the infrastructure and services that will be required by future generations. In all fields, data analysts generate random values for statistical sampling, bootstrap and resampling methods, Monte Carlo estimation, and data simulation.
For these techniques to be statistically valid, the values that are produced by statistical software must contain a high degree of randomness. The new random-number generators in SAS provide analysts with fast, high-quality, state-of-the-art random numbers.
New random-number generators in SAS
Prior to SAS 9.4M5, SAS supported the Mersenne twister random-number generator (introduced in SAS 9.1) and a hardware-based RNG (supported on Intel CPUs in SAS 9.4M3). SAS 9.4M5 introduces new random-number generators and new variants of the Mersenne twister, as follows:
- PCG: A 64-bit permuted congruential generator (O’Neill 2014) with good statistical properties.
- TF2: A 2x64-bit counter-based RNG that is based on the Threefish encryption function in the Random123 library (Salmon et al. 2011). The generator is also known as the 2x64 Threefry generator.
- TF4: A 4x64-bit counter-based RNG that is also known as the 4x64 Threefry generator.
- RDRAND: A hardware-based RNG (Intel, 2014) that is repeatedly reseeded by using random values from thermal noise in the chip. The RNG requires an Intel processor (Ivy Bridge and later) that supports the RDRAND instruction.
- MT1998: (deprecated) The original 1998 32-bit Mersenne twister algorithm (Matsumoto and Nishimura, 1998). This was the default RNG for the RAND function prior to SAS 9.4M3. For a small number of seed values (those exactly divisible by 8,192), this RNG does not produce a sufficiently random stream of numbers. Consequently, other RNGS are preferable.
- MTHYBRID: (default) A hybrid method that improves the MT1998 method by using the MT2002 initialization for seeds that are exactly divisible by 8,192. This is the default RNG, beginning with the SAS 9.4M3.
- MT2002: The 2002 32-bit Mersenne twister algorithm (Matsumoto and Nishimura 2002).
- MT64: A 64-bit version of the 2002 Mersenne twister algorithm.
Choose a random-number generator in SAS
In SAS, you can use the STREAMINIT subroutine to choose an RNG and to set the seed value. You can then use the RAND function to produce high-quality streams of random numbers for many common distributions. The first argument to the RAND function specifies the name of the distribution. Subsequent arguments specify parameters for the distribution.
If you want to use the default RNG (which is 'MTHYBRID'), you can specify only the seed. For example, the following DATA step initializes the default PRNG with the seed value 12345 and then generates five observations from the uniform distribution on (0,1) and from the standard normal distribution:
data RandMT(drop=i); call streaminit(12345); /* choose default RNG (MTHYBRID) and seed=12345 */ do i = 1 to 5; uMT = rand('Uniform'); /* generate random uniform: U ~ U(0,1) */ nMT = rand('Normal'); /* generate random normal: N ~ N(mu=0, sigma=1) */ output; end; run;
If you want to choose a different RNG, you can specify the name of the RNG followed by the seed. For example, the following DATA step initializes the PCG method with the seed value 12345 and then generates five observations from the uniform and normal distributions:
data RandPCG(drop=i); call streaminit('PCG', 12345); /* SAS 9.4M5: choose PCG method, same seed */ do i = 1 to 5; uPCG = rand('Uniform'); nPCG = rand('Normal'); output; end; run;
A hardware-based method is initialized by the entropy source, so you do not need to specify a seed value. For example, the following DATA step initializes the RDRAND method and then generates five observations. The three data sets are then merged and printed:
data RandHardware(drop=i); call streaminit('RDRAND'); /* SAS 9.4M3: Hardware method, no seed */ do i = 1 to 5; uHardware = rand('Uniform'); nHardware = rand('Normal'); output; end; run; data All; merge RandMT RandPCG RandHardware; run; title "Uniform Random Variates for Three RNGs"; proc print data=All noobs; var u:; run; title "Normal Random Variates for Three RNGs"; proc print data=All noobs; var n:; run;
The output shows that each RNG generates a different sequence of numbers. If you run this program yourself, you will generate the same sequences for the first two columns (the MTHYBRID and PCG streams), but your numbers for the RDRAND RNG will be different. Each column of the first table is an independent random sample from the uniform distribution. Similarly, each column of the second table is an independent random sample from the standard normal distribution.
In SAS 9.4M5, you can choose from eight different random-number generators. You use the STREAMINIT function to select the RNG and use the RAND function to obtain random values. In a future blog post, I will compare the attributes of the different RNGs and discuss the advantages of each. I will also show how to use the new random-number generators to generate random numbers in parallel by running a DS2 program in multiple threads.
The information in this article is taken from my forthcoming SAS Global Forum 2018 paper, "Tips and Techniques for Using the Random-Number Generators in SAS" (Sarle and Wicklin, 2018).
- Intel Corporation (2014). “Intel Digital Random Number Generator (DRNG) Software Implementation Guide.” Accessed December 23, 2017.
- Matsumoto, M., and Nishimura, T. (1998). “Mersenne Twister: A 623-Dimensionally Equidistributed Uniform Pseudorandom Number Generator.” ACM Transactions on Modeling and Computer Simulation 8:3–30.
- Matsumoto, M., and Nishimura, T. (2002). “Mersenne Twister with Improved Initialization.” Accessed April 10, 2015.
- O’Neill, M. E. (2014). "PCG: A Family of Simple Fast Space-Efficient Statistically Good Algorithms for Random Number Generation." Accessed December 23, 2017.
- Salmon, J. K., Moraes, M. A., Dror, R. O., and Shaw, D. E. (2011). “Parallel Random Numbers: As Easy as 1, 2, 3.” In SC ’11: 2011 International Conference for High Performance Computing, Networking, Storage and Analysis, p. 1–12.
Are you interested in using SAS Visual Analytics 8.2 to visualize a state by regions, but all you have is a county shapefile? As long as you can cross-walk counties to regions, this is easier to do than you might think.
Here are the steps involved:
Obtain a county shapefile and extract all components to a folder. For example, I used the US Counties shapefile found in this SAS Visual Analytics community post.
Note: Shapefile is a geospatial data format developed by ESRI. Shapefiles are comprised of multiple files. When you unzip the shapefile found on the community site, make sure to extract all of its components and not just the .shp. You can get more information about shapefiles from this Wikipedia article: https://en.wikipedia.org/wiki/Shapefile.
Run PROC MAPIMPORT to convert the shapefile into a SAS map dataset.
libname geo 'C:\Geography'; /*location of the extracted shapefile*/ proc mapimport datafile="C:\Geography\UScounties.shp" out=geo.shapefile_counties; run;
Add a Region variable to your SAS map dataset. If all you need is one state, you can subset the map dataset to keep just the state you need. For example, I only needed Texas, so I used the State_FIPS variable to subset the map dataset:
proc sql; create table temp as select *, /*cross-walk counties to regions*/ case when name='Anderson' then '4' when name='Andrews' then '9' when name='Angelina' then '5' when name='Aransas' then '11', <……> when name='Zapata' then '11' when name='Zavala' then '8' end as region from geo.shapefile_counties /*subset to Texas*/ where state_fips='48'; quit;
Use PROC GREMOVE to dissolve the boundaries between counties that belong to the same region. It is important to sort the county dataset by region before you run PROC GREMOVE.
proc sort data=temp; by region; run; proc gremove data=temp out=geo.regions_shapefile nodecycle; by region; id name; /*name is county name*/ run;
To validate that your boundaries resolved correctly, run PROC GMAP to view the regions. If the regions do not look right when you run this step, it may signal an issue with the underlying data. For example, when I ran this with a county shapefile obtained from Census, I found that some of the counties were mislabeled, which of course, caused the regions to not dissolve correctly.
proc gmap map=geo.regions_shapefile data=geo.regions_shapefile all; id region; choro region / nolegend levels=1; run;
Here’s the result I got, which is exactly what I expected:
Add a sequence number variable to the regions dataset. SAS Visual Analytics 8.2 needs it properly define a custom polygon inside a report:
data geo.regions_shapefile; set geo.regions_shapefile; seqno=_n_; run;
Load the new region shapefile in SAS Visual Analytics.
In the dataset with the region variable that you want to visualize, create a new geography variable and define a new custom polygon provider.
Now, you can create a map of your custom regions:
How to create custom regional maps in SAS Visual Analytics 8.2 was published on SAS Users.
Let’s lay down some fundamentals. In business you want to achieve the highest revenues with the best margins and the lowest costs. More specifically, in manufacturing, you want your products to be the highest quality (relative to specification) when you make the item. And you want it shipped to the [...]
How do you take your manufacturing business to the next level? was published on SAS Voices by Tim Clark
If you have worked with the different types of score code generated by the high-performance modeling nodes in SAS® Enterprise Miner 14.1, you have probably come across the Analytic Store (or ASTORE) file type for scoring. The ASTOREfile type works very well for scoring complex machine learning models like random forests, gradient boosting, support vector machines and others. In this article, we will focus on ASTORE files generated by SAS® Viya® Visual Data Mining and Machine Learning (VDMML) procedures. An introduction to analytic stores on SAS Viya can be found here.
In this post, we will:
- Generate an ASTORE file for a PROC ASTORE in SAS Visual Data Mining and Machine Learning.
Generate an ASTORE file for a gradient boosting model
Our example dataset is a distributed in-memory CAS table that contains information about applicants who were granted credit for a certain home equity loan. The categorical binary-valued target variable ‘BAD’ identifies if a client either defaulted or repaid their loan. The remainder of the variables indicating the candidate’s credit history, debt-to-income ratio, occupation, etc., are used as predictors for the model. In the code below, we are training a gradient boosting model on a randomly sampled 70% of the data and validating against 30% of the data. The statement SAVESTATE creates an analytic store file (ASTORE) for the model and saves it as a binary file named “astore_gb.”
proc gradboost data=PUBLIC.HMEQ; partition fraction(validate=0.3); target BAD / level=nominal; input LOAN MORTDUE DEBTINC VALUE YOJ DEROG DELINQ CLAGE NINQ CLNO / level=interval; input REASON JOB / level=nominal; score out=public.hmeq_scored copyvars=(_all_); savestate rstore=public.astore_gb; id _all_; run;
Shown below are a few observations from the scored dataset hmeq_scored where YOJ (years at present job) is greater than 10 years.
Override the scoring decision using PROC ASTORE
In this segment, we will use PROC ASTORE to override the scoring decision from the gradient boosting model. To that end, we will first make use of the DESCRIBE statement in PROC ASTORE to produce basic DS2 scoring code using the EPCODE option. We will then edit the score code in DS2 language syntax to override the scoring decision produced from the gradient boosting model.
proc astore; describe rstore=public.astore_gb epcode="/viyafiles/jukhar/gb_epcode.sas"; run;
A snapshot of the output from the above code statements are shown below. The analytic store is assigned to a unique string identifier. We also get information about the analytic engine that produced the store (gradient boosting, in this case) and the time when the store was created. In addition, though not shown in the snapshot below, we get a list of the input and output variables used.
Let’s take a look at the DS2 score code (“gb_epcode.sas”) produced by the EPCODE option in the DESCRIBE statement within PROC ASTORE.
data sasep.out; dcl package score sc(); dcl double "LOAN"; dcl double "MORTDUE"; dcl double "DEBTINC"; dcl double "VALUE"; dcl double "YOJ"; dcl double "DEROG"; dcl double "DELINQ"; dcl double "CLAGE"; dcl double "NINQ"; dcl double "CLNO"; dcl nchar(7) "REASON"; dcl nchar(7) "JOB"; dcl double "BAD"; dcl double "P_BAD1" having label n'Predicted: BAD=1'; dcl double "P_BAD0" having label n'Predicted: BAD=0'; dcl nchar(32) "I_BAD" having label n'Into: BAD'; dcl nchar(4) "_WARN_" having label n'Warnings'; Keep "P_BAD1" "P_BAD0" "I_BAD" "_WARN_" "BAD" "LOAN" "MORTDUE" "VALUE" "REASON" "JOB" "YOJ" "DEROG" "DELINQ" "CLAGE" "NINQ" "CLNO" "DEBTINC" ; varlist allvars[_all_]; method init(); sc.setvars(allvars); sc.setKey(n'F8E7B0B4B71C8F39D679ECDCC70F6C3533C21BD5'); end; method preScoreRecord(); end; method postScoreRecord(); end; method term(); end; method run(); set sasep.in; preScoreRecord(); sc.scoreRecord(); postScoreRecord(); end; enddata;
The sc.setKey in the method init () method block contains a string identifier for the analytic store; this is the same ASTORE identifier that was previously outputted as part of PROC ASTORE. In order to override the scoring decision created from the original gradient boosting model, we will edit the gb_epcode.sas file (shown above) by inserting new statements in the postScoreRecord method block; the edited file must follow DS2 language syntax. For more information about the DS2 language, see
method postScoreRecord(); if YOJ>10 then do; I_BAD_NEW='0'; end; else do; I_BAD_NEW=I_BAD; end; end;
Because we are saving the outcome into a new variable called “I_BAD_NEW,” we will need to declare this variable upfront along with the rest of the variables in the score file.
In order for this override to take effect, we will need to run the SCORE statement in PROC ASTORE and provide both the original ASTORE file (astore_gb), as well as the edited DS2 score code (gb_epcode.sas).
proc astore; score data=public.hmeq epcode="/viyafiles/jukhar/gb_epcode.sas" rstore=public.astore_gb out=public.hmeq_new; run;
A comparison of “I_BAD” and “I_BAD_NEW” in the output of the above code for select variables shows that the override rule for scoring has indeed taken place.
In this article we explored how to override the scoring decision produced from a machine learning model in SAS Viya. You will find more information about scoring in the Using PROC ASTORE to override scoring decisions in SAS® Viya® was published on SAS Users.