Customer data platforms (CDPs), data management platforms (DMPs), people-based marketing, identity graphs, and more overlapping topics represent an important ingredient of any martech brainstorming session in 2020. As your brand spreads out across touchpoints — from web to mobile applications, as well as call centers, email and direct mail — [...]
Through our new strategic partnership, SAS and Microsoft are helping the public sector realize the power of analytics in the cloud. I spoke with Daniel Sumner, Worldwide Director of Smart Infrastructure at Microsoft, to explore the possibilities of more robust analytics and cloud strategies in government — and the technology [...]
The future of government with SAS Analytics on Azure was published on SAS Voices by Steve Bennett
The HighLow plot often enables you to create many custom plots without resorting to annotation. Although it is designed to create a candlestick chart for stocks, it is incredibly versatile. Recently, a SAS programmer wanted to create a patient-profile graph that looked like a stacked bar chart but had repeated categories. The graph is shown to the right. The graph shows which treatments each patient was given and in which order. Treatments can be repeated, and during some periods a patient was given no treatment.
Although the graph looks like a stacked bar chart, there is an important difference. A stacked bar chart is a graph of (counts of) discrete categories. For each patient, each category appears at most one time. Although the categories can be ordered, the categories are displayed in the same order for every patient.
In contrast, by using the HighLow chart, you can order the categories independently for each patient and you can display repeated categories.
A graph of mutually exclusive events over time
This chart visualizes the sequence of treatments for a set of patients over time, but you can generalize the idea. In general, this graph is used to show different subjects over time. At each instant of time, the subject can experience one and only one event at a time. In theory, this means that the events should be mutually exclusive, but you can get around that restriction by defining additional events such as "A and B", "A and C", and "B and C".
The following DATA step defines the data in "long form," meaning that each patient profile is defined by using multiple rows. There are four variables: The subject ID, the starting time, the ending time, and the treatment category. Each row of the data set defines the starting and ending time for one treatment for one patient. The DATA step also implements a trick that I like: prepend "fake data" to specify the order that the treatments will appear in the legend.
data Patients; length Treatment $ 10; input ID StartDay EndDay Treatment; /* trick: The first patient is a 'missing patient', which determines the order that the categories appear in the legend. */ datalines; . 0 0 A . 0 0 B . 0 0 C . 0 0 None 101 0 90 A 101 90 120 B 101 120 180 A 101 180 210 None 101 210 270 B 101 270 360 None 101 360 420 A 201 0 180 B 201 180 240 None 201 240 300 C 301 0 270 C 301 270 320 None 301 320 480 A 401 0 180 C 401 180 240 None 401 240 360 A 401 360 450 B ; title "Treatments by Day for Each Patient"; proc sgplot data=Patients; highlow y=ID low=StartDay high=EndDay / group=Treatment type=BAR; yaxis type=discrete reverse; /* the ID var is numeric, so you must specify DISCRETE */ xaxis grid label='Day Since Start of Treatment'; run;
The graph is shown at the top of the article. The PROC SGPLOT is a one-liner. The "hard part" about creating this graph is knowing that it can be created by using a HighLow plot and creating the data in the correct form.
In summary, the HighLow plot gives you more control over the graph than a standard stacked bar chart. It enables you to display repeated categories and to order the categories independently for each subject. By using these features, you can visualize a sequence of mutually exclusive events over time for each subject.
There have been many blog posts that use the HighLow plot to create custom graphs. In statistics, I have used the HighLow plot to create deviation plots and butterfly fringe plots. In clinical graphs, it can create graphs such as:
- Swimmer plots
- Adverse event charts and other patient profile charts
- Schedule charts and GANTT charts
The post The HighLow plot: When a stacked bar chart is not enough appeared first on The DO Loop.
In this multi-part series we're going to explore a real-life example of operationalizing your analytics and then quickly dive into the technical details to make it happen. The phrase Operationalize your Analytics itself encompasses a framework for realizing timely and relevant business value from models, business rules, and algorithms. It also refers to the development-to-deployment life cycle commonly used in the ModelOps space. This series is geared towards walking you through a piece of the puzzle, exposing your analytics as REST APIs, but to get a better feel for the whole picture I would encourage taking a look at these resources:
- Operationalizing your analytics at SAS (includes links to multiple white papers)
- 7 tips for operationalizing your analytics
- ModelOps: How to operationalize the model life cycle
Fair warning: I get detailed here. This is a journey. I'm not trying to pitch the story for the next Netflix original, but I want to outline business challenges companies face and how they can utilize technology to solve these problems. The more background you have, the more the solution resonates. I also want to have a little fun and give props to my champion salesman.
Framing the use case
Our goal in this first part of the series is to better understand the value of building custom applications on top of SAS Viya. To illustrate my point, I'm going to use Tara's Auto Mall, a used car buying and selling business I hypothetically own. We incorporated the business back in 1980 when neon and hairspray were king and MTV played music videos. Kurt, pictured at our lot, was my best salesman.
This man knew how to spot a great buy. Every auction we'd send him to looking for cars, he'd find the gems that just needed a little polishing, a little clean up, and knew exactly how much to pay for them. Sure, we still bust on him today for a couple of heaps he bought that we never made a dime off, but for the most part he knew how not to get caught up in the heat of the auction. It's an easy thing to do while just trying to outbid the others that you end up paying too much for a car. After all, we've got to turn a profit to keep the lights on and we do that by bringing used cars from auction to our lot for resale. Over the years, Kurt learned which makes and models we could get the best profit margins out of, while also learning how to spot cars that might end up racking up too many unexpected costs. "Salt and sun damage will take a huge bite out of your commission" he'd always tell the new guys.
But these days Kurt, and most of the other great salesman I've had over the years, spends his days fishing in the middle of nowhere, happily retired. Sure, I've got a few folks still on the team that really know a good buy when they see it, but more and more often we're losing money on deals. As hard as these new kids work, they just don't have the years of experience to know what they're getting into at auction.
Sometimes a car will come up that looks like it's in great shape. Clean and well-kept, low miles even, and everyone starts spouting off bids, one right after the other. That'll make just about anyone think "they must know something I don't know, I've got to win this one!" But unless you really know what to look for, like Kurt did, that good-looking deal might end up being a dud. And while Kurt had experience under his belt to help his eye for that, keeping track of the good buys and the bad ones, learning from all those lessons - that's the best yard stick for comparisons when you're trying to make a quick call at auction. How much have we been able to sell cars with similar characteristics for before? Have they ended up costing us more money to fix up than they're worth?
Needing to modernizing our business
Okay, you get it. It's a challenge many are actively facing: a retiring workforce taking their years of experience with them, combined with agile competitors and market disrupters. Not only to better compete but also to cut down on poor "gut" decisions, Tara's Auto Mall decides to build an app. This way all their employees can use the app at auction to help make the best possible decisions for which cars to bid on and how much is too much before overpaying. Over the years they've bought a lot of cars, thousands, and no one can really keep track of all that data in their head to know how to truly evaluate a good buy.
The buying team should be able to pull up this custom app on their phone, put in the details for the car they're thinking about making a bid on, and almost instantly get back a suggestion on whether or not they should bid at all and if so, a maximum bid amount and some other helpful information. Those results should be based on both the historical data from previous sales and from business logic. And they should be able to record how much each car up at auction went for and it's characteristics, whether they won it or not, so we can feed that back into our database to make the models even more accurate over time.
How can Tara's Auto Mall improve?
Now that we've got the main requirements for the application that Tara's Auto Mall needs to build so the buying team can make data-driven decisions at auction, we can start to hash out how those recommendations will be generated. The solution they decided on is to use SAS Viya to create a decision flow to evaluate used cars up for bidding at auction by using that specific car's details as inputs and first checking them against some business rules that they've come to trust over time. This adds a layer of intelligence into the decision-making process that models alone cannot provide.
We heard a bit about the lessons learned over time, like looking for salt and sun damage, so we can layer this into the recommendation logic by checking for certain states where that type of damage is typical. This allows us to add more "human knowledge" into the computer-driven process and automatically alert the buyer that they should be aware of this potential damage. That decision flow then uses both supervised and unsupervised models to evaluate the car's unique characteristics against previous sales, the profits from them, and market data, to generate a maximum bid suggestion.
Here's the fun part
During an auction we don't have time to wait for some batch process to give recommendations and we don't want our buyers to have to log in and enter values into a written program either. Instead, we can publish the entire decision flow as an API endpoint to the SAS Micro Analytic Service (MAS), a “compile-once, execute-many-times” service in SAS Viya. Then we can have our custom application make REST calls to that endpoint with the specific car details as inputs, where it will execute against a packaged version of that entire flow we outlined. The result is a recommendation from both the business rules and models.
The app is part of the SAS decisioning process, and evolves over time. As more sales are added to the database through the app, the solution automatically retrains the recommendation model. It yields more-accurate suggestions as the data changes and then republishes to MAS so our custom app can automatically take advantage of the updated model.
I want to pause here and note that, while this example is for an application making REST calls to MAS, that is really the tip of the iceberg. One of the greatest advantages to the modernization of the SAS platform with SAS Viya is the sheer number of underlying services that are now exposed as API endpoints for you to incorporate into your larger IT landscape and business processes. Check out developer.sas.com to take a look.
Tara's Auto Mall has now decided on the custom application to help their buyers at auction and the solution to building repeatable analytics and using that to provide near-real-time recommendations. Next, we need to work through integrating the custom app into their SAS Viya environment so they can embed the recommendations.
Questions from the author
While you might not have gotten to work with Kurt, Tara's Auto Mall's best and most fashionably-dressed salesman, is your business experiencing similar challenges? Do you have homegrown applications that could benefit from injecting repeatable analytics and intelligent decision making into them?
On discussion forums, many SAS programmers ask about the best way to generate dummy variables for categorical variables. Well-meaning responders offer all sorts of advice, including writing your own DATA step program, sometimes mixed with macro programming. This article shows that the simplest and easiest way to generate dummy variables in SAS is to use PROC GLMSELECT. It is not necessary to write a SAS program to generate dummy variables. This article shows an example of generating dummy variables that have meaningful names, which are based on the name of the original variable and the categories (levels) of the variable.
A dummy variable is a binary indicator variable. Given a categorical variable, X, that has k levels, you can generate k dummy variables. The j_th dummmy variable indicates the presence (1) or absence (0) of the j_th category.
Why GLMSELECT is the best way to generate dummy variables
I usually avoid saying "this is the best way" to do something in SAS. But if you are facing an impending deadline, you are probably more interested in solving your problem and less interested in comparing five different ways to solve it. So let's cut to the chase: If you want to generate dummy variables in SAS, use PROC GLMSELECT.
Why do I say that? Because PROC GLMSELECT has the following features that make it easy to use and flexible:
- The syntax of PROC GLMSELECT is straightforward and easy to understand.
- The dummy variables that PROC GLMSELECT creates have meaningful names. For example, if the name of the categorical variable is X and it has values 'A', 'B', and 'C', then the names of the dummy variables are X_A, X_B, and X_C.
- PROC GLMSELECT creates a macro variable named _GLSMOD that contains the names of the dummy variables.
- When you write the dummy variables to a SAS data set, you can include the original variables or not.
- By default, PROC GLMSELECT uses the GLM parameterization of CLASS variables. This is what you need to generate dummy variables. But the same procedure also enables you to generate design matrices that use different parameterizations, that contain interaction effects, that contain spline bases, and more.
The only drawback to using PROC GLMSELECT is that it requires a response variable to put on the MODEL statement. But that is easily addressed.
How to generate dummy variables
Let's show an example of generating dummy variables. I will use two categorical variables in the Sashelp.Cars data: Origin and Cylinders. First, let's look at the data. As the output from PROC FREQ shows, the Origin variable has three levels ('Asia', 'Europe', and 'USA') and the Cylinders variable has seven valid levels and also contains two missing values.
%let DSIn = Sashelp.Cars; /* name of input data set */ %let VarList = Origin Cylinders; /* name of categorical variables */ proc freq data=&DSIn; tables &VarList; run;
In order to use PROC GLMSELECT, you need a numeric response variable. PROC GLMSELECT does not care what the response variable is, but it must exist. The simplest thing to do is to create a "fake" response variable by using a DATA step view. To generate the dummy variables, put the names of the categorical variables on the CLASS and MODEL statements. You can use the OUTDESIGN= option to write the dummy variables (and, optionally, the original variables) to a SAS data set. The following statements generate dummy variables for the Origin and Cylinders variables:
/* An easy way to generate dummy variables is to use PROC GLMSELECT */ /* 1. add a fake response variable */ data AddFakeY / view=AddFakeY; set &DSIn; _Y = 0; run; /* 2. Create the dummy variables as a GLM design matrix. Include the original variables, if desired */ proc glmselect data=AddFakeY NOPRINT outdesign(addinputvars)=Want(drop=_Y); class &VarList; /* list the categorical variables here */ model _Y = &VarList / noint selection=none; run;
The dummy variables are contained in the WANT data set. As mentioned, the GLMSELECT procedure creates a macro variable (_GLSMOD) that contains the names of the dummy variables. You can use this macro variable in procedures and in the DATA step. For example, you can use it to look at the names and labels for the dummy variables:
/* show the names of the dummy variables */ proc contents varnum data=Want(keep=&_GLSMOD); ods select Position; run;
Notice that the names of the dummy variables are very understandable. The three levels of the Origin variable are 'Asia', 'Europe', and 'USA, so the dummy variables are named Origin_Asia, Origin_Europe, and Origin_USA. The dummy variables for the seven valid levels of the Cylinders variable are named Cylinders_N, where N is a valid level.
A macro to generate dummy variables
It is easy to encapsulate the two steps into a SAS macro to make it easier to generate dummy variables. The following statements define the %DummyVars macro, which takes three arguments:
- DSIn is the name of the input data set, which contains the categorical variables.
- VarList is a space-separated list of the names of the categorical variables. Dummy variables will be created for each variable that you specify.
- DSOut is the name of the output data set, which contains the dummy variables.
/* define a macro to create dummy variables */ %macro DummyVars(DSIn, /* the name of the input data set */ VarList, /* the names of the categorical variables */ DSOut); /* the name of the output data set */ /* 1. add a fake response variable */ data AddFakeY / view=AddFakeY; set &DSIn; _Y = 0; /* add a fake response variable */ run; /* 2. Create the design matrix. Include the original variables, if desired */ proc glmselect data=AddFakeY NOPRINT outdesign(addinputvars)=&DSOut(drop=_Y); class &VarList; model _Y = &VarList / noint selection=none; run; %mend; /* test macro on the Age and Sex variables of the Sashelp.Class data */ %DummyVars(Sashelp.Class, Age Sex, ClassDummy);
When you run the macro, it writes the dummy variables to the ClassDummy data set. It also creates a macro variable (_GLSMOD) that contains the name of the dummy variables. You can use the macro to analyze or print the dummy variables, as follows:
/* _GLSMOD is a macro variable that contains the names of the dummy variables */ proc print data=ClassDummy noobs; var Name &_GLSMod; run;
The dummy variables tell you that Alfred is a 14-year-old male, Alice is a 13-year-old female, and so forth.
What happens if a categorical variable contains a missing value?
If a categorical variable contains a missing value, so do all dummy variables that are generated from that variable. For example, we saw earlier that the Cylinders variable for the Sashelp.Cars data has two missing values. You can use PROC MEANS to show that the dummy variables (named Cylinders_N) also have two missing values. Because the dummy variables are binary variables, the sum of each dummy variable matches the number of levels. Compare the SUM column in the PROC MEANS output with the earlier output from PROC FREQ:
/* A missing value in Cylinders results in a missing value for each dummy variable that is generated from Cylinders */ proc means data=Want N NMiss Sum ndec=0; vars Cylinders_:; run;
In most analyses, it is unnecessary to generate dummy variables. Most SAS procedures support the CLASS statement, which enables you to use categorical variables directly in statistical analyses. However, if you do need to generate dummy variables, there is an easy way to do it: Use PROC GENSELECT or use the %DummyVars macro in this article. The result is a SAS data set that contains the dummy variables and a macro variable (_GLSMOD) that contains the names of the dummy variables.
Here are links to previous articles about dummy variables and creating design matrices in SAS.
- The GLMMOD procedure enables you to create dummy variables. However, the dummy variables are named COL1, COL2, ..., which might be harder work with.
- Generating dummy variables is a special case of generating a design matrix. You can read about four procedures that can generate a design matrix in SAS. However, PROC GLMSELECT can do everything those procedures can do, except GLIMMIX procedure can generate design columns for random effects.
- There are other ways to generate meaningful names for dummy variables, but none is easier than using PROC GLMSELECT.
Remember back to your early school days, singing with all your classmates “If you’re happy and you know it clap your hands!” and then we’d all clap our hands. Being happy back then was so simple. Today, it’s hard to get away from all the negative headlines of 2020! It’s [...]
Decision trees are a fundamental machine learning technique that every data scientist should know. Luckily, the construction and implementation of decision trees in SAS is straightforward and easy to produce.
There are simply three sections to review for the development of decision trees:
- Tree development
- Model evaluation
The data that we will use for this example is found in the fantastic UCI Machine Learning Repository. The data set is titled “Bank Marketing Dataset,” and it can be found at: http://archive.ics.uci.edu/ml/datasets/Bank+Marketing#
This data set represents a direct marketing campaign (phone calls) conducted by a Portuguese banking institution. The goal of the direct marketing campaign was to have customers subscribe to a term deposit product. The data set consists of 15 independent variables that represent customer attributes (age, job, marital status, education, etc.) and marketing campaign attributes (month, day of week, number of marketing campaigns, etc.).
The target variable in the data set is represented as “y.” This variable is a binary indicator of whether the phone solicitation resulted in a sale of a term deposit product (“yes”) or did not result in a sale (“no”). For our purposes, we will recode this variable and label it as “TARGET,” and the binary outcomes will be 1 for “yes” and 0 for “no.”
The data set is randomly split into two data sets at a 70/30 ratio. The larger data set will be labeled “bank_train” and the smaller data set will be labeled “bank_test”. The decision tree will be developed on the bank_train data set. Once the decision tree has been developed, we will apply the model to the holdout bank_test data set.
The code below specifies how to build a decision tree in SAS. The data set mydata.bank_train is used to develop the decision tree. The output code file will enable us to apply the model to our unseen bank_test data set.
ODS GRAPHICS ON; PROC HPSPLIT DATA=mydata.bank_train; CLASS TARGET _CHARACTER_; MODEL TARGET(EVENT='1') = _NUMERIC_ _CHARACTER_; PRUNE costcomplexity; PARTITION FRACTION(VALIDATE=<strong>0.3</strong> SEED=<strong>42</strong>); CODE FILE='C:/Users/James Gearheart/Desktop/SAS Book Stuff/Data/bank_tree.sas'; OUTPUT OUT = SCORED; run;
The output of the decision tree algorithm is a new column labeled “P_TARGET1”. This column shows the probability of a positive outcome for each observation. The output also contains the standard tree diagram that demonstrates the model split points.
Once you have developed your model, you will need to evaluate it to see whether it meets the needs of the project. In this example, we want to make sure that the model adequately predicts which observation will lead to a sale.
The first step is to apply the model to the holdout bank_test data set.
DATA test_scored; SET MYDATA.bank_test; %INCLUDE 'C:/Users/James Gearheart/Desktop/SAS Book Stuff/Data/bank_tree.sas'; RUN;
The %INCLUDE statement applied the decision tree algorithm to the bank_test data set and created the P_TARGET1 column for the bank_test data set.
Now that the model has been applied to the bank_test data set, we will need to evaluate the performance of the model by creating a lift table. Lift tables provide additional information that has been summarized in the ROC chart. Remember that every point along the ROC chart is a probability threshold. The lift table provides detailed information for every point along the ROC curve.
The model evaluation macro that we will use was developed by Wensui Liu. This easy-to-use macro is labeled “separation” and can be applied to any binary classification model output to evaluate the model results.
You can find this macro in my GitHub repository for my new book, End-to-End Data Science with SAS®. This GitHub repository contains all of the code demonstrated in the book along with all of the macros that were used in the book.
This macro on my C drive, and we call it with a %INCLUDE statement.
%INCLUDE 'C:/Users/James Gearheart/Desktop/SAS Book Stuff/Projects/separation.sas'; %<em>separation</em>(data = test_scored, score = P_TARGET1, y = target);
The score script that was generated from the CODE FILE statement in the PROC HPSPLIT procedure is applied to the holdout bank_test data set through the use of the %INCLUDE statement.
The table below is generated from the lift table macro.
This table shows that that model adequately separated the positive and negative observations. If we examine the top two rows of data in the table, we can see that the cumulative bad percent for the top 20% of observations is 47.03%. This can be interpreted as we can identify 47.03% of positive cases by selecting the top 20% of the population. This selection is made by selecting observations with a P_TARGET1 score greater than or equal to 0.8276 as defined by the MAX SCORE column.
Fraud, waste and abuse (FWA) ravages the US health care system. Estimates from the National Health Care Anti-Fraud Association show fraud costs health care organizations $70 billion to $230 billion each year. The precise figure is unknowable because only 3 to 10% of this fraud is ever detected. With more [...]
Catch a fraudster: Finding the needle in the haystack with AI was published on SAS Voices by Alyssa Farrell
In the paper "Tips and Techniques for Using the Random-Number Generators in SAS" (Sarle and Wicklin, 2018), I discussed an example that uses the new STREAMREWIND subroutine in Base SAS 9.4M5. As its name implies, the STREAMREWIND subroutine rewinds a random number stream, essentially resetting the stream to the beginning. I struggled to create a compelling example for the STREAMREWIND routine because using the subroutine "results in dependent streams of numbers" and because "it is usually not necessary in simulation studies" (p. 12). Regarding an application, I asserted that the subroutine "is convenient for testing."
But recently I was thinking about two-factor authentication and realized that I could use the STREAMREWIND subroutine to emulate generating a random token that changes every 30 seconds. I think it is a cool example, and it gives me the opportunity to revisit some of the newer features of random-number generation in SAS, including new generators and random number keys.
A brief overview of two-factor authentication
I am not an expert on two-factor authentication (TFA), but I use it to access my work computer, my bank accounts, and other sensitive accounts. The main idea behind TFA is that before you can access a secure account, you must authenticate yourself in two ways:
- Provide a valid username and password.
- Provide information that depends on a physical device that you own and that you have previously registered.
Most people use a smartphone as the physical device, but it can also be a PC or laptop. If you do an internet search for "two factor authentication tokens," you can find many images like the one on the right. This is the display from a software program that runs on a PC, laptop, or phone. The "Credential ID" field is a long string that is unique to each device. (For simplicity, I've replaced the long string with "12345.") The "Security Code" field displays a pseudorandom number that changes every 30 seconds. The Security Code depends on the device and on the time of day (within a 30-second interval). In the image, you can see a small clock and the number 28, which indicates that the Security Code will be valid for another 28 seconds before a new number is generated.
After you provide a valid username and password, the account challenges you to type in the current Security Code for your registered device. When you submit the Security Code, the remote server checks whether the code is valid for your device and for the current time of day. If so, you can access your account.
Two-factor random number streams
I love the fact that the Security Code is pseudorandom and yet verifiable. And it occurred to me that I can use the main idea of TFA to demonstrate some of the newer features in the SAS random number generators (RNGs).
Long-time SAS programmers know that each stream is determined by a random number seed. But a newer feature is that you can also set a "key" for a random number stream. For several of the new RNGs, streams that have the same seed but different keys are independent. You can use this fact to emulate the TFA app:
- The Credential ID (which is unique to each device) is the "seed" for an RNG.
- The time of day is the "key" for an RNG. Because the Security Code must be valid for 30 seconds, round the time to the nearest 30-second boundary.
- Usually each call to the RAND function advances the state of the RNG so that the next call to RAND produces a new pseudorandom number. For this application, we want to get the same number for any call within a 30-second period. One way to do this is to reset the random number stream before each call so that RAND always returns the FIRST number in the stream for the (seed, time) combination.
Using a key to change a random-number stream
Before worrying about using the time of day as the key value, let's look at a simpler program that returns the first pseudorandom number from independent streams that have the same seed but different key values. I will use PROC FCMP to write a function that can be called from the SAS DATA step. Within the DATA step, I set the seed value and use the "Threefry 2" (TF2) RNG. I then call the Rnd6Int function for six different key values.
proc fcmp outlib=work.TFAFunc.Access; /* this function sets the key of a random-numbers stream and returns the first 6-digit pseudorandom number in that stream */ function Rnd6Int(Key); call stream(Key); /* set the Key for the stream */ call streamrewind(Key); /* rewind stream with this Key */ x = rand("Integer", 0, 999999); /* first 6-digit random number in stream */ return( x ); endsub; quit; options cmplib=(work.TFAFunc); /* DATA step looks here for unresolved functions */ data Test; DeviceID = 12345; /* ID for some device */ call streaminit('TF2', DeviceID); /* set RNG and seed (once per data step) */ do Key = 1 to 6; SecCode = Rnd6Int(Key); /* get random number from seed and key values */ /* Call the function again. Should produce the same value b/c of STREAMREWIND */ SecCodeDup = Rnd6Int(Key); output; end; keep DeviceID Key SecCode:; format SecCode SecCodeDup Z6.; run; proc print data=Test noobs; run;
Each key generates a different pseudorandom six-digit integer. Notice that the program calls the Rnd6Int function twice for each seed value. The function returns the same number each time because the random number stream for the (seed, key) combination gets reset by the STREAMREWIND call during each call. Without the STREAMREWIND call, the function would return a different value for each call.
Using a time value as a key
With a slight modification, the program in the previous section can be made to emulate the program/app that generates a new TFA token every 30 seconds. However, so that we don't have to wait so long, the following program sets the time interval (the DT macro) to 10 seconds instead of 30. Instead of talking about a 30-second interval or a 10-second interval, I will use the term "DT-second interval," where DT can be any time interval.
The program below gets the "key" by looking at the current datetime value and rounding it to the nearest DT-second interval. This value (the RefTime variable) is sent to the Rnd6Int function to generate a pseudorandom Security Code. To demonstrate that the program generates a new Security Code every DT seconds, I call the Rnd6Int function 10 times, waiting 3 seconds between each call. The results are printed below:
%let DT = 10; /* change the Security Code every DT seconds */ /* The following DATA step takes 30 seconds to run because it performs 10 iterations and waits 3 secs between iterations */ data TFA_Device; keep DeviceID Time SecCode; DeviceID = 12345; call streaminit('TF2', DeviceID); /* set the RNG and seed */ do i = 1 to 10; t = datetime(); /* get the current time */ /* round to the nearest DT seconds and save the "reference time" */ RefTime = round(t, &DT); SecCode = Rnd6Int(RefTime); /* get a random Security Code */ Time = timepart(t); /* output only the time */ call sleep(3, 1); /* delay 3 seconds; unit=1 sec */ output; end; format Time TIME10. SecCode Z6.; run; proc print data=TFA_Device noobs; var DeviceId Time SecCode; run;
The output shows that the program generated three different Security Codes. Each code is constant for a DT-second period (here, DT=10) and then changes to a new value. For example, when the seconds are in the interval [05, 15), the Security Code has the same value. The Security Code is also constant when the seconds are in the interval [15, 25) and so forth. A program like this emulates the behavior of an app that generates a new pseudorandom Security Code every DT seconds.
Different seeds for different devices
For TFA, every device has a unique Device ID. Because the Device ID is used to set the random number seed, the pseudorandom numbers that are generated on one device will be different than the numbers generated on another device. The following program uses the Device ID as the seed value for the RNG and the time of day for the key value. I wrapped a macro around the program and called it for three hypothetical values of the Device ID.
%macro GenerateCode(ID, DT); data GenCode; keep DeviceID Time SecCode; format DeviceID 10. Time TIME10. SecCode Z6.; DeviceID = &ID; call streaminit('TF2', DeviceID); /* set the seed from the device */ t = datetime(); /* look at the current time */ /* round to the nearest DT seconds and save the "reference time" */ RefTime = round(t, &DT); /* round to nearest DT seconds */ SecCode = Rnd6Int(RefTime); /* get a random Security Code */ Time = timepart(t); /* output only the time */ run; proc print data=GenCode noobs; run; %mend; /* each device has a unique ID */ %GenerateCode(12345, 30); %GenerateCode(24680, 30); %GenerateCode(97531, 30);
As expected, the program produces different Security Codes for different Device IDs, even though the time (key) value is the same.
In summary, you can use features of the SAS random number generators in SAS 9.4M5 to emulate the behavior of a TFA token generator. The SAS program in this article uses the Device ID as the "seed" and the time of day as a "key" to select an independent stream. (Round the time into a certain time interval.) For this application, you don't want the RAND function to advance the state of the RNG, so you can use the STREAMREWIND call to rewind the stream before each call. In this way, you can generate a pseudorandom Security Code that depends on the device and is valid for a certain length of time.
Meet Alfred Mukudu in this third post of the Humans of SAS Services series.