3月 102017
 

I recently needed to solve a fun programming problem. I challenge other SAS programmers to solve it, too! The problem is easy to state: Given a long sequence of digits, can you write a program to count how many times a particular subsequence occurs? For example, if I give you a sequence of 1,000 digits, can you determine whether the five-digit pattern {1 2 3 4 5} appears somewhere as a subsequence? How many times does it appear?

If the sequence is stored in a data set with one digit in each row, then SAS DATA step programmers might suspect that the LAG function will be useful for solving this problem. The LAG function enables a DATA step to examine the values of several digits simultaneously.

The SAS/IML language also has a LAG function which enables you to form a matrix of lagged values. This leads to an interesting way use vectorized computations to solve this problem. The following SAS/IML program defines a small eight-digit set of numbers in which the pattern {1 2 3} appears twice. The LAG function in SAS/IML accepts a vector of lags and creates a matrix where each column is a lagged version of the input sequence:

/* Find a certain pattern in sequence of digits */
proc iml;
Digit = {1,1,2,3,3,1,2,3};      /* digits to search */
target = {1 2 3};               /* the pattern to find */
p = ncol(target);               /* length of target sequence */
D = lag(Digit, (p-1):0);        /* columns shift the digits */
print D;

The output shows a three-column matrix (D) that contains the second, first, and zeroth lag (in that order) for the input sequence. Notice that if I am searching for a particular three-digit pattern, this matrix is very useful. The rows of this matrix are all three-digit patterns that appear in the original sequence. Consequently, to search for a three-digit pattern, I can use the rows of the matrix D.

To make the task easier, you can delete the first two rows, which contain missing values. You can also form a binary matrix X that has the value X[i,j]=1 when the j_th element of the pattern equals the j_th element of the i_th row, and 0 otherwise, as shown in the following:

D = D[p:nrow(Digit),];          /* delete first p rows */
X = (D=target);                 /* binary matrix */
print X;

Notice that in SAS/IML, the comparison operator (=) can perform a vector comparison. The binary comparison operator detects that the matrix on the left (D) and the vector on the right (target) both contain three columns. Therefore the operator creates the three-column logical matrix X, as shown. The X matrix has a wonderful property: a row of X contains all 1s if and only if the corresponding row of D matches the target pattern. So to find matches, you just need to sum the values in the rows of X. If the row sum equals the number of digits in the pattern, then that row indicates a place where the target pattern appears in the original sequence.

You can program this test in PROC IML as follows. The subscript reduction operator [,+] is shorthand for "compute the sum of each row over all columns".

/* sum across columns. Which rows contain all 1s? */
b = (X[,+] = p);                /* b[i]=1 if i_th row matches target */
NumRepl = sum(b);               /* how many times does target appear? */
if NumRepl=0 then 
   print "The target does not appear in the digits";
else
   print "The target appears at location " (loc(b)[1]),  /* print 1st location */
         "The target appears" (NumRepl) "times.";

The program discovered that the target pattern appears in the sequence twice. The first appearance begins with the second digit in the sequence. The pattern also appears in the sequence at the sixth position, although that information is not printed.

Notice that you can solve this problem in SAS/IML without writing any loops. Instead, you can use the LAG function to convert the N-digit sequence into a matrix with N-p rows and p columns. You can then test whether the target pattern matches one of the rows.

Your turn! Can you solve this problem?

Now that I have shown one way to solve the problem, I invite SAS programmers to write their own program to determine whether a specified pattern appears in a numeric sequence. You can use the DATA step, DS2, SQL, or any other SAS language.

Post a comment to submit your best SAS program. Extra points for programs that count all occurrences of the pattern and display the location of the first occurrence.

To help you debug your program, here is test data that we can all use. It contains 10 million random digits:

data Have(keep=Digit);
call streaminit(31488);
do i = 1 to 1e7;
   Digit = floor(10*rand("uniform"));
   output;
end;
run;

To help you determine if your program is correct, you can use the following results. In this sequence of digits:

  • The five-digit pattern {1 2 3 4 5} occurs 101 times and the first appearance begins at row 34417
  • The six-digit patter {6 5 4 3 2 1} occurs 15 times and the first appearance begins at row 120920

You can verify these facts by using PROC PRINT as follows:

proc print data=Have(firstobs=34417 obs=34421); run;
proc print data=Have(firstobs=120920 obs=120925); run;

Happy programming!

The post Find a pattern in a sequence of digits appeared first on The DO Loop.

3月 092017
 

Unlike some other UK Government departments, the Home Office has done well out of the recent spending review. Overall police spending has been protected – following the debacle of the earlier calculation errors – to protect against emerging crime threats and to train more firearms officers. Counter-terrorism has received a [...]

Why deeper insights, not more cash, will transform Home Office outcomes was published on SAS Voices by Peter Snelling

3月 092017
 

SAS® Federation Server provides a central, virtual environment for administering and securing access to your data. It also allows you to combine data from multiple sources without moving or copying the data. SAS Federation Server Manager, a web-based application, is used to administer SAS Federation Server(s).

Data privacy is a major concern for organizations and one of the features of SAS Federation Server is it allows you to effectively and efficiently control access to your data, so you can limit who is able to view sensitive data such as credit card numbers, personal identification numbers, names, etc. In this three-part blog series, I will explore the topic of controlling data access using SAS Federation Server. The series covers the following topics:

SAS Metadata Server is used to perform authentication for users and groups in SAS Federation Server and SAS Federation Server Manager is used to help control access to the data. SAS Metadata Server is used to perform authentication for users and groups in SAS Federation Server and SAS Federation Server Manager is used to help control access to the data.  Note: Permissions applied for a particular data source cannot be bypassed with SAS Federation Server security. If permissions are denied at the source data, for example on a table, then users will always be denied access to that table, no matter what permissions are set in SAS Federation Server.

In this post, I will build on the examples from my previous articles and demonstrate how you can use data masking to conceal actual data values from users, but still allow them access for analysis and reporting purposes.

In previous posts, I gave the Finance Users group access to the SALARY table. Linda is a member of the Finance Users group, so currently she has access to the SALARY table.

However, I want to restrict her access. She needs access to the Salary info for analytic purposes, but does not need to know the identifying data of IDNUM, so I can hide that column from her. She does need the JOBCODE information for her analytics; however, she does not need to know the actual JOBCODE information associated with the record, so that data can be masked to prevent her from viewing that identifying information.

First, I create a FedSQL View of the SALARY table. FedSQL is the implementation of SQL that SAS Federation Server uses to access relational data.  For the view, I set the Security to Use the definer’s privileges when accessed since I will eventually deny Linda the rights to view the underlying table to the view.

Here is the default code for the view:

I change the code to the following to remove the IDNUM column from the view and mask the JOBCODE column, so Linda will not know what is the real JOBCODE associated with the Salary.

There are several data masking functions available for use. In this instance, I use the TRANC function to mask the JOBCODE field using transliterated values by replacing the first three characters with other values.  Refer to the Data Masking section of the SAS Federation Server Manager 4.2: User’s Guide for more information on the different data masking functions.

Now that I have created the FedSQL view, I then need to grant Linda authorization to it.

Next, I need to deny Linda authorization to the SALARY table, so she won’t be able to access the original table.

Linda is only able to view the SALARY_VIEW with the IDNUM column removed and the JOBCODE information masked.

Linda is denied access to the SALARY table.

However, Kate another member of the Finance team is able to view the full SALARY table with the IDNUM column and the real information (non-masked) in the JOBCODE column.

In this blog entry, I covered the third part of this series on controlling data access to SAS Federation Server 4.2.  Other blogs in the series include

For more information on SAS Federation Server visit the:

Securing sensitive data using SAS Federation Server data masking was published on SAS Users.

3月 092017
 

Several months ago, I posted a blog about calculating moving averages for a measure in the Visual Analytics Designer. Soon after that, I was asked about calculating not only the average, but also the standard deviation over a period of months, when the data might consist of one or more repeated values of a measure for each month of a series of N months.  For the example of N=20 months, we might want to view the average and standard deviation over the last n months, where n is any number between 3 and 20.

The example report shown below allows the user to type in a number, n, between 3 and 20, to display a report consisting of the amount values for past n months, the amount values for Current Month Amt-Previous, the Avg over the last n months, the Standard Deviation over the last n months, and the absolute value of the (Current Month Amt – Previous Month Amt), divided by the Standard Deviation over the last n months. A Display rule is applied to the final Abs column, showing Green for a value less than 1 and red for a value greater than or equal to 1.

The data used in this example had multiple Amount values for each month, so we first used the Visual Data Builder to create a SUM aggregation for Amount for each unique Date value.  This step gives more flexibility in using the amount value for aggregations in the designer.

When the modified data source is initially added to the report, it contains only the Category data item Month, with a format of MMYYYY, and the measure Amount Sum for Month.

The data will be displayed in a list table. The first columns added to the table will be Month, displayed with a MMYYYY format, and Amount Sum for Month.

Specify the properties for the list table as below:

Since we want to display the last n months, we create a new calculated data item, Numeric Date, calculated as below, using the TREATAS operator on the Month data item:

Then we create the Current Month Amt-Previous aggregated measure using the RelativePeriod date operator:

Next, create the Avg over all displayed months aggregated measure as below:

Then, create the Std.Dev. over all displayed months aggregated measure as shown below:

Create the Abs (Current-Previous/StdDev) as shown below:

Create a numeric parameter, Number of Months, as shown, with minimum value of 3 (smallest value that a standard deviation will make sense) and maximum value of 20 (the number of months in our data). You can let the default (Current value) value be any value that you choose:

For the List Table, create a Rank, as shown below. Note that we are creating the rank on the Numeric Date (not the Month data item), and rather than a specific value for count, we are going to use the value of the parameter, Number of Months.

Create a text input object that enables the user to type in a ‘number of months’ between 3 and 20.

Associate the Parameter with the Text input object:

If you wish, you can add display rules to sound an alarm whenever there is an alarming month-to-month difference in comparison to the standard deviation for the months.

So the final result of all of the above is this report, which points out month-to-month differences, which might deserve further concern or investigation. Note that the Numeric Date value is included below just to enable you to see what those values look like—you likely would not want to include that calculated data item in your report.

Calculating standard deviation of a measure in Visual Analytics Designer was published on SAS Users.

3月 092017
 

Editor’s note: This is the first in a series of posts to help current SAS programmers add SAS Viya to their analytics skillset. In this post, SAS instructors Stacey Syphus and Marc Huber introduce you to the new Transitioning from Programming in SAS 9 to SAS Viya video library, designed to show SAS programmers [...]

The post Transitioning from programming in SAS 9 to SAS Viya appeared first on SAS Learning Post.

3月 082017
 

The NC Scholastic Chess Championship is coming up this weekend, and my buddy Michael Thomas asked if I might could create a few graphs to help analyze the event data. How could I pass up an opportunity like that?!?! Read along, and find out what graphs I created, and the [...]

The post What's your next move? (analytics for a chess tournament) appeared first on SAS Learning Post.

3月 082017
 

Suppose you have several discrete variables. You want to conduct a frequency analysis of these variables and print the results, but ONLY for variables that have three or more levels. In other words, you want to conditionally display some results, but you don't know which variables satisfy the condition until after you run the analysis.

An experienced SAS programmer can probably think of several ways to solve this problem. The simplest solution requires going through the data twice. During the first pass you use PROC SQL or PROC FREQ to count the number of distinct levels for each variable. You then create a list of the variables that have three or more levels and call PROC FREQ on those variables and show the one-way frequency tables that result.

That is a fine solution. However, I read the question just after I finished writing an article about how to select and reorder output with PROC DOCUMENT. It occurred to me that a more efficient solution is to let PROC FREQ compute tables for all the variables, but use PROC DOCUMENT to display only the tables that satisfy the condition. If you don't mind extra complexity, you can even use the DATA step and CALL EXECUTE to automate some of the replaying, a technique that I learned from a 2016 paper by Warren Kuhfeld. (He uses similar ideas in his free e-book Advanced ODS Graphics Examples.)

To demonstrate this technique, I will create a modified version of the Sashelp.Cars data. The following DATA step copies the data and adds two new character variables, one with one level and another with two levels:

data Have;
set sashelp.cars;
c1 = "A";
if _N_ < 100 then c2 = "A"; 
             else c2 = "B";
run;

Step 1: Store the output in a document

The goal is to print ONLY frequency tables for variables that have three or more levels. The following ODS statements suppress output to all open destinations, open the DOCUMENT destination (named "RDoc"), and select only the OneWayFreqs table. The ODS OUTPUT destination is used to save the "NLevels" table of PROC FREQ, which contains information about the number of levels in each variable.

ods exclude all;                          /* suppress output */
ods document name=RDoc(write);            /* write to document */
ods document select OneWayFreqs;          /* these tables go into the doc */
ods output NLevels=Levels;                /* save number of levels to data set */
   proc freq data=Have nlevels;
     tables origin c1 cylinders c2 type;  /* specify variables to analyze */
   run;
ods document close;
ods exclude none;

If the preceding statements seem confusing, try running just the PROC FREQ statement. It produces five frequency tables and an output data set (Levels) which contains the number of levels for each variable. The other ODS statements just ensure that only the DOCUMENT destination receives the OneWayFreqs tables.

Step 2: Examine the names of the objects in the document

While developing the program, you will want to see the contents of the Levels data and the RDoc document, as follows. These statements will not appear in the final program.

proc print data=Levels noobs;
   var TableVar NLevels;
run;
 
proc document name=RDoc;
   list ^ (where=(_TYPE_="Table")) / levels=all;  /* list all tables */
run; quit;

The first table shows which variables have three or more levels. The second table lists the names of the tables in the document. The variables are stored in the same order as the variables in the Levels data set.

Step 3: Display the output for certain variabes

If you were doing this task manually, you would look at the Levels data set and conclude that the first, third, and fifth variables have three or more levels. You could then use the REPLAY statement in PROC document to display those tables. The manual code would look like the following:

/* No automation: Print only OneWayFreqs tables w/ 3 or more levels */
proc document name=RDoc(read);
   replay Freq#1Table1#1OneWayFreqs#1;   /* display Table1 */
   replay Freq#1Table3#1OneWayFreqs#1;   /* display Table3 */
   replay Freq#1Table5#1OneWayFreqs#1;   /* display Table5 */
run; quit;

The observant programmer will notice that these statements are just the result of an algorithm:

  1. Loop over each row in the Levels data set
  2. If the NLevels variable is greater than some threshold, output the corresponding table.

You can program that algorithm in the SAS DATA step and generate the corresponding PROC DOCUMENT statements. One way is to write the statements to a text file and then use the %INCLUDE statement to execute the statements. An alternative approach is to use the CALL EXECUTE subroutine to buffer up the statement so that they run when the DATA step terminates, as shown by the following program:

%let L = 3;              /* print only OneWayFreqs tables w/ L or more levels */
options source;          /* show the statements submitted by CALL EXECUTE*/
title "Replay only the tables that contain &L or more levels";
data _NULL_;
set Levels end=EOF;     /* implicit loop over rows of the data */
if _N_ = 1 then         /* first statement */
   call execute('proc document name=RDoc(read);');
if NLevels >= &L then   /* replay tables that satisfy condition */
   call execute('replay Freq#1Table'|| strip(putn(_N_,3)) ||'#1OneWayFreqs#1;');
if EOF then             /* last statement */
   call execute('run; quit;');
run;

The DATA step generates the complete call to PROC DOCUMENT, which executes after the DATA set exits. The result is that one-way frequency tables are conditionally printed. Although PROC FREQ analyzed all the variables, only the tables that have more than three levels are displayed.

If you haven't seen this technique before, it might be a little jarring because you are using a SAS program to write a SAS program. This is an advanced technique, to be sure, but one that can be very useful. It can be adapted to many other situations in which you want to conditionally display certain tables, but you must run the analysis before you know which tables satisfy the condition.

The post Display output conditionally with PROC DOCUMENT appeared first on The DO Loop.

3月 082017
 

Digitalisation is blasting the cobwebs out from enterprises and organisations of all kinds – freeing them to innovate and take advantage of the always-on economy. But it’s also helping new disruptive players to gain an unexpectedly strong foothold in many markets. One of the key advantages these new players have [...]

Is governance getting in the way of innovation? was published on SAS Voices by Peter Pugh-Jones

3月 072017
 

To get a high-performing analytics team producing insights that matter, you need great people, powerful software and a culture of experimentation and innovation. Three simple ingredients, but getting there is far from easy. In this post, I’d like to get you thinking about how to organize for success by building [...]

Building a high-performing analytics team was published on SAS Voices by Steven O'Donoghue