Ron Cody

10月 132021
 

There are many reasons why you might want to encrypt data. I use a SAS program to encrypt a list of logon names and passwords. Before we get started describing how to encrypt data, let's discuss some basic concepts concerning encrypting and decrypting data.

All computer data is stored as a series of 1s and 0s. For example, an uppercase A in ASCII is 01000001. Many encrypting schemes use a key to transform plaintext into ciphertext. As an example, suppose your key is B (01000010).

I'm sure you remember the Boolean operators AND, NOT, and OR. If you perform an AND operator on A and B, the result is a 1 if both values are true (1) and false (0) otherwise. So, A AND B is 01000000. The OR operator results in a value of true if either A or B is true, or both A and B are true. Therefore, A OR B is 01000011. The operator that you might not be as familiar with is the exclusive OR (XOR) operator. This is similar to an OR operator except that if both A and B are true, the result is false. A XOR B is equal to 00000011. Why is this useful? An interesting property of the XOR operator is if you take the result and use the XOR operator again on the previous result, you get back to the original value.

The table below shows how the XOR operator works:

A 0 1 0 0 0 0 0 1
B 0 1 0 0 0 0 1 0
A XOR B 0 0 0 0 0 0 1 1
(A XOR B) XOR B 0 1 0 0 0 0 0 1

 

If A is the cleartext message and B is the key, A XOR B is the ciphertext. If you perform an exclusive OR with the key (B) and the ciphertext (as was done in the last line), you get back the cleartext value (A).

Of course, you don't use single letter keys to encode messages. Key lengths of 8, 16, or even up to 512 are common. The problem with any key is that if you apply it to a long message, there is a pattern in the ciphertext that allows a code breaker to figure out how long the key is, and even what it is. There are several computer programs that can decode many of the popular encrypting methods. My favorite is:

https://www.geocachingtoolbox.com/

Here is a list of ciphers that can be broken with this program:

ADFGX/ADFGVX cipher Four-square cipher Substitution cipher
Affine cipher Gronsfeld cipher Trifid cipher
Atbash cipher Kamasutra cipher Vanity code
Bacon cipher Kenny code Vigenère cipher
Bifid cipher One-time pad Vigenère cipher decoder
Burrows-Wheeler transform Playfair cipher
Caeser cipher (ROT13) Rail Fence cipher

 

Let's write a short SAS program that uses a key to encode a text string.

data Encode;
 
retain Key 12345; ❶
 
length Letter $ 1; ❷
 
String = 'This is a test';
 
do i = 1 to lengthn(String);
 
Letter = substr(String,i,1); ❸
 
Rank = rank(Letter); ❹
 
Coded = bxor(Rank,Key); ❺
 
Decoded = bxor(Coded,Key); ❻
 
Clear = byte(Decoded); ❼
 
output;
 
end;
 
drop i;
 
run;
 
 
 
title "Listing of Data Set Encode";
 
proc print dta=Encode noobs;
 
var Letter Key Rank Coded Decoded Clear;
 
format Key Rank Coded binary8.;
 
run;

❶ A RETAIN statement is used to assign the number 12345 to a numeric variable called Key.  (You could have used an assignment statement, but using a RETAIN statement is more efficient and elegant.

❷ The variable Letter will hold each letter of the message and is set to a length of one.

❸ The SUBSTR function will extract each letter from String.

❹ Because Boolean operators only operate on true/false values, you use the RANK function to convert each letter to its ASCII value (stored internally as a series of 0s and 1s).

❺ You now use the BXOR (binary exclusive OR) function to encode each letter of your message.

❻ To demonstrate that the program is working, you use the BXOR function again to demonstrate that this process will return the original String.

❼ The BYTE function takes an ASCII value and returns the appropriate character.

Here is the listing of data set Encode:

Because this encryption method uses a single (and short) key, it would be fairly easy to break. What if you encode every letter of the original message with a different key? You can accomplish this by using the SAS random function RAND and using a seed value so that the same series of random numbers gets generated every time you run the program. You can even use one of a dozen different random distributions, to make it harder for someone to decode your file. Here is an example:

First, here is a copy of my text file that contains my secret message (stored in the location c:\Books\Blogs\Cipher\Clear_Text.txt).

Good morning Mr. Phelps.

Your mission, should you decide to accept it,

is to rid the world of evil.

As usual, if you or any member of your team are caught or killed,

the Secretary will disavow any knowledge of your actions.

The following program encrypts this file and creates a temporary data set (in a real situation, you would make this a permanent data set):

data Coded;
 
call streaminit(13579); ❶
 
array l[150] $ 1 _temporary_; ❷
 
array num[150] _temporary_; ❸
 
array xor[150]; ❹
 
infile "c:\Books\Blogs\Cipher\Clear_Text.txt" pad;
 
input string $150.; ❺
 
do i = 1 to dim(l);
 
l[i] = substr(string,i,1); ❼
 
num[i] = rank(l[i]); ❽
 
xor[i] = bxor(num[i],int(100*rand('Uniform'))); ❾
 
end;
 
keep xor1-xor150;
 
run;

❶ You need to set a seed value using CALL STREAMINIT so that when you run the decoding program, you will generate the same series of random numbers.

❷ This temporary array will hold up to 150 characters.

❸ The Num array holds the numerical ASCII value for each of the characters in the line.

❹ The XOR array holds the values of the exclusive OR between each numerical ASCII value and the Key.

❺ Read in a string of up to 150 characters.

❻ The DO LOOP picks up each character in a line, starting from 1 and ending at the length of each line.

❼ The RANK function outputs the ASCII value of each character.

❽ The BXOR (binary exclusive OR) function performs the exclusive OR between each ASCII value and the Key.

To decode this message, use the following program:

data Decode;
 
call streaminit(13579); ❶
 
array l[150] $ 1 _temporary_;
 
array num[150] _temporary_;
 
array xor[150];
 
length String $ 150;
 
set Coded;
 
do i = 1 to dim(l);
 
num[i] = bxor(xor[i],int(100*rand('Uniform'))); ❷
 
l[i] = byte(num[i]); ❸
 
substr(String,i,1) = l[i]; ❹
 
end;
 
keep String;
 
run;

❶ Notice that the value of the CALL STREAMINIT routine uses the same seed as the previous program.

❷ The BXOR function between each coded value and the Key, will produce the cleartext.

❸ The BYTE function will convert the ASCII values back to letters, numbers, and other characters.

❹ Finally, the SUBSTR function used on the left-hand side of the equal sign will place each of the characters into the appropriate location in the String variable. (See my previous blog that discusses the use of the SUBSTR function used on the left-hand side of the equal sign.)

Here is the output:

It would be straightforward to convert these two programs into macros so that you could encrypt and decrypt any file.

Of course, I would be remiss if I didn't mention that you can encrypt a SAS data set using two data set options ENCRYPT=and ENCRYPTKEY="password". But what would be the fun of that?

Here is an example:

*Note: AES stands for Advanced Encryption Standard
 
If you use quotation marks on the ENCRYPTKEY= option, you have
 
more flexibility in choosing a password (maximum length=64);
 
 
 
data Secret(encrypt=aes encryptkey="mypassword");
 
input String $80.;
 
datalines;
 
This is a secret message.
 
See if you can decode it.
 
This message will not self-destruct!
 
;

You can decode the encrypted data set by including the DATA set option ENCRYPTKEY="password" in any procedure, such as the PROC PRINT shown below:

title "Listing of Data Set Secret";
 
proc print data=Secret(encryptkey="mypassword") noobs;
 
run;

Encrypting Data Using SAS was published on SAS Users.

9月 212021
 

The term "fuzzy matching" describes a method of comparing two strings that might have slight differences, such as misspelling or a middle initial in a name included or not included. One of my favorite functions to compare the "closeness" of two strings is the SPEDIS (spelling distance) function.

Have you ever misspelled a word in Microsoft Word? If so, the word is underlined in red (so they tell me, because I have never misspelled anything) and if you right-click on the word, you see a list of possible alternatives. The algorithm used is similar to the SPEDIS function.

Here is how the SPEDIS function works. It takes two arguments, representing the two strings being compared. If the two strings are exactly the same, the function returns a zero. For each spelling mistake, the function assigns penalty points. Some errors are assigned more penalty points than others. For example, getting the first letter wrong in a word earns high points—interchanging two letters (for example 'ie' for 'ei') earns only a few points. Other errors such as adding a letter or omitting a letter, result an intermediate number of penalty points.

After the two strings are compared, all the penalty points are added up. However, it is not finished yet.  The SPEDIS functions returns the number of penalty points as a percentage of the length of the function's first argument. Why does this make sense? Making a spelling mistake in a three-letter word is quite a severe error. Making a spelling mistake in a 10-letter word is considered a minor mistake.

The following table shows a few examples.

Notice that the SPEDIS function is also useful in comparing digits, such as Social Security numbers.

The next program demonstrates how to use the SPEDIS function to perform a fuzzy match:

*Demonstrating "Fuzzy" Matching;
 
data Compare1;
   input Subj $3. Name1 $25.;
   Name1 = compbl(propcase(Name1)); ❶
datalines;
001 Ronald P. Cody
002 Alfred Smith
003 Fred Mastermind
004 Mixed cAse
005 Two  SPACES
;
Data Compare2;
   input Name2 $25.;
   Name2 = compbl(propcase(Name2)); ❷
datalines;
Ronald Cody
Alfred Smith
Fred Dummy
MIXEd CAse
Too  Spaces
;

❶ To be on the safe side, two functions, COMBL and PROPCASE, are used to convert multiple spaces to a single space (COMPBL) and to convert all names to proper case (PROPCASE).

❷ You use COMPBL and PROPCASE on the second data set.

The SAS code that selects exact and near matches uses PROC SQL to create a Cartesian Product. If you have not seen this before, here is a brief explanation. A Cartesian Product contains every combination of observations from one data set to every observation in a second data set. In "real life," you would always include a WHERE clause in the SQL code because the number of observations in a Cartesian Product is the number of observations in the first data set times the number of observations in the second data set. This would, most likely, be huge.

The table below is a partial listing of the Cartesian Product of the two data sets Compare1 and Compare2 (without a WHERE clause).

Next is the remaining code that produces two data sets—one with exact matches and one with possible matches:

proc sql; ❸
   create table Exact as
   select *
   from Compare1,Compare2
   where spedis(Name1,Name2) eq 0;quit;
 
proc sql; ❺
   create table Close as
   select *
   from Compare1,Compare2
   where 0 lt spedis(Name1,Name2) le 25;quit;

 

❸ This is the PROC SQL code that create the Cartesian Product. You can get more information on this from SAS Help.

❹ If the SPEDIS function returns a 0, you have an exact match.

❺ You repeat the SQL code to create the data set for close, but not exact matches.

❻ You are looking for spelling distances greater than 0 and less than or equal to 25.

Here are the two listings.

You will need to experiment with the value you choose for considering two strings to be a possible match. I chose 25 in the sample program. If you want to be more certain that the names are possible matches, use a smaller number for creating the data set Exact. If you want to see even remote possible matches, choose a larger value. You also need to consider the length of the two strings. If they are quite long, you might choose a smaller value to be considered a match because the percentage for a single spelling error will result in a relatively small value for the spelling distance.

Fuzzy Matching was published on SAS Users.

 Posted by at 9:09 下午  Tagged with:
8月 102021
 

This post demonstrates how to rank data and how to place these ranks into roughly equal groups.

There are certain variables, such as annual salary, that are highly skewed. There are many who earn between $50,00 and $150,000, but some who earn millions or hundreds of millions of dollars a year. Trying to use variables like annual salary in statistical models typically violates assumptions of many popular statistical techniques. There are several solutions to the types of distribution problems we just described. One solution is to use a transformation like a logarithm of a value to "bring in the tail." Another solution is to substitute ranks for the original values. For example, the lowest salary would be assigned a rank of one, the next highest would be assigned a rank of two, and so forth. Another method is to place all of the values into a number of bins. For example, you could place all the salaries into ranges such that there would be approximately an equal number of values in each range.

You can use SAS Studio tasks to create ranks and, with a tiny bit of editing, create salary ranges.

Let's start with a data set called Salary that was created by a small program using a random number function. Shown below is a histogram and a smooth line representing 1,000 values of salary from this data set.

You see a grouping of values on the left side of the distribution and a few very high salaries in the right tail. For curious readers, here is the program that generated these data values.

The RAND function can generate quite a few distributions, such as uniform and normal. For this program, an exponential distribution was used.

Suppose you plan to use yearly salary in a binary logistic regression model. Using the actual values from the Salary data set would not work well. Let's start out by creating a new variable that represents the rank of salary. In SAS Studio, this is easily done using the Rank Data task as one of the selections under the Data tab. You can see this in the figure below.

You choose the data set and variable to rank on the DATA tab, like this.

The Salary data set was selected, and the variable Salary was chosen as the variable (column) to rank. Finally, Rank_Salary was selected for the output data set name. A histogram of the ranks is, as you would expect, uniform ranging from one to 1,000 (see figure below).

How can you place these 1,000 values into 10 bins? To do this, you click the CODE tab and then click Edit (circled in the figure below).

All you need to do is add the PROC RANK option Groups=10 to this program as shown next.

This option groups all the ranks into 10 groups. Below is a histogram of the variable Rank_Salary with the Groups= option included.

This new variable would work quite well in a logistic regression model or other types of regression.

If you found this blog post helpful, you might be interested in some of my books. As always, comments and/or suggestions are welcome.

How to Transform a Skewed Distribution to a Uniform Distribution was published on SAS Users.

8月 032021
 

Thousands of SAS users are migrating from SAS University Edition to SAS OnDemand for Academics (ODA). I thought I would share some of my thoughts, having just finished two books using ODA (Getting Started with SAS Programming: Using SAS Studio in the Cloud and A Gentle Introduction to Statistics Using SAS Studio in the Cloud). Users of SAS University Edition had to download software to create a virtual computer on their local computer, then download the SAS software, and finally, set up a way to read and write files from their “real” computer to their “virtual computer.” This caused many people massive headaches (including me). The great news about SAS OnDemand for Academics is that you don’t have to download anything! You access SAS on a cloud platform. Also, reading data from your real computer is quite simple.

ODA was developed so that people can use it to learn how to program and run tasks using SAS. It is not to be used for commercial purposes. There is also a 5-gigabyte limit for data files, but that is certainly not a problem for learning how to use SAS.

Registering for ODA

To gain access to ODA, you need to register with SAS. Part of the registration process is to create a SAS profile. If you already have a SAS profile, skip that portion of the instructions.

To start, point your browser to:

https://welcome.oda.sas.com

You will have to wait for an email from SAS with an approval, but once you have access, it is a delight to use. Let me show you step-by-step how easy it is to upload an Excel workbook from your local computer to SAS Studio using ODA.

This first figure shows a worksheet (Drug_Study.xlsx) with information on four individuals.

Figure 1: Excel Spreadsheet

All the variable names should be self-explanatory except for SBP (systolic blood pressure) and DBP (diastolic blood pressure). It is a good idea to choose variable names that are compatible with SAS V7. The next step is to select a folder in the Server Files and Folder tab in SAS Studio and upload the file. This tab is shown in Figure 2 (with the UPLOAD icon circled).

Figure 2: Upload the File to a Folder in SAS Studio

Once you click the UPLOAD icon, you are given the option to select a file.

Figure 3: Select Your File

Select your file, click UPLOAD, and the file on your local hard drive (Drug_Study.xlsx) will appear in the SAS Studio folder (Figure 4). (Note: you can upload multiple files at one time if you choose.)

Figure 4: The File Drug_Study.xlsx in Now in the SAS Studio Folder

You can now use the IMPORT utility (under Tasks and Utilities) to convert this Excel file to a SAS data set. In this example, the default name of the WORK file (IMPORT) was changed to Drug_Study. A list of variables is shown in Figure 5.

Figure 5: List of Variable Names in Data Set Drug_Study

You can now write programs or use SAS Studio tasks to analyze this data set.

To learn more about how to use SAS Studio as part of SAS OnDemand for Academics, write SAS programs, or use SAS Studio tasks, please take a look at my new book, Getting Started with SAS Programing: Using SAS Studio in the Cloud.

I welcome comments and questions on this blog.

Migrating from SAS University Edition to SAS OnDemand for Academics was published on SAS Users.

7月 272021
 

In the past, the COMPRESS function was useful. Since SAS version 9, it has become a blockbuster, and you might not have noticed. The major change was the addition of a new optional parameter called MODIFIERS.

The traditional use of the COMPRESS function was to remove blanks or a list of selected characters from a character string. The addition of a MODIFIER argument does two things. First, you can specify classes of characters to remove, such as all letters, all punctuation marks, or all digits. That is extremely useful, but the addition of the 'k' modifier is why I used the term blockbuster in my description. The 'k' modifier flips the function from one that removes characters from a string to one that keeps a list of characters and removes everything else. Let me show you some examples.

This first example stems from a real problem I encountered while trying to read values that contained units. My data looked something like this:

ID     Weight 
001    100lbs.
002     59Kgs.
003    210LBS
004    83kg

My goal was to create a variable called Wt that represented the person's weight in pounds as a numeric value.

First, let’s look at the code. Then, I’ll give an explanation.

data Convert;
   length ID $3 Weight $8;
   input ID Weight;
 
   Wt = input(compress(Weight,,'kd'),8.);
   /* The COMPRESS function uses two modifiers, 'k' and 'd'.  This means
      keep the digits, remove anything else.  The INPUT function does the
      character-to-numeric conversion.
   */
 
   If findc(Weight,'k','i') then Wt = Wt * 2.2;
 
   /* the FINDC function is looking for an upper or lowercase 'k' in the
      original character string.  If found, it converts the value in
      kilograms to pounds (note: 1 kg = 2.2 pounds).
   */
 
datalines;
001    100lbs.
002     59Kgs.
003    210LBS
004    83kg
;
title "Listing of Data Set Convert";
footnote "This program was run using SAS OnDemand for Academics";
proc print data=Convert noobs;
run;

The program reads the value of Weight as a character string. The COMPRESS function uses 'k' and 'd' as modifiers. Notice the two commas in the list of arguments. A single comma would interpret 'kd' as the second argument (the list of characters to remove). Including two commas notifies the function that 'kd' is the third argument (modifiers). You can list these modifiers in any order, but I like to use 'kd', and I think of it as "keep the digits." What remains is the string of digits. The INPUT function does the character-to-numeric conversion.

Your next step is to figure out if the original value of Weight contained an upper or lowercase 'k'. The FINDC function can take three arguments: the first is the string that you are examining, the second is a list of characters that you are searching for, and the third argument is the 'i' modifier that says, "ignore case" (very useful).

If the original character string (Weight) contains an uppercase or lowercase 'k', you convert from kilograms to pounds.

Here is the output:

There is one more useful application of the COMPRESS function that I want to discuss. Occasionally, you might have a text file in ASCII or EBCDIC that contains non-printing characters (usually placed there in error). Suppose you want just the digits, decimal points (periods), blanks, and commas. You need to read the original value as a text string. Let's call the original string Contains_Junk. All you need to convert these values is one line of code like this:

Valid = compress(Contains_Junk,'.,','kdas');

In this example, you are using all three arguments of the COMPRESS function. As in pre-9 versions of SAS, the second argument is a list of characters that you want to remove. However, because the third argument (modifiers) contains a 'k', the second argument is a list of characters that you want to keep. In addition to periods and commas, you use modifiers to include all digits, uppercase and lowercase letters (the 'a' modifier - 'a' for alpha), and space characters (these include spaces, tabs, and a few others such as carriage returns and linefeeds). If you did not want to include tabs and other "white space" characters, you could rewrite this line as:

Valid = compress(Contains_Junk,'., ','kd');

Here you are including a blank in the second argument and omitting the 's' in the modifier list.

You can read more about the COMPRESS function in any of the following books, available from SAS Press as an e-book or from Amazon in print form:

Or my latest programming book:

 

Questions and/or comments are welcome.

The Amazing COMPRESS Function was published on SAS Users.

5月 252021
 

In SAS Studio, the ordering of rows and columns in the Table Analysis task are, by default, arranged by the internal ordering of the values used in the table. The table arranges the variables alphabetically or numerically by increasing value. For example, traditional coding uses 1 for Yes and 0 for No, so the No column is created as the first row because the internal value is 0. There are times when it makes more sense to change the order of the rows and/or columns.

Suppose you have data on risk factors for having a heart attack (including high blood pressure) and outcome data (heart attack). A data set called Risk has data on the status of blood pressure and heart attack (simulated data).

Here are the first 10 observations from that data set:

You can use PROC FREQ to create a 2x2 table or you can use the SAS Studio task called Table Analysis (in the Statistics task list) to create your table. Regardless of whether you decide to write a program or use a SAS Studio task, the resulting table looks like this:

Because we are more interested in what causes heart attacks, we would prefer to have Yes (1) as the first row and column of the table. Here is how to do it with a short SAS program:

You create a format that labels 1 as '1:Yes' and 0 as '2:No' and associate this format with both variables in the PROC FREQ step. You also include the PROC FREQ option ORDER=formatted. This option orders values by their formatted values rather than the default ordering—by the internal values. The original table placed 0 before 1 for that reason. By being tricky and placing the 1: and 2: in the format label, you are forcing the Yes values to come before the No values (otherwise, 'No' would come before 'Yes' – alphabetical order). Here is the output:

If you decided to use a SAS Studio task to create the table, you would open the Code window and click the Edit icon. You could then add PROC FORMAT and the ORDER=formatted option in the TABLES statement.

For the curious readers who would like to see how the Risk data set was created, here is the code:

Here the RAND function is generating a Bernoulli distribution (0 or 1), based on a probability of a getting a 1. You specify this probability using the second argument of the function. One more note: The statement CALL STREAMINIT is used to generate the same series of random numbers every time you run the program. If you omit this statement, the program generates a different series of random numbers every time you run it.

You can read more about how to reorder rows and columns in a 2x2 table in my new book, A Gentle Introduction to Statistics Using SAS Studio in the Cloud. In that book, I demonstrate how to edit the SAS Studio-generated code to reorder rows and columns in a table.

Reordering rows and columns in a 2x2 table with SAS® Studio Tasks was published on SAS Users.

3月 022021
 

The more I use SAS Studio in the cloud via SAS OnDemand for Academics, the more I like it. To demonstrate how useful the Files tab is, I'm going to show you what happens when you drag a text file, a SAS data set, and a SAS program into the Editor window.

I previously created a folder called MyBookFiles and uploaded several files from my local computer to that folder.  You can see a partial list of files in the figure below.

Notice that there are text files, SAS data sets, SAS programs, and some Excel workbooks. Look what happens when I drag a text file (Blank_Delimiter.txt) into the Editor window.

No need to open Notepad to view this file—SAS Studio displays it for you. What about a SAS data set? As an example, I dragged a SAS data set called blood_pressure into the Editor.

You see a list of variables and some of the observations in this data set.  There are vertical and horizontal scroll bars (not shown in the figure) to see more rows or columns. If you want to see a listing of the entire data set or the first 'n' observations, you can run the List Data task, located under the Tasks and Utilities tab.

For the last example, I dragged a SAS program into the editor. It appears exactly the same as if I opened it in my stand-alone version of SAS.

At this point, you can run the program or continue to write more SAS code. By the way, the tilde (~) used In the INFILE statement is a shortcut for your home directory. Follow it with the folder name and the file name.

You can read more about SAS Studio in the cloud in my latest book, Getting Started with SAS Programming: Using SAS Studio in the Cloud.

Viewing files, programs, and data sets in SAS Studio was published on SAS Users.

6月 242020
 

A lookup table is a programming technique where one or more values can be used to retrieve another value. For example, many years ago, I had benzene exposure estimates for 10 years (1940 to 1949) for each of five locations in a factory. Given a year and a job location, I needed to know the benzene concentration.

I would be terribly embarrassed today if anyone saw the first program I wrote to solve the problem! This blog shows a better way that uses temporary arrays to create an n-way lookup table. To keep the example simple, let's use five years of data (1944 to 1948) and four locations (1 to 4).

Temporary arrays

Before we get into the program, let's discuss temporary arrays, one of my favorite SAS tools. Here is an example of a one-dimensional temporary array:

Data Pass_Fail;
   input ID $ Grade1 - Grade5;
   array PF[5] _temporary_ (65 70 55 65 55);
   array Grade[5]; *If you leave off the variable list SAS will use the
                    array name with numbers 1-5 added. In this example
                    the variables will be Grade1, Grade2, etc.;
   array Pass_or_Fail[5] $ 4;
   do i = 1 to 5;
      if Grade[i] ge PF[i] then Pass_or_Fail[i] = 'Pass';
      else if not missing(Grade[i]) then Pass_or_Fail[i] = 'Fail';
   end;
   drop i;
datalines;
001 90 68 52 70 72
002 56 69 72 75 88
;
Title "Listing of Data Set Pass_Fail";
Proc print data=Pass_Fail noobs;
Run;

In this example, the temporary array is called PF (pass fail values), and it has 5 elements. There are no actual variables PF1, PF2, and so on, only array elements PF[1], PF[2], and so on. The initial values of the five passing grades are placed in parentheses following the key word _temporary_. In many situations, you load the values of the temporary array from a data file.

To keep this first example easy to understand, we will put the initial values in the array statement. You can now compare each student's grade for every test and assign a value of "Pass" or "Fail."

Here is the output:

Note: You can read a blog that I wrote years ago on temporary arrays for another example. 

Example

Now for the two-way table lookup example.

*Two-dimensional table lookup using a temporary array;
data Lookup;
   array Benzene[1944:1948,4] _temporary_; ①
 
   /* Populate the array */
   if _n_ = 1 then do Year = 1944 to 1948; ②
       do Location = 1 to 4;
	      input Benzene[Year,Location] @; ③
	   end;
	end;
 
   input Subj $ Year Location;
   Benzene_Level = Benzene[Year, Location]; ④
datalines;
250 200 150 130
90 180 155 90
95 35 170 140 
80 50 45 100 
40 50 25 15
001 1944 3
002 1948 1
003 1945 4
;
title "Listing od Data Set Lookup";
proc print data=Lookup noobs;
run;

① This ARRAY statement creates an array with two dimensions (you use a comma to create multiple dimensions). To make programming easier to understand, the first dimension of the array uses subscripts 1944 to 1948, rather than 1 to 5 (the colon enables you to specify the lower and upper bounds of an array). Also notice that there are no initial values in this statement—they will be read from data.
② This section of code populates the values in the Benzene temporary array. You use the statement if _n_ = 1 to ensure that this section of code executes only once.
③ The INPUT statement reads in a value for Year and Location. The single trailing @ sign prevents SAS from going to a new line each time to DO loop iterates.
④ Notice how easy it is to retrieve an exposure value, given a value of Year and Location. The first five lines of data are the values used to populate the temporary array.

You can read more about temporary arrays in my book, Learning SAS by Example: A Programmers Guide, Second Edition.

Comments on this blog are welcome.

Multi-way lookup tables was published on SAS Users.

5月 292020
 

While working at the Rutgers Robert Wood Johnson Medical School, I had access to data on over ten million visits to emergency departments in central New Jersey, including ICD-9 (International Classification of Disease – 9th edition) codes along with some patient demographic data.

I also had the ozone level from several central New Jersey monitoring stations for every hour of the day for ten years. I used PROC REG (and ARIMA) to assess the association between ozone levels and the number of admissions to emergency departments diagnosed as asthma. Some of the predictor variables, besides ozone level, were pollen levels and a dichotomous variable indicating if the date fell on a weekend. (On weekdays, patients were more likely to visit the personal physician than on a weekend.) The study showed a significant association between ozone levels and asthma attacks.

It would have been nice to have the incredible diagnostics that are now produced when you run PROC REG. Imagine if I had SAS Studio back then!

In the program, I used a really interesting trick. (Thank you Paul Grant for showing me this trick so many years ago at a Boston Area SAS User Group meeting.) Here's the problem: there are many possible codes such as 493, 493.9, 493.100, 493.02, and so on that all relate to asthma. The straightforward way to check an ICD-9 code would be to use the SUBSTR function to pick off the first three digits of the code. But why be straightforward when you can be tricky or clever? (Remember Art Carpenter's advice to write clever code that no one can understand so they can't fire you!)

The following program demonstrates the =: operator:

*An interesting trick to read ICD codes;
<strong>Data</strong> ICD_9;
  input ICD : $7. @@;
  if ICD =: "493" the output;
datalines;
493 770.6 999 493.9 493.90 493.100
;
title "Listing of All Asthma Codes";
<strong>proc</strong> <strong>print</strong> data=ICD_9 noobs;
<strong>run</strong>;

 

Normally, when SAS compares two strings of different length, it pads the shorter string with blanks to match the length of the longer string before making the comparison. The =: operator truncates the longer string to the length of the shorter string before making the comparison.

The usual reason to write a SAS blog is to teach some aspect of SAS programming or to just point out something interesting about SAS. While that is usually my motivation, I have an ulterior motive in writing this blog – I want to plug a new book I have just published on Amazon. It's called 10-8 Awaiting Crew: Memories of a Volunteer EMT. One of the chapters discusses the difficulty of conducting statistical studies in pre-hospital settings. This was my first attempt at a non-technical book. I hope you take a look. (Enter "10-8 awaiting crew" or "Ron Cody" in Amazon search to find the book.) Drop me an email with your thoughts at ron.cody@gmail.com.

Using SAS to estimate the link between ozone and asthma (and a neat trick) was published on SAS Users.

5月 292020
 

While working at the Rutgers Robert Wood Johnson Medical School, I had access to data on over ten million visits to emergency departments in central New Jersey, including ICD-9 (International Classification of Disease – 9th edition) codes along with some patient demographic data.

I also had the ozone level from several central New Jersey monitoring stations for every hour of the day for ten years. I used PROC REG (and ARIMA) to assess the association between ozone levels and the number of admissions to emergency departments diagnosed as asthma. Some of the predictor variables, besides ozone level, were pollen levels and a dichotomous variable indicating if the date fell on a weekend. (On weekdays, patients were more likely to visit the personal physician than on a weekend.) The study showed a significant association between ozone levels and asthma attacks.

It would have been nice to have the incredible diagnostics that are now produced when you run PROC REG. Imagine if I had SAS Studio back then!

In the program, I used a really interesting trick. (Thank you Paul Grant for showing me this trick so many years ago at a Boston Area SAS User Group meeting.) Here's the problem: there are many possible codes such as 493, 493.9, 493.100, 493.02, and so on that all relate to asthma. The straightforward way to check an ICD-9 code would be to use the SUBSTR function to pick off the first three digits of the code. But why be straightforward when you can be tricky or clever? (Remember Art Carpenter's advice to write clever code that no one can understand so they can't fire you!)

The following program demonstrates the =: operator:

*An interesting trick to read ICD codes;
<strong>Data</strong> ICD_9;
  input ICD : $7. @@;
  if ICD =: "493" the output;
datalines;
493 770.6 999 493.9 493.90 493.100
;
title "Listing of All Asthma Codes";
<strong>proc</strong> <strong>print</strong> data=ICD_9 noobs;
<strong>run</strong>;

 

Normally, when SAS compares two strings of different length, it pads the shorter string with blanks to match the length of the longer string before making the comparison. The =: operator truncates the longer string to the length of the shorter string before making the comparison.

The usual reason to write a SAS blog is to teach some aspect of SAS programming or to just point out something interesting about SAS. While that is usually my motivation, I have an ulterior motive in writing this blog – I want to plug a new book I have just published on Amazon. It's called 10-8 Awaiting Crew: Memories of a Volunteer EMT. One of the chapters discusses the difficulty of conducting statistical studies in pre-hospital settings. This was my first attempt at a non-technical book. I hope you take a look. (Enter "10-8 awaiting crew" or "Ron Cody" in Amazon search to find the book.) Drop me an email with your thoughts at ron.cody@gmail.com.

Using SAS to estimate the link between ozone and asthma (and a neat trick) was published on SAS Users.