sas programming

12月 022022
 

Welcome back to my SAS Users blog series CAS Action! - a series on fundamentals. If you'd like to start by learning more about the distributed CAS server and CAS actions, please see CAS Actions and Action Sets - a brief intro. Otherwise, let's learn how to rename columns in CAS tables.

In this example, I will use the CAS language (CASL) to execute the alterTable CAS action. Be aware, instead of using CASL, I could execute the same action with Python, R and more with some slight changes to the syntax for the specific language. Refer to the documentation for syntax in other languages.

Load the demonstration data into memory

I'll start by executing the loadTable action to load the WARRANTY_CLAIMS_0117.sashdat file from the Samples caslib into memory in the Casuser caslib. By default the Samples caslib should be available in your SAS Viya environment. Then I'll preview the CAS table using the columnInfo and fetch CAS actions.

* Connect to the CAS server and name the connection CONN *;
cas conn;
 
proc cas;
   * Specify the output CAS table *;
   casTbl = {name = "WARRANTY_CLAIMS", caslib = "casuser"};
 
   * Load the CAS table *;
   table.loadtable / 
      path = "WARRANTY_CLAIMS_0117.sashdat", caslib = "samples",
      casOut = casTbl;
 
    * Preview the CAS table *;
    table.columnInfo / table = casTbl;
    table.fetch / table = casTbl, to = 5;
quit;

The columnInfo action returns information about each column. Notice that the WARRANTY_CLAIMS CAS table has column names and columns labels.

The fetch CAS action returns five rows.

Notice that by default the fetch action uses columns labels in the header.

Rename columns in a CAS table

To rename columns in a CAS table, use the alterTable CAS action. In the alterTable action, specify the CAS table using the name and caslib parameters. Additionally, use the columns parameter to specify the columns to modify. The columns parameter requires a list of dictionaries, each dictionary specifies the column to modify.

Here, I'll rename the claim_attribute_1, seller_attribute_5 and product_attribute_1 columns. Then I'll execute the columnInfo action to view the updated column information.

proc cas;
   * Reference the CAS table *;
   casTbl = {name = "WARRANTY_CLAIMS", caslib = "casuser"};
 
   * Rename columns *;
   table.alterTable / 
      name = casTbl['name'], caslib = casTbl['caslib'],
      columns = {
	{name = 'claim_attribute_1', rename = 'Campaign_Type'},
	{name = 'seller_attribute_5', rename = 'Selling_Dealer'},
	{name = 'product_attribute_1', rename = 'Vehicle_Class'}
      };
 
   * View column metadata *;
   table.columnInfo / table = casTbl;
quit;

The results show that the alterTable CAS action renamed the columns to Campaign_Type, Selling_Dealer and Vehicle_Class. While this worked, what if you wanted to rename all columns in the CAS table using the column labels?

Rename all columns using the column labels

I'll dynamically rename the CAS table columns using the column labels. Since the column labels contain spaces, I'll also replace all spaces with an underscore. Now, I could manually specify each column and column label in the alterTable action, but why do all that work? Instead you can dynamically create a list of dictionaries for use in the alterTable action.

proc cas;
* Reference the CAS table *;
   casTbl = {name = "WARRANTY_CLAIMS", caslib = "casuser"};
 
  * Rename columns with the labels. Spaces replaced with underscores *;
 
   *1. Store the results of the columnInfo action in a dictionary *;
   table.columnInfo result=cr / table = casTbl;
 
   *Loop over the columnInfo result table and create a list of dictionaries *;
   *2*;
   listElementCounter = 0;
   *3*;
   do columnMetadata over cr.ColumnInfo;
	*4.1*; listElementCounter = listElementCounter + 1;
	*4.2*; convertColLabel = tranwrd(columnMetadata['Label'],' ','_');
	*4.3*; renameColumns[listElementCounter] = {name = columnMetadata['Column'], rename = convertColLabel};
   end;
 
   *5. Rename columns *;
   table.alterTable / 
	name = casTbl['Name'], 
	caslib = casTbl['caslib'], 
	columns=renameColumns;
 
   *6. Preview CAS table *;
   table.columnInfo / table = casTbl;
quit;
  1. The columnInfo action will store the results in a dictionary named cr.
  2. The variable listElementCounter will act as a counter that can be used to append each dictionary to the list.
  3. Loop over the result table stored in the cr dictionary. When you loop over a result table, each row is treated as a dictionary. The key is the column name and it returns the value of that column.
  4. In the loop:
    1. accumulate the counter
    2. access the column label and replace all spaces with underscores using the tranwrd function
    3. create a list named renamedColumns that contains each dictionary with the column to rename and it's new name.
  5. The alterTable action will use the list of dictionaries to rename each column.
  6. The columnInfo action will display the new column information.

The results show that each column was dynamically renamed using the column label and the spaces replaced with underscores.

Summary

In summary, using the alterTable CAS action enables you to rename columns in a CAS table.  With some knowledge of lists, dictionaries and loops in the CAS language, you can dynamically use the column labels to rename the columns. When using the alterTable action remember that:

  • The name and caslib parameters specify the CAS table.
  • The columns parameter requires a list of dictionaries.
  • Each dictionary specifies the column to modify.

Want to learn how to do this using Python? Check out my post Getting started with Python integration to SAS® Viya® - Part 11 - Rename Columns.

Additional resources

simple.freq CAS action
SAS® Cloud Analytic Services: CASL Programmer’s Guide 
CAS Action! - a series on fundamentals
Getting Started with Python Integration to SAS® Viya® - Index
SAS® Cloud Analytic Services: Fundamentals

CAS-Action! Rename Columns in a CAS Table was published on SAS Users.

10月 312022
 

Every year, I write a special article for Halloween in which I show a SAS programming TRICK that is a real TREAT! This year, the trick is to concatenate two strings into a single string in a way that guarantees you can always recover the original strings. I learned this trick from @RobinHouston on Twitter.

Why would you want to combine and pull apart strings?

I have combined strings when I wanted to concatenate levels of two categorical variables to form an interaction variable. By combining the strings, you can create a box plot that shows the response for each joint level of two categorical variables. I have not needed to separate the strings later, but I can imagine scenarios in which that capability might be useful.

To give a concrete example, suppose you have two strings: "ghost" and "goblin." The first trick constructs a string such as "ghost/goblin" and shows you how to reconstruct the original values from that string if the first string does not contain the character ('/') that is used to separate the first string from the second. The second trick constructs a more complicated string: "ghost/goblinghost*goblin". You can always reconstruct the original values regardless of the characters in the strings.

The easy way: Use one delimiter

A simple technique for combining strings is to use a single delimiter to concatenate two strings and then pull them apart. That idea works provided that the delimiter is not part of the first string. Let's see how it works by concatenating the Make and Model character variables from the Sashelp.Cars data set. The following DATA step extracts a subset of the full data:

data Have;
set Sashelp.Cars;
by Make;
if First.Make;
keep Make Model;
run;
 
proc print data=Have(obs=5);
run;

Suppose that you want to concatenate the Make and Model variables to form a new string while still being able to determine where the first string ends and the second string begins. If you know that some character such as '/' is not part of the first string, you can use the CAT or CATS function to concatenate the strings. The CATS function strips off leading and trailing blanks from all strings. Because SAS strings are blank-padded, I prefer to use the CATS function, as shown in the following DATA step:

%let DELIM = /;    /* choose any rare character as a delimiter */
data SimpleCombine;
length Join $55;
set Have;
/* Use CAT to preserve leading and trailing blanks.
   Use CATS to strip leading and trailing blanks.   */
Join = cats(Make,"&DELIM",Model);  
run;
 
proc print data=SimpleCombine(obs=5);
run;

The JOIN variable contains the concatenated strings, separated by the '/' character. If you want to recover the original two values (minus any spaces that were stripped off), you can use the FIND function to locate the position of the delimiter, then use the SUBSTR function to extract the substrings before and after the delimiter, as follows:

data SimpleSplit;
length v1 v2 $55;
set SimpleCombine;
loc = find(Join, "&DELIM");       /* location of delimiter */
v1 = substr(Join, 1, loc-1);
v2 = substr(Join, loc+1);
run;
 
proc print data=SimpleSplit(obs=5);
run;

The output shows that the v1 and v2 variables contain the original string values. The v1 and v2 variables do not contain any leading or trailing blanks, but they would if you use the CAT function instead of the CATS function. Thus, if the delimiter is not part of the first string, you can use a single delimiter to combine and split strings. The logic of the program assumes that the position of the first delimiter in the combined string is the location at which to split the string. That logic fails if the first string contains the delimiter as one of its characters.

A trick for combining and splitting any strings

In many situations, you know that the input strings will not contain a certain character, such as '/'. However, if you want a FOOLPROOF method that will work for ANY strings regardless of their contents, the trick from @RobinHouston become relevant. The trick enables you to combine and split any two strings!

The basic idea is to combine the strings by using two different delimiters instead of only one. I will use the delimiters '/' and '*', but the values do not matter. You can use 'A' and 'B' or '0' and '1', if you prefer. The method still works if the delimiter is contained in one or both strings.

Here's how to combine the strings so that they can be split later. If the original strings are v1 and v2, you form the concatenated strings
      s1 = v1 + '/' + v2
and
      s2 = v1 + '*' + v2
Notice that the strings s1 and s2 have the same length. In fact, they are identical except for one location: s1 contains the first delimiter whereas s2 contains the second delimiter. You then store the concatenated string
      s = s1 + s2
which always has an even number of characters.

You can use the SAS DATA step to implement this combination step:

/* Use the characters '/' and '*' to mark the boundary between Make and Model 
   Note: '*'=ASCII 42 precedes '/'=ASCII 47 in the ASCII table */
data Encode;
length s1 s2 $55 Join $110;
set Have;
/* Use CAT to preserve leading and trailing blanks.
   Use CATS to strip leading and trailing blanks.   */
s1 = cats(Make,'/',Model);  
s2 = cats(Make,'*',Model);
Join = cats(s1,s2);        /* Join does not have leading spaces */
drop s1 s2;
run;
 
proc print data=Encode(obs=5);
run;

To recover the original strings from s, you can do the following:

  • Let s1 be the first half of s and s2 be the second half of s. Because s has an even number of characters, this operation is always well defined.
  • Use the COMPARE function to find the position, k, where s1 differs from s2. By construction, this is the location of the delimiter.
  • Recover v1 by extracting the first k-1 characters from s1 and recover v2 by using the characters after the (k+1)st position.

The following DATA step splits the string back into its original values:

/* Split the encoded string in half. Use the COMPARE function to find where the two 
   halves are different. Split the string at that location.  */
data Decode;
set Encode;
L = lengthc(trim(Join));         /* trim any trailing spaces */
s1 = substr(Join, 1, L/2);       /* first half of string */
s2 = substr(Join, L/2 + 1, L/2); /* second half */
/* find the location where strings differ */
diffLoc = abs(compare(s1, s2));
v1 = substr(s1, 1, diffLoc-1);
v2 = substr(s1, diffLoc+1);
drop s1 s2 diffLoc;
run;
 
proc print data=Decode(obs=5);
run;

The output shows that the v1 and v2 variables contain the original strings.

Summary

This article shows how to concatenate two strings so that you can recover the original string values. The first trick uses only one delimiter and requires that you choose a character that is not part of the first string. The second trick is more complicated, but it works for ANY strings, regardless of their contents.

I think both tricks are real treats! You might never have need for the second trick, but it is very clever. I like to write these tricks down in case someone in the future (maybe me!) searches for "how do I combine two strings so that I can recover the original strings?" Post a comment if you use either of these tricks!

The post A trick to combine and split strings appeared first on The DO Loop.

10月 052022
 

Have you ever typed your credit card into an online order form and been told that you entered the wrong number? Perhaps you wondered, "How do they know that the numbers I typed do not make a valid credit card number?" The answer is that credit card numbers and other account numbers incorporate a simple rule that protects against many typographical errors. In general, a rule that validates a block of data is called a checksum. The checksum that is used by credit cards is computed by using the Luhn algorithm or the "mod 10" algorithm. Chris Hemedinger wrote a blog post that shows how to implement the Luhn checksum algorithm in SAS.

You can also use checksum methods to ensure that information or data is not changed after it is created. I recently needed to ensure that some data was not being corrupted as it was passed along a complicated SAS/IML program. I was able to use the SHA256 function in SAS to create a checksum for the data. Chris Hemedinger has a blog post that describes the SHA256 function and how to call it from the SAS DATA step to ensure the data integrity of character data. Although he doesn't say it, you can use the SHA256 function on numerical data by applying a format to the value (for example, by using the PUTN function in SAS). The format will determine the value of the checksum.

This article shows how to use the SHA256 function to create a checksum for character and numerical data. For SAS/IML programmers, I also show how to use a checksum to ensure that items in a SAS/IML list have not changed since the list was created.

Hashing character values

You can use any hash function to create a checksum. SAS supports a large variety of hash functions. This article uses the SHA256 function, but you can use a different function if you prefer.

The following example illustrates how you can use a hash function for an array of character values. In the SAS/IML language, the ROWVEC function takes a matrix of input values and converts them into a one-dimensional row vector. The ROWCATC function takes a character matrix and concatenates the values across columns, removing all spaces. Thus, the SAS/IML expression rowcatc(rowvec(m)) returns a string that concatenates the elements in m. You can perform a similar operation in the SAS DATA step by using the CATS function.

/* a simple checksum: SHA256 applied to data strings */
proc iml;
ID = {"Andrew", "Sofya", "Shing", "Maryam", "Terence"};
 
str1 = rowcatc(rowvec(ID)); 
hash = sha256(str1);
print str1, hash[F=$hex64.];

The program takes the names of five people, concatenates them together into a long string, and then applies the SHA256 hash function. The output is a fixed-length string. For the SHA256 function, the output string contains 256 bits. Since each hexadecimal digit can represent four bits of information, you can display this 256-bit value by using a 64-character hexadecimal string. This is one of the advantages of using a hash function: regardless of the length of the original string, the output is a string of a known length.

Hashing numerical values

If you want to hash an array of character values, you can use the PUTN function in SAS to apply a format, which converts the array of numbers into a string. This article uses the BEST12. format. If you use a different format, you will get a different string and a different hash value.

In SAS 9.4, character strings are limited to 32,767 characters. (In SAS Viya, you can use the VARCHAR type to store strings of any length.) If you allocate 12 characters to each number, you can encode at most 32767/12 = 2730 numbers into a single string. There are ways to get around this, but for simplicity, if an array contains more than 100 values, I will use only the first 100 values to create the hash value. This is implemented in the following SAS/IML user-defined function, which takes a matrix of numbers as input and returns a long string representation of the numbers:

start NumToStr(m, maxN=100);
   if nrow(m)*ncol(m) < maxN then 
      s = putn(m, "BEST12.");          /* apply format to all values */
   else
      s = putn(m[1:maxN], "BEST12.");  /* apply format to first 100 values */
   return( rowcatc(rowvec(s)) );       /* concatenate values into one string */
finish;
 
Height = {175, 143, 165, 157, 161};      /* cm */
Weight = {50, 38.5, 44, 46.5, 47};       /* kg */
corr = 0.8805;
 
str2 = NumToStr(Height);
str3 = NumToStr(Weight);
str4 = NumToStr(corr);
print str3;

The function converts an array of numbers into a string of formatted values. The function is tested by using two arrays and one scalar value. One of the strings is shown. You can see that the formatted Weight values (which include decimal points) are concatenated into a string.

Creating a checksum

The previous sections created four strings: str1, str2, str3, and str4. Notice that the correlation value (0.8972) is dependent on the Height and Weight data values. If the values in those arrays were to change (either accidentally or intentionally), it would be useful to know that the correlation value is no longer valid. One way to find out if any information has changed is to create a checksum for the original information and periodically compare the original checksum to the checksum for the current state of the information. The following statements compute the checksum for the original data by concatenating the strings for the ID, Height, Weight, and Corr variables and computing the SHA256 value of the result:

s = cats(str1, str2, str3, str4);
check = sha256(s);
print check[F=$hex64.];

The output is a checksum. If you modify any of the numbers or strings in these arrays, then (with high probability) the new checksum will not match the original checksum. For example, the following statements compute the checksum for the correlation equal to "0.89". The new checksum is different from the original checksum, which tells you that one or more data values have changed.

/* What if the numbers or order of numbers change? */
s = cats(str1, str2, str3, "0.89");
check1 = sha256(s);
print check1[F=$hex64.];

A checksum for a SAS/IML list

This section shows one way to automate the computations in the previous sections. In the SAS/IML Language, it is often convenient to use a list to pack related information into a single object that you can pass to functions. This section shows how to compute the checksum for a list of arbitrary items.

The program defines the following functions:

  • The CharToStr function concatenates up to 100 elements in a character matrix. The result is a single string.
  • The NumToStr function applies the BEST12. format to up to 100 elements in a numeric matrix. These formatted values are concatenated together to form a single string.
  • The GetChecksum function iterates over the items in a list. If an item is a character matrix, call the CharToStr function. If the item is a numeric matrix, call the NumToStr function. For simplicity, other items are skipped in this implementation. The strings for each item are concatenated together. The SHA256 function creates a hash value for the concatenated strings.

This process is demonstrated by using the same data as in the previous sections:

proc iml;
start CharToStr(s, maxN=100);
   if nrow(s)*ncol(s) < maxN then 
      return( rowcatc(rowvec(s)) );         /* concatenate all values */
   else
      return( rowcatc(rowvec(s[1:maxN])) ); /* concatenate 100 values */
finish;
 
start NumToStr(m, maxN=100);
   if nrow(m)*ncol(m) < maxN then 
      s = putn(m, "BEST12.");          /* apply format to all values */
   else
      s = putn(m[1:maxN], "BEST12.");  /* apply format to first 100 values */
   return( CharToStr(s) );             /* concatenate values into one string */
finish;
 
/* get hash of the numerical and character elements of a list*/
start GetChecksum(L);
   allStr = "";
   do i = 1 to ListLen(L);
      m = L$i;
      if type(m)='N' then 
         str = NumToStr(m);
      else if type(m)='C' then
         str = CharToStr(m);
      else str = "";         /* you could handle other types here */
      allStr = cats(allStr, str);
   end;
   checksum = sha256(allStr);
   return checksum;
finish;
 
ID = {"Andrew", "Sofya", "Shing", "Maryam", "Terence"};
Height = {175, 143, 165, 157, 161};      /* cm */
Weight = {50, 38.5, 44, 46.5, 47};       /* kg */
corr = 0.8805;
 
L = [ID, Height, Weight, corr];   /* list with four items */
check = GetChecksum(L);
print check[F=$hex64.];

The hash value in this program is identical to the hash value for the original data. That is because the items were packed into the list in the same order.

This process was motivated by a real-life problem. I was writing a complicated program that pre-computes certain statistics and then calls a series of functions. I wanted to ensure that the items in the list were not modified as the list was passed from function to function. One way to do that is to compute the checksum of the original items in the list and compare it to the checksum at the end of the program.

Summary

This article discusses how to use a hash function to compute a checksum to ensure data integrity. The article uses the SHA256 hash function, but SAS supports other hash functions. A hash function takes an arbitrary string as input and returns a fixed-length string, called the hash value. With high probability, the hash value "identifies" the data: if you change the data, you are likely to obtain a different hash value. You can use this fact to compute a checksum for the original data and compare it to the checksum for the data at a later time. If the checksum has changed, the data have been corrupted.

The post Checksums and data integrity in SAS programs appeared first on The DO Loop.

9月 262022
 

The correlations between p variables are usually displayed by using a symmetric p x p matrix of correlations. However, sometimes you might prefer to see the correlations listed in "long form" as a three-column table, as shown to the right. In this table, each row shows a pair of variables and the correlation between them. I call this the "list format."

This article demonstrates two ways to display correlations in a list format by using Base SAS. First, this article shows how to use the FISHER option in PROC CORR to create a table that contains the correlations in a list. Second, this article shows how to use the DATA step to convert the symmetric matrix into a simple three-column table that displays the pairwise statistics between variables. This is useful when you have a matrix of statistics that was produced by a procedure other than PROC CORR.

The FISHER option to create a table of pairwise correlations

If you want to get the correlations between variables in a three-column table (instead of a matrix), use the FISHER option in PROC CORR. The FISHER option is used to produce confidence intervals and p-values for a hypothesis test, but it also presents the statistics in a column. As usual, you can use the ODS OUTPUT statement to save the table to a data set. The following call to PROC CORR analyzes a subset of the Sashelp.Iris data.

proc corr data=sashelp.iris(where=(Species="Setosa")) FISHER;  /* FISHER ==> list of Pearson correlations */
   var _numeric_;
   ods output FisherPearsonCorr = CorrList;       /* Optional: Put the correlations in a data set */
run;
 
proc print data=CorrList noobs; 
   var Var WithVar Corr;
run;

The output is shown at the top of this article. Use this trick when you want to output Pearson (or Spearman) correlations in list format.

However, this trick does not work for other correlations (such as Kendall's or Hoeffding's), nor on statistics that are produced by a different procedure. For example, the CORRB option on the MODEL statement in PROC REG produces a matrix of correlations between the regression parameters in the model. It does not provide a special option to display the correlations in a list format. The next section shows how to use the DATA step to convert the statistics from wide form (a symmetric matrix) to long form (the list format).

Generate the correlations between all variables

If you have a symmetric matrix stored in a SAS data set, you can use the DATA step to convert the matrix elements from wide form to long form. This section uses a correlation matrix as an example, but the technique also applies to covariance matrices, distance matrices, or other symmetric matrices. If the diagonal elements are constant (they are 1 for a correlation matrix), you need to convert only the upper-triangular elements of the symmetric matrix. If the diagonal elements are not constant, you should include them in the output.

The following statements compute the Pearson correlation for all numeric variables for a subset of the Sashelp.Iris data. The OUTP= option writes the statistics to the CorrSym data set:

proc corr data=sashelp.iris(where=(Species="Setosa")) outp=CorrSym;
   var _numeric_;
run;

The structure of a TYPE=CORR data set is documented, but it is easy enough to figure out the structure by printing the data set, as follows:

proc print data=CorrSym; run;

I have highlighted certain elements in the data set that are used in the next section. The goal is to display a list of the correlations inside the triangle.

Convert from a correlation matrix to a pairwise list

Notice that the correlations are identified by the rows where _TYPE_="CORR". Notice that the only numeric columns are the p variables that contain the correlations. You can therefore use a WHERE clause to subset the rows and use the _NUMERIC_ keyword to read the columns into an array. You can then iterate over the upper-triangular elements of the correlation matrix, as follows:

/* Convert a symmetric matrix of correlations into a three-column table.
   This is a "wide to long" conversion, but we need to display only the upper-triangular elements.
 
   Input: a dense symmetric matrix p x p matrix with elements C[i,j] for 1 <= i,j <= p
   Output: pairwise statistics by using the upper triangular elements.
           The output columns are  Var1 Var2 Correlation
*/
data SymToPairs;
  set CorrSym(where=(_type_='CORR'));  /* subset the rows */
  Var1 = _NAME_;
  length Var2 $32;
  array _cols (*) _NUMERIC_;     /* columns that contain the correlations */
  do _i = _n_ + 1 to dim(_cols); /* iterate over STRICTLY upper-triangular values  */
     Var2 = vname(_cols[_i]);    /* get the name of this column           */
     Correlation = _cols[_i];    /* get the value of this element         */
     output;
  end;
  keep Var1 Var2 Correlation;
run;
 
proc print data=SymToPairs noobs; run;

Success! The correlations which were in the upper-triangular portion of a p x p matrix (p=4 for this example) are now in the p(p-1)/2 rows of a table. Notice that the DATA step uses the VNAME function to obtain the name of a variable in an array. This is a useful trick.

This trick will work for any symmetric matrix in a data set. However, for statistics that are produced by other procedures, you must figure out which rows to extract and which variables to include on the ARRAY statement. If you are converting a matrix that does not have a constant diagonal, you can write the DO loop as follows:

  do _i = _n_ to dim(_cols); /* iterate over diagonal AND upper-triangular values  */
  ...
  end;

If you include the diagonal elements in the output, the list format will contain p(p+1)/2 rows. Try modifying the DATA step for the previous example to include the diagonal elements. You should get a table that has 10 rows. The correlation of each variable with itself is 1.

A rectangular array of correlations

The preceding sections show one way to convert a symmetric matrix of correlations into a list. You can use this technique when you have used the VAR statement in PROC CORR to compute all correlations between variables. Another common computation is to compute the correlations between one set of variables and another set. This is done by using the VAR and WITH statements in PROC CORR to define the two sets of variables. The result is a rectangular matrix of correlations. The matrix is not symmetric and does not contain 1s on the diagonal. Nevertheless, you can slightly modify the previous program to handle this situation:

proc corr data=sashelp.cars noprob outp=CorrRect;
   var EngineSize HorsePower MPG_Highway;
   with Length Wheelbase;
run;
 
/* Convert a rectangular matrix of correlations into a three-column table.   
   Input: an (r x p) dense matrix of statistics
   Output: pairwise statistics in a table: Var1 Var2 Correlation
*/
data RectToPairs;
  set CorrRect(where=(_type_='CORR'));
  Var1 = _NAME_;
  length Var2 $32;
  array _cols (*) _NUMERIC_;
  do _i = 1 to dim(_cols);     /* iterate over ALL values       */
     Var2 = vname(_cols[_i]);  /* get the name of this column   */
     Correlation = _cols[_i];  /* get the value of this element */
     output;
  end;
  keep Var1 Var2 Correlation;
run;
 
proc print data=RectToPairs noobs; run;

In this case, the original matrix is 2 x 3, and the converted list has 6 rows.

Summary

If you use PROC CORR in SAS to compute Pearson or Spearman correlations, you can use the FISHER option to display the correlations in list format. If you have a symmetric matrix of statistics from some other source, you can use a DATA step to display the upper-triangular elements in a list. You can choose to include or exclude the diagonal elements.

If you use the VAR and WITH statements in PROC CORR, you obtain a rectangular matrix of correlations, which is not symmetric. You can use the DATA step to display all elements in list format.

A table of pairwise statistics can be useful for visualizing the correlations in a bar chart or for highlighting correlations that are very large or very small. It is also possible to go the other way: From a list of all pairwise correlations, you can construct the symmetric correlation matrix.

The post Display correlations in a list format appeared first on The DO Loop.

8月 152022
 

John Tukey was an influential statistician who proposed many statistical concepts. In the 1960s and 70s, he was fundamental in the discovery and exposition of robust statistical methods, and he was an ardent proponent of exploratory data analysis (EDA). In his 1977 book, Exploratory Data Analysis, he discussed a small family of power transformations that you can use to explore the relationship between two variables. Given bivariate data (x1, y1), (x2, y2), ..., (xn, yn), Tukey's transformations attempt to transform to the Y values so that the scatter plot of the transformed data is as linear as possible, and the relationship between the variables is simple to explain.

Tukey's Transformations

Tukey's transformations depend on a "power parameter," λ, which can be either positive or negative. When Y > 0, the transformations are as follows:

  • When λ > 0, the transformation is T(y; λ) = yλ.
  • When λ = 0, the transformation is T(y; λ) = log(y).
  • When λ < 0, the transformation is T(y; λ) = –(yλ).

If not all values of Y are positive, it is common to shift Y by adding a constant to every value. A common choice is to add c = 1 – min(Y) to every value. This ensures that the shifted values of Y are positive, and the minimum value is 1.

Although Tukey's transformation is defined for all values of the parameter, λ, he argues for using values that result in simple relationships such as integer powers (squares and cubes), logarithms, and reciprocal power laws such as inverse-square laws. Thus, he simplified the family to the following small set of useful transformations:

Some practitioners add -3, -1/3, 1/3, and 3 to this set of transformations. Tukey's ladder of transformation is similar to the famous Box-Cox family of transformations (1964), which I will describe in a subsequent article. The Box-Cox transformations have a different objective: they transform variables to acieve normality, rather than linearity.

A graphical method for Tukey's transformation

Tukey proposed many graphical techniques for exploratory data analysis. To explore whether a transformation of the Y variable results in a nearly linear relationship with the X variable, he suggests plotting the data. This is easily performed by using a SAS DATA step to transform the Y variable for each value of λ in the set {-2, -1, -1/2, 0, 1/2, 1, 2}. To ensure that the transformation only uses positive values, the following program adds c = 1 - min(Y) to each value of Y before transforming. The resulting scatter plots are displayed by using PROC SGPANEL.

/* define the data set and the X and Y variables */
%let dsName = Sashelp.Cars;
%let XName = Weight;
%let YName = MPG_Highway;
 
/* compute c = 1 - min(Y) and put in a macro variable */
proc sql noprint;                              
 select 1-min(&YName) into :c trimmed 
 from &dsName;
quit;
%put &=c;
 
data TukeyTransform;
array lambdaSeq[7] _temporary_ (-2, -1, -0.5, 0, 0.5, 1, 2);
set &dsName;
x = &XName;
y = &YName + &c;  /* offset by c to ensure Y >= 1 */
do i = 1 to dim(lambdaSeq);
   lambda = lambdaSeq[i];
   if lambda=0 then        z = log(y);
   else if lambda > 0 then z = y**lambda;
   else                    z = -(y**lambda);
   output;
end;
keep lambda x y z;
label x="&xName" y="&yName" z="Transformation of &YName";
run;
 
ods graphics / width=400px height=840px;
title "Tukey's Ladder of Transformations";
proc sgpanel data=TukeyTransform;
   panelby lambda / layout=rowlattice uniscale=column onepanel rowheaderpos=right;
   scatter x=x y=z;
   rowaxis grid; colaxis grid;
run;

The panel of scatter plots shows seven transformations of the Y variable plotted against X. It is difficult to visually discern which plot is "most linear." The graph for λ=0 seems to be most linear, but the scatter plot for λ=0.5 also have very little curvature.

This panel of graphs is typical. Graphical methods can hint at the relationships between variables, but analytical techniques are often necessary to quantify the relationships. I put the name of the data set and the name of the variables in macro variables so that you can easily run the code on other data and investigate how it works on other examples.

An analytical method for Tukey's transformation

Tukey emphasized graphical methods in Exploratory Data Analysis (1977), but his ladder of transformations has a rigorous statistical formulation. The correlation coefficient is a measure of linearity in a scatter plot, so for each value of λ you can compute ρλ = corr(X, T(Y; λ)), where T(Y; λ) is the Tukey transformation for the parameter λ. The value of λ that results in the correlation that has the largest magnitude (closest to ±1) is the optimal parameter. In SAS, you can sort the transformed data by the Lambda variable and use PROC CORR to compute the correlation for each parameter value:

proc sort data=TukeyTransform;
   by lambda;
run;
 
/* compute corr(X, T(Y; lambda)) for each lambda */
proc corr data=TukeyTransform outp=CorrOut noprint;
   by lambda;
   var Z;
   with X;
run;
 
proc print data=CorrOut(where=(_TYPE_="CORR")) noobs label;   
   var lambda z;
   label z="corr(x, T(y))";
run;

The correlation coefficients indicate that λ=0 results in a correlation of -0.85, which is the largest magnitude. For these data, λ=0.5 also does a good job of linearizing the data (ρ=-0.84). Several values of λ result in a correlation coefficient that is very strong, which is why it is difficult to find the best parameter by visually inspecting the graphs.

Summary

Tukey's ladder of transformations is a way to check whether two variables, X and Y, have a simple nonlinear relationship. For example, they might be related by a square-root or a logarithmic transformation. By using Tukey's ladder of transformations, you can discover relationships between variables such as inverse square laws, reciprocal laws, power laws, and logarithmic/exponential laws.

The post Tukey's ladder of variable transformations appeared first on The DO Loop.

8月 032022
 

A SAS programmer asked for help on a discussion forum: "My SAS session will not display any tables or graphs! I try to use PROC PRINT and other procedures, but no output is displayed! What can I do?"

The most common reasons why you might not see any output when you generate tables and graphs are:

  1. The select list is set to NONE or is otherwise preventing output from appearing.
  2. All ODS destinations are closed. This could happen if a program uses ODS <destination> CLOSE to close one or more ODS destinations.

A solution that usually works is to exit the current SAS session and start a new one. This will reset all ODS destinations. However, if the problem occurs every time you run a program, then something in the program is messing with the ODS destinations or SELECT list. In that case, it is helpful to understand how to examine the SELECT list and how to check which ODS destinations are open. This article shows you how to do those two things.

The quick solution

Let's jump straight to the solution. To make sure that you will see output, submit the ODS SELECT ALL statement, then make sure you have an open ODS destination. The following statements set the ODS SELECT list and display the active ODS destinations:

ods select all;                      /* clear the SELECT lists for all destinations*/
proc print data=sashelp.vdest; run;  /* display open ODS destinations */

The output that you see will depend on which interface you are using to run SAS. In the old SAS Windowing Environment, you might see output like the following:

This output shows that the open destination is the HTML destination. In the newer SAS Studio interface, you might see output like the following:

This output indicates that there are two open destinations. The HTML5(WEB) destination is a newer version of the HTML destination. The LISTING destination is an old pre-HTML destination.

If you do not see any output from the PROC PRINT call, then check the log. If there are no open ODS destinations, then the log will display notes such as the following:

NOTE: No observations in data set SASHELP.VDEST.
NOTE: There were 0 observations read from the data set SASHELP.VDEST.

In this case, you need to study the code to find out why your program has closed all the destinations! In the SAS Windowing Environment, you can reopen ODS destinations fairly easily. In SAS Studio, it is best to open new destinations by using the menus and dialog boxes: Options->Preferences->Results.

The next sections look closer at these two statements and discuss a few related statements, including the ODS SHOW statement.

Examine the overall ODS SELECT list

The ODS system maintains a SELECT list for each ODS destination. It also maintains an overall (or "global") SELECT list, which applies to all destinations. To see the "global" SELECT list, you can submit the ODS SHOW statement:

ods show;       /* display the overall SELECT list in the log */

The result of this statement is displayed in the SAS log. To ensure that all output is being routed to the open destinations, you want to see a log message such as the following:

Current OVERALL select list is: ALL

If you see any other message in the log, it means that the SELECT list has changed, which affects which output appears when you try to display a graph or table. When you submit the ODS SELECT ALL statement, it clears all SELECT lists and ensures that all output makes its way to the open ODS destinations.

You can also set and clear destination-specific SELECT lists. For example, the following statements create a restricted SELECT list for only the LISTING destination. You can use ODS <destination> SHOW to view a destination-specific SELECT list, as follows:

ods LISTING;                 /* open the LISTING destination */
ods LISTING select ANOVA ParameterEstimates;  /* select tables ONLY for LISTING */
ods show;                    /* display the overall SELECT list in the log */
ods LISTING show;            /* display the SELECT list for the LISTING destination */

The SAS log shows the following text:

Current OVERALL select list is: ALL
Current LISTING select list is:
1. ANOVA
2. ParameterEstimates

The log reveals that the SELECT list for the LISTING destination is filtering the output. When you run the next procedure, only tables named ANOVA and ParameterEstimates will appear. All other tables and graphs are suppressed.

You can use ODS <destination> SELECT ALL to clear a specific destination. However, I recommend that you use ODS SELECT ALL without specifying a destination, which clears all destinations. This is shown in the following example:

ods select all;              /* clear the SELECT lists for all destinations*/
ods LISTING show;            /* display the SELECT list for the LISTING destination */
Current LISTING select list is set to default value (ALL).
Current OVERALL select list is: ALL

Examine open ODS destinations

Did you know that you can print the Sashelp.VDest data view to display the open ODS destinations and what style each is using? After you ensure that the ODS SELECT list is clear, run the following PROC PRINT statements:

proc print data=sashelp.vdest;
run;

You will see a table such as the one above. Each row displays the name and style for an open ODS destination. As mentioned earlier, if you have closed all destinations, you will get a note in the log that says:

NOTE: No observations in data set SASHELP.VDEST.
NOTE: There were 0 observations read from the data set SASHELP.VDEST.

If you are not familiar with the SasHelp.VDest view, it is a view of a a read-only PROC SQL table, which contains information about the current state of SAS. DICTIONARY tables are documented in the PROC SQL documentation. By looking at the documentation, you can discover that the DICTIONARY.Destinations table in PROC SQL contains information about open ODS destinations. If you want to read that information by using a non-SQL SAS procedure, use the SASHELP.VDest view, which contains the same information.

Summary

This article shows how to inspect and reset the ODS SELECT list to ensure that all output will be displayed by SAS. The article also shows how to use a dictionary table (DICTIONARY.Destinations) or its associated data view (Sashelp.VDest) to display the open ODS destinations.

The post How to examine the status of ODS destinations in SAS appeared first on The DO Loop.

7月 252022
 

I've previously shown how to use Monte Carlo simulation to estimate probabilities and areas. I illustrated the Monte Carlo method by estimating π ≈ 3.14159... by generating points uniformly at random in a unit square and computing the proportion of those points that were inside the unit circle. The previous article used the SAS/IML language to estimate the area. However, you can also perform a Monte Carlo computation by using the SAS DATA step. Furthermore, the method is not restricted to simple geometric shapes like circles. You can use the Monte Carlo technique to estimate the area of any bounded planar figure. You merely need a bounding box and a way to assess whether a random point is inside the figure. This article uses Monte Carlo simulation to estimate the area of a polar rose with five petals.

The polar rose

You can read about how to draw a polar rose in SAS. The simplest form of the polar rose is given by the polar equation r = cos(k θ), where k is an integer, r is the polar radius, and θ is the polar angle (0 ≤ θ ≤ 2 π).

The following DATA step computes the coordinates of the polar rose for k=5. The subsequent call to PROC SGPLOT graphs the curve.

%let k = 5;      /* for odd k, the rose has k petals */
/* draw the curve r=cos(k * theta) in polar coordinates:
   https://blogs.sas.com/content/iml/2015/12/16/polar-rose.html */
data Rose;
do theta = 0 to 2*constant("pi") by 0.01;
   r = cos(&k * theta);       /* rose */
   x = r*cos(theta);          /* convert to Euclidean coordinates */
   y = r*sin(theta);
   output;
end;
keep x y;
run;
 
title "Polar Rose: r = cos(&k theta)";
ods graphics / width=400px height=400px;
proc sgplot data=Rose aspect=1;
   refline 0 / axis=x;   refline 0 / axis=y;
   series x=x y=y;
   xaxis min=-1 max=1;   yaxis min=-1 max=1;
run;

The remainder of this article uses Monte Carlo simulation to estimate the area of the five-petaled rose. The answer is well-known and is often proved in a calculus class. When k is odd, the area of the rose generated by r = cos(k θ) is A = π / 4 ≈ 0.7854.

Monte Carlo estimates of area

To estimate the area of a geometric figure by using Monte Carlo simulation, do the following:

  1. Generate N points uniformly in a bounding rectangle.
  2. For each point, compute whether the point is inside the figure. This enables you to form the proportion, p, of points that are inside the figure.
  3. Estimate the area of the figure as p AR, where AR is the area of the bounding rectangle.
  4. Optionally, you can construct a confidence interval for the area. A confidence interval for the binomial proportion is p ± zα sqrt(p(1-p) / N), where zα is the 1-α/2 quantile of the normal distribution. If you multiply the interval by AR, you get a confidence interval for the area.

The Monte Carlo method produces a simulation-based estimate. The answer depends on the random number stream, so you obtain AN estimate rather than THE estimate. In this article, the rose will be bounded by using the rectangle [-1, 1] x [-1, 1], which has area AR = 4.

The plot to the right shows the idea of a Monte Carlo estimate. The plot shows 3,000 points from the uniform distribution on the rectangle [-1, 1] x [-1, 1]. There are 600 points, shown in red, that are inside one of the petals of the polar rose. Because 600 is 20% of the total number of points, we estimate that the rose occupies 20% of the area of the square. Thus, with N=3000, an estimate of the area of the rose is 0.2*AR = 0.8, which is somewhat close to π / 4 ≈ 0.7854.

Statistical theory suggests that the accuracy of the proportion estimate should be proportional to 1/sqrt(N). This statement has two implications:

  1. If you want to double the accuracy of an estimate, you must increase the number of simulated points fourfold.
  2. Loosely speaking, if you use one million points (N=1E6), you can expect your estimate of the proportion to be roughly 1/sqrt(N) = 1E-3 units from the true proportion. A later section makes this statement more precise.

A Monte Carlo implementation in the SAS DATA step

Let's increase the number of simulated points to improve the accuracy of the estimates. We will use one million (1E6) random points in the bounding rectangle.

There are two primary ways to obtain the Monte Carlo estimate. The first way is to generate a data set that has N random points and contains a binary indicator variable that specifies whether each point is inside the figure. You can then use PROC FREQ to obtain an estimate of the binomial proportion and a confidence interval for the proportion, as follows:

%let N = 1E6;           /* generate N random points in bounding rectangle */
data Area;
/* define rectangle on which to uniformly sample */
a = -1;  b = 1;         /* x in (a,b) */
c = -1;  d = 1;         /* y in (a,b) */
AreaRect = (b-a)*(d-c); /* area of bounding rectangle */
call streaminit(12345);
do i = 1 to &N;
   x = rand("Uniform", a, b);
   y = rand("Uniform", c, d);
   /* Add code HERE to determine if (x,y) is inside the region of interest */
   theta = atan2(y, x);                 /* notice order */
   rxy = sqrt(x**2 + y**2);             /* observed value of r */
   isInside = (rxy < cos(&k * theta));  /* inside the rose */
   output;
end;
keep x y isInside;
run;
 
/* isInside is binary, so can compute binomial CI for estimate */
proc freq data=Area;
   tables IsInside / binomial(level='1' CL=wald);   /* Optional: Use BINOMIAL option for CL */
   ods select Binomial BinomialCLs;
run;

The proportion of points inside the figure is 0.1966. A 95% confidence interval for the proportion is [0.1958, 0.1974]. To transform these values into areas, you need to multiply them by the area of the bounding rectangle, which is 4. Thus, an estimate for the area of the rose is 0.7864, and a confidence interval is [0.7832, 0.7896]. This estimate is within 0.001 units of the true area, which is π/4 ≈ 0.7854. If you change the random number seed, you will get a different estimate.

Manual computation of binomial estimates

If you are comfortable with programming the binomial proportion yourself, you can perform the entire computation inside a DATA step without using PROC FREQ. The advantage of this method is that you do not generate a data set that has N observations. Rather, you can count the number of random points that are inside the rose, and merely output an estimate of that area and a confidence interval. By using this method, you can simulate millions or billions of random points without writing a large data set. The following DATA step demonstrates this technique:

%let N = 1E6;   
data MCArea;
/* define rectangle on which to uniformly sample */
a = -1;  b = 1;         /* x in (a,b) */
c = -1;  d = 1;         /* y in (a,b) */
AreaRect = (b-a)*(d-c); /* area of bounding rectangle */
call streaminit(12345);
do i = 1 to &N;
   x = rand("Uniform", a, b);
   y = rand("Uniform", c, d);
   /* Add code HERE to determine if (x,y) is inside the region of interest */
   theta = atan2(y, x);                 /* notice order */
   rxy = sqrt(x**2 + y**2);             /* observed value of r */
   isInside = (rxy < cos(&k * theta));  /* inside the rose */
   /* we now have binary variable isInside */
   sumInside + isInside;                /* count the number of points inside */
end;
 
/* simulation complete; output only the results */
p = sumInside / &N;                     /* proportion of points inside region */
AreaEst = p * AreaRect;                 /* estimate of area of the figure */
/* Optional: Since p is a binomial proportion, you can get a confidence interval */
alpha = 0.05;
zAlpha = quantile("Normal", 1-alpha/2);  /* = 1.96 when alpha=0.05 */
SE = zAlpha*sqrt(p*(1-p)/&N);
LCL = (p - SE)*AreaRect;
UCL = (p + SE)*AreaRect;
/* Optional: if you know the true area, you can compare it with MC estimate */
ExactArea = constant('pi') / 4;    
Difference = ExactArea - AreaEst;
keep AreaRect p AreaEst LCL UCL ExactArea Difference;
label AreaRect="Area of Rectangle" p = "Proportion Inside" AreaEst="MC Area Estimate" 
      ExactArea="Exact Area" LCL="95% Lower Limit" UCL="95% Upper Limit";
run;
 
proc print data=MCArea noobs label; 
   format Difference 8.5;
run;

Notice that this DATA step outputs only one observation, whereas the previous DATA step created N observations, which were later read into PROC FREQ to estimate the proportion. Also, this second program multiplies the proportion and the confidence interval by the area of the bounding rectangle. Thus, you get the estimates for the area of the figure, not merely the estimates for the proportion of points inside the figure. The two answers are the same because both programs use the same set of random points.

Regarding the accuracy of the computation, notice that the confidence interval for the proportion is based on the standard error computation, which is zα sqrt(p*(1-p)/N). This quantity is largest when p=1/2 and zα = 1.96 for α = 0.05. Thus, for any p, the standard error is less than 1/sqrt(N). This is what I meant earlier when I claimed that the estimate of the proportion is likely to be within 1/sqrt(N) units from the true proportion. Of course, your estimates for the area might not be that small if the area of the bounding rectangle is large. The estimate of the area is likely to be AR/sqrt(N) units from the true area. But by choosing N large enough, you can make the error as small as you like.

Summary

This article shows how to use the SAS DATA step to perform a Monte Carlo approximation for the area of a bounded planar figure, the five-petaled polar rose. The key to the computation is being able to assess whether a random point is inside the figure of interest. If you can do that, you can estimate the area by using the proportion of random points inside the figure times the area of the bounding rectangle.

The article shows two implementations. The first writes the binary indicator variable to a SAS data set and uses PROC FREQ to compute the proportion estimate. To estimate the are, multiply the proportion by the area of the bounding rectangle. The second implementation performs all computations inside the DATA step and outputs only the estimates of the proportion and the area.

The post Monte Carlo estimates of area appeared first on The DO Loop.

7月 252022
 

I've previously shown how to use Monte Carlo simulation to estimate probabilities and areas. I illustrated the Monte Carlo method by estimating π ≈ 3.14159... by generating points uniformly at random in a unit square and computing the proportion of those points that were inside the unit circle. The previous article used the SAS/IML language to estimate the area. However, you can also perform a Monte Carlo computation by using the SAS DATA step. Furthermore, the method is not restricted to simple geometric shapes like circles. You can use the Monte Carlo technique to estimate the area of any bounded planar figure. You merely need a bounding box and a way to assess whether a random point is inside the figure. This article uses Monte Carlo simulation to estimate the area of a polar rose with five petals.

The polar rose

You can read about how to draw a polar rose in SAS. The simplest form of the polar rose is given by the polar equation r = cos(k θ), where k is an integer, r is the polar radius, and θ is the polar angle (0 ≤ θ ≤ 2 π).

The following DATA step computes the coordinates of the polar rose for k=5. The subsequent call to PROC SGPLOT graphs the curve.

%let k = 5;      /* for odd k, the rose has k petals */
/* draw the curve r=cos(k * theta) in polar coordinates:
   https://blogs.sas.com/content/iml/2015/12/16/polar-rose.html */
data Rose;
do theta = 0 to 2*constant("pi") by 0.01;
   r = cos(&k * theta);       /* rose */
   x = r*cos(theta);          /* convert to Euclidean coordinates */
   y = r*sin(theta);
   output;
end;
keep x y;
run;
 
title "Polar Rose: r = cos(&k theta)";
ods graphics / width=400px height=400px;
proc sgplot data=Rose aspect=1;
   refline 0 / axis=x;   refline 0 / axis=y;
   series x=x y=y;
   xaxis min=-1 max=1;   yaxis min=-1 max=1;
run;

The remainder of this article uses Monte Carlo simulation to estimate the area of the five-petaled rose. The answer is well-known and is often proved in a calculus class. When k is odd, the area of the rose generated by r = cos(k θ) is A = π / 4 ≈ 0.7854.

Monte Carlo estimates of area

To estimate the area of a geometric figure by using Monte Carlo simulation, do the following:

  1. Generate N points uniformly in a bounding rectangle.
  2. For each point, compute whether the point is inside the figure. This enables you to form the proportion, p, of points that are inside the figure.
  3. Estimate the area of the figure as p AR, where AR is the area of the bounding rectangle.
  4. Optionally, you can construct a confidence interval for the area. A confidence interval for the binomial proportion is p ± zα sqrt(p(1-p) / N), where zα is the 1-α/2 quantile of the normal distribution. If you multiply the interval by AR, you get a confidence interval for the area.

The Monte Carlo method produces a simulation-based estimate. The answer depends on the random number stream, so you obtain AN estimate rather than THE estimate. In this article, the rose will be bounded by using the rectangle [-1, 1] x [-1, 1], which has area AR = 4.

The plot to the right shows the idea of a Monte Carlo estimate. The plot shows 3,000 points from the uniform distribution on the rectangle [-1, 1] x [-1, 1]. There are 600 points, shown in red, that are inside one of the petals of the polar rose. Because 600 is 20% of the total number of points, we estimate that the rose occupies 20% of the area of the square. Thus, with N=3000, an estimate of the area of the rose is 0.2*AR = 0.8, which is somewhat close to π / 4 ≈ 0.7854.

Statistical theory suggests that the accuracy of the proportion estimate should be proportional to 1/sqrt(N). This statement has two implications:

  1. If you want to double the accuracy of an estimate, you must increase the number of simulated points fourfold.
  2. Loosely speaking, if you use one million points (N=1E6), you can expect your estimate of the proportion to be roughly 1/sqrt(N) = 1E-3 units from the true proportion. A later section makes this statement more precise.

A Monte Carlo implementation in the SAS DATA step

Let's increase the number of simulated points to improve the accuracy of the estimates. We will use one million (1E6) random points in the bounding rectangle.

There are two primary ways to obtain the Monte Carlo estimate. The first way is to generate a data set that has N random points and contains a binary indicator variable that specifies whether each point is inside the figure. You can then use PROC FREQ to obtain an estimate of the binomial proportion and a confidence interval for the proportion, as follows:

%let N = 1E6;           /* generate N random points in bounding rectangle */
data Area;
/* define rectangle on which to uniformly sample */
a = -1;  b = 1;         /* x in (a,b) */
c = -1;  d = 1;         /* y in (a,b) */
AreaRect = (b-a)*(d-c); /* area of bounding rectangle */
call streaminit(12345);
do i = 1 to &N;
   x = rand("Uniform", a, b);
   y = rand("Uniform", c, d);
   /* Add code HERE to determine if (x,y) is inside the region of interest */
   theta = atan2(y, x);                 /* notice order */
   rxy = sqrt(x**2 + y**2);             /* observed value of r */
   isInside = (rxy < cos(&k * theta));  /* inside the rose */
   output;
end;
keep x y isInside;
run;
 
/* isInside is binary, so can compute binomial CI for estimate */
proc freq data=Area;
   tables IsInside / binomial(level='1' CL=wald);   /* Optional: Use BINOMIAL option for CL */
   ods select Binomial BinomialCLs;
run;

The proportion of points inside the figure is 0.1966. A 95% confidence interval for the proportion is [0.1958, 0.1974]. To transform these values into areas, you need to multiply them by the area of the bounding rectangle, which is 4. Thus, an estimate for the area of the rose is 0.7864, and a confidence interval is [0.7832, 0.7896]. This estimate is within 0.001 units of the true area, which is π/4 ≈ 0.7854. If you change the random number seed, you will get a different estimate.

Manual computation of binomial estimates

If you are comfortable with programming the binomial proportion yourself, you can perform the entire computation inside a DATA step without using PROC FREQ. The advantage of this method is that you do not generate a data set that has N observations. Rather, you can count the number of random points that are inside the rose, and merely output an estimate of that area and a confidence interval. By using this method, you can simulate millions or billions of random points without writing a large data set. The following DATA step demonstrates this technique:

%let N = 1E6;   
data MCArea;
/* define rectangle on which to uniformly sample */
a = -1;  b = 1;         /* x in (a,b) */
c = -1;  d = 1;         /* y in (a,b) */
AreaRect = (b-a)*(d-c); /* area of bounding rectangle */
call streaminit(12345);
do i = 1 to &N;
   x = rand("Uniform", a, b);
   y = rand("Uniform", c, d);
   /* Add code HERE to determine if (x,y) is inside the region of interest */
   theta = atan2(y, x);                 /* notice order */
   rxy = sqrt(x**2 + y**2);             /* observed value of r */
   isInside = (rxy < cos(&k * theta));  /* inside the rose */
   /* we now have binary variable isInside */
   sumInside + isInside;                /* count the number of points inside */
end;
 
/* simulation complete; output only the results */
p = sumInside / &N;                     /* proportion of points inside region */
AreaEst = p * AreaRect;                 /* estimate of area of the figure */
/* Optional: Since p is a binomial proportion, you can get a confidence interval */
alpha = 0.05;
zAlpha = quantile("Normal", 1-alpha/2);  /* = 1.96 when alpha=0.05 */
SE = zAlpha*sqrt(p*(1-p)/&N);
LCL = (p - SE)*AreaRect;
UCL = (p + SE)*AreaRect;
/* Optional: if you know the true area, you can compare it with MC estimate */
ExactArea = constant('pi') / 4;    
Difference = ExactArea - AreaEst;
keep AreaRect p AreaEst LCL UCL ExactArea Difference;
label AreaRect="Area of Rectangle" p = "Proportion Inside" AreaEst="MC Area Estimate" 
      ExactArea="Exact Area" LCL="95% Lower Limit" UCL="95% Upper Limit";
run;
 
proc print data=MCArea noobs label; 
   format Difference 8.5;
run;

Notice that this DATA step outputs only one observation, whereas the previous DATA step created N observations, which were later read into PROC FREQ to estimate the proportion. Also, this second program multiplies the proportion and the confidence interval by the area of the bounding rectangle. Thus, you get the estimates for the area of the figure, not merely the estimates for the proportion of points inside the figure. The two answers are the same because both programs use the same set of random points.

Regarding the accuracy of the computation, notice that the confidence interval for the proportion is based on the standard error computation, which is zα sqrt(p*(1-p)/N). This quantity is largest when p=1/2 and zα = 1.96 for α = 0.05. Thus, for any p, the standard error is less than 1/sqrt(N). This is what I meant earlier when I claimed that the estimate of the proportion is likely to be within 1/sqrt(N) units from the true proportion. Of course, your estimates for the area might not be that small if the area of the bounding rectangle is large. The estimate of the area is likely to be AR/sqrt(N) units from the true area. But by choosing N large enough, you can make the error as small as you like.

Summary

This article shows how to use the SAS DATA step to perform a Monte Carlo approximation for the area of a bounded planar figure, the five-petaled polar rose. The key to the computation is being able to assess whether a random point is inside the figure of interest. If you can do that, you can estimate the area by using the proportion of random points inside the figure times the area of the bounding rectangle.

The article shows two implementations. The first writes the binary indicator variable to a SAS data set and uses PROC FREQ to compute the proportion estimate. To estimate the are, multiply the proportion by the area of the bounding rectangle. The second implementation performs all computations inside the DATA step and outputs only the estimates of the proportion and the area.

The post Monte Carlo estimates of area appeared first on The DO Loop.

7月 182022
 

A colleague was struggling to compute a right-tail probability for a distribution. Recall that the cumulative distribution function (CDF) is defined as a left-tail probability. For a continuous random variable, X, with density function f, the CDF at the value x is
    F(x) = Pr(X ≤ x) = ∫ f(t) dt
where the integral is evaluated on the interval (-∞, x). That is, it is the area under the density function "to the left" of x. In contrast, the right-tail probability is the complement of this quantity. The right-tail probability is defined by
    S(x) = Pr(X ≥ x) = 1 – F(x)
It is the area under the density function "to the right" of x.

The right-tail probability is also called the right-side or upper-tail probability. The function, S, that gives the right-tail probability is called the survival distribution function (SDF). Similarly, a quantile that is associated with a probability that is close to 0 or 1 might be called an extreme quantile. This article discusses why a naive computation of an extreme quantile can be inaccurate. It shows how to compute extreme probabilities and extreme quantiles by using special functions in SAS:

Accurate computations of extreme probabilities

My colleague contacted me to discuss some issues related to computing an extreme quantile for p ≈ 1. To demonstrate, I will use the standard lognormal distribution, whose PDF is shown to the right. (However, this discussion applies equally well to other distributions.) When p is small, the lognormal quantile is close to 0. When p is close to 1, the lognormal quantile is arbitrarily large because the lognormal distribution is unbounded.

The following SAS DATA step uses the QUANTILE function to compute extreme quantiles in the left and right tails of the lognormal distribution. The quantiles are associated with the probabilities p=1E-k and p=1 – 1E-k for k=8, 9, 10, ..., 15.

%let Distrib = Lognormal;
data Have;
do pow = -8 to -15 by -1;
   p = 10**pow;                   
   Lx = quantile("&Distrib", p);   /* left-tail quantile: find x such that CDF(x) = p */
   Rp = 1 - p;                     /* numerically, 1-p = 1 for p < MACEPS */
   Rx = quantile("&Distrib", Rp);  /* right-tail quantile: find x such that CDF(x) = 1-p */
   output;
end;
format Rp 18.15;
run;
 
proc print noobs; run;

The second column shows the probability, p, and the Lx column shows the corresponding left-tail quantile. For example, the first row shows that 1E-8 is the probability that a standard lognormal random variate has a value less than 0.00365. The Rp column shows the probability 1 – p. The Rx column shows the corresponding quantile. For example, the first row shows that 1 – 1E-8 is the probability that a standard lognormal random variate has a value greater than 273.691.

From experimentation, it appears that the QUANTILE function returns a missing value when the probability argument is greater than or equal to 1 – 1E-12. The SAS log displays a note that states
    NOTE: Argument 2 to function QUANTILE('Normal',1) at line 5747 column 9 is invalid.
This means that the function will not compute extreme-right quantiles. Why? Well, I can think of two reasons:

  • When p is very small, the quantity 1 – p does not have full precision. (This is called the "loss-of-precision problem" in numerical analysis.) In fact, if p is less than machine epsilon, the expression 1 – p EXACTLY EQUALS 1 in floating-point computations! Thus, we need a better way to specify probabilities that are close to 1.
  • For unbounded distributions, the CDF function for an unbounded distribution is extremely flat for values in the right tail. Because the pth quantile is obtained by solving for the root of the function CDF(x) - p = 0, the computation to find an accurate extreme-right quantile is numerically imprecise for unbounded distributions.

Compute extreme quantiles in SAS

The previous paragraph does not mean that you can't compute extreme-right quantiles. It means that you should use a function that is specifically designed to handle probabilities that are close to 1. The special function is the SQUANTILE function, which uses asymptotic expansions and other tricks to ensure that extreme-right quantiles are computed accurately. If x is the value returned by the SQUANTILE function, then 1 – CDF(x) = p. You can, of course, rewrite this expression to see that x satisfies the equation CDF(x) = 1 – p.

The syntax for the SQUANTILE function is almost the same as for the QUANTILE function. For any distribution, QUANTILE("Distrib", 1-p) = SQUANTILE("Distrib", p). That is, you add an 'S' to the function name and replace 1 – p with p. Notice that by specifying p instead of 1 – p, the right-tail probability can be specified more accurately.

The following SAS DATA step is a modification of the previous DATA step. It uses the SQUANTILE function to compute the extreme-right quantiles:

data Want;
do pow = -8 to -15 by -1;
   p = 10**pow;
   Lx = quantile("&Distrib", p);   /* find x such that CDF(x) = p */
   Rp = 1 - p;
   Rx = squantile("&Distrib", p);  /* find x such that 1-CDF(x)=p */
   output;
end;
format Rp 18.15;
drop pow;
run;
 
proc print noobs; run;

The table shows that the SQUANTILE function can compute quantiles that are far in the right tail. The first row of the table shows that the probability is 1E-8 that a lognormal random variate exceeds 273.69. (Equivalently, 1 – 1E-8 is the probability that the variate is less than 273.69.) The last row of the table shows that 1E-15 is the probability that a lognormal random variate exceeds 2811.14.

Compute extreme probabilities in SAS

When p is small, the "loss-of-precision problem" (representing 1 – p in finite precision) also affects the computation of the cumulative probability (CDF) of a distribution. That is, the expression CDF("Distrib", 1-p) might not be very accurate because 1 – p loses precision when p is small. To counter that problem, SAS provides the SDF function, which computes the survival distribution function (SDF). The SDF returns the right-tail quantile x for which 1 – CDF(x) = p. This is equivalent to finding the left-tail quantile that satisfies CDF(x) = 1 – p.

The following DATA step computes several one-sided probabilities. For each value of x, the program computes the probability that a random lognormal variate is greater than x:

%let Distrib = lognormal;
data SDF;
do x = 500 to 2500 by 500;              /* extreme quantiles  */
   Rp       = 1 - cdf("&Distrib", x);   /* right-tail probability */
   RpBetter = sdf("&Distrib", x);       /* better computation of right-tail probability */
   output;
end;
run;
 
proc print noobs; run;

The difference between the probabilities in the Rp and RpBetter columns are not as dramatic as for the quantiles in the previous section. That is because the quantity 1 – CDF(x) does not lose precision when x is large because CDF(x) ≈ 1. Nevertheless, you can see small differences between the columns, and the RpBetter column will tend to be more accurate when x is large.

Summary

This article shows two tips for working with extreme quantiles and extreme probabilities. When the probability is close to 1, consider using the SQUANTILE function to compute quantiles. When the quantile is large, consider using the SDF function to compute probabilities.

The post Tips for computing right-tail probabilities and quantiles appeared first on The DO Loop.