sas programming

5月 222017
 

In last week's article about the Flint water crisis, I computed the 90th percentile of a small data set. Although I didn't mention it, the value that I reported is different from the the 90th percentile that is reported in Significance magazine.

That is not unusual. The data only had 20 unique values, and there are many different formulas that you can use to compute sample percentiles (generally called quantiles). Because different software packages use different default formulas for sample quantiles, it is not uncommon for researchers to report different quantiles for small data sets. This article discusses the five percentile definitions that are supported in SAS software.

You might wonder why there are multiple definitions. Recall that a sample quantile is an estimate of a population quantile. Statisticians have proposed many quantile estimators, some of which are based on the empirical cumulative distribution (ECDF) of the sample, which approximates the cumulative distribution function (CDF) for the population. The ECDF is a step function that has a jump discontinuity at each unique data value. Consequently, the inverse ECDF does not exist and the quantiles are not uniquely defined.

Definitions of sample quantiles

In SAS, you can use the PCTLDEF= option in PROC UNIVARIATE or the QNTLDEF= option in other procedures to control the method used to estimate quantiles. A sample quantile does not have to be an observed data value because you are trying to estimate an unknown population parameter.

For convenience, assume that the sample data are listed in sorted order. In high school, you probably learned that if a sorted sample has an even number of observations, then the median value is the average of the middle observations. The default quantile definition in SAS (QNTLDEF=5) extends this familiar rule to other quantiles. Specifically, if the sample size is N and you ask for the q_th quantile, then when Nq is an integer, the quantile is the data value x[Nq]. However, when Nq is not an integer, then the quantile is defined (somewhat arbitrarily) as the average of the two data x[j] and x[j+1], where j = floor(Nq). For example, if N=10 and you want the q=0.17 quantile, then Nq=1.7, so j=1 and the 17th percentile is reported as the midpoint between the ordered values x[1] and x[2].

Averaging is not the only choices you can make when Nq is not an integer. The other percentile definitions correspond to making different choices. For example, you could round Nq down (QNTLDEF=3), or you could round it to the nearest integer (QNTLDEF=2). Or you could use linear interpolation (QNTLDEF=1 and QNTLDEF=4) between the data values whose (sorted) indices are closest to Nq. In the example where N=10 and q=0.17, the QNTLDEF=1 interpolated quantile is 0.3 x[1] + 0.7 x[2].

Visualizing the definitions for quantiles

The SAS documentation contains the formulas used for the five percentile definitions, but sometimes a visual comparison is easier than slogging through mathematical equations. The differences between the definitions are most apparent on small data sets that contain integer values, so let's create a tiny data set and apply the five definitions to it. The following example has 10 observations and six unique values.

data Q;
input x @@;
datalines;
0 1 1 1 2 2 2 4 5 8
;
ECDF of a small data set

You can use PROC UNIVARIATE or other methods to plot the empirical cumulative proportions, as shown. Because the ECDF is a step function, most cumulative proportions values (such as 0.45) are "in a gap." By this I mean that there is no observation t in the data for which the cumulative proportion P(X ≤ t) equals 0.45. Depending on how you define the sample quantiles, the 0.45 quantile might be reported as 1, 1.5, 1.95, or 2.

Since the default definition is QNTLDEF=5, let's visualize the sample quantiles for that definition. You can use the PCTPTS= option on the OUTPUT statement in PROC UNIVARIATE to declare the percentiles that you want to compute. Equivalently, you can use the QNTL function in PROC IML, as below. Regardless, you can ask SAS to find the quantiles for a set of probabilities on a fine grid of points such as {0.001, 0.002, ..., 0.998, 0.999}. You can the graph of the probabilities versus the quantiles to visualize how the percentile definition computes quantiles for the sample data.

proc iml;
use Q; read all var "x"; close;       /* read data */
prob = T(1:999) / 1000;               /* fine grid of prob values */
call qntl(quantile, x, prob, 5);      /* use method=5 definition  */
create Pctls var {"quantile" "prob" "x"}; append; close;
quit;
 
title "Sample Percentiles";
title2 "QNTLDEF = 5";
proc sgplot data=Pctls noautolegend;
   scatter x=quantile y=prob / markerattrs=(size=5 symbol=CircleFilled);
   fringe x / lineattrs=GraphData2(thickness=3);
   xaxis display=(nolabel) values=(0 to 8);
   yaxis offsetmin=0.05 grid values=(0 to 1 by 0.1) label="Cumulative Proportions";
   refline 0 1 / axis=y;
run;
Sample quantiles (percentiles) for a small data set

For each probability value (Y axis), the graph shows the corresponding sample quantile (X axis) for the default definition in SAS, which is QNTLDEF=5. The X axis also displays red tick marks at the location of the data. You can use this graph to find any quantile. For example, to find the 0.45 quantile, you start at 0.45 on the Y axis, move to the right until you hit a blue marker, and then drop down to the X axis to discover that the 0.45 quantile estimate is 2.

If you prefer to think of the quantiles (the X values) as a function of the probabilities, just interchange the X= and Y= arguments in the SCATTER statement (or turn your head sideways!). Then the quantile function is a step function.

Comparing all five SAS percentile definitions

It is easy to put a loop around the SAS/IML computation to compute the sample quantiles for the five different definitions that are supported in SAS. The following SAS/IML program writes a data set that contains the sample quantiles. You can use the WHERE statement in PROC PRINT to compare the same quantile across the different definitions. For example, the following displays the 0.45 quantile (45th percentile) for the five definitions:

/* Compare all SAS methods */
proc iml;
use Q; read all var "x"; close;       /* read data */
prob = T(1:999) / 1000;               /* fine grid of prob values */
create Pctls var {"Qntldef" "quantile" "prob" "x"};
do def = 1 to 5;
   call qntl(quantile, x, prob, def); /* qntldef=1,2,3,4,5 */
   Qntldef = j(nrow(prob), 1, def);   /* ID variable */
   append;
end;
close;
quit;
 
proc print data=Pctls noobs;
   where prob = 0.45;                 /* compare 0.45 quantile for different definitions */
   var Qntldef quantile;
run;

You can see that the different definitions lead to different sample quantiles. How do the quantile functions compare? Let's plot them and see:

ods graphics / antialiasmax=10000;
title "Sample Percentiles in SAS";
proc sgpanel data=Pctls noautolegend;
   panelby Qntldef / onepanel rows=2;
   scatter x=quantile y=prob/ markerattrs=(size=3 symbol=CircleFilled);
   fringe x;
   rowaxis offsetmax=0 offsetmin=0.05 grid values=(0 to 1 by 0.1) label="Cumulative Proportion";
   refline 0 1 / axis=y;
   colaxis display=(nolabel);
run;
Compare percentile definitions in SAS

The graphs (click to enlarge) show that QNTLDEF=1 and QNTLDEF=4 are piecewise-linear interpolation methods, whereas QNTLDEF=2, 3, and 5 are discrete rounding methods. The default method (QNTLDEF=5) is similar to QNTLDEF=2 except for certain averaged values. For the discrete definitions, SAS returns either a data value or the average of adjacent data values. The interpolation methods do not have that property: the methods will return quantile values that can be any value between observed data values.

If you have a small data set, as in this blog post, it is easy to see how the percentile definitions are different. For larger data sets (say, 100 or more unique values), the five quantile functions look quite similar.

The differences between definitions are most apparent when there are large gaps between adjacent data values. For example, the sample data has a large gap between the ninth and tenth observations, which have the values 5 and 8, respectively. If you compute the 0.901 quantile, you will discover that the "round down" method (QNTLDEF=2) gives 5 as the sample quantile, whereas the "round up" method (QNTLDEF=3) gives the value 8. Similarly, the "backward interpolation method" (QNTLDEF=1) gives 5.03, whereas the "forward interpolation method" (QNTLDEF=4) gives 7.733.

In summary, this article shows how the (somewhat obscure) QNTLDEF= option results in different quantile estimates. Most people just accept the default definition (QNTLDEF=5), but if you are looking for a method that interpolates between data values, rather than a method that rounds and averages, I recommend QNTLDEF=1, which performs linear interpolations of the ECDF. The differences between the definitions are most apparent for small samples and when there are large gaps between adjacent data values.

Reference

For more information about sample quantiles, including a mathematical discussion of the various formulas, see
Hyndman, R. J. and Fan, Y. (1996) "Sample quantiles in statistical packages", American Statistician, 50, 361–365.

The post Quantile definitions in SAS appeared first on The DO Loop.

5月 152017
 

The little SAS program’s official name was Extract_Transform_Load_ 0314.sas.  But, that name was much too formal, way too long, and did not roll off of the tongue very easily at all.  So, everybody simply called her:  ETL Pi. ETL Pi was conceived in a 2-hour project strategy meeting in conference [...]

The post The Little SAS Program’s Big Night Out appeared first on SAS Learning Post.

5月 152017
 

Last week I showed a timeline of living US presidents. The number of living presidents is constant during the time interval between inaugurations and deaths of presidents. The data was taken from a Wikipedia table (shown below) that shows the number of years and days between events. This article shows how you can use the INTCK and INTNX functions in SAS to compute the time between events in this format. In particular, I use two little-known options to these functions that make this task easy.

Intervals between dates

If you are computing the interval between two dates (a start date and an end date) there are two SAS functions that you absolutely must know about.

  • The INTCK function returns the number of time units between two dates. For the time unit, you can choose years, months, weeks, days, and more. For example, in my previous article I used the INTCK function to determine the number of days between two dates.
  • The INTNX function returns a SAS date that is a specified number of time units away from a specified date. For example, you can use the INTNX function to compute the date that is 308 days in the future from a given date.

These two functions complement each other: one computes the difference between two dates, the other enables you to add time units to a date value.

By default, these functions use the number of "calendar boundaries" between the dates, such as the first day of a year, month, or week. For example, if you choose to measure year intervals, the INTCK function counts how many times 01JAN occurred between the dates, and the INTNX function returns a future 01JAN date. Similarly, if you measure month intervals, the INTCK function counts how many first-of-the-months occur between two dates, and the INTNX function returns a future first-of-the-month date.

Options to compute anniversary dates

Both functions support many options to modify the default behavior. If you want to count full year intervals, instead of the number of times people celebrated New Year's Eve, these function support options (as of SAS 9.2) to count the number of "anniversaries" between two dates and to compute the date of a future anniversary. You can use the 'CONTINUOUS' option for the INTCK function and the 'SAME' option for the INTNX function, as follows:

  • The 'CONTINUOUS' option in the INTCK function enables you to count the number of anniversaries of one date that occur prior to a second date. For example, the statement
    Years = intck('year', '30APR1789'd, '04MAR1797'd, 'continuous');
    returns the value 7 because there are 7 full years (anniversaries of 30APR) between those two dates. Without the 'CONTINUOUS' option, the function returns 8 because 01JAN occurs 8 times between those dates.
  • The statement
    Anniv = intnx('year', '30APR1789'd, 7, 'same');
    returns the 7th anniversary of the date 30APR1789. In other words, it returns the date value for 30APR1796.

The beauty of these functions is that they automatically handle leap years! If you request the number of days between two dates, the INTCK function includes leap days in the result. If an event occurs on a leap day, and you ask the INTNX function for the next anniversary of that event, you will get 28FEB of the next year, which is the most common convention for handling anniversaries of a leap day.

An algorithm to compute years and days between events

The following algorithm computes the number of years and days between dates in SAS:

  • Use the INTCK function with the 'CONTINUOUS' option to compute the number of complete years between the two dates.
  • Use the INTNX function to find a third date (the anniversary date) which is the same month and day as the start date, but occurs less than one year prior to the end date. (The anniversary of a leap days is either 28FEB or 29FEB, depending on whether the anniversary occurs in a leap year.)
  • Use the INTCK function to compute the number of days between the anniversary date and the end date.

The following DATA step computes the time interval in years and days between the first few US presidential inaugurations and deaths. The resulting Year and Day variables contain the same information as is displayed in the Wikipedia table.

data YearDays;
format Date prevDate anniv Date9.;
input @1  Date anydtdte12.
      @13 Event $26.;
prevDate = lag(Date);
if _N_=1 then do;                               /* when _N_=1, lag(Date)=. */
   Years=.; Days=.; return;            /* set years & days, go to next obs */
end;
Years = intck('year', prevDate, Date, 'continuous'); /* num complete years */
Anniv = intnx('year', prevDate, Years, 'same');      /* most recent anniv  */
Days = intck('day', anniv, Date);                    /* days since anniv   */
datalines;
Apr 30, 1789 Washington Inaug
Mar 4, 1797  J Adams Inaug
Dec 14, 1799 Washington Death
Mar 4, 1801  Jefferson Inaug
Mar 4, 1809  Madison Inaug
Mar 4, 1817  Monroe Inaug
Mar 4, 1825  JQ Adams Inaug
Jul 4, 1826  Jefferson Death
Jul 4, 1826  J Adams Death
run;
 
proc print data=YearDays;
var Event prevDate Date Anniv Years Days;
run;

Summary and references

In summary, the INTCK and INTNX functions are essential for computing intervals between dates. In this article, I emphasized two little-known options: the 'CONTINUOUS' option in INTCK and the 'SAME' option in INTNX. By using these options, you can to compute the number of anniversaries between dates and the most recent anniversary. Thus you can compute the years and days between two dates.

There have been countless articles and papers written about SAS dates and finding intervals between dates. I recommend the following articles:

Lastly, do you know what the acronyms INTCK and INTNX stand for? Obviously the 'INT' part refers to INTervals. The general consensus is that 'INTCK' stands for 'Interval Check' and 'INTNX' stands for "Interval Next."

The post INTCK and INTNX: Two essential functions for computing intervals between dates in SAS appeared first on The DO Loop.

4月 252017
 

When developing SAS® data sets, program code and/or applications, efficiency is not always given the attention it deserves, particularly in the early phases of development. Since data sizes and system performance can affect a program and/or an application’s behavior, SAS users may want to access information about a data set’s [...]

The post Determining the size of a SAS data set appeared first on SAS Learning Post.

4月 142017
 

Image of The Little SAS Enterprise Guide BookThere is a new member of The Little SAS Book family: The Little SAS Enterprise Guide Book.

If you are familiar with our other EG books, you may be wondering why this one isn’t called the “Fourth Edition.”  That is because we changed the title slightly.  Our previous EG books were each written for a specific version of EG, and consequently had the version number right in the title.  This book was written using EG 7.1, but it also applies to some earlier versions (5.1 and 6.1).  With a little luck, this book will also apply to future versions.  So it’s a keeper.

I’m very pleased with how this book has turned out.  We updated it so that all the windows and icons match the current EG, and we also added some great new sections.  Even with the new topics, this book is 60 pages shorter than our previous EG book!  It is shorter because we replaced some chapters on specific types of tasks, with a new chapter that explains how tasks work in general.  The result is a book that is easier to read and more useful.

For more information about this book including the table of contents, an excerpt, and reviews, click here.


4月 142017
 

Image of The Little SAS Enterprise Guide BookThere is a new member of The Little SAS Book family: The Little SAS Enterprise Guide Book.

If you are familiar with our other EG books, you may be wondering why this one isn’t called the “Fourth Edition.”  That is because we changed the title slightly.  Our previous EG books were each written for a specific version of EG, and consequently had the version number right in the title.  This book was written using EG 7.1, but it also applies to some earlier versions (5.1 and 6.1).  With a little luck, this book will also apply to future versions.  So it’s a keeper.

I’m very pleased with how this book has turned out.  We updated it so that all the windows and icons match the current EG, and we also added some great new sections.  Even with the new topics, this book is 60 pages shorter than our previous EG book!  It is shorter because we replaced some chapters on specific types of tasks, with a new chapter that explains how tasks work in general.  The result is a book that is easier to read and more useful.

For more information about this book including the table of contents, an excerpt, and reviews, click here.


4月 142017
 

Image of The Little SAS Enterprise Guide BookThere is a new member of The Little SAS Book family: The Little SAS Enterprise Guide Book.

If you are familiar with our other EG books, you may be wondering why this one isn’t called the “Fourth Edition.”  That is because we changed the title slightly.  Our previous EG books were each written for a specific version of EG, and consequently had the version number right in the title.  This book was written using EG 7.1, but it also applies to some earlier versions (5.1 and 6.1).  With a little luck, this book will also apply to future versions.  So it’s a keeper.

I’m very pleased with how this book has turned out.  We updated it so that all the windows and icons match the current EG, and we also added some great new sections.  Even with the new topics, this book is 60 pages shorter than our previous EG book!  It is shorter because we replaced some chapters on specific types of tasks, with a new chapter that explains how tasks work in general.  The result is a book that is easier to read and more useful.

For more information about this book including the table of contents, an excerpt, and reviews, click here.


4月 132017
 
The Penitent Magdalene

Titian (Tiziano Vecellio) (Italian, about 1487 - 1576) The Penitent Magdalene, 1555 - 1565, Oil on canvas 108.3 × 94.3 cm (42 5/8 × 37 1/8 in.) The J. Paul Getty Museum, Los Angeles; Digital image courtesy of the Getty's Open Content Program.

Even if you are a traditional SAS programmer and have nothing to do with cybersecurity, you still probably have to deal with this issue in your day-to-day work.

The world has changed, and what you do as a SAS programmer is not just between you and your computer anymore. However, I have found that many of us are still careless, negligent or reckless enough to be dangerous.

Would you scotch-tape your house key to the front door next to the lock and go on vacation? Does anybody do that? Still, some of us have no problem explicitly embedding passwords in our code.

That single deadly sin, the thing that SAS programmers (or any programmers) must not do under any circumstances, is placing unmasked passwords into their code. I must confess, I too have sinned, but I have seen the light and hope you will too.

Password usage examples

Even if SAS syntax calls for a password, never type it or paste it into your SAS programs. Ever.

If you connect to a database using a SAS/ACCESS LIBNAME statement, your libname statement might look like:

libname mydblib oracle path=airdb_remote schema=hrdept
	user=myusr1 password=mypwd1;

If you specify the LIBNAME statement options for the metadata engine to connect to the metadata server, it may look like:

libname myeng meta library=mylib
	repname=temp metaserver='a123.us.company.com' port=8561 
 		user=idxyz pw=abcdefg;

If you use LIBNAME statement options for the metadata engine to connect to a database, it may look like:

libname oralib meta library=oralib dbuser=orauser dbpassword=orapw;

In all of the above examples, some password is “required” to be embedded in the SAS code. But it does not mean you should put it there “as is,” unmasked. SAS program code is usually saved as a text file, which is stored on your laptop or somewhere on a server. Anyone who can get access to such a file would immediately get access to those passwords, which are key to accessing databases that might contain sensitive information. It is your obligation and duty to protect this sensitive data.

Hiding passwords in a macro variable or a macro?

I’ve seen some shrewd SAS programmers who do not openly put passwords in their SAS code. Instead of placing the passwords directly where the SAS syntax calls for, they assign them to a macro variable in AUTOEXEC.SAS, an external SAS macro file, a compiled macro or some other SAS program file included in their code, and then use a macro reference, e.g.

/* in autoexec.sas or external macro */
%let opw = mypwd1;
 
/* or */
 
%macro opw;
	mypwd1
%mend opw;
/* in SAS program */
libname mydblib oracle user=myusr1 password=&opw
        path=airdb_remote schema=hrdept;
 
/* or */
 
libname mydblib oracle user=myusr1 password=%opw
        path=airdb_remote schema=hrdept;

Clever! But it’s no more secure than leaving your house key under the door mat. In fact it is much less secure. One who wants to look up your password does not even need to look under the door mat, oh, I mean look at your program file where the password is assigned to a macro variable or a macro. For a macro variable, a simple %put &opw; statement will print the password’s actual value in the SAS log. In case of a macro, one can use %put %opw; with the same result.

In other words, hiding the passwords do not actually protect them.

What to do

What should you do instead of openly placing or concealing those required passwords into your SAS code? The answer is short: encrypt passwords.

SAS software provides a powerful procedure for encrypting passwords that renders them practically unusable outside of the SAS system.

This is PROC PWENCODE, and it is very easy to use. In its simplest form, in order to encrypt (encode) your password abc123 you would need to submit just the following two lines of code:

proc pwencode in="abc123";
run;

The encrypted password is printed in the SAS log:

1 proc pwencode in=XXXXXXXX;
2 run;

{SAS002}3CD4EA1E5C9B75D91A73A37F

Now, you can use this password {SAS002}3CD4EA1E5C9B75D91A73A37F in your SAS programs. The SAS System will seamlessly take care of decrypting the password during compilation.

The above code examples can be re-written as follows:

libname mydblib oracle path=airdb_remote schema=hrdept
	user=myusr1 password="{SAS002}9746E819255A1D2F154A26B7";
 
libname myeng meta library=mylib
	repname=temp metaserver='a123.us.company.com' port=8561 
 		user=idxyz pw="{SAS002}9FFC53315A1596D92F13B4CA";
 
libname oralib meta library=oralib dbuser=orauser
dbpassword="{SAS002}9FFC53315A1596D92F13B4CA";

Encryption methods

The {SAS002} prefix indicates encoding method. This SAS-proprietary encryption method which uses 32-bit key encryption is the default, so you don’t have to specify it in the PROC PWENCODE.

There are other, stronger encryption methods supported in SAS/SECURE:

{SAS003} – uses a 256-bit key plus 16-bit salt to encode passwords,

{SAS004} – uses a 256-bit key plus 64-bit salt to encode passwords.

If you want to encode your password with one of these stronger encryption methods you must specify it in PROC PWENCODE:

proc pwencode in="abc123" method=SAS003;
run;

SAS Log: {SAS003}50374C8380F6CDB3C91281FF2EF57DED10E6

proc pwencode in="abc123" method=SAS004;
run;

SAS Log: {SAS004}1630D14353923B5940F3B0C91F83430E27DA19164FC003A1

Beyond encryption

There are other methods of obscuring passwords to protect access to sensitive information that are available in the SAS Business Intelligence Platform. These are the AUTHDOMAIN= SAS/ACCESS option supported in LIBNAME statements, as well as PROC SQL CONNECT statements, SAS Token Authentication, and Integrated Windows Authentication. For more details, see Five strategies to eliminate passwords from your SAS programs.

Conclusion

Never place unencrypted password into your SAS program. Encrypt it!

Place this sticker in front of you until you memorize it by heart.

PROC PWENCODE sticker

 

One deadly sin SAS programmers should stop committing was published on SAS Users.

4月 132017
 

I recently asked a SAS user, “Which interface do you use for SAS?” She replied, “Interface? I just install SAS and use it.” “You’re using the SAS windowing environment,” I explained, but she had no idea what I was talking about. This person is an extremely sophisticated SAS user who [...]

The post What’s your SAS interface? appeared first on SAS Learning Post.

3月 272017
 

Did you know that you can check a SAS macro variable to see if ODS graphics is enabled? The other day I wanted to write a SAS program that creates a graph only if ODS graphics is enabled. The solution is to check the SYSODSGRAPHICS macro variable, which is automatically updated by the SAS system whenever ODS graphics is enabled or disabled. (Thanks to Warren Kuhfeld for this tip!) The SYSODSGRAPHICS variable has the value 0 for "off" and 1 for "on."

For example, the following SAS/IML program checks to see if ODS graphics is on. If so, it creates a bar chart of a categorical variable. If not, it computes the count for each category and prints a frequency table:

ods graphics on;         /* -OR-  ods graphics off */
proc iml;
use sashelp.cars;  read all var "Origin";  close;
 
if &SYSODSGRAPHICS then  /* is ODS graphics enabled? */
   call bar(Origin);     /* optionally display a graphical summary */
/* always display a tabular summary */
call tabulate(Categories, Counts, Origin);
print Counts[colname=Categories];

The SYSODSGRAPHICS automatic macro variable is a recent addition to the automatic macro variables in SAS. You can read Rick Langston's 2015 paper to discover other (relatively) new macro features in SAS 9.3 and SAS 9.4.

SAS has many other automatic macro variables. In general, you can use automatic macro variables in SAS to check the status of the SAS system at run time. Some of my other favorite automatic macro variables (which some people call system macro variables) are the following:

  • SYSERR, SYSERRORTEXT, and SYSWARNINGTEXT: Provide information about errors and warnings from SAS procedures. In my book Statistical Programming with SAS/IML I show how to use these macro variables in conjunction with the SUBMIT/ENDSUBMIT statements in SAS/IML.
  • SYSVER and SYSVLONG: Provide information about your version of SAS. You can use these macro variables to execute newer, more efficient, code if the user has a recent version of SAS.
  • SYSSCP and SYSSCPL: Provide information about the operating system. For example, is SAS running on Windows or Linux?

Do you have a favorite automatic variable? Leave a comment and tell me how you use automatic variables in your SAS programs.

The post Is ODS graphics enabled? Use automatic macro variables to determine the state of SAS appeared first on The DO Loop.