sas programming

6月 062017
 

Wait!  Don't close this window.  I understand that regular expressions can be very complicated (yes, there are many books on the subject), but some basic expressions to test patterns such as zip codes or telephone numbers are not that difficult.  In addition, you can sometimes use Google to search for [...]

The post Using regular expressions to verify the pattern of character data appeared first on SAS Learning Post.

5月 312017
 

SAS programs in Excel, finally!When SAS Add-In for Microsoft Office was first created over a decade ago, SAS programmers were told to check their skills at the door. This new product was for non-programmers only. SAS programmers were invited to contribute to the experience by packaging their code in SAS stored processes, which end users would then run using point-and-click menus. But there was no way to write and run your SAS programs directly in Microsoft Excel or Word or PowerPoint.

This was a comfort to many SAS administrators, who wanted to provide SAS analytics to their end users but didn't want them to have to learn to program. Or perhaps to even allow them to program. But, times have changed. Citizen data scientists have been practicing their coding skills, and now they want to mix it up in Microsoft Office. In response to this demand, SAS R&D has added the SAS programming environment -- the parts that make sense, anyway -- into SAS Add-In for Microsoft Office. You can write programs, run them, and drop the results into any part of your Microsoft Office document.

Here's a short screencast of how it works in Microsoft Excel:

SAS programming in AMO

If you want to learn more and see a narrated demo from the principal developer, watch this interview with Tim Beese from SAS R&D. Tim shows the coding feature along with several other cool enhancements to SAS Add-In for Microsoft Office. As Tim explains, SAS administrators still have the final say when it comes to allowing Excel users to let loose with SAS code -- they can enable the feature by role, and so grant this as a privilege at their discretion.

The first few minutes of this video shows some impressive integration with SAS Visual Analytics and Microsoft Excel. The SAS programming demo begins around the 5-minute mark.

The post Create and run SAS code in SAS Add-In for Microsoft Office appeared first on The SAS Dummy.

5月 312017
 

SAS programs in Excel, finally!When SAS Add-In for Microsoft Office was first created over a decade ago, SAS programmers were told to check their skills at the door. This new product was for non-programmers only. SAS programmers were invited to contribute to the experience by packaging their code in SAS stored processes, which end users would then run using point-and-click menus. But there was no way to write and run your SAS programs directly in Microsoft Excel or Word or PowerPoint.

This was a comfort to many SAS administrators, who wanted to provide SAS analytics to their end users but didn't want them to have to learn to program. Or perhaps to even allow them to program. But, times have changed. Citizen data scientists have been practicing their coding skills, and now they want to mix it up in Microsoft Office. In response to this demand, SAS R&D has added the SAS programming environment -- the parts that make sense, anyway -- into SAS Add-In for Microsoft Office. You can write programs, run them, and drop the results into any part of your Microsoft Office document.

Here's a short screencast of how it works in Microsoft Excel:

SAS programming in AMO

If you want to learn more and see a narrated demo from the principal developer, watch this interview with Tim Beese from SAS R&D. Tim shows the coding feature along with several other cool enhancements to SAS Add-In for Microsoft Office. As Tim explains, SAS administrators still have the final say when it comes to allowing Excel users to let loose with SAS code -- they can enable the feature by role, and so grant this as a privilege at their discretion.

The first few minutes of this video shows some impressive integration with SAS Visual Analytics and Microsoft Excel. The SAS programming demo begins around the 5-minute mark.

The post Create and run SAS code in SAS Add-In for Microsoft Office appeared first on The SAS Dummy.

5月 262017
 

Elizabeth is courageous. Scoliosis since birth, corrective spinal surgery replaced her spine with steel, tripping on stairs permanently broke her right ankle. Then she decided to come take yoga with me. To help ease back pain & reduce hip stress, I offered options like bent legs not cross. In class [...]

The post Take that SAS option appeared first on SAS Learning Post.

5月 222017
 

In last week's article about the Flint water crisis, I computed the 90th percentile of a small data set. Although I didn't mention it, the value that I reported is different from the the 90th percentile that is reported in Significance magazine.

That is not unusual. The data only had 20 unique values, and there are many different formulas that you can use to compute sample percentiles (generally called quantiles). Because different software packages use different default formulas for sample quantiles, it is not uncommon for researchers to report different quantiles for small data sets. This article discusses the five percentile definitions that are supported in SAS software.

You might wonder why there are multiple definitions. Recall that a sample quantile is an estimate of a population quantile. Statisticians have proposed many quantile estimators, some of which are based on the empirical cumulative distribution (ECDF) of the sample, which approximates the cumulative distribution function (CDF) for the population. The ECDF is a step function that has a jump discontinuity at each unique data value. Consequently, the inverse ECDF does not exist and the quantiles are not uniquely defined.

Definitions of sample quantiles

In SAS, you can use the PCTLDEF= option in PROC UNIVARIATE or the QNTLDEF= option in other procedures to control the method used to estimate quantiles. A sample quantile does not have to be an observed data value because you are trying to estimate an unknown population parameter.

For convenience, assume that the sample data are listed in sorted order. In high school, you probably learned that if a sorted sample has an even number of observations, then the median value is the average of the middle observations. The default quantile definition in SAS (QNTLDEF=5) extends this familiar rule to other quantiles. Specifically, if the sample size is N and you ask for the q_th quantile, then when Nq is an integer, the quantile is the data value x[Nq]. However, when Nq is not an integer, then the quantile is defined (somewhat arbitrarily) as the average of the two data x[j] and x[j+1], where j = floor(Nq). For example, if N=10 and you want the q=0.17 quantile, then Nq=1.7, so j=1 and the 17th percentile is reported as the midpoint between the ordered values x[1] and x[2].

Averaging is not the only choices you can make when Nq is not an integer. The other percentile definitions correspond to making different choices. For example, you could round Nq down (QNTLDEF=3), or you could round it to the nearest integer (QNTLDEF=2). Or you could use linear interpolation (QNTLDEF=1 and QNTLDEF=4) between the data values whose (sorted) indices are closest to Nq. In the example where N=10 and q=0.17, the QNTLDEF=1 interpolated quantile is 0.3 x[1] + 0.7 x[2].

Visualizing the definitions for quantiles

The SAS documentation contains the formulas used for the five percentile definitions, but sometimes a visual comparison is easier than slogging through mathematical equations. The differences between the definitions are most apparent on small data sets that contain integer values, so let's create a tiny data set and apply the five definitions to it. The following example has 10 observations and six unique values.

data Q;
input x @@;
datalines;
0 1 1 1 2 2 2 4 5 8
;
ECDF of a small data set

You can use PROC UNIVARIATE or other methods to plot the empirical cumulative proportions, as shown. Because the ECDF is a step function, most cumulative proportions values (such as 0.45) are "in a gap." By this I mean that there is no observation t in the data for which the cumulative proportion P(X ≤ t) equals 0.45. Depending on how you define the sample quantiles, the 0.45 quantile might be reported as 1, 1.5, 1.95, or 2.

Since the default definition is QNTLDEF=5, let's visualize the sample quantiles for that definition. You can use the PCTPTS= option on the OUTPUT statement in PROC UNIVARIATE to declare the percentiles that you want to compute. Equivalently, you can use the QNTL function in PROC IML, as below. Regardless, you can ask SAS to find the quantiles for a set of probabilities on a fine grid of points such as {0.001, 0.002, ..., 0.998, 0.999}. You can the graph of the probabilities versus the quantiles to visualize how the percentile definition computes quantiles for the sample data.

proc iml;
use Q; read all var "x"; close;       /* read data */
prob = T(1:999) / 1000;               /* fine grid of prob values */
call qntl(quantile, x, prob, 5);      /* use method=5 definition  */
create Pctls var {"quantile" "prob" "x"}; append; close;
quit;
 
title "Sample Percentiles";
title2 "QNTLDEF = 5";
proc sgplot data=Pctls noautolegend;
   scatter x=quantile y=prob / markerattrs=(size=5 symbol=CircleFilled);
   fringe x / lineattrs=GraphData2(thickness=3);
   xaxis display=(nolabel) values=(0 to 8);
   yaxis offsetmin=0.05 grid values=(0 to 1 by 0.1) label="Cumulative Proportions";
   refline 0 1 / axis=y;
run;
Sample quantiles (percentiles) for a small data set

For each probability value (Y axis), the graph shows the corresponding sample quantile (X axis) for the default definition in SAS, which is QNTLDEF=5. The X axis also displays red tick marks at the location of the data. You can use this graph to find any quantile. For example, to find the 0.45 quantile, you start at 0.45 on the Y axis, move to the right until you hit a blue marker, and then drop down to the X axis to discover that the 0.45 quantile estimate is 2.

If you prefer to think of the quantiles (the X values) as a function of the probabilities, just interchange the X= and Y= arguments in the SCATTER statement (or turn your head sideways!). Then the quantile function is a step function.

Comparing all five SAS percentile definitions

It is easy to put a loop around the SAS/IML computation to compute the sample quantiles for the five different definitions that are supported in SAS. The following SAS/IML program writes a data set that contains the sample quantiles. You can use the WHERE statement in PROC PRINT to compare the same quantile across the different definitions. For example, the following displays the 0.45 quantile (45th percentile) for the five definitions:

/* Compare all SAS methods */
proc iml;
use Q; read all var "x"; close;       /* read data */
prob = T(1:999) / 1000;               /* fine grid of prob values */
create Pctls var {"Qntldef" "quantile" "prob" "x"};
do def = 1 to 5;
   call qntl(quantile, x, prob, def); /* qntldef=1,2,3,4,5 */
   Qntldef = j(nrow(prob), 1, def);   /* ID variable */
   append;
end;
close;
quit;
 
proc print data=Pctls noobs;
   where prob = 0.45;                 /* compare 0.45 quantile for different definitions */
   var Qntldef quantile;
run;

You can see that the different definitions lead to different sample quantiles. How do the quantile functions compare? Let's plot them and see:

ods graphics / antialiasmax=10000;
title "Sample Percentiles in SAS";
proc sgpanel data=Pctls noautolegend;
   panelby Qntldef / onepanel rows=2;
   scatter x=quantile y=prob/ markerattrs=(size=3 symbol=CircleFilled);
   fringe x;
   rowaxis offsetmax=0 offsetmin=0.05 grid values=(0 to 1 by 0.1) label="Cumulative Proportion";
   refline 0 1 / axis=y;
   colaxis display=(nolabel);
run;
Compare percentile definitions in SAS

The graphs (click to enlarge) show that QNTLDEF=1 and QNTLDEF=4 are piecewise-linear interpolation methods, whereas QNTLDEF=2, 3, and 5 are discrete rounding methods. The default method (QNTLDEF=5) is similar to QNTLDEF=2 except for certain averaged values. For the discrete definitions, SAS returns either a data value or the average of adjacent data values. The interpolation methods do not have that property: the methods will return quantile values that can be any value between observed data values.

If you have a small data set, as in this blog post, it is easy to see how the percentile definitions are different. For larger data sets (say, 100 or more unique values), the five quantile functions look quite similar.

The differences between definitions are most apparent when there are large gaps between adjacent data values. For example, the sample data has a large gap between the ninth and tenth observations, which have the values 5 and 8, respectively. If you compute the 0.901 quantile, you will discover that the "round down" method (QNTLDEF=2) gives 5 as the sample quantile, whereas the "round up" method (QNTLDEF=3) gives the value 8. Similarly, the "backward interpolation method" (QNTLDEF=1) gives 5.03, whereas the "forward interpolation method" (QNTLDEF=4) gives 7.733.

In summary, this article shows how the (somewhat obscure) QNTLDEF= option results in different quantile estimates. Most people just accept the default definition (QNTLDEF=5), but if you are looking for a method that interpolates between data values, rather than a method that rounds and averages, I recommend QNTLDEF=1, which performs linear interpolations of the ECDF. The differences between the definitions are most apparent for small samples and when there are large gaps between adjacent data values.

Reference

For more information about sample quantiles, including a mathematical discussion of the various formulas, see
Hyndman, R. J. and Fan, Y. (1996) "Sample quantiles in statistical packages", American Statistician, 50, 361–365.

The post Quantile definitions in SAS appeared first on The DO Loop.

5月 152017
 

The little SAS program’s official name was Extract_Transform_Load_ 0314.sas.  But, that name was much too formal, way too long, and did not roll off of the tongue very easily at all.  So, everybody simply called her:  ETL Pi. ETL Pi was conceived in a 2-hour project strategy meeting in conference [...]

The post The Little SAS Program’s Big Night Out appeared first on SAS Learning Post.

5月 152017
 

Last week I showed a timeline of living US presidents. The number of living presidents is constant during the time interval between inaugurations and deaths of presidents. The data was taken from a Wikipedia table (shown below) that shows the number of years and days between events. This article shows how you can use the INTCK and INTNX functions in SAS to compute the time between events in this format. In particular, I use two little-known options to these functions that make this task easy.

Intervals between dates

If you are computing the interval between two dates (a start date and an end date) there are two SAS functions that you absolutely must know about.

  • The INTCK function returns the number of time units between two dates. For the time unit, you can choose years, months, weeks, days, and more. For example, in my previous article I used the INTCK function to determine the number of days between two dates.
  • The INTNX function returns a SAS date that is a specified number of time units away from a specified date. For example, you can use the INTNX function to compute the date that is 308 days in the future from a given date.

These two functions complement each other: one computes the difference between two dates, the other enables you to add time units to a date value.

By default, these functions use the number of "calendar boundaries" between the dates, such as the first day of a year, month, or week. For example, if you choose to measure year intervals, the INTCK function counts how many times 01JAN occurred between the dates, and the INTNX function returns a future 01JAN date. Similarly, if you measure month intervals, the INTCK function counts how many first-of-the-months occur between two dates, and the INTNX function returns a future first-of-the-month date.

Options to compute anniversary dates

Both functions support many options to modify the default behavior. If you want to count full year intervals, instead of the number of times people celebrated New Year's Eve, these function support options (as of SAS 9.2) to count the number of "anniversaries" between two dates and to compute the date of a future anniversary. You can use the 'CONTINUOUS' option for the INTCK function and the 'SAME' option for the INTNX function, as follows:

  • The 'CONTINUOUS' option in the INTCK function enables you to count the number of anniversaries of one date that occur prior to a second date. For example, the statement
    Years = intck('year', '30APR1789'd, '04MAR1797'd, 'continuous');
    returns the value 7 because there are 7 full years (anniversaries of 30APR) between those two dates. Without the 'CONTINUOUS' option, the function returns 8 because 01JAN occurs 8 times between those dates.
  • The statement
    Anniv = intnx('year', '30APR1789'd, 7, 'same');
    returns the 7th anniversary of the date 30APR1789. In other words, it returns the date value for 30APR1796.

The beauty of these functions is that they automatically handle leap years! If you request the number of days between two dates, the INTCK function includes leap days in the result. If an event occurs on a leap day, and you ask the INTNX function for the next anniversary of that event, you will get 28FEB of the next year, which is the most common convention for handling anniversaries of a leap day.

An algorithm to compute years and days between events

The following algorithm computes the number of years and days between dates in SAS:

  • Use the INTCK function with the 'CONTINUOUS' option to compute the number of complete years between the two dates.
  • Use the INTNX function to find a third date (the anniversary date) which is the same month and day as the start date, but occurs less than one year prior to the end date. (The anniversary of a leap days is either 28FEB or 29FEB, depending on whether the anniversary occurs in a leap year.)
  • Use the INTCK function to compute the number of days between the anniversary date and the end date.

The following DATA step computes the time interval in years and days between the first few US presidential inaugurations and deaths. The resulting Year and Day variables contain the same information as is displayed in the Wikipedia table.

data YearDays;
format Date prevDate anniv Date9.;
input @1  Date anydtdte12.
      @13 Event $26.;
prevDate = lag(Date);
if _N_=1 then do;                               /* when _N_=1, lag(Date)=. */
   Years=.; Days=.; return;            /* set years & days, go to next obs */
end;
Years = intck('year', prevDate, Date, 'continuous'); /* num complete years */
Anniv = intnx('year', prevDate, Years, 'same');      /* most recent anniv  */
Days = intck('day', anniv, Date);                    /* days since anniv   */
datalines;
Apr 30, 1789 Washington Inaug
Mar 4, 1797  J Adams Inaug
Dec 14, 1799 Washington Death
Mar 4, 1801  Jefferson Inaug
Mar 4, 1809  Madison Inaug
Mar 4, 1817  Monroe Inaug
Mar 4, 1825  JQ Adams Inaug
Jul 4, 1826  Jefferson Death
Jul 4, 1826  J Adams Death
run;
 
proc print data=YearDays;
var Event prevDate Date Anniv Years Days;
run;

Summary and references

In summary, the INTCK and INTNX functions are essential for computing intervals between dates. In this article, I emphasized two little-known options: the 'CONTINUOUS' option in INTCK and the 'SAME' option in INTNX. By using these options, you can to compute the number of anniversaries between dates and the most recent anniversary. Thus you can compute the years and days between two dates.

There have been countless articles and papers written about SAS dates and finding intervals between dates. I recommend the following articles:

Lastly, do you know what the acronyms INTCK and INTNX stand for? Obviously the 'INT' part refers to INTervals. The general consensus is that 'INTCK' stands for 'Interval Check' and 'INTNX' stands for "Interval Next."

The post INTCK and INTNX: Two essential functions for computing intervals between dates in SAS appeared first on The DO Loop.

4月 252017
 

When developing SAS® data sets, program code and/or applications, efficiency is not always given the attention it deserves, particularly in the early phases of development. Since data sizes and system performance can affect a program and/or an application’s behavior, SAS users may want to access information about a data set’s [...]

The post Determining the size of a SAS data set appeared first on SAS Learning Post.

4月 142017
 

Image of The Little SAS Enterprise Guide BookThere is a new member of The Little SAS Book family: The Little SAS Enterprise Guide Book.

If you are familiar with our other EG books, you may be wondering why this one isn’t called the “Fourth Edition.”  That is because we changed the title slightly.  Our previous EG books were each written for a specific version of EG, and consequently had the version number right in the title.  This book was written using EG 7.1, but it also applies to some earlier versions (5.1 and 6.1).  With a little luck, this book will also apply to future versions.  So it’s a keeper.

I’m very pleased with how this book has turned out.  We updated it so that all the windows and icons match the current EG, and we also added some great new sections.  Even with the new topics, this book is 60 pages shorter than our previous EG book!  It is shorter because we replaced some chapters on specific types of tasks, with a new chapter that explains how tasks work in general.  The result is a book that is easier to read and more useful.

For more information about this book including the table of contents, an excerpt, and reviews, click here.


4月 142017
 

Image of The Little SAS Enterprise Guide BookThere is a new member of The Little SAS Book family: The Little SAS Enterprise Guide Book.

If you are familiar with our other EG books, you may be wondering why this one isn’t called the “Fourth Edition.”  That is because we changed the title slightly.  Our previous EG books were each written for a specific version of EG, and consequently had the version number right in the title.  This book was written using EG 7.1, but it also applies to some earlier versions (5.1 and 6.1).  With a little luck, this book will also apply to future versions.  So it’s a keeper.

I’m very pleased with how this book has turned out.  We updated it so that all the windows and icons match the current EG, and we also added some great new sections.  Even with the new topics, this book is 60 pages shorter than our previous EG book!  It is shorter because we replaced some chapters on specific types of tasks, with a new chapter that explains how tasks work in general.  The result is a book that is easier to read and more useful.

For more information about this book including the table of contents, an excerpt, and reviews, click here.