Ron Cody

1月 082020

Did I trick you into seeing what this blog is about with its mysterious title? I am going to talk about how to use the FIND function to search text values.

The FIND function searches for substrings in character values. For example, you might want to extract all email addresses ending in .edu from a list of email addresses. If you are a slightly older SAS programmer like me, you may be more familiar with the INDEX function. If you use only two arguments in the FIND function, the first being the string you are searching and the second being the substring you are looking for, the FIND function is identical to the INDEX function. Both of these functions will searczh the string (first argument) for the substring (second argument) and return the position where the substring starts. If the substring is not found, the function returns a zero.

The newer FIND function has several advantages over the older INDEX function. These advantages are realized by the optional third and fourth arguments to the FIND function. These two arguments allow you to specify a starting position for the search and modifiers that allow you to ignore case. You can use either of these two arguments, or both, and the order doesn't matter! How is this possible? The value for the starting position is always a numeric value and the value for the modifier is always a character value. Thus, SAS can always figure out if a value is a starting position or a modifier.

Let's look at an example

Suppose you have a SAS data set called Emails, and each observation in the data set contains a name and an email address.

Here is a listing of the SAS data set Emails:

You want to select all observations where the variable Email_Address contains .edu (ignoring case).

The program below does just that:

*Searching for .edu;
data Education;
   set Emails;
   if find(Email_Address,'.edu','i') then output;
title "Listing of Data Set Education";
proc print data=Education noobs;

The 'i' modifier is an instruction to ignore case. In the listing of Education below, notice that all the .edu addresses are listed, regardless of case.

Not only is the FIND function more flexible than the older INDEX function, the ignore case modifier is really handy.

For more tips on writing code and how to get started in SAS Studio, check out my book, Learning SAS by Example: A Programmer’s Guide, Second Edition. You can also download a free book excerpt. To also learn more about SAS Press, check out the up-and-coming titles and receive exclusive discounts, make sure to subscribe to the SAS Books newsletter.

Adventures of a SAS detective and the fantastic FIND function was published on SAS Users.

11月 122019

The t-test is a very useful test that compares one variable (perhaps blood pressure) between two groups. T-tests are called t-tests because the test results are all based on t-values. T-values are an example of what statisticians call test statistics. A test statistic is a standardized value that is calculated from sample data during a hypothesis test. It is used to determine whether there is a significant difference between the means of two groups. With all inferential statistics, we assume the dependent variable fits a normal distribution. When we assume a normal distribution exists, we can identify the probability of a particular outcome. The procedure that calculates the test statistic compares your data to what is expected under the null hypothesis. There are several SAS Studio tasks that include options to test this assumption. Let's use the t-test task as an example.

You start by selecting:

Tasks and Utilities □ Tasks □ Statistics □ t Tests

On the DATA tab, select the Cars data set in the SASHELP library. Next request a Two-sample test, with Horsepower as the Analysis variable and Cylinders as the Groups variable. Use a filter to include only 4- or 6-cylinder cars. It should look like this:

On the OPTIONS tab, check the box for Tests for normality as shown below.

All the tests for normality for both 4-cylinder and 6-cylinder cars reject the null hypothesis that the data values come from a population that is normally distributed. (See the figure below.)

Should you abandon the t-test results and run a nonparametric test analysis such as a Wilcoxon Rank Sum test that does not require normal distributions?

This is the point where many people make a mistake. You cannot simply look at the results of the tests for normality to decide if a parametric test is valid or not. Here is the reason: When you have large sample sizes (in this data set, there were 136 4-cylinder cars and 190 6-cylinder cars), the tests for normality have more power to reject the null hypothesis and often result in p-values less than .05. When you have small sample sizes, the tests for normality will not be significant unless there are drastic departures from normality. It is with small sample sizes where departures from normality are important.

The bottom line is that the tests for normality often lead you to make the wrong decision. You need to look at the distributions and decide if they are somewhat symmetrical. The central limit theory states that the sampling distribution of means will be normally distributed if the sample size is sufficiently large. "Sufficiently large" is a judgment call. If the distribution is symmetrical, you may perform a t-test with sample sizes as small as 10 or 20.

The figure below shows you the distribution of horsepower for 4- and 6-cylinder cars.

With the large sample sizes in this data set, you should feel comfortable in using a t-test. The results, shown below, are highly significant.

If you are in doubt of your decision to use a parametric test, feel free to check the box for a nonparametric test on the OPTIONS tab. Running a Wilcoxon Rank Sum test (a nonparametric alternative to a t-test), you also find a highly significant difference in horsepower between 4- and 6-cylinder cars. (See the figure below.)

You can read more about assumptions for parametric tests in my new book, A Gentle Introduction to Statistics Using SAS Studio.

For a sneak preview check out the free book excerpt. You can also learn more about SAS Press, check out the up-and-coming titles, and receive exclusive discounts. To do so, make sure to subscribe to the newsletter.

Testing the Assumption of Normality for Parametric Tests was published on SAS Users.

11月 062019

I have been programming SAS for a LONG time and have never seen much in the way of programming standards. For example, most SAS programmers indent DATA and PROC statements (I like three spaces). Most programmers do not like to see more than one statement on a line and most agree that there should be blank lines between program boundaries (DATA and PROC steps).

I thought I would share some of my thoughts on programming standards, with the hope that others will chime in with their ideas.

    • I like to indent all the statements in a DO group or DO loop. If there are nested groups, each one gets indented as well.
    • I prefer variable names in proper case.
    • I am not a fan of camel-case. For example, I prefer Weight_Kg to WeightKg. The reason that some programmers like camel-case is that SAS will automatically split a variable name at a capital letter in some headings.
    • I like my TITLE statements in open code, not inside a PROC. To me, that makes sense because TITLE statements are global.
    • There should be no conversion messages (character to numeric or numeric to character) in the SAS log. For example use Num = INPUT(Char_Num,12.); instead of Num = 1*Char_Num;. The latter statement forces an automatic character to numeric conversion and places a message in the log.
    • I always use the statement ODS NOPROCTITLE;. This eliminates the default SAS procedure name at the top of the output.
    • Although fewer and fewer people are reading raw text data, I like my @ signs to all line up in my INPUT statement.
    • I like to use the /* and */ comments to define all macro variables. For example:

Notice that I prefer named parameters in my macros, instead of positional parameters.

If this seems like too much work - SAS Studio has an automatic formatting tool that can help standardize your programs. For example, look at the code below:

Really ugly, right? Here is how you can use the automatic formatting tool in SAS Studio.

When you click this icon, the program now looks like this:

That’s pretty much the way I would write it. By the way, if you don't like how Studio formatted your code, enter a control-z to undo it.

For more tips on writing code and how to get started in SAS Studio – check out my book, Learning SAS by Example: A Programmer’s Guide, Second Edition. You can also download a free book excerpt. To also learn more about SAS Press, check out the up-and-coming titles, and receive exclusive discounts make sure to subscribe to the SAS Books newsletter.

Making your SAS code more readable was published on SAS Users.

7月 252019

Years ago I saw a line of SAS code that was really puzzling. It was a statement that started with:

if 0 then … ;

What? This was a statement that would always be evaluated as false. Why would anyone write such a statement? Recently, I was discussing with a good friend of mine a macro I wrote where I used an IF statement that was always false.

DATA Step compiler shortcut
In some programming languages, the "IF 0" trick ensures that the code that follows would never be executed.  But with the DATA step, SAS actually does still evaluate the "zeroed-out" code during the compile phase.

Let's see why this can actually be useful.

The problem I was trying to solve was to assign the number of observations in a data set to a macro variable. First, the code, then the explanation.

if 0 then set Oscar Nobs=Number_of_Obs; call symputx('Number',Number_of_Obs); stop; run;

The SET option Nobs= places the number of observations in data set Oscar in a variable I called Number_of_Obs. This happens at compile time, before any observations are read from data set Oscar. The CALL SYMPUTX routine takes two arguments. The first is the name of a macro variable (Number) and the second is the name of a SAS variable (Number_of_Obs). The call routine assigns the value of the SAS variable to the Macro variable.

Titles by Ron CodyMost data sets stop executing when you reach the end-of-file on any data set. Because you are not reading any values from data set Oscar, you need a STOP statement to prevent the Data Step from looping. When this Data Step executes, the number of observations in data set Oscar is assigned to the macro variable called Number.

Now you, too, can write programs using this useful trick. The macro where I used this trick can be found in my book, Cody's Data Cleaning techniques, 3rd edition. You can see a copy of this macro (and all the other macros described in the book) by going to my author web site:, scrolling down to the Data Cleaning book and clicking on the link to see example and data. One of the macros that use this trick is called As an alternative, you can buy the book!

By the way, you can download programs and data from any of my books by using this same method and it is free!

Books by Ron Cody See them all!

A DATA step compiler trick to get the record count was published on SAS Users.

10月 022018

I often get asked for programming tips. Here, I share three of my favorite tips for beginners. Tip #1: COUNTC and CATS Functions Together The CATS function concatenates all of its arguments after it strips leading and trailing blanks. The COUNTC function counts characters. Together, they can let you operate [...]

The post Three of My Favorite Programming Tips appeared first on SAS Learning Post.

6月 272017

In a previous blog, I demonstrated a program and macro that could identify all numeric variables set to a specific value, such as 999. This blog discusses an immensely useful technique that allows you to perform an operation on all numeric or all character variables in a SAS data set. [...]

The post How to perform an operation on all numeric or all character variables in a SAS data set appeared first on SAS Learning Post.

6月 272017

When I teach my Data Cleaning course, the last topic I cover in the two-day course is SAS Integrity Constraints.  I find that most of the students, who are usually quite advanced programmers, have never heard of Integrity Constraints (abbreviated ICs).  I decided a short discussion on this topic would [...]

The post Keeping your data set clean: Integrity constraints appeared first on SAS Learning Post.