books

8月 012022
 

Recently, the SAS Press team moved to a new building on the SAS campus. And when the SAS Press team moves, we bring a lot of books with us! Packing and organizing all of our books gave us a chance to appreciate all of our authors’ hard work during the more than 30 years that SAS Press has existed.

One author has an outsized presence on the SAS Press bookshelves – Ron Cody. He has written over a dozen books that include some of the most popular titles for new SAS users. He taught statistics at the Rutgers Robert Wood Johnson Medical School for many years and is a frequent presenter at SAS conferences.

Few people know more about SAS than Ron, which made him the perfect person to be our first SAS Press Author of the Month. During August, you can get special deals on all of Ron's books at RedShelf. Find out more information on his author page. Ron is also hosting a free webinar on data cleaning tips and tricks on August 11th.

We recently asked Ron to share a little bit about his journey as an author and teacher. As you might imagine, Ron has a lot of SAS knowledge and advice to share.

Ron Cody's books on a bookshelf.

The Cody section of the SAS Press bookshelf.

Q: When did you decide to write your first book?

I decided to write my first SAS book (Applied Statistics and the SAS Programming Language) in 1985. It was published by Elsevier Publishing, which was later bought by Prentice Hall. At the time I was writing the book, there were no other books out there—just SAS manuals. This book is still in print (in a fifth edition).

Q: What made you decide to keep writing more books about SAS?

Once I realized I could write well enough to get published, I got the "writer's bug." Although writing a book is hard work, the reward is substantial. My motivation is more about teaching people about SAS and statistics rather than the monetary reward. My goal in writing any book is to make enough money to take my wife to dinner!

Q: Is one of your books your favorite? Or do you love them all equally?

I do have some favorites. I would put Learning SAS by Example as one. It contains a section for beginners as well as later chapters that are useful to intermediate, or even advanced SAS programmers. I particularly like my latest book, A Gentle Introduction to Statistics Using SAS Studio in the Cloud. I think I made statistical concepts accessible to non-mathematically minded people. There are only a few equations in the entire book.

Q: As a teacher, how did you encourage students who were having a hard time understanding statistics?

I try really hard to convince them that the statistical concepts really make sense, and it is sometimes the terminology that gets in the way.

Q: What do you think students struggle with the most when they are learning SAS?

I believe the overwhelming richness of SAS can intimidate a beginning programmer. That's why I start from simple examples and explain step-by-step how each program works.

Q: What is your best advice for someone who wants to learn SAS?

Buy all of my books, of course! Just kidding. There are many YouTube videos and other online resources that are useful. I think two of the best books would be The Little SAS Book, and either Learning SAS by Example (for the more serious student) or Getting Started with SAS Programming Using SAS Studio in the Cloud. The latter book is more suited to someone using SAS OnDemand for Academics.

Q: You recently published a memoir about your time as an EMT. How did it feel to reflect back on that time of your life? Any more memoirs in your future?

I thoroughly enjoyed writing a book about my years as a volunteer EMT (10-8 Awaiting Crew: Memories of a Volunteer EMT). I was fortunate that I had kept a journal and recorded details of some of the more interesting and exciting calls. As of now, I do not have another memoir in my future, but I am working on a nonfiction novel. I'm not sure how successful it will be, but I'm going to give it a try. I strive to write almost every day. I tell other beginning authors, that if you spend a few hours writing every day, it becomes much easier. I call it "getting in the groove."


Ron Cody sitting in a chair with a cat in his lap. Ron is wearing a shirt that says "using Statistics to prove a point is just MEAN."

Ron and Dudley. Credit: Jan Cody

Thanks for talking to us, Ron! We hope you will join us in celebrating Ron's accomplishments as our Author of the Month. Check out his author page to see special deals on all of Ron’s books during August and to discover more resources.

 

Meet our SAS Press Author of the Month – Ron Cody was published on SAS Users.

4月 052022
 

Have you ever heard something referred to as the bee’s knees? Most likely the person uttering that expression meant that it was truly amazing and extraordinary. Maybe you stopped and pondered the origin of the phrase. Well wonder no more! In the 1920s in the United States, people were obsessed with rhyming and anthropomorphism (giving human traits or attributes to animals). This era saw the introduction of multiple animal-centric idioms: the ant’s pants, the cat’s pajamas, and (you guessed it) the bee’s knees. Most likely this idiom refers to the pollen basket on the lower section of a worker bee’s legs (where the tibia and the tarsi connect). When bees are out foraging, they carry all the pollen they find in the pollen baskets on their legs. If you ever see a bee with lots of pollen on her legs, you know she’s been working hard… one might even say she’s an amazing and extraordinary worker!

SAS Visual Analytics has a lot of features that you might call the bee’s knees, but one of the most amazing and extraordinary features is the AggregateTable operator. This operator, introduced in SAS Visual Analytics 8.3, enables you to perform aggregations on data crossings that are independent of (or changed from) the data in your objects. This means you can use this operator to compare aggregations for different groups of data in one single object.

To illustrate the AggregateTable operator in action (and to keep with the theme), let’s consider an example.

I’m a hobby beekeeper in Texas. This means that I own and maintain a few hives in my backyard from which I collect honey and wax to sell to family and friends. I’m interested in learning about honey production in the United States for different years. I’m pretty sure I’m not the biggest honey producer in my state (or even my county), but I want to look at different crossings of total production (by state and year, by county and year, and by state).

The United States Department of Agriculture’s National Agricultural Statistics Service has Census data on honey production (measured in pounds) for all counties in the United States for 2002, 2007, 2012, and 2017.

Type of Aggregation: Add

To start, I would like to calculate the total production for each state (Maximum by State) and display the total production for the state that produced the most honey in that year. For example, in 2002 California produced the most honey of any state (15,585,925 pounds) and in 2017 North Dakota produced the most honey of any state (24,296,437 pounds).

Because the table contains details by county and I’m interested in the total production by state, I will either need to create an aggregated data source that contains details by state, or I will need to use the AggregateTable operator. Since this post is about the wonders of the AggregateTable operator, let’s focus on that.

The AggregateTable operator requires five parameters:

Parameter Description
Aggregation- Aggregated The aggregation applied to the aggregated item when it is used in an object that displays fewer group-by crossings than the table in the expression.
Aggregation- Measure The aggregation applied to the measure in the inner table context.
Type of aggregation The type of aggregation that is performed. Values are Fixed, Add, or Remove.
Categories The list of categories used to alter the data crossing for the aggregation.
Measure The measure that is aggregated. A Table operator can be added as the measure to perform a nested aggregation.

It also has a nested operator, Table, that creates an intermediate table defined by the Type of aggregation, Categories, Measure, and Aggregation- Measure parameters.

For this example, I want to use a Key value object to see the total production values for the state that produced the most honey in each year. The object will contain one group-by crossing (or category): Year. The calculation, however, will need to use two group-by crossings to determine the highest producing state for each year: Year and State. Therefore, the Aggregation-Measure is _Sum_ (to calculate the total production by state), the Type of aggregation is Add (because I want to add a crossing for State to the calculation), Categories is set to State, and Measure is Production (lb).

The intermediate table will contain one row for each state and year and contain total production values.

Then, for each year, I want the highest production value (for 2002, 15,585,925 pounds). Therefore, the Aggregation- Aggregated parameter should be _Max_ to grab the maximum values for each year from the intermediate table.

Then, I can display the Maximum by State with Year in a Key value object.

Note: Beginning in SAS Visual Analytics 2021.1.1 (May 2021), a new role is available for the Key value object, Lattice category. This enables you to display a key value object for each distinct value of a category data item (in this example, Year).

Now that I have a data item that contains the production amount for the highest producing state for each year, I can create some more complex calculations, like the percentage of total production for each year by the state that had the highest production. This will enable me to see if the highest producing state is doing all the heavy lifting or if all the states are producing a near-equal amount.

Type of Aggregation: Remove

The Type of aggregation parameter also enables you to remove categories from the calculation. For this example, suppose I want to compare the production in each county to the production from the highest producing county in that state (Maximum by County). I want to use a list table to compare these values.

The object will contain three group-by crossings: Year, State, and County. The calculation, however, will only use two group-by crossings to determine the highest producing county in each state for each year: Year and State. Therefore, the Aggregation-Measure is _Max_ (to calculate the maximum production in each state), the Type of aggregation is Remove (because I want to remove a crossing for County from the calculation), Categories is set to County, and Measure is Production (lb).

The intermediate table will contain one row for each state and year and contain the production values for the county in that state with the highest production. Notice that for this table, the aggregation for Production was set to Maximum to show the maximum production for each state.

Because the number of groupings in the object (County, Year, and State) is not fewer than the number of groupings in the calculation (Year and State), the Aggregation-Aggregated parameter is not applied and can be set to any value.

Then, I can display Maximum by County in a list table with total production by county to easily compare each county’s production with the production of the county in that state that had the highest production.

Now, I can calculate each county’s difference from the county with the highest production in that state for each year.

Type of Aggregation: Fixed

The Type of aggregation parameter also enables you to set a fixed category for the calculation. For this example, suppose I want to compare the production in each county to the total production across all years by state (Total by State). I want to use a list table to compare these values.

The object will contain three group-by crossings: Year, State, and County. The calculation, however, will only use one group-by crossing to determine the total production by state across all years: State. Therefore, the Aggregation-Measure is _Sum_ (to calculate the total production by state across all years), the Type of aggregation is Fixed (because I want to fix the crossing to State for the calculation), Categories is set to State and Measure is Production (lb).

The intermediate table will contain one row for each state and total production values across all years.

Because the number of groupings in the object (County, Year, and State) is not fewer than the number of groupings in the calculation (State), the Aggregation-Aggregated parameter is not applied and can be set to any value.

Then, I can display Total by State in a list table with total production by county to easily compare each county’s production with total production in the state across all years.

I can even compare total state values for each year with total production in that state across all years.

Then, I can calculate the share of total production produced each year.

For more information about how to create advanced data items and filters for your SAS Visual Analytics reports, including examples that use different types of operators, check out my book Interactive Reports in SAS Visual Analytics: Advanced Features and Customization.

Bees are important contributors in pollinating most of the food that you eat and are in desperate need of your help! There are many ways you can help the honeybee population thrive:

  • Become a beekeeper
  • Plant a garden for all pollinators (including bumblebees, butterflies, bats, and moths)
  • Reduce or eliminate pesticide use
  • Support your local beekeepers by buying local honey
  • Contact a beekeeping group if you see a swarm
  • Volunteer as a citizen data scientist by taking pictures and posting information on the iNaturalist app
  • Create a bee bath for those hot summer days
  • Build homes for native bees

 

The AggregateTable operator… It’s the bee’s knees! was published on SAS Users.

2月 012022
 

When we moved out to the country with our two dogs, our oldest dog Todd suddenly decided he liked to howl…. And he would do so every time we left the house. Maybe it was the country air? Maybe it was a time-lapse gene? Maybe he just wanted to learn something new?

If you’ve been using SAS Visual Analytics for a while, it’s possible you might have never created reports that linked to external URLs. SAS Visual Analytics can do so much on its own, perhaps you never thought about extending its functionality outside the product itself! Well, it’s time to learn a new trick.

To illustrate how this can be done (and to keep with the theme), let’s consider an example.

I’m interested in adding a new member to my family (a dog!), and I know I would like to adopt an animal in need. I’m not sure, however, which breed will suit my lifestyle. I need a dog that’s playful and sweet, but one that also likes to sleep late.

I have a report in SAS Visual Analytics that shows details about animals surrendered at an Austin animal shelter. I’d like to select various characteristics (like animal type, sex, whether the dog is spayed or neutered, and condition) and see what breeds they have available. Then, I’d like to see additional details about each breed at the American Kennel Club website (www.akc.org). On this website, you can find information about various breeds of dogs (and cats!), including average sizes, life expectancy, personality, and many other traits.

I’ll add an interactive link to the report, so when a user selects a specific breed, the page for that breed appears. The interactive link will use parameters to pass a selected value from the report to the web page.

To create interactive links, I like to follow four simple steps:

  1. Research the structure of the URL
  2. Use a hardcoded value to test the link
  3. Parameterize the link
  4. Test the parameterized value

Step 1: Research

Before adding interactive links to a report, you need to understand how the target web page structures the URL. I typically do this by accessing the target web page and searching for a specific subject. For some websites, you might need to view the Developer Guide for the website to fully understand the structure.

Typically, URLs are constructed in one of three ways:

  • Path: In these URLs, the subject is added at the end of the URL. For example, to view a country page on Wikipedia, you use the following URL: https://en.wikipedia.org/wiki/country where country is the full name of the country of interest.
  • Query: In these URLs, the subject is assigned to the value of a URL parameter using a sequence of attribute-value pairs: ?parameter1=value1&parameter2=value2. Multiple parameters can be assigned by separating the attribute-value pairs with an ampersand (&). For example, to search Etsy for a specific type of item, you use the following URL: https://www.etsy.com/search?q=item where item is the specific search string.
  • File: In these URLs, the subject is a part of a file name at the end of the URL. For example, to view a country profile on CIA Factbook, you use the following URL: https://www.cia.gov/library/publications/the-world-factbook/geos/country-code.html where country-code is the 2-letter abbreviation of the country of interest.

To start, I’ll select one of the breeds in the list: Australian Cattle Dog (my current dog’s breed). The American Kennel Club website has a drop-down selector at the top of the page where you can select a breed.

Australian cattle dogs are alert, curious, and pleasant. Tell me about it! He won’t let a leaf fly by outside without raising the alarm.

The URL is constructed with the breed as part of the path: https://www.akc.org/dog-breeds/australian-cattle-dog/. Notice that for breeds with multiple words (like Australian Cattle Dog), the link uses hyphens (-) instead of spaces.

Step 2: Hardcode

Now that you understand the structure of the URL, you can test the link using various hardcoded values. For example, to view details about dachshunds, go to https://www.akc.org/dog-breeds/dachshund/. Dachshunds are friendly, curious, and spunky. They must be! Why else would they be chosen to star in dog races at Oktoberfest celebrations around the world?

Step 3: Parameterize

After you have tested the URL using hardcoded values, you need to replace the hardcoded value with parameters. These are values that will be passed from your report to the external URL to make the links interactive. For the report, I’ll add the link to the word cloud and replace the hardcoded breed with the breed I select in the report.

Because the URL replaces spaces with hyphens, I have created a calculated item in SAS Visual Analytics that has breeds with multiple words separated by slashes instead of spaces, Breed (ForLink).

Because I want to pass Breed (ForLink) from the word cloud to the URL, I need to add it to one of the roles for the word cloud. I don’t want the breeds to appear with hyphens in the word cloud, so I’ll add the calculated item to the Hidden role.

Tip: Data items assigned to the Hidden role are available for color-mapped display rules, external links, and mapping data sources and should only be assigned if it will not increase the number of rows in the query. In this example, the word cloud shows details about breeds. Adding Breed (ForLink) to the Hidden role makes the value available for the external link and does not increase the number of rows in the query.

Then, to add the link:

  1. With the word cloud selected, click Actions in the right pane and expand URL Links.
  2. Click New URL Link.
  3. Specify a descriptive name for the link.
  4. For the URL, enter the URL up to, but not including, the breed (https://www.akc.org/dog-breeds/); this value will be passed from the selected breed in the word cloud.
  5. Next to Parameters, click the Add icon.
  6. For the Source field, select Breed (ForLink) and leave the Target value blank. Adding nothing to the Target field indicates that the value of Breed (ForLink) should be appended to the end of the URL.

When a viewer selects a breed in the word cloud, the breed value will be appended to the end of the URL and details for that breed will be displayed.

Step 4: Test

After the interactive link has been created, you need to ensure that the link works by testing it in the report.

I’ll select both Cat and Dog as the type of animal, Male for sex, Yes for spayed or neutered, and Aged for condition. There are 143 animals in the Austin animal shelter that meet these criteria. I’m thinking a Labrador retriever might be good for my family, so I’ll double-click Labrador Retriever in the word cloud to see the traits and characteristics for that breed. It looks like Labrador retrievers are friendly, active, and outgoing, and they are also highly adaptable (meaning I can train them to sleep late). It sounds like a perfect fit!

For more information about how to add interactive links to your SAS Visual Analytics reports, including examples on creating links with different URL structures, check out my book Interactive Reports in SAS Visual Analytics: Advanced Features and Customization.

You can’t teach an old dog new tricks… or can you? was published on SAS Users.

2月 012022
 

When we moved out to the country with our two dogs, our oldest dog Todd suddenly decided he liked to howl…. And he would do so every time we left the house. Maybe it was the country air? Maybe it was a time-lapse gene? Maybe he just wanted to learn something new?

If you’ve been using SAS Visual Analytics for a while, it’s possible you might have never created reports that linked to external URLs. SAS Visual Analytics can do so much on its own, perhaps you never thought about extending its functionality outside the product itself! Well, it’s time to learn a new trick.

To illustrate how this can be done (and to keep with the theme), let’s consider an example.

I’m interested in adding a new member to my family (a dog!), and I know I would like to adopt an animal in need. I’m not sure, however, which breed will suit my lifestyle. I need a dog that’s playful and sweet, but one that also likes to sleep late.

I have a report in SAS Visual Analytics that shows details about animals surrendered at an Austin animal shelter. I’d like to select various characteristics (like animal type, sex, whether the dog is spayed or neutered, and condition) and see what breeds they have available. Then, I’d like to see additional details about each breed at the American Kennel Club website (www.akc.org). On this website, you can find information about various breeds of dogs (and cats!), including average sizes, life expectancy, personality, and many other traits.

I’ll add an interactive link to the report, so when a user selects a specific breed, the page for that breed appears. The interactive link will use parameters to pass a selected value from the report to the web page.

To create interactive links, I like to follow four simple steps:

  1. Research the structure of the URL
  2. Use a hardcoded value to test the link
  3. Parameterize the link
  4. Test the parameterized value

Step 1: Research

Before adding interactive links to a report, you need to understand how the target web page structures the URL. I typically do this by accessing the target web page and searching for a specific subject. For some websites, you might need to view the Developer Guide for the website to fully understand the structure.

Typically, URLs are constructed in one of three ways:

  • Path: In these URLs, the subject is added at the end of the URL. For example, to view a country page on Wikipedia, you use the following URL: https://en.wikipedia.org/wiki/country where country is the full name of the country of interest.
  • Query: In these URLs, the subject is assigned to the value of a URL parameter using a sequence of attribute-value pairs: ?parameter1=value1&parameter2=value2. Multiple parameters can be assigned by separating the attribute-value pairs with an ampersand (&). For example, to search Etsy for a specific type of item, you use the following URL: https://www.etsy.com/search?q=item where item is the specific search string.
  • File: In these URLs, the subject is a part of a file name at the end of the URL. For example, to view a country profile on CIA Factbook, you use the following URL: https://www.cia.gov/library/publications/the-world-factbook/geos/country-code.html where country-code is the 2-letter abbreviation of the country of interest.

To start, I’ll select one of the breeds in the list: Australian Cattle Dog (my current dog’s breed). The American Kennel Club website has a drop-down selector at the top of the page where you can select a breed.

Australian cattle dogs are alert, curious, and pleasant. Tell me about it! He won’t let a leaf fly by outside without raising the alarm.

The URL is constructed with the breed as part of the path: https://www.akc.org/dog-breeds/australian-cattle-dog/. Notice that for breeds with multiple words (like Australian Cattle Dog), the link uses hyphens (-) instead of spaces.

Step 2: Hardcode

Now that you understand the structure of the URL, you can test the link using various hardcoded values. For example, to view details about dachshunds, go to https://www.akc.org/dog-breeds/dachshund/. Dachshunds are friendly, curious, and spunky. They must be! Why else would they be chosen to star in dog races at Oktoberfest celebrations around the world?

Step 3: Parameterize

After you have tested the URL using hardcoded values, you need to replace the hardcoded value with parameters. These are values that will be passed from your report to the external URL to make the links interactive. For the report, I’ll add the link to the word cloud and replace the hardcoded breed with the breed I select in the report.

Because the URL replaces spaces with hyphens, I have created a calculated item in SAS Visual Analytics that has breeds with multiple words separated by slashes instead of spaces, Breed (ForLink).

Because I want to pass Breed (ForLink) from the word cloud to the URL, I need to add it to one of the roles for the word cloud. I don’t want the breeds to appear with hyphens in the word cloud, so I’ll add the calculated item to the Hidden role.

Tip: Data items assigned to the Hidden role are available for color-mapped display rules, external links, and mapping data sources and should only be assigned if it will not increase the number of rows in the query. In this example, the word cloud shows details about breeds. Adding Breed (ForLink) to the Hidden role makes the value available for the external link and does not increase the number of rows in the query.

Then, to add the link:

  1. With the word cloud selected, click Actions in the right pane and expand URL Links.
  2. Click New URL Link.
  3. Specify a descriptive name for the link.
  4. For the URL, enter the URL up to, but not including, the breed (https://www.akc.org/dog-breeds/); this value will be passed from the selected breed in the word cloud.
  5. Next to Parameters, click the Add icon.
  6. For the Source field, select Breed (ForLink) and leave the Target value blank. Adding nothing to the Target field indicates that the value of Breed (ForLink) should be appended to the end of the URL.

When a viewer selects a breed in the word cloud, the breed value will be appended to the end of the URL and details for that breed will be displayed.

Step 4: Test

After the interactive link has been created, you need to ensure that the link works by testing it in the report.

I’ll select both Cat and Dog as the type of animal, Male for sex, Yes for spayed or neutered, and Aged for condition. There are 143 animals in the Austin animal shelter that meet these criteria. I’m thinking a Labrador retriever might be good for my family, so I’ll double-click Labrador Retriever in the word cloud to see the traits and characteristics for that breed. It looks like Labrador retrievers are friendly, active, and outgoing, and they are also highly adaptable (meaning I can train them to sleep late). It sounds like a perfect fit!

For more information about how to add interactive links to your SAS Visual Analytics reports, including examples on creating links with different URL structures, check out my book Interactive Reports in SAS Visual Analytics: Advanced Features and Customization.

You can’t teach an old dog new tricks… or can you? was published on SAS Users.

1月 112022
 

Last year, I wrote a blog demonstrating how to use the %Auto_Outliers macro to automatically identify possible data errors. This blog demonstrates a different approach—one that is useful for variables for which you can identify reasonable ranges of values for each variable. For example, you would not expect resting heart rates below 40 or over 100 or adult heights below 45 inches or above 84 inches. Although values outside those ranges might be valid values, it would be a good idea to check those out-of-range values to see if they are real or data errors.

In the third edition of my book, Cody's Data Cleaning Techniques, I present two macros: %Errors and %Report. These two macros provide a consolidated error report.

To demonstrate these two macros, I created a data set called Patients. A listing of the first 10 observations is shown below:

Notice that the unique identifier is called Patno, and you see three variables HR (heart rate), SBP (systolic blood pressure), and DBP (diastolic blood pressure).

The calling arguments for the %Errors macro are:

  • VAR=     A variable for which you have pre-defined bounds
  • Low=     The lowest reasonable value for this variable
  • High=    The highest reasonable value for this variable
  • Missing=Error or Missing=Ignore. The default for this argument is IGNORE, but it is still good practice to include it in the call so that anyone reading your program understands how missing values are being handled.

Because you might be calling this macro for many variables, the values of two macro variables &Dsn (data set name) and &IDVar (the identifying variable such as Patno or Subj) are assigned values once, using two %Let statements. You can then call the %Errors macro for each variable of interest. When you are finished, call the %Report macro to see a consolidated report of your possible errors.

Here is an example:

/* Set values for &Dsn and %IDVar with %LET statements */
%let Dsn = Clean.Patients;   
%let Idvar = Patno;
 
%Errors(Var=HR, Low=40, High=100, Missing=error)
%Errors(Var=SBP, Low=80, High=200, Missing=ignore)
%Errors(Var=DBP, Low=60, High=120)   
 
/* When you are finished selecting variables, create the report */
%Report

You are reporting all heart rates below 40 or above 100 (and considering missing values as errors); values of SBP below 80 or above 200 (and ignoring missing values); and values of DBP below 60 or above 120 (also ignoring missing values—using the default value of IGNORE).

Here is the result:

Notice that several patients with missing values for HR are flagged as errors. I should point out that I have violated a cardinal rule of macro programming: never write a macro that has no arguments—it should be a program. However, I liked the idea of calling %Errors and then %Report. Shown below are the two macros:

/****************************************************************
| PROGRAM NAME: ERRORS.SAS  in c:\Books\Cleans\Patients         |
| PURPOSE: Accumulates errors for numeric variables in a SAS    |
|          data set for later reporting.                        |
|          This macro can be called several times with a        |
|          different variable each time. The resulting errors   |
|          are accumulated in a temporary SAS data set called   |
|          Errors.                                              |
| ARGUMENTS: Dsn=    - SAS data set name (assigned with a %LET) |
|            Idvar=  - Id variable (assigned with a %LET)       |
|                                                               |
|            Var     = The variable name to test                |
|            Low     = Lowest valid value                       |
|            High    = Highest valid value                      |
|            Missing = IGNORE (default) Ignore missing values   |
|                      ERROR Missing values flagged as errors   |
|                                                               |
| EXAMPLE: %let Dsn = Clean.Patients;                           |
|          %let Idvar = Patno;                                  |
|                                                               |
|          %Errors(Var=HR, Low=40, High=100, Missing=error)     |
|          %Errors(Var=SBP, Low=80, High=200, Missing=ignore)   |
|          %Errors(Var=DBP, Low=60, High=120)                   |
|          Test the numeric variables HR, SBP, and DBP in data  |
|          set Clean.patients for data outside the ranges       |
|          40 to 100, 80 to 200, and 60 to 120 respectively.    |
|          The ID variable is PATNO and missing values are to   |
|          be flagged as invalid for HR but not for SBP or DBP. |
****************************************************************/
%macro Errors(Var=,    /* Variable to test     */
              Low=,    /* Low value            */
              High=,   /* High value           */
              Missing=IGNORE 
                       /* How to treat missing values         */
                       /* Ignore is the default.  To flag     */
                       /* missing values as errors set        */
                       /* Missing=error                       */);
data Tmp;
   set &Dsn(keep=&Idvar &Var);
   length Reason $ 10 Variable $ 32;
   Variable = "&Var";
   Value = &Var;
   if &Var lt &Low and not missing(&Var) then do;
      Reason='Low';
      output;
   end;
 
   %if %upcase(&Missing) ne IGNORE %then %do;
      else if missing(&Var) then do;
         Reason='Missing';
         output;
      end;
   %end;
 
   else if &Var gt &High then do;
      Reason='High';
      output;
      end;
      drop &Var;
   run;
 
   proc append base=Errors data=Tmp;
   run;
 
%mend Errors;

The basic idea for the %Errors macro is to test each variable and, if it is a possible error, use PROC APPEND to add it to a data set called Errors. When the first error is detected, PROC APPEND creates the data set Errors. From then on, each observation in data set Tmp is added to data set Errors.

Most of this macro is straightforward. For those readers who are not that comfortable with macro programming, the third section (beginning with %if %upcase(&Missing)) is executed only when the value of the macro variable &Missing is not equal to IGNORE.

Below is a listing of the %Report macro:

%macro Report;
   proc sort data=Errors;
      by &Idvar;
   run;
 
   proc print data=Errors;
      title "Error Report for Data Set &Dsn";
      id &Idvar;
      var Variable Value Reason;
   run;
 
   proc delete data=Errors Tmp;
   run;
 
%mend Report;

The %Report macro is mainly a PROC PRINT of the temporary data set Errors. I added PROC DELETE to delete the two temporary data sets Error and Tmp.

You can cut and paste these macros, or you can download all of the macros, programs, and data sets from Cody's Data Cleaning Techniques Using SAS®, Third Edition, by going to support.sas.com/Cody, search for the book, then click Example Code and Data. You do not have to buy the book to download all the files (although I would be delighted if you did). This is true for all of my books published by SAS Press.

Comments and/or corrections are always welcome.

Two macros for detecting data errors was published on SAS Users.

1月 112022
 

Last year, I wrote a blog demonstrating how to use the %Auto_Outliers macro to automatically identify possible data errors. This blog demonstrates a different approach—one that is useful for variables for which you can identify reasonable ranges of values for each variable. For example, you would not expect resting heart rates below 40 or over 100 or adult heights below 45 inches or above 84 inches. Although values outside those ranges might be valid values, it would be a good idea to check those out-of-range values to see if they are real or data errors.

In the third edition of my book, Cody's Data Cleaning Techniques, I present two macros: %Errors and %Report. These two macros provide a consolidated error report.

To demonstrate these two macros, I created a data set called Patients. A listing of the first 10 observations is shown below:

Notice that the unique identifier is called Patno, and you see three variables HR (heart rate), SBP (systolic blood pressure), and DBP (diastolic blood pressure).

The calling arguments for the %Errors macro are:

  • VAR=     A variable for which you have pre-defined bounds
  • Low=     The lowest reasonable value for this variable
  • High=    The highest reasonable value for this variable
  • Missing=Error or Missing=Ignore. The default for this argument is IGNORE, but it is still good practice to include it in the call so that anyone reading your program understands how missing values are being handled.

Because you might be calling this macro for many variables, the values of two macro variables &Dsn (data set name) and &IDVar (the identifying variable such as Patno or Subj) are assigned values once, using two %Let statements. You can then call the %Errors macro for each variable of interest. When you are finished, call the %Report macro to see a consolidated report of your possible errors.

Here is an example:

/* Set values for &Dsn and %IDVar with %LET statements */
%let Dsn = Clean.Patients;   
%let Idvar = Patno;
 
%Errors(Var=HR, Low=40, High=100, Missing=error)
%Errors(Var=SBP, Low=80, High=200, Missing=ignore)
%Errors(Var=DBP, Low=60, High=120)   
 
/* When you are finished selecting variables, create the report */
%Report

You are reporting all heart rates below 40 or above 100 (and considering missing values as errors); values of SBP below 80 or above 200 (and ignoring missing values); and values of DBP below 60 or above 120 (also ignoring missing values—using the default value of IGNORE).

Here is the result:

Notice that several patients with missing values for HR are flagged as errors. I should point out that I have violated a cardinal rule of macro programming: never write a macro that has no arguments—it should be a program. However, I liked the idea of calling %Errors and then %Report. Shown below are the two macros:

/****************************************************************
| PROGRAM NAME: ERRORS.SAS  in c:\Books\Cleans\Patients         |
| PURPOSE: Accumulates errors for numeric variables in a SAS    |
|          data set for later reporting.                        |
|          This macro can be called several times with a        |
|          different variable each time. The resulting errors   |
|          are accumulated in a temporary SAS data set called   |
|          Errors.                                              |
| ARGUMENTS: Dsn=    - SAS data set name (assigned with a %LET) |
|            Idvar=  - Id variable (assigned with a %LET)       |
|                                                               |
|            Var     = The variable name to test                |
|            Low     = Lowest valid value                       |
|            High    = Highest valid value                      |
|            Missing = IGNORE (default) Ignore missing values   |
|                      ERROR Missing values flagged as errors   |
|                                                               |
| EXAMPLE: %let Dsn = Clean.Patients;                           |
|          %let Idvar = Patno;                                  |
|                                                               |
|          %Errors(Var=HR, Low=40, High=100, Missing=error)     |
|          %Errors(Var=SBP, Low=80, High=200, Missing=ignore)   |
|          %Errors(Var=DBP, Low=60, High=120)                   |
|          Test the numeric variables HR, SBP, and DBP in data  |
|          set Clean.patients for data outside the ranges       |
|          40 to 100, 80 to 200, and 60 to 120 respectively.    |
|          The ID variable is PATNO and missing values are to   |
|          be flagged as invalid for HR but not for SBP or DBP. |
****************************************************************/
%macro Errors(Var=,    /* Variable to test     */
              Low=,    /* Low value            */
              High=,   /* High value           */
              Missing=IGNORE 
                       /* How to treat missing values         */
                       /* Ignore is the default.  To flag     */
                       /* missing values as errors set        */
                       /* Missing=error                       */);
data Tmp;
   set &Dsn(keep=&Idvar &Var);
   length Reason $ 10 Variable $ 32;
   Variable = "&Var";
   Value = &Var;
   if &Var lt &Low and not missing(&Var) then do;
      Reason='Low';
      output;
   end;
 
   %if %upcase(&Missing) ne IGNORE %then %do;
      else if missing(&Var) then do;
         Reason='Missing';
         output;
      end;
   %end;
 
   else if &Var gt &High then do;
      Reason='High';
      output;
      end;
      drop &Var;
   run;
 
   proc append base=Errors data=Tmp;
   run;
 
%mend Errors;

The basic idea for the %Errors macro is to test each variable and, if it is a possible error, use PROC APPEND to add it to a data set called Errors. When the first error is detected, PROC APPEND creates the data set Errors. From then on, each observation in data set Tmp is added to data set Errors.

Most of this macro is straightforward. For those readers who are not that comfortable with macro programming, the third section (beginning with %if %upcase(&Missing)) is executed only when the value of the macro variable &Missing is not equal to IGNORE.

Below is a listing of the %Report macro:

%macro Report;
   proc sort data=Errors;
      by &Idvar;
   run;
 
   proc print data=Errors;
      title "Error Report for Data Set &Dsn";
      id &Idvar;
      var Variable Value Reason;
   run;
 
   proc delete data=Errors Tmp;
   run;
 
%mend Report;

The %Report macro is mainly a PROC PRINT of the temporary data set Errors. I added PROC DELETE to delete the two temporary data sets Error and Tmp.

You can cut and paste these macros, or you can download all of the macros, programs, and data sets from Cody's Data Cleaning Techniques Using SAS®, Third Edition, by going to support.sas.com/Cody, search for the book, then click Example Code and Data. You do not have to buy the book to download all the files (although I would be delighted if you did). This is true for all of my books published by SAS Press.

Comments and/or corrections are always welcome.

Two macros for detecting data errors was published on SAS Users.

9月 132021
 

The Day of the Programmer is not enough time to celebrate our favorite code-creators. That’s why at SAS, we celebrate an entire week with SAS Programmer Week! If you want to extend the fun and learning of SAS Programmer Week year-round, SAS Press is here to support you with books for programmers at every level.

2021 has been a big year for learning, so we wanted to share the six most popular books for programmers this year. There are some old favorites on this list as well as some brand-new books on a variety of topics. Check out the list below, and see what your fellow programmers are reading this year!

  1. Little SAS Book: A Primer, Sixth Edition

This book is at the top of almost every list of recommended books for anyone who wants to learn SAS. And for good reason! It breaks down the basics of SAS into easy-to-understand chunks with tons of practice questions. If you are new to SAS or are interested in getting your basic certification, this is the book for you.

  1. Learning SAS by Example: A Programmer’s Guide, Second Edition

Whether you are learning SAS for the first time or just need a quick refresher on a single topic, this book is well-organized so that you can read start to finish or skip to your topic of interest. Filled with real-world examples, this is a book that should be on every SAS programmer’s bookshelf!

  1. Text Mining and Analysis: Practical Methods, Examples, and Case Studies Using SAS

If you work with big data, then you probably work with a lot of text. The third book on our list is for anyone who handles unstructured data. This book focuses on practical solutions to real-life problems. You’ll learn how to collect, cleanse, organize, categorize, explore, analyze, and interpret your data.

  1. End-to-End Data Science with SAS: A Hands-On Programming Guide

This book offers a step-by-step explanation of how to create machine learning models for any industry. If you want to learn how to think like a data scientist, wrangle messy code, choose a model, and evaluate models in SAS, then this book has the information that you need to be a successful data scientist.

  1. Cody's Data Cleaning Techniques Using SAS, Third Edition

Every programmer knows that garbage in = garbage out. Take out the trash with this indispensable guide to cleaning your data. You’ll learn how to find and correct errors and develop techniques for correcting data errors.

  1. SAS Graphics for Clinical Trials by Example

If you are a programmer who works in the health care and life sciences industry and want to create visually appealing graphs using SAS, then this book is designed specifically for you. You’ll learn how to create a wide range of graphs using Graph Template Language (GTL) and statistical graphics procedures to solve even the most challenging clinical graph problems.

An honorable mention also goes to the SAS Certification Guides. They are a great way to study for the certification exams for the SAS Certified Specialist: Base Programming and SAS Certified Professional: Advanced Programming credentials.

We have many books available to support you as you develop your programming skills – and some of them are free! Browse all our available titles today.

Top Books for SAS Programmers was published on SAS Users.

8月 032021
 

Thousands of SAS users are migrating from SAS University Edition to SAS OnDemand for Academics (ODA). I thought I would share some of my thoughts, having just finished two books using ODA (Getting Started with SAS Programming: Using SAS Studio in the Cloud and A Gentle Introduction to Statistics Using SAS Studio in the Cloud). Users of SAS University Edition had to download software to create a virtual computer on their local computer, then download the SAS software, and finally, set up a way to read and write files from their “real” computer to their “virtual computer.” This caused many people massive headaches (including me). The great news about SAS OnDemand for Academics is that you don’t have to download anything! You access SAS on a cloud platform. Also, reading data from your real computer is quite simple.

ODA was developed so that people can use it to learn how to program and run tasks using SAS. It is not to be used for commercial purposes. There is also a 5-gigabyte limit for data files, but that is certainly not a problem for learning how to use SAS.

Registering for ODA

To gain access to ODA, you need to register with SAS. Part of the registration process is to create a SAS profile. If you already have a SAS profile, skip that portion of the instructions.

To start, point your browser to:

https://welcome.oda.sas.com

You will have to wait for an email from SAS with an approval, but once you have access, it is a delight to use. Let me show you step-by-step how easy it is to upload an Excel workbook from your local computer to SAS Studio using ODA.

This first figure shows a worksheet (Drug_Study.xlsx) with information on four individuals.

Figure 1: Excel Spreadsheet

All the variable names should be self-explanatory except for SBP (systolic blood pressure) and DBP (diastolic blood pressure). It is a good idea to choose variable names that are compatible with SAS V7. The next step is to select a folder in the Server Files and Folder tab in SAS Studio and upload the file. This tab is shown in Figure 2 (with the UPLOAD icon circled).

Figure 2: Upload the File to a Folder in SAS Studio

Once you click the UPLOAD icon, you are given the option to select a file.

Figure 3: Select Your File

Select your file, click UPLOAD, and the file on your local hard drive (Drug_Study.xlsx) will appear in the SAS Studio folder (Figure 4). (Note: you can upload multiple files at one time if you choose.)

Figure 4: The File Drug_Study.xlsx in Now in the SAS Studio Folder

You can now use the IMPORT utility (under Tasks and Utilities) to convert this Excel file to a SAS data set. In this example, the default name of the WORK file (IMPORT) was changed to Drug_Study. A list of variables is shown in Figure 5.

Figure 5: List of Variable Names in Data Set Drug_Study

You can now write programs or use SAS Studio tasks to analyze this data set.

To learn more about how to use SAS Studio as part of SAS OnDemand for Academics, write SAS programs, or use SAS Studio tasks, please take a look at my new book, Getting Started with SAS Programing: Using SAS Studio in the Cloud.

I welcome comments and questions on this blog.

Migrating from SAS University Edition to SAS OnDemand for Academics was published on SAS Users.

7月 272021
 

In the past, the COMPRESS function was useful. Since SAS version 9, it has become a blockbuster, and you might not have noticed. The major change was the addition of a new optional parameter called MODIFIERS.

The traditional use of the COMPRESS function was to remove blanks or a list of selected characters from a character string. The addition of a MODIFIER argument does two things. First, you can specify classes of characters to remove, such as all letters, all punctuation marks, or all digits. That is extremely useful, but the addition of the 'k' modifier is why I used the term blockbuster in my description. The 'k' modifier flips the function from one that removes characters from a string to one that keeps a list of characters and removes everything else. Let me show you some examples.

This first example stems from a real problem I encountered while trying to read values that contained units. My data looked something like this:

ID     Weight 
001    100lbs.
002     59Kgs.
003    210LBS
004    83kg

My goal was to create a variable called Wt that represented the person's weight in pounds as a numeric value.

First, let’s look at the code. Then, I’ll give an explanation.

data Convert;
   length ID $3 Weight $8;
   input ID Weight;
 
   Wt = input(compress(Weight,,'kd'),8.);
   /* The COMPRESS function uses two modifiers, 'k' and 'd'.  This means
      keep the digits, remove anything else.  The INPUT function does the
      character-to-numeric conversion.
   */
 
   If findc(Weight,'k','i') then Wt = Wt * 2.2;
 
   /* the FINDC function is looking for an upper or lowercase 'k' in the
      original character string.  If found, it converts the value in
      kilograms to pounds (note: 1 kg = 2.2 pounds).
   */
 
datalines;
001    100lbs.
002     59Kgs.
003    210LBS
004    83kg
;
title "Listing of Data Set Convert";
footnote "This program was run using SAS OnDemand for Academics";
proc print data=Convert noobs;
run;

The program reads the value of Weight as a character string. The COMPRESS function uses 'k' and 'd' as modifiers. Notice the two commas in the list of arguments. A single comma would interpret 'kd' as the second argument (the list of characters to remove). Including two commas notifies the function that 'kd' is the third argument (modifiers). You can list these modifiers in any order, but I like to use 'kd', and I think of it as "keep the digits." What remains is the string of digits. The INPUT function does the character-to-numeric conversion.

Your next step is to figure out if the original value of Weight contained an upper or lowercase 'k'. The FINDC function can take three arguments: the first is the string that you are examining, the second is a list of characters that you are searching for, and the third argument is the 'i' modifier that says, "ignore case" (very useful).

If the original character string (Weight) contains an uppercase or lowercase 'k', you convert from kilograms to pounds.

Here is the output:

There is one more useful application of the COMPRESS function that I want to discuss. Occasionally, you might have a text file in ASCII or EBCDIC that contains non-printing characters (usually placed there in error). Suppose you want just the digits, decimal points (periods), blanks, and commas. You need to read the original value as a text string. Let's call the original string Contains_Junk. All you need to convert these values is one line of code like this:

Valid = compress(Contains_Junk,'.,','kdas');

In this example, you are using all three arguments of the COMPRESS function. As in pre-9 versions of SAS, the second argument is a list of characters that you want to remove. However, because the third argument (modifiers) contains a 'k', the second argument is a list of characters that you want to keep. In addition to periods and commas, you use modifiers to include all digits, uppercase and lowercase letters (the 'a' modifier - 'a' for alpha), and space characters (these include spaces, tabs, and a few others such as carriage returns and linefeeds). If you did not want to include tabs and other "white space" characters, you could rewrite this line as:

Valid = compress(Contains_Junk,'., ','kd');

Here you are including a blank in the second argument and omitting the 's' in the modifier list.

You can read more about the COMPRESS function in any of the following books, available from SAS Press as an e-book or from Amazon in print form:

Or my latest programming book:

 

Questions and/or comments are welcome.

The Amazing COMPRESS Function was published on SAS Users.

7月 212021
 

In my new book, I explain how segmentation and clustering can be accomplished in three ways: coding in SAS, point-and-click in SAS Visual Statistics, and point-and-click in SAS Visual Data Mining and Machine Learning using SAS Model Studio. These three analytical tools allow you to do many diverse types of segmentation, and one of the most common methods is clustering. Clustering is still among the top 10 machine learning methods used based on several surveys across the globe.

One of the best methods for learning about your customers, patrons, clients, or patients (or simply observations in almost any data set) is to perform clustering to find clusters that have similar within-cluster characteristics and each cluster has differing combinations of attributes. You can use this method to aid in understanding your customers or profile various data sets. This can be done in an environment where SAS and open-source software work in a unified platform seamlessly. (While open source is not discussed in my book, stay tuned for future blog posts where I will discuss more fun and exciting things that should be of interest to you for clustering and segmentation.)

Let’s look at an example of clustering. The importance of looking at one’s data quickly and easily is a real benefit when using SAS Visual Statistics.

Initial data exploration and preparation

To demonstrate the simplicity of clustering in SAS Visual Statistics, the data set CUSTOMERS is used here and also throughout the book. I have loaded the CUSTOMERS data set into memory, and it is now listed in the active tab. I can easily explore and visualize this data by right-mouse-clicking and selecting Actions and then Explore and Visualize. This will take you to the SAS Visual Analytics page.

I have added four new compute items by taking the natural logarithm of four attributes and will use these newly transformed attributes in a clustering.

Performing simple clustering

Clustering in SAS Visual Statistics can be found by selecting the Objects icon on the left and scrolling down to see the SAS Visual Statistics menus as seen below. Dragging the Cluster icon onto the Report template area will allow you to use that statistic object and visualize the clusters.

Once the Cluster object is on the template, adding data items to the Data Roles is simple by checking the four computed data items.

Click the OK icon, and immediately the four data items that are being clustered will look like the report below where five clusters were found using the four data items.

There are 105,456 total observations in the data set, however, only 89,998 were used for the analysis. Some observations were not used due to the natural logarithm not being able to be computed. To see how to handle that situation easily, please pick up a copy of Segmentation Analytics with SAS Viya. Let me know if you have any questions or comments.

 

 

Clustering made simple was published on SAS Users.