sas press

12月 142022
 

If you have ever needed to score a multiple-choice test, this blog is for you. Even if you are not planning to score a test, the techniques used in this example are useful for many other programming tasks. The example I am going to present assumes the answer key and student answers are in a single file. A sample file for scoring is shown below:

Sample_Test.txt

Student IDs are in columns 1–9. (The answer key uses all zeros, but the program would work equally well if you left these columns blank.) The student answers are in columns 11–20. The "trick" is to read the answer key and place the correct answers in the variables Key1–Key10.  You can do this by writing a conditional INPUT statement that uses the internal SAS variable _N_ to let you know when you are reading the first observation. Let's take a look at the program and annotate it.

data Score;
   infile 'c:\books\Test Soring\Sample_Test.txt' pad; ❶
   array Ans[10] $ 1 Ans1-Ans10;   ***student answers;array Key[10] $ 1 Key1-Key10;   ***answer key; 
   array Score[10] Score1-Score10; ***score array 
                                      1=right,0=wrong;
   retain Key1-Key10; ❸
 
   if _n_ = 1 then input @11 (Key1-Key10)($1.); ❹
   input @1  ID $9.
         @11 (Ans1-Ans10)($1.);
 
   do Item=1 to 10;
      Score[Item] = Key[Item] eq Ans[Item]; ❺
   end;
 
   Raw=sum(of Score1-Score10); ❻
   Percent=100*Raw / 10;
   keep Ans1-Ans10 ID Raw Percent;
   label ID      = 'Student ID'
         Raw     = 'Raw Score'
         Percent = 'Percent Score';
run;
 
proc sort data=Score; ❼
   by ID;
run;
 
title "Student Scores";
proc print data=Score label;	 ❽
   id ID;
   var Raw Percent;
run;

❶ The data file is located in the C:\Books folder. The PAD option is useful if there are lines of data that contain fewer than 10 answers—it is an instruction to PAD each line with blanks up to the defined record length.

❷ You set up three arrays: Ans holds the student answers, Key holds the answer key, and Score is a variable that indicates if the answer is correct or not (1=correct, 0=incorrect).

❸ Because you are reading the Key values from raw data, you need to RETAIN these values so that SAS does not set them to a missing value on subsequent iterations of the DATA step.

❹ When _N_ is equal to 1, you read the variables Key1–Key10.

❺ This is an interesting statement. The logical expression Key[Item] eq Ans[Item] will be 1 if the student has the correct answer and 0 otherwise. This value is then assigned to the variable Score[Item].

❻ The raw score (the number of items answered correctly) is the sum of the variables Score1–Score10.

❼ For a more logical report, you sort the observations in ID order.

❽ You use PROC PRINT (or PROC REPORT if you prefer) to list the data. The PROC PRINT option LABEL causes the procedure to use variable labels (if they have labels—and they do) instead of the default use of variable names as column headings. By the way, the default behavior for PROC REPORT is to use variable labels.

Below is the output from this program.

If you would like to know more about test scoring, including how to score tests with multiple versions or how to perform item analysis, please see my book Test Scoring and Item Analysis Using SAS (used copies are available on Amazon). As usual, comments or corrections are welcome. You may email me directly at Ron.Cody@gmail.com.

Using SAS to score a test was published on SAS Users.

11月 012022
 

SAS SQL handles missing values differently than the ANSI standard for SQL. PROC SQL follows the SAS convention for handling missing values: numerical missing values are always interpreted as less or smaller than all nonmissing values. My first blog showed that missing values can be troublemakers in non-grouped descriptive statistics. This blog now focuses on how to handle tables if they contain missing values.

In a single table

Missing values can lead to unexpected results while working with only one table, for example, in WHERE conditions, when sorting data using ORDER BY, and when grouping data using GROUP BY. (See the following examples.)

WHERE

If a column contains missing values, a WHERE condition can lead to undesirable results under certain circumstances. If you want to query all S0666 values smaller than 1.5 in the example code below, then a simple WHERE would also include missing values.

To keep S0666 values less than 1.5, but without missing values, you can extend the WHERE condition by adding IS NOT MISSING

Example: WHERE

proc sql ;
  select * 
  from SASHELP.SYR1001
  where S0666 < 1.5 
            and S0666 is not missing ;
quit ;

ORDER BY

When sorting data in ascending order, PROC SQL places missing values before data of type numeric or string. If that ORDER BY does not contain any other column to sort by, the values in other columns are only sorted if you expressly do so. If your data contain missing values, you may have to add further sorting variables to ORDER BY after the column containing missing values, for example, T and S0502 in the example below.

Example: ORDER BY

proc sql ;
  select * 
  from SASHELP.SYR1001
  order by S0666, T, S0502 ;
quit ;

GROUP BY

Missing values have subtle effects that may lead to fake statistics. The following example demonstrates the effect that missing values have on a seemingly simple calculation of means. To demonstrate, I replaced the missing values in column S0666 with 0s and saved the new column as S0666_0. The means of both S0666 and S0666_0 are calculated based on the sum of all available values divided by the number of nonmissing values data rows.

Example: GROUP BY

proc sql ;
       select T, S0666, 
	       avg(S0666) as S0666_MEAN,  
                case
		    when S0666 is missing then 0
                         else S0666 
                  end as S0666_0, 
          avg(calculated S0666_0) as S0666_0_MEAN 
	from SASHELP.SYR1001 ;
quit ;

Output

                   S0666_              S0666_0_
 T      S0666       mean     S0666_0       mean       

 52    1.52000    1.75444    1.52000    0.90229
 53    1.84000    1.75444    1.84000    0.90229
 54    1.96000    1.75444    1.96000    0.90229
 55     .         1.75444    0.00000    0.90229
 56     .         1.75444    0.00000    0.90229
 57     .         1.75444    0.00000    0.90229
 58     .          ...

Because the denominator for S0666_0_mean is larger, the resulting mean is (not surprisingly) lower in the output. However, S0666_mean also returns means for rows that did not contribute values to its calculation (only for the denominator, if you will).

When joining tables

If you are about to join tables, you need to check whether the keys of the tables to be joined contain missing values. And if they contain missing values, you need to check if they are of the same type. SAS defines missing values in different ways. Unfortunately, a numerical column may contain different types of missing values at the same time. Missing numerical values and missing strings are handled completely differently.

Even if your data contain only one type of numerical or string missing value, you may be in for a big surprise. In the following example, the data sets ONE and TWO are identical except for the names of the columns in TWO. Each second and third row have missing values in the key; the following example highlights its effect.

 

Data ONE Data TWO
data ONE ;
   input ID A B C ;
datalines;
01 1   2  3
.  1   2  3
.  99 99 99
05 1   2  3
;
run ;
data TWO ;
   input ID D E F;
datalines;
01 1   2  3
.  1   2  3
.  99 99 99
05 1   2  3
;
run ;

 

Example: Join

proc sql ; 
create table FJ as select
a.ID, A,B,C,D,E,F
from ONE a full join TWO b
on a.ID=b.ID
order by ID ;
quit ;

 

Output

ID     A     B     C     D     E     F
 .     1     2     3    99    99    99
 .    99    99    99     1     2     3
 .    99    99    99    99    99    99
 .     1     2     3     1     2     3
 1     1     2     3     1     2     3
 5     1     2     3     1     2     3

The missing values in the key of ONE and TWO generate a Cartesian product. Each missing value of ONE is combined with each missing value of TWO (the same effect applies to non-missing values as well). With 2 times 2 missing values, two missing values are enough to multiply the associated data rows. Please note also the limited effect of ORDER BY in the presence of missing values. Beware if you may have more than one missing value in keys and are about to join one or more tables.

Have you had unexpected results from missing values in your data when using PROC SQL? You can learn more about effects caused by missing values in Chapter 2 of my book Advanced SQL with SAS.

Unexpected results from missing values with PROC SQL was published on SAS Users.

10月 032022
 

We publish a lot of books by SAS experts at SAS Press, but how does someone become an expert in the first place? Becoming certified is one step, but who develops the certifications in the first place? Those are the true experts. They have to have a deep understanding of SAS, an ability to explain difficult concepts, and a passion for sharing their knowledge. Our SAS Press Author of the Month for October is Jane Eslinger, also known around the office as the PROC REPORT expert.

Jane has written two books on PROC REPORT that are the go-to reference for anyone who wants to understand all of the capabilities of the REPORT procedure. I asked her a few questions about how she became the PROC REPORT expert and what advice she has for SAS novices.

Q: What made you decide to write a book about PROC REPORT?

At the time I was in SAS Technical Support and answering customer questions on PROC REPORT. I tell people that PROC REPORT has a large learning curve. I knew PROC REPORT but discovered that information about it was spread across a wide variety of places – SAS notes, conference papers, and SAS Communities posts. I decided that I wanted to put all of the information in one place, so I wrote the first book.

Q: Have you noticed any new trends in reporting in the last few years?

Good question. I would say that programmers are more interested in making drill-down or cascading reports. People want to click from one report to the next, more refined report.

Q: What do you think is the most misunderstood feature of PROC REPORT?

Two things, actually. First, is the compute block. Programmers know that they can assign or change values within the compute block. They also know it can be used to affect styling. But it takes practice to get all of it right. And lots of programmers use compute blocks for the bare essentials but it can do so much more.

Second, is the COLUMN statement. Programmers want to control what is in the header section of the report but do not understand how do use spanning headers and null headers to get what they want.

Q: How did you first learn SAS?

I had to use it in my regression statistics course in college, as well as take a basic programming class for my degree.

Q: As an instructor, how do you encourage students who get frustrated when learning SAS?

I tell them that they are learning a new language or a new skill. You aren’t going to get it right the first time! I also tell them that programmers with 20 years of experience still have to look up syntax.

Q: What is your best advice for someone who wants to learn SAS?

Practice. But really the best way to practice is to have a project (school or at work) that requires you to use SAS. Learning SAS is much easier when you can connect programming concepts to your own project instead of arbitrary data.

Q: Just for fun, what is the last non-technical book you read that you enjoyed?

Wish You Were Here by Jodi Picoult.


Thanks for talking to me, Jane! Visit Jane’s author page to find an exclusive discount code that you can use to purchase her e-books in the month of October.

You can also find her at SESUG 2022 where she will be giving a hands-on training session on how to create PROC REPORT code for multiple report types. And check out Jane's "Ask the Expert" webinars anytime to learn more about PROC REPORT:

 

Meet our SAS Press Author of the Month -- Jane Eslinger was published on SAS Users.

8月 012022
 

Recently, the SAS Press team moved to a new building on the SAS campus. And when the SAS Press team moves, we bring a lot of books with us! Packing and organizing all of our books gave us a chance to appreciate all of our authors’ hard work during the more than 30 years that SAS Press has existed.

One author has an outsized presence on the SAS Press bookshelves – Ron Cody. He has written over a dozen books that include some of the most popular titles for new SAS users. He taught statistics at the Rutgers Robert Wood Johnson Medical School for many years and is a frequent presenter at SAS conferences.

Few people know more about SAS than Ron, which made him the perfect person to be our first SAS Press Author of the Month. During August, you can get special deals on all of Ron's books at RedShelf. Find out more information on his author page. Ron is also hosting a free webinar on data cleaning tips and tricks on August 11th.

We recently asked Ron to share a little bit about his journey as an author and teacher. As you might imagine, Ron has a lot of SAS knowledge and advice to share.

Ron Cody's books on a bookshelf.

The Cody section of the SAS Press bookshelf.

Q: When did you decide to write your first book?

I decided to write my first SAS book (Applied Statistics and the SAS Programming Language) in 1985. It was published by Elsevier Publishing, which was later bought by Prentice Hall. At the time I was writing the book, there were no other books out there—just SAS manuals. This book is still in print (in a fifth edition).

Q: What made you decide to keep writing more books about SAS?

Once I realized I could write well enough to get published, I got the "writer's bug." Although writing a book is hard work, the reward is substantial. My motivation is more about teaching people about SAS and statistics rather than the monetary reward. My goal in writing any book is to make enough money to take my wife to dinner!

Q: Is one of your books your favorite? Or do you love them all equally?

I do have some favorites. I would put Learning SAS by Example as one. It contains a section for beginners as well as later chapters that are useful to intermediate, or even advanced SAS programmers. I particularly like my latest book, A Gentle Introduction to Statistics Using SAS Studio in the Cloud. I think I made statistical concepts accessible to non-mathematically minded people. There are only a few equations in the entire book.

Q: As a teacher, how did you encourage students who were having a hard time understanding statistics?

I try really hard to convince them that the statistical concepts really make sense, and it is sometimes the terminology that gets in the way.

Q: What do you think students struggle with the most when they are learning SAS?

I believe the overwhelming richness of SAS can intimidate a beginning programmer. That's why I start from simple examples and explain step-by-step how each program works.

Q: What is your best advice for someone who wants to learn SAS?

Buy all of my books, of course! Just kidding. There are many YouTube videos and other online resources that are useful. I think two of the best books would be The Little SAS Book, and either Learning SAS by Example (for the more serious student) or Getting Started with SAS Programming Using SAS Studio in the Cloud. The latter book is more suited to someone using SAS OnDemand for Academics.

Q: You recently published a memoir about your time as an EMT. How did it feel to reflect back on that time of your life? Any more memoirs in your future?

I thoroughly enjoyed writing a book about my years as a volunteer EMT (10-8 Awaiting Crew: Memories of a Volunteer EMT). I was fortunate that I had kept a journal and recorded details of some of the more interesting and exciting calls. As of now, I do not have another memoir in my future, but I am working on a nonfiction novel. I'm not sure how successful it will be, but I'm going to give it a try. I strive to write almost every day. I tell other beginning authors, that if you spend a few hours writing every day, it becomes much easier. I call it "getting in the groove."


Ron Cody sitting in a chair with a cat in his lap. Ron is wearing a shirt that says "using Statistics to prove a point is just MEAN."

Ron and Dudley. Credit: Jan Cody

Thanks for talking to us, Ron! We hope you will join us in celebrating Ron's accomplishments as our Author of the Month. Check out his author page to see special deals on all of Ron’s books during August and to discover more resources.

 

Meet our SAS Press Author of the Month – Ron Cody was published on SAS Users.

7月 122022
 

SQL (Structured Query Language) is the most widely used programming language for relational databases worldwide. No other programming language produces more hits for a web search than SQL and interest is growing rapidly. In June 2010, Google showed 135 million hits versus 586 million hits in June 2020.

SQL is the programming language to handle big data. Within a very short time after its invention, SQL developed into an industry quasi-standard. The reasons behind this rapid development include the fast spread of databases, proven performance, popularity among users, and the growing relevance of analysts and programmers. Most analyst positions expect at least knowledge of basic SQL capabilities.

For analysts and programmers, SQL is attractive because once acquired, SQL skills can be applied in basically all environments from finance to healthcare, and in all systems from open source to commercial. Due to its relative independence from manufacturers (an ANSI standard), what you learn to do in SQL in Oracle can be applied to SQL in SAS.

PROC SQL in SAS is powerful and versatile. It offers myriad possibilities for working with:

  • descriptive statistics
  • advanced macro programming
  • SAS datasets, views, or complex queries
  • special application areas like integrity constraints or performance tuning
  • thousands of SAS functions and SAS function calls

If you start learning PROC SQL, you will also acquire the basics of PROC FEDSQL, PROC DS2, and PROC CAS. And that will offer you a handy toolbox for SAS platforms and applications like SAS 9.4, SAS Viya, SAS Integration Studio, SAS Studio, Enterprise Guide, and many more.

Is that tempting enough to try your hand at PROC SQL? No, you want to see what you get? I will show you examples of four programs that do the same thing using PROC SQL, PROC FEDSQL, PROC DS2, and PROC CAS. I’ll keep it simple just to prove the point. But have no fear, SQL can accomplish very advanced tasks. I’ve been involved in rewriting complex SQL programs that were thousands of lines long.

Let’s use PROC SQL as a springboard. From there, choose where you want to go.

Example 1: PROC SQL

proc sql;
   select REGION, SUBSIDIARY, SALES
   from work.shoes
      where SALES > 750000 ; 
quit;

With PROC FEDSQL, you can start working in the cloud environment of SAS Viya. Please note that PROC FEDSQL is not always 1:1 to PROC SQL as it may appear from this example.

Example 2: PROC FEDSQL

proc fedsql;
   select REGION, SUBSIDIARY, SALES
   from work.shoes
      where SALES > 750000 ; 
quit;

DS2 allows you to speed up processing by using its built-in multi-threading capabilities.

Example 3: PROC DS2

proc DS2 ; 
data LEFT_RIGHT4 (overwrite=yes) ; 
method run(); 
set {select LEFT.ID, 
             LEFT.A, RIGHT.F 
     from work.LEFT, work.RIGHT 
       where LEFT.ID = RIGHT.ID} ; 
output ; 
end ; 
enddata ;
run ;
quit;

PROC CAS enables you to take advantage of SAS Cloud Analytic Services (CAS).

Example 4: PROC CAS

proc CAS; 
session my_CAS_session ; 
  fedsql.execdirect 
  query=
   'select * 
    from 
    CASUSER.CAS_CLASS' ;
Quit ;

Notice that the SQL language elements like the SELECT statement are the same in each example. Once you have learned the basic syntax, you can use it in PROC FEDSQL, PROC DS2, and PROC CAS. And I am pretty sure, there are some more to come.

Why learn SQL? Because it’s a sustainable investment in your future. If you want to learn more about PROC SQL techniques, check out my book Advanced SQL with SAS®.

Why learn SQL? was published on SAS Users.

7月 122022
 

SQL (Structured Query Language) is the most widely used programming language for relational databases worldwide. No other programming language produces more hits for a web search than SQL and interest is growing rapidly. In June 2010, Google showed 135 million hits versus 586 million hits in June 2020.

SQL is the programming language to handle big data. Within a very short time after its invention, SQL developed into an industry quasi-standard. The reasons behind this rapid development include the fast spread of databases, proven performance, popularity among users, and the growing relevance of analysts and programmers. Most analyst positions expect at least knowledge of basic SQL capabilities.

For analysts and programmers, SQL is attractive because once acquired, SQL skills can be applied in basically all environments from finance to healthcare, and in all systems from open source to commercial. Due to its relative independence from manufacturers (an ANSI standard), what you learn to do in SQL in Oracle can be applied to SQL in SAS.

PROC SQL in SAS is powerful and versatile. It offers myriad possibilities for working with:

  • descriptive statistics
  • advanced macro programming
  • SAS datasets, views, or complex queries
  • special application areas like integrity constraints or performance tuning
  • thousands of SAS functions and SAS function calls

If you start learning PROC SQL, you will also acquire the basics of PROC FEDSQL, PROC DS2, and PROC CAS. And that will offer you a handy toolbox for SAS platforms and applications like SAS 9.4, SAS Viya, SAS Integration Studio, SAS Studio, Enterprise Guide, and many more.

Is that tempting enough to try your hand at PROC SQL? No, you want to see what you get? I will show you examples of four programs that do the same thing using PROC SQL, PROC FEDSQL, PROC DS2, and PROC CAS. I’ll keep it simple just to prove the point. But have no fear, SQL can accomplish very advanced tasks. I’ve been involved in rewriting complex SQL programs that were thousands of lines long.

Let’s use PROC SQL as a springboard. From there, choose where you want to go.

Example 1: PROC SQL

proc sql;
   select REGION, SUBSIDIARY, SALES
   from work.shoes
      where SALES > 750000 ; 
quit;

With PROC FEDSQL, you can start working in the cloud environment of SAS Viya. Please note that PROC FEDSQL is not always 1:1 to PROC SQL as it may appear from this example.

Example 2: PROC FEDSQL

proc fedsql;
   select REGION, SUBSIDIARY, SALES
   from work.shoes
      where SALES > 750000 ; 
quit;

DS2 allows you to speed up processing by using its built-in multi-threading capabilities.

Example 3: PROC DS2

proc DS2 ; 
data LEFT_RIGHT4 (overwrite=yes) ; 
method run(); 
set {select LEFT.ID, 
             LEFT.A, RIGHT.F 
     from work.LEFT, work.RIGHT 
       where LEFT.ID = RIGHT.ID} ; 
output ; 
end ; 
enddata ;
run ;
quit;

PROC CAS enables you to take advantage of SAS Cloud Analytic Services (CAS).

Example 4: PROC CAS

proc CAS; 
session my_CAS_session ; 
  fedsql.execdirect 
  query=
   'select * 
    from 
    CASUSER.CAS_CLASS' ;
Quit ;

Notice that the SQL language elements like the SELECT statement are the same in each example. Once you have learned the basic syntax, you can use it in PROC FEDSQL, PROC DS2, and PROC CAS. And I am pretty sure, there are some more to come.

Why learn SQL? Because it’s a sustainable investment in your future. If you want to learn more about PROC SQL techniques, check out my book Advanced SQL with SAS®.

Why learn SQL? was published on SAS Users.

4月 052022
 

Have you ever heard something referred to as the bee’s knees? Most likely the person uttering that expression meant that it was truly amazing and extraordinary. Maybe you stopped and pondered the origin of the phrase. Well wonder no more! In the 1920s in the United States, people were obsessed with rhyming and anthropomorphism (giving human traits or attributes to animals). This era saw the introduction of multiple animal-centric idioms: the ant’s pants, the cat’s pajamas, and (you guessed it) the bee’s knees. Most likely this idiom refers to the pollen basket on the lower section of a worker bee’s legs (where the tibia and the tarsi connect). When bees are out foraging, they carry all the pollen they find in the pollen baskets on their legs. If you ever see a bee with lots of pollen on her legs, you know she’s been working hard… one might even say she’s an amazing and extraordinary worker!

SAS Visual Analytics has a lot of features that you might call the bee’s knees, but one of the most amazing and extraordinary features is the AggregateTable operator. This operator, introduced in SAS Visual Analytics 8.3, enables you to perform aggregations on data crossings that are independent of (or changed from) the data in your objects. This means you can use this operator to compare aggregations for different groups of data in one single object.

To illustrate the AggregateTable operator in action (and to keep with the theme), let’s consider an example.

I’m a hobby beekeeper in Texas. This means that I own and maintain a few hives in my backyard from which I collect honey and wax to sell to family and friends. I’m interested in learning about honey production in the United States for different years. I’m pretty sure I’m not the biggest honey producer in my state (or even my county), but I want to look at different crossings of total production (by state and year, by county and year, and by state).

The United States Department of Agriculture’s National Agricultural Statistics Service has Census data on honey production (measured in pounds) for all counties in the United States for 2002, 2007, 2012, and 2017.

Type of Aggregation: Add

To start, I would like to calculate the total production for each state (Maximum by State) and display the total production for the state that produced the most honey in that year. For example, in 2002 California produced the most honey of any state (15,585,925 pounds) and in 2017 North Dakota produced the most honey of any state (24,296,437 pounds).

Because the table contains details by county and I’m interested in the total production by state, I will either need to create an aggregated data source that contains details by state, or I will need to use the AggregateTable operator. Since this post is about the wonders of the AggregateTable operator, let’s focus on that.

The AggregateTable operator requires five parameters:

Parameter Description
Aggregation- Aggregated The aggregation applied to the aggregated item when it is used in an object that displays fewer group-by crossings than the table in the expression.
Aggregation- Measure The aggregation applied to the measure in the inner table context.
Type of aggregation The type of aggregation that is performed. Values are Fixed, Add, or Remove.
Categories The list of categories used to alter the data crossing for the aggregation.
Measure The measure that is aggregated. A Table operator can be added as the measure to perform a nested aggregation.

It also has a nested operator, Table, that creates an intermediate table defined by the Type of aggregation, Categories, Measure, and Aggregation- Measure parameters.

For this example, I want to use a Key value object to see the total production values for the state that produced the most honey in each year. The object will contain one group-by crossing (or category): Year. The calculation, however, will need to use two group-by crossings to determine the highest producing state for each year: Year and State. Therefore, the Aggregation-Measure is _Sum_ (to calculate the total production by state), the Type of aggregation is Add (because I want to add a crossing for State to the calculation), Categories is set to State, and Measure is Production (lb).

The intermediate table will contain one row for each state and year and contain total production values.

Then, for each year, I want the highest production value (for 2002, 15,585,925 pounds). Therefore, the Aggregation- Aggregated parameter should be _Max_ to grab the maximum values for each year from the intermediate table.

Then, I can display the Maximum by State with Year in a Key value object.

Note: Beginning in SAS Visual Analytics 2021.1.1 (May 2021), a new role is available for the Key value object, Lattice category. This enables you to display a key value object for each distinct value of a category data item (in this example, Year).

Now that I have a data item that contains the production amount for the highest producing state for each year, I can create some more complex calculations, like the percentage of total production for each year by the state that had the highest production. This will enable me to see if the highest producing state is doing all the heavy lifting or if all the states are producing a near-equal amount.

Type of Aggregation: Remove

The Type of aggregation parameter also enables you to remove categories from the calculation. For this example, suppose I want to compare the production in each county to the production from the highest producing county in that state (Maximum by County). I want to use a list table to compare these values.

The object will contain three group-by crossings: Year, State, and County. The calculation, however, will only use two group-by crossings to determine the highest producing county in each state for each year: Year and State. Therefore, the Aggregation-Measure is _Max_ (to calculate the maximum production in each state), the Type of aggregation is Remove (because I want to remove a crossing for County from the calculation), Categories is set to County, and Measure is Production (lb).

The intermediate table will contain one row for each state and year and contain the production values for the county in that state with the highest production. Notice that for this table, the aggregation for Production was set to Maximum to show the maximum production for each state.

Because the number of groupings in the object (County, Year, and State) is not fewer than the number of groupings in the calculation (Year and State), the Aggregation-Aggregated parameter is not applied and can be set to any value.

Then, I can display Maximum by County in a list table with total production by county to easily compare each county’s production with the production of the county in that state that had the highest production.

Now, I can calculate each county’s difference from the county with the highest production in that state for each year.

Type of Aggregation: Fixed

The Type of aggregation parameter also enables you to set a fixed category for the calculation. For this example, suppose I want to compare the production in each county to the total production across all years by state (Total by State). I want to use a list table to compare these values.

The object will contain three group-by crossings: Year, State, and County. The calculation, however, will only use one group-by crossing to determine the total production by state across all years: State. Therefore, the Aggregation-Measure is _Sum_ (to calculate the total production by state across all years), the Type of aggregation is Fixed (because I want to fix the crossing to State for the calculation), Categories is set to State and Measure is Production (lb).

The intermediate table will contain one row for each state and total production values across all years.

Because the number of groupings in the object (County, Year, and State) is not fewer than the number of groupings in the calculation (State), the Aggregation-Aggregated parameter is not applied and can be set to any value.

Then, I can display Total by State in a list table with total production by county to easily compare each county’s production with total production in the state across all years.

I can even compare total state values for each year with total production in that state across all years.

Then, I can calculate the share of total production produced each year.

For more information about how to create advanced data items and filters for your SAS Visual Analytics reports, including examples that use different types of operators, check out my book Interactive Reports in SAS Visual Analytics: Advanced Features and Customization.

Bees are important contributors in pollinating most of the food that you eat and are in desperate need of your help! There are many ways you can help the honeybee population thrive:

  • Become a beekeeper
  • Plant a garden for all pollinators (including bumblebees, butterflies, bats, and moths)
  • Reduce or eliminate pesticide use
  • Support your local beekeepers by buying local honey
  • Contact a beekeeping group if you see a swarm
  • Volunteer as a citizen data scientist by taking pictures and posting information on the iNaturalist app
  • Create a bee bath for those hot summer days
  • Build homes for native bees

 

The AggregateTable operator… It’s the bee’s knees! was published on SAS Users.

2月 172022
 

This blog serves two purposes: the main purpose is to show you some useful SAS coding techniques, and the second is to show you an interesting method of creating a Beale cipher.

TJ Beale is famous in Virginia for leaving behind three ciphers, supposedly describing the location of hidden gold and treasures. (Most cryptologists and historians believe the whole set of ciphers and treasure was a hoax.) In one of the ciphers, he used a method based on the Declaration of Independence. His coding method was as follows:

  • Get a copy of the Declaration of Independence and number each word.
  • Take the first letter of each word and form a list.
  • Associate each number with that letter.

For example, consider this text:

“Four score and seven years ago, our fathers brought forth upon this continent a new nation, conceived in liberty and dedicated to the proposition that all men are created equal. “

To create a Beale cipher, you would proceed as follows:

Four(1) score(2) and(3) seven(4) years(5) ago(6), our(7) fathers(8) brought(9) forth(10) upon(11) this(12) continent(13) a(14) new(15) nation(16), conceived(17) in(18) liberty(19) and(20) dedicated(21) to(22) the(23) proposition(24) that(25) all(26) men(27) are(28) created(29) equal(30).

Next, you would make a table like this:

Letter Numbers
F 1,8,10 (all the numbers of words that begin with 'F')
S 2,4
A 3,6,14,20,26,28
Y 5
…and so on

 

You would then want to put the list in alphabetical order like this:

A 3,6,14,20,26,28
B 9
C 13,17,29
D 21
E 30
F 1,8,10
…and so on

 

To create your cipher, select any number at random from the list of numbers, corresponding to the letter that you want to encode. The advantage of this over a simple substitution cipher is that you cannot use frequency analysis to guess what letter a particular number represents.

This blog explains how to create a Beale cipher; my next blog will explain how to decipher a Beale cipher.

You need to start out with a book or document that is accessible to the sender and recipient of the cipher. To offer some additional security, you could decide to start from a specific page in a book. For a simple demonstration of how to create a Beale cipher, I have entered part of the Declaration of Independence in a text file called Declare.txt.

A funny aside: I was teaching my Functions course in the UK, in a small town north of London on the Thames. One of the programs demonstrating several SAS character functions was the program I'm using here to demonstrate how to create a Beale cipher. I had completely forgotten that the document was the Declaration of Independence. Whoops! I asked the class, "I hope you're not still angry with us about that." Apparently not, and we all had a good laugh.

Back to the problem. I will break down the program into small steps and provide a partial listing of data sets along the way, so that you can see exactly how the program works. The first step is read the text file, extract the first letter from each word, change the letter to uppercase, and associate each letter with the count of words in the text.

Here is the first part of the program.

data Beale;
   length Letter $ 1;  
   infile 'c:\Books\Blogs\Declare.txt'; 
   input Letter : $upcase1. @@; ❶
   N + 1; ❷	
   output;
run;
 
title "First Step in the Beale Cipher (first 10 observations)";
proc print data=Beale(obs=10) noobs;
run;

❶ By using the $UPCASE1. informat, you are selecting the first letter of each word and converting it to uppercase. If you are unfamiliar with the $UPCASEn. informat, it is similar to the $n. informat with the additional task of converting the character(s) to uppercase.

❷ You use a SUM statement to associate each letter with the word count.

Here is the listing from this first step:

Next, you need to sort the data set by Letter so that all the words that start with As, Bs, and so forth are placed together.

proc sort data=Beale;
   by Letter;
run;
 
title "The list in sorted order (partial listing)";
proc print data=Beale(obs=10) noobs;
run;

Below is a partial listing of the sorted file:

Any of the numbers 24, 25, 27, and so forth can be used to code an 'A'.

The final step is to list all the letters from A to Z (Z is pronounced Zed in the UK and Canada) in a line, followed by all the possible numbers associated with each letter.

data Next;
   length List $ 40; ❸
   retain List; ❹
   set Beale;
   by Letter; ❺
   if first.Letter then List = ' '; ❻
   List = catx(',',List,N); ❼ 
   if last.Letter then output; ❽ 
run;
 
title "List of Beale Substitutions";
proc print data=next(obs=5) noobs;
   var Letter List;
run;

❸ The variable List will hold all the possible numbers that can be used to code any of the letters. In a real program, this list might be longer.

❹ You need to RETAIN this variable; otherwise, it would be set back to a missing value for each iteration of the DATA step.

❺ Following the SET statement with a BY statement creates the two temporary variables, First.Letter and Last.Letter. First.Letter is true when you are reading the first observation for each letter—Last.Letter is true when you are reading the last observation for a letter.

❻ For the first A, B, C, and so on, initialize the variable List to a missing value.

❼ Use the CATX function to concatenate all the numbers, separated by commas.

❽ When you are done reading the last A, B, C, and so on, output the string.

Below are a few lines generated by this program:

For more information about the CATX function and other SAS functions, please take a look at my book, SAS Functions by Example, Second Edition.

Fun with Ciphers (Part 1) was published on SAS Users.

1月 112022
 

Last year, I wrote a blog demonstrating how to use the %Auto_Outliers macro to automatically identify possible data errors. This blog demonstrates a different approach—one that is useful for variables for which you can identify reasonable ranges of values for each variable. For example, you would not expect resting heart rates below 40 or over 100 or adult heights below 45 inches or above 84 inches. Although values outside those ranges might be valid values, it would be a good idea to check those out-of-range values to see if they are real or data errors.

In the third edition of my book, Cody's Data Cleaning Techniques, I present two macros: %Errors and %Report. These two macros provide a consolidated error report.

To demonstrate these two macros, I created a data set called Patients. A listing of the first 10 observations is shown below:

Notice that the unique identifier is called Patno, and you see three variables HR (heart rate), SBP (systolic blood pressure), and DBP (diastolic blood pressure).

The calling arguments for the %Errors macro are:

  • VAR=     A variable for which you have pre-defined bounds
  • Low=     The lowest reasonable value for this variable
  • High=    The highest reasonable value for this variable
  • Missing=Error or Missing=Ignore. The default for this argument is IGNORE, but it is still good practice to include it in the call so that anyone reading your program understands how missing values are being handled.

Because you might be calling this macro for many variables, the values of two macro variables &Dsn (data set name) and &IDVar (the identifying variable such as Patno or Subj) are assigned values once, using two %Let statements. You can then call the %Errors macro for each variable of interest. When you are finished, call the %Report macro to see a consolidated report of your possible errors.

Here is an example:

/* Set values for &Dsn and %IDVar with %LET statements */
%let Dsn = Clean.Patients;   
%let Idvar = Patno;
 
%Errors(Var=HR, Low=40, High=100, Missing=error)
%Errors(Var=SBP, Low=80, High=200, Missing=ignore)
%Errors(Var=DBP, Low=60, High=120)   
 
/* When you are finished selecting variables, create the report */
%Report

You are reporting all heart rates below 40 or above 100 (and considering missing values as errors); values of SBP below 80 or above 200 (and ignoring missing values); and values of DBP below 60 or above 120 (also ignoring missing values—using the default value of IGNORE).

Here is the result:

Notice that several patients with missing values for HR are flagged as errors. I should point out that I have violated a cardinal rule of macro programming: never write a macro that has no arguments—it should be a program. However, I liked the idea of calling %Errors and then %Report. Shown below are the two macros:

/****************************************************************
| PROGRAM NAME: ERRORS.SAS  in c:\Books\Cleans\Patients         |
| PURPOSE: Accumulates errors for numeric variables in a SAS    |
|          data set for later reporting.                        |
|          This macro can be called several times with a        |
|          different variable each time. The resulting errors   |
|          are accumulated in a temporary SAS data set called   |
|          Errors.                                              |
| ARGUMENTS: Dsn=    - SAS data set name (assigned with a %LET) |
|            Idvar=  - Id variable (assigned with a %LET)       |
|                                                               |
|            Var     = The variable name to test                |
|            Low     = Lowest valid value                       |
|            High    = Highest valid value                      |
|            Missing = IGNORE (default) Ignore missing values   |
|                      ERROR Missing values flagged as errors   |
|                                                               |
| EXAMPLE: %let Dsn = Clean.Patients;                           |
|          %let Idvar = Patno;                                  |
|                                                               |
|          %Errors(Var=HR, Low=40, High=100, Missing=error)     |
|          %Errors(Var=SBP, Low=80, High=200, Missing=ignore)   |
|          %Errors(Var=DBP, Low=60, High=120)                   |
|          Test the numeric variables HR, SBP, and DBP in data  |
|          set Clean.patients for data outside the ranges       |
|          40 to 100, 80 to 200, and 60 to 120 respectively.    |
|          The ID variable is PATNO and missing values are to   |
|          be flagged as invalid for HR but not for SBP or DBP. |
****************************************************************/
%macro Errors(Var=,    /* Variable to test     */
              Low=,    /* Low value            */
              High=,   /* High value           */
              Missing=IGNORE 
                       /* How to treat missing values         */
                       /* Ignore is the default.  To flag     */
                       /* missing values as errors set        */
                       /* Missing=error                       */);
data Tmp;
   set &Dsn(keep=&Idvar &Var);
   length Reason $ 10 Variable $ 32;
   Variable = "&Var";
   Value = &Var;
   if &Var lt &Low and not missing(&Var) then do;
      Reason='Low';
      output;
   end;
 
   %if %upcase(&Missing) ne IGNORE %then %do;
      else if missing(&Var) then do;
         Reason='Missing';
         output;
      end;
   %end;
 
   else if &Var gt &High then do;
      Reason='High';
      output;
      end;
      drop &Var;
   run;
 
   proc append base=Errors data=Tmp;
   run;
 
%mend Errors;

The basic idea for the %Errors macro is to test each variable and, if it is a possible error, use PROC APPEND to add it to a data set called Errors. When the first error is detected, PROC APPEND creates the data set Errors. From then on, each observation in data set Tmp is added to data set Errors.

Most of this macro is straightforward. For those readers who are not that comfortable with macro programming, the third section (beginning with %if %upcase(&Missing)) is executed only when the value of the macro variable &Missing is not equal to IGNORE.

Below is a listing of the %Report macro:

%macro Report;
   proc sort data=Errors;
      by &Idvar;
   run;
 
   proc print data=Errors;
      title "Error Report for Data Set &Dsn";
      id &Idvar;
      var Variable Value Reason;
   run;
 
   proc delete data=Errors Tmp;
   run;
 
%mend Report;

The %Report macro is mainly a PROC PRINT of the temporary data set Errors. I added PROC DELETE to delete the two temporary data sets Error and Tmp.

You can cut and paste these macros, or you can download all of the macros, programs, and data sets from Cody's Data Cleaning Techniques Using SAS®, Third Edition, by going to support.sas.com/Cody, search for the book, then click Example Code and Data. You do not have to buy the book to download all the files (although I would be delighted if you did). This is true for all of my books published by SAS Press.

Comments and/or corrections are always welcome.

Two macros for detecting data errors was published on SAS Users.

1月 112022
 

Last year, I wrote a blog demonstrating how to use the %Auto_Outliers macro to automatically identify possible data errors. This blog demonstrates a different approach—one that is useful for variables for which you can identify reasonable ranges of values for each variable. For example, you would not expect resting heart rates below 40 or over 100 or adult heights below 45 inches or above 84 inches. Although values outside those ranges might be valid values, it would be a good idea to check those out-of-range values to see if they are real or data errors.

In the third edition of my book, Cody's Data Cleaning Techniques, I present two macros: %Errors and %Report. These two macros provide a consolidated error report.

To demonstrate these two macros, I created a data set called Patients. A listing of the first 10 observations is shown below:

Notice that the unique identifier is called Patno, and you see three variables HR (heart rate), SBP (systolic blood pressure), and DBP (diastolic blood pressure).

The calling arguments for the %Errors macro are:

  • VAR=     A variable for which you have pre-defined bounds
  • Low=     The lowest reasonable value for this variable
  • High=    The highest reasonable value for this variable
  • Missing=Error or Missing=Ignore. The default for this argument is IGNORE, but it is still good practice to include it in the call so that anyone reading your program understands how missing values are being handled.

Because you might be calling this macro for many variables, the values of two macro variables &Dsn (data set name) and &IDVar (the identifying variable such as Patno or Subj) are assigned values once, using two %Let statements. You can then call the %Errors macro for each variable of interest. When you are finished, call the %Report macro to see a consolidated report of your possible errors.

Here is an example:

/* Set values for &Dsn and %IDVar with %LET statements */
%let Dsn = Clean.Patients;   
%let Idvar = Patno;
 
%Errors(Var=HR, Low=40, High=100, Missing=error)
%Errors(Var=SBP, Low=80, High=200, Missing=ignore)
%Errors(Var=DBP, Low=60, High=120)   
 
/* When you are finished selecting variables, create the report */
%Report

You are reporting all heart rates below 40 or above 100 (and considering missing values as errors); values of SBP below 80 or above 200 (and ignoring missing values); and values of DBP below 60 or above 120 (also ignoring missing values—using the default value of IGNORE).

Here is the result:

Notice that several patients with missing values for HR are flagged as errors. I should point out that I have violated a cardinal rule of macro programming: never write a macro that has no arguments—it should be a program. However, I liked the idea of calling %Errors and then %Report. Shown below are the two macros:

/****************************************************************
| PROGRAM NAME: ERRORS.SAS  in c:\Books\Cleans\Patients         |
| PURPOSE: Accumulates errors for numeric variables in a SAS    |
|          data set for later reporting.                        |
|          This macro can be called several times with a        |
|          different variable each time. The resulting errors   |
|          are accumulated in a temporary SAS data set called   |
|          Errors.                                              |
| ARGUMENTS: Dsn=    - SAS data set name (assigned with a %LET) |
|            Idvar=  - Id variable (assigned with a %LET)       |
|                                                               |
|            Var     = The variable name to test                |
|            Low     = Lowest valid value                       |
|            High    = Highest valid value                      |
|            Missing = IGNORE (default) Ignore missing values   |
|                      ERROR Missing values flagged as errors   |
|                                                               |
| EXAMPLE: %let Dsn = Clean.Patients;                           |
|          %let Idvar = Patno;                                  |
|                                                               |
|          %Errors(Var=HR, Low=40, High=100, Missing=error)     |
|          %Errors(Var=SBP, Low=80, High=200, Missing=ignore)   |
|          %Errors(Var=DBP, Low=60, High=120)                   |
|          Test the numeric variables HR, SBP, and DBP in data  |
|          set Clean.patients for data outside the ranges       |
|          40 to 100, 80 to 200, and 60 to 120 respectively.    |
|          The ID variable is PATNO and missing values are to   |
|          be flagged as invalid for HR but not for SBP or DBP. |
****************************************************************/
%macro Errors(Var=,    /* Variable to test     */
              Low=,    /* Low value            */
              High=,   /* High value           */
              Missing=IGNORE 
                       /* How to treat missing values         */
                       /* Ignore is the default.  To flag     */
                       /* missing values as errors set        */
                       /* Missing=error                       */);
data Tmp;
   set &Dsn(keep=&Idvar &Var);
   length Reason $ 10 Variable $ 32;
   Variable = "&Var";
   Value = &Var;
   if &Var lt &Low and not missing(&Var) then do;
      Reason='Low';
      output;
   end;
 
   %if %upcase(&Missing) ne IGNORE %then %do;
      else if missing(&Var) then do;
         Reason='Missing';
         output;
      end;
   %end;
 
   else if &Var gt &High then do;
      Reason='High';
      output;
      end;
      drop &Var;
   run;
 
   proc append base=Errors data=Tmp;
   run;
 
%mend Errors;

The basic idea for the %Errors macro is to test each variable and, if it is a possible error, use PROC APPEND to add it to a data set called Errors. When the first error is detected, PROC APPEND creates the data set Errors. From then on, each observation in data set Tmp is added to data set Errors.

Most of this macro is straightforward. For those readers who are not that comfortable with macro programming, the third section (beginning with %if %upcase(&Missing)) is executed only when the value of the macro variable &Missing is not equal to IGNORE.

Below is a listing of the %Report macro:

%macro Report;
   proc sort data=Errors;
      by &Idvar;
   run;
 
   proc print data=Errors;
      title "Error Report for Data Set &Dsn";
      id &Idvar;
      var Variable Value Reason;
   run;
 
   proc delete data=Errors Tmp;
   run;
 
%mend Report;

The %Report macro is mainly a PROC PRINT of the temporary data set Errors. I added PROC DELETE to delete the two temporary data sets Error and Tmp.

You can cut and paste these macros, or you can download all of the macros, programs, and data sets from Cody's Data Cleaning Techniques Using SAS®, Third Edition, by going to support.sas.com/Cody, search for the book, then click Example Code and Data. You do not have to buy the book to download all the files (although I would be delighted if you did). This is true for all of my books published by SAS Press.

Comments and/or corrections are always welcome.

Two macros for detecting data errors was published on SAS Users.