Macro

4月 062022
 

Background

Among the many logic puzzles in the world, the Zebra Puzzle is the one of the most famous. It was first published on December 17, 1962 by Life Magazine, an old weekly magazine owned by Time Warner, and it claimed that only 0.1% of people in the world can solve it. The original puzzle is as follows:

Zebra Puzzle: Who drinks the water? Who owns the zebra?

  1. There are five houses.
  2. The Englishman lives in the red house.
  3. The Spaniard owns the dog.
  4. Coffee is drunk in the green house.
  5. The Ukrainian drinks tea.
  6. The green house is immediately to the right of the ivory house.
  7. The Old Gold smoker owns snails.
  8. Kools are smoked in the yellow house.
  9. Milk is drunk in the middle house.
  10. The Norwegian lives in the first house.
  11. The man who smokes Chesterfields lives in the house next to the man with the fox.
  12. Kools are smoked in the house next to the house where the horse is kept.
  13. The Lucky Strike smoker drinks orange juice.
  14. The Japanese smokes Parliaments.
  15. The Norwegian lives next to the blue house.

Life Magazine did not publish the puzzle's solution. When it was released, the Zebra puzzle gained attention from logic puzzle fans all over the world who would then challenge their friends and it spread like wildfire. On March 25, 1963, the names of hundreds of people who successfully solved the puzzle were published in that month's edition of Life Magazine.

The Zebra puzzle is now often referred to as Einstein’s puzzle and some say it was invented by Einstein as a boy. In actuality, there is no real evidence that Einstein did invent the puzzle. The cigarette brand Kools mentioned in the original puzzle did not exist during Einstein’s boyhood. The version of Einstein’s puzzle that is widely circulated today is a variant of the above Zebra puzzle. Some claim that Einstein once asserted that 98% the world's population would not be able to solve it. The puzzle is as follows:

Einstein’s Puzzle: Who keeps the fish?

  1. The British man lives in a red house.
  2. The Swedish man keeps dogs as pets.
  3. The Danish man drinks tea.
  4. The Green house is next to, and on the left of the White house.
  5. The owner of the green house drinks coffee.
  6. The person who smokes Pall Mall rears birds.
  7. The owner of the yellow house smokes Dunhill.
  8. The man living in the center house drinks milk.
  9. The Norwegian lives in the first house.
  10. The man who smokes Blends lives next to the one who keeps cats.
  11. The man who keeps horses lives next to the man who smokes Dunhill.
  12. The man who smokes Blue Master drinks beer.
  13. The German smokes Prince.
  14. The Norwegian lives next to the blue house.
  15. The Blends smoker lives next to the one who drinks water.

For a problem in the form of the above two puzzles, this is essentially a classic logic problem. It has important experimental significance in mathematics, logic, and computer science because it can reveal to some extent what intelligence is, how logical reasoning works and how to empower machines to create artificial intelligence.

Problem Analysis

Using computers to solve logic puzzles has been always a common topic and one brute force method is to enumerate every possible situation before deciding. For both the Zebra puzzle and Einstein’s puzzle above, there are 5 dimensions (Nationality, Color, Pets, Drinks, Cigarettes) that need to be connected to the appropriate house. For any dimension, the possible choices are 5 × 4 × 3 × 2 × 1 = 5!, there are (5!) 5, or 24, 883, 200, 000 (about 25 billion) possibilities in total for five dimensions (See Table 1). A well-written program can solve this in seconds on a capable computer. The traversal determination is completed within minutes. This enumeration selection and decision method is essentially a brute force method of an exhaustive search space in which there is no intelligence at all.

Location Nationality Color Pets Drinks Cigarettes
1 C(5, 1)=5 5 5 5 5
2 C(4, 1)=4 4 4 4 4
3 C(3, 1)=3 3 3 3 3
4 C(2, 1)=2 2 2 2 2
5 C(1, 1)=1 1 1 1 1
Possibilities 5! 5! 5! 5! 5!

Table 1: Searching space

We can also think of such puzzles as Constraint Satisfaction Problems (CSPs). A CSP is a special type of problem that involves a set of variables and a set of constraints between them, finding one (or multiple) assignments such that all constraints are satisfied. For a constraint satisfaction problem with limited definitions, it can be solved with a traditional backtracking search in which the variable selection strategy is optimized by the minimum remaining value, the most constrained variable and failure priority strategies. This method is obviously more rational and closer to the way people think.

In addition to the brute force method and the CSP solution method above, is there any other solution method for this logic puzzle? Is it possible to build a general automatic reasoning system for it? This needs to be viewed from the perspective of human thinking.

The brain structure and intelligence level of different people often reveal huge differences when solving problems. A relatively clear difference is the ability of logical reasoning. Logical reasoning is essentially a form of thinking that logically deduces some new knowledge (conclusion) based on some known knowledge (premise). The process of people's cognition of the objective world is essentially the process of observing, identifying, naming, measuring various objects in the real world and connecting these basic cognitive components to build a cognitive building.

The reasoning process from individual to general, or general to individual, can be summarized as deduction and induction. We generally start from various concrete individuals and continue to abstract and summarize to form "abstract" concepts, then summarize them into general principles and laws. On the other hand, based on the existing knowledge system and the known rules and laws, we will carry out the knowledge application process for the newly emerged concrete individuals in turn. Just like a credit scoring mechanism, this is nothing more than an artificial evaluation model or system based on collected financial data, personal data, etc. of various users, that then evaluates new credit card applicants to determine whether or not to grant, what the credit limit is, what the risk is, etc.

The difference between machines and humans is very significant in several aspects. The memory capacity and computing power of computers can be expanded infinitely, especially cloud computing, which has raised this ability to a new level. No matter how good the human memory is, it is not comparable to those well-organized and indexed knowledge base systems that exist on hard disks or in the cloud. No matter how good the human mental arithmetic ability is, it cannot deduce the recursive processing and stack-type repetitive processing of the computer. No matter how sharp the human eye is, it cannot be compared with the bit-level discrimination ability of the color components of a pixel. However, our human brain is capable of innovation, intuition, perception, high abstraction, and derived wisdom, which is the advantage of our carbon-based life over silicon-based machines that need electricity to operate.

Human Perspective

Let's analyze how a "human" with strong logical reasoning ability solves the Zebra puzzle. From all the puzzle's statements, we can perceive various entities, that is, images constructed by the identification and naming of various real-world entities, including Englishman, red house, Spaniard, dogs, etc.; similarly, we deconstruct the relationship between entities from some verbs or prepositions, such as... lives..., owns..., drinks..., to the right of..., etc., this relationship is also an entity in essence.

Secondly, according to the relationship between entities in the real world, people will establish corresponding connections in the brain, and abstract higher-level category concepts, such as nationality, color, pets, drinks, cigarettes, and other concepts according to the type of connection. Among them, you can also see that concepts like "nationality" are almost purely virtual. These virtual entities are developed based on the needs of human beings to organize society. There are almost no physical entities except an ID number to prove this. Also, you can't say someone is a stateless person if the passport is not issued or lost!

Therefore, according to the order of the puzzle statements, we can summarize 5 dimensions (corresponding variables) according to the entity type, the nationality of the homeowner, the color of the house, pets, drinks, and cigarettes. From this, we can list 5 explicit dimensions and their possible values (in order of appearance in the puzzle). Undoubtedly, these variables are nominal:

  • Nationality: British, Spaniard, Ukrainian, Norwegian, Japanese
  • House: red, green, ivory, yellow, blue
  • Pets: dog, snails, fox, horse
  • Drinks: coffee, tea, milk, orange juice
  • Cigarettes: Old Gold, Kools, Chesterfields, Lucky Strike, Parliaments

Wait, there's another dimension hidden in the puzzle: the location of the house. This is an entity relationship identified based on the words "to the right of," "in the middle," "next to," “in the first” and other concepts in the statement set, so we should establish a concept in our mind to express the order relationship between entities. The most straightforward way is to number them sequentially: 1, 2, 3, 4, 5. Thus, position variables are ordinal variables that not only contain differences but also describe differences in their order or rank.

At this point in the puzzle, the solution to the problem "Who drinks water and who owns zebras?" becomes searching for the mapping relationship of entities in the above six dimensions. The essence of knowledge is connections, so the intelligence based on this model is the process of establishing principles and criteria and applying them to solve, predict and change the future.

Everyone uses different strategies when dealing with problems and even the same person will use different strategies in the derivation process. The 6-dimensional variable is enough to confuse most people's minds - "how to choose the next statement to get a solution as quickly as possible" is the key to solving the problem! Like the minimum remaining strategy for constraint satisfaction problems, we can usually use the deterministic highest strategy to reason forward and use elimination strategies to reduce the solution space.

When people reason, they often begin with a certain statement (knowledge), then apply knowledge to uncertain statements and constantly accept and exclude uncertain statements. Uncertain means that there are multiple possible distributions for the same statement. If there are multiple possible branches, hypothetical reasoning will be used to select one of the limited possible paths to try until the result meets the maximum statements accepted. If there are too many branches of uncertainty, the process degenerates into a search process.

For a logic puzzle, we also need to consider whether its final solution is unique and how many variants of methods (paths) there are to solve it. That is to say, the former can be obtained by the fastest method, while the latter is more complicated and needs to be traversed on the equivalent branch to obtain all possible paths.

Implementation perspective

Based on the above thinking logic, we can use SAS to design a model-oriented automatic reasoning system. It takes the tagged text as the input and outputs the answer text for a query. The so-called general automatic reasoning means that it must be abstract enough and completely centered on an abstract model, which means that it should be able to solve arbitrary logic puzzles without changing any code. The solver itself should not have any domain-related formulas or constraints. The difference between it and the constraint satisfaction problems solver is that the constraints are contained in the model generation stage, rather than determined in the solving stage. The overall design (Figure 1) is as follows, and currently you can click here to download the precompiled SAS code for preview, but it can only run in English, Unicode or Simplified Chinese SAS 94M7+(WIN) or SAS Viya(LAX) environments.

Figure 1: Overall design

Due to the constraints being parsed and implied in the generated model from statements, an automatic reasoning system can be a pure model structure-based solver which does not contain any predefined domain-specific knowledge. The solution generated by the reasoning system is just a subset of the model that can be further processed or presented. The boundaries between humans and machines are limited to model generation and question answering. Thanks to this, a statement set can be any set of finite constraints, whether it is a zebra puzzle, a simpler Einstein’s puzzle or a more complicated blood donation puzzle that can be solved by the same reasoning system. They differ only in the statement set and the questions. The practice has proved that according to the above principles of human thinking, all zebra-type puzzles can be resolved by the same system as they are essentially the same.

Since machines cannot automatically identify different dimensions or variables from entities and relationships like humans, we need to provide a minimal set of tags in the statement set, which can be in the form of a parsimonious [N:Englishman] keyword tag or form such as <Entity Type="N">Englishman</Entity> XML markup. The following table (Table 2) lists the statement set of the Zebra puzzle which consists of 14 statements. This labeled statement set is the only input of the reasoning system.

Table 2: Tagged Statements as input

%SolvePuzzle(input=Zebra_en, question=%str([N:Who] drinks [D:Water],[N:Who] owns [P:Zebra]));

 

The above statement set is parsed and a model is generated. The type of the annotation, i.e. the variable type is a key factor. In the Zebra puzzle, all variables are nominal except the Location, which is an ordinal variable.

To solve a single solution, the termination condition is that all statements are accepted (satisfied). As the reasoning system proceeds, it must maintain two queues: accepted knowledge and rejected knowledge. We know that when we accept new knowledge, we may accept more compatible knowledge. We may also reject some contradictory knowledge; when we reject a piece of knowledge, if it is propagatable in logic, we will also reject more associated knowledge. This is an active elimination strategy. The method of solving all solutions shares this reasoning logic with solving a single solution, but the outer structure is different.

Given the specificity of Location variables in Zebra puzzle, determining the mapping of other categorical variables on the Location variable is a key point. Various zebra puzzle variants have an ordinal variable, when we accept a statement, determining the value of the Location  variable is the most critical step, so the reasoning system should expose it. The following table (Table 3) shows the reasoning steps of the system for the zebra puzzle, where L is the location of the house:


Table 3: Reasoning Steps as output

The result of the reasoning steps can be presented in a more intuitive matrix as follows (Table 4):

Table 4: Reasoning Result

The system can also print the detailed solution process in the form of matrix according to the reasoning steps (Table 5) :

Table 5: Reasoning Result step by step

For the solution results, we can ask any reasonable question in the form of text and let the system return an intuitive answer. Note that the question must be in the tagged form above with question-condition pairs, that is, the sequence of forms in which the question comes first, and the condition comes after. For example:

 %QueryAnswer(question=%str([N:Who] drink [D:Water][N:Who] owns [P:Zebra]));

Table 6: Reasoning Answer for question

In fact, we can ask the system to answer all kinds of valid questions, whether they are known or unknown conditions in the statement, such as:

%QueryAnswer(question=%str([P:What] was kept by [N:Spaniard][N:Who] smokes [C:Chesterfields].));

Table 7: Reasoning Answer for question

As a puzzle terminator, we try to let the automatic reasoning system solve the so-called very hard zebra puzzle: Blood Donation puzzle. The complete text description of the puzzle is as follows:

Blood Donation Puzzle

There is a hospital with a blood donation section with five units next to each other. One day after five women donated blood simultaneously, the secretary forgot to make a record of the event and the managers hired you to conduct research and hopefully find out each donor's blood type correctly. After your investigation, you observed that each donor has a different shirt color, name, blood type, age, weight and job. Then you talked with the laboratory workers and they told you they have all the information about the donors but it's completely mixed up. They also give you a report about what they remember from the instance of donation.

  • Shirt Color: black, blue, green, purple, red
  • Names: Andrea, Brooke, Kathleen, Meghan, Nichole.
  • Blood Types: A+, AB+, B+, B-, O-.
  • Ages: 25, 30, 35, 40, 45 years.
  • Weights: 120, 130, 140, 150, 160 lbs.
  • Jobs: Actress, Chef, Engineer, Florist, Policewoman

Chief security officers also provided a report for you according to what he remembers.

  1. The A+ donor is next to the B+ donor.
  2. Brooke is at one of the ends.
  3. The woman wearing a Black shirt is somewhere to the left of the 150 lb woman.
  4. The Actress is next to the Chef.
  5. Kathleen is 40 years old.
  6. The Florist is somewhere to the right of the woman wearing the purple shirt.
  7. The oldest year-old donor weighs 130 lb.
  8. Brooke is next to Nichole.
  9. The 35-year-old woman is exactly to the left of the 30-year-old woman.
  10. The 120 lb donor is somewhere between the the O- donor and the 150 lb donor, in that order.
  11. Kathleen is at one of the ends.
  12. The woman wearing the purple shirt is somewhere to the right of the woman wearing the green shirt.
  13. The B+ donor weighs 140 lb.
  14. The youngest woman is next to the 30-year-old woman.
  15. The woman considered AB+ universal recipient is exactly to the left of the A+ donor.
  16. Meghan is somewhere to the right of the woman wearing the purple shirt.
  17. The woman wearing the green shirt is somewhere between the Actress and the woman wearing the red shirt, in that order.
  18. At one of the ends is the 130 lb woman.
  19. The O- universal donor is 35 years old.
  20. The Florist is somewhere between the Actress and the Engineer, in that order.
  21. The woman wearing the blue shirt is somewhere to the left of the woman wearing the red shirt.
  22. The AB+ donor is next to the youngest woman.

According to all this information, the managers expect you to find out all details about the blood donors. In fact, the solution to the blood donation puzzle can be formulated into four more concise questions: Whose blood type is B-? And what color shirt is worn by the person weighing 160 lb? How many years old is Andrea, which unit is the Policewoman?

To make the system to solve the blood donation puzzle, we only need to create the following set of labeled statements as the input to the automatic reasoning system (Table 8):

Table 8: Tagged statement as input

The system solution results are as follows (Table 9), please note the Youngest and Oldest in the column A, which are mapping to the age 25 and 45. If you want 25 and 45 to appear in the matrix, you should let these numbers substitute for youngest/oldest in the input statements.

Table 9: Reasoning Result

Further tests indicate that the automatic reasoning system can solve various similar logic puzzles and the difference in input information is in the statement set. Further analysis and comparison showed that the Zebra puzzle, Einstein’s puzzle, and the Blood Donation puzzle all have unique solutions, but there can be multiple solution order variants. E.g., the Zebra puzzle has 144 variants while the Blood Donation puzzle has 972 variants.

Table 10: Multiple Reasoning Variants

Einstein’s puzzle is less difficult than the Zebra puzzle. The Blood Donation puzzle has more dimensions and requires more questions to be answered, and the average time required to solve all variants becomes longer due to the expansion of the search space. Another interesting finding is that although there are many questions to be answered in the Blood Donation puzzle, the reasoning steps to find the first solution will not lead to any dead ends, indicating that the puzzle was purposely designed in the most natural way that humans can solve. The table below lists the brief analysis data.

Puzzle Name Statements Questions

Model Size X Dimensions

First Solution Time (s) Variants Average Solution Time (s)
Einstein’  15 Who keeps fish? 154 X 6 0.1 s 736 0.02 s
Zebra  14

Who drinks water?

Who owns zebra?

138 X 6 0.2 s 144 0.11 s
Blood Donation 22 Whose blood type is B-?

What color shirt is worn by the person weighing 160 lb?

How many years old is Andrea?

Which unit is the Policewoman?

332 X 7 0.2 s 972 6 s

Table 11: Comparison of three puzzles

The evolution from data to intelligence must go through several stages: data -> information -> knowledge -> intelligence. The underlying data plays the most fundamental role. It comes from actual observations and is the most primitive entity, which has no meaning itself except to represent a kind of existence. For the reasoning system, the four types of variables in statistics, nominal variables, ordinal variables, interval variables and ratio variables, reflect the capacity of information. The various puzzles mentioned above usually only contain limited nominal and ordinal variables and do not contain interval and ratio variables, so they can be easily solved by the automatic reasoning system mentioned in this article. Currently, the model generation part can only deal with limited categorical variables, it has not been generalized to continuous variables but discrete variables so far. In contrast, the rules are implied in the constraint formulation of general constraint satisfaction problems, so there is no such restriction of variable type. It still requires further research and optimization to solve, e.g., the mixed model approach.

Due to space limitations, only the key concepts, implementation bird view and some system capabilities are presented in this post, please allow me to discuss technical details for this automatic reasoning system in the near future!

Summary

Starting from the Zebra puzzle, this post identifies the common paradigm of logic puzzles and designs a general logic reasoning system to solve them automatically. It is not only different from the brute force method without intelligence, but also different from the general solution to the constraint satisfaction problems. This practice has shown that by starting from a set of purely declarative statements, any zebra puzzle with limited constraints can be tagged and solved quickly in the same way without changing any code.

Learn more

Zebra Puzzle Terminator: A general automatic reasoning system solving method was published on SAS Users.

1月 272022
 

Many SAS programmers use macros. I have seen students in my SAS classes use several methods to activate their macros. One way is to load the macro in the Display manager or editor in SAS OnDemand for Academics and submit it. Another technique is to use the statement %Include macro-name.

The best way to make all of your macros available in every SAS session is to take advantage of the Auto Call facility. There are several steps to make this work. I will show you how this works in SAS running on your local computer or SAS OnDemand for Academics.

Instructions for SAS on your local computer

First, place the macros that you plan to use in a pre-defined place, such as "C:\MySASFiles\MyMacros."

Next, edit your Autoexec.sas file. Remember, all the statements in this file are executed every time you open a SAS session. Add the following statements:

filename MyMacros 'C:\MySASFiles\MyMacros;'

(or wherever you placed your macros)

options mautosource SASAutos=(MyMacros);

(The name in parentheses is the file name above.)

You are done. The next time you open a SAS session on your local machine any of the macros in your macro library are available to use.

Instructions for SAS OnDemand for Academics

Open SAS Studio. In the Navigation Pane, click Server Files and Folders. Create a new folder as shown below. (The new folder icon is circled.)

In this example, I chose Macros as my folder name.

Click the Upload icon (circled in the figure below).

Upload the macros from your local computer to this folder (shown below).

Now it's time to add the appropriate statements to the Autoexec.sas file. Do this by clicking the icon circled in the figure below.

Choose Edit Autoexec File from the drop-down list.

Add the two lines in the box shown in the next figure.

Click SAVE. (You can click RUN first and check the log to see that there are any errors.)

Demonstrating a Macro Call (with the Auto Call facility)

You can now use any of the macros in your macro folder. For example, I have a macro called %Print that will either print an entire data set or the first “n” observations and add a title to the listing. The program below was run after opening a new session in SAS OnDemand for Academics.

Here is the result:

If you have never used the Auto Call facility, you now know how to do it. One important final point: you must save your macros using the same name as the macro name. For example, if you have a macro called Print, save it as Print.sas. By doing that, SAS knows where to find and compile the macro that you want to use.

If you are curious about my %Print macro or want to test out the Auto Call facility, here is a listing:

*----------------------------------------------------------------*
| Program Name: Print.sas  in C:\sasdata\macros                  |
| Purpose: Macro that prints out a data set and uses the data    |
|          set name in the title                                 |
| Calling Argument: %PRINT(dsn,obs=)                             |
| Example: %PRINT(test,obs=5)                                    |
| Date: January 13, 1998                                         |
| Updated March 6, 2000 to add obs positional parameter          |
*----------------------------------------------------------------*;
%macro Print(dsn,obs=max);
   proc print data=&amp;dsn (obs=&amp;obs) noobs;
      title "Listing of data set %upcase(&amp;dsn)";
      %if &amp;obs ne max %then title2 "First &amp;obs Observations";;
   run;
%mend print;

 

How to use the SAS Auto Call facility was published on SAS Users.

1月 112022
 

Last year, I wrote a blog demonstrating how to use the %Auto_Outliers macro to automatically identify possible data errors. This blog demonstrates a different approach—one that is useful for variables for which you can identify reasonable ranges of values for each variable. For example, you would not expect resting heart rates below 40 or over 100 or adult heights below 45 inches or above 84 inches. Although values outside those ranges might be valid values, it would be a good idea to check those out-of-range values to see if they are real or data errors.

In the third edition of my book, Cody's Data Cleaning Techniques, I present two macros: %Errors and %Report. These two macros provide a consolidated error report.

To demonstrate these two macros, I created a data set called Patients. A listing of the first 10 observations is shown below:

Notice that the unique identifier is called Patno, and you see three variables HR (heart rate), SBP (systolic blood pressure), and DBP (diastolic blood pressure).

The calling arguments for the %Errors macro are:

  • VAR=     A variable for which you have pre-defined bounds
  • Low=     The lowest reasonable value for this variable
  • High=    The highest reasonable value for this variable
  • Missing=Error or Missing=Ignore. The default for this argument is IGNORE, but it is still good practice to include it in the call so that anyone reading your program understands how missing values are being handled.

Because you might be calling this macro for many variables, the values of two macro variables &Dsn (data set name) and &IDVar (the identifying variable such as Patno or Subj) are assigned values once, using two %Let statements. You can then call the %Errors macro for each variable of interest. When you are finished, call the %Report macro to see a consolidated report of your possible errors.

Here is an example:

/* Set values for &amp;Dsn and %IDVar with %LET statements */
%let Dsn = Clean.Patients;   
%let Idvar = Patno;
 
%Errors(Var=HR, Low=40, High=100, Missing=error)
%Errors(Var=SBP, Low=80, High=200, Missing=ignore)
%Errors(Var=DBP, Low=60, High=120)   
 
/* When you are finished selecting variables, create the report */
%Report

You are reporting all heart rates below 40 or above 100 (and considering missing values as errors); values of SBP below 80 or above 200 (and ignoring missing values); and values of DBP below 60 or above 120 (also ignoring missing values—using the default value of IGNORE).

Here is the result:

Notice that several patients with missing values for HR are flagged as errors. I should point out that I have violated a cardinal rule of macro programming: never write a macro that has no arguments—it should be a program. However, I liked the idea of calling %Errors and then %Report. Shown below are the two macros:

/****************************************************************
| PROGRAM NAME: ERRORS.SAS  in c:\Books\Cleans\Patients         |
| PURPOSE: Accumulates errors for numeric variables in a SAS    |
|          data set for later reporting.                        |
|          This macro can be called several times with a        |
|          different variable each time. The resulting errors   |
|          are accumulated in a temporary SAS data set called   |
|          Errors.                                              |
| ARGUMENTS: Dsn=    - SAS data set name (assigned with a %LET) |
|            Idvar=  - Id variable (assigned with a %LET)       |
|                                                               |
|            Var     = The variable name to test                |
|            Low     = Lowest valid value                       |
|            High    = Highest valid value                      |
|            Missing = IGNORE (default) Ignore missing values   |
|                      ERROR Missing values flagged as errors   |
|                                                               |
| EXAMPLE: %let Dsn = Clean.Patients;                           |
|          %let Idvar = Patno;                                  |
|                                                               |
|          %Errors(Var=HR, Low=40, High=100, Missing=error)     |
|          %Errors(Var=SBP, Low=80, High=200, Missing=ignore)   |
|          %Errors(Var=DBP, Low=60, High=120)                   |
|          Test the numeric variables HR, SBP, and DBP in data  |
|          set Clean.patients for data outside the ranges       |
|          40 to 100, 80 to 200, and 60 to 120 respectively.    |
|          The ID variable is PATNO and missing values are to   |
|          be flagged as invalid for HR but not for SBP or DBP. |
****************************************************************/
%macro Errors(Var=,    /* Variable to test     */
              Low=,    /* Low value            */
              High=,   /* High value           */
              Missing=IGNORE 
                       /* How to treat missing values         */
                       /* Ignore is the default.  To flag     */
                       /* missing values as errors set        */
                       /* Missing=error                       */);
data Tmp;
   set &amp;Dsn(keep=&amp;Idvar &amp;Var);
   length Reason $ 10 Variable $ 32;
   Variable = "&amp;Var";
   Value = &amp;Var;
   if &amp;Var lt &amp;Low and not missing(&amp;Var) then do;
      Reason='Low';
      output;
   end;
 
   %if %upcase(&amp;Missing) ne IGNORE %then %do;
      else if missing(&amp;Var) then do;
         Reason='Missing';
         output;
      end;
   %end;
 
   else if &amp;Var gt &amp;High then do;
      Reason='High';
      output;
      end;
      drop &amp;Var;
   run;
 
   proc append base=Errors data=Tmp;
   run;
 
%mend Errors;

The basic idea for the %Errors macro is to test each variable and, if it is a possible error, use PROC APPEND to add it to a data set called Errors. When the first error is detected, PROC APPEND creates the data set Errors. From then on, each observation in data set Tmp is added to data set Errors.

Most of this macro is straightforward. For those readers who are not that comfortable with macro programming, the third section (beginning with %if %upcase(&Missing)) is executed only when the value of the macro variable &Missing is not equal to IGNORE.

Below is a listing of the %Report macro:

%macro Report;
   proc sort data=Errors;
      by &amp;Idvar;
   run;
 
   proc print data=Errors;
      title "Error Report for Data Set &amp;Dsn";
      id &amp;Idvar;
      var Variable Value Reason;
   run;
 
   proc delete data=Errors Tmp;
   run;
 
%mend Report;

The %Report macro is mainly a PROC PRINT of the temporary data set Errors. I added PROC DELETE to delete the two temporary data sets Error and Tmp.

You can cut and paste these macros, or you can download all of the macros, programs, and data sets from Cody's Data Cleaning Techniques Using SAS®, Third Edition, by going to support.sas.com/Cody, search for the book, then click Example Code and Data. You do not have to buy the book to download all the files (although I would be delighted if you did). This is true for all of my books published by SAS Press.

Comments and/or corrections are always welcome.

Two macros for detecting data errors was published on SAS Users.

1月 112022
 

Last year, I wrote a blog demonstrating how to use the %Auto_Outliers macro to automatically identify possible data errors. This blog demonstrates a different approach—one that is useful for variables for which you can identify reasonable ranges of values for each variable. For example, you would not expect resting heart rates below 40 or over 100 or adult heights below 45 inches or above 84 inches. Although values outside those ranges might be valid values, it would be a good idea to check those out-of-range values to see if they are real or data errors.

In the third edition of my book, Cody's Data Cleaning Techniques, I present two macros: %Errors and %Report. These two macros provide a consolidated error report.

To demonstrate these two macros, I created a data set called Patients. A listing of the first 10 observations is shown below:

Notice that the unique identifier is called Patno, and you see three variables HR (heart rate), SBP (systolic blood pressure), and DBP (diastolic blood pressure).

The calling arguments for the %Errors macro are:

  • VAR=     A variable for which you have pre-defined bounds
  • Low=     The lowest reasonable value for this variable
  • High=    The highest reasonable value for this variable
  • Missing=Error or Missing=Ignore. The default for this argument is IGNORE, but it is still good practice to include it in the call so that anyone reading your program understands how missing values are being handled.

Because you might be calling this macro for many variables, the values of two macro variables &Dsn (data set name) and &IDVar (the identifying variable such as Patno or Subj) are assigned values once, using two %Let statements. You can then call the %Errors macro for each variable of interest. When you are finished, call the %Report macro to see a consolidated report of your possible errors.

Here is an example:

/* Set values for &amp;Dsn and %IDVar with %LET statements */
%let Dsn = Clean.Patients;   
%let Idvar = Patno;
 
%Errors(Var=HR, Low=40, High=100, Missing=error)
%Errors(Var=SBP, Low=80, High=200, Missing=ignore)
%Errors(Var=DBP, Low=60, High=120)   
 
/* When you are finished selecting variables, create the report */
%Report

You are reporting all heart rates below 40 or above 100 (and considering missing values as errors); values of SBP below 80 or above 200 (and ignoring missing values); and values of DBP below 60 or above 120 (also ignoring missing values—using the default value of IGNORE).

Here is the result:

Notice that several patients with missing values for HR are flagged as errors. I should point out that I have violated a cardinal rule of macro programming: never write a macro that has no arguments—it should be a program. However, I liked the idea of calling %Errors and then %Report. Shown below are the two macros:

/****************************************************************
| PROGRAM NAME: ERRORS.SAS  in c:\Books\Cleans\Patients         |
| PURPOSE: Accumulates errors for numeric variables in a SAS    |
|          data set for later reporting.                        |
|          This macro can be called several times with a        |
|          different variable each time. The resulting errors   |
|          are accumulated in a temporary SAS data set called   |
|          Errors.                                              |
| ARGUMENTS: Dsn=    - SAS data set name (assigned with a %LET) |
|            Idvar=  - Id variable (assigned with a %LET)       |
|                                                               |
|            Var     = The variable name to test                |
|            Low     = Lowest valid value                       |
|            High    = Highest valid value                      |
|            Missing = IGNORE (default) Ignore missing values   |
|                      ERROR Missing values flagged as errors   |
|                                                               |
| EXAMPLE: %let Dsn = Clean.Patients;                           |
|          %let Idvar = Patno;                                  |
|                                                               |
|          %Errors(Var=HR, Low=40, High=100, Missing=error)     |
|          %Errors(Var=SBP, Low=80, High=200, Missing=ignore)   |
|          %Errors(Var=DBP, Low=60, High=120)                   |
|          Test the numeric variables HR, SBP, and DBP in data  |
|          set Clean.patients for data outside the ranges       |
|          40 to 100, 80 to 200, and 60 to 120 respectively.    |
|          The ID variable is PATNO and missing values are to   |
|          be flagged as invalid for HR but not for SBP or DBP. |
****************************************************************/
%macro Errors(Var=,    /* Variable to test     */
              Low=,    /* Low value            */
              High=,   /* High value           */
              Missing=IGNORE 
                       /* How to treat missing values         */
                       /* Ignore is the default.  To flag     */
                       /* missing values as errors set        */
                       /* Missing=error                       */);
data Tmp;
   set &amp;Dsn(keep=&amp;Idvar &amp;Var);
   length Reason $ 10 Variable $ 32;
   Variable = "&amp;Var";
   Value = &amp;Var;
   if &amp;Var lt &amp;Low and not missing(&amp;Var) then do;
      Reason='Low';
      output;
   end;
 
   %if %upcase(&amp;Missing) ne IGNORE %then %do;
      else if missing(&amp;Var) then do;
         Reason='Missing';
         output;
      end;
   %end;
 
   else if &amp;Var gt &amp;High then do;
      Reason='High';
      output;
      end;
      drop &amp;Var;
   run;
 
   proc append base=Errors data=Tmp;
   run;
 
%mend Errors;

The basic idea for the %Errors macro is to test each variable and, if it is a possible error, use PROC APPEND to add it to a data set called Errors. When the first error is detected, PROC APPEND creates the data set Errors. From then on, each observation in data set Tmp is added to data set Errors.

Most of this macro is straightforward. For those readers who are not that comfortable with macro programming, the third section (beginning with %if %upcase(&Missing)) is executed only when the value of the macro variable &Missing is not equal to IGNORE.

Below is a listing of the %Report macro:

%macro Report;
   proc sort data=Errors;
      by &amp;Idvar;
   run;
 
   proc print data=Errors;
      title "Error Report for Data Set &amp;Dsn";
      id &amp;Idvar;
      var Variable Value Reason;
   run;
 
   proc delete data=Errors Tmp;
   run;
 
%mend Report;

The %Report macro is mainly a PROC PRINT of the temporary data set Errors. I added PROC DELETE to delete the two temporary data sets Error and Tmp.

You can cut and paste these macros, or you can download all of the macros, programs, and data sets from Cody's Data Cleaning Techniques Using SAS®, Third Edition, by going to support.sas.com/Cody, search for the book, then click Example Code and Data. You do not have to buy the book to download all the files (although I would be delighted if you did). This is true for all of my books published by SAS Press.

Comments and/or corrections are always welcome.

Two macros for detecting data errors was published on SAS Users.

2月 012021
 

Do you want to spend less time on the tedious task of preparing your data? I want to tell you about a magical and revolutionary SAS macro called %TK_codebook. Not only does this macro create an amazing codebook showcasing your data, it also automatically performs quality control checks on each variable. You will easily uncover potential problems lurking in your data including variables that have:

  • Incomplete formats
  • Out of range values
  • No variation in response values
  • Variables missing an assigned user-defined format
  • Variables that are missing labels

All you need is a SAS data set with labels and formats assigned to each variable and the accompanying format catalogue. Not only will this macro change the way you clean and prepare your data, but it also gives you an effortless way to evaluate the quality of data you obtain from others before you start your analysis. Look how easy it is to create a codebook if you have a SAS data set with labels and formats:

title height=12pt 'Master Codebook for Study A Preliminary Data';
title2 height=10pt 'Simulated Data for Participants in a Health Study';
title3 height=10pt 'Data simulated to include anomalies illustrating the power of %TK_codebook';
 
libname library "/Data_Detective/Formats/Blog_1_Codebooks";
 
%TK_codebook(lib=work,
	file1=STUDYA_PRELIM,
	fmtlib=LIBRARY,
	cb_type=RTF,
	cb_file=/Data_Detective/Book/Blog/SAS_Programs/My_Codebook.rtf,
	var_order=INTERNAL,
	organization = One record per CASEID,
	include_warn=YES;
run;

Six steps create your codebook

After creating titles for your codebook, this simple program provides %TK_codebook with the following instructions:

  1. Create a codebook for SAS data set STUDYA_PRELIM located in the temporary Work library automatically defined by SAS
  2. Find the formats assigned to the STUDYA_PRELIM in a format catalogue located in the folder assigned to the libref LIBRARY
  3. Write your codebook in a file named /Data_Detective/Book/Blog/SAS_Programs/My_Codebook.rtf
  4. List variables in the codebook by their INTERNAL order (order stored in the data set)
  5. Add “One record per CASEID” indicating which variable(s) uniquely identify each observation to codebook header
  6. Include reports identifying potential problems in the data

Just these few lines of code will create the unbelievably useful codebook shown below.

The data set used has many problems that can interfere with analysis. %TK_codebook creates reports showing a concise summary of only those problem variables needing close examination. These reports save you an incredible amount of time.

Using assigned formats, %TK_codebook identifies unexpected values occurring in each variable and provides a summary in the first two reports.

Values occurring outside those defined by the assigned format indicate two possible problems:

  1. A value was omitted from the format definition (Report 1 – Incomplete formats)
  2. The variable has unexpected values needing mitigation before the data is analyzed (Report 2 – Out of Range Value)

The next report lists numeric variables that have no variation in their values.

These variables need examining to uncover problems with preparing the data set.

The next two reports warn you about variables missing an assigned user-defined format. These variables will be excluded from screening for out-of-range values and incomplete format definitions.

The last report informs you about variables that are missing a label or have a label that matches the variable name.

It is easy to use %TK_codebook to resolve problems in your data and create an awesome codebook. Instead of spending your time preparing your data, you will be using your data to change the world!

Create your codebook

Download %TK_codebook from my author page, then learn to use it from my new book, The Data Detective’s Toolkit: Cutting-Edge Techniques and SAS Macros to Clean, Prepare, and Manage Data.

THE DATA DETECTIVE'S TOOLKIT | BUY IT NOW

Creating codebooks with SAS macros was published on SAS Users.

7月 142020
 

In my new book, End-to-End Data Science with SAS: A Hands-On Programming Guide, I use the 1.5 IQR rule to adjust multiple variables.  This program utilizes a macro that loops through a list of variables to make the necessary adjustments and creates an output data set.

One of the most popular ways to adjust for outliers is to use the 1.5 IQR rule. This rule is very straightforward and easy to understand. For any continuous variable, you can simply multiply the interquartile range by the number 1.5. You then add that number to the third quartile. Any values above that threshold are suspected as being an outlier. You can also perform the same calculation on the low end. You can subtract the value of IQR x 1.5 from the first quartile to find low-end outliers.

The process of adjusting for outliers can be tedious if you have several continuous variables that are suspected as having outliers. You will need to run PROC UNIVARIATE on each variable to identify its median, 25th percentile, 75th percentile, and interquartile range. You would then need to develop a program that identifies values above and below the 1.5 IQR rule thresholds and overwrite those values with new values at the threshold.

The following program is a bit complicated, but it automates the process of adjusting a list of continuous variables according to the 1.5 IQR rule. This program consists of three distinct parts:

    1. Create a BASE data set that excludes the variables contained in the &outliers global macro. Then create an OUTLIER data set that contains only the unique identifier ROW_NUM and the outlier variables.
    2. Create an algorithm that loops through each of the outlier variables contained in the global variable &outliers and apply the 1.5 IQR rule to cap each variable’s range according to its unique 1.5 IQR value.
    3. Merge the newly restricted outlier variable with the BASE data set.
/*Step 1: Create BASE and OUTLIER data sets*/
 
%let outliers = /*list of variables*/;
 
DATA MYDATA.BASE;
    SET MYDATA.LOAN_ADJUST (DROP=&amp;outliers.);
    ROW_NUM = _N_;
RUN;
 
DATA outliers;
    SET MYDATA.LOAN_ADJUST (KEEP=&amp;outliers. ROW_NUM);
    ROW_NUM = _N_;
RUN;
 
 /*Step 2: Create loop and apply the 1.5 IQR rule*/
 
%MACRO loopit(mylist);
    %LET n = %SYSFUNC(countw(&amp;mylist));
 
    %DO I=1 %TO &amp;n;
        %LET val = %SCAN(&amp;mylist,&amp;I);
 
        PROC UNIVARIATE DATA = outliers ;
            VAR &amp;val.;
            OUTPUT OUT=boxStats MEDIAN=median QRANGE=iqr;
        run;
 
        data _NULL_;
           SET boxStats;
           CALL symput ('median',median);
           CALL symput ('iqr', iqr);
        run;
 
        %PUT &amp;median;
        %PUT &amp;iqr;
 
        DATA out_&amp;val.(KEEP=ROW_NUM &amp;val.);
        SET outliers;
 
       IF &amp;val. ge &amp;median + 1.5 * &amp;iqr THEN
           &amp;val. = &amp;median + 1.5 * &amp;iqr;
       RUN;
 
/*Step 3: Merge restricted value to BASE data set*/
 
       PROC SQL;
           CREATE TABLE MYDATA.BASE AS
               SELECT *
               FROM MYDATA.BASE AS a
               LEFT JOIN out_&amp;val. as b
                   on a.ROW_NUM = b.ROW_NUM;
       QUIT;
 
    %END;
%MEND;
 
%LET list = &amp;outliers;
%loopit(&amp;list);

Notes on the outlier adjustment program:

  • A macro variable is created that contains all of the continuous variables that are suspected of having outliers.
  • Separate data sets were created: one that contains all of the outlier variables and one that excludes the outlier variables.
  • A macro program is developed to contain the process of looping through the list of variables.
  • A macro variable (n) is created that counts the number of variables contained in the macro variable.
  • A DO loop is created that starts at the first variable and runs the following program on each variable contained in the macro variable.
  • PROC UNIVARIATE identifies the variable’s median and interquartile range.
  • A macro variable is created to contain the values of the median and interquartile range.
  • A DATA step is created to adjust any values that exceed the 1.5 IQR rule on the high end and the low end.
  • PROC SQL adds the adjusted variables to the BASE data set.

This program might seem like overkill to you. It could be easier to simply adjust outlier variables one at a time. This is often the case; however, when you have a large number of outlier variables, it is often beneficial to create an algorithm to transform them efficiently and consistently

Adjusting outliers with the 1.5 IQR rule was published on SAS Users.

2月 052020
 

One of the first and most important steps in analyzing data, whether for descriptive or inferential statistical tasks, is to check for possible errors in your data. In my book, Cody's Data Cleaning Techniques Using SAS, Third Edition, I describe a macro called %Auto_Outliers. This macro allows you to search for possible data errors in one or more variables with a simple macro call.

Example Statistics

To demonstrate how useful and necessary it is to check your data before starting your analysis, take a look at the statistics on heart rate from a data set called Patients (in the Clean library) that contains an ID variable (Patno) and another variable representing heart rate (HR). This is one of the data sets I used in my book to demonstrate data cleaning techniques. Here is output from PROC MEANS:

The mean of 79 seems a bit high for normal adults, but the standard deviation is clearly too large. As you will see later in the example, there was one person with a heart rate of 90.0 but the value was entered as 900 by mistake (shown as the maximum value in the output). A severe outlier can have a strong effect on the mean but an even stronger effect on the standard deviation. If you recall, one step in computing a standard deviation is to subtract each value from the mean and square that difference. This causes an outlier to have a huge effect on the standard deviation.

Macro

Let's run the %Auto_Outliers macro on this data set to check for possible outliers (that may or may not be errors).

Here is the call:

%Auto_Outliers(Dsn=Clean.Patients,
               Id=Patno,
               Var_List=HR SBP DBP,
               Trim=.1,
               N_Sd=2.5)

This macro call is looking for possible errors in three variables (HR, SBP, and DBP); however, we will only look at HR for this example. Setting the value of Trim equal to .1 specifies that you want to remove the top and bottom 10% of the data values before computing the mean and standard deviation. The value of N_Sd (number of standard deviations) specifies that you want to list any heart rate beyond 2.5 trimmed standard deviations from the mean.

Result

Here is the result:

After checking every value, it turned out that every value except the one for patient 003 (HR = 56) was a data error. Let's see the mean and standard deviation after these data points are removed.

Notice the Mean is now 71.3 and the standard deviation is 11.5. You can see why it so important to check your data before performing any analysis.

You can download this macro and all the other macros in my data cleaning book by going to support.sas.com/cody. Scroll down to Cody's Data Cleaning Techniques Using SAS, and click on the link named "Example Code and Data." This will download a file containing all the programs, macros, and data files from the book.  By the way, you can do this with any of my books published by SAS Press, and it is FREE!

Let me know if you have questions in the comments section, and may your data always be clean! To learn more about SAS Press, check out up-and-coming titles, and to receive exclusive discounts make sure to subscribe to the newsletter.

Finding Possible Data Errors Using the %Auto_Outliers Macro was published on SAS Users.

4月 012019
 

dividing by zero with SAS

Whether you are a strong believer in the power of dividing by zero, agnostic, undecided, a supporter, denier or anything in between and beyond, this blog post will bring all to a common denominator.

History of injustice

For how many years have you been told that you cannot divide by zero, that dividing by zero is not possible, not allowed, prohibited? Let me guess: it’s your age minus 7 (± 2).

But have you ever been bothered by that unfair restriction? Think about it: all other numbers get to be divisors. All of them, including positive, negative, rational, even irrational and imaginary. Why such an injustice and inequality before the Law of Math?

We have our favorites like π, and prime members (I mean numbers), but zero is the bottom of the barrel, the lowest of the low, a pariah, an outcast, an untouchable when it comes to dividing by. It does not even have a sign in front of it. Well, it’s legal to have, but it’s meaningless.

And that’s not all. Besides not being allowed in a denominator, zeros are literally discriminated against beyond belief. How else could you characterize the fact that zeros are declared as pathological liars as their innocent value is equated to FALSE in logical expressions, while all other more privileged numbers represent TRUE, even the negative and irrational ones!

Extraordinary qualities of zeros

Despite their literal zero value, their informational value and qualities are not less than, and in many cases significantly surpass those of their siblings. In a sense, zero is a proverbial center of the universe, as all the other numbers dislocated around it as planets around the sun. It is not coincidental that zeros are denoted as circles, which makes them forerunners and likely ancestors of the glorified π.

Speaking of π, what is all the buzz around it? It’s irrational. It’s inferior to 0: it takes 2 π’s to just draw a single zero (remember O=2πR?). Besides, zeros are not just well rounded, they are perfectly rounded.

Privacy protection experts and GDPR enthusiasts love zeros. While other small numbers are required to be suppressed in published demographical reports, zeros may be shown prominently and proudly as they disclose no one’s personally identifiable information (PII).

No number rivals zero. Zeros are perfect numerators and equalizers. If you divide zero by any non-zero member of the digital community, the result will always be zero. Always, regardless of the status of that member. And yes, zeros are perfect common denominators, despite being prohibited from that role for centuries.

Zeros are the most digitally neutral and infinitely tolerant creatures. What other number has tolerated for so long such abuse and discrimination!

Enough is enough!

Dividing by zero opens new horizons

Can you imagine what new opportunities will arise if we break that centuries-old tradition and allow dividing by zero? What new horizons will open! What new breakthroughs and discoveries can be made!

With no more prejudice and prohibition of the division by zero, we can prove virtually anything we wish. For example, here is a short 5-step mathematical proof of “4=5”:

1)   4 – 4 = 10 – 10
2)   22 – 22 = 5·(2 – 2)
3)   (2 + 2)·(2 – 2) = 5·(2 – 2) /* here we divide both parts by (2 – 2), that is by 0 */
4)   (2 + 2) = 5
5)   4 = 5

Let’s make the next logical step. If dividing by zero can make any wish a reality, then producing a number of our choosing by dividing a given number by zero scientifically proves that division by zero is not only legitimate, but also feasible and practical.

As you will see below, division by zero is not that easy, but with the power of SAS, the power to know and the powers of curiosity, imagination and perseverance nothing is impossible.

Division by zero - SAS implementation

Consider the following use case. Say you think of a “secret” number, write it on a piece of paper and put in a “secret” box. Now, you take any number and divide it by zero. If the produced result – the quotient – is equal to your secret number, wouldn’t it effectively demonstrate the practicality and magic power of dividing by zero?

Here is how you can do it in SAS. A relatively “simple”, yet powerful SAS macro %DIV_BY_0 takes a single number as a numerator parameter, divides it by zero and returns the result equal to the one that is “hidden” in your “secret” box. It is the ultimate, pure artificial intelligence, beyond your wildest imagination.

All you need to do is to run this code:

 
data MY_SECRET_BOX;        /* you can use any dataset name here */
   MY_SECRET_NUMBER = 777; /* you can use any variable name here and assign any number to it */
run;
 
%macro DIV_BY_0(numerator);
 
   %if %sysevalf(&numerator=0) %then %do; %put 0:0=1; %return; %end;
   %else %let putn=&sysmacroname; 
   %let %sysfunc(putn(%substr(&putn,%length(&putn)),words.))=
   %sysevalf((&numerator/%sysfunc(constant(pi)))**&sysrc);  
   %let a=com; %let null=; %let nu11=%length(null); 
   %let com=*= This is going to be an awesome blast! ;
   %let %substr(&a,&zero,&zero)=*Close your eyes and open your mind, then;
   %let imagine = "large number like 71698486658278467069846772 Bytes divided by 0";
   %let O=%scan(%quote(&c),&zero+&nu11); 
   %let l=%scan(%quote(&c),&zero);
   %let _=%substr(%scan(&imagine,&zero+&nu11),&zero,&nu11);
   %let %substr(&a,&zero,&zero)%scan(&&&a,&nu11+&nu11-&zero)=%scan(&&&a,-&zero,!b)_;
   %do i=&zero %to %length(%scan(&imagine,&nu11)) %by &zero+&zero;
   %let null=&null%sysfunc(&_(%substr(%scan(&imagine,&nu11),&i,&zero+&zero))); %end;
   %if &zero %then %let _0=%scan(&null,&zero+&zero); %else;
   %if &nu11 %then %let _O=%scan(&null,&zero);
   %if %qsysfunc(&O(_&can)) %then %if %sysfunc(&_0(&zero)) %then %put; %else %put;
   %put &numerator:0=%sysfunc(&_O(&zero,&zero));
   %if %sysfunc(&l(&zero)) %then;
 
%mend DIV_BY_0;
 
%DIV_BY_0(55); /* parameter may be of any numeric value */

When you run this code, it will produce in the SAS LOG your secret number:

55:0=777

How is that possible without the magic of dividing by zero? Note that the %DIV_BY_0 macro has no knowledge of your dataset name, nor the variable name holding your secret number value to say nothing about your secret number itself.

That essentially proves that dividing by zero can practically solve any imaginary problem and make any wish or dream come true. Don’t you agree?

There is one limitation though. We had to make this sacrifice for the sake of numeric social justice. If you invoke the macro with the parameter of 0 value, it will return 0:0=1 – not your secret number - to make it consistent with the rest of non-zero numbers (no more exceptions!): “any number, except zero, divided by itself is 1”.

Challenge

Can you crack this code and explain how it does it? I encourage you to check it out and make sure it works as intended. Please share your thoughts and emotions in the Comments section below.

Disclosure

This SAS code contains no cookies, no artificial sweeteners, no saturated fats, no psychotropic drugs, no illicit substances or other ingredients detrimental to your health and integrity, and no political or religious statements. It does not collect, distribute or sell your personal data, in full compliance with FERPA, HIPPA, GDPR and other privacy laws and regulations. It is provided “as is” without warranty and is free to use on any legal SAS installation. The whole purpose of this blog post and the accompanied SAS programming implementation is to entertain you while highlighting the power of SAS and human intelligence, and to fool around in the spirit of the date of this publication.

Dividing by zero with SAS was published on SAS Users.

4月 212018
 

Have you ever been working in the macro facility and needed a macro function, but you could not locate one that would achieve your task? With the %SYSFUNC macro function, you can access most SAS® functions. In this blog post, I demonstrate how %SYSFUNC can help in your programming needs when a macro function might not exist. I also illustrate the formatting feature that is built in to %SYSFUNC. %SYSFUNC also has a counterpart called %QSYSFUNC that masks the returned value, in case special characters are returned.
%SYSFUNC enables the execution of SAS functions and user-written functions, such as those created with the FCMP procedure. Within the DATA step, arguments to the functions require quotation marks, but because %SYSFUNC is a macro function, you do not enclose the arguments in quotation marks. The examples here demonstrate this.

%SYSFUNC has two possible arguments. The first argument is the SAS function, and the second argument (which is optional) is the format to be applied to the value returned from the function. Suppose you had a report and within the title you wanted to issue today’s date in word format:

   title "Today is %sysfunc(today(),worddate20.)";

The title appears like this:

   "Today is               July 4, 2018"

Because the date is right-justified, there are leading blanks before the date. In this case, you need to introduce another function to remove the blank spaces. Luckily %SYSFUNC enables the nesting of functions, but each function that you use must have its own associated %SYSFUNC. You can rewrite the above example by adding the STRIP function to remove any leading or trailing blanks in the value:

   title "Today is %sysfunc(strip(%sysfunc(today(),worddate20.)))";

The title now appears like this:

    "Today is July 4, 2018"

The important thing to notice is the use of two separate functions. Each function is contained within its own %SYSFUNC.

Suppose you had a macro variable that contained blank spaces and you wanted to remove them. There is no macro COMPRESS function that removes all blanks. However, with %SYSFUNC, you have access to one. Here is an example:

   %let list=a    b    c; 
   %put %sysfunc(compress(&list));

The value that is written to the log is as follows:

   abc

In this last example, I use %SYSFUNC to work with SAS functions where macro functions do not exist.

The example checks to see whether an external file is empty. It uses the following SAS functions: FILEEXIST, FILENAME, FOPEN, FREAD, FGET, and FCLOSE. There are other ways to accomplish this task, but this example illustrates the use of SAS functions within %SYSFUNC.

   %macro test(outf);
   %let filrf=myfile;
 
   /* The FILEEXIST function returns a 1 if the file exists; else, a 0
   is returned. The macro variable &OUTF resolves to the filename
   that is passed into the macro. This function is used to determine
   whether the file exists. In this case you want to find the file
   that is contained within &OUTF. Notice that there are no quotation
   marks around the argument, as you will see in all cases below. If
   the condition is false, the %ELSE portion is executed, and a
   message is written to the log stating that the file does not
   exist.*/
 
   %if %sysfunc(fileexist(&outf)) %then %do;
 
   /* The FILENAME function returns 0 if the operation was successful; 
   else, a nonzero is returned. This function can assign a fileref
   for the external file that is located in the &OUTF macro 
   variable. */
 
   %let rc=%sysfunc(filename(filrf,&outf));
 
   /* The FOPEN function returns 0 if the file could not be opened; 
   else, a nonzero is returned. This function is used to open the
   external file that is associated with the fileref from &FILRF. */
 
   %let fid=%sysfunc(fopen(&filrf));
 
   /* The %IF macro checks to see whether &FID has a value greater
   than zero, which means that the file opened successfully. If the
   condition is true, we begin to read the data in the file. */
 
   %if &fid > 0 %then %do;
 
   /* The FREAD function returns 0 if the read was successful; else, a
   nonzero is returned. This function is used to read a record from
   the file that is contained within &FID. */
 
   %let rc=%sysfunc(fread(&fid));
 
   /* The FGET function returns a 0 if the operation was successful. A
   returned value of -1 is issued if there are no more records
   available. This function is used to copy data from the file data 
   buffer and place it into the macro variable, specified as the
   second argument in the function. In this case, the macro variable
   is MYSTRING. */   
 
   %let rc=%sysfunc(fget(&fid,mystring));
 
   /* If the read was successful, the log will write out the value
   that is contained within &MYSTRING. If nothing is returned, the
   %ELSE portion is executed. */
 
   %if &rc = 0 %then %put &mystring;
   %else %put file is empty;
 
   /* The FCLOSE function returns a 0 if the operation was successful;
   else, a nonzero value is returned. This function is used to close
   the file that was referenced in the FOPEN function. */
 
   %let rc=%sysfunc(fclose(&fid));
   %end;
 
   /* The FILENAME function is used here to deassign the fileref 
   FILRF. */
 
   %let rc=%sysfunc(filename(filrf));
   %end;
   %else %put file does not exist;
   %mend test;
   %test(c:\testfile.txt)

There are times when the value that is returned from the function used with %SYSFUNC contains special characters. Those characters then need to be masked. This can be done easily by using %SYSFUNC’s counterpart, %QSYSFUNC. Suppose we run the following example:

   %macro test(dte);
   %put &dte;
   %mend test;
 
   %test(%sysfunc(today(), worddate20.))

The above code would generate an error in the log, similar to the following:

   1  %macro test(dte);
   2  %put &dte;
   3  %mend test;
   4
   5  %test(%sysfunc(today(), worddate20.))
   MLOGIC(TEST):  Beginning execution.
   MLOGIC(TEST):  Parameter DTE has value July 20
   ERROR: More positional parameters found than defined.
   MLOGIC(TEST):  Ending execution.

The WORDDATE format would return the value like this: July 20, 2017. The comma, to a parameter list, represents a delimiter, so this macro call is pushing two positional parameters. However, the definition contains only one positional parameter. Therefore, an error is generated. To correct this problem, you can rewrite the macro invocation in the following way:

   %test(%qsysfunc(today(), worddate20.))

The %QSYSFUNC macro function masks the comma in the returned value so that it is seen as text rather than as a delimiter.

For a list of the functions that are not available with %SYSFUNC, see the “How to expand the number of available SAS functions within the macro language was published on SAS Users.

10月 172017
 

Would you like to format your macro variables in SAS? Good news. It's easy!  Just use the %FORMAT function, like this: %let x=1111; Log %put %format(&amp;x,dollar11.); $1,111 %put %format(&amp;x,roman.); MCXI %put %format(&amp;x,worddate.); January 16, 1963   %let today=%sysfunc(today()); %put %format(&amp;today,worddate.); October 13, 2017   %put %format(Macro,$3.); Mac What?!  You never [...]

The post How to format a macro variable appeared first on SAS Learning Post.