Yinliang Wu

6月 082022
 

When generating animation files, the number of frames shown per second is called the frame rate (FPS - Frames per second). Generally, the higher the frame rate, the smoother the animation. Human eyes can process 10 to 12 frames per second and perceive them individually, while higher frame rates are perceived as motion due to "persistence of vision." Modern movies are usually shot and presented at 24 FPS, and there are currently five commonly used frame rates in the film and television industry: 24 FPS, 25 FPS, 30 FPS, 50 FPS and 60 FPS for HDTV.

Animations can present richer time-varying analysis results than static images. It can consider and compare analysis results from different angles to present more details. Animations generated in SAS can be in GIF or SVG format. Various SAS/Graph PROC and SG PROC steps support generating animated GIF files. This short post discusses some timing control issues that authors have found while generating animations for rigorous physics simulations with SAS. Preserving time accuracy is important in physics simulation and some time-sensitive animation generation.

GIF format review

The Graphics Interchange Format (GIF) is a bitmap image format with animation features created by CompuServer in 1987, and GIF89a is version 89a (July 1989) of the format. Its advantages include smaller file sizes and wider support for web browsers and platforms, but the smallest unit of frame duration in the GIF89a file format is hundredths of a second (0.01 seconds) instead of milliseconds (0.001 seconds). Many GIF tools allow specifying the frame duration in milliseconds, even though they will be rounded to hundredths of seconds anyways. They unintendedly mask the limitations of this file format. On the other hand, some GIF-playing software will automatically modify a time interval that is too small. For example, Microsoft's IE rounds up time intervals under 50ms to 100ms. This creates certain challenges for producing time-accurate animations.

Research in 2014[1] indicated that GIF89a does not produce substantial delays when an animation interval is larger than 100ms, regardless of the web browser. Below this threshold, the resulting performance falls considerably. IE9 reinterprets any delay below 100ms as 100ms, so a 50ms interval shows a 50ms delay, while a 16.67ms interval shows around 85ms delay. Firefox 10 and Chrome 17 can execute a GIF89a animation correctly (except for an 85ms delay on 16.67ms intervals), and their mean number of missed frames is smaller and stabilizes over time. The refresh interval is 16.67 milliseconds for a screen with a refresh rate of 60Hz.

Due to the need to control the size of the generated file, 8 FPS (125ms) is enough for most GIF files. When there are many dynamics at play, we can use 15 FPS (66.67ms). The file size for 24 FPS (41.67ms) GIFs without compression will be larger than a standard MP4 video file, so then it would be better to produce the animation as an MP4 instead of a GIF format with a higher FPS.

What is the proper FPS?

When generating animations in SAS, you can specify the duration of a frame through the SAS system option ANIMDURATION. It can accept up to 10 characters in length, that is, there can be 8 significant digits after the decimal point. This commonly leads to programmers incorrectly thinking that the duration of the animation can be specified in milliseconds. The actual test shows that in SAS, the duration of a frame specified via ANIMDURATION option with value 1/FPS, and the actual duration of the animation will be quite different. That is, if the programmer wants to generate relatively time-accurate animation directly in the simulation, they should pay attention when selecting a specific FPS to generate the animation and use the following relationship to specify the target DURATION value:

/*Calculate the accurate ANIMDURATION from/to FPS in SAS*/
 
proc fcmp inlib=work.proto outlib=work.fcmp.animation;
  function fps_animduration(fps);
    dur=1/fps;
    return( floor( dur * 100)/100 ); *Keep the first 2 digits after decimal point instead of round(dur, 0.01);
  endsub;
 
  function animduration_fps(dur);
    return( floor(<strong>1</strong>/dur) ); *Cut the integer part;
  endsub;
run;
 
option cmplib=(work.fcmp);

Accordingly, the following conversion can be used in SAS Macro:

%let DUR=%sysevalf( %sysfunc(floor((1/&FPS)*100 ))/100 );

In a time-accurate physics simulation, the duration or period is the absolute time, so the number of frames that need to be generated is the duration divided by the frame rate. Since the minimum unit of GIF per frame duration mentioned above is one-hundredth of a second, the actual generated frame rate is larger than expected in a generated GIF file, which means that the actual number of frames generated is less than expected for a specified simulation duration. Experiments show that when the FPS is less than or equal to 100, if no error in the number of generated frames is allowed, there are only 19 frame rates that can be used: 1-10, 11, 12, 14, 16, 20, 25, 33, 50 and 100 FPS. If 5% of the generated frames or time error is allowed, you can have an additional 8 frame rates to choose from: 24, 32, 48, 49, 96, 97, 98 and 99 FPS. Based on the background knowledge mentioned earlier, combined with actual needs, GIF file size limitations and playback smoothness requirements, we’d better generally choose a frame rate between 8 and 25 to produce time-accurate animations. 8, 10, 16, 20 and 25 FPS are all good choices. 24 FPS can be chosen if the error rate of frames generated is allowed to be <=5%. The following table lists the error between the actual number of frames generated and the expected number of frames when the duration is 1s under different FPS settings:

Figure1: Frame lost rate (<=5%) and FPS in GIF generation

The relationship between the actual loss rate of generated frames and specified FPS in the figure below reveals which frame rate we should specify when generating time-accurate animations:

Figure2: Frame lost rate and FPS specified in GIF generation

The figure below is a single-cycle animation generated by solving the figure-8 solution of the three-body problem in SAS. The physical time of a single cycle is about 6.3259 seconds, and the actual animation time is also 6.32 seconds with a time scale factor of 1. The duration is 0.04s per frame, for a total of 158 frames. In fact, the problem discussed in this article was discovered when trying to solve the three-body problem with SAS. Of course, in addition to the above frame rate selection, introducing compression techniques to SAS GIF generating is recommended to reduce the file size. Several experiments have shown that third-party tools such as GIF Optimizer can be used to reduce a GIF file's size by 90% without a significant reduction in graphics quality.

Figure3: An animation with galaxy background for the figure-8 solution of three body problem in SAS

Conclusion

When generating GIF animations in SAS that need to be time-accurate such as physics simulation, pay attention to the selection of the ANIMDURATION corresponding to the appropriate frame rate. When programming a GIF animation to display the analysis results, do not choose a random frame rate, or else there will be a large loss of generated frames or a time error. 8, 10, 16, 20 and 25 FPS can be chosen when there is no generated frame loss allowed and 24 FPS can be selected when a generated frame loss rate below 5% is allowed. The limitations discussed here are inherent to the GIF file specification; not due to SAS or third-party tools.

Learn more

Considerations for generating time-accurate animation with SAS was published on SAS Users.

4月 062022
 

Background

Among the many logic puzzles in the world, the Zebra Puzzle is the one of the most famous. It was first published on December 17, 1962 by Life Magazine, an old weekly magazine owned by Time Warner, and it claimed that only 0.1% of people in the world can solve it. The original puzzle is as follows:

Zebra Puzzle: Who drinks the water? Who owns the zebra?

  1. There are five houses.
  2. The Englishman lives in the red house.
  3. The Spaniard owns the dog.
  4. Coffee is drunk in the green house.
  5. The Ukrainian drinks tea.
  6. The green house is immediately to the right of the ivory house.
  7. The Old Gold smoker owns snails.
  8. Kools are smoked in the yellow house.
  9. Milk is drunk in the middle house.
  10. The Norwegian lives in the first house.
  11. The man who smokes Chesterfields lives in the house next to the man with the fox.
  12. Kools are smoked in the house next to the house where the horse is kept.
  13. The Lucky Strike smoker drinks orange juice.
  14. The Japanese smokes Parliaments.
  15. The Norwegian lives next to the blue house.

Life Magazine did not publish the puzzle's solution. When it was released, the Zebra puzzle gained attention from logic puzzle fans all over the world who would then challenge their friends and it spread like wildfire. On March 25, 1963, the names of hundreds of people who successfully solved the puzzle were published in that month's edition of Life Magazine.

The Zebra puzzle is now often referred to as Einstein’s puzzle and some say it was invented by Einstein as a boy. In actuality, there is no real evidence that Einstein did invent the puzzle. The cigarette brand Kools mentioned in the original puzzle did not exist during Einstein’s boyhood. The version of Einstein’s puzzle that is widely circulated today is a variant of the above Zebra puzzle. Some claim that Einstein once asserted that 98% the world's population would not be able to solve it. The puzzle is as follows:

Einstein’s Puzzle: Who keeps the fish?

  1. The British man lives in a red house.
  2. The Swedish man keeps dogs as pets.
  3. The Danish man drinks tea.
  4. The Green house is next to, and on the left of the White house.
  5. The owner of the green house drinks coffee.
  6. The person who smokes Pall Mall rears birds.
  7. The owner of the yellow house smokes Dunhill.
  8. The man living in the center house drinks milk.
  9. The Norwegian lives in the first house.
  10. The man who smokes Blends lives next to the one who keeps cats.
  11. The man who keeps horses lives next to the man who smokes Dunhill.
  12. The man who smokes Blue Master drinks beer.
  13. The German smokes Prince.
  14. The Norwegian lives next to the blue house.
  15. The Blends smoker lives next to the one who drinks water.

For a problem in the form of the above two puzzles, this is essentially a classic logic problem. It has important experimental significance in mathematics, logic, and computer science because it can reveal to some extent what intelligence is, how logical reasoning works and how to empower machines to create artificial intelligence.

Problem Analysis

Using computers to solve logic puzzles has been always a common topic and one brute force method is to enumerate every possible situation before deciding. For both the Zebra puzzle and Einstein’s puzzle above, there are 5 dimensions (Nationality, Color, Pets, Drinks, Cigarettes) that need to be connected to the appropriate house. For any dimension, the possible choices are 5 × 4 × 3 × 2 × 1 = 5!, there are (5!) 5, or 24, 883, 200, 000 (about 25 billion) possibilities in total for five dimensions (See Table 1). A well-written program can solve this in seconds on a capable computer. The traversal determination is completed within minutes. This enumeration selection and decision method is essentially a brute force method of an exhaustive search space in which there is no intelligence at all.

Location Nationality Color Pets Drinks Cigarettes
1 C(5, 1)=5 5 5 5 5
2 C(4, 1)=4 4 4 4 4
3 C(3, 1)=3 3 3 3 3
4 C(2, 1)=2 2 2 2 2
5 C(1, 1)=1 1 1 1 1
Possibilities 5! 5! 5! 5! 5!

Table 1: Searching space

We can also think of such puzzles as Constraint Satisfaction Problems (CSPs). A CSP is a special type of problem that involves a set of variables and a set of constraints between them, finding one (or multiple) assignments such that all constraints are satisfied. For a constraint satisfaction problem with limited definitions, it can be solved with a traditional backtracking search in which the variable selection strategy is optimized by the minimum remaining value, the most constrained variable and failure priority strategies. This method is obviously more rational and closer to the way people think.

In addition to the brute force method and the CSP solution method above, is there any other solution method for this logic puzzle? Is it possible to build a general automatic reasoning system for it? This needs to be viewed from the perspective of human thinking.

The brain structure and intelligence level of different people often reveal huge differences when solving problems. A relatively clear difference is the ability of logical reasoning. Logical reasoning is essentially a form of thinking that logically deduces some new knowledge (conclusion) based on some known knowledge (premise). The process of people's cognition of the objective world is essentially the process of observing, identifying, naming, measuring various objects in the real world and connecting these basic cognitive components to build a cognitive building.

The reasoning process from individual to general, or general to individual, can be summarized as deduction and induction. We generally start from various concrete individuals and continue to abstract and summarize to form "abstract" concepts, then summarize them into general principles and laws. On the other hand, based on the existing knowledge system and the known rules and laws, we will carry out the knowledge application process for the newly emerged concrete individuals in turn. Just like a credit scoring mechanism, this is nothing more than an artificial evaluation model or system based on collected financial data, personal data, etc. of various users, that then evaluates new credit card applicants to determine whether or not to grant, what the credit limit is, what the risk is, etc.

The difference between machines and humans is very significant in several aspects. The memory capacity and computing power of computers can be expanded infinitely, especially cloud computing, which has raised this ability to a new level. No matter how good the human memory is, it is not comparable to those well-organized and indexed knowledge base systems that exist on hard disks or in the cloud. No matter how good the human mental arithmetic ability is, it cannot deduce the recursive processing and stack-type repetitive processing of the computer. No matter how sharp the human eye is, it cannot be compared with the bit-level discrimination ability of the color components of a pixel. However, our human brain is capable of innovation, intuition, perception, high abstraction, and derived wisdom, which is the advantage of our carbon-based life over silicon-based machines that need electricity to operate.

Human Perspective

Let's analyze how a "human" with strong logical reasoning ability solves the Zebra puzzle. From all the puzzle's statements, we can perceive various entities, that is, images constructed by the identification and naming of various real-world entities, including Englishman, red house, Spaniard, dogs, etc.; similarly, we deconstruct the relationship between entities from some verbs or prepositions, such as... lives..., owns..., drinks..., to the right of..., etc., this relationship is also an entity in essence.

Secondly, according to the relationship between entities in the real world, people will establish corresponding connections in the brain, and abstract higher-level category concepts, such as nationality, color, pets, drinks, cigarettes, and other concepts according to the type of connection. Among them, you can also see that concepts like "nationality" are almost purely virtual. These virtual entities are developed based on the needs of human beings to organize society. There are almost no physical entities except an ID number to prove this. Also, you can't say someone is a stateless person if the passport is not issued or lost!

Therefore, according to the order of the puzzle statements, we can summarize 5 dimensions (corresponding variables) according to the entity type, the nationality of the homeowner, the color of the house, pets, drinks, and cigarettes. From this, we can list 5 explicit dimensions and their possible values (in order of appearance in the puzzle). Undoubtedly, these variables are nominal:

  • Nationality: British, Spaniard, Ukrainian, Norwegian, Japanese
  • House: red, green, ivory, yellow, blue
  • Pets: dog, snails, fox, horse
  • Drinks: coffee, tea, milk, orange juice
  • Cigarettes: Old Gold, Kools, Chesterfields, Lucky Strike, Parliaments

Wait, there's another dimension hidden in the puzzle: the location of the house. This is an entity relationship identified based on the words "to the right of," "in the middle," "next to," “in the first” and other concepts in the statement set, so we should establish a concept in our mind to express the order relationship between entities. The most straightforward way is to number them sequentially: 1, 2, 3, 4, 5. Thus, position variables are ordinal variables that not only contain differences but also describe differences in their order or rank.

At this point in the puzzle, the solution to the problem "Who drinks water and who owns zebras?" becomes searching for the mapping relationship of entities in the above six dimensions. The essence of knowledge is connections, so the intelligence based on this model is the process of establishing principles and criteria and applying them to solve, predict and change the future.

Everyone uses different strategies when dealing with problems and even the same person will use different strategies in the derivation process. The 6-dimensional variable is enough to confuse most people's minds - "how to choose the next statement to get a solution as quickly as possible" is the key to solving the problem! Like the minimum remaining strategy for constraint satisfaction problems, we can usually use the deterministic highest strategy to reason forward and use elimination strategies to reduce the solution space.

When people reason, they often begin with a certain statement (knowledge), then apply knowledge to uncertain statements and constantly accept and exclude uncertain statements. Uncertain means that there are multiple possible distributions for the same statement. If there are multiple possible branches, hypothetical reasoning will be used to select one of the limited possible paths to try until the result meets the maximum statements accepted. If there are too many branches of uncertainty, the process degenerates into a search process.

For a logic puzzle, we also need to consider whether its final solution is unique and how many variants of methods (paths) there are to solve it. That is to say, the former can be obtained by the fastest method, while the latter is more complicated and needs to be traversed on the equivalent branch to obtain all possible paths.

Implementation perspective

Based on the above thinking logic, we can use SAS to design a model-oriented automatic reasoning system. It takes the tagged text as the input and outputs the answer text for a query. The so-called general automatic reasoning means that it must be abstract enough and completely centered on an abstract model, which means that it should be able to solve arbitrary logic puzzles without changing any code. The solver itself should not have any domain-related formulas or constraints. The difference between it and the constraint satisfaction problems solver is that the constraints are contained in the model generation stage, rather than determined in the solving stage. The overall design (Figure 1) is as follows, and currently you can click here to download the precompiled SAS code for preview, but it can only run in English, Unicode or Simplified Chinese SAS 94M7+(WIN) or SAS Viya(LAX) environments.

Figure 1: Overall design

Due to the constraints being parsed and implied in the generated model from statements, an automatic reasoning system can be a pure model structure-based solver which does not contain any predefined domain-specific knowledge. The solution generated by the reasoning system is just a subset of the model that can be further processed or presented. The boundaries between humans and machines are limited to model generation and question answering. Thanks to this, a statement set can be any set of finite constraints, whether it is a zebra puzzle, a simpler Einstein’s puzzle or a more complicated blood donation puzzle that can be solved by the same reasoning system. They differ only in the statement set and the questions. The practice has proved that according to the above principles of human thinking, all zebra-type puzzles can be resolved by the same system as they are essentially the same.

Since machines cannot automatically identify different dimensions or variables from entities and relationships like humans, we need to provide a minimal set of tags in the statement set, which can be in the form of a parsimonious [N:Englishman] keyword tag or form such as <Entity Type="N">Englishman</Entity> XML markup. The following table (Table 2) lists the statement set of the Zebra puzzle which consists of 14 statements. This labeled statement set is the only input of the reasoning system.

Table 2: Tagged Statements as input

%SolvePuzzle(input=Zebra_en, question=%str([N:Who] drinks [D:Water],[N:Who] owns [P:Zebra]));

 

The above statement set is parsed and a model is generated. The type of the annotation, i.e. the variable type is a key factor. In the Zebra puzzle, all variables are nominal except the Location, which is an ordinal variable.

To solve a single solution, the termination condition is that all statements are accepted (satisfied). As the reasoning system proceeds, it must maintain two queues: accepted knowledge and rejected knowledge. We know that when we accept new knowledge, we may accept more compatible knowledge. We may also reject some contradictory knowledge; when we reject a piece of knowledge, if it is propagatable in logic, we will also reject more associated knowledge. This is an active elimination strategy. The method of solving all solutions shares this reasoning logic with solving a single solution, but the outer structure is different.

Given the specificity of Location variables in Zebra puzzle, determining the mapping of other categorical variables on the Location variable is a key point. Various zebra puzzle variants have an ordinal variable, when we accept a statement, determining the value of the Location  variable is the most critical step, so the reasoning system should expose it. The following table (Table 3) shows the reasoning steps of the system for the zebra puzzle, where L is the location of the house:


Table 3: Reasoning Steps as output

The result of the reasoning steps can be presented in a more intuitive matrix as follows (Table 4):

Table 4: Reasoning Result

The system can also print the detailed solution process in the form of matrix according to the reasoning steps (Table 5) :

Table 5: Reasoning Result step by step

For the solution results, we can ask any reasonable question in the form of text and let the system return an intuitive answer. Note that the question must be in the tagged form above with question-condition pairs, that is, the sequence of forms in which the question comes first, and the condition comes after. For example:

 %QueryAnswer(question=%str([N:Who] drink [D:Water][N:Who] owns [P:Zebra]));

Table 6: Reasoning Answer for question

In fact, we can ask the system to answer all kinds of valid questions, whether they are known or unknown conditions in the statement, such as:

%QueryAnswer(question=%str([P:What] was kept by [N:Spaniard][N:Who] smokes [C:Chesterfields].));

Table 7: Reasoning Answer for question

As a puzzle terminator, we try to let the automatic reasoning system solve the so-called very hard zebra puzzle: Blood Donation puzzle. The complete text description of the puzzle is as follows:

Blood Donation Puzzle

There is a hospital with a blood donation section with five units next to each other. One day after five women donated blood simultaneously, the secretary forgot to make a record of the event and the managers hired you to conduct research and hopefully find out each donor's blood type correctly. After your investigation, you observed that each donor has a different shirt color, name, blood type, age, weight and job. Then you talked with the laboratory workers and they told you they have all the information about the donors but it's completely mixed up. They also give you a report about what they remember from the instance of donation.

  • Shirt Color: black, blue, green, purple, red
  • Names: Andrea, Brooke, Kathleen, Meghan, Nichole.
  • Blood Types: A+, AB+, B+, B-, O-.
  • Ages: 25, 30, 35, 40, 45 years.
  • Weights: 120, 130, 140, 150, 160 lbs.
  • Jobs: Actress, Chef, Engineer, Florist, Policewoman

Chief security officers also provided a report for you according to what he remembers.

  1. The A+ donor is next to the B+ donor.
  2. Brooke is at one of the ends.
  3. The woman wearing a Black shirt is somewhere to the left of the 150 lb woman.
  4. The Actress is next to the Chef.
  5. Kathleen is 40 years old.
  6. The Florist is somewhere to the right of the woman wearing the purple shirt.
  7. The oldest year-old donor weighs 130 lb.
  8. Brooke is next to Nichole.
  9. The 35-year-old woman is exactly to the left of the 30-year-old woman.
  10. The 120 lb donor is somewhere between the the O- donor and the 150 lb donor, in that order.
  11. Kathleen is at one of the ends.
  12. The woman wearing the purple shirt is somewhere to the right of the woman wearing the green shirt.
  13. The B+ donor weighs 140 lb.
  14. The youngest woman is next to the 30-year-old woman.
  15. The woman considered AB+ universal recipient is exactly to the left of the A+ donor.
  16. Meghan is somewhere to the right of the woman wearing the purple shirt.
  17. The woman wearing the green shirt is somewhere between the Actress and the woman wearing the red shirt, in that order.
  18. At one of the ends is the 130 lb woman.
  19. The O- universal donor is 35 years old.
  20. The Florist is somewhere between the Actress and the Engineer, in that order.
  21. The woman wearing the blue shirt is somewhere to the left of the woman wearing the red shirt.
  22. The AB+ donor is next to the youngest woman.

According to all this information, the managers expect you to find out all details about the blood donors. In fact, the solution to the blood donation puzzle can be formulated into four more concise questions: Whose blood type is B-? And what color shirt is worn by the person weighing 160 lb? How many years old is Andrea, which unit is the Policewoman?

To make the system to solve the blood donation puzzle, we only need to create the following set of labeled statements as the input to the automatic reasoning system (Table 8):

Table 8: Tagged statement as input

The system solution results are as follows (Table 9), please note the Youngest and Oldest in the column A, which are mapping to the age 25 and 45. If you want 25 and 45 to appear in the matrix, you should let these numbers substitute for youngest/oldest in the input statements.

Table 9: Reasoning Result

Further tests indicate that the automatic reasoning system can solve various similar logic puzzles and the difference in input information is in the statement set. Further analysis and comparison showed that the Zebra puzzle, Einstein’s puzzle, and the Blood Donation puzzle all have unique solutions, but there can be multiple solution order variants. E.g., the Zebra puzzle has 144 variants while the Blood Donation puzzle has 972 variants.

Table 10: Multiple Reasoning Variants

Einstein’s puzzle is less difficult than the Zebra puzzle. The Blood Donation puzzle has more dimensions and requires more questions to be answered, and the average time required to solve all variants becomes longer due to the expansion of the search space. Another interesting finding is that although there are many questions to be answered in the Blood Donation puzzle, the reasoning steps to find the first solution will not lead to any dead ends, indicating that the puzzle was purposely designed in the most natural way that humans can solve. The table below lists the brief analysis data.

Puzzle Name Statements Questions

Model Size X Dimensions

First Solution Time (s) Variants Average Solution Time (s)
Einstein’  15 Who keeps fish? 154 X 6 0.1 s 736 0.02 s
Zebra  14

Who drinks water?

Who owns zebra?

138 X 6 0.2 s 144 0.11 s
Blood Donation 22 Whose blood type is B-?

What color shirt is worn by the person weighing 160 lb?

How many years old is Andrea?

Which unit is the Policewoman?

332 X 7 0.2 s 972 6 s

Table 11: Comparison of three puzzles

The evolution from data to intelligence must go through several stages: data -> information -> knowledge -> intelligence. The underlying data plays the most fundamental role. It comes from actual observations and is the most primitive entity, which has no meaning itself except to represent a kind of existence. For the reasoning system, the four types of variables in statistics, nominal variables, ordinal variables, interval variables and ratio variables, reflect the capacity of information. The various puzzles mentioned above usually only contain limited nominal and ordinal variables and do not contain interval and ratio variables, so they can be easily solved by the automatic reasoning system mentioned in this article. Currently, the model generation part can only deal with limited categorical variables, it has not been generalized to continuous variables but discrete variables so far. In contrast, the rules are implied in the constraint formulation of general constraint satisfaction problems, so there is no such restriction of variable type. It still requires further research and optimization to solve, e.g., the mixed model approach.

Due to space limitations, only the key concepts, implementation bird view and some system capabilities are presented in this post, please allow me to discuss technical details for this automatic reasoning system in the near future!

Summary

Starting from the Zebra puzzle, this post identifies the common paradigm of logic puzzles and designs a general logic reasoning system to solve them automatically. It is not only different from the brute force method without intelligence, but also different from the general solution to the constraint satisfaction problems. This practice has shown that by starting from a set of purely declarative statements, any zebra puzzle with limited constraints can be tagged and solved quickly in the same way without changing any code.

Learn more

Zebra Puzzle Terminator: A general automatic reasoning system solving method was published on SAS Users.

2月 152022
 

The Missionaries and Cannibals Problem (MCP) is a classic river-crossing logic puzzle that derives from the famous Jealous Husbands problem. The earliest version of the MCP problem was described by Pocock in 1891. In the article “The jealous husbands and the missionaries and cannibals” issued by Ian Pressman and David Singmaster on The Mathematical Gazette73 ( JSTOR 3619658), the following theorem was stated as the 4th theorem without proof for this river crossing problem:

THEOREM 4. The minimal number of crossings to ferry n >= 3 missionaries and n cannibals across a river with an island, using a two-person boat and bank-to- bank crossings, is 4n - 1.

The Missionaries and Cannibals Problem is usually defined as follows:

On one bank of a river are 3 missionaries and 3 cannibals. There is 1 boat available that can carry at most 2 people and that they would like to use to cross the river. If the cannibals ever outnumber the missionaries on either of the river’s banks or on the boat, the missionaries will get eaten. How can the boat be used to carry all the missionaries and cannibals across the river safely? The boat cannot cross the river by itself with no people on board and there is no island in the middle of the river.

Solving the problem

To build a system to solve this problem, we can define how to represent the state of the system and how the states will change from the actions applied. We also need to define the initial state and the final state, so the problem solving is abstracted as finding a path from the initial state to the final state. Through this method, we can solve the problem with the help of computer graph theory knowledge to find a connected one-way graph path. The detailed algorithm can be described as such:

  1. The primary argument for the system is the number of Missionaries (M), the number of Cannibals (C) and the capacity of the boat (B). For the upper problem, the M=3, C=3 and B=2.
  2. The system state can be uniquely defined by the state of missionaries, cannibals and the boat on the left bank. In other words, [m=3, c=3, b=1] indicates there are 3 missionaries, 3 cannibals and a one-person boat on the left bank. For the state of the other bank, it’s uniquely determined by the left bank after crossing. Both banks need to always abide by the game's rules for all [m, c, b], m>=0, c>=0, m>=c if m>0.
  3. The number of valid crossing actions depends on the capacity of the boat and the state of the departure ferry. If the capacity of a boat is 2, the possible states of the boat need to meet all of the following conditions of the rules defined in #2:
    • (p+q)<=B : a boat can carry at most B people.
    • (p=0 OR (p>0 AND p>=q)): cannibals can’t outnumber the missionaries on the boat if there is any missionary.
    • NOT (p=0 AND q=0): the boat cannot cross the river by itself with no people.

So, we get following actions table for river crossing:

(p, q) 0 1 2
0 (0, 0) (0, 1) (0, 2)
1 (1,0) (1,1) (1,2)
2 (2,0) (2,1) (2,2)

 

  1. Starting from the initial system state, we can use Breadth First Search (BFS) algorithm to drive state space transition. The starting point is the initial state, while the end point is the target state. To avoid reentrant visits, we need to use a dictionary to record the nodes that have been visited. We also need a queue to pool the newfound system state which has not been visited yet. So, we can apply the actions defined in #3 until the state space is traversed.
  2. The output of #4 is the path segment for a final solution. Whenever we find a solution, we need to dump out the full path. And all these paths form a Directed Acyclic Graph (DAG). For those endpoints that are not the final goal state, we need to remove them to build a single clean graph with the final state as the endpoint.

Visualizing the solution

The following result is generated by the upper algorithm implementation by SAS for the MCP problem (M=3, C=3 and B=2). It’s a directed acyclic graph that can represent all possible solutions on one page. The NETDRAW procedure in SAS was designed to draw a network diagram of the activities in a project, but we use it to visualize nodes and relationships for a directed acyclic graph (DAG) here (click here to download the precompiled code).We also can generate the step description for a solution (top-most path) in that directed acyclic graph. For example, the first intuitive solution for (M=3, C=3, B=2) is listed below. It is one of the 4 possible solutions revealed by the upper directed acyclic graph. There are three other variants for (M=3, C=3 and B=2) besides the following solution.

Generally, if the boat's capacity is 2 (B=2) and the number of missionaries and cannibals is equal (C=M), we draw the following conclusions:

  • When M = 1, there is one and only one solution, that is, N(M=1, C=1, B=2) =1.
  • When M = 2, there are 4 different solutions, that is, N(M=2, C=2, B=2) =4. Each solution needs 5 trips.
  • When M = 3, there are 4 different solutions, that is, N(M=3, C=3, B=2) =4. Each solution needs 11 tips. In fact, this is the only case meet the 4n-1 statement of THEOREM #4.
  • When M>=4, there is no solution, that is, N(M>=4, C=M, B=2)=0.

When the number of cannibals is less than that of missionaries, such as 1 less, that is (C=M-1), all values of M have solutions. This is because fewer cannibals weaken the constraints, so there will be more solutions.

  • When M = 1, there is one and only one solution, that is, N(M=1, C=0, B=2)=1.
  • When M = 2, there are 3 different solutions, that is, N(M=2, C=1, B=2)=3. Each solution needs 3 trips.
  • When M = 3, there are 9 different solutions, that is, N(M=3, C=2, B=2) =9. Each solution needs 7 trips.
  • When M = 4, there are 25 different solutions, that is, N(M=4, C=3, B=2)=25. Each solution needs 11 trips.

If the number of missionaries and cannibals is equal (C=M) when the boat's capacity is 3 (B=3), then:

  • When M = 1, there is one and only one solution, that is, N(M=1, C=1, B=3) = 1.
  • When M = 2, there are 5 different solutions, that is, N(M=2, C=2, B=3) = 5. Each solution needs 3 trips.
  • When M = 3, there are 6 different solutions, that is, N(M=3, C=3, B=3) = 6. Each solution needs 5 trips.
  • When M = 4, there are 32 different solutions, that is, N(M=4, C=4, B=3) = 32. Each solution needs 9 trips.
  • When M = 5, there are 25 different solutions, that is, N(M=5, C=5, B=3) = 25. Each solution needs 11 trips.
  • When M>=6, there is no solution, that is, N(M>=6, C=M, B=3) = 0.

For solving an upper missionaries and cannibals Problem (M=5, C=5, B=3), the step description of a solution also can be generated by SAS as below:

In the same way, when the number of cannibals is less than that of the missionaries, such as 1 less (C=M-1), then all values of M can be solved because fewer cannibals weaken the restriction conditions.

When the capacity of boat B is greater than or equal to 4, there are solutions for all values of M if the number of missionaries and cannibals are equal (C=M). E.g., here is a list of all solutions for MCP(M=5, C=5, B=4) and the step description of a solution below:

Furthermore, the following table lists the statistics of all possible MCP solutions when M<=16, C=M, B=1 to 6. The SolutionsNum column indicates the number of solutions while MinTrips and MaxTrips indicate the minimum and maximum trips needed, respectively. The DistinctTripsLength column indicates whether the number of trips is variable; the distinct trips length is either 1 or 2.

The maximum number of trips across the river is not monotonically increasing, they show the following correlation. And when other conditions are the same, B=4 requires the greatest number of trips if M>=6.

 

The first case with a variable number of trips is MCP(M=2, C=2, B=4), it has 6 solutions with 1 or 3 trips; the second case with a variable number of trips is MCP(M=3, C=3, B=4), it has 25 solutions with 3 or 5 trips, you can see more details below:

Summary and more resources

We have described how to solve the classic Missionaries and Cannibals problem with SAS, visualized solutions with SAS NETDRAW procedure and generated steps description automatically for any MCP problem. The statistics of all possible MCP solutions when M<=16 proved that MCP(M=3, C=3, B=2) is the only case that conforms to Theorem 4. This is just one example of how powerful SAS can be for problem-solving and data visualization. For more examples, please check out some of my other articles:

Solve the Missionaries and Cannibals Problem with SAS was published on SAS Users.

1月 282022
 

In May of 1999, Ian Stewart issued the article “A Puzzle for Pirates” in that month's edition of Scientific American (pp.98-99). This mathematical game then went on to become a worldwide hit, especially for puzzlers who like a good challenge. The rules of the pirate game are described below:

Five rational pirates have found 100 gold coins on an island and need to decide how to distribute them. They are a democratic bunch and they ask the fiercest pirate to propose a plan of distribution. Then, all of the living pirates (including the proposer) vote on whether to accept the distribution plan. If the majority (50% or more) accepts that proposal, then coins are dispersed and the game ends. Otherwise, the proposer is thrown overboard and killed, then the procedure is repeated with the next fiercest pirate until a proposal is accepted or only one pirate is left. If there is a tie vote, then the proposer (say, the fiercest pirate) has the casting vote.

At first glance, the proposer may offer the most favorable terms to all voters to pass the proposal, but this is not the case. The pirates are not only rational but also greedy and want to maximize their personal gain. Due to the game's rules and how the order of pirate fierceness is known to all in advance, these rational pirates prioritize surviving first, then maximizing gold coins for themselves. They would love to kill off as many pirates as possible to make more coins for themselves through future distribution proposals. This logic may become surprisingly complicated due to the changes in the numbers of pirates and coins.

Suppose N is the number of pirates, and G is the number of gold coins. We number the pirates in order from meekest to fiercest, the greater the number, the fiercer the pirate. So P1 is the least fierce pirate, and PN is the fiercest pirate. Due to there being too many branches from the beginning, we can think of the problem backward from the end, that is, starting from the game results to reverse the entire distribution process.

If there are only two pirates (N=2), then obviously the fierce pirate is going to propose G coins for himself, and none for the meek pirate (P1=0, P2=G). Any proposal will pass no matter what it is for this case, so the fierce pirate would take all the coins while the meek pirate has no choice but to accept it. The fierce pirate doesn’t need to please meek pirate at all!

If there are three pirates (N=3), and all of them are aware of an inevitable positive result, then the fiercest pirate P3 is going to make a proposal pass to guarantee his own survival first, so he needs at least one vote from P1 or P2. Due to P2 having the possibility of getting G coins, so he can’t be bribed with gold coins; P1 knows he is going to get 0 coins if the game goes onto the next round, and P3 knows what P1 is thinking, so P3 can offer the least amount of coins, e.g., just 1 coin to buy the vote of P1, and P1 also would love to support P3 accordingly as he can get a better result: survival plus coins. So, the final proposal by P3 is P1=1, P2=0, P3=G-1, where G>1.

From the two trial calculations above, we can conclude the following decision-making patterns or rules for a player of the pirate game. The rules apply to all pirates and it is the rationality of a pirate, that is the pirate game recursive solving algorithm (click here to download the precompiled code).

  1. For any N, G, both N and G must be greater than zero for game, to avoid the possibility of unlimited bribing, we usually define them as both integer and 2*G < N, but this is not required.
  2. If the number of pirates N is 2, then proposal is P1=0, P2=G.
  3. Otherwise, count the proposer’s vote out and let the rest of the pirates (N-1) to distribute G gold coins. For its trial result proposal, count the number of votes supported:
    • The fiercest pirate whose proposal is impossible to pass in the next round. Because this pirate is going to die in the next round, he is the inevitable supporter of the current proposal to avoid making a proposal and being killed in the next round.
    • The pirates who are going to receive 0 coins in the next round. Try to buy this kind of pirate as a supporter with the least coins (e.g., 1 coin) from total G coins. For those pirates who received coins in the trial calculation, take their coins away!
  4. Check if the current proposal is going to pass the vote and return that flag.
    • If the number of votes supported plus 1 vote of the current proposer is equal to or greater than the N/2, then the proposal is passed. So the current proposer takes all the remaining gold coins after buying supporters in 3.b and make all those pirates going to die in the next round alive because they avoid the further distribution process.
    • If the current proposal cannot be passed, then throw the current proposer overboard, marking PN as a must-die pirate.

Please note step #3, we're calling the pirate game-solving algorithm recursively, so it would reach the game-over condition. The following proposal plan table is generated by upper algorithm implementation by SAS for pirate games with N<=20 and G=100. You can see the best solution is, the fiercest pirate chooses the pirates with the same parity as him in the list ordered by fierce as his supporter. E.g., P3 would choose P1 as his supporter for a N=3 pirate game, and P5 would choose P3, P1 as his supporter for a N=5 game.

Will this pattern continue for any N pirate games with constant G gold coins? The answer is no because we have limited gold coins (G) to buy supporters. Anyway, it is true for N<= 2*G game, e.g., The P200 can take 1 coin and buy 99 supporters with another 99 coins, the detailed proposal plan for N=200 pirate game is show as below:

Interesting things happen while the number of pirates N exceed the 2*G. e.g., P201 and P202 pirates still can survive if he is taking no gold coins for himself and offering 1 gold coin each to 100 pirates with the same parity in the list ordered by fierceness. The detailed proposal for N=201, N=202 cases is show as below:

Unfortunately, P203 has no coins to buy supporters anymore in the pirate game (G=100, N=203), so P201 and P202 in this round would love to throw P203 overboard to earn more coins through future distribution proposals. In other words, P203 must die for this case (We use -1 to indicate death in detailed proposal).

P204 is a lucky guy who can survive. Since P203 must die in the next round, P203 is going to support any proposal made by P204 in the current round. P204 will apply the same strategy as P202 used in the N=202 pirate game above, so P204 takes no coins, and he gets support from P203 in this current round. So, P201, P202, P203 and P204 all survive with no coins. The detailed proposal is shown as below:

Accordingly, P205, P206 and P207 are all unlucky and doomed to die in their round, just like P203 in pirate game (G=100, N=203). The key reason is that they cannot get sufficient votes to make their proposals pass. The four rational pirates (P201 to P204) know they can survive and would love to throw these three pirates (P205 to P207) overboard to earn more coins through future distribution proposals. The detailed proposal for the game (G=100, N=207) is shown below:

Just as lucky as P204 in pirate game (G=100, N=204), P208 is also lucky enough to make a proposal pass because he can get sufficient support from P205, P206 and P207 who are doomed to die. P205, P206 and P207 must support P208 to avoid making a proposal and getting killed in the next round. The detailed proposal for the game (G=100, N=208) is shown below:

In general, suppose the number of the pirate I-th (I<= N) is PI in a pirate game (N, G), we have the following conclusion based on the recursive solving algorithm result:

  1. When I<=2*G, all pirates will survive, and they also have 50% possibility to get 1 or more gold coins.
  2. When I >2*G and I=N, the pirate PI (=PN, proposer) is going to survive only when (N - 2*G is the power of 2, otherwise they will die.
  3. When I>2*G and I<N, all pirates with number I <= (2*G + M) will survive but will receive no coins, all pirates whose number I>(2*G+M) will die, where M is the highest power of 2 that does not exceed N - 2*G. Please note that the survival for the PI is a possibility related to N value, those PI will survive again If N exceed 2*G + 2*M.

e.g., G=100, N=(205, 206, 207), N-2*G=(5, 6, 7), the M=4 (2^2<=4), so 2*G+M=204, so P201, P202, P203 and P204 will survive in pirate games N=(205, 206, 207). P205 must die in pirate game N=(205, 206, 207), and P206 must die in pirate game N=(206, 207), and P207 must die in pirate game N=207. But P205, P206, P207 will survive again if N is equal to or greater than 208.

In fact, there is no unique solution to distribute gold coins if a pirate already guarantees his survival. This means that the fiercest pirate has some freedom of selection when I>=2G+2, but the simplest solution is to select those pirates with the same parity by order of fierceness.

From the proposer’s perspective (I=N), if he attends a pirate game with different N, then his personal interest is different. For a game with a number of pirates that is less than or equal to 2*G, the proposer will always survive with coins. For games with greater than 2*G pirates, most proposers can’t survive except whose number minus 2*G happens to be an integer power of 2. We can visualize the solution for pirate game(N=500, G=100) with SAS as seen below.

Accordingly, we also can calculate the possibility of survival of a pirate game with N pirates (N<=500) and constant G=100. The survival curve below indicates how likely a pirate (proposer) is to survive if he participates in a game with N pirates. As expected, the curve is a non-linear line.

For the least fierce pirate, he has no chance to distribute gold coins, but he has most chances to vote on proposals. In other words, he can survive longer than other pirates. For the fiercest pirate, the survival rate appears as a cliff-like fall at some points for the pirate whose number PI is greater than 2*G. The survival curve below indicates how likely a pirate PI is to survive if he participates in a game with N pirates (N<=500).

e.g., For a game with 500 pirates, the first 44 pirates are doomed to be thrown overboard (P457 to P500), so the blue survival line falls to 0 at P456 (2*G+2^Y=2*100+2^8=456), accordingly the death rate(red) is suddenly increased to 100%.

Summary

We have figured out how to solve the pirate game with a recursive solving algorithm in SAS, and how to analyze and visualize the law behind the complex logic of the pirate game. Now the next time you join a pirate game, you will know your destiny ahead of time and reap the benefits while avoiding getting thrown overboard.

Solve "A Puzzle for Pirates" with SAS was published on SAS Users.

12月 232021
 

As the leader in analytics, SAS provides a very rich set of SAS functions and CALL routines for programmers. There is a total of 693 uniquely named SAS functions besides the hundreds of SAS PROC steps. These SAS functions not only cover general functions such as character/string processing, mathematical, permutation & combination, probability, search and sorting, date and time, etc., but also includes a powerful domain-specific function set involving variable information and control, external files and system extensions, web services and tools, descriptive statistics, financial calculations, geodetic distance & ZIP code, Git and other fields. This article reviews the financial functions provided by the SAS system and demonstrates how to use them to do financial analysis and simplify your decisions. Functions with the same name can be invoked easily during programming with SAS DATA step, DS2, FedSQL and CAS environment, which will improve your programs' portability.

Overview

SAS provides 94 financial functions in total, which are mainly used to calculate financial values, such as interest, periodic payments, depreciation, and prices for European options on stocks and futures. The 53 financial functions conforming to industry standards are packaged in 1 general function called FINANCE. Each finance function has a unique string identifier for the first parameter of the FINANCE function. These functions have to be invoked via the FINANCE function. For example, we can use irr=finance("IRR", -400,100,200,300); to compute the internal rate of return (IRR) for a series of cash flows.

The 53 FINANCE functions are compatible with the 53 financial functions provided by Microsoft Excel for Office 365. In fact, the list of financial functions available in the latest Excel for Office 365 software contains 56 functions. Among the 3 extra functions, PDURATION and RRI are specific functions for Excel 2013, and STOCKHISTORY is a specific function for Office 365. These 53 financial functions are designed for financial calculation such as depreciation, maturation, accrued interest, net present value, periodic savings, internal rates of return, etc. They can be categorized as investment, depreciation, Internal rate of return, financial value, and yield calculations. The complete list of SAS FINANCE functions is shown in Table 1.

Category Subcategory Function
Investment Calculation [12] Future Value [2] FV, FVSCHEDULE
Payment [4] PMT, IPMT, PPMT, ISPMT
Present Value [3] PV, NPV, XNPV
Annual interest rate [2] EFFECT, NOMINAL
Number of periods [1] NPER
Depreciation [7] AMORDEGRC, AMORLINC, DB, DDB, SLN, SYD, VDB
Internal rate of return [4] IRR, RATE, MIRR, XIRR
Financial value and yield [30] Accrued/Cumulative Interest & Principal [4] ACCRINT, ACCRINTM, CUMIPMT, CUMPRINC
Coupon days/date and Number [6] COUPDAYBS, COUPDAYS, COUPDAYSNC, COUPNCD, COUPNUM, COUPPCD
Yield [6] INTRATE, ODDFYIELD, ODDLYIELD, YIELD, YIELDDISC, YIELDMAT
Treasury bill [3] TBILLEQ, TBILLPRICE, TBILLYIELD
Duration [2] DURATION, MDURATION
Security [7] DISC, ODDFPRICE, ODDLPRICE, PRICE, PRICEDISC, PRICEMAT, RECEIVED
Dollar Price [2] DOLLARDE, DOLLARFR

Table 1: 53 FINANCE XXX functions in one FINANCE function

Another 40 independently named SAS financial functions are used for investment or loan calculation, cash flow, depreciation, and pricing models. For example, we can use irr =irr(1,-400,100,200,300); to compute the internal rate of return (IRR) as a percentage for a series of cash flows.

There are 7 functions (IRR, NPV, PMT, PPMT, IPMT, CUMIPMT, CUMPRINC) whose names are the same as the industry standard FINANCE functions, but they have different sign-in results or even different calculation logic. For example, FINANCE PMT is used to calculate the periodic payment of an annuity, but the PMT function is used to calculate the periodic payment for a constant payment loan or the periodic savings for a future balance. Annuity and periodic savings are different in anchoring market interest rates, interest calculation, liquidity, and binding, so the same arguments return different results for them. The list is shown in Table 2.

Category Function
Investment/Loan [16] IRR/INTRR, NPV/NETPV, SAVING, SAVINGS,
NOMRATE, EFFRATE, COMPOUND,
PMT, PPMT, IPMT, CUMIPMT, CUMPRINC, MORT, TIMEVALUE
Depreciation [10] DACCDB, DACCDBSL, DACCSL, DACCSYD, DACCTAB
DEPDB, DEPDBSL, DEPSL, DEPSYD, DEPTAB
Cash flow [6] CONVX, CONVXP, DUR, DURP, PVP, YIELDP
Pricing Models [8] Stocks: BLKSHCLPRC/BLKSHPTPRC, GARKHCLPRC/GARKHPTPRC, MARGRCLPRC/MARGRPTPRC
Futures: BLACKCLPRC/BLACKPTPRC

Table 2: 40 uniquely named financial functions

To keep semantic consistency, these industry-standard financial functions usually have the same parameter names, and some of the most important parameters are:

  • Interest rate (rate): The interest rate or discount rate of an investment or loan.
  • Number of periods (nper): The total number of payment periods for investment or loan.
  • Payment (pmt): The regular payment that is made each period for investment or loan, the value cannot change over the life of the annuity. Typically, it contains principal and interest, but not other fees and taxes.
  • Present value (pv): The current value of an investment or loan. The present value of an investment is the value at the beginning of the investment period, and the present value of the loan is the value of the principal borrowed.
  • Future value (fv): The value of an investment or loan after all payments have occurred, or a cash balance that you want to attain after the last payment is made.
  • Payments due Type (type): A flag indicates if payments are due at the beginning of the period. The omitted type or 0 means payments are due at the end of the period, 1 indicates payments are due at the beginning of the period.
  • Period (period): Specify the period for interest rate or depreciation calculation. The period must be between 1 and the number of periods (nper).
  • Day count basis (basis): Optional, specify the type of day count basis to use. 0-represents US (NASD) 30/360, default; 1-represents Actual/actual; 2-represents Actual /360; 3-represents Actual /365; 4-represents European 30/360.
  • Frequency (frequency): Specify the number of coupon payments per year, 1-annual payment; 2-semiannual payment; 4-quarterly payment.

Below, we use some simple examples to show how to use these SAS financial functions to calculate and analyze mortgages.

Mortgage calculation

The average interest rate of first home mortgage in China is 5.46% since September 2021, and the average interest rate of second home mortgages is 5.83%, up 23 and 29 basis points respectively from 2020. Assuming that the first housing loan amount is 2 million yuan (CNY), the loan periods is 25 years. If the mortgage type is Constant Payment Mortgage (CPM), how much will the monthly payment and total payment be?

The FINANCE PMT function can be used to calculate the periodic payment of an annuity, it can be used to calculate the constant payment for Constant Payment Mortgage by default. That is, the payment is a constant for all periods, the same amount of the loan, including principal and interest, is repaid every month. The required parameters of the FINANCE PMT function are the interest rate (rate), the number of periods (nper), and the loan amount as the present value (pv). The optional parameters are the future value (fv) and the payment due type (type).

The FINANCE PPMT function can be used to calculate the payment on the principal during a given period for an investment or a loan based on a constant periodic payment and a constant interest rate. The required parameters are interest rate (rate), the payment period (period), number of periods (nper), and the loan amount as the present value (pv), optional parameters are future value (fv) and payment due type (type). Since FINANCE PPMT returns the monthly principal, we can calculate the monthly interest and the principal balance (pb) that need to be repaid. The complete SAS program is as follows:

 /*Mortgage: Constant Payment Mortgage*/
data Mortgage1;
  rate=0.0546/12; /*Interest rate: convert yearly 5.46% to monthly*/
  nper=25*12;     /*Number of periods: Term in month, convert 25 years to 300 months*/
  pv=2000000;     /*Present value: Loan amount - the value of the principal borrowed 2M*/
 
  pmt = FINANCE('pmt', rate, nper, pv); /*Monthly payment*/
  pb=pv;           /*Principal Balance*/
  do period = 1 to nper;
    ppmt=FINANCE('ppmt', rate, period, nper, pv); /*Payment on the principal*/
    mi=pmt- ppmt; /*Payment on the interest*/
    pb=pb+ppmt;
    output;
  end;
 
  label rate="Interest Rate (monthly)" nper="Number of periods (month)" pv="Loan amount" 
        period="Period" pmt = "Monthly payment" ppmt="Payment on principal" mi="Payment on interest" pb="Principal balance";
run;
 
title 'Mortgage: Constant Payment Mortgage';
proc print label;
run;

The results are as follows (Figure 1):

Figure 1: Details of Constant Payment Mortgage payment

To show the total payment and the total interest, we use PROC REPORT to generate a summary table as Figure 2. The total payment for the Constant Payment Mortgage is 3,670,206.07, the total interest is 1,670,206.07, and the ratio of interest to principal is 83.51%.

proc report;
  column period periodid pmt ppmt mi pb;
  format period 3.0 pmt ppmt  mi pb comma15.2;
  define period / order noprint;
  define pb / order;
  rbreak after / summarize style=[font_weight=bold];
 
  define periodid / computed "Period" center; 
  compute periodid / character length=20;
    periodid=period ;
  endcomp;
  compute after;
    if _break_ in ('_RBREAK_') then periodid='Total';
  endcomp;
run;

Figure 2: Constant Payment Mortgage payment schedule

If the mortgage type is Constant Amortization Mortgage (CAM), what will happen to its monthly payment and interest expenses? A Constant Amortization Mortgage divides the total amount of the loan into equal parts during the payment periods and the lender repays the same amount of principal every month and the interest incurred on the remaining loan of that month. As a result, the monthly payment amount is not constant, the initial payment is high, but the interest component of the payments declines rapidly over time. For this type of mortgage, the monthly payment on the principal (ppmt) is constant; we only need to calculate the current payment on the interest (mi) generated by the principal balance (pb) to get the current total payment (pmt) for the given period. The calculation can be accomplished without calling SAS FINANCE functions. The program is as follows:

/*Mortgage: Constant Amortization Mortgage*/
data Mortgage2;
  rate=0.0546/12; /*Interest rate: convert yearly 5.46% to monthly*/
  nper=25*12;     /*Number of periods: Term in month, convert 25 years to 300 months*/
  pv=2000000;     /*Present value: Loan amount - the value of the principal borrowed 2M*/
 
  ppmt=-pv/nper;   /*Monthly Payment on the principal*/
  pb=pv;           /*Principal Balance*/
  do period=1 to nper;
    mi=-pb*rate;   /*Payment on the interest*/
    pmt=ppmt+ mi;
    pb=pb+ppmt;
    output;
  end;
 
  label rate="Interest Rate (monthly)" nper="Number of periods (month)" pv="Loan amount" 
        period="Period" pmt = "Monthly payment" ppmt="Payment on principal" mi="Payment on interest" pb="Principal balance";
RUN;
 
title 'Mortgage: Constant Amortization Mortgage';
PROC PRINT;
RUN;

We use the same PROC REPORT program above to generate the summary table as Figure 3. In the table, the total payment for the Constant Amortization Mortgage is 3,369,550.00, the total interest is 1,369,550.00, and the ratio of interest to principal is 68.48%. This indicates that using this type of mortgage can save 300,656.07 in interest expenses.

Figure 3: Constant Amortization Mortgage payment schedule

To compare the monthly payment, monthly interest, and their cumulative value changes with the repayment periods for two mortgage types, we use PROC SGPANEL to draw the following figures. The code is as follows:

/*Merge two mortgage data with type=1/2 in long format*/
data Mortgage;
  set Mortgage1 Mortgage2;  
  retain type 1;
  if period=1 and _N_^=1 then type=type+1;
  pmt=abs(pmt);
  mi=abs(mi);
  ppmt=abs(ppmt);
run;
/*Create format for mortgage type*/
proc format;
  value typefmt 1="Constant Payment Mortgage"
2="Constant Amortization Mortgage";
run;
data Mortgage;
  set Mortgage;
  accuppmt+ppmt;
  accupmt+pmt;
  if first.type then do; 
	accuppmt=ppmt;
	accupmt=pmt;
  end;
  by type;
  label accuppmt="Cumulative monthly payment on principal" accupmt="Cumulative monthly payment";
  format type typefmt.;
run;
 
ods graphics / width=800px height=300px;
title "Mortgage: CPM vs CAM";
proc sgpanel data=Mortgage;
  panelby type / novarname rows=2;
 
  vline period / response=pmt lineattrs=(color=Cxa0a0a0);
  vbar period / response=pmt nooutline fillattrs=(color=cxE0E0E0) FILLTYPE=SOLID transparency= 0.5;
  vbar period / response=ppmt nooutline fillattrs=(color=DarkOrange) FILLTYPE=SOLID transparency= 0.5;  
  colaxis label="Period" type=linear  values=(1 to 300 by 12  ) valueattrs=(color=gray size=12pt) grid;
run;
 
ods graphics / width=800px height=500px;
title "Mortgage: CPM vs CAM";
proc sgpanel data=Mortgage;
  panelby type / novarname rows=2 ;
 
  vline period / response=accupmt lineattrs=(color=Cxa0a0a0);
  vbar period / response=accupmt nooutline fillattrs=(color=cxE0E0E0) FILLTYPE=SOLID transparency= 0.5;;
  vbar period / response=accuppmt nooutline fillattrs=(color=DarkOrange) FILLTYPE=SOLID transparency= 0.5;;
  colaxis label="Period" type=linear  values=(1 to 300 by 12  )  valueattrs=(color=gray size=11pt) grid;  
run;

In the following figures, the y-axis represents the monthly payment or the cumulative monthly payment, and the x-axis represents the payment period. The orange represents the principal component and the gray represents the interest component; the two constitute the monthly payment.

In Figure 4, the monthly payment of CPM (top), and the monthly payment on principal of CAM (bottom) are fixed values, which appear as a horizontal line and do not change over time. Correspondingly, in Figure 5, the cumulative monthly payment of CPM (top) and the cumulative monthly payment on principal (bottom) show linear growth.

Figure 4: Monthly payment of CPM vs CAM over period

Figure 5: Cumulative monthly payment of CPM vs CAM over period

Since the monthly payment of CPM is constant, the principal component increases month by month, but the interest component decreases accordingly. For CAM, the monthly payment and the interest component decreases rapidly month by month, but the principal component is constant.

CPM brings less stress to lenders and it is suitable for people whose income is expected to rise, and the lender is only willing to bear a lower monthly payment. CPM is also suitable for those people who have other channels to manage their own money to achieve greater benefits than mortgage expenditures. It is also suitable for those people who have a plan to sell real estate after a certain period. The disadvantage of CPM is that the overall interest expense is relatively high.

CAM is suitable for people who have higher income but whose income is flat or declining in the future. These people hope to put the payment stress on now rather than in the future. Declining payments may be appropriate to match a deflationary environment, but its high initial payment and rapid paydown of principal reduces leverage too fast. Furthermore, some people may complain the constantly changing payment is difficult to budget for and manage.

Summary

We have summarized the design and implementation of SAS functions in financial calculations. We then provided examples to calculate and analyze the monthly payment, interest, and principal for CPM/CAM mortgages. Finally, we compared the monthly payment, principal, interest, and cumulative value changes to a period for two mortgage types.

SAS financial functions review and mortgage payment analysis was published on SAS Users.

12月 232021
 

As the leader in analytics, SAS provides a very rich set of SAS functions and CALL routines for programmers. There is a total of 693 uniquely named SAS functions besides the hundreds of SAS PROC steps. These SAS functions not only cover general functions such as character/string processing, mathematical, permutation & combination, probability, search and sorting, date and time, etc., but also includes a powerful domain-specific function set involving variable information and control, external files and system extensions, web services and tools, descriptive statistics, financial calculations, geodetic distance & ZIP code, Git and other fields. This article reviews the financial functions provided by the SAS system and demonstrates how to use them to do financial analysis and simplify your decisions. Functions with the same name can be invoked easily during programming with SAS DATA step, DS2, FedSQL and CAS environment, which will improve your programs' portability.

Overview

SAS provides 94 financial functions in total, which are mainly used to calculate financial values, such as interest, periodic payments, depreciation, and prices for European options on stocks and futures. The 53 financial functions conforming to industry standards are packaged in 1 general function called FINANCE. Each finance function has a unique string identifier for the first parameter of the FINANCE function. These functions have to be invoked via the FINANCE function. For example, we can use irr=finance("IRR", -400,100,200,300); to compute the internal rate of return (IRR) for a series of cash flows.

The 53 FINANCE functions are compatible with the 53 financial functions provided by Microsoft Excel for Office 365. In fact, the list of financial functions available in the latest Excel for Office 365 software contains 56 functions. Among the 3 extra functions, PDURATION and RRI are specific functions for Excel 2013, and STOCKHISTORY is a specific function for Office 365. These 53 financial functions are designed for financial calculation such as depreciation, maturation, accrued interest, net present value, periodic savings, internal rates of return, etc. They can be categorized as investment, depreciation, Internal rate of return, financial value, and yield calculations. The complete list of SAS FINANCE functions is shown in Table 1.

Category Subcategory Function
Investment Calculation [12] Future Value [2] FV, FVSCHEDULE
Payment [4] PMT, IPMT, PPMT, ISPMT
Present Value [3] PV, NPV, XNPV
Annual interest rate [2] EFFECT, NOMINAL
Number of periods [1] NPER
Depreciation [7] AMORDEGRC, AMORLINC, DB, DDB, SLN, SYD, VDB
Internal rate of return [4] IRR, RATE, MIRR, XIRR
Financial value and yield [30] Accrued/Cumulative Interest & Principal [4] ACCRINT, ACCRINTM, CUMIPMT, CUMPRINC
Coupon days/date and Number [6] COUPDAYBS, COUPDAYS, COUPDAYSNC, COUPNCD, COUPNUM, COUPPCD
Yield [6] INTRATE, ODDFYIELD, ODDLYIELD, YIELD, YIELDDISC, YIELDMAT
Treasury bill [3] TBILLEQ, TBILLPRICE, TBILLYIELD
Duration [2] DURATION, MDURATION
Security [7] DISC, ODDFPRICE, ODDLPRICE, PRICE, PRICEDISC, PRICEMAT, RECEIVED
Dollar Price [2] DOLLARDE, DOLLARFR

Table 1: 53 FINANCE XXX functions in one FINANCE function

Another 40 independently named SAS financial functions are used for investment or loan calculation, cash flow, depreciation, and pricing models. For example, we can use irr =irr(1,-400,100,200,300); to compute the internal rate of return (IRR) as a percentage for a series of cash flows.

There are 7 functions (IRR, NPV, PMT, PPMT, IPMT, CUMIPMT, CUMPRINC) whose names are the same as the industry standard FINANCE functions, but they have different sign-in results or even different calculation logic. For example, FINANCE PMT is used to calculate the periodic payment of an annuity, but the PMT function is used to calculate the periodic payment for a constant payment loan or the periodic savings for a future balance. Annuity and periodic savings are different in anchoring market interest rates, interest calculation, liquidity, and binding, so the same arguments return different results for them. The list is shown in Table 2.

Category Function
Investment/Loan [16] IRR/INTRR, NPV/NETPV, SAVING, SAVINGS,
NOMRATE, EFFRATE, COMPOUND,
PMT, PPMT, IPMT, CUMIPMT, CUMPRINC, MORT, TIMEVALUE
Depreciation [10] DACCDB, DACCDBSL, DACCSL, DACCSYD, DACCTAB
DEPDB, DEPDBSL, DEPSL, DEPSYD, DEPTAB
Cash flow [6] CONVX, CONVXP, DUR, DURP, PVP, YIELDP
Pricing Models [8] Stocks: BLKSHCLPRC/BLKSHPTPRC, GARKHCLPRC/GARKHPTPRC, MARGRCLPRC/MARGRPTPRC
Futures: BLACKCLPRC/BLACKPTPRC

Table 2: 40 uniquely named financial functions

To keep semantic consistency, these industry-standard financial functions usually have the same parameter names, and some of the most important parameters are:

  • Interest rate (rate): The interest rate or discount rate of an investment or loan.
  • Number of periods (nper): The total number of payment periods for investment or loan.
  • Payment (pmt): The regular payment that is made each period for investment or loan, the value cannot change over the life of the annuity. Typically, it contains principal and interest, but not other fees and taxes.
  • Present value (pv): The current value of an investment or loan. The present value of an investment is the value at the beginning of the investment period, and the present value of the loan is the value of the principal borrowed.
  • Future value (fv): The value of an investment or loan after all payments have occurred, or a cash balance that you want to attain after the last payment is made.
  • Payments due Type (type): A flag indicates if payments are due at the beginning of the period. The omitted type or 0 means payments are due at the end of the period, 1 indicates payments are due at the beginning of the period.
  • Period (period): Specify the period for interest rate or depreciation calculation. The period must be between 1 and the number of periods (nper).
  • Day count basis (basis): Optional, specify the type of day count basis to use. 0-represents US (NASD) 30/360, default; 1-represents Actual/actual; 2-represents Actual /360; 3-represents Actual /365; 4-represents European 30/360.
  • Frequency (frequency): Specify the number of coupon payments per year, 1-annual payment; 2-semiannual payment; 4-quarterly payment.

Below, we use some simple examples to show how to use these SAS financial functions to calculate and analyze mortgages.

Mortgage calculation

The average interest rate of first home mortgage in China is 5.46% since September 2021, and the average interest rate of second home mortgages is 5.83%, up 23 and 29 basis points respectively from 2020. Assuming that the first housing loan amount is 2 million yuan (CNY), the loan periods is 25 years. If the mortgage type is Constant Payment Mortgage (CPM), how much will the monthly payment and total payment be?

The FINANCE PMT function can be used to calculate the periodic payment of an annuity, it can be used to calculate the constant payment for Constant Payment Mortgage by default. That is, the payment is a constant for all periods, the same amount of the loan, including principal and interest, is repaid every month. The required parameters of the FINANCE PMT function are the interest rate (rate), the number of periods (nper), and the loan amount as the present value (pv). The optional parameters are the future value (fv) and the payment due type (type).

The FINANCE PPMT function can be used to calculate the payment on the principal during a given period for an investment or a loan based on a constant periodic payment and a constant interest rate. The required parameters are interest rate (rate), the payment period (period), number of periods (nper), and the loan amount as the present value (pv), optional parameters are future value (fv) and payment due type (type). Since FINANCE PPMT returns the monthly principal, we can calculate the monthly interest and the principal balance (pb) that need to be repaid. The complete SAS program is as follows:

 /*Mortgage: Constant Payment Mortgage*/
data Mortgage1;
  rate=0.0546/12; /*Interest rate: convert yearly 5.46% to monthly*/
  nper=25*12;     /*Number of periods: Term in month, convert 25 years to 300 months*/
  pv=2000000;     /*Present value: Loan amount - the value of the principal borrowed 2M*/
 
  pmt = FINANCE('pmt', rate, nper, pv); /*Monthly payment*/
  pb=pv;           /*Principal Balance*/
  do period = 1 to nper;
    ppmt=FINANCE('ppmt', rate, period, nper, pv); /*Payment on the principal*/
    mi=pmt- ppmt; /*Payment on the interest*/
    pb=pb+ppmt;
    output;
  end;
 
  label rate="Interest Rate (monthly)" nper="Number of periods (month)" pv="Loan amount" 
        period="Period" pmt = "Monthly payment" ppmt="Payment on principal" mi="Payment on interest" pb="Principal balance";
run;
 
title 'Mortgage: Constant Payment Mortgage';
proc print label;
run;

The results are as follows (Figure 1):

Figure 1: Details of Constant Payment Mortgage payment

To show the total payment and the total interest, we use PROC REPORT to generate a summary table as Figure 2. The total payment for the Constant Payment Mortgage is 3,670,206.07, the total interest is 1,670,206.07, and the ratio of interest to principal is 83.51%.

proc report;
  column period periodid pmt ppmt mi pb;
  format period 3.0 pmt ppmt  mi pb comma15.2;
  define period / order noprint;
  define pb / order;
  rbreak after / summarize style=[font_weight=bold];
 
  define periodid / computed "Period" center; 
  compute periodid / character length=20;
    periodid=period ;
  endcomp;
  compute after;
    if _break_ in ('_RBREAK_') then periodid='Total';
  endcomp;
run;

Figure 2: Constant Payment Mortgage payment schedule

If the mortgage type is Constant Amortization Mortgage (CAM), what will happen to its monthly payment and interest expenses? A Constant Amortization Mortgage divides the total amount of the loan into equal parts during the payment periods and the lender repays the same amount of principal every month and the interest incurred on the remaining loan of that month. As a result, the monthly payment amount is not constant, the initial payment is high, but the interest component of the payments declines rapidly over time. For this type of mortgage, the monthly payment on the principal (ppmt) is constant; we only need to calculate the current payment on the interest (mi) generated by the principal balance (pb) to get the current total payment (pmt) for the given period. The calculation can be accomplished without calling SAS FINANCE functions. The program is as follows:

/*Mortgage: Constant Amortization Mortgage*/
data Mortgage2;
  rate=0.0546/12; /*Interest rate: convert yearly 5.46% to monthly*/
  nper=25*12;     /*Number of periods: Term in month, convert 25 years to 300 months*/
  pv=2000000;     /*Present value: Loan amount - the value of the principal borrowed 2M*/
 
  ppmt=-pv/nper;   /*Monthly Payment on the principal*/
  pb=pv;           /*Principal Balance*/
  do period=1 to nper;
    mi=-pb*rate;   /*Payment on the interest*/
    pmt=ppmt+ mi;
    pb=pb+ppmt;
    output;
  end;
 
  label rate="Interest Rate (monthly)" nper="Number of periods (month)" pv="Loan amount" 
        period="Period" pmt = "Monthly payment" ppmt="Payment on principal" mi="Payment on interest" pb="Principal balance";
RUN;
 
title 'Mortgage: Constant Amortization Mortgage';
PROC PRINT;
RUN;

We use the same PROC REPORT program above to generate the summary table as Figure 3. In the table, the total payment for the Constant Amortization Mortgage is 3,369,550.00, the total interest is 1,369,550.00, and the ratio of interest to principal is 68.48%. This indicates that using this type of mortgage can save 300,656.07 in interest expenses.

Figure 3: Constant Amortization Mortgage payment schedule

To compare the monthly payment, monthly interest, and their cumulative value changes with the repayment periods for two mortgage types, we use PROC SGPANEL to draw the following figures. The code is as follows:

/*Merge two mortgage data with type=1/2 in long format*/
data Mortgage;
  set Mortgage1 Mortgage2;  
  retain type 1;
  if period=1 and _N_^=1 then type=type+1;
  pmt=abs(pmt);
  mi=abs(mi);
  ppmt=abs(ppmt);
run;
/*Create format for mortgage type*/
proc format;
  value typefmt 1="Constant Payment Mortgage"
2="Constant Amortization Mortgage";
run;
data Mortgage;
  set Mortgage;
  accuppmt+ppmt;
  accupmt+pmt;
  if first.type then do; 
	accuppmt=ppmt;
	accupmt=pmt;
  end;
  by type;
  label accuppmt="Cumulative monthly payment on principal" accupmt="Cumulative monthly payment";
  format type typefmt.;
run;
 
ods graphics / width=800px height=300px;
title "Mortgage: CPM vs CAM";
proc sgpanel data=Mortgage;
  panelby type / novarname rows=2;
 
  vline period / response=pmt lineattrs=(color=Cxa0a0a0);
  vbar period / response=pmt nooutline fillattrs=(color=cxE0E0E0) FILLTYPE=SOLID transparency= 0.5;
  vbar period / response=ppmt nooutline fillattrs=(color=DarkOrange) FILLTYPE=SOLID transparency= 0.5;  
  colaxis label="Period" type=linear  values=(1 to 300 by 12  ) valueattrs=(color=gray size=12pt) grid;
run;
 
ods graphics / width=800px height=500px;
title "Mortgage: CPM vs CAM";
proc sgpanel data=Mortgage;
  panelby type / novarname rows=2 ;
 
  vline period / response=accupmt lineattrs=(color=Cxa0a0a0);
  vbar period / response=accupmt nooutline fillattrs=(color=cxE0E0E0) FILLTYPE=SOLID transparency= 0.5;;
  vbar period / response=accuppmt nooutline fillattrs=(color=DarkOrange) FILLTYPE=SOLID transparency= 0.5;;
  colaxis label="Period" type=linear  values=(1 to 300 by 12  )  valueattrs=(color=gray size=11pt) grid;  
run;

In the following figures, the y-axis represents the monthly payment or the cumulative monthly payment, and the x-axis represents the payment period. The orange represents the principal component and the gray represents the interest component; the two constitute the monthly payment.

In Figure 4, the monthly payment of CPM (top), and the monthly payment on principal of CAM (bottom) are fixed values, which appear as a horizontal line and do not change over time. Correspondingly, in Figure 5, the cumulative monthly payment of CPM (top) and the cumulative monthly payment on principal (bottom) show linear growth.

Figure 4: Monthly payment of CPM vs CAM over period

Figure 5: Cumulative monthly payment of CPM vs CAM over period

Since the monthly payment of CPM is constant, the principal component increases month by month, but the interest component decreases accordingly. For CAM, the monthly payment and the interest component decreases rapidly month by month, but the principal component is constant.

CPM brings less stress to lenders and it is suitable for people whose income is expected to rise, and the lender is only willing to bear a lower monthly payment. CPM is also suitable for those people who have other channels to manage their own money to achieve greater benefits than mortgage expenditures. It is also suitable for those people who have a plan to sell real estate after a certain period. The disadvantage of CPM is that the overall interest expense is relatively high.

CAM is suitable for people who have higher income but whose income is flat or declining in the future. These people hope to put the payment stress on now rather than in the future. Declining payments may be appropriate to match a deflationary environment, but its high initial payment and rapid paydown of principal reduces leverage too fast. Furthermore, some people may complain the constantly changing payment is difficult to budget for and manage.

Summary

We have summarized the design and implementation of SAS functions in financial calculations. We then provided examples to calculate and analyze the monthly payment, interest, and principal for CPM/CAM mortgages. Finally, we compared the monthly payment, principal, interest, and cumulative value changes to a period for two mortgage types.

SAS financial functions review and mortgage payment analysis was published on SAS Users.

6月 252021
 

In many programming languages, there is a function named eval() that can be used to evaluate an expression and return the result at run time. For example, in Python, the eval() function parses the expression passed to it and runs a Python expression or code within the program. Even Python eval() supports more optional parameters, such as global/local dictionary for runtime context, but the goal for evaluation remains the same.

When an expression is stored as a character variable of an SAS observation, it suggests the running context is the SAS program execution phase, so the internal data structure like Input Buffer (IB), Program Data Vectors (PDV) and Descriptor Information of the output database are all accessible. Furthermore, all SAS system functions are also accessible in an expression, just like the raw SAS code exposed to SAS compiler and runtime. In the following SAS data set, what the user wants is to get the real value from the expression in variable C. It is different from the commonly used Calculated Column, which uses the same computing rule to expand a new variable for all observations. Here the column C has a different computing rule for each observation; the user's expected result is the column D.

%let MVAR=2;
data a; 
  length a b 8 c $ 255;
  a=3; b=4; c='a**2+b**2'; output; /* Arithmetic or Logical */  
  a=7; b=22; c='b/a * &MVAR'; output; /* SAS Macro Var*/
  a=113; b=355; c='cos(b/a)'; output; /* Trigonometric */
  a=0; b=1; c='cdf("NORMAL", 0, a, b)'; output; /* Probability */
run;
proc print;run;

What solutions ahead?

Someone might want a solution that parses the expression in variable C, and then try to rebuild the abstract syntax tree and try to evaluate it from bottom to top. I can’t say this solution is totally wrong but it’s very complex and too hard to generate a general solution. If we have an SAS function eval() in DATA Step, then we can easily use the following code to achieve the goal. Unfortunately, SAS did not provide the eval() function.

 
data b; 
  set a;
  d=eval(c);
run;

SAS provides Macro function %eval, %sysevalf to evaluate arithmetic and logical expressions using integer or floating-point arithmetic. SAS also provides function resolve() (not supported in the DATA Step that runs in CAS) to return the resolved value of the argument after it has been processed by the macro facility. Anyway, resolve() can’t access the values of macro variable assigned by symput() or symputn() at program execution phase. So, for the SAS code as below, it outputs c=150 and d=200. For simple arithmetic and logical expression, we can use the tranwrd() function to replace all variable names with real values, and then use resolve('%sysevalf('|| cats(expression_with_real_values) || ')') to get evaluated result, but it limited to only SAS functions supported by %eval() and %sysevalf() macro functions.

%let a=100;
%let b=50;
data _null_;
  call symput("b", 200);  
  c=resolve("%eval(&a+&b)");
  put c=;
  d=symget("b");
  put d=; 
run;

Now let’s return to the key question we raised: How can we implement the eval() function in SAS Data Step? The best solution we found so far is to use the dynamic code generation mechanism in SAS, and then submit the code to SAS for parsing and execution. In this way, we retrieve string expression from a variable to a real expression in SAS code successfully. We don’t care what the valid SAS expression is, we totally convert it to the code snippets and submit it for execution. Syntax check, reference to PDV variables, internal functions, and even SAS Macro variables are all supported. Yes, it’s a smart and concise implementation for general purposes.

Due to each observation having a different computing rule, we need to use _N_ to control the observation mapping. So, for the existing dataset A, we can use the following code to achieve the goal. The key secret is to use CALL EXECUTE to generate SAS code dynamically and delay execution after the code is ready. In the output dataset B, we have a numeric column D with the value evaluated with the character expression from column C. You can see the output on the RIGHT part of the Figure 1.

data _null_;
  set a end=last;
  if _n_=1 then call execute("data b; set a;");
  call execute( "if _N_=" || compress(_N_) || " then d=" || trim(c) || "; ");
  if last then call execute("run;");
run;
proc print; run;

Wrap up as a reusable SAS macro

We also can wrap up the above SAS code in a reusable SAS macro, so this logic can be used anywhere you want. The %evalvar macro has four arguments: ds= and var= are the input dataset and columns with expression to be evaluated. And outds= and outvar= are the output dataset and columns to store result. We also can specify same value for ds= and outds=, so the user just expands the existing dataset with an additional evaluated column.

%macro EvalVar(ds=, var=, outds=, outvar=);
data _null_;
  set &ds end=last;
  if _n_=1 then call execute("data &outds; set &ds;");
  call execute( "if _n_=" || compress(_n_) || " then &outvar=" || &var || ";");
  if last then call execute("run;");
 run;
%mend; 
 
%EvalVar(ds=a, var=c, outds=b, outvar=d);
proc print; run;

By the way, I also had a temporary SAS code generation solution implemented via the %include SAS macro. The temporary SAS code will be deleted automatically when the SAS session is closed. The sample code is also attached here for your information.

filename tmpfile temp;
data _null_;
  file tmpfile;
  set a end=last;  
  if _N_=1 then put "data b; set a;";
  put "if _N_ =" _N_ "then d=" c ";";
  if last then 	put "run;";
run;
%include tmpfile;
proc print; run;

Summary

In this article, we talk about how to evaluate SAS expressions in Data Step dynamically. The expression parsing and execution are totally handled by SAS at program execution phase. It avoids handling abstract syntax tree parsing and evaluation ourselves on it. We introduce two dynamic code generation implementations via call execute or %include. We also use _N_ to control observation mapping due to each observation has different computing rules. This implementation can reflect the beauty of simplicity in SAS before SAS provides system function eval() one day.

How to evaluate SAS expression in DATA Step dynamically was published on SAS Users.

6月 032020
 

In natural language processing, word vectors play a key role in making technologies such as machine translation and speech recognition possible. A word vector is a row of numeric values where each point captures a dimension of the word’s meaning. Each value represents how closely it relates to the concept behind that dimension, so the semantics of the word is embedded across the dimensions of the vector. Since similar words have similar vectors, representing words as vectors like this would simplify and unify vectors' operations.

Word vectors are generated by a training performed word-word co-occurrence statistics on a large corpus. You can use pre-trained word vectors like GloVe, provided by Stanford University.

Let's talk about how to transform word vector tables from long to wide in SAS, so we can potentially get sentence vectors to process further. Suppose we generate word vectors from the following 3 sentences:

Jack went outside.
Jill likes to draw in the afternoon.
Tony is a boy.

Each word has 2 numeric values (Vector1, Vector2), each value represents how closely the word relates to the concept defined by that dimension. The value numbers (here VNUM=2) may range from hundreds to thousands in real text analysis scenarios.

Long word vector table

The sample code below generates an upper sample table and sorts it for further processing.

data HAVE;
  length Word $ 45;
  input SentenceID Word Vector1-Vector2; /*300+*/
datalines;
1	Jacky 	   0.24011	 0.400996 
1	went	  -0.047581	 0.868716 
1	outside	  -1.197891	 1.162238
2	Jill	  -0.199579	 0.251252
2	likes	  -1.935640	-0.288264
2	to	  -0.526053	-1.143420
2	draw	  -0.736289	-0.794812
2	in 	  -2.757234	 0.506639
2	the	  -0.736289	-0.794812
2	afternoon -0.047581	 0.868716
3	Tony 	   0.34032	 0.600983 
3	is	   0.147531	 0.968817
3	a	   1.347543	 2.568323
3	boy       -3.257891      3.172238
run; 
proc sort data=HAVE;
  by SentenceID;
run;
proc print data=have;run;

If we want to transform the upper long table to a wide table as seen below, how can we do this as efficiently and simply as possible? The upper 14 words belong to 3 sentences that would result in the following 3 rows with 22 columns (1 + WNUM + WNUM x VNUM=1 + 7 + 7 x 2 = 22).

Wide word vector table

Please note that we can calculate the max word number (WNUM) in a sentence at runtime with SAS code below. For the upper case, the value of WNUM is 7.

proc sql noprint;
  select max(count) into :wnum from (
    select count(Word) as count from HAVE group by SentenceID 
  );
quit;

In fact, we don’t need any SAS PROC to handle this kind of transformation. A SAS Data step provides an efficient and convenient way to transform data. The key is to use an ARRAY to map all word vectors from the source table, and then define two ARRAYs to store output words and vectors in a wide style. These two arrays for output words and vectors need to be RETAIN during the implicit loop and KEEP for OUTPUT while it reaches the last SentenceId.

You can see the full SAS code below with detailed comments.

/*Long table to Wide table*/
%let vnum=2; /*vector numbers for a word*/
%let wnum=7; /*max word number in a sentence*/
data WANT;
  set HAVE;
  by Sentenceid;
  array _vector_ [*] vector:;         /*Map to source vectors*/
 
  array _word [ %eval(1*&wnum)] $ 45; /*Array to store WORD in wide table*/
  array _vector [ %eval(&wnum*&vnum)];/*Array to store VECTORS in wide table*/
  retain _word: _vector:;             /*RETAIN during the implicit loop*/
 
  retain _offset_ 0;                  /*Offset of a WORD in a sentence, base 0*/
  if first.Sentenceid then do;
    call missing(of _word[*]);
	call missing(of _vector[*]);
    _offset_=0;
  end;
  else _offset_=_offset_+1;
 
  _word[ _offset_+1 ]=word;           /*Cache current word to array WORD at [ _offset_+1]*/
  do i=1 to dim(_vector_);            /*Cache each vectors to array VECTORS at [_offset_* &vnum +i]*/
    _vector[_offset_* &vnum +i]=_vector_[i]; 
  end;
  keep Sentenceid _word: _vector: ;   /*Keep for output when it hit last.Sentenceid*/
 
  if last.Sentenceid then output;     /*Output the cached WORD and VECTORS*/
run;
 
proc print data=want;run;

Accordingly, if we need to transform a word vector back from wide style to long style, we need to generate &WNUM rows x &VNUM columns for each sentence, and it’s the reversed process for upper logic. The full SAS code with detailed comments is listed below:

/*Wide table to Long table*/
data HAVE2;
  set WANT; 
 
  array _word [*] _word:;           /*Array _word[] mapping to WORD in wide table*/
  array _vector_ [*] _vector:;     /*Array _vector[] mapping to VECTORS in wide table*/
 
  length Word $ 45;                 /*Output Word in the long table*/
  array Vector[&vnum];              /*Output Vectors in the long table*/
  do i=1 to &wnum;                  /*Unpack word from array _word[]*/       
    word=_word[i]; 
	if word=" " then continue;
    do j=1 to &vnum;                /*Unpack vectors from array _vector[]*/
	  oo= (j+&vnum * (i-1)); 
      Vector[j]=_vector_[j + &vnum *(i-1)];
    end;
	keep Sentenceid Word Vector:;
	output;                          /*One row in wide table generate &wnum rows[]*/
  end;
run;
proc print data=HAVE2;run;

To wrap the upper bi-directional transformation process for general repurposing in text analysis, we provide two SAS MACROs listed below:

%Long2Wide(data=Have, vnum=2, wnum=7, sid=SentenceId, word=Word, out=Want);
proc print data=Want;run;
 
%Wide2Long(data=Want, vnum=2, wnum=7, sid=Sentenceid, out=Have2, outword=Word, outvector=Vector);
proc print data=Have2;run;

We have demonstrated how to transform a word vector table from a long style to a wide style (or vice versa) efficiently with a SAS DATA step. We have also provided two well-wrapped SAS MACROs for general re-use purposes. To learn more, please check out these additional resources:

Transform word vector tables from long to wide was published on SAS Users.

12月 202019
 

One day, I received a question from the SAS community on how to exclude variables with missing values up to a certain percentage in data analysis. Before I dive into details, another question about how to remove variables with repeated values up to a certain percentage was raised. In fact, this user demands focus on the value characteristics for a single variable, and it determines if the user should include a specific variable in the following data analysis. It reflects the demand of variable selection for wide datasets with noise; it enhances data filtering by enabling variables(cols) selection, in addition to the normal observation(rows) filtering.

DATA Step implementation

If the variable filtering is based on the percentage of the missing values, the most intuitive thought is to use DATA step to calculate the percentage of missing values for each column of the input dataset, and then collect the variable names that meet the threshold conditions before passing them to a global macro variable for subsequent analysis. Since SAS DATA step includes both character and numeric data types, we cannot define one single array to map all columns of the dataset. Instead, we need to define two arrays to map all numeric variables and all character variables respectively.

In order to build a universal reusable SAS macro, we also need to detect the number of numeric and character variables in advance and pass them to the array definition in the next DATA step. In that DATA step, we use two arrays to access dataset variables and count the number of missing values respectively. When the loop reaches the last observation, it collects variable names through the vname() function which meet the threshold condition, and finally place them into a SAS global macro variable &VARLIST. The SAS macro below demonstrates this logic:

 
%macro FilterCols(data=, th=0, varlist=);
  data _null_;  /*get numeric and character variables count */
    set &data (obs=1);
    array _ncols_n _numeric_;
    array _ccols_n $ _character_;
    call symput ('_ncols_n',put(dim(_ncols_n),best.));
    call symput ('_ccols_n',put(dim(_ccols_n),best.));
  run;
  data _null_;
	set &data end=last;
    /*Process numeric types*/
	array _ncols_ _numeric_;	
	array _ncols_count {&_ncols_n};
	retain _ncols_count (&_ncols_n*0);	
	do i=1 to dim(_ncols_);			
	  if _ncols_[i]= .  then _ncols_count[i]=_ncols_count[i] + 1;
	end;
    /*Process character types*/
	array _ccols_ $  _character_;
    array _ccols_count {&_ccols_n};	
	retain _ccols_count (&_ccols_n*0);
	do i=1 to dim(_ccols_);			
	  if _ccols_[i]=" " then _ccols_count[i]=_ccols_count[i]+1;
	end; 
    /*Filter out the variables meet the conditions and place them in the &VARLIST global macro variable*/
	if last then do;	  
	  length _varlist_ $ 32767;
retain _varlist_ " ";
	  do i=1 to &_ncols_n; 
	    if (_ncols_count[i]/_N_) >= &th then do;
		  _varlist_=trim(_varlist_) || " " || trim( vname(_ncols_[i]) );
        end;				
	  end;
	  do i=1 to &_ccols_n; 
	    if (_ccols_count[i]/_N_) >= &th then do;
		  _varlist_=trim(_varlist_) || " " || trim( vname(_ccols_[i]) );
		end;
	  end;	
	  call symput ("&VARLIST",trim(_varlist_));			
	end;
  run;
%mend; 
/*Exclude variables if missing value pct GE. threshold  &th*/
%macro FilterData(data=, th=, out=);
  %global myvars;
  %FilterCols(data=&data, th=&th, varlist=myvars);
  data &out;
    set &data; 
    %if %str(&myvars)^=%str() %then %do; drop &myvars; %end;
  run;
%mend;

To verify the programing logic, let’s generate a demo dataset WORK.CLASS based on SASHELP.CLASS, the output dataset has different missing value percentage (10.53%, 36.84%, 52.63%) in Sex, Height and Weight columns.

data class; /*To generate demo dataset*/
  set sashelp.class;  
  if age < = 11 then sex=" ";
  if age <= 12 then height=.;
  if age <= 13 then weight=.;  
run;
proc print data=class; run; /*See Figure 1 for output */

Now we can use the following code to filter columns by percentage of missing values, which is greater than 40%; the Weight column will not appear in the output CLASS2 dataset.

 
%FilterData(data=class, th=0.4, out=class2);
proc print data=class2; run; /*See Figure 1 for output */

Figure 1: Data before filtered and after filtered

The calculation here for the percentage of missing values is implemented by the user programming in the DATA step. The missing value count is computed separately in two arrays according to the character type and the numerical type. If you want to do it in a single array, you must use HASH object to store the variable name and its corresponding index in that array. Whenever a missing value is detected, we need to check whether the variable is already in the HASH object; if not, then add it directly. Otherwise, look up the table to find the appropriate array index, and then increase the counter to the corresponding element in that array. The advantage of this approach is we only need one array in later variable names collecting phase. The disadvantage is the use of HASH object increases the programming complexity. The complete implementation based on this logic is as follows, it can completely replace the first version of %FilterCols implementation.

 
%macro FilterCols(data=, th=0, varlist=);
  data _null_; /*Detect variable numbers of dataset*/
    dsid=open("&data.",'i');
    if dsid>0 then do;
      nvars= attrn(dsid, 'NVARS'); 
	  call symput("NVARS", nvars); 
    end;
  run;
  data _null_;
    set &data. end=last; 
    /*Define 2 arrays map to numeric/character vars respectively*/
    array _ncols_ _numeric_;
    array _ccols_ _character_;
 
    if _N_=1 then do; /*HASH Object definition and initialize*/
      length key $ 32;
      declare hash h(); 
	  rc=h.definekey("key");
	  rc=h.definedata("data", "key");
	  rc=h.definedone();
	  call missing(key, data); 
    end;  
    /*Define one array for counting of missing value*/
	array _cols_count[&NVARS] _temporary_; 
	retain _cols_count ( &NVARS * 0); 
    retain _col_id_ 1; 
 
    do i=1 to dim(_ncols_);	  
      if missing( _ncols_[i] ) then do; 
        /*Find mapped array index and modify counter value*/
	    key=vname( _ncols_[i] );  
	    if h.find()^=0 then do;         	    
          data=_col_id_;
		   h.add();  
		   _col_id_=_col_id_+1;		
	    end;    
        _cols_count[ data ] = _cols_count[ data ] +1; 
	  end;
    end;
    do i=1 to dim(_ccols_); 
      if missing( _ccols_[i] ) then do;
        /*Find mapped array index and modify counter value*/
	    key=vname( _ccols_[i] );
	    if h.find()^=0 then do;         	    
          data=_col_id_;
		   h.add();	 
		   _col_id_=_col_id_+1;		
	    end;	        
        _cols_count[ data ] = _cols_count[ data ] +1; 	    		
	  end;
    end; 
    /*Filter out the variables meet the conditions and place them in the &VARLIST global macro variable*/
    if last then do;
      declare hiter iter('h');
      rc = iter.first();  
      do while (rc = 0);   	    
        pct=_cols_count[data]/ _N_;
		if pct >= &th. then do; 
	      length _varlist_ $ 32767; 
retain _varlist_ " ";
          _varlist_=trim(_varlist_) || " "|| trim(key);		  	  
        end; 
        rc = iter.next(); 
      end;	
	  call symput( "&VARLIST", trim( left(_varlist_)));
    end;
  run;
%mend;

PROC FREQ Implementation

In the two macros above, the counting of missing values for each column are implemented by user programming in DATA step, but SAS provides a more powerful frequency statistic step called PROC FREQ. Can we directly use its output for variable filtering? The answer is YES. For example, the following SAS code can list the Frequency, Percent, Cumulative Frequency, and Cumulative Percent of each variable values (including missing values) in the data set.

 
  proc freq data=class;
    table Sex Height Weight /missing;  /*See Figure 2 for output */
  run;

Figure 2: PROC FREQ output

To get the result of PROC FREQ into a dataset, we just need to add ODS Table statement before running PROC FREQ:

 
ods table onewayfreqs=class_freqdata;

Figure 3: OneWayFreqs table class_freqdata

So, the frequency statistics result of PROC FREQ is stored in the class_freqdata dataset. We only need to check the value of variable “percent” for each column. If it is greater than the threshold specified, the variable name is added to the buffer variable _varilist_, otherwise it is ignored; the final value of _varilist_ is placed to the SAS global macro variable &VARLIST for later use. Comparing to the DATA step implementation, the entire code here is much shorter and simpler.

 
%macro FilterCols(data=, th=0, varlist=);
  ods table onewayfreqs=&data._freqdata; /*FREQ out to temp dataset*/
  proc freq data=&data;
    table _all_ /missing;  
  run; 
  data _null_;
	set &data._freqdata end=last;
	var=scan(Table, 2);  
if getoption("LOCALE")="ZH_CN" then var=scan(Table, 2, "“”");
	value=vvaluex(var); 
/*Filter out the variables meet the conditions and place them in the &VARLIST global macro variable*/
	if  value=" " or compress(value)="." then  do; 
      if percent/100>= &th then do;
	    length _varlist_ $ 32767;
        retain _varlist_ " "; 
	    _varlist_=trim(_varlist_) || " "|| trim(var);	
		put var= _varlist_;
	  end;
   end;
   if last then call  symput( "&VARLIST", trim( left(_varlist_)));
  run; 
/*Delete the temp dataset in WORK library*/
  proc datasets nolist;
    delete &data._freqdata;
  quit;
%mend;

Please note that PROC FREQ would generate output to ODS HTML destination by default. If you want to avoid generating ODS HTML output when generating the temporary dataset “class_freqdata”, you can use the ods html close; statement to temporarily close the ODS HTML output and then open it again anytime needed.

Since the output of PROC FREQ already has the frequency information of each variable, we can naturally filter columns based on the percentage of duplicate values. However, it should be noted that a single variable always contains multiple distinct values, and each distinct value (including missing values) has a frequency percentage. If percentage of duplicate value is too high in a specific column, the user may exclude that column from the following analysis. As there are multiple frequency percentages, we just choose the largest one as the duplicate value percentage of that variable. For example, the Sex column of demo dataset CLASS has a missing value of 10.53%, a female of 42.11% and a male of 47.37%, so the percentage of duplicate values in the Sex column should be 47.37%.

The following code filter variables by percentage of duplicate values with a specific threshold. The code logic is almost same as before, but it filters values with non-missing values. So, run the same test code against the demo dataset will remove the Age column by duplicate values percentage threshold, instead of the Weight column by missing values percentage threshold.

 
%macro FilterCols(data=, th=0, varlist=);
  ods table onewayfreqs=&data._freqdata; /*FREQ out to temp dataset*/
  ods html close;
  proc freq data=&data;
    table _all_ / missing;  
  run; 
  ods html;
  data _null_;
    set &data._freqdata end=last;
var=scan(Table, 2);  
if getoption("LOCALE")="ZH_CN" then var=scan(Table, 2, "“”");
    value=vvaluex(var);
/*Filter out the variables meet the conditions and place them in the &VARLIST global macro variable*/
    if ^(value=" " or compress(value)=".") then do;  
      if  percent > maxpct then do; /*Search max percentage of duplicate values of a variable*/
	    retain maxpct 0;
        retain maxvar;
        maxpct=percent;  
	    maxvar=var;
      end; 
      if (_N_>1 and var ^= lastvar) or last then do;      
        if maxpct/100 >= &th. then do;
          length _varlist_ $ 32767;
          retain _varlist_ " ";
          _varlist_=trim(_varlist_) || " "|| trim(maxvar);	        	      
        end;
        maxpct=percent;
	    maxvar=var;
      end;
      lastvar=var;
      retain lastvar;
	end;
    if last then call symput( "&VARLIST", trim( left(_varlist_)));  
  run;
/*Delete the temp dataset in WORK library*/
  proc datasets nolist;
    delete &data._freqdata ;
  quit;
%mend;

In order to combine these two logics of by missing values and by duplicate values in one macro, we just introduce one flag parameter byduprate. The default value is 0 to filter variables by missing values, and value 1 to filter variables by duplicate values. The complete code is as below:

 
%macro FilterCols(data=, th=0, varlist=, byduprate=0);
  ods table onewayfreqs=&data._freqdata; /*FREQ out to temp dataset*/
  ods html close;
proc freq data=&data;
    table _all_ /missing;  
  run; 
ods html;
  data _null_;
	set &data._freqdata end=last;
	var=scan(Table, 2); 
if getoption("LOCALE")="ZH_CN" then var=scan(Table, 2, "“”"); 
	value=vvaluex(var);
  %if &byduprate=0 %then %do; /*Filter variables by missing value*/
	if  value=" " or compress(value)="." then  do; 
      if percent/100>= &th then do;
	    length _varlist_ $ 32767;
	    retain _varlist_ " ";
	    _varlist_=trim(_varlist_) || " "|| trim(var);				 
	  end;
   end;
  %end;
  %else %do; /*Filter variables by duplicate value percentage*/
    if ^(value=" " or compress(value)=".") then do;  
      if  percent > maxpct then do;      
        retain maxpct 0;
        retain maxvar;
        maxpct=percent;  
	    maxvar=var;		
      end; 
      if (_N_>1 and var ^= lastvar) or last then do;              
        if maxpct/100 >= &th. then do;
          length _varlist_ $ 32767;
          retain _varlist_ " ";
          _varlist_=trim(_varlist_) || " "|| trim(maxvar);	        
        end;
        maxpct=percent;
	    maxvar=var;
      end;
      lastvar=var;
      retain lastvar;
    end;
  %end;
    if last then call  symput( "&VARLIST", trim( left(_varlist_)));
  run; 
/*Delete the temp dataset in WORK library*/
  proc datasets nolist;
    delete &data._freqdata;
  quit;
%mend;

Accordingly, we also need to introduce the parameter byduprate with the same name and behavior in the %FilterData macro that's invoking the %FilterCols macro. The SAS macro %FilterData also has the same default value 0, but it needs to pass the byduprate argument specified by the user to the %FilterCols macro. The enhanced version is as below:

 
%macro FilterData(data=, th=, out=, byduprate=0);
  %global myvars;
  %FilterCols(data=&data, th=&th, varlist=myvars, byduprate=&byduprate); 
  data &out;
    set &data; 
    %if %str(&myvars)^=%str() %then %do; drop &myvars; %end;
  run;
%mend;

Now the final version of the macro %FilterData unified the two kinds of variable filtering logic, it just depends on the different value of the argument byduprate: If byduprate is not specified or the value is 0, it means filter variables by the percentage of missing values, otherwise it is by the percentage of duplicate values. For the following two code snippets, the former drops the Weight column, while the latter drops the Sex column.

 
/*Drop columns with missing value percentage >= 40% */
%FilterData(data=class, th=0.4, out=class2);              
proc print data=class2;run;
 
/*Drop columns with duplicate value percentage >= 40% */
%FilterData(data=class, th=0.4, out=class3, byduprate=1); 
proc print data=class3;run;

Figure 4: Drop column by missing and duplicate value percentage

Summary

This article discusses how to use SAS to filter variables in a dataset based on the percentage of missing values or duplicate values. The missing value statistics can be implemented by either DATA step programming on your own or reusing the existing powerful PROC FREQ with statistics result for missing values and duplicate values. The five macros in this article also demonstrated how to wrap up the existing powerful SAS analytics as a reusable, general-purpose SAS macro for a big project. The final version of macro %FilterData also demonstrated how to unify filtering variables by the percentage of either missing values or duplicate values in one to make calling SAS macro functions convenient and easy to reuse. To learn more, please check out these resources:

How to filter variables by percentage of missing values or duplicate values in SAS was published on SAS Users.

11月 282019
 

With time series data analysis, we can apply moving average methods to predict data points without seasonality. This includes Simple Average (SA), Simple Moving Average (SMA), Weighted Moving Average (WMA), Exponential Moving Average (EMA), etc. For series with a trend but without seasonality, we can use linear, non-linear and autoregressive prediction models. In practice, we often use these kinds of moving average methods for series data with large variance, e.g. financial and stock market data.

Since introducing the window concept, SMA reflects historical data with a lag problem, and all points in the same window have the same importance to the current point. Weights are introduced to reflect the different importances of points in a window, and it leads to WMA and EMA. WMA gives greater weight to recent observations while less weight to the longer-term observations. The weights decrease in a linear fashion. If the sequence fluctuations are not large, the same weight is given to the observation and WMA degenerates into a SMA and the WMA becomes an EMA when weights decrease exponentially. However, these traditional moving average methods ave a lag problem due to introducing the smooth window concept and its results are less responsive to the current value. To gain a better balance between lag reduction and curve smoothing in a moving average, we must get help from some smarter mathematics algorithm.

What is Hull Moving Average?

Hull Moving Average (HMA) is a fast moving average method with low lag developed by Australia investor Alan Hull in 2005. It’s known to almost eliminate lag altogether and manages to improve smoothing at the same time. Longer period HMAs can be used as an indicator to identify trend, and shorter period HMAs can be used as entry signals in the direction of the prevailing trend. Due to lag problems in all moving average methods, it’s not suggested to be used as a reverse signal alone. Anyway, we can use different window HMAs to build a more robust trading strategy. E.g., 52-week HMA for trend indicator, and 13-week HMA for entry signal. Here is how it works:

  • Long period HMA identifies the trend: rising HMA indicates the prevailing trend is rising, it’s better to enter long positions; falling HMA indicate the prevailing trend is falling, it’s better to enter short positions.
  • Short period HMA identifies the entry signals in the direction of the prevailing trend: When the prevailing trend is rising, HMA goes up to indicate a long entry signal. When the prevailing trend is falling, HMA goes down to indicate a short entry signal.

Both long & short period HMAs in bullish mode generate Long Signal, and both in bearish mode generate Short Signal. So Long Trades get a buy signal after a Long Signal occurs, while Short Trades get a sell signal after a Short Signal occurs. Long Trades get a Sell Signal when Long or Short Trend is no longer bullish and Short Trades get a buy signal when Long or Short trend is no longer bearish. Figure 1 is a sample for this HMA based trading strategy.

In fact, HMA is quite simple. Its calculation only includes three WMA steps:

  1. Calculate WMA of original sequence X using window length n ;
  2. Calculate WMA of X using window length n/2;
  3. Calculate WMA of derived sequence Y using window length Sqrt(n), while Y is the new series of 2 * WMA(x, n/2) – WMA(x, n). So HMA(x, n) formula can be defined as below:

HMA(x, n) = WMA( 2*WMA(x, n/2) − WMA(x, n), sqrt(n) )

HMA makes the moving average line more responsive to current value and keeps the smoothing of the curve line. It almost eliminates lag altogether and manages to improve smoothing at the same time.

How does HMA achieve the perfect balance?

We know SMA can gain the curve line of an average value for historical data falling in the current window, but it has the constant lag of at least a half window. The output may also have poor smoothness. If we embed SMA multiple times to get the average of the averages, it would be smoother but would increase lag at the same time.

Suppose we have this series: {1, 2, 3, 4, 5, 6, 7, 8, 9, 10}. The SMA at the last point is 5.5 with window length n=10, i.e. SUM[1...10]/10
is much different than the real value 10. If we reduce the window length to n=5, the SMA at the last point is 8, i.e. SUM[6…10]/5. If we have the second SMA with half size window and add the difference between these two SMAs, then we can get 8 + (8-5.5)=10.5, which is very close to the real value 10 and will eliminate lag. Then we can use SMA with specific window length again to reduce that slight overcompensation and improve smoothness. HMA uses linear WMA instead of SMA, and the last window length is sqrt(n). This is the key trick with using HMA.

Figure 2 below shows the difference between HMA(x, 4) and embed SMA twice SMA( SMA(x,4),4).

Figure 2: Comparison of SMA( SMA(x, 4), 4) and HMA(x, 4).

In stock trading, there are lots of complicated technical indicators derived from stock historical data, such as Ease of Movement indicators (EMV), momentum indicators (MTM), MACD indicators, Energy indicator CR, KDJ indicators, Bollinger (BOL) and so on. Its essence is like feature engineering extraction before neural network training. The fundamental purpose is to find some intrinsic property from price fluctuation. The Hull moving average itself was discovered and applied to real trading practices, which gives us a glimpse to the complication of building an automatic financial trading strategy under the veil. It’s just the beginning, the reflex in trading and aging of models may both bring uncertainty, so the effectiveness of quantitative trading strategies is a very complex topic which needs to be validated through real practice.

HMA in SAS Code

SAS was created for data analysis however, and my colleagues go into detail about how data analysis connects with HMA. Check out Cindy Wang's How to Calculate Hull Moving Average in SAS Visual Analytics and Rick Wicklin's The Hull moving average: Implement a custom time series smoother in SAS. I would like to share how to implement both HMA/WMA with only 25 lines of code in this article. You can use the macro anywhere to build HMA series with specific window lengths.

First, implement Weighted Moving Average (WMA) with SAS macro as shown below (only 11 lines). The macro has four arguments:

  • WMA_N       WMA window length
  • INPUT          Input variable name
  • OUT             Output variable name, default is input name with _WMA_n suffix
  • OUTLABEL The label of output variable, default is “input name WMA(n)
%macro AddWMA(WMA_N=4, INPUT=close, OUT=&INPUT._WMA_&WMA_N., OUTLABEL=&INPUT. WMA(&WMA_N.));
  retain _sumx_&INPUT._WMA_&WMA_N.;
  _sumx_&INPUT._WMA_&WMA_N.=sum (_sumx_&INPUT._WMA_&WMA_N., &INPUT., -lag&WMA_N.( &INPUT. )) ;
 
  retain _sum_&INPUT._WMA_&WMA_N.;
  _sum_&INPUT._WMA_&WMA_N.=sum (_sum_&INPUT._WMA_&WMA_N., &WMA_N * &INPUT., -lag( _sumx_&INPUT._WMA_&WMA_N.  )) ;
 
  retain _sumw_&INPUT._WMA_&WMA_N.;
  if _N_ <= &WMA_N then _sumw_&INPUT._WMA_&WMA_N. = sum (_sumw_&INPUT._WMA_&WMA_N., &WMA_N, -_N_, 1);
 
  &OUT=_sum_&INPUT._WMA_&WMA_N. / _sumw_&INPUT._WMA_&WMA_N.;
 
  drop _sumx_&INPUT._WMA_&WMA_N. _sumw_&INPUT._WMA_&WMA_N. _sum_&INPUT._WMA_&WMA_N.;
  label &OUT="&OUTLABEL"; 
%mend;

We must watch out for the trick in the upper implementation of WMA with linear weights. In fact, there is no do-loop in the code, and it also doesn’t perform repeat computations for weights and values. We just keep the last sum of all values in the window (See Ln2-3), and the sum for WMA is the last sum for WMA plus window length times the value subtracted from it (See Ln4-5). For weight sum computation, we just need to pay attention to the points less than the window length(See Ln6-7).

Second, implement HMA with three times WMA invocation as below, (only 9 lines). The macro has four arguments:

  • HMA_N       HMA window length
  • INPUT         Input variable name
  • OUT            Output variable name, default is input name with _HMA_n suffix
  • OUTLABEL The label of output variable, default is “input name HMA(n)
%macro AddHMA(HMA_N=5, INPUT=close, OUT=&INPUT._HMA_&HMA_N., outlabel=&INPUT. HMA(&HMA_N.));
  %AddWMA(WMA_N=&HMA_N, INPUT=&INPUT.);
 
  %AddWMA(WMA_N=%sysfunc(round(&HMA_N./2)), INPUT=&INPUT.);
 
  &INPUT._HMA_&HMA_N._DELTA=2 * &INPUT._WMA_%sysfunc(round(&HMA_N./2))  - &INPUT._WMA_&HMA_N;
  %AddWMA(WMA_N=%sysfunc(round(%sysfunc(sqrt(&HMA_N)))), INPUT=&INPUT._HMA_&HMA_N._DELTA);
 
  rename &INPUT._HMA_&HMA_N._DELTA_WMA_%sysfunc(round(%sysfunc(sqrt(&HMA_N.))))=&OUT;
  label &INPUT._HMA_&HMA_N._DELTA_WMA_%sysfunc(round(%sysfunc(sqrt(&HMA_N.))))="&outlabel"; 
 
  drop &INPUT._WMA_&HMA_N &INPUT._WMA_%sysfunc(round(&HMA_N./2)) &INPUT._HMA_&HMA_N._DELTA;
%mend;

The upper code is very intuitive as the first two lines perform WMA with window n and n/2 for X, then the next two lines generate new series Y and performs WMA with window sqrt(n). Other codes are just to improve macro flexibility and to drop temp variables, etc.

To compare HMA with SMA result, we must also implement SMA with the 7 lines of code as shown below. It has the same argument as WMA/HMA above, but SMA is an unnecessary part of HMA macro %ADDHMA implmentation.

 
%macro AddMA(MA_N=5, INPUT=close, out=&INPUT._MA_&MA_N., outlabel=&INPUT. MA(&MA_N.));
  retain _sum_&INPUT._MA_&MA_N.;
  _sum_&INPUT._MA_&MA_N.=sum (_sum_&INPUT._MA_&MA_N., &INPUT., -lag&MA_N.( &INPUT. )) ;
  &out=_sum_&INPUT._MA_&MA_N. / min(_n_, &MA_N.);
 
  drop _sum_&INPUT._MA_&MA_N.;
  label &out="&outlabel";
%mend;

Now we can test all the above implementations as shown below. First is a simple case we described above:

data a;
  input x @@;
  %AddMA( input=x, MA_N=5, out=SMA, outlabel=%str(SMA)); 
  %AddHMA( input=x, HMA_N=5, out=HMA, outlabel=%str(HMA))  
datalines;
1 2 3 4 5 6 7 8 9 10
;
proc print data=a label;
  format _all_ 8.2;
run;

The output is listed below, the HMA result is much more close to the real value x than SMA. It’s more responsive and less in lag.

Figure 3: HMA output sample

Now we can use the HMA macro to check IBM stock data in the SAS system dataset SASHELP.Stocks and draw trends in the plot to compare HMA(x, 5) and SMA(x, 5) output.

data Stocks_HMA;
  set sashelp.Stocks(where=(stock='IBM'));
  %AddHMA( HMA_N=5, out=HMA, outlabel=%str(HMA))
  %AddMA(MA_N=5, out=SMA, outlabel=%str(SMA));
run; 
ods graphics / width=800px height=300px;
title "Comparison of Simple and Hull Moving Average";
proc sgplot data=Stocks_HMA noautolegend;
   highlow x=Date high=High low=Low / lineattrs=(color=LightGray);
   scatter x=Date y=Close / markerattrs=(color=LightGray);
   series x=Date y=SMA / curvelabel curvelabelattrs=(color=DarkGreen) lineattrs=(color=DarkGreen);
   series x=Date y=HMA / curvelabel curvelabelattrs=(color=DarkOrange) lineattrs=(color=DarkOrange);
   refline '01JUN1992'd / axis=x label='June 1992' labelloc=inside;
   xaxis grid display=(nolabel); 
   yaxis grid max=200 label="IBM Closing Price";
run;

Figure 4: Comparison of SMA and HMA

Summary

We have talked about what Hull Moving Average is, how it works, and worked through a super simple SAS Macro implementation for HMA, WMA and SMA with only 25 lines of SAS code. This HMA implementation makes it quite simple and efficient to calculate HMA in SAS Data step without SAS/IML and Visual Analytics. In fact, it shows how easy process rolling-window computation for time series data can be in SAS, and opens a new window to build complicated technical indicators for stock trading systems yourself.

Learn more

Implementing HMA, WMA & SMA with 25 lines of SAS code was published on SAS Users.