“The future is already here — it's just not very evenly distributed.” ~ William Gibson, author The same can be said for climate change – global warming is here, in a big way, but its effects are still an arm's length away for many of us. How is climate change [...]
Designing interactive reports can be a fun and unique challenge. As user interface experience designers can attest, there are several aspects that go into developing a successful and effective self-service tool. Granted I’m not designing the actual software, but reports require a similar approach to be sure that visualizations are clear and that users can get to the answers they are looking for. Enter prompts.
Reports prompt users to better understand trends, how their data points compare to the whole, and to narrow the scope of data. Being able to pick the placement of these prompts quickly and easily will open the possibilities of your report layouts! I’m specifically speaking about Report and Page level prompts. Traditionally, these global prompt controls were only able to be placed at the top; see the yellow highlighted areas below.
Let’s take a look at an example report with the traditional Report and Page prompt layout. The Report prompts are extremely easy to pick out, since they sit above the pages, but the Page prompts can sometimes blend in with other prompts contained in the report body.
Introduced in the SAS Visual Analytics 8.4 release is the ability to control the layout position of these prompts. Using my example report, let’s change the placement of these prompts. In Edit mode, open the Options pane and use the top level drop-down to select the report name. This will activate the report level, and the report level Options will display. Next, under the Report Controls subgroup, move the placement radio button to the west cardinal point.
Depending on the type of control objects you are using in your report, you may not like this layout yet. For instance, you can see here that my date slider is taking up too much space.
When you activate the slide control, use the Options pane to alter the Slider Direction and Layout. You can even use the Style option to change the font size. You can see that after these modifications, the Report prompt space can be configured to your liking.
Next, let’s change the placement for the Page prompts, for demonstration purposes. From the Options pane, use the top drop-down to select the page name. This will activate the page level, and the page level Options will display. Next, under the Page Controls subgroup, move the placement radio button to the west cardinal position.
You can see that the direction of the button bar control was automatically changed to vertical. Now we can clearly see which prompts belong to the page level.
If I switch to view mode, and adjust the browser size, you can get a better feel for the Report and Page prompt layout changes.
But as with many things, just because you can, doesn’t mean you should. This is where the report designer’s creativity and style can really take flight. Here is the same report, but with my preferred styling.
Notice that I kept the Report prompts along the top but moved the Page prompts to the left of the report. I also added two containers and configured a gray border for each container to better separate the objects. This helps the user quickly see that the drop-down will filter the word cloud is only. I also used the yellow highlighting through styling and a display rule to emphasize the selected continent. The bar chart is fed from an aggregated data source which is why the report prompt is not filtering out the other continents.
Feel free to send me your latest report design ideas!
Additional material related to Report and Page prompts:
- Use a List Control as a Visual Analytics Report or Page/Section Prompt
- VA 7.4: Configure Report or Section Level Cascading Prompts
- VA 7.3: Configure Cascading Prompts
New control prompt placement option in SAS Visual Analytics was published on SAS Users.
Biplots are two-dimensional plots that help to visualize relationships in high dimensional data. A previous article discusses how to interpret biplots for continuous variables. The biplot projects observations and variables onto the span of the first two principal components. The observations are plotted as markers; the variables are plotted as vectors. The observations and/or vectors are not usually on the same scale, so they need to be rescaled so that they fit on the same plot. There are four common scalings (GH, COV, JK, and SYM), which are discussed in the previous article.
This article shows how to create biplots in SAS. In particular, the goal is to create the biplots by using modern ODS statistical graphics. You can obtain biplots that use the traditional SAS/GRAPH system by using the %BIPLOT macro by Michael Friendly. The %BIPLOT macro is very powerful and flexible; it is discussed later in this article.
There are four ways to create biplots in SAS by using ODS statistical graphics:
- You can use PROC PRINQUAL in SAS/STAT software to create the COV biplot.
- If you have a license for SAS/GRAPH software (and SAS/IML software), you can use Friendly's %BIPLOT macro and use the OUT= option in the macro to save the coordinates of the markers and vectors. You can then use PROC SGPLOT to create a modern version of Friendly's biplot.
- You can use the matrix computations in SAS/IML to "manually" compute the coordinates of the markers and vectors. (These same computations are performed internally by the %BIPLOT macro.) You can use the Biplot module to create a biplot, or you can use the WriteBiplot module to create a SAS data set that contains the biplot coordinates. You can then use PROC SGPLOT to create the biplot.
For consistency with the previous article, all methods in this article standardize the input variables to have mean zero and unit variance (use the SCALE=STD option in the %BIPLOT macro). All biplots show projections of the same four-dimensional Fisher's iris data. The following DATA step assigns a blank label. If you do not supply an ID variable, some biplots display observations numbers.
data iris; set Sashelp.iris; id = " "; /* create an empty label variable */ run;
Use PROC PRINQUAL to compute the COV biplot
The PRINQUAL procedure can perform a multidimensional preference analysis, which is visualized by using a MDPREF plot. The MDPREF plot is closely related to biplot (Jackson (1991), A User’s Guide to Principal Components, p. 204). You can get PROC PRINQUAL to produce a COV biplot by doing the following:
- Use the N=2 option to specify you want to compute two principal components.
- Use the MDPREF=1 option to specify that the procedure should not rescale the vectors in the biplot. By default, MDPREF=2.5 and the vectors appear 2.5 larger than they should be. (More on scaling vectors later.)
- Use the IDENTITY transformation so that the variables are not transformed in a nonlinear manner.
The following PROC PRINQUAL statements produce a COV biplot (click to enlarge):
proc prinqual data=iris plots=(MDPref) n=2 /* project onto Prin1 and Prin2 */ mdpref=1; /* use COV scaling */ transform identity(SepalLength SepalWidth PetalLength PetalWidth); /* identity transform */ id ID; ods select MDPrefPlot; run;
Use Friendly's %BIPLOT macro
Friendly's books [SAS System for Statistical Graphics (1991) and Visualizing Categorical Data (2000)] introduced many SAS data analysts to the power of using visualization to accompany statistical analysis, and especially the analysis of multivariate data. His macros use traditional SAS/GRAPH graphics from the 1990s. In the mid-2000s, SAS introduced ODS statistical graphics, which were released with SAS 9.2. Although the %BIPLOT macro does not use ODS statistical graphics directly, the macro supports the OUT= option, which enables you to create an output data set that contains all the coordinates for creating a biplot.
/* A. Use %BIPLOT macro, which uses SAS/IML to compute the biplot coordinates. Use the OUT= option to get the coordinates for the markers and vectors. B. Transpose the data from long to wide form. C. Use PROC SGPLOT to create the biplot */ %let FACTYPE = SYM; /* options are GH, COV, JK, SYM */ title "Biplot: &FACTYPE, STD"; %biplot(data=iris, var=SepalLength SepalWidth PetalLength PetalWidth, id=id, factype=&FACTYPE, /* GH, COV, JK, SYM */ std=std, /* NONE, MEAN, STD */ scale=1, /* if you do not specify SCALE=1, vectors are auto-scaled */ out=biplotFriendly,/* write SAS data set with results */ symbols=circle dot, inc=1); /* transpose from long to wide */ data Biplot; set biplotFriendly(where=(_TYPE_='OBS') rename=(dim1=Prin1 dim2=Prin2 _Name_=_ID_)) biplotFriendly(where=(_TYPE_='VAR') rename=(dim1=vx dim2=vy _Name_=_Variable_)); run; proc sgplot data=Biplot aspect=1 noautolegend; refline 0 / axis=x; refline 0 / axis=y; scatter x=Prin1 y=Prin2 / datalabel=_ID_; vector x=vx y=vy / datalabel=_Variable_ lineattrs=GraphData2 datalabelattrs=GraphData2; xaxis grid offsetmin=0.1 offsetmax=0.2; yaxis grid; run;
Because you are using PROC SGPLOT to display the biplot, you can easily configure the graph. For example, I added grid lines, which are not part of the output from the %BIPLOT macro. You could easily change attributes such as the size of the fonts or add additional features such as an inset. With a little more work, you can merge the original data and the biplot data and color-code the markers by a grouping variable (such as Species) or by a continuous response variable.
Notice that the %BUPLOT macro supports a SCALE= option. The SCALE= option applies an additional linear scaling to the vectors. You can use this option to increase or decrease the lengths of the vectors in the biplot. For example, in the SYM biplot, shown above, the vectors are long relative to the range of the data. If you want to display vectors that are only 25% as long, you can specify SCALE=0.25. You can specify numbers greater than 1 to increase the vector lengths. For example, SCALE=2 will double the lengths of the vectors. If you omit the SCALE= option or set SCALE=0, then the %BIPLOT macro automatically scales the vectors to the range of the data. If you use the SCALE= option, you should tell the reader that you did so.
SAS/IML modules that compute biplots
The %BIPLOT macro uses SAS/IML software to compute the locations of the markers and vectors for each type of biplot. I wrote three SAS/IML modules that perform the three steps of creating a biplot:
- The CalcBiplot module computes the projections of the observations and scores onto the first few principal components. This module (formerly named CalcPrinCompBiplot) was written in the mid-2000s and has been distributed as part of the SAS/IML Studio application. It returns the scores and vectors as SAS/IML matrices.
- The WriteBiplot module calls the CalcBiplot module and then writes the scores to a SAS data set called _SCORES and the vectors (loadings) to a SAS data set called _VECTORS. It also creates two macro variables, MinAxis and MaxAxis, which you can use if you want to equate the horizontal and vertical scales of the biplot axes.
- The Biplot function calls the WriteBiplot module and then calls PROC SGPLOT to create a biplot. It is the "raw SAS/IML" version of the %BIPLOT macro.
You can use the CalcBiplot module to compute the scores and vectors and return them in IML matrices. You can use the WriteBiplot module if you want that information in SAS data sets so that you can create your own custom biplot. You can use the Biplot module to create standard biplots. The Biplot and WriteBiplot modules are demonstrated in the next sections.
Use the Biplot module in SAS/IML
The syntax of the Biplot module is similar to the %BIPLOT macro for most arguments. The input arguments are as follows:
- X: The numerical data matrix
- ID: A character vector of values used to label rows of X. If you pass in an empty matrix, observation numbers are used to label the markers. This argument is ignored if labelPoints=0.
- varNames: A character vector that contains the names of the columns of X.
- FacType: The type of biplot: 'GH', 'COV', 'JK', or 'SYM'.
- StdMethod: How the original variables are scaled: 'None', 'Mean', or 'Std'.
- Scale: A numerical scalar that specifies additional scaling applied to vectors. By default, SCALE=1, which means the vectors are not scaled. To shrink the vectors, specify a value less than 1. To lengthen the vectors, specify a value greater than 1. (Note: The %BIPLOT macro uses SCALE=0 as its default.)
- labelPoints: A binary 0/1 value. If 0 (the default) points are not labeled. If 1, points are labeled by the ID values. (Note: The %BIPLOT macro always labels points.)
The last two arguments are optional. You can specify them as keyword-value pairs outside of the parentheses. The following examples show how you can call the Biplot module in a SAS/IML program to create a biplot:
ods graphics / width=480px height=480px; proc iml; /* assumes the modules have been previously stored */ load module=(CalcBiplot WriteBiplot Biplot); use sashelp.iris; read all var _NUM_ into X[rowname=Species colname=varNames]; close; title "COV Biplot with Scaled Vectors and Labels"; run Biplot(X, Species, varNames, "COV", "Std") labelPoints=1; /* label obs */ title "JK Biplot: Relationships between Observations"; run Biplot(X, NULL, varNames, "JK", "Std"); title "JK Biplot: Automatic Scaling of Vectors"; run Biplot(X, NULL, varNames, "JK", "Std") scale=0; /* auto scale; empty ID var */ title "SYM Biplot: Vectors Scaled by 0.25"; run Biplot(X, NULL, varNames, "SYM", "Std") scale=0.25; /* scale vectors by 0.25 */
The program creates four biplots, but only the last one is shown. The last plot uses the SCALE=0.25 option to rescale the vectors of the SYM biplot. You can compare this biplot to the SYM biplot in the previous section, which did not rescale the length of the vectors.
Use the WriteBiplot module in SAS/IML
If you prefer to write an output data set and then create the biplot yourself, use the WriteBiplot module. After loading the modules and the data (see the previous section), you can write the biplot coordinates to the _Scores and _Vectors data sets, as follows. A simple DATA step appends the two data sets into a form that is easy to graph:
run WriteBiplot(X, NULL, varNames, "JK", "Std") scale=0; /* auto scale vectors */ QUIT; data Biplot; set _Scores _Vectors; /* append the two data sets created by the WriteBiplot module */ run; title "JK Biplot: Automatic Scaling of Vectors"; title2 "FacType=JK; Std=Std"; proc sgplot data=Biplot aspect=1 noautolegend; refline 0 / axis=x; refline 0 / axis=y; scatter x=Prin1 y=Prin2 / ; vector x=vx y=vy / datalabel=_Variable_ lineattrs=GraphData2 datalabelattrs=GraphData2; xaxis grid offsetmin=0.1 offsetmax=0.1 min=&minAxis max=&maxAxis; yaxis grid min=&minAxis max=&maxAxis; run;
In the program that accompanies this article, there is an additional example in which the biplot data is merged with the original data so that you can color-code the observations by using the Species variable.
This article shows four ways to use modern ODS statistical graphics to create a biplot in SAS. You can create a COV biplot by using the PRINQUAL procedure. If you have a license for SAS/IML and SAS/GRAPH, you can use Friendly's %BIPLOT macro to write the biplot coordinates to a SAS data set, then use PROC SGPLOT to create the biplot. This article also presents SAS/IML modules that compute the same biplots as the %BIPLOT macro. The WriteBiplot module writes the data to two SAS data sets (_Score and _Vector), which can be appended and used to plot a biplot. This gives you complete control over the attributes of the biplot. Or, if you prefer, you can use the Biplot module in SAS/IML to automatically create biplots that are similar to Friendly's but are displayed by using ODS statistical graphics.
You can download the complete SAS program that is used in this article. For convenience, I have also created a separate file that defines the SAS/IML modules that create biplots.
The t-test is a very useful test that compares one variable (perhaps blood pressure) between two groups. T-tests are called t-tests because the test results are all based on t-values. T-values are an example of what statisticians call test statistics. A test statistic is a standardized value that is calculated from sample data during a hypothesis test. It is used to determine whether there is a significant difference between the means of two groups. With all inferential statistics, we assume the dependent variable fits a normal distribution. When we assume a normal distribution exists, we can identify the probability of a particular outcome. The procedure that calculates the test statistic compares your data to what is expected under the null hypothesis. There are several SAS Studio tasks that include options to test this assumption. Let's use the t-test task as an example.
You start by selecting:
Tasks and Utilities □ Tasks □ Statistics □ t Tests
On the DATA tab, select the Cars data set in the SASHELP library. Next request a Two-sample test, with Horsepower as the Analysis variable and Cylinders as the Groups variable. Use a filter to include only 4- or 6-cylinder cars. It should look like this:
On the OPTIONS tab, check the box for Tests for normality as shown below.
All the tests for normality for both 4-cylinder and 6-cylinder cars reject the null hypothesis that the data values come from a population that is normally distributed. (See the figure below.)
Should you abandon the t-test results and run a nonparametric test analysis such as a Wilcoxon Rank Sum test that does not require normal distributions?
This is the point where many people make a mistake. You cannot simply look at the results of the tests for normality to decide if a parametric test is valid or not. Here is the reason: When you have large sample sizes (in this data set, there were 136 4-cylinder cars and 190 6-cylinder cars), the tests for normality have more power to reject the null hypothesis and often result in p-values less than .05. When you have small sample sizes, the tests for normality will not be significant unless there are drastic departures from normality. It is with small sample sizes where departures from normality are important.
The bottom line is that the tests for normality often lead you to make the wrong decision. You need to look at the distributions and decide if they are somewhat symmetrical. The central limit theory states that the sampling distribution of means will be normally distributed if the sample size is sufficiently large. "Sufficiently large" is a judgment call. If the distribution is symmetrical, you may perform a t-test with sample sizes as small as 10 or 20.
The figure below shows you the distribution of horsepower for 4- and 6-cylinder cars.
With the large sample sizes in this data set, you should feel comfortable in using a t-test. The results, shown below, are highly significant.
If you are in doubt of your decision to use a parametric test, feel free to check the box for a nonparametric test on the OPTIONS tab. Running a Wilcoxon Rank Sum test (a nonparametric alternative to a t-test), you also find a highly significant difference in horsepower between 4- and 6-cylinder cars. (See the figure below.)
You can read more about assumptions for parametric tests in my new book, A Gentle Introduction to Statistics Using SAS Studio.
For a sneak preview check out the free book excerpt. You can also learn more about SAS Press, check out the up-and-coming titles, and receive exclusive discounts. To do so, make sure to subscribe to the newsletter.
Testing the Assumption of Normality for Parametric Tests was published on SAS Users.
US military veterans are mission-focused, team-oriented and natural leaders that benefit any organization that hires them. SAS has many programs to help US military veterans find jobs within the company and elsewhere. SAS also works with veterans organizations to use their data to help transitioning military members and their spouses [...]
Analytics helping veterans make the transition to civilian life was published on SAS Voices by Trent Smith
In grade school, students learn how to round numbers to the nearest integer. In later years, students learn variations, such as rounding up and rounding down by using the greatest integer function and least integer function, respectively. My sister, who is an engineer, learned a rounding method that rounds half-integers to the nearest even number. This method is called the round-to-even method. (Other names include the round-half-to-even method, the round-ties-to-even method, and "bankers' rounding.") When people first encounter the round-to-even method, they are often confused. Why would anyone use such a strange rounding scheme? This article describes the round-to-even method, explains why it is useful, and shows how to use SAS software to apply the round-to-even method.
What is the round-to-even method of rounding?
The round-to-even method is used in engineering, finance, and computer science to reduce bias when you use rounded numbers to estimate sums and averages. The round-to-even method works like this:
- If the difference between the number and the nearest integer is less than 0.5, round to the nearest integer. This familiar rule is used by many rounding methods.
- If the difference between the number and the nearest integer is exactly 0.5, look at the integer part of the number. If the integer part is EVEN, round towards zero. If the integer part of the number is ODD, round away from zero. In either case, the rounded number is an even integer.
All rounding functions are discontinuous step functions that map the real numbers onto the integers. The graph of the round-to-even function is shown to the right.
I intentionally use the phrase "round away from zero" instead of "round up" because you can apply the rounding to positive or negative numbers. If you round the number -2.5 away from zero, you get -3. If you round the number -2.5 towards zero, you get -2. However, for simplicity, the remainder of the article uses positive numbers.
Examples of the round-to-even method of rounding
For the number 2.5, which integer is closest to it? Well, it's a tie: 2.5 is a "half-integer" that is just as close to 2 as it is to 3. So which integer should you choose as THE rounded value? The traditional rounding method rounds the midpoint between two integers away from zero, so 2.5 is traditionally rounded to 3. This produces a systematic bias: all half-integers are rounded away from zero. This fact leads to biased estimates when you use the rounded data in an analysis.
To reduce this systematic bias, you can use the round-to-even method, which rounds some half-integers away from zero and others towards zero. For the round-to-even method, the number 2.5 rounds down to 2, whereas the number 3.5 rounds up to 4.
The table at the right shows some decimal values and the results of rounding the values under the standard method and the round-to-even method. The second column (Round(x)) shows the result of the traditional rounding method where all half-integers are rounded away from 0. The third column (RoundE(x)) is the round-to-even method that rounds half-integers to the nearest even integer. The red boxes indicate numbers for which the Round and RoundE functions produce different answers. Notice that for the round-to-even method, 50% of the half-integers round towards 0 and 50% round away from 0. In the table, 0.5 and 2.5 are rounded down by the round-to-even method, whereas 1.5 and 3.5 are rounded up.
Why use the round-to-even method of rounding?
The main reason to use the round-to-even method is to avoid systematic bias when calculating with rounded numbers. One application involves mental arithmetic. If you want to estimate the sum (or mean) of a list of numbers, you can mentally round the numbers to the nearest integer and add up the numbers in your head. The round-to-even method helps to avoid bias in the estimate, especially if many of the values are half-integers.
Most computers use the round-to-even method for numerical computations. The round-to-even method has been a part of the IEEE standards for rounding since 1985.
How to use the round-to-even method in SAS?
SAS software supports the ROUND function for standard rounding of numbers and the ROUNDE function ('E' for 'even') for round-to-even rounding. For example, the following DATA step produces the table that is shown earlier in this article:
data Seq; keep x Round RoundEven; label Round = "Round(x)" RoundEven="RoundE(x)"; do x = 0 to 3.5 by 0.25; Round = round(x); /* traditional: half-integers rounded away from 0 */ RoundEven = rounde(x); /* round half-integers to the nearest even integer */ output; end; run; proc print data=Seq noobs label; run;
An application: Estimate the average length of lengths with SAS
Although the previous sections discuss rounding values like 0.5 to the nearest integer, the same ideas apply when you round to the nearest tenth, hundredth, thousandth, etc. The next example rounds values to the nearest tenth. Values like 0.95, 1.05, 1.15, etc., are equidistant from the nearest tenth and can be rounded up or down, depending on the rounding method you choose. In SAS, you can use an optional argument to the ROUND and ROUNDE functions to specify the unit to which you want to round. For example, the expression ROUND(x, 0.1) rounds x to the nearest tenth.
An example in the SAS documentation for PROC UNIVARIATE contains the effective channel length (in microns) for 425 transistors from "Lot 1" of a production facility. In the data set, the lengths are recorded to two decimal places. What would be the impact on statistical measurements if the engineer had been lazy and decided to round the measurements to one decimal place, rather than typing all those extra digits?
The following SAS DATA step rounds the data to one decimal place (0.1 microns) by using the ROUND and ROUNDE functions. The call to PROC MEANS computes the mean and sum of the unrounded and rounded values. For the full-precision data, the estimate of the mean length is 1.011 microns. If you round the data by using the standard rounding method, the estimate shoots up to 1.018, which overestimates the average. In contrast, if you round the data by using the round-to-even method, the estimate is 1.014, which is closer to the average of the unrounded numbers (less biased). Similarly, the Sum column shows that the sum of the round-to-even values is much closer to the sum of the unrounded values.
/* round real data to the nearest 0.1 unit */ data rounding; set Channel1; Round = round(Length, 0.1); /* traditional: round to nearest tenth */ RoundEven = rounde(Length, 0.1); /* use round-to-even method to round to nearest tenth */ /* create a binary indicator variable: Was x rounded up or down? */ RoundUp = (Round > Length); /* 1 if rounded up; 0 if rounded down */ RoundEvenUp = (RoundEven > Length); run; proc means data=rounding sum mean ndec=3; label Length="True values" Round ="Rounded values" RoundEven="Round-to-even values"; var Length Round RoundEven; run;
As mentioned earlier, when you use the traditional rounding method, you introduce a bias every time you encounter a "half-unit" datum such as 0.95, 1.05, or 1.15. For this real data, you can count how many data were rounded up versus rounded down by each method. To get an unbiased result, you should round up the half-unit data about as often as you round down. The following call to PROC MEANS shows the proportion of data that are rounded up and rounded down by each method. The output shows that about 55% of the data are rounded up by the traditional rounding method, whereas a more equitable 50.1% of the values are rounded up by the round-to-even method.
proc means data=rounding mean ndec=3; label RoundUp = "Proportion rounded up for ROUND" RoundEvenUp= "Proportion rounded up for ROUNDE"; var RoundUp RoundEvenUp; run;
This example illustrates a general truism: The round-to-even method is a less biased way to round data.
This article explains the round-to-even method of rounding. This method is not universally taught, but it is taught to college students in certain disciplines. The method rounds most numbers to the nearest integer. However, half-integers, which are equally close to two integers, are rounded to the nearest EVEN integer, thus giving the method its name. This method reduces the bias when you use rounded values to estimate quantities such as sums, means, standard deviations, and so forth.
Whereas SAS provides separate ROUND and ROUNDE functions, other languages might default to the round-to-even method. For example, the round() function in python and the round function in R both implement the round-to-even method. Because some users are unfamiliar with this method of rounding, the R documentation provides an example and then explicitly states, "this is *good* behaviour -- do *NOT* report it as bug!"
Six editions is a lot! If you had told us, back when we wrote the first edition of The Little SAS Book, that someday we would write a sixth; we would have wondered how we could possibly find that much to say. After all, it is supposed to be The Little SAS Book, isn’t it? But the developers at SAS Institute are constantly hard at work inventing new and better ways of analyzing and visualizing data. And some of those ways turn out to be so fundamental that they belong even in a little book about SAS.
One of the biggest changes to SAS software in recent years is the proliferation of interfaces. SAS programmers have more choices than ever before. Previous editions contained some sections specific to the SAS windowing environment (also called Display Manager). We wrote this edition for all SAS programmers whether you use SAS Studio, SAS Enterprise Guide, the SAS windowing environment, or run in batch. That sounds easy, but it wasn’t. There are differences in how SAS behaves with different interfaces, and these differences can be very fundamental. In particular, the system option that sets the rules for names of variables varies depending on how you run SAS. So old sections had to be rewritten, and we added a whole new section showing how to use variable names containing blanks and special characters.
New ways to read and write Microsoft Excel files
Previous editions already covered how to read and write Microsoft Excel files, but SAS developers have created some great new ways. This edition contains new sections about the XLSX LIBNAME engine and the ODS EXCEL destination.
More PROC SQL
From the very first edition, The Little SAS Book always covered PROC SQL. But it was in an appendix and over time we noticed that most people ignore appendices. So for this edition, we removed the appendix and added new sections on using PROC SQL to
- Subset your data
- Join data sets
- Add summary statistics to a data set
- Create macro variables with the INTO clause
For people who are new to SQL, these sections provide a good introduction; for people who already know SQL, they provide a model of how to leverage SQL in your SAS programs.
Updates and additions throughout the book
Almost every section in this edition has been changed in some way. We added new options, made sure everything is up-to-date, and ran every example in every SAS interface noting any differences. For example, PROC SGPLOT has some new options, the default ODS style for PDF has changed, and the LISTING destination behaves differently in different interfaces. Here’s a short list, in no particular order, of new or expanded topics in the sixth edition:
- More examples with permanent SAS data sets, CSV files, or tab-delimited files
- More log notes throughout the book showing what to look for
- LIKE or sounds-like (=*) operators in WHERE statements
- CROSSLIST, NOCUM, and NOPRINT options in PROC FREQ
- Grouping data with a user-defined format and the PUT function
- Iterative DO groups
- DO WHILE and DO UNTIL statements
- %DO statements
Even though we have added a lot to this edition, it is still a little book. In fact, this edition is shorter than the last—by twelve pages! We think this is the best edition yet.
I have been programming SAS for a LONG time and have never seen much in the way of programming standards. For example, most SAS programmers indent DATA and PROC statements (I like three spaces). Most programmers do not like to see more than one statement on a line and most agree that there should be blank lines between program boundaries (DATA and PROC steps).
I thought I would share some of my thoughts on programming standards, with the hope that others will chime in with their ideas.
• I like to indent all the statements in a DO group or DO loop. If there are nested groups, each one gets indented as well.
• I prefer variable names in proper case.
• I am not a fan of camel-case. For example, I prefer Weight_Kg to WeightKg. The reason that some programmers like camel-case is that SAS will automatically split a variable name at a capital letter in some headings.
• I like my TITLE statements in open code, not inside a PROC. To me, that makes sense because TITLE statements are global.
• There should be no conversion messages (character to numeric or numeric to character) in the SAS log. For example use Num = INPUT(Char_Num,12.); instead of Num = 1*Char_Num;. The latter statement forces an automatic character to numeric conversion and places a message in the log.
• I always use the statement ODS NOPROCTITLE;. This eliminates the default SAS procedure name at the top of the output.
• Although fewer and fewer people are reading raw text data, I like my @ signs to all line up in my INPUT statement.
• I like to use the /* and */ comments to define all macro variables. For example:
Notice that I prefer named parameters in my macros, instead of positional parameters.
If this seems like too much work - SAS Studio has an automatic formatting tool that can help standardize your programs. For example, look at the code below:
Really ugly, right? Here is how you can use the automatic formatting tool in SAS Studio.
When you click this icon, the program now looks like this:
That’s pretty much the way I would write it. By the way, if you don't like how Studio formatted your code, enter a control-z to undo it.
For more tips on writing code and how to get started in SAS Studio – check out my book, Learning SAS by Example: A Programmer’s Guide, Second Edition. You can also download a free book excerpt. To also learn more about SAS Press, check out the up-and-coming titles, and receive exclusive discounts make sure to subscribe to the SAS Books newsletter.
Principal component analysis (PCA) is an important tool for understanding relationships in multivariate data. When the first two principal components (PCs) explain a significant portion of the variance in the data, you can visualize the data by projecting the observations onto the span of the first two PCs. In a PCA, this plot is known as a score plot. You can also project the variable vectors onto the span of the PCs, which is known as a loadings plot. See the article "How to interpret graphs in a principal component analysis" for a discussion of the score plot and the loadings plot.
A biplot overlays a score plot and a loadings plot in a single graph. An example is shown at the right. Points are the projected observations; vectors are the projected variables. If the data are well-approximated by the first two principal components, a biplot enables you to visualize high-dimensional data by using a two-dimensional graph.
In general, the score plot and the loadings plot will have different scales. Consequently, you need to rescale the vectors or observations (or both) when you overlay the score and loadings plots. There are four common choices of scaling. Each scaling emphasizes certain geometric relationships between pairs of observations (such as distances), between pairs of variables (such as angles), or between observations and variables. This article discusses the geometry behind two-dimensional biplots and shows how biplots enable you to understand relationships in multivariate data.
Some material in this blog post is based on documentation that I wrote in 2004 when I was working on the SAS/IML Studio product and writing the SAS/IML Studio User's Guide. The documentation is available online and includes references to the literature.
The Fisher iris data
A previous article shows the score plot and loadings plot for a PCA of Fisher's iris data. For these data, the first two principal components explain 96% of the variance in the four-dimensional data. Therefore, these data are well-approximated by a two-dimensional set of principal components. For convenience, the score plot (scatter plot) and the loadings plot (vector plot) are shown below for the iris data. Notice that the loadings plot has a much smaller scale than the score plot. If you overlay these plots, the vectors would appear relatively small unless you rescale one or both plots.
The mathematics of the biplot
You can perform a PCA by using a singular value decomposition of a data matrix that has N rows (observations) and p columns (variables). The first step in constructing a biplot is to center and (optionally) scale the data matrix. When variables are measured in different units and have different scales, it is usually helpful to standardize the data so that each column has zero mean and unit variance. The examples in this article use standardized data.
The heart of the biplot is the singular value decomposition (SVD). If X is the centered and scaled data matrix, then the SVD of X is
X = U L V`
where U is an N x N orthogonal matrix, L is a diagonal N x p matrix, and V is an orthogonal p x p matrix. It turns out that the principal components (PCs) of X`X are the columns of V and the PC scores are the columns of U. If the first two principal components explain most of the variance, you can choose to keep only the first two columns of U and V and the first 2 x 2 submatrix of L. This is the closest rank-two approximation to X. In a slight abuse of notation,
X ≈ U L V`
where now U, L, and V all have only two columns.
Since L is a diagonal matrix, you can write L = Lc L1-c for any number c in the interval [0, 1]. You can then write
X ≈ (U Lc)(L1-c V`)
= A B
This the factorization that is used to create a biplot. The most common choices for c are 0, 1, and 1/2.
The four types of biplots
The choice of the scaling parameter, c, will linearly scale the observations and vectors separately. In addition, you can write X ≈ (β A) (B / β) for any constant β. Each choice for c corresponds to a type of biplot:
- When c=0, the vectors are represented faithfully. This corresponds to the GH biplot. If you also choose β = sqrt(N-1), you get the COV biplot.
- When c=1, the observations are represented faithfully. This corresponds to the JK biplot.
- When c=1/2, the observations and vectors are treated symmetrically. This corresponds to the SYM biplot.
The GH biplot for variables
If you choose c = 0, then A = U and B = L V`. The literature calls this biplot the GH biplot. I call it the "variable preserving" biplot because it provides the most faithful two-dimensional representation of the relationship between vectors. In particular:
- The length of each vector (a row of B) is proportional to the variance of the corresponding variable.
- The Euclidean distance between the i_th and j_th rows of A is proportional to the Mahalanobis distance between the i_th and j_th observations in the data.
In preserving the lengths of the vectors, this biplot distorts the Euclidean distance between points. However, the distortion is not arbitrary: it represents the Mahalanobis distance between points.
The GH biplot is shown to the right, but it is not very useful for these data. In choosing to preserve the variable relationships, the observations are projected onto a tiny region near the origin. The next section discusses an alternative scaling that is more useful for the iris data.
The COV biplot
If you choose c = 0 and β = sqrt(N-1), then A = sqrt(N-1) U and B = L V` / sqrt(N-1). The literature calls this biplot the COV biplot. This biplot is shown at the top of this article. It has two useful properties:
- The length of each vector is equal to the variance of the corresponding variable.
- The Euclidean distance between the i_th and j_th rows of A is equal to the Mahalanobis distance between the i_th and j_th observations in the data.
In my opinion, the COV biplot is usually superior to the GH biplot.
The JK biplot
If you choose c = 1, you get the JK biplot, which preserves the Euclidean distance between observations. Specifically, the Euclidean distance between the i_th and j_th rows of A is equal to the Euclidean distance between the i_th and j_th observations in the data.
In faithfully representing the observations, the angles between vectors are distorted by the scaling.
The SYM biplot
If you choose c = 1/2, you get the SYM biplot (also called the SQ biplot), which attempts to treat observations and variables in a symmetric manner. Although neither the observations nor the vectors are faithfully represented, often neither representation is very distorted. Consequently, some people prefer the SYM biplot as a compromise between the COV and JK biplots. The SYM biplot is shown in the next section.
How to interpret a biplot
As discussed in the SAS/IML Studio User's Guide, you can interpret a biplot in the following ways:
- The cosine of the angle between a vector and an axis indicates the importance of the contribution of the corresponding variable to the principal component.
- The cosine of the angle between pairs of vectors indicates correlation between the corresponding variables. Highly correlated variables point in similar directions; uncorrelated variables are nearly perpendicular to each other.
- Points that are close to each other in the biplot represent observations with similar values.
- You can approximate the relative coordinates of an observation by projecting the point onto the variable vectors within the biplot. However, you cannot use these biplots to estimate the exact coordinates because the vectors have been centered and scaled. You could extend the vectors to become lines and add tick marks, but that becomes messy if you have more than a few variables.
If you want to faithfully interpret the angles between vectors, you should equate the horizontal and vertical axes of the biplot, as I have done with the plots on this page.
If you apply these facts to the standardized iris data, you can make the following interpretations:
- The PetalLength and PetalWidth variables are the most important contributors to the first PC. The SepalWidth variable is the most important contributor to the second PC.
- The PetalLength and PetalWidth variables are highly correlated. The SepalWidth variable is almost uncorrelated with the other variables.
- Although I have suppressed labels on the points, you could label the points by an ID variable or by the observation number and use the relative locations to determine which flowers had measurements that were most similar to each other.
This article presents an overview of biplots. A biplot is an overlay of a score plot and a loadings plot, which are two common plots in a principal component analysis. These two plots are on different scales, but you can rescale the two plots and overlay them on a single plot. Depending upon the choice of scaling, the biplot can provide faithful information about the relationship between variables (lengths and angles) or between observations (distances). It can also provide approximates relationships between variables and observations.
In a subsequent post, I will show how to use SAS to create the biplots in this article.
After you start to use arrays, it is easy to see their usefulness and power. In this post, I will share tips and techniques that can increase the functionality of arrays and enable you to use programming shortcuts to take maximum advantage of them.
These tips will also streamline your programming, even if you are not using arrays. I'll provide complete code examples for each tip at the end of each section, as well as explain how the code is constructed along the way.
Use arrays to zoom out for greater perspective
One of the biggest uses of arrays, of course, is to reshape your data from one observation per identifying variable per data point, to one observation per ID containing all the data points for that ID. For example, consider the following data set:
Now, you want the values of the EVENT variable to become the names of the variables in the output data set, as shown here:
How do you know which variables to create if your data set is more complicated than shown above?
Create a macro variable
You can use the SQL procedure to create a macro variable (called EVENTLIST here) that contains all the unique values of EVENT. You then can use that macro variable in your array definition statement:
proc sql noprint; select distinct event into :eventlist separated by ' ' from work.events; quit;
When you run the following statement, the resulting log shows that &EVENTLIST contains the unique values for the EVENT variable:
Here is the log information:
3365 %put &eventlist;
aaa bbb ccc ddd
You can then use the EVENTLIST macro variable in an ARRAY statement to define the variables that you are including:
array events_[*] &eventlist;
Use CALL MISSING to set all variables
However, you do not want the values for the previous ID to contaminate the next ID, so you need to reset the array variables to missing with each new ID. You can use a DO loop that uses FIRST.variable and LAST.variable processing in order to set each value to missing:
if first.id then do; do i = 1 to dim(events_); events_[i] = .; end; end;
DO loops were commonly used in the past to initialize all elements of an array to missing. And it still needs to be used if you want to set all the elements to a non-missing value, such as 0. However, you can replace this code with one function call. The CALL MISSING routine sets all the variables in the array to missing in one statement:
if first.id then call missing(of events_[*]);
Compare via OF variable-list syntax
The OF variable-list syntax is another helpful feature. You can use OF with a list of variables or an array reference.
So, now you have the EVENT values as variable names. How are you going to compare the value of a variable in an observation with a variable name in the array? You can reference each element of the array that you created with EVENTS_[i], but that will return the value of that element. For this comparison, you need to obtain the variable name for each element in the array.
To return the name of each variable in the array, you can use the VNAME function:
if event = vname(events_[i]) then do;
Now you can find when the value of the EVENT variable in the original data set matches the name of the variable in the array.
Additional ways to extract variables information in arrays
Other variable information functions can also be used in the same way to extract information about each of the variables in the array.
In this example, a value that is read from each observation can match only one variable name in the array, so you want to stop when you achieve that instead of continuing the DO loop.
To stop the DO loop at that point, use the LEAVE statement, which stops processing the current loop and resumes with the next statement in the DATA step.
Because you want to output only one observation per ID, you must explicitly use the OUTPUT statement at the end of each ID to override the default output that occurs at the end of each iteration of the DATA step:
if last.id then output;
Combine above methods with DATA Step
Putting all these methods together, here is a short DATA Step that will reshape the original data set into the output data set that you want:
data events_out (drop =i event); set events; by id; array events_[*] &eventlist; retain &eventlist; if first.id then call missing(of events_[*]); do i = 1 to dim(events_); if event = vname(events_[i]) then do; events_[i] = 1; leave; end; end; if last.id then output; run;
Bonus shortcuts using arrays
Want to use even more shortcuts with arrays? Here, your original data has a numeric variable called RATING, which is included in every observation:
You want to find the lowest rating and corresponding event and also the highest rating and corresponding event for each ID. You can start off by defining the arrays and reading each observation in the same way as in the previous example. In this case, you have two arrays, so you need to read both the EVENTS and RATINGS variables into their own arrays:
array events_ $3.;
…more SAS statements…
After you have read all the observations for an ID, how do you find the lowest rating? Of course, you can use a DO loop to loop through the array and check each value to see whether it is the lowest and keep track of which one it is. Or you can use the OF array syntax with the MIN and MAX functions:
This code returns the value of the lowest and highest ratings for ID 1, which are 3 and 9, respectively. But you also want to know which events were rated the highest and lowest for this ID. First, you have to determine which elements in the RATINGS_ array had the lowest and highest values. You know the values, but you do not yet know which specific elements match those values. Use the WHICHN function to determine the indexes for those values in the array. Keep in mind that if there is more than one element that matches the value using the WHICHN function, the function returns only the index of the first element in the array to match:
min_index=whichn(lowest_rating, of ratings_[*]);
max_index=whichn(highest_rating, of ratings_[*]);
For ID 1, these functions return 3 and 5, respectively. Now you can use the indexes to retrieve the corresponding elements in the EVENTS_ array:
Putting all these methods together, here is the complete code for returning the desired results:
data rating (keep = id lowest_rated_event lowest_rating highest_rated_event highest_rating); set events; by id; array ratings_; array events_ $3.; retain ratings: events:; if first.id then call missing(of ratings_[*],of events_[*], count); count + 1; ratings_[count] = rating; events_[count]=event; if last.id then do; lowest_rating = min(of ratings_[*]); highest_rating = max(of ratings_[*]); min_index =whichn(lowest_rating,of ratings_[*]); max_index=whichn(highest_rating, of ratings_[*]); lowest_rated_event=events_[min_index]; highest_rated_event=events_[max_index]; output; end; run;
Here is the resulting output:
You can expand the functionality and capability of arrays by incorporating other features of the SAS programming language in your DATA Step code. I hope these tips and links improve your array use and lead you to explore even more ways to work with arrays. The resource below complements this post and provides additional tips and tricks.Adventures in Arrays | Learn more about array processing!