2月 172020
 

Administrators like to be able to keep track of resource usage and who is using what in a system. When an administrator has this capability, they can look out for issues of high resource usage that may have an impact on overall system performance. In Viya, data is accessed in caslibs. In this post, I will show you how an administrator can track and control resource usage of personal caslibs.

A Caslib is an in-memory space to hold tables, access control lists, and data source information. In GEL enablement classes as we have discussed CAS and caslibs, one of the big asks we have had was related to personal caslibs and how can an administrator keep track of the resources that they use. Until recently we didn’t have a great answer, the good news is now we do.

Personal caslibs are, by default, the active caslib when a user starts a CAS session. This gives each user a default location to access and load data (sort of like a home directory). As the name suggests, they are personal and only the user who starts the session can access the data. In early releases, administrators saw this as a big problem because personal caslibs were basically invisible to them. There was no way to monitor what was going on with a personal caslib, leaving the system open to issues where one user could consume a lot of resources and have an adverse impact.

Introduced in Viya 3.4, the accessControl.accessPersonalCaslibs action brings all existing personal Caslibs into a session where an administrator has assumed the data or superuser role.

Running the accessControl.accessPersonalCaslibs action has the following effects. The administrator can:

  • See all personal caslibs that existed at the time the action was run
  • View promoted tables characteristics within the personal caslibs
  • Drop promoted tables within other users’ personal caslibs.

This elevation of access remains in effect for the duration of the session. The action does not give an administrator full access to personal caslibs, an administrator can never fetch data from other users’ personal caslibs, drop any personal caslib, set access controls on any personal caslib, or set access controls on any table in any personal caslib. What it does is give the administrator a view into personal caslibs to be able to monitor and troubleshooting their resource usage.

Let’s look at how it works. Logged into Viya as Viya administrator (by default also a CAS administrator), I can use the table.caslibinfo action to see all the caslibs that the administrator has permission to view. In the output below, I see my own personal caslib, and all the other caslibs that the administrator has permissions (set by the CAS authorization system) to view.

cas mysess;
proc cas;
table.caslibinfo;
quit;
cas mysess terminate;

In the code below, the super user role is assumed for this session (SAS Administrators by default can also assume the super user role in CAS). With the role assumed, the administrator can execute the accessControl.accessPersonalCaslibs action and the subsequent table.caslibinfo action returns all caslibs including any personal caslibs that existed when the session started.

cas adminsession;
proc cas;
/* need to be a super user or data administrator */
accessControl.assumeRole / adminRole="superuser";
accessControl.accessPersonalCaslibs;
table.caslibinfo ;
quit;
cas adminsession terminate;

That helps, but what about the details? How can the administrator see how many resources the tables in a personal CASLIB are using? To get the details, we can access an individual CASLIB and its tables, and for each table, execute the table.tabledetails action. The program below will loop all the personal caslibs and, for each of the caslibs, it will loop the in-memory tables and execute the table.tabledetails action. The output of tabledetails gives an idea of how many resources (memory, disk etc.) a table is using.

cas adminsession;
proc cas;
/* need to be a super user */
accessControl.assumeRole / adminRole=”superuser”;
accessControl.accessPersonalCaslibs;table.caslibinfo result=fileresult;
casliblist=findtable(fileresult);
 
/* loop caslibs */
do cvalue over casliblist;
 
if cvalue.name==: ‘CASUSER’ then
do; /* only look at caslibs that contain CASUSER */
 
table.tableinfo result=tabresult / caslib=cvalue.name;
 
tablelist=findtable(tabresult);
x=dim(tablelist);
 
if x>1 then
do; /* there are tables available */
 
do tvalue over tablelist; /* loop all tables in the caslib */
 
table.tabledetails / caslib=cvalue.name name=tvalue.name;
table.tableinfo / caslib=cvalue.name name=tvalue.name;
 
end; /* loop all tables in the caslib */
end; /* there are tables available */
end; /* only look at caslibs that contain CASUSER */
end; /* loop caslibs */
 
accessControl.dropRole / adminRole=”superuser”;
quit;
cas adminsession terminate;

Two fields that can give an administrator an idea of how big a table is are:

  • Data size: the size of the SAS Dataset in memory
  • Memory Mapped: the part of the data that has been “memory mapped” and is backed up in the CAS Disk Cache memory-mapped files.

The table below show the output for one users personal caslib.

If one table in particular is causing problems, it is possible for the administrator to drop the table from memory.

cas adminsession;
proc cas;
 
accessControl.assumeRole / adminRole=”superuser”;
accessControl.accessPersonalCaslibs;
sessionProp.setSessOpt / caslib=”CASUSER(gatedemo499)”;
table.droptable / name=”TRAIN”;
 
quit;
cas adminsession terminate;

Visibility into personal caslibs will be a big help to Viya administrators monitoring CAS resource usage. Check out the following for more details:

Viya administrators can now get personal with users' Caslibs was published on SAS Users.

2月 172020
 

A previous article shows how to interpret the collinearity diagnostics that are produced by PROC REG in SAS. The process involves scanning down numbers in a table in order to find extreme values. This can be a tedious and error-prone process. Friendly and Kwan (2009) compare this task to a popular picture book called Where's Waldo? in which children try to find one particular individual (Waldo) in a crowded scene that involves of hundreds of people. The game is fun for children, but less fun for a practicing analyst who is trying to discover whether a regression model suffers from severe collinearities in the data.

Friendly and Kwan suggest using visualization to turn a dense table of numbers into an easy-to-read graph that clearly displays the collinearities, if they exist. Friendly and Kwan (henceforth, F&K) suggest several different useful graphs. I decided to implement a simple graph (a discrete heat map) that is easy to create and enables the analyst to determine whether there are collinearities in the data. One version of the collinearity diagnostic heat map is shown below. (Click to enlarge.) For comparison, the table from my previous article is shown below it. The highlighted cells in the table were added by me; they are not part of the output from PROC REG.


Visualization principles

There are two important sets of elements in a collinearity diagnostics table. The first is the set of condition indices, which are displayed in the leftmost column of the heat map. The second is the set of cells that show the proportion of variance explained by each row. (However, only the rows that have a large condition index are important.) F&K make several excellent points about the collinearity diagnostic table:

  • Display order: In a table, the important information is in the bottom rows. It is better to reverse-sort the table so that the largest condition indices (the important ones) are at the top.
  • Condition indices: A condition number between 20 and 30 is starting to get large (F&K use 10-30). An index over 30 is generally considered large and an index that exceeds 100 is "a sign of potential disaster in estimation" (p. 58). F&K suggest using "traffic lighting" (green, yellow, and red) to color the condition indices by the severity of the collinearity. I modified their suggestion to include an orange category.
  • Proportion of variance: F&K note that "the variance proportions corresponding to small condition numbers are completely irrelevant" (p. 58) and also that tables print too many decimals. "Do we really care that [a]variance proportion is 0.00006088?" Of course not! Therefore we should only display the large proportions. F&K also suggest displaying a percentage (instead of proportion) and rounding the percentage to the nearest integer.

A discrete heat map to visualize collinearity diagnostics

There are many ways to visualize the Collinearity Diagnostics table. F&K use traffic lighting for the condition numbers and a bubble plot for the proportion of variance entries. Another choice would be to use a panel of bar charts for the proportion of variance. However, I decided to use a simple discrete heat map. The following list describes the main steps to create the plot. You can download the complete SAS program that creates the plot and modify it (if desired) to use with your own data. For each step, I link to a previous article that describes more details about how to perform the step.

  1. Use the ODS OUTPUT statement to save the Collinearity Diagnostics table to a data set.
  2. Use PROC FORMAT to define a format. The format converts the table values into discrete values. The condition indices are in the range [1, ∞) whereas the values for the proportion of variance are in the range [0, 1). Therefore you can use a single format that maps these values into 'low', 'medium', and 'high' values.
  3. The HEATMAPPARM statement in PROC SGPLOT is designed to work with data in "long format." Therefore convert the Collinearity Diagnostics data set from wide form to long form.
  4. Create a discrete attribute map that maps categories to colors.
  5. Use the HEATMAPPARM statement in PROC SGPLOT to create a discrete heat map that visualizes the collinearity diagnostics. Overlay (rounded) values for the condition indices and the important (relatively large) values of the proportion of variance.

The discrete heat map enables you to draw the same conclusions as the original collinearity diagnostics table. However, whereas using the table is akin to playing "Where's Waldo," the heat map makes it apparent that the most severe collinearity (top row; red condition index) is between the RunPulse and MaxPulse variables. The second most severe collinearity (second row from top; orange condition index) is between the Intercept and the Age variable. None of the remaining rows have two or more large cells for the proportion of variance.

You can download the SAS program that creates the collinearity plot. It would not be hard to turn it into a SAS macro, if you intend to use it regularly.

References

Friendly, M., & Kwan, E. (2009). "Where's Waldo? Visualizing collinearity diagnostics." The American Statistician, 63(1), 56-65. https://doi.org/10.1198/tast.2009.0012

The post Visualize collinearity diagnostics appeared first on The DO Loop.

2月 142020
 

In honor of Valentine’s day, we thought it would be fitting to present an excerpt from a paper about the LIKE operator because when you like something a lot, it may lead to love! If you want more, you can read the full paper “Like, Learn to Love SAS® Like” by Louise Hadden, which won best paper at WUSS 2019.

Introduction

SAS provides numerous time- and angst-saving techniques to make the SAS programmer’s life easier. Among those techniques are the ability to search and select data using SAS functions and operators in the data step and PROC SQL, as well as the ability to join data sets based on matches at various levels. This paper explores how LIKE is featured in each one of these techniques and is suitable for all SAS practitioners. I hope that LIKE will become part of your SAS toolbox, too.

Smooth Operators

SAS operators are used to perform a number of functions: arithmetic calculations, comparing or selecting variable values, or logical operations. Operators are loosely grouped as “prefix” (for example a sign before a variable) or “infix” which generally perform an operation BETWEEN two variables. Arithmetic operations using SAS operators may include exponentiation (**), multiplication (*), and addition (+), among others. Comparison operators may include greater than (>, GT) and equals (=, EQ), among others. Logical, or Boolean, operators include such operands as || or !!, AND, and OR, and serve the purpose of grouping SAS operations. Some operations that are performed by SAS operators have been formalized in functions. A good example of this is the concatenation operators (|| and !!) and the more powerful CAT functions which perform similar, but not identical, operations. LIKE operators are most frequently utilized in the DATA step and PROC SQL via a DATA step.

There is a category of SAS operators that act as comparison operators under special circumstances, generally in where statements in PROC SQL and the data step (and DS2) and subsetting if statements in the data step. These operators include the LIKE operator and the SOUNDS LIKE operator, as well as the CONTAINS and the SAME-AND operators. It is beyond the scope of this short paper to discuss all the smooth operators, but they are definitely worth a look.

LIKE Operator

Character operators are frequently used for “pattern matching,” that is, evaluating whether a variable value equals, does not equal, or sounds like a specified value or pattern. The LIKE operator is a case-sensitive character operator that employs two special “wildcard” characters to specify a pattern: the percent sign (%) indicates any number of characters in a pattern, while the underscore (_) indicates the presence of a single character per underscore in a pattern. The LIKE operator is akin to the GREP utility available on Unix/Linux systems in terms of its ability to search strings.

The LIKE operator also includes an escape routine in case you need to use a string that includes a comparison operator such as the carat, the underscore, or the percent sign, etc. An example of the escape routine syntax, when looking for a string containing a percent sign, is:

where yourvar like ‘100%’ escape ‘%’;

Additionally, SAS practitioners can use the NOT LIKE operator to select variables WITHOUT a given pattern. Please note that the LIKE statement is case-sensitive. You can use the UPCASE, LOWCASE, or PROPCASE functions to adjust input strings prior to using the LIKE statement. You may string multiple LIKE statements together with the AND or OR operators.

SOUNDS LIKE Operator

The LIKE operator, described above, searches the actual spelling of operands to make a comparison. The SOUNDS LIKE operator uses phonetic values to determine whether character strings match a given pattern. As with the LIKE operator, the SOUNDS LIKE operator is useful for when there are misspellings and similar sounding names in strings to be compared. The SOUNDS LIKE operator is denoted with a short cut ‘-*’. SOUNDS LIKE is based on SAS’s SOUNDEX algorithm. Strings are encoded by retaining the original first column, stripping all letters that are or act as vowels (A, E, H, I, O, U, W, Y), and then assigning numbers to groups: 1 includes B, F, P, and V; 2 includes C, G, J, K, Q, S, X, Z; 3 includes D and T; 4 includes L; 5 includes M and N; and 6 includes R. “Tristn” therefore becomes T6235, as does Tristan, Tristen, Tristian, and Tristin.

For more on the SOUNDS LIKE operator, please read the documentation.

Joins with the LIKE Operator

It is possible to select records with the LIKE operator in PROC SQL with a WHERE statement, including with joins. For example, the code below selects records from the SASHELP.ZIPCODE file that are in the state of Massachusetts and are for a city that begins with “SPR”.

proc sql;
    CREATE TABLE TEMP1 AS
    select
        a.City ,
        a.countynm  , a.city2 ,
         a.statename , a.statename2
    from sashelp.zipcode as a
    where upcase(a.city) like 'SPR%' and 
upcase(a.statename)='MASSACHUSETTS' ; 
quit;

The test print of table TEMP1 shows only cases for Springfield, Massachusetts.

The code below joins SASHELP.ZIPCODE and a copy of the same file with a renamed key column (city --> geocity), again selecting records for the join that are in the state of Massachusetts and are for a city that begins with “SPR”.

proc sql;
    CREATE TABLE TEMP2 AS
    select
        a.City , b.geocity, 
        a.countynm  ,
        a.statename , b.statecode, 
        a.x, a.y
    from sashelp.zipcode as a, zipcode2 as b
    where a.city = b.geocity and upcase(a.city) like 'SPR%' and b.statecode
= 'MA' ;
quit;

The test print of table TEMP2 shows only cases for Springfield, Massachusetts with additional variables from the joined file.

The LIKE “Condition”

The LIKE operator is sometimes referred to as a “condition,” generally in reference to character comparisons where the prefix of a string is specified in a search. LIKE “conditions” are restricted to the DATA step because the colon modifier is not supported in PROC SQL. The syntax for the LIKE “condition” is:

where firstname=: ‘Tr’;

This statement would select all first names in Table 2 above. To accomplish the same goal in PROC SQL, the LIKE operator can be used with a trailing % in a where statement.

Conclusion

SAS provides practitioners with several useful techniques using LIKE statements including the smooth LIKE operator/condition in both the DATA step and PROC SQL. There’s definitely reason to like LIKE in SAS programming.

To learn more about SAS Press, check out our up-and-coming titles, and to receive exclusive discounts make sure to subscribe to our newsletter.

References

    Gilsen, Bruce. September 2001. “SAS® Program Efficiency for Beginners.” Proceedings of the Northeast SAS Users Group Conference, Baltimore, MD.

    Roesch, Amanda. September 2011. “Matching Data Using Sounds-Like Operators and SAS® Compare Functions.” Proceedings of the Northeast SAS Users Group Conference, Portland, ME.

    Shankar, Charu. June 2019. “The Shape of SAS® Code.” Proceedings of PharmaSUG 2019 Conference, Philadelphia, PA.

Learn to Love SAS LIKE was published on SAS Users.

2月 132020
 

If you’re a SAS user, you know that SAS Global Forum is where you want to be! It’s our premier, can’t-miss event for SAS professionals—that includes thousands of users, executives, partners and academics.

SAS Global Forum 2018 ftirst-time attendee, Rachel Nemecek

It’s also our largest users’ event, organized by users, for users, and this year’s event marks our 44th annual! Of course, back when SAS got its start, our inaugural event was then called SAS.ONE. And while it’s grown from a couple dozen sessions to hundreds, it’s been a major keystone for SAS users around the world who look forward connecting and building knowledge.

Considering attending this year? We sat down with Analytics Manager, Rachel Nemecek, who attended for the first time in 2018, to share her SAS Global Forum experience with us!

Rachel’s favorite part of SAS Global Forum? “The access to so many different SAS subject matter experts in one place.  I could just wander around the Attendee Hub [The Quad] and check out different technologies, meet product experts and get all my questions answered. It’s also a great opportunity to set up meetings with experts on particular topics relevant to my business. And, of course, swag.  I won some SAS socks that I still wear and all my analyst friends are jealous.”

Keep reading to hear more about all the great things to expect at SAS Global Forum, and Rachel’s first-time experience, tips and insights!

Psst: Looking for conference proceedings? Our very own Lex Jansen has documented SAS conference proceedings from 1976-present!

Learn: Enhance your analytics skills

At the forefront of advanced analytics, digital transformation and innovation, you’ll hear the latest SAS advancements and announcements from SAS executives during the event. You’ll also hear from a great lineup of partners, customers (like Kellogg Company!) and industry experts on topics like artificial intelligence, IoT, cloud and more. With separate tracks for Users, Executives, Partners and Academics, the event features customized content for each audience.

What was Rachel’s favorite session of SAS Global Forum 2018? “I particularly enjoyed hearing from Reshma Saujani, founder of Girls Who Code.  Kept thinking of my nieces and how great it was that opportunities like that to experiment with technical ideas could be available to them at a young age.”

Covering all learning preferences and formats, the event offers a variety of session types. In addition to a plethora of SAS topics, you’ll enjoy over 600 sessions, workshops, demos and presentations. And The Quad is not to be missed! With a 20% larger space than last year, you’ll learn more about SAS and our partners, and take part in hands-on activities, experiences and games like chess, smart darts and more!

Lastly, don’t forget that event attendees can also take advantage of (discounted) SAS training and certification right on-site! To secure your seat, register for pre and post-conference training and exams during your event registration. During the event check out the Learning Lab, where you can take free SAS e-learning classes and access our free SAS Viya for Learners tool.

Rachel’s advice? “Make sure to explore all the activities that are available outside just the conference sessions – hands-on activities, demos, and of course, the networking events.”

Tip:  Don’t miss Opening Session for insights and entertainment! Also, don’t forget to schedule in time to visit the custom t-shirt press in the Cherry Blossom Lounge for your SAS Global Forum 2020 t-shirt!

Once available, use the SAS Global Forum app to build, map, and plan out your agenda in advance—not only will it help you keep track of the sessions you want to attend, but you’ll have a handy guide for navigating around the conference venue.

You’re sure to takeaway a lot of great SAS nuggets!

Network: Build your analytics circle

Attendees can also enjoy organized receptions for networking. Taking advantage of this year’s location, the 2020 event social will take place across two (Smithsonian) museums in just one night!

And, as Rachel points out, networking isn’t limited to who you meet outside your organization: “There were probably a couple dozen folks from various parts of my company that attended, which ended up being an extra benefit – to network with internal analytics colleagues outside my immediate organization.”

Be sure to sign up for the free networking events during registration, take advantage of the free lunch on-site (and lunchtime chats), and utilize the SAS Global Forum app! This year, all content will be housed in the app as part of our “go green” initiative. These are great opportunities to mingle with fellow SAS users and build your SAS network. Also, keep an eye out for SAS user group gatherings and special events!

Tip: Connect online via the SAS Global Forum Community and follow along on Twitter, Facebook, Instagram and LinkedIn. #SASGF

Join us there!

Will Rachel be attending this year’s event? “Yes, I’ll be there and will be bringing a few folks from my team.”

Don’t miss out! Registration is open now!

We’ll see you in DC!

Tip: Not able to attend? You can still join in online. Several sessions will be live streamed via the SAS Users YouTube channel.

Have you attended #SASGF in the past? Share your favorite highlights from the event with us below!

First-timer to the event this year? We’re excited to have you join us and we’re looking forward to helping you create your SAS Global Forum story!

Experience SAS Global Forum: A First-Timer’s View & Tip Sheet was published on SAS Users.

2月 132020
 

What do a Pulitzer prize-winning author, an Emmy award-winning TV personality and one of the top 5 influencers in the world have in common? They’ll all be at SAS Global Forum this year. It’s where analytics enthusiasts and executive thought leaders meet to share strategies, get training in abundance, and [...]

3 people I can’t wait to meet at SAS Global Forum  was published on SAS Voices by Jenn Chase

2月 122020
 

The Johnson system (Johnson, 1949) contains a family of four distributions: the normal distribution, the lognormal distribution, the SB distribution, and the SU distribution. Previous articles explain why the Johnson system is useful and show how to use PROC UNIVARIATE in SAS to estimate parameters for the Johnson SB distribution or for the Johnson SU distribution.

How to choose between the SU and SB distribution?

The graph to the right shows a histogram with an overlay of a fitted Johnson SB density curve. But why fit the SB distribution? Why not the SU? For a given sample of data, it is not usually clear from the histogram whether the tails of the data are best described as thin-tailed or heavy-tailed. Accordingly, it is not clear whether the SB (bounded) or the SU (unbounded) distribution is a more appropriate model.

One way to answer that question is to plot the sample kurtosis and skewness on a moment-ratio diagram to see if it is in the SB or SU region of the diagram. Unfortunately, high-order sample moments have a lot of variability, so this method is not very accurate. See Chapter 16 of Simulating Data with SAS for examples.

Slifker and Shapiro (1980) devised a method that does not use high-order moments. Instead, they compare the length of the tails of the distribution to the length of the central portion of the distribution. They use four quantiles to define "the tails" and "the central portion," so their method is similar to a robust definition of skewness, which also uses quantile information.

For ease of exposition, in this section I will oversimplify Slifker and Shapiro's results. Essentially, they suggest using the 6th, 30th, 70th, and 94th percentiles of the data to determine whether the data are best modeled by the SU, SB, or lognormal distribution. Denote these percentiles by P6, P30, P70, and P94, respectively. The key quantities in the computation are lengths of the intervals between percentiles of the data. In particular, define

  • m = P94 - P70 (the length of the upper tail)
  • n = P30 - P06 (the length of the lower tail)
  • p = P70 - P30 (the length of the central portion)

Slifker and Shapriro show that if you use percentiles of the distributions, the ratio m*n/p2 has the following properties:

  • m*n/p2 > 1 for percentiles of the SU distribution
  • m*n/p2 < 1 for percentiles of the SB distribution
  • m*n/p2 = 1 for percentiles of the lognormal distribution

Therefore, they suggest that you use the sample estimates of the percentiles to compute the ratio. If the ratio is close to 1, use the lognormal distribution. Otherwise, if the ratio is greater than 1, use the SU distribution. Otherwise, if the ratio is less than 1, use the SB distribution.

Details of the computation

The previous section oversimplifies one aspect of the computation. Slifker and Shapiro don't actually recommend using the 6th, 30th, 70th, and 94th percentiles of the data. They recommend choosing a normal variate, z (0 < z < 1) that depends on the size of the sample. They recommend using "a value of z near 0.5 such as z = 0.524" for "moderate-sized data sets" (p. 240). After choosing z, consider the evenly spaced values {-3*z, -z, z, 3*z}. These points divide the area under the normal density curve into four regions.

For z = 0.524, the areas of these regions are 0.058, 0.300, 0.700, and 0.942. These, then, are the quantiles to use for "moderate-sized data sets." This choice assumes that the 5.8th and 94.2th sample percentiles (which define the "tails") are good approximates of the percentiles of the distribution. If you have a large data set, another reasonable choice would be z = 0.6745, which leads to the 2.2th, 25th, 75th, and 97.8th percentiles.

A SAS program to determine the Johnson family from data

You can write a SAS/IML program that reads in univariate data, computes the percentiles of the data, and computes the ratio m*n/p2. You can use the QNTL function in the SAS/IML language to compute the percentiles, but Slifker and Shapiro use a slightly nonstandard definition of percentiles in their paper. For consistency, the following program uses their definition. Part of the percentile computation requires looking at the integer and fractional part of a number.

The following program analyzes the EngineSize variable in the Sashelp.Cars data set:

/* For Sashelp.Cars, the EngineSize (and Cylinders) variable is SB. Others are SU. */
%let dsname = Sashelp.Cars;
%let varName = EngineSize; /* OR  %let varName = mpg_city; */
 
/* Implement the Slifker and Shapiro (1980) https://www.jstor.org/stable/1268463
   method that uses sample percentiles to assess whether the data are best 
   modeled by the SB, SU, or SL (lognormal) distributions. */
proc iml;
/* exclude any row with missing value: https://blogs.sas.com/content/iml/2015/02/23/complete-cases.html */
start ExtractCompleteCases(X);
   idx = loc(countmiss(X, "row")=0);
   if ncol(idx)>0 then return( X[idx, ] ); else return( {} ); 
finish;
 
/* read the variable into x */
use &dsname;
   read all var {&varName} into x;
close;
 
x = ExtractCompleteCases(x);      /* remove missing values */
if nrow(x)=0 then abort;
call sort(x);
N = nrow(x);
 
/* Generate the percentiles as percentiles of the normal distribution for evenly spaced 
   variates. This computation does not depend on the data values. */
z0 = 0.524;                       /* one possible choice. Another might be 0.6745 */
z = -3*z0 // -z0 // z0 // 3*z0;   /* evenly space z values */
pctl = cdf("Normal", z);          /* percentiles of the normal distribution */
print pctl[f=5.3];                /* These are the percentiles to use */
 
/* Note: for z0 = 0.524, the percentiles are approximately 
   the 5.8th, 30th, 70th, and 94.2th percentiles of the data */
 
/* The following computations are almost (but not quite) the same as 
   call qntl(xz, x, p);   
   Use MOD(k,1) to compute the fractional part of a number.  ‎*/
k = pctl*N + 0.5;
intk = int(k);          /* int(k) is integer part and mod(k,1) is fractional part */
xz = x[intk] + mod(k,1) # (x[intk+1] - x[intk]); /* quantiles using linear interpol */
 
/* Use xz to compare length of left/right tails with length of central region */
m = xz[4] - xz[3];      /* right tail: length of 94th - 70th percentile */
n = xz[2] - xz[1];      /* left tail: length of 30th - 6th percentile */
p = xz[3] - xz[2];      /* central region: length of 70th - 30th percentile */
 
/* use ratio to decide between SB, SL (lognormal) and SU distributions */
ratio = m*n/p**2;
eps = 0.05;             /* if ratio is within 1 +/- eps, use lognormal */
 
if (ratio > 1 + eps) then 
   type = 'SU';
else if (ratio < 1 - eps) then
   type = 'SB';
else 
   type = 'LOGNORMAL';
 
print type;
call symput('type', type);   /* optional: store value in a macro variable */
quit;

The output of the program is the word 'SB', 'SU', or 'LOGNORMAL', which indicates which Johnson distribution seems to fit the data. When you know this information, you can use the appropriate option in the HISTOGRAM statement of PROC UNIVARIATE. For example, the SB distribution seems to be appropriate for the EngineSize variable, so you can use the following statements to produce the graph at the top of this article.

proc univariate data=Sashelp.Cars;
   var EngineSize;
   histogram EngineSize / SB(theta=0 sigma=EST fitmethod=moments);
   ods select Histogram;
run;

Most of the other variables in the Sashelp.Cars data are best modeled by using the SU distribution. For example, if you change the value of the &varName macro to mpg_city and rerun the program, the program will print 'SU'.

Concluding thoughts

Slifker and Shapiro's method is not perfect. One issue is that you have to choose a particular set of percentiles (derived from the choice of "z0" in the program). Different choices of z0 could conceivably lead to different results. And, of course, the sample percentiles—like all statistics—have sampling variability. Nevertheless, the method seems to work adequately in practice.

The second part of Slifker and Shapiro describe a method of using the percentiles of the data to fit the parameters of the data. This is the "method of percentiles" that is mentioned in the PROC UNIVARIATE documentation. In a companion article, Mage (1980, p. 251) states: "Although a given percentile point fit of the SB parameters may be adequate for most purposes, the ambiguity of obtaining different parameters by different percentile choices may be unacceptable in some applications." In other words, it would be preferable to "avoid the situation where one choice of percentiles leads to [one conclusion]and a second choice of percentiles leads to [a different conclusion]."

It is worth noting that the UNIVARIATE procedure supports three different methods for fitting parameters in the Johnson distributions. No one method works perfectly for all possible data sets, so experiment with several different methods when you use the Johnson distribution to model data.

The post The Johnson system: Which distribution should you choose to model data? appeared first on The DO Loop.

2月 102020
 

You can represent every number as a nearby integer plus a decimal. For example, 1.3 = 1 + 0.3. The integer is called the integer part of x, whereas the decimal is called the fractional part of x (or sometimes the decimal part of x). This representation is not unique. For example, you can also write 1.3 = 2 + (-0.7). There are several ways to produce the integer part of a number, depending on whether you want to round up, round down, round towards zero, or use some alternative rounding method.

Just as each rounding method defines the integer part of a number, so, too, does it define the fractional part. If [x]denotes the integer part of x (by whatever rounding method you choose), then the fractional part is defined by frac(x) = x - [x]. For some choices of a rounding function (for example, FLOOR), the fractional part is positive for all x. For other choices, the fractional part might vary according to the value of x.

In applications, two common representations are as follows:

  • Round x towards negative infinity. The fractional part of x is always positive. You can round towards negative infinity by using the FLOOR function in a computer language. For example, if x = -1.3, then FLOOR(x) is -2 and the fractional part is 0.7.
  • Round x towards zero. The fractional part of x always has the same sign as x. You can round towards zero by using the INT function. For example, if x = -1.3, then INT(x) is -1 and the fractional part is -0.3.

Here is an interesting fact: for the second method (INT), you can compute the fractional part directly by using the MOD function in SAS. In SAS, the expression MOD(x,1) returns the signed fractional part of a number because the MOD function in SAS returns a result that has the same sign as x. This can be useful when you are interested only in the fraction portion of a number.

The following DATA step implements both common methods for representing a number as an integer and a fractional part. Notice the use of the MOD function for the second method:

data Fractional;
input x @@;
/* Case 1: x = floor(x) + frac1(x) where frac1(x) >= 0 */
Floor = floor(x);
Frac1 = x - Floor;                  /* always positive */
 
/* Case 2: x = int(x) + frac2(x) where frac1(x) has the same sign as x */
Int = int(x);
Frac2 = mod(x,1);                   /* always same sign as x */
label Floor = 'Floor(x)' Int='Int(x)' Frac2='Mod(x,1)';
datalines;
-2 -1.8 -1.3 -0.7 0 0.6 1.2 1.5 2
;
 
proc print data=Fractional noobs label;
run;

The table shows values for a few positive and negative values of x. The second and third columns represent the number as x = Floor(x) + Frac1. The fourth and fifth columns represent the number as x = Int(x) + Mod(x,1). For non-negative values of x, the two methods are equivalent. When x can be either positive or negative, I often find that the second representation is easier to work with.

In statistics, the fractional part of a number is used in the definition of sample estimates for percentiles and quantiles. SAS supports five different definitions of quantiles, some of which look at the fractional part of a number to decide how to estimate a percentile.

The post Find the fractional part of a number appeared first on The DO Loop.

2月 052020
 

One of the first and most important steps in analyzing data, whether for descriptive or inferential statistical tasks, is to check for possible errors in your data. In my book, Cody's Data Cleaning Techniques Using SAS, Third Edition, I describe a macro called %Auto_Outliers. This macro allows you to search for possible data errors in one or more variables with a simple macro call.

Example Statistics

To demonstrate how useful and necessary it is to check your data before starting your analysis, take a look at the statistics on heart rate from a data set called Patients (in the Clean library) that contains an ID variable (Patno) and another variable representing heart rate (HR). This is one of the data sets I used in my book to demonstrate data cleaning techniques. Here is output from PROC MEANS:

The mean of 79 seems a bit high for normal adults, but the standard deviation is clearly too large. As you will see later in the example, there was one person with a heart rate of 90.0 but the value was entered as 900 by mistake (shown as the maximum value in the output). A severe outlier can have a strong effect on the mean but an even stronger effect on the standard deviation. If you recall, one step in computing a standard deviation is to subtract each value from the mean and square that difference. This causes an outlier to have a huge effect on the standard deviation.

Macro

Let's run the %Auto_Outliers macro on this data set to check for possible outliers (that may or may not be errors).

Here is the call:

%Auto_Outliers(Dsn=Clean.Patients,
               Id=Patno,
               Var_List=HR SBP DBP,
               Trim=.1,
               N_Sd=2.5)

This macro call is looking for possible errors in three variables (HR, SBP, and DBP); however, we will only look at HR for this example. Setting the value of Trim equal to .1 specifies that you want to remove the top and bottom 10% of the data values before computing the mean and standard deviation. The value of N_Sd (number of standard deviations) specifies that you want to list any heart rate beyond 2.5 trimmed standard deviations from the mean.

Result

Here is the result:

After checking every value, it turned out that every value except the one for patient 003 (HR = 56) was a data error. Let's see the mean and standard deviation after these data points are removed.

Notice the Mean is now 71.3 and the standard deviation is 11.5. You can see why it so important to check your data before performing any analysis.

You can download this macro and all the other macros in my data cleaning book by going to support.sas.com/cody. Scroll down to Cody's Data Cleaning Techniques Using SAS, and click on the link named "Example Code and Data." This will download a file containing all the programs, macros, and data files from the book.  By the way, you can do this with any of my books published by SAS Press, and it is FREE!

Let me know if you have questions in the comments section, and may your data always be clean! To learn more about SAS Press, check out up-and-coming titles, and to receive exclusive discounts make sure to subscribe to the newsletter.

Finding Possible Data Errors Using the %Auto_Outliers Macro was published on SAS Users.

2月 052020
 

A SAS programmer wanted to create a graph that illustrates how Deming regression differs from ordinary least squares regression. The main idea is shown in the panel of graphs below.

  • The first graph shows the geometry of least squares regression when we regress Y onto X. ("Regress Y onto X" means "use values of X to predict Y.") The residuals for the model are displayed as vectors that show how the observations are projected onto the regression line. The projection is vertical when we regression Y onto X.
  • The second graph shows the geometry when we regress X onto Y. The projection is horizontal.
  • The third graph shows the perpendicular projection of both X and Y onto the identity line. This is the geometry of Deming regression.

This article answers the following two questions:

  1. Given any line and any point in the plane, how do you find the location on the line that is closest to the point? This location is the perpendicular projection of the point onto the line.
  2. How do you use the SGPLOT procedure in SAS to create the graphs that chow the projections of points onto lines?

The data for the examples are shown below:

data Have;
input x y @@;
datalines;
0.5 0.6   0.6 1.4   1.4 3.0   1.7 1.4   2.2 1.7
2.4 2.1   2.4 2.4   3.0 3.3   3.1 2.5 
;

The projection of a point onto a line

Assume that you know the slope and intercept of a line: y = m*x + b. You can use calculus to show that the projection of the point (x0, y0) onto the line is the point (xL, yL) where
xL = (x + m*(y – b)) / (1 + m2) and yL = m * xL + b.

To derive this formula, you need to solve for the point on the line that minimizes the distance from (x0, y0) to the line. Let (x, m*x + b) be any point on the line. We want to find a value of x so that the distance from (x0, y0) to (x, m*x + b) is minimized. The solution that minimizes the distance also minimizes the squared distance, so define the squared-distance function
f(x) = (x - x0)2 + (m*x + b - y0)2.
To find the location of the minimum for this function, set the derivative equal to zero and solve for the value of x:

  • f`(x) = 2(x - x0) + 2 m*(m*x + b - y0)
  • Set f`(x)=0 and solve for x. The solution is the value xL = (x + m*(y – b)) / (1 + m2), which minimizes the distance from the point to the line.
  • Plug xL into the formula for the line to find the corresponding vertical coordinate on the line: yL = m * xL + b.

You can use the previous formulas to write a simple SAS DATA step that projects each observation onto a specified line. (For convenience, I put the value of the slope (m) and intercept (b) into macro variables.) The following DATA step projects a set of points onto the line y = m*x + b. You can use PROC SGPLOT to create a scatter plot of the observations. Use the VECTOR statement to draw the projections of the points onto the line.

/* projection onto general line of the form y = &m*x + &b */
%let b = 0.4;
%let m = 0.8;
data Want;
set Have;
xL = (x + &m *(y - &b)) / (1 + &m**2);
yL = &m * xL + &b;
run;
 
title "Projection onto Line y=&m x + &b";
proc sgplot data=Want aspect=1 noautolegend;
   scatter x=x y=y;
   vector x=xL y=yL / xorigin=x yorigin=y; /* use the NOARROWHEADS option to suppress the arrow heads */
   lineparm x=0 y=&b slope=&m / lineattrs=(color=black);
   xaxis grid; yaxis grid;
run;

You can get the graph for Deming regression by setting b=0 and m=1 in the previous formulas and program.

In summary, you can use that math you learned in high school to find the perpendicular projection of a point onto a line. You can then use the VECTOR statement in PROC SGPLOT in SAS to create a graph that illustrates the projection. Such a graph is useful for comparing different kinds of regressions, such as comparing least-squares and Deming regression.

The post Visualize residual projections for linear regression appeared first on The DO Loop.