1月 292020
 

In a previous article, I showed how to perform collinearity diagnostics in SAS by using the COLLIN option in the MODEL statement in PROC REG. For models that contain an intercept term, I noted that there has been considerable debate about whether the data vectors should be mean-centered prior to performing the collinearity diagnostics. In other words, if X is the design matrix used for the regression, should you use X to analyze collinearity or should you use the centered data X – mean(X)? The REG procedure provides options for either choice. The COLLIN option uses the X matrix to assess collinearity; the COLLINOINT option uses the centered data.

As Belsley (1984, p. 76) states, "centering will typically seem to improve the conditioning." However, he argues that running collinearity diagnostics on centered data "gives us information about the wrong problem." He goes on to say, "mean-centering typically removes from the data the interpretability that makes conditioning diagnostics meaningful."

This article looks at how centering the data affects the collinearity diagnostics. Throughout this article, when I say "collinearity diagnostics, I am referring to the variance-decomposition algorithm that is implemented by the COLLIN in PROC REG, which was described in the previous article. Nothing in this article applies to the VIF or TOL options in PROC REG, which provide alternative diagnostics.

The article has two main sections:

  • The mathematics behind the COLLIN (and COLLINOINT) options in PROC REG.
  • An example of an ill-conditioned linear system that becomes perfectly conditioned if you center the data.

The arguments in this article are taken from the references at the end of this article. This article assumes that you have already read my previous article about collinearity diagnostics.

The mathematics of the COLLIN option

The COLLIN option implements the regression-coefficient variance decomposition due to Belsley and presented in Belsley, Kuh, and Welsch (1980), henceforth, BKW. The collinearity diagnostics algorithm (also known as an analysis of structure) performs the following steps:

  1. Let X be the data matrix. If the model includes an intercept, X has a column of ones. BKW recommend that you NOT center X, but if you choose to center X, do it at this step. As a reminder, the COLLIN option in PROC REG does not center the data whereas the COLLINOINT option centers the data.
  2. Scale X so that each column has unit length (unit variance).
  3. Compute the singular value decomposition of X = UDV`.
  4. From the diagonal matrix, D, compute the eigenvalues and condition indices of X`X.
  5. Compute P, the matrix of variance-decomposition proportions as described in BKW, p. 105-107.
  6. From this information, you can determine whether the regression model suffers from harmful collinearity.

To make sure that I completely understand an algorithm, I like to implement it in the SAS/IML matrix language. The following SAS/IML statements implement the analysis-of-structure method. You can run the program on the same Fitness data that were used in the previous article. The results are the same as from PROC REG.

proc iml;
start CollinStruct(lambda, cond, P,         /* output variables */
                   XVar, opt);              /* input variables  */
 
   /* 1. optionally center the data */
   if upcase(opt) = "NOINT" then 
      X = XVar - mean(XVar);                /* COLLINOINT: do not add intercept, center */
   else
      X = j(nrow(XVar), 1, 1) || XVar;      /* COLLIN: add intercept, do not center */
 
   /* 2. Scale X to have unit column length (unit variance) */
   Z = X / sqrt(X[##, ]);
   /* 3. Obtain the SVD of X and calculate condition indices and the P matrix */
   call svd(U, D, V, Z);
   /* 4. compute the eigenvalues and condition indices of X`X */
   lambda = D##2;                           /* eigenvalues are square of singular values */
   cond = sqrt(lambda[1] / lambda);         /* condition indices */
 
   /* 5. Compute P = matrix of variance-decomposition proportions */
   phi = V##2 / lambda`;          /* divide squared columns by eigenvalues (proportions of each PC) */
   phi_k = phi[,+];               /* for each component, sum across columns */
   P = T( phi / phi_k );          /* create proportions and transpose the result */
finish;
 
/* Perform Regression-Coefficient Variance Decomposition of BKW */
varNames = {RunTime Age Weight RunPulse MaxPulse RestPulse};
use fitness;
   read all var varNames into XVar;
close;
 
/* perform COLLIN analysis */
call CollinStruct(lambda, cond, P,  XVar, "INT");
print "----- Do Not Center the Data (COLLIN) -----", lambda cond;
 
/* perform COLLINOINT analysis */
call CollinStruct(lambda0, cond0, P0,  XVar, "NOINT");
print "----- Center the Data (COLLINOINT) -----", lambda0 cond0;

The first table shows the eigenvalues and (more importantly) the condition indices for the original (uncentered) data. You can see that there are three condition indices that exceed 30, which indicates that there might be as many as three sets of nearly collinear relationships among the variables. My previous analysis showed two important sets of relationships:

  • Age is moderately collinear with the intercept term.
  • RunPulse is strongly collinear with MaxPulse.

In the second table, which analyses the structure of the centered data, none of the condition indices are large. An interpretation of the second table is that the variables are not collinear. This contradicts the first analysis.

Why does centering the data change the condition indices so much? This phenomenon was studied by Belsley who showed that "centering will typically seem to improve the conditioning," sometimes by a large factor (Belsley, 1984, p. 76). Belsley says that the second table "gives us information about the wrong problem; namely, it tells us about the sensitivity of the LS solution... to numerically small relative changes in the centered data. And since the magnitude of the centered changes is usually uninterpretable," so also are the condition indices for the centered data.

Ill-conditioned data that becomes perfectly conditioned by centering

Belsley (1984) presents a small data set (N=20) for which the original variables are highly collinear (maximum condition index is 1,242) whereas the centered data is perfectly conditioned (all condition indices are 1). Belsley could have used a much smaller example, as shown in Chennamaneni et al. (2008). I used their ideas to construct the following example.

Suppose that the (uncentered) data matrix is
X = A + ε B
where A is any N x k matrix that has constant columns, B is a centered orthogonal matrix, and ε > 0 is a small number, such as 0.001. Clearly, X is a small perturbation of a rank-deficient and ill-conditioned matrix (A). The condition indices for X can be made arbitrarily large by making ε arbitrarily small. I think everyone would agree that the columns of X are highly collinear. As shown below, the analysis-of-structure algorithm on X reveals the collinearities.

But what happens if you center the data? Because A has constant columns, the mean-centered version of A is the zero matrix. Centering B does not change it because the columns are already centered. Therefore, the centered version of X is ε B, which is a perfectly conditioned orthogonal matrix! This construction is valid for arbitrarily large data, but the following statements implement this construction for a small 3 x 2 matrix.

A = { 1  2,         /* linearly dependent ==> infinite condition index) */
      1  2,
      1  2};
B = {-1  1,         /* orthogonal columns ==> perfect condition number (1) */
      0 -2,
      1  1};
eps = 0.001;        /* the smaller eps is, the more ill-conditioned X is */
X = A + eps * B;    /* small perturbation of a rank deficient matrix */
 
/* The columns of X are highly collinear. The columns X - mean(X) are perfectly conditioned. */
Xc = X - mean(X);
print X, Xc;

This example reveals how "mean-centering can remove from the data the information needed to assess conditioning correctly" (Belsley, 1984, p. 74). As expected, if you run the analysis-of-structure diagnostics on this small example, the collinearity is detected in the original data. However, if you center the data prior to running the diagnostics, the results do not indicate that the data are collinear:

/* The columns of the X matrix are highly collinear, but only 
   the analysis of the uncentered data reveals the collinearity */
call CollinStruct(lambda, cond, P, X, "INT");
print lambda cond P[c={"Intercept" "X1" "X2"}];  /* as ill-conditioned as you want */
 
call CollinStruct(lambda0, cond0, P0, X, "NOINT");
print lambda0 cond0 P0[c={"X1" "X2"}];           /* perfectly conditioned */

In the first table (which is equivalent to the COLLIN option in PROC REG), the strong collinearities are apparent. In the second table (which is equivalent to the COLLINOINT option), the collinearities are not apparent. As Belsley (1984, p. 75) says, an example like this "demonstrates that it matters very much in what form the data are taken in order to assess meaningfully the conditioning of a LS problem, and that centered data are not usually the correct form."

Geometrically, the situation is similar to the following diagram, which is part of Figure 2 on p. 7 of Chennamaneni, et al. (2008). The figure shows two highly collinear vectors (U and V). However, the mean-centered vectors are not close to being collinear and can even be orthogonal (perfectly conditioned).

U and V are highly collinear. The mean centered vectors are orthogonal.

Summary

In summary, this article has presented Belsley's arguments about why collinearity diagnostics should be performed on the original data. As Belsley (1984, p. 75) says, there are situations in which centering the data is useful, but "assessing conditioning is not one of them." The example in the second section demonstrates why Belsley's arguments are compelling: the data are clearly strongly collinear, yet if you apply the algorithm to the mean-centered data, you do not get any indication that the problem exists. The analysis of the fitness data shows that the same situation can occur in real data.

These examples convince me that the analysis-of-structure algorithm reveals collinearity only when you apply it to the original data. If you agree, then you should use the COLLIN option in PROC REG to perform collinearity diagnostics.

However, not everyone agrees with Belsley. If you are not convinced, you can use the COLLINOINT option in PROC REG to perform collinearity diagnostics on the centered data. However, be aware that the estimates for the centered data are still subject to inflated variances and sensitive parameter estimates (Belsley, 1984, p. 74), even if this diagnostic procedure does not reveal that fact.

Further reading

The post Collinearity diagnostics: Should the data be centered? appeared first on The DO Loop.

1月 292020
 

Editor's note: This post is the first in our Young Data Scientists series, so be sure to check back for future posts! -------------------------------------------------------------------------------------------------------------------------------------------- Stefan Stoyanov is a Business Analytics and Research Intern at Boemska and a 2020 SAS Student Ambassador. He’s an MSc Business Analytics student with a passion for [...]

Data science gives you power to change the world was published on SAS Voices by Jelena Stankovic

1月 282020
 

Despite all the handwringing by marketers and advertisers, Google’s latest move to phase out Chrome’s third-party cookie support -- "in two years” -- is not a new one. Apple released Intelligent Tracking Prevention for its Safari browser in 2017 to prevent third parties from tracking users for over 24 hours [...]

3 ways the crumbling cookie could improve customer experience was published on SAS Voices by Wilson Raj

1月 272020
 

Workday calendar including weekends and holidays

Way too often, SAS programmers run into a task when for a given date (e.g. event date) there is a need to shift (add or subtract) it by a specified number of days excluding weekends and holidays — in other words to move a date by a given number of workdays. It does not matter how many days off are in our date span as long as it contains exactly the required number of workdays.

For the purpose of this blog post, we will use the following words as synonyms: workdays, work days, working days, business days; as opposed to their antonym: days off.

In the ideal world

If not for gifts from governments and employers called holidays, shifting (incrementing or decrementing) a date by a number of workdays using SAS would be a piece of cake. It’s literally a single line of code using INTNX function with the WEEKDAY date interval. For example, the following DATA Step code:

data _null_;
   event_date = '02JAN2020'd;
   shift_date = intnx('weekday', event_date, -10);
   put shift_date= date9.;
run;

produces in the SAS log:

shift_date=19DEC2019

Here:

  • 'weekday' is date interval covering Monday through Friday;
  • event_date is starting date point;
  • -10 is number of workdays to shift by (positive number means increment; negative number means decrement).

Note, that the WEEKDAY date interval can be modified to accommodate different weekend days. For example:

  • WEEKDAY17W - five-day work week with a Sunday (1) and Saturday (7) weekend (same as WEEKDAY);
  • WEEKDAY1W - six-day week with Sunday (1) as the only weekend day;
  • WEEKDAY67W - five-day week with Friday (6) and Saturday (7) as weekend days, etc.

Holidays schedule

In the real world, however, weekends defined by the WEEKDAY interval are not the only days off, as they do not account for holidays. In the example above, when we shifted our starting date (2 January 2020) by -10 we arrived at 19 December 2019 which means we miscounted several holidays as workdays.

Which holidays (and how many) we miscounted depends on jurisdiction (country, state, business), as their holidays schedules vary. For example, for US federal agencies we would miss (1. New Year – 1Jan2020, 2. Christmas Day – 25Dec2019, and 3. Christmas Eve Day – 24Dec2019 – although this is not an official US federal holiday, most federal employees were given that day off by presidential executive order).

For SAS Institute (USA), we would miscue 6 non-weekend holiday days (Winter Holiday 25Dec2019 – 27Dec2019 and 30Dec2019 - 1Jan2020).

In other countries or businesses, this holidays schedule might be quite different, and that is why this date-shifting task that would account for holidays schedule causes so much confusion. Let’s straighten it out with the help of our old friend – SAS user-defined format. But first, let’s create a workday calendar – a data table listing all OUR work days as well as days off.

Workday calendar

Practically every organization has (or must have) a workday calendar that defines the valid working days and consists of a repeating pattern of days on and days off, as well as exceptions to that pattern. While such a calendar may span multiple years, for our purposes, we can use a subset of that calendar, which reliably covers the date range of our interest.

Let’s create an example of the workday calendar as a SAS data table:

data DAYS_OF_WEEK;
   format DATE date9.;
   do DATE='01JAN2019'd to '31JAN2020'd;
      WEEK_DAY = weekday(DATE);
      DAY_NAME = put(DATE,downame.);
      WORK_DAY = 1<WEEK_DAY<7;
      output;
   end;
run;
 
data DAYS_HOLIDAY;
   format DATE date9.;
   input DATE date9.;
   WORK_DAY = 0;
   datalines;
01JAN2019
21JAN2019
18FEB2019
27MAY2019
04JUL2019
02SEP2019
11NOV2019
28NOV2019
24DEC2019
25DEC2019
01JAN2020
20JAN2020
; 
 
/* Overlay holidays onto weekdays */
data DAYS_WEEKENDS_AND_HOLIDAYS;
   merge
      DAYS_OF_WEEK
      DAYS_HOLIDAY;
   by DATE;
run;

Here is a fragment of the resulting workday calendar table:
Workday calendar table

If date shifting is needed on an individual-level, then workday calendars should be created for every person and must include working days, weekends, holidays as well as personal days off such as vacations, sick days etc.

SAS format to distinguish workdays from days off

Now, for the dates range of our interest, we want to create a SAS user-defined format that lists all the days off while workdays are lumped into the other category. It’s just more efficient that way, as the number of days off is usually smaller than the number of work days so our explicit list of dates will be shorter. For example:

proc format;
   value dayoff
   '01DEC2019'd = 'Y'
   '07DEC2019'd = 'Y'
   '08DEC2019'd = 'Y'
   . . .
   '24DEC2019'd = 'Y'
   '25DeC2019'd = 'Y'
   '01JAN2020'd = 'Y'
   '20JAN2020'd = 'Y'
   other = 'N'
   ;
run;

In this user-defined SAS format values labeled 'Y' mean day off, and values labeled 'N' mean workday. That includes and takes care of both weekends and holidays.

The proc format above serves only for illustrational purposes of what kind of format we are going to create. However, by no means do I suggest implementing it this hard-coded way. Quite the contrary, we are going to create format dynamically and 100% data-driven. Here is how we do it:

data WORK.DAYSOFF (rename=(DATE=START));
   set DAYS_WEEKENDS_AND_HOLIDAYS(where=(WORK_DAY=0)) end=last;
   retain FMTNAME 'dayoff' TYPE 'n' LABEL 'Y';
   output;
   if last then do;
      HLO = 'O';
      LABEL = 'N';
      output;
   end;
run;
 
proc format cntlin=WORK.DAYSOFF;
run;

In the above code, HLO='O' and LABEL='N' are responsible for generating the OTHER category for the dayoff format.

Shifting dates by a number of workdays using dayoff format

With the dayoff user-defined format at hands, we can easily increment or decrement dates by a number of workdays. Here is how:

/* data table of some dates */
data EVENTS;
   do EVENT_DATE='01DEC2019'd to '31DEC2019'd;
      output;
   end;
   format EVENT_DATE date9.;
run;
 
/* Calculating new dates shifted by a number of workdays */
data EVENTS_WITH_SHIFTS;
   set EVENTS;
 
   /* Decrement EVENT_DATE by 10 workdays */ 
   d = EVENT_DATE;
   do i=1 to 10;
      d = d - 1;
      if put(d, dayoff.)='Y' then i = i - 1;
   end;
   BEFORE_DATE = d;
 
   /* Increment EVENT_DATE by 12 workdays */ 
   d = EVENT_DATE;
   do i=1 to 12;
      d = d + 1;
      if put(d, dayoff.)='Y' then i = i - 1;
   end;
   AFTER_DATE = d;
 
   format BEFORE_DATE AFTER_DATE date9.;
   drop d i;
run;

In this code, we decrement (d=d-1) or increment (d=d+1) our event date every time the do-loop iterates. It will iterate while counter i does not exceed the number of workdays. However, within the do-loop we modify counter i to i-1 every time we come across a day off as determined by condition put(d,dayoff.)='Y'. This will effectively exclude days off from counting towards the number of workdays. The do-loop will iterate the number of workdays plus the number of days off thus moving date d by the number of days that includes exactly the given number of workdays (plus some number of days off which we don’t care about). Just pause for a second and think to absorb this.

This simple technique can be modularized by implementing it as a SAS user-defined function or a SAS data-step macro.

User-defined function to shift a date by a number of workdays

Here is the user-defined function shiftwd() that shifts a beginning date specified in the first argument from_date by a number of workdays specified in the second argument shift_by. The second argument can be either positive or negative. Positive second argument means advancing the first argument (incrementing); negative second argument means subtracting workdays from the first argument (decrementing). Both arguments can be either variable names or numerals representing whole numbers.

libname funclib 'c:\projects\shift\functions';
proc fcmp outlib=funclib.funcs.dates; 
   function shiftwd(from_date, shift_by); 
      d = from_date; 
      do i=1 to abs(shift_by); 
         d = d + sign(shift_by); 
         if put(d,dayoff.)='Y' then i = i - 1; 
      end; 
      return(d); 
   endfunc; 
run;

Function usage example:

libname funclib 'c:\projects\shift\functions';
options cmplib= funclib.funcs;
 
data EVENTS_WITH_SHIFTS;
   set EVENTS;
   BEFORE_DATE = shiftwd(EVENT_DATE,-10); /* Decrement EVENT_DATE by 10 workdays */ 
   AFTER_DATE  = shiftwd(EVENT_DATE, 12); /* Increment EVENT_DATE by 12 workdays */ 
   format BEFORE_DATE AFTER_DATE date9.;
run;

SAS macro to shift a date by a number of workdays

Similarly, the same can be implemented as a data-step macro:

%macro shiftwd (fromvar=,endvar=,wdays=,sign=);
   &endvar = &fromvar;
   do i=1 to &wdays;
      &endvar = &endvar &sign 1;
      if put(&endvar, dayoff.)='Y' then i = i - 1;  
   end;
   drop i;
%mend;

This macro has 4 required parameters:

  • fromvar - variable name of the beginning date;
  • endvar - variable name of the ending date;
  • wdays - variable name or numeral representing number of workdays to move from the beginning date;
  • sign - operation sign defining direction of the date move (+ for incrementing, - for decrementing).

Macro usage example:

data EVENTS_WITH_SHIFTS;
   set EVENTS;
   %shiftwd(fromvar=EVENT_DATE,endvar=BEFORE_DATE,wdays=10,sign=-); /* Decrement EVENT_DATE by 10 workdays */ 
   %shiftwd(fromvar=EVENT_DATE,endvar=AFTER_DATE, wdays=12,sign=+); /* Increment EVENT_DATE by 12 workdays */ 
   format BEFORE_DATE AFTER_DATE date9.;
run;

Related materials

Calculating the number of working days between two dates (Blog post)

Custom Time Intervals (SAS Documentation)

Your thoughts?

Do you find this material useful? How do you handle the task of adding or subtracting workdays from a date? Please share in the comments section below.

Shifting a date by a given number of workdays was published on SAS Users.

1月 272020
 

The Johnson system (Johnson, 1949) contains a family of four distributions: the normal distribution, the lognormal distribution, the SB distribution (which models bounded distributions), and the SU distribution (which models unbounded distributions). Note that 'B' stands for 'bounded' and 'U' stands for 'unbounded.' A previous article explains the purpose of the Johnson system and shows how to work with the Johnson SB distribution, including how to compute the probability density function (PDF), generate random variates, and estimate parameters from data. This article provides similar information for the Johnson SU distribution.

What is the Johnson SU distribution?

The SU distribution contains a location parameter, θ, and a scale parameter, σ > 0. The SU density is positive for all real numbers. The two shape parameters for the SU distribution are called δ and γ in the SAS documentation for PROC UNIVARIATE. By convention, the δ parameter is taken to be positive (δ > 0).

If X is a random variable from the Johnson SU distribution, then the variable
Z = γ + δ sinh-1( (X-θ) / σ )
is standard normal. The transformation to normality is invertible. Because sinh-1(t) = arsinh(t) = log(t + sqrt(1+t2)), some authors specify the transformation in terms of logs and square roots.

Generate random variates from the SU distribution

The relationship between the SU and the normal distribution provides an easy way to generate random variates from the SU distribution. To be explicit, define Y = (Z-γ) / δ, where Z ∼ N(0,1). Then the following transformation defines a Johnson SU random variable:
X = θ + σ sinh(Y),
where sinh(t) = (exp(t) – exp(-t)) / 2.

The following SAS DATA step generates a random sample from a Johnson SU distribution:

/* Johnson SU(location=theta, scale=sigma, shape=delta, shape=gamma) */
data SU(keep= X);
call streaminit(1);
theta = 0;  sigma = 1;   gamma = -1.1;   delta = 1.5;  
do i = 1 to 1000;
   Y = (rand("Normal")-gamma) / delta;
   X = theta + sigma * sinh(Y);
   output;
end;
run;

You can use PROC UNIVARIATE in SAS to overlay the SU density on the histogram of the random variates:

proc univariate data=SU;
   histogram x / SU(theta=0 sigma=1 gamma=-1.1 delta=1.5);
   ods select Histogram;
run;

The PDF of the SU distribution

The formula for the SU density function is given in the PROC UNIVARIATE documentation (set h = v = 1 in the formula). You can evaluate the probability density function (PDF) on the interval [θ – 4 σ, θ + 4 σ] for visualization purposes. So that you can easily compare various shape parameters, the following examples use θ=0 and σ=1. The θ parameter translates the density curve and the σ parameter scales it, but these parameters do not change the shape of the curve. Therefore, only the two shape parameters are varied in the following program, which generates the PDF of the SU distribution for four combinations of parameters.

/* Visualize a few SU distributions */
data SU_pdf;
array _theta[4] _temporary_ (0 0 0 0);
array _sigma[4] _temporary_ (1 1 1 1);
array _gamma[4] _temporary_ (-1.1 -1.1  0.5  0.5);
array _delta[4] _temporary_ ( 1.5  0.8  0.8  0.5 );
sqrt2pi = sqrt(2*constant('pi'));
do i = 1 to dim(_theta);
   theta=_theta[i]; sigma=_sigma[i]; gamma=_gamma[i]; delta=_delta[i]; 
   Params = catt("gamma=", gamma, "; delta=", delta);
   do x = theta-4*sigma to theta+4*sigma by 0.01;
      c = delta / (sigma * sqrt2pi);
      w = (x-theta) / sigma;
      z = gamma + delta*arsinh(w);
      PDF = c / sqrt(1+w**2) * exp( -0.5*z**2 );
      output;
   end;
end;
drop c w z sqrt2pi;
run;
 
title "The Johnson SU Probability Density";
title2 "theta=0; sigma=1";
proc sgplot data=SU_pdf;
   label pdf="Density";
   series x=x y=pdf / group=Params lineattrs=(thickness=2);
   xaxis grid offsetmin=0 offsetmax=0;  yaxis grid;
run;

The SU distributions have relatively large kurtosis. The graph shows that the SU distribution contains a variety of unimodal curves. The blue curve (γ=–1.1, δ=1.5) and the red curve (γ=–1.1, δ=0.8) both exhibit positive skewness whereas the other curves are closer to being symmetric.

The SU distributions are useful for modeling "heavy-tailed" distributions. A random sample from a heavy-tailed distribution typically contains more observations that are far from the mean than a random normal sample of the same size. Thus you might want to use a Johnson SU distribution to model a phenomenon in which extreme events occur somewhat frequently.

The CDF and quantiles of the SU distribution

I am not aware of explicit formulas for the cumulative distribution function (CDF) and quantile function of the SU distribution. You can use numerical integration of the PDF to evaluate the cumulative distribution. You can use numerical root-finding methods to evaluate the inverse CDF (quantile) function, but it will be an expensive computation.

Fitting parameters of the SU distribution

If you have data, you can use PROC UNIVARIATE to estimate four parameters of the SU distribution that fit the data. In an earlier section, we simulated a data set for which all values are in the interval [0, 1]. The following call to PROC UNIVARIATE estimates the shape parameters for these simulated data:

proc univariate data=SU;
   histogram x / SU(theta=EST sigma=EST gamma=EST delta=EST
                    fitmethod=percentiles);
run;

The HISTOGRAM statement supports three methods for fitting the parameters from data. By default, the procedure uses the FITMETHOD=PERCENTILE method (Slifker and Shapiro, 1980) to estimate the parameters. You can also use FITMETHOD=MLE to obtain a maximum likelihood estimate or FITMETHOD=MOMENTS to estimate the parameters by matching the sample moments. For these data, I've used the percentile method. The method of moments does not provide a feasible solution for these data.

Summary

The Johnson SU distribution is a family that models unbounded distributions. It is especially useful for modeling distributions that have heavy tails. This article shows how to simulate random values from the SU distribution and how to visualize the probability density function (PDF). It also shows how to use PROC UNIVARIATE to fit parameters of the SU distribution to data.

The post The Johnson SU distribution appeared first on The DO Loop.

1月 232020
 

When I attend a conference, one of the first things I do is look at the agenda. This gives me a good overview of how my time will be spent.

The next thing I do is find the detailed breakdown of sessions, so I can start building out my own personal agenda. I know my areas of interest, and I want to make sure my time is spent learning as much as possible. I’ve done this at industry conferences, as well as every SAS GF I’ve attended (and I’ve attended a lot).

I am happy to report that the session catalog for SAS Global Forum 2020 helps me understand what sessions are available so I can make the most of my conference experience. Here are my tips to make the most of the session catalog:

Use filters

For the first time, all programs are represented in the catalog. And anyone can attend any session. You can start broad, then filter by level of expertise, topic of interest, industry, product, or style of presentation. Filters are your friend…use them!

Use the interest list

You can bookmark sessions to your interest list, but beware that these do not automatically save for you. Be sure to use the export feature. Then you can share your chosen sessions with friends, print them, and cross-check them as you’re building your agenda in the mobile app starting March 1. It’s a great tool.

Review sample agendas

Speaking of agendas, it’s sometimes hard to narrow down your options. Start with one of the custom-tailored agendas within the session catalog and use these as a starting point to help complete your personal agenda. We keep adding new sample agendas, so you will have many great sessions to pick from!

Solicit the help of your SAS point of contact

If you’re a first-timer, or someone who is overwhelmed by too many choices, let your SAS contact help. They can help select sessions that they feel will be beneficial to your learning and networking.

Moral of the story

If you can’t tell, I’m a big fan of the session catalog. It’s totally customizable. The way I use it may not be the way you use it and that’s fine. That’s why there are so many ways to search, filter and save. Just give it a try. Your SAS Global Forum experience will be all the better when you include the session catalog.

And remember, anyone can check out the session catalog. Not planning to attend? Well, boo, go ahead and see what you’re missing. If you’re on the fence about the conference, check out all the great things SAS Global Forum has to offer then go register!

Make the most of SAS Global Forum with the session catalog was published on SAS Users.

1月 232020
 

I was recently asked about how to interpret the output from the COLLIN (or COLLINOINT) option on the MODEL statement in PROC REG in SAS. The example in the documentation for PROC REG is correct but is somewhat terse regarding how to use the output to diagnose collinearity and how to determine which variables are collinear. This article uses the same data but goes into more detail about how to interpret the results of the COLLIN and COLLINOINT options.

An overview of collinearity in regression

Collinearity (sometimes called multicollinearity) involves only the explanatory variables. It occurs when a variable is nearly a linear combination of other variables in the model. Equivalently, there a set of explanatory variables that is linearly dependent in the sense of linear algebra. (Another equivalent statement is that the design matrix and the X`X matrices are not full rank.)

For example, suppose a model contains five regressor variables and the variables are related by X3 = 3*X1 - X2 and X5 = 2*X4;. In this case, there are two sets of linear relationships among the regressors, one relationship that involves the variables X1, X2, and X3, and another that involves the variables X4 and X5. In practice, collinearity means that a set of variables are almost linearly combinations of each other. For example, the vectors u = X3 - 3*X1 + X2 and v = X5 - 2*X4; are close to the zero vector.

Unfortunately, the words "almost" and "close to" are difficult to quantify. The COLLIN option on the MODEL statement in PROC REG provides a way to analyze the design matrix for potentially harmful collinearities.

Why should you avoid collinearity in regression?

The assumptions of ordinary least square (OLS) regression are not violated if there is collinearity among the independent variables. OLS regression still provides the best linear unbiased estimates of the regression coefficients.

The problem is not the estimates themselves but rather the variance of the estimates. One problem caused by collinearity is that the standard errors of those estimates will be big. This means that the predicted values, although the "best possible," will have wide prediction limits. In other words, you get predictions, but you can't really trust them.

A second problem concerns interpretability. The sign and magnitude of a parameter estimate indicate how the dependent variable changes due to a unit change of the independent variable when the other variables are held constant. However, if X1 is nearly collinear with X2 and X3, it does not make sense to discuss "holding constant" the other variables (X2 and X3) while changing X1. The variables necessarily must change together. Collinearities can even cause some parameter estimates to have "wrong signs" that conflict with your intuitive notion about how the dependent variable should depend on an independent variable.

A third problem with collinearities is numerical rather than statistical. Strong collinearities cause the cross-product matrix (X`X) to become ill-conditioned. Computing the least squares estimates requires solving a linear system that involves the cross-product matrix. Solving an ill-conditioned system can result in relatively large numerical errors. However, in practice, the statistical issues usually outweigh the numerical one. A matrix must be extremely ill-conditioned before the numerical errors become important, whereas the statistical issues are problematic for moderate to large collinearities.

How to interpret the output of the COLLIN option?

The following example is from the "Collinearity Diagnostics" section of the PROC REG documentation. Various health and fitness measurements were recorded for 31 men, such as time to run 1.5 miles, the resting pulse, the average pulse rate while running, and the maximum pulse rate while running. These measurements are used to predict the oxygen intake rate, which is a measurement of fitness but is difficult to measure directly.

data fitness;
   input Age Weight Oxygen RunTime RestPulse RunPulse MaxPulse @@;
   datalines;
44 89.47 44.609 11.37 62 178 182   40 75.07 45.313 10.07 62 185 185
44 85.84 54.297  8.65 45 156 168   42 68.15 59.571  8.17 40 166 172
38 89.02 49.874  9.22 55 178 180   47 77.45 44.811 11.63 58 176 176
40 75.98 45.681 11.95 70 176 180   43 81.19 49.091 10.85 64 162 170
44 81.42 39.442 13.08 63 174 176   38 81.87 60.055  8.63 48 170 186
44 73.03 50.541 10.13 45 168 168   45 87.66 37.388 14.03 56 186 192
45 66.45 44.754 11.12 51 176 176   47 79.15 47.273 10.60 47 162 164
54 83.12 51.855 10.33 50 166 170   49 81.42 49.156  8.95 44 180 185
51 69.63 40.836 10.95 57 168 172   51 77.91 46.672 10.00 48 162 168
48 91.63 46.774 10.25 48 162 164   49 73.37 50.388 10.08 67 168 168
57 73.37 39.407 12.63 58 174 176   54 79.38 46.080 11.17 62 156 165
52 76.32 45.441  9.63 48 164 166   50 70.87 54.625  8.92 48 146 155
51 67.25 45.118 11.08 48 172 172   54 91.63 39.203 12.88 44 168 172
51 73.71 45.790 10.47 59 186 188   57 59.08 50.545  9.93 49 148 155
49 76.32 48.673  9.40 56 186 188   48 61.24 47.920 11.50 52 170 176
52 82.78 47.467 10.50 53 170 172
;
 
proc reg data=fitness plots=none;
   model Oxygen = RunTime Age Weight RunPulse MaxPulse RestPulse / collin;
   ods select ParameterEstimates CollinDiag;
   ods output CollinDiag = CollinReg;
quit;

The output from the COLLIN option is shown. I have added some colored rectangles to the output to emphasize how to interpret the table. To determine collinearity from the output, do the following:

  • Look at the "Condition Index" column. Large values in this column indicate potential collinearities. Many authors use 30 as a number that warrants further investigation. Other researchers suggest 100. Most researchers agree that no single number can handle all situations.
  • For each row that has a large condition index, look across the columns in the "Proportion of Variation" section of the table. Identify cells that have a value of 0.5 or greater. The columns of these cells indicate which variables contribute to the collinearity. Notice that at least two variables are involved in each collinearity, so look for at least two cells with large values in each row. However, there could be three or more cells that have large values. "Large" is relative to the value 1, which is the sum of each column.

Let's apply these rules to the output for the example:

  • If you use 30 as a cutoff value, there are three rows (marked in red) whose condition numbers exceed the cutoff value. They are rows 5, 6, and 7.
  • For the 5th row (condition index=33.8), there are no cells that exceed 0.5. The two largest cells (in the Weight and RestPulse columns) indicate a small near-collinearity between the Weight and RestPulse measurements. The relationship is not strong enough to worry about.
  • For the 6th row (condition index=82.6), there are two cells that are 0.5 or greater (rounded to four decimals). The cells are in the Intercept and Age columns. This indicates that the Age and Intercept terms are nearly collinear. Collinearities with the intercept term can be hard to interpret. See the comments at the end of this article.
  • For the 7th row (condition index=196.8), there are two cells that are greater than 0.5. The cells are in the RunPulse and MaxPulse columns, which indicates a very strong linear relationship between these two variables.

Your model has collinearities. Now what?

After you identify the cause of the collinearities, what should you do? That is a difficult and controversial question that has many possible answers.

  • Perhaps the simplest solution is to use domain knowledge to omit the "redundant" variables. For example, you might want to drop MaxPulse from the model and refit. However, in this era of Big Data and machine learning, some analysts want an automated solution.
  • You can use dimensionality reduction and an (incomplete) principal component regression.
  • You can use a biased estimation technique such as ridge regression, which allows bias but reduces the variance of the estimates.
  • Some practitioners use variable selection techniques to let the data decide which variables to omit from the model. However, be aware that different variable-selection methods might choose different variables from among the set of nearly collinear variables.

Keep the intercept or not?

Equally controversial is the question of whether to include the intercept term in a collinearity diagnostic. The COLLIN option in PROC REG includes the intercept term among the variables to be analyzed for collinearity. The COLLINOINT option excludes the intercept term. Which should you use? Here are two opinions that I found:

  1. Use the intercept term: Belsley, Kuh, and Welsch (Regression Diagnostics, 1980, p. 120) state that omitting the intercept term is "inappropriate in the event that [the design matrix]contains a constant column." They go on to say (p. 157) that "centering the data [when the model has an intercept term]can mask the role of the constant in any near dependencies and produce misleading diagnostic results." These quotes strongly favor using the COLLIN option when the model contains an intercept term.
  2. Do not use the intercept term if it is outside of the data range: Freund and Littell (SAS System for Regression, 3rd Ed, 2000) argue that including the intercept term in the collinearity analysis is not always appropriate. "The intercept is an estimate of the response at the origin, that is, where all independent variables are zero. .... [F]or most applications the intercept represents an extrapolation far beyond the reach of the data. For this reason, the inclusion of the intercept in the study of multicollinearity can be useful only if the intercept has some physical interpretation and is within reach of the actual data space." For the example data, it is impossible for a person to have zero age, weight, or pulse rate, therefore I suspect Freund and Little would recommend using the COLLINOINT option instead of the COLLIN option for these data.

So what should you do if the experts disagree? I usually defer to the math. Although I am reluctant to contradict Freund and Littell (both widely published experts and Fellows of the American Statistical Association), the mathematics of the collinearity analysis (which I will discuss in a separate article) seems to favor the opinion of Belsley, Kuh, and Welsch. I use the COLLIN option when my model includes an intercept term.

Do you have an opinion on this matter? Leave a comment.

The post Collinearity in regression: The COLLIN option in PROC REG appeared first on The DO Loop.

1月 212020
 

One great thing about being a SAS programmer is that you never run out of new things to learn. SAS often gives us a variety of methods to produce the same result. One good example of this is the DATA step and PROC SQL, both of which manipulate data. The DATA step is extremely powerful and flexible, but PROC SQL has its advantages too. Until recently, my knowledge of PROC SQL was pretty limited. But for the sixth edition of The Little SAS Book, we decided to move the discussion of PROC SQL from an appendix (who reads appendices?) to the body of the book. This gave me an opportunity to learn more about PROC SQL.

When developing my programs, I often find myself needing to calculate the mean (or sum, or median, or whatever) of a variable, and then merge that result back into my SAS data set. That would generally involve at least a couple PROC steps and a DATA step, but using PROC SQL I can achieve the same result all in one step.

Example

Consider this example using the Cars data set in the SASHELP library. Among other things, the data set contains the 2004 MSRP for over 400 models of cars of various makes and car type. Suppose you want a data set which contains the make, model, type, and MSRP for the model, along with the median MSRP for all cars of the same make. In addition, you would like a variable that is the difference between the MSRP for that model, and the median MSRP for all models of the same make. Here is the PROC SQL code that will create a SAS data set, MedianMSRP, with the desired result:

*Create summary variable for median MSRP by Make;

PROC SQL;
   CREATE TABLE MedianMSRP AS
   SELECT Make, Model, Type, MSRP,
          MEDIAN(MSRP) AS MedMSRP,
          (MSRP - MEDIAN(MSRP)) AS MSRP_VS_Median
   FROM sashelp.cars
   GROUP BY Make;
QUIT;


 

The CREATE TABLE clause simply names the SAS data set to create, while the FROM clause names the SAS data set to read. The SELECT clause lists the variables to keep from the old data set (Make, Model, Type, and MSRP) along with specifications for the new summary variables. The new variable, MedMSRP, is the median of the old MSRP variable, while the new variable MSRP_VS_Median is the MSRP minus the median MSRP. The GROUP BY clause tells SAS to do the calculations within each value of the variable Make. If you leave off the GROUP BY clause, then the calculations would be done over the entire data set. When you run this code, you will get the following message in your SAS log telling you it is doing exactly what you wanted it to do:

NOTE: The query requires remerging summary statistics back with the original data.

The following PROC PRINT produces a report showing just the observations for two makes – Porsche and Jeep.

PROC PRINT DATA = MedianMSRP;
  TITLE '2004 Car Prices';
  WHERE Make IN ('Porsche','Jeep');
  FORMAT MedMSRP MSRP_VS_Median DOLLAR8.0;
RUN;

Results

Here are the results:

Now PROC SQL aficionados will tell you that if all you want is a report and you don’t need to create a SAS data set, then you can do it all in just the PROC SQL step. But that is the topic for another blog!

 

Expand Your SAS Knowledge by Learning PROC SQL was published on SAS Users.