8月 222018
 

A SAS programmer recently asked how to interpret the "standardized regression coefficients" as computed by the STB option on the MODEL statement in PROC REG and other SAS regression procedures. The SAS documentation for the STB option states, "a standardized regression coefficient is computed by dividing a parameter estimate by the ratio of the sample standard deviation of the dependent variable to the sample standard deviation of the regressor." Although correct, this definition does not provide an intuitive feeling for how to interpret the standardized regression estimates. This article uses SAS to demonstrate how parameter estimates for the original variables are related to parameter estimates for standardized variables. It also derives how regression coefficients change after a linear transformation of the variables.

Proof by example

One of my college physics professors used to smile and say "I will now prove this by example" when he wanted to demonstrate a fact without proving it mathematically. This section uses PROC STDIZE and PROC REG to "prove by example" that the standardized regression estimates for data are equal to the estimates that you obtain by standardizing the data. The following example uses continuous response and explanatory variables, but there is a SAS Usage Note that describes how to standardize classification variables.

The following call to PROC REG uses the STB option to compute the standardized parameter estimates for a model that predicts the weights of 19 students from heights and ages:

proc reg data=Sashelp.Class plots=none;
   Orig: model Weight = Height Age / stb;
   ods select ParameterEstimates;
quit;

The last column is the result of the STB option on the MODEL statement. You can get the same numbers by first standardizing the data and then performing a regression on the standardized variables, as follows:

/* Put original and standardized variables into the output data set.
   Standardized variables have the names 'StdX' where X was the name of the original variable. 
   METHOD=STD standardizes variables according to StdX = (X - mean(X)) / std(X) */
proc stdize data=Sashelp.Class out=ClassStd method=std OPREFIX SPREFIX=Std;
run;
 
proc reg data=ClassStd plots=none;
   Std: model StdWeight = StdHeight StdAge;
   ods select ParameterEstimates;
quit;

The parameter estimates for the standardized data are equal to the STB estimates for the original data. Furthermore, the t values and p-values for the slope parameters are equivalent because these statistics are scale- and translation-invariant. Notice, however, that scale-dependent statistics such as standard errors and covariance of the betas will not be the same for the two analyses.

Linear transformations of random variables

Mathematically, you can use a algebra to understand how linear transformations affect the relationship between two linearly dependent random variables. Suppose X is a random variable and Y = b0 + b1*X for some constants b0 and b1. What happens to this linear relationship if you apply linear transformations to X and Y?

Define new variables U = (Y - cY) / sY and V = (X - cX) / sX. If you solve for X and Y and plug the expressions into the equation for the linear relationship, you find that the new random variables are related by
U = (b0 + b1*cX - cY) / sY + b1*(sX/sY)*V.
If you define B0 = (b0 + b1*cX - cY) / sY and B1 = b1*(sX/sY), then U = B0 + B1*V, which shows that the transformed variables (U and V) are linearly related.

The physical significance of this result is that linear relationships persist no matter what units you choose to measure the variables. For example, if X is measured in inches and Y is measured in pounds, then the quantities remain linearly related if you measure X in centimeters and measure Y in "kilograms from 20 kg."

The effect of standardizing variables on regression estimates

The analysis in the previous section holds for any linear transformation of linearly related random variables. But suppose, in addition, that

  • U and V are standardized versions of Y and X, respectively. That is, cY and cX are the sample means and sY and sX are the sample standard deviations.
  • The parameters b0 and b1 are the regression estimates for a simple linear regression model.

For simple linear regression, the intercept estimate is b0 = cY - b1*cY, which implies that B0 = 0. Furthermore, the coefficient B1 = b1*(sX/sY) is the original parameter estimate "divided by the ratio of the sample standard deviation of the dependent variable to the sample standard deviation of the regressor," just as the PROC REG documentation states and just as we saw in the PROC REG output in the previous section. Thus the STB option on the MODEL statement does not need to standardize any data! It produces the standardized estimates by setting the intercept term to 0 and dividing the parameter estimates by the ratio of standard deviations, as noted in the documentation. (A similar proof handles multiple explanatory variables.)

Interpretation of the regression coefficients

For the original (unstandardized) data, the intercept estimate predicts the value of the response when the explanatory variables are all zero. The regression coefficients predict the change in the response for one unit change in an explanatory variable. The "change in response" depends on the units for the data, such as kilograms per centimeter.

The standardized coefficients predict the number of standard deviations that the response will change for one STANDARD DEVIATION of change in an explanatory variable. The "change in response" is a unitless quantity. The fact that the standardized intercept is 0 indicates that the predicted value of the (centered) response is 0 when the model is evaluated at the mean values of the explanatory variables.

In summary, standardized coefficients are the parameter estimates that you would obtain if you standardize the response and explanatory variables by centering and scaling the data. A standardized parameter estimate predicts the change in the response variable (in standard deviations) for one standard deviation of change in the explanatory variable.

The post Standardized regression coefficients appeared first on The DO Loop.

8月 212018
 

When you hear someone refer to an ‘inside baseball’ move, it means they’re playing into the subtleties of the game. Inside baseball requires a high level of awareness, experience, and strategic thought. This typically results in a mix of strategies to get runners on base and manufacture runs rather than [...]

Three 'inside baseball' tactics for manufacturing leaders was published on SAS Voices by Roger Thomas

8月 202018
 

As SAS Global Forum 2019 Conference Chair, I am excited to share our plans for an extraordinary SAS experience in Dallas, Texas, April 28-May 1. SAS has always been my analytic tool of choice and has provided me an amazing journey in my professional career.  Attending many SAS global conferences throughout the years, I've found this is the best venue to learn, contribute and network with other SAS colleagues.

The content will not disappoint – submit your abstract now
What presentation topics will you hear at SAS Global Forum? In 2019, expanded topics will include deep dives into text analytics, business visualization, AI/Machine Learning, SAS administration, open source integration and career development discussions. As always, SAS basic programming techniques using the current SAS software environment will be the underpinning of many sessions.

Do you want to be a part of that great content but never presented at a conference? Now could be the time to learn new skills by participating in the SAS Global Forum Presenter Mentoring Program. Available for first-time presenters, you will collaborate with a seasoned expert to guide you on preparing and delivering a professional presentation. The mentors can also help before you even submit your abstract in the call for content – which is now open. Get help to craft your title and abstract for your submission. The open call is available through October 22.

How to get to Dallas
Did you know there are scholarships and awards to help SAS users attend SAS Global Forum? Our award program has been enhanced for New SAS Professionals. This award provides an opportunity for new SAS users to enhance their basic SAS skills as well as learn about the latest technology in their line of business. Hear what industry experts are doing to solve business issues by sharing real-world evidence cases.

In addition, we're offering an enhanced International Professional Award focused on global engagement and participation. This award is available for those outside of the continental 48 states to share their expertise and industry solutions from around the world.  At the conference, you will have a chance to learn and network with international professionals who work on analytic projects similar to your analytic career.

Don’t miss out on this valuable experience! Submit your abstract for consideration to be a presenter! I look forward to seeing you in Dallas and hearing about your work.

Call for content now open for SAS Global Forum 2019 was published on SAS Users.

8月 202018
 

Using small multiples is a neat way to display a lot of information in a small amount of space. But depending on how deeply you want to analyze and scrutinize the data, you need to be careful in choosing just how small you make your small multiples. Let's look at [...]

The post Most common jobs in the US - a small multiples comparison appeared first on SAS Learning Post.

8月 202018
 

Using small multiples is a neat way to display a lot of information in a small amount of space. But depending on how deeply you want to analyze and scrutinize the data, you need to be careful in choosing just how small you make your small multiples. Let's look at [...]

The post Most common jobs in the US - a small multiples comparison appeared first on SAS Learning Post.

8月 202018
 

I love data; I’m a real and unabashed data geek. I'm the sci-fi nerd who has fun with data from Star Wars and analyzes World of Warcraft logs using SAS. More importantly, I love what data can do. I love the way it can show people new insights and new ideas, [...]

Having fun, getting involved and feeling useful: Data for Good was published on SAS Voices by Mary Osborne

8月 202018
 
Video killed the radio star....
We can't rewind, we've gone too far.
      -- The Buggles (1979)

"You kids have it easy," my father used to tell me. "When I was a kid, I didn't have all the conveniences you have today." He's right, and I could say the same thing to my kids, especially about today's handheld technology. In particular, I've noticed that powerful handheld technology (especially the modern calculator) has killed the standard probability tables that were once ubiquitous in introductory statistics textbooks.

In my first probability and statistics course, I constantly referenced the 23 statistical tables (which occupied 44 pages!) in the appendix of my undergraduate textbook. Any time I needed to compute a probability or test a hypothesis, I would flip to a table of probabilities for the normal, t, chi-square, or F distribution and use it to compute a probability (area) or quantile (critical value). If the value I needed wasn't tabulated, I had to manually perform linear interpolation from two tabulated values. I had no choice: my calculator did not have support for these advanced functions.

In contrast, kids today have it easy! When my son took AP statistics in high school, his handheld calculator (a TI-84, which costs about $100) could compute the PDF, CDF, and quantiles of all the important probability distributions. Consequently, his textbook did not include an appendix of statistical tables.

It makes sense that publishers would choose to omit these tables, just as my own high school textbooks excluded the trig and logarithm tables that were prevalent in my father's youth. When handheld technology can reproduce all the numbers in a table, why waste the ink and paper?

In fact, by using SAS software, you can generate and display a statistical table with only a few statements. To illustrate this, let's use SAS to generate two common statistical tables: a normal probability table and a table of critical values for the chi-square statistic.

A normal probability table

A normal probability table gives an area under the standard normal density curve. As discussed in a Wikipedia article about the standard normal table, there are three equivalent kinds of tables, but I will use SAS to produce the first table on the list. Given a standardized z-score, z > 0, the table gives the probability that a standard normal random variate is in the interval (0, z). That is, the table gives P(0 < Z < z) for a standard normal random variable Z. The graph below shows the shaded area that is given in the body of the table.

You can create the table by using the SAS DATA step, but I'll use SAS/IML software. The rows of the table indicate the z-score to the first decimal place. The columns of the table indicate the second decimal place of the z-score. The key to creating the table is to recall that you can call any Base SAS function from SAS/IML, and you can use vectors of parameters. In particular, the following statements use the EXPANDGRID function to generate all two-decimal z-scores in the range [0, 3.4]. The program then calls the CDF function to evaluate the probability P(Z < z) and subtracts 1/2 to obtain the probability P(0 < Z < z). The SHAPE function reshapes the vector of probabilities into a 10-column table. Finally, the PUTN function converts the column and row headers into character values that are printed at the top and side of the table.

proc iml;
/* normal probability table similar to 
   https://en.wikipedia.org/wiki/Standard_normal_table#Table_examples  */
z1 = do(0, 3.4, 0.1);          /* z-score to first decimal place */
z2 = do(0, 0.09, 0.01);        /* second decimal place */
z = expandgrid(z1, z2)[,+];    /* sum of all pairs from of z1 and z2 values */
p = cdf("Normal", z) - 0.5;    /* P( 0 < Z < z ) */
Prob = shape( p, ncol(z1) );   /* reshape into table with 10 columns */
 
z1Lab = putn(z1, "3.1");       /* formatted values of z1 for row headers */
z2Lab = putn(z2, "4.2");       /* formatted values of z2 for col headers */
print Prob[r=z1Lab c=z2Lab F=6.4
           L="Standard Normal Table for P( 0 < Z < z )"];
Standard Normal Probability Table

To find the probability between 0 and z=0.67, find the row for z=0.6 and then move over to the column labeled 0.07. The value of that cell is P(0 < Z < 0.67) = 0.2486.

Chi-square table of critical values

Some statistical tables display critical values for a test statistic instead of probabilities. This section shows how to construct a table of the critical values of a chi-square test statistic for common significance levels (α). The rows of the table correspond to degrees of freedom; the columns correspond to significance levels. The following graph shows the situation. The shaded area corresponds to significance levels.

The corresponding table provides the quantile of the distribution that corresponds to each significance level. Again, you can use the DATA step, but I have chosen to use SAS/IML software to generate the table:

/* table of critical values for the chi-square distribution
   https://flylib.com/books/3/287/1/html/2/images/xatab02.jpg */
df = (1:30) || do(40,100,10);   /* degrees of freedom */
/* significance levels for common one- or two-sided tests */
alpha = {0.99 0.975 0.95 0.9 0.1 0.05 0.025 0.01}; 
g = expandgrid(df, alpha);                    /* all pairs of (df, alpha) values */
p = quantile("ChiSquare", 1 - g[,2], g[,1]);  /* compute quantiles for (df, 1-alpha) */
CriticalValues = shape( p, ncol(df) );        /* reshape into table */
 
dfLab = putn(df, "3.");                       /* formatted row headers */
pLab = putn(alpha, "5.3");                    /* formatted col headers */
print CriticalValues[rowname=dfLab colname=pLab F=6.3 
      L="Critical Values of Chi-Square Distribution"];

To illustrate how to use the table, suppose that you have computed a chi-square test statistic for 9 degrees of freedom. You want to determine if you should reject the (one-sided) null hypothesis at the α = 0.05 significance level. You trace down the table to the row for df=9 and then trace over to the column for 0.05. The value in that cell is 16.919, so you would reject the null hypothesis if your test statistic exceeds that critical value.

Just as digital downloads of songs have supplanted records and CDs, so, too, have modern handheld calculators replaced the statistical tables that used to feature prominently in introductory statistics courses. However, if you ever feel nostalgic for the days of yore, you can easily resurrect your favorite table by writing a few lines of SAS code. To be sure, there are some advanced tables (the Kolmogorov-Smirnov test comes to mind) that have not yet been replaced by a calculator key, but the simple tables are dead. Killed by the calculator.

It might be ill to speak of the dead, but I say, "good riddance"; I never liked using tables those tables anyway. What are your thoughts? If you are old enough to remember tables do you have fond memories of using them? If you learned statistics recently, are you surprised that tables were once so prevalent? Share your experiences by leaving a comment.

The post Calculators killed the standard statistical table appeared first on The DO Loop.

8月 182018
 

Get ready to have your mind blown. Whether or not you plan to attend Analytics Experience in San Diego on September 17, you'll be inspired by the speakers we have lined up to keynote the event. There's an inventor. A data scientist. A world class athlete. And a photographer. They've [...]

4 inspiring videos from Analytics Experience keynote speakers was published on SAS Voices by Alison Bolen

8月 172018
 

Data density estimation is often used in statistical analysis as well as in data mining and machine learning. Visualization of data density estimation will show the data’s characteristics like distribution, skewness and modality, etc. The most widely-used visualizations people used for data density are boxplot, histogram, kernel density estimates, and some other plots. SAS has several procedures that can create such plots. Here, I'll visualize the kernel density estimates superimposing on histogram using SAS Visual Analytics.

A histogram shows the data distribution through some continuous interval bins, and it is a very useful visualization to present the data distribution. With a histogram, we can get a rough view of the density of the values distribution. However, the bin width (or number of bins) has significant impact to the shape of a histogram and thus gives different impressions to viewers. For example, we have same data for the two below histograms, the left one with 6 bins and the right one with 4 bins. Different bin width shows different distribution for same data. In addition, histogram is not smooth enough to visually compare with the mathematical density models. Thus, many people use kernel density estimates which looks more smoothly varying in the distribution.

Kernel density estimates (KDE) is a widely-used non-parametric approach of estimating the probability density of a random variable. Non-parametric means the estimation adjusts to the observations in the data, and it is more flexible than parametric estimation. To plot KDE, we need to choose the kernel function and its bandwidth. Kernel function is used to compute kernel density estimates. Bandwidth controls the smoothness of KDE plot, which is essentially the width of the sliding window used to generate the density. SAS offers several ways to generate the kernel density estimates. Here I use the Proc UNIVARIATE to create KDE output as an example (for simplicity, I set c = SJPI to have SAS select the bandwidth by using the Sheather-Jones plug-in method), then make the corresponding visualization in SAS Visual Analytics.

Visualize the kernel density estimates using SAS code

It is straightforward to run kernel density estimates using SAS Proc UNIVARIATE. Take the variable MSRP in SASHELP.CARS dataset as an example. The min/max value of MSRP column is 10280 and 192465 respectively. I plot the histogram with 15 bins here in the example. Below is the sample codes segment I used to construct kernel density estimates of the MSRP column:

title 'Kernel density estimates of MSRP';
proc univariate data = sashelp.cars noprint;	
   histogram MSRP / kernel (c = SJPI) endpoints = 10280 to 192465 by 12145 outkernel = KDE  odstitle = title; 
run;

Run above code in SAS Studio, and we get following graph.

Visualize the kernel density estimates using SAS Visual Analytics

  1. In SAS Visual Analytics, load the SASHELP.CARS and the KDE dataset (from previous Proc UNIVARIATE) to the CAS server.
  2. Drag and drop a ‘Precision Container’ in the canvas, and put a histogram and a numeric series plot in the container.
  3. Assign corresponding data to the histogram plot: assign CARS.MSRP as histogram Measure, and ‘Frequency Percent’ as histogram Frequency; Set the options of the histogram with following settings:
    Object -> Title: No title;

    Graph Frame: Grid lines: disabled

    Histogram -> Bin range: Measure values; check the ‘Set a fixed bin count’ and set ‘Bin count’ to 15.

    X Axis options:

       Fixed minimum: 10280

       Fixed maximum: 192465

       Axis label: disabled

       Axis Line: enabled

       Tick value: enabled

    Y Axis options:

       Fixed minimum: 0

       Fixed maximum: 0.5

       Axis label: disabled

       Axis Line: disabled

       Tick value: disabled

  1. Assign corresponding KDE data to the numeric series plot. Define a calculated item: Percent as (‘Percent of Observations Per Data Unit’n / 100) with the format of ‘PERCENT12.2’, and assign it to the ‘Y axis’; assign the ‘Data Value’ to the ‘X axis.’ Now set the options of the numeric series plot with following settings:
    Object -> Title: No title;

    Style -> Line/Marker: (change the first color to purple)

    Graph Frame -> Grid lines: disabled

    Series -> Line thickness: 2

    X Axis options:

       Axis label: disabled

       Axis Line: disabled

       Tick value: disabled

    Y Axis options:

       Fixed minimum: 0

       Fixed maximum: 0.5

       Axis label: enabled

       Axis Line: enabled

       Tick value: enabled

    Legend:

       Visibility: Off

  1. Now we can start to overlay the two charts. As can be seen in the screenshot below, SAS Visual Analytics 8.3 provides a smart guide with precision container, which shows grids to help you align the objects in it. If you hold the ctrl button while dragging the numeric series plot to overlay the histogram, some fine grids displayed by the smart guide to help you with basic alignment. It is a little tricky though, to make the overlay precisely, you may fine tune the value of the Left/Top/Width/Height in the Layout of VA Options panel. The goal is to make the intersection of the axes coincides with each other.

After that, we can add a text object above the charts we just made, and done with the kernel density estimates superimposing on a histogram shown in below screenshot, similarly as we got from SAS Proc UNIVARIATE. (If you'd like to use PROC KDE UNIVAR statement for data density estimates, you can visualize it in SAS Visual Analytics in a similar way.)

To go further, I make a KDE with a scatter plot where we can also get impression of the data density with those little circles; another KDE plot with a needle plot where the data density is also represented by the barcode-like lines. Both are created in similar ways as described in above histogram example.

So far, I’ve shown you how I visualize KDE using SAS Visual Analytics. There are other approaches to visualize the kernel density estimates in SAS Visual Analytics, for example, you may create a custom graph in Graph Builder and import it into SAS Visual Analytics to do the visualization. Anyway, KDE is a good visualization in helping you understand more about your data. Why not give a try?

Visualizing kernel density estimates in SAS Visual Analytics was published on SAS Users.