10月 102017
 

Industry-University relationships can be very complex, yet at Regent’s University London (RUL), we are very proud to report on the success of our first joint SAS-RUL project. In June 2016, as part of SAS Academy activities at Regent’s, I started a research collaboration with Mike Turner and Neil Griffin from [...]

The post Inspiring students to re-think marketing for the digital economy appeared first on SAS Analytics U Blog.

10月 092017
 

My new SAS Press book “An Introduction to SAS Visual Analytics” (written in collaboration with Tricia Aanderud and Rob Collum) covers all of the different aspects of SAS® Visual Analytics, including how to develop reports, load data, and handle administration. Below is an example of the types of tips that you can find [...]

The post An introduction to SAS Visual Analytics: the Parallel Period function of the Derived Item calculations appeared first on SAS Learning Post.

10月 092017
 

Every day 91 Americans die from opioid abuse, every nine seconds a student drops out of high school and every 47 seconds a child is confirmed to be abused or neglected. These are sobering statistics that show the challenge our government leaders face to help those in need. While there [...]

Inspirational, emotional government leadership event focuses on challenges, solutions was published on SAS Voices by Paula Henderson

10月 092017
 

Correlations between variables are typically displayed in a matrix. Because the correlation matrix is determined by the order of the variables, it is difficult to find the largest and smallest correlations, which is why analysts sometimes use colors to visualize the correlation matrix. Another visualization option is the pairwise correlation plot, which orders pairs of variables by their correlations.

Neither graph addresses a related problem: for each variable, which other variables are strongly correlated with it? Which are weakly correlated? In SAS, the CORR procedure supports a little-known option that answers that question. You can use the RANK option in the PROC CORR statement to order the correlations for each variable (independently) according to the magnitude of the correlations.

Notice that the RANK option does not compute "rank correlation." You can compute rank correlation by using the SPEARMAN option.

Ordering correlations by size

Consider the Sashelp.Heart data set, which contains data for 5209 patients who enrolled in the Framingham Heart Study. You might want to know which variables are highly correlated with weight, or smoking, or blood pressure, to name a few examples. The following call to PROC CORR uses the RANK option to order each row of correlations in the output:

/* Note: The MRW variable is similar to the body-mass index (BMI). */
proc corr data=sashelp.Heart RANK noprob;
   var Height Weight MRW Smoking Diastolic Systolic Cholesterol;
run;
Correlations ordered by magnitude

The output orders each row according to the magnitude of the correlations. (Click to enlarge.) For example, look at the row for the Weight variable, which is highlighted by a red rectangle. Scanning across the row, you can see that the variables that are the most strongly correlated with Weight are MRW (which measures whether a patient is overweight) and the height. At the end of the row are the variables that are essentially uncorrelated with Weight, namely Smoking and Cholesterol. The numbers at the bottom of each cell indicate the number of nonmissing pairwise observations.

In a similar way, look at the row for the Smoking variable. That variable is most strongly correlated with Height and MRW. Notice that the correlation with MRW is negative, which shows that the correlations are ordered by absolute values (magnitude). The highly correlated variables—whether positively or negatively correlated—appear first and the uncorrelated variable (correlations near zero) occur last in each row.

Ordering correlations for groups of variables

It is common to want to examine the correlations between groups of variables. For example, a clinician might want to look at the correlations between clinical measurements (blood pressure, cholesterol,...) and genetic or lifestyle choices (weight, smoking habits,...). The following call to PROC CORR uses the VAR and WITH statements to compare groups of variables, and uses the RANK option to order the correlations along each row:

proc corr data=sashelp.Heart RANK noprob;
   var Height Weight MRW Smoking;       /* genetic and lifestyle factors */
   with Diastolic Systolic Cholesterol; /* clinical measurements */
run;

Notice that the number of variables in the VAR statement determine the columns. The variables in the WITH statement determine the rows. Within each row, the variables in the VAR statement are ordered by the magnitude of the correlation. For these data, the Diastolic and Systolic variables are similar with respect to how they correlate with the column variables. The order of the column variables is the same for the first two rows. In contrast, the strength of the correlations between the Cholesterol variable and the column variables are in a different order.

In summary, you can use the RANK option in the PROC CORR statement to order the rows of a correlation matrix according to the magnitude (absolute value) of the correlations between each variable and the others. This makes it easy to find pairs of variables that are strongly correlated and pairs that are weakly correlated.

The post Order correlations by magnitude appeared first on The DO Loop.

10月 062017
 

After a few tough years in the trenches, analytics leaders in utilities are emerging and making a difference as their utilities vie to stay relevant in the ever-changing energy landscape. At the core of this emergence are leaders that are embracing open analytics platforms and pushing analytics to the edge. [...]

For utilities, the horizon is wide open was published on SAS Voices by Mike F. Smith

10月 062017
 

Sometimes life is just too busy, and I bet many of you feel the same way. If you’re like me, you’re playing a number of roles: employee, spouse, parent, community leader, coach and so on. With all the craziness, it’s likely we’re shortchanging another role that’s critically important to our [...]

The post Learning in the middle of the night - or whenever appeared first on SAS Learning Post.

10月 062017
 

The movie Blade Runner came out about 30 years ago, portraying a dystopian future set in 2019 Los Angeles. Many of the technology predictions in the movie have actually become reality - but how about the weather and pollution? The movie was all about smog and a lack of sunshine, and I wondered [...]

The post Los Angeles ozone levels - 30th anniversary revisit appeared first on SAS Learning Post.

10月 052017
 

As traditional reserves deplete and oil prices rise, market analysts predict that the global demand for petroleum products will increasingly be met with oil extracted from non-traditional resources in more challenging and harsher environments. Therefore, companies across the oil & gas industry are evaluating technologies and processes that can deliver [...]

Why analytics and IIoT are critical for the oilfield of the future was published on SAS Voices by Keith Holdaway

10月 042017
 

Stories have been flooding the market lately about artificial intelligence (AI) causing the next world war. Based originally on comments made by Elon Musk (see below), many others are jumping in to share similar fears. China, Russia, soon all countries w strong computer science. Competition for AI superiority at national level [...]

Will AI cause the next major war? was published on SAS Voices by Mary Beth Ainsworth

10月 042017
 

If you perform a weighted statistical analysis, it can be useful to produce a statistical graph that also incorporates the weights. This article shows how to construct and interpret a weighted histogram in SAS.

How to construct a weighted histogram

Before constructing a weighted histogram, let's review the construction of an unweighted histogram. A histogram requires that you specify a set of evenly spaced bins that cover the range of the data. An unweighted histogram of frequencies is constructed by counting the number of observations that are in each bin. Because counts are dependent on the sample size, n, histograms often display the proportion (or percentage) of values in each bin. The proportions are the counts divided by n. On the proportion scale, the height of each bin is the sum of the quantity 1/n, where the sum is taken over all observations in the bin.

That fact is important because it reveals that the unweighted histogram is a special case of the weighted histogram. An unweighted histogram is equivalent to a weighted histogram in which each observation receives a unit weight. Therefore the quantity 1/n is the standardized weight of each observation: the weight divided by the sum of the weights. The formula is the same for non-unit weights: the height of each bin is the sum of the quantity wi / Σ wi, where the sum is taken over all observations in the bin. That is, you add up all the standardized weights in each bin to produce the bin height.

An example of a weighted histogram

The SAS documentation for the WEIGHT statement includes the following example. Twenty subjects estimate the diameter of an object that is 30 cm across. Some people are placed closer to the object than others. The researcher believes that the precision of the estimate is inversely proportional to the distance from the object. Therefore the researcher weights each subject's estimate by using the inverse distance.

The following DATA step creates the data, and PROC SGPLOT creates a weighted histogram of the data by using the WEIGHT= option on the HISTOGRAM option. (The WEIGHT= option was added in SAS 9.4M1.)

data Size;
input Distance ObjectSize @@;
Wt = 1 / distance;        /* precision */
x = ObjectSize; 
label x = "Estimate of Size";
datalines;
1.5 30   1.5 20   1.5 30   1.5 25
3   43   3   33   3   25   3   30
4.5 25   4.5 36   4.5 48   4.5 33
6   43   6   36   6   23   6   48
7.5 30   7.5 25   7.5 50   7.5 38
;
 
title "Weighted Histogram of Size Estimate";
proc sgplot data=size noautolegend;
   histogram x / WEIGHT=Wt scale=proportion datalabel binwidth=5;
   fringe x / lineattrs=(thickness=2 color=black) transparency=0.6;
   yaxis grid offsetmin=0.05 label="Weighted Proportion";
   refline 30 / axis=x lineattrs=(pattern=dash);
run;
Weighted histogram in SAS; weights proportional to inverse variance

The weighted histogram is shown to the right. The data values are shown in the fringe plot beneath the histogram. The height of each bin is the sum of the weights of the observations in that bin. The dashed line represents the true diameter of the object. Most estimates are clustered around the true value, except for a small cluster of larger estimates. Notice that I use the SCALE=PROPORTION option to plot the weighted proportion of observations in each bin, although the default behavior (SCALE=PERCENT) would also be acceptable.

If you remove the WEIGHT= option and study the unweighted graph, you will see that the average estimate for the unweighted distribution (33.6) is not as close to the true diameter as the weighted estimate (30.1). Furthermore, the weighted standard deviation is about half the unweighted standard deviation, which shows that the weighted distribution of these data has less variance than the unweighted distribution.

By the way, although PROC UNIVARIATE can produce weighted statistics, it does not create weighted graphics as of SAS 9.4M5. One reason is that the graphics statements (CDFPLOT, HISTOGRAM, QQPLOT, etc) not only create graphs but also fit distributions and produce goodness-of-fit statistics, and those analyses do not support weight variables.

Checking the computation

Although a weighted histogram is not conceptually complex, I understand a computation better when I program it myself. You can write a SAS program that computes a weighted histogram by using the following algorithm:

  1. Construct the bins. For this example, there are eight bins of width 5, and the first bin starts at x=17.5. (It is centered at x=20.) Initialize all bin heights to zero.
  2. For each observation, find the bin that contains it. Increment the bin height by the weight of that observation.
  3. Standardize the heights by dividing by the sum of weights. You can skip this step if the weights sum to unity.

A SAS/IML implementation of this algorithm requires only a few lines of code. A DATA step implementation that uses arrays is longer, but probably looks more familiar to many SAS programmers:

data BinHeights(keep=height:);
array EndPt[8] _temporary_;
binStart = 17.5;  binWidth = 5;  /* anchor and width for bins */
do i = 1 to dim(EndPt);          /* define endpoints of bins */
   EndPt[i] = binStart + (i-1)*binWidth;
end;
 
array height[7];                 /* height of each bin */
set Size end=eof;                /* for each observation ... */
sumWt + Wt;                      /* compute sum of weights */
Found=0;
do i = 1 to dim(EndPt)-1 while (^Found); /* find bin for each obs */
   Found = (EndPt[i] <= x < EndPt[i+1]);
   if Found then height[i] + Wt; /* increment bin height by weight */
end;
if eof then do;            
   do i = 1 to dim(height);      /* scale heights by sum of weights */
      height[i] = height[i] / sumWt;
   end;
   output;
end;
run;
 
proc print noobs data=BinHeights; run;
Heights of bars in weighted histogram

The computations from the DATA step match the data labels that appear on the weighted histogram in PROC SGPLOT.

Summary

In SAS, the HISTOGRAM statement in PROC SGPLOT supports the WEIGHT= option, which enables you to create a weighted histogram. A weighted histogram shows the weighted distribution of the data. If the histogram displays proportions (rather than raw counts), then the heights of the bars are the sum of the standardized weights of the observations within each bin. You can download the SAS program that computes the quantities in this article.

How can you interpret a weighted histogram? That depends on the meaning of the weight variables. For survey data and sampling weights, the weighted histogram estimates the distribution of a quantity in the population. For inverse variance weights (such as were used in this article), the weighted histogram overweights precise measurements and underweights imprecise measurements. When the weights are correct, the weighted histogram is a better estimate of the density of the underlying population and the weighted statistics (mean, variance, quantiles,...) are better estimates of the corresponding population quantities.

Have you ever plotted a weighted histogram? What was the context? Leave a comment.

The post Create and interpret a weighted histogram appeared first on The DO Loop.