8月 172017
 

Analytics-driven forecasting means more than measuring trend and seasonality. It includes all categories of methods (e.g. exponential smoothing, dynamic regression, ARIMA, ARIMA(X), unobserved component models, and more), including artificial intelligence, but not necessarily deep learning algorithms. That said, deep learning algorithms like neural networks can also be used for demand forecasting, [...]

At the end of the day, it’s all about analytics-driven forecasting was published on SAS Voices by Charlie Chase

8月 172017
 

This question has been asked repeatedly for decades by anyone facing a new system. That system might be a new product, a new piece of equipment, a new process, or anything really that is new and not well understood. Ultimately, you might need to change this system but first you [...]

The post Which of my factors are important? appeared first on SAS Learning Post.

8月 162017
 

I frequently get asked about my favorite book on a particular topic, how to find free SAS learning materials online, how to get help with SAS issues, etc. This lead me to compile a "Great Resources for SAS Programmers" document which I freely share with my students upon request. So [...]

The post Jedi SAS Tips - Favorite Resources for SAS Programmers appeared first on SAS Learning Post.

8月 162017
 

Visualizing the correlations between variables often provides insight into the relationships between variables. I've previously written about how to use a heat map to visualize a correlation matrix in SAS/IML, and Chris Hemedinger showed how to use Base SAS to visualize correlations between variables.

Recently a SAS programmer asked how to construct a bar chart that displays the pairwise correlations between variables. This visualization enables you to quickly identify pairs of variables that have large negative correlations, large positive correlations, and insignificant correlations.

In SAS, PROC CORR can computes the correlations between variables, which are stored in matrix form in the output data set. The following call to PROC CORR analyzes the correlations between all pairs of numeric variables in the Sashelp.Heart data set, which contains data for 5,209 patients in a medical study of heart disease. Because of missing values, some pairwise correlations use more observations than others.

ods exclude all;
proc corr data=sashelp.Heart;      /* pairwise correlation */
   var _NUMERIC_;
   ods output PearsonCorr = Corr;  /* write correlations, p-values, and sample sizes to data set */
run;
ods exclude none;

The CORR data set contains the correlation matrix, p-values, and samples sizes. The statistics are stored in "wide form," with few rows and many columns. As I previously discussed, you can use the HEATMAPCONT subroutine in SAS/IML to quickly visualize the correlation matrix:

proc iml;
use Corr;
   read all var "Variable" into ColNames;  /* get names of variables */
   read all var (ColNames) into mCorr;     /* matrix of correlations */
   ProbNames = "P"+ColNames;               /* variables for p-values are named PX, PY, PZ, etc */
   read all var (ProbNames) into mProb;    /* matrix of p-values */
close Corr;
 
call HeatmapCont(mCorr) xvalues=ColNames yvalues=ColNames
     colorramp="ThreeColor" range={-1 1} title="Pairwise Correlation Matrix";
Heat map of correlations between variables

The heat map gives an overall impression of the correlations between variables, but it has some shortcomings. First, you can't determine the magnitudes of the correlations with much precision. Second, it is difficult to compare the relative sizes of correlations. For example, which is stronger: the correlation between systolic and diastolic blood pressure or the correlation between weight and MRW? (MRW is a body-weight index.)

These shortcomings are resolved if you present the pairwise correlations as a bar chart. To create a bar chart, it is necessary to convert the output into "long form." Each row in the new data set will represent a pairwise correlation. To identify the row, you should also create a new variable that identifies the two variables whose correlation is represented. Because the correlation matrix is symmetric and has 1 on the diagonal, the long-form data set only needs the statistics for the lower-triangular portion of the correlation matrix.

Let's extract the data in SAS/IML. The following statements construct a new ID variable that identifies each new row and extract the correlations and p-values for the lower-triangular elements. The statistics are written to a SAS data set called CorrPairs. (In Base SAS, you can transform the lower-triangular statistics by using the DATA step and arrays, similar to the approach in this SAS note; feel free to post your Base SAS code in the comments.)

numCols = ncol(mCorr);                /* number of variables */
numPairs = numCols*(numCols-1) / 2;
length = 2*nleng(ColNames) + 5;       /* max length of new ID variable */
PairNames = j(NumPairs, 1, BlankStr(length));
i = 1;
do row= 2 to numCols;                 /* construct the pairwise names */
   do col = 1 to row-1;
      PairNames[i] = strip(ColNames[col]) + " vs. " + strip(ColNames[row]);
      i = i + 1;
   end;
end;
 
lowerIdx = loc(row(mCorr) > col(mCorr));  /* indices of lower-triangular elements */
Corr = mCorr[ lowerIdx ];
Prob = mProb[ lowerIdx ];
Significant = choose(Prob > 0.05, "No ", "Yes");  /* use alpha=0.05 signif level */
 
create CorrPairs var {"PairNames" "Corr" "Prob" "Significant"};
append;
close;
QUIT;

You can use the HBAR statement in PROC SGPLOT to construct the bar chart. This bar chart contains 45 rows, so you need to make the graph tall and use a small font to fit all the labels without overlapping. The call to PROC SORT and the DISCRETEORDER=DATA option on the YAXIS statement ensure that the categories are displayed in order of increasing correlation.

proc sort data=CorrPairs;  by Corr;  run;
 
ods graphics / width=600px height=800px;
title "Pairwise Correlations";
proc sgplot data=CorrPairs;
hbar PairNames / response=Corr group=Significant;
refline 0 / axis=x;
yaxis discreteorder=data display=(nolabel) 
      labelattrs=(size=6pt) fitpolicy=none 
      offsetmin=0.012 offsetmax=0.012  /* half of 1/k, where k=number of catgories */
      colorbands=even colorbandsattrs=(color=gray transparency=0.9);
xaxis grid display=(nolabel);
keylegend / position=topright location=inside across=1;
run;
Bar chart of pairwise correlations between variables

The bar chart (click to enlarge) enables you to see which pairs of variables are highly correlated (positively and negatively) and which have correlations that are not significantly different from 0. You can use additional colors or reference lines if you want to visually emphasize other features, such as the correlations that are larger than 0.25 in absolute value.

The bar chart is not perfect. This example, which analyzes 10 variables, is very tall with 45 rows. Among k variables there are k(k-1)/2 correlations, so the number of pairwise correlations (rows) increases quadratically with the number of variables. In practice, this chart would be unreasonably tall when there are 14 or 15 variables (about 100 rows).

Nevertheless, for 10 or fewer variables, a bar chart of the pairwise correlations provides an alternative visualization that has some advantages over a heat map of the correlation matrix. What do you think? Would this graph be useful in your work? Leave a comment.

The post Use a bar chart to visualize pairwise correlations appeared first on The DO Loop.

8月 152017
 

If you know me, you probably know that I spend a lot of time on the water. I like speed paddling (dragon boat, outrigger canoe, surfski, and racing SUP), and I also have a big pontoon boat at Jordan Lake where I try to go fishing at least once a [...]

The post Tracking floods and droughts in Jordan Lake, NC appeared first on SAS Learning Post.

8月 152017
 

Veterans transitioning back to civilian life face a number of challenges, many of which involve working in a non-military environment. The Honor Foundation helps Navy SEALs and other current and transitioning members of the U.S. Special Operations community enter civilian life. Last week, a group of former special operators visited several [...]

Helping military veterans transition to civilian life, pursue analytics careers was published on SAS Voices by Trent Smith

8月 152017
 

At the SAS UK Office in Marlow we are very fortunate to have a number of interns working here. I’m always amazed at how professional, prepared and just downright smart these students are. I recently asked one of our interns: Josie Comber, who is a student at the University of [...]

The post Learning experience from a real internship appeared first on SAS Analytics U Blog.

8月 152017
 

CAS data modelingThe CAS physical data model, i.e.what features CAS offers for data storage, and how to use them to maximize performance in CAS (and consequently SAS Visual Analytics 8.1 too).

So, specifically let’s answer the question:

What CAS physical table storage features can we use to get better performance in CAS and SAS Visual Analytics/CAS?

CAS Physical Table Storage Features

The following data storage features affect how CAS tables are physically structured:

  • Compression
  • Partitioning
  • Sorting
  • Repeated Tables
  • Extended Data Types (Varchar)
  • User Defined Formats

Compression — the Storage Option that Degrades Performance

data public.MegaCorp (compress=yes);
   set baselib.MegaCorp;
run;

Partitioning and Sorting

Partitioning is a powerful tool for improving Bar Charts, Decision Tree, Linear Regression) provide grouping as well as classification functionality.

When performing analyses/processing, CAS first groups the data into the required BY-groups. Pre-partitioning on commonly-used BY-groups means CAS can skip this step, vastly improving performance.

Within partitions, tables can be sorted by non-partition-key variables. Pre-sorting by natural ordering variables (e.g. time) allows CAS to skip the ordering step in many cases just like partitioning allows CAS to skip the grouping step.

For a full use-case, consider a line graph that groups sales by region and plots by date. This graph object would benefit greatly from a CAS table that is pre-partitioned by region and pre-sorted by date.

Join Optimization

Partitioning can also support join operations since both the CAS FedSQL Merge Join algorithm utilize BY-GROUP operations to support their processing.

Pre-partitioning tables in anticipation of joins will greatly improve join performance. A good use case is partitioning both a large transaction table and an equally large reference table (e.g. an enormous Customer table) by the common field, customerID. When a DATA Step MERGE or a FedSQL join is performed between the two tables on that field, the join/merge will take advantage of partitioning for the BY-GROUP operation resulting in something similar to a partition-wise join.

Like Compression, partitioning and sorting can be implemented via CAS actions as well as data set options. Using the data set options is demonstrated below:

data mycas.bigOrderTable (partition=(region division) orderby=(year quarter month));
   set CASorBase.bigOrderTable;
run;

Repeated Tables

By default, in distributed CAS Server deployments, CAS divides incoming tables into blocks and distributes those blocks among its DUPLICATE data set option or the Repeated Tables have two main use-cases in CAS:

1.     Join Optimization
2.     Small Table Operation Optimization

Join Optimization

For join operations, the default data distribution scheme can result in significant network traffic as matching records from the two tables travel between worker nodes to meet. If one of the two tables was created with the DUPLICATE/REPEAT option, then every possible record from that table is available on every node to the other table. There is no need for any network traffic.

Small Table Operation Optimization

For small tables, even single table operations can perform better with repeated instead of divided distribution. LASR actually implemented the “High Volume Access to Smaller Tables” feature for the same reason. When a table is repeated, CAS runs any required operation on a single worker node against the full copy of the table that resides there, instead of distributing the work.

As stated, repeated tables can be implemented with the DUPLICATE data set option, it can also be implemented with the REPEAT option on the PROC CASUTIL LOAD statement. The CASUTIL method is shown below:

proc casutil ;
   load data=sashelp.prdsale outcaslib=”caspath”
           casout=”prdsale” replace REPEAT ;
quit ;

Extended Data Types (VARCHAR)

With Viya 3.2 comes SAS’ first widespread implementation of variable length character fields. While Base SAS offers variable length character fields through compression, Viya 3.2 is the first major SAS release to include a save storage space, it also improves performance by reducing the size of the record being processed. CAS, like any other processing engine, will process narrower records more quickly than wide records.

User Defined Formats

User defined formats (UDFs) exist in CAS in much the same way they do in Base SAS. Their primary function, of course, is to provide display formatting for raw data values. Think about a format for direction. The raw data might be: “E”, “W”, “N”, “S” while the corresponding format values might be “East”, “West”, “North”, “South.”

So how might user defined formats improve performance in CAS? The same way they do in Base SAS, and the same way that VARCHAR does, by reducing the size of the record that CAS has to process. Imagine replacing multiple 200 byte description fields with 1 byte codes. If you had 10 such fields, the record length would decrease 1990 bytes ((10 X 200) – 10). This is an extreme example but it illustrates the point: User defined formats can reduce the amount of data that CAS has to process and, consequently, will lead to performance gains.

CAS data modeling for performance was published on SAS Users.

8月 142017
 

In an increasingly complicated operating environment, the global oil and gas industry struggles to optimize their asset portfolio. Producers combat severe challenges inherent to aging facilities in the context of pricing volatility. Change is the new constant, as the cyclical nature of the oil and gas industry means that producers [...]

Analytics holds the key to optimizing oilfield assets was published on SAS Voices by Keith Holdaway

8月 142017
 

In an increasingly complicated operating environment, the global oil and gas industry struggles to optimize their asset portfolio. Producers combat severe challenges inherent to aging facilities in the context of pricing volatility. Change is the new constant, as the cyclical nature of the oil and gas industry means that producers [...]

Analytics holds the key to optimizing oilfield assets was published on SAS Voices by Keith Holdaway