2月 062019
 

Splitting external text data files into multiple files

Recently, I worked on a cybersecurity project that entailed processing a staggering number of raw text files about web traffic. Millions of rows had to be read and parsed to extract variable values.

The problem was complicated by the varying records composition. Each external raw file was a collection of records of different structures that required different parsing programming logic. Besides, those heterogeneous records could not possibly belong to the same rectangular data tables with fixed sets of columns.

Solving the problem

To solve the problem, I decided to employ a "divide and conquer" strategy: to split the external file into many files, each with a homogeneous structure, then parse them separately to create as many output SAS data sets.

My plan was to use a SAS DATA Step for looping through the rows (records) of the external file, read each row, identify its type, and based on that, write it to a corresponding output file.

Like how we would split a data set into many:

 
data CARS_ASIA CARS_EUROPE CARS_USA;
   set SASHELP.CARS;
   select(origin);
      when('Asia')   output CARS_ASIA;
      when('Europe') output CARS_EUROPE;
      when('USA')    output CARS_USA;
   end;   
run;

But how do you switch between the output files? The idea came from SAS' Chris Hemedinger, who suggested using multiple FILE statements to redirect output to different external files.

Splitting an external raw file into many

As you know, one can use PUT statement in a SAS DATA Step to output a character string or a combination of character strings and variable values into an external file. That external file (a destination) is defined by a

 
filename inf  'c:\temp\input_file.txt';
filename out1 'c:\temp\traffic.txt';
filename out2 'c:\temp\system.txt';
filename out3 'c:\temp\threat.txt';
filename out4 'c:\temp\other.txt';
 
data _null_;
   infile inf;
   input REC_TYPE $10. @;
   input;
   select(REC_TYPE);
      when('TRAFFIC') file out1;
      when('SYSTEM')  file out2;
      when('THREAT')  file out3;
      otherwise       file out4;
   end;
   put _infile_;
run;

In this code, the first INPUT statement retrieves the value of REC_TYPE. The trailing @ line-hold specifier ensures that an input record is held for the execution of the next INPUT statement within the same iteration of the DATA Step. It may not be used exactly as written, but the point is you need to capture the filed(s) of interest and stay on the same row.

The second INPUT statement reads the whole raw file record into the _infile_ DATA Step automatic variable.

Depending on the value of the REC_TYPE variable assigned in the first INPUT statement, SELECT block toggles the FILE definition between one of the four filerefs, out1, out2, out3, or out4.

Then the PUT statement outputs the _infile_ automatic variable value to the output file defined in the SELECT block.

Splitting a data set into several external files

Similar technique can be used to split a data table into several external raw files. Let’s combine the above two code samples to demonstrate how you can split a data set into several external raw files:

 
filename outasi 'c:\temp\cars_asia.txt';
filename outeur 'c:\temp\cars_europe.txt';
filename outusa 'c:\temp\cars_usa.txt';
 
data _null_;
   set SASHELP.CARS;
   select(origin);
      when('Asia')   file outasi;
      when('Europe') file outeur;
      when('USA')    file outusa;
   end;
   put _all_; 
run;

This code will read observations of the SASHELP.CARS data table, and depending on the value of ORIGIN variable, put _all_ will output all the variables (including automatic variables _ERROR_ and _N_) as named values (VARIABLE_NAME=VARIABLE_VALUE pairs) to one of the three external raw files specified by their respective file references (outasi, outeur, or outusa.)

You can modify this code to produce delimited files with full control over which variables and in what order to output. For example, the following code sample produces 3 files with comma-separated values:

 
data _null_;
   set SASHELP.CARS;
   select(origin);
      when('Asia')   file outasi dlm=',';
      when('Europe') file outeur dlm=',';
      when('USA')    file outusa dlm=',';
   end;
   put make model type origin msrp invoice; 
run;

You may use different delimiters for your output files. In addition, rather than using mutually exclusive SELECT, you may use different logic for re-directing your output to different external files.

Bonus: How to zip your output files as you create them

For those readers who are patient enough to read to this point, here is another tip. As described in this series of blog posts by Chris Hemedinger, in SAS you can read your external raw files directly from zipped files without unzipping them first, as well as write your output raw files directly into zipped files. You just need to specify that in your filename statement. For example:

UNIX/Linux

 
filename outusa ZIP '/sas/data/temp/cars_usa.txt.gz' GZIP;

Windows

 
filename outusa ZIP 'c:\temp\cars.zip' member='cars_usa.txt';

Your turn

What is your experience with creating multiple external raw files? Could you please share with the rest of us?

How to split a raw file or a data set into many external raw files was published on SAS Users.

2月 062019
 

Feature generation (also known as feature creation) is the process of creating new features to use for training machine learning models. This article focuses on regression models. The new features (which statisticians call variables) are typically nonlinear transformations of existing variables or combinations of two or more existing variables. This article argues that a naive approach to feature generation can lead to many correlated features (variables) that increase the cost of fitting a model without adding much to the model's predictive power.

Feature generation in traditional statistics

Feature generation is not new. Classical regression often uses transformations of the original variables. In the past, I have written about applying a logarithmic transformation when a variable's values span several orders of magnitude. Statisticians generate spline effects from explanatory variables to handle general nonlinear relationships. Polynomial effects can model quadratic dependence and interactions between variables. Other classical transformations in statistics include the square-root and inverse transformations.

SAS/STAT procedures provide several ways to generate new features from existing variables, including the EFFECT statement and the "stars and bars" notation. However, an undisciplined approach to feature generation can lead to a geometric explosion of features. For example, if you generate all pairwise quadratic interactions of N continuous variables, you obtain "N choose 2" or N*(N-1)/2 new features. For N=100 variables, this leads to 4950 pairwise quadratic effects!

Generated features might be highly correlated

In addition to the sheer number of features that you can generate, another problem with generating features willy-nilly is that some of the generated effects might be highly correlated with each other. This can lead to difficulties if you use automated model-building methods to select the "important" features from among thousands of candidates.

I was reminded of this fact recently when I wrote an article about model building with PROC GLMSELECT in SAS. The data were simulated: X from a uniform distribution on [-3, 3] and Y from a cubic function of X (plus random noise). I generated the polynomial effects x, x^2, ..., x^7, and the procedure had to build a regression model from these candidates. The stepwise selection method added the x^7 effect first (after the intercept). Later it added the x^5 effect. Of course, the polynomials x^7 and x^5 have a similar shape to x^3, but I was surprised that those effects entered the model before x^3 because the data were simulated from a cubic formula.

After thinking about it, I realized that the odd-degree polynomial effects are highly correlated with each other and have high correlations with the target (response) variable. The same is true for the even-degree polynomial effects. Here is the DATA step that generates 1000 observations from a cubic regression model, along with the correlations between the effects (x1-x7) and the target variable (y):

%let d = 7;
%let xMean = 0;
data Poly;
call streaminit(54321);
array x[&d];
do i = 1 to 1000;
   x[1] = rand("Normal", &xMean);  /* x1 ~ U(-3, 3] */
   do j = 2 to &d;
      x[j] = x[j-1] * x[1];        /* x[i] = x1**i, i = 2..7 */
   end;
   y = 2 - 1.105*x1 - 0.2*x2 + 0.5*x3 + rand("Normal");  /* response is cubic function of x1 */
   output;
end;
drop i j;
run;
 
proc corr data=Poly nosimple noprob;
   var y;
   with x:;
run;
Correlations between the target variable and polynimal effects

You can see from the output that the x^5 and x^7 effects have the highest correlations with the response variable. Because the squared correlations are the R-square values for the regression of Y onto each effect, it makes intuitive sense that these effects are added to the model early in the process. Towards the end of the model-building process, the x^3 effect enters the model and the x^5 and x^7 effects are removed. To the model-building algorithm, these effects have similar predictive power because they are highly correlated with each other, as shown in the following correlation matrix:

proc corr data=Poly nosimple noprob plots(MAXPOINTS=NONE)=matrix(NVAR=ALL);
   var x:;
run;
Correlations among polynomial effects

I've highlighted certain cells in the lower triangular correlation matrix to emphasize the large correlations. Notice that the correlations between the even- and odd-degree effects are close to zero and are not highlighted. The table is a little hard to read, but you can use PROC IML to generate a heat map for which cells are shaded according to the pairwise correlations:

Heat map of correlations among polynomial effects

This checkerboard pattern shows the large correlations that occur between the polynomial effects in this problem. This image shows that many of the generated features do not add much new information to the set of explanatory variables. This can also happen with other transformations, so think carefully before you generate thousands of new features. Are you producing new effects or redundant ones?

Generated features are not independent

Notice that the generated effects are not statistically independent. They are all generated from the same X variable, so they are functionally dependent. In fact, a scatter plot matrix of the data will show that the pairwise relationships are restricted to parametric curves in the plane. The graph of (x, x^2) is quadratic, the graph of (x, x^3) is cubic, the graph of (x^2, x^3) is an algebraic cusp, and so forth. The PROC CORR statement in the previous section created a scatter plot matrix, which is shown below: Again, I have highlighted cells that have highly correlated variables.

Scatter plot matrix that shows the statistical dependencies between polynomial effects

I like this scatter plot matrix for two reasons. First, it is soooo pretty! Second, it visually demonstrates that two variables that have low correlation are not necessarily independent.

Summary

Feature generation is important in many areas of machine learning. It is easy to create thousands of effects by applying transformations and generating interactions between all variables. However, the example in this article demonstrates that you should be a little cautious of using a brute-force approach. When possible, you should use domain knowledge to guide your feature generation. Every feature you generate adds complexity during model fitting, so try to avoid adding a bunch of redundant highly-correlated features.

For more on this topic, see the article "Automate your feature engineering" by my colleague, Funda Gunes. She writes: "The process [of generating new features]involves thinking about structures in the data, the underlying form of the problem, and how best to expose these features to predictive modeling algorithms. The success of this tedious human-driven process depends heavily on the domain and statistical expertise of the data scientist." I agree completely. Funda goes on to discuss SAS tools that can help data scientists with this process, such as Model Studio in SAS Visual Data Mining and Machine Learning. She shows an example of feature generation in her article and in a companion article about feature generation. These tools can help you generate features in a thoughtful, principled, and problem-driven manner, rather than relying on a brute-force approach.

The post Feature generation and correlations among features in machine learning appeared first on The DO Loop.

2月 042019
 

One of the key health trends we’ll continue to follow in 2019 is the flood of medical and personal data that, if managed and analyzed properly, could help health care organizations provide better care, life sciences companies deliver better therapies and individuals make smarter lifestyle choices. Sounds great, but there [...]

The case for personal change in health care was published on SAS Voices by Cameron McLauchlin

2月 042019
 
Animation of models built by using PROC GLMSELECT in SAS

I previously discussed how you can use validation data to choose between a set of competing regression models. In that article, I manually evaluated seven models on the training data and manually chose the model that gave the best predictions for the validation data. Fortunately, SAS software provides ways to automate this process! This article describes how PROC GLMSELECT builds models on training data and uses validation data to choose a final model. The animated GIF to the right visualizes the sequence of models that are built.

You can download the complete SAS program that creates the results in this blog post.

How to run PROC GLMSELECT

The GLMSELECT procedure in SAS/STAT is a workhorse procedure that implements many variable-selection methods, including least angle regression (LAR), LASSO, and elastic nets. Even though PROC GLMSELECT was introduced in SAS 9.1 (Cohen, 2006), many of its options remain relatively unknown to many SAS data analysts.

Some statisticians argue against automated model building and selection. Both Cohen (2006) and the PROC GLMSELECT documentation contain a discussion of the dangers and controversies regarding automated model building. Nevertheless, machine learning techniques routinely use this sort of process to build accurate predictive models from among hundreds of variables (or features, as they are known in the ML community).

The following four statements create the analyses and graphs in this article. The (x, y) data are simulated from the cubic polynomial y = 2 - 1.105*x - 0.2*x2 + 0.5*x3 + N(0,1).

title "Select Model from 8 Effects";
proc glmselect data=Have seed=1 plots(StartStep=1)=(ASEPlot Coefficients);
   effect poly = polynomial(x / degree=7);    /* generate monomial effects: x, x^2, ..., x^7 */
   partition fraction(validate=0.4);          /* use 40% of data for validation */
   model y = poly / selection=stepwise(select=SBC choose=validate)
                details=steps(Candidates ParameterEstimates); /* OPTIONAL */
run;

The analysis uses the following options:

  • The PLOTS= option requests two plots. The ASEPlot is a plot of the average square error (ASE) of each model on the validation data. The STARTSTEP= option displays the ASE beginning with the first step. (The zeroth step is usually the intercept-only model, which typically has a very large ASE.) The Coefficients plot shows how various effects enter or leave the model at each step of the model-building process. By default, a panel is created, but the lower part of the panel duplicates part of the ASE plot so I've unpacked the panel into separate plots.
  • The EFFECT statement generates the monomial effects {x, x^2, ..., x^7} from the variable x.
  • The PARTITION statement randomly divides the input data into two subsets. The validation set contains 40% of the data and the training set contains the other 60%. The SEED= option on the PROC GLMSELECT statement specifies the seed value for the random split.
  • The SELECTION= option specifies the algorithm that builds a model from the effects. This example uses the stepwise selection algorithm because it is easy to understand. At each step in the model-building process, the stepwise algorithm builds a new model by modifying the model from the previous step. The new model will either contain a new effect that was not in the previous model or will remove an effect from the previous model.
  • The SELECT=SBC option specifies that the procedure will use the SBC criterion to assess the candidate effects and determine which effect should be added (or removed) from the previous model.
  • The CHOOSE=VALIDATE option specifies that the models are scored on the validation data. The ASE values for the models are used to choose the most predictive model from among the models that were built.
  • The DETAILS= option displays information about the models that are built at each step. If you only care about the final model, you do not need this option.

The chosen model is a cubic polynomial. The parameter estimates (below) are very close to the parameter values that are used to simulate the cubic data, so the procedure "discovered" the correct model.

Parameter estimates for a model chosen among models built by PROC GLMSELECT in SAS

How PROC GLMSELECT constructs the models

Average square error (ASE) plot for models built by PROC GLMSELECT in SAS

How did PROC GLMSELECT obtain that model? The output of the DETAILS=STEPS option shows that the GLMSELECT procedure built eight models. It then used the validation data to decide that the four-parameter cubic model was the model that best balances parsimony (simple versus complex models) and prediction accuracy (underfitting versus overfitting). I won't display all the output, but you can summarize the details by using the ASE plot and the Coefficients plot.

The ASE plot (shown to the right) visualizes the prediction accuracy of the models. The initial model (zeroth step) is the intercept-only model. The horizontal axis of the ASE plot shows how the models are formed from the previous model. The label for the first tick mark is "1+x^5", which means "the model at Step=1 adds the x^5 term to the previous model. The label for the second tick mark is "2+x^2", which means "the model at Step=2 adds the x^2 term to the previous model." A minus sign means that an effect is removed. For example, the label "6-x^7" means that "the model at Step=6 removes the x^7 effect from the previous model." The vertical axis tracks the change of the ASE for each successive model. The model-building process stops when it can no longer decrease the ASE on the validation data. For this example, that happens at Step=7.

If you prefer a table, the SelectionSummary table summarizes the models that are built. The columns labeled ASE and Validation ASE contain the precise values in the ASE plot.

SBC and validation ASE for models built by PROC GLMSELECT in SAS

How PROC GLMSELECT defines the models

Coefficient progression plot for models built by PROC GLMSELECT in SAS

The models are least squares estimates for the included effects on the training data. The DETAILS=STEPS option displays the parameter estimates for each model. For these models, which are all polynomial effects for a single continuous variable, you can graph the eight models and overlay the fitted curves on the data. You can do this in eight separate plots or you can use PROC SGPLOT in SAS to create an animated gif that animates the sequence of models. The animation is shown at the top of this article.

In general, you can't visualize models with many effects, but the Coefficients plot displays the values of the estimates for each model. The plot (shown to the right; click to enlarge) labels only the effects in the final model, but I have manually added labels for the x^5 and x^7 effects to help explain the plot.

Because the magnitudes of the parameter estimates can vary greatly, the vertical axis of the plot shows standardized estimates. As before, the horizontal axis represents the various models. Each line visualizes the evolution of values for a particular effect. For example, the brown line dominates the upper left portion of the graph. This line shows the progression of the standardized coefficient of the x^5 term. In models 1–4, the coefficient of the x^5 effect is large and positive. In models 5 and 6, the standardized coefficient of x^5 is small and negative. In the last model, the coefficient of x^5 is set to zero because the effect is removed. Similarly, the magenta line displays the standardized coefficients for x^7. The coefficient is negative for Steps 3 and 4, positive for Step 5, and is zero for the last two models, which do not include the x^7 effect. By using this plot, you can visually discern which effects are included in each model and study the relative change of the coefficients between models.

Summary

This article shows how to use PROC GLMSELECT in SAS to build a sequence of models on training data. From among the models, a final model is chosen that best predicts a validation data set. The example uses the stepwise selection technique because it is easy to understand, but the GLMSELECT procedure supports other model selection algorithms.

You can visualize the model selection process by using the ASE plot. You can visualize the progression of candidate models by using the coefficient plot. For this univariate regression model, you can also visualize each candidate model by overlaying curves or by using an animation.

For more information about the model selection procedures in SAS, see the SAS/STAT documentation or the following articles:

The post Model selection with PROC GLMSELECT appeared first on The DO Loop.

2月 042019
 
Animation of models built by using PROC GLMSELECT in SAS

I previously discussed how you can use validation data to choose between a set of competing regression models. In that article, I manually evaluated seven models on the training data and manually chose the model that gave the best predictions for the validation data. Fortunately, SAS software provides ways to automate this process! This article describes how PROC GLMSELECT builds models on training data and uses validation data to choose a final model. The animated GIF to the right visualizes the sequence of models that are built.

You can download the complete SAS program that creates the results in this blog post.

How to run PROC GLMSELECT

The GLMSELECT procedure in SAS/STAT is a workhorse procedure that implements many variable-selection methods, including least angle regression (LAR), LASSO, and elastic nets. Even though PROC GLMSELECT was introduced in SAS 9.1 (Cohen, 2006), many of its options remain relatively unknown to many SAS data analysts.

Some statisticians argue against automated model building and selection. Both Cohen (2006) and the PROC GLMSELECT documentation contain a discussion of the dangers and controversies regarding automated model building. Nevertheless, machine learning techniques routinely use this sort of process to build accurate predictive models from among hundreds of variables (or features, as they are known in the ML community).

The following four statements create the analyses and graphs in this article. The (x, y) data are simulated from the cubic polynomial y = 2 - 1.105*x - 0.2*x2 + 0.5*x3 + N(0,1).

title "Select Model from 8 Effects";
proc glmselect data=Have seed=1 plots(StartStep=1)=(ASEPlot Coefficients);
   effect poly = polynomial(x / degree=7);    /* generate monomial effects: x, x^2, ..., x^7 */
   partition fraction(validate=0.4);          /* use 40% of data for validation */
   model y = poly / selection=stepwise(select=SBC choose=validate)
                details=steps(Candidates ParameterEstimates); /* OPTIONAL */
run;

The analysis uses the following options:

  • The PLOTS= option requests two plots. The ASEPlot is a plot of the average square error (ASE) of each model on the validation data. The STARTSTEP= option displays the ASE beginning with the first step. (The zeroth step is usually the intercept-only model, which typically has a very large ASE.) The Coefficients plot shows how various effects enter or leave the model at each step of the model-building process. By default, a panel is created, but the lower part of the panel duplicates part of the ASE plot so I've unpacked the panel into separate plots.
  • The EFFECT statement generates the monomial effects {x, x^2, ..., x^7} from the variable x.
  • The PARTITION statement randomly divides the input data into two subsets. The validation set contains 40% of the data and the training set contains the other 60%. The SEED= option on the PROC GLMSELECT statement specifies the seed value for the random split.
  • The SELECTION= option specifies the algorithm that builds a model from the effects. This example uses the stepwise selection algorithm because it is easy to understand. At each step in the model-building process, the stepwise algorithm builds a new model by modifying the model from the previous step. The new model will either contain a new effect that was not in the previous model or will remove an effect from the previous model.
  • The SELECT=SBC option specifies that the procedure will use the SBC criterion to assess the candidate effects and determine which effect should be added (or removed) from the previous model.
  • The CHOOSE=VALIDATE option specifies that the models are scored on the validation data. The ASE values for the models are used to choose the most predictive model from among the models that were built.
  • The DETAILS= option displays information about the models that are built at each step. If you only care about the final model, you do not need this option.

The chosen model is a cubic polynomial. The parameter estimates (below) are very close to the parameter values that are used to simulate the cubic data, so the procedure "discovered" the correct model.

Parameter estimates for a model chosen among models built by PROC GLMSELECT in SAS

How PROC GLMSELECT constructs the models

Average square error (ASE) plot for models built by PROC GLMSELECT in SAS

How did PROC GLMSELECT obtain that model? The output of the DETAILS=STEPS option shows that the GLMSELECT procedure built eight models. It then used the validation data to decide that the four-parameter cubic model was the model that best balances parsimony (simple versus complex models) and prediction accuracy (underfitting versus overfitting). I won't display all the output, but you can summarize the details by using the ASE plot and the Coefficients plot.

The ASE plot (shown to the right) visualizes the prediction accuracy of the models. The initial model (zeroth step) is the intercept-only model. The horizontal axis of the ASE plot shows how the models are formed from the previous model. The label for the first tick mark is "1+x^5", which means "the model at Step=1 adds the x^5 term to the previous model. The label for the second tick mark is "2+x^2", which means "the model at Step=2 adds the x^2 term to the previous model." A minus sign means that an effect is removed. For example, the label "6-x^7" means that "the model at Step=6 removes the x^7 effect from the previous model." The vertical axis tracks the change of the ASE for each successive model. The model-building process stops when it can no longer decrease the ASE on the validation data. For this example, that happens at Step=7.

If you prefer a table, the SelectionSummary table summarizes the models that are built. The columns labeled ASE and Validation ASE contain the precise values in the ASE plot.

SBC and validation ASE for models built by PROC GLMSELECT in SAS

How PROC GLMSELECT defines the models

Coefficient progression plot for models built by PROC GLMSELECT in SAS

The models are least squares estimates for the included effects on the training data. The DETAILS=STEPS option displays the parameter estimates for each model. For these models, which are all polynomial effects for a single continuous variable, you can graph the eight models and overlay the fitted curves on the data. You can do this in eight separate plots or you can use PROC SGPLOT in SAS to create an animated gif that animates the sequence of models. The animation is shown at the top of this article.

In general, you can't visualize models with many effects, but the Coefficients plot displays the values of the estimates for each model. The plot (shown to the right; click to enlarge) labels only the effects in the final model, but I have manually added labels for the x^5 and x^7 effects to help explain the plot.

Because the magnitudes of the parameter estimates can vary greatly, the vertical axis of the plot shows standardized estimates. As before, the horizontal axis represents the various models. Each line visualizes the evolution of values for a particular effect. For example, the brown line dominates the upper left portion of the graph. This line shows the progression of the standardized coefficient of the x^5 term. In models 1–4, the coefficient of the x^5 effect is large and positive. In models 5 and 6, the standardized coefficient of x^5 is small and negative. In the last model, the coefficient of x^5 is set to zero because the effect is removed. Similarly, the magenta line displays the standardized coefficients for x^7. The coefficient is negative for Steps 3 and 4, positive for Step 5, and is zero for the last two models, which do not include the x^7 effect. By using this plot, you can visually discern which effects are included in each model and study the relative change of the coefficients between models.

Summary

This article shows how to use PROC GLMSELECT in SAS to build a sequence of models on training data. From among the models, a final model is chosen that best predicts a validation data set. The example uses the stepwise selection technique because it is easy to understand, but the GLMSELECT procedure supports other model selection algorithms.

You can visualize the model selection process by using the ASE plot. You can visualize the progression of candidate models by using the coefficient plot. For this univariate regression model, you can also visualize each candidate model by overlaying curves or by using an animation.

For more information about the model selection procedures in SAS, see the SAS/STAT documentation or the following articles:

The post Model selection with PROC GLMSELECT appeared first on The DO Loop.

2月 022019
 

SAS Visual Analytics

I don't know about you, but when I read challenges like:

  • Detecting hidden heart failure before it harms an individual
  • Can SAS Viya AI help to digitalize pension management?
  • How to recommend your next adventure based on travel data
  • How to use advanced analytics in building a relevant next best action
  • Can SAS help you find your future home?
  • When does a customer have their travel mood on, and to which destination will he travel?
  • How can SAS Viya, Machine Learning and Face Recognition help find missing people?

…I can continue with the list of ideas provided by the teams participating in the SAS Nordics User Group’s Hackathon. But one thing is for sure, I become enthusiastic and I'm eager to discover the answers and how analytics can help in solving these questions.

When the Nordics team asked for support for providing SAS Viya infrastructure on Azure Cloud platform, I didn't hesitate to agree and started planning the environment.

Environment needs

Colleagues from the Nordics countries informed us their Hackathon currently included fourteen registered teams. Hence, they needed at least fourteen different environments with the latest and greatest SAS Viya Tools like SAS Visual Analytics, SAS VDMML and SAS Text Analytics. In addition, participants wanted to get the chance to use open source technologies with SAS and asked us to install R-Studio and Jupyter. This would allow data scientists develop models in a programming language of choice and provide access to SAS predictive modeling capabilities.

The challenge I faced was how to automate this installation process. We didn't want to repeat an exact installation fourteen times! Also, in case of a failure we needed a way to quickly reinstall a fresh virtual machine in our environment. We wanted to create the virtual machines on the Azure Cloud platform. The goal was to quickly get SAS Viya instances up and running on Azure, with little user interaction. We ended up with a single script expecting one parameter: the name of the instance. Next, I provide an overview of how we accomplished our task.

The setup

As we need to deploy fourteen identical copies of the same SAS Viya software, we decided to make use of the SAS Mirror Manager, which is a utility for synchronizing SAS software repositories. After downloading the mirror repository, we moved the complete file structure to a Web Server hosted on a separate Nordics Hackathon repository virtual machine, but within a similar private network where the SAS Viya instances will run. This guarantees low latency when downloading the software.

Once the repository server is up and running, we have what we needed to create a SAS Viya base image. Within that image, we first need to make sure to meet the requirements described in the SAS Viya Deployment Guide. To complete this task, we turned to the Viya Infrastructure Resource Kit (VIRK). The VIRK is a collection of tools, created by Erwan Granger, that assist in infrastructure and readiness-verification tasks. The script is located in a repository on SAS software’s GitHub page. By running the VIRK script before creation of the base image, we guarantee all virtual machines based on the image meet the necessary requirements.

Next, we create within the base image the SAS Viya Playbook as described in the SAS Viya Deployment Guide. That allows us to kick off a SAS Viya installation later. The Viya installation must occur later during the initial launch of a new VM based on that image. We cannot install SAS Viya beforehand because one of the requirements is a static IP address and a static hostname, which is different for each VM we launch. However, we can install R-Studio server on the base image. Another important file we make available on this base image is a script to initiate the Ansible installations of OpenLdap, SAS Viya and Jupyter.

Deployment

After the common components are in place we follow the instructions from Azure on how to create a custom image of an Azure VM. This capability is available on other public cloud providers as well. Now all the prerequisites to create working Viya environments for the Hackathon are complete. Finally, we create a launch script to install a full SAS Viya environment with single command and one parameter, the hostname, from the Azure CLI.

$ ./launchscript.sh viya01
$ ./launchscript.sh viya02
$ ./launchscript.sh viya03
...
$ ./launchscript.sh viya12
$ ./launchscript.sh viya13
$ ./launchscript.sh viya14

The script

The main parts of this launch script are:

  1. Testing if the Nordics Hackathon Repository VM is running because we must download software from our own locally created repository.
  2. Launch a new VM, based on the SAS Viya Image we created during preparation, assign a public static IP address, and choose a Standard_E32-16s_v3 Azure VM.
  3. Launch our own Viya-install script to perform the following three sub-steps:
    • Install openLDAP as the identity provider
    • Install SAS Viya just as you would do by following the SAS Viya Deployment Guide.
    • Install Jupyter with a customized Ansible script made by my colleague Alexander Koller.

The result of this is we have fourteen full SAS Viya installations ready in about one hour and 45 minutes. We recently posted a Linkedin video describing the entire process.

Final thoughts

I am planning to write a blog on SAS Communities to share more technical insight on how we created the script. I am honored I was asked to be part of the jury for the Hackathon. I am looking forward to the analytical insights that the different teams will discover and how they will make use of SAS Viya running on the Azure Cloud platform.

Additional resources

Series of Webinars supporting the Nordic Hackathon

Installing SAS Viya Azure virtual machines with a single click was published on SAS Users.