7月 302021
 

In this Q&A with MIT/SMR Connections, Gavin Day, Senior Vice President of Technology at SAS, shares real-life examples of artificial intelligence (AI) at work, discusses picking the right problems to solve with AI, dispels a common misconception about AI, and defines AI success. Q: Could you describe some especially interesting [...]

AI in real-life: A Q&A with Gavin Day was published on SAS Voices by Kimberly Nevala

7月 292021
 

When using ordinary least squares regression (OLS), if your response or dependent attribute isn’t close to a normal distribution, your analysis is going to be affected and typically not in a good way. The farther away your input data is from normality, the greater the impact on your model will be as well. One of the key assumptions in OLS and other regressions is that the distribution of the dependent be at least approximately normally distributed.

This blog post discusses two distributional data transforms: the natural (or common) logarithm, and the Johnson family of transforms,1 specifically the Su transform. The Su transform has been empirically shown to improve logistic regression performance from the input attributes.2 I will focus primarily on the response or dependent attribute.

Example

In this example, the attribute called VALUE24 is the amount of revenue generated from campaign purchases that were obtained in the last 24 months. The histogram below shows the distribution, and the fitted line is what a normal distribution should look like given the mean and variance of this data.

You can clearly see that the normal curve overlaid on the histogram indicates that the data isn’t close to a normal distribution. The dependent attribute of VALUE24 will need to be transformed.

The following histogram is shown with the VALUE24 attributed transform with a natural logarithm. The normal curve is also plotted as before. While this logarithm transform is much better because the normal curve is closer to the log-normal distribution, it still has some room for improvement.

Below is the Su transformed distribution of VALUE24. Notice that not only did it transform it to conform to a normal distribution, it also centered the mean at zero and the variance is one!

Benefits of the Su transform

The benefits of using this transform for use in predictive models are:

  • Uses a link-free consistency by allowing an elliptically contoured predictor space.
  • Reduction in nonlinear confounding that will lead to less misspecification.
  • Local adaptivity: suitable transforms allow models to balance their sensitivity to dense versus sparse regions of the predictor space.2

For details about the first listed benefit, please refer to Potts.2 The second benefit is that your model has a much greater chance of being specified correctly due to a reduction in confounding of attributes. This will not only improve your model’s accuracy but its interpretability, as well. The last major benefit is that the Su transform brings in very long tails of a distribution much better than a log or other transforms.1

Computing the Su transform

The transform that makes the distribution work the best for global smoothing without resorting to a nonparametric density estimation is to estimate a Beta distribution, which can be very close to a normal distribution when certain parameters of the Beta distribution are set.2 The actual equation for a Su family is below. However, we won’t be using that exact equation.

What we need is to frame the above equation for the optimal transformation that can be estimated using nonlinear regression and normal scores as the response variable. The above equation can be framed to the following2:

In SAS, one can use the RANK procedure and the nonlinear regression procedure, PROC NLIN; or if using the completely in-memory platform of SAS Viya, PROC NLMOD to accomplish solving the three parameters λ, δ, γ.

However, there is one serious downside to fitting a Su transform with the above nonlinear function.  Unlike the log or some other transforms, it doesn’t have an inverse link function to transform the data back to its original form. There is one saving grace, however, and that is one can perform an empirical data simulation and develop score code that will mimic the relationship between the Su transformed and the untransformed data. This will enable the Su transform to be estimated on a good statistical representative sample and applied to the larger data set from which the sample was derived. The scatter plot below shows the general relationship of VALUE24 and the Su transformed VALUE24.

Using PROC GENSELECT, the empirical fit of the above relationship can be performed and score code generated. While other SAS procedures can be used to do the same fitting, such as the TRANSREG and GLMSELECT procedures, they won’t write out the score code at present when using computed effect statements. After fitting this data with a six-degree polynomial spline, the fitted curve is shown in the scatter plot below.

The following SAS score code was generated from the GENSELECT procedure and written to a .sas file on the server.

SAS code used to fit the six-degree curve

ods graphics on;
 
title 'General Smoothing Spline for Value24 Empirical Link Function';
 
proc genselect data=casuser.merge_buytest ;
 
effect poly_su24 = poly( su_value24 /degree=6);
 
model value24 = poly_su24 / distribution=normal;
 
code file="/shared/users/racoll/poly_value24_score.sas";
 
output out=casuser.glms_out pred=pred_value24 copyvar=id;
 
run; title;

Generated SAS scoring code from PROC GENSELECT

drop _badval_ _linp_ _temp_ _i_ _j_; 
_badval_ = 0; _
linp_   = 0; 
_temp_   = 0; 
_i_      = 0; 
_j_      = 0; 
drop MACLOGBIG; 
MACLOGBIG= 7.0978271289338392e+02; 
array _xrow_0_0_{7} _temporary_; 
array _beta_0_0_{7} _temporary_ (    210.154143004534 
    123.296571835751 
    49.7657489425605 
    9.82725091766856 
    -3.13300266038598 
    -0.70029417420551 
    0.17335174313709); 
array _xtmp_0_0_{7} _temporary_; 
array _xcomp_0_0_{7} _temporary_; 
array _xpoly1_0_0_{7} _temporary_;  
 if missing(su_value24) 
 then do; 
    _badval_ = 1; 
    goto skip_0_0; 
end;   
 
do _i_=1 to 7; _xrow_0_0_{_i_} = 0; end; 
do _i_=1 to 7; _xtmp_0_0_{_i_} = 0; end; 
do _i_=1 to 7; _xcomp_0_0_{_i_} = 0; end; 
do _i_=1 to 7; _xpoly1_0_0_{_i_} = 0; end;   
 
_xtmp_0_0_[1] = 1;   
 
_temp_ = 1; 
_xpoly1_0_0_[1] = su_value24; 
_xpoly1_0_0_[2] = su_value24 * _xpoly1_0_0_[1];
 _xpoly1_0_0_[3] = su_value24 * _xpoly1_0_0_[2]; 
_xpoly1_0_0_[4] = su_value24 * _xpoly1_0_0_[3]; 
_xpoly1_0_0_[5] = su_value24 * _xpoly1_0_0_[4];
 _xpoly1_0_0_[6] = su_value24 * _xpoly1_0_0_[5]; 
do _j_=1 to 1; _xtmp_0_0_{1+_j_} = _xpoly1_0_0_{_j_}; end;   
 
_temp_ = 1; 
do _j_=1 to 1; _xtmp_0_0_{2+_j_} = _xpoly1_0_0_{_j_+1}; end; 
_temp_ = 1; 
do _j_=1 to 1; _xtmp_0_0_{3+_j_} = _xpoly1_0_0_{_j_+2}; end; 
_temp_ = 1; 
do _j_=1 to 1; _xtmp_0_0_{4+_j_} = _xpoly1_0_0_{_j_+3}; end; 
_temp_ = 1; 
do _j_=1 to 1; _xtmp_0_0_{5+_j_} = _xpoly1_0_0_{_j_+4}; end; 
_temp_ = 1; 
do _j_=1 to 1; _xtmp_0_0_{6+_j_} = _xpoly1_0_0_{_j_+5}; end; 
do _j_=1 to 1; _xrow_0_0_{_j_+0} = _xtmp_0_0_{_j_+0}; end; 
do _j_=1 to 1; _xrow_0_0_{_j_+1} = _xtmp_0_0_{_j_+1}; end; 
do _j_=1 to 1; _xrow_0_0_{_j_+2} = _xtmp_0_0_{_j_+2}; end; 
do _j_=1 to 1; _xrow_0_0_{_j_+3} = _xtmp_0_0_{_j_+3}; end; 
do _j_=1 to 1; _xrow_0_0_{_j_+4} = _xtmp_0_0_{_j_+4}; end; 
do _j_=1 to 1; _xrow_0_0_{_j_+5} = _xtmp_0_0_{_j_+5}; end; 
do _j_=1 to 1; _xrow_0_0_{_j_+6} = _xtmp_0_0_{_j_+6}; end;  
 
do _i_=1 to 7; 
_linp_ + _xrow_0_0_{_i_} * _beta_0_0_{_i_}; 
end;   
 
skip_0_0: 
label P_VALUE24 = 'Predicted: VALUE24'; 
if (_badval_ eq 0) and not missing(_linp_) then do; 
  P_VALUE24 = _linp_; 
end; else do; 
    _linp_ = .; 
    P_VALUE24 = .; 
end;

This score code can be placed in a SAS DATA step along with your analytical model’s score code so that a back-transform can be accomplished of your predicted dependent variable VALUE24.

If you liked this blog post, then you might like my latest book, Segmentation Analytics with SAS® Viya®: An Approach to Clustering and Visualization.

References

1 Johnson, N. L., “Systems of Frequency Curves Generated by Methods of Translation,” Biometrika, 1949.

2 Potts, W., “Elliptical Predictors for Logistic Regression”, Keynote Address at SAS Data Mining Conference, Las Vegas, NV, 2006.

 

 

An analytic transform for cantankerous data was published on SAS Users.

7月 272021
 

In the past, the COMPRESS function was useful. Since SAS version 9, it has become a blockbuster, and you might not have noticed. The major change was the addition of a new optional parameter called MODIFIERS.

The traditional use of the COMPRESS function was to remove blanks or a list of selected characters from a character string. The addition of a MODIFIER argument does two things. First, you can specify classes of characters to remove, such as all letters, all punctuation marks, or all digits. That is extremely useful, but the addition of the 'k' modifier is why I used the term blockbuster in my description. The 'k' modifier flips the function from one that removes characters from a string to one that keeps a list of characters and removes everything else. Let me show you some examples.

This first example stems from a real problem I encountered while trying to read values that contained units. My data looked something like this:

ID     Weight 
001    100lbs.
002     59Kgs.
003    210LBS
004    83kg

My goal was to create a variable called Wt that represented the person's weight in pounds as a numeric value.

First, let’s look at the code. Then, I’ll give an explanation.

data Convert;
   length ID $3 Weight $8;
   input ID Weight;
 
   Wt = input(compress(Weight,,'kd'),8.);
   /* The COMPRESS function uses two modifiers, 'k' and 'd'.  This means
      keep the digits, remove anything else.  The INPUT function does the
      character-to-numeric conversion.
   */
 
   If findc(Weight,'k','i') then Wt = Wt * 2.2;
 
   /* the FINDC function is looking for an upper or lowercase 'k' in the
      original character string.  If found, it converts the value in
      kilograms to pounds (note: 1 kg = 2.2 pounds).
   */
 
datalines;
001    100lbs.
002     59Kgs.
003    210LBS
004    83kg
;
title "Listing of Data Set Convert";
footnote "This program was run using SAS OnDemand for Academics";
proc print data=Convert noobs;
run;

The program reads the value of Weight as a character string. The COMPRESS function uses 'k' and 'd' as modifiers. Notice the two commas in the list of arguments. A single comma would interpret 'kd' as the second argument (the list of characters to remove). Including two commas notifies the function that 'kd' is the third argument (modifiers). You can list these modifiers in any order, but I like to use 'kd', and I think of it as "keep the digits." What remains is the string of digits. The INPUT function does the character-to-numeric conversion.

Your next step is to figure out if the original value of Weight contained an upper or lowercase 'k'. The FINDC function can take three arguments: the first is the string that you are examining, the second is a list of characters that you are searching for, and the third argument is the 'i' modifier that says, "ignore case" (very useful).

If the original character string (Weight) contains an uppercase or lowercase 'k', you convert from kilograms to pounds.

Here is the output:

There is one more useful application of the COMPRESS function that I want to discuss. Occasionally, you might have a text file in ASCII or EBCDIC that contains non-printing characters (usually placed there in error). Suppose you want just the digits, decimal points (periods), blanks, and commas. You need to read the original value as a text string. Let's call the original string Contains_Junk. All you need to convert these values is one line of code like this:

Valid = compress(Contains_Junk,'.,','kdas');

In this example, you are using all three arguments of the COMPRESS function. As in pre-9 versions of SAS, the second argument is a list of characters that you want to remove. However, because the third argument (modifiers) contains a 'k', the second argument is a list of characters that you want to keep. In addition to periods and commas, you use modifiers to include all digits, uppercase and lowercase letters (the 'a' modifier - 'a' for alpha), and space characters (these include spaces, tabs, and a few others such as carriage returns and linefeeds). If you did not want to include tabs and other "white space" characters, you could rewrite this line as:

Valid = compress(Contains_Junk,'., ','kd');

Here you are including a blank in the second argument and omitting the 's' in the modifier list.

You can read more about the COMPRESS function in any of the following books, available from SAS Press as an e-book or from Amazon in print form:

Or my latest programming book:

 

Questions and/or comments are welcome.

The Amazing COMPRESS Function was published on SAS Users.

7月 262021
 

A SAS programmer recently asked why his SAS program and his colleague's R program display different estimates for the quantiles of a very small data set (less than 10 observations). I pointed the programmer to my article that compares the nine common definitions for sample quantiles. The article has a section that explicitly compares the default sample quantiles in SAS and R. The function in the article is written to support all nine definitions. The programmer asked whether I could provide a simpler function that computes only the default definition in R.

This article compares the default sample quantiles in SAS in R. It is a misnomer to refer to one definition as "the SAS method" and to another as "the R method." In SAS, procedures such as PROC UNIVARIATE and PROC MEANS enable you to use the QNTLDEF= to use five different quantile estimates. By using SAS/IML, you can compute all nine estimation methods. Similarly, R supports all nine definitions. However, users of statistical software often use the default methods, then wonder why they get different answers from different software. This article explains the difference between the DEFAULT method in SAS and the DEFAULT method in R. The default in R is also the default method in Julia and in the Python packages SciPy and NumPy.

The Hyndman and Fan taxonomy

The purpose of a sample statistic is to estimate the corresponding population parameter. That is, the sample quantiles are data-based estimates of the unknown quantiles in the population. Hyndman and Fan ("Sample Quantiles in Statistical Packages," TAS, 1996), discuss nine definitions of sample quantiles that commonly appear in statistical software packages. All nine definitions result in valid estimates. For large data sets (say, 100 or more observations), they tend to give similar results. The differences between the definitions are most evident if you use a small data set that has wide gaps between two adjacent pairs of values (after sorting the data). The example in this article is small and has a wide gap between the largest value and the next largest value.

By default, SAS uses Hyndman and Fan's Type=2 method, whereas R (and Julia, SciPy, and NumPy) use the Type=7 method. The Type=2 method uses the empirical cumulative distribution of the data (empirical CDF) to estimate the quantiles, whereas the Type=7 method uses a piecewise-linear estimate of the cumulative distribution function. This is demonstrated in the next section.

An example of sample quantiles

To focus the discussion, consider the data {0, 1, 1, 1, 2, 2, 2, 4, 5, 8}. There are 10 observations, but only six unique values. The following graphs show the estimates of the cumulative distribution function used by the Type=2 and Type=7 methods. The fringe plot (rug plot) below the CDF shows the locations of the data:

The sample quantiles are determined by the estimates of the CDF. The largest gap in the data is between the values X=5 and X=8. So, for extreme quantiles (greater than 0.9), we expect to see differences between the Type=2 and the Type=7 estimates for extreme quantiles. The following examples show that the two methods agree for some quantiles, but not for others:

  • The 0.5 quantile (the median) is determined by drawing a horizontal line at Y=0.5 and seeing where the horizontal line crosses the estimate of the CDF. For both graphs, the corresponding X value is X=2, which means that both methods give the same estimate (2) for the median.
  • The 0.75 quantile (the 75th percentile) estimates are different between the two methods. In the upper graph, a horizontal line at 0.75 crosses the empirical CDF at X=4, which is a data value. In the lower graph, the estimate for the 0.75 quantile is X=3.5, which is neither a data value nor the average of adjacent values.
  • The 0.95 quantile (the 95th percentile) estimates are different. In the upper graph, a horizontal line at 0.95 crosses the empirical CDF at X=8, which is the maximum data value. In the lower graph, the estimate for the 0.95 quantile is X=6.65, which is between the two largest data values.

Comments on the CDF estimates

The Type=2 method (the default in SAS) uses an empirical CDF (ECDF) to estimate the population quantiles. The ECDF has a long history of being used for fitting and comparing distributions. For example, the Kolmogorov-Smirnov test uses the ECDF to compute nonparametric goodness-of-fit tests. When you use the ECDF, a quantile is always an observed data value or the average of two adjacent data values.

The Type=7 method (the default in R) uses a piecewise-linear estimate to the CDF. There are many ways to create a piecewise-linear estimate, and there have been many papers (going back to the 1930's) written about the advantages and disadvantages of each choice. In Hyndman and Fan's taxonomy, six of the nine methods use piecewise-linear estimates. Some people prefer the piecewise-linear estimates because the inverse CDF is continuous: a small change to the probability value (such as 0.9 to 0.91) results in a small change to the quantile estimates. This is property is not present in the methods that use the ECDF.

A function to compute the default sample quantiles in R

Back in 2017, I wrote a SAS/IML function that can compute all of the common definitions of sample quantiles. If you only want to compute the default (Type=7) definition in R, you can use the following simpler function:

proc iml;
/* By default, R (and Julia, and some Python packages) uses
   Hyndman and Fan's Type=7 definition. Compute Type=7 sample quantiles.
*/
start GetQuantile7(y, probs);
   x = colvec(y);
   N = nrow(x);
   if N=1 then return (y);  /* handle the degenerate case, N=1 */
 
   /* remove missing values, if any */
   idx = loc(x^=.);
   if ncol(idx)=0 then  
      return (.);           /* all values are missing */
   else if ncol(idx)<N then do;
      x = x[idx,]; N = nrow(x);  /* remove missing */
   end;
 
   /* Main computation: Compute Type=7 sample quantile. 
      Estimate is a linear interpolation between x[j] and x[j+1]. */
   call sort(x);
   p = colvec(probs);
   m = 1-p;
   j = floor(N*p + m);      /* indices into sorted data values */
   g = N*p + m - j;         /* 0 <= g <= 1 for interpolation */
 
   q = j(nrow(p), 1, x[N]); /* if p=1, estimate by x[N]=max(x) */
   idx = loc(p < 1);
   if ncol(idx) >0 then do;
      j = j[idx]; g = g[idx];
      q[idx] = (1-g)#x[j] + g#x[j+1]; /* linear interpolation */
   end;   
   return q;
finish;
 
/* Compare the SAS and R default definitions.
   The differences between definitions are most apparent 
   for small samples that have large gaps between adjacent data values. */
x = {0 1 1 1 2 2 2 4 5 8}`;
prob = {0.5, 0.75, 0.9, 0.95};
 
call qntl(SASDefaultQntl, x, prob);
RDefaultQntl = GetQuantile7(x, prob);
print prob SASDefaultQntl RDefaultQntl;

The table shows some of the quantiles that were discussed previously. If you choose prob to be evenly spaced points in [0,1], you get the values on the graphs shown previously.

Summary

There are many ways to estimate quantiles. Hyndman and Fan (1996) list nine common definitions. By default, SAS uses the Type=2 method, where R (and other software) uses the Type=7 method. SAS procedures support five of the nine common definitions of sample quantiles, and you can use SAS/IML to compute the remaining definitions. To make it easy to reproduce the default values of sample quantiles from other software, I have written a SAS/IML function that computes the Type=7 quantiles.

If you do not have SAS/IML software, but you want to compute estimates that are based on a piecewise-linear estimate of the CDF, I suggest you use the QNTLDEF=1 option in PROC UNIVARIATE or PROC MEANS. This produces the Type=4 method in Hyndman and Fan's taxonomy. For more information about the quantile definitions that are natively available in SAS procedures, see "Quantile definitions in SAS."

The post Compare the default definitions for sample quantiles in SAS, R, and Python appeared first on The DO Loop.

7月 242021
 

Are you thinking about migrating your environment over to SAS Viya but don’t know if your current architecture will be supported? Luckily, with the help of the SAS 9 Content Assessment Tool, we can help to alleviate some of those worries.

If you look at our documentation on the SAS 9 Content Assessment Tool, you’ll see that there is a fairly large amount of information to parse through. However, each section is designed to cover a different application within the SAS 9 Content Assessment Tool to help assist in your transition. Before diving in and running all the applications, first ask yourself a few questions:

  • What is your goal with using the SAS 9 Content Assessment Tool?
  • Are you simply wanting to count (inventory) what is in your environment?
  • Do you want a more detailed explanation (profile) about the items that exist in your deployment?
  • Are you wanting to know if your current SAS code will transition (code check) correctly to SAS Viya?

These are all important questions to consider before choosing which application to run and for what purpose. In this guide, I will show you an average use case for running the SAS 9 Content Assessment Tool and how we will check if our environment is Viya-ready.

Download and unpack the tool

Before we can access SAS Content Assessment, we will first need to download and unpack the tool itself. The latest download available at the time of this article is version 2021.1.2 and can be downloaded from the SAS 9 Content Assessment 2021.1.3 support site.

We will be downloading the Linux version for our purposes:

You will then want to unpack the tar file using the Linux command below:

tar zxvf /SAS9ContentAssessment/SAS9ContentAssessment.2021.1.2.lax.tgz

The resulting unpacked files should produce the “assessment” and “migration” files:

Now that we have successfully unpacked the Content Assessment tool, we can proceed to modifying the setenv.yaml and metaparms.sas files for configuration.

Set up the configuration

We’ll start by specifying specific configuration information necessary to run the Content Assessment tool successfully. Let’s modify the setenv.yaml file first:

We are required to set explicit values for each key-value pair that is denoted with an asterisk. However, both the SAS_CATALOGSDIR and ENTERPRISE_GUIDE_PROJECTSDIR only need values specified if you intend to gather this content. This means, for this example, we will be setting values for:

  • SAS_HOME
  • ASSESSMENT_CONFIGDIR

We have modified the file in our text editor of choice Notepad++:

*We do not have any SAS Enterprise Guide projects to gather for this particular example so we will leave this field blank.

*Same for this one

Once you have set these values, you can Save and Close the file and proceed to editing metaparms.sas.

In this file, specific values must be defined. These values are written in italics:

  • METADATAHOSTNAME
  • METADATAUSERPASSWORD

In this case, we will be specifying additional values to help distinguish which host machines we are running the Content Assessment Tool on:

Once you have set these values, you can Save and Close the file.

Task #1: Inventory Content

Now that we have filled in the required values for the configuration files, we can begin with running the applications within SAS Content Assessment Tool. We will first begin with running Inventory Content. In summary, Inventory Content performs the following:

  • It determines what licensed products are installed.
  • It examines the SAS Deployment Registry for various actions and items.
  • It enumerates all file system content.
  • It enumerates SAS metadata, such as:
    • SAS metadata objects contained in SAS folders
    • SAS server and SAS application server contexts
    • ACTs and ACEs

We will go ahead and run Inventory Content by navigating to our unpacked tool directory:

Then we will run the script:

[sas@trcv037 assessment]$ ./inventoryContent

Running the script should show multiple check like that below:

Now that Inventory Content has completed, it should have products 3 different datasets:

  • all_objects.sas7bdat
  • deploymentinfo.sas7bdat
  • licenseinfo.sas7bdat

We can confirm this by navigating to the path for inventory as defined in the setenv.yaml file: ASSESSMENT_DATAMARTDIR/inventory/METADATALABEL.

We have completed the Inventory Content task.

Task #2: Profile Content

Running the Profile Content application will allow us profile SAS objects and SAS content. To run the Profile Content application, you will need to navigate to the unpacked tool directory and issue the command:


./profileContent

The process will look similar to:

Note: If you are running a version of SAS earlier than 9.4 M3, you will need to run the Relationship Loader by adding the command option --load-relationships

This will produce multiple SAS datasets in the output defined for ASSESSMENT_DATAMARTDIR/profile/METADATALABEL.

We have now completed the Profile Content task.

Task #3: Code Check

The goal of the Code Check application is to check SAS code for SAS Viya incompatibilities. During the process Code Check performs the following actions:

  • It classifies any file in the starting directory and any subdirectories that have an extension of .SAS as SAS code. It adds it to the list of files to be scanned.
  • It searches the files for keywords that have been identified as not compatible in SAS Viya.
  • It searches the files for hardcoded paths in statements such as LIBNAME, INFILE, FILENAME, %INC, and %INCLUDE. These occurrences might not cause errors, but they are gathered so that they can be validated.

When executing the codeCheck command, you will have a few options available to you. You can use the --scan-tag option to distinguish separate runs from Code Checks across different directories.

Then there’s the --source-location which is required to target the directory in which you scan for code.

Finally, there’s the --sources-file option if you have several directories you would like to scan all at once.

In our instance, we will simply point to one directory to gather our code and distinguish this run with a unique scan tag. We’ll run the following command:



./codeCheck --scan-tag exampleGather --source-location /home/sas/My_SAS_Files

You can see that we are targeting the directory: /home/sas/My_SAS_Files. This directory contains 5 example programs that we would like to check:

After executing the command, you should see the following generated from the script:

And two datasets created for each scan tag in the Code Check datamart:

  • scan-tag_elements.sas7bdat
  • scan-tag_issues.sas7bdat

We have now completed our run of the Code Check application.

Task #4: Publish Assessed Content

Now that we have finished running Inventory, Profile, and Code Check, we can now choose to Publish and Merge our content so that we can view the results in a more readable format. If Inventory, Profile or Code Check has been run on more than one machine, we would need to aggregate all the resulting datasets to one location and run publishAssessedContent on that directory. However, in our instance, we have only generated the results on one machine so this is not necessary.

When running the publishAssessedContent command, you will need to specify which datamart type you are targeting. For example, for Code Check, we would specify the Code Check datamart type:


./publishAssessedContent --datamart-type codecheck

The datamart-type value would change for Inventory and Profile.

When executing the publishAssessedContent command, the result looks like this:

You can see that Publish found the resulting datasets generated from Code Check and merged and published them to the respective datamart.

If we go to that datamart location we can see that some additional datasets were generated:

  • codechk_elements.sas7bdat
  • codechk_issues.sas7bdat
  • codechk_keywords.sas7bdat
  • elements_aggregated.sas7bdat

Once you have the datasets generated in this manner, you can choose to import these by following the steps in our Importing the Data Mart documentation.

You can also choose to encrypt your result using the --encryption option. You can specify either:

--encrypt-aes

--encrypt-sas

When using this method, an assessment-key.sas file is created in the location the tool was unpacked.

If you are coordinating with your SAS success team to plan a road map to SAS Viya, you will likely be requested to provide these datamarts. To do so, you will need to run the --create-uploads option with the original command. For example:


$ ./SAS9ContentAssessment/assessment/publishAssessedContent --create-uploads

--datamart-type profile --encrypt-aes

The result is three upload files that are created:

OURCustomer_profile_abcdef03470.tgz

OURCustomer_profile_abcdef03470_assessment-key.sas

OURCustomer_profile_abcdef03470_reports.tgz

These will be delivered to your SAS success team by following the steps in our documentation.

Related Resources

READ MORE | THE SAS 9 CONTENT ASSESSMENT TOOL
WATCH ON YOUTUBE | SAS 9 CONTENT ASSESSMENT DEMO
READ MORE | SYSTEM EVALUATION TOOL HELPS WITH SAS 9.4 UPGRADE-IN-PLACE PLANNING

How to use the SAS® 9 Content Assessment Tool was published on SAS Users.

7月 222021
 

How do you convince decision makers in your enterprise to give a machine learning (ML) project the green light?

You might be super excited about machine learning – as many of us are – and might think that this stuff should basically sell itself! The value proposition can seem totally obvious when you are already invested in it. The improvement to current operations is a "no-brainer." And the core ML technology is nifty as heck.

But to get traction for a new initiative, to sell it to decision makers, you need to take a step back from the excitement that you feel and tell a simple, non-technical business story that is sober rather than fervent.

Start with an elevator pitch

99.5% of our direct mail is ineffective. Only half a percent respond.

If we can lower that nonresponse rate to 98.5% — and increase the response rate to 1.5% — that would mean a projected $500,000 increase in annual profit, tripling the ROI of the marketing campaigns. I can show you the arithmetic in detail.

We can use machine learning to hone down the size of our mailings by targeting the customers more likely to respond. This should cut costs about three times the amount that it will decrease revenue, giving us the gains and ROI I just mentioned.

A short pitch like this is the best place to start before asking for questions. Get straight to the point – the business value and the bottom line – and then see where your colleagues are coming from. Remember, they're not necessarily excited about ML, so in this early stage, it is really, really easy to bore them. That’s why you must lead with the value and then get into the ML technology only to the degree necessary to establish credibility.

Keep your pitch focused on accomplishing these three things

  1. Your pitch must lead with the value proposition, expressed in business terms without any real details about ML, models, or data. Nothing about how ML works, only the actionable value that it delivers. Focus on the functional purpose, the operational improvement gained by model deployment – and yet, in this opening, don't use the words "model" or "deployment."
  2. Your pitch must estimate a performance improvement in terms of one or two key performance indicators (KPIs) such as response rate, profit, ROI, costs, or labor/staff requirements. Express this potential result in simple terms. For example, the profit curve of a model is “TMI” (Too Much Information) – it introduces unnecessary complexity during this introductory pitch. Instead, just show a bar chart with only two bars to illustrate the potential improvement. Stick with the metrics that matter, the ones people care about — that is, the ones that actually drive business decisions at your company. Make the case that the performance improvement more than justifies the expense of the ML project. Don't get into predictive model performance measures such as lift.
  3. Stop and listen -- keep your pitch short and then open the conversation. Realize that your pitch isn't the conclusion but rather a catalyst to begin a dialogue. By laying out the fundamental proposition and asking them to go next, you get to find out which aspects are of concern and which are of interest, and you get a read on their comfort level with ML or with analytics in general.

So, does the wondrous technology of machine learning itself even matter in this pitch? Can you really sell ML without getting into ML? Well, yes, it does matter, and usually you will get into it, eventually. But you need to interactively determine when to do so, to what depth, and at what pace.

With machine learning, leading with the scientific virtues and quantitative capabilities of the technology that you are selling – predictive modeling algorithms, the idea of learning from data, probabilities, and so on – is like pitching the factory rather than the sausage. Instead, lead with the business value proposition.

It's more common than you may realize for the business professional to whom you're speaking to feel nervous about their own ability to understand analytical technology. The elevator-pitch format serves as an antidote to this type of "tech aversion." Lead with a simple story about how value is delivered or how processes will improve.

These tactics for green lighting compose just one part of machine learning leadership. For machine learning projects to succeed, a very particular leadership practice must be followed. To fully dive in, enroll in my SAS Business Knowledge Series course, Machine Learning Leadership and Practice – End-to-End Mastery. (This article is based on one of the course’s 142 videos.) I developed this curriculum to empower you to generate value with machine learning, whether you work as a techie, a business leader, or some combination of the two. This course delivers the end-to-end expertise that you need, covering both the core technology and the business-side practice. Why cover both sides? Because both sides need to learn both sides! Click here for more details, the full syllabus, and to enroll.

Getting the green light for a machine learning project was published on SAS Users.

7月 212021
 

In my new book, I explain how segmentation and clustering can be accomplished in three ways: coding in SAS, point-and-click in SAS Visual Statistics, and point-and-click in SAS Visual Data Mining and Machine Learning using SAS Model Studio. These three analytical tools allow you to do many diverse types of segmentation, and one of the most common methods is clustering. Clustering is still among the top 10 machine learning methods used based on several surveys across the globe.

One of the best methods for learning about your customers, patrons, clients, or patients (or simply observations in almost any data set) is to perform clustering to find clusters that have similar within-cluster characteristics and each cluster has differing combinations of attributes. You can use this method to aid in understanding your customers or profile various data sets. This can be done in an environment where SAS and open-source software work in a unified platform seamlessly. (While open source is not discussed in my book, stay tuned for future blog posts where I will discuss more fun and exciting things that should be of interest to you for clustering and segmentation.)

Let’s look at an example of clustering. The importance of looking at one’s data quickly and easily is a real benefit when using SAS Visual Statistics.

Initial data exploration and preparation

To demonstrate the simplicity of clustering in SAS Visual Statistics, the data set CUSTOMERS is used here and also throughout the book. I have loaded the CUSTOMERS data set into memory, and it is now listed in the active tab. I can easily explore and visualize this data by right-mouse-clicking and selecting Actions and then Explore and Visualize. This will take you to the SAS Visual Analytics page.

I have added four new compute items by taking the natural logarithm of four attributes and will use these newly transformed attributes in a clustering.

Performing simple clustering

Clustering in SAS Visual Statistics can be found by selecting the Objects icon on the left and scrolling down to see the SAS Visual Statistics menus as seen below. Dragging the Cluster icon onto the Report template area will allow you to use that statistic object and visualize the clusters.

Once the Cluster object is on the template, adding data items to the Data Roles is simple by checking the four computed data items.

Click the OK icon, and immediately the four data items that are being clustered will look like the report below where five clusters were found using the four data items.

There are 105,456 total observations in the data set, however, only 89,998 were used for the analysis. Some observations were not used due to the natural logarithm not being able to be computed. To see how to handle that situation easily, please pick up a copy of Segmentation Analytics with SAS Viya. Let me know if you have any questions or comments.

 

 

Clustering made simple was published on SAS Users.