3月 112020
 

In the early days and weeks of any widespread global health concern, particularly in a fast-moving outbreak like the coronavirus, there are many unknowns. Data visualization can be a good starting point to understand trends and piece data points together into a meaningful story. The ability to visualize the spread [...]

Using data visualization to track the coronavirus outbreak was published on SAS Voices by Mark Lambrecht

3月 102020
 

If you have been using SAS for long, you have probably noticed that there is generally more than one way to do anything. (For an example, see my co-author Lora Delwiche’s blog about PROC SQL.) The Little SAS Book has long covered reading and writing Microsoft Excel files with the IMPORT and EXPORT procedures, but for the Sixth Edition, we decided it was time to add two more ways: The ODS EXCEL destination makes it easy to convert procedure results into Excel files, while the XLSX LIBNAME engine allows you to access Excel files as if they were SAS data sets.

With the XLSX LIBNAME engine, you can convert an Excel file to a SAS data set (or vice versa) if you want to, but you can also access an Excel file directly without the need for a SAS data set. This engine works for files created using any version of Microsoft Excel 2007 or later in the Windows or UNIX operating environments. You must have SAS 9.4M2 or higher and SAS/ACCESS Interface to PC Files software. A nice thing about this engine is that it works with any combination of 32-bit and 64-bit systems.

The XLSX LIBNAME engine uses the first line in your file for the variable names, scans each full column to determine the variable type (character or numeric), assigns lengths to character variables, and recognizes dates, and numeric values containing commas or dollar signs. While the XLSX LIBNAME engine does not offer many options, because you are using an Excel file like a SAS data set, you can use many standard data set options. For example, you can use the RENAME= data set option to change the names of variables, and FIRSTOBS= and OBS= to select a subset of rows.

Reading an Excel file as is 

Suppose you have the following Excel file containing data about magnolia trees:

With the XLSX LIBNAME engine, SAS can read the file, without first converting it to a SAS data set. Here is a PROC PRINT that prints the data directly from the Excel file.

* Read an Excel spreadsheet using XLSX LIBNAME;
LIBNAME exfiles XLSX 'c:\MyExcel\Trees.xlsx';

PROC PRINT DATA = exfiles.sheet1;
   TITLE 'PROC PRINT of Excel File';
RUN;

Here are the results of the PROC PRINT. Notice that the variable names were taken from the first row in the file.

PROC PRINT of Excel File

Converting an Excel file to a SAS data set 

If you want to convert an Excel file to a SAS data set, you can do that too. Here is a DATA step that reads the Excel file. The RENAME= data set option changes the variable name MaxHeight to MaxHeightFeet. Then a new variable is computed which is equal to the height in meters.

* Import Excel into a SAS data set and compute height in meters;
DATA magnolia;
   SET exfiles.sheet1 (RENAME = (MaxHeight = MaxHeightFeet));
   MaxHeightMeters = ROUND(MaxHeightFeet * 0.3048);
RUN;

Here is the SAS data set with the renamed and new variables:


Writing to an Excel file 

It is just as easy to write to an Excel file as it is to read from it.

* Write a new sheet to the Excel file;
DATA exfiles.trees;
   SET magnolia;
RUN;
LIBNAME exfiles CLEAR;

Here is what the Excel file looks like with the new sheet. Notice that the new tab is labeled with the name of the SAS data set TREES.

The XLSX LIBNAME engine is so flexible and easy to use that we think it’s a great addition to any SAS programmer’s skill set.

To learn more about the content in The Little SAS Book, check out the free book excerpt.  To see up-and-coming titles and get exclusive discounts, make sure to subscribe to the SAS Books newsletter.

Accessing Excel files using LIBNAME XLSX was published on SAS Users.

3月 092020
 

In a previous article, I discussed the binormal model for a binary classification problem. This model assumes a set of scores that are normally distributed for each population, and the mean of the scores for the Negative population is less than the mean of scores for the Positive population. I showed how you can construct an exact ROC curve for the population.

Of course, in the real world, you do not know the distribution of the population, so you cannot obtain an exact ROC curve. However, you can estimate the ROC curve by using a random sample from the population.

This article draws a random sample from the binormal model and constructs the empirical ROC curve by using PROC LOGISTIC in SAS. The example shows an important truth that is sometimes overlooked: a sample ROC curve is a statistical estimate. Like all estimates, it is subject to sampling variation. You can visualize the variation by simulating multiple random samples from the population and overlaying the true ROC curve on the sample estimates.

Simulate data from a binormal model

The easiest way to simulate data from a binormal model is to simulate the scores. Recall the assumptions of the binormal model: all variables are continuous and there is a function that associates a score with each individual. The distribution of scores is assumed to be normal for the positive and negative populations. Thus, you can simulate the scores themselves, rather than the underlying data.

For this article, the scores for the Negative population (those who do not have a disease or condition) is N(0, 1). The scores for the Positive population (those who do have the disease or condition) is N(1.5, 0.75). The ROC curve for the population is shown to the right. It was computed by using the techniques from a previous article about binormal ROC curves.

The following SAS DATA step simulates a sample from this model. It samples nN = 50 scores from the Negative population and nP = 25 scores from the Positive population. The distribution of the scores in the sample are graphed below:

%let mu_N    = 0;       /* mean of Negative population */
%let sigma_N = 1;       /* std dev of Negative population */
%let mu_P    = 1.5;     /* mean of Positive population */
%let sigma_P = 0.75;    /* std dev of Positive population */
 
%let n_N = 50;     /* number of individuals from the Negative population */
%let n_P = 25;     /* number of individuals from the Positive population */
 
/* simulate one sample from the binormal model */
data BinormalSample;
call streaminit(12345);
y = 1;             /* positive population */
do i = 1 to &n_P;
   x = rand("Normal", &mu_P, &sigma_P); output;
end;
y = 0;             /* negative population */
do i = 1 to &n_N;
   x = rand("Normal", &mu_N, &sigma_N); output;
end;
drop i;
run;
 
title "Distribution of Scores for 'Negative' and 'Positive' Samples";
ods graphics / width=480px height=360px subpixel;
proc sgpanel data=BinormalSample noautolegend;
   styleattrs datacolors=(SteelBlue LightBrown);
   panelby y      / layout=rowlattice onepanel noheader;
   inset y        / position=topleft textattrs=(size=14) separator="=";
   histogram x    / group=y;
   rowaxis offsetmin=0;
   colaxis label="Score";
run;

Create an ROC plot for a sample

You can create an ROC curve by first creating a statistical model that classifies each observation into one of the two classes. You can then call PROC LOGISTIC in SAS to create the ROC curve, which summarizes the misclassification matrix (also called the confusion matrix) at various cutoff values for a threshold parameter. Although you can create an ROC curve for any predictive model, the following statements fit a logistic regression model. You can use the OUTROC= option on the MODEL statement to write the values of the sample ROC curve to a SAS data set. You can then overlay the ROC curves for the sample and for the population to see how they compare:

/* use logistic regression to classify individuals */
proc logistic data=BinormalSample noprint;
   model y(event='1') = x / outroc=ROC1;
run;
 
/* merge in the population ROC curve */
data ROC2;
   set ROC1 PopROC;
run;
 
title "Population ROC Curve vs. Estimate";
ods graphics / width=480px height=480px;
proc sgplot data=ROC2 aspect=1 noautolegend;
   step x=_1MSPEC_ y=_SENSIT_ / lineattrs=(color=blue) name="est" legendlabel="Estimate";
   series x=FPR y=TPR / lineattrs=(color=black) name="pop" legendlabel="Population";
   lineparm x=0 y=0 slope=1 / lineattrs=(color=gray);
   xaxis grid;   yaxis grid;
   keylegend "est" "pop" / location=inside position=bottomright across=1 opaque;
   label _1MSPEC_ ="False Positive Rate (FPR)" _SENSIT_ ="True Positive Rate (TPR)";
run;

The graph shows the sample ROC curve (the blue, piecewise-constant curve) and the population ROC curve (the black, smooth curve). The sample ROC curve is an estimate of the population curve. For this random sample, you can see that most estimates of the true population rate are too large (given a value for the threshold parameter) and most estimates of the false positive rate are too low. In summary, the ROC estimate for this sample is overly optimistic about the ability of the classifier to discriminate between the positive and negative populations.

The beauty of the binormal model is that we know the true ROC curve. We can compare estimates to the truth. This can be useful when evaluating competing models: Instead of comparing sample ROC curves with each other, we can compare them to the ROC curve for the population.

Variation in the ROC estimates

If we simulate many samples from the binormal model and plot many ROC estimates, we can get a feel for the variation in the ROC estimates in random samples that have nN = 50 and nP = 25 observations. The following DATA step simulates B = 100 samples. The subsequent call to PROC LOGISTIC uses BY-group analysis to fit all B samples and generate the ROC curves. The ROC curves for five samples are shown below:

/* 100 samples from the binormal model */
%let NumSamples = 100;
data BinormalSim;
call streaminit(12345);
do SampleID = 1 to &NumSamples;
   y = 1;             /* sample from positive population */
   do i = 1 to &n_P;
      x = rand("Normal", &mu_P, &sigma_P); output;
   end;
   y = 0;             /* sample from negative population */
   do i = 1 to &n_N;
      x = rand("Normal", &mu_N, &sigma_N); output;
   end;
end;
drop i;
run;
 
proc logistic data=BinormalSim noprint;
   by SampleID;
   model y(event='1') = x / outroc=ROCs;
run;
 
title "5 Sample ROC Curves";
proc sgplot data=ROCs aspect=1 noautolegend;
   where SampleID <= 5;
   step x=_1MSPEC_ y=_SENSIT_ / group=SampleID;
   lineparm x=0 y=0 slope=1 / lineattrs=(color=gray);
   xaxis grid;   yaxis grid;
   label _1MSPEC_ ="False Positive Rate (FPR)" _SENSIT_ ="True Positive Rate (TPR)";
run;

There is a moderate amount of variation between these curves. The brown curve looks different from the magenta curve, even though each curve results from fitting the same model to a random sample from the same population. The difference between the curves are only due to random variation in the data (scores).

You can use partial transparency to overlay all 100 ROC curves on the same graph. The following DATA step overlays the empirical estimates and the ROC curve for the population:

/* merge in the population ROC curve */
data AllROC;
   set ROCs PopROC;
run;
 
title "ROC Curve for Binormal Model";
title2 "and Empirical ROC Curves for &NumSamples Random Samples";
proc sgplot data=AllROC aspect=1 noautolegend;
   step x=_1MSPEC_ y=_SENSIT_ / group=SampleID transparency=0.9
                                lineattrs=(color=blue pattern=solid thickness=2);
   series x=FPR y=TPR / lineattrs=(color=black thickness=2);
   lineparm x=0 y=0 slope=1 / lineattrs=(color=gray);
   xaxis grid;   yaxis grid;
   label _1MSPEC_ ="False Positive Rate (FPR)" _SENSIT_ ="True Positive Rate (TPR)";
run;

The graph shows 100 sample ROC curves in the background (blue) and the population ROC curve in the foreground (black). The ROC estimates show considerable variability.

Summary

In summary, this article shows how to simulate samples from a binormal model. You can use PROC LOGISTIC to generate ROC curves for each sample. By looking at the variation in the ROC curves, you can get a sense for how these estimates can vary due to random variation in the data.

The post ROC curves for a binormal sample appeared first on The DO Loop.

3月 062020
 

While I am a women in computer science, let me start by saying that I’m a reluctant coder. I have an eye for data but find the abstract nature of many computer languages to be very challenging and not something I typically love. My first experience with coding was in [...]

Why are fewer women getting computer science degrees? was published on SAS Voices by Mary Osborne

3月 052020
 

Have you heard that SAS offers a collection of new, high-performance CAS procedures that are compatible with a multi-threaded approach? The free e-book Exploring SAS® Viya®: Data Mining and Machine Learning is a great resource to learn more about these procedures and the features of SAS® Visual Data Mining and Machine Learning. Download it today and keep reading for an excerpt from this free e-book!

In SAS Studio, you can access tasks that help automate your programming so that you do not have to manually write your code. However, there are three options for manually writing your programs in SAS® Viya®:

  1. SAS Studio provides a SAS programming environment for developing and submitting programs to the server.
  2. Batch submission is also still an option.
  3. Open-source languages such as Python, Lua, and Java can submit code to the CAS server.

In this blog post, you will learn the syntax for two of the new, advanced data mining and machine learning procedures: PROC TEXTMINE and PROCTMSCORE.

Overview

The TEXTMINE and TMSCORE procedures integrate the functionalities from both natural language processing and statistical analysis to provide essential functionalities for text mining. The procedures support essential natural language processing (NLP) features such as tokenizing, stemming, part-of-speech tagging, entity recognition, customized stop list, and so on. They also support dimensionality reduction and topic discovery through Singular Value Decomposition.

In this example, you will learn about some of the essential functionalities of PROC TEXTMINE and PROC TMSCORE by using a text data set containing 1,830 Amazon reviews of electronic gaming systems. The data set is named Amazon. You can find similar data sets of Amazon reviews at http://jmcauley.ucsd.edu/data/amazon/.

PROC TEXTMINE

The Amazon data set has already been loaded into CAS. The review content is stored in the variable ReviewBody, and we generate a unique review ID for each review. In the proc call shown in Program 1 we ask PROC TEXTMINE to do three tasks:

  1. parse the documents in table reviews and generate the term by document matrix
  2. perform dimensionality reduction via Singular Value Decomposition
  3. perform topic discovery based on Singular Value Decomposition results

Program 1: PROC TEXTMINE

data mycaslib.amazon;
    set mylib.amazon;
run;

data mycaslib.engstop;
    set mylib.engstop;
run;

proc textmine data=mycaslib.amazon;
    doc_id id;
    var reviewbody;

 /*(1)*/  parse reducef=2 entities=std stoplist=mycaslib.engstop 
          outterms=mycaslib.terms outparent=mycaslib.parent
          outconfig=mycaslib.config;

 /*(2)*/  svd k=10 svdu=mycaslib.svdu outdocpro=mycaslib.docpro
          outtopics=mycaslib.topics;

run;

(1) The first task (parsing) is specified in the PARSE statement. Parameter “reducef” specifies the minimum number of times a term needs to appear in the text to be included in the analysis. Parameter “stop” specifies a list of terms to be excluded from the analysis, such as “the”, “this”, and “that”. Outparent is the output table that stores the term by document matrix, and Outterms is the output table that stores the information of terms that are included in the term by document matrix. Outconfig is the output table that stores configuration information for future scoring.

(2) Tasks 2 and 3 (dimensionality reduction and topic discovery) are specified in the SVD statement. Parameter K specifies the desired number of dimensions and number of topics. Parameter SVDU is the output table that stores the U matrix from SVD calculations, which is needed in future scoring. Parameter OutDocPro is the output table that stores the new matrix with reduced dimensions. Parameter OutTopics specifies the output table that stores the topics discovered.

Click the Run shortcut button or press F3 to run Program 1. The terms table shown in Output 1 stores the tagging, stemming, and entity recognition results. It also stores the number of times each term appears in the text data.

Output 1: Results from Program 1

PROC TMSCORE

PROC TEXTMINE is used with large training data sets. When you have new documents coming in, you do not need to re-run all the parsing and SVD computations with PROC TEXTMINE. Instead, you can use PROC TMSCORE to score new text data. The scoring procedure parses the new document(s) and projects the text data into the same dimensions using the SVD weights derived from the original training data.

In order to use PROC TMSCORE to generate results consistent with PROC TEXTMINE, you need to provide the following tables generated by PROC TEXTMINE:

  • SVDU table – provides the required information for projection into the same dimensions.
  • Config table – provides parameter values for parsing.
  • Terms table – provides the terms that should be included in the analysis.

Program 2 shows an example of TMSCORE. It uses the same input data layout used for PROC TEXTMINE code, so it will generate the same docpro and parent output tables, as shown in Output 2.

Program 2: PROC TMSCORE

Proc tmscore data=mycaslib.amazon svdu=mycaslib.svdu
        config=mycaslib.config terms=mycaslib.terms
        svddocpro=mycaslib.score_docpro outparent=mycaslib.score_parent;
    var reviewbody;
    doc_id id;
run;

 

Output 2: Results from Program 2

To learn more about advanced data mining and machine learning procedures available in SAS Viya, including PROC FACTMAC, PROC TEXTMINE, and PROC NETWORK, you can download the free e-book, Exploring SAS® Viya®: Data Mining and Machine Learning. Exploring SAS® Viya® is a series of e-books that are based on content from SAS® Viya® Enablement, a free course available from SAS Education. You can follow along with examples in real time by watching the videos.

 

Learn about new data mining and machine learning procedures in SAS Viya was published on SAS Users.

3月 052020
 

Fifty years ago, as the women’s liberation movement was gaining momentum in the U.S., my maternal great-grandmother, Pearl, worked in a factory sewing American flags while volunteering with the Girl Scouts and caring for her grandchildren. My paternal grandmother, Greta, also worked in local factories while caring for her family. [...]

50 years of strong, intelligent women was published on SAS Voices by Ashley Binder

3月 042020
 

Your brand is customer journey obsessed, and every interaction with your company provides a potential opportunity to make an intelligent decision, deepen engagement and meet conversion goals. The hype of martech innovation in 2020 is continuing to elevate, and every technology vendor is claiming the following statement: "Bolster the customer [...]

SAS Customer Intelligence 360: Marketing AI vision was published on Customer Intelligence Blog.

3月 042020
 

I'm a planner, and my plan was to have a completely natural birth and to breast feed only. This plan was written in stone (or so I thought). Long story short – my birthing plan flew out the window. One of my complications was very high blood pressure, which refused [...]

Breast milk, babies, NCSU students and SAS/OR was published on SAS Voices by Natalia Summerville

3月 042020
 

Suppose that a data set contains a set of parameter values. For each row of parameters, you need to perform some computation. A recent discussion on the SAS Support Communities mentions an important point: if there are duplicate rows in the data, a program might repeat the same computation several times. This is inefficient. An efficient alternative is to perform the computations once and store them in a list. This article describes this issue and implements it by using lists in SAS/IML software. Lists were introduced in SAS/IML 14.2 (SAS 9.4M4). A natural syntax for creating lists was introduced in SAS/IML 14.3 (SAS 9.4M5).

This article assumes that the matrices are not known until you read a data set that contains the parameters. The program must be able to handle computing, storing, and accessing an arbitrary number of matrices: 4, 10, or even 100. Obviously, if you know in advance the number of matrices that you need, then you can just create those matrices and give them names such as X1, X2, X3, and X4.

Example: Using powers of a matrix

To give a concrete example, suppose the following data set contains multiple values for a parameter, p:

data Powers;
input p @@;
datalines;
9 1 5 2 2 5 9 1 2 5 2 1 
;

For each value of p, you need to compute X = Ap = A*A*...*A for a matrix, A, and then use X in a subsequent computation. Here's how you might perform that computation in a straightforward way in SAS/IML without using lists:

proc iml;
/* use a small matrix, but in a real application the matrix might be large. */
A = {2 1, 1 3};
 
use Powers;                    /* read the parameters from the data set */
read all var "p";
close; 
 
/* The naive method: Compute the power A**p[i] each time you need it. */
do i = 1 to nrow(p);
  power = p[i];
  X = A**power;                /* compute each power as needed */
  /* compute with X ... */
end;

The program computes Ap for each value of p in the data set. Because the data set contains duplicate values of p, the program computes A2 four times, A5 three times, and A9 two times. If A is a large matrix, it is more efficient to compute and store these matrices and reuse them as needed.

Store all possible matrices in a list

If the values of p are uniformly likely, the easiest way to store matrix powers in a list is to find the largest value of p (call it m) and then create a list that has m items. The first item in the list is A, the second item is A2, and so forth up to Am. This is implemented by using the following statements:

/* Method 1: Find upper bound of power, m. 
             Compute all matrices A^p for 1 <= p <= m. 
             Store these matrices in a list so you don't need to recompute. */
m = max(p);                    /* largest power in data */
L = ListCreate( m );
do i = 1 to m;
  L$i = A**i;                  /* compute all powers up to maximum; store A^i as i_th item */
end;
 
do i = 1 to nrow(p);           /* extract and use the matrices, as needed */
  power = p[i];
  X = L$power;                 /* or   X = ListGetItem(L, power); */
  /* compute with X ... */
end;

For these data, the list contains nine elements because 9 is the largest value in the data. The following statements show how to display a compact version of the list, which shows the matrices Ap for p=1,2,...,9.

package load ListUtil;         /* load the Struct and ListPrint subroutines */
run struct(L);
Structure of a nine-item list. The p_th item stores A^p.

The table shows the first few elements of every item in the list. In this case, the p_th item is storing the 2 x 2 matrix Ap. The elements of the matrix are displayed in row-major order.

Store only the necessary matrices in a list

Notice that the data only contain four values of p: 1, 2, 5, and 9. The method in the previous section computes and stores unnecessary matrices such as A3, A4, and A8. This is inefficient. Depending on the maximum value of p (such as max(p)=100), this method might be wasteful in terms of memory and computational resources.

A better solution is to compute and store only the matrices that are needed for the computation. For this example, you only need to compute and store four matrices. You can use the UNIQUE function to find the unique values of p (in sorted order). The following statements compute and store Ap for only the unique values of p in the data set:

/* Method 2: Compute matrices A^p for the unique values of p in the data.
             Store only these matrices in a named list. */
u = unique(p);                 /* find the unique values of p */
nu = ncol(u);                  /* how many unique values? */
L = ListCreate( nu );
do i = 1 to nu;
  power = u[i];                /* compute only the powers in the data; store in a named list */
  L$i =  [#"Power"  = power,   /* optional: store as a named list */
          #"Matrix" = A**power]; 
end;
 
run struct(L);
Structure of a four-item list, where each item is a named list.

For this example, each item in the list L is itself a list. I used a "named list" with items named "Power" and "Matrix" so that the items in L have context. The STRUCT subroutine shows that L contains only four items.

How can you access the matrix for, say, A5? There are several ways, but the easiest is to use the u matrix that contains the unique values of p in sorted order. You can use the LOC function to look up the index of the item that you want. For example, A5 is stored as the third item in the list because 5 is the third element of u. Because each item in L is itself a list, you can extract the "Matrix" item as follows:

do i = 1 to nrow(p);
  power = p[i];
  k = loc(u = power);          /* the k_th item is A^p */
  X = L$k$"Matrix";            /* extract it */
  /* compute with X ... */
  *print i power X;
end;

In summary, this article describes how to use a list to store matrices so that you can compute them once and reuse them many times. The article assumes that the number of matrices is not known in advance but is obtained by reading a data set at run time. The example stores powers of a matrix, but this technique is generally applicable.

If you have never used lists in SAS/IML, see the article "Create lists by using a natural syntax in SAS/IML" or watch this video that describes the list syntax.

The post Store pre-computed matrices in a list appeared first on The DO Loop.