7月 212020
 

A client recently asked if there's a programmatic way to reduce memory requirements of a CAS table. In this post, you'll learn how to accomplish that when your SASWORK data set has lengthy character data.

What is SASWORK?

The SASWORK library is the temporary library automatically defined by SAS at the start of each SAS session or job. The SASWORK library stores temporary SAS files that you create, such as data sets, catalogs, formats, etc., as well as files created internally by SAS. To access files in the SASWORK library, specify a one-level name for the file or by using a two-level name, i.e. the WORK libref.

A programmatic CASLIB to SASWORK

It is a common practice for SAS programmers to read a source table and then store that source table in SASWORK. When the SASWORK data set has long character data, you can significantly reduce memory requirements by creating a path-based CASLIB to SASWORK and leveraging PROC CASUTIL with the CASDATA= option and use the IMPORTOPTIONS VARCHARCONVERSION=16 statement. The VARCHARCONVERSION=16 statement automatically converts all character data types with length of 16 or greater to a VARCHAR data type.

Code in Figure 1 below creates a macro variable called WORKPATH which we will use in our CASLIB statement:

/* The macro variable WORKPATH contains the path to SASWORK */
%let workpath = %sysfunc(quote(%sysfunc(pathname(work))));
%put &workpath;

(Figure 1: Creating the Macro Variable WORKPATH)

Figure 2 below is the resulting SAS log:


(Figure 2: SAS log of creating the Macro Variable WORKPATH)

Code in Figure 3 below creates our SASWORK data set that we will lift into CAS:

data saswork_table_with_char300;
   length a $ 300 b $ 15 c $ 16;
   a='a300'; b='b15' ; c='c16' ; 
   output;
   a='a300300'; b='b151515'; c='c161616'; 
   output;
   c='c161616161616161'; 
   b='b15151515151515';
   a="a300qzwsxedcrfvtgbyhnujmiklopqazwsxedcrfvtgbyhnujmikolp1234567890123456789012345678901234567890"; 
   output;
run;
 
proc contents data=saswork_table_with_char300;
title "Contents of WORK.SASWORK_TABLE_WITH_CHAR300";
run;

(Figure 3: Code to create a SASWORK data set using a one-level name, i.e. saswork_table_with_char300)

Notice in Figure 4 below the lengths of the character variables a, b, and c:


(Figure 4: Results from PROC CONTENTS)

Code in Figure 5 below creates a path-based CASLIB to SASWORK by using our macro variable WORKPATH:

proc cas;
   file log;
   table.dropCaslib /
   caslib='sas7bdat' quiet = true;
 run;
  addcaslib /
    datasource={srctype="path"}
    name="sas7bdat"
    path=&workpath ; 
 run;

(Figure 5: Path-based CASLIB to SASWORK)

Figure 6 below is resulting SAS log:


(Figure 6: SAS log of path-based CASLIB to SASWORK)

Code in Figure 7 below lifts the SASWORK data set into CAS and ensures the CAS table will contain VARCHAR data types for any character data with a length of 16 or greater:

proc casutil;
 load casdata="saswork_table_with_char300.sas7bdat" 
 casout="cas_table_with_varchar" 
 outcaslib="casuser"
 importoptions=(filetype="basesas", dtm="auto", debug="dmsglvli", 
                varcharconversion=16) ;
run;
quit;

(Figure 7: code that lifts the SASWORK data set into CAS)

Figure 8 below is the resulting SAS log:


(Figure 8: SAS log of code to lift the SASWORK data set into CAS)

Code in Figure 9 below displays the characteristics of our CAS table “cas_table_with_varchar”: Notice in Figure 10 that our variables a and c have a data type of VARCHAR.

proc cas;
   sessionProp.setSessOpt /
   caslib="casuser";
run;
   table.columninfo / table="cas_table_with_varchar";
quit;

(Figure 9: SAS code to display information of a CAS table)

Figure 10 below reveals CAS table characteristics:


(Figure 10: CAS table characteristics)

Conclusion: Conserve memory with smaller CAS tables

By leveraging the coding techniques in this blog, we can lift into CAS any data sets we previously stored in SASWORK. In addition, we convert all character data types with a length of 16 or greater into the VARCHAR data type, which can significantly reduce the size of our CAS tables.

LEARN MORE | SAS® Cloud Analytic Services 3.5: User’s Guide

How to create a path-based CASLIB to SASWORK for speedier analysis with SAS® Viya® was published on SAS Users.

7月 212020
 

When Liverpool clinched the Premier League title with a win on June 25, my household was giddy. We are a family of sports fans, and we especially love baseball (Go, Red Sox!) and soccer. I’m a relative newbie to Liverpool, having cheered for them for the last five years. Having [...]

Premier League champs offer lessons on and off the field was published on SAS Voices by Jenn Chase

7月 212020
 

When Liverpool clinched the Premier League title with a win on June 25, my household was giddy. We are a family of sports fans, and we especially love baseball (Go, Red Sox!) and soccer. I’m a relative newbie to Liverpool, having cheered for them for the last five years. Having [...]

Premier League champs offer lessons on and off the field was published on SAS Voices by Jenn Chase

7月 202020
 

I think we can all agree that lifelong learning is the future, for all of us. We know that we need to learn and develop all the time, simply to stay abreast. The world is changing fast, and we must change with it. Investing in analytics talent is an investment in your future.

Create a corporate learning culture

Organizations may embrace the concept of data in their work cultures, yet they fail to commit to developing the analytics skills their teams need to harness the data. Organizations often don’t know what skills their team has, the skills they’re lacking, or even where they need their employees to go. Corporate training brings new avenues for career development and professional training, and that can have a huge impact on retention.

Companies that understand the importance of continued skill development for their employees will set out a vision and develop a corporate learning culture. Leaders are needed who can drive a cultural change towards continued learning in the organization. Hiring curious people and rewarding that curiosity develops the right mindset. Revealing the knowledge gap that employees have is a good trigger to get the learning started.

When an organization wants to close the talent gap for Analytics and AI, SAS can offer support on the whole journey.

Company objectives and goals will determine how the skills development program will be shaped. Once the roles and desired skills levels are determined, SAS will provide a Learning Needs Assessment to uncover the skills gaps.

Based on the findings from the assessment, a program for learning and development will be proposed for the different target audiences and to work towards company goals.

With readily available digital assets, the first steps to a learning platform can be realized quickly. By dynamically developing more assets and creating domain and company specific materials, continuous learning is encouraged and facilitated.

Power of online learning

Today most of us are forced to work from home due to COVID-19. We know the future of work will look different; we are getting used to the digital channel for communicating and interacting. Though the digitization of learning has been going on for many years, learning content is now moving to the cloud, becoming accessible across multiple devices and teaching environments and often being generated, shared, and continually updated.

Millennials feel most comfortable with this digitization, but the wider workforce will ultimately benefit from access to digital learning assets. Integrated cloud-based platforms enable more than just new computer programs or smartphone apps. Smart organizations are now expanding their use of cloud-based learning to run personalized online courses, small tailored live web sessions, instructional videos, e-coaching, communities, virtual classrooms, and simulations games. SAS is responding to this need by offering several free options for learning including e-Learning, online tutorials, lab time, SAS Academy for Data Science, and a SAS Learning Subscription that offers more than 100 e-Learning courses. There is something for everyone!

Finding analytics talent within your workforce can be challenging. It can be hard to assess the skills and identify the talent you need for specific solutions. With an unparalleled depth of understanding in the industry, SAS can help identify, cultivate and grow analytics talent in your organization. In addition, we’ll help you to build a talent pipeline in partnership with local academic institutions, if needed.

Do you want to start today with learning essential skills? Take a look at how SAS can help your organization to get ahead, through our learning solutions customized to help you train your team. Or transform it.

Shaping the future of lifelong learning was published on SAS Users.

7月 202020
 

I recently showed how to compute within-group multivariate statistics by using the SAS/IML language. However, a principal of good software design is to encapsulate functionality and write self-contained functions that compute and return the results. What is the best way to return multiple statistics from a SAS/IML module?

A convenient way to return several statistics is to use lists. Lists were introduced in SAS/IML 14.2 and extended in SAS/IML 14.3. I have previously written about how to use lists to pass arguments into a SAS/IML module. For parallel processing in the iml action in SAS Viya, lists are an essential data structure.

This article shows two things. First, it shows how to use the UNIQUE-LOC technique to compute multivariate statistics for each group in the data. It then shows two ways to return those statistics by using SAS/IML lists. Depending on your application, one way might be more convenient to use than the other.

Returning matrices versus lists

Suppose you want to write a SAS/IML function that computes several statistics for each group in the data. If you are analyzing one variable (univariate data), the statistics are scalar. Therefore, the module can return a matrix in which the rows of the matrix represent the groups in the data and the columns represent the statistics, such as mean, median, variance, skewness, and so forth. This form of the output matches the output from PROC MEANS when you use a CLASS statement to specify the group variable.

For multivariate data, the situation becomes more complex. For example, suppose that for each group you want to compute the number of observations, the mean vector, and the covariance matrix. Although it is possible to pack all this information into a matrix, it requires a lot of bookkeeping to keep track of which row contains which statistics for which group. A simpler method is to pack the multivariate statistics into a list and return the list.

Multivariate statistics for the iris data

Fisher's iris data is often used to demonstrate ideas in multivariate statistics. The following call to PROC SGPLOT creates a scatter plot for two variables (PetalWidth and SepalLength) and uses the GROUP= option to color the observations by the levels of the Species variable. The plot also overlays 95% prediction ellipses for each group.

title "Iris Data, Colored by Species";
title2 "95% Prediction Ellipse, assuming MVN";
proc sgplot data=Sashelp.Iris;
   ellipse x=PetalWidth y=SepalWidth / group=Species outline fill transparency=0.6;
   scatter x=PetalWidth y=SepalWidth / group=Species transparency=0.2
                       markerattrs=(symbol=CircleFilled size=10) jitter;
   xaxis grid; yaxis grid;
run;

Although PROC SGPLOT computes these ellipses automatically, you might be interested in the multivariate statistics behind the ellipses. The ellipses depend on the number of observations, the mean vector, and the covariance matrix for each group. The following SAS/IML module computes these quantities for each group. The module assembles the statistics into three matrices:

  • The ns vector contains the number of observations. The k_th row contains the number of observations in the k_th group.
  • The means matrix contains the group means. The k_th row contains the mean vector for the k_th group.
  • The covs matrix contains the group covariance matrices. The k_th row contains the covariance matrix for the k_th group. You can use the ROWVEC function to "flatten" a matrix into a row vector. When you want to reconstruct the matrix, you can use the SHAPE function, as discussed in the article "Create an array of matrices in SAS."

These three items are packed into a list and returned by the module.

proc iml;
/* GIVEN Y and a classification variable Group that indicate group membership,
   return a list that contains:
   1. Number of observations in each group
   2. ML estimates of within-group mean
   3. ML estimate of within-group covariance 
   The k_th row of each matrix contains the statistics for the k_th group.
*/
start MLEstMVN(Y, Group);
   d = ncol(Y);
   u = unique(Group);
   G = ncol(u);
   ns    = J(G,   1, 0);           /* k_th row is number of obs in k_th group */
   means = J(G,   d, 0);           /* k_th row is mean of k_th group */
   covs  = J(G, d*d, 0) ;          /* k_th row is cov of k_th group */
 
   do k = 1 to G;
      X = Y[loc(Group=u[k]), ];    /* obs in k_th group */
      n = nrow(X);                 /* number of obs in this group */
      C = (n-1)/n * cov(X);        /* ML estimate of COV does not use the Bessel correction */
      ns[k]      = n;             
      means[k, ] = mean(X);        /* ML estimate of mean */
      covs[k, ]  = rowvec( C );    /* store COV in k_th row */
   end;
   outList = [#'n'=ns, #'mean'=means, #'cov'=Covs, #'group'=u]; /* pack into list */
   return (outList);
finish;
 
/* Example: read the iris data  */
varNames = {'PetalWidth' 'SepalWidth'};
use Sashelp.Iris;
read all var varNames into X;
read all var {'Species'};
close;
 
L = MLEstMVN(X, Species);            /* call function; return named list of results */
ns = L$'n';                          /* extract obs numbers          */
means = L$'mean';                    /* extract mean vectors         */
Covs = L$'cov';                      /* extract covariance matrix    */
lbl = L$'group';
print ns[r=lbl c='n'], 
      means[r=lbl c={'m1' 'm2'}], 
      covs[r=lbl c={C11 C12 C21 C22}];

The multivariate statistics are shown. The k_th row of each matrix contains a statistic for the k_th group. The number of observations and the mean vectors are easy to understand and interpret. The covariance matrix for each group has been flattened into a row vector. You can use the SHAPE function to rearrange the row into a matrix. For example, the covariance matrix for the first group ('Setosa') can be constructed as Setosa_Cov = shape(covs[1,], 2).

An alternative: A list of lists

It can be convenient to return a list of lists that contains the statistics. Suppose you want to compute m statistics for each of G groups. The function in the previous section returned a list that contains m items and each item had G rows. (I also included an additional list item, the group levels, but that is optional.) Alternatively, you could return a list that has G sublists, where each sublist contains the m statistics for the group. This is shown in the following function, which returns the same information but organizes it in a different way.

/* Alternative: Return a list of G sublists. Each sublist has only 
   the statistics relevant to the corresponding group.    */
start MLEst2(Y, Group);
   u = unique(Group);
   G = ncol(u);
   outList = ListCreate(G);
 
   do k = 1 to G;
      X = Y[loc(Group=u[k]), ];    /* obs in k_th group */
      n = nrow(X);                 /* number of obs in k_th group */
      SL = [#'n'=n, 
            #'mean'=mean(X),        /* ML estimate of mean */
            #'cov'=(n-1)/n*cov(X)]; /* ML estimate of COV; no need to flatten */
      call ListSetItem(outList, k, SL);   /* set sublist as the k_th item */
   end;
   call ListSetName(outList, 1:G, u);     /* give a name to each sublist */
   return (outList);
finish;
 
L2 = MLEst2(X, Species);
package load ListUtil;  /* load a package to print lists */
call Struct(L2);        /* or: call ListPrint(L2);       */

The output from the STRUCT call gives a hierarchical overview of the L2 list. The leftmost column indicates the hierarchy of the lists. The list has G=3 items, which are themselves lists. Each list has a name (the group level and m=3 items: the number of observations ('n'), the mean vector ('mean'), and the within-group MLE covariance matrix ('cov'). In this output, notice that there is no need to flatten the covariance matrix into a row vector. For some applications, this list format might be easier to work with. If you want, for example, the mean vector for the 'Setosa' group, you can access it by using the notation m = L2$'Setosa'$'mean'. Or you could extract the 'Setosa' list and access the statistics as follows:

SetosaList = L2$'Setosa';    /* get the list of statistics for the 'Setosa' group */
m = SetosaList$'mean';       /* get the mean vector */
C = SetosaList$'cov';        /* get the MLE covariance matrix */
print m[c=varNames], C[c=varNames r=varNames];

Summary

This article shows how to do two things:

  • How to use the UNIQUE-LOC technique to compute multivariate statistics for each group in the data.
  • How to return those statistics by using SAS/IML lists. Two ways to organize the output are presented. The simplest method is to stack the statistics into matrices where the k_th row represents the k_th group. You can then return a list of the matrices. The second method is to return a list of sublists, where the k_th sublist contains the statistics for the k_th list. The first method requires that you "flatten" matrices into row vectors whereas the second method does not.

The post Compute within-group multivariate statistics and store them in a list appeared first on The DO Loop.

7月 162020
 

Ever heard of a Turducken? It's a chicken stuffed inside a duck that's stuffed inside a turkey along with layers of stuffing (which I just learned is referred to as a three-bird roast outside the US and Canada; there's also an English variant known as a gooducken, where the turkey [...]

Utilities, forget the Duck Curve, and get ready for the Turducken Curve was published on SAS Voices by David Pope

7月 152020
 

Snack wrappers and noisy parents: Being a virtual intern during covid-19 I was supposed to be behind a desk downing M&M’s and enjoying the SAS campus and all its glory. But instead, I’m working from home as external communications intern. Working from home has been a blessing filled with challenges [...]

Snack wrappers and noisy parents: Being a virtual intern during covid-19 was published on SAS Voices by Falesha Brodie

7月 152020
 

The multivariate normal distribution is used frequently in multivariate statistics and machine learning. In many applications, you need to evaluate the log-likelihood function in order to compare how well different models fit the data. The log-likelihood for a vector x is the natural logarithm of the multivariate normal (MVN) density function evaluated at x. A probability density function is usually abbreviated as PDF, so the log-density function is also called a log-PDF. This article discusses how to efficiently evaluate the log-likelihood function and the log-PDF. Examples are provided by using the SAS/IML matrix language.

The multivariate normal PDF

A previous article provides examples of using the LOGPDF function in SAS for univariate distributions. Multivariate distributions are more complicated and are usually written by using matrix-vector notation. The multivariate normal distribution in dimension d has two parameters: A d-dimensional mean vector μ and a d x d covariance matrix Σ. The MVN PDF evaluated at a d-dimensional vector x is
\(f(\mathbf{x})= \frac{1}{\sqrt { (2\pi)^d|\boldsymbol \Sigma| } } \exp\left(-\frac{1}{2} (\mathbf{x}-\boldsymbol\mu)^{\rm T} \boldsymbol\Sigma^{-1} ({\mathbf x}-\boldsymbol\mu)\right) \)
where |Σ| is the determinant of Σ. I have previously shown how to evaluate the MVN density in the SAS/IML language, and I noted that the argument to the EXP function involves the expression MD(x; μ, Σ)2 = (x-μ)TΣ-1(x-μ), where MD is the Mahalanobis distance between the point x and the mean vector μ.

Evaluate the MVN log-likelihood function

When you take the natural logarithm of the MVN PDF, the EXP function goes away and the expression becomes the sum of three terms:
\(\log(f(\mathbf{x})) = -\frac{1}{2} [ d \log(2\pi) + \log(|\boldsymbol \Sigma|) + {\rm MD}(\mathbf{x}; \boldsymbol\mu, \boldsymbol\Sigma)^2 ]\)
The first term in the brackets is easy to evaluate, but the second and third terms appear more daunting. Fortunately, the SAS/IML language provides two functions that simplify the evaluation:

  • The LOGABSDET function computes the logarithm of the absolute value of the determinant of a matrix. For a full-rank covariance matrix the determinant is always positive, so the SAS/IML function LogAbsDet(C)[1] returns the log-determinant of a covariance matrix, C.
  • The MAHALANOBIS function in SAS/IML evaluates the Mahalanobis distance. The function is vectorized, which means that you can pass in a matrix that has d columns, and the MAHALANOBIS function will return the distance for each row of the matrix.

Some researchers use -2*log(f(x)) instead of log(f(x)) as a measure of likelihood. You can see why: The -2 cancels with the -1/2 in the formula and makes the values positive instead of negative.

Log likelihood versus log-PDF

I use the terms log-likelihood function and log-PDF function interchangeably, but there is a subtle distinction. The log-PDF is a function of x when the parameters are specified (fixed). The log-likelihood is a function of the parameters when the data are specified. The SAS/IML function in the next section can be used for either purpose. (Note: Some references use the term "log likelihood" to refer only to the sum of the log-PDF scores evaluated at each observation in the sample.)

Example: Compare the log likelihood values for different parameter values

The log-likelihood function has many applications, but one is to determine whether one model fits the data better than another model. The log likelihood depends on the mean vector μ and the covariance matrix, Σ, which are the parameters for the MVN distribution.

Suppose you have some data that you think are approximately multivariate normal. You can use the log-likelihood function to evaluate whether the model MVN(μ1, Σ1) fits the data better than an alternative model MVN(μ2, Σ2). For example, the Fisher Iris data for the SepalLength and SepalWidth variables appear to be approximately bivariate normal and positively correlated, as shown in the following graph:

title "Iris Data and 95% Prediction Ellipse";
title2 "Assuming Multivariate Normality";
proc sgplot data=Sashelp.Iris noautolegend;
   where species="Setosa";
   scatter x=SepalLength y=SepalWidth / jitter;
   ellipse x=SepalLength y=SepalWidth;
run;

The following SAS/IML function defines a function (LogPdfMVN) that evaluates the log-PDF at every observation of a data matrix, given the MVN parameters (or estimates for the parameters). To test the function, the program creates a data matrix from the SepalLength and SepalWidth variables for the observations for which Species="Setosa". The program uses the MEAN and COV functions to compute the maximum likelihood estimates for the data, then calls the LogPdfMVN function to evaluate the log-PDF at each observation:

proc iml;
/* This function returns the log-PDF for a MVN(mu, Sigma) density at each row of X.
   The output is a vector with the same number of rows as X. */
start LogPdfMVN(X, mu, Sigma);
   d = ncol(X);
   log2pi = log( 2*constant('pi') );
   logdet = logabsdet(Sigma)[1];             /* sign of det(Sigma) is '+' */
   MDsq = mahalanobis(X, mu, Sigma)##2;      /* (x-mu)`*inv(Sigma)*(x-mu) */
   Y = -0.5 *( MDsq + d*log2pi + logdet );   /* log-PDF for each obs. Sum it for LL */
   return( Y );
finish;
 
/* read the iris data for the Setosa species */
use Sashelp.Iris where(species='Setosa');
read all var {'SepalLength' 'SepalWidth'} into X;
close;
 
n = nrow(X);           /* assume no missing values */
m = mean(X);           /* maximum likelihood estimate of mu */
S = (n-1)/n * cov(X);  /* maximum likelihood estimate of Sigma */
/* evaluate the log likelihood for each observation */
LL = LogPdfMVN(X, m, S);

Notice that you can find the maximum likelihood estimates (m and S) by using a direct computation. For MVN models, you do not need to run a numerical optimization, which is one reason why MVN models are so popular.

The LogPdfMVN function returns a vector that has the same number of rows as the data matrix. The value of the i_th element is the log-PDF of the i_th observation, given the parameters. Because the parameters for the LogPdfMVN function are the maximum likelihood estimates, the total log likelihood (the sum of the LL vector) should be as large as possible for this data. In other words, if we choose different values for μ and Σ, the total log likelihood will be less. Let's see if that is true for this example. Let's change the mean vector and use a covariance matrix that incorrectly postulates that the SepalLength and SepalWidth variables are negatively correlated. The following statements compute the log likelihood for the alternative model:

/* What if we use "wrong" parameters? */
m2 = {45 30};
S2 = {12 -10,  -10 14};          /* this covariance matrix indicates negative correlation */
LL_Wrong = LogPdfMVN(X, m2, S2); /* LL for each obs of the alternative model */
 
/* The total log likelihood is sum(LL) over all obs */
TotalLL = sum(LL);
TotalLL_Wrong = sum(LL_Wrong);
print TotalLL TotalLL_Wrong;

As expected, the total log likelihood is larger for the first model than for the second model. The interpretation that the first model fits the data better than the second model.

Although the total log likelihood (the sum) is often used to choose the better model, the log-PDF of the individual observations are also important. The individual log-PDF values identify which observations are unlikely to come from a distribution with the given parameters.

Visualize the log-likelihood for each model

The easiest way to demonstrate the difference between the "good" and "bad" model parameters is to draw the bivariate scatter plot of the data and color each observation by the log-PDF at that position.

The plot for the first model (which fits the data well) is shown below. The observations are colored by the log-PDF value (the LL vector) for each observation. Most observations are blue or blue-green because those colors indicate high values of the log-PDF.

The plot for the second model (which intentionally misspecifies the parameters) is shown below. The observations near (45, 30) are blue or blue-green because that is the location of the specified mean parameter. A prediction ellipse for the specified model has a semimajor axis that slopes from the upper left to the lower right. Therefore, the points in the upper right corner of the plot have a large Mahalanobis distance and a very negative log-PDF. These points are colored yellow, orange, or red. They are "outliers" in the sense that they are unlikely to be observed in a random sample from an MVN distribution that has the second set of parameters.

What is a "large" log-PDF value?

For this example, the log-PDF is negative for each observation, so "large" and "small" can be confusing terms. I want to emphasize two points:

  1. When I say a log-PDF value is "large" or "high," I mean "close to the maximum value of the log-PDF function." For example, -3.1 is a large log-PDF value for these data. Observations that are far from the mean vector are very negative. For example, -40 is a "very negative" value.
  2. The maximum value of the log-PDF occurs when an observation exactly equals the mean vector. Thus the log-PDF will never be larger than -0.5*( d*log(2π) + log(det(Σ)) ). For these data, the maximum value of the log-PDF is -4.01 when you use the maximum likelihood estimates as MVN parameters.

Summary

In summary, this article shows how to evaluate the log-PDF of the multivariate normal distribution. The log-PDF values indicate how likely each observation would be in a random sample, given parameters for an MVN model. If you sum the log-PDF values over all observations, you get a statistic (the total log likelihood) that summarizes how well a model fits the data. If you are comparing two models, the one with the larger log likelihood is the model that fits better.

The post How to evaluate the multivariate normal log likelihood appeared first on The DO Loop.

7月 152020
 

The multivariate normal distribution is used frequently in multivariate statistics and machine learning. In many applications, you need to evaluate the log-likelihood function in order to compare how well different models fit the data. The log-likelihood for a vector x is the natural logarithm of the multivariate normal (MVN) density function evaluated at x. A probability density function is usually abbreviated as PDF, so the log-density function is also called a log-PDF. This article discusses how to efficiently evaluate the log-likelihood function and the log-PDF. Examples are provided by using the SAS/IML matrix language.

The multivariate normal PDF

A previous article provides examples of using the LOGPDF function in SAS for univariate distributions. Multivariate distributions are more complicated and are usually written by using matrix-vector notation. The multivariate normal distribution in dimension d has two parameters: A d-dimensional mean vector μ and a d x d covariance matrix Σ. The MVN PDF evaluated at a d-dimensional vector x is
\(f(\mathbf{x})= \frac{1}{\sqrt { (2\pi)^d|\boldsymbol \Sigma| } } \exp\left(-\frac{1}{2} (\mathbf{x}-\boldsymbol\mu)^{\rm T} \boldsymbol\Sigma^{-1} ({\mathbf x}-\boldsymbol\mu)\right) \)
where |Σ| is the determinant of Σ. I have previously shown how to evaluate the MVN density in the SAS/IML language, and I noted that the argument to the EXP function involves the expression MD(x; μ, Σ)2 = (x-μ)TΣ-1(x-μ), where MD is the Mahalanobis distance between the point x and the mean vector μ.

Evaluate the MVN log-likelihood function

When you take the natural logarithm of the MVN PDF, the EXP function goes away and the expression becomes the sum of three terms:
\(\log(f(\mathbf{x})) = -\frac{1}{2} [ d \log(2\pi) + \log(|\boldsymbol \Sigma|) + {\rm MD}(\mathbf{x}; \boldsymbol\mu, \boldsymbol\Sigma)^2 ]\)
The first term in the brackets is easy to evaluate, but the second and third terms appear more daunting. Fortunately, the SAS/IML language provides two functions that simplify the evaluation:

  • The LOGABSDET function computes the logarithm of the absolute value of the determinant of a matrix. For a full-rank covariance matrix the determinant is always positive, so the SAS/IML function LogAbsDet(C)[1] returns the log-determinant of a covariance matrix, C.
  • The MAHALANOBIS function in SAS/IML evaluates the Mahalanobis distance. The function is vectorized, which means that you can pass in a matrix that has d columns, and the MAHALANOBIS function will return the distance for each row of the matrix.

Some researchers use -2*log(f(x)) instead of log(f(x)) as a measure of likelihood. You can see why: The -2 cancels with the -1/2 in the formula and makes the values positive instead of negative.

Log likelihood versus log-PDF

I use the terms log-likelihood function and log-PDF function interchangeably, but there is a subtle distinction. The log-PDF is a function of x when the parameters are specified (fixed). The log-likelihood is a function of the parameters when the data are specified. The SAS/IML function in the next section can be used for either purpose. (Note: Some references use the term "log likelihood" to refer only to the sum of the log-PDF scores evaluated at each observation in the sample.)

Example: Compare the log likelihood values for different parameter values

The log-likelihood function has many applications, but one is to determine whether one model fits the data better than another model. The log likelihood depends on the mean vector μ and the covariance matrix, Σ, which are the parameters for the MVN distribution.

Suppose you have some data that you think are approximately multivariate normal. You can use the log-likelihood function to evaluate whether the model MVN(μ1, Σ1) fits the data better than an alternative model MVN(μ2, Σ2). For example, the Fisher Iris data for the SepalLength and SepalWidth variables appear to be approximately bivariate normal and positively correlated, as shown in the following graph:

title "Iris Data and 95% Prediction Ellipse";
title2 "Assuming Multivariate Normality";
proc sgplot data=Sashelp.Iris noautolegend;
   where species="Setosa";
   scatter x=SepalLength y=SepalWidth / jitter;
   ellipse x=SepalLength y=SepalWidth;
run;

The following SAS/IML function defines a function (LogPdfMVN) that evaluates the log-PDF at every observation of a data matrix, given the MVN parameters (or estimates for the parameters). To test the function, the program creates a data matrix from the SepalLength and SepalWidth variables for the observations for which Species="Setosa". The program uses the MEAN and COV functions to compute the maximum likelihood estimates for the data, then calls the LogPdfMVN function to evaluate the log-PDF at each observation:

proc iml;
/* This function returns the log-PDF for a MVN(mu, Sigma) density at each row of X.
   The output is a vector with the same number of rows as X. */
start LogPdfMVN(X, mu, Sigma);
   d = ncol(X);
   log2pi = log( 2*constant('pi') );
   logdet = logabsdet(Sigma)[1];             /* sign of det(Sigma) is '+' */
   MDsq = mahalanobis(X, mu, Sigma)##2;      /* (x-mu)`*inv(Sigma)*(x-mu) */
   Y = -0.5 *( MDsq + d*log2pi + logdet );   /* log-PDF for each obs. Sum it for LL */
   return( Y );
finish;
 
/* read the iris data for the Setosa species */
use Sashelp.Iris where(species='Setosa');
read all var {'SepalLength' 'SepalWidth'} into X;
close;
 
n = nrow(X);           /* assume no missing values */
m = mean(X);           /* maximum likelihood estimate of mu */
S = (n-1)/n * cov(X);  /* maximum likelihood estimate of Sigma */
/* evaluate the log likelihood for each observation */
LL = LogPdfMVN(X, m, S);

Notice that you can find the maximum likelihood estimates (m and S) by using a direct computation. For MVN models, you do not need to run a numerical optimization, which is one reason why MVN models are so popular.

The LogPdfMVN function returns a vector that has the same number of rows as the data matrix. The value of the i_th element is the log-PDF of the i_th observation, given the parameters. Because the parameters for the LogPdfMVN function are the maximum likelihood estimates, the total log likelihood (the sum of the LL vector) should be as large as possible for this data. In other words, if we choose different values for μ and Σ, the total log likelihood will be less. Let's see if that is true for this example. Let's change the mean vector and use a covariance matrix that incorrectly postulates that the SepalLength and SepalWidth variables are negatively correlated. The following statements compute the log likelihood for the alternative model:

/* What if we use "wrong" parameters? */
m2 = {45 30};
S2 = {12 -10,  -10 14};          /* this covariance matrix indicates negative correlation */
LL_Wrong = LogPdfMVN(X, m2, S2); /* LL for each obs of the alternative model */
 
/* The total log likelihood is sum(LL) over all obs */
TotalLL = sum(LL);
TotalLL_Wrong = sum(LL_Wrong);
print TotalLL TotalLL_Wrong;

As expected, the total log likelihood is larger for the first model than for the second model. The interpretation that the first model fits the data better than the second model.

Although the total log likelihood (the sum) is often used to choose the better model, the log-PDF of the individual observations are also important. The individual log-PDF values identify which observations are unlikely to come from a distribution with the given parameters.

Visualize the log-likelihood for each model

The easiest way to demonstrate the difference between the "good" and "bad" model parameters is to draw the bivariate scatter plot of the data and color each observation by the log-PDF at that position.

The plot for the first model (which fits the data well) is shown below. The observations are colored by the log-PDF value (the LL vector) for each observation. Most observations are blue or blue-green because those colors indicate high values of the log-PDF.

The plot for the second model (which intentionally misspecifies the parameters) is shown below. The observations near (45, 30) are blue or blue-green because that is the location of the specified mean parameter. A prediction ellipse for the specified model has a semimajor axis that slopes from the upper left to the lower right. Therefore, the points in the upper right corner of the plot have a large Mahalanobis distance and a very negative log-PDF. These points are colored yellow, orange, or red. They are "outliers" in the sense that they are unlikely to be observed in a random sample from an MVN distribution that has the second set of parameters.

What is a "large" log-PDF value?

For this example, the log-PDF is negative for each observation, so "large" and "small" can be confusing terms. I want to emphasize two points:

  1. When I say a log-PDF value is "large" or "high," I mean "close to the maximum value of the log-PDF function." For example, -3.1 is a large log-PDF value for these data. Observations that are far from the mean vector are very negative. For example, -40 is a "very negative" value.
  2. The maximum value of the log-PDF occurs when an observation exactly equals the mean vector. Thus the log-PDF will never be larger than -0.5*( d*log(2π) + log(det(Σ)) ). For these data, the maximum value of the log-PDF is -4.01 when you use the maximum likelihood estimates as MVN parameters.

Summary

In summary, this article shows how to evaluate the log-PDF of the multivariate normal distribution. The log-PDF values indicate how likely each observation would be in a random sample, given parameters for an MVN model. If you sum the log-PDF values over all observations, you get a statistic (the total log likelihood) that summarizes how well a model fits the data. If you are comparing two models, the one with the larger log likelihood is the model that fits better.

The post How to evaluate the multivariate normal log likelihood appeared first on The DO Loop.

7月 142020
 

In my new book, End-to-End Data Science with SAS: A Hands-On Programming Guide, I use the 1.5 IQR rule to adjust multiple variables.  This program utilizes a macro that loops through a list of variables to make the necessary adjustments and creates an output data set.

One of the most popular ways to adjust for outliers is to use the 1.5 IQR rule. This rule is very straightforward and easy to understand. For any continuous variable, you can simply multiply the interquartile range by the number 1.5. You then add that number to the third quartile. Any values above that threshold are suspected as being an outlier. You can also perform the same calculation on the low end. You can subtract the value of IQR x 1.5 from the first quartile to find low-end outliers.

The process of adjusting for outliers can be tedious if you have several continuous variables that are suspected as having outliers. You will need to run PROC UNIVARIATE on each variable to identify its median, 25th percentile, 75th percentile, and interquartile range. You would then need to develop a program that identifies values above and below the 1.5 IQR rule thresholds and overwrite those values with new values at the threshold.

The following program is a bit complicated, but it automates the process of adjusting a list of continuous variables according to the 1.5 IQR rule. This program consists of three distinct parts:

    1. Create a BASE data set that excludes the variables contained in the &outliers global macro. Then create an OUTLIER data set that contains only the unique identifier ROW_NUM and the outlier variables.
    2. Create an algorithm that loops through each of the outlier variables contained in the global variable &outliers and apply the 1.5 IQR rule to cap each variable’s range according to its unique 1.5 IQR value.
    3. Merge the newly restricted outlier variable with the BASE data set.
/*Step 1: Create BASE and OUTLIER data sets*/
 
%let outliers = /*list of variables*/;
 
DATA MYDATA.BASE;
    SET MYDATA.LOAN_ADJUST (DROP=&outliers.);
    ROW_NUM = _N_;
RUN;
 
DATA outliers;
    SET MYDATA.LOAN_ADJUST (KEEP=&outliers. ROW_NUM);
    ROW_NUM = _N_;
RUN;
 
 /*Step 2: Create loop and apply the 1.5 IQR rule*/
 
%MACRO loopit(mylist);
    %LET n = %SYSFUNC(countw(&mylist));
 
    %DO I=1 %TO &n;
        %LET val = %SCAN(&mylist,&I);
 
        PROC UNIVARIATE DATA = outliers ;
            VAR &val.;
            OUTPUT OUT=boxStats MEDIAN=median QRANGE=iqr;
        run;
 
        data _NULL_;
           SET boxStats;
           CALL symput ('median',median);
           CALL symput ('iqr', iqr);
        run;
 
        %PUT &median;
        %PUT &iqr;
 
        DATA out_&val.(KEEP=ROW_NUM &val.);
        SET outliers;
 
       IF &val. ge &median + 1.5 * &iqr THEN
           &val. = &median + 1.5 * &iqr;
       RUN;
 
/*Step 3: Merge restricted value to BASE data set*/
 
       PROC SQL;
           CREATE TABLE MYDATA.BASE AS
               SELECT *
               FROM MYDATA.BASE AS a
               LEFT JOIN out_&val. as b
                   on a.ROW_NUM = b.ROW_NUM;
       QUIT;
 
    %END;
%MEND;
 
%LET list = &outliers;
%loopit(&list);

Notes on the outlier adjustment program:

  • A macro variable is created that contains all of the continuous variables that are suspected of having outliers.
  • Separate data sets were created: one that contains all of the outlier variables and one that excludes the outlier variables.
  • A macro program is developed to contain the process of looping through the list of variables.
  • A macro variable (n) is created that counts the number of variables contained in the macro variable.
  • A DO loop is created that starts at the first variable and runs the following program on each variable contained in the macro variable.
  • PROC UNIVARIATE identifies the variable’s median and interquartile range.
  • A macro variable is created to contain the values of the median and interquartile range.
  • A DATA step is created to adjust any values that exceed the 1.5 IQR rule on the high end and the low end.
  • PROC SQL adds the adjusted variables to the BASE data set.

This program might seem like overkill to you. It could be easier to simply adjust outlier variables one at a time. This is often the case; however, when you have a large number of outlier variables, it is often beneficial to create an algorithm to transform them efficiently and consistently

Adjusting outliers with the 1.5 IQR rule was published on SAS Users.