1月 112022
 

Last year, I wrote a blog demonstrating how to use the %Auto_Outliers macro to automatically identify possible data errors. This blog demonstrates a different approach—one that is useful for variables for which you can identify reasonable ranges of values for each variable. For example, you would not expect resting heart rates below 40 or over 100 or adult heights below 45 inches or above 84 inches. Although values outside those ranges might be valid values, it would be a good idea to check those out-of-range values to see if they are real or data errors.

In the third edition of my book, Cody's Data Cleaning Techniques, I present two macros: %Errors and %Report. These two macros provide a consolidated error report.

To demonstrate these two macros, I created a data set called Patients. A listing of the first 10 observations is shown below:

Notice that the unique identifier is called Patno, and you see three variables HR (heart rate), SBP (systolic blood pressure), and DBP (diastolic blood pressure).

The calling arguments for the %Errors macro are:

  • VAR=     A variable for which you have pre-defined bounds
  • Low=     The lowest reasonable value for this variable
  • High=    The highest reasonable value for this variable
  • Missing=Error or Missing=Ignore. The default for this argument is IGNORE, but it is still good practice to include it in the call so that anyone reading your program understands how missing values are being handled.

Because you might be calling this macro for many variables, the values of two macro variables &Dsn (data set name) and &IDVar (the identifying variable such as Patno or Subj) are assigned values once, using two %Let statements. You can then call the %Errors macro for each variable of interest. When you are finished, call the %Report macro to see a consolidated report of your possible errors.

Here is an example:

/* Set values for &Dsn and %IDVar with %LET statements */
%let Dsn = Clean.Patients;   
%let Idvar = Patno;
 
%Errors(Var=HR, Low=40, High=100, Missing=error)
%Errors(Var=SBP, Low=80, High=200, Missing=ignore)
%Errors(Var=DBP, Low=60, High=120)   
 
/* When you are finished selecting variables, create the report */
%Report

You are reporting all heart rates below 40 or above 100 (and considering missing values as errors); values of SBP below 80 or above 200 (and ignoring missing values); and values of DBP below 60 or above 120 (also ignoring missing values—using the default value of IGNORE).

Here is the result:

Notice that several patients with missing values for HR are flagged as errors. I should point out that I have violated a cardinal rule of macro programming: never write a macro that has no arguments—it should be a program. However, I liked the idea of calling %Errors and then %Report. Shown below are the two macros:

/****************************************************************
| PROGRAM NAME: ERRORS.SAS  in c:\Books\Cleans\Patients         |
| PURPOSE: Accumulates errors for numeric variables in a SAS    |
|          data set for later reporting.                        |
|          This macro can be called several times with a        |
|          different variable each time. The resulting errors   |
|          are accumulated in a temporary SAS data set called   |
|          Errors.                                              |
| ARGUMENTS: Dsn=    - SAS data set name (assigned with a %LET) |
|            Idvar=  - Id variable (assigned with a %LET)       |
|                                                               |
|            Var     = The variable name to test                |
|            Low     = Lowest valid value                       |
|            High    = Highest valid value                      |
|            Missing = IGNORE (default) Ignore missing values   |
|                      ERROR Missing values flagged as errors   |
|                                                               |
| EXAMPLE: %let Dsn = Clean.Patients;                           |
|          %let Idvar = Patno;                                  |
|                                                               |
|          %Errors(Var=HR, Low=40, High=100, Missing=error)     |
|          %Errors(Var=SBP, Low=80, High=200, Missing=ignore)   |
|          %Errors(Var=DBP, Low=60, High=120)                   |
|          Test the numeric variables HR, SBP, and DBP in data  |
|          set Clean.patients for data outside the ranges       |
|          40 to 100, 80 to 200, and 60 to 120 respectively.    |
|          The ID variable is PATNO and missing values are to   |
|          be flagged as invalid for HR but not for SBP or DBP. |
****************************************************************/
%macro Errors(Var=,    /* Variable to test     */
              Low=,    /* Low value            */
              High=,   /* High value           */
              Missing=IGNORE 
                       /* How to treat missing values         */
                       /* Ignore is the default.  To flag     */
                       /* missing values as errors set        */
                       /* Missing=error                       */);
data Tmp;
   set &Dsn(keep=&Idvar &Var);
   length Reason $ 10 Variable $ 32;
   Variable = "&Var";
   Value = &Var;
   if &Var lt &Low and not missing(&Var) then do;
      Reason='Low';
      output;
   end;
 
   %if %upcase(&Missing) ne IGNORE %then %do;
      else if missing(&Var) then do;
         Reason='Missing';
         output;
      end;
   %end;
 
   else if &Var gt &High then do;
      Reason='High';
      output;
      end;
      drop &Var;
   run;
 
   proc append base=Errors data=Tmp;
   run;
 
%mend Errors;

The basic idea for the %Errors macro is to test each variable and, if it is a possible error, use PROC APPEND to add it to a data set called Errors. When the first error is detected, PROC APPEND creates the data set Errors. From then on, each observation in data set Tmp is added to data set Errors.

Most of this macro is straightforward. For those readers who are not that comfortable with macro programming, the third section (beginning with %if %upcase(&Missing)) is executed only when the value of the macro variable &Missing is not equal to IGNORE.

Below is a listing of the %Report macro:

%macro Report;
   proc sort data=Errors;
      by &Idvar;
   run;
 
   proc print data=Errors;
      title "Error Report for Data Set &Dsn";
      id &Idvar;
      var Variable Value Reason;
   run;
 
   proc delete data=Errors Tmp;
   run;
 
%mend Report;

The %Report macro is mainly a PROC PRINT of the temporary data set Errors. I added PROC DELETE to delete the two temporary data sets Error and Tmp.

You can cut and paste these macros, or you can download all of the macros, programs, and data sets from Cody's Data Cleaning Techniques Using SAS®, Third Edition, by going to support.sas.com/Cody, search for the book, then click Example Code and Data. You do not have to buy the book to download all the files (although I would be delighted if you did). This is true for all of my books published by SAS Press.

Comments and/or corrections are always welcome.

Two macros for detecting data errors was published on SAS Users.

1月 102022
 

Mergers and acquisitions are difficult to pull off, but they can impress when the right strategy is applied. And analyzing your data almost always helps with the right strategy. In the first eight months of 2021, publicly announced M&A activities were valued at more than $3.6 trillion globally and $1.8 [...]

Mergers and acquisitions: coming together with data and analytics was published on SAS Voices by Alex Coop

1月 102022
 

The past 20 months of disruptions caused by COVID-19 have been a wake-up call for retailers and consumer goods companies. Unpredictable market trends have caused havoc with categories, brands and products making it harder to predict supply requirements. All of these changes have given rise to the need for consumption [...]

Can demand planning save your supply chain? Find out with this free trial was published on SAS Voices by Charlie Chase

1月 102022
 

On this blog, I write about a diverse set of topics that are relevant to statistical programming and data visualization. In a previous article, I presented some of the most popular blog posts from 2021. The most popular articles often deal with elementary or familiar topics that are useful to almost every data analyst. However, I also write articles that discuss more advanced topics. Did you make a New Year's resolution to learn something new this year? Here is your chance! The following articles deserve a second look. I have grouped them into four categories: SAS programming, statistics, optimization, and data simulation.

SAS programming

For years, I've been writing articles about how to accomplish tasks in Base SAS. In addition, I now write about how to program basic tasks in SAS Viya.

Use the DOLIST Syntax to Specify Tick Marks on a Graph

Probability and statistics

A Family of Density Curves for the Inverse Gamma Distribution

Probability and statistics provide the theoretical basis for modern data analysis. You cannot understand data science, machine learning, or artificial intelligence without understanding the basic principles of statistics. Most readers are familiar with common probability distributions, Pearson correlation, and least-squares regression. The following articles discuss some of the lesser-known statistical cousins of these topics:

  • The inverse gamma distribution: To use any probability distribution, you need to know four essential functions: the density, the cumulative probability, the quantiles, and how to generate random variates. You might be familiar with the gamma distribution, but what is the inverse gamma distribution and how do you define the four essential functions in SAS? Similarly, what is the generalized gamma distribution?
  • The Hoeffding D statistic: The Hoeffding D statistic measures the association between two variables. How do you compute the Hoeffding D statistic in SAS, and how do you interpret the results?
  • Weibull regression: In ordinary least squares regression, the response variable is assumed to be modeled by a set of explanatory variables and normally distributed errors. If the error terms are Weibull distributed, you can estimate parameters for the Weibull distribution as part of a regression analysis, but you need to transform the regression estimates to obtain the usual Weibull parameters.

Optimization

Optimization is at the heart of many statistical techniques, such as maximum likelihood estimation. But sometimes you need to solve a "pure" optimization problem. SAS supports many methods for solving optimization problems:

Multivariate simulation and bootstrapping

It is straightforward to simulate univariate data. It is harder to simulate multivariate data while preserving important relations between the variables, such as correlation. Similarly, it can be challenging to analyze the bootstrap distribution of a multivariate statistic, such as a correlation matrix:

The Geometry of the Iman-Conover Transformation

Your turn

Did I omit one of your favorite blog posts from The DO Loop in 2021? If so, leave a comment and tell me what topic you found interesting or useful.

The post 12 blog posts from 2021 that deserve a second look appeared first on The DO Loop.

1月 072022
 

Welcome to the sixth installment in my series Getting Started with Python Integration to SAS Viya. In previous posts, I discussed how to connect to the CAS serverhow to execute CAS actions, and how to work with the results. Now it's time to generate simple descriptive statistics of a CAS table.

Let's begin by confirming the cars table is loaded into memory. With a connection to CAS established, execute the tableInfo action to view available in-memory tables. If necessary, you can execute the following code in SAS Studio to load the sashelp.cars table into memory.

conn.tableinfo(caslib="casuser")

The results show the cars table is loaded into memory and available for processing. Next, reference the cars table in the variable tbl. Then use the print function to show the value of the variable.

tbl = conn.CASTable('cars', caslib='casuser')
print(tbl)
CASTable('cars', caslib='casuser')

The results show that the tbl variable references the cars table in the CAS server.

Preview the CAS Table

First things first. Remember, the SWAT package blends the world of Pandas and CAS into one. So you can begin with the traditional head method to preview the CAS table.

tbl.head()

The SWAT head method returns five rows from the CAS server to the client as expected.

The Describe Method

Next, let's retrieve descriptive statistics of all numeric columns by using the familiar describe method on the CAS table.

tbl.describe()

The SWAT describe method returns the same descriptive statistics as the Pandas describe method. The only difference is that the SWAT version uses the CAS API to convert the describe method into CAS actions behind the scenes to process the data on the distributed CAS server. CAS processes the data and returns summarized results back to the client as a SASDataFrame, which is a subclass of the Pandas DataFrame. You can now work with the results as you would a Pandas DataFrame.

Summary CAS Action

Instead of using the familiar describe method, let's use a CAS action to do something similar. Here I'll use the summary CAS action.

tbl.summary()

Summary CAS Action

The results of the summary action return a CASResults object (Python dictionary) to the client. The CASResults object contains a single key named Summary with a SASDataFrame as the value. The SASDataFrame shows a variety of descriptive statistics.  While the summary action does not return exactly the same statistics as the describe method, it can provide additional insights into your data.

What if we don't want all the statistics for all of the data?

Selecting Columns and Summary Statistics with the Summary Action

Let's add additional parameters to the summary action. I'll add the inputs parameter to specify the columns to analyze in the CAS server.

tbl.summary(inputs = ['MPG_City','MPG_Highway'])

The results show only the MPG_City and MPG_Highway columns were analyzed.

Next, I'll use the subSet parameter to specify the summary statistics to produce. Here I'll obtain the MEAN, MIN and MAX.

tbl.summary(inputs = ['MPG_City','MPG_Highway'],
                       subSet = ['mean','min','max'])

The results processed only the MPG_City and MPG_Highway columns, and returned only the specified summary statistics to the client.

Creating a Calculated Column

Lastly, let's create a calculated column within the summary action. There are a variety of ways to do this. I like to add it as a parameter to the CASTable object. You can do that by specifying the tbl object, then computedVarsProgram parameter. Within computedVarsProgram you can use SAS assignment statements with most SAS functions. Here we will create a new column name MPG_Avg that takes the mean of MPG_City and MPG_Highway. Lastly, add the new column to the inputs parameter.

tbl.computedVarsProgram = 'MPG_Avg = mean(MPG_City, MPG_Highway);'
tbl.summary(inputs = ['MPG_City','MPG_Highway', 'MPG_Avg'],
                       subSet = ['mean','min','max'])

In the results I see the calculated column and requested summary statistics.

Summary

The SWAT package blends the world of Pandas and CAS. You can use many of the familiar Pandas methods within the SWAT package, or the flexible, highly optimized CAS actions like summary to easily obtain summary statistics of your data in the massively parallel processing CAS engine.

Additional and related resources

Getting Started with Python Integration to SAS® Viya® - Index
SWAT API Reference
CAS Action Documentation
SAS® Cloud Analytic Services: Fundamentals
SAS Scripting Wrapper for Analytics Transfer (SWAT)
CAS Action! - a series on fundamentals
Execute the following code in SAS Studio to load the sashelp.cars table into memory

Getting Started with Python Integration to SAS® Viya® - Part 6 - Descriptive Statistics was published on SAS Users.

1月 062022
 

This article was co-written by Nick Johnson, Product Marketing Manager, Microsoft Partnership. Check out his blog profile for more information.

As employees continue to adapt to the reality of remote work — collaborating with teams near and far — it is vital that organizations have the right collaboration and productivity tools at their fingertips to support teams remotely. With over 250 million monthly active users, Microsoft Teams has become the collaboration tool of choice for thousands of organizations, changing the way meetings are conducted and how teams access the documents and data that support their business operations.

SAS and Microsoft are partnering to inspire greater trust and confidence in every decision, by driving innovation and proven AI in the cloud. With a combined product roadmap, SAS and Microsoft are working tirelessly to improve offerings and connectivity between SAS Viya and Microsoft. That’s why we’re especially excited to announce SAS Conversation Designer is now generally available in Microsoft Teams.

Conversational AI enables humans to interact with machines using natural language – text or voice – and instantly get a human-like, intelligent response. And ChatOps – a way of collaborating that connects people with process, tools and automation into a transparent workflow – can enable your teams to work together on complex analytics processes without writing a single line of code. Conversational AI is creating new opportunities for finding insights in your data by simply asking a question in natural language to a chatbot.

Now, you can ask questions of your SAS and open-source data directly from the Microsoft Teams toolbar and share insights directly with your teammates without jumping between application interfaces. Chat-enabled analytics does the work for you by providing data, reports and visualizations through a chat interface in Microsoft Teams.

With SAS Conversation Designer in Teams you can:

    • Build and deploy a chatbot with ease using a low-code visual interface.
    • Get answers and complete tasks using SAS’ industry leading natural language processing.
    • Access data, reports and visualizations via chat – even run advanced analytics and AI.

Follow the quick start guide below to see how easy it is to build and deploy natural language chatbots in your Microsoft Teams environment.

Get started:

Step 1: To get started, log onto SAS Viya through a web browser using Azure AD for simplified access.

Step 2: SAS Conversation Designer’s visual interface is where you can build a chatbot. You can see where key words and phrases, intents, and dialog shown on a visual pipeline create a structure for the chatbot.

Step 3: Now that the critical elements of the chatbot are in place, the chatbot can be published and is ready for interaction.

Step 4: Let’s put this bot into Microsoft Teams. Gather the information within SAS Viya for configuring a manifest file, then enter the information into the App Studio in Microsoft Teams.

Step 5: Start a conversation! Your chatbot is ready to provide insights and accelerate collaboration for you and your colleagues.

Learn more

Looking for more on chatbots and our partnership with Microsoft? Check out the resources below:

SAS Conversation Designer

Microsoft Partnership

Make collaboration your superpower with SAS Conversation Designer in Microsoft Teams was published on SAS Users.

1月 062022
 

This article was co-written by Nick Johnson, Product Marketing Manager, Microsoft Partnership. Check out his blog profile for more information.

As employees continue to adapt to the reality of remote work — collaborating with teams near and far — it is vital that organizations have the right collaboration and productivity tools at their fingertips to support teams remotely. With over 250 million monthly active users, Microsoft Teams has become the collaboration tool of choice for thousands of organizations, changing the way meetings are conducted and how teams access the documents and data that support their business operations.

SAS and Microsoft are partnering to inspire greater trust and confidence in every decision, by driving innovation and proven AI in the cloud. With a combined product roadmap, SAS and Microsoft are working tirelessly to improve offerings and connectivity between SAS Viya and Microsoft. That’s why we’re especially excited to announce SAS Conversation Designer is now generally available in Microsoft Teams.

Conversational AI enables humans to interact with machines using natural language – text or voice – and instantly get a human-like, intelligent response. And ChatOps – a way of collaborating that connects people with process, tools and automation into a transparent workflow – can enable your teams to work together on complex analytics processes without writing a single line of code. Conversational AI is creating new opportunities for finding insights in your data by simply asking a question in natural language to a chatbot.

Now, you can ask questions of your SAS and open-source data directly from the Microsoft Teams toolbar and share insights directly with your teammates without jumping between application interfaces. Chat-enabled analytics does the work for you by providing data, reports and visualizations through a chat interface in Microsoft Teams.

With SAS Conversation Designer in Teams you can:

    • Build and deploy a chatbot with ease using a low-code visual interface.
    • Get answers and complete tasks using SAS’ industry leading natural language processing.
    • Access data, reports and visualizations via chat – even run advanced analytics and AI.

Follow the quick start guide below to see how easy it is to build and deploy natural language chatbots in your Microsoft Teams environment.

Get started:

Step 1: To get started, log onto SAS Viya through a web browser using Azure AD for simplified access.

Step 2: SAS Conversation Designer’s visual interface is where you can build a chatbot. You can see where key words and phrases, intents, and dialog shown on a visual pipeline create a structure for the chatbot.

Step 3: Now that the critical elements of the chatbot are in place, the chatbot can be published and is ready for interaction.

Step 4: Let’s put this bot into Microsoft Teams. Gather the information within SAS Viya for configuring a manifest file, then enter the information into the App Studio in Microsoft Teams.

Step 5: Start a conversation! Your chatbot is ready to provide insights and accelerate collaboration for you and your colleagues.

Learn more

Looking for more on chatbots and our partnership with Microsoft? Check out the resources below:

SAS Conversation Designer

Microsoft Partnership

Make collaboration your superpower with SAS Conversation Designer in Microsoft Teams was published on SAS Users.

1月 052022
 

In a hackathon, teams of participants collaborate and compete to find the best solutions to a business or humanitarian challenge using technology. But unlike many traditional hackathons where participants meet in person for a couple of days, the SAS Hackathon is all-digital and lasts for a month. Prior to the [...]

Why hack? 10 reasons why the SAS Hackathon is more than a competition was published on SAS Voices by Einar Halvorsen

1月 052022
 

You can use the Cholesky decomposition of a covariance matrix to simulate data from a correlated multivariate normal distribution. This method is encapsulated in the RANDNORMAL function in SAS/IML software, but you can also perform the computations manually by calling the ROOT function to get the Cholesky root and then use matrix multiplication to correlate multivariate norm (MVN) data, as follows:

%let d = 12;           /* number of variables */
%let N = 500;          /* number of observations */
proc iml;
d = &d;                /* want to generate d variables that are MVN */
N = &N;                /* sample size */
 
/* easy to generate uncorrelated MVN variables */
call randseed(1234);
z = j(d, N);
call randgen(z, "Normal"); /* z is MVN(0,I(d)) */
 
/* use Cholesky root to transform to correlated variables */
Sigma = toeplitz(d:1);   /* example of covariance matrix */
G = root(Sigma);         /* the Cholesky root of the full covariance matrix */
x = G`*z;                /* x is MVN(0, Sigma) */
title "Two Components of MVN(0,Sigma) Data";
call scatter(x[1,], x[2,]) label={x1 x2};

The simulation creates d=12 correlated variables and 500 observations. The first two variables are plotted against each other so that you can see that they are correlated. The other pairs of variables are also correlated according to the entries in the Σ covariance matrix.

Generating a huge number of MVN variables

In the comments to one of my previous articles, a SAS programmer asked whether it is possible to generate MVN data even if the covariance matrix is so large that it cannot be factored by the ROOT function. For example, suppose you want to generate d=20,000 MVN variables. A d x d covariance matrix of that size requires 3 GB of RAM, and on some operating system you cannot form the Cholesky root of a matrix that large. However, I have developed an algorithm that enables you to deal with the covariance matrix (Σ) as a block matrix.

To introduce the algorithm, represent the Σ matrix as a 2 x 2 block matrix. (See the last section for a more general representation.) For a 2 x 2 block algorithm, you only need to form the Cholesky roots of matrices of size (d/2), which are 10000 x 10000 matrices for this example. Matrices of that size require only 0.75 GB and can be quickly factored. Of course, there is no such thing as a free lunch: The new algorithm is more complicated and requires additional computations, albeit on smaller matrices.

Let's look at how to simulate MVN data by treating Σ as a block matrix of size d/2. For ease of exposition, assume d is even and each block is of size d/2. However, the algorithm easily handles the case where d is odd. When d is odd, choose the upper blocks to have floor(d/2) rows and choose the lower clocks to have ceil(d/2) rows. The SAS code (below) handles the general case. (In fact, the algorithm doesn't care about the block size. For any p in (0, 1), one block can be size pd and the other size (1-p)d.

In block form, the covariance matrix is
\(\Sigma = \begin{bmatrix}A & B\\B^\prime & C\end{bmatrix}\)
There is a matrix decomposition called the LDLT decomposition that enables you to write Σ as the product of three block matrices:

\(\begin{bmatrix}A & B\\B^\prime & C\end{bmatrix} = \begin{bmatrix}I & 0\\B^\prime A^{-1} & I\end{bmatrix} \begin{bmatrix}A & 0\\0 & S\end{bmatrix} \begin{bmatrix}I & A^{-1}B\\0 & I\end{bmatrix} \)
where \(S = C - B^\prime A^{-1} B\) is the Schur complement of the block matrix C. There is a theorem that says that if Σ is symmetric positive definite (SPD), then so is every principal submatrix, so A is SPD. Thus, we can use the Cholesky decomposition and write \(A = G^\prime_A G_A\) for an upper triangular matrix, \(G_A\). By reordering the variables, C is SPD as well. I don't have a proof that S is PD, but it seems to be true in the examples I've tried, so let's list that as an assumption and assume that \(S = G^\prime_S G_S\) for an upper triangular matrix \(G_S\).

Using these definitions, the Cholesky decomposition of Σ can be written in block form as
\(\Sigma = \begin{bmatrix}G^\prime_A & 0\\B^\prime G^{-1}_A & G^\prime_S\end{bmatrix} \begin{bmatrix}G_A & (G^\prime_A)^{-1} B \\ 0 & G_S\end{bmatrix} \)

To simulate MVN data, we let z be uncorrelated MVN(0, I) data and define
\(x = \begin{bmatrix}x_1 \\ x_2 \end{bmatrix} = \begin{bmatrix}G^\prime_A & 0\\B^\prime G^{-1}_A & G^\prime_S\end{bmatrix} \begin{bmatrix}z_1 \\ z_2\end{bmatrix} = \begin{bmatrix}G^\prime_A z_1 \\ B^\prime G^{-1}_A z_1 + G^\prime_S z_2\end{bmatrix} \)

The previous equation indicates that you can generate correlated MVN(0, Σ) data if you know the Cholesky decomposition of the upper-left block of Σ (which is GA) and the Cholesky decomposition of the Schur complement (which is GS). Each of the matrices in these equations is of size d/2, which means that the computations are performed on matrices that are half the size of Σ. However, notice that generating the x2 components requires some extra work: the Schur complement involves the inverse of A and the formula for x2 also involves the inverse of GA. These issues are not insurmountable, but it means that the block algorithm is more complicated than the original algorithm, which merely multiplies z by the Cholesky root of Σ.

A SAS program for block simulation of MVN data

The SAS/IML program in a previous section shows how to generate correlated MVN(0, Σ) data (x) from uncorrelated MVN(0, I) data (z) by using the Cholesky root (G) of Σ. This section shows how to get exactly the same x values by performing block operations. For the block operations, each matrix is size d/2, so has 1/4 of the elements of Σ.

The first step is to represent Σ and z in block form.

/* The Main Idea: get the same MVN data by performing block 
   operations on matrices that are size d/2 */
/* 1. Represent Sigma={A B, B` C} and z={z1, z2} in block form */
d1 = ceil(d/2);           /* half the variables */
d2 = d - d1;              /* the other half of the vars */
dp1 = d1 + 1;
 
A = Sigma[1:d1,  1:d1];   /* break up the symmetric (d x d) covariance */
B = Sigma[1:d1,  dp1:d];  /* matrix into blocks of size d/2 */
C = Sigma[dp1:d, dp1:d];
z1 = z[1:d1, ];           /* extract the first d1 obs for MVN(0,I) data */
z2 = z[dp1:d, ];          /* extract the last d2 obs for MVN(0,I) data */

It is easy to generate x1, which contains the first d/2 components of the MVN(0, Σ) simulated data. You simply use the Cholesky decomposition of A, which is the upper-left block of Σ:

/* 2. Compute Cholesky root of A and compute x1 z1 */
G_A = root(A);            /* Cholesky of upper left block */
x1 = G_A`*z1;             /* generate first half of variables */

It is not as easy to generate x2, which contains the last d/2 components. In block form, \(x_2 = v + u\), where \(v = B^\prime G_A^{-1} z_1\) and \(u = G_S^\prime z_2\), and where \(S = G_S^\prime G_S\) is the Cholesky decomposition of the Shur complement. This is shown in the following statements:

/* 3. Generate the second half of variables */
S = C - B`*inv(A)*B;      /* S is the Schur complement */
G_S = root(S);            /* Cholesky root of Schur complement */
v = B`*inv(G_A)*z1;
u = G_S`*z2;
x2 = v + u;
 
/* verify that the computation gives exactly the same simulated data */
print (max(abs(x[1:d1,] - x1)))[L="Diff(x - x1)"];
print (max(abs(x[dp1:d,] - x2)))[L="Diff(x - x2)"];

The output shows that the block computation, which uses multiple matrices of size d/2, gives exactly the same answer as the original computation, which uses matrices of size d.

Analysis of the block algorithm

Now that the algorithm runs on a small example, you can increase the dimensions of the matrices. For example, try rerunning the algorithm with the following parameters:

%let d = 5000;    /* number of variables */
%let N = 1000;    /* number of observations */

When I run this larger program on my PC, it takes about 2.3 seconds to simulate the MVN data by using the full Cholesky decomposition. It takes about 17 seconds to simulate the data by using the block method. Most of the time for the block method is spent on the computation v = B`*inv(G_A)*z1, which takes about 15 seconds. So if you want to improve the performance, you should focus on improving that computation.

You can improve the performance of the block algorithm in several ways:
  • You don't need to find the inverse of A. You have the Cholesky root of A, which is triangular, so you can define H = inv(G`A)*B and compute the Schur complement as C – H`*H.
  • You don't need to explicitly form any inverses. Ultimately, all these matrices are going to operate on the vector z2, which means that you can simulate the data by using only matrix multiplication and solving linear equations.

I tried several techniques, but I could not make the block algorithm competitive with the performance of the original Cholesky algorithm. The issue is that the block algorithm computes two Cholesky decompositions, solves some linear equations, and performs matrix multiplication. Even though each computation is on a matrix that is half the size of the original matrix, the additional computations result in a longer runtime.

Generalization to additional blocks

The algorithm generalizes to other block sizes. I will outline the case for a 3 x 3 block matrix. The general case of k x k blocks follows by induction.

The figure to the right shows the Σ covariance matrix as a 3 x 3 block matrix. By using 2 x 2 block algorithm, you can obtain multivariate normal vectors x1 and x2, which correspond to the covariances in the four blocks in the upper-left corner of the matrix. You can then view the upper four blocks as a single (larger) block, thus obtaining a 2 x 2 block matrix where the upper-left block has size 2d/3 and the lower-right block (Σ33) has size d/3.

There are many details to work through, but the main idea is that you can repeat the computation on this new block matrix where the block sizes are unequal. To proceed, you need the Cholesky decomposition of the large upper-left block, but we have already shown how to compute that in terms of the Cholesky root of Σ11, its inverse, and the Cholesky root of the Schur complement of Σ22. You also need to form the Schur complement of Σ33, but that is feasible because we have the Cholesky decomposition of the large upper-left block.

I have not implemented the details because it is not clear to me whether three or more blocks provides a practical advantage. Even if I work out the details for k=3, the algebra for k > 3 blocks seems to be daunting.

Summary

It is well known that you can use the Cholesky decomposition of a covariance matrix to simulate data from a correlated multivariate normal distribution. This article shows how to break up the task by using a block Cholesky method. The method is implemented for k=2 blocks. The block method requires more matrix operations, but each operation is performed on a smaller matrix. Although the block method takes longer to run, it can be used to generate a large number of variables by generating k variables at a time, where k is small enough that the k x k matrices can be easily handled.

The block method generalizes to k > 2 blocks, but it is not clear whether more blocks provide any practical benefits.

The post A block-Cholesky method to simulate multivariate normal data appeared first on The DO Loop.