When you write a program that simulates data from a statistical model, you should always check that the simulation code is correct. One way to do this is to generate a large simulated sample, estimate the parameters in the simulated data, and make sure that the estimates are close to the parameters in the simulation. For regression models that include CLASS variables, this task requires a bit of thought. The problem is that categorical variables can have different parameterizations, and the parameterization determines the parameter estimates.

Recently I read a published paper that included a simulation in SAS of data from a logistic model. I noticed that the parameters in the simulation code did not match the parameter estimates in the example. I was initially confused, although I figured it out eventually. This article describes the "parameterization problem" and shows how to ensure that the parameters in the simulation code match the parameter estimates for the analysis.

I have previously written about how to incorporate classification variables into simulations of regression data. If you haven't read that article, it is a good place to begin. You can also read about how to simulate data from a general linear regression model, such as a logistic regression model.

### Simulated explanatory variables with a classification variable

Before demonstrating the problem, let's simulate the explanatory variables. Each simulation will use the same data and the same model but change the way that the model is parameterized. The following SAS DATA step simulates two normal variables and a categorical "treatment" variable that has three levels (Trt=1, 2, or 3). The design is balanced in the treatment variable, which means that each treatment group has the same number of observations.

data SimExplanatory; call streaminit(54321); do i = 1 to 2000; /* number in each treatment group */ do Trt = 1 to 3; /* Trt = 1, 2, and 3 */ x1 = rand('Normal'); /* x1 ~ N(0,1) */ x2 = rand('Normal'); /* x2 ~ N(0,1) */ LinCont = -1.5 + 0.4*x1 - 0.7*x2; /* offset = -1.5; params are 0.4 and -0.7 for x1 and x2 */ output; end; end; drop i; run;

The data set contains a variable called LinCont. The LinCont variable is a linear combination of the x1 and x2 variables and a constant offset. All models will use the same offset (-1.5) and the same parameters for x1 and x2 (0.4 and -0.7, respectively).

### Data for a logistic regression model with a CLASS variable

When writing a simulation of data that satisfy a regression model, most programmers assign coefficients for each level of the treatment variable. For example, you might use the coefficients {1, 2, 3}, which means that the response for the group Trt=1 gets 3 additional units of response, that the group Trt=2 gets 2 additional units, and that the group Trt=3 gets only 1 additional unit. This is a valid way to specify the model, but when you compute the regression estimates, you will discover that they are not close to the parameters in the model. The following DATA step simulates logistic data and uses PROC LOGISTIC to fit a model to the simulated data.

data LogiSim_Attempt; call streaminit(123); set SimExplanatory; /* read the explanatory variables */ array TrtCoef[3] _temporary_ (3, 2, 1); /* treatment effects */ /* a straightforward way to generate Y data from a logistic model */ eta = LinCont + TrtCoef[1]*(Trt=1) + TrtCoef[2]*(Trt=2) + TrtCoef[3]*(Trt=3); p = logistic(eta); /* P(Y | explanatory vars) */ y = rand("Bernoulli", p); /* random realization of Y */ drop eta p; run;   title "Parameter Estimates Do Not Match Simulation Code"; ods select ParameterEstimates; proc logistic data=LogiSim_Attempt plots=none; class Trt; model y(event='1') = Trt x1 x2; run;

Notice that the simulation uses the expression (Trt=1) and (Trt=2) and (Trt=3). These are binary expressions. For any value of Trt, one of the terms is unity and the others are zero.

Notice that the parameter estimates for the levels of the Trt variable do not match the parameters in the model. The model specified parameters {3, 2, 1}, but the estimates are close to {1, 0, ???}, where '???' is used to indicate that the ParameterEstimates table does not include a row for the Trt=3 level. Furthermore, the estimate for the Intercept is not close to the parameter value, which is -1.5.

I see many simulations like this. The problem is that the parameterization used by the statistical analysis procedure (here, PROC LOGISTIC) is different from the model's parameterization. The next section shows how to write a simulation so that the parameters in the simulation code and the parameter estimates agree.

### The 'reference' parameterization

You can't estimate all three coefficients of the categorical variable. However, you can estimate the differences between the levels. For example, you can estimate the average difference between the Trt=1 group and the Trt=3 group. One parameterization of regression models is the "reference parameterization." In the reference parameterization, one of the treatment levels is used as a reference level and you estimate the difference between the non-reference effects and the reference effect. The Intercept parameter is the sum of the offset value and the reference effect. The following simulation code simulates data by using a reference parameterization:

data LogiSim_Ref; call streaminit(123); set SimExplanatory; /* assume reference or GLM parameterization */ array TrtCoef[3] _temporary_ (3, 2, 1); /* treatment effects */ RefLevel = TrtCoef[3]; /* use last level as reference */ eta = LinCont + RefLevel /* constant term is (Intercept + RefLevel) */ + (TrtCoef[1]-RefLevel)*(Trt=1) /* param = +2 */ + (TrtCoef[2]-RefLevel)*(Trt=2) /* param = +1 */ + (TrtCoef[3]-RefLevel)*(Trt=3); /* param = 0 */ p = logistic(eta); /* P(Y | explanatory vars) */ y = rand("Bernoulli", p); /* random realization of Y */ drop eta p; run;   title "Reference Parameterization"; ods select ParameterEstimates; proc logistic data=LogiSim_Ref plots=none; class Trt / param=reference ref=LAST; /* reference parameterization */ model y(event='1') = Trt x1 x2; run;

In this simulation and analysis, the constant term in the model is "Intercept + RefLevel," which is -0.5. The parameter for the Trt=1 effect is +2, which is "TrtCoef[1]-RefLevel." Similarly, the parameter for the Trt=2 effect is +1, and the parameter for Trt=3 is 0. Notice that the parameter estimates from PROC LOGISTIC agree with these values because I specified the PARAM=REFERENCE option, which tells the procedure to use the same parameterization as the simulation code.

Notice that this model is the same as the previous model because we are simply adding the RefLevel to the constant term and subtracting it from the Trt effect.

### The 'effect' parameterization

In a similar way, instead of subtracting a reference level, you can subtract the mean of the coefficients for the treatment effects. In general, you need to compute a weighted mean, where the weights are the number of observations for each level, but for the case of balanced treatments, you can just use the ordinary arithmetic mean of the coefficients.

For these data, the mean of the treatment effects is (3+2+1)/3 = 2. The following simulation adds the mean to the constant term and subtracts it from the Trt effects. This is the "effect parameterization":

data LogiSim_Effect; call streaminit(123); set SimExplanatory; array TrtCoef[3] _temporary_ (3, 2, 1); /* treatment effects */ /* assume effect parameterization, which is equivalent to using deviations from the mean: array TrtDev[3] _temporary_ (+1, 0, -1); */   MeanTrt = 2; /* for balanced design, mean=(3+2+1)/3 */ eta = LinCont + MeanTrt /* est intercept = (Intercept + MeanTrt) */ + (TrtCoef[1]-MeanTrt)*(Trt=1) /* param = +1 */ + (TrtCoef[2]-MeanTrt)*(Trt=2) /* param = 0 */ + (TrtCoef[3]-MeanTrt)*(Trt=3); /* param = -1 */ p = logistic(eta); /* P(Y | explanatory vars) */ y = rand("Bernoulli", p); /* random realization of Y */ drop eta p; run;   title "Effect Parameterization"; ods select ParameterEstimates; proc logistic data=LogiSim_effect plots=none; class Trt / param=effect ref=LAST; model y(event='1') = Trt x1 x2; run;

In this simulation and analysis, the constant term in the model is "Intercept + MeanTrt," which is 0.5. The parameters for the Trt levels are {+1, 0, -1}. Notice that the PROC LOGISTIC call uses the PARAM=EFFECT option, which tells the procedure to use the same effect parameterization. Consequently, the parameter estimates are close to the parameter values. (These are the same parameter estimates as in the first example, since PROC LOGISTIC uses the effect parameterization by default.)

If you use 0 as an offset value, then the Intercept parameter equals the mean effect of the treatment variable, which leads to a nice interpretation when there is only one classification variable and no interactions.

### Summary

This article shows how to write the simulation code (the equation of the linear predictor) so that it matches the parameterization used to analyze the data. This article uses PROC LOGISTIC, but you can use the same trick for other procedures and analyses. When the simulation code and the analysis use the same parameterization of classification effects, it is easier to feel confident that your simulation code correctly simulates data from the model that you are analyzing.

The post Simulate regression models that incorporate CLASS parameterizations appeared first on The DO Loop.

When you write a program that simulates data from a statistical model, you should always check that the simulation code is correct. One way to do this is to generate a large simulated sample, estimate the parameters in the simulated data, and make sure that the estimates are close to the parameters in the simulation. For regression models that include CLASS variables, this task requires a bit of thought. The problem is that categorical variables can have different parameterizations, and the parameterization determines the parameter estimates.

Recently I read a published paper that included a simulation in SAS of data from a logistic model. I noticed that the parameters in the simulation code did not match the parameter estimates in the example. I was initially confused, although I figured it out eventually. This article describes the "parameterization problem" and shows how to ensure that the parameters in the simulation code match the parameter estimates for the analysis.

I have previously written about how to incorporate classification variables into simulations of regression data. If you haven't read that article, it is a good place to begin. You can also read about how to simulate data from a general linear regression model, such as a logistic regression model.

### Simulated explanatory variables with a classification variable

Before demonstrating the problem, let's simulate the explanatory variables. Each simulation will use the same data and the same model but change the way that the model is parameterized. The following SAS DATA step simulates two normal variables and a categorical "treatment" variable that has three levels (Trt=1, 2, or 3). The design is balanced in the treatment variable, which means that each treatment group has the same number of observations.

data SimExplanatory; call streaminit(54321); do i = 1 to 2000; /* number in each treatment group */ do Trt = 1 to 3; /* Trt = 1, 2, and 3 */ x1 = rand('Normal'); /* x1 ~ N(0,1) */ x2 = rand('Normal'); /* x2 ~ N(0,1) */ LinCont = -1.5 + 0.4*x1 - 0.7*x2; /* offset = -1.5; params are 0.4 and -0.7 for x1 and x2 */ output; end; end; drop i; run;

The data set contains a variable called LinCont. The LinCont variable is a linear combination of the x1 and x2 variables and a constant offset. All models will use the same offset (-1.5) and the same parameters for x1 and x2 (0.4 and -0.7, respectively).

### Data for a logistic regression model with a CLASS variable

When writing a simulation of data that satisfy a regression model, most programmers assign coefficients for each level of the treatment variable. For example, you might use the coefficients {1, 2, 3}, which means that the response for the group Trt=1 gets 3 additional units of response, that the group Trt=2 gets 2 additional units, and that the group Trt=3 gets only 1 additional unit. This is a valid way to specify the model, but when you compute the regression estimates, you will discover that they are not close to the parameters in the model. The following DATA step simulates logistic data and uses PROC LOGISTIC to fit a model to the simulated data.

data LogiSim_Attempt; call streaminit(123); set SimExplanatory; /* read the explanatory variables */ array TrtCoef[3] _temporary_ (3, 2, 1); /* treatment effects */ /* a straightforward way to generate Y data from a logistic model */ eta = LinCont + TrtCoef[1]*(Trt=1) + TrtCoef[2]*(Trt=2) + TrtCoef[3]*(Trt=3); p = logistic(eta); /* P(Y | explanatory vars) */ y = rand("Bernoulli", p); /* random realization of Y */ drop eta p; run;   title "Parameter Estimates Do Not Match Simulation Code"; ods select ParameterEstimates; proc logistic data=LogiSim_Attempt plots=none; class Trt; model y(event='1') = Trt x1 x2; run;

Notice that the simulation uses the expression (Trt=1) and (Trt=2) and (Trt=3). These are binary expressions. For any value of Trt, one of the terms is unity and the others are zero.

Notice that the parameter estimates for the levels of the Trt variable do not match the parameters in the model. The model specified parameters {3, 2, 1}, but the estimates are close to {1, 0, ???}, where '???' is used to indicate that the ParameterEstimates table does not include a row for the Trt=3 level. Furthermore, the estimate for the Intercept is not close to the parameter value, which is -1.5.

I see many simulations like this. The problem is that the parameterization used by the statistical analysis procedure (here, PROC LOGISTIC) is different from the model's parameterization. The next section shows how to write a simulation so that the parameters in the simulation code and the parameter estimates agree.

### The 'reference' parameterization

You can't estimate all three coefficients of the categorical variable. However, you can estimate the differences between the levels. For example, you can estimate the average difference between the Trt=1 group and the Trt=3 group. One parameterization of regression models is the "reference parameterization." In the reference parameterization, one of the treatment levels is used as a reference level and you estimate the difference between the non-reference effects and the reference effect. The Intercept parameter is the sum of the offset value and the reference effect. The following simulation code simulates data by using a reference parameterization:

data LogiSim_Ref; call streaminit(123); set SimExplanatory; /* assume reference or GLM parameterization */ array TrtCoef[3] _temporary_ (3, 2, 1); /* treatment effects */ RefLevel = TrtCoef[3]; /* use last level as reference */ eta = LinCont + RefLevel /* constant term is (Intercept + RefLevel) */ + (TrtCoef[1]-RefLevel)*(Trt=1) /* param = +2 */ + (TrtCoef[2]-RefLevel)*(Trt=2) /* param = +1 */ + (TrtCoef[3]-RefLevel)*(Trt=3); /* param = 0 */ p = logistic(eta); /* P(Y | explanatory vars) */ y = rand("Bernoulli", p); /* random realization of Y */ drop eta p; run;   title "Reference Parameterization"; ods select ParameterEstimates; proc logistic data=LogiSim_Ref plots=none; class Trt / param=reference ref=LAST; /* reference parameterization */ model y(event='1') = Trt x1 x2; run;

In this simulation and analysis, the constant term in the model is "Intercept + RefLevel," which is -0.5. The parameter for the Trt=1 effect is +2, which is "TrtCoef[1]-RefLevel." Similarly, the parameter for the Trt=2 effect is +1, and the parameter for Trt=3 is 0. Notice that the parameter estimates from PROC LOGISTIC agree with these values because I specified the PARAM=REFERENCE option, which tells the procedure to use the same parameterization as the simulation code.

Notice that this model is the same as the previous model because we are simply adding the RefLevel to the constant term and subtracting it from the Trt effect.

### The 'effect' parameterization

In a similar way, instead of subtracting a reference level, you can subtract the mean of the coefficients for the treatment effects. In general, you need to compute a weighted mean, where the weights are the number of observations for each level, but for the case of balanced treatments, you can just use the ordinary arithmetic mean of the coefficients.

For these data, the mean of the treatment effects is (3+2+1)/3 = 2. The following simulation adds the mean to the constant term and subtracts it from the Trt effects. This is the "effect parameterization":

data LogiSim_Effect; call streaminit(123); set SimExplanatory; array TrtCoef[3] _temporary_ (3, 2, 1); /* treatment effects */ /* assume effect parameterization, which is equivalent to using deviations from the mean: array TrtDev[3] _temporary_ (+1, 0, -1); */   MeanTrt = 2; /* for balanced design, mean=(3+2+1)/3 */ eta = LinCont + MeanTrt /* est intercept = (Intercept + MeanTrt) */ + (TrtCoef[1]-MeanTrt)*(Trt=1) /* param = +1 */ + (TrtCoef[2]-MeanTrt)*(Trt=2) /* param = 0 */ + (TrtCoef[3]-MeanTrt)*(Trt=3); /* param = -1 */ p = logistic(eta); /* P(Y | explanatory vars) */ y = rand("Bernoulli", p); /* random realization of Y */ drop eta p; run;   title "Effect Parameterization"; ods select ParameterEstimates; proc logistic data=LogiSim_effect plots=none; class Trt / param=effect ref=LAST; model y(event='1') = Trt x1 x2; run;

In this simulation and analysis, the constant term in the model is "Intercept + MeanTrt," which is 0.5. The parameters for the Trt levels are {+1, 0, -1}. Notice that the PROC LOGISTIC call uses the PARAM=EFFECT option, which tells the procedure to use the same effect parameterization. Consequently, the parameter estimates are close to the parameter values. (These are the same parameter estimates as in the first example, since PROC LOGISTIC uses the effect parameterization by default.)

If you use 0 as an offset value, then the Intercept parameter equals the mean effect of the treatment variable, which leads to a nice interpretation when there is only one classification variable and no interactions.

### Summary

This article shows how to write the simulation code (the equation of the linear predictor) so that it matches the parameterization used to analyze the data. This article uses PROC LOGISTIC, but you can use the same trick for other procedures and analyses. When the simulation code and the analysis use the same parameterization of classification effects, it is easier to feel confident that your simulation code correctly simulates data from the model that you are analyzing.

The post Simulate regression models that incorporate CLASS parameterizations appeared first on The DO Loop.

I spend my days helping SAS Platform Administrators develop their skills. The best part of my job? Helping them understand the power and importance of SAS as an analytics platform. Watching them move from no knowledge of SAS to feeling confident in their newfound abilities brings me joy.

Most of the future SAS Administrators I teach function as “SAS Administrator and <fill in the blank>.” Their dilemma inspired this post describing a typical day in the life of a SAS Administrator.

Customer needs vary, and SAS adapts with them, so there's really no by-the-book, play-by-play list of the daily activities. Every SAS Administrator's day is a little different, but below are some of the common tasks.

## SAS server management

The day starts with checking your servers. As the admin, I check my servers every day. Without the ability to process their code, your users will not be happy. I know, I know, making users happy might not be high on your bucket list, but it’s a part of the job. Besides, don’t you want the satisfaction of knowing that your finely tuned SAS environment remained finely tuned over night?

Which SAS components you have and the operating system determine how you check your servers.

SAS® Viya provides scripts in /etc/init.d that you use to stop, start, restart, and check the status of an individual SAS Viya server and service.

How you run the individual server and service scripts depends on your operating system:

To check status of all servers and services:

sudo /etc/init.d/sas-viya-all-services status

For SAS® 9, there are scripts in <config-directory>/config/Lev1 to control the servers.

To check status of all servers and services:

<config-directory>/config/Lev1/sas.servers {start|stop|restart|status}

The good news is you do not have to run these scripts manually every day. You can schedule them using third party tools so that when you arrive at work, server status greets you.

## User management

Many administrators prefer dealing with the servers more than the users. Like I mentioned earlier, users are really our primary reason for being here. Not a day goes by that you will not have to perform some task associated with users. Users come, so you must give access. Users go, so you must remove access. User A is having trouble with his report. User B is having trouble with her code. Solve the issue and you are a genius. If they stump you, no worries! There is always SAS Technical Support for help. With SAS, you are never alone.

## Data management

I often tell my students to become friends with their database administrators (DBAs). Unless, the <fill in the blank> part of their “SAS and…” is DBA. Then they should be their own best friend because SAS is all about accessing data of all sizes. If I had a nickel for every story I’ve heard about turf wars between SAS Administrators and DBAs….

The SAS Administrator must make sure the data is accessible. It may not be your job, but it is your responsibility. It doesn’t matter that you had nothing to do with the database password changing overnight and now this morning no one can access tables they could access yesterday. Today you solve everyone’s data access issues.

People often ask me if I get tired of teaching the same administration classes over and over again. I tell them, "No Way!" The content may be the same, but because every customer can use SAS differently, every class is different. I think the same is true for administering SAS. Every day brings the opportunity for learning, fun and excitement. There will be smooth days, when all is well. There will be challenging days, when you wonder what in the world is going on. Regardless, as the SAS Administrator you have many resources to help. There’s a whole SAS Administrator Community if you feel alone, SAS Technical Support if you run into trouble, and SAS Education training if you need a bump or reset to your administration skills. There's also a way to be certified as a SAS Platform Administrator.

A day in the life of the SAS Platform Administrator was published on SAS Users.

The Kronecker product (also called the direct product) is a binary operation that combines two matrices to form a new matrix. The Kronecker product appears in textbooks about the design of experiments and multivariate statistics. The Kronecker product seems intimidating at first, but often one of the matrices in the product has a special form, such as being a matrix of all 1s. In these special cases, the Kronecker product becomes much easier to understand. This article demonstrates some of the most common uses of the Kronecker product, including using it to concatenate matrices and to form block matrices.

### What is the Kronecker product?

The Kronecker product is best understood by thinking about block matrices. If A is a n x p matrix and B is a m x q matrix, then the Kronecker product A ⊗ B is a block matrix that is formed from blocks of B, where each block is multiplied by an element of A:

The SAS/IML language uses the "at sign" (@) to represent the binary operator. Thus, the SAS/IML expression A @ B forms the Kronecker product A ⊗ B.

### The Kronecker product for vectors of 1s and special matrices

If A or B has a special form, the Kronecker product simplifies. First, consider only the case where A is a vector of all 1s or a special matrix. The following statements follow directly from the definition of the Kronecker product.

• If A is a row vector of all 1s, then A ⊗ B is the horizontal concatenation of p copies of B.
• If A is a column vector of all 1s, then A ⊗ B is the vertical concatenation of n copies of B.
• If A is a matrix of all 1s, then A ⊗ B is a block matrix that contains blocks of the matrix B.
• If A is the identity matrix, then A ⊗ B is a block diagonal matrix that contains copies of the matrix B along the diagonal.

Next, let A be an arbitrary matrix but consider only special forms for B:

• If B is a row vector of all 1s, then A ⊗ B contains q copies of the first column of A, followed by q copies of the second column of A, and so forth. I call this sequential replication of columns of A.
• If B is a column vector of all 1s, then A ⊗ B contains m copies of the first row of A, followed by m copies of the second row of A, and so forth. This is sequential replication of rows of A.
• If B is a matrix of all 1s, then A ⊗ B is a block matrix where the (i,j)th block is the constant matrix where all elements are A[i,j].
• If B is an identity matrix, then A ⊗ B is a block matrix where the (i,j)th block is the matrix A[i,j]*I.

The next sections show concrete examples of these eight ways to use the Kronecker product operator.

### Use the Kronecker product for horizonal or vertical concatenation

You can use the Kronecker product to perform horizontal or vertical concatenation. For example, the following SAS/IML program defines two vectors that contain only 1s. The vector w is a row vector and the vector h is a column vector. The program computes w ⊗ B and h ⊗ B for various choices of B. The ODS LAYOUT GRIDDED statement arranges the output in a grid.

proc iml; /* specify RHS vectors */ w = {1 1}; /* w for 'wide' (row vector) */ h = {1, 1}; /* h for 'high' (col vector) */   /* specify data matrices */ r = {1 2 3}; /* row vector */ c = {1, 2, 3}; /* column vector */ x = {1 2 3, /* 2 x 3 matrix */ 4 5 6};   /******************************/ /* vector @ B : Concatenation */ /******************************/ wr = w @ r; /* horiz concat. Same as repeat(r, 1, ncol(w)) */ wx = w @ x; /* horiz concat. Same as repeat(x, 1, ncol(w)) */ hc = h @ c; /* vert concat. Same as repeat(c, nrow(h), 1) */ hx = h @ x; /* vert concat. Same as repeat(x, nrow(h), 1) */   ods layout gridded columns=2 advance=table; print wr[L='w@r'], wx[L='w@x'], hc[L='h@c'], hx[L='h@x']; ods layout end;

The output indicates that the Kronecker product with w on the left results in horizontal concatenation. Using h on the left results in vertical concatenation.

### Use the Kronecker product for sequential replication

The previous section showed how to take a sequence of numbers such as {1,2,3} and repeat the entire sequence to form a new sequence such as {1,2,3,1,2,3}. But sometimes you want to repeat each element before repeating the next element. In other words, you might be interested in forming a sequence such as {1,1,2,2,3,3}. You can perform this kind of "sequential replication" by putting the data on the left and the vectors of 1s on the right, as follows:

/***************************************/ /* A @ vector : Sequential replication */ /***************************************/ rw = r @ w; /* repeat cols sequentially */ xw = x @ w; ch = c @ h; /* repeat rows sequentially */ xh = x @ h;   ods layout gridded columns=2 advance=table; print rw[L='r@w'], xw[L='x@w'], ch[L='c@h'], xh[L='x@h']; ods layout end;

It is worth mentioning that the second row of output is the transpose of the first row. That is because the transpose operation "distributes" over the Kronecker product operator. That is, for any matrices A and B, it is true that (A ⊗ B) = A ⊗ B. Since r=c and w=h, then (r ⊗ w) = (r ⊗ w) = c ⊗ h.

### Use the Kronecker product to construct block matrices

The Kronecker product is essentially an operation that forms block matrices. When one of the components is a vector of all 1s, then "forming a block matrix" is the same as concatenation. But if A is a binary matrix, then A ⊗ B is a block matrix that has the same structure as A, but each nonzero block is a copy of B.

Probably the most important subcase is that the matrix I(n) ⊗ B is a block diagonal matrix, where each diagonal block is a copy of B. The following list gives some other examples that use the identity or a constant matrix. Let I(n) be the n x k identity matrix and let J(n) be the n x n matrix of all 1s. Then

• I(n) ⊗ B is a block diagonal matrix with copies of B on the diagonal.
• J(n) ⊗ B is a block matrix with a copy of B in each block.
• A ⊗ I(n) is a block matrix where each block is a diagonal matrix.
• A ⊗ J(n) is a block matrix where each block is a constant matrix.
Ix = I(2) @ x; /* diagonal block matrix */ Jx = J(2) @ x; /* 2 x 2 block matrix */ xI = x @ I(2); /* blocks of diagonal matrices */ xJ = x @ J(2); /* blocks of constant matrices */   ods layout gridded columns=2 advance=table; print Ix[L='I@x'], Jx[L='J@x'], xI[L='x@I'], xJ[L='x@J']; ods layout end;

### How is the Kronecker product used in statistics?

Recall that (mathematically) you can only add matrices that have the same dimensions. For example, you can't add a matrix and a vector. However, a common operation in multivariate statistics is to center a data matrix by subtracting the mean of each column from that column. If X is the data matrix, "mean(X)" is a row vector where the i_th element of the vector is the mean of the i_th column. The centering operation is sometimes informally written as "X – mean(X)," even though that expression does not make mathematical sense. Formally, you need to vertically concatenate the mean vector n times, where n is the number of rows in the data matrix. As we have seen, if h is a column vector of length n that contains all 1s, then mean(X) ⊗ h is a matrix and the expression X – mean(X) ⊗ h is a proper mathematical expression that indicates the centering operation.

As described in a previous article, in the early days of SAS, statistical programmers had to use Kronecker product operator to ensure that all matrix operations were mathematically correct. But that changed in SAS version 8 (circa 1999) when the IML language introduced a shorthand notation that makes it easier to perform row and column operations. Back in the "old days," you saw lots of programs that used the Kronecker operator like this:

/* read data matrix into X */ use Sashelp.Class; read all var _num_ into X; close; mean = x[:,]; /* row vector of means for each column */ h = j(nrow(X), 1, 1); /* column vector */ CenterX = X - mean @ h; /* subtract matrices of same dimension */

The modern SAS/IML programmer will typically use the shorthand notation to avoid the Kronecker product:

CenterX = X - mean; /* modern code: Subtract the row vector from each row of X */

Although you no longer need to use the Kronecker product for basic centering operations, the Kronecker operator is still useful for other tasks, such as building block matrices and concatenation. Block diagonal matrices are used in mixed models of longitudinal data (Chapter 12 of Wicklin, 2013).

### Summary

This article shows eight ways that you can use the Kronecker product. The Kronecker operator enables you to perform horizontal and vertical concatenation, perform sequential repetition of rows and columns, and build block matrices with special structure. You can use this article as a "cheat sheet" to remind you about some common uses of the Kronecker operator.

The post 8 ways to use the Kronecker product appeared first on The DO Loop.

SAS Press has added to its selection of free downloadable eBooks with the new SAS and Open-Source Model Management® – Special Collection.

From the description:

“Turn analytical models into business value and smarter decisions with this special collection of papers about SAS Model Management. Without a structured and standardized process to integrate and coordinate all the different pieces of the model life cycle, a business can experience increased costs and missed opportunities. SAS Model Management solutions enable organizations to register, test, deploy, monitor, and retrain analytical models, leveraging any available technology – including open-source models in Python, R, and TensorFlow –into a competitive advantage.”

As we know, the diversity and multitude of tools, programming languages, algorithms and skills is reshaping Models’ Management best practices. The “New Normal” has brought an immediate need to adopt Model Management’s solutions which enable users to register, test, deploy, monitor and retrain models, in a scalable and structured way, leveraging their open-source/commercial development tool of choice.

In this eBook, we have carefully selected several papers and best practices from SAS Global Forum 2020 on how to ML in production and become an Expert in each step of the Model Management Lifecycle.

### Key Take Aways

1. How you can put ML in production and turn your models into business value.
2. How SAS openness allows you to leverage any open-source technology.
3. How SAS Model Management solution enables you to register, test, deploy, monitor and retrain analytical models.

### Papers Selected

• The Aftermath What Happens After You Deploy Your Models and Decisions
By David Duling
• Turning the Crank: A Simulation of Optimizing Model Retraining
By David Duling
• Open-Source Model Management with SAS® Model Manager
By Glenn Clingroth, Hongjie Xin, and Scott Lindauer
• Deploying Models Using SAS® and Open-Source
By Jared Dean
• Cows or Chickens: How You Can Make Your Models into Containers
By Hongjie Xin, Jacky Jia, David Duling, and Chris Toth
• Choose Your Own Adventure: Manage Model Development via a Python IDE
By Jon Walker
• Build Your ML Web Application Using SAS AutoML
By Paata Ugrekhelidze
• Monitoring the Relevance of Predictors for a Model Over Time
By Ming-Long Lam
• Model Validation
By Hans-Joachim Edert and Tamara Fischer

If you are interested in other topics, like Forecasting, Computer Vision, Text Analytics, find more free downloadable eBooks in the Special Collection series at support.sas.com/freesasebooks.

SAS and Open-Source Model Management (free eBook) was published on SAS Users.

In his blog post, How to split one data set into many, Chris Hemedinger showed how to subset or split SAS data sets based on the values of categorical variables. For example, based on a value of variable REGION you may split a data set MARKETING into MARKETING_ASIA, MARKETING_AMERICA, MARKETING_EUROPE, and so on.

In some cases, however, we need to split a large data set into many – not by a subsetting variable values, but by a number of observations in order to produce smaller, better manageable data sets. Such an approach can be dictated by restrictions on the data set size imposed by hardware (memory size, transmission channel bandwidth etc.), processing time, or user interface convenience (e.g. search results displayed by pages).

We might need to split a data set into smaller tables of K observations or less each; or to split a data set into S equal (or approximately equal) pieces.

We might need to split a data set into sequentially selected subsets where the first K observations go into the first data set, the second K observations go into the second data set, and so on. Alternatively, we might need to randomly select observations from a data set while splitting it into smaller tables.

This blog post provides possible coding solutions for such scenarios.

## Splitting a data set into smaller data sets sequentially

Let’s say we need to split a data set SASHELP.CARS (number of observation N=428) into several smaller datasets. We will consider the following two sequential observation selection scenarios:

1. Each smaller data set should have maximum of K observations.
2. There should be S smaller data sets of approximately same size.

Ideally, we would like to split a data set into K observations each, but it is not always possible to do as the quotient of dividing the number of observations in the original dataset N by K is not always going to be a whole number. Therefore, we will split it into several smaller data sets of K observations each, but the last smaller data set will have the number of observations equal to the remainder of the division N by K.

Similarly, with the scenario 2, we will split the source data set into several smaller data sets of the same size, but the last smaller data set will have the number of observations equal to the remainder of the division N by K.

Below is a SAS macro code that covers both these scenarios.

%macro split (SRC_DATASET=, OUT_PREFIX=, SPLIT_NUM=, SPLIT_DEF=); /* Parameters: /* SRC_DATASET - name of the source data set */ /* OUT_PREFIX - prefix of the output data sets */ /* SPLIT_NUM - split number */ /* SPLIT_DEF - split definition (=SETS or =NOBS) */   %local I K S TLIST;   /* number of observations &K, number of smaller datasets &S */ data _null_; if 0 then set &SRC_DATASET nobs=N; if upcase("&SPLIT_DEF")='NOBS' then do; call symputx('K',&SPLIT_NUM); call symputx('S',ceil(N/&SPLIT_NUM)); put "***MACRO SPLIT: Splitting into datasets of no more than &SPLIT_NUM observations"; end; else if upcase("&SPLIT_DEF")='SETS' then do; call symputx('S',&SPLIT_NUM); call symputx('K',ceil(N/&SPLIT_NUM)); put "***MACRO SPLIT: Splitting into &SPLIT_NUM datasets"; end; else put "***MACRO SPLIT: Incorrect SPLIT_DEF=&SPLIT_DEF value. Must be either SETS or NOBS."; stop; run;     /* terminate macro if nothing to split */ %if (&K le 0) or (&S le 0) %then %return;   /* generate list of smaller dataset names */ %do I=1 %to &S; %let TLIST = &TLIST &OUT_PREFIX._&I; %end;   /* split source dataset into smaller datasets */ data &TLIST; set &SRC_DATASET; select; %do I=1 %to &S; when(_n_ <= &K * &I) output &OUT_PREFIX._&I; %end; end; run;   %mend split;

The following are examples of the macro invocations:

%split(SRC_DATASET=SASHELP.CARS, OUT_PREFIX=WORK.CARS, SPLIT_NUM=100, SPLIT_DEF=SET);   %split(SRC_DATASET=SASHELP.CARS, OUT_PREFIX=WORK.CARS, SPLIT_NUM=100, SPLIT_DEF=NOBS);   %split(SRC_DATASET=SASHELP.CARS, OUT_PREFIX=WORK.CARS, SPLIT_NUM=3, SPLIT_DEF=SETS);

These invocations will produce the following SAS logs:

%split(SRC_DATASET=SASHELP.CARS, OUT_PREFIX=WORK.CARS, SPLIT_NUM=100, SPLIT_DEF=SET); ***MACRO SPLIT: Incorrect SPLIT_DEF=SET value. Must be either SETS or NOBS.   %split(SRC_DATASET=SASHELP.CARS, OUT_PREFIX=WORK.CARS, SPLIT_NUM=100, SPLIT_DEF=NOBS); ***MACRO SPLIT: Splitting into datasets of no more than 100 observations NOTE: There were 428 observations read from the data set SASHELP.CARS. NOTE: The data set WORK.CARS_1 has 100 observations and 15 variables. NOTE: The data set WORK.CARS_2 has 100 observations and 15 variables. NOTE: The data set WORK.CARS_3 has 100 observations and 15 variables. NOTE: The data set WORK.CARS_4 has 100 observations and 15 variables. NOTE: The data set WORK.CARS_5 has 28 observations and 15 variables.   %split(SRC_DATASET=SASHELP.CARS, OUT_PREFIX=WORK.CARS, SPLIT_NUM=3, SPLIT_DEF=SETS); ***MACRO SPLIT: Splitting into 3 datasets NOTE: There were 428 observations read from the data set SASHELP.CARS. NOTE: The data set WORK.CARS_1 has 143 observations and 15 variables. NOTE: The data set WORK.CARS_2 has 143 observations and 15 variables. NOTE: The data set WORK.CARS_3 has 142 observations and 15 variables.

## Splitting a data set into smaller data sets randomly

For randomly splitting a data set into many smaller data sets we can use the same approach as above with a slight modification. In essence, we are going to randomly shuffle observations of our source data set first, and then apply the sequential splitting.

In order to implement this, we just need to replace the last data step in the above macro with the following 3 steps:

/* generate random numbers, R */ data; set &SRC_DATASET; call streaminit(1234); R = rand('uniform'); run;   /* sort data in R order */ proc sort; by R; run;   /* split source dataset into smaller datasets */ data &TLIST (drop=R); set; select; %do I=1 %to &S; when(_n_ <= &K * &I) output &OUT_PREFIX._&I; %end; end; run;

This modified code will produce similar results (with the same information in the SAS log), however, smaller data sets will have their observations randomly selected from the source data set.

## DATAn naming convention

You may have noticed that in this random splitting code I have not specified data set names neither in the DATA statement of the first DATA step, nor in the PROC SORT and not even in the SET statement of the last DATA step. Not only these shortcuts possible due to SAS’ DATAn naming convention, but it is a very robust way of dynamically assigning temporary data set names. This method is especially useful and appropriate for SAS macros as it guarantees that you do not accidentally overwrite a data set with the same name in SAS program that invokes your macro. Think about it: if you are a macro developer you need to make sure that whatever temporary data sets you create within your macro their names must be unique for a SAS session in order not to interfere with any data sets that may be created in the calling SAS program outside of your macro.

Here are defaults in SAS’ DATAn naming convention:

• If you do not specify a name for the output data set in a DATA statement, SAS automatically assigns the default names WORK.DATA1, WORK.DATA2, and so on, to each successive data set that you create.
• If you do not specify a name for the input data set in a SET statement, SAS automatically uses the last data set that was created. SAS keeps track of the most recently created data set through the reserved name _LAST_. When you execute a DATA or PROC step without specifying an input data set, by default, SAS uses the _LAST_ data set.

For more information on this useful SAS coding technique see special data set names and examples and warning on using special data set names.

Do you find this post useful? Have you ever split data sets into smaller ones based on a number of observations? Do you use special data set names and DATAn naming convention in your SAS coding? Please share your thoughts in the comments section below.

Splitting a data set into smaller data sets was published on SAS Users.

Move from rapid response to long-term resiliency based on these 5 tips.

Last month a SAS programmer asked how to fit a multivariate Gaussian mixture model in SAS. For univariate data, you can use the FMM Procedure, which fits a large variety of finite mixture models. If your company is using SAS Viya, you can use the MBC or GMM procedures, which perform model-based clustering (PROC MBC) or cluster analysis by using the Gaussian mixture model (PROC GMM). The MBC procedure is part of SAS Visual Statistics; The GMM procedure is part of SAS Visual Data Mining and Machine Learning (VDMML).

### The problem: Fitting a Gaussian mixture model

A Gaussian mixture model assumes that each cluster is multivariate normal but allows different clusters to have different within-cluster covariance structures. As in k-means clustering, it is assumed that you know the number of clusters, G. The clustering problem is an "unsupervised" machine learning problem, which means that the observations do not initially have labels that tell you which observation belongs to which cluster. The goal of the analysis is to assign each observation to a cluster ("hard clustering") or a probability of belonging to each cluster ("fuzzy clustering"). I will use the terms "cluster" and "group" interchangeably. In this article, I will only discuss the hard-clustering problem, which is conceptually easier to understand and implement.

The density of a Gaussian mixture with G components is $\Sigma_{i=1}^G \tau_i f(\mathbf{x}; {\boldsymbol\mu}_i, {\boldsymbol\Sigma}_i)$, where f(x; μi, Σi) is the multivariate normal density of the i_th component, which has mean vector μi and covariance matrix Σi. The values τi are the mixing probabilities, so Σ τi = 1. If x is a d-dimensional vector, you need to estimate τi, μi, and Σi for each of the G groups, for a total of G*(1 + d + d*(d-1)/2) – 1 parameter estimates. The d*(d-1)/2 expression is the number of free parameters in the symmetric covariance matrix and the -1 term reflects the constraint Σ τi = 1. Fortunately, the assumption that the groups are multivariate normal enables you to compute the maximum likelihood estimates directly without doing any numerical optimization.

Fitting a Gaussian mixture model is a "chicken-and-egg" problem because it consists of two subproblems, each of which is easy to solve if you know the answer to the other problem:

• If you know to which group each observation belongs, you can compute maximum likelihood estimates for the mean (center) and covariance of each group. You can also use the number of observations in each group to estimate the mixing probabilities for the finite mixture distribution.
• If you know the center, covariance matrix, and mixing probabilities for each of the G groups, you can use the density function for each group (weighted by the mixing probability) to determine how likely each observation is to belong to a cluster. For "hard clustering," you assign the observation to the cluster that gives the highest likelihood.

The expectation-maximization (EM) algorithm is an iterative method that enables you to solve interconnected problems like this. The steps of the EM algorithm are given in the documentation for the MBC procedure, as follows:

1. Use some method (such as k-means clustering) to assign each observation to a cluster. The assignment does not have to be precise because it will be refined. Some data scientists use random assignment as a quick-and-dirty way to initially assign points to clusters, but for hard clustering this can lead to less than optimal solutions.
2. The M Step: Assume that the current assignment to clusters is correct. Compute the maximum likelihood estimates of the within-cluster count, mean, and covariance. From the counts, estimate the mixing probabilities. I have previously shown how to compute the within-group parameter estimates
3. The E Step: Assume the within-cluster statistics are correct. I have previously shown how to evaluate the likelihood that each observation belongs to each cluster. Use the likelihoods to update the assignment to clusters. For hard clustering, this means choosing the cluster for which the density (weighted by the mixing probabilities) is greatest.
4. Evaluate the mixture log likelihood, which is an overall measure of the goodness of fit. If the log likelihood has barely changed from the previous iteration, assume that the EM algorithm has converged. Otherwise, go to the M step and repeat until convergence. The documentation for PROC MBC provides the log-likelihood function for both hard and fuzzy clustering.

This implementation of the EM algorithm performs the M-step before the E-step, but it is still the "EM" algorithm. (Why? Because there is no "ME" in statistics!)

### Use k-means clustering to assign the initial group membership

The first step is to assign each observation to a cluster. Let's use the Fisher Iris data, which we know has three clusters. We will not use the Species variable in any way, but merely treat the data as four-dimensional unlabeled data.

The following steps are adapted from the Getting Started example in PROC FASTCLUS. The call to PROC STDIZE standardizes the data and the call to PROC FASTCLUS performs k-means clustering and saves the clusters to an output data set.

/* standardize and use k-means clustering (k=3) for initial guess */ proc stdize data=Sashelp.Iris out=StdIris method=std; var Sepal: Petal:; run;   proc fastclus data=StdIris out=Clust maxclusters=3 maxiter=100 random=123; var Sepal: Petal:; run;   data Iris; merge Sashelp.Iris Clust(keep=Cluster); /* for consistency with the Species order, remap the cluster numbers */ if Cluster=1 then Cluster=2; else if Cluster=2 then Cluster=1; run;   title "k-Means Clustering of Iris Data"; proc sgplot data=Iris; ellipse x=PetalWidth y=SepalWidth / group=Cluster; scatter x=PetalWidth y=SepalWidth / group=Cluster transparency=0.2 markerattrs=(symbol=CircleFilled size=10) jitter; xaxis grid; yaxis grid; run;

The graph shows the cluster assignments from PROC FASTCLUS. They are similar but not identical to the actual groups of the Species variable. In the next section, these cluster assignments are used to initialize the EM algorithm.

### The EM algorithm for hard clustering

You can write a SAS/IML program that implements the EM algorithm for hard clustering. The M step uses the MLEstMVN function (described in a previous article) to compute the maximum likelihood estimates within clusters. The E step uses the LogPdfMVN function (described in a previous article) to compute the log-PDF for each cluster (assuming MVN).

proc iml; load module=(LogPdfMVN MLEstMVN); /* you need to STORE these modules */   /* 1. Read data. Initialize 'Cluster' assignments from PROC FASTCLUS */ use Iris; varNames = {'SepalLength' 'SepalWidth' 'PetalLength' 'PetalWidth'}; read all var varNames into X; read all var {'Cluster'} into Group; /* from PROC FASTCLUS */ close;   nobs = nrow(X); d = ncol(X); G = ncol(unique(Group)); prevCDLL = -constant('BIG'); /* set to large negative number */ converged = 0; /* iterate until converged=1 */ eps = 1e-5; /* convergence criterion */ iterHist = j(100, 3, .); /* monitor EM iteration */ LL = J(nobs, G, 0); /* store the LL for each group */   /* EM algorithm: Solve the M and E subproblems until convergence */ do nIter = 1 to 100 while(^converged); /* 2. M Step: Given groups, find MLE for n, mean, and cov within each group */ L = MLEstMVN(X, Group); /* function returns a list */ ns = L$'n'; tau = ns / sum(ns); means = L$'mean'; covs = L\$'cov';   /* 3. E Step: Given MLE estimates, compute the LL for membership in each group. Use LL to update group membership. */ do k = 1 to G; LL[ ,k] = log(tau[k]) + LogPDFMVN(X, means[k,], shape(covs[k,],d)); end; Group = LL[ ,<:>]; /* predicted group has maximum LL */   /* 4. The complete data LL is the sum of log(tau[k]*PDF[,k]). For "hard" clustering, Z matrix is 0/1 indicator matrix. DESIGN function: https://blogs.sas.com/content/iml/2016/03/02/dummy-variables-sasiml.html */ Z = design(Group); /* get dummy variables for Group */ CDLL = sum(Z # LL); /* sum of LL weighted by group membership */ /* compute the relative change in CD LL. Has algorithm converged? */ relDiff = abs( (prevCDLL-CDLL) / prevCDLL ); converged = ( relDiff < eps );   /* monitor convergence; if no convergence, iterate */ prevCDLL = CDLL; iterHist[nIter,] = nIter || CDLL || relDiff; end;   /* remove unused rows and print EM iteration history */ iterHist = iterHist[ loc(iterHist[,2]^=.), ]; print iterHist[c={'Iter' 'CD LL' 'relDiff'}];

The output from the iteration history shows that the EM algorithm converged in five iterations. At each step of the iteration, the log likelihood increased, which shows that the fit of the Gaussian mixture model improved at each iteration. This is one of the features of the EM algorithm: the likelihood always increases on successive steps.

### The results of the EM algorithm for fitting a Gaussian mixture model

This problem uses G=3 clusters and d=4 dimensions, so there are 3*(1 + 4 + 4*3/2) – 1 = 32 parameter estimates! Most of those parameters are the elements of the three symmetric 4 x 4 covariance matrices. The following statements print the estimates of the mixing probabilities, the mean vector, and the covariance matrices for each cluster. To save space, the covariance matrices are flattened into a 16-element row vector.

/* print final parameter estimates for Gaussian mixture */ GroupNames = strip(char(1:G)); rows = repeat(T(1:d), 1, d); cols = repeat(1:d, d, 1); SNames = compress('S[' + char(rows) + ',' + char(cols) + ']'); print tau[r=GroupNames F=6.2], means[r=GroupNames c=varNames F=6.2], covs[r=GroupNames c=(rowvec(SNames)) F=6.2];

You can merge the final Group assignment with the data and create scatter plots that show the group assignment. One of the scatter plots is shown below:

The EM algorithm changed the group assignments for 10 observations. The most obvious change is the outlying marker in the lower-left corner, which changed from Group=2 to Group=1. The other markers that changed are near the overlapping region between Group=2 and Group=3.

### Summary

This article shows an implementation of the EM algorithm for fitting a Gaussian mixture model. For simplicity, I implemented an algorithm that uses hard clustering (the complete data likelihood model). This algorithm might not perform well with a random initial assignment of clusters, so I used the results of k-means clustering (PROC FASTCLUS) to initialize the algorithm. Hopefully, some of the tricks and techniques in this implementation of the EM algorithm will be useful to SAS programmers who want to write more sophisticated EM programs.

The main purpose of this article is to demonstrate how to implement the EM algorithm in SAS/IML. As a reminder, if you want to fit a Gaussian mixture model in SAS, you can use PROC MBC or PROC GMM in SAS Viya. The MBC procedure is especially powerful: in Visual Statistics 8.5 you can choose from 17 different ways to model the covariance structure in the clusters. The MBC procedure also supports several ways to initialize the group assignments and supports both hard and fuzzy clustering. For more about the MBC procedure, see Kessler (2019), "Introducing the MBC Procedure for Model-Based Clustering."

I learned a lot by writing this program, so I hope it is helpful. You can download the complete SAS program that implements the EM algorithm for fitting a Gaussian mixture model. Leave a comment if you find it useful.