I attended a seminar last week whose purpose was to inform SAS 9 programmers about SAS Viya. I could tell from the programmer's questions that some programmers were confused about three basic topics:

• What are the computing environments in Viya, and how should a programmer think about them?
• What procedures do you get when you order a programming-oriented Viya product such as SAS Visual Statistics or SAS Econometrics? Of these procedures, which are CAS-enabled?
• If you have legacy SAS programs, can you still run them if your company migrates from SAS 9 to SAS Viya?

I am a programmer, so I thought it might be helpful for me to discuss these topics programmer-to-programmer. In a series of articles, I am going to discuss issues that a SAS statistical programmer might face when migrating to Viya from SAS 9. I use the term "SAS 9" to refer to the SAS Workspace Server that runs procedures in traditional products such as Base SAS, SAS/STAT, and SAS/ETS. So "SAS 9" refers to the most recent version of the "classic" SAS programming environment. It is the version of SAS that existed before SAS Viya was created.

### Clients and servers: Where do computations occur in SAS Viya?

In SAS 9, a procedure runs on the SAS Workspace Server. In SAS 9, the word "client," refers to a program such as Enterprise Guide (EG) or SAS Studio, which runs on a PC and submits code to the SAS Workspace Server. The server computes the results and sends tables and graphs back to the client, which displays them. Typically, the input and output data sets remain on the server.

You can think of SAS Viya as having two main components: the CAS server where the data are stored and the computations are performed, and support for several client languages. A client language enable you to connect to the CAS server and tell it what analyses you want to perform. So, in the world of Viya, "client" no longer refers to a GUI like EG, but to an entire programming environment such as SAS, Python, or R. The purpose of the client software is to connect to CAS, submit actions, and get back results. You then use the capabilities of the client language to display the results as a table or graph. For example, the SAS client uses ODS to display tables and graphs. In Python, you might use matplotlib to graph the results. In R, you might use ggplot. In all cases, you can also use the native capabilities of the client language (DATA step, Pandas, the tidyverse,....) to modify, aggregate, or enhance the output.

I use the SAS client to connect to and communicate with the CAS server. By using a SAS client to communicate with CAS, I can leverage my 25 years of SAS programming knowledge and skills. Others have written about how to use other clients (such as a Python client) to connect to CAS and to call CAS actions.

### What procedures do you get when you order a programming-oriented Viya product?

When you purchase a product in SAS Viya, you get three kinds of computational capabilities:

• Actions, which run on the CAS server. You can call an action from any client language.
• CAS-enabled procedures, which are parsed on the SAS client but call CAS actions "under the covers."
• Legacy SAS procedures that run on the SAS client, just as they do in SAS 9.

Obviously, the CAS-enabled and legacy procedures are only available on the SAS client.

To give an example, SAS Visual Statistic contains action sets (which contain actions), CAS-enabled procedures, and all the procedures in SAS/STAT. All procedures run on the SAS compute server, which is also called the SAS client. However, the CAS-enabled procedures call one or more actions that run on the CAS server, then display the results as ODS tables and graphics.

A CAS-enabled procedure performs very few computations on the client. In contrast, a legacy procedure that is not CAS-enabled performs all of its computations on the SAS client. It does not call any CAS actions. An example of a CAS-enabled procedure is the REGSELECT procedure, which performs linear regression with feature selection. It contains many of the features of the GLMSELECT procedure, which is a traditional regression procedure in SAS/STAT.

### Can you run legacy programs in SAS Viya?

Naturally, SAS 9 statistical programmers want to make sure that their existing programs will run in Viya. That is why SAS Visual Statistics comes with the legacy SAS/STAT procedures. The same is true for SAS/ETS proceduires, which are shipped as part of SAS Econometrics. And the SAS IML product in Viya contains PROC IML, which runs on the SAS client, as well as the newer iml action, which runs in CAS.

So what happens if, for example, you call PROC REG in SAS and ask it to perform a regression on a SAS data set in the WORK libref? PROC REG will do what it has always done. It will run in the SAS environment. It will not run on the CAS server. It will not magically run faster than it used to in SAS 9. The performance of most legacy programs should be comparable to their performance in SAS 9.

There are some exceptions to that rule. Some SAS procedures have been enhanced and now perform better than their SAS 9 counterparts. For example, the SAS IML team has enhanced certain functions in PROC IML in SAS Viya so that they have better performance in SAS Viya than the SAS 9 version of the procedure. The SAS IML development team is focused exclusively on improving performance and adding features in SAS Viya, both PROC IML and the iml action.

Another exception is that some Base SAS procedures were enhanced so that they behave differently depending on the location of the data. Many Base SAS procedures are now hybrid procedures. If you tell them to analyze a CAS table, they will call an action, which runs in CAS, and retrieve the results. If you tell them to analyze a SAS data set, they will run on the SAS client, just like they do in SAS 9.

To make the situation more complicated, some of the legacy Base SAS procedures support features that are not supported in CAS. When you request an option that is not supported in CAS, the procedure will download the data from CAS into SAS and perform the computation on the client. This can be inefficient, so check the documentation before you start using legacy procedures to analyze CAS tables. As a rule, I prefer to use legacy procedures to analyze SAS data sets on the client; I use newer CAS-enabled procedures for analyzing CAS tables.

### Summary

At a recent seminar for SAS 9 programmers, there were lots of questions about SAS Viya and what it means to for a SAS programmer to start programming in Viya. This article is the first of several articles that I intend to write for SAS programmers. I don't know everything, but I hope that other SAS programmers will join me in sharing what they have learned about the process of migrating from SAS 9 to SAS Viya.

If you are a SAS programmer who has general questions about SAS Viya, let me know what you are thinking by leaving a comment. I might not know the answer, but I'll try to find someone who does. For programming questions ("How do I do XYZ in Viya?"), post your question and sample data to the SAS Support Communities. There is a dedicated community for SAS Viya questions, so take advantage of the experts there.

The post What is a CAS-enabled procedure? appeared first on The DO Loop.

In a previous article, I showed how to use the CVEXHULL function in SAS/IML to compute the convex hull of a finite set of planar points. The convex hull is a convex polygon, which is defined by its vertices. To visualize the polygon, you need to know the vertices in sequential order. Fortunately, the CVEXHULL function returns the vertices in counterclockwise order, so you can connect the points in the order they are provided.

But what if the CVEXHULL function did not output the vertices in sequential order? It turns out that you can perform a simple computation that orders the vertices of a convex polygon. This article shows how.

### Connect unsorted vertices of a convex polygon

Let's start by defining six points that form the vertices of a convex polygon. The vertices are not sorted in any way, so if you connect the points in the order given, you get a star-like pattern:

proc iml; P = { 0 2, 6 0, 0 0, 4 -1, 5 2, 2 -1 };   /* create a helper function to connect vertices in the order they are given */ start GraphVertices(P); Poly = P // P[1,]; /* repeat first point to close the polygon */ call series(Poly[,1], Poly[,2]) procopt="aspect=1" option="markers" grid={x y}; finish; title "Connect Unsorted Points"; run GraphVertices(P);

The line plot does not look like a convex polygon because the vertices were not ordered. However, it is not hard to order the vertices of a convex polygon.

### Order vertices of a convex polygon

The key idea is to recall that the centroid of a polygon is the arithmetic average of the coordinates of its vertices. For a convex polygon, the centroid always lies in the interior of the polygon, as shown in the graph to the right. The centroid is displayed as a large dot in the center of the polygon.

You can use the centroid as the origin and construct the vectors from the centroid to each vertex. For each vector, you can compute the angle made with the horizontal axis. You can then sort the angles, which provides a sequential ordering of the vertices of the convex polygon. In the graph at the right, the angles are given in degrees, but you can use radians for all the calculations. For this example, the angles that the vectors make with the positive X axis are 38 degrees, 150 degrees, 187 degrees, and so forth.

The SAS/IML language provides a compact way to compute the centroid and the vectors. You can use the ATAN2 function in SAS to compute the angles, as follows:

C = mean(P); /* centroid of a convex polygon */ v = P - C; /* vectors from centroid to points on convex hull */ radian = atan2( v[,2], v[,1] ); /* angle with X axis for each vector */

At this point, you could sort the vertices by their radian measure. However, the ATAN2 function returns values in the interval (-π π]. If you prefer to order the vertices by using the standard "polar angle," which is in the interval [0, 2π), you can add 2π to any negative angle from the ATAN2 function. You can then use the angles to sort the vertices, as follows:

/* Optional: The ATAN2 function returns values in (-pi, pi]. Convert to values in (0,2*pi] */ pi = constant('pi'); idx = loc( radian<0 ); radian[idx] = radian[idx] + 2*pi;   /* now sort the points by angle */ call sortndx(idx, radian); /* get row numbers that sort the angles */ Q = P[idx,]; /* sort the vertices of the polygon by their angle */ title "Connect Sorted Points"; run GraphVertices(Q);

With this ordering, the vertices are now sorted in sequential order according to the angle each vertex makes with the centroid.

In summary, the CVEXHULL function in SAS/IML returns vertices of a convex polygon in sequential order. But if you are even given the vertices in a random order, you can perform a computation to sort them by using the angle each vertex makes with the centroid.

The post The order of vertices on a convex polygon appeared first on The DO Loop.

Given a cloud of points in the plane, it can be useful to identify the convex hull of the points. The convex hull is the smallest convex set that contains the observations. For a finite set of points, it is a convex polygon that has some of the points as its vertices. An example of a convex hull is shown to the right. The convex hull is the polygon that encloses the points.

This article describes the CVEXHULL function in the SAS/IML language, which computes the convex hull for a set of planar points.

### Visualize the convex hull

The graph at the beginning of this article shows the convex hull as a shaded polygon. The original points are overlaid on the polygon and labeled by the observation number. The six points that form the convex hull are colored red. This section shows how to create the graph.

The graph uses the POLYGON statement to visualize the convex hull. This enables you to shade the interior of the convex hull. If you do not need the shading, you could use a SERIES statement, but to get a closed polygon you would need to add the first point to the end of the list of vertices.

To create the graph, you must write the relevant information to a SAS data set so that you can use PROC SGPLOT to create the graph. The following statements write the (x,y) coordinates of the point, the observation numbers (for the data labels), the coordinates of the convex hull vertices (cx, cy), and an ID variable, which is required to use the POLYGON statement. It also creates a binary indicator variable that is used to color-code the markers in the scatter plot:

x = points[,1]; y = points[,2]; obsNum = t(1:nrow(points)); /* optional: use observation numbers for labels */ /* The points on the convex hull are sorted in counterclockwise order. If you use a series plot, you must repeat the first point so that the polygon is closed. For example, use convexHull = convexHull // convexHull[1,]; */ cx = convexHull[,1]; cy = convexHull[,2]; ID = j(nrow(cx),1,1); /* create ID variable for POLYGON statement */ /* create a binary (0/1) indicator variable */ OnHull = j(nrow(x), 1, 0); /* most points NOT vertices of the convex hull */ OnHull[hullIdx] = 1; /* these points are the vertices */   create CHull var {'x' 'y' 'cx' 'cy' 'ID' 'obsNum' 'OnHull'}; append; close; QUIT;

In the graph at the top of this article, vertices of the convex hull are colored red and the other points are blue. When you use the GROUP= option in PROC SGPLOT statements, the group colors might depend on the order of the observations in the data. To ensure that the colors are consistent regardless of the order of the data set, you can use a discrete attribute map to associate colors and values of the grouping variable. For details about using a discrete attribute maps, see Kuhfeld's 2016 article.

To use a discrete attribute map, you need to define it in a SAS data set, read it by using the DATTRMAP= option on the PROC SGPLOT statement, and specify it by using the ATTRID= statement on the SCATTER statement, as follows:

### Step 4: Analyze the bootstrap distribution

You can graph the bootstrap distribution for each pair of variables by using the SGPANEL procedure. The following statements create a basic panel:

proc sgpanel data=BootLong; panelby Label; histogram Corr; run;

However, a goal of this analysis is to compare the Fisher confidence intervals (CIs) to the bootstrap distribution of the correlation estimates. Therefore, it would be nice to overlay the original correlation estimates and Fisher intervals on the histograms of the bootstrap distributions. To do that, you can concatenate the BootLong and the FisherCorr data sets, as follows:

/* STEP 4: Analyze the bootstrap distribution */ /* overlay the original sample estimates and the Fisher confidence intervals */ data BootLongCI; set BootLong FisherCorr(rename=(Var=Var1 WithVar=Var2 Corr=Estimate)); label Corr="Correlation Estimate"; run;   title "Bootstrap Correlation Estimates (B=&NumSamples)"; title2 "Overlay Fisher Confidence Intervals"; proc sgpanel data=BootLongCI; panelby Label / novarname; histogram Corr; refline Estimate / axis=X lineattrs=(color=red); refline LCL / axis=X ; refline UCL / axis=X ; run;

The graph is shown at the top of this article. In the graph, the red lines indicate the original estimates of correlation from the data. The gray vertical lines show the Fisher CIs for the data. The histograms show the distribution of the correlation estimates on 10,000 bootstrap re-samples of the data.

Visually, it looks like the Fisher confidence intervals do a good job of estimating the variation in the correlation estimates. It looks like most of the bootstrap estimates are contained in the 95% CIs.

### The coverage of Fisher's confidence intervals

You can use the bootstrap estimates to compute the empirical coverage probability of the Fisher CIs. That is, for each pair of variables, what proportion of the bootstrap estimates (out of 10,000) are within the Fisher intervals? If all pairs of variables were bivariate normal, the expected proportion would be 95%.

You can use the logical expression (Corr>=LCL & Corr<=UCL) to create a binary indicator variable. The proportion of 1s for this variable equals the proportion of bootstrap estimates that are inside a Fisher interval. The following DATA step creates the indicator variable and the call to PROC FREQ counts the proportion of bootstrap estimates that are in the Fisher interval.

/* how many bootstrap estimates are inside a Fisher interval? */ proc sort data=BootLong; by Label; run;   data Pctl; merge BootLong FisherCorr(rename=(Corr=Estimate)); by Label; label Corr="Correlation Estimate"; InLimits = (Corr>=LCL & Corr<=UCL); run;   proc freq data=Pctl noprint; by Label; tables InLimits / nocum binomial(level='1'); output out=Est binomial; run; proc print data=Est noobs; var Label N _BIN_ L_BIN U_BIN; run;

The table shows empirical estimates for the coverage probability of the Fisher intervals. The _BIN_ column contains estimates of the binary proportion. The L_BIN and U_BIN columns contain confidence intervals for the proportions. You can see that all intervals include more than 90% of the bootstrap estimates, and some are close to 95% coverage. The following graph visualizes the empirical coverage for each pair of variables:

ods graphics / width=400px height=240px; title "Proportion of Bootstrap Estimates in Fisher Intervals"; proc sgplot data=Est noautolegend; scatter x=_BIN_ y=Label / xerrorlower=L_Bin xerrorupper=U_Bin; refline 0.95 / axis=x; xaxis max=1 grid; yaxis grid display=(nolabel) offsetmin=0.1 offsetmax=0.1; run;

The graph indicates that the many Fisher intervals are too wide (too conservative). Their empirical coverage is greater than the 95% coverage that you would expect for bivariate normal data. On the other hand, the Fisher interval for Corr(X2,X4) has lower coverage than expected.

### Summary

This article shows how to perform a bootstrap analysis for correlation among multiple variables. The process of resampling the data (PROC SURVEYSELECT) and generating the bootstrap estimates (PROC CORR with a BY statement) is straightforward. However, to analyze the bootstrap distributions, you need to restructure the statistics by converting the output data from a wide form to a long form. You can use this technique to perform your own bootstrap analysis of correlation coefficients.

Part of the analysis also used the bootstrap distributions to estimate the coverage probability of the Fisher confidence intervals for these data. The Fisher intervals have 95% coverage for normally distributed data. For the iris data, most of the intervals have coverage that is greater than expected, although one interval has lower-than-expected coverage. These results are strongly dependent on the data.

The post Bootstrap correlation coefficients in SAS appeared first on The DO Loop.

For graphing multivariate data, it is important to be able to convert the data between "wide form" (a separate column for each variable) and "long form" (which contains an indicator variable that assigns a group to each observation). If the data are numeric, the wide data can be represented as an N x p matrix. The same data in long form can be represented by two columns and Np rows, where the first column contains the data and the second column identifies that the first N rows belong to the first group, the second N rows belong to the second group, and so on. Many people have written about how to use PROC TRANSPOSE or the SAS DATA step to convert from wide form to long form and vice versa.

The conversion is slightly different for a symmetric matrix because you might want to display only the upper-triangular portion of the matrix. This article examines how to convert between a symmetric correlation matrix (wide form) and a "compressed symmetric" form that only stores the elements in the upper-triangular portion of the symmetric matrix.

### Compressed symmetric storage

When the data represents a symmetric N x N matrix, you can save space by storing only half the data, such as only the upper triangular portion of the matrix. This is called compressed symmetric storage. Familiar examples of symmetric matrices include correlation, covariance, and distance matrices. For example, you can represent the 3 x 3 symmetric matrix A = {3 2 1, 2 5 0, 1 0 9} by storing only the upper triangular entries U = {3, 2, 1, 5, 0, 9}. There are N(N+1)/2 upper triangular elements. For a correlation matrix, you don't need to store the diagonal elements; you only need to store N(N-1)/2 elements.

When you run a correlation analysis by using PROC CORR in SAS, you usually get a square symmetric matrix of correlation estimates. However, if you use the FISHER option to get confidence limits for the correlation coefficients, then you get a table that shows the correlation estimates for each pair of variables. I will demonstrate this point by using the Sashelp.Iris data. To make the discussion general, I will put the variable names into a macro variable so that you can concentrate on the main ideas:

data sample; /* use a subset of the Iris data */ set Sashelp.Iris(where=(Species="Versicolor")); label PetalLength= PetalWidth= SepalLength= SepalWidth=; run;   %let varNames = PetalLength PetalWidth SepalLength SepalWidth; /* count how many variables are in the macro variable */ data _null_; nv = countw("&varNames"); call symputx("numVars", nv); run; %put &=numVars; /* for this example, numVars=4 */   proc corr data=sample fisher(rho0=0.7 biasadj=no) noprob outp=CorrOut; var &varNames; ods select PearsonCorr FisherPearsonCorr; ods output FisherPearsonCorr = FisherPCorr; run;

The usual square symmetric correlation matrix is shown at the top of the image. I have highlighted the elements in the strictly upper triangular portion of the matrix. By using these values, you can re-create the entire matrix. The second table is an alternative display that shows only the six pairwise correlation estimates. The arrangement in the second table can be useful for plotting the pairwise correlations in a bar chart.

It is useful to be able to convert the first table into the format of the second table. It is also useful to be able to augment the second table with row/column information so that each row of the second table is easily associated with the corresponding row and column in the first table.

### Convert a symmetric table to a pairwise list

Let's use the information in the first table to list the correlations for pairs of variables, as shown in the second table. Notice that the call to PROC CORR uses the OUTP= option to write the correlation estimates to a SAS data set. You should print that data set to understand its structure; it contains more than just the correlation estimates! After you understand the structure, the following DATA step will make more sense. The main steps of the transformation are:

• Create a variable named ROW that has the values 1, 2, 3, ..., N. In the following DATA step, I use the MOD function because in a subsequent article, I will use the same DATA step to perform a bootstrap computation in which the data set contains B copies of the correlation matrix.
• Create a variable named COL. For each value of ROW, the COL variable loops over the values ROW+1, ROW+2, ..., N.
• Put the correlation estimates into a single variable named CORR.
• If desired, create a short label that identifies the two variables that produced the correlation estimate.
/* Convert correlations in the OUTP= data set from wide to long form (pairwise statistics) */ data Long; set CorrOut(where=(_Type_="CORR") rename=(_NAME_=Var1)); length Var2 $32 Label$13; array X[*] &varNames; row = 1 + mod(_N_-1, &NumVars); /* 1, 2, ..., N, 1, 2, ..., N, 1, 2, ... */ do col = row+1 to &NumVars; /* names for columns */ Var2 = vname(X[col]); Corr = X[col]; Label = cats("Corr(X",row,",X",col,")"); output; end; drop _TYPE_ &varNames; run;   proc print data=Long; run;

The ROW and COL variables identify the position of the correlation estimates in the upper-triangular portion of the symmetric correlation matrix. You can write a similar DATA step to convert a symmetric matrix (such as a covariance matrix) in which you also want to display the diagonal elements.

### Identify the rows and columns of a pairwise list

The data for the second table is stored in the FisherPCorr data set. Although it is possible to convert the pairwise list into a dense symmetric matrix, a more useful task is to identify the rows and columns for each entry in the pairwise list, as follows:

/* Write the (row,col) info for the list of pairwise correlations in long form */ data FisherCorr; set FisherPCorr; retain row col 1; if col>=&NumVars then do; row+1; col=row; end; col+1; Label = cats("Corr(X",row,",X",col,")"); run;   proc print data=FisherCorr; var Var WithVar Corr row col Label; run;

From the ROW and COL variables, you can assemble the data into a symmetric matrix. For example, I've written about how to use the SAS/IML language to create a symmetric correlation matrix from the strictly upper-triangular estimates.

### Summary

For a symmetric matrix, you can display the matrix in a wide format (an N x N matrix) or you can display the upper-triangular portion of the matrix in a single column (long format). This article shows how to use the SAS DATA step to convert a correlation matrix into a long format. By adding some additional variables, you can identify each value with a row and column in the upper-triangular portion of the symmetric matrix.

The post Convert a symmetric matrix from wide to long form appeared first on The DO Loop.

One of the benefits of using the SWEEP operator is that it enables you to "sweep in" columns (add effects to a model) in any order. This article shows that if you use the SWEEP operator, you can compute a SSCP matrix and use it repeatedly to estimate any linear regression model that uses those effects.

This post relates to a previous post, in which I showed that you never need to use a permutation matrix to rearrange the order of rows and columns of a matrix. As an application, I used the sum of squares and crossproducts (SSCP) matrix, which is used to estimate the regression coefficients in a least-squares regression model. Being able to permute the rows and columns of the SSCP matrix efficiently means that you can solve the linear regression problem very quickly for any ordering of the columns of the design matrix. But if you use the SWEEP operator, you do not need to permute the SSCP matrix at all!

### The SSCP matrix in PROC REG

Before discussing the SWEEP operator, let's review a little-used feature of PROC REG in SAS. The REG procedure is interactive, which means that you can interactively add or delete effects from a model as part of the model-building process. However, if you are going to explore several different models, you must use the VAR statement to specify all variables that you might conceivably want to use BEFORE the first RUN statement. Why? Because when PROC REG encounters the RUN statement it builds the SSCP matrix for the variables that you have specified. As stated in the documentation, for each subsequent model, "the procedure selects the appropriate crossproducts from the [SSCP] matrix." In other words, PROC REG re-uses that SSCP matrix for every MODEL statement.

Let's look at an example. The following call to PROC REG computes the SSCP matrix for five variables and the intercept effect. It uses that SSCP matrix to compute the parameter estimates:

proc reg data=sashelp.cars plots=NONE; ods select ParameterEstimates; VAR MSRP EngineSize Horsepower MPG_Highway Weight; /* vars in the SSCP */ FULL: model MSRP = EngineSize Horsepower MPG_Highway Weight; /* uses the SSCP */ run;

Because PROC REG is an interactive procedure, the procedure does not exit when it encounters the RUN statement. It remembers the SSCP matrix until the procedure exits. If you submit additional MODEL statements, PROC REG will use the SSCP matrix to estimate the new parameters, as long as the variables were previously specified. For example, you can specify the same model, but list the variables in a different order. Or you can estimate a reduced model that contains only a subset of the variables, as follows:

 PERMUTE: model MSRP = Horsepower Weight EngineSize MPG_Highway; run; PARTIAL: model MSRP = Weight Horsepower; run;quit;

The output for the "PERMUTE" model is not shown, because the estimates are the same as for the "FULL" model, but in a different order. It is important to note that PROC REG estimates the second and third models without re-reading the data. It simply re-uses the original SSCP matrix. As I have previously shown, computing the SSCP matrix is often 90% of the work of computing regression estimates, which means that the second and third models are computed very quickly.

The next section shows that how the SWEEP operator can quickly compute the parameter estimates for any model and for any order of the variables. There is no need to physically permute the rows and columns of the SSCP matrix.

### The SWEEP operator

You can use the SAS/IML language to illuminate how PROC REG can compute one SSCP matrix and re-use it for all subsequent models. The following SAS/IML statements read the data and compute the SSCP matrix. Following Goodnight (1979), the last column contains the response variable, but this is merely a convenience.

proc iml; varNames = {'EngineSize' 'Horsepower' 'MPG_Highway' 'Weight'}; use sashelp.cars; read all var varNames into X; read all var 'MSRP' into Y; close; /* add intercept column and response variable */ X = j(nrow(X), 1, 1) || X; /* the design matrix */ k = ncol(X); /* number of effects */   varNames = 'Intercept' || varNames || 'MSRP'; XY = X || Y; SSCP = XY * XY; /* the only SSCP that we'll need */   /* by using the sweep operator (1st row=mean, last col=param est*/ S = sweep(SSCP, 1:k); rCol = nrow(SSCP); /* the response column */ b = S[1:k, rCol]; /* parameter estimates for regression coefficients */ print b[r=varNames];

The parameter estimates are the same as computed by PROC REG. They were obtained by sweeping the columns of the SSCP matrix in the order 1, 2, ..., 5. You can think of the vector v=1:k as the "identity permutation" of the rows. Then the SWEEP operator for the variables in their original order is the operation S = sweep(SSCP, v).

The SWEEP function in SAS/IML enables you to specify any vector of indices as the second argument. In this way, you can sweep the columns of the SSCP matrix in any order. For example, the "PERMUTE" model from the PROC REG example corresponds to sweeping the SSCP matrix in the order v = {1 3 5 2 4}. As the following example shows, you get the same parameter estimates because the models are the same:

/* sweep original SSCP in the order of v */ v = {1 3 5 2 4}; pVarNames = varNames[,v]; /* effect names in permuted order */ S = sweep(SSCP, v); /* sweep the specified rows in the specified order */ b = S[v, rCol]; /* get the parameter estimates in new order */ print b[r=pVarNames];

As expected, the parameter estimates are the same, but the order of the parameters is different.

You can also estimate parameters for any model that uses a subset of the columns. For example, the model that has the explanatory variables Weight and Horsepower (the fifth and third effects) is computed from the SSCP matrix by sweeping the columns in the order {1 5 3}, as follows:

/* Perform a partial sweep for the model {1 5 3}. No need to form a new design matrix or to extract a portion of the SCCP. */ v = {1 5 3}; pVarNames = varNames[,v]; /* effect names for this model */ S = sweep(SSCP, v); /* sweeps in the variables in the order of v */ b = S[v, rCol]; /* get the parameter estimates */ print b[r=pVarNames];

The output matches the "PARTIAL" model from PROC REG.

### Summary

In a previous post, I showed an efficient way to permute the order of rows and columns of a matrix. But the current article shows that you do not need to permute the elements of an SSCP matrix if you use the SWEEP operator. The SWEEP operator enables you to specify the order in which you want to add effects to the model. From a design matrix, you can compute the SSCP matrix. (This is most of the work!) You can then quickly compute any model that uses the columns of the design matrix in any order.

This useful property of the SWEEP operator is among the many features discussed in Goodnight, J. (1979), "A Tutorial on the SWEEP Operator," The American Statistician. For other references, see the end of my previous article about the SWEEP operator.

The post More on the SWEEP operator for least-square regression models appeared first on The DO Loop.

Do you ever use a permutation matrix to change the order of rows or columns in a matrix? Did you know that there is a more efficient way in matrix-oriented languages such as SAS/IML, MATLAB, and R? Remember the following tip:
Never multiply with a large permutation matrix! Instead, use subscripts to permute rows and columns.

This article explains why it is not necessary to multiply by a permutation matrix in a matrix language. The advice is similar to the tip in the article, "Never multiply with a large diagonal matrix," which discusses an efficient way to use diagonal matrices in matrix computations. That article recommends that you never multiply with a large diagonal matrix. Instead, use elementwise multiplication of rows and columns.

### Permutation matrices

Permutation matrices have many uses, but their primary use in statistics is to rearrange the order of rows or columns in a matrix. A previous article shows how to create a permutation matrix and how to use it to permute the order of rows and columns in a matrix. I ended the article by noting that "there is an easy alternative way" to permute rows and columns: use subscript notation. Subscripts enable you to permute rows and columns efficiently and intuitively. If you use matrix multiplication to permute columns of a matrix, you might not remember whether to multiply the permutation matrix on the left or on the right, but if you use subscripts, it is easy to remember the syntax.

The following example defines a 5 x 5 matrix, A, with integer entries and a function that creates a permutation matrix. (The permutation matrix is the transpose of the matrix that I used in the previous article. I think this definition is easier to use.) If you multiply A on the right (A*Q), you permute the columns. If you multiply A on the left (Q*A), you permute the rows, as shown:

proc iml; A = shape(1:25, 5, 5); /*{ 1 2 3 4 5, 6 7 8 9 10, ... 21 22 23 24 25} */ v = {3 5 2 4 1}; /* new order for the columns or rows */   /* Method 1: Explicitly form the permutation matrix */ /* the following matrix permutes the rows when P is used on the left (P*A) Use P on the right to permute columns. */ start PermutionMatrixT(v); return( I(ncol(v))[ ,v] ); finish; Q = PermutionMatrixT(v); C = A*Q; /* permute the columns */ R = Q*A; /* permute the rows */ ods layout gridded advance=table columns=2; print R[r=v L="Permute Rows"], C[c=v L="Permute Columns"]; ods layout end;

This method explicitly forms a permutation matrix to permute the rows and/or columns. To permute the rows of an N x p matrix, the permutation matrix is an N x N matrix, and the matrix multiplication requires O(N2p) floating-point operations. To permute the columns, the permutation matrix is p x p, and the matrix multiplication requires O(Np2) operations.

### Permutation of indices for subscripts

In contrast, if you permute a matrix by using subscript indices, no floating-point operations are needed. The syntax A[v,] permutes the rows of the matrix A by using the indices in the vector v. The syntax A[,v] permutes the columns of A. The syntax A[v,v] permutes both the rows and the columns. Here is the previous permutation, but without using permutation matrices:

/* Method 2: Specify subscript indices to form the permutation matrix */ C = A[ ,v]; /* permute the columns */ R = A[v, ]; /* permute the rows */ print R[r=v L="Permute Rows"], C[c=v L="Permute Columns"];

The output is the same and is not shown. No computations are required, only copying data from one matrix to another. For v = {3 5 2 4 1}, the first syntax means, "copy the columns in the order, 3, 5, 2, 4, and 1." The second syntax specifies the order of the rows. You don't have to remember which side of A to put the permutation matrix, nor whether to use a transpose operation.

### An application: Order of effects in a regression model

In my previous article, I used a correlation matrix to demonstrate why it is useful to know how to permute rows and columns of a matrix. Another application is the order of columns in a design matrix for linear regression models. Suppose you read in a design matrix where the columns of the matrix are in a specified order. For some statistics (for example, the Type I sequential sum of squares), the order of the variables in the model are important. The order of variables is also used in regression techniques such as variable selection methods. Fortunately, if you have computed the sum of squares and crossproducts matrix (SSCP) for the variables in the original order, it is trivial to permute the matrix to accommodate any other ordering of the variables. If you have computed the SSCP matrix in one order, you can obtain it for any order without recomputing it.

The SSCP matrix is part of the "normal equations" that are used to obtain least-squares estimates for regression parameters. Let's look at an example. Suppose you compute the SSCP matrix for a data matrix that has five effects, as shown below:

/* permute columns or rows of a SSCP matrix (XpX) */ use sashelp.cars; varNames = {'EngineSize' 'Horsepower' 'MPG_City' 'MPG_Highway' 'Weight'}; read all var varNames into Z; close;   X = Z - mean(Z); /* center the columns */ XpX = X*X; print XpX[r=varNames c=varNames];

Now, suppose you want the SSCP matrix for the same variables in a different order. For example, you want the order {'MPG_City' 'Weight' 'Horsepower' 'MPG_Highway' 'EngineSize'}; which is the permutation {3, 5, 2, 4, 1} of the original ordering. You have two choices: permute the columns of the design matrix and recompute the SSCP or permute the existing SSCP matrix. The first way is intuitive but less efficient. It is shown below:

/* inefficient way: create new design matrix, Y, which permutes columns of X. Then compute SSCP. */ v = {3 5 2 4 1}; Q = PermutionMatrixT(v); Y = X*Q; sscp = Y*Y; /* = Q*X*X*Q = Q*(X*X)*Q */ pVarNames = varNames[,v]; /* get new order for var names */ print sscp[r=pVarNames c=pVarNames];

The second way is a one-liner and does not require any computations. You don't have to form the permutation matrix, Q. You only need to use subscripts to permute the new SSCP:

/* efficient way: don't form Q; just permute the indices */ sscp2 = XpX[v,v]; print sscp2[r=pVarNames c=pVarNames];

The output is exactly the same and is not shown. After you have obtained the SSCP matrix in the new order, you can use it to compute regression statistics. If you are solving the normal equations, you also need to permute the right-hand side of the equations, which will be equal to Q*X`*y.

### Summary

In summary, you rarely need to explicitly form a permutation matrix and use matrix multiplication to permute the rows or columns of a matrix. In most situations, you can use subscript indices to permute row, columns, or both rows and columns.

The post Never multiply with a large permutation matrix appeared first on The DO Loop.