Statistical Programming

11月 222021

I attended a seminar last week whose purpose was to inform SAS 9 programmers about SAS Viya. I could tell from the programmer's questions that some programmers were confused about three basic topics:

  • What are the computing environments in Viya, and how should a programmer think about them?
  • What procedures do you get when you order a programming-oriented Viya product such as SAS Visual Statistics or SAS Econometrics? Of these procedures, which are CAS-enabled?
  • If you have legacy SAS programs, can you still run them if your company migrates from SAS 9 to SAS Viya?

I am a programmer, so I thought it might be helpful for me to discuss these topics programmer-to-programmer. In a series of articles, I am going to discuss issues that a SAS statistical programmer might face when migrating to Viya from SAS 9. I use the term "SAS 9" to refer to the SAS Workspace Server that runs procedures in traditional products such as Base SAS, SAS/STAT, and SAS/ETS. So "SAS 9" refers to the most recent version of the "classic" SAS programming environment. It is the version of SAS that existed before SAS Viya was created.

Clients and servers: Where do computations occur in SAS Viya?

In SAS 9, a procedure runs on the SAS Workspace Server. In SAS 9, the word "client," refers to a program such as Enterprise Guide (EG) or SAS Studio, which runs on a PC and submits code to the SAS Workspace Server. The server computes the results and sends tables and graphs back to the client, which displays them. Typically, the input and output data sets remain on the server.

You can think of SAS Viya as having two main components: the CAS server where the data are stored and the computations are performed, and support for several client languages. A client language enable you to connect to the CAS server and tell it what analyses you want to perform. So, in the world of Viya, "client" no longer refers to a GUI like EG, but to an entire programming environment such as SAS, Python, or R. The purpose of the client software is to connect to CAS, submit actions, and get back results. You then use the capabilities of the client language to display the results as a table or graph. For example, the SAS client uses ODS to display tables and graphs. In Python, you might use matplotlib to graph the results. In R, you might use ggplot. In all cases, you can also use the native capabilities of the client language (DATA step, Pandas, the tidyverse,....) to modify, aggregate, or enhance the output.

I use the SAS client to connect to and communicate with the CAS server. By using a SAS client to communicate with CAS, I can leverage my 25 years of SAS programming knowledge and skills. Others have written about how to use other clients (such as a Python client) to connect to CAS and to call CAS actions.

What procedures do you get when you order a programming-oriented Viya product?

When you purchase a product in SAS Viya, you get three kinds of computational capabilities:

  • Actions, which run on the CAS server. You can call an action from any client language.
  • CAS-enabled procedures, which are parsed on the SAS client but call CAS actions "under the covers."
  • Legacy SAS procedures that run on the SAS client, just as they do in SAS 9.

Obviously, the CAS-enabled and legacy procedures are only available on the SAS client.

To give an example, SAS Visual Statistic contains action sets (which contain actions), CAS-enabled procedures, and all the procedures in SAS/STAT. All procedures run on the SAS compute server, which is also called the SAS client. However, the CAS-enabled procedures call one or more actions that run on the CAS server, then display the results as ODS tables and graphics.

A CAS-enabled procedure performs very few computations on the client. In contrast, a legacy procedure that is not CAS-enabled performs all of its computations on the SAS client. It does not call any CAS actions. An example of a CAS-enabled procedure is the REGSELECT procedure, which performs linear regression with feature selection. It contains many of the features of the GLMSELECT procedure, which is a traditional regression procedure in SAS/STAT.

Can you run legacy programs in SAS Viya?

Naturally, SAS 9 statistical programmers want to make sure that their existing programs will run in Viya. That is why SAS Visual Statistics comes with the legacy SAS/STAT procedures. The same is true for SAS/ETS proceduires, which are shipped as part of SAS Econometrics. And the SAS IML product in Viya contains PROC IML, which runs on the SAS client, as well as the newer iml action, which runs in CAS.

So what happens if, for example, you call PROC REG in SAS and ask it to perform a regression on a SAS data set in the WORK libref? PROC REG will do what it has always done. It will run in the SAS environment. It will not run on the CAS server. It will not magically run faster than it used to in SAS 9. The performance of most legacy programs should be comparable to their performance in SAS 9.

There are some exceptions to that rule. Some SAS procedures have been enhanced and now perform better than their SAS 9 counterparts. For example, the SAS IML team has enhanced certain functions in PROC IML in SAS Viya so that they have better performance in SAS Viya than the SAS 9 version of the procedure. The SAS IML development team is focused exclusively on improving performance and adding features in SAS Viya, both PROC IML and the iml action.

Another exception is that some Base SAS procedures were enhanced so that they behave differently depending on the location of the data. Many Base SAS procedures are now hybrid procedures. If you tell them to analyze a CAS table, they will call an action, which runs in CAS, and retrieve the results. If you tell them to analyze a SAS data set, they will run on the SAS client, just like they do in SAS 9.

To make the situation more complicated, some of the legacy Base SAS procedures support features that are not supported in CAS. When you request an option that is not supported in CAS, the procedure will download the data from CAS into SAS and perform the computation on the client. This can be inefficient, so check the documentation before you start using legacy procedures to analyze CAS tables. As a rule, I prefer to use legacy procedures to analyze SAS data sets on the client; I use newer CAS-enabled procedures for analyzing CAS tables.


At a recent seminar for SAS 9 programmers, there were lots of questions about SAS Viya and what it means to for a SAS programmer to start programming in Viya. This article is the first of several articles that I intend to write for SAS programmers. I don't know everything, but I hope that other SAS programmers will join me in sharing what they have learned about the process of migrating from SAS 9 to SAS Viya.

If you are a SAS programmer who has general questions about SAS Viya, let me know what you are thinking by leaving a comment. I might not know the answer, but I'll try to find someone who does. For programming questions ("How do I do XYZ in Viya?"), post your question and sample data to the SAS Support Communities. There is a dedicated community for SAS Viya questions, so take advantage of the experts there.

The post What is a CAS-enabled procedure? appeared first on The DO Loop.

11月 172021

In a previous article, I showed how to use the CVEXHULL function in SAS/IML to compute the convex hull of a finite set of planar points. The convex hull is a convex polygon, which is defined by its vertices. To visualize the polygon, you need to know the vertices in sequential order. Fortunately, the CVEXHULL function returns the vertices in counterclockwise order, so you can connect the points in the order they are provided.

But what if the CVEXHULL function did not output the vertices in sequential order? It turns out that you can perform a simple computation that orders the vertices of a convex polygon. This article shows how.

Connect unsorted vertices of a convex polygon

Let's start by defining six points that form the vertices of a convex polygon. The vertices are not sorted in any way, so if you connect the points in the order given, you get a star-like pattern:

proc iml;
P = { 0  2,
      6  0,
      0  0,
      4 -1,
      5  2, 
      2 -1 }; 
/* create a helper function to connect vertices in the order they are given */
start GraphVertices(P);
   Poly = P // P[1,];   /* repeat first point to close the polygon */
   call series(Poly[,1], Poly[,2]) procopt="aspect=1" option="markers" grid={x y};
title "Connect Unsorted Points";
run GraphVertices(P);

The line plot does not look like a convex polygon because the vertices were not ordered. However, it is not hard to order the vertices of a convex polygon.

Order vertices of a convex polygon

The key idea is to recall that the centroid of a polygon is the arithmetic average of the coordinates of its vertices. For a convex polygon, the centroid always lies in the interior of the polygon, as shown in the graph to the right. The centroid is displayed as a large dot in the center of the polygon.

You can use the centroid as the origin and construct the vectors from the centroid to each vertex. For each vector, you can compute the angle made with the horizontal axis. You can then sort the angles, which provides a sequential ordering of the vertices of the convex polygon. In the graph at the right, the angles are given in degrees, but you can use radians for all the calculations. For this example, the angles that the vectors make with the positive X axis are 38 degrees, 150 degrees, 187 degrees, and so forth.

The SAS/IML language provides a compact way to compute the centroid and the vectors. You can use the ATAN2 function in SAS to compute the angles, as follows:

C = mean(P);           /* centroid of a convex polygon */
v = P - C;             /* vectors from centroid to points on convex hull */
radian = atan2( v[,2], v[,1] );  /* angle with X axis for each vector */

At this point, you could sort the vertices by their radian measure. However, the ATAN2 function returns values in the interval (-π π]. If you prefer to order the vertices by using the standard "polar angle," which is in the interval [0, 2π), you can add 2π to any negative angle from the ATAN2 function. You can then use the angles to sort the vertices, as follows:

/* Optional: The ATAN2 function returns values in (-pi, pi]. 
   Convert to values in (0,2*pi] */
pi = constant('pi');
idx = loc( radian<0 );
radian[idx] = radian[idx] + 2*pi;
/* now sort the points by angle */
call sortndx(idx, radian);   /* get row numbers that sort the angles */
Q = P[idx,];                 /* sort the vertices of the polygon by their angle */
title "Connect Sorted Points";
run GraphVertices(Q);

With this ordering, the vertices are now sorted in sequential order according to the angle each vertex makes with the centroid.

In summary, the CVEXHULL function in SAS/IML returns vertices of a convex polygon in sequential order. But if you are even given the vertices in a random order, you can perform a computation to sort them by using the angle each vertex makes with the centroid.

The post The order of vertices on a convex polygon appeared first on The DO Loop.

11月 152021

Given a cloud of points in the plane, it can be useful to identify the convex hull of the points. The convex hull is the smallest convex set that contains the observations. For a finite set of points, it is a convex polygon that has some of the points as its vertices. An example of a convex hull is shown to the right. The convex hull is the polygon that encloses the points.

This article describes the CVEXHULL function in the SAS/IML language, which computes the convex hull for a set of planar points.

How to compute a 2-D convex hull in SAS

The CVEXHULL function takes a N x 2 data matrix. Each row of the matrix is the (x,y) coordinates for a point. The CVEXHULL function returns a column vector of length N.

  • The first few elements are positive integers. They represent the rows of the data matrix that form the convex hull.
  • The positive integers are sorted so that you can visualize the convex hull by connecting the points in order.
  • The remaining elements are negative integers. The absolute values of these integers represent the rows of the data matrix that are contained in the convex hull or are on the boundary of the convex hull but are not vertices.

The following example comes from the SAS/IML documentation. The data matrix contains 18 points. Of those, six are the vertices of the convex hull. The output of the CVEXHULL function is shown below:

proc iml; 
points = {0  2, 0.5 2, 1 2, 0.5 1, 0 0, 0.5 0, 1  0, 
          2 -1,   2 0, 2 1,   3 0, 4 1,   4 0, 4 -1, 
          5  2,   5 1, 5 0,   6 0 }; 
/* Find the convex hull:
   - indices on the convex hull are positive
   - the indices for the convex hull are listed first, in sequential order
   - interior indices are negative
Indices = cvexhull( points ); 
reset wide;
print (Indices`)[L="Indices"];

I have highlighted the first six elements. The indices tell you that the convex hull is formed by using the 1st, 5th, 8th, 14th, 18th, and 15th points of the data matrix. You can use the LOC statement to find the positive values in the Indices vector. You can use those values to extract the points on the convex hull, as follows:

hullIdx = indices[loc(indices>0)];   /* the positive indices */
convexHull = points[hullIdx, ];      /* extract rows */
print hullIdx convexHull[c={'cx' 'cy'} L=""];

The output shows that the convex hull is formed by the six points (0,2), (0,0), ..., (5,2).

Visualize the convex hull

The graph at the beginning of this article shows the convex hull as a shaded polygon. The original points are overlaid on the polygon and labeled by the observation number. The six points that form the convex hull are colored red. This section shows how to create the graph.

The graph uses the POLYGON statement to visualize the convex hull. This enables you to shade the interior of the convex hull. If you do not need the shading, you could use a SERIES statement, but to get a closed polygon you would need to add the first point to the end of the list of vertices.

To create the graph, you must write the relevant information to a SAS data set so that you can use PROC SGPLOT to create the graph. The following statements write the (x,y) coordinates of the point, the observation numbers (for the data labels), the coordinates of the convex hull vertices (cx, cy), and an ID variable, which is required to use the POLYGON statement. It also creates a binary indicator variable that is used to color-code the markers in the scatter plot:

x = points[,1]; y = points[,2]; 
obsNum = t(1:nrow(points));  /* optional: use observation numbers for labels */
/* The points on the convex hull are sorted in counterclockwise order.
   If you use a series plot, you must repeat the first point 
   so that the polygon is closed. For example, use
   convexHull = convexHull // convexHull[1,]; */
cx = convexHull[,1]; cy = convexHull[,2]; 
ID = j(nrow(cx),1,1);        /* create ID variable for POLYGON statement */
/* create a binary (0/1) indicator variable */ 
OnHull = j(nrow(x), 1, 0);   /* most points NOT vertices of the convex hull */
OnHull[hullIdx] = 1;         /* these points are the vertices */
create CHull var {'x' 'y' 'cx' 'cy' 'ID' 'obsNum' 'OnHull'};

In the graph at the top of this article, vertices of the convex hull are colored red and the other points are blue. When you use the GROUP= option in PROC SGPLOT statements, the group colors might depend on the order of the observations in the data. To ensure that the colors are consistent regardless of the order of the data set, you can use a discrete attribute map to associate colors and values of the grouping variable. For details about using a discrete attribute maps, see Kuhfeld's 2016 article.

To use a discrete attribute map, you need to define it in a SAS data set, read it by using the DATTRMAP= option on the PROC SGPLOT statement, and specify it by using the ATTRID= statement on the SCATTER statement, as follows:

data DAttrs;                             /* use DATTRMAP=<data set name> */
length MarkerStyleElement $11.;
ID = "HullAttr";                         /* use ATTRID=<ID value> */
Value = 0; MarkerStyleElement = "GraphData1"; output; /* 0 ==> 1st color */
Value = 1; MarkerStyleElement = "GraphData2"; output; /* 1 ==> 2nd color */
title "Points and Convex Hull";
proc sgplot data=CHull DATTRMAP=DAttrs;
   polygon x=cx y=cy ID=ID / fill outline lineattrs=GraphData2;
   scatter x=x y=y / datalabel=obsNum group=OnHull 
                     markerattrs=(symbol=CircleFilled) ATTRID=HullAttr;

The graph is shown at the top of this article. Notice that the points (0.5, 2) and (1, 2) are on the boundary of the convex hull, but they are drawn in blue because they are not vertices of the polygon.


In summary, you can compute a 2-D convex hull by using the CVEXHULL function in SAS/IML software. The output is a set of indices, which you can use to extract the vertices of the convex hull and to color-code markers in a scatter plot.

By the way, there is a hidden message in the graph of the convex hull. Can you see it? It has been hiding in the SAS/IML documentation for more than 20 years.

In closing, I'll mention that a 2-D convex hull is one computation in the general field of computational geometry. The SAS/IML group is working to add additional functionality to the language, including convex hulls in higher dimensions. In your work, do you have specific needs for results in computational geometry? If so, let me know the details in the comments.

The post Two-dimensional convex hulls in SAS appeared first on The DO Loop.

11月 012021

A previous article discusses how to use SAS regression procedures to fit a two-parameter Weibull distribution in SAS. The article shows how to convert the regression output into the more familiar scale and shape parameters for the Weibull probability distribution, which are fit by using PROC UNIVARIATE.

Although PROC UNIVARIATE can fit many univariate distributions, it cannot fit a mixture of distributions. For that task, you need to use PROC FMM, which fits finite mixture models. This article discusses how to use PROC FMM to fit a mixture of two Weibull distributions and how to interpret the results. The same technique can be used to fit other mixtures of distributions. If you are going to use the parameter estimates in SAS functions such as the PDF, CDF, and RAND functions, you cannot use the regression parameters directly. You must transform them into the distribution parameters.

Simulate a mixture of Weibull data

You can use the RAND function in the SAS DATA step to simulate a mixture distribution that has two components, each drawn from a Weibull distribution. The RAND function samples from a two-parameter Weibull distribution Weib(α, β) whose density is given by
\(f(x; \alpha, \beta) = \frac{\beta}{\alpha^{\beta}} (x)^{\beta -1} \exp \left(-\left(\frac{x}{\alpha}\right)^{\beta }\right)\)
where α is a shape parameter and β is a scale parameter. This parameterization is used by most Base SAS functions and procedures, as well as many regression procedures in SAS. The following SAS DATA step simulates data from two Weibull distributions. The first component is sampled from Weib(α=1.5, β=0.8) and the second component is sampled from Weib(α=4, β=2). For the mixture distribution, the probability of drawing from the first distribution is 0.667 and the probability of drawing from the second distribution is 0.333.

After generating the data, you can call PROC UNIVARIATE to estimate the parameters for each component. Notice that this fits each component separately. If the parameter estimates are close to the parameter values, that is evidence that the simulation generated the data correctly.

/* sample from a mixture of two-parameter Weibull distributions */
%let N = 3000;
data Have(drop=i);
call streaminit(12345);
array prob [2] _temporary_ (0.667 0.333);
do i = 1 to &N;
   component = rand("Table", of prob[*]);
   if component=1 then
      d = rand("weibull", 1.5, 0.8);  /* C=Shape=1.5; Sigma=Scale=0.8 */
      d = rand("weibull", 4,   2);    /* C=Shape=4; Sigma=Scale=2 */
proc univariate data=Have;
   class component;
   var d;
   histogram d / weibull NOCURVELEGEND;  /* fit (Sigma, C) for each component */
   ods select Histogram ParameterEstimates Moments;
   ods output ParameterEstimates = UniPE;
   inset weibull(shape scale) / pos=NE; 
title "Weibull Estimates for Each Component";
proc print data=UniPE noobs;
   where Parameter in ('Scale', 'Shape');
   var Component Parameter Symbol Estimate;

The graph shows a histogram for data in each component. PROC UNIVARIATE overlays a Weibull density on each histogram, based on the parameter estimates. The estimates for both components are close to the parameter values. The first component contains 1,970 observations, which is 65.7% of the total sample, so the estimated mixing probabilities are close to the mixing parameters. I used ODS OUTPUT and PROC PRINT to display one table that contains the parameter estimates from the two groups. PROC UNIVARIATE calls the shape parameter c and the scale parameter σ.

Fitting a finite mixture distribution

The PROC UNIVARIATE call uses the Component variable to identify the Weibull distribution to which each observation belongs. If you do not have the Component variable, is it still possible to estimate a two-component Weibull model?

The answer is yes. The FMM procedure fits statistical models for which the distribution of the response is a finite mixture of distributions. In general, the component distributions can be from different families, but this example is a homogeneous mixture, with both components from the Weibull family. When fitting a mixture model, we assume that we do not know which observations belong to which component. We must estimate the mixing probabilities and the parameters for the components. Typically, you need a lot of data and well-separated components for this effort to be successful.

The following call to PROC FMM fits a two-component Weibull model to the simulated data. As shown in a previous article, the estimates from PROC FMM are for the intercept and scale of the error term for a Weibull regression model. These estimates are different from the shape and scale parameters in the Weibull distribution. However, you can transform the regression estimates into the shape and scale parameters, as follows:

title "Weibull Estimates for Mixture";
proc fmm data=Have plots=density;
   model d = / dist=weibull link=log k=2;
   ods select ParameterEstimates MixingProbs DensityPlot;
   ods output ParameterEstimates=PE0;
/* Add the estimates of Weibull scale and shape to the table of regression estimates.
   See */
data FMMPE;
set PE0(rename=(ILink=WeibScale));
if Parameter="Scale" then WeibShape = 1/Estimate;
else WeibShape = ._;  /* ._ is one of the 28 missing values in SAS */
proc print data=FMMPE; 
   var Component Parameter Estimate WeibShape WeibScale;

The program renames the ILink column to WeibScale. It also adds a new column (WeibShape) to the ParameterEstimates table. These two columns display the Weibull shape and scale parameter estimates for each component. Despite not knowing which observation came from which component, the procedure provides good estimates for the Weibull parameters. PROC FMM estimates the first component as Weib(α=1.52, β=0.74) and the second component as Weib(α=3.53, β=1.88). It estimates the mixing parameters for the first component as 0.,6 and the parameter for the second component as 0.4.

The PLOTS=DENSITY option on the PROC FMM statement produces a plot of the data and overlays the component and mixture distributions. The plot is shown below and is discussed in the next section.

The graph of the component densities

The PLOTS=DENSITY option produces a graph of the data and overlays the component and mixture distributions. In the graph, the red curve shows the density of the first Weibull component (W1(d)), the green curve shows the density of the second Weibull component (W2(d)), and the blue curve shows the density of the mixture. Technically, only the blue curve is a "true" density that integrates to unity (or 100% on a percent scale). The components are scaled densities. The integral of a component equals the mixing probability, which for these data are 0.6 and 0.4, respectively. The mixture density equals the sum of the component densities.

Look closely at the legend in the plot, which identifies the component curves by the parameter estimates. Notice, that the estimates in the legend are the REGRESSION estimates, not the shape and scale estimates for the Weibull distribution. Do not be misled by the legend. If you plot the PDF
density = PDF("Weibull", d, 0.74, 0.66); /* WRONG! */
you will NOT get the density curve for the first component. Instead, you need to convert the regression estimates into the shapes and scale parameters for the Weibull distribution. The following DATA step uses the transformed parameter estimates and demonstrates how to graph the component and mixture densities:

/* plot the Weibull component densitites and the mixture density */
data WeibComponents;
retain d1 d2;
array WeibScale[2] _temporary_ (0.7351,  1.8820);  /* =exp(Intercept) */
array WeibShape[2] _temporary_ (1.52207, 3.52965); /* =1/Scale */
array MixParm[2]   _temporary_ (0.6, 0.4);
do d = 0.01, 0.05 to 3.2 by 0.05;
   d1 = MixParm[1]*pdf("Weibull", d, WeibShape[1], WeibScale[1]); 
   d2 = MixParm[2]*pdf("Weibull", d, WeibShape[2], WeibScale[2]); 
   Component = "Mixture        "; density = d1+d2; output;
   Component = "Weib(1.52,0.74)"; density = d1;    output; 
   Component = "Weib(3.53,1.88)"; density = d2;    output; 
title "Weibull Mixture Components";
proc sgplot data=WeibComponents;
   series x=d y=density / group=Component;
   keylegend / location=inside position=NE across=1 opaque;
   xaxis values=(0 to 3.2 by 0.2) grid offsetmin=0.05 offsetmax=0.05;
   yaxis grid;

The density curves are the same, but the legend for this graph displays the shape and scale parameters for the Weibull distribution. If you want to reproduce the vertical scale (percent), you can multiply the densities by 100*h, where h=0.2 is the width of the histogram bins.

In general, be aware that the PLOTS=DENSITY option produces a graph in which the legend labels refer to the REGRESSION parameters. For example, if you use PROC FMM to fit a mixture of normal distributions, the parameter estimates in the legend are for the mean and the VARIANCE of the normal distributions. However, if you intend to use those estimates in other SAS functions (such as PDF, CDF, and RAND), you must take the square root of the variance to obtain the standard deviation.


This article uses PROC FMM to fit a mixture of two Weibull distributions. The article shows how to interpret the parameter estimates from the procedure by transforming them into the shape and scale parameters for the Weibull distribution. The article also emphasizes that if you use the PLOTS=DENSITY option produces a graph, the legend in the graph contains the regression parameters, which are not the same as the parameters that are used for the PDF, CDF, and RAND functions.

The post Fit a mixture of Weibull distributions in SAS appeared first on The DO Loop.

10月 182021

This article uses an example to introduce to genetic algorithms (GAs) for optimization. It discusses two operators (mutation and crossover) that are important in implementing a genetic algorithm. It discusses choices that you must make when you implement these operations.

Some programmers love using genetic algorithms. Genetic algorithms are heuristic methods that can be used to solve problems that are difficult to solve by using standard discrete or calculus-based optimization methods. A genetic algorithm tries to mimic natural selection and evolution by starting with a population of random candidates. Candidates are evaluated for "fitness" by plugging them into the objective function. The better candidates are combined to create a new set of candidates. Some of the new candidates experience mutations. Eventually, over many generations, a GA can produce candidates that approximately solve the optimization problem. Randomness plays an important role. Re-running a GA with a different random number seed might produce a different solution.

Critics of genetic algorithms note two weaknesses of the method. First, you are not guaranteed to get the optimal solution. However, in practice, GAs often find an acceptable solution that is good enough to be used. The second complaint is that the user must make many heuristic choices about how to implement the GA. Critics correctly note that implementing a genetic algorithm is as much an art as it is a science. You must choose values for hyperparameters and define operators that are often based on a "feeling" that these choices might result in an acceptable solution.

This article discusses two fundamental parts of a genetic algorithm: the crossover and the mutation operators. The operations are discussed by using the binary knapsack problem as an example. In the knapsack problem, a knapsack can hold W kilograms. There are N objects, each with a different value and weight. You want to maximize the value of the objects you put into the knapsack without exceeding the weight. A solution to the knapsack problem is a 0/1 binary vector b. If b[i]=1, the i_th object is in the knapsack; if b[i]=0, it is not.

A brief overview of genetic algorithms

The SAS/IML User's Guide provides an overview of genetic algorithms. The main steps in a genetic algorithm are as follows:

  • Encoding: Each potential solution is represented as a chromosome, which is a vector of values. The values can be binary, integer-valued, or real-valued. (The values are sometimes called genes.) For the knapsack problem, each chromosome is an N-dimensional vector of binary values.
  • Fitness: Choose a function to assess the fitness of each candidate chromosome. This is usually the objective function for unconstrained problems, or a penalized objective function for problems that have constraints. The fitness of a candidate determines the probability that it will contribute its genes to the next generation of candidates.
  • Selection: Choose which candidates become parents to the next generation of candidates.
  • Crossover (Reproduction): Choose how to produce children from parents.
  • Mutation: Choose how to randomly mutate some children to introduce additional diversity.

This article discusses the crossover and the mutation operators.

The mutation operator

The mutation operator is the easiest operation to understand. In each generation, some candidates are randomly perturbed. By chance, some of the mutations might be beneficial and make the candidate more fit. Others are detrimental and make the candidate less fit.

For a binary chromosome, a mutation consists of changing the parity of some proportion of the elements. The simplest mutation operation is to always change k random elements for some hyperparameter k < N. A more realistic mutation operation is to choose the number of sites randomly according to a binomial probability distribution with hyperparameter pmut. Then k is a random variable that differs for each mutation operation.

The following SAS/IML program chooses pmut=0.2 and defines a subroutine that mutates a binary vector, b. In this example, there are N=17 items that you can put into the knapsack. The subroutine first uses the Binom(pmut, N) probability distribution to obtain a random number of sites, k, to mutate. (But if the distribution returns 0, set k=1.) The SAMPLE function then draws k random positions (without replacement), and the values in those positions are changed.

proc iml;
call randseed(12345);
N = 17;                      /* size of binary vector */
ProbMut= 0.2;                /* mutation in 20% of sites */
/* Mutation operator for a binary vector, b. 
   The number of mutation sites k ~ Binom(ProbMut, N), but not less than 1. 
   Randomly sample (without replacement) k sites. 
   If an item is not in knapsack, put it in; if an item is in the sack, take it out. */
start Mutate(b) global(ProbMut);
   N = nrow(b);
   k = max(1, randfun(1, "Binomial", ProbMut, N)); /* how many sites? */
   j = sample(1:N, k, "NoReplace");                /* choose random elements */
   b[j] = ^b[j];                                   /* mutate these sites */
Items = 5:12;                  /* choose items 5-12 */
b = j(N,1,0);  b[Items] = 1;
bOrig = b;
run Mutate(b);
print (bOrig`)[L="Original b" c=(1:N)]
      (b`)[L="Randomly Mutated b" c=(1:N)];

In this example, the original chromosome has a 1 in locations 5:12. The binomial distribution randomly decides to mutate k=4 sites. The SAMPLE function randomly chooses the locations 3, 11, 15, and 17. The parity of those sites is changed. This is seen in the output, which shows that the parity of these four sites differs between the original and the mutated b vector.

Notice that you must choose HOW the mutation operator works, and you must choose a hyperparameter that determines how many sites get mutated. The best choices depend on the problem you are trying to solve. Typically, you should choose a small value for the probability pmut so that only a few sites are mutated.

In the SAS/IML language, there are several built-in mutation operations that you can use. They are discussed in the documentation for the GASETMUT subroutine.

The crossover operator

The crossover operator is analogous to the creation of offspring through sexual reproduction. You, as the programmer, must decide how the parent chromosomes, p1 and p2, will combine to create two children, c1 and c2. There are many choices you can make. Some reasonable choices include:

  • Randomly choose a location s, 1 ≤ s ≤ N. You then split the parent chromosomes at that location and exchange and combine the left and right portions of the parents' chromosomes. One child chromosome is c1 = p1[1:s] // p2[s+1:N] and the other is c2 = p2[1:s] // p1[s+1:N]. Note that each child gets some values ("genes") from each parent.
  • Randomly choose a location s, 1 ≤ s ≤ N. Divide the first chromosome into subvectors of length s and N-s. Divide the second chromosome into subvectors of length N-s and s. Exchange the subvectors of the same length to form the child chromosomes.
  • Randomly choose k locations. Exchange the locations between parents to form the child chromosomes.

The following SAS/IML function implements the third crossover method. The method uses a hyperparameter, pcross, which is the probability that each location in the chromosome is selected. On average, about Npcross locations will be selected. In the following program, pcross = 0.3, so we expect 17(0.3)=5.1 values to be exchanged between the parent chromosomes to form the children:

start uniform_cross(child1, child2, parent1, parent2) global(ProbCross);
   b = j(nrow(parent1), 1);
   call randgen(b, "Bernoulli", ProbCross); /* 0/1 vector */
   idx = loc(b=1);                    /* locations to cross */
   child1 = parent1;                  /* initialize children */
   child2 = parent2;
   if ncol(idx)>0 then do;            /* exchange values */
      child1[idx] = parent2[idx];     /* child1 gets some from parent2 */
      child2[idx] = parent1[idx];     /* child2 gets some from parent1 */
ProbCross = 0.3;                               /* crossover 25% of sites */
Items = 5:12;   p1 = j(N,1,0);  p1[Items] = 1; /* choose items 5-12 */
Items = 10:15;  p2 = j(N,1,0);  p2[Items] = 1; /* choose items 10-15 */
run uniform_cross(c1, c2, p1, p2);
print (p1`)[L="Parent1" c=(1:N)], (p2`)[L="Parent2" c=(1:N)],
      (c1`)[L="Child1" c=(1:N)], (c2`)[L="Child2" c=(1:N)];

I augmented the output to show how the child chromosomes are created from their parents. For this run, the selected locations are 1, 8, 10, 12, and 14. The first child gets all values from the first parent except for the values in these five positions, which are from the second parent. The second child is formed similarly.

When the parent chromosomes resemble each other, the children will resemble the parents. However, if the parent chromosomes are very different, the children might not look like either parent.

Notice that you must choose HOW the crossover operator works, and you must choose a hyperparameter that determines how to split the parent chromosomes. In more sophisticated crossover operations, there might be additional hyperparameters, such as the probability that a subchromosome from a parent gets reversed in the child. There are many heuristic choices to make, and the best choice is not knowable.

In the SAS/IML language, there are many built-in crossover operations that you can use. They are discussed in the documentation for the GASETCRO subroutine.


Genetic algorithms can solve optimization problems that are intractable for traditional mathematical optimization algorithms. But the power comes at a cost. The user must make many heuristic choices about how the GA should work. The user must choose hyperparameters that control the probability that certain events happen during mutation and crossover operations. The algorithm uses random numbers to generate new potential solutions from previous candidates. This article used the SAS/IML language to discuss some of the choices that are required to implement these operations.

A subsequent article discusses how to implement a genetic algorithm in SAS.

The post Crossover and mutation: An introduction to two operations in genetic algorithms appeared first on The DO Loop.

9月 132021

Photograph by Poussin Jean, license CC BY-SA 3.0, via Wikimedia Commons

The partition problem has many variations, but recently I encountered it as an interactive puzzle on a computer. (Try a similar game yourself!) The player is presented with an old-fashioned pan-balance scale and a set of objects of different weights. The challenge is to divide (or partition) the objects into two group. You put one group of weights on one side of the scale and the remaining group on the other side so that the scale balances.

Here's a canonical example of the partition problem for two groups. The weights of six items are X = {0.4, 1.0, 1.2, 1.7, 2.6, 2.7}. Divide the objects into two groups of three items so that each group contains half the weight, which is 4.8 for this example. Give it a try! I'll give a solution in the next section.

As is often the case, there are at least two ways to solve this problem: the brute-force approach and an optimization method that minimizes the difference in weights between the two groups. One advantage of the brute-force approach is that it is guaranteed to find all solutions. However, the brute-force method quickly becomes impractical as the number of items increases.

This article considers brute-force solutions. A more elegant solution will be discussed in a future article.

Permutations: A brute-force approach

Let's assume that the problem specifies the number of items that should be placed in each group. One way to specify groups is to use a vector of ±1 values to encode the group to which each item belongs. For example, if there are six items, the vector c = {-1, -1, -1, +1, +1, +1} indicates that the first three items belong to one group and the last three items belong to the other group.

One way to solve the partition problem for two groups is to consider all permutations of the items. For example, Y = {0.4, 1.7, 2.7, 1.0, 1.2, 2.6} is a permutation of the six weights in the previous section. The vector c = {-1, -1, -1, +1, +1, +1} indicates that the items {0.4, 1.7, 2.7} belong to one group and the items {1.0, 1.2, 2.6} belong to the other group. Both groups have 4.8 units of weight, so Y is a solution to the partition problem.

Notice that the inner product Y`*c = 0 for this permutation. Because c is a vector of ±1 values, the inner product is the difference between the sum of weights in the first group and the sum of weights in the second group. An inner product of 0 means that the sums are the same in both groups.

A program to generate all partitions

Let's see how you can use the ALLPERM function in SAS/IML to solve the two-group partition problem. Since the number of permutations grows very fast, let's use an example that contains only four items. The weights of the four items are X = {1.2, 1.7, 2.6, 3.1}. We want two items in each group, so define c = {-1, -1, 1, 1}. We search for a permutation Y = π(X), such that Y`*c = 0. The following SAS/IML program generates ALL permutations of the integers {1,2,3,4}. For each permutation, the sum of the first two weights is compared to the sum of the last two weights. The permutations for which Y`*c = 0 are solutions to the partition problem.

proc iml;
X = {1.2, 1.7, 2.6, 3.1};
c = { -1,  -1,   1,   1};  /* k = 2 */
/* Brute Force Method 1: Generate all permutations of items */
N = nrow(X);
P = allperm(N);            /* all permutations of 1:N */
Y = shape( X[P], nrow(P), ncol(P) );   /* reshape to N x N! matrix */
z = Y * c;                 /* want to find z=0 (or could minimize z) */
idx = loc(abs(z) < 1E-8);  /* indices where z=0 in finite precision */
Soln = Y[idx, ];           /* return all solutions; each row is solution */

This program generates P, a matrix whose rows contain all permutations of four elements. The matrix Y is a matrix where each row is a permutation of the weights. Therefore, Y*c is the vector of all differences. When a difference is zero, the two groups contain the same weights. The following statements count how many solutions are found and print the first solution:

numSolns = nrow(soln);
s = soln[1,];
print numSolns, s[c={'L1' 'L2' 'R1' 'R2'}];

There are a total of eight solutions. One solution is to put the weights {1.2, 3.1} in one group and the weights {1.7, 2.6} in the other group. What are the other solutions? For this set of weights, the other solutions are trivial variations of the first solution. The following statement prints all the solutions:

print soln[c={'L1' 'L2' 'R1' 'R2'}];

I have augmented the output so that it is easier to see the structure. In the first four rows, the values {1.2, 3.1} are in the first group and the values {1.7, 2.6} are in the second group. In the last four rows, the values switch groups. Thus, this method, which is based on generating all permutations, generates a lot of solutions that are qualitatively the same, in practice.

Combinations: Another brute-force approach

The all-permutation method generates N! possible partitions and, as we have seen, not all the partitions are qualitatively different. Thus, using all permutations is inefficient. A more efficient (but still brute-force) method is to use combinations instead of permutations. Combinations are essentially a sorted version of a permutation. The values {1, 2, 3}, {2, 1, 3}, and {3, 2, 1} are different permutations, whereas there is only one combination ({1, 2, 3}) that contains these three numbers. If there are six items and you want three items in each group, there are 6! = 720 permutations to consider, but only "6 choose 3" = 20 combinations.

The following SAS/IML function uses combinations to implement a brute-force solution of the two-group partition problem. The Partition2_Comb function takes a vector of item weights and a vector (nPart) that contains the number of items that you want in the first and second groups. If you want k items in the first group, the ALLCOMB function creates the complete set of combinations of k indices. The SETDIFF function computes the complementary set of indices. For example, if there are six items and {1, 4, 5} is a set of indices, then {2, 3, 6} is the complementary set of indices. After the various combinations are defined, the equation Y*c = 0 is used to find solutions, if any exist, where the first k elements of c are -1 and the last N-k elements are +1.

/* Brute Force Method 2: Generate all combination of size k, N-k */
start Partition2_Comb(_x, k, tol=1e-8);
   x = colvec(_x);
   N = nrow(x);
   call sort(x);                 /* Optional: standardize the output */
   c = j(k, 1, -1) // j(N-k, 1, 1); /* construct +/-1 vector */
   L = allcomb(N, k);            /* "N choose k" possible candidates in "left" group */
   R = j(nrow(L), N-k, .);
   do i = 1 to nrow(L);
      R[i,] = setdif(1:N, L[i,]);  /* complement indices in "right" group */
   P = L || R;                   /* combine the left and right indices */
   Y = shape( X[P], nrow(P) );   /* reshape X[P] into an N x (N choose k) matrix */
   z = Y * c;                    /* want to find z=0 (or could minimize z) */
   solnIdx = loc(abs(z) < tol);  /* indices where z=0 in finite precision */
   if ncol(solnIdx) = 0 then 
      soln = j(1, N, .);         /* no solution */
      soln = Y[solnIdx, ];       /* each row is solution */
   return soln;                  /* return all solutions */
/* test the function on a set that has 11 items, partitioned into groups of 5 and 6 */
x = {0.8, 1.2, 1.3, 1.4, 1.6, 1.8, 1.9, 2.0, 2.2, 2.3, 2.5}; 
soln = Partition2_Comb(x, 5);
numSolns = nrow(soln);
print numSolns, soln[c=(('L1':'L5') || ('R1':'R6'))];

The output shows 13 solutions for a set of 11 items that are partitioned into two groups, one with five items and the other with 11-5=6 items. For the solution, each partition has 9.5 units of weight.


This article shows some SAS/IML programming techniques for using a brute-force method to solve the partition problem on N items. In the partition problem, you split items into two groups that have k and N-k items so that the weight of the items in each group is equal. This article introduced the solution in terms of all permutations of the items, but implemented a solution in terms of all combinations, which eliminates redundant orderings of the items.

In many applications, you don't need to find all solutions, you just need any solution. A subsequent article will discuss formulating the partition problem as an optimization that seeks to minimize the difference between the weights of the two groups.

The post The partition problem appeared first on The DO Loop.

8月 302021

You can use the bootstrap method to estimate confidence intervals. Unlike formulas, which assume that the data are drawn from a specified distribution (usually the normal distribution), the bootstrap method does not assume a distribution for the data. There are many articles about how to use SAS to bootstrap statistics for univariate data and for data from a regression model. This article shows how to bootstrap the correlation coefficients in multivariate data. The main challenge is that the correlation coefficients are typically output in a symmetric matrix. To analyze the bootstrap distribution, you must convert the statistics from a "wide form" (a matrix) to a "long form" (a column), as described in a previous article.

To motivate the bootstrap method for correlations, recall that PROC CORR in SAS supports the FISHER option, which estimates confidence intervals for the correlation coefficients. However, the interval estimates are parametric: they assume that the pairs of variables are bivariate normal. How good are the estimates for real data? You can generate the bootstrap distribution for the correlation coefficients and compare the Fisher confidence intervals to the bootstrap estimates. As a preview of things to come, the graph to the right (click to expand) shows histograms of bootstrap estimates and the Fisher confidence intervals.

This article assumes that the reader is familiar with basic bootstrapping in SAS. If not, see the article, "Compute a bootstrap confidence interval in SAS."

Bootstrap multivariate data in SAS

The basic steps of the bootstrap algorithm are as follows:

  1. Compute a statistic for the original data.
  2. Use PROC SURVEYSELECT to resample (with replacement) B times from the data.
  3. Use BY-group processing to compute the statistic of interest on each bootstrap sample.
  4. Use the bootstrap distribution to obtain estimates of bias and uncertainty, such as confidence intervals.

Let's use a subset of Fisher's Iris data for this example. There are four variables to analyze, so you have to estimate confidence intervals for 4*3/2 = 6 correlation coefficients. For this example, let's use the FISHER option to estimate the confidence intervals. We can then compute the proportion of bootstrap estimates that fall within those confidence intervals, thereby estimating the coverage probability of the Fisher intervals? Does a 95% Fisher interval have 95% coverage for these data? Let's find out!

Step 1: Compute statistics on the original data

The following SAS statements subset the Sashelp.Iris data and call PROC CORR to estimate Fisher's confidence intervals for the six correlation coefficients:

data sample;   /* N = 50 */
   set Sashelp.Iris(where=(Species="Versicolor"));
   label PetalLength= PetalWidth= SepalLength= SepalWidth=;
/* STEP 1: Compute statistics on the original data */
%let varNames = PetalLength PetalWidth SepalLength SepalWidth;
%let numVars=4;
proc corr data=sample fisher(biasadj=no) noprob plots=matrix;
   var &varNames;
   ods output FisherPearsonCorr=FPCorr(keep=Var WithVar NObs Corr Lcl Ucl);
proc print data=FPCorr noobs; run;

The output shows the estimates for the 95% Fisher confidence intervals for the correlation coefficients. Remember, these estimates assume bivariate normality. Let's generate a bootstrap distribution for the correlation statistics and examine how well the Fisher intervals capture the sampling variation among the bootstrap estimates.

Steps 2 and 3: Bootstrap samples and the bootstrap distribution

As shown in other articles, you can use PROC SURVEYSELECT in SAS to generate B bootstrap samples of the data. Each "bootstrap sample" is a sample (with replacement) from the original data. You can then use the BY statement in PROC CORR to obtain B correlation matrices, each estimated from one of the bootstrap samples:

/* STEP 2: Create B=10,000 bootstrap samples */
/* Bootstrap samples */
%let NumSamples = 10000;       /* number of bootstrap resamples */
proc surveyselect data=sample NOPRINT seed=12345
     method=urs              /* resample with replacement */
     samprate=1              /* each bootstrap sample has N observations */
     /* OUTHITS                 option to suppress the frequency var */
     reps=&NumSamples;       /* generate NumSamples bootstrap resamples */
/* STEP 3. Compute the statistic for each bootstrap sample */
proc corr data=BootSamples noprint outp=CorrOut;
   by Replicate;
   freq NumberHits;
   var &varNames;

For many bootstrap analyses, you can proceed directly to the next step, which is to summarize the bootstrap distribution. However, in this case, the statistics in the CorrOut data set are not in a convenient format. The statistics are stored in matrices, as shown by the following call to PROC PRINT, which prints the correlation estimates for the first two bootstrap samples:

/* look at estimates for first two bootstrap samples */
proc print data=CorrOut noobs;
   where _Type_="CORR" & Replicate<=2;

To analyze the distribution of the correlation estimates, it is useful to reshape the data, as described in a previous article. The article shows how to obtain only the values in the cells that are highlighted in yellow. The article also shows how to add labels that identify the pairs of variables. The following DATA step reshapes the data and adds labels such as "Corr(X1,X2)":

/* Convert correlations in the OUTP= data set from wide to long form. See */
data BootLong;
set CorrOut(where=(_Type_="CORR") rename=(_NAME_=Var1));
length Var2 $32 Label $13;
array X[*] &varNames;
Row = 1 + mod(_N_-1, &NumVars);
do Col = Row+1 to &NumVars;       /* names for columns */
   Var2 = vname(X[Col]);
   Corr = X[Col];
   Label = cats("Corr(X",Row,",X",Col,")");
drop _TYPE_ &varNames;

The bootstrap estimates are now in a "long format," which makes it easier to analyze and visualize the bootstrap distribution of the estimates.

The next section combines the original estimates and the bootstrap estimates. To facilitate combining the data sets, you need to add the Label column to the data set that contains the Fisher intervals. This was also described in the previous article. Having a common Label variable enables you to merge the two data sets.

/* add labels to the Fisher statistics */
data FisherCorr;
length Label $13;
set FPCorr;
retain row col 1;
if col>=&NumVars then do; row+1; col=row; end;
Label = cats("Corr(X",row,",X",col,")");

Step 4: Analyze the bootstrap distribution

You can graph the bootstrap distribution for each pair of variables by using the SGPANEL procedure. The following statements create a basic panel:

proc sgpanel data=BootLong;
   panelby Label;
   histogram Corr;

However, a goal of this analysis is to compare the Fisher confidence intervals (CIs) to the bootstrap distribution of the correlation estimates. Therefore, it would be nice to overlay the original correlation estimates and Fisher intervals on the histograms of the bootstrap distributions. To do that, you can concatenate the BootLong and the FisherCorr data sets, as follows:

/* STEP 4: Analyze the bootstrap distribution */
/* overlay the original sample estimates and the Fisher confidence intervals */
data BootLongCI;
set BootLong FisherCorr(rename=(Var=Var1 WithVar=Var2 Corr=Estimate));
label Corr="Correlation Estimate";
title "Bootstrap Correlation Estimates (B=&NumSamples)";
title2 "Overlay Fisher Confidence Intervals";
proc sgpanel data=BootLongCI;
   panelby Label / novarname;
   histogram Corr;
   refline Estimate / axis=X lineattrs=(color=red);
   refline LCL / axis=X ;
   refline UCL / axis=X ;

The graph is shown at the top of this article. In the graph, the red lines indicate the original estimates of correlation from the data. The gray vertical lines show the Fisher CIs for the data. The histograms show the distribution of the correlation estimates on 10,000 bootstrap re-samples of the data.

Visually, it looks like the Fisher confidence intervals do a good job of estimating the variation in the correlation estimates. It looks like most of the bootstrap estimates are contained in the 95% CIs.

The coverage of Fisher's confidence intervals

You can use the bootstrap estimates to compute the empirical coverage probability of the Fisher CIs. That is, for each pair of variables, what proportion of the bootstrap estimates (out of 10,000) are within the Fisher intervals? If all pairs of variables were bivariate normal, the expected proportion would be 95%.

You can use the logical expression (Corr>=LCL & Corr<=UCL) to create a binary indicator variable. The proportion of 1s for this variable equals the proportion of bootstrap estimates that are inside a Fisher interval. The following DATA step creates the indicator variable and the call to PROC FREQ counts the proportion of bootstrap estimates that are in the Fisher interval.

/* how many bootstrap estimates are inside a Fisher interval? */
proc sort data=BootLong; by Label; run;
data Pctl;
merge BootLong FisherCorr(rename=(Corr=Estimate));
by Label;
label Corr="Correlation Estimate";
InLimits = (Corr>=LCL & Corr<=UCL);
proc freq data=Pctl noprint;
   by Label;
   tables InLimits / nocum binomial(level='1');
   output out=Est binomial;
proc print data=Est noobs;
   var Label N _BIN_ L_BIN U_BIN;

The table shows empirical estimates for the coverage probability of the Fisher intervals. The _BIN_ column contains estimates of the binary proportion. The L_BIN and U_BIN columns contain confidence intervals for the proportions. You can see that all intervals include more than 90% of the bootstrap estimates, and some are close to 95% coverage. The following graph visualizes the empirical coverage for each pair of variables:

ods graphics / width=400px height=240px;
title "Proportion of Bootstrap Estimates in Fisher Intervals"; 
proc sgplot data=Est noautolegend;
   scatter x=_BIN_ y=Label / xerrorlower=L_Bin xerrorupper=U_Bin;
   refline 0.95 / axis=x;
   xaxis  max=1 grid;
   yaxis grid display=(nolabel) offsetmin=0.1 offsetmax=0.1;

The graph indicates that the many Fisher intervals are too wide (too conservative). Their empirical coverage is greater than the 95% coverage that you would expect for bivariate normal data. On the other hand, the Fisher interval for Corr(X2,X4) has lower coverage than expected.


This article shows how to perform a bootstrap analysis for correlation among multiple variables. The process of resampling the data (PROC SURVEYSELECT) and generating the bootstrap estimates (PROC CORR with a BY statement) is straightforward. However, to analyze the bootstrap distributions, you need to restructure the statistics by converting the output data from a wide form to a long form. You can use this technique to perform your own bootstrap analysis of correlation coefficients.

Part of the analysis also used the bootstrap distributions to estimate the coverage probability of the Fisher confidence intervals for these data. The Fisher intervals have 95% coverage for normally distributed data. For the iris data, most of the intervals have coverage that is greater than expected, although one interval has lower-than-expected coverage. These results are strongly dependent on the data.

The post Bootstrap correlation coefficients in SAS appeared first on The DO Loop.

8月 252021

For graphing multivariate data, it is important to be able to convert the data between "wide form" (a separate column for each variable) and "long form" (which contains an indicator variable that assigns a group to each observation). If the data are numeric, the wide data can be represented as an N x p matrix. The same data in long form can be represented by two columns and Np rows, where the first column contains the data and the second column identifies that the first N rows belong to the first group, the second N rows belong to the second group, and so on. Many people have written about how to use PROC TRANSPOSE or the SAS DATA step to convert from wide form to long form and vice versa.

The conversion is slightly different for a symmetric matrix because you might want to display only the upper-triangular portion of the matrix. This article examines how to convert between a symmetric correlation matrix (wide form) and a "compressed symmetric" form that only stores the elements in the upper-triangular portion of the symmetric matrix.

Compressed symmetric storage

When the data represents a symmetric N x N matrix, you can save space by storing only half the data, such as only the upper triangular portion of the matrix. This is called compressed symmetric storage. Familiar examples of symmetric matrices include correlation, covariance, and distance matrices. For example, you can represent the 3 x 3 symmetric matrix A = {3 2 1, 2 5 0, 1 0 9} by storing only the upper triangular entries U = {3, 2, 1, 5, 0, 9}. There are N(N+1)/2 upper triangular elements. For a correlation matrix, you don't need to store the diagonal elements; you only need to store N(N-1)/2 elements.

When you run a correlation analysis by using PROC CORR in SAS, you usually get a square symmetric matrix of correlation estimates. However, if you use the FISHER option to get confidence limits for the correlation coefficients, then you get a table that shows the correlation estimates for each pair of variables. I will demonstrate this point by using the Sashelp.Iris data. To make the discussion general, I will put the variable names into a macro variable so that you can concentrate on the main ideas:

data sample;   /* use a subset of the Iris data */
   set Sashelp.Iris(where=(Species="Versicolor"));
   label PetalLength= PetalWidth= SepalLength= SepalWidth=;
%let varNames = PetalLength PetalWidth SepalLength SepalWidth;
/* count how many variables are in the macro variable */
data _null_;
nv = countw("&varNames");
call symputx("numVars", nv);
%put &=numVars;   /* for this example, numVars=4 */
proc corr data=sample fisher(rho0=0.7 biasadj=no) noprob outp=CorrOut;
   var &varNames;
   ods select PearsonCorr FisherPearsonCorr;
   ods output FisherPearsonCorr = FisherPCorr;

The usual square symmetric correlation matrix is shown at the top of the image. I have highlighted the elements in the strictly upper triangular portion of the matrix. By using these values, you can re-create the entire matrix. The second table is an alternative display that shows only the six pairwise correlation estimates. The arrangement in the second table can be useful for plotting the pairwise correlations in a bar chart.

It is useful to be able to convert the first table into the format of the second table. It is also useful to be able to augment the second table with row/column information so that each row of the second table is easily associated with the corresponding row and column in the first table.

Convert a symmetric table to a pairwise list

Let's use the information in the first table to list the correlations for pairs of variables, as shown in the second table. Notice that the call to PROC CORR uses the OUTP= option to write the correlation estimates to a SAS data set. You should print that data set to understand its structure; it contains more than just the correlation estimates! After you understand the structure, the following DATA step will make more sense. The main steps of the transformation are:

  • Create a variable named ROW that has the values 1, 2, 3, ..., N. In the following DATA step, I use the MOD function because in a subsequent article, I will use the same DATA step to perform a bootstrap computation in which the data set contains B copies of the correlation matrix.
  • Create a variable named COL. For each value of ROW, the COL variable loops over the values ROW+1, ROW+2, ..., N.
  • Put the correlation estimates into a single variable named CORR.
  • If desired, create a short label that identifies the two variables that produced the correlation estimate.
/* Convert correlations in the OUTP= data set from wide to long form (pairwise statistics) */
data Long;
set CorrOut(where=(_Type_="CORR") rename=(_NAME_=Var1));
length Var2 $32 Label $13;
array X[*] &varNames;
row = 1 + mod(_N_-1, &NumVars);   /* 1, 2, ..., N, 1, 2, ..., N, 1, 2, ... */
do col = row+1 to &NumVars;       /* names for columns */
   Var2 = vname(X[col]);
   Corr = X[col];
   Label = cats("Corr(X",row,",X",col,")");
drop _TYPE_ &varNames;
proc print data=Long; run;

The ROW and COL variables identify the position of the correlation estimates in the upper-triangular portion of the symmetric correlation matrix. You can write a similar DATA step to convert a symmetric matrix (such as a covariance matrix) in which you also want to display the diagonal elements.

Identify the rows and columns of a pairwise list

The data for the second table is stored in the FisherPCorr data set. Although it is possible to convert the pairwise list into a dense symmetric matrix, a more useful task is to identify the rows and columns for each entry in the pairwise list, as follows:

/* Write the (row,col) info for the list of pairwise correlations in long form */
data FisherCorr;
set FisherPCorr;
retain row col 1;
if col>=&NumVars then do;
Label = cats("Corr(X",row,",X",col,")");
proc print data=FisherCorr; 
   var Var WithVar Corr row col Label;

From the ROW and COL variables, you can assemble the data into a symmetric matrix. For example, I've written about how to use the SAS/IML language to create a symmetric correlation matrix from the strictly upper-triangular estimates.


For a symmetric matrix, you can display the matrix in a wide format (an N x N matrix) or you can display the upper-triangular portion of the matrix in a single column (long format). This article shows how to use the SAS DATA step to convert a correlation matrix into a long format. By adding some additional variables, you can identify each value with a row and column in the upper-triangular portion of the symmetric matrix.

The post Convert a symmetric matrix from wide to long form appeared first on The DO Loop.

8月 112021

One of the benefits of using the SWEEP operator is that it enables you to "sweep in" columns (add effects to a model) in any order. This article shows that if you use the SWEEP operator, you can compute a SSCP matrix and use it repeatedly to estimate any linear regression model that uses those effects.

This post relates to a previous post, in which I showed that you never need to use a permutation matrix to rearrange the order of rows and columns of a matrix. As an application, I used the sum of squares and crossproducts (SSCP) matrix, which is used to estimate the regression coefficients in a least-squares regression model. Being able to permute the rows and columns of the SSCP matrix efficiently means that you can solve the linear regression problem very quickly for any ordering of the columns of the design matrix. But if you use the SWEEP operator, you do not need to permute the SSCP matrix at all!

The SSCP matrix in PROC REG

Before discussing the SWEEP operator, let's review a little-used feature of PROC REG in SAS. The REG procedure is interactive, which means that you can interactively add or delete effects from a model as part of the model-building process. However, if you are going to explore several different models, you must use the VAR statement to specify all variables that you might conceivably want to use BEFORE the first RUN statement. Why? Because when PROC REG encounters the RUN statement it builds the SSCP matrix for the variables that you have specified. As stated in the documentation, for each subsequent model, "the procedure selects the appropriate crossproducts from the [SSCP] matrix." In other words, PROC REG re-uses that SSCP matrix for every MODEL statement.

Let's look at an example. The following call to PROC REG computes the SSCP matrix for five variables and the intercept effect. It uses that SSCP matrix to compute the parameter estimates:

proc reg plots=NONE;
   ods select ParameterEstimates;
   VAR MSRP EngineSize Horsepower MPG_Highway Weight;            /* vars in the SSCP */
   FULL:    model MSRP = EngineSize Horsepower MPG_Highway Weight;  /* uses the SSCP */

Because PROC REG is an interactive procedure, the procedure does not exit when it encounters the RUN statement. It remembers the SSCP matrix until the procedure exits. If you submit additional MODEL statements, PROC REG will use the SSCP matrix to estimate the new parameters, as long as the variables were previously specified. For example, you can specify the same model, but list the variables in a different order. Or you can estimate a reduced model that contains only a subset of the variables, as follows:

   PERMUTE: model MSRP = Horsepower Weight EngineSize MPG_Highway;
   PARTIAL: model MSRP = Weight Horsepower;

The output for the "PERMUTE" model is not shown, because the estimates are the same as for the "FULL" model, but in a different order. It is important to note that PROC REG estimates the second and third models without re-reading the data. It simply re-uses the original SSCP matrix. As I have previously shown, computing the SSCP matrix is often 90% of the work of computing regression estimates, which means that the second and third models are computed very quickly.

The next section shows that how the SWEEP operator can quickly compute the parameter estimates for any model and for any order of the variables. There is no need to physically permute the rows and columns of the SSCP matrix.

The SWEEP operator

You can use the SAS/IML language to illuminate how PROC REG can compute one SSCP matrix and re-use it for all subsequent models. The following SAS/IML statements read the data and compute the SSCP matrix. Following Goodnight (1979), the last column contains the response variable, but this is merely a convenience.

proc iml;
varNames = {'EngineSize' 'Horsepower' 'MPG_Highway' 'Weight'};
   read all var varNames into X;
   read all var 'MSRP' into Y;
/* add intercept column and response variable */
X = j(nrow(X), 1, 1) || X;   /* the design matrix */
k = ncol(X);                 /* number of effects */
varNames = 'Intercept' || varNames || 'MSRP';
XY =  X || Y;
SSCP = XY` * XY;             /* the only SSCP that we'll need */
/* by using the sweep operator (1st row=mean, last col=param est*/
S = sweep(SSCP, 1:k); 
rCol = nrow(SSCP);           /* the response column */
b = S[1:k, rCol];            /* parameter estimates for regression coefficients */
print b[r=varNames];

The parameter estimates are the same as computed by PROC REG. They were obtained by sweeping the columns of the SSCP matrix in the order 1, 2, ..., 5. You can think of the vector v=1:k as the "identity permutation" of the rows. Then the SWEEP operator for the variables in their original order is the operation S = sweep(SSCP, v).

The SWEEP function in SAS/IML enables you to specify any vector of indices as the second argument. In this way, you can sweep the columns of the SSCP matrix in any order. For example, the "PERMUTE" model from the PROC REG example corresponds to sweeping the SSCP matrix in the order v = {1 3 5 2 4}. As the following example shows, you get the same parameter estimates because the models are the same:

/* sweep original SSCP in the order of v */
v = {1 3 5 2 4};
pVarNames = varNames[,v];    /* effect names in permuted order */
S = sweep(SSCP, v);          /* sweep the specified rows in the specified order */
b = S[v, rCol];              /* get the parameter estimates in new order */
print b[r=pVarNames];

As expected, the parameter estimates are the same, but the order of the parameters is different.

You can also estimate parameters for any model that uses a subset of the columns. For example, the model that has the explanatory variables Weight and Horsepower (the fifth and third effects) is computed from the SSCP matrix by sweeping the columns in the order {1 5 3}, as follows:

/* Perform a partial sweep for the model {1 5 3}. No need to form a new
   design matrix or to extract a portion of the SCCP. */
v = {1 5 3};
pVarNames = varNames[,v];    /* effect names for this model */
S = sweep(SSCP, v);          /* sweeps in the variables in the order of v */
b = S[v, rCol];              /* get the parameter estimates */
print b[r=pVarNames];

The output matches the "PARTIAL" model from PROC REG.


In a previous post, I showed an efficient way to permute the order of rows and columns of a matrix. But the current article shows that you do not need to permute the elements of an SSCP matrix if you use the SWEEP operator. The SWEEP operator enables you to specify the order in which you want to add effects to the model. From a design matrix, you can compute the SSCP matrix. (This is most of the work!) You can then quickly compute any model that uses the columns of the design matrix in any order.

This useful property of the SWEEP operator is among the many features discussed in Goodnight, J. (1979), "A Tutorial on the SWEEP Operator," The American Statistician. For other references, see the end of my previous article about the SWEEP operator.

The post More on the SWEEP operator for least-square regression models appeared first on The DO Loop.

8月 092021

Do you ever use a permutation matrix to change the order of rows or columns in a matrix? Did you know that there is a more efficient way in matrix-oriented languages such as SAS/IML, MATLAB, and R? Remember the following tip:
Never multiply with a large permutation matrix! Instead, use subscripts to permute rows and columns.

This article explains why it is not necessary to multiply by a permutation matrix in a matrix language. The advice is similar to the tip in the article, "Never multiply with a large diagonal matrix," which discusses an efficient way to use diagonal matrices in matrix computations. That article recommends that you never multiply with a large diagonal matrix. Instead, use elementwise multiplication of rows and columns.

Permutation matrices

Permutation matrices have many uses, but their primary use in statistics is to rearrange the order of rows or columns in a matrix. A previous article shows how to create a permutation matrix and how to use it to permute the order of rows and columns in a matrix. I ended the article by noting that "there is an easy alternative way" to permute rows and columns: use subscript notation. Subscripts enable you to permute rows and columns efficiently and intuitively. If you use matrix multiplication to permute columns of a matrix, you might not remember whether to multiply the permutation matrix on the left or on the right, but if you use subscripts, it is easy to remember the syntax.

The following example defines a 5 x 5 matrix, A, with integer entries and a function that creates a permutation matrix. (The permutation matrix is the transpose of the matrix that I used in the previous article. I think this definition is easier to use.) If you multiply A on the right (A*Q), you permute the columns. If you multiply A on the left (Q`*A), you permute the rows, as shown:

proc iml;
A = shape(1:25, 5, 5); /*{ 1  2  3  4  5,
                           6  7  8  9 10,
                          21 22 23 24 25} */
v = {3 5 2 4 1};      /* new order for the columns or rows */
/* Method 1: Explicitly form the permutation matrix */
/* the following matrix permutes the rows when P` is used on the left (P`*A)
   Use P on the right to permute columns. */
start PermutionMatrixT(v);
   return( I(ncol(v))[ ,v] );
Q = PermutionMatrixT(v);
C = A*Q;    /* permute the columns */
R = Q`*A;   /* permute the rows */
ods layout gridded advance=table columns=2;
   print R[r=v L="Permute Rows"], C[c=v L="Permute Columns"];
ods layout end;

This method explicitly forms a permutation matrix to permute the rows and/or columns. To permute the rows of an N x p matrix, the permutation matrix is an N x N matrix, and the matrix multiplication requires O(N2p) floating-point operations. To permute the columns, the permutation matrix is p x p, and the matrix multiplication requires O(Np2) operations.

Permutation of indices for subscripts

In contrast, if you permute a matrix by using subscript indices, no floating-point operations are needed. The syntax A[v,] permutes the rows of the matrix A by using the indices in the vector v. The syntax A[,v] permutes the columns of A. The syntax A[v,v] permutes both the rows and the columns. Here is the previous permutation, but without using permutation matrices:

/* Method 2: Specify subscript indices to form the permutation matrix */
C = A[ ,v];   /* permute the columns */
R = A[v, ];   /* permute the rows */
print R[r=v L="Permute Rows"], C[c=v L="Permute Columns"];

The output is the same and is not shown. No computations are required, only copying data from one matrix to another. For v = {3 5 2 4 1}, the first syntax means, "copy the columns in the order, 3, 5, 2, 4, and 1." The second syntax specifies the order of the rows. You don't have to remember which side of A to put the permutation matrix, nor whether to use a transpose operation.

An application: Order of effects in a regression model

In my previous article, I used a correlation matrix to demonstrate why it is useful to know how to permute rows and columns of a matrix. Another application is the order of columns in a design matrix for linear regression models. Suppose you read in a design matrix where the columns of the matrix are in a specified order. For some statistics (for example, the Type I sequential sum of squares), the order of the variables in the model are important. The order of variables is also used in regression techniques such as variable selection methods. Fortunately, if you have computed the sum of squares and crossproducts matrix (SSCP) for the variables in the original order, it is trivial to permute the matrix to accommodate any other ordering of the variables. If you have computed the SSCP matrix in one order, you can obtain it for any order without recomputing it.

The SSCP matrix is part of the "normal equations" that are used to obtain least-squares estimates for regression parameters. Let's look at an example. Suppose you compute the SSCP matrix for a data matrix that has five effects, as shown below:

/* permute columns or rows of a SSCP matrix (XpX) */
varNames = {'EngineSize' 'Horsepower' 'MPG_City' 'MPG_Highway' 'Weight'};
read all var varNames into Z;
X = Z - mean(Z);   /* center the columns */
XpX = X`*X;
print XpX[r=varNames c=varNames];

Now, suppose you want the SSCP matrix for the same variables in a different order. For example, you want the order {'MPG_City' 'Weight' 'Horsepower' 'MPG_Highway' 'EngineSize'}; which is the permutation {3, 5, 2, 4, 1} of the original ordering. You have two choices: permute the columns of the design matrix and recompute the SSCP or permute the existing SSCP matrix. The first way is intuitive but less efficient. It is shown below:

/* inefficient way: create new design matrix, Y, which permutes columns of X. Then compute SSCP. */
v = {3 5 2 4 1};
Q = PermutionMatrixT(v);
Y = X*Q;
sscp = Y`*Y;               /* = Q`*X`*X*Q = Q`*(X`*X)*Q */
pVarNames = varNames[,v];  /* get new order for var names */
print sscp[r=pVarNames c=pVarNames];

The second way is a one-liner and does not require any computations. You don't have to form the permutation matrix, Q. You only need to use subscripts to permute the new SSCP:

/* efficient way: don't form Q; just permute the indices */
sscp2 = XpX[v,v];
print sscp2[r=pVarNames c=pVarNames];

The output is exactly the same and is not shown. After you have obtained the SSCP matrix in the new order, you can use it to compute regression statistics. If you are solving the normal equations, you also need to permute the right-hand side of the equations, which will be equal to Q`*X`*y.


In summary, you rarely need to explicitly form a permutation matrix and use matrix multiplication to permute the rows or columns of a matrix. In most situations, you can use subscript indices to permute row, columns, or both rows and columns.

The post Never multiply with a large permutation matrix appeared first on The DO Loop.