Lisa Lucas explains why people, knowledge and tech are vital for contact tracing.
Contact tracing: The vital roles of people, knowledge and technology was published on SAS Voices by Lisa Lucas
Lisa Lucas explains why people, knowledge and tech are vital for contact tracing.
Contact tracing: The vital roles of people, knowledge and technology was published on SAS Voices by Lisa Lucas
Could you pass a test on coronavirus testing terms? Jim Harris can help.
The post Coronavirus testing: The terms you need to know appeared first on The Data Roundtable.
As he watched me unpack yet another Amazon delivery, my husband decided to do an intervention. Scattered around me on the floor were solar panels, water purification tablets, tarps, hunting knives and enough batteries to power New York City for many weeks. It was 2019 and hurricane season was upon [...]
Manufacturing production during the pandemic: Panicked or pragmatic? was published on SAS Voices by Marcia Walker
In my earlier blog post, Changing variable type and variable length in SAS datasets, I showed how you can effectively change variables lengths in a SAS data set. That approach works fine when you need to change length attribute for few variables, on a case by case basis. But what if you need to change lengths for all character variables in a data set? Or if you need to do this for all data sets in a data library? For example, you need to expand (increase) all your character variables lengths by 50%. Well, then the case-by-case approach becomes too laborious and inefficient.
Before reading any further, let’s take a quick quiz:
Q: A character variable length attribute represents a number of:
If your answer is anything but B, it’s incorrect. According to the SAS documentation, length refers to the number of bytes used to store each of the variable's values in a SAS data set. You can use a LENGTH statement to set the length of both numeric and character variables.
It is true though that for some older encoding systems (ASCII, ISO/IEC 8859, EBCIDIC, etc.) there was no difference between the number of bytes and the number of characters as those systems were based on exactly one byte per character encoding. They are even called Single Byte Character Sets (SBCS) for that reason. The problem is they can accommodate only a maximum of 2^{8}=256 symbols which is not nearly enough to cover all the variety of natural languages, special characters, emojis etc.
For this scenario, let’s consider Internet traffic analysis where your data contains multiple character columns for Internet Protocol addresses (IP addresses) in 32-bit version 4 (IPv4, e.g. ‘125.255.501.780’). You transition to a newer 128-bit IPv6 standard (e.g. ‘2001:0000:3238:DFE1:0063:0000:0000:FEFB’) and need to modify your data structure to accommodate the new standard with longer character values.
In this scenario, you migrate/move SAS data sets from older SBCS environments to newer Multi-Byte-Character Set (MBCS) encoding environments. For such a case, the ability to increase character variables lengths in bulk with a simple action becomes especially significant and critical.
Currently, the most commonly used MBCS is Unicode which is supported by all modern operating systems, databases and web browsers. Out of different flavors of Unicode (UTF-8, UTF-16, UTF-32) the most popular is UTF-8. UTF-8 (8-bit Unicode Transformation Format) is a variable-width character set that uses from 1 to 4 one-byte (8-bit) code units per character; it is capable of encoding 1,112,064 various characters that covers most modern languages, including Arabic and Hebrew characters, hieroglyphs, emojis as well as many other special characters.
Since each UTF-8 encoded character may require somewhere between one and four bytes, and not all SBCS characters are represented by one byte in UTF-8, data migration from SBCS to UTF-8 may cause data truncation and subsequently data loss.
When SAS reads an SBCS-encoded data set and writes its records into UTF-8-encoded data set it throws an ERROR message in the log and stops execution:
ERROR: Some character data was lost during transcoding in the dataset LIBREF.DSNAME. Either the data contains characters that are not representable in the new encoding or truncation occurred during transcoding.
When SAS reads an SBCS-encoded data set and produces a UTF-8-encoded printed report only (without generating a UTF-8-encoded output data set) it generates a WARNING message (with identical description as the above ERROR message) while continuing execution:
WARNING: Some character data was lost during transcoding in the dataset LIBREF.DSNAME. Either the data contains characters that are not representable in the new encoding or truncation occurred during transcoding.
Either ERROR or WARNING is unacceptable and must be properly addressed.
Regardless of character transcoding, SAS’ CVP Engine is short and effective answer to this question. CVP stands for Character Variable Padding which is exactly what this special-purpose engine does – it pads or expands, increases character variables by a number of bytes. CVP engine is part of Base SAS and does not require any additional licensing.
The CVP engine is a read-only engine for SAS data sets only. You can think of it as of a magnifying glass: it creates an expanded view of the character data descriptors (lengths) without changing them. Still we can use the CVP Engine to actually change a data set or a whole data library to their expanded character variables version. All we need to do is to define our source library as CVP library, for example:
libname inlib cvp 'c:\source_folder'; |
Then use PROC COPY to create expanded versions of our original data sets in a target library:
libname outlib 'c:\target_folder'; proc copy in=inlib out=outlib noclone; select dataset1 dataset2; run; |
Or, if we need to expand character variable lengths for the whole library, then we use the same PROC COPY without the SELECT statement:
proc copy in=inlib out=outlib noclone; run; |
It’s that easy. And the icing on the cake is that CVP engine automatically adjusts the variables format widths to meet the expanded byte lengths for all converted character variables.
CVP Engine is a near-perfect SAS solution to the problem of potential data truncation when data is transcoded during migration or move from SBCS-based to MBCS-based systems.
To avoid data loss from possible data truncation during transcoding we can use the above code with a slight but important modification – define the target library with outencoding='UTF-8' option. It will result in our target data not only expanded lengthwise but properly encoded as well. Then we run this modified code in the old SBCS environment before moving/migrating our data sets to the new MBCS environment:
libname inlib cvp 'c:\source_folder'; libname outlib 'c:\utf8_target_folder' outencoding='UTF-8'; proc copy in=inlib out=outlib noclone; select dataset1 dataset2; run; |
Again, if you need to expand character variable lengths for the whole library, then you can use the same PROC COPY without the SELECT statement:
proc copy in=inlib out=outlib noclone; run; |
After that we can safely move our expanded, UTF-8-encoded data to the new UTF-8 environment.
There are several options available with the CVP Engine. Here are the most widely used:
CVPBYTES=bytes - specifies the number of bytes by which to expand character variable lengths. The lengths of character variables are increased by adding the specified bytes value to the current length.
Example: libname inlib 'SAS data-library' cvpbytes=5;
The CVPBYTES= option implicitly specifies the CVP engine, that is if you specify the CVPBYTES= option you don’t have to specify CVP engine explicitly as SAS will use it automatically.
CVPMULTIPLIER=multiplier - specifies a multiplier value that expands character variable. The lengths of character variables are increased by multiplying the current length by the specified multiplier value. You can specify a multiplier value from 1 to 5, or you can specify 0 and then the CVP engine determines the multiplier automatically.
Example: libname inlib 'SAS data-library' cvpmultiplier=2.5;
The CVPMULTIPLIER= option also implicitly specifies the CVP engine, that is if you specify the CVPMULTIPLIER= option, you don’t have to specify CVP engine explicitly as SAS will use it automatically.
Note:
Have you found this blog post useful? Please share your use cases, thoughts and feedback in the comments section below.
Expanding lengths of all character variables in SAS data sets was published on SAS Users.
A previous article about standardizing data in groups shows how to simulate data from two groups. One sample (with n1=20 observations) is simulated from an N(15, 5) distribution whereas a second (with n2=30 observations) is simulated from an N(16, 5) distribution. Although I didn't mention, due to random sampling variation the sample means of the two groups are close to each other: 14.9 and 15.15. In fact, if you run a t test on these data, the t test concludes that the samples probably came from populations that have the same mean parameter. More formally, the t test fails to reject the null hypothesis that the two group means are the same.
The probability of making an incorrect inference like this is called the power of the hypothesis test. This article uses PROC POWER to compute the exact power of a t test and then uses a simulation to estimate the power. Along the way, I show a surprising feature of SAS/IML functions.
Let's run the t test on the simulated data. The following SAS DATA step is the same as in the previous article. It simulates the data and calls PROC TTEST to analyze the data. (In this article, all tests are performed at the α=0.05 significance level.) The null hypothesis is that the group means are equal; the alternative hypothesis is that they are not equal. The data are simulated from distributions that have the same variance, so you can use the pooled t test to analyze the data:
/* Samples from N(15,5) and N(16, 5). */ %let mu1 = 15; %let mu2 = 16; %let sigma = 5; data TwoSample; call streaminit(54321); n1 = 20; n2 = 30; Group = 1; do i = 1 to n1; x = round(rand('Normal', &mu1, &sigma), 0.01); output; end; Group = 2; do i = 1 to n2; x = round(rand('Normal', &mu2, &sigma), 0.01); output; end; keep Group x; run; proc ttest data=TwoSample plots=none; class Group; var x; run; |
I blurred the rows of the tables that correspond to the Satterthwaite test so that you can focus on the pooled t test. The pooled test uses the difference between the sample means (-0.25) and the pooled standard deviation (4.86) to construct a t test for the difference of means. The p-value for the two-sided test is 0.8613, which means that the data do not provide enough evidence to reject the null hypothesis.
Because we simulated the data, we know that, in fact, the difference between the population means is not zero. In other words, the t test failed to reject the null hypothesis even though the alternative hypothesis (unequal means) is true. This is called a "Type 2" error. The probability of making a Type 2 error is called the power of a statistical hypothesis test. Typically, the power depends on the sample size and the magnitude of the effect you are trying to detect.
A two-sample t test is one of the tests that is supported by PROC POWER in SAS. You can use PROC POWER to find the probability of a Type 2 error for random samples of size n1=20 and n2=30 when the true difference between the means is 1.
/* compute exact power */ proc power; twosamplemeans power = . /* missing ==> "compute this" */ meandiff= 1 /* difference of means */ stddev=5 /* N(mu1-mu2, 5) */ groupns=(20 30); /* num obs in the samples */ run; |
The tables tell you that (with these parameter values) the two-sided t test will commit a Type 2 error about 10% of the time (power=0.104) at the α=0.05 significance level. So we got "unlucky" when the simulated sample failed to reject the null hypothesis. Apparently, this only happens for 10% of random samples. In 9 out of 10 random samples, the t test will conclude that the sample comes from populations that have different means.
I previously wrote about how to use simulation to estimate the power of a t test. That article shows how to use the DATA step to simulate many independent samples and then uses BY-group processing in PROC TTEST to analyze each sample. You use the ODS OUTPUT statement to capture the statistics for each analysis. For some samples, the test rejects the null hypothesis; for others, it does not. The proportion of samples that reject the null hypothesis is an estimate of the power of the test.
If you can write a SAS/IML function that implements a t test, you can perform the same power computation in PROC IML. The following SAS/IML program (based on Chapter 5 of Wicklin (2013) Simulating Data with SAS) defines a function that computes the t statistic for a difference of means. The function returns a binary value that indicates whether the test rejects the null hypothesis. To test the function, I call it on the same simulated data:
proc iml; /* SAS/IML function for a pooled t test. X : column vector with n1 elements from a population with mean mu_x. Y : column vector with n2 elements from a population with mean mu_y. Compute the pooled t statistic for the difference of means and return whether the statistic rejects the null hypothesis H0: mu_x = mu_y at the alpha signif level. Formulas from the SAS documentation: https://bit.ly/30rr2GV */ start TTestH0(x, y, alpha=0.05); n1 = nrow(X); n2 = nrow(Y); /* sample sizes */ meanX = mean(x); varX = var(x); /* mean & var of each sample */ meanY = mean(y); varY = var(y); DF = n1 + n2 - 2; /* pooled variance: https://blogs.sas.com/content/iml/2020/06/29/pooled-variance.html */ poolVar = ((n1-1)*varX + (n2-1)*varY) / DF; /* Compute test statistic, critical value, and whether we reject H0 */ t = (meanX - meanY) / sqrt(poolVar*(1/n1 + 1/n2)); *print (meanX - meanY)[L="Diff (1-2)"] (sqrt(poolVar))[L="PoolSD"] t; t_crit = quantile('t', 1-alpha/2, DF); RejectH0 = (abs(t) > t_crit); /* binary 0|1 for 2-sided test */ return RejectH0; finish; use TwoSample; read all var 'x' into x where(group=1); read all var 'x' into y where(group=2); close; reject = TTestH0(x, y); print reject; |
To confirm that the computations are the same as in PROC TTEST, I uncommented the PRINT statement in the module. The computations are the same as for PROC TTEST. The test fails to reject the null hypothesis for these data.
Notice that the SAS/IML function is both concise (less than 10 lines) and mimics the formulas that you see in textbooks or documentation. This "natural syntax" is one of the advantages of the SAS/IML language.
The SAS/IML function looks like a typical function for a univariate analysis, but it contains a surprise. Without making any modifications. you can call the function to run thousands of t tests. The module is written in a vectorized manner. If you send in matrices that have B columns, the module will run t tests for each column and return a binary row vector that contains B values.
Take a close look at the SAS/IML function. The MEAN and VAR functions operate on columns of a matrix and return row vectors. Consequently, the meanX, meanY, varX, and varY variables are row vectors. Therefore, the poolVar and t variables are also row vectors, and the RejectH0 variable is a binary row vector.
To demonstrate that the function can analyze many samples in a single call, the following statements simulate 100,000 random samples from the N(15,5) and N(16,5) distributions. Each column of the X and Y matrices is a sample. One call to the TTestH0 function computes 100,000 t tests. The "colon" subscript reduction operator computes the mean of the vector, which is the proportion of tests that reject the null hypothesis. As mentioned earlier, this is an estimate of the power of the test.
/* simulation of power */ B = 1e5; /* number of independent samples (number of cols) */ n1 = 20; n2 = 30; /* sample sizes (number of rows) */ call randseed(321); x = j(n1, B); /* allocate array for samples from Group 1 */ y = j(n2, B); /* allocate array for samples from Group 2 */ call randgen(x, 'Normal', &mu1, &sigma); /* X ~ N(mu1,sigma) */ call randgen(y, 'Normal', &mu2, &sigma); /* Y ~ N(mu2,sigma) */ reject = TTestH0(x, y); /* result is row vector with B elements */ power = reject[:]; print power[L="Power (B=1E5)"]; |
The result is very close to the exact power as computed by PROC POWER.
Notice also that the SAS/IML function, when properly written, can test thousands of samples as easily as it tests one sample. The program does not require an explicit loop over the samples.
This article starts with a simulated sample that contains two groups. Although the groups are simulated from distributions that have different means, due to random sampling variation a t test fails to reject the null hypothesis that the means are equal. This is a Type 2 error. For random samples from the specified distributions, PROC POWER shows that the probability that the t test will commit a Type 2 error is 0.1. You can estimate this value by simulating many samples and computing the proportion that fails to reject the null hypothesis. The simulation agrees with the exact power.
I previously wrote a simulation of power by using the DATA step and BY-group processing in PROC TTEST. But in this article, I implement a t test by writing a SAS/IML function. You might be surprised to discover that you can pass many samples to the function, which runs a test on each column of the input matrices.
It is worth mentioning that you can repeat this computation for many parameter values to obtain a power curve. For example, you can vary the difference between the means to determine how the power depends on the magnitude of the difference between the means.
The post Use simulation to estimate the power of a statistical test appeared first on The DO Loop.
The most fundamental concept that students learning introductory SAS programming must master is how SAS handles data. This might seem like an obvious statement, but it is often overlooked by students in their rush to produce code that works. I often tell my class to step back for a moment and "try to think like SAS" before they even touch the keyboard. There are many key topics that students must understand in order to be successful SAS programmers. How does SAS compile and execute a program? What is the built-in loop that SAS uses to process data observation by observation? What are the coding differences when working with numeric and character data? How does SAS handle missing observations?
One concept that is a common source of confusion for students is how to tell SAS to treat rows versus columns. An example that we use in class is how to write a program to calculate a basic descriptive statistic, such as the mean. The approach that we discuss is to identify our goal, rows or columns, and then decide what SAS programming statements are appropriate by thinking like SAS. First, we decide if we want to calculate the mean of an observation (a row) or the mean of a variable (a column). We also pause to consider other issues such as the type of variable, in this case numeric, and how SAS evaluates missing data. Once these concepts are understood we can proceed with an appropriate method: using DATA step programming, a procedure such as MEANS, TABULATE, REPORT or SQL, and so on. For more detailed information about this example there is an excellent user group paper on this topic called "Many Means to a Mean" written by Shannon Pileggi for the Western Users of SAS Software conference in 2017. In addition, The Little SAS^{®} Book and its companion book, Exercises and Projects for the Little SAS^{®} Book, Sixth Edition address these types of topics in easy-to-understand examples followed up with thought-provoking exercises.
Here is an example of the type of question that our book of exercises and projects uses to address this type of concept.
Avg1 = MEAN(X1,X2,X3); Avg2 = (X1 + X2 + X3) / 3; PROC MEANS; VAR X1 X2 X3; RUN;
In the book, we provide solutions for odd-numbered multiple choice and short answer questions, and hints for the programming exercises. Here is the solution for this question:
For more information about The Little SAS Book and its companion book of exercises and projects, check out these blogs:
Learning to think like SAS was published on SAS Users.
Katherine Taylor explains how and why new data improves risk models.
Risk modeling during a crisis requires new data was published on SAS Voices by Katherine Taylor
A common operation in statistical data analysis is to center and scale a numerical variable. This operation is conceptually easy: you subtract the mean of the variable and divide by the variable's standard deviation. Recently, I wanted to perform a slight variation of the usual standardization:
Although you can write SAS DATA step code to center and scale the data, PROC STDIZE in SAS contains options that handle both these cases.
Let's start by creating some example data. The following DATA step simulates samples drawn from two groups. The distribution of the first group is N(mu1, sigma) and the distribution of the second group is N(mu2, sigma). The first sample size is n1=0; the second size is n2=30. Notice that the data are ordered by the Group variable.
%let mu1 = 15; %let mu2 = 16; %let sigma = 5; /* Create two samples N(mu1, sigma) and N(mu2, sigma), with samples sizes n1 and n2, respectively */ data TwoSample; call streaminit(54321); n1 = 20; n2 = 30; Group = 1; do i = 1 to n1; x = round(rand('Normal', &mu1, &sigma), 0.01); output; end; Group = 2; do i = 1 to n2; x = round(rand('Normal', &mu2, &sigma), 0.01); output; end; keep Group x; run; ods graphics / width=500px height=360px; title "Normal Data"; title2 "N(&mu1, &sigma) and N(&mu2, &sigma)"; proc sgpanel data=TwoSample noautolegend; panelby Group / columns=1; histogram x; density x / type=normal; colaxis grid; run; |
The graph shows a comparative histogram of the distribution of each group and overlays the density of the normal curve that best fits the data. You can run PROC MEANS to calculate that the estimates for the first group are mean=14.9 and StdDev=5.64. For the second group, the estimates are mean=15.15 and StdDev=4.27.
Typically, you standardize data by using the sample mean and the sample standard deviation. You can do this by using PROC STDIZE and specify the METHOD=STD method (which is the default method). You can use the BY statement to apply the standardization separately to each group. The following statements standardize the data in each group by using the sample statistics:
/* use PROC STDIZE to standardize by the sample means and sample std dev */ proc stdize data=TwoSample out=ScaleSample method=std; by Group; var x; run; proc means data=ScaleSample Mean StdDev Min Max ndec=2; class Group; var x; run; |
As expected, each sample now has mean=0 and unit standard deviation. Many multivariate analyses center and scale data in order to adjust for the fact that different variables are measured on different scales. PROC STDIZE has many other options for standardizing data.
A second way to standardize the data is to use the DATA step to center and scale each variable and each group. You can overwrite the contents of each column, or (as I've done below), you can create a new variable that contains the standardized values. You need to specify values for the location and scale parameters. For these simulated data, I know the population values of the location and scale parameters. The following DATA step uses the parameters to center and scale the data:
/* because we know the population parameters, we can use them to center and scale the data */ /* DATA step method */ data ScaleData; array mu[2] (&mu1 &mu2); /* put parameter values into an array */ set TwoSample; y = (x-mu[Group]) / σ run; proc means data=ScaleData Mean StdDev Min Max ndec=2; class Group; var y; run; |
Because I use the population parameters instead of the sample estimates, the transformed variable does not have zero mean nor unit variance.
You can perform the same calculations by using PROC STDIZE. If you look up the documentation for the METHOD= option on the PROC STDIZE statement, you will see that the procedure supports the METHOD=IN(ds) method, which enables you to read parameters from a data set. I will focus on the LOCATION and SCALE parameters; see the documentation for other options.
You can specify the parameters in "wide form" or in "long form". In wide form, each column specifies a parameter and you can include multiple rows if you want to perform a BY-group analysis. You can use the LOCATION and SCALE statements to specify the name of the variables that contain the parameters. The following DATA step creates a data set that has three variables: Group, Mu, and Sigma. Each row specifies the location and scale parameter for centering and scaling data in the levels of the Group variable.
data ScaleWide; Group=1; Mu=&mu1; Sigma=σ output; Group=2; Mu=&mu2; Sigma=σ output; run; proc stdize data=TwoSample out=StdWide method=in(ScaleWide); by Group; var x; location mu; /* column name METHOD=IN data set */ scale sigma; /* column name METHOD=IN data set */ run; proc means data=StdWide Mean StdDev Min Max ndec=2; class Group; var x; run; |
The output is identical to the output from the previous section and is not shown.
You can also specify the location and scale parameters in long form. For a long-form specification, you need to create a data set that contains a variable named _TYPE_. The parameters for centering and scaling are specified in rows where _TYPE_="LOCATION" and _TYPE_="SCALE", respectively. You can also include one or more BY-group variables (if you are using the BY statement) and one or more columns whose names are the same as the variables you intend to transform.
For example, the example data in this article has a BY-group variable named Group and an analysis variable named X. The following DATA step specifies the values to use to center (_TYPE_='LOCATION') and scale (_TYPE_='SCALE') the X variable for each group:
data ScaleParam; length _TYPE_ $14; input Group _TYPE_ x; datalines; 1 LOCATION 15 1 SCALE 5 2 LOCATION 16 2 SCALE 5 ; proc stdize data=TwoSample out=Scale2 method=in(ScaleParam); by Group; var x; run; proc means data=Scale2; class Group; var x; run; |
Again, the result (not shown) is the same as when the DATA step is used to center and scale the data.
This article gives examples of using PROC STDIZE in SAS/STAT to center and scale data. Most analysts want to standardize data in groups by using the sample means and sample standard deviations of each group. This is the default behavior of PROC STDIZE. However, you can also center and scale data by using values that are different from the group statistics. For example, you might want to use a grand mean and a pooled standard deviation for the groups. Or, perhaps you are using simulated data and you know the population mean and variance of each group. Regardless, there are three ways to center and scale the data when you know the location and scale parameters:
The post 4 ways to standardize data in SAS appeared first on The DO Loop.
Have you ever seen the "brain teaser" for children that shows a 4 x 4 grid and asks "how many squares of any size are in this grid?" To solve this problem, the reader must recognize that there are sixteen 1 x 1 squares, nine 2 x 2 squares, four 3 x 3 squares, and one 4 x 4 square. Hence there are a total of (16+9+4+1) = 30 squares of any size.
Recently a SAS statistical programmer asked a similar question about submatrices of a matrix. Specifically, the programmer wanted to form square submatrices of consecutive rows and columns. This article shows a few tricks in the SAS/IML language that enable you to generate and compute with all k x k submatrices of a matrix, where the submatrices are formed by using consecutive rows and columns of the original matrix. Although it is possible to consider more general submatrices, in this article, a "submatrix" always refers to one that is formed by using consecutive rows and columns.
I've previously written about several ways to extract submatrices from a matrix. Recall the "colon operator" (formally called the index creation operator) generates a vector of consecutive integers. You can use the vectors in the row and column position of the subscripts to extract a submatrix. For example, the following SAS/IML statements define a 4 x 4 matrix and extract the four 3 x 3 submatrices:
A = {4 3 1 6, 2 4 3 1, 0 2 4 3, 5 0 2 4}; /* extract the four 3 x 3 submatrices */ A11 = A[1:3, 1:3]; A12 = A[1:3, 2:4]; A21 = A[2:4, 1:3]; A22 = A[2:4, 2:4]; |
Manually specifying the row and column numbers is tedious. Let's write a SAS/IML function that will return a k x k submatrix of A by starting at the (i,j)th position of A. Call k the order of the submatrix.
/* Given A, return the square submatrix of order k starting at index (i,j) */ start submat(A, i, j, k); return A[ i:(i+k-1), j:(j+k-1) ]; finish; /* example : 3 x 3 submatrices of A */ A11 = submat(A, 1, 1, 3); A12 = submat(A, 1, 2, 3); A21 = submat(A, 2, 1, 3); A22 = submat(A, 2, 2, 3); /* https://blogs.sas.com/content/iml/2015/12/02/matrices-graphs-gridded-layout.html */ ods layout gridded columns=4 advance=table; print A11, A12, A21, A22; ods layout end; |
If A is an n x m matrix, we can ask how many submatrices of order k are in A. Recall that we are only considering submatrices formed by consecutive rows and columns. There are (n – k + 1) sequences of consecutive rows of length k, such as 1:k, 2:(k+1), and so forth. Similarly, there are (m – k + 1) sequences of consecutive columns of length k. So the following SAS/IML function counts the number of submatrices of order k. You can call the function for different k=1, 2, 3, and 4 in order to solve the "Counting Squares" brain teaser:
/* count all submatrices of order k for the matrix A */ start CountAllSubmat(A, k); numRows = nrow(A)-k+1; numCols = ncol(A)-k+1; return numRows * numCols; finish; n = nrow(A); numSubmat = j(n, 1, .); do k = 1 to n; numSubmat[k] = CountAllSubmat(A, k); end; order = T(1:n); print order numSubmat; |
If you add up the number of squares each size, you get a total of 30 squares.
You can iterate over all submatrices of a specified order. You can perform a matrix computation on each submatrix (for example, compute its determinant), or you can pack it into a list for future processing. The following SAS/IML function iterates over submatrices and puts each submatrix into a list. The STRUCT subroutine from the ListUtil package can be used to indicate the contents of the list in a concise form.
start GetListSubmat(A, k); numSub = CountAllSubmat(A, k); L = ListCreate( numSub ); /* create list with numSub items */ cnt = 1; do i = 1 to nrow(A)-k+1; /* for each row */ do j = 1 to ncol(A)-k+1; /* and for each column */ B = submat(A, i, j, k); /* compute submatrix of order k from (i,j)th cell */ call ListSetItem(L, cnt, B); /* put in list (or compute with B here) */ cnt = cnt + 1; end; end; return L; finish; /* Example: get all matrices of order k=2 */ L = GetListSubmat(A, 2); package load ListUtil; call Struct(L); |
The output from the STRUCT call shwos that there are nine 2 x 2 matrices. The Value1–Valuie4 columns show the values (in row-major order). For example, the first 2 x 2 submatrix (in the upper left corner of A) is {4 3, 2 4}. The second submatrix (beginning with A[1,2]) is {3 1, 4 3}.
It is straightforward to modify the function to compute the determinant or some other matrix computation.
This article shows some tips and techniques for dealing with submatrices of a matrix. The submatrices in this article are formed by using consecutive rows and columns. Thus, they are similar to the "Counting Squares" brain teaser, which requires that you consider sub-squares of order 1, 2, 3, and so forth.
The post Submatrices of matrices appeared first on The DO Loop.