Tips and Techniques

7月 202022
 

It isn't easy to draw the graph of a function when you don't know what the graph looks like. To draw the graph by using a computer, you need to know the domain of the function for the graph: the minimum value (xMin) and the maximum value (xMax) for plotting the function. When you know a lot about the function, you can make well-informed choices for visualizing the graph. However, without knowledge, you might need several attempts before you discover the domain that best demonstrates the shape of the function. This trial-and-error approach is time-consuming and frustrating.

If the function is the density function for a probability distribution, there is a neat trick you can use to visualize the function. You can use quantiles of the distribution to determine the interval [xMin, xMax]. This article shows that you can set xMin to be an extreme-left quantile such as QUANTILE(0.01), and set xMin to be an extreme-right quantile such as QUANTILE(0.99). The values 0.01 and 0.99 are arbitrary, but in practice, these values provide good initial choices for many distributions. From looking at the resulting graph, you can often modify those numbers to improve the visualization.

Choose a domain for the lognormal distribution

Here's an example. Suppose you want to graph the PDF or CDF of the standard lognormal distribution. What is the domain of the lognormal distribution? You could look it up on Wikipedia, but mathematically the 0.01 and 0.99 quantiles of the distribution provide an interval that contains 98% of the probability. Thus, is makes sense to initially visualize the function on this domain. The following SAS DATA step evaluates the PDF at 201 evenly spaced points on the interval [xMin, xMax]:

/* Choose xMin and xMax by using quantiles, then use equally spaced intervals in x */
data PDF;
xMin = quantile("LogNormal", 0.01);   /* p = 0.01 */
xMax = quantile("LogNormal", 0.99);   /* p = 0.99 */
dx = (xMax - xMin) / 200;
do x = xMin to xMax by dx;
   pdf = pdf("Lognormal", x); 
   output;
end;
drop xMin xMax dx;
run;
 
title "Visualize PDF by using Quantiles to Determine Domain";
proc sgplot data=PDF;
  series x=x y=PDF;
run;

This graph is pretty good, but not perfect. For the lognormal distribution, you might want xMin to move to the left so that the graph shows more of the shape of the left tail of the distribution. If you know that x=0 is the minimum value for the lognormal distribution, you could set xMin=0. However, you could also let the distribution itself decide the lower bound by decreasing the probability value used for the left quantile. You can change one number to re-visualize the graph. Just change the probability value from 0.01 to a smaller number such as 0.001 or 0.0001. For example, if you use 0.0001, the first statement in the DATA step looks like this:

xMin = quantile("LogNormal", 0.0001);  /* p = 0.0001 */

With that small change, the graph now effectively shows the shape of the PDF for the lognormal distribution:

Visualize discrete densities

You can use the same trick to visualize densities or cumulative probabilities for a discrete probability distribution. For example, suppose you want to visualize the density for the Poisson(20) distribution. If you are familiar with the Poisson distribution, you might know that you need an interval that is symmetric about x=20 (the mean value), but how wide should the interval be? Instead of doing computations in your head, just use the quantile trick, as follows:

/* A discrete example: PDF = Poisson(20) */
data PMF;
xMin = quantile("Poisson", 0.01, 20);   /* p = 0.01 */
xMax = quantile("Poisson", 0.99, 20);   /* p = 0.99 */
do x = xMin to xMax;                    /* x is integer-valued */
   pmf = pdf("Poisson", x, 20);         /* discrete probability mass function (PMF) */
   output;
end;
drop xMin xMax;
run;
 
title "Visualize Discrete PDF by using Quantiles to Determine the Domain";
proc sgplot data=PMF;
  vbar x / response=pmf;
run;

Notice that the DATA step uses integer steps between xMin and xMax because the Poisson distribution is defined only for integer values of x. Notice also that I used a bar chart to visualize the density. Alternatively, you could use a needle chart. Regardless, the program automatically chose the interval [10, 31] as a representative domain. If you want to visualize more of the tails, you can use 0.001 and 0.999 as the input values to the QUANTILE call.

Summary

This article shows a convenient trick for visualizing the PDF or CDF of a distribution function. You can use the quantile function to find quantiles in the left tail (xMin) and in the right tails (xMax) of the distribution. You can visualize the function by graphing it on the interval [xMin, xMax]. I suggest you use the 0.01 and 0.99 quantiles as an initial guess for the domain. You can modify those values if you want to visualize more (or less) of the tails of the distribution.

The post A tip for visualizing the density function of a distribution appeared first on The DO Loop.

7月 182022
 

A colleague was struggling to compute a right-tail probability for a distribution. Recall that the cumulative distribution function (CDF) is defined as a left-tail probability. For a continuous random variable, X, with density function f, the CDF at the value x is
    F(x) = Pr(X ≤ x) = ∫ f(t) dt
where the integral is evaluated on the interval (-∞, x). That is, it is the area under the density function "to the left" of x. In contrast, the right-tail probability is the complement of this quantity. The right-tail probability is defined by
    S(x) = Pr(X ≥ x) = 1 – F(x)
It is the area under the density function "to the right" of x.

The right-tail probability is also called the right-side or upper-tail probability. The function, S, that gives the right-tail probability is called the survival distribution function (SDF). Similarly, a quantile that is associated with a probability that is close to 0 or 1 might be called an extreme quantile. This article discusses why a naive computation of an extreme quantile can be inaccurate. It shows how to compute extreme probabilities and extreme quantiles by using special functions in SAS:

Accurate computations of extreme probabilities

My colleague contacted me to discuss some issues related to computing an extreme quantile for p ≈ 1. To demonstrate, I will use the standard lognormal distribution, whose PDF is shown to the right. (However, this discussion applies equally well to other distributions.) When p is small, the lognormal quantile is close to 0. When p is close to 1, the lognormal quantile is arbitrarily large because the lognormal distribution is unbounded.

The following SAS DATA step uses the QUANTILE function to compute extreme quantiles in the left and right tails of the lognormal distribution. The quantiles are associated with the probabilities p=1E-k and p=1 – 1E-k for k=8, 9, 10, ..., 15.

%let Distrib = Lognormal;
data Have;
do pow = -8 to -15 by -1;
   p = 10**pow;                   
   Lx = quantile("&Distrib", p);   /* left-tail quantile: find x such that CDF(x) = p */
   Rp = 1 - p;                     /* numerically, 1-p = 1 for p < MACEPS */
   Rx = quantile("&Distrib", Rp);  /* right-tail quantile: find x such that CDF(x) = 1-p */
   output;
end;
format Rp 18.15;
run;
 
proc print noobs; run;

The second column shows the probability, p, and the Lx column shows the corresponding left-tail quantile. For example, the first row shows that 1E-8 is the probability that a standard lognormal random variate has a value less than 0.00365. The Rp column shows the probability 1 – p. The Rx column shows the corresponding quantile. For example, the first row shows that 1 – 1E-8 is the probability that a standard lognormal random variate has a value greater than 273.691.

From experimentation, it appears that the QUANTILE function returns a missing value when the probability argument is greater than or equal to 1 – 1E-12. The SAS log displays a note that states
    NOTE: Argument 2 to function QUANTILE('Normal',1) at line 5747 column 9 is invalid.
This means that the function will not compute extreme-right quantiles. Why? Well, I can think of two reasons:

  • When p is very small, the quantity 1 – p does not have full precision. (This is called the "loss-of-precision problem" in numerical analysis.) In fact, if p is less than machine epsilon, the expression 1 – p EXACTLY EQUALS 1 in floating-point computations! Thus, we need a better way to specify probabilities that are close to 1.
  • For unbounded distributions, the CDF function for an unbounded distribution is extremely flat for values in the right tail. Because the pth quantile is obtained by solving for the root of the function CDF(x) - p = 0, the computation to find an accurate extreme-right quantile is numerically imprecise for unbounded distributions.

Compute extreme quantiles in SAS

The previous paragraph does not mean that you can't compute extreme-right quantiles. It means that you should use a function that is specifically designed to handle probabilities that are close to 1. The special function is the SQUANTILE function, which uses asymptotic expansions and other tricks to ensure that extreme-right quantiles are computed accurately. If x is the value returned by the SQUANTILE function, then 1 – CDF(x) = p. You can, of course, rewrite this expression to see that x satisfies the equation CDF(x) = 1 – p.

The syntax for the SQUANTILE function is almost the same as for the QUANTILE function. For any distribution, QUANTILE("Distrib", 1-p) = SQUANTILE("Distrib", p). That is, you add an 'S' to the function name and replace 1 – p with p. Notice that by specifying p instead of 1 – p, the right-tail probability can be specified more accurately.

The following SAS DATA step is a modification of the previous DATA step. It uses the SQUANTILE function to compute the extreme-right quantiles:

data Want;
do pow = -8 to -15 by -1;
   p = 10**pow;
   Lx = quantile("&Distrib", p);   /* find x such that CDF(x) = p */
   Rp = 1 - p;
   Rx = squantile("&Distrib", p);  /* find x such that 1-CDF(x)=p */
   output;
end;
format Rp 18.15;
drop pow;
run;
 
proc print noobs; run;

The table shows that the SQUANTILE function can compute quantiles that are far in the right tail. The first row of the table shows that the probability is 1E-8 that a lognormal random variate exceeds 273.69. (Equivalently, 1 – 1E-8 is the probability that the variate is less than 273.69.) The last row of the table shows that 1E-15 is the probability that a lognormal random variate exceeds 2811.14.

Compute extreme probabilities in SAS

When p is small, the "loss-of-precision problem" (representing 1 – p in finite precision) also affects the computation of the cumulative probability (CDF) of a distribution. That is, the expression CDF("Distrib", 1-p) might not be very accurate because 1 – p loses precision when p is small. To counter that problem, SAS provides the SDF function, which computes the survival distribution function (SDF). The SDF returns the right-tail quantile x for which 1 – CDF(x) = p. This is equivalent to finding the left-tail quantile that satisfies CDF(x) = 1 – p.

The following DATA step computes several one-sided probabilities. For each value of x, the program computes the probability that a random lognormal variate is greater than x:

%let Distrib = lognormal;
data SDF;
do x = 500 to 2500 by 500;              /* extreme quantiles  */
   Rp       = 1 - cdf("&Distrib", x);   /* right-tail probability */
   RpBetter = sdf("&Distrib", x);       /* better computation of right-tail probability */
   output;
end;
run;
 
proc print noobs; run;

The difference between the probabilities in the Rp and RpBetter columns are not as dramatic as for the quantiles in the previous section. That is because the quantity 1 – CDF(x) does not lose precision when x is large because CDF(x) ≈ 1. Nevertheless, you can see small differences between the columns, and the RpBetter column will tend to be more accurate when x is large.

Summary

This article shows two tips for working with extreme quantiles and extreme probabilities. When the probability is close to 1, consider using the SQUANTILE function to compute quantiles. When the quantile is large, consider using the SDF function to compute probabilities.

The post Tips for computing right-tail probabilities and quantiles appeared first on The DO Loop.

2月 232022
 

It is important to be able to detect whether a numerical matrix is symmetric. Some operations in linear algebra require symmetric matrices. Sometimes, you can use special algorithms to factor a symmetric matrix. In both cases, you need to test a matrix for symmetry.

A symmetric matrix must be square. Mathematically, we say that an n x n matrix, A, is symmetric if A = AT. In terms of the elements of A, the matrix is symmetric if A[i,j] = A[j,i] for all 1 < i < j ≤ n.

How to test a numerical matrix for symmetry

In numerical linear algebra, you should usually avoid testing floating-point numbers for equality. Instead, ask whether the difference | A[i,j] - A[j,i] | is smaller than some value, δ. But what small value should you choose for δ? Although a value such as δ = 1E-14 is suitable for many matrices, symmetry should not depend on the scaling of the matrix. In other words, the test for the matrices A and k*A should give the same results for any nonzero scalar, k.

One way to ensure that the scale of the matrix does not affect the result is to operate on a standardized matrix. You can use the elements of A to determine the scale factor. Let c = max(|A[i,j]|) for 1 ≤ i, j ≤ n. Define δ = c*sqrt(ε), where ε is machine precision. In SAS, you can use the CONSTANT('MACEPS') and CONSTANT('SQRTMACEPS') function calls to obtain the value of ε and sqrt(ε), respectively. SAS uses 2.22E-16 for machine precision, so sqrt(ε) ≈ 1.5e-8.

In summary, a good test for symmetry is to test whether all pairwise differences satisfy the equation | A[i,j] - A[j,i] | < c*sqrt(ε), where c = max(|A[i,j]|). This is equivalent to scaling A so that the largest element has unit magnitude and comparing the scaled differences with sqrt(ε).

In a matrix language such as SAS/IML, you can easily write a function that tests for symmetry, as follows:

proc iml;
 
start TestSym(A);
   if nrow(A) ^= ncol(A) then return(0);    /* A is not square */
   c = max(abs(A));
   sqrteps = constant('SqrtMacEps');
   return( all( abs(A-A`) < c*sqrteps ) );
finish;

Examples of testing a matric for symmetry

With that definition, let's see whether the function works on a few example matrices. The first example is a symmetric matrix. Let's see whether the test returns 1 (true) for the matrix and for various scalings of the matrix:

/* A is symmetric. TestSym(A) should return 1 (true). */
A = {1  2  3 -4  5,
     2  3  2  3  3,
     3  2  5 -1  0,
    -4  3 -1  4 -2, 
     5  3  0 -2  1 };
b1 = TestSym(A);
b2 = TestSym(1E6 * A);
b3 = TestSym(1E-8 * A);
print b1 b2 b3;

Success. The TestSym function correctly detects that the matrix is symmetric. Various scalings of the matrix are also deemed symmetric.

What happens if we break the symmetry by making a tiny change to the (3,5) element?

/* break the symmetry by changing an off-diagonal element */
A[3,5] = 1e-6;
d1 = TestSym(A);
d2 = TestSym(1E6 * A);
d3 = TestSym(1E-8 * A);
print d1 d2 d3;

For this matrix, the magnitude of the largest element is c=5. The perturbation to the matrix is 1e-6, which is less than c*sqrt(ε). Accordingly, the test detects that the perturbed matrix and its scalings are not symmetric. If you rerun the previous statements but use a smaller perturbation such as A[3,5] = 1e-9, the test will declare that the matrix is "numerically symmetric." Tiny differences between A[i,j] and A[j,i] are ignored, where "tiny" is relative to the magnitude of the largest element of A.

Make a symmetric matrix

There are times when you might want to form a matrix that is symmetric from a matrix that is nearly symmetric. An easy way to do that is to use the matrix B = (A + AT)/2, as follows:

B = (A + A`)/2;      /* B is always symmetric */
isSym = TestSym(B);
print isSym;

For any square matrix, A, the matrix B = (A + AT)/2 is symmetric. I call the matrix B the symmetricization of A. An off-diagonal elements B[i,j] is the average of the corresponding elements A[i,j] and A[j,i].

Summary

This article shows how to test a matrix for symmetry in numerical linear algebra. It uses the largest value of the matrix as a scale factor to standardize the computation. This technique is used in several SAS/IML functions that need to determine whether a matrix is symmetric.

Test for skew symmetric matrix

Here is an exercise for the motivated reader. A square matrix, M, is skew-symmetric if it satisfies the equation M = -M`. In other words, M[i,j] = -M[j,i] for all 1 < i, j ≤ n. Can you modify the TestSym function to test a matrix for a skew-symmetry to within some numerical precision? Hint: You only need to change one line!

The post Test for a symmetric matrix appeared first on The DO Loop.

2月 232022
 

It is important to be able to detect whether a numerical matrix is symmetric. Some operations in linear algebra require symmetric matrices. Sometimes, you can use special algorithms to factor a symmetric matrix. In both cases, you need to test a matrix for symmetry.

A symmetric matrix must be square. Mathematically, we say that an n x n matrix, A, is symmetric if A = AT. In terms of the elements of A, the matrix is symmetric if A[i,j] = A[j,i] for all 1 < i < j ≤ n.

How to test a numerical matrix for symmetry

In numerical linear algebra, you should usually avoid testing floating-point numbers for equality. Instead, ask whether the difference | A[i,j] - A[j,i] | is smaller than some value, δ. But what small value should you choose for δ? Although a value such as δ = 1E-14 is suitable for many matrices, symmetry should not depend on the scaling of the matrix. In other words, the test for the matrices A and k*A should give the same results for any nonzero scalar, k.

One way to ensure that the scale of the matrix does not affect the result is to operate on a standardized matrix. You can use the elements of A to determine the scale factor. Let c = max(|A[i,j]|) for 1 ≤ i, j ≤ n. Define δ = c*sqrt(ε), where ε is machine precision. In SAS, you can use the CONSTANT('MACEPS') and CONSTANT('SQRTMACEPS') function calls to obtain the value of ε and sqrt(ε), respectively. SAS uses 2.22E-16 for machine precision, so sqrt(ε) ≈ 1.5e-8.

In summary, a good test for symmetry is to test whether all pairwise differences satisfy the equation | A[i,j] - A[j,i] | < c*sqrt(ε), where c = max(|A[i,j]|). This is equivalent to scaling A so that the largest element has unit magnitude and comparing the scaled differences with sqrt(ε).

In a matrix language such as SAS/IML, you can easily write a function that tests for symmetry, as follows:

proc iml;
 
start TestSym(A);
   if nrow(A) ^= ncol(A) then return(0);    /* A is not square */
   c = max(abs(A));
   sqrteps = constant('SqrtMacEps');
   return( all( abs(A-A`) < c*sqrteps ) );
finish;

Examples of testing a matric for symmetry

With that definition, let's see whether the function works on a few example matrices. The first example is a symmetric matrix. Let's see whether the test returns 1 (true) for the matrix and for various scalings of the matrix:

/* A is symmetric. TestSym(A) should return 1 (true). */
A = {1  2  3 -4  5,
     2  3  2  3  3,
     3  2  5 -1  0,
    -4  3 -1  4 -2, 
     5  3  0 -2  1 };
b1 = TestSym(A);
b2 = TestSym(1E6 * A);
b3 = TestSym(1E-8 * A);
print b1 b2 b3;

Success. The TestSym function correctly detects that the matrix is symmetric. Various scalings of the matrix are also deemed symmetric.

What happens if we break the symmetry by making a tiny change to the (3,5) element?

/* break the symmetry by changing an off-diagonal element */
A[3,5] = 1e-6;
d1 = TestSym(A);
d2 = TestSym(1E6 * A);
d3 = TestSym(1E-8 * A);
print d1 d2 d3;

For this matrix, the magnitude of the largest element is c=5. The perturbation to the matrix is 1e-6, which is less than c*sqrt(ε). Accordingly, the test detects that the perturbed matrix and its scalings are not symmetric. If you rerun the previous statements but use a smaller perturbation such as A[3,5] = 1e-9, the test will declare that the matrix is "numerically symmetric." Tiny differences between A[i,j] and A[j,i] are ignored, where "tiny" is relative to the magnitude of the largest element of A.

Make a symmetric matrix

There are times when you might want to form a matrix that is symmetric from a matrix that is nearly symmetric. An easy way to do that is to use the matrix B = (A + AT)/2, as follows:

B = (A + A`)/2;      /* B is always symmetric */
isSym = TestSym(B);
print isSym;

For any square matrix, A, the matrix B = (A + AT)/2 is symmetric. I call the matrix B the symmetricization of A. An off-diagonal elements B[i,j] is the average of the corresponding elements A[i,j] and A[j,i].

Summary

This article shows how to test a matrix for symmetry in numerical linear algebra. It uses the largest value of the matrix as a scale factor to standardize the computation. This technique is used in several SAS/IML functions that need to determine whether a matrix is symmetric.

Test for skew symmetric matrix

Here is an exercise for the motivated reader. A square matrix, M, is skew-symmetric if it satisfies the equation M = -M`. In other words, M[i,j] = -M[j,i] for all 1 < i, j ≤ n. Can you modify the TestSym function to test a matrix for a skew-symmetry to within some numerical precision? Hint: You only need to change one line!

The post Test for a symmetric matrix appeared first on The DO Loop.

2月 162022
 

A SAS programmer asked an interesting question: If data in a time series has missing values, can you plot a dashed line to indicate that the response is missing at some times?

A simple way to achieve this is by overlaying two lines. The first line (the "bottom" line in the overlay) is dashed. The second line (the "top" line in the overlay) is solid and uses the BREAK option to add gaps where the series has missing data. The result is shown to the left.

Plotting gaps in the data

An important thing to remember about the SG graphics procedures in SAS is that points and lines are displayed in the same order as the statements that you specify. So, if you use two SERIES statements, the first line is plotted "on the bottom," and the second statement is plotted "on top."

A second important point is that you can use the BREAK option on the SERIES statement to force a break in the line for each missing value for the Y variable. The BREAK statement causes the line to appear as multiple line segments that do not "connect through" missing data. If you do not use the BREAK statement, the SERIES statement will connect each valid data point to the next valid data point.

You can therefore plot two lines. The bottom line is dashed and connects through missing values. The top line is solid and breaks at missing values. This is shown in the following call to PROC SGPLOT:

/* create data that has some missing Y values */
data Have;
do x = 0 to 6.2 by 0.2;
   cnt + 1;
   y = 3 + x/4 + 0.5*sin(x**1.5);
   if cnt=10 | cnt=11 | cnt=20 | cnt=21 | 
      cnt=30 | cnt=40 | cnt=41 then 
      y = .;
   output;
end;
run;
 
title "Series with Gaps Due to Missing Values";
proc sgplot data=Have noautolegend;
   series x=x y=y / lineattrs=GraphData1(pattern=dash);
   series x=x y=y / BREAK lineattrs=GraphData1(pattern=solid thickness=2);
run;

The graph is shown at the top of this article.

Display more information about nonmissing data

There might be times when you want to enhance the series plot by showing more information about the location of the nonmissing data. An easy way to do that is to use the MARKERS option to add markers to the graph. The markers are displayed only at locations for which both X and Y are nonmissing. A second way to visualize the locations of the nonmissing values is to add a fringe plot along the bottom of the line plot, as follows:

/* append the "fringe" data: the X value of points that have nonmissing Y value */
data Want;
set Have Have(rename=(x=t) where=(y ^= .));
keep x y t;
run;
 
title "Series and Fringe Plot";
proc sgplot data=Want noautolegend;
   series x=x y=y / lineattrs=GraphData1(pattern=dash);
   series x=x y=y / markers BREAK 
                    lineattrs=GraphData1(pattern=solid thickness=2);
   fringe t;
run;

This graph makes it easier to see the nonmissing values and the locations of the gaps in the data.

Summary

This article shows a cool trick for using a dashed line to indicate that a time series has missing values. The trick is to overlay a dashed line and a solid line. By using the BREAK option on the solid line, the underlying dashed line shows through and visually indicates that missing values are present in the data.

The post A trick to plot a time series that has missing values appeared first on The DO Loop.

1月 032022
 

Last year, I wrote almost 100 posts for The DO Loop blog. My most popular articles were about data visualization, statistics and data analysis, and simulation and bootstrapping. If you missed any of these gems when they were first published, here are some of the most popular articles from 2021:

SAS programming and data visualization

Panel of Regression Diagnostic Plots in SAS

SAS programming and data visualization

  • Display the first or last observations in data: Whether your data are in a SAS data set or a SAS/IML matrix, this article describes how to display to print the top rows (and bottom rows!) of a rectangular data set.
  • Customize titles in a visualization of BY groups: Have you ever use the BY statement to graph data across groups, such as Males/Females or Control/Experimental groups? If so, you might want to learn how to use the #BYVAR and #BYVAL keywords to customize the titles that appear on each graph.
  • Horizontal Bar Chart in SAS

  • Reasons to prefer a horizontal bar chart: Bar charts are used to visualize the counts of categorical data. Vertical charts are used more often, but there are advantages to using a horizontal bar chart, especially if you are displaying many categories or categories that have long labels. This article shows how to create a horizontal bar chart in SAS and gives examples for which a horizontal chart is preferable.
  • Why you should visualize distributions: It is common to report the means (of difference between means) for different groups in a study. However, means and standard deviations only tell part of the story. This article shows four examples where the mean difference between group scores is five points. However, when you plot the data for each example, you discover additional information about how the groups differ.

Simulation and Bootstrapping

Since I am the guy who wrote the book on statistical simulation in SAS, it is not surprising that my simulation articles are popular. Simulation helps analysts understand expected values, sampling variation, and standard errors for statistics.

Bootstrapping Residuals for a Time Series Regression Model

Did you resolve to learn something new in the New Year? Reading these articles requires some effort, but they provide tips and techniques that make the effort worthwhile. So, read (or re-read!) these popular articles from 2021. To ensure you don't miss a single article in 2022, consider subscribing to The DO Loop.

The post Top 10 posts from <em>The DO Loop</em> in 2021 appeared first on The DO Loop.

4月 262021
 

SAS/IML programmers often create and call user-defined modules. Recall that a module is a user-defined subroutine or function. A function returns a value; a subroutine can change one or more of its input arguments. I have written a complete guide to understanding SAS/IML modules, which contains may tips for working with SAS/IML modules. Among the tips are two that can trip up programmers who are new to the SAS/IML language:

In the years since I wrote those articles, SAS/IML introduced lists, which are a popular way to pass many arguments to a user-defined module. Lists and matrices behave similarly when passed to a module:

  • If you modify a list in a module, the list is also changed in the calling environment.
  • If you specify an expression such as [A, B], SAS/IML creates a temporary list.

This article shows a subroutine that takes a list and modifies the items in the list. It also looks at what happens if you pass in a temporary list.

Review: Passing a matrix to a module

Before discussing lists, let's review the rules for passing a matrix to a user-defined module. The following user-defined subroutine doubles the elements for its matrix argument:

proc iml;
/* Double the values in X. The matrix X is changed in the calling environment. */
start Double(X);
   X = 2*X;     
finish;
 
r1 = 1:3;
r2 = 4:6;
A = r1 // r2;          /* each row of A is a copy. r1 and r2 do not share memory with A. */
run Double(A);
print A;

As shown by the output, the elements of A are changed after passing A to the Double module. Notice also that although A was constructed from vectors r1 and r2, those vectors are not changed. They were used to initialize the values of A, but they do not share any memory with A.

Now suppose you want to double ONLY the first row of A. The following attempt does NOT double the first row:

A = r1 // r2;          /* each row of A is a copy */
run Double( A[1,] );   /* NOTE: create temporary vector, which is changed but vanishes */
print A;

The result is not shown because the values of A are unchanged. The matrix is not changed because it was not sent into the module. Instead, a temporary matrix (A[1,]) was passed to the Double module. The values of the temporary matrix were changed, but that temporary matrix vanishes, so there is no way to access those modified values. (See the previous article for the correct way to double the first row.)

Passing lists: The same rules apply

These same rules apply to lists, as I demonstrate in the next example. The following user-defined subroutine takes a list as an argument. It doubles every (numerical) item in the list. This modification affects the list in the calling environment.

/* Double the values of every numerical matrix in a list, Lst. 
   Lst will be changed in the calling environment. */
start DoubleListItems(Lst);
   if type(Lst)^='L' then stop "ERROR: Argument is not a list";
   nItems = ListLen(Lst);
   do i = 1 to nItems;
      if type(Lst$i)='N' then 
         Lst$i = 2*Lst$i;     /* the list is changed in the calling environment */
   end;
finish;
 
A = {1 2 3, 4 5 6};
B = -1:1;
L = [A, B];            /* each item in L is a copy */
run DoubleListItems(L);/* the copies are changed */
print (L$1)[label="L$1"], (L$2)[label="L$2"];

As expected, the items in the list were modified by the DoubleListItems subroutine. The list is modified because we created a "permanent" variable (L) and passed it to the module. If you simply pass in the expression [A, B], then SAS/IML creates a temporary list. The module modifies the items in the temporary list, but you cannot access the modified values because the temporary list vanishes:

/* create temporary list. Items in the list are changed but the list vanishes */
run DoubleListItems( [A, B] );  /* no way to access the modified values! */

Summary

This brief note is a reminder that SAS/IML creates temporary variables for expressions like X[1,] or [A, B]. In most cases, programmers do not need to think about this fact. However, it becomes important if you write a user-defined module that modifies one of its arguments. If you pass a temporary variable, the modifications are made to the temporary variable, which promptly vanishes after the call. To prevent unexpected surprises, always pass in a "permanent" variable to a module that modifies its arguments.

The post On passing a list to a SAS/IML module appeared first on The DO Loop.

1月 182021
 

Have you ever heard of the DOLIST syntax? You might know the syntax even if you are not familiar with the name. The DOLIST syntax is a way to specify a list of numerical values to an option in a SAS procedure. Applications include:

  • Specify the end points for bins of a histogram
  • Specify percentiles to be output to a data set
  • Specify tick marks for a custom axis on a graph
  • Specify the location of reference lines on a graph
  • Specify a list of parameters for an algorithm. Examples include smoothing parameters (the SMOOTH= option in PROC LOESS), sample sizes (the NTOTAL= option in PROC POWER), and initial guess for parameters in an optimization (the PARMS statement in PROC NLMIXED and PROC NLIN)

This article demonstrates how to use the DOLIST syntax to specify a list of values in SAS procedures. It shows how to use a single statement to specify individual values and also a sequence of values.

The DOLIST syntax enables you to write a single statement that specifies individual values and one or more sequences of values. The DOLIST syntax should be in the toolbox of every intermediate-level SAS programmer!

The DOLIST syntax in the SAS DATA step

According to the documentation of PROC POWER, the syntax described in this article is sometimes called the DOLIST syntax because it is based on the syntax for the iterative DO loop in the DATA step.

The most common syntax for a DO loop is DO x = start TO stop BY increment. For example, DO x = 10 TO 90 BY 20; iterates over the sequence of values 10, 30, 50, 70, and 90. If the increment is 1, you can omit the BY increment portion of the statement. However, you can also specify values as a common-separated list, such as DO x = 10, 30, 50, 70, 90;, which generates the same values. What you might not know is that you can combine these two methods. For example, in the following DATA step, the values are specified by using two comma-separated lists and three sequences. For clarity, I have placed each list on a separate line, but that is not necessary:

/* the DOLIST syntax for a DO loop in the DATA step */
data A;
do pctl = 5,                  /* individual value(s)  */
          10 to 50 by 20,     /* a sequence of values */
          54.3, 69.1,         /* individual value(s)  */
          80 to 90 by 5,      /* another sequence     */
          60 to 40 by -20;    /* yet another sequence */
   output;
end;
run;
 
proc print; run;

The output (not shown) is a list of values: 5, 10, 30, 50, 54.3, 69.1, 80, 85, 90, 60, 40. Notice that the values do not need to be in sorted order, although they often are.

The expressions to the right of the equal sign are what I mean by the "DOLIST syntax." You can use the same syntax to specify a list of options in many SAS procedures. When the SAS documentation says that an option takes a "list of values," you can often use a comma-separated list, a space-separated list, and the syntax start TO stop BY increment. (Or a combination of these expressions!) The following sections provide a few examples, but there are literally hundreds of options in SAS that support the DOLIST syntax!

Some procedures (for example, PROC SGPLOT) require the DOLIST values to be in parentheses. Consequently, I have adopted the convention of always using parentheses around DOLIST values, even if the parentheses are not strictly required. As far as I know, it is never wrong to put the DOLIST inside parentheses, and it keeps me from having to remember whether parentheses are required. The examples in this article all use parentheses to enclose DOLIST values.

Histogram bins and percentiles

You can use the DOLIST syntax to specify the endpoints of bins in a histogram. For example, in PROC UNIVARIATE, the ENDPOINTS= option in the HISTOGRAM statement supports a DOLIST. Because histograms use evenly spaced bins, usually you will specify only one sequence, as follows:

proc univariate data=sashelp.cars;
   var weight;
   histogram weight / endpoints=(1800 to 7200 by 600);   /* DOLIST sequence expression */
run;

You can also use the DOLIST syntax to specify percentiles. For example, the PCTLPTS= option on the OUTPUT statement enables you to specify which percentiles of the data should be written to a data set:

proc univariate data=sashelp.cars;
   var MPG_City;
   output out=UniOut pctlpre=P_  pctlpts=(50 75, 95 to 100 by 2.5);  /* DOLIST */
run;

Notice that this example specifies both individual percentiles (50 and 75) and a sequence of percentiles (95, 97.5, 100).

Tick marks and reference lines

The SGPLOT procedure enables you to specify the locations of tick marks on the axis of a graph. Most of the time you will specify an evenly spaced set of values, but (just for fun) the following example shows how you can use the DOLIST syntax to combine evenly spaced values and a few custom values:

title "Specify Ticks on the Y Axis";
proc sgplot data=sashelp.cars;
   scatter x=Weight y=Mpg_City;
   yaxis grid values=(10 to 40 by 5, 50 60); /* DOLIST; commas optional */
run;

As shown in the previous example, the GRID option on the XAXIS and YAXIS statements enables you to display reference lines at each tick location. However, sometimes you want to display reference lines independently from the tick marks. In that case, you can use the REFLINE statement, as follows:

title "Many Reference Lines";
proc sgplot data=sashelp.cars;
   scatter x=Weight y=MPG_City;
   refline (1800 to 6000 by 600, 7000) / axis=x;  /* many reference lines */
run;

Statistical procedures

Many statistical procedures have options that support lists. In most cases, you can use the DOLIST syntax to provide values for the list.

I have already written about how to use the DOLIST syntax to specify initial guesses for the PARM statement in PROC NLMIXED and PROC NLIN. The documentation for the POWER procedure discusses how to specify lists of values and uses the term "DOLIST" in its discussion.

Some statistical procedures enable you to specify multiple parameter values, and the analysis is repeated for each parameter in the list. One example is the SMOOTH= option in the MODEL statement of the LOESS procedure. The SMOOTH= option specifies values of the loess smoothing parameter. The following call to PROC LOESS fits four loess smoothers to the data. The call to PROC SGPLOT overlays the smoothers on a scatter plot of the data:

proc loess data=sashelp.cars plots=none;
   model MPG_City = Weight / smooth=(0.1 to 0.5 by 0.2, 0.75); /* value-list */
   output out=LoessOut P=Pred;
run;
 
proc sort data=LoessOut; by SmoothingParameter Weight; run;
proc sgplot data=LoessOut;
   scatter x=Weight y=MPG_City / transparency=0.9;
   series x=Weight y=Pred / group=SmoothingParameter 
          curvelabel curvelabelpos=min;
run;

Summary

In summary, this article describes the DOLIST syntax in SAS, which enables you to simultaneously specify individual values and evenly spaced sequences of values. A sequence is specified by using the start TO step BY increment syntax. The DOLIST syntax is valid in many SAS procedures and in the DATA step. In some procedures (such as PROC SGPLOT), the syntax needs to be inside parentheses. For readability, you can use commas to separate individual values and sequences.

Many SAS procedures accept the special syntax even if it is not explicitly mentioned in the documentation. In the documentation for an option, look for terms such as value-list or numlist or value-1 <...value-n>, which indicate that the option supports the DOLIST syntax.

The post The DOLIST syntax: Specify a list of numerical values in SAS appeared first on The DO Loop.

1月 112021
 

On The DO Loop blog, I write about a diverse set of topics, including statistical data analysis, machine learning, statistical programming, data visualization, simulation, numerical analysis, and matrix computations. In a previous article, I presented some of my most popular blog posts from 2020. The most popular articles often deal with elementary or familiar topics that are useful to almost every data analyst.

However, among last year's 100+ articles are many that discuss advanced topics. Did you make a New Year's resolution to learn something new this year? Here is your chance! The following articles were fun to write and deserve a second look.

Machine learning concepts

Relationship between a threshold value and true/false negatives and positives

Statistical smoothers

Bilinear interpolation of 12 data values

I write a lot about scatter plot smoothers, which are typically parametric or nonparametric regression models. But a SAS customer wanted to know how to get SAS to perform various classical interpolation schemes such as linear and cubic interpolations:

SAS Viya and parallel computing

SAS is devoting tremendous resources to SAS Viya, which offers a modern analytic platform that runs in the cloud. One of the advantages of SAS Viya is the opportunity to take advantage of distributed computational resources. In 2020, I wrote a series of articles that demonstrate how to use the iml action in Viya 3.5 to implement custom parallel algorithms that use multiple nodes and threads on a cluster of machines. Whereas many actions in SAS Viya perform one and only one task, the iml action supports a general framework for custom, user-written, parallel computations:

The map-reduce functionality in the iml action

  • The map-reduce paradigm is a two-step process for distributing a computation. Every thread runs a function and produces a result for the data that it sees. The results are aggregated and returned. The iml action supports the MAPREDUCE function, which implements the map-reduce paradigm.
  • The parallel-tasks paradigm is a way to run independent computations concurrently. The iml action supports the PARTASKS function, which implements the map-reduce paradigm.

Simulation and visualization

Decomposition of a convex polygon into triangles

Generate random points in a polygon

Your turn

Did I omit one of your favorite blog posts from The DO Loop in 2020? If so, leave a comment and tell me what topic you found interesting or useful. And if you missed some of these articles when they were first published, consider subscribing to The DO Loop in 2021.

The post Blog posts from 2020 that deserve a second look appeared first on The DO Loop.

1月 042021
 

Last year, I wrote more than 100 posts for The DO Loop blog. In previous years, the most popular articles were about SAS programming tips, statistical analysis, and data visualization. But not in 2020. In 2020, when the world was ravaged by the coronavirus pandemic, the most-read articles were related to analyzing and visualizing the tragic loss and suffering of the pandemic. Here are some of the most popular articles from 2019 in several categories.

The coronavirus pandemic

Relationship between new cases and cumulative cases in COVID-19 infections

Statistical analysis: Regression

Visualization of collinearity diagnostics

Other statistical analyses

Data visualization

Many articles in the previous sections included data visualization, but two popular articles are specifically about data visualization:

Reference lines at values of statistical estimates

Many people claim they want to forget 2020, but these articles provide a few tips and techniques that you might want to remember. So, read (or re-read!) these popular articles from 2020. And if you made a resolution to learn something new this year, consider subscribing to The DO Loop so you don't miss a single article!

The post Top posts from <em>The DO Loop</em> in 2020 appeared first on The DO Loop.