sas/iml studio

11月 092016
 

This article uses graphical techniques to visualize one of my favorite geometric objects: the surface of a three-dimensional torus. Along the way, this article demonstrates techniques that are useful for visualizing more mundane 3-D point clouds that arise in statistical data analysis.


Projections of a 3-D torus: Four visualization techniques for high-dimensional data. #DataViz
Click To Tweet


Define points on a torus

A torus is the product of two circles, so it can be parameterized by two angles, usually called θ and φ. You can create a regular grid of points in the square (θ, φ) ∈ [0, 2π] x [0, 2π] and map the points into Euclidean three-space as shown in the following SAS DATA step:

data Torus;
R = 8;       /* radius to center of tube */
A = 3;       /* radius of tube */
pi = constant('pi');
step = 2*pi/50;
/* create torus as parametric image of [0, 2*pi] x [0,2*pi] */
do theta = 0 to 2*pi-step by step;
   do phi = 0 to 2*pi-step by 2*pi/50;
      x = (R + A*cos(phi)) * cos(theta);
      y = (R + A*cos(phi)) * sin(theta);
      z =      A*sin(phi);
      output;
   end;
end;
keep x y z;
run;
 
title "Projections of a Standard Torus";
proc sgscatter data=Torus;
   matrix x y z;
run;
Projections of torus onto coordinate planes

The SGSCATTER procedure displays the projection of the torus onto the three coordinate planes. In the (x,y) plane you can see that the object has a hole, but the projection onto the other coordinate planes are not very insightful because there is a lot of overplotting.

Rotating scatter plot in SAS/IML Studio

One way to avoid overplotting is to visualize the torus as a 3-D cloud of points. The SAS/IML Studio environment supports a rotating 3-D scatter plot, as shown to the left. In SAS/IML Studio you can interactively rotate the 3-D cloud of points, change the marker colors, and perform other techniques in exploratory data analysis.

Alternatively, if you want to look at planar projections with PROC SGSCATTER, you can rotate the torus so that the projections onto the coordinate planes are not degenerate.

Rotating a cloud of points

My previous article defined SAS/IML functions that create rotation matrices. The following SAS/IML program is almost identical to the program in the previous article, so I will not explain each statement. The program reads the Torus data set, rotates the points in a particular way, and writes the rotated coordinates to a SAS data set.

proc iml;
/* choose rotation matrix as product of canonical rotations */
load module=Rot3D;         /* see http://blogs.sas.com/content/iml/2016/11/07/rotations-3d-data.html */
pi = constant('pi');
Rz = Rot3D(-pi/6, "Z");    /* rotation matrix for (x,y) plane */
Rx = Rot3D(-pi/3, "X");    /* rotation matrix for (y,z) plane */ 
Ry = Rot3D( pi/6, "Y");    /* rotation matrix for (x,z) plane */
P = Rx*Ry*Rz;              /* cumulative rotation */
 
use Torus;                 /* read data (points on a torus) */
read all var {x y z} into M;
close Torus;
 
Rot = M * P`;              /* apply rotation and write to data set */
create RotTorus from Rot[colname={"Px" "Py" "Pz"}];
append from Rot;
close;
QUIT;

You can use PROC SGSCATTER to project these rotated points onto the coordinate planes. The TRANSPARENCY= option creates semi-transparent markers that give the illusion of a ghostly see-through torus. This can be an effective technique for visualizing any point cloud, since it enables you to see regions in which overplotting occurs. The statements also use the COLORRESPONSE= option to color the markers by the (original) Z variable. The COLORRESPONSE= option requires SAS 9.4m3.

data Coords;
merge Torus RotTorus;
run;
 
title "Projections of a Rotated Torus";
proc sgscatter data=Coords;
matrix Px Py Pz / markerattrs=(size=12 symbol=CircleFilled)
       transparency=0.75
       colorresponse=Z  colormodel=ThreeColorRamp; /* COLORRESPONSE= requires 9.4m3 */
run;
Projections of rotated torus onto coordinate planes

The transparent markers serve a second function. The torus is composed of a sequence of circles. Therefore, the last circles (near θ = 2π) will obscure the first circles (near θ = 0) if opaque markers are used. The parametric construction is very apparent if you remove the TRANSPARENCY= option.

If you want to plot without transparency, you should sort the data, which is a standard technique in the graphics community where it is called z-ordering or depth sorting. For example, in the (Px,Py) plane you can sort by the Pz variable so that negative Pz values are plotted first and positive Pz values are plotted on top of the earlier markers. Unfortunately, you can only use this sorting trick to plot one pair of coordinates at a time. The following code demonstrates this trick for the (Px, Py) plane:

proc sort data=Coords out=SortCoords;
   by Pz;
run;
 
title "Projection of a Rotated Torus";
title2 "Sorted by Distance Above Coordinate Plane";
proc sgplot data=SortCoords;
   scatter x=Px y=Py / markerattrs=(size=12 symbol=CircleFilled)
           colorresponse=Z  colormodel=ThreeColorRamp;
run;
Projection of torus onto coordinate plane

In conclusion, PROC SGSCATTER enables you to visualize high-dimensional data via projections onto coordinate planes. I demonstrated the techniques on a 3-D torus, but the techniques apply generally to any high-dimensional data. Visualization tricks in this article include:

  • If the projections of the data onto coordinate planes are degenerate, rotate the points.
  • Use semi-transparency to reduce the effect of overplotting.
  • Use the COLORRESPONSE= option to color the markers by one of the variables. This requires SAS 9.4m4.
  • If you do not want to use transparency, sort the data in a direction perpendicular to the projected plane. That makes the rendering look more realistic.

Your high-dimensional point clouds might not look as cool as this torus, but if you use these visualization tips, the data will be easier to interpret and understand. The SGSCATTER procedure in SAS is an easy-to-use routine for investigating relationships among several variables.

tags: Math, SAS/IML Studio, Statistical Graphics

The post Visualize a torus in SAS appeared first on The DO Loop.

10月 152014
 

A colleague jokingly teases me whenever I write a blog that demonstrates how to write fun and exciting programs by using SAS software. "Why do you get to have all the fun?" he mock-chides.

Today I'm ready to face his ribbing, because this article is about Conway's Game of Life and how to implement cellular automata in the SAS/IML language.

Cellular automata and Conway's Game of Life

Conway's Game of Life is a set of simple rules that give rise to beautiful regular and irregular patterns. The rules describes how cellular automata live or die based on the presence or absence of neighbors on a rectangular grid. This dynamical system is a "game" because it is fun to create some initial distribution of automata and watch the evolution of the system from that initial condition. Like many programmers, I wrap the edges of the grid so that the cells on the right and left sides of the grid are neighbors, and similarly for the cells on the top and bottom edges.

The following rules were used to evolve the automata:

  • Birth Rule: An organism is born into any empty cell that has exactly three living neighbors.
  • Survival Rule: An organism with either two or three neighbors survives from one generation to the next.
  • Death Rule: An organism with four or more neighbors dies from overcrowding. An organism with fewer than two neighbors dies from loneliness.

The Game of Life in SAS/IML Studio

Almost every beginning programmer has attempted to program the Game of Life. In a matrix language such as SAS/IML, the implementation of the rules are particularly easy and compact. I first implemented the Game of Life in SAS/IML Studio in 2001. My program is part of the standard set of demo programs that are distributed with SAS/IML Studio. To access the program from the SAS/IML Studio menu, select File ⇒ Open ⇒ File. Then click Go to Installation directory. Select the Programs and Demos folders, then open the file Life.sx.

When you run the program, you are prompted for the size of a square grid (I like 20 rows and columns) and told to click on the grid to create an initial distribution of the automata. (Hold down the SHIFT key while you click.) The following animated GIF shows 100 steps that evolve from a simple initial condition that contains an "exploder" pattern and a "glider" pattern.

lifeani

The interesting thing about this implementation in SAS/IML Studio is that I didn't program the animation at all. The "live" automata are observations that are selected in the grid. The program figures out which cells are alive and dead at the next generation and then simply tells the in-memory copy of the data to update the selected observations. The plot automatically updates to reflect the newly selected observations. To learn more about dynamically linked graphics and how to select observations in SAS/IML Studio, see my book Statistical Programming with SAS/IML Software, especially Chapters 6 and 10.

Recall that SAS/IML Studio is a free download and can be installed on a Windows PC and used by anyone who has a license for SAS/IML and SAS/STAT software. Unfortunately, SAS/IML Studio is not available for users of the free SAS University Edition. If you do not have access to SAS/IML Studio, here is a version of the Life.sx program. You can use the computational portions of the program in PROC IML, even though you cannot create the IMLPlus graphics.

Creating an animated GIF in SAS/IML Studio

When you run the program in SAS/IML Studio, the graphics are displayed on your PC monitor. So how did I create an animated GIF to embed in this blog post? It was surprisingly easy! In the IMLPlus program, the on-screen graph is controlled through a Java variable called plot. You can use the SaveToFile method to create a bitmap version of the graph. If you call that method within the loop that updates the graph for each new generation of automata, you get a sequence of bitmaps in some local directory, like this:

path = 'C:TempAnimGifLife';       /* directory to store images */
plot.SaveToFile(path+"Life000.bmp");  /* save the initial configuration */
do i = 1 to nGenerations;             /* for each new generation */
   run update_generation( grid );     /* compute new configuration for automata */
   run Grid2Selection( dobj, grid );  /* update the selected observations */
   plot.SetTitleText("Game of Life: Generation = "+char(i,3));  /* update title */
   suffix = trim(putn(i, "Z3."));     /* 001, 002, ..., 100 */
   plot.SaveToFile(path+"Life"+suffix+".bmp"); /* save image to Life001.bmp */
end;

After creating 100 bitmap images in a directory, you can create an animated GIF by using a program that stitches the images together in sequence. I used a freeware Java program called GiftedMotion, which does not require you to install any software. Just download the JAR file to your desktop, double click, and navigate to the directory that contains the images.

tags: IMLPlus, Just for Fun, SAS/IML Studio
6月 132014
 

Today is my 500th blog post for The DO Loop. I decided to celebrate by doing what I always do: discuss a statistical problem and show how to solve it by writing a program in SAS.

Two ways to parameterize the lognormal distribution

I recently blogged about the relationship between the parameters in the lognormal family and the underlying normal family. This inspired me to look closer into how the mean and standard deviation of the normal distribution are related to the mean and standard deviation of the lognormal distribution.

Recall that if X ~ N(μ, σ) is normally distributed with parameters μ and σ, then Y = exp(X) is lognormally distributed with the same parameters. The mean of Y is m = exp(μ + σ2/2) and the variance is v = (exp(σ2) -1) exp(2μ + σ2). As usual, the standard deviation of Y is sqrt(v).

These formulas establish a one-to-one (and therefore invertible) relationship between the central moments (means and standard deviations) of the normal and lognormal distributions. If you have the lognormal parameters (m, sqrt(v)), you can compute the corresponding normal parameters, and vice versa. The formulas for μ and σ as functions of m and v are μ = ln(m2 / φ) and σ = sqrt( ln(φ2 / m2) ) where φ = sqrt(v + m2). You can think of these formulas as defining a function, F, that maps (m, sqrt(v)) values into (μ, σ) values.

lognormxform

What does this mapping tell us about the relationship between the central moments of the normal and lognormal distributions? What is the geometry of the transformation that maps one set of means and standard deviations to the other?

If F is the function that gives (μ, σ) as a function of (m, sqrt(v), you can visualize F by using the graphic to the left. (Click to enlarge.) The grid in the upper part of the picture represents a uniform mesh [1,30] x [1,30] in the domain of F. These are means and standard deviations for lognormal distributions. Vertical lines (m is constant) are drawn in red; horizontal lines (v is constant) are drawn in blue. The corresponding (μ, σ) values (the image of the grid under F) are shown in the bottom part of the picture. The bottom picture shows the mean and standard deviation of the corresponding normal distributions.

By looking at the scale of the graphs, you can see that the transformation is highly contractive on this domain. The function maps a huge rectangle (841 square units) into a little wing-shaped region whose area is about 11 square units. The contraction is especially strong when the lognormal parameters m and v are large. For example the entire right half of the grid in the domain of F (where m ≥ 15) is mapped into a tiny area near (μ,σ)=(3,0.5) in the range.

This fact is not too surprising because the inverse mapping (F–1) is exponential, and therefore highly expansive. Nevertheless, it is illuminating to see the strength of the contraction. For example, the following graphic "zooms in" on an area of the domain that is centered near the point (m, sqrt(v)) = (22.5, 11.5), which is the approximate pre-image of (μ, σ) = (3, 0.5). The parallelogram in the domain of F is mapped to a tiny square that is centered near (μ, σ) = (3, 0.5). On this scale, the mapping is well-approximated by its linearization, which is the Jacobian matrix of first derivatives. The determinant of the Jacobian matrix at the point (m, sqrt(v) = (22.5, 11.5) is about 1/600, which means that the big parallelogram on the left has about 600 times the area of the tiny square on the right. (In the illustration, the square is not drawn to scale.)

lognormxforminv

Implications for data analysis

These pictures and computations tell us a lot about the relationship between the central moments of the normal and the lognormal distribution. The main result is that small changes in the parameters of the normal distribution cause large changes in the mean and standard deviation of the lognormal distribution. This leads to the following corollaries:

  • Small variations in normal data can lead to big difference in the lognormal data. If you simulate a small sample of data from an N(3, 0.5) distribution, you will get a sample moments that vary slightly from the parameter values. For example, the sample mean might range between 2.9 and 3.1, and the sample standard deviation might range between 0.4 and 0.6. However, if you exponentiate the simulated values to form a lognormal distribution, the mean and standard deviation of the lognormal data will vary widely from sample to sample. In short, the mean and standard deviation of lognormal data are very sensitive to variation in the normal samples.
  • For lognormal data with a large mean, the parameter estimates are not sensitive to variation in the data. Suppose you have a small data set with mean 22.5 and the standard deviation 11.5, and you think it lognormally distributed. If you use PROC UNIVARIATE to estimate the parameters for a lognormal fit, you will get estimates that are close to μ=3 and σ=0.5. Because of the small sample size, you might be worried that the parameter estimates aren't very good. But look at it this way: if you draw a different sample whose mean and standard deviation are several units away from those values, the parameter estimates will still be close to μ=3 and σ=0.5! Variations in the lognormal data are not very important when you estimate the parameters of the underlying normal data.

For simplicity, this discussion focused on parameter values near (μ, σ)=(3, 0.5), but you can use the same ideas and pictures to study other regions of parameter space. Moreover, the same ideas apply to other data transformations.

Computational details

The SAS/IML language is an excellent way to compute the quantities and to create the pictures in this article. In particular, I used the dynamically linked graphics in SAS/IML Studio to explore this problem because of the highly flexible graphics in IMLPlus. You can download the complete SAS/IML Studio program that I used to create the images.

The grids in the graphs were created by using the EXPANDGRID function that I discussed in the article how to generate a grid of points in SAS. It is easy to apply the transformation to the grid:

proc iml;
/* lognormal to normal transformation (m, sqrt(v)) --> (mu, sigma) */
start LN2N(p);
   m = p[,1]; v = p[,2]##2;
   phi = sqrt(v + m##2);
   mu = log(m##2 / phi);
   sigma = sqrt(log(phi##2/m##2));
   return( mu || sigma );
finish;
 
m = 1:30;                      /* horizontal grid points */
sqrtV = 1:30;                  /* vertical grid points */
grid = expandgrid(m, sqrtV);   /* make grid (SAS/IML 12.3) */
Fgrid = LN2N(grid);            /* compute image of grid */

You can easily compute a numerical Jacobian matrix (the matrix of first derivatives) by using the NLPFDD function, which computes finite-difference derivatives. You can compute the determinant of the Jacobian matrix by using the DET function. The following statements compute the Jacobian matrix and determinant at the point (22.5, 11.5). The determinant is approximately 1/600.

/* NLPFDD wants a function that returns a column vector */
start LN2Ncol(x);  
   return( T(LN2N(x)) );
finish;
 
/* linearization of lognormal-->normal transformation */
x0 = {22.5 11.5};    /* approximate pre-image of (3, 0.5) */
call nlpfdd(f, J, JpJ, "LN2Ncol", x0, {2 . .});
detJ = det(J);       /* det(J) shows strong local contraction */
print J[c={"dx1" "dx2"} r={"dF1" "dF2"}], detJ;
t_lognormalxform

This article—my 500th blog post—provides a good example of what I try to accomplish with my blog. The problem was inspired by a question from a SAS customer. I analyzed it by writing a SAS/IML program. The analysis requires applications of statistics, multivariate analysis, geometry, and matrix computations. I created some cool graphs. Thanks for reading.

tags: SAS/IML Studio, Statistical Programming
9月 162013
 

SAS has supported calling R from the SAS/IML language since 2009. The interface to R is part of the SAS/IML language. However, there have been so many versions of SAS and R since 2009, that it is hard to remember which SAS release supports which versions of R. The following table lists recent SAS releases and the versions of R that each supports:

There were two releases of R that broke backwards compatibility with prior SAS releases:

  • R 2.12.0, which introduced 64-bit R, changed the locations of certain libraries. To use R 2.12.0 or later, you must use SAS 9.3 or later.
  • R 3.0.0 changed certain aspects of the external API to R. To use R 3.0.0 or later you must use SAS 9.4 or later.
An error message that you might see because of incompatible versions is "An installed version of R could not be found," although this message can also occur for other reasons.

SAS ensures that each SAS release supports backwards compatibility with as many previous R releases as possible. However, after a version of SAS ships, it is impossible to ensure forward compatibility, as evidenced by the R 3.0.0 release. That version of R was released in April 2013; it is supported in SAS 9.4 software, which shipped a few months later.

Some software companies distribute their own versions of R, but SAS does not. Consequently, if the interface to R changes, SAS customers need to use a compatible version of R until they can upgrade to a more recent version of SAS. To reiterate:

  • If you are running a version of SAS prior to SAS 9.4, you should use R 2.15.x or earlier.
  • You can call R 3.0.x by using SAS/IML 12.3 or later. SAS/IML 12.3 shipped in July 2013 as part of SAS 9.4.

I'll close with a few comments about 32-bit and 64-bit versions of SAS and R.

  • If you are using SAS software on a 32-bit edition of Windows, you must install the 32-bit edition of R.
  • If you are using SAS software on a 64-bit edition of Windows, you can install either the 32-bit or the 64-bit edition of R. The 32-bit edition of SAS looks first for the 32-bit edition of R and then for the 64-bit edition. The 64-bit edition of SAS looks first for the 64-bit edition of R and then for the 32-bit edition.

tags: Data Analysis, SAS/IML Studio
9月 162013
 

SAS has supported calling R from the SAS/IML language since 2009. The interface to R is part of the SAS/IML language. However, there have been so many versions of SAS and R since 2009, that it is hard to remember which SAS release supports which versions of R. The following table lists recent SAS releases and the versions of R that each supports:

To date, three releases of R broke backwards compatibility with prior SAS releases:

  • R 2.12.0, which introduced 64-bit R, changed the locations of certain libraries. To use R 2.12.0 or later, you must use SAS 9.3 or later.
  • R 3.0.0 changed certain aspects of the external API to R. To use R 3.0.0 or later you must use SAS 9.4 or later.
  • R 3.0.2 changed an internal detail that SAS was using. SAS/IML 13.1 will support R 3.0.2.
An error message that you might see because of incompatible versions is "An installed version of R could not be found," although this message can also occur for other reasons.

SAS ensures that each SAS release supports backwards compatibility with as many previous R releases as possible. However, after a version of SAS ships, it is impossible to ensure forward compatibility, as evidenced by the R 3.0.0 release. That version of R was released in April 2013; it is supported in SAS 9.4 software, which shipped a few months later.

Some software companies distribute their own versions of R, but SAS does not. Consequently, if the interface to R changes, SAS customers need to use a compatible version of R until they can upgrade to a more recent version of SAS. To reiterate:

  • If you are running a version of SAS prior to SAS 9.4, you should use R 2.15.x or earlier.
  • You can call R 3.0.1 by using SAS/IML 12.3 or later. SAS/IML 12.3 shipped in July 2013 as part of SAS 9.4.
  • You can call R 3.0.2 and later by using SAS/IML 13.1 or later. SAS/IML 13.1 is scheduled to ship in December 2013 as part of SAS 9.4m1.

I'll close with a few comments about 32-bit and 64-bit versions of SAS and R.

  • If you are using SAS software on a 32-bit edition of Windows, you must install the 32-bit edition of R.
  • If you are using SAS software on a 64-bit edition of Windows, you can install either the 32-bit or the 64-bit edition of R. The 32-bit edition of SAS looks first for the 32-bit edition of R and then for the 64-bit edition. The 64-bit edition of SAS looks first for the 64-bit edition of R and then for the 32-bit edition.

tags: Data Analysis, SAS/IML Studio
3月 042013
 

I am pleased to announce that the documentation for the IMLPlus language is now available online. Previously, this resource was available only from within the SAS/IML Studio application. This documentation can now be accessed by anyone, regardless of whether they have installed SAS/IML Studio.

As I have described previously, IMLPlus is an enhanced version of the SAS/IML programming language that includes the ability to create and manipulate dynamically linked statistical graphs. You can use the IMLPlus language from within SAS/IML Studio, which is a free application that is distributed as part of the SAS/IML product.

I have been writing this blog for 2.5 years, but I have written only a few articles about IMLPlus, in part because I was unable to link to the IMLPlus language documentation. Now that there is an online version of the IMLPlus class reference, expect an occasional article that demonstrates features of this rich and powerful extension to the SAS/IML computational language.

If you would like to explore the IMLPlus programming language, first check whether SAS/IML Studio is installed on your Windows PC. To get started with IMLPlus, you can work through the examples in SAS/IML Studio for SAS/STAT Programmers. If you have my book, Statistical Programming with SAS/IML Software, I discuss IMLPlus programming techniques in Chapters 5–10.

tags: IMLPlus, SAS/IML Studio
8月 222011
 

When I write SAS/IML programs, I usually do my development in the SAS/IML Studio environment. Why? There are many reasons, but the one that I will discuss today is the fact that the application is multithreaded and supports multiple programming workspaces.

The advantages of multiple programming workspaces

I am always multitasking, which means that usually I am writing and running multiple programs that are in various stages of completion. For example, today I am working on three projects: a program for my blog (ContaminatedNormal), a customer's program that I am trying to analyze for SAS Technical Support (TechSupportG01832), and a program that I am writing for an upcoming presentation (7_MVNormal). When I am working on my upcoming presentation, my SAS/IML Studio environment might look like the following:

In the image, I drew a red ellipse on the workspace bar, which shows that there are three workspaces. The following image is a close-up of the workspace bar:

The names of the programs appear on the buttons on the workspace bar, and I can switch to a different program by clicking a button.

Each workspace is a separate SAS/IML programming environment. It has its own program window and an independent WORK library for storing temporary data sets. Each program runs independently: a variable named x in one program has no relationship to a variable of the same name in a different program. Consequently, I can develop many SAS/IML programs simultaneously.

The advantages of a multithreaded application

Even more useful is that I can run multiple programs simultaneously! Because the SAS/IML application is multithreaded, I can run arbitrarily many SAS/IML programs concurrently (but there is only one SAS process). This is great if I am running a simulation that requires several minutes to complete. I can run the simulation in one programming workspace and then switch to a different workspace to work on a different program.

For example, suppose that I am hard at work when I get a phone call. The caller asks if I can come up to his office in ten minutes. No problem. I immediately open a prewritten program called Alarm and run it in SAS/IML Studio. I then switch to the Technical Support issue and start working on that program. My workspace bar looks like the following:

The blue triangle on the Alarm button lets me know that the program is running in a separate thread, even while I investigate the Technical Support issue. (If I have access to several SAS servers, I can even tell SAS/IML Studio to run programs on a remote server.) After ten minutes, the Alarm program finishes running and makes a beeping sounds that reminds me of my appointment.

Notice that a multithreaded interface does not imply that the computations are multithreaded. However, SAS also provides multithreaded computations in Base SAS procedures (Ray, 2003) and in analytical procedures (Cohen, 2002).

A SAS/IML program that sounds an alarm

A blog post that doesn't contain any code is so unsatisfying, so here is my SAS/IML program that beeps after a specified length of time. You can run this program in SAS/IML Studio, then switch to a new workspace and continue your work:

/* Demonstrate that SAS/IML Studio is multithreaded */
/* no "proc iml" statement; run in SAS/IML Studio */
delay = 10; /* specify time in minutes (use 1/12 to demo program) */
 
AlarmTime = time()+60*delay;
print "Alarm will ring at " AlarmTime[format=time9.];
printnow; /* force printing to happen immediately */
 
/* pause for this many seconds */
call sleep(60*delay, 1); 
 
/* Time sound the alarm */
tone = repeat(do(400,800,50), 3);
call sound(tone, 0.05);