11月 232016
 

Somewhere in my past I encountered a panel of histograms for small random samples of normal data. I can't remember the source, but it might have been from John Tukey or William Cleveland. The point of the panel was to emphasize that (because of sampling variation) a small random sample might have a distribution that looks quite different from the distribution of the population. The diversity of shapes in the panel was surprising, and it made a big impact on me. About half the histograms exhibited shapes that looked nothing like the familiar bell shape in textbooks.


A small random sample might look quite different than the population #StatWisdom
Click To Tweet


In this article I recreate the panel. In the following SAS DATA step I create 20 samples, each of size N. I think the original panel showed samples of size N=15 or N=20. I've used N=15, but you can change the value of the macro variable to explore other sample sizes. If you change the random number seed to 0 and rerun the program, you will get a different panel every time. The SGPANEL procedure creates a 4 x 5 panel of the resulting histograms. Click to enlarge.

%let N = 15;
data Normal;
call streaminit(93779);
do ID = 1 to 20;
   do i = 1 to &N;
      x = rand("Normal");   output;
   end;
end;
run;
 
title "Random Normal Samples of Size &N";
proc sgpanel data=Normal;
   panelby ID /  rows=4 columns=5 onepanel;
   histogram x;
run;
Panel of histograms for random normal samples (N=15). Due to sampling variation, some histograms are not bell-shaped.

Each sample is drawn from the standard normal distribution, but the panel of histogram reveals a diversity of shapes. About half of the ID values display the typical histogram for normal data: a peak near x=0 and a range of [-3, 3]. However, the other ID values look less typical. The histograms for ID=1, 19, and 20 seem to have fewer negative values than you might expect. The distribution is very flat (almost uniform) for ID=3, 9, 13, and 16.

Because histograms are created by binning the data, they are not always the best way to visualize a sample distribution. You can create normal quantile-quantile (Q-Q) plots to compare the empirical quantiles for the simulated data to the theoretical quantiles for normally distributed data. The following statements use PROC RANK to create the normal Q-Q plots, as explained in a previous article about how to create Q-Q plots in SAS:

proc rank data=Normal normal=blom out=QQ;
   by ID;
   var x;
   ranks Quantile;
run;
 
title "Q-Q Plots for Random Normal Samples of Size &N";
proc sgpanel data=QQ noautolegend;
   panelby ID / rows=4 columns=5 onepanel;
   scatter x=Quantile y=x;
   lineparm x=0 y=0 slope=1;
   colaxis label="Normal Quantiles";
run;
Panel of quantile-quantile plots for random normal samples (N=15), which shows the sampling variation of samples

The Q-Q plots show that the sample distributions are well-modeled by a standard normal distribution, although the deviation in the lower tail is apparent for ID=1, 19, and 20. This panel shows why it is important to use Q-Q plots to investigate the distribution of samples: the bins used to create histograms can make us see shapes that are not really there. The SAS documentation includes a section on how to interpret Q-Q plots.

In conclusion, small random normal samples can display a diversity of shapes. Statisticians understand this sampling variation and routinely caution that "the standard errors are large" for statistics that are computed on a small sample. However, viewing a panel of histograms makes the concept of sampling variation more real and less abstract.

tags: Simulation, Statistical Thinking

The post Sampling variation in small random samples appeared first on The DO Loop.

 Leave a Reply

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>

(required)

(required)