12月 092020
 

One purpose of principal component analysis (PCA) is to reduce the number of important variables in a data analysis. Thus, PCA is known as a dimension-reduction algorithm. I have written about four simple rules for deciding how many principal components (PCs) to keep.

There are other methods for deciding how many PCs to keep. Recently a SAS customer asked about a method known as Horn's method (Horn, 1965), also called parallel analysis. This is a simulation-based method for deciding how many PCs to keep. If the original data consists of N observations and p variables, Horn's method is as follows:

  • Generate B sets of random data with N observations and p variables. The variables should be normally distributed and uncorrelated. That is, the data matrix, X, is a random sample from a multivariate normal (MVN) distribution with an identity correlation parameter: X ~ MVN(0, I(p)).
  • Compute the corresponding B correlation matrices and the eigenvalues for each correlation matrix.
  • Estimate the 95th percentile of each eigenvalue distribution. That is, estimate the 95th percentile of the largest eigenvalue, the 95th percentile of the second largest eigenvalue, and so forth.
  • Compare the observed eigenvalues to the 95th percentiles of the simulated eigenvalues. If the observed eigenvalue is larger, keep it. Otherwise, discard it.

I do not know why the adjective "parallel" is used for Horn's analysis. Nothing in the analysis is geometrically parallel to anything else. Although you can use parallel computations to perform a simulation study, I doubt Horn was thinking about that in 1965. My best guess is that is Horn's method is a secondary analysis that is performed "off to the side" or "in parallel" to the primary principal component analysis.

Parallel analysis in SAS

You do not need to write your own simulation method to use Horn's method (parallel analysis). Horn's parallel analysis is implemented in SAS (as of SAS/STAT 14.3 in SAS 9.4M5) by using the PARALLEL option in PROC FACTOR. The following call to PROC FACTOR uses data about US crime rates. The data are from the Getting Started example in PROC PRINCOMP.

proc factor data=Crime method=Principal 
   parallel(nsims=1000 seed=54321) nfactors=parallel plots=parallel;
   var Murder Rape Robbery Assault Burglary Larceny Auto_Theft;
run;

The PLOTS=PARALLEL option creates the visualization of the parallel analysis. The solid line shows the eigenvalues for the observed correlation matrix. The dotted line shows the 95th percentile of the simulated data. When the observed eigenvalue is greater than the corresponding 95th percentile, you keep the factor. Otherwise, you discard the factor. The graph shows that only one principal component would be kept according to Horn's method. This graph is a variation of the scree plot, which is a plot of the observed eigenvalues.

The same information is presented in tabular form in the "ParallelAnalysis" table. The first row is the only row for which the observed eigenvalue is greater than the 95th percentile (the "critical value") of the simulated eigenvalues.

Interpretation of the parallel analysis

Statisticians often use statistical tests based on a null hypothesis. In Horn's method, the simulation provides the "null distribution" of the eigenvalues of the correlation matrix under the hypothesis that the variables are uncorrelated. Horn's method says that we should only accept a factor as important if it explains more variance than would be expected from uncorrelated data.

Although the PARALLEL option is supported in PROC FACTOR, some researchers suggest that parallel analysis is valid only for PCA. Saccenti and Timmerman (2017) write, "Because Horn’s parallel analysis is associated with PCA, rather than [common factor analysis], its use to indicate the number of common factors is inconsistent (Ford, MacCallum, & Tait, 1986; Humphreys, 1975)." I an expert in factor analysis, but a basic principle of simulation is to ensure that the "null distribution" is appropriate to the analysis. For PCA, the null distribution in Horn's method (eigenvalues of a sample correlation matrix) is appropriate. However, in some common factor models, the important matrix is a "reduced correlation matrix," which does not have 1s on the diagonal.

The advantages of knowing how to write a simulation

Although the PARALLEL option in PROC FACTOR runs a simulation and summarizes the results, there are several advantages to implementing a parallel analysis yourself. For example, you can perform the analysis on the covariance (rather than correlation) matrix. Or you can substitute a robust correlation matrix as part of a robust principal component analysis.

I decided to run my own simulation because I was curious about the distribution of the eigenvalues. The graph that PROC FACTOR creates shows only the upper 95th percentiles of the eigenvalue distribution. I wanted to overlay a confidence band that indicates the distribution of the eigenvalues. The band would visualize the uncertainty in the eigenvalues of the simulated data. How wide is the band? Would you get different results if you use the median eigenvalue instead of the 95th percentile?

Such a graph is shown to the right. The confidence band was created by using a technique similar to the one I used to visualize uncertainty in predictions for linear regression models. The graph shows that the distribution of each eigenvalue and connects them with a straight line. The confidence band fits well with the existing graph, even though the X axis is discrete.

Implement Horn's simulation method

Here's an interesting fact about the simulation in Horn's method. Most implementations generate B random samples, X ~ MVN(0, I(p)), but you don't actually NEED the random samples! All you need are the correlation matrices for the random samples. It turns out that you can simulate the correlation matrices directly by using the Wishart distribution. SAS/IML software includes the RANDWISHART function, which simulates matrices from the Wishart distribution. You can transform those matrices into correlation matrices, find the eigenvalues, and compute the quantiles in just a few lines of PROC IML:

/* Parallel Analysis (Horn 1965) */
proc iml;
/* 1. Read the data and compute the observed eigenvalues of the correlation */
varNames = {"Murder" "Rape" "Robbery" "Assault" "Burglary" "Larceny" "Auto_Theft"};
use Crime;
   read all var varNames into X;
   read all var "State";
close;
p = ncol(X);
N = nrow(X);
 
m = corr(X);                  /* observed correlation matrix */
Eigenvalue = eigval(m);       /* observed eigenvalues */
 
/* 2. Generate random correlation matrices from MVN(0,I(p)) data
      and compute the eigenvalues. Each row of W is a p x p scatter 
      matrix for a random sample of size N where X ~ MVN(0, I(p)) */
nSim = 1000;
call randseed(12345);
W = RandWishart(nSim, N-1, I(p));  /* each row stores a p x p matrix */
S = W / (N-1);                /* rescale to form covariance matrix */
simEigen = j(nSim, p);        /* store eigenvalues in rows */
do i = 1 to nSim;
   cov = shape(S[i,], p, p);  /* reshape the i_th row into p x p */
   R = cov2corr(cov);         /* i_th correlation matrix */
   simEigen[i,] = T(eigval(R));  
end;
 
/* 3. find 95th percentile for each eigenvalue */
alpha = 0.05;
call qntl(crit, simEigen, 1-alpha);
 
results = T(1:nrow(Eigenvalue)) || Eigenvalue || crit`;
print results[c={"Number" "Observed Eigenvalue" "Simul Crit Val"} F=BestD8.];

The table is qualitatively the same as the one produced by PROC FACTOR. Both tables are the results of simulations, so you should expect to see small differences in the third column, which shows the 95th percentile of the distributions of the eigenvalues.

Visualize the distribution of eigenvalues

The eigenvalues are stored as rows of the simEigen matrix, so you can estimate the 5th, 10th, ..., 95th percentiles and over band plots on the eigenvalue (scree) plot, as follows:

/* 4. Create a graph that illustrates Horn's method: 
   Factor Number vs distribution of eigenvalues. Write results in long form. */
/* 4a. Write the observed eigenvalues and the 95th percentile */
create Horn from results[c={"Factor" "Eigenvalue" "SimulCritVal"}];
   append from results;
close;
 
/* 4b. Visualize the uncertainty in the simulated eigenvalues. For details, see
   https://blogs.sas.com/content/iml/2020/10/12/visualize-uncertainty-regression-predictions.html 
*/
a = do(0.05, 0.45, 0.05);          /* significance levels */
call qntl(Lower, simEigen, a);     /* lower qntls         */
call qntl(Upper, simEigen, 1-a);   /* upper qntls         */
Factor = col(Lower);               /* 1,2,3,...,1,2,3,... */
alpha = repeat(a`, 1, p);             
create EigenQntls var {"Factor" "alpha" "Lower" "Upper"};
   append;
close;
QUIT;
 
proc sort data=EigenQntls;
   by alpha Factor;
run;
data All;
   set Horn EigenQntls;
run;
 
title "Horn's Method (1965) for Choosing the Number of Factors";
title2 "Also called Parallel Analysis";
proc sgplot data=All noautolegend;
   band x=Factor lower=Lower upper=Upper/ group=alpha fillattrs=(color=gray) transparency=0.9;
   series x=Factor y=Eigenvalue / markers name='Eigen' legendlabel='Observed Eigenvalue';
   series x=Factor y=SimulCritVal / markers lineattrs=(pattern=dot) 
          name='Sim' legendlabel='Simulated Crit Value';
   keylegend 'Eigen' 'Sim' / across=1;
run;

The graph is shown in the previous section. The darkest part of the band shows the median eigenvalue. You can see that the "null distribution" of eigenvalues is rather narrow, even though the data contain only 50 observations. I thought perhaps it would be wider. Because the band is narrow, it doesn't matter much whether you choose the 95th percentile as a critical value or some other value (90th percentile, 80th percentile, and so forth). For these data, any reasonable choice for a percentile will still lead to rejecting the second factor and keeping only one principal component. Because the band is narrow, the results will not be unduly affected by whether you use few or many Monte Carlo simulations. In this article, both simulations used B=1000 simulations.

Summary

In summary, PROC FACTOR supports the PARALLEL and PLOTS=PARALLEL options for performing a "parallel analysis," which is Horn's method for choosing the number of principal components to retain. PROC FACTOR creates a table and graph that summarize Horn's method. You can also run the simulation yourself. If you use SAS/IML, you can simulate the correlation matrices directly, which is more efficient than simulating the data. If you run the simulation yourself, you can add additional features to the scree plot, such as a confidence band that shows the null distribution of the eigenvalues.

The post Horn's method: A simulation-based method for retaining principal components appeared first on The DO Loop.

12月 092020
 

By definition, managed application services require a high degree of trust. After all, you’re paying someone else to manage your business applications and, in many cases, your data. To help establish that trust, we want to introduce you to our managed application services team – and have them answer some of your questions about our hosted and [...]

Building trust in the cloud with Bryan Harkola was published on SAS Voices by Lindsay Marshall

12月 082020
 

COVID-19 has upset nearly every prediction and business plan for 2020 across the planet. Making predictions for 2021 may seem like a fool’s errand at this point, but many trends and consequences are already obvious and emerging from the global pandemic. The last global pandemic of this magnitude, the Spanish [...]

Fools rush in … trends and predictions for 2021 was published on SAS Voices by Leo Sadovy

12月 082020
 

COVID-19 has upset nearly every prediction and business plan for 2020 across the planet. Making predictions for 2021 may seem like a fool’s errand at this point, but many trends and consequences are already obvious and emerging from the global pandemic. The last global pandemic of this magnitude, the Spanish [...]

Fools rush in … trends and predictions for 2021 was published on SAS Voices by Leo Sadovy

12月 082020
 

This is the second of three posts on our hot-fix process, aimed at helping you better manage your SAS®9 environment through tips and best practices. The first, an overview of the SASHFADD tool, is linked here and at the end of this post.

Having a good understanding of the hot-fix process can help you keep your SAS environment running smoothly. This second installment aims to help you understand hot-fix dependencies by giving you a sneak peek into the report generated by the SASHFADD tool.

SASHFADD tool report

Here's what the SASHFADD tool report looks like:

The citation key for items in the report is listed at the bottom of the page. This blog takes you through an example report for the SAS middle tier to show you how the citation key (shown below) works. For full details about the items in the citation key, see the SAS Hot Fix Analysis, Download and Deployment (SASHFADD) Tool Usage Guide.

My report, SAS Middle Tier 9.4_M6 for Windows for x64, shown in the display below contains the following:

  • [1] for Issues Addressed
  • [A] for Alert
  • [C] for Container
  • [D] for Documentation
  • [S] for Security

What do the citations mean to you? Well, you need to click the citations to get a bigger picture of what this hot fix contains. Let's examine each citation in my example report in detail.

Citation key example

When you click the [1] for Issue(s) Addressed, it takes you to a page that lists all the SAS Notes that are being addressed in the fix:

If you scroll to the bottom of that section, it shows you whether there are other hot-fix dependencies or requirements that you need to meet before installing this hot fix. In this example, it shows that you must also install Hot Fix E3I012.

In the list of SAS Notes, any alert SAS Notes are labeled with ALERT. This report includes an [A], and you can see the alert SAS Note marked in the list. This citation is helpful if you want to update your system only for critical issues. If that is your preference, there is a SASHFADD script that will automatically download only the alert hot fixes and any corresponding dependency hot fixes.

The [C] in the report shows that it is a container hot fix. This citation is not clickable; it is there to indicate that you should refer to the documentation for a list of hot fixes in the container. When you click the [D], you see any relevant installation instructions. This page also lists the "member" hot fixes of the D8T013 "container" hot fix:

This page includes important notes along with installation instructions, and sometimes also includes pre-installation and/or post-installation instructions.

In this example, the post-installation instructions refer to a security update, which corresponds with the [S] from the report. Security updates might require additional steps. For example, some security updates require that you reapply them after you install the latest hot fixes. The security update will have its own instructions, so be sure to carefully read and follow all the steps.

Remember that the SASHFADD tool that generates this report can keep you from missing essential dependencies and instructions!

SASHFADD tool tips

Here are a couple of tips about using the SASHFADD tool.

Recommendation: If you ran the SASHFADD report a while ago, it is a good idea to run the SASHFADD report again before you apply your hot fixes so that you have the most up-to-date information. As you can see, when I ran the SASHFADD tool today, E3I012 is on the list as a dependency, not E3I011.

Reminder: If you run the SASHFADD tool manually, make sure that you get the updated SAS9X_HFADD_data.xml file before generating the updated report. Otherwise, you are running the same report you had before. This XML file is available on page 16 in the SAS Hot Fix Analysis, Download and Deployment (SASHFADD) Tool Usage Guide:

SAS Note versus SASHFADD report for checking dependencies

Sometimes, people access hot fixes via a SAS Note rather than through a SASHFADD report. If you install the hot fix via a SAS Note such as SAS Note 66740, be sure to complete these steps:

  1. Click the Hot Fix tab and click the appropriate link for your release.
  2. Go to the bottom of the hot-fix page and check whether there are any dependent hot fixes.
  3. (This step applies to both methods for checking dependencies – via a SAS Note or via a SASHFADD report.) Before you install the dependent hot fixes, check whether they are already installed by looking at the View Registry report. You should also make sure that the version matches. In this example, you can see that the version matches but that Hot Fix E3I012 is not installed:

The SASHFADD report already listed the above dependent hot fix, so it would be part of my install if I used the SASHFADD report as my guide:

Tip: It is worth noting that when I click the link for E3I011, it shows that E31011 was replaced with E3I012:

 

The biggest takeaway message that I would like you to get from this blog is this: If you do not install the dependent hot fixes and/or follow the instructions such as the post-installation steps, you will encounter issues that will cause more downtime. This is something that I know we all want to avoid! If you run the SASHFADD tool, it automatically downloads all dependent hot fixes that you are eligible for, which eliminates the need for you to review the list of dependencies from any download page.

Helpful resources

See the following links for the detailed and thorough documentation:

Coming soon: third and final post in the series

The third blog in the series is about getting on a schedule with your hot-fix updates.

Thank you and have a wonderful day!

READ PART ONE | The SAS Hot Fix Analysis, Download and Deployment (SASHFADD) Tool

Understanding the SAS Hot Fix Analysis, Download and Deployment Tool Report was published on SAS Users.

12月 072020
 

In my previous blog post, I talked about using PROC CAS to accomplish various data preparation tasks. Since then, my colleague Todd Braswell and I worked through some interesting challenges implementing an Extract, Transform, Load (ETL) process that continuously updates data in CAS. (Todd is really the brains behind getting this to work – I am just along for the ride.)

In a nutshell, the process goes like this:

  1. PDF documents drop into a "receiving" directory on the server. The documents have a unique SubmissionID. Some documents are very large – thousands of pages long.
  2. A Python job runs and converts PDFs into plain text. Python calls an API that performs Optical Character Recognition (OCR) and saves off the output as a CSV file, one row per page, in the PDF document.
  3. A SAS program, running in batch, loads the CSV file with new records into a CAS table. SubmissionID is passed to the batch program as a macro variable, which is used as part of the CAS table name.
  4. Records loaded from the CSV file are appended to the Main table. If records with the current SubmissionID already exist in the Main table, they are deleted and replaced with new records.
    The Main table is queried by downstream processes, which extract subsets of data, apply model score code, and generate results for the customer.

Continuously update data process flow

Due to the volume of data and the amount of time it takes to OCR large PDFs, the ETL process runs in multiple sessions simultaneously. And here is a key requirement: the Main table is always available, in a promoted state, with all up-to-date records, in order for the model score code to pick up the needed records.

What does "promoted state" mean?

The concept of table scope, which was introduced with the first release of SAS Viya, presents a challenge. CAS tables are in-memory tables that can have one of two "scopes":

  • Session scope – the table exists within the scope of your current CAS session and drops from memory as soon as the session ends. Functionally, this is somewhat similar to the data you write to the WORK library in SAS 9 – once you disconnect, the data drops from the WORK library.
  • Global scope – the table is available to all sessions. If your session ends, you will still have access to it when you start a new session. Your colleagues also maintain access, assuming they have the necessary permissions. For the table to assume global scope, it needs to be promoted.

Common promotion techniques for a table are the DATA STEP, PROC CAS, or PROC CASUTIL. For example:

/*promote a table using DATA STEP*/
*this snippet copies a SAS 9 table into CAS and promotes in one step;
data mycaslib.my_table (promote=yes);
     set mylib.my_table;
     run;
 
/*promote using PROC CASUTIL*/
*this snippet promotes a session-scope CAS table to global scope;
proc casutil incaslib='mycaslib' outcaslib='mycaslib';
     promote casdata='my_table';
     quit;
 
/*Promote using PROC CAS*/
*same as above, this snippet promotes a session-scope table to global scope;
proc cas;
table.promote / 
     caslib='mycaslib'
     targetcaslib='mycaslib' 
     name='my_table' 
     target='my_table';
    quit;

Fun Facts About Table Promotion

You cannot promote a table that has already been promoted. If you need to promote a new version of the same table, you need to first drop the existing table and promote the new version.

To discover the current table state, use the

proc cas;
     table.tableinfo / 
     caslib='mycaslib' 
     name='main';
     quit;

If you append rows to a promoted table using the DATA STEP append option, the new rows are automatically promoted. For example, in this snippet the mycaslib.main table, which is promoted, remains promoted when the rows from mycaslib.new_rows are appended to it:

data mycaslib.main(append=yes);
     set mycaslib.new_rows;
     run;

When you manipulate a promoted table using the DATA STEP apart from appending rows, it creates a new, session-scope version of the same table. You will have two versions of the table: the global-scope table, which remains unchanged, and the session-scope version which has the changes you implemented. Even if you don't change anything in the table and simply run:

data mycaslib.my_table;
     set mycaslib.my_table;
     run;

in which mycaslib.my_table is promoted, you end up with a promoted and an unpromoted version of this table in the mycaslib library – a somewhat unexpected and hardly desired result. Appendix 1 walks through a quick exercise you can try to verify this.

As you probably guessed, this is where we ran into trouble with our ETL process: the key requirement was for the Main table to remain promoted, yet we needed to continuously update it. The task was simple if we just needed to append the rows; however, we also needed to replace the rows if they already existed. If we tried to delete the existing rows using the DATA STEP, we would have to deal with the changes applied to a session-scope copy of the global-scope table.

Initially, we designed the flow to save off the session-scope table with changes, then drop the (original) global-scope version, and finally reload the up-to-date version. This was an acceptable workaround, but errors started to pop up when a session looked for the Main table to score data, while a different concurrent session reloaded the most up-to-date data. We were uncertain how this would scale as our data grew.

PROC CAS to the rescue!

After much research, we learned the deleteRows, which allows you to delete rows directly from a global-scope table. The data is never dropped to session-scope – exactly what we needed. Here's an example:

proc cas;
     table.deleteRows /
     table={caslib="mycaslib", name="Main", where="SubmissionID = 12345"};
     quit;

In case you are wondering, the Tables action set also has an

/*1. Load new rows. SubmissionID macro variable is a parameter passed to the batch program*/
/*New rows are written to the casuser library, but it does not really matter which caslib you choose – 
   we are not persisting them across sessions*/
proc casutil;
       load file="/path_to_new_rows/New_rows_&SubmissionID..csv" outcaslib="casuser" casout="new_rows";
       quit;
/*2. Delete rows with the current SubmissionID */
proc cas;
       table.deleteRows /
       table={caslib="prod", name="Main", where="SubmissionID = &SubmissionID."};
       quit;
/*3. Append new rows*/
data mycaslib.main(append=yes);
	set mycaslib.new_rows;
	run;
/*4. Save the main table to ensure we have a disk backup of in-memory data*/
proc casutil incaslib="prod" outcaslib="prod";
	save casdata="main" replace;
	quit;

Conclusion

We learned how to continuously update data in CAS while ensuring the data remains available to all sessions accessing it asynchronously. We learned the append option in DATA STEP automatically promotes new rows but manipulating the data in a global-scope table through DATA STEP in other ways leads to data being copied to session-scope. Finally, we learned that to ensure the table remains promoted while it is updated, we can fall back on PROC CAS.

Together, these techniques enabled implementation of a robust data flow that overcomes concurrency problems due to multiple processes updating and querying the data.

Acknowledgement

We thank Brian Kinnebrew for his generous help in investigating this topic and the technical review.

Appendix 1

Try the following exercise to verify that manipulating a promoted table in DATA STEP leads to two copies of the table – session- AND global-scope.

/*copy and promote a sample SAS 9 table*/
data casuser.cars(promote=yes);
     set sashelp.cars;
     run;
/*check the number of rows and confirm that the table is promoted*/
proc cas;
     table.tableinfo / caslib='casuser' name='cars';
     quit; /*The table is promoted and has 428 rows*/
 
/*delete some rows in the promoted table*/
data casuser.cars;
     set casuser.cars;
     if make='Acura' then delete;
     run;
/*check again – how may rows does the table have? Is it promoted?*/
proc cas;
     table.tableinfo / caslib='casuser' name='cars';
     quit;  /*The table is has 421 rows but it is no longer promoted*/
 
/*reset your CAS session*/
/*kill your current CAS session */
cas _all_ terminate;
/*start a new CAS session and assign caslibs*/
cas; 
caslib _all_ assign;
 
/*check again – how may rows does the table have? Is it promoted?*/
proc cas;
     table.tableinfo / caslib='casuser' name='cars';
     quit;  /*The table is promoted and has 428 rows*/

What we see here is, manipulating a global-scope table in DATA STEP leads to duplication of data. CAS copies the table to session-scope and applies the changes there. The changes go away if you terminate the session. One way to get around this issue is instead of trying to overwrite the promoted table, create a new table, then drop the old table and promote the new table under the old table's name. Otherwise, use table.Update actions to update/delete rows in place, as described in this post.

Append and Replace Records in a CAS Table was published on SAS Users.

12月 072020
 
Spruce (picea glauca) branches

Image of spruce branch (Picea glauca) by Aleksandrs Balodis, licensed under the Creative Commons Attribution-Share Alike 4.0 International license (CC-BY-SA-4.0).

"O Christmas tree, O Christmas tree, how lovely are your branches!" The idealized image of a Christmas tree is a perfectly straight conical tree with lush branches and no bare spots. Although this ideal exists only on Christmas cards, forest researchers are always trying to develop trees that approach the ideal. And they use serious science and statistics to do it!

Bert Cregg, a forest researcher at Michigan State University, is a Christmas tree researcher who has been featured in Wired magazine. Cregg and his colleagues have published many papers on best practices for growing Christmas trees. One paper that caught my eye is Gooch, Nzokou, and Cregg (2009), "Effect of Indoor Exposure on the Cold Hardiness and Physiology of Containerized Christmas Trees." In this paper, Cregg and his colleagues investigate whether you can buy a live Christmas tree, keep it in the house for the holidays, and then transplant it in your yard. The authors use a statistical technique known as the median lethal dose, or LD50, to describe how bringing a potted Christmas tree into the house can affect its hardiness to freezing temperatures.

This blog post uses Cregg's data and shows how to use SAS to estimate the LD50 statistic. If you are not interested in the details, the last section of this article summarizes the results. Spoiler: An evergreen kept indoors has a reduced ability to withstand freezing temperatures after it is transplanted. In Cregg's experiment, all the transplanted trees died within six months.

The problem and the experiment

Containerized (potted) Christmas trees are popular among consumers who want a live Christmas tree but do not want to kill the tree by cutting it down. The idea is to bring the tree indoors during the holidays, then plant it on your property. However, there is a problem with bringing a tree into a house during the winter. Evergreens naturally go dormant in the winter, which enables their needles and buds to withstand freezing temperatures. When you bring a tree into a heated space, it "wakes up." If you later move the tree outside into freezing temperatures, the needles and buds can be damaged by the cold. This damage can kill the tree.

Cregg and his colleagues set up an experiment to understand how the length of time spent indoors affects a Christmas tree's ability to withstand freezing temperatures. Trees were brought indoors for 0, 3, 7, 14, and 20 days. Cuttings from the trees were then exposed to freezing conditions: -3, -6, -9, ..., -30 degrees Celsius. The goal is to estimate the median "lethal dose" for temperature. That is, to estimate the temperature at which half of the cuttings are damaged by the cold. In pharmacological terms, the time spent indoors is a treatment and the temperature is a "dose." The trees that were never brought indoors (0 days) are a control group.

Cregg and his colleagues studied three species of Christmas trees and studied dames to both buds and needles. For brevity, I will only look at the results for buds on the Black Hills Spruce (Picea glauca).

The data and the graphical method

In Table 1 (p. 75), the authors show data for the bud mortality and 50% lethality (LD50) for each treatment group (days indoors) as a function of decreasing temperatures. The data for the spruce are shown in the following SAS DATA step. Each cell in the table represents the percentage of 60 cuttings that showed damage after being exposed to a freezing temperature. By knowing there were 60 cuttings, you can compute how many cuttings were damaged.

/* Bud mortality and LD50. Data from Table 1 of Gooch, Nzokou, & Cregg (2009). */
data SpruceBudDamage;
Ntrials = 60;
input TimeIndoors @;
do Temp = -3 to -30 by -3;
   input BudMort @;
   NEvents = int(NTrials * BudMort / 100);  /* compute NEvents from mortality */
   output;
end;
label Temp = "Temperature (C)";
/* Bud Mortality (percent) as a function of temperature for each treatment */
/* Days    | --------  Temperature (C)  -------- */   
/* Indoors |-3 -6 -9 -12 -15 -18 -21 -24 -27 -30 */
datalines;
   0         0  0  0   0   0  6.7  0   0  20 100
   3         0  0  0   0   0  0    0  30  80 100
   7         0  0  0   0   0  0   10  40 100 100
  14         0  0  0   0   0  0    0  80 100 100
  20         0  0  0   0  20 40  100 100 100 100
;

The paper says that the LD50 "was determined graphically using a pairwise plot of the exposure temperature and the percentage of bud mortality ... for each species." I don't know what that means. It sounds like they did not estimate the LD50 statistically but instead graphed the bud mortality versus the temperature and used linear interpolation (or a smoother) to estimate the temperature at which the mortality is 50%. Here is a graph of the data connected by lines:

title "Bud Mortality in Black Hills Spruce";
proc sgplot data=SpruceBudDamage;
   series x=Temp y=BudMort / markers group=TimeIndoors lineattrs=(thickness=2);
   refline 50 / axis=y lineattrs=(color=red);
   xaxis grid; yaxis grid;
run;

The horizontal red line in the graph is the line of 50% mortality. For each curve, a crude estimate of LD50 is the temperature at which the curve crosses that line. The graphical estimates and the estimates in the paper are shown in the following table. The estimates in the authors' paper are greater than the estimates from my graph, but I do not know why. If you use something other than linear interpolation (for example, a loess smoother), you will get different curves and different estimates.

Although these graphical estimates differ slightly from the published results, the numbers tell the same story. On average, a spruce Christmas tree that is not brought indoors is hardy to about -28C. If you bring a tree indoors for 3 days, it is hardy only to about -25C. The longer a tree is indoors, the more it loses hardiness. For trees that are indoors for 20 days, the median lethal temperature is -18C, or about 10 degrees warmer than for the control group.

Better estimates of LD50

The graphical estimates are crude. They are based on linear interpolation between two consecutive data points: one for which the mortality is below 50% and the next for which the mortality is above 50%. The estimates ignore all other data. Furthermore, the estimates assume that the mortality is linear between those two points, which is not usually the case. The mortality curve is typically a sigmoid (or S-shaped) curve.

Fortunately, we can use statistics to address these concerns. The usual way to estimate LD50 in SAS is to use PROC PROBIT. For these data, we will perform a separate analysis for each value of the TimeIndoors variable. The INVERSECL option on the MODEL statement estimates the Temperature (and confidence limits) for a range of probability values. You can use the ODS OUTPUT statement to write the statistics to a SAS data set so that you can use PROC SGPLOT to overlay all five curves on a single graph, as follows:

proc probit data=SpruceBudDamage plots=(predpplot);
   by TimeIndoors;
   model NEvents / NTrials = Temp / InverseCL;
   ods exclude ProbitAnalysis;
   ods output ProbitAnalysis=ProbitOut;
run;
 
proc sgplot data=ProbitOut;
   band y=Probability lower=LowerCL upper=UpperCL / group=TimeIndoors transparency=0.5;
   series y=Probability x=Variable / group=TimeIndoors;
   refline 0.50 / axis=y lineattrs=(color=brown);
   xaxis grid values=(-30 to -3 by 3); yaxis grid;
run;

For each treatment group (time spent indoors), the graph shows probability curves for bud mortality as a function of the outdoor temperature. The median lethal temperature is where these curves and inverse confidence intervals intersect the line of 0.5 probability. (They are inverse confidence limits because they are for the value of the X value that produces a given Y value.) You can use PROC PRINT to display the estimates for LD50 for each treatment group:

The estimates for LD50 use all the data to model the bud mortality. This method also produces 95% (inverse) confidence limits for the LD50 parameter. For one of the treatment groups (TimeIndoors=14), the confidence limits could not be produced. The documentation for PROC PROBIT discusses why this can happen.

If you want, you can use this table to estimate the differences between the LD50 values.

Summary

In summary, growing beautiful Christmas trees requires a lot of science and statistics. This article analyzes data from an experiment in which potted Christmas trees were brought indoors and later returned outdoors where they can experience freezing temperatures. Trees that are brought indoors lose some of their hardiness and can be damaged by freezing temperatures. This article shows how to use PROC PROBIT in SAS to compute the median lethal temperature (LD50), which is the temperature at which half of the trees would be damaged. After 20 days indoors, a spruce tree will lose about 10 degrees (C) of resistance to freezing temperatures.

If you are thinking about getting a containerized (live) Christmas tree, Cregg (2020) wrote an article about how to take care of it and how to prepare it for transplanting after Christmas. He suggests waiting until spring. In the 2009 experiment, 100% of the trees that had been brought indoors for 10 or more days died within six months of transplanting, as compared to 0% of the control group. This happened even though the outdoor temperature never approached the LD50 level. This experiment was done in Michigan, so the results might not apply to trees in warmer regions.

The post Can you transplant an indoor Christmas tree? appeared first on The DO Loop.

12月 022020
 

We've turned some of our most notable predictions for next year into a slide show. Click the orange "next" button to see these 2021 predictions from SAS.   Who’s brave enough to make predictions for next year after the unpredictable year we just had? We are. After all, the disruptive [...]

8 trends to watch for analytics in 2021 was published on SAS Voices by Alison Bolen