sas programming

5月 292018
 

The SAS language provides syntax that enables you to quickly specify a list of variables. SAS statements that accept variable lists include the KEEP and DROP statements, the ARRAY statement, and the OF operator for comma-separated arguments to some functions. You can also use variable lists on the VAR statements and MODEL statements of analytic procedures.

This article describes six ways to specify a list of variables in SAS. There is a section in the SAS documentation that describes how to construct lists, but this blog post provides more context and a cut-and-paste example for every syntax. This article demonstrates the following:

  • Use the _NUMERIC_, _CHARACTER_, and _ALL_ keywords to specify variables of a certain type (numeric or character) or all types.
  • Use a single hyphen (-) to specify a range of variables that have a common prefix and a sequential set of numerical suffixes.
  • Use the colon operator (:) to specify a list of variables that begin with a common prefix.
  • Use a double-hyphen (--) to specify a consecutive set of variables, regardless of type. You can also use a variation of this syntax to specify a consecutive set of variables of a certain type (numeric or character).
  • Use the OF operator to specify variables in an array or in a function call.
  • Use macro variables to specify variables that satisfy certain characteristics.

Some companies might discourage the use of variable lists in production code because automated lists can be volatile. If the number and names of variables in your data sets occasionally change, it is safer to manually list the variables that you are analyzing. However, for developing code and constructing examples, lists can be a huge time saver.

Use the _NUMERIC_, _CHARACTER_, and _ALL_ keywords

You can specify all numeric variables in a data set by using the _NUMERIC_ keyword. You can specify all character variables by using the _CHARACTER_ keyword. Many SAS procedures use a VAR statement to specify the variables to be analyzed. When you want to analyze all variables of a certain type, you can use these keywords, as follows:

/* compute descriptive statistics of allnumeric variables */
proc means data=Sashelp.Heart nolabels; 
   var _NUMERIC_;          /* _NUMERIC_ is the default */
run;
 
/* display the frequencies of all levels for all character variables */
proc freq data=Sashelp.Heart; 
   tables _CHARACTER_;    /* _ALL_ is the defaul */
run;
Use a keyword to specify a list of variables in SAS

One of my favorite SAS programming tricks is to use these keywords in a KEEP or DROP statement (or data set option). For example, the following statements create a new data set that contains all numeric variables and two character variables from the Sashelp.Heart data:

data HeartNumeric;
set Sashelp.Heart(keep=_NUMERIC_            /* all numeric variables */
                       Sex Smoking_Status); /* two character variables */
run;

An example of using the _ALL_ keyword is shown in the section that discusses the OF operator.

Use a hyphen to specify numerical suffixes

In many situations, variables are named with a common prefix and numerical suffix. For example, financial data might have variables that are named Sales2008, Sales2009, ..., Sales2017. In simulation studies, variables often have names such as X1, X2, ..., X50. The hyphen enables you to specify the first and last variable in a list. The first example can be specified as Sales2008-Sales2017. The second example is X1-X50.

The following DATA step creates 10 variables, including the variables x1-x6. Notice that the data set variables are not in alphanumeric order. That is okay. The syntax x1-x6 will select the six variables x1, x2, x3, x4, x5, and x6 regardless of their physical order in the data. The call to PROC REG uses the six variables in a linear regression:

data A;
   retain Y x1 x3 Z x6 x5 x2 W x4 R;  /* create 10 variables and one observation. Initialize to 0 */
run;
proc reg data=A plots=none;
   model Y = x1-x6;
run;

The parameter estimates from PROC REG are displayed in the order that you specify in the MODEL statement. However, if you use the SET statement in a DATA step, the variables appear in the original order unless you intentionally reorder the variables:

data B;
   set A(keep=x1-x6);
run;

Use the colon operator to specify a prefix

If you want to use variables that have a common prefix but have a variety of suffixes, you can use the colon operator (:), which is a wildcard character that matches any name that begins with a specified prefix. For example, the following DATA step creates a data set that contains 10 variables, including five variables that begin with the prefix 'Sales'. The subsequent DATA step drops the variables that begin with the prefix 'Sales':

data A;
retain Sales17 Y Sales16 Z SalesRegion Sales_new Sales1 R; /* 1 obs. Initialize to 0 */
run;
 
data B;
   set A(drop= Sales: ); /* drop all variables that begin with 'Sales' */
run;

Use a double-hyphen to specify consecutive variables

The previous sections used wildcard characters to match variables that had a specified type or prefix. In the previous sections, you will get the same set of variables regardless of how they might be ordered in the data set. You can use a double-hyphen (--) to specify a consecutive set of variables. The variables you get depend on the order of the variables in the data set.

data A;
   retain Y 0   x3 2   C1 'A'   C2 'BC'
          Z 3   W  4   C4 'D'   C5 'EF'; /* Initialize eight variables */
run;
data B;
   set A(keep=x3--C4);
run;

In this example, the data set B contains the variables x3, C1, C2, Z, W, and C4. If you use the double-hyphen to specify a list, be sure that you know the order of the variables and that this order is never going to change. If the order of the variables changes, your program will behave differently.

You can also specify all variables of a certain type within a range of variables. The syntax Y-numeric-Z specifies all numeric variables between Y and Z in the data set. The syntax Y-character-Z specifies all character variables between Y and Z. For example, the following call to PROC CONTENTS displays the variables (in order) in the Sashelp.Heart data. The call to PROC LOGISTIC specifies all the numeric variables between (and including) the AgeCHDiag variable and the Smoking variable:

proc contents data=Sashelp.Heart order=varnum ;
run;
 
proc logistic data=Sashelp.Heart;
   model status = AgeCHDdiag-numeric-Smoking;
   ods select ParameterEstimates;
run;
Use a double-hyphen to specify a contiguous list of variables in SAS

Arrays and the OF operator

You can use variable lists to assign an array in a SAS DATA step. For example, the following program creates a numerical array named X and a character array named C. The program finds the maximum value in each row and puts that value into the variable named rowMaxNUm. The program also creates a variable named Str that contains the concatenation of the character values for each row:

data Arrays;
   set sashelp.Class;
   array X {*} _NUMERIC_;        /* X[1] is 1st var, X[2] is 2nd var, etc */
   array C {*} _CHARACTER_;      /* C[1] is 1st var, C[2] is 2nd var, etc */
   /* use the OF operator to pass values in array to functions */
   rowMaxNum = max(of x[*]);     /* find the max value in this array (row) */
   length Str $30;
   call catx(' ', Str, of C[*]); /* concatenate the strings in this array (row) */
   keep rowMaxNum Str;
run;
 
proc print data=Arrays(obs=4);
run;
Use a keyword to specify a list of variables to certain SAS functions

You can use the OF operator directly in functions without creating an array. For example, the following program uses the _ALL_ keyword to output the "complete cases" for the Sashelp.Heart data. The program drops any observation that has a missing value for any variable:

data CompleteCases;
  set Sashelp.Heart;
  if cmiss(of _ALL_)=0;  /* output only complete cases for all vars */
run;

Use macro variables to specify a list

The previous sections demonstrate how you can use syntax to specify a list of variables to SAS statements. In contrast, this section describes a technique rather than syntax. It is sometimes the case that the names of variables are in a column in a data set. There might be other columns in the data set that contain characteristics or statistics for the variables. For example, the following call to PROC MEANS creates an output data set (called MissingValues) that contains columns named Variable and NMiss.

proc means data=Sashelp.Heart nolabels NMISS stackodsoutput;
   var _NUMERIC_;
   ods output Summary = MissingValues;
run;
proc print; run;
Use a macro variable to specify a list of variables in SAS

Suppose you want to keep or drop those variables that have one or more missing values. The following PROC SQL call creates a macro variable (called MissingVarList) that contains a space-separated list of all variables that have at least one missing value. This technique has many applications and is very powerful.

/* Use PROC SQL to create a macro variable (MissingVarList) that contains
   the list of variables that have a property such as missing values */
proc sql noprint;                              
 select Variable into :MissingVarList separated by ' '
 from MissingValues
 where NMiss > 0;
quit;
%put &=MissingVarList;
MISSINGVARLIST=AgeCHDdiag Height Weight MRW Smoking AgeAtDeath Cholesterol

You can now use the macro variable in a KEEP, DROP, VAR, or MODEL statement, such as KEEP=&MissingVarList;

Summary

This article shows six ways to specify a list of variables to SAS statements and functions. The SAS syntax provides keywords (_NUMERIC_, _CHARACTER_, and _ALL_) and operators (hyphen, colon, and double-hyphen) to make it easy to specify a list of variables. You can use the syntax in conjunction with the OF operator to pass a variable list to some SAS functions. Lastly, if the names of variables are stored in a column in a data set, you can use the full power of PROC SQL to create a macro variable that contains variables that satisfy certain criteria.

Do you use shorthand syntax to specify lists of variables? Why or why not? Leave a comment.

The post 6 easy ways to specify a list of variables in SAS appeared first on The DO Loop.

5月 212018
 

In a recent blog post, Chris Hemedinger used a scatter plot to show the result of 100 coin tosses. Chris arranged the 100 results in a 10 x 10 grid, where the first 10 results were shown on the first row, the second 10 were shown on the second row, and so on. Placing items along each row before going to the next row is called row-major order.

An implicit formula for arranging items in rows

If you process items sequentially, it is easy to position the items in a grid by using an inductive scheme:

  1. Place the first item at (1, 1).
  2. Assume the n_th is placed at position (r, c). Place the (n+1)st item at position (r, c+1) if there is room on the current row, otherwise place it at (r+1, 1), which is the first element of the next row.

The inductive scheme is also called an implicit or recursive formula because the position of the (n+1)st item is given in terms of the position of the nth item.

For example, suppose that you have 70 items and you want to place 11 items in each row. The inductive algorithm looks like the following:

%let Nx = 11;           /* number of items in row */
data Loc;
label r = "Row" c = "Column";
retain r 1  c 1 item 1;
output;                 /* base case */
do item = 2 to 70;      /* inductive step */
   c + 1;
   if c > &Nx then do;
      r + 1; c = 1;
   end;
   output;
end;
run;
 
title "Position of Items in Grid";
proc sgplot data=Loc;
   text x=c y=r text=item / textattrs=(size=12) position=center strip;
   xaxis integer offsetmin=0.05 offsetmax=0.05;
   yaxis reverse offsetmin=0.05 offsetmax=0.05;
run;
Items arranged in a grid in row-major order with 11 items in each row

The inductive algorithm is easy to implement and to understand. However, it does not enable you to easily determine the row and column of the 1,234,567_th item if there are 11 items in each row. Nor does it enable you to compute the positions when the index increments by a value greater than 1. To answer these questions, you need to use an explicit or direct formula.

An explicit formula for arranging items in rows

The explicit formula uses the MOD function to compute the column position and integer division to compute the row position. SAS does not have an explicit "integer division operator," but you can emulate it by using the FLOOR function. The following macro definitions encapsulate the formulas:

/* (row, col) for item n if there are Nx items in each row (count from 1),
   assuming row-major order */
%macro ColPos(n, Nx);
   1 + mod(&n.-1, &Nx.)
%mend;
%macro RowPos(n, Nx);
   1 + floor((&n.-1) / &Nx.)
%mend;

The formulas might look strange because they subtract 1, do a calculation, and then add 1. This formula assumes that you want to count the items, rows, and columns beginning with 1. If you prefer to count from 0 then the formulas become MOD(n, Nx) and FLOOR(N/Nx).

You can use the formulas to directly compute the position of the odd integers in the digits 1–70 when there are 11 items on each row:

%let Nx = 11;
data grid;
do item = 1 to 70 by 2;       /* only odd integers */
   row = %RowPos(item, &Nx);
   col = %ColPos(item, &Nx);
   output;
end;
run;
 
title "Position of Odd Integers in Grid";
proc sgplot data=grid;
   text x=col y=row text=item / textattrs=(size=12) position=center strip;
   xaxis integer offsetmin=0.05 offsetmax=0.05 label="Column" max=&Nx;
   yaxis reverse offsetmin=0.05 offsetmax=0.05 integer label="Row";
run;
Positions of odd integers in a grid in row-major order

Of course, you can also use the direct formula to process items incrementally. The following DATA step computes the positions for 19 observations in the Sashelp.Class data set, where five names are placed in each row:

data gridName;
set sashelp.class;
y = %RowPos(_N_, 5);  /* 5 columns in each row */
x = %ColPos(_N_, 5);
run;
 
title "Position Five Names in Each Row";
proc sgplot data=gridName;
   text x=x y=y text=Name / textattrs=(size=12) position=center strip;
   xaxis integer offsetmin=0.08 offsetmax=0.08;
   yaxis reverse offsetmin=0.05 offsetmax=0.05 integer;
run;
Positions of items in a grid with 5 items in each row

The explicit formula is used in the SAS/IML NDX2SUB function, which tells you the row and column information for the n_th item in a matrix.

In summary, you can use an implicit formula or an explicit formula to arrange items in rows, where each row contains Nx items. The implicit formula is useful when you are arranging the items sequentially. The explicit formula is ideal when you are randomly accessing the items and you need a direct computation that provides the row and column position.

Finally, if you want to arrange items in column-major order (down the first column, then down the second,...), you can use similar formulas. The row position of the n_th item is 1 + mod(n-1, Ny) and the column position is 1 + floor((n-1) / Ny), where Ny is the number of rows in the grid.

The post Position items in a grid appeared first on The DO Loop.

5月 072018
 

Datasets can present themselves in different ways. Identical data can bet arranged differently, often as wide or tall datasets. Generally, the tall dataset is better. Learn how to convert wide data into tall data with PROC TRANSPOSE.

The post "Wide" versus "Tall" data: PROC TRANSPOSE v. the DATA step appeared first on SAS Learning Post.

5月 072018
 

Do you periodically delete unneeded global macro variables? You should! Deleting macro variables releases memory and keeps your symbol table clean. Learn how the macro language statement that deletes global macro variables and about the %DELETEALL statement that can be a life saver for macro programmers.

The post Deleting global macro variables appeared first on SAS Learning Post.

4月 242018
 

SAS variables are variables in the statistics sense, not the computer programming sense. SAS has what many computer languages call “variables,” it just calls them “macro variables.” Knowing the difference between SAS variables and SAS macro variables will help you write more flexible and effective code.

The post When a variable is not a variable appeared first on SAS Learning Post.

2月 262018
 

My article about the difference between CLASS variables and BY variables in SAS focused on SAS analytical procedures. However, the BY statement is also useful in the SAS DATA step where it is used to merge data sets and to analyze data at the group level. When you use the BY statement in the DATA step, the DATA step creates two temporary indicator variables for each variable in the BY statement. The names of these variables are FIRST.variable and LAST.variable, where variable is the name of a variable in the BY statement. For example, if you use the statement BY Sex, then the names of the indicator variables are FIRST.Sex and LAST.Sex.

This article gives several examples of using the FIRST.variable and LAST.variable indicator variables for BY-group analysis in the SAS DATA step. The first example shows how to compute counts and cumulative amounts for each BY group. The second example shows how to compute the time between the first and last visit of a patient to a clinic, as well as the change in a measured quantity between the first and last visit. BY-group processing in the DATA step is a fundamental operation that belongs in every SAS programmer's tool box.

Use FIRST. and LAST. variables to find count the size of groups

The first example uses data from the Sashelp.Heart data set, which contains data for 5,209 patients in a medical study of heart disease. The data are distributed with SAS. The following DATA step extracts the Smoking_Status and Weight variables and sorts the data by the Smoking_Status variable:

proc sort data=Sashelp.Heart(keep=Smoking_Status Weight)
          out=Heart;
   by Smoking_Status;
run;

Because the data are sorted by the Smoking_Status variable, you can use the FIRST.Smoking_Status and LAST.Smoking_Status temporary variables to count the number of observations in each level of the Smoking_Status variable. (PROC FREQ computes the same information, but does not require sorted data.) When you use the BY Smoking_Status statement, the DATA step automatically creates the FIRST.Smoking_Status and LAST.Smoking_Status indicator variables. As its name implies, the FIRST.Smoking_Status variable has the value 1 for the first observation in each BY group and the value 0 otherwise. (More correctly, the value is 1 for the first record and for records for which the Smoking_Status variable is different than it was for the previous record.) Similarly, the LAST.Smoking_Status indicator variable has the value 1 for the last observation in each BY group and 0 otherwise.

The following DATA step defines a variable named Count and initializes Count=0 at the beginning of each BY group. For every observation in the BY group, the Count variable is incremented by 1. When the last record in each BY group is read, that record is written to the Count data set.

data Count;
   set Heart;                 /* data are sorted by Smoking_Status */
   BY Smoking_Status;         /* automatically creates indicator vars */
   if FIRST.Smoking_Status then
      Count = 0;              /* initialize Count at beginning of each BY group */
   Count + 1;                 /* increment Count for each record */
   if LAST.Smoking_Status;    /* output only the last record of each BY group */
run;
 
proc print data=Count noobs; 
   format Count comma10.;
   var Smoking_Status Count;
run;
Use FIRST.variable and LAST.variable to count the size of groups

The same technique enables you to accumulate values of a variable within a group. For example, you can accumulate the total weight of all patients in each smoking group by using the following statements:

if FIRST.Smoking_Status then
   cumWt = 0;
cumWt + Weight;

This same technique can be used to accumulate revenue from various sources, such as departments, stores, or regions.

Use FIRST. and LAST. variables to compute duration of treatment

Another common use of the FIRST.variable and LAST.variable indicator variables is to determine the length of time between a patient's first visit and his last visit. Consider the following DATA step, which defines the dates and weights for four male patients who visited a clinic as part of a weight-loss program:

data Patients;
informat Date date7.;
format Date date7. PatientID Z4.;
input PatientID Date Weight @@;
datalines;
1021 04Jan16  302  1042 06Jan16  285
1053 07Jan16  325  1063 11Jan16  291
1053 01Feb16  299  1021 01Feb16  288
1063 09Feb16  283  1042 16Feb16  279
1021 07Mar16  280  1063 09Mar16  272
1042 28Mar16  272  1021 04Apr16  273
1063 20Apr16  270  1053 28Apr16  289
1053 13May16  295  1063 31May16  269
;

For these data, you can sort by the patient ID and by the date of visit. After sorting, the first record for each patient contains the first visit to the clinic and the last record contains the last visit. You can subtract the patient's weight for these dates to determine how much the patient gained or lost during the trial. You can also use the INTCK function to compute the elapsed time between visits. If you want to measure time in days, you can simply subtract the dates, but the INTCK function enables you to compute duration in terms of years, months, weeks, and other time units.

proc sort data=Patients;
   by PatientID Date;
run;
 
data weightLoss;
   set Patients;
   BY PatientID;
   retain startDate startWeight;                 /* RETAIN the starting values */
   if FIRST.PatientID then do;
      startDate = Date; startWeight = Weight;    /* remember the initial values */
   end;
   if LAST.PatientID then do;
      endDate = Date; endWeight = Weight;
      elapsedDays = intck('day', startDate, endDate); /* elapsed time (in days) */
      weightLoss = startWeight - endWeight;           /* weight loss */
      AvgWeightLoss = weightLoss / elapsedDays;       /* average weight loss per day */
      output;                                         /* output only the last record in each group */
   end;
run;
 
proc print noobs; 
   var PatientID elapsedDays startWeight endWeight weightLoss AvgWeightLoss;
run;
Use FIRST.variable and LAST.variable to compute elapsed time and average quantities in a group

The output data set summarizes each patient's activities at the clinic, including his average weight loss and the duration of his treatment.

Some programmers think that the FIRST.variable and LAST.variable indicator variables require that the data be sorted, but that is not true. The temporary variables are created whenever you use a BY statement in a DATA step. You can use the NOTSORTED option on the BY statement to process records regardless of the sort order.

Summary

In summary, the BY statement in the DATA step automatically creates two indicator variables. You can use the variables to determine the first and last record in each BY group. Typically the FIRST.variable indicator is used to initialize summary statistics and to remember the initial values of measurement. The LAST.variable indicator is used to output the result of the computations, which often includes simple descriptive statistics such as a sum, difference, maximum, minimum, or average values.

BY-group processing in the DATA step is a common topic that is presented at SAS conferences. Some authors use FIRST.BY and LAST.BY as the name of the indicator variables. For further reading, I recommend the paper "The Power of the BY Statement" (Choate and Dunn, 2007). SAS also provides several samples about BY-group processing in the SAS DATA step, including the following:

The post How to use FIRST.variable and LAST.variable in a BY-group analysis in SAS appeared first on The DO Loop.

2月 142018
 

When I first learned to program in SAS, I remember being confused about the difference between CLASS statements and BY statements. A novice SAS programmer recently asked when to use one instead of the other, so this article explains the difference between the CLASS statement and BY variables in SAS procedures.

The BY statement and the CLASS statement in SAS both enable you to specify one or more categorical variables whose levels define subgroups of the data. (For simplicity, we consider only a single categorical variable.) The primary difference is that the BY statement computes many analyses, each on a subset of the data, whereas the CLASS statement computes a single analysis of all the data. Specifically,

  • The BY statement repeats an analysis on every subgroup. The subgroups are treated as independent samples. If a BY variable defines k groups, the output will contains k copies of every table and graph, one copy for the first group, one copy for the second group, and so on.
  • The CLASS statement enables you to include a categorical variable as part of an analysis. Often the CLASS variable is used to compare the groups, such as in a t test or an ANOVA analysis. In regression models, the CLASS statement enables you to estimate parameters for the levels of a categorical variable, thereby estimating the effect of each level on the response. Another use of a CLASS variable is to define categories for a classification task, such as a discriminant analysis.

To illustrate the differences between an analysis that uses a BY statement and one that uses a CLASS statement, let's create a subset (called Cars) of the Sashelp.Cars data. The levels of the Origin variable indicate whether a vehicle is manufactured in "Asia", "Europe", or the "USA". For efficiency reasons, most classical SAS procedures require that you sort the data when you use a BY statement. Therefore, a call to PROC SORT creates a sorted version of the data called CarsSorted, which will be used for the BY-group analyses.

data Cars;
   set Sashelp.Cars;
   where cylinders in (4,6,8) and type ^= 'Hybrid'; 
run;
 
proc sort data=Cars out=CarsSorted; 
   by Origin; 
run;

Descriptive statistics for grouped data

When you generate descriptive statistics for groups of data, the univariate statistics are identical whether you use a CLASS statement or a BY statement. What changes is the way that the statistics are displayed. When you use the CLASS statement, you get one table that contains all statistics or one graph that shows the distribution of each subgroup. However, when you use the BY statement you get multiple tables and graphs.

The following statements use the CLASS statement to produce descriptive statistics. PROC UNIVARIATE displays one (paneled) graph that shows a comparative histogram for the vehicles that are made in Asia, Europe, and USA. PROC MEANS displays one table that contains descriptive statistics:

proc univariate data=Cars;
   class Origin;
   var Horsepower;
   histogram Horsepower / nrows=3; /* must use NROWS= to get panel */
   ods select histogram;
run;
 
proc means data=Cars N Mean Std;
   class Origin;
   var Horsepower Weight Mpg_Highway;
run;

In contrast, if you run a BY-group analysis on the levels of the Origin variable, you will see three times as many tables and graphs. Each analysis is preceded by a label that identifies each BY group. Notice that the BY-group analysis uses the sorted data.

proc means data=CarsSorted N Mean Std;
   by Origin;
   var Horsepower Weight Mpg_Highway;
run;

Always remember that the output from a BY statement is equivalent to the output from running the procedure multiple times on subsets of the data. For example, the previous statistics could also be generated by calling PROC MEANS three times, each call with a different WHERE clause, as follows:

proc means N Mean Std data=CarsSorted( where=(origin='Asia') );
   var Horsepower Weight Mpg_Highway;
run;
proc means N Mean Std data=CarsSorted( where=(origin='Europe') );
   var Horsepower Weight Mpg_Highway;
run;
proc means N Mean Std data=CarsSorted( where=(origin='USA') );
   var Horsepower Weight Mpg_Highway;
run;

In fact, if you ever find yourself repeating an analysis many times (perhaps by using a macro loop), you should consider whether you can rewrite your program to be more efficient by using a BY statement.

Comparing groups: Use the CLASS statement

As a general rule, you should use a CLASS statement when you want to compare or contrast groups. For example, the following call to PROC GLM performs an ANOVA analysis on the horsepower (response variable) for the three groups defined by the Origin variable. The procedure automatically creates a graph that displays three boxplots, one for each group. The procedure also computes parameter estimates for the levels of the CLASS variable (not shown).

proc glm data=Cars; /* by default, create graph with side-by-side boxplots */
   class Origin;
   model Horsepower = Origin / solution;
run;

You can specify multiple variables on the CLASS statement to include multiple categorical variables in a model. Any variables that are not listed on the CLASS statement are assumed to be continuous. Thus the following call to PROC GLM analyzes a model that has one continuous and one classification variable. The procedure automatically produces a graph that overlays the three regression curves on the data:

ods graphics /antialias=on;
title "CLASS Variable Regression: One Model with Multiple Parameters";
proc GLM data=Cars plots=FitPlot;
   class Origin;
   model Horsepower = Origin | Weight / solution;
   ods select ParameterEstimates ANCOVAPlot;
quit;

In contrast, if you use a BY statement, the Origin variable cannot be part of the model but is used only to subset the data. If you use a BY statement, you obtain three different models of the form Horsepower = Weight. You get three parameter estimates tables and three graphs, each showing one regression line overlaid on a subset of the data.

Predicted Values: CLASS VARIABLE versus BY Variable

When you use a BY statement and fit three models of the form Horsepower = Weight, the procedure fits a total of six parameters. Notice that when you use the CLASS statement and fit the model Horsepower = Origin | Weight, you also fit six free parameters. It turns out that these two methods produce the same predicted values. In fact, you can combine the parameter estimates (for the GLM parameterization) for the CLASS model to obtain the parameter estimates from the BY-variable analysis, as shown below. Each parameter estimate for the BY-variable models are obtained as the sum of two estimates for the CLASS-variable analysis:

For many regression models, the predicted values for the BY-variable analyses are the same as for a particular model that uses a CLASS variable. As shown above, you can even see how the parameters are related when you use a GLM or reference parameterization. However, the CLASS variable formulation can fit models (such as the equal-slope model Horsepower = Origin Weight) that are not available when you use a BY variable to fit three separate models. Furthermore, the CLASS statement provides parameter estimates so that you can see the effect of the groups on the response variable. It is more difficult to compare the models that are produced by using the BY statement.

Other CLASS-like statements in SAS

Some SAS procedures use other syntax to analyze groups. In particular, the SGPLOT procedure calls classification variables "group variables." If you want to overlay graphs for multiple groups, you can use the GROUP= option on many SGPLOT statements. (Some statements support the CATEGORY= option, which is similar.) For example, to replicate the two-variable regression analysis from PROC GLM, you can use the following statements in PROC SGPLOT:

proc sgplot data=Cars;
   reg y=Horsepower x=Weight / group=Origin; /* Horsepower = Origin | Weight */
run;

Summary

In summary, use the BY statement in SAS procedures when you want to repeat an analysis for every level of one or more categorical variables. The variables define the subsets but are not otherwise part of the analysis. In classical SAS procedures, the data must be sorted by the BY variables. A BY-group analysis can produce many tables and graphs, so you might want to suppress the ODS output and write the results to a SAS data set.

Use the CLASS statement when you want to include a categorical variable in a model. A CLASS statement often enables you to compare or contrast subgroups. For example, in regression models you can evaluate the relative effect of each level on the response variable.

In some cases, the BY statement and the CLASS statement produce identical statistics. However, the CLASS statement enables you to fit a wider variety of models.

The post The difference between CLASS statements and BY statements in SAS appeared first on The DO Loop.

2月 092018
 

Want to see my newly minted certified professional badge? Scroll down to take a peek. Yes, I managed to successfully complete the Base SAS Programmer certification exam… with, ahem, flying colors I might add. Here are my tips to tackle the Base SAS certification exam: 1.  Get clear on the [...]

The post Demystifying certification (Part 3): To the finish appeared first on SAS Learning Post.

2月 022018
 

Do you know what the #1 fear in North America is? Most people say fear of public speaking or fear of death, but you may just want to consider this new fear upping the charts - the fear of writing a SAS Certification exam! The real test begins before even [...]

The post Demystifying SAS Certification (Part 2): Write it down appeared first on SAS Learning Post.