1月 082020
 

Did I trick you into seeing what this blog is about with its mysterious title? I am going to talk about how to use the FIND function to search text values.

The FIND function searches for substrings in character values. For example, you might want to extract all email addresses ending in .edu from a list of email addresses. If you are a slightly older SAS programmer like me, you may be more familiar with the INDEX function. If you use only two arguments in the FIND function, the first being the string you are searching and the second being the substring you are looking for, the FIND function is identical to the INDEX function. Both of these functions will searczh the string (first argument) for the substring (second argument) and return the position where the substring starts. If the substring is not found, the function returns a zero.

The newer FIND function has several advantages over the older INDEX function. These advantages are realized by the optional third and fourth arguments to the FIND function. These two arguments allow you to specify a starting position for the search and modifiers that allow you to ignore case. You can use either of these two arguments, or both, and the order doesn't matter! How is this possible? The value for the starting position is always a numeric value and the value for the modifier is always a character value. Thus, SAS can always figure out if a value is a starting position or a modifier.

Let's look at an example

Suppose you have a SAS data set called Emails, and each observation in the data set contains a name and an email address.

Here is a listing of the SAS data set Emails:

You want to select all observations where the variable Email_Address contains .edu (ignoring case).

The program below does just that:

*Searching for .edu;
data Education;
   set Emails;
   if find(Email_Address,'.edu','i') then output;
run;
title "Listing of Data Set Education";
proc print data=Education noobs;
run;

The 'i' modifier is an instruction to ignore case. In the listing of Education below, notice that all the .edu addresses are listed, regardless of case.

Not only is the FIND function more flexible than the older INDEX function, the ignore case modifier is really handy.

For more tips on writing code and how to get started in SAS Studio, check out my book, Learning SAS by Example: A Programmer’s Guide, Second Edition. You can also download a free book excerpt. To also learn more about SAS Press, check out the up-and-coming titles and receive exclusive discounts, make sure to subscribe to the SAS Books newsletter.

Adventures of a SAS detective and the fantastic FIND function was published on SAS Users.

1月 082020
 

Many SAS procedures can automatically create a graph that overlays multiple prediction curves and their prediction limits. This graph (sometimes called a "fit plot" or a "sliced fit plot") is useful when you want to visualize a model in which a continuous response variable depends on one continuous explanatory variable and one categorical (classification) variable. You can use the EFFECTPLOT statement in PROC PLM to create similar visualizations of other kinds of regression models. This article shows three ways to create a sliced fit plot: direct from the procedure, by using the EFFECTPLOT statement in PROC PLM, and by writing the predictions to a data set and using PROC SGPLOT to graph the results.

The data for this example is from the PROC LOGISTIC documentation. The response variable is the presence or absence of pain in seniors who are splits into two treatment groups (A and B) and a placebo group (P).

Data Neuralgia;
   input Treatment $ Sex $ Age Duration Pain $ @@;
   datalines;
P F 68  1 No  B M 74 16 No  P F 67 30 No  P M 66 26 Yes B F 67 28 No  B F 77 16 No
A F 71 12 No  B F 72 50 No  B F 76  9 Yes A M 71 17 Yes A F 63 27 No  A F 69 18 Yes
B F 66 12 No  A M 62 42 No  P F 64  1 Yes A F 64 17 No  P M 74  4 No  A F 72 25 No
P M 70  1 Yes B M 66 19 No  B M 59 29 No  A F 64 30 No  A M 70 28 No  A M 69  1 No
B F 78  1 No  P M 83  1 Yes B F 69 42 No  B M 75 30 Yes P M 77 29 Yes P F 79 20 Yes
A M 70 12 No  A F 69 12 No  B F 65 14 No  B M 70  1 No  B M 67 23 No  A M 76 25 Yes
P M 78 12 Yes B M 77  1 Yes B F 69 24 No  P M 66  4 Yes P F 65 29 No  P M 60 26 Yes
A M 78 15 Yes B M 75 21 Yes A F 67 11 No  P F 72 27 No  P F 70 13 Yes A M 75  6 Yes
B F 65  7 No  P F 68 27 Yes P M 68 11 Yes P M 67 17 Yes B M 70 22 No  A M 65 15 No
P F 67  1 Yes A M 67 10 No  P F 72 11 Yes A F 74  1 No  B M 80 21 Yes A F 69  3 No
;

Automatically create a sliced fit plot

Many SAS regression procedures support the PLOTS= option on the PROC statement. For PROC LOGISTIC, the option that creates a sliced fit plot is the PLOTS=EFFECTPLOT option, and you can add prediction limits to the graph by using the CLBAND suboption, as follows:

proc logistic data=Neuralgia alpha=0.2 plots(only)=effectplot(clband);
   class Treatment;
   model Pain(Event='Yes')= Treatment Age;
run;

That was easy! The procedure automatically creates a title, legend, axis labels, and so forth. By using the PLOTS= option, you get a very nice plot that shows the predictions and prediction limits for the model.

Create a sliced fit plot by using PROC PLM

One of the nice things about the STORE statement in SAS regression procedures is that it enables you to create graphs and perform other post-fitting analyses without rerunning the procedure. Maybe you intend to examine many models before deciding on the best model. You can run goodness-of-fit statistics for the models and then use PROC PLM to create a sliced fit plot for only the final model. To do this, use the STORE statement in the regression procedure and then "restore" the model in PROC PLM, which can perform several post-fitting analyses, including creating a sliced fit plot, as follows:

proc logistic data=Neuralgia alpha=0.2 noprint;
   class Treatment;
   model Pain(Event='Yes')= Treatment Age;
   store PainModel / label='Neuralgia Study';
run;
 
proc plm restore=PainModel noinfo;
   effectplot slicefit(x=Age sliceby=Treatment) / clm;
run;

The sliced fit plot is identical to the one that is produced by PROC LOGISTIC and is not shown.

Create a sliced fit plot manually

For many situations, the statistical graphics that are automatically produced are adequate. However, at times you might want to customize the graph by changing the title, the placement of the legend, the colors, and so forth. Sometimes companies mandate color-schemes and fonts that every report must use. For this purpose, SAS supports ODS styles and templates, which you can use to permanently change the output of SAS procedures. However, in many situations, you just want to make a small one-time modification. In that situation, it is usually simplest to write the predictions to a SAS data set and then use PROC SGPLOT to create the graph.

It is not hard to create a sliced fit plot. For these data, you can perform three steps:

  1. Write the predicted values and upper/lower prediction limits to a SAS data set.
  2. Sort the data by the classification variable and by the continuous variable.
  3. Use the BAND statement with the TRANSPARENCY= option to plot the confidence bands. Use the SERIES statement to plot the predicted values.

You can use the full power of PROC SGPLOT to modify the plot. For example, the following statements label the curves, move the legend, and change the title and Y axis label:

proc logistic data=Neuralgia alpha=0.2 noprint;
   class Treatment;
   model Pain(Event='Yes')= Treatment Age;
   /* 1. Use a procedure or DATA step to write Pred, Lower, and Upper limits */
   output out=LogiOut pred=Pred lower=Lower upper=Upper;
run;
 
/* 2. Be sure to SORT! */
proc sort data=LogiOut;
   by Treatment Age;
run;
 
/* 3. Use a BAND statement. If more that one band, use transparency */
title "Predicted Probabilities with 80% Confidence Limits";
title2 "Three Treatment Groups";
proc sgplot data=LogiOut;
   band x=Age lower=Lower upper=Upper / group=Treatment transparency=0.75 name="L";
   series x=Age y=Pred / group=Treatment curvelabel;
   xaxis grid;
   yaxis grid label="Predicted Probability of Pain" max=1;
   keylegend "L" / location=inside position=NW title="Treatment" across=1 opaque;
run;

The graph is shown at the top of this article. It has a customized title, label, and legend. In addition, the curves on this graph differ from the curves on the previous graph. The OUTPUT statement evaluates the model at the observed values of Age and Treatment. Notice that the A and B treatment groups do not have any patients that are over the age of 80, therefore those prediction curves do not extend to the right-hand side of the graph.

In summary, there are three ways to visualize predictions and confidence bands for a regression model in SAS. This example used PROC LOGISTIC, but many other regression procedures support similar options. In most SAS/STAT procedures, you can use the PLOTS= option to obtain a fit plot or a sliced fit plot. More than a dozen procedures support the STORE statement, which enables you to use PROC PLM to create the visualization. Lastly, all regression procedures support some way to output predicted values to a SAS data set. You can sort the data, then use the BAND statement (with transparency) and the SERIES statement to create the sliced fit plot.

The post 3 ways to add confidence limits to regression curves in SAS appeared first on The DO Loop.

1月 062020
 

Last year, I wrote more than 100 posts for The DO Loop blog. The most popular articles were about SAS programming tips for data analysis, statistical analysis, and data visualization. Here are the most popular articles from 2019 in each category.

SAS programming tips

The Essential Guide to Binning

  • Create training, testing, and validation data sets:: This post shows how to create training, validation, and test data sets in SAS. This technique is popular in data science and machine learning because you typically want to fit the model on one set of data and then evaluate the goodness of fit by using a different set of data.
  • 5 reasons to use PROC FORMAT to recode variables in SAS: Often SAS programmers use PROC SQL or the SAS DATA step to create new data variables to recode raw data. This is not always necessary. It is more efficient to use PROC FORMAT to recode the raw data. Learn five reasons why you should use PROC FORMAT to recode variables.
  • Conditionally append observations to a data set: In preparing data for graphing. you might want to add additional data to the end of a data set. (For example, to plot reference lines or text.) You can do this by using one DATA step to create the new observations and another DATA step to merge the new observations to the end of the original data. However, you can also use the END= option and end-of-file processing to append new observations to data. Read about how to use the END= option to append data. The comments at the end of the article show how to perform similar computations by using a hash table, PROC SQL, and a DOW loop.
  • Use PROC HPBIN to bin numerical variables: In machine learning, it is common to bin numerical variables into a set of discrete values. You can use PROC HPBIN to bin multiple variables in a single pass through the data. PROC HPBIN can bin data into equal-length bins (called bucket binning) or by using quantiles of the data. This article and the PROC FORMAT article are both referenced in my Essential Guide to Binning in SAS.

Statistical analyses

Geometric Meaning of Kolmogorov's D

Data visualization

Visualize an Interaction Effect

I always enjoy learning new programming methods, new statistical ideas, and new data visualization techniques. If you like to learn new things, too, read (or re-read!) these popular articles from 2019. Then share this page with a friend. I hope we both have many opportunities to learn and share together in the new year.

The post Top posts from <em>The DO Loop</em> in 2019 appeared first on The DO Loop.

1月 032020
 

It is a generally accepted maxim in business that you can’t cost-cut your way to success.  Growth requires attention to the top line, which in turn is largely driven by innovation.  Applying analytics to improving innovation suffers from the same lack of direction as does Analytics for strategy, but with [...]

4 ways to apply analytics to innovation was published on SAS Voices by Leo Sadovy

12月 202019
 

One day, I received a question from the SAS community on how to exclude variables with missing values up to a certain percentage in data analysis. Before I dive into details, another question about how to remove variables with repeated values up to a certain percentage was raised. In fact, this user demands focus on the value characteristics for a single variable, and it determines if the user should include a specific variable in the following data analysis. It reflects the demand of variable selection for wide datasets with noise; it enhances data filtering by enabling variables(cols) selection, in addition to the normal observation(rows) filtering.

DATA Step implementation

If the variable filtering is based on the percentage of the missing values, the most intuitive thought is to use DATA step to calculate the percentage of missing values for each column of the input dataset, and then collect the variable names that meet the threshold conditions before passing them to a global macro variable for subsequent analysis. Since SAS DATA step includes both character and numeric data types, we cannot define one single array to map all columns of the dataset. Instead, we need to define two arrays to map all numeric variables and all character variables respectively.

In order to build a universal reusable SAS macro, we also need to detect the number of numeric and character variables in advance and pass them to the array definition in the next DATA step. In that DATA step, we use two arrays to access dataset variables and count the number of missing values respectively. When the loop reaches the last observation, it collects variable names through the vname() function which meet the threshold condition, and finally place them into a SAS global macro variable &VARLIST. The SAS macro below demonstrates this logic:

 
%macro FilterCols(data=, th=0, varlist=);
  data _null_;  /*get numeric and character variables count */
    set &data (obs=1);
    array _ncols_n _numeric_;
    array _ccols_n $ _character_;
    call symput ('_ncols_n',put(dim(_ncols_n),best.));
    call symput ('_ccols_n',put(dim(_ccols_n),best.));
  run;
  data _null_;
	set &data end=last;
    /*Process numeric types*/
	array _ncols_ _numeric_;	
	array _ncols_count {&_ncols_n};
	retain _ncols_count (&_ncols_n*0);	
	do i=1 to dim(_ncols_);			
	  if _ncols_[i]= .  then _ncols_count[i]=_ncols_count[i] + 1;
	end;
    /*Process character types*/
	array _ccols_ $  _character_;
    array _ccols_count {&_ccols_n};	
	retain _ccols_count (&_ccols_n*0);
	do i=1 to dim(_ccols_);			
	  if _ccols_[i]=" " then _ccols_count[i]=_ccols_count[i]+1;
	end; 
    /*Filter out the variables meet the conditions and place them in the &VARLIST global macro variable*/
	if last then do;	  
	  length _varlist_ $ 32767;
retain _varlist_ " ";
	  do i=1 to &_ncols_n; 
	    if (_ncols_count[i]/_N_) >= &th then do;
		  _varlist_=trim(_varlist_) || " " || trim( vname(_ncols_[i]) );
        end;				
	  end;
	  do i=1 to &_ccols_n; 
	    if (_ccols_count[i]/_N_) >= &th then do;
		  _varlist_=trim(_varlist_) || " " || trim( vname(_ccols_[i]) );
		end;
	  end;	
	  call symput ("&VARLIST",trim(_varlist_));			
	end;
  run;
%mend; 
/*Exclude variables if missing value pct GE. threshold  &th*/
%macro FilterData(data=, th=, out=);
  %global myvars;
  %FilterCols(data=&data, th=&th, varlist=myvars);
  data &out;
    set &data; 
    %if %str(&myvars)^=%str() %then %do; drop &myvars; %end;
  run;
%mend;

To verify the programing logic, let’s generate a demo dataset WORK.CLASS based on SASHELP.CLASS, the output dataset has different missing value percentage (10.53%, 36.84%, 52.63%) in Sex, Height and Weight columns.

data class; /*To generate demo dataset*/
  set sashelp.class;  
  if age < = 11 then sex=" ";
  if age <= 12 then height=.;
  if age <= 13 then weight=.;  
run;
proc print data=class; run; /*See Figure 1 for output */

Now we can use the following code to filter columns by percentage of missing values, which is greater than 40%; the Weight column will not appear in the output CLASS2 dataset.

 
%FilterData(data=class, th=0.4, out=class2);
proc print data=class2; run; /*See Figure 1 for output */

Figure 1: Data before filtered and after filtered

The calculation here for the percentage of missing values is implemented by the user programming in the DATA step. The missing value count is computed separately in two arrays according to the character type and the numerical type. If you want to do it in a single array, you must use HASH object to store the variable name and its corresponding index in that array. Whenever a missing value is detected, we need to check whether the variable is already in the HASH object; if not, then add it directly. Otherwise, look up the table to find the appropriate array index, and then increase the counter to the corresponding element in that array. The advantage of this approach is we only need one array in later variable names collecting phase. The disadvantage is the use of HASH object increases the programming complexity. The complete implementation based on this logic is as follows, it can completely replace the first version of %FilterCols implementation.

 
%macro FilterCols(data=, th=0, varlist=);
  data _null_; /*Detect variable numbers of dataset*/
    dsid=open("&data.",'i');
    if dsid>0 then do;
      nvars= attrn(dsid, 'NVARS'); 
	  call symput("NVARS", nvars); 
    end;
  run;
  data _null_;
    set &data. end=last; 
    /*Define 2 arrays map to numeric/character vars respectively*/
    array _ncols_ _numeric_;
    array _ccols_ _character_;
 
    if _N_=1 then do; /*HASH Object definition and initialize*/
      length key $ 32;
      declare hash h(); 
	  rc=h.definekey("key");
	  rc=h.definedata("data", "key");
	  rc=h.definedone();
	  call missing(key, data); 
    end;  
    /*Define one array for counting of missing value*/
	array _cols_count[&NVARS] _temporary_; 
	retain _cols_count ( &NVARS * 0); 
    retain _col_id_ 1; 
 
    do i=1 to dim(_ncols_);	  
      if missing( _ncols_[i] ) then do; 
        /*Find mapped array index and modify counter value*/
	    key=vname( _ncols_[i] );  
	    if h.find()^=0 then do;         	    
          data=_col_id_;
		   h.add();  
		   _col_id_=_col_id_+1;		
	    end;    
        _cols_count[ data ] = _cols_count[ data ] +1; 
	  end;
    end;
    do i=1 to dim(_ccols_); 
      if missing( _ccols_[i] ) then do;
        /*Find mapped array index and modify counter value*/
	    key=vname( _ccols_[i] );
	    if h.find()^=0 then do;         	    
          data=_col_id_;
		   h.add();	 
		   _col_id_=_col_id_+1;		
	    end;	        
        _cols_count[ data ] = _cols_count[ data ] +1; 	    		
	  end;
    end; 
    /*Filter out the variables meet the conditions and place them in the &VARLIST global macro variable*/
    if last then do;
      declare hiter iter('h');
      rc = iter.first();  
      do while (rc = 0);   	    
        pct=_cols_count[data]/ _N_;
		if pct >= &th. then do; 
	      length _varlist_ $ 32767; 
retain _varlist_ " ";
          _varlist_=trim(_varlist_) || " "|| trim(key);		  	  
        end; 
        rc = iter.next(); 
      end;	
	  call symput( "&VARLIST", trim( left(_varlist_)));
    end;
  run;
%mend;

PROC FREQ Implementation

In the two macros above, the counting of missing values for each column are implemented by user programming in DATA step, but SAS provides a more powerful frequency statistic step called PROC FREQ. Can we directly use its output for variable filtering? The answer is YES. For example, the following SAS code can list the Frequency, Percent, Cumulative Frequency, and Cumulative Percent of each variable values (including missing values) in the data set.

 
  proc freq data=class;
    table Sex Height Weight /missing;  /*See Figure 2 for output */
  run;

Figure 2: PROC FREQ output

To get the result of PROC FREQ into a dataset, we just need to add ODS Table statement before running PROC FREQ:

 
ods table onewayfreqs=class_freqdata;

Figure 3: OneWayFreqs table class_freqdata

So, the frequency statistics result of PROC FREQ is stored in the class_freqdata dataset. We only need to check the value of variable “percent” for each column. If it is greater than the threshold specified, the variable name is added to the buffer variable _varilist_, otherwise it is ignored; the final value of _varilist_ is placed to the SAS global macro variable &VARLIST for later use. Comparing to the DATA step implementation, the entire code here is much shorter and simpler.

 
%macro FilterCols(data=, th=0, varlist=);
  ods table onewayfreqs=&data._freqdata; /*FREQ out to temp dataset*/
  proc freq data=&data;
    table _all_ /missing;  
  run; 
  data _null_;
	set &data._freqdata end=last;
	var=scan(Table, 2);  
if getoption("LOCALE")="ZH_CN" then var=scan(Table, 2, "“”");
	value=vvaluex(var); 
/*Filter out the variables meet the conditions and place them in the &VARLIST global macro variable*/
	if  value=" " or compress(value)="." then  do; 
      if percent/100>= &th then do;
	    length _varlist_ $ 32767;
        retain _varlist_ " "; 
	    _varlist_=trim(_varlist_) || " "|| trim(var);	
		put var= _varlist_;
	  end;
   end;
   if last then call  symput( "&VARLIST", trim( left(_varlist_)));
  run; 
/*Delete the temp dataset in WORK library*/
  proc datasets nolist;
    delete &data._freqdata;
  quit;
%mend;

Please note that PROC FREQ would generate output to ODS HTML destination by default. If you want to avoid generating ODS HTML output when generating the temporary dataset “class_freqdata”, you can use the ods html close; statement to temporarily close the ODS HTML output and then open it again anytime needed.

Since the output of PROC FREQ already has the frequency information of each variable, we can naturally filter columns based on the percentage of duplicate values. However, it should be noted that a single variable always contains multiple distinct values, and each distinct value (including missing values) has a frequency percentage. If percentage of duplicate value is too high in a specific column, the user may exclude that column from the following analysis. As there are multiple frequency percentages, we just choose the largest one as the duplicate value percentage of that variable. For example, the Sex column of demo dataset CLASS has a missing value of 10.53%, a female of 42.11% and a male of 47.37%, so the percentage of duplicate values in the Sex column should be 47.37%.

The following code filter variables by percentage of duplicate values with a specific threshold. The code logic is almost same as before, but it filters values with non-missing values. So, run the same test code against the demo dataset will remove the Age column by duplicate values percentage threshold, instead of the Weight column by missing values percentage threshold.

 
%macro FilterCols(data=, th=0, varlist=);
  ods table onewayfreqs=&data._freqdata; /*FREQ out to temp dataset*/
  ods html close;
  proc freq data=&data;
    table _all_ / missing;  
  run; 
  ods html;
  data _null_;
    set &data._freqdata end=last;
var=scan(Table, 2);  
if getoption("LOCALE")="ZH_CN" then var=scan(Table, 2, "“”");
    value=vvaluex(var);
/*Filter out the variables meet the conditions and place them in the &VARLIST global macro variable*/
    if ^(value=" " or compress(value)=".") then do;  
      if  percent > maxpct then do; /*Search max percentage of duplicate values of a variable*/
	    retain maxpct 0;
        retain maxvar;
        maxpct=percent;  
	    maxvar=var;
      end; 
      if (_N_>1 and var ^= lastvar) or last then do;      
        if maxpct/100 >= &th. then do;
          length _varlist_ $ 32767;
          retain _varlist_ " ";
          _varlist_=trim(_varlist_) || " "|| trim(maxvar);	        	      
        end;
        maxpct=percent;
	    maxvar=var;
      end;
      lastvar=var;
      retain lastvar;
	end;
    if last then call symput( "&VARLIST", trim( left(_varlist_)));  
  run;
/*Delete the temp dataset in WORK library*/
  proc datasets nolist;
    delete &data._freqdata ;
  quit;
%mend;

In order to combine these two logics of by missing values and by duplicate values in one macro, we just introduce one flag parameter byduprate. The default value is 0 to filter variables by missing values, and value 1 to filter variables by duplicate values. The complete code is as below:

 
%macro FilterCols(data=, th=0, varlist=, byduprate=0);
  ods table onewayfreqs=&data._freqdata; /*FREQ out to temp dataset*/
  ods html close;
proc freq data=&data;
    table _all_ /missing;  
  run; 
ods html;
  data _null_;
	set &data._freqdata end=last;
	var=scan(Table, 2); 
if getoption("LOCALE")="ZH_CN" then var=scan(Table, 2, "“”"); 
	value=vvaluex(var);
  %if &byduprate=0 %then %do; /*Filter variables by missing value*/
	if  value=" " or compress(value)="." then  do; 
      if percent/100>= &th then do;
	    length _varlist_ $ 32767;
	    retain _varlist_ " ";
	    _varlist_=trim(_varlist_) || " "|| trim(var);				 
	  end;
   end;
  %end;
  %else %do; /*Filter variables by duplicate value percentage*/
    if ^(value=" " or compress(value)=".") then do;  
      if  percent > maxpct then do;      
        retain maxpct 0;
        retain maxvar;
        maxpct=percent;  
	    maxvar=var;		
      end; 
      if (_N_>1 and var ^= lastvar) or last then do;              
        if maxpct/100 >= &th. then do;
          length _varlist_ $ 32767;
          retain _varlist_ " ";
          _varlist_=trim(_varlist_) || " "|| trim(maxvar);	        
        end;
        maxpct=percent;
	    maxvar=var;
      end;
      lastvar=var;
      retain lastvar;
    end;
  %end;
    if last then call  symput( "&VARLIST", trim( left(_varlist_)));
  run; 
/*Delete the temp dataset in WORK library*/
  proc datasets nolist;
    delete &data._freqdata;
  quit;
%mend;

Accordingly, we also need to introduce the parameter byduprate with the same name and behavior in the %FilterData macro that's invoking the %FilterCols macro. The SAS macro %FilterData also has the same default value 0, but it needs to pass the byduprate argument specified by the user to the %FilterCols macro. The enhanced version is as below:

 
%macro FilterData(data=, th=, out=, byduprate=0);
  %global myvars;
  %FilterCols(data=&data, th=&th, varlist=myvars, byduprate=&byduprate); 
  data &out;
    set &data; 
    %if %str(&myvars)^=%str() %then %do; drop &myvars; %end;
  run;
%mend;

Now the final version of the macro %FilterData unified the two kinds of variable filtering logic, it just depends on the different value of the argument byduprate: If byduprate is not specified or the value is 0, it means filter variables by the percentage of missing values, otherwise it is by the percentage of duplicate values. For the following two code snippets, the former drops the Weight column, while the latter drops the Sex column.

 
/*Drop columns with missing value percentage >= 40% */
%FilterData(data=class, th=0.4, out=class2);              
proc print data=class2;run;
 
/*Drop columns with duplicate value percentage >= 40% */
%FilterData(data=class, th=0.4, out=class3, byduprate=1); 
proc print data=class3;run;

Figure 4: Drop column by missing and duplicate value percentage

Summary

This article discusses how to use SAS to filter variables in a dataset based on the percentage of missing values or duplicate values. The missing value statistics can be implemented by either DATA step programming on your own or reusing the existing powerful PROC FREQ with statistics result for missing values and duplicate values. The five macros in this article also demonstrated how to wrap up the existing powerful SAS analytics as a reusable, general-purpose SAS macro for a big project. The final version of macro %FilterData also demonstrated how to unify filtering variables by the percentage of either missing values or duplicate values in one to make calling SAS macro functions convenient and easy to reuse. To learn more, please check out these resources:

How to filter variables by percentage of missing values or duplicate values in SAS was published on SAS Users.

12月 192019
 

From sports and health data to environmental and policy data, our data visualization experts have used SAS technologies to explore and present analyses on hundreds of topics throughout the year. We've selected some of the best in this end-of-year roundup to showcase their skills and SAS technologies, and to demonstrate [...]

Our top 10 data visualizations of 2019 was published on SAS Voices by Alison Bolen

12月 182019
 

How can a solar farm ensure peak energy production? And what factors can be adjusted to optimize production throughout the day, the week and season-by-season? These are just some of the questions that a team of data scientists have asked and answered about the SAS solar farm using data, drones [...]

How drones and computer vision helped boost clean energy production was published on SAS Voices by Briana Ullman

12月 182019
 

How can a solar farm ensure peak energy production? And what factors can be adjusted to optimize production throughout the day, the week and season-by-season? These are just some of the questions that a team of data scientists have asked and answered about the SAS solar farm using data, drones [...]

How drones and computer vision helped boost clean energy production was published on SAS Voices by Briana Ullman