Tech

1月 162020
 

This blog is part of a series on SAS Visual Data Mining and Machine Learning (VDMML).

If you're new to SAS VDMML and you want a brief overview of the features available, check out my last blog post!

This blog will discuss types of missing data and how to use imputation in SAS VDMML to improve your predictions.

 
Imputation is an important aspect of data preprocessing that has the potential to make (or break) your model. The goal of imputation is to replace missing values with values that are close to what the missing value might have been. If successful, you will have more data with which the model can learn. However, you could introduce serious bias and make learning more difficult.

missing data

When considering imputation, you have two main tasks:

  • Try to identify the missing data mechanism (i.e. what caused your data to be missing)
  • Determine which method of imputation to use

Why is your data missing?

Identifying exactly why your data is missing is often a complex task, as the reason may not be apparent or easily identified. However, determining whether there is a pattern to the missingness is crucial to ensuring that your model is as unbiased as possible, and that you have enough data to work with.

One reason imputation may be used is that many machine learning models cannot use missing values and only do complete case analysis, meaning that any observation with missing values will be excluded.

Why is this a problem? Well look at the percentage of missing data in this dataset...

Percentage of missing data for columns in HMEQ dataset

View the percentage of missing values for each variable in the Data tab within Model Studio.  Two variables are highlighted to demonstrate a high percentage of missingness (DEBTINC - 21%, DEROG - 11%).

If we use complete case analysis with this data, we will lose at least 21% of our observations! Not good.

Missing data can be classified in 3 types:

  • missing completely at random (MCAR),
  • missing at random (MAR),
  • and missing not at random (MNAR).

Missing completely at random (MCAR) indicates the missingness is unrelated to any of the variables in the analysis -- rare case for missing data. This means that neither the values that are missing nor the other variables associated with those observations are related to the missingness.  In this case, complete case analysis should not produce a biased model. However, if you are concerned about losing data, imputing the missing values is still a valid approach.

Data can also be missing at random (MAR), where the missing variable value is random and unrelated to the missingness, but the missingness is related to the other variables in the analysis. For example, let’s say there was a survey sent out in the mail asking participants about their activity levels. Then, a portion of surveys from the certain region were lost in the mail. In this case, the missing data is not at random, but this missingness is not related to the variable we are interested in, activity level.

Dramatization of potential affected survey participants

An important consideration is that even though the missingness is unrelated to activity level, the data could be biased if participants in that region are particularly fit or lazy. Therefore, if we had data from other people in the affected region, we may be able to use that information in how we impute the data. However, imputing MAR data will bias the results, to what extent will be dependent on your data and method for imputation.

In some cases, the data is missing not at random (MNAR) and the missingness could provide information that is useful to prediction. MNAR data is non-ignorable and must be addressed or your model will be biased.  For example, if a survey asks a sensitive question about a person's weight, a higher proportion of people with a heavier weight may withhold that information -- creating bias within the missing data. One possible solution is to create another level or label for the missing value.

The bottom-line is that whether you should use imputation for missing values is completely dependent on your data and what the missing value represents. Be diligent!

If you have determined that imputation may be beneficial for your data, there are several options to choose from. I will not be going into detail on the types of imputation available as the type you use will be highly dependent on your data and goals, but I have listed a few examples below to give you a glimpse of the possibilities.

Examples of types of imputations:

  • Mean, Median
  • Regression
  • Hot-deck
  • Last Observation Carried Forward (LOCF - for longitudinal data)
  • Multiple Imputation

How to Impute in SAS VDMML

Now, I will show an example of Imputation in SAS Viya - Model Studio. (SAS VDMML)

First, I created a New Project, selecting the Data Mining and Machine Learning, and the default template. 

On the data tab, I will specify how I want certain features to be imputed. This does not do the imputation for us! Think about the specifications on the data tab as a blueprint for how you want the machine learning pipeline to handle the data.

To start, I selected CLAGE feature and changed the imputation to "Mean" (See demonstration below). Then, I selected DELINQ and changed the imputation to "Median". For NINQ, I selected "minimum" as the imputed value. For DEBTINC and YOJ, I will set the Imputation for both to Custom Constant Value (50,100).

Selecting to impute the mean value for the CLAGE variable

Selecting to impute the mean value for the CLAGE variable

Now, I will go over to the Pipeline tab, and add an imputation node (this is where the work gets done!)

Adding an Imputation node to the machine learning pipeline

Adding an Imputation node to the machine learning pipeline

Selecting the imputation node, I view the options and change the default operations for imputation to none. I changed this because we only want the variables previously specified to be imputed. However, if you wanted to run the default imputation, SAS VDMML automatically imputes all missing values that are Class (categorical) variables to the Median and Interval (numerical) values to the Mean. 

Updating default imputation values to none for Categorical and Interval variable

Updating default imputation values to none for Categorical and Interval variables

Looking at the results, we can see what variables imputation was performed on, what value was used, and how many values for each variable were replaced.

Results of running Imputation node on data

Results of running Imputation node on data

And that's how easy it is to impute variables in SAS Model Studio!

Interested in checking out SAS Viya?

Machine Learning Using SAS Viya is a course that teaches machine learning basics, gives instruction on using SAS Viya VDMML, and provides access to the SAS Viya for Learners software all for $79.This course is the pre-requisite course for the SAS Certified Specialist in Machine Learning Certification. Going through the course myself, I was able to quickly learn how to use SAS VDMML and received a refresher on many data preprocessing tactics and machine learning concepts. 

Want to learn more? 

Stay tuned; I will be posting more blogs with examples of specific features in the SAS VDMML. If you there’s any specific features you would like to know more about, leave a comment below! 

Missing data? Quickly impute values using SAS Viya was published on SAS Users.

1月 162020
 

Using Customer Lifetime Value in your business decision making is often important and crucial for success. Businesses that are customer-centric often spend thousands of dollars acquiring new customers, “on-boarding” new customers, and retaining those customers. If your business margins are thin, then it can often be months or quarters before you start to turn a profit on a particular customer. Additionally, some business models will segment the worth of their customers into categories that will often give different levels of service to the more “higher worth” customers. The metric most often used for that is called Customer Lifetime Value (CLV). CLV is simply a balance sheet look at the total cost spent versus the total revenue earned over a customer’s projected tenure or “life.”

In this blog, we will focus on how a business analyst can build a functional analytical dashboard for a fictional company that is seeing its revenue, margins, and a customer’s lifetime value decrease and what steps they can take to correct that.

We will cover 3 main areas of interest:

  1. First, screenshots of SAS Visual Analytic reports, using Customer Lifetime Value and how you can replicate them.
  2. Next, we will look at the modeling that we did in the report, with explanations on how we got used the results in subsequent modeling.
  3. Lastly, we talk about one example of how we scored and deployed the model, and how you can do the same.

Throughout this blog, I will also highlight areas where SAS augments our software with artificial intelligence to improve your experience.

1. State of the company

First, we will look at the state of the company using the dashboard and take note of any problems.

Our dashboard shows the revenue of our company over the last two years as well as a forecast for the next 6 months. We see that revenue has been on the decline in recent years and churns have been erratically climbing higher.

Our total annual revenue was 112M last year with just over 5,000 customers churning.

So far this year, our revenue is tracking low and sits at only 88M, but the bad news is that we have already tripled last year's churn total.

If these trends continue, we stand to lose a third of our revenue!

2. The problems

Now, let’s investigate as to where the problems are and what can be done about them.

If we look at our current metrics, we can see some interesting points worth investigating.

The butterfly chart on the right shows movement between customer loyalty tiers within each region of the country with the number of upgrades (on the right) and downgrades (on the left).

The vector plots show us information over multiple dimensions. These show us the difference between two values and the direction it is heading. For example, on the left, we see that Revenue is pointed downward while churns (x axis) are increasing.

The vector plot on the right shows us the change in margin from year to year as well as the customer lifetime value.

What’s interesting here is that there are two arrows that are pointing up, indicating a rise in customer lifetime value. Indeed, if we were to click on the map, we would see that these two regions are the same two that have a net increase in Loyalty Tier.

This leads me to believe that a customer’s tier is predictive of margin. Let’s investigate it further.

3. Automated Analysis

We will use the Automated Analysis feature within Visual Analytics to quickly give us the drivers of CLV.

This screenshot shows an analysis that SAS Visual Analytics(VA) performed for me automatically. I simply told VA which variable I was interested in analyzing and within a matter of seconds, it ran a series of decision trees to produce this summary. This is an example of how SAS is incorporating AI into our software to improve your experience.

Here we can see that loyalty tier is indeed the most important factor in determining projected annual margin (or CLV).

4. Influential driver

Once identified, the important driver will be explored across other dimensions to assess how influential this driver might be.

A cursory exploration of Loyalty Tier indicates that yes, loyalty tier, particularly Tier 5, has a major influence on revenue, order count, repeat orders, and margin.

5. CLV comparison models

We will create two competing models for CLV and compare them.

Here on our modeling page are two models that I’ve created to predict CLV. The first one is a Linear Regression and the second is a Gradient Boosting model. I've used Model Comparison to tell me that the Linear Regression model delivers a more accurate prediction and so I use the output of that model as input into a recommendation engine.

6. Recommendation engine

Based on our model learnings and the output of the model, we are going to build a recommendation engine to help us with determine what to do with each customer.

Represented here, I built a recommendation engine model using the Factorization Machine algorithm.

Once we implement our model, customers are categorized more appropriately and we can see that it has had an impact on revenue and the number of accounts is back on the rise!

Conclusion

Even though Customer Lifetime Value has been around for years, it is still a valuable metric to utilize in modeling and recommendation engines as we have seen. We used it our automated analysis, discovered that it had an impact on revenue, we modeled future values of CLV and then incorporated those results into a recommendation engine that recommended new loyalty tiers for our customers. As a result, we saw positive changes in overall company revenue and churn.

To learn more, please check out these resources:

How to utilize Customer Lifetime Value with SAS Visual Analytics was published on SAS Users.

1月 132020
 

Are you ready to get a jump start on the new year? If you’ve been wanting to brush up your SAS skills or learn something new, there’s no time like a new decade to start! SAS Press is releasing several new books in the upcoming months to help you stay on top of the latest trends and updates. Whether you are a beginner who is just starting to learn SAS or a seasoned professional, we have plenty of content to keep you at the top of your game.

Here is a sneak peek at what’s coming next from SAS Press.
 
 

For students and beginners

For beginners, we have Exercises and Projects for The Little SAS® Book: A Primer, Sixth Edition, the best-selling workbook companion to The Little SAS Book by Rebecca Ottesen, Lora Delwiche, and Susan Slaughter. Exercises and Projects for The Little SAS® Book, Sixth Edition will be updated to match the updates to the new The Little SAS® Book: A Primer, Sixth Edition. This hands-on workbook is designed to hone your SAS skills whether you are a student or a professional.

 

 

For data explorers of all levels

This free e-book explores the features of SAS® Visual Data Mining and Machine Learning, powered by SAS® Viya®. Users of all skill levels can visually explore data on their own while drawing on powerful in-memory technologies for faster analytic computations and discoveries. You can manually program with custom code or use the features in SAS® Studio, Model Studio, and SAS® Visual Analytics to automate your data manipulation and modeling. These programs offer a flexible, easy-to-use, self-service environment that can scale on an enterprise-wide level. This book introduces some of the many features of SAS Visual Data Mining and Machine Learning including: programming in the Python interface; new, advanced data mining and machine learning procedures; pipeline building in Model Studio, and model building and comparison in SAS® Visual Analytics

 

 

For health care data analytics professionals

If you work with real world health care data, you know that it is common and growing in use from sources like observational studies, pragmatic trials, patient registries, and databases. Real World Health Care Data Analysis: Causal Methods and Implementation in SAS® by Doug Faries et al. brings together best practices for causal-based comparative effectiveness analyses based on real world data in a single location. Example SAS code is provided to make the analyses relatively easy and efficient. The book also presents several emerging topics of interest, including algorithms for personalized medicine, methods that address the complexities of time varying confounding, extensions of propensity scoring to comparisons between more than two interventions, sensitivity analyses for unmeasured confounding, and implementation of model averaging.

 

For those at the cutting edge

Are you ready to take your understanding of IoT to the next level? Intelligence at the Edge: Using SAS® with the Internet of Things edited by Michael Harvey begins with a brief description of the Internet of Things, how it has evolved over time, and the importance of SAS’s role in the IoT space. The book will continue with a collection of chapters showcasing SAS’s expertise in IoT analytics. Topics include Using SAS Event Stream Processing to process real world events, connectivity, using the ESP Geofence window, applying analytics to streaming data, using SAS Event Stream Processing in a typical IoT reference architecture, the role of SAS Event Stream Manager in managing ESP deployments in an IoT ecosystem, how to use deep learning with Your IoT Digital, accounting for data quality variability in streaming GPS data for location-based analytics, and more!

 

 

 

Keep an eye out for these titles releasing in the next two months! We hope this list will help in your search for a SAS book that will get you to the next step in updating your SAS skills. To learn more about SAS Press, check out our up-and-coming titles, and to receive exclusive discounts make sure to subscribe to our newsletter.

Foresight is 2020! New books to take your skills to the next level was published on SAS Users.

1月 112020
 
In SAS 9.3 and earlier, the default value of the YEARCUTOFF= option is 1920. This default setting could trigger data integrity issues because any 2-digit years of "20" in dates will be assumed to occur in 1920 instead of 2020. If the intended year in the date is 2020, you must set the YEARCUTOFF= option to a value larger than 1920. Of course, the best alternative is to always specify date values with 4-digit years.

Luckily, SAS makes it easy to change the YEARCUTOFF= option so that it works best for your data. The default value for the YEARCUTOFF= option changed in SAS 9.4 to 1926. This change makes it easier for customers who are still using 2-digit years of "20" to make sure that the date is assigned to 2020. Let's review some of the frequently asked questions that customers ask about how SAS works with 2-digit years.

What is the YEARCUTOFF= option?

The YEARCUTOFF= option lets you specify which century SAS software should assign to dates with 2-digit years.

How do I specify the YEARCUTOFF= option in my SAS programs?

The option is specified in an OPTIONS statement. Here is an example:

options yearcutoff=1930;

You can also specify the option in an autoexec file or a config file. If you don't specify the YEARCUTOFF= option, the SAS system default is used. Remember that 1920 is the default for SAS 9.3 and earlier releases and 1926 is the default for SAS 9.4. (For reference, SAS 9.3 was released in 2011. The first release of SAS 9.4 was released in 2013.)

How does the YEARCUTOFF= option work?

The YEARCUTOFF= option specifies the first year of a 100-year window in which all 2-digit years are assumed to occur. For example, if the YEARCUTOFF= option is set to 1920, all 2-digit years are assumed to occur between 1920 and 2019. This means that two-digit years from 20 - 99 are assigned a century prefix of "19" and all 2-digit years from 00 - 19 have a century prefix of "20."

Which types of date values are affected by the YEARCUTOFF= option?

The YEARCUTOFF= option affects the interpretation of 2-digit years in the following cases:

  • Reading date values from external files
  • Specifying dates or year values in SAS functions
  • Specifying SAS date literals

The YEARCUTOFF= option does not influence the following cases:

  • Processing dates with 4-digit years
  • Processing dates already stored as SAS date values (the number of days since January 1, 1960)
  • Displaying dates with SAS date formats

Which value should I set the YEARCUTOFF= option to?

The optimal value depends on the range of dates in your data. The YEARCUTOFF= option should be set so that the 100-year range encompasses the range of your data values. In general, SAS recommends setting the YEARCUTOFF= option to a value equal to or slightly less than the first year in your data. For example, if the range of dates that you are processing is from 1930 - 2010, a YEARCUTOFF value of 1925 or 1930 would be appropriate. If you set YEARCUTOFF=1925, then all 2-digit years are assumed to be in the 100-year period from 1925 to 2024. If all the dates in your data fall within that range, they will be interpreted correctly.

What do I do if my dates with 2-digit years span more than 100 years?

The YEARCUTOFF= option cannot reliably assign centuries to 2-digit years if the range of dates for a variable is greater than 100 years. If the date ranges for a variable span more than 100 years, you must either specify the dates with 4-digit years or use DATA step logic to assign a century to each year (perhaps based on the value of another variable).

But why does the YEARCUTOFF= option allow only a 100-year span? If the YEARCUTOFF= option allowed for more than a 100-year span, there would be no way to determine which century a 2-digit year should have. For example, let’s assume that YEARCUTOFF=1950 with a 150-year span and your external data file had 2-digit years. In this scenario, your 150-year span would be from 1950 to 2100. Since you have 2-digit years, there would be no way to determine if the year 00 was meant to occur during 2000 or 2100.

How do I change the default setting for all the SAS users at my site?

Setting system default option values is usually done by a site SAS Installation Representative. The recommended method for setting a system default YEARCUTOFF= value is to specify the desired value in the system SAS configuration file. Note that even if you set a default value for all the users at your site, they can override the default value in their SAS programs, in personal autoexec files, in config files, by setting an environment variable, or when invoking SAS software.

How do I change the default setting for my own programs if I want a default that is different from the rest of the users at my site?

You can specify personal default values either in a personal configuration file or in an autoexec file. If you specify the value in a personal configuration file, the syntax depends on your operating system and is the same as that for setting the value in the system-wide configuration file on each system. If you use an autoexec file, you can specify the YEARCUTOFF= option in an OPTIONS statement.  Here is an example:

options yearcutoff=1930;

Additional Resources

  • YEARCUTOFF= System Option section in SAS® 9.3 System Options: Reference, Second Edition
  • YEARCUTOFF= System Option section in SAS® 9.4 System Options: Reference, Fifth Edition
  • SAS Note 46368, "The default value for the YEARCUTOFF= system option has changed in SAS® 9.4 and beyond"
  • SAS Note 65307, "You might encounter an issue in which 2-digit year dates have the wrong century in SAS® 9.3 and earlier releases"

Why does my SAS date have the wrong century? was published on SAS Users.

1月 082020
 

Did I trick you into seeing what this blog is about with its mysterious title? I am going to talk about how to use the FIND function to search text values.

The FIND function searches for substrings in character values. For example, you might want to extract all email addresses ending in .edu from a list of email addresses. If you are a slightly older SAS programmer like me, you may be more familiar with the INDEX function. If you use only two arguments in the FIND function, the first being the string you are searching and the second being the substring you are looking for, the FIND function is identical to the INDEX function. Both of these functions will searczh the string (first argument) for the substring (second argument) and return the position where the substring starts. If the substring is not found, the function returns a zero.

The newer FIND function has several advantages over the older INDEX function. These advantages are realized by the optional third and fourth arguments to the FIND function. These two arguments allow you to specify a starting position for the search and modifiers that allow you to ignore case. You can use either of these two arguments, or both, and the order doesn't matter! How is this possible? The value for the starting position is always a numeric value and the value for the modifier is always a character value. Thus, SAS can always figure out if a value is a starting position or a modifier.

Let's look at an example

Suppose you have a SAS data set called Emails, and each observation in the data set contains a name and an email address.

Here is a listing of the SAS data set Emails:

You want to select all observations where the variable Email_Address contains .edu (ignoring case).

The program below does just that:

*Searching for .edu;
data Education;
   set Emails;
   if find(Email_Address,'.edu','i') then output;
run;
title "Listing of Data Set Education";
proc print data=Education noobs;
run;

The 'i' modifier is an instruction to ignore case. In the listing of Education below, notice that all the .edu addresses are listed, regardless of case.

Not only is the FIND function more flexible than the older INDEX function, the ignore case modifier is really handy.

For more tips on writing code and how to get started in SAS Studio, check out my book, Learning SAS by Example: A Programmer’s Guide, Second Edition. You can also download a free book excerpt. To also learn more about SAS Press, check out the up-and-coming titles and receive exclusive discounts, make sure to subscribe to the SAS Books newsletter.

Adventures of a SAS detective and the fantastic FIND function was published on SAS Users.

12月 202019
 

One day, I received a question from the SAS community on how to exclude variables with missing values up to a certain percentage in data analysis. Before I dive into details, another question about how to remove variables with repeated values up to a certain percentage was raised. In fact, this user demands focus on the value characteristics for a single variable, and it determines if the user should include a specific variable in the following data analysis. It reflects the demand of variable selection for wide datasets with noise; it enhances data filtering by enabling variables(cols) selection, in addition to the normal observation(rows) filtering.

DATA Step implementation

If the variable filtering is based on the percentage of the missing values, the most intuitive thought is to use DATA step to calculate the percentage of missing values for each column of the input dataset, and then collect the variable names that meet the threshold conditions before passing them to a global macro variable for subsequent analysis. Since SAS DATA step includes both character and numeric data types, we cannot define one single array to map all columns of the dataset. Instead, we need to define two arrays to map all numeric variables and all character variables respectively.

In order to build a universal reusable SAS macro, we also need to detect the number of numeric and character variables in advance and pass them to the array definition in the next DATA step. In that DATA step, we use two arrays to access dataset variables and count the number of missing values respectively. When the loop reaches the last observation, it collects variable names through the vname() function which meet the threshold condition, and finally place them into a SAS global macro variable &VARLIST. The SAS macro below demonstrates this logic:

 
%macro FilterCols(data=, th=0, varlist=);
  data _null_;  /*get numeric and character variables count */
    set &data (obs=1);
    array _ncols_n _numeric_;
    array _ccols_n $ _character_;
    call symput ('_ncols_n',put(dim(_ncols_n),best.));
    call symput ('_ccols_n',put(dim(_ccols_n),best.));
  run;
  data _null_;
	set &data end=last;
    /*Process numeric types*/
	array _ncols_ _numeric_;	
	array _ncols_count {&_ncols_n};
	retain _ncols_count (&_ncols_n*0);	
	do i=1 to dim(_ncols_);			
	  if _ncols_[i]= .  then _ncols_count[i]=_ncols_count[i] + 1;
	end;
    /*Process character types*/
	array _ccols_ $  _character_;
    array _ccols_count {&_ccols_n};	
	retain _ccols_count (&_ccols_n*0);
	do i=1 to dim(_ccols_);			
	  if _ccols_[i]=" " then _ccols_count[i]=_ccols_count[i]+1;
	end; 
    /*Filter out the variables meet the conditions and place them in the &VARLIST global macro variable*/
	if last then do;	  
	  length _varlist_ $ 32767;
retain _varlist_ " ";
	  do i=1 to &_ncols_n; 
	    if (_ncols_count[i]/_N_) >= &th then do;
		  _varlist_=trim(_varlist_) || " " || trim( vname(_ncols_[i]) );
        end;				
	  end;
	  do i=1 to &_ccols_n; 
	    if (_ccols_count[i]/_N_) >= &th then do;
		  _varlist_=trim(_varlist_) || " " || trim( vname(_ccols_[i]) );
		end;
	  end;	
	  call symput ("&VARLIST",trim(_varlist_));			
	end;
  run;
%mend; 
/*Exclude variables if missing value pct GE. threshold  &th*/
%macro FilterData(data=, th=, out=);
  %global myvars;
  %FilterCols(data=&data, th=&th, varlist=myvars);
  data &out;
    set &data; 
    %if %str(&myvars)^=%str() %then %do; drop &myvars; %end;
  run;
%mend;

To verify the programing logic, let’s generate a demo dataset WORK.CLASS based on SASHELP.CLASS, the output dataset has different missing value percentage (10.53%, 36.84%, 52.63%) in Sex, Height and Weight columns.

data class; /*To generate demo dataset*/
  set sashelp.class;  
  if age < = 11 then sex=" ";
  if age <= 12 then height=.;
  if age <= 13 then weight=.;  
run;
proc print data=class; run; /*See Figure 1 for output */

Now we can use the following code to filter columns by percentage of missing values, which is greater than 40%; the Weight column will not appear in the output CLASS2 dataset.

 
%FilterData(data=class, th=0.4, out=class2);
proc print data=class2; run; /*See Figure 1 for output */

Figure 1: Data before filtered and after filtered

The calculation here for the percentage of missing values is implemented by the user programming in the DATA step. The missing value count is computed separately in two arrays according to the character type and the numerical type. If you want to do it in a single array, you must use HASH object to store the variable name and its corresponding index in that array. Whenever a missing value is detected, we need to check whether the variable is already in the HASH object; if not, then add it directly. Otherwise, look up the table to find the appropriate array index, and then increase the counter to the corresponding element in that array. The advantage of this approach is we only need one array in later variable names collecting phase. The disadvantage is the use of HASH object increases the programming complexity. The complete implementation based on this logic is as follows, it can completely replace the first version of %FilterCols implementation.

 
%macro FilterCols(data=, th=0, varlist=);
  data _null_; /*Detect variable numbers of dataset*/
    dsid=open("&data.",'i');
    if dsid>0 then do;
      nvars= attrn(dsid, 'NVARS'); 
	  call symput("NVARS", nvars); 
    end;
  run;
  data _null_;
    set &data. end=last; 
    /*Define 2 arrays map to numeric/character vars respectively*/
    array _ncols_ _numeric_;
    array _ccols_ _character_;
 
    if _N_=1 then do; /*HASH Object definition and initialize*/
      length key $ 32;
      declare hash h(); 
	  rc=h.definekey("key");
	  rc=h.definedata("data", "key");
	  rc=h.definedone();
	  call missing(key, data); 
    end;  
    /*Define one array for counting of missing value*/
	array _cols_count[&NVARS] _temporary_; 
	retain _cols_count ( &NVARS * 0); 
    retain _col_id_ 1; 
 
    do i=1 to dim(_ncols_);	  
      if missing( _ncols_[i] ) then do; 
        /*Find mapped array index and modify counter value*/
	    key=vname( _ncols_[i] );  
	    if h.find()^=0 then do;         	    
          data=_col_id_;
		   h.add();  
		   _col_id_=_col_id_+1;		
	    end;    
        _cols_count[ data ] = _cols_count[ data ] +1; 
	  end;
    end;
    do i=1 to dim(_ccols_); 
      if missing( _ccols_[i] ) then do;
        /*Find mapped array index and modify counter value*/
	    key=vname( _ccols_[i] );
	    if h.find()^=0 then do;         	    
          data=_col_id_;
		   h.add();	 
		   _col_id_=_col_id_+1;		
	    end;	        
        _cols_count[ data ] = _cols_count[ data ] +1; 	    		
	  end;
    end; 
    /*Filter out the variables meet the conditions and place them in the &VARLIST global macro variable*/
    if last then do;
      declare hiter iter('h');
      rc = iter.first();  
      do while (rc = 0);   	    
        pct=_cols_count[data]/ _N_;
		if pct >= &th. then do; 
	      length _varlist_ $ 32767; 
retain _varlist_ " ";
          _varlist_=trim(_varlist_) || " "|| trim(key);		  	  
        end; 
        rc = iter.next(); 
      end;	
	  call symput( "&VARLIST", trim( left(_varlist_)));
    end;
  run;
%mend;

PROC FREQ Implementation

In the two macros above, the counting of missing values for each column are implemented by user programming in DATA step, but SAS provides a more powerful frequency statistic step called PROC FREQ. Can we directly use its output for variable filtering? The answer is YES. For example, the following SAS code can list the Frequency, Percent, Cumulative Frequency, and Cumulative Percent of each variable values (including missing values) in the data set.

 
  proc freq data=class;
    table Sex Height Weight /missing;  /*See Figure 2 for output */
  run;

Figure 2: PROC FREQ output

To get the result of PROC FREQ into a dataset, we just need to add ODS Table statement before running PROC FREQ:

 
ods table onewayfreqs=class_freqdata;

Figure 3: OneWayFreqs table class_freqdata

So, the frequency statistics result of PROC FREQ is stored in the class_freqdata dataset. We only need to check the value of variable “percent” for each column. If it is greater than the threshold specified, the variable name is added to the buffer variable _varilist_, otherwise it is ignored; the final value of _varilist_ is placed to the SAS global macro variable &VARLIST for later use. Comparing to the DATA step implementation, the entire code here is much shorter and simpler.

 
%macro FilterCols(data=, th=0, varlist=);
  ods table onewayfreqs=&data._freqdata; /*FREQ out to temp dataset*/
  proc freq data=&data;
    table _all_ /missing;  
  run; 
  data _null_;
	set &data._freqdata end=last;
	var=scan(Table, 2);  
if getoption("LOCALE")="ZH_CN" then var=scan(Table, 2, "“”");
	value=vvaluex(var); 
/*Filter out the variables meet the conditions and place them in the &VARLIST global macro variable*/
	if  value=" " or compress(value)="." then  do; 
      if percent/100>= &th then do;
	    length _varlist_ $ 32767;
        retain _varlist_ " "; 
	    _varlist_=trim(_varlist_) || " "|| trim(var);	
		put var= _varlist_;
	  end;
   end;
   if last then call  symput( "&VARLIST", trim( left(_varlist_)));
  run; 
/*Delete the temp dataset in WORK library*/
  proc datasets nolist;
    delete &data._freqdata;
  quit;
%mend;

Please note that PROC FREQ would generate output to ODS HTML destination by default. If you want to avoid generating ODS HTML output when generating the temporary dataset “class_freqdata”, you can use the ods html close; statement to temporarily close the ODS HTML output and then open it again anytime needed.

Since the output of PROC FREQ already has the frequency information of each variable, we can naturally filter columns based on the percentage of duplicate values. However, it should be noted that a single variable always contains multiple distinct values, and each distinct value (including missing values) has a frequency percentage. If percentage of duplicate value is too high in a specific column, the user may exclude that column from the following analysis. As there are multiple frequency percentages, we just choose the largest one as the duplicate value percentage of that variable. For example, the Sex column of demo dataset CLASS has a missing value of 10.53%, a female of 42.11% and a male of 47.37%, so the percentage of duplicate values in the Sex column should be 47.37%.

The following code filter variables by percentage of duplicate values with a specific threshold. The code logic is almost same as before, but it filters values with non-missing values. So, run the same test code against the demo dataset will remove the Age column by duplicate values percentage threshold, instead of the Weight column by missing values percentage threshold.

 
%macro FilterCols(data=, th=0, varlist=);
  ods table onewayfreqs=&data._freqdata; /*FREQ out to temp dataset*/
  ods html close;
  proc freq data=&data;
    table _all_ / missing;  
  run; 
  ods html;
  data _null_;
    set &data._freqdata end=last;
var=scan(Table, 2);  
if getoption("LOCALE")="ZH_CN" then var=scan(Table, 2, "“”");
    value=vvaluex(var);
/*Filter out the variables meet the conditions and place them in the &VARLIST global macro variable*/
    if ^(value=" " or compress(value)=".") then do;  
      if  percent > maxpct then do; /*Search max percentage of duplicate values of a variable*/
	    retain maxpct 0;
        retain maxvar;
        maxpct=percent;  
	    maxvar=var;
      end; 
      if (_N_>1 and var ^= lastvar) or last then do;      
        if maxpct/100 >= &th. then do;
          length _varlist_ $ 32767;
          retain _varlist_ " ";
          _varlist_=trim(_varlist_) || " "|| trim(maxvar);	        	      
        end;
        maxpct=percent;
	    maxvar=var;
      end;
      lastvar=var;
      retain lastvar;
	end;
    if last then call symput( "&VARLIST", trim( left(_varlist_)));  
  run;
/*Delete the temp dataset in WORK library*/
  proc datasets nolist;
    delete &data._freqdata ;
  quit;
%mend;

In order to combine these two logics of by missing values and by duplicate values in one macro, we just introduce one flag parameter byduprate. The default value is 0 to filter variables by missing values, and value 1 to filter variables by duplicate values. The complete code is as below:

 
%macro FilterCols(data=, th=0, varlist=, byduprate=0);
  ods table onewayfreqs=&data._freqdata; /*FREQ out to temp dataset*/
  ods html close;
proc freq data=&data;
    table _all_ /missing;  
  run; 
ods html;
  data _null_;
	set &data._freqdata end=last;
	var=scan(Table, 2); 
if getoption("LOCALE")="ZH_CN" then var=scan(Table, 2, "“”"); 
	value=vvaluex(var);
  %if &byduprate=0 %then %do; /*Filter variables by missing value*/
	if  value=" " or compress(value)="." then  do; 
      if percent/100>= &th then do;
	    length _varlist_ $ 32767;
	    retain _varlist_ " ";
	    _varlist_=trim(_varlist_) || " "|| trim(var);				 
	  end;
   end;
  %end;
  %else %do; /*Filter variables by duplicate value percentage*/
    if ^(value=" " or compress(value)=".") then do;  
      if  percent > maxpct then do;      
        retain maxpct 0;
        retain maxvar;
        maxpct=percent;  
	    maxvar=var;		
      end; 
      if (_N_>1 and var ^= lastvar) or last then do;              
        if maxpct/100 >= &th. then do;
          length _varlist_ $ 32767;
          retain _varlist_ " ";
          _varlist_=trim(_varlist_) || " "|| trim(maxvar);	        
        end;
        maxpct=percent;
	    maxvar=var;
      end;
      lastvar=var;
      retain lastvar;
    end;
  %end;
    if last then call  symput( "&VARLIST", trim( left(_varlist_)));
  run; 
/*Delete the temp dataset in WORK library*/
  proc datasets nolist;
    delete &data._freqdata;
  quit;
%mend;

Accordingly, we also need to introduce the parameter byduprate with the same name and behavior in the %FilterData macro that's invoking the %FilterCols macro. The SAS macro %FilterData also has the same default value 0, but it needs to pass the byduprate argument specified by the user to the %FilterCols macro. The enhanced version is as below:

 
%macro FilterData(data=, th=, out=, byduprate=0);
  %global myvars;
  %FilterCols(data=&data, th=&th, varlist=myvars, byduprate=&byduprate); 
  data &out;
    set &data; 
    %if %str(&myvars)^=%str() %then %do; drop &myvars; %end;
  run;
%mend;

Now the final version of the macro %FilterData unified the two kinds of variable filtering logic, it just depends on the different value of the argument byduprate: If byduprate is not specified or the value is 0, it means filter variables by the percentage of missing values, otherwise it is by the percentage of duplicate values. For the following two code snippets, the former drops the Weight column, while the latter drops the Sex column.

 
/*Drop columns with missing value percentage >= 40% */
%FilterData(data=class, th=0.4, out=class2);              
proc print data=class2;run;
 
/*Drop columns with duplicate value percentage >= 40% */
%FilterData(data=class, th=0.4, out=class3, byduprate=1); 
proc print data=class3;run;

Figure 4: Drop column by missing and duplicate value percentage

Summary

This article discusses how to use SAS to filter variables in a dataset based on the percentage of missing values or duplicate values. The missing value statistics can be implemented by either DATA step programming on your own or reusing the existing powerful PROC FREQ with statistics result for missing values and duplicate values. The five macros in this article also demonstrated how to wrap up the existing powerful SAS analytics as a reusable, general-purpose SAS macro for a big project. The final version of macro %FilterData also demonstrated how to unify filtering variables by the percentage of either missing values or duplicate values in one to make calling SAS macro functions convenient and easy to reuse. To learn more, please check out these resources:

How to filter variables by percentage of missing values or duplicate values in SAS was published on SAS Users.

12月 172019
 

The next time you pick up a book, you might want to pause and think about the work that has gone into producing it – and not just from the authors!

The authors of the SAS classic, The Little SAS Book, Sixth Edition, did just that. The Acknowledgement section in the front of a book is usually short – just a few lines to thank family and significant others. In their sixth edition, authors Lora D. Delwiche and Susan J. Slaughter, took it to the next level and produced The Little SAS Book Family Tree. The authors explained:

“Over the years, many people have helped to make this book a reality. We are grateful to everyone who has contributed both to this edition, and to editions in the past. It takes a family to produce a book including reviewers, copyeditors, designers, publishing specialists, marketing specialists, and of course our editors.”

So what happens after you sign a book contract?

First you will be assigned a development editor (DE) who will answer questions and be with you every step of the way – from writing your sample chapter to publication. Your DE will discuss schedules and milestones as well as give you an authoring template, style guidelines, and any software you need to write your book.

Once you have all the resources you need, you'll write your sample chapter. This will help your DE evaluate your writing style, quality of graphics and output, structure, and any potential production issues.

The next step is submitting a draft manuscript for technical review. You'll get feedback from internal and external subject-matter experts, and then you can revise the manuscript based on this feedback. When you and your editor are satisfied with the technical content, your DE will perform a substantive edit on your manuscript, taking particular care with the structure and flow of your writing.

Once in production, your manuscript will be copy edited, and a production specialist will lay out the pages and make sure everything looks good. A graphic designer will work with you to create a cover that encompasses both our branding and your suggestions. Your book will be published in print and e-book versions and made ready for sale.

Finally, after the book is published, a marketing specialist will start promoting your book through our social media channels and other campaigns.

So the next time you pick up a book, spare a thought for the many people who have worked to make it a reality!

For more information about publishing with SAS or for our full catalogue, visit our online bookstore.

How many people does it take to publish a book? was published on SAS Users.

12月 122019
 

Parts 1 and 2 of this blog post discussed exploring and preparing your data using SASPy. To recap, Part 1 discussed how to explore data using the SASPy interface with Python. Part 2 continued with an explanation of how to prepare your data to use it with a machine-learning model. This final installment continues the discussion about preparation by explaining techniques for normalizing your numerical data and one-hot encoding categorical variables.

Normalizing your data

Some of the numerical features in the examples from parts 1 and 2 have different ranges. The variable Age spans from 0-100 and Hours_per_week from 0-80. These ranges affect the calculation of each feature when you apply them to a supervised learner. To ensure the equal treatement of each feature, you need to scale the numerical features.

The following example uses the SAS STDIZE procedure to scale the numerical features. PROC STDIZE is not a standard procedure available in the SASPy library.  However, the good news is you can add any SAS procedure to SASPy! This feature enables Python users to become a part of the SAS community by contributing to the SASPy library and giving them the chance to use the vast number of powerful SAS procedures. To add PROC STDIZE to SASPy, see the instructions in the blog post Adding SAS procedures to the SASPy interface to Python.

After you add the STDIZE to SASPy, run the following code to scale the numerical features. The resulting values will be between 0 and 1.

# Creating a SASPy stat objcect
stat = sas.sasstat()
# Use the stdize function that was added to SASPy to scale our features
stat_result = stat.stdize(data=cen_data_logTransform,
                         procopts = 'method=range out=Sasuser.afternorm',
			 var = 'age education_num capital_gain capital_loss hours_per_week')

To use the STDIZE procedure in SASPy we need to specify the method and the output data set in the statement options. For this we use the "procopts" option and we specify range as our method and our "out" option to a new SAS data set, afternorm.

After running the STDIZE procedure we assign the new data set into a SAS data object.

norm_data = sas.sasdata('afternorm', libref='SASuser')

Now let's verify if we were successful in transforming our numerical features

norm_data.head(obs=5)


 

 

 

 

 
The output looks great! You have normalized the numerical features. So, it's time to tackle the last data-preparation step.

One-Hot Encoding

Now that you have adjusted the numerical features, what do you do with the categorical features? The categories Relationship, Race, Sex, and so on are in string format. Statistical models cannot interpret these values, so you need to transform the values from strings into a numerical representation. The one-hot encoding process provides the transformation you need for your data.

To use one-hot encoding, use the LOGISTIC procedure from the SASPy Stat class. SASPy natively includes the LOGISTIC procedure, so you can go straight to coding. To generate the syntax for the code below, I followed the instructions from Usage Note 23217: Saving the coded design matrix of a model to a data set.

stat_proc_log = stat.logistic(data=norm_data, procopts='outdesign=SASuser.test1 outdesignonly',
            cls = "workclass education_level marital_status occupation relationship race sex native_country / param=glm",
	    model = "age = workclass education_level marital_status occupation relationship race sex native_country / noint")

To view the results from this code, create a SAS data object from the newly created data set, as shown in this example:

one_hot_data = sas.sasdata('test1', libref='SASuser')
display(one_hot_data.head(obs=5))

The output:

 

 

 

 

Our data was successfully one-hot encoded! For future reference, due to SAS’ analytical power, this step is not required. When including a categorical feature in a class statement the procedure automatically generates a design matrix with the one-hot encoded feature. For more information, I recommend reading this post about different ways to create a design matrix in SAS.

Finally

You made it to the end of the journey! I hope everyone who reads these blogs can see the value that SASPy brings to the machine-learning community. Give SASPy  a try, and you'll see the power it can bring to your SAS solutions.

Stay curious, keep learning, and (most of all) continue innovating.

Machine Learning with SASPy: Exploring and Preparing your data - Part 3 was published on SAS Users.

12月 122019
 

Bringing the power of SAS to your Python scripts can be a game changer. An easy way to do that is by using SASPy, a Python interface to SAS allowing Python developers to use SAS® procedures within Python. However, not all SAS procedures are included in the SASPy library. So, what do you do if you want to use those excluded procedures? Easy! The SASPy library contains functionality enabling you to add SAS procedures to the SASPy library. In this post, I'll explain the process.

The basics for adding procedures are covered in the Contributing new methods section in the SASPy documentation. To further assist you, this post expands upon the steps, providing step-by-step details for adding the STDIZE procedure to SASPy. For a hands-on application of the use case refer the blog post Machine Learning with SASPy: Exploring and Preparing your data - Part 3.

This is your chance to contribute to the project! Whereas, you can choose to follow the steps below as a one-off solution, you also have the choice to share your work and incorporate it in the SASPy repository.

Prerequisites

Before you add a procedure to SASPy, you need to perform these prerequisite steps:

  1. Identify the SAS product associated with the procedure you want to add, e.g. SAS/STAT, SAS/ETS, SAS Enterprise Miner, etc.
  2. Locate the SASPy file (for example, sasstat.py, sasets.py, and so on) corresponding to the product from step 1.
  3. Ensure you have a current license for the SAS product in question.

Adding a SAS procedure to SASPy

SASPy utilizes Python Decorators to generate the code for adding SAS procedures. Roughly, the process is:

  1. define the procedure
  2. generate the code to add
  3. add the code to the proper SASPy file
  4. (optional)create a pull request to add the procedure to the SASPy repository

Below we'll walk through each step in detail.

Create a set of valid statements

Start a new python session with Jupyter and create a list of valid arguments for the chosen procedure. You determine the arguments for the procedure by searching for your procedure in the appropriate SAS documentation. For example, the PROC STDIZE arguments are documented in the SAS/STAT® 15.1 User's Guide, in the The STDIZE Procedure section, with the contents:

The STDIZE procedure

 
 
 
 
 
 
 
 
 
 

For example, I submitted the following command to create a set of valid arguments for PROC STDIZE:

lset = {'STDIZE', 'BY', 'FREQ', 'LOCATION', 'SCALE', 'VAR', 'WEIGHT'}

Call the doc_convert method

The doc_convert method takes two arguments: a list of valid statements (method_stmt) and the procedure name (stdize).

import saspy
 
print(saspy.sasdecorator.procDecorator.doc_convert(lset, 'STDIZE')['method_stmt'])
print(saspy.sasdecorator.procDecorator.doc_convert(lset, 'STDIZE')['markup_stmt'])

The command generates the method call and the docstring markup like the following:

def STDIZE(self, data: [SASdata', str] = None,
   by: [str, list] = None,
   location: str = None,
   scale: str = None,
   stdize: str = None,
   var: str = None,
   weight: str = None,
   procopts: str = None,
   stmtpassthrough: str = None,
   **kwargs: dict) -> 'SASresults':
   Python method to call the STDIZE procedure.
 
   Documentation link:
 
   :param data: SASdata object or string. This parameter is required.
   :parm by: The by variable can be a string or list type.
   :parm freq: The freq variable can only be a string type.
   :parm location: The location variable can only be a string type.
   :parm scale: The scale variable can only be a string type.
   :parm stdize: The stdize variable can be a string type.
   :parm var: The var variable can only be a string type.
   :parm weight: The weight variable can be a string type.
   :parm procopts: The procopts variable is a generic option avaiable for advanced use It can only be a string type.
   :parm stmtpassthrough: The stmtpassthrough variable is a generic option available for advanced use. It can only be a string type.
   :return: SAS Result Object

Update SASPy product file

We'll take the output and add it to the appropriate product file (sasstat.py in this case). When you open this file, be sure to open it with administrative privileges so you can save the changes. Prior to adding the code to the product file, perform the following tasks:

  1. add @procDecorator.proc_decorator({}) before the function definition
  2. add the proper documentation link from the SAS Programming Documentation site
  3. add triple quotes ("""") to comment out the second section of code
  4. include any additional details others might find helpful

The following output shows the final code to add to the sasstat.py file:

@procDecorator.proc_decorator({})
def STDIZE(self, data: [SASdata', str] = None,
   by: [str, list] = None,
   location: str = None,
   scale: str = None,
   stdize: str = None,
   var: str = None,
   weight: str = None,
   procopts: str = None,
   stmtpassthrough: str = None,
   **kwargs: dict) -> 'SASresults':
   """
   Python method to call the STDIZE procedure.
 
   Documentation link:
   https://go.documentation.sas.com/?cdcId=pgmsascdc&cdcVersion=9.4_3.4&docsetId=statug&docsetTarget=statug_stdize_toc.htm&locale=en
   :param data: SASdata object or string. This parameter is required.
   :parm by: The by variable can be a string or list type.
   :parm freq: The freq variable can only be a string type.
   :parm location: The location variable can only be a string type.
   :parm scale: The scale variable can only be a string type.
   :parm stdize: The stdize variable can be a string type.
   :parm var: The var variable can only be a string type.
   :parm weight: The weight variable can be a string type.
   :parm procopts: The procopts variable is a generic option avaiable for advanced use It can only be a string type.
   :parm stmtpassthrough: The stmtpassthrough variable is a generic option available for advanced use. It can only be a string type.
   :return: SAS Result Object
   """

Update sasdecorator file with the new method

Alter the sasdecorator.py file by adding stdize in the code on line 29, as shown below.

if proc in ['hplogistic', 'hpreg', 'stdize']:

Important: The update to the sasdecorator file is only a requirement when you add a procedure with no plot options. The sasstat.py library assumes all procedures produce plots. However, PROC STDIZE does not include them. So, you should perform this step ONLY when your procedure does not include plot options. This will more than likely change in a future release, so please follow the Github page for any updates.

Document a test for your function

Make sure you write at least one test for the procedure. Then, add the test to the appropriate testing file.

Finally

Congratulations! All done. You now have the knowledge to add even more procedures in the future.

After you add your procedure, I highly recommend you contribute your procedure to the SASPy GitHub library. To contribute, follow the outlined instructions on the Contributing Rules GitHub page.

Adding SAS® procedures to the SASPy interface to Python was published on SAS Users.

12月 102019
 
The DATA step has been around for many years and regardless of how many new SAS® products and solutions are released, the DATA step remains a popular way to create and manipulate SAS data sets. The SAS language contains a wide variety of approaches that provide an endless number of ways to accomplish a goal. Whether you are reshaping a data set entirely or simply assigning values to a new variable, there are numerous tips and tricks that you can use to save time and keystrokes. Here are a few that Technical Support offers to customers on a regular basis.

Writing file contents to the SAS® log

Perhaps you are reading a file and seeing “invalid data” messages in the log or you are not sure which delimiter is used between variables. You can use the null INPUT statement to read a small sample of the data. Specify the amount of data that you want to read by adjusting options in the INFILE statement. This example reads one record with 500 bytes:

   data _null_;
      infile 'path' recfm=f lrecl=500 obs=1;
      input;
      list;
   run;

The LIST statement writes the current record to the log. In addition, it includes a ruled line above the record so that it is easier to see the data for each column of the file. If the data contains at least one unprintable character, you will see three lines of output for each record written.

For example:

RULE:     ----+----1----+----2----+----3----+----4----+----5----+----6----+
 
1   CHAR  this is a test.123.dog.. 24
    ZONE  766726726276770333066600
    NUMR  48930930104534912394F7DA

The line beginning with CHAR describes the character that is represented by the two hexadecimal characters underneath. When you see a period in that line, it might actually be a period, but it often means that the hexadecimal characters do not map to a specific key on the keyboard. Also, column 15 for this data is a tab ('09'x). By looking at this record in the log, you can see that the first variable contains spaces and that the file is tab delimited.

Writing hexadecimal output for all input

Since SAS® 9.4M6 (TS1M6), the HEXLISTALL option in the LIST statement enables all lines of input data to be written in hexadecimal format to the SAS log, regardless of whether there are unprintable characters in the data:

   data mylib.new / hexlistall;

Removing “invalid data” messages from the log

When SAS reads data into a data set, it is common to encounter values that are invalid based on the variable type or the informat that is specified for a variable. You should examine the SAS log to assess whether the informat is incorrect or if there are some records that are actually invalid. After you ensure that the data is being read correctly, you can dismiss any “invalid data” messages by using double question mark (??) modifiers after the variable name in question:

   input date ?? mmddyy10.;

You can also use the question mark modifiers when you convert a character variable to numeric with the INPUT function:

   newdate=input(olddate,?? mmddyy10.);

The above syntax reads all values with the MMDDYY10. informat and then dismisses the notes to the log when some values for the OLDDATE variable are invalid.

Sharing files between UNIX and Microsoft Windows operating systems

The end-of-record markers in text files are different for UNIX and Windows operating systems. If you needed to share a file between these systems, the file used to need preprocessing in order to change the markers to the desired type or perhaps specify a delimiter. Now, the TERMSTR= option in the INFILE statement enables you to specify the end-of-record marker in the incoming file.

If you are working in a UNIX environment and you need to read a file that was created in a Windows environment, use the TERMSTR=CRLF option:

   infile 'file-specification'  termstr=crlf ;

If you are in a Windows environment and you need to read a file that was created in a UNIX environment, use this syntax:

   infile 'file-specification'  termstr=lf ;

Adapting array values from an ARRAY statement

The VNAME function makes it very convenient to use the variable names from an ARRAY statement. You most often use an ARRAY statement to obtain the values from numerous variables. In this example, the SQL procedure makes it easy to store the unique values of the variable Product into the macro variable &VARLIST and that number of values into the macro variable &CT (another easy tip). Within the DO loop, you obtain the names of the variables from the array, and those values from the array then become variable names.

   proc sql noprint;
      select distinct(product) into :varlist separated by ' '
      from one;
      select count(distinct product) into :ct
      from one;
   quit;
 
   …more DATA step statements…
   array myvar(&ct) $ &varlist;
      do i=1 to &ct;
         if product=vname(myvar(i)) then do;
         myvar(i)=left(put(contract,8.));
   end;
   …more DATA step statements…

Using a mathematical equation in a DO loop

A typical DO loop has a beginning and an end, both represented by integers. Did you know you can use an equation to process data more easily? Suppose that you want to process every four observations as one unit. You can run code similar to the following:

   …more DATA step statements…
   J+1;
   do i=((j*4)-3) to (j*4);
      set data-set-name point=I;
      …more DATA step statements…
   end;

Using an equation to point to an array element

With small data sets, you might want to put all values for all observations into a single observation. Suppose that you have a data set with four variables and six observations. You can create an array to hold the existing variables and also create an array for the new variables. Here is partial code to illustrate how the equation dynamically points to the correct new variable name in the array:

   array old(4) a b c d;
   array test (24) var1-var24;
   retain var1-var24;
   do i=1 to nobs;
      set w nobs=nobs point=i;
      do j=1 to 4;
      test(((i-1)*4)+j)=old(j);
      end;
   end;

Creating a hexadecimal reference chart

How often have you wanted to know the hexadecimal equivalent for a character or vice versa? Sure, you can look up a reference chart online, but you can also create one with a short program. The BYTE function returns values in your computer’s character set. The PUT statement writes the decimal value, hexadecimal value, and equivalent character to the SAS log:

   data a;
      do i=0 to 255;
      x=byte(i);
      put i=z3. i=hex2. x=;
   end;
   run;

When the resulting character is unprintable, such as a carriage return and line feed (CRLF) character, the X value is blank.

Conclusion

Hopefully, these new tips will be useful in your future programming. If you know some additional tips, please comment them below so that readers of this blog can add even more DATA step tips to their arsenal! If you would like to learn more about DATA step, check out these other blogs:

Old reliable: DATA step tips and tricks was published on SAS Users.