sas programming

9月 162019
 

SAS SPDS is lightning fastJust when you think you’ve seen it all, life can surprise you in a big way, making you wonder what else you've missed.

That is what happened when I recently had a chance to work with the SAS® Scalable Performance Data Server, a product that's been around a while, but I never crossed paths with before. I open an SPDS table of a hundred million records in SAS® Enterprise Guide, and I can scroll it as fast as if it were an Excel “baby” spreadsheet of a hundred rows. That’s how powerful it feels, to say nothing about the lighting speed of the queries.

What is the SAS Scalable Performance Data Server?

Also known as the SAS SPD Server (or SPDS), it's a data storage system designed for high-performance data delivery. Its primary purpose is to provide rapid queries of vast amounts of data. We are talking terabytes of data with tables containing billions of rows. SPDS employs parallel storage and efficient indexing, coupled with a multi-threaded server system concurrently processing tasks distributed across multiple processors.

Availability of the SPDS client in SAS® Viya effectively integrates SAS SPDS with SAS Viya, extending functionality of its applications beyond the native Cloud Analytic Services (CAS) where you can continue reaping all the benefits of the SAS SPDS.

SPDS library

In addition to connecting to SPD Server with explicit SQL pass-through, connection to SPD Server with a LIBNAME statement is available as well, for example:

libname mylibref sasspds 'serverdomain' host='nodename_or_ip' service='5400'
                         user='mySPDuserid' password='{SAS003}XXXXXXX...XXX';

This effectively creates an SPDS library, and the tables in that library can be referenced by two-level name mylibref.tablename as if this were a SAS BASE library.

Cluster tables vs. member tables

Besides ordinary data tables, SPDS library offers so called dynamic cluster tables – or clusters for short – enabling transparent access to large amounts of data.

Dynamic cluster tables (cluster tables or clusters) are virtual tables that allow users to access many server tables (member tables) as if they were one table. A dynamic cluster table is a collection of SPD Server tables that are presented to the end-user application as a single table through a metadata layer acting like a view.

Member tables can be added to the cluster as well as replaced and removed from the cluster.

The role of PROC SPDO

PROC SPDO is the SAS procedure for the SPD Server operator interface. It performs a wide range of SPD server, user and table management tasks:

  • create, list, modify, destroy, and undo dynamic cluster tables
  • add, remove, replace, and fix cluster table members
  • add, modify, list, and delete access control lists (ACLs) for server resources
  • define, describe, and remove WHERE constraints on tables for row-level security definition and management
  • issue system commands on server nodes

In addition to PROC SPDO, SPD Server plug-in for SAS® Data Management Console is also available.

Retrieving SPDS library contents

If you open an SPDS library in SAS Enterprise Guide, you won’t be able to tell which table in that library is a member table and which is a cluster table – they all look the same. But in many cases, we need to know what is what. Moreover, for data-driven processing we need to capture the SPDS library objects into a dataset and identify them whether they are clusters or member tables.

Luckily, PROC CONTENTS with OUT= option allows us to do just that. While MEMTYPE column is equal to ‘DATA’ for both, clusters and member tables, there is another, less known column inversely called TYPEMEM that has value of 'DATA' for clusters and blank value ' ' for member tables. The following simple code allows you to retrieve SPDS library objects list into WORK.SPDSTYPES dataset where TABLETYPE column specifies whether it’s a cluster or a member for each library object MEMNAME:

proc contents data=SPDSLIB._all_ out=WORK.ALLOBJECTS (keep=MEMNAME TYPEMEM);
run;
 
proc sort data=WORK.ALLOBJECTS nodupkey;
   by MEMNAME;
run;
 
data WORK.SPDSTYPES;
   set WORK.ALLOBJECT;
   attrib TABLETYPE $7 label='SPDS table type';
   select(TYPEMEM);
      when('DATA') TABLETYPE = 'CLUSTER';
      when('')     TABLETYPE = 'MEMBER';
      otherwise    TABLETYPE = '';
   end;
run;

In this code PROC CONTENTS produces one record per column NAME in every object MEMNAME in the SPDS library; PROC SORT reduces (un-duplicates) this list to one record per MEMNAME; finally, data step creates TABLETYPE column indicating which MEMNAME is CLUSTER and which is MEMBER.

Retrieving SPDS cluster’s member list

In addition to retrieving a list of objects in the SPDS library described above, we also need a way of capturing the content (a list of members) of the cluster itself in order to control removing or replacing its members. PROC SPDO’s CLUSTER LIST statement produces such a list, and its OUT= option allows you to dump that list into a dataset:

proc spdo lib=SPDSLIB;
   cluster list CLUSTER1 out=CLUSTER1_MEMBERS;
   cluster list CLUSTER1 out=CLUSTER2_MEMBERS;
   /* ... */
   cluster list CLUSTER1 out=CLUSTERN_MEMBERS;
quit;

This approach creates one output table per cluster, and you can’t use the same OUT= destination table for different clusters, for they will be overwritten with each subsequent CLUSTER LIST statement, not appended.

If you need to capture contents of several clusters into one dataset, then instead of the above method of outputting each cluster content into separate table and then appending (concatenating) them, the good old ODS OUTPUT with CLUSTERLIST= option allows us to do it in a single step:

ods noresults;
ods output clusterlist=WORK.CLUSTER_MEMS;
proc spdo lib=SPDSLIB;
   cluster list CLUSTER1;
   cluster list CLUSTER2;
   /* ... */
   cluster list CLUSTERN;
quit;
ods output close;
ods results;

As additional bonus ODS NORESULTS suppresses printed output when it’s not needed, e.g. for automatic data-driven processing.

Your thoughts?

What is your experience with SAS SPDS? How might you use it in the future? Please comment below.

How to retrieve contents of SAS® Scalable Performance Data Server library was published on SAS Users.

8月 282019
 

This article is not a tutorial on Hadoop, Spark, or big data. At the same time, no prerequisite knowledge of these technologies is required for understanding. We’ll give you enough background prior to diving into the details. In simplest terms, the Hadoop framework maintains the data and Spark controls and directs data processing. As an analogy, think of Hadoop as a train, big data as the payload, and Spark as the crew driving the train and organizing and distributing the goods.

Big data

I recently read that data volumes are doubling each year. Not that long ago we talked in terms of gigabytes. This quickly turned into terabytes and we’re now in the age of petabytes. The type of data is also changing. Data used to fit neatly into rows and columns. Now, nearly eighty percent of data is unstructured. All these trends and facts have led us to deal with massive amounts of data, aka big data. Maintaining and processing big data required creating technical frameworks. Next, we’ll investigate a couple of these tools.

Hadoop

Hadoop is a technology stack utilizing parallel processing on a distributed filesystem. Hadoop is useful to companies when data sets become so large or complex that their current solutions cannot effectively process the information in a reasonable amount of time. As the data science field has matured over the past few years, so has the need for a different approach to processing data.

Apache Spark

Apache Spark is a cluster-computing framework utilizing both iterative algorithms and interactive/exploratory data analysis. The goal of Spark is to keep the benefits of Hadoop’s scalable, distributed, fault-tolerant processing framework, while making it more efficient and easier to use. Using in-memory distributed computing, Spark provides capabilities over and above the batch model of Hadoop MapReduce. As a result, this brings to the big data world new applications of data science that were previously too expensive or slow on massive data sets.

Now let’s explore how SAS integrates with these technologies to maximize capturing, managing, and analyzing big data.

SAS capabilities to leverage Spark

SAS provides Hadoop data processing and data scoring capabilities using SAS/ACCESS Interface to Hadoop and In-Database Technologies to Hadoop with MapReduce or Spark as the processing framework. This addresses some of the traditional data management batch processing, huge volumes of extract, transform, load (ETL) data as well as faster, interactive and in-memory processing for quicker response with Spark.

In SAS Viya, SAS/ACCESS Interface to Hadoop includes SAS Data Connector to Hadoop. All users with SAS/ACCESS Interface to Hadoop can use the serial. Likewise, SAS Data Connect Accelerator to Hadoop can load or save data in parallel between Hadoop and SAS using SAS Embedded Process, as a Hive/MapReduce or Spark job.

Connecting to Spark in a Hadoop Cluster

There are two ways to connect to a Hadoop cluster using SAS/ACCESS Interface to Hadoop, based on the SAS environment: LIBNAME and CASLIB statements.

LIBNAME statement to connect to Spark from MVA SAS

options set=SAS_HADOOP_JAR_PATH="/third_party/Hadoop/jars/lib:/third_party/Hadoop/jars/lib/spark"; 
options set=SAS_HADOOP_CONFIG_PATH="/third_party/Hadoop/conf"; 
 
libname hdplib hadoop server="hadoop.server.com" port=10000 user="hive"
schema='default' properties="hive.execution.engine=SPARK";

Parameters

SAS_HADOOP_JAR_PATH Directory path for the Hadoop and Spark JAR files
SAS_HADOOP_CONFIG_PATH Directory path for the Hadoop cluster configuration files
Libref The hdplib libref specifies the location where SAS will find the data
SAS/ACCESS Engine Name HADOOP option to connect Hadoop engine
SERVER Hadoop Hive server to connect
PORT Listening Hive server Port. 10000 is the default, so it is not required. It is included just in case
USER and PASSWORD Are not always required
SCHEMA Hive schema to access. It is optional; by default, it connects to the “default” schema
PROPERTIES Hadoop properties. Choosing SPARK for the property hive.execution.engine enables SAS Viya to use Spark as the execution platform

 
CASLIB statement to connect from CAS

caslib splib sessref=mysession datasource=(srctype="hadoop", dataTransferMode="auto",username="hive", server="hadoop.server.com", 
hadoopjarpath="/opt/sas/viya/config/data/hadoop/lib:/opt/sas/viya/conf ig/data/hadoop/lib/spark", 
hadoopconfigdir="/opt/sas/viya/config/data/hadoop/conf", schema="default"
platform="spark"
dfdebug="EPALL" 
properties="hive.execution.engine=SPARK");

Parameters

CASLIB Space holder for the specified data access. The splib CAS library specifies the Hadoop data source
sessref Holds the CAS library in a specific CAS session. mysession is the current active CAS session
SRCTYPE Type of data source
DATATRANSFERMODE Type of data movement between CAS and Hadoop. Accepts one of three values – serial, parallel, auto. When AUTO is specified, CAS choose the type of data transfer based on available license in the system. If Data Connect Accelerator to Hadoop has been licensed, parallel data transfer will be used, otherwise serial mode of transfer is used
HADOOPJARPATH Hadoop and Spark JAR files location path on the CAS cluster
HADOOPCONFIGDIR Hadoop configuration files location path on the CAS cluster
PLATFORM Type of Hadoop platform to execute the job or transfer data using SAS Embedded Process. Default value is “mapred” for Hive MapReduce. When using “Spark”, data transfer and job executes as a Spark job
DFDEBUG Used to receive additional information back from SAS Embedded Process transfers data in the SAS log
PROPERTIES Hadoop properties. Choosing SPARK for the property hive.execution.engine enables SAS Viya to use Spark as the execution platform

 

Data Access using Spark

SAS Data Connect Accelerator for Hadoop with the Spark platform option uses Hive as the query engine to access Spark data. Data movement happens between Spark and CAS through SAS generated Scala code. This approach is useful when data already exists in Spark and either needs to be used for SAS analytics processing or moved to CAS for massively parallel data and analytics processing.

Loading Data from Hadoop to CAS using Spark

Processing data in CAS offers advanced data preparation, visualization, modeling and model pipelines, and finally model deployment. Model deployment can be performed using available CAS modules or pushed back to Spark if the data is already in Hadoop.

Load data from Hadoop to CAS using Spark

proc casutil 
      incaslib=splib
      outcaslib=casuser;
      load casdata="gas"
      casout="gas"
      replace;
run;

Parameters

PROC CASUTIL Used to process CAS action routines to process data
INCASLIB Input CAS library to read data
OUTCASLIB Output CAS library to write data
CASDATA Table to load to the CAS in-memory server
CASOUT Output CAS table name

 

We can look at the status of the data load job using Hadoop' resource management and job scheduling application, YARN. YARN is responsible for allocating system resources to the various applications running in a Hadoop cluster and scheduling tasks to be executed on different cluster nodes.

Loading Data from Hadoop to Viya CAS using Spark

In the figure above, the YARN application executed the data load as a Spark job. This was possible because the CASLIB statement had Platform= Spark option specified. The data movement direction, in this case Hadoop to CAS uses the Spark job name, “SAS CAS/DC Input,” where “Input” is data loaded into CAS.

Saving Data from CAS to Hadoop using Spark

You can save data back to Hadoop from CAS at many stages of the analytic life cycle. For example, use data in CAS to prepare, blend, visualize, and model. Once the data meets the business use case, data can be saved in parallel to Hadoop using Spark jobs to share with other parts of the organization.

Using the SAVE CAS action to move data to Hadoop using Spark

proc cas;
session mysession; 
      table.save /
      caslib="splib"
      table={caslib="casuser", name="gas"},
      name="gas.sashdat"
      replace=True;
quit;

Parameters

PROC CAS Used to execute CAS actionsets and actions to process data.
“table” is the actionset and “save” is the action
TABLE Location and name of the source table
NAME Name of the target table saved to the Hadoop library using Spark

 

We can verify the status of saving data from CAS to Hadoop using YARN application. Data from CAS saves as a Hadoop table using, Spark as the execution platform. Furthermore, as SAS Data Connect Accelerator for Hadoop transfers data in parallel, individual Spark executors in each of the Spark executor nodes handles data execution for that specific Hadoop cluster node.

Saving Data from Viya CAS to Hadoop using Spark

Finally, the SAVE data executed as a Spark job. As we can see from YARN, the Spark job named “SAS CAS/DC Output” specifies that the data moves from CAS to Hadoop.

Where we are; where we're going

We have so far traveled across the Spark pond to setup SAS libraries for Spark, Load and Save data from and to Hadoop using Spark. In the next section we’ll look at ways to Score data and execute SAS code inside Hadoop using Spark.

Data and Analytics Innovation using SAS & Spark - part 1 was published on SAS Users.

8月 072019
 
2-D binning of counts

Do you want to bin a numeric variable into a small number of discrete groups? This article compiles a dozen resources and examples related to binning a continuous variable. The examples show both equal-width binning and quantile binning. In addition to standard one-dimensional techniques, this article also discusses various techniques for 2-D binning.

SAS procedures that support binning include the HPBIN, IML, KDE, RANK, and UNIVARIATE procedures.

Equal-width binning in SAS

The simplest binning technique is to form equal-width bins, which is also known as bucket binning. If a variable has the range [Min, Max] and you want to split the data into k equal-width bins (or buckets), each bin will have width (Max - Min) / k.

Quantile binning in SAS

In bucket binning, some bins have more observations than others. This enables you to estimate the density of the data, as in a histogram. However, you might want all bins to contain about the same number of observations. In that case, you can use quantiles of the data as cutpoints. If you want four bins, use the 25th, 50th, and 75th percentiles as cutpoints. If you want 10 bins, use the sample deciles as cutpoints. Here are several resources for quantile binning:

Binning by using arbitrary cutpoints in SAS

Sometimes you need to bin based on scientific standards or business rules. For example, the Saffir-Simpson hurricane scale uses specific wind speeds to classify a hurricane as Category 1, Category 2, and so forth. In these cases, you need to be able to define custom cutpoints and assign observations to bins based on those cutpoints.

2-D binning and bivariate histograms in SAS

A histogram is a visualization of a univariate equal-width binning scheme. You can perform similar computations and visualizations for two-dimensional data. If your goal is to understand the density of continuous bivariate data, you might want to use a bivariate histogram rather than a scatter plot (which, for large samples, suffers from overplotting).

In summary, this guide provides many links to programs and examples that bin data in SAS. Whether you want to use equal-width bins, quantile bins, or two-dimensional bins, hopefully, you will find an example to get you started. If I've missed an important topic, or if you have a favorite binning method that I have not covered, leave a comment.

The post The essential guide to binning in SAS appeared first on The DO Loop.

8月 052019
 

Binning transforms a continuous numerical variable into a discrete variable with a small number of values. When you bin univariate data, you define cut point that define discrete groups. I've previously shown how to use PROC FORMAT in SAS to bin numerical variables and give each group a meaningful name such as 'Low,' 'Medium,' and 'High.' This article uses PROC HPBIN to create bins that are assigned numbers. If you bin the data into k groups, the groups have the integer values 1, 2, 3, ..., k. Missing values are put into the zeroth group.

There are two popular ways to choose the cut points. You can use evenly spaced points, or you can use quantiles of the data. If you use evenly spaced cut points (as in a histogram), the number of observations in each bin will usually vary. Using evenly spaced cut points is called the "bucket binning" method. Alternatively, if you use quantiles as cut points (such as tertiles, quartiles, or deciles), the number of observations in each bin tend to be more balanced. This article shows how to use PROC HPBIN in SAS to perform bucket binning and quantile binning. It also shows how to use the CODE statement in PROC HPBIN to create a DATA step that uses the same cut points to bin future observations.

Create sample data

The following statements create sample data from the Sashelp.Heart data. An ID variable is added to the data so that you can identify each observation. A call to PROC MEANS produces descriptive statistics about two variables: Cholesterol and Systolic blood pressure.

data Heart;
format PatientID Z5.;
set Sashelp.Heart(keep=Sex Cholesterol Systolic);
PatientID = 10000 + _N_;
run;
 
proc means data=Heart nolabels N NMISS Min Max Skewness;
   var Cholesterol Systolic;
run;
Desriptive statistics for two variables

The output shows the range of the data for each variable. It also shows that the Cholesterol variable has 152 missing values. If your analysis requires nonmissing observations, you can use PROC HPIMPUTE to replace the missing values. For this article, I will not replace the missing values so that you can see how PROC HPBIN handles missing values.

Each variable has a skewed distribution, as indicated by the values of the skewness statistic. This usually indicates that equal-length binning will result in bins in the tail of the distribution that have only a few observations.

Use PROC HPBIN to bin data into equal-length bins

A histogram divides the range of the data by using k evenly spaced cutpoints. The width of each bin is (Max – Min) / k. PROC HPBIN enables you to create new variables that indicate to which bin each observation belongs. You can use the global NUMBIN= option on the PROC HPBIN statement to set the default number of bins for each variable. You can use the INPUT statement to specify which variables to bin. You can override the default number of bins by using the NUMBIN= option on any INPUT statement.

Suppose that you want to bin the Cholesterol data into five bins and the remaining variables into three bins.

  • The range of the Cholesterol data is [96, 568], so the width of the five bins that contain nonmissing values will be 94.4.
  • The range of the Systolic data is [82, 300], so the width of the three bins will be 72.66.

The following call to PROC HPBIN bins the variables. The output data set, HeartBin, contains the bin numbers for each observation.

/* equally divide the range each variable (bucket binning) */
proc hpbin data=Heart output=HeartBin numbin=3;  /* global NUMBIN= option */
   input Cholesterol / numbin=5;                 /* override global NUMBIN= option */
   input Systolic;                 
   id PatientID Sex;
run;
Cutpoints and frequency of observations for bucket binning using PROC HPBIN in SAS

Part of the output from PROC HPBIN is shown. (You can suppress the output by using the NOPRINT option.) The first table shows that PROC HPBIN used four threads on my PC to compute the results in parallel. The second table summarizes the transformation that bins the data. For each variable, the second column gives the names of the binned variables in the OUTPUT= data set. The third column shows the cutpoints for each bin. The Frequency and Proportion column show the frequency and proportion (respectively) of observations in each bin. As expected for these skewed variables, bins in the tail of each variable contain very few observations (less than 1% of the total).

The OUTPUT= option creates an output data set that contains the indicator variables for the bins. You can use PROC FREQ to enumerate the bin values and (again) count the number of observations in each bin:

proc freq data=HeartBin;
   tables BIN_Cholesterol BIN_Systolic / nocum;
run;
Frequency of observations in each bin for bucket binning using PROC HPBIN in SAS

Notice that the Cholesterol variable was split into six bins even though the syntax specified NUMBIN=5. If a variable contains missing values, a separate bin is created for them. In this case, the zeroth bin contains the 152 missing values for the Cholesterol variable.

Bucket binning divides the range of the variables into equal-width intervals. For long-tailed data, the number of observations in each bin might vary widely, as for these data. The next section shows an alternative binning strategy in which the width of the bins vary and each bin contains approximately the same number of observations.

Use PROC HPBIN to bin data by using quantiles

You can use evenly-spaced quantiles as cutpoints in an attempt to balance the number of observations in the bins. However, if the data are rounded or have duplicate values, the number of observations in each bin can still vary. PROC HPBIN has two ways methods for quantile binning. The slower method (the QUANTILE option) computes cutpoints based on the sample quantiles and then bins the observations. The faster method (the PSEUDO_QUANTILE option) uses approximate quantiles to bin the data. The following call uses the PSEUDO_QUANTILE option to bin the data into approximately equal groups:

/* bin by quantiles of each variable */
proc hpbin data=Heart output=HeartBin numbin=3 pseudo_quantile;
   input Cholesterol / numbin=5;    /* override global NUMBIN= option */
   input Systolic;                  /* use global NUMBIN= option */
   id PatientID Sex;
   code file='C:/Temp/BinCode.sas'; /* generate scoring code */
run;
Cutpoints and frequency of observations in each bin for quantile binning using PROC HPBIN in SAS

The output shows that the number of observations in each bin is more balanced. For the Systolic variable, each bin has between 1,697 and 1,773 observations. For the Cholesterol variable, each bin contains between 975 and 1,056 observations. Although not shown in the table, the BIN_Cholesterol variable also contains a bin for the 152 missing values for the Cholesterol variable.

Use PROC HPBIN to write DATA step code to bin future observations

In the previous section, I used the CODE statement to specify a file that contains SAS DATA step code that can be used to bin future observations. The statements in the BinCode.sas file are shown below:

*****************      BIN_Systolic     ********************;
length BIN_Systolic 8;
if missing(Systolic) then do; BIN_Systolic = 0; end;
else if Systolic < 124.0086 then do; BIN_Systolic =     1; end;
else if 124.0086 <= Systolic < 140.0098 then do; BIN_Systolic =     2; end;
else if 140.0098 <= Systolic then do; BIN_Systolic =     3; end;
 
*****************      BIN_Cholesterol     ********************;
length BIN_Cholesterol 8;
if missing(Cholesterol) then do; BIN_Cholesterol = 0; end;
else if Cholesterol < 190.0224 then do; BIN_Cholesterol =     1; end;
else if 190.0224 <= Cholesterol < 213.0088 then do; BIN_Cholesterol =     2; end;
else if 213.0088 <= Cholesterol < 234.0128 then do; BIN_Cholesterol =     3; end;
else if 234.0128 <= Cholesterol < 263.0408 then do; BIN_Cholesterol =     4; end;
else if 263.0408 <= Cholesterol then do; BIN_Cholesterol =     5; end;

You can see from these statements that the zeroth bin is reserved for missing values. Nonmissing values will be split into bins according to the approximate tertiles (NUMBIN=3) or quintiles (NUMBIN=5) of the training data.

The following statements show how to use the file that was created by the CODE statement. New data is contained in the Patients data set. Simply use the SET statement and the %INCLUDE statement to bin the new data, as follows:

data Patients;
length Sex $6;
input PatientID Sex Systolic Cholesterol;
datalines;
13021 Male    96 . 
13022 Male   148 242 
13023 Female 144 217 
13024 Female 164 376 
13025 Female .   248 
13026 Male   190 238 
13027 Female 158 326 
13028 Female 188 266 
;
 
data MakeBins;
set Patients;
%include 'C:/Temp/BinCode.sas';   /* include the binning statements */
run;
 
/* Note: HPBIN puts missing values into bin 0 */
proc print data=MakeBins;  run;
Binning new observations by using PROC HPBIN in SAS

The input data can contain other variables (PatientID, Sex) that are not binned. However, the data should contain the Systolic and Cholesterol variables because the statements in the BinCode.sas file refer to those variables.

Summary

In summary, you can use PROC HPBIN in SAS to create a new discrete variable by binning a continuous variable. This transformation is common in machine learning algorithms. Two common binning methods include bucket binning (equal-length bins) and quantile binning (unequal-length bins). Missing values are put into their own bin (the zeroth bin). The CODE statement in PROC HPBIN enables you to write DATA step code that you can use to bin future observations.

The post How to use PROC HPBIN to bin numerical variables appeared first on The DO Loop.

8月 052019
 

Binning transforms a continuous numerical variable into a discrete variable with a small number of values. When you bin univariate data, you define cut point that define discrete groups. I've previously shown how to use PROC FORMAT in SAS to bin numerical variables and give each group a meaningful name such as 'Low,' 'Medium,' and 'High.' This article uses PROC HPBIN to create bins that are assigned numbers. If you bin the data into k groups, the groups have the integer values 1, 2, 3, ..., k. Missing values are put into the zeroth group.

There are two popular ways to choose the cut points. You can use evenly spaced points, or you can use quantiles of the data. If you use evenly spaced cut points (as in a histogram), the number of observations in each bin will usually vary. Using evenly spaced cut points is called the "bucket binning" method. Alternatively, if you use quantiles as cut points (such as tertiles, quartiles, or deciles), the number of observations in each bin tend to be more balanced. This article shows how to use PROC HPBIN in SAS to perform bucket binning and quantile binning. It also shows how to use the CODE statement in PROC HPBIN to create a DATA step that uses the same cut points to bin future observations.

Create sample data

The following statements create sample data from the Sashelp.Heart data. An ID variable is added to the data so that you can identify each observation. A call to PROC MEANS produces descriptive statistics about two variables: Cholesterol and Systolic blood pressure.

data Heart;
format PatientID Z5.;
set Sashelp.Heart(keep=Sex Cholesterol Systolic);
PatientID = 10000 + _N_;
run;
 
proc means data=Heart nolabels N NMISS Min Max Skewness;
   var Cholesterol Systolic;
run;
Desriptive statistics for two variables

The output shows the range of the data for each variable. It also shows that the Cholesterol variable has 152 missing values. If your analysis requires nonmissing observations, you can use PROC HPIMPUTE to replace the missing values. For this article, I will not replace the missing values so that you can see how PROC HPBIN handles missing values.

Each variable has a skewed distribution, as indicated by the values of the skewness statistic. This usually indicates that equal-length binning will result in bins in the tail of the distribution that have only a few observations.

Use PROC HPBIN to bin data into equal-length bins

A histogram divides the range of the data by using k evenly spaced cutpoints. The width of each bin is (Max – Min) / k. PROC HPBIN enables you to create new variables that indicate to which bin each observation belongs. You can use the global NUMBIN= option on the PROC HPBIN statement to set the default number of bins for each variable. You can use the INPUT statement to specify which variables to bin. You can override the default number of bins by using the NUMBIN= option on any INPUT statement.

Suppose that you want to bin the Cholesterol data into five bins and the remaining variables into three bins.

  • The range of the Cholesterol data is [96, 568], so the width of the five bins that contain nonmissing values will be 94.4.
  • The range of the Systolic data is [82, 300], so the width of the three bins will be 72.66.

The following call to PROC HPBIN bins the variables. The output data set, HeartBin, contains the bin numbers for each observation.

/* equally divide the range each variable (bucket binning) */
proc hpbin data=Heart output=HeartBin numbin=3;  /* global NUMBIN= option */
   input Cholesterol / numbin=5;                 /* override global NUMBIN= option */
   input Systolic;                 
   id PatientID Sex;
run;
Cutpoints and frequency of observations for bucket binning using PROC HPBIN in SAS

Part of the output from PROC HPBIN is shown. (You can suppress the output by using the NOPRINT option.) The first table shows that PROC HPBIN used four threads on my PC to compute the results in parallel. The second table summarizes the transformation that bins the data. For each variable, the second column gives the names of the binned variables in the OUTPUT= data set. The third column shows the cutpoints for each bin. The Frequency and Proportion column show the frequency and proportion (respectively) of observations in each bin. As expected for these skewed variables, bins in the tail of each variable contain very few observations (less than 1% of the total).

The OUTPUT= option creates an output data set that contains the indicator variables for the bins. You can use PROC FREQ to enumerate the bin values and (again) count the number of observations in each bin:

proc freq data=HeartBin;
   tables BIN_Cholesterol BIN_Systolic / nocum;
run;
Frequency of observations in each bin for bucket binning using PROC HPBIN in SAS

Notice that the Cholesterol variable was split into six bins even though the syntax specified NUMBIN=5. If a variable contains missing values, a separate bin is created for them. In this case, the zeroth bin contains the 152 missing values for the Cholesterol variable.

Bucket binning divides the range of the variables into equal-width intervals. For long-tailed data, the number of observations in each bin might vary widely, as for these data. The next section shows an alternative binning strategy in which the width of the bins vary and each bin contains approximately the same number of observations.

Use PROC HPBIN to bin data by using quantiles

You can use evenly-spaced quantiles as cutpoints in an attempt to balance the number of observations in the bins. However, if the data are rounded or have duplicate values, the number of observations in each bin can still vary. PROC HPBIN has two ways methods for quantile binning. The slower method (the QUANTILE option) computes cutpoints based on the sample quantiles and then bins the observations. The faster method (the PSEUDO_QUANTILE option) uses approximate quantiles to bin the data. The following call uses the PSEUDO_QUANTILE option to bin the data into approximately equal groups:

/* bin by quantiles of each variable */
proc hpbin data=Heart output=HeartBin numbin=3 pseudo_quantile;
   input Cholesterol / numbin=5;    /* override global NUMBIN= option */
   input Systolic;                  /* use global NUMBIN= option */
   id PatientID Sex;
   code file='C:/Temp/BinCode.sas'; /* generate scoring code */
run;
Cutpoints and frequency of observations in each bin for quantile binning using PROC HPBIN in SAS

The output shows that the number of observations in each bin is more balanced. For the Systolic variable, each bin has between 1,697 and 1,773 observations. For the Cholesterol variable, each bin contains between 975 and 1,056 observations. Although not shown in the table, the BIN_Cholesterol variable also contains a bin for the 152 missing values for the Cholesterol variable.

Use PROC HPBIN to write DATA step code to bin future observations

In the previous section, I used the CODE statement to specify a file that contains SAS DATA step code that can be used to bin future observations. The statements in the BinCode.sas file are shown below:

*****************      BIN_Systolic     ********************;
length BIN_Systolic 8;
if missing(Systolic) then do; BIN_Systolic = 0; end;
else if Systolic < 124.0086 then do; BIN_Systolic =     1; end;
else if 124.0086 <= Systolic < 140.0098 then do; BIN_Systolic =     2; end;
else if 140.0098 <= Systolic then do; BIN_Systolic =     3; end;
 
*****************      BIN_Cholesterol     ********************;
length BIN_Cholesterol 8;
if missing(Cholesterol) then do; BIN_Cholesterol = 0; end;
else if Cholesterol < 190.0224 then do; BIN_Cholesterol =     1; end;
else if 190.0224 <= Cholesterol < 213.0088 then do; BIN_Cholesterol =     2; end;
else if 213.0088 <= Cholesterol < 234.0128 then do; BIN_Cholesterol =     3; end;
else if 234.0128 <= Cholesterol < 263.0408 then do; BIN_Cholesterol =     4; end;
else if 263.0408 <= Cholesterol then do; BIN_Cholesterol =     5; end;

You can see from these statements that the zeroth bin is reserved for missing values. Nonmissing values will be split into bins according to the approximate tertiles (NUMBIN=3) or quintiles (NUMBIN=5) of the training data.

The following statements show how to use the file that was created by the CODE statement. New data is contained in the Patients data set. Simply use the SET statement and the %INCLUDE statement to bin the new data, as follows:

data Patients;
length Sex $6;
input PatientID Sex Systolic Cholesterol;
datalines;
13021 Male    96 . 
13022 Male   148 242 
13023 Female 144 217 
13024 Female 164 376 
13025 Female .   248 
13026 Male   190 238 
13027 Female 158 326 
13028 Female 188 266 
;
 
data MakeBins;
set Patients;
%include 'C:/Temp/BinCode.sas';   /* include the binning statements */
run;
 
/* Note: HPBIN puts missing values into bin 0 */
proc print data=MakeBins;  run;
Binning new observations by using PROC HPBIN in SAS

The input data can contain other variables (PatientID, Sex) that are not binned. However, the data should contain the Systolic and Cholesterol variables because the statements in the BinCode.sas file refer to those variables.

Summary

In summary, you can use PROC HPBIN in SAS to create a new discrete variable by binning a continuous variable. This transformation is common in machine learning algorithms. Two common binning methods include bucket binning (equal-length bins) and quantile binning (unequal-length bins). Missing values are put into their own bin (the zeroth bin). The CODE statement in PROC HPBIN enables you to write DATA step code that you can use to bin future observations.

The post How to use PROC HPBIN to bin numerical variables appeared first on The DO Loop.

7月 172019
 

Problem solving: thinking inside the box

Have you ever tried to pass comma-delimited values to SAS macro or to a SAS macro function? How can SAS distinguish commas separating parameters or arguments from commas separating parts of the values?

Passing comma-delimited value as an argument to a SAS macro function

Let’s say you want to extract the first word from the following string of characters (these words represent column names in the SASHELP.CARS data table):

make, model, type, origin

If you run the following code:

%let firstvar = %scan(make, model, type, origin, 1);

you get is the following ERROR in your SAS log:

ERROR: Macro function %SCAN has too many arguments.

That is because %scan macro function sees and treats those make, model, type and origin as arguments since commas between them are interpreted as argument separators.

Even if you “hide” your comma-delimited value within a macro variable, it still won’t do any good since the macro variable gets resolved during macro compilation before being passed on to a macro or macro function for execution.

%let mylist = make, model, type, origin;
%let firstvar = %scan(&mylist, 1);

You will still get the same ERROR:

ERROR: Macro function %SCAN has too many arguments.

Passing comma-delimited value as a parameter to a SAS macro

Try submitting the following code that passes your macro variable value to a SAS macro as a parameter:

%let mylist = make, model, type, origin;
%macro subset(dsname=, varlist=);
   proc sql;
      select &varlist
      from &dsname;
   quit;
%mend subset;
%subset(dsname=SASHELP.CARS, varlist=&mylist)

You will get another version of the SAS log ERROR:

ERROR: All positional parameters must precede keyword parameters.
NOTE: Line generated by the macro variable "MYLIST".
1                 type, origin
                  ----
                  180
ERROR 180-322: Statement is not valid or it is used out of proper order.

In this case, macro %subset gets as confused as the %scan function above because your macro variable will get resolved during macro compilation, and SAS macro processor will see the macro invocation as:

%subset(dsname=SASHELP.CARS, varlist=make, model, type, origin)

treating each comma as a parameter separator.

All this confusion happens because SAS functions’ arguments and SAS macros’ parameters use commas as their separators, while resolved macro variables introduce their own values’ comma delimiters into the functions/macros constructs’ picture, thus wreaking havoc on your SAS program.

It’s time for a vacation

But don’t panic! To fight that chaos, you need to take a vacation. Not a stay-home, do-nothing vacation, but some serious vacation, with faraway destination and travel arrangements. While real vacation is preferable, an imaginary one would do it too. I mean to start fighting the mess with comma-separated values, pick your destination, book your hotel and flight, and start packing your stuff.

Do you have a “vacation items list”? In my family, we have an individual vacation list for every family member. How many items do you usually take with you? Ten, twenty, a hundred?

Regardless, you don’t show up at the airport checkpoint with a pile of your vacation items. That would’ve been too messy. I don’t think you would be even allowed boarding with an unpacked heap of your stuff. You come to an airport neatly rolling a single item that is called a suitcase. Well, I suppose that some of you may have two of them, but I can’t imagine more than that.

You only started your fantasy vacation, you haven’t even checked in to your flight, but you have already have a solution in your sight, a perfect combine-and-conquer solution for passing comma-delimited values. Even if you have not yet realized that it’s in your plain view.

Thinking inside the box

Forget about “thinking outside the box” metaphor. You can’t solve all your problems with a single strategy. Sometimes, you need to turn your thinking on its head to solve, or even to see the problem.

As for your airport check-in, instead of thinking outside the box, you thought “inside the box” and brought your many items “boxed” as a single item – a suitcase. A container, in a broader sense.

That is exactly how we are going to approach our comma-delimited lists problem. We are going to check them in to a macro or a macro function as a single, boxed item. Just like this:
Passing a comma-separated value to SAS macro or SAS macro function
Or like this:
passing SAS macro variable with comma-separated value to SAS macro or SAS macro function

Not surprisingly, SAS macro language provides a variety of these wonder boxes for many special occasions collectively known as macro quoting functions. Personally, I would prefer calling them “macro masking functions,” as they have nothing to do with “quoting” per se and have everything to do with masking various characters during macro compilation or macro processing. But that is what “macro quoting” means – masking, boxing, - similar to “quoting” a character string to make it a single entity.

Different macro quoting functions mask different special characters (+ - , / ; = etc.) and mnemonics (AND OR GT EQ etc.) so that the macro facility interprets them as text instead of as language symbols.

Here are all 7 SAS macro quoting functions, two of which work at macro compilation - %STR() and %NRSTR(), while other 5 work at macro execution - %QUOTE() and %NRQUOTE(), %BQUOTE() and %NRBQUOTE(), and %SUPERQ().

You may look up what symbols they mask and the timing they apply (macro compilation vs. macro execution) in this macro quoting functions summary. You may also want to look at the following cheat sheet: Deciding When to Use a Macro Quoting Function and Which Function to Use.

As general rule of thumb, use macro quoting functions at compilation time when you mask text constants - (make, model, type, origin); use macro quoting functions at execution time when you mask macro or macro variable references containing & or % - (&mylist).

NOTE: There are many other SAS macro functions that besides their main role also perform macro quoting, e.g. %QSCAN(), %QSUBSTR() and others; they all start with %Q.

Masking commas within a comma-delimited value passed as an argument or a parameter

It turns out that to mask (or to “box”) comma-separated values in a macro function or a SAS macro, any macro quoting function will work. In this case I would suggest using the simplest (and shortest) %STR(). %STR() applies during macro compilation and serves as a perfect “box” for our comma-delimited values to hide (mask) commas to receiving macro function or a macro does not confuse them with its own commas separating arguments / parameters.

With it we can re-write our above examples as:

%let firstvar = %scan(%str(make, model, type, origin), 1);
%put &=firstvar;

SAS log will produce exactly what we expected:

FIRSTVAR=make

Similarly, we can call the above SAS macro as:

%subset(dsname=SASHELP.CARS, varlist=%str(make, model, type, origin) )

It will run without ERRORs and produce a print of the SASHELP.CARS data table with 4 columns specified by the varlist parameter value:

SAS output table as a result of macro run

Masking commas within a macro variable value passed as an argument or parameter

When you assign a comma-delimited list as a value to a macro variable, we want to mask commas within the resolved value during execution. Any of the execution time macro quoting functions will mask comma.

Again, in case of multiple possibilities I would use the shortest one - %QUOTE().

With it we can re-write our above examples as:

%let mylist = make, model, type, origin;
 
%let firstvar = %scan(%quote(&mylist), 1);
 
%subset(dsname=SASHELP.CARS, varlist=%quote(&mylist))

But just keep in mind that the remaining 4 execution time macro quoting functions - %NRQUOTE(), %BQUOTE(), %NRBQUOTE() and %SUPERQ() - will work too.

NOTE: The syntax of the %SUPERQ() function is quite different from the rest of the pack. The %SUPERQ() macro function takes as its argument either a macro variable name without an ampersand or a macro text expression that yields a macro variable name without an ampersand.

Get it going

I realize that macro quoting is not a trivial matter. That is why I attempted to explain its concept on a very simple yet powerful use case. Hope you will expand on this to empower your SAS coding skills.

Passing comma-delimited values into SAS macros and macro functions was published on SAS Users.

7月 102019
 

Posted on behalf of SAS Press author Derek Morgan.


I was sitting in a model railroad club meeting when one of our more enthusiastic young members said, "Wouldn't it be cool if we could make a computer simulation, with trains going between stations and all. We could have cars and engines assigned to each train and timetables and…"

So, I thought to myself, “Timetables… I bet SAS can do that easily… sounds like something fun for Mr. Dates and Times."

As it turns out, the only easy part of creating a timetable is calculating the time. SAS handles the concept of elapsed time smoothly. It’s still addition and subtraction, which is the basis of how dates and times work in SAS. If a train starts at 6:00 PM (64,800 seconds of the day) and arrives at its destination 12 hours (43,200 seconds) later, it arrives at 6:00 AM the next day. The math is start time+duration=end time (108,000 seconds,) which is 6:00 AM the next day. It doesn’t matter which day, that train is always scheduled to arrive at 6:00 AM, 12 hours from when it left.

It got a lot more complicated when the group grabbed onto the idea. One of the things they wanted to do was to add or delete trains and adjust the timing so multiple trains don’t run over the same track at the same time. This wouldn’t be that difficult in SAS; just create an interactive application, but… I’m the only one who has SAS. So how do I communicate my SAS database with the “outside world”? The answer was Microsoft Excel, and this is where it gets thorny.

It’s easy enough to send SAS data to Excel using ODS EXCEL and PROC REPORT, but how could I get Excel to allow the club to manipulate the data I sent?
I used the COMPUTE block in PROC REPORT to display a formula for every visible train column. I duplicated the original columns (with corresponding names to keep it all straight) and hid them in the same spreadsheet. The EXCEL formula code is in line 8.

Compute Block Code:

I also added three rows to the dataset at the top. The first contains the original start time for each train, the second contains an offset, which is always zero in the beginning, while the third row was blank (and formatted with a black background) to separate it from the actual schedule.


Figure 1: Schedule Adjustment File

The users can change the offset to change the starting time of a train (Column C, Figure 2.) The formula in the visible columns adds the offset to the value in each cell of the corresponding hidden column (as long as it isn’t blank.) You can’t simply add the offset to the value of the visible cell, because that would be a circular reference.

The next problem was moving a train to an earlier starting time, because Excel has no concept of negative time (or date—a date prior to the Excel reference date of January 1, 1900 will be a character value in Excel and cause your entire column to be imported into SAS as character data.) Similarly, you can’t enter -1:00 as an offset to move the starting time of our 5:35 AM train to 4:35 AM. Excel will translate “-1:00” as a character value and that will cause a calculation error in Excel. In order to move that train to 4:35 AM, you have to add 23 hours to the original starting time (Column D, Figure 2.)


Figure 2: Adjusting Train Schedules

After the users adjust the schedules, it’s time to return our Excel data to SAS, which creates more challenges. In the screenshot above, T534_LOC is the identifier of a specific train, and the timetable is kept in SAS time values. Unfortunately, PROC IMPORT using DBMS=XLSX brings the train columns into SAS as character data. T534_LOC also imports as the actual Excel value, time as a fraction of a day.


Figure 3: How the Schedule Adjustment File Imports to SAS

While I can fix that by converting the character data to numeric and multiplying by 86,400, I still need the original column name of T534_LOC for the simulation, so I would have to rename each character column and output the converted data to the original column name. There are currently 146 trains spread across 12 files, and that is a lot of work for something that was supposed to be easy! Needless to say, this “little” side project, like most model railroads, is still in progress. However, this exercise in moving time data between Microsoft Excel and SAS gave me even more appreciation for the way SAS handles date and time data.

Figure 4 is a partial sample of the finished timetable file, generated as an RTF file using SAS. The data for trains 534 and 536 are from the spreadsheet in Figure 1.


Figure 4: Partial Sample Timetable

Want to learn more about how to use and manipulate dates, times, and datetimes in SAS? You'll find the answers to these questions and much more in my book The Essential Guide to SAS Dates and Times, Second Edition. Updated for SAS 9.4, with additional functions, formats, and capabilities, the Second Edition has a new chapter dedicated to the ISO 8601 standard and the formats and functions that are new to SAS, including how SAS works with Universal Coordinated Time (UTC). Chapter 1 is available as a free preview here.

For updates on new SAS Press books and great discounts subscribe to the SAS Press New Book Newsletter.

SAS Press Author Derek Morgan on Timetables and Model Trains was published on SAS Users.

7月 032019
 

An informat helps you read data into a SAS data set. SAS supports more than 100 informats. The most common informats are related to dates and times and make it easy to read an input string such as 28JAN2001 and convert it to a SAS date such as 15003. Yet no matter how many built-in informats SAS provides, programmers will inevitably encounter data that are difficult to read. That is why PROC FORMAT in SAS enables you to create a custom informat. A custom informat enables you to process input strings in whatever way is most convenient for your application. This article shows how to create a custom informat in SAS in three easy steps:

  1. Write DATA step code that parses an input string and converts it into a form that is more convenient to work with.
  2. Use PROC FCMP to define a function that implements the same string-to-value conversion. Often you can cut and paste from your DATA step program into PROC FCMP.
  3. Use the INVALUE statement in PROC FORMAT to define a custom informat.

Technically, you can skip Step 1, but I find it much easier to develop and debug a program in the DATA step because I can use the PUT statement or PROC PRINT to look at intermediate results.

A motivating example: Convert durations into seconds

While writing a recent article about how to visualize race times, I researched ways to read times into SAS. For time-of-day data, you can use the TIMEw. or ANYDTTME. informats to read a string in the form HH:MM:SS and convert it to a SAS time value. For example, the time "1:23:45" (AM) is converted to the number of seconds since midnight, which is 5025.

Race times are slightly different because they represents durations of time. Short races such as a 5K are typically recorded in minutes, and sprints are recorded in seconds. Both times contain fractions of seconds, so the data is typically represented as SS.fff or MM:SS.ff. You can use the HHMMSS. informat to read these data and store them as seconds. The following DATA step reads a sprint time (12.345 s) and a 5K time (18:33.24) into a SAS time variable. You can use the HOUR, MINUTE, and SECOND functions to make sure the data are read correctly:

/* You can use the HHMMSS informat to read durations into SAS */
data RaceTimes;
input Time HHMMSS.;               /* read duration = length of time */
format Time TIME10.3;             /* there is no HHMMSS. format, so use TIME. format */
rawTime = Time;
hours = hour(Time);               /* check the results by print hour, min, and sec */
minutes = minute(Time);
seconds = second(Time);
datalines;
12.345
18:33.24 
19:23
;
 
proc print noobs; run;
Use HHMMSS informat to read race times into SAS

The only problem with the HHMMSS. informat is that it reads the string 19:23 as "19 hours, 23 minutes," whereas I prefer "19 minutes, 23 seconds." (It's an ambiguous case. For clarity, the string could be entered as 0:19:23 or 19.23.0.) If a SAS informat does not interpret strings the way you prefer, you can write your own SAS informat. The main challenge is writing DATA step code that parses strings such as "12.345" and "18:33.24" and computes the number of seconds that each string represents. The next section presents a DATA step program that converts these strings to seconds. You can use the program to define a new informat that parses race-time data.

Step 1: Write a DATA step that converts a string into a number

This article is about how to create an informat, not about how to parse strings in SAS, so I'll let the comments in the program speak for themselves. Basically, the following program reads from the right side of a string until it encounters a colon (or the beginning of the string). It extracts the string following the colon and converts it to a number, which is the number of seconds. It repeats the process to find the number of minutes (if any) and the number of hours (if any). It then converts the sum of the hours, minutes, and seconds to seconds and stores it in a variable named Time.

/* Parse strings. Obtain time, in seconds, for durations written in the following forms:
          SS.fff
       MM:SS.fff
   HHH:MM:SS.fff
*/
data RaceTimes(drop= i j z);
length str $13;                     /* HHH.MM.SS.fff */
input str;                          /* read raw string */
s=0; m=0; h=0;
/* read seconds :SS.fff */
i = find(str, ':', -length(str)+1); /* find colon from right */
z = substr(str, i+1);               /* extract seconds */
s = input(z, best6.);               /* convert to number */
/* read minutes :MM: */
if i > 0 then do;
   j = find(str, ':', -(i-1));      /* find colon from right */
   z = substr(str, j+1, i-j-1);     /* extract minutes */
   m = input(z, best6.);            /* convert to number */
   /* read hours HHH:, if any */
   if j > 0 then do;
      z = substr(str, 1, j-1);      /* extract hours */
      h = input(z, best6.);         /* convert to number */
   end;
end;
Time = 3600*h + 60*m + s;           /* convert to seconds */
datalines;
   11.345
 0:11.345
 2:39.66 
18:33.24 
19:23
49:02.89 
72:02.89 
1:12:02.89
100:71:00
;
 
proc print data=RaceTimes noobs; 
   format time 10.3;
run;
Convert race times into seconds with SAS

The DATA step parses each string and converts it to a time, in seconds. The output shows the results. This table is the same as you would get by using the HHMMSS. informat to read the data, except that the string 19:23 is now interpreted as "19 minutes, 23 seconds."

Notice that you can see the intermediate expressions such as s, m, and h. This is useful when you are trying to debug the program.

Step 2: Use PROC FCMP to define a function

You can use PROC FCMP to define a custom function in Base SAS. This example defines a function named Duration, which takes a string as an input argument. Copy and paste the DATA step code and return the Time variable. You can use the OPTIONS CMPLIB= system option to tell SAS where to find the Duration function. You can then call the new function from the DATA step, as follows:

proc fcmp outlib=work.functions.MathFuncs; 
function Duration(str $); 
   s=0; m=0; h=0;
   /* read seconds :SS.fff */
   i = find(str, ':', -length(str)+1); /* find colon from right */
   z = substr(str, i+1);               /* extract seconds */
   s = input(z, best6.);               /* convert to number */
   /* read minutes :MM: */
   if i > 0 then do;
      j = find(str, ':', -(i-1));      /* find colon from right */
      z = substr(str, j+1, i-j-1);     /* extract minutes */
      m = input(z, best6.);            /* convert to number */
      /* read hours, if any */
      if j > 0 then do;
         z = substr(str, 1, j-1);      /* extract hours */
         h = input(z, best6.);         /* convert to number */
      end;
   end;
   Time = 3600*h + 60*m + s;           /* convert to seconds */
   return( Time );
endsub; 
quit; 
 
options cmplib=work.functions; /* where to find Duration() function */
data RaceTimes2;
length str $13;
input str;
/* call Duration function to convert string to a time value */
Time = Duration(str);          
datalines;
   11.345
 0:11.345
 2:39.66 
18:33.24 
19:23
49:02.89 
72:02.89 
1:12:02.89
100:71:00 
;
Convert race times seconds with SAS

Step 3: Use PROC FORMAT to define a custom informat

I have written several articles about how to use PROC FORMAT to create your own custom formats in SAS. The process of creating an informat is similar, but because parsing strings is complicated it is often best to write the code into an FCMP function and then define the informat in terms of the function. This is done in the following call to PROC FORMAT, which uses the INVALUE statement to define an informat named DURATION.:

/* define the DURATION. informat */
proc format; 
   invalue duration (default=13) other=[Duration()]; 
run; 
 
/* use new DURATION informat to read duration times into SAS */
data RaceTimes3;
input Time : DURATION.;
datalines;
   11.345
 0:11.345
 2:39.66 
18:33.24 
19:23
49:02.89 
72:02.89
1:12:02.89
100:71:00 
;
 
proc print data=RaceTimes3; 
   format Time 10.3;
run;
Use custom informat to read race times into SAS

Success! The program uses the DURATION. informat to read strings of the form "SS.fff", "MM:SS.fff", and "HHH:MM:SS.fff". It converts the strings to seconds. You can use the statement FORMAT Time TIME10.3 in PROC PRINT if you want to format the seconds as hours, minutes, and seconds.

In summary, you can use the INVALUE statement in PROC FORMAT to create a custom informat in SAS. An informat enables you to read a string or pattern and convert it to a value that is easier to work with. Although SAS supports the HHMMSS. informat for reading strings that represent durations of time, I chose to create my own informat that converts race times to seconds. You can use PROC FCMP to define a function that parses a string and returns a numerical value. You can then use PROC FORMAT to define the informat, which calls the FCMP function.

The post Create a custom informat in SAS appeared first on The DO Loop.

7月 032019
 

One of my favorite parts of summer is a relaxing weekend by the pool. Summer is the time I get to finally catch up on my reading list, which has been building over the year. So, if expanding your knowledge is a goal of yours this summer, SAS Press has a shelf full of new titles for you to explore. To help navigate your selection we asked some of our authors what SAS books were on their reading lists for this summer?

Teresa Jade


Teresa Jade, co-author of SAS® Text Analytics for Business Applications: Concept Rules for Information Extraction Models, has already started The DS2 Procedure: SAS Programming Methods at Work by Peter Eberhardt. Teresa reports that the book “is a concise, well-written book with good examples. If you know a little bit about the SAS DATA step, then you can leverage what you know to more quickly get up to speed with DS2 and understand the differences and benefits.”
 
 
 

Derek Morgan

Derek Morgan, author of The Essential Guide to SAS® Dates and Times, Second Edition, tells us his go-to books this summer are Art Carpenter’s Complete Guide to the SAS® REPORT Procedure and Kirk Lafler's PROC SQL: Beyond the Basics Using SAS®, Third Edition. He also notes that he “learned how to use hash objects from Don Henderson’s Data Management Solutions Using SAS® Hash Table Operations: A Business Intelligence Case Study.”
 

Chris Holland

Chris Holland co-author of Implementing CDISC Using SAS®: An End-to-End Guide, Revised Second Edition, recommends Richard Zink’s JMP and SAS book, Risk-Based Monitoring and Fraud Detection in Clinical Trials Using JMP® and SAS®, which describes how to improve efficiency while reducing costs in trials with centralized monitoring techniques.
 
 
 
 
 

And our recommendations this summer?

Download our two new free e-books which illustrate the features and capabilities of SAS® Viya®, and SAS® Visual Analytics on SAS® Viya®.

Want to be notified when new books become available? Sign up to receive information about new books delivered right to your inbox.

Summer reading – Book recommendations from SAS Press authors was published on SAS Users.

6月 262019
 

"There's a way to do it better - find it." - Thomas A. Edison

Finding a better SAS code

When it comes to SAS coding, this quote by Thomas A. Edison is my best advisor. Time permitting, I love finding better ways of implementing SAS code.

But what code feature means “better” – brevity, clarity or efficiency? It all depends on the purpose of your code. When code is to illustrate a coding concept or technique, clarity is a paramount. However, when processing large data volumes in near real-time, code efficiency becomes critical, not just a luxury or convenience. And brevity won’t hurt in either case. Ideally, your code should be a combination of all three features - brevity, clarity and efficiency.

Parsing a character string

In this blog post we will solve a problem of parsing a character string to find a position of n-th occurrence of a group of characters (substring) in that string.

The closest out-of-box solution to this problem is SAS’ FIND() function. Except this function searches only for a single/first instance of specified substring of characters within a character string. Close enough, and with some do-looping we can easily construct what we want.

After some internet and soul searching to find the Nth occurrence of a substring within a string, I came up with the following DATA STEP code snippet:

   p = 0;
   do i=1 to n until(p=0); 
      p = find(s, x, p+1);
   end;

Here, s is a text string (character variable) to be parsed; x is a character variable holding a group of characters that we are searching for within s; p is a position of x value found within s; n is an instance number.

If there is no n-th instance of x within s found, then the code returns p=0.

In this code, each do-loop iteration searches for x within s starting from position p+1 where p is position found in prior iteration: p = find(s,x,p+1);.

Notice, if there is no prior-to n instance of x within s, the do-loop ends prematurely, based on until(p=0) condition, thus cutting the number of loops to the minimal necessary.

Reverse string search

Since find() function allows for a string search in a reverse direction (from right to left) by making the third augment negative, the above code snippet can be easily modified to do just that: find Nth instance (from right to left) of a group of characters within a string. Here is how you can do that:

   p = length(s) + 1;
   do i=1 to n until(p=0); 
      p = find(s, x, -p+1);
   end;

The difference here is that we start from position length(s)+1 instead of 0, and each iteration searches substring x within string s starting from position –(p-1)=-p+1 from right to left.

Testing SAS code

You can run the following SAS code to test and see how these searches work:

data a;
   s='AB bhdf +BA s Ab fs ABC Nfm AB ';
   x='AB';
   n=3;
 
   /* from left to right */
   p = 0;
   do i=1 to n until(p=0); 
      p = find(s, x, p+1);
   end;
   put p=;
 
   /* from right to left */
   p = length(s) + 1;
   do i=1 to n until(p=0); 
      p = find(s, x, -p+1);
   end;
   put p=;
run;

FINDNTH() function

We can also combine the above left-to-right and right-to-left searches into a single user-defined SAS function by means of SAS Function Compiler (PROC FCMP) procedure:

proc fcmp outlib=sasuser.functions.findnth;
   function findnth(str $, sub $, n);
      p = ifn(n>=0,0,length(str)+1);
      do i=1 to abs(n) until(p=0);
         p = find(str,sub,sign(n)*p+1);
      end;
      return (p);
   endsub;
run;

We conveniently named it findnth() to match the Tableau FINDNTH(string, substring, occurrence) function that returns the position of the nth occurrence of substring within the specified string, where the occurrence argument defines n.

Except our findnth() function allows for both, positive (for left-to-right searches) as well as negative (for right-to-left searches) third argument while Tableau’s function only allows for left-to-right searches.

Here is an example of the findnth() function usage:

options cmplib=sasuser.functions;
data a;
   s='AB bhdf +BA s Ab fs ABC Nfm AB ';
   x='AB';
   n=3;
 
   /* from left to right */
   p=findnth(s,x,n);
   put p=;
 
   /* from right to left */
   p=findnth(s,x,-n);
   put p=;
run;

Using Perl regular expression

As an alternative solution I also implemented SAS code for finding n-th occurrence of a substring within a string using Perl regular expression (regex or prx):

data a;
   s='AB bhdf +BA s Ab fs ABC Nfm AB ';
   x='AB';
   n=3;
 
   /* using regex */
   xid = prxparse('/'||x||'/o');
   p = 0;
   do i=1 to n until(p=0);
      from = p + 1;
      call prxnext(xid, p + 1, length(s), s, p, len);
   end;
   put p=;
run;

However, efficiency benchmarking tests demonstrated that the above solutions using FIND() function or FINDNTH() SAS user-written function run roughly twice faster than this regex solution.

Challenge

Can you come up with an even better solution to the problem of finding Nth instance of a sub-string within a string? Please share your thoughts and solutions with us. Thomas A. Edison would have been proud of you!

Finding n-th instance of a substring within a string was published on SAS Users.