sas programming

1月 112020
 
In SAS 9.3 and earlier, the default value of the YEARCUTOFF= option is 1920. This default setting could trigger data integrity issues because any 2-digit years of "20" in dates will be assumed to occur in 1920 instead of 2020. If the intended year in the date is 2020, you must set the YEARCUTOFF= option to a value larger than 1920. Of course, the best alternative is to always specify date values with 4-digit years.

Luckily, SAS makes it easy to change the YEARCUTOFF= option so that it works best for your data. The default value for the YEARCUTOFF= option changed in SAS 9.4 to 1926. This change makes it easier for customers who are still using 2-digit years of "20" to make sure that the date is assigned to 2020. Let's review some of the frequently asked questions that customers ask about how SAS works with 2-digit years.

What is the YEARCUTOFF= option?

The YEARCUTOFF= option lets you specify which century SAS software should assign to dates with 2-digit years.

How do I specify the YEARCUTOFF= option in my SAS programs?

The option is specified in an OPTIONS statement. Here is an example:

options yearcutoff=1930;

You can also specify the option in an autoexec file or a config file. If you don't specify the YEARCUTOFF= option, the SAS system default is used. Remember that 1920 is the default for SAS 9.3 and earlier releases and 1926 is the default for SAS 9.4. (For reference, SAS 9.3 was released in 2011. The first release of SAS 9.4 was released in 2013.)

How does the YEARCUTOFF= option work?

The YEARCUTOFF= option specifies the first year of a 100-year window in which all 2-digit years are assumed to occur. For example, if the YEARCUTOFF= option is set to 1920, all 2-digit years are assumed to occur between 1920 and 2019. This means that two-digit years from 20 - 99 are assigned a century prefix of "19" and all 2-digit years from 00 - 19 have a century prefix of "20."

Which types of date values are affected by the YEARCUTOFF= option?

The YEARCUTOFF= option affects the interpretation of 2-digit years in the following cases:

  • Reading date values from external files
  • Specifying dates or year values in SAS functions
  • Specifying SAS date literals

The YEARCUTOFF= option does not influence the following cases:

  • Processing dates with 4-digit years
  • Processing dates already stored as SAS date values (the number of days since January 1, 1960)
  • Displaying dates with SAS date formats

Which value should I set the YEARCUTOFF= option to?

The optimal value depends on the range of dates in your data. The YEARCUTOFF= option should be set so that the 100-year range encompasses the range of your data values. In general, SAS recommends setting the YEARCUTOFF= option to a value equal to or slightly less than the first year in your data. For example, if the range of dates that you are processing is from 1930 - 2010, a YEARCUTOFF value of 1925 or 1930 would be appropriate. If you set YEARCUTOFF=1925, then all 2-digit years are assumed to be in the 100-year period from 1925 to 2024. If all the dates in your data fall within that range, they will be interpreted correctly.

What do I do if my dates with 2-digit years span more than 100 years?

The YEARCUTOFF= option cannot reliably assign centuries to 2-digit years if the range of dates for a variable is greater than 100 years. If the date ranges for a variable span more than 100 years, you must either specify the dates with 4-digit years or use DATA step logic to assign a century to each year (perhaps based on the value of another variable).

But why does the YEARCUTOFF= option allow only a 100-year span? If the YEARCUTOFF= option allowed for more than a 100-year span, there would be no way to determine which century a 2-digit year should have. For example, let’s assume that YEARCUTOFF=1950 with a 150-year span and your external data file had 2-digit years. In this scenario, your 150-year span would be from 1950 to 2100. Since you have 2-digit years, there would be no way to determine if the year 00 was meant to occur during 2000 or 2100.

How do I change the default setting for all the SAS users at my site?

Setting system default option values is usually done by a site SAS Installation Representative. The recommended method for setting a system default YEARCUTOFF= value is to specify the desired value in the system SAS configuration file. Note that even if you set a default value for all the users at your site, they can override the default value in their SAS programs, in personal autoexec files, in config files, by setting an environment variable, or when invoking SAS software.

How do I change the default setting for my own programs if I want a default that is different from the rest of the users at my site?

You can specify personal default values either in a personal configuration file or in an autoexec file. If you specify the value in a personal configuration file, the syntax depends on your operating system and is the same as that for setting the value in the system-wide configuration file on each system. If you use an autoexec file, you can specify the YEARCUTOFF= option in an OPTIONS statement.  Here is an example:

options yearcutoff=1930;

Additional Resources

  • YEARCUTOFF= System Option section in SAS® 9.3 System Options: Reference, Second Edition
  • YEARCUTOFF= System Option section in SAS® 9.4 System Options: Reference, Fifth Edition
  • SAS Note 46368, "The default value for the YEARCUTOFF= system option has changed in SAS® 9.4 and beyond"
  • SAS Note 65307, "You might encounter an issue in which 2-digit year dates have the wrong century in SAS® 9.3 and earlier releases"

Why does my SAS date have the wrong century? was published on SAS Users.

1月 062020
 

Last year, I wrote more than 100 posts for The DO Loop blog. The most popular articles were about SAS programming tips for data analysis, statistical analysis, and data visualization. Here are the most popular articles from 2019 in each category.

SAS programming tips

The Essential Guide to Binning

  • Create training, testing, and validation data sets:: This post shows how to create training, validation, and test data sets in SAS. This technique is popular in data science and machine learning because you typically want to fit the model on one set of data and then evaluate the goodness of fit by using a different set of data.
  • 5 reasons to use PROC FORMAT to recode variables in SAS: Often SAS programmers use PROC SQL or the SAS DATA step to create new data variables to recode raw data. This is not always necessary. It is more efficient to use PROC FORMAT to recode the raw data. Learn five reasons why you should use PROC FORMAT to recode variables.
  • Conditionally append observations to a data set: In preparing data for graphing. you might want to add additional data to the end of a data set. (For example, to plot reference lines or text.) You can do this by using one DATA step to create the new observations and another DATA step to merge the new observations to the end of the original data. However, you can also use the END= option and end-of-file processing to append new observations to data. Read about how to use the END= option to append data. The comments at the end of the article show how to perform similar computations by using a hash table, PROC SQL, and a DOW loop.
  • Use PROC HPBIN to bin numerical variables: In machine learning, it is common to bin numerical variables into a set of discrete values. You can use PROC HPBIN to bin multiple variables in a single pass through the data. PROC HPBIN can bin data into equal-length bins (called bucket binning) or by using quantiles of the data. This article and the PROC FORMAT article are both referenced in my Essential Guide to Binning in SAS.

Statistical analyses

Geometric Meaning of Kolmogorov's D

Data visualization

Visualize an Interaction Effect

I always enjoy learning new programming methods, new statistical ideas, and new data visualization techniques. If you like to learn new things, too, read (or re-read!) these popular articles from 2019. Then share this page with a friend. I hope we both have many opportunities to learn and share together in the new year.

The post Top posts from <em>The DO Loop</em> in 2019 appeared first on The DO Loop.

12月 122019
 

Parts 1 and 2 of this blog post discussed exploring and preparing your data using SASPy. To recap, Part 1 discussed how to explore data using the SASPy interface with Python. Part 2 continued with an explanation of how to prepare your data to use it with a machine-learning model. This final installment continues the discussion about preparation by explaining techniques for normalizing your numerical data and one-hot encoding categorical variables.

Normalizing your data

Some of the numerical features in the examples from parts 1 and 2 have different ranges. The variable Age spans from 0-100 and Hours_per_week from 0-80. These ranges affect the calculation of each feature when you apply them to a supervised learner. To ensure the equal treatement of each feature, you need to scale the numerical features.

The following example uses the SAS STDIZE procedure to scale the numerical features. PROC STDIZE is not a standard procedure available in the SASPy library.  However, the good news is you can add any SAS procedure to SASPy! This feature enables Python users to become a part of the SAS community by contributing to the SASPy library and giving them the chance to use the vast number of powerful SAS procedures. To add PROC STDIZE to SASPy, see the instructions in the blog post Adding SAS procedures to the SASPy interface to Python.

After you add the STDIZE to SASPy, run the following code to scale the numerical features. The resulting values will be between 0 and 1.

# Creating a SASPy stat objcect
stat = sas.sasstat()
# Use the stdize function that was added to SASPy to scale our features
stat_result = stat.stdize(data=cen_data_logTransform,
                         procopts = 'method=range out=Sasuser.afternorm',
			 var = 'age education_num capital_gain capital_loss hours_per_week')

To use the STDIZE procedure in SASPy we need to specify the method and the output data set in the statement options. For this we use the "procopts" option and we specify range as our method and our "out" option to a new SAS data set, afternorm.

After running the STDIZE procedure we assign the new data set into a SAS data object.

norm_data = sas.sasdata('afternorm', libref='SASuser')

Now let's verify if we were successful in transforming our numerical features

norm_data.head(obs=5)


 

 

 

 

 
The output looks great! You have normalized the numerical features. So, it's time to tackle the last data-preparation step.

One-Hot Encoding

Now that you have adjusted the numerical features, what do you do with the categorical features? The categories Relationship, Race, Sex, and so on are in string format. Statistical models cannot interpret these values, so you need to transform the values from strings into a numerical representation. The one-hot encoding process provides the transformation you need for your data.

To use one-hot encoding, use the LOGISTIC procedure from the SASPy Stat class. SASPy natively includes the LOGISTIC procedure, so you can go straight to coding. To generate the syntax for the code below, I followed the instructions from Usage Note 23217: Saving the coded design matrix of a model to a data set.

stat_proc_log = stat.logistic(data=norm_data, procopts='outdesign=SASuser.test1 outdesignonly',
            cls = "workclass education_level marital_status occupation relationship race sex native_country / param=glm",
	    model = "age = workclass education_level marital_status occupation relationship race sex native_country / noint")

To view the results from this code, create a SAS data object from the newly created data set, as shown in this example:

one_hot_data = sas.sasdata('test1', libref='SASuser')
display(one_hot_data.head(obs=5))

The output:

 

 

 

 

Our data was successfully one-hot encoded! For future reference, due to SAS’ analytical power, this step is not required. When including a categorical feature in a class statement the procedure automatically generates a design matrix with the one-hot encoded feature. For more information, I recommend reading this post about different ways to create a design matrix in SAS.

Finally

You made it to the end of the journey! I hope everyone who reads these blogs can see the value that SASPy brings to the machine-learning community. Give SASPy  a try, and you'll see the power it can bring to your SAS solutions.

Stay curious, keep learning, and (most of all) continue innovating.

Machine Learning with SASPy: Exploring and Preparing your data - Part 3 was published on SAS Users.

11月 122019
 

The t-test is a very useful test that compares one variable (perhaps blood pressure) between two groups. T-tests are called t-tests because the test results are all based on t-values. T-values are an example of what statisticians call test statistics. A test statistic is a standardized value that is calculated from sample data during a hypothesis test. It is used to determine whether there is a significant difference between the means of two groups. With all inferential statistics, we assume the dependent variable fits a normal distribution. When we assume a normal distribution exists, we can identify the probability of a particular outcome. The procedure that calculates the test statistic compares your data to what is expected under the null hypothesis. There are several SAS Studio tasks that include options to test this assumption. Let's use the t-test task as an example.

You start by selecting:

Tasks and Utilities □ Tasks □ Statistics □ t Tests

On the DATA tab, select the Cars data set in the SASHELP library. Next request a Two-sample test, with Horsepower as the Analysis variable and Cylinders as the Groups variable. Use a filter to include only 4- or 6-cylinder cars. It should look like this:

On the OPTIONS tab, check the box for Tests for normality as shown below.

All the tests for normality for both 4-cylinder and 6-cylinder cars reject the null hypothesis that the data values come from a population that is normally distributed. (See the figure below.)

Should you abandon the t-test results and run a nonparametric test analysis such as a Wilcoxon Rank Sum test that does not require normal distributions?

This is the point where many people make a mistake. You cannot simply look at the results of the tests for normality to decide if a parametric test is valid or not. Here is the reason: When you have large sample sizes (in this data set, there were 136 4-cylinder cars and 190 6-cylinder cars), the tests for normality have more power to reject the null hypothesis and often result in p-values less than .05. When you have small sample sizes, the tests for normality will not be significant unless there are drastic departures from normality. It is with small sample sizes where departures from normality are important.

The bottom line is that the tests for normality often lead you to make the wrong decision. You need to look at the distributions and decide if they are somewhat symmetrical. The central limit theory states that the sampling distribution of means will be normally distributed if the sample size is sufficiently large. "Sufficiently large" is a judgment call. If the distribution is symmetrical, you may perform a t-test with sample sizes as small as 10 or 20.

The figure below shows you the distribution of horsepower for 4- and 6-cylinder cars.

With the large sample sizes in this data set, you should feel comfortable in using a t-test. The results, shown below, are highly significant.

If you are in doubt of your decision to use a parametric test, feel free to check the box for a nonparametric test on the OPTIONS tab. Running a Wilcoxon Rank Sum test (a nonparametric alternative to a t-test), you also find a highly significant difference in horsepower between 4- and 6-cylinder cars. (See the figure below.)

You can read more about assumptions for parametric tests in my new book, A Gentle Introduction to Statistics Using SAS Studio.

For a sneak preview check out the free book excerpt. You can also learn more about SAS Press, check out the up-and-coming titles, and receive exclusive discounts. To do so, make sure to subscribe to the newsletter.

Testing the Assumption of Normality for Parametric Tests was published on SAS Users.

11月 062019
 

I have been programming SAS for a LONG time and have never seen much in the way of programming standards. For example, most SAS programmers indent DATA and PROC statements (I like three spaces). Most programmers do not like to see more than one statement on a line and most agree that there should be blank lines between program boundaries (DATA and PROC steps).

I thought I would share some of my thoughts on programming standards, with the hope that others will chime in with their ideas.

    • I like to indent all the statements in a DO group or DO loop. If there are nested groups, each one gets indented as well.
    • I prefer variable names in proper case.
    • I am not a fan of camel-case. For example, I prefer Weight_Kg to WeightKg. The reason that some programmers like camel-case is that SAS will automatically split a variable name at a capital letter in some headings.
    • I like my TITLE statements in open code, not inside a PROC. To me, that makes sense because TITLE statements are global.
    • There should be no conversion messages (character to numeric or numeric to character) in the SAS log. For example use Num = INPUT(Char_Num,12.); instead of Num = 1*Char_Num;. The latter statement forces an automatic character to numeric conversion and places a message in the log.
    • I always use the statement ODS NOPROCTITLE;. This eliminates the default SAS procedure name at the top of the output.
    • Although fewer and fewer people are reading raw text data, I like my @ signs to all line up in my INPUT statement.
    • I like to use the /* and */ comments to define all macro variables. For example:

Notice that I prefer named parameters in my macros, instead of positional parameters.

If this seems like too much work - SAS Studio has an automatic formatting tool that can help standardize your programs. For example, look at the code below:

Really ugly, right? Here is how you can use the automatic formatting tool in SAS Studio.

When you click this icon, the program now looks like this:

That’s pretty much the way I would write it. By the way, if you don't like how Studio formatted your code, enter a control-z to undo it.

For more tips on writing code and how to get started in SAS Studio – check out my book, Learning SAS by Example: A Programmer’s Guide, Second Edition. You can also download a free book excerpt. To also learn more about SAS Press, check out the up-and-coming titles, and receive exclusive discounts make sure to subscribe to the SAS Books newsletter.

Making your SAS code more readable was published on SAS Users.

10月 292019
 

Thank you to Lora Delwiche and Susan Slaughter for providing the following information:

Six editions is a lot! If you had told us back when we wrote the first edition of The Little SAS Book that someday we would write a sixth, we would have wondered how we could possibly find that much to say. After all, it is supposed to be The Little SAS Book, isn’t it? But the developers at SAS are constantly hard at work inventing new and better ways of analyzing and visualizing data. And some of those ways turn out to be so fundamental that they belong even in a little book about SAS.

Interface independence

One of the biggest changes to SAS software in recent years is the proliferation of interfaces. SAS programmers have more choices than ever before. Previous editions contained some sections specific to the SAS windowing environment (also called Display Manager). We wrote this edition for all SAS programmers whether you use SAS Studio, SAS Enterprise Guide, the SAS windowing environment, or run in batch. That sounds easy, but it wasn’t. There are differences in how SAS behaves with different interfaces, and these differences can be very fundamental. In particular, the system option that sets the rules for names of variables varies depending on how you run SAS. So old sections had to be rewritten, and we added a whole new section showing how to use variable names containing blanks and special characters.

New ways to read and write Microsoft Excel files

Previous editions already covered how to read and write Microsoft Excel files, but SAS developers have created new ways that are even better. This edition contains new sections about the XLSX LIBNAME engine and the ODS EXCEL destination.

More PROC SQL

From the very first edition, The Little SAS Book always covered PROC SQL. But it was in an appendix, and over time we noticed that most people ignore appendices. So for this edition, we removed the appendix and added new sections on using PROC SQL to:

• Subset your data
• Join data sets
• Add summary statistics to a data set
• Create macro variables with the INTO clause

For people who are new to SQL, these sections provide a good introduction; for people who already know SQL, they provide a model of how to leverage SQL in your SAS programs.

Updates and additions throughout the book

Almost every section in this edition has been changed in some way. We added new options, made sure everything is up-to-date, and ran every example in every SAS interface noting any differences. For example, PROC SGPLOT has some new options, the default ODS style for PDF has changed, and the LISTING destination behaves differently in different interfaces. Here’s a short list, in no particular order, of new or expanded topics in the sixth edition:

• More examples with permanent SAS data sets, CSV files, or tab-delimited files
• More log notes throughout the book showing what to look for
• LIKE or sounds-like (=*) operators in WHERE statements
• CROSSLIST, NOCUM, and NOPRINT options in PROC FREQ
• Grouping data with a user-defined format and the PUT function
• Iterative DO groups
• DO WHILE and DO UNTIL statements
• %DO statements

Even though we have added a lot to this edition, it is still a little book. In fact, this edition is shorter than the last—by 12 pages! We think this is the best edition yet. For a sneak preview check out the free book excerpt. You can also learn more about SAS Press, check out the up-and-coming titles, and to exclusive discounts -- make sure to subscribe to the newsletter.

The Little SAS Book 6.0: The best-selling SAS book gets even better was published on SAS Users.

9月 232019
 

Although I do not typically blog about undocumented SAS options, I'll make an exception this time. For many years, I have known that the CONTENTS and COMPARE procedures support the BRIEF and SHORT options, but I always forget which option goes with which procedure. For the record, here are the documented options:

SAS provides an autocomplete feature that programmers can use to remind themselves which options are supported in each procedure. Both SAS Studio and SAS Enterprise Guide support the autocomplete feature, which displays a list of options as you type SAS code. Unfortunately, I turn off that feature because I find it distracting. Consequently, I cannot remember whether to use SHORT or BRIEF.

My "solution" to this dilemma is to randomly guess an option, run the program, and fix the error if I guessed wrong. I should be wrong 50% of the time, but I noticed that I am wrong only about 25% of the time. Upon investigating, I discovered that some seemingly "wrong" code actually works. Although it is not documented, it turns out that the COMPARE procedure supports the SHORT option as an alias for BRIEFSUMMARY.

This is awesome! No longer do I need to randomly guess the option. I can use the SHORT option in both PROC CONTENTS and PROC COMPARE! As a bonus, the SHORT option is also supported by7 PROC OPTIONS and PROC DATASETS.

If you have never used the SHORT option before, it's a great time saver because it produces condensed output, as shown in the following examples. For PROC CONTENTS, I like to use the options VARNUM SHORT when I want to display a list of the variables in a data set:

proc contents data=Sashelp.Cars varnum SHORT;      /* SHORT is documented option */
run;

For PROC COMPARE, I use the undocumented SHORT option (which is an alias for BRIEF) when I want to display whether two data sets are equal:

data cars; set Sashelp.cars; run;
 
proc compare base=sashelp.cars compare=cars SHORT; /* SHORT is not documented, but it works */
run;

As I mentioned, the SHORT option is also supported in PROC OPTIONS. For completeness, here is an example that writes the values of several options to the SAS log:

proc options option=(linesize pagesize memsize _LAST_) SHORT; run;    /* SHORT is documented option */
 LINESIZE=75 PAGESIZE=24 MEMSIZE=17179869184 _LAST_=WORK.CARS

In short (see what I did there?), you can use the SHORT option in several Base SAS procedures. If you can't remember that the SHORT option applies to PROC CONTENTS and the BRIEF option applies to PROC COMPARE, just use the SHORT option in both procedures and in PROC OPTIONS, too.

The post Use the SHORT option in Base SAS procedures to reduce output appeared first on The DO Loop.

9月 162019
 

SAS SPDS is lightning fastJust when you think you’ve seen it all, life can surprise you in a big way, making you wonder what else you've missed.

That is what happened when I recently had a chance to work with the SAS® Scalable Performance Data Server, a product that's been around a while, but I never crossed paths with before. I open an SPDS table of a hundred million records in SAS® Enterprise Guide, and I can scroll it as fast as if it were an Excel “baby” spreadsheet of a hundred rows. That’s how powerful it feels, to say nothing about the lighting speed of the queries.

What is the SAS Scalable Performance Data Server?

Also known as the SAS SPD Server (or SPDS), it's a data storage system designed for high-performance data delivery. Its primary purpose is to provide rapid queries of vast amounts of data. We are talking terabytes of data with tables containing billions of rows. SPDS employs parallel storage and efficient indexing, coupled with a multi-threaded server system concurrently processing tasks distributed across multiple processors.

Availability of the SPDS client in SAS® Viya effectively integrates SAS SPDS with SAS Viya, extending functionality of its applications beyond the native Cloud Analytic Services (CAS) where you can continue reaping all the benefits of the SAS SPDS.

SPDS library

In addition to connecting to SPD Server with explicit SQL pass-through, connection to SPD Server with a LIBNAME statement is available as well, for example:

libname mylibref sasspds 'serverdomain' host='nodename_or_ip' service='5400'
                         user='mySPDuserid' password='{SAS003}XXXXXXX...XXX';

This effectively creates an SPDS library, and the tables in that library can be referenced by two-level name mylibref.tablename as if this were a SAS BASE library.

Cluster tables vs. member tables

Besides ordinary data tables, SPDS library offers so called dynamic cluster tables – or clusters for short – enabling transparent access to large amounts of data.

Dynamic cluster tables (cluster tables or clusters) are virtual tables that allow users to access many server tables (member tables) as if they were one table. A dynamic cluster table is a collection of SPD Server tables that are presented to the end-user application as a single table through a metadata layer acting like a view.

Member tables can be added to the cluster as well as replaced and removed from the cluster.

The role of PROC SPDO

PROC SPDO is the SAS procedure for the SPD Server operator interface. It performs a wide range of SPD server, user and table management tasks:

  • create, list, modify, destroy, and undo dynamic cluster tables
  • add, remove, replace, and fix cluster table members
  • add, modify, list, and delete access control lists (ACLs) for server resources
  • define, describe, and remove WHERE constraints on tables for row-level security definition and management
  • issue system commands on server nodes

In addition to PROC SPDO, SPD Server plug-in for SAS® Data Management Console is also available.

Retrieving SPDS library contents

If you open an SPDS library in SAS Enterprise Guide, you won’t be able to tell which table in that library is a member table and which is a cluster table – they all look the same. But in many cases, we need to know what is what. Moreover, for data-driven processing we need to capture the SPDS library objects into a dataset and identify them whether they are clusters or member tables.

Luckily, PROC CONTENTS with OUT= option allows us to do just that. While MEMTYPE column is equal to ‘DATA’ for both, clusters and member tables, there is another, less known column inversely called TYPEMEM that has value of 'DATA' for clusters and blank value ' ' for member tables. The following simple code allows you to retrieve SPDS library objects list into WORK.SPDSTYPES dataset where TABLETYPE column specifies whether it’s a cluster or a member for each library object MEMNAME:

proc contents data=SPDSLIB._all_ out=WORK.ALLOBJECTS (keep=MEMNAME TYPEMEM);
run;
 
proc sort data=WORK.ALLOBJECTS nodupkey;
   by MEMNAME;
run;
 
data WORK.SPDSTYPES;
   set WORK.ALLOBJECT;
   attrib TABLETYPE $7 label='SPDS table type';
   select(TYPEMEM);
      when('DATA') TABLETYPE = 'CLUSTER';
      when('')     TABLETYPE = 'MEMBER';
      otherwise    TABLETYPE = '';
   end;
run;

In this code PROC CONTENTS produces one record per column NAME in every object MEMNAME in the SPDS library; PROC SORT reduces (un-duplicates) this list to one record per MEMNAME; finally, data step creates TABLETYPE column indicating which MEMNAME is CLUSTER and which is MEMBER.

Retrieving SPDS cluster’s member list

In addition to retrieving a list of objects in the SPDS library described above, we also need a way of capturing the content (a list of members) of the cluster itself in order to control removing or replacing its members. PROC SPDO’s CLUSTER LIST statement produces such a list, and its OUT= option allows you to dump that list into a dataset:

proc spdo lib=SPDSLIB;
   cluster list CLUSTER1 out=CLUSTER1_MEMBERS;
   cluster list CLUSTER1 out=CLUSTER2_MEMBERS;
   /* ... */
   cluster list CLUSTER1 out=CLUSTERN_MEMBERS;
quit;

This approach creates one output table per cluster, and you can’t use the same OUT= destination table for different clusters, for they will be overwritten with each subsequent CLUSTER LIST statement, not appended.

If you need to capture contents of several clusters into one dataset, then instead of the above method of outputting each cluster content into separate table and then appending (concatenating) them, the good old ODS OUTPUT with CLUSTERLIST= option allows us to do it in a single step:

ods noresults;
ods output clusterlist=WORK.CLUSTER_MEMS;
proc spdo lib=SPDSLIB;
   cluster list CLUSTER1;
   cluster list CLUSTER2;
   /* ... */
   cluster list CLUSTERN;
quit;
ods output close;
ods results;

As additional bonus ODS NORESULTS suppresses printed output when it’s not needed, e.g. for automatic data-driven processing.

Your thoughts?

What is your experience with SAS SPDS? How might you use it in the future? Please comment below.

How to retrieve contents of SAS® Scalable Performance Data Server library was published on SAS Users.

8月 282019
 

This article is not a tutorial on Hadoop, Spark, or big data. At the same time, no prerequisite knowledge of these technologies is required for understanding. We’ll give you enough background prior to diving into the details. In simplest terms, the Hadoop framework maintains the data and Spark controls and directs data processing. As an analogy, think of Hadoop as a train, big data as the payload, and Spark as the crew driving the train and organizing and distributing the goods.

Big data

I recently read that data volumes are doubling each year. Not that long ago we talked in terms of gigabytes. This quickly turned into terabytes and we’re now in the age of petabytes. The type of data is also changing. Data used to fit neatly into rows and columns. Now, nearly eighty percent of data is unstructured. All these trends and facts have led us to deal with massive amounts of data, aka big data. Maintaining and processing big data required creating technical frameworks. Next, we’ll investigate a couple of these tools.

Hadoop

Hadoop is a technology stack utilizing parallel processing on a distributed filesystem. Hadoop is useful to companies when data sets become so large or complex that their current solutions cannot effectively process the information in a reasonable amount of time. As the data science field has matured over the past few years, so has the need for a different approach to processing data.

Apache Spark

Apache Spark is a cluster-computing framework utilizing both iterative algorithms and interactive/exploratory data analysis. The goal of Spark is to keep the benefits of Hadoop’s scalable, distributed, fault-tolerant processing framework, while making it more efficient and easier to use. Using in-memory distributed computing, Spark provides capabilities over and above the batch model of Hadoop MapReduce. As a result, this brings to the big data world new applications of data science that were previously too expensive or slow on massive data sets.

Now let’s explore how SAS integrates with these technologies to maximize capturing, managing, and analyzing big data.

SAS capabilities to leverage Spark

SAS provides Hadoop data processing and data scoring capabilities using SAS/ACCESS Interface to Hadoop and In-Database Technologies to Hadoop with MapReduce or Spark as the processing framework. This addresses some of the traditional data management batch processing, huge volumes of extract, transform, load (ETL) data as well as faster, interactive and in-memory processing for quicker response with Spark.

In SAS Viya, SAS/ACCESS Interface to Hadoop includes SAS Data Connector to Hadoop. All users with SAS/ACCESS Interface to Hadoop can use the serial. Likewise, SAS Data Connect Accelerator to Hadoop can load or save data in parallel between Hadoop and SAS using SAS Embedded Process, as a Hive/MapReduce or Spark job.

Connecting to Spark in a Hadoop Cluster

There are two ways to connect to a Hadoop cluster using SAS/ACCESS Interface to Hadoop, based on the SAS environment: LIBNAME and CASLIB statements.

LIBNAME statement to connect to Spark from MVA SAS

options set=SAS_HADOOP_JAR_PATH="/third_party/Hadoop/jars/lib:/third_party/Hadoop/jars/lib/spark"; 
options set=SAS_HADOOP_CONFIG_PATH="/third_party/Hadoop/conf"; 
 
libname hdplib hadoop server="hadoop.server.com" port=10000 user="hive"
schema='default' properties="hive.execution.engine=SPARK";

Parameters

SAS_HADOOP_JAR_PATH Directory path for the Hadoop and Spark JAR files
SAS_HADOOP_CONFIG_PATH Directory path for the Hadoop cluster configuration files
Libref The hdplib libref specifies the location where SAS will find the data
SAS/ACCESS Engine Name HADOOP option to connect Hadoop engine
SERVER Hadoop Hive server to connect
PORT Listening Hive server Port. 10000 is the default, so it is not required. It is included just in case
USER and PASSWORD Are not always required
SCHEMA Hive schema to access. It is optional; by default, it connects to the “default” schema
PROPERTIES Hadoop properties. Choosing SPARK for the property hive.execution.engine enables SAS Viya to use Spark as the execution platform

 
CASLIB statement to connect from CAS

caslib splib sessref=mysession datasource=(srctype="hadoop", dataTransferMode="auto",username="hive", server="hadoop.server.com", 
hadoopjarpath="/opt/sas/viya/config/data/hadoop/lib:/opt/sas/viya/conf ig/data/hadoop/lib/spark", 
hadoopconfigdir="/opt/sas/viya/config/data/hadoop/conf", schema="default"
platform="spark"
dfdebug="EPALL" 
properties="hive.execution.engine=SPARK");

Parameters

CASLIB Space holder for the specified data access. The splib CAS library specifies the Hadoop data source
sessref Holds the CAS library in a specific CAS session. mysession is the current active CAS session
SRCTYPE Type of data source
DATATRANSFERMODE Type of data movement between CAS and Hadoop. Accepts one of three values – serial, parallel, auto. When AUTO is specified, CAS choose the type of data transfer based on available license in the system. If Data Connect Accelerator to Hadoop has been licensed, parallel data transfer will be used, otherwise serial mode of transfer is used
HADOOPJARPATH Hadoop and Spark JAR files location path on the CAS cluster
HADOOPCONFIGDIR Hadoop configuration files location path on the CAS cluster
PLATFORM Type of Hadoop platform to execute the job or transfer data using SAS Embedded Process. Default value is “mapred” for Hive MapReduce. When using “Spark”, data transfer and job executes as a Spark job
DFDEBUG Used to receive additional information back from SAS Embedded Process transfers data in the SAS log
PROPERTIES Hadoop properties. Choosing SPARK for the property hive.execution.engine enables SAS Viya to use Spark as the execution platform

 

Data Access using Spark

SAS Data Connect Accelerator for Hadoop with the Spark platform option uses Hive as the query engine to access Spark data. Data movement happens between Spark and CAS through SAS generated Scala code. This approach is useful when data already exists in Spark and either needs to be used for SAS analytics processing or moved to CAS for massively parallel data and analytics processing.

Loading Data from Hadoop to CAS using Spark

Processing data in CAS offers advanced data preparation, visualization, modeling and model pipelines, and finally model deployment. Model deployment can be performed using available CAS modules or pushed back to Spark if the data is already in Hadoop.

Load data from Hadoop to CAS using Spark

proc casutil 
      incaslib=splib
      outcaslib=casuser;
      load casdata="gas"
      casout="gas"
      replace;
run;

Parameters

PROC CASUTIL Used to process CAS action routines to process data
INCASLIB Input CAS library to read data
OUTCASLIB Output CAS library to write data
CASDATA Table to load to the CAS in-memory server
CASOUT Output CAS table name

 

We can look at the status of the data load job using Hadoop' resource management and job scheduling application, YARN. YARN is responsible for allocating system resources to the various applications running in a Hadoop cluster and scheduling tasks to be executed on different cluster nodes.

Loading Data from Hadoop to Viya CAS using Spark

In the figure above, the YARN application executed the data load as a Spark job. This was possible because the CASLIB statement had Platform= Spark option specified. The data movement direction, in this case Hadoop to CAS uses the Spark job name, “SAS CAS/DC Input,” where “Input” is data loaded into CAS.

Saving Data from CAS to Hadoop using Spark

You can save data back to Hadoop from CAS at many stages of the analytic life cycle. For example, use data in CAS to prepare, blend, visualize, and model. Once the data meets the business use case, data can be saved in parallel to Hadoop using Spark jobs to share with other parts of the organization.

Using the SAVE CAS action to move data to Hadoop using Spark

proc cas;
session mysession; 
      table.save /
      caslib="splib"
      table={caslib="casuser", name="gas"},
      name="gas.sashdat"
      replace=True;
quit;

Parameters

PROC CAS Used to execute CAS actionsets and actions to process data.
“table” is the actionset and “save” is the action
TABLE Location and name of the source table
NAME Name of the target table saved to the Hadoop library using Spark

 

We can verify the status of saving data from CAS to Hadoop using YARN application. Data from CAS saves as a Hadoop table using, Spark as the execution platform. Furthermore, as SAS Data Connect Accelerator for Hadoop transfers data in parallel, individual Spark executors in each of the Spark executor nodes handles data execution for that specific Hadoop cluster node.

Saving Data from Viya CAS to Hadoop using Spark

Finally, the SAVE data executed as a Spark job. As we can see from YARN, the Spark job named “SAS CAS/DC Output” specifies that the data moves from CAS to Hadoop.

Where we are; where we're going

We have so far traveled across the Spark pond to setup SAS libraries for Spark, Load and Save data from and to Hadoop using Spark. In the next section we’ll look at ways to Score data and execute SAS code inside Hadoop using Spark.

Data and Analytics Innovation using SAS & Spark - part 1 was published on SAS Users.

8月 072019
 
2-D binning of counts

Do you want to bin a numeric variable into a small number of discrete groups? This article compiles a dozen resources and examples related to binning a continuous variable. The examples show both equal-width binning and quantile binning. In addition to standard one-dimensional techniques, this article also discusses various techniques for 2-D binning.

SAS procedures that support binning include the HPBIN, IML, KDE, RANK, and UNIVARIATE procedures.

Equal-width binning in SAS

The simplest binning technique is to form equal-width bins, which is also known as bucket binning. If a variable has the range [Min, Max] and you want to split the data into k equal-width bins (or buckets), each bin will have width (Max - Min) / k.

Quantile binning in SAS

In bucket binning, some bins have more observations than others. This enables you to estimate the density of the data, as in a histogram. However, you might want all bins to contain about the same number of observations. In that case, you can use quantiles of the data as cutpoints. If you want four bins, use the 25th, 50th, and 75th percentiles as cutpoints. If you want 10 bins, use the sample deciles as cutpoints. Here are several resources for quantile binning:

Binning by using arbitrary cutpoints in SAS

Sometimes you need to bin based on scientific standards or business rules. For example, the Saffir-Simpson hurricane scale uses specific wind speeds to classify a hurricane as Category 1, Category 2, and so forth. In these cases, you need to be able to define custom cutpoints and assign observations to bins based on those cutpoints.

2-D binning and bivariate histograms in SAS

A histogram is a visualization of a univariate equal-width binning scheme. You can perform similar computations and visualizations for two-dimensional data. If your goal is to understand the density of continuous bivariate data, you might want to use a bivariate histogram rather than a scatter plot (which, for large samples, suffers from overplotting).

In summary, this guide provides many links to programs and examples that bin data in SAS. Whether you want to use equal-width bins, quantile bins, or two-dimensional bins, hopefully, you will find an example to get you started. If I've missed an important topic, or if you have a favorite binning method that I have not covered, leave a comment.

The post The essential guide to binning in SAS appeared first on The DO Loop.