3月 302017
 

Users frequently ask how to plot their data as markers on a map. There are several ways to do this using SAS software. If you're a Visual Analytics user, you can do it using a point-and-click interface. But if you're a coder, you might need a little help... In this [...]

The post Plotting markers on a map at zip code locations, using GMap or SGplot appeared first on SAS Learning Post.

3月 292017
 

In 1990, the internet took on its most recognizable form. It brought connection, knowledge and speed that was previously inaccessible. Fast-forward 27 years, and I get asked a lot about the most recent form of the internet – the internet of things (IoT). And while I think the current possibilities [...]

3 best practices to prepare for the future of IoT was published on SAS Voices by Oliver Schabenberger

3月 292017
 

Lists are collections of objects. SAS/IML 14.2 supports lists as a way to store matrices, data tables, and other lists in a single object that you can pass to functions. SAS/IML lists automatically grow if you add new items to them and shrink if you remove items. You can also modify existing items. You can use indices to refer to items of a list, or you can assign names to the items to create a named list.

Create a list

You can create a list by using the ListSetItem subroutine to assign a value to each item:

proc iml;
L = ListCreate(2);                  /* L is two-item list */
call ListSetItem(L, 1, 1:3);        /* 1st item is 1x3 vector */
call ListSetItem(L, 2, {4 3, 2 1}); /* 2nd item is 2x2 matrix */

The arguments for the ListSetItem subroutine are list, index, and value, which is the same order that you use when you assign a value to a matrix element, such as A[2] = 5.

If you do not know how many items the list will contain, or if you later want to add additional items, you can use

X = {3 1 4, 2 5 3};                 /* create a 2x3 matrix */
call ListAddItem(L, X);             /* add 3rd item; list grows */

Notice that the syntax for the ListAddItem subroutine only requires the list and the value because the new item is always added to the end of the list. At this point, the list contains three items, as shown below:

Items in a SAS/IML list can be different shapes and sizes

Insert, modify, and delete list items

A convenient feature of SAS/IML lists is that they automatically resize. You can insert new items into any position in the list. You can also replace an existing item or delete an item. The following statements demonstrate modifying an existing list.

call ListInsertItem(L, 2, -1:1);    /* insert new item before item 2; list grows */
call ListSetItem(L, 1, {A B, C D}); /* replace 1st item with 2x2 character matrix */
call ListDeleteItem(L, 3);          /* delete 3rd item; list shrinks */

The arguments for k, existing higher-indexed items are also renumbered. (Conceptually, items "move to the left.") After the previous sequence of operations, the list contains the following items:

You can insert, modify, and delete items in a SAS/IML list

Create named lists

In the previous section, the list acted like a dynamic array: it stored items that you could access by using indices such as 1, 2, and 3. For some applications, it makes more sense to name the list items and refer to the items by their names rather than their positions. This is similar to "structs" in some languages, where you can refer to members of a struct by name.

For example, suppose a teacher wants to store information about students in her classes. For each student, she might want to store the student's name, class, and test scores. The following statements create a SAS/IML list that has three items named "Name", "Class", and "Scores". The items are then assigned values:

Student = ListCreate({"Name" "Class" "Scores"});  /* create named list with 3 items */
call ListSetItem(Student, "Name", "Ron");         /* set "name" value for student */
call ListSetItem(Student, "Class", "Statistics"); /* set "class" value  for student */
Tests = {100 97 94 100 100};                      /* test scores */
call ListSetItem(Student, "Scores", Tests);       /* set "scores" value for student */

Although you can still use positional indices to access list items ("Ron" is the first item, "Statistics" is the second item, ...), you can also use the name of items. For example, to extract the test scores into a SAS/IML matrix or vector, you can use

s = ListGetItem(Student, "Scores");               /* get test scores from list */

Lists are for storing items, not for computing. You cannot add, subtract, or multiply lists. However, the example shows that you can extract an item into a matrix and subsequently perform algebraic operations on the matrix.

Lists of lists

Lists can contain sublists. In this way you can represent nested or hierarchical data. For example, the teacher might want to store information about several students. If information about each student is stored in a list, then she can use a list of lists to store information about multiple students.

The following statements create a list called All. The first student ("Ron") is added to the list, which means that the item is a copy of the Student list. Then information about a second student ("Karl") is stored in the Student list and copied into the All list as a second item. This process could continue until all students are stored in the list.

All = ListCreate();                  /* empty list */
call ListAddItem(All, Student);      /* add "Ron" to list */
 
call ListSetItem(Student, "Name", "Karl"); 
call ListSetItem(Student, "Class", "Calculus");
call ListSetItem(Student, "Scores", {90 92 84 70 80});  
call ListAddItem(All, Student);      /* add "Karl" to list */

At this point, the All list contains two items. Each item is a list that contains three items.

Summary

SAS/IML 14.2 supports lists, which are containers that store other objects. Lists dynamically grow or shrink as necessary. You can index items by their position in the list (using indices) or you can create a named list and reference items by their names. You can use built-in functions to extract items a list or add new items to a list. You can insert, delete, and modify list items. You can represent hierarchical data by using lists of lists. For more information about SAS/IML lists, see the chapter in the SAS/IML documentation. The documentation shows how lists can be used to emulate other data structures, such as stacks, queues, and trees.

The post Lists: Nonmatrix data structures in SAS/IML appeared first on The DO Loop.

3月 292017
 

There is a well-known Russian saying that goes “Если нельзя, но очень хочется, то можно.” The English translation of it can span anywhere from “If you can’t, but want it badly, then you can” to “If you shouldn’t, but want it badly, then you should” to “If you may not, but want it badly, then you may.” Depending on your situation, any possible combination of “may,” “can,” or “should” may apply. You can even replace “want” with “need” to get a slightly different flavor.

There are known means of modifying variable attributes with PROC DATASETS, but they are limited to variable name, format, informat, and label. But what if we want/need to modify a variable length, or change a variable type? And I am not talking about creating a new variable with a different length or converting a numeric variable value into a character value of another variable. We want to change variable type and/or variable length in place, without adding new variables. If you believe it can’t be done, read the first paragraph again.

Imagine that we have two data tables that we need to concatenate into one table. However, there is one common variable that is of different type in each table – in the first table it is numeric, but in the second table it is character.

Sample data

Let’s create some sample data to emulate our situation by running the following SAS code:

libname sasdl 'C:PROJECTS_BLOG_SASchanging_variable_type_and_length_in_sas_datasets';
 
/* create study2016 data table */
data sasdl.study2016;
	length subjid dob 8 state $2 start_date end_date 8;
	infile datalines truncover;
	input subjid dob : mmddyy10. state start_date : mmddyy10. end_date : mmddyy10.;
	format dob start_date end_date mmddyy10.;
	datalines;
123456 08/15/1960 MD 01/02/2016
234567 11/13/1970 AL 05/12/2016 12/30/2016
;
 
/* create study2017 data table */
data sasdl.study2017;
	length subjid $6 dob 4 state $2 start_date end_date 4;
	infile datalines truncover;
	input subjid dob : mmddyy10. state start_date : mmddyy10. end_date : mmddyy10.;
	format dob start_date end_date mmddyy10.;
	datalines;
987654 03/15/1980 VA 02/13/2017
876543 11/13/1970 NC 01/11/2017 01/30/2017
765432 12/15/1990 NY 03/14/2017
;

The produced data tables will look as follows:

Table STUDY2016:
SAS data table 1

Table STUDY2017:
SAS data table 2

If we look at the tables’ variable properties, we will see that the subjid variable is of different type in these two data tables: it is of type Numeric (length of 8) in STUDY2016 and of type Character (length of 6) in STUDY2017:

SAS variables properties

Also, notice that variables dob, start_date, and end_date, although of the same Numeric type, have different length attributes - 8 in the STUDY2016 table, and 4 in the STUDY2017 table.

Data table concatenating problem

If we try to concatenate these two tables using PROC APPEND, SAS will generate an ERROR:

proc append base=sasdl.study2016 data=sasdl.study2017;
run;
 
NOTE: Appending SASDL.STUDY2017 to SASDL.STUDY2016.
WARNING: Variable subjid not appended because of type mismatch.
WARNING: Variable dob has different lengths on BASE and DATA
         files (BASE 8 DATA 4).
WARNING: Variable start_date has different lengths on BASE and
         DATA files (BASE 8 DATA 4).
WARNING: Variable end_date has different lengths on BASE and
         DATA files (BASE 8 DATA 4).
ERROR: No appending done because of anomalies listed above.
       Use FORCE option to append these files.
NOTE: 0 observations added.

Even if we do use the FORCE option as the ERROR message suggests, the result will be disappointing:

proc append base=sasdl.study2016 data=sasdl.study2017 force;
run;
 
NOTE: Appending SASDL.STUDY2017 to SASDL.STUDY2016.
WARNING: Variable subjid not appended because of type mismatch.
WARNING: Variable dob has different lengths on BASE and DATA
         files (BASE 8 DATA 4).
WARNING: Variable start_date has different lengths on BASE and
         DATA files (BASE 8 DATA 4).
WARNING: Variable end_date has different lengths on BASE and
         DATA files (BASE 8 DATA 4).
NOTE: FORCE is specified, so dropping/truncating will occur.

The resulting data table will have missing values for the appended subjid:

Missing values in SAS table after PROC APPEND

Solution

In order to concatenate these tables, we must make the mismatching variable subjid of the same type in both data tables, either Character or Numeric. Making them both of Character type seems more robust, since it would allow for the values to contain both digit and non-digit characters. But if you know for sure that the value contains only digits, making them both Numeric works just as well.

Let’s say we decide to make them of Character type. Also note that our numeric variables representing dates (dob, start_date and end_date) are of different lengths: they are length 8 in STUDY2016 and length 4 in STUDY2017. Let’s make them the same length as well. From the standpoint of numerical accuracy in SAS for the dates, a length of 4 seems to be quite adequate to represent them accurately.

Let’s apply all our modifications to the STUDY 2016 dataset. Even though we are going to re-build the dataset in order to modify variable type and length, we are going to preserve the variable order so it feels like we just modified those variable attributes.

Here is how it can be done.

/* create macrovariable varlist containing a list of variable names */
proc sql noprint;
	select name into :varlist separated by ' '
	from sashelp.vcolumn
	where upcase(libname) eq 'SASDL' and upcase(memname) eq 'STUDY2016';
quit;
 
/* modify variable type and length */
data sasdl.study2016 (drop=v1-v4);
	retain &varlist; *<-- preserve variable order ;
	length subjid $6 dob start_date end_date 4; *<-- define new types/lengths ;
	format dob start_date end_date mmddyy10.;   *<-- recreate formats ;
	set sasdl.study2016 (rename=(subjid=v1 dob=v2 start_date=v3 end_date=v4));
	subjid = put(v1,6.); *<-- redefine subjid variable ;
	dob = v2;            *<-- redefine dob variable ;
	start_date = v3;     *<-- redefine start_date variable ;
	end_date = v4;       *<-- redefine end_date variable ;
run;
 
/* make sure new concatenated file (study_all) does not exist */
proc datasets library=sasdl nolist;
	delete study_all;
quit;
 
/* append both (study2016 and study2017) to study_all */
proc append base=sasdl.study_all data=sasdl.study2016;
run;
proc append base=sasdl.study_all data=sasdl.study2017;
run;

In this code, first, using proc sql and SAS view sashelp.vcolumn, we create a macro variable varlist to hold the list of all the variable names in our table, sasdl.study2016.

Then in the data step, we use a retain statement to preserve the variable order. When we read the sasdl.study2016 dataset using the set statement, we rename our variables-to-be-modified to some temporary names (e.g. v1 – v4) which we eventually drop in the data statement.

Then we re-assign the values of those temporary variables to the original variable names, thereby essentially creating new variables with new type and length. Since these new variables are named exactly as the old ones, the effect is as if their type and length attributes where modified, while in fact the whole table was rebuilt and replaced. Problem solved.

When we concatenate the data tables we create a new table sasdl.study_all. Before concatenating our two tables using proc append twice, we use proc datasets to delete that new table first. Even if the table does not exist, proc datasets will at least attempt to delete it. With all the seeming redundancy of this step, you will definitely appreciate it when you try running this code more than one time.

Changing variable type and variable length in SAS datasets was published on SAS Users.

3月 282017
 

How many meteorites have hit the earth in the last 4,000 years? Where have they landed? And which ones were the biggest? Can we show all of this information - and more in an intuitive data visualization? It turns out NASA provides public data about recorded meteorite impacts on earth all the [...]

How to design a meteorite infographic using NASA data and SAS was published on SAS Voices by Falko Schulz

3月 282017
 

I have been using the SAS Viya environment for just over six months now and I absolutely love it.  As a long-time SAS coder and data scientist I’m thrilled with the speed and greater accuracy I’m getting out of a lot of the same statistical techniques I once used in SAS9.  So why would a data scientist want to switch over to the new SAS Viya platform? The simple response is “better, faster answers.”  There are some features that are endemic to the SAS Viya architecture that provide advantages, and there are also benefits specific to different products as well.  So, let me try to distinguish between these.

SAS Viya Platform Advantages

To begin, I want to talk about the SAS Viya platform advantages.  For data processing, SAS Viya uses something called the CAS (Cloud Analytic Services) server – which takes the place of the SAS9 workspace server.  You can still use your SAS9 installation, as SAS has made it easy to work between SAS9 and SAS Viya using SAS/CONNECT, a feature that will be automated later in 2017.

Parallel Data Loads

One thing I immediately noticed was the speed with which large data sets are loaded into SAS Viya memory.  Using Hadoop, we can stage input files in either HDFS or Hive, and CAS will lift that data in parallel into its pooled memory area.  The same data conversion is occurring, like what happened in SAS9, but now all available processors can be applied to load the input data simultaneously.  And speaking of RAM, not all of the data needs to fit exactly into memory as it did with the LASR and HPA procedures, so much larger data sets can be processed in SAS Viya than you might have been able to handle before.

Multi-threaded DATA step

After initially loading data into SAS Viya, I was pleased to learn that the SAS DATA step is multi-threaded.  Most of your SAS9 programs will run ‘as is,’ however the multi-processing really only kicks in when the system finds explicit BY statements or partition statements in the DATA step code.  Surprisingly, you no longer need to sort your data before using BY statements in Procs or DATA steps.  That’s because there is no Proc Sort anymore – sorting is a thing of the past and certainly takes some getting used to in SAS Viya.  So for all of those times where I had to first sort data before I could use it, and then execute one or more DATA steps, that all transforms into a more simplified code stream.   Steven Sober has some excellent code examples of the DATA step running in full-distributed mode in his recent article.

Open API’s

While all of SAS Viya’s graphical user interfaces are designed with consistency of look and feel in mind, the R&D teams have designed it to allow almost any front-end or REST service submit commands and receive results from either CAS or its corresponding micro-service architecture.  Something new I had to learn was the concept of a CAS action set.  CAS action sets are comprised of a number of separate actions which can be executed singly or with other actions belonging to the same set.  The cool thing about CAS actions is that there is one for almost any task you can think about doing (kind of like a blend between functions and Procs in SAS9).  In fact, all of the visual interfaces SAS deploys utilize CAS actions behind the scenes and most GUI’s will automatically generate code for you if you do not want to write it.

But the real beauty of CAS actions is that you can submit them through different coding interfaces using the open Application Programming Interface’s (API’s) that SAS has written to support external languages like Python, Java, Lua and R (check out Github on this topic).  The standardization aspect of using the same CAS action within any type of external interface looks like it will pay huge dividends to anyone investing in this approach.

Write it once, re-use it elsewhere

I think another feature that old and new users alike will adore is the “write-it-once, re-use it” paradigm that CAS actions support.  Here’s an example of code that was used in Proc CAS, and then used in Jupyter notebook using Python, followed by a R/REST example.

Proc CAS

proc cas;
dnnTrain / table={name  = 'iris_with_folds'
                   where = '_fold_ ne 19'}
 modelWeights = {name='dl1_weights', replace=true}
 target = "species"
 hiddens = {10, 10} acts={'tanh', 'tanh'}
 sgdopts = {miniBatchSize=5, learningRate=0.1, 
                  maxEpochs=10};
run;

 

Python API

s.dnntrain(table = {‘name’: 'iris_with_folds’,
                                  ‘where’: '_fold_ ne 19},
   modelweights = {‘name’: 'dl1_weights', ‘replace’: True}
   target  = "species"
   hiddens  = [10, 10], acts=['tanh', ‘tanh']
   sgdopts  = {‘miniBatchSize’: 5, ‘learningRate’: 0.1, 
                      ‘maxEpochs’: 10})

 

R API

cas.deepNeural.dnnTrain(s,
  table = list(name = 'iris_with_folds’
                   where = '_fold_ ne 19’),
  modelweights = list({name='dl1_weights', replace=T),
  target = "species"
  hiddens = c(10, 10), acts=c('tanh', ‘tanh‘)
  sgdopts = list(miniBatchSize = 5, learningRate = 0.1,
                   maxEpochs=10))

 

See how nearly identical each of these three are to one another?  That is the beauty of SAS Viya.  Using a coding approach like this means that I do not need to rely exclusively on finding SAS coding talent anymore.  Younger coders who usually know several open source languages take one look at this, understand it, and can easily incorporate it into what they are already doing.  In other words, they can stay in coding environments that are familiar to them, whilst learning a few new SAS Viya objects that dramatically extend and improve their work.

Analytics Procedure Advantages

Auto-tuning

Next, I want address some of the advantages in the newer analytics procedures.  One really great new capability that has been added is the auto-tuning feature for some machine learning modeling techniques, specifically (extreme) gradient boosting, decision tree, random forest, support vector machine, factorization machine and neural network.  This capability is something that is hard to find in the open source community, namely the automatic tuning of major option settings required by most iterative machine learning techniques.  Called ‘hyperspace parameters’ among data scientists, SAS has built-in optimizing routines that try different settings and pick the best ones for you (in parallel!!!).  The process takes longer to run initially, but, wow, the increase in accuracy without going through the normal model build trial-and-error process is worth it for this amazing feature!

Extreme Gradient Boosting, plus other popular ML techniques

Admittedly, xgboost has been in the open source community for a couple of years already, but SAS Viya has its own extreme[1] gradient boosting CAS action (‘gbtreetrain’) and accompanying procedure (Gradboost).  Both are very close to what Chen (2015, 2016) originally developed, yet have some nice enhancements sprinkled throughout.  One huge bonus is the auto-tuning feature I mentioned above.  Another set of enhancements include: 1) a more flexible tree-splitting methodology that is not limited to CART (binary tree-splitting), and 2) the handling of nominal input variables is done automatically for you, versus ‘one-hot-encoding’ you need to perform in most open source tools.  Plus, lots of other detailed option settings for fine tuning and control.

In SAS Viya, all of the popular machine learning techniques are there as well, and SAS makes it easy for you to explore your data, create your own model tournaments, and generate score code that is easy to deploy.  Model management is currently done through SAS9 (at least until the next SAS Viya release later this year), but good, solid links are provided between SAS Viya and SAS9 to make transferring tasks and output fairly seamless.  Check out the full list of SAS Viya analytics available as of March 2017.

In-memory forecasting

It is hard to beat SAS9 Forecast Server with its unique 12 patents for automatic diagnosing and generating forecasts, but now all of those industry-leading innovations are also available in SAS Viya’s distributed in-memory environment. And by leveraging SAS Viya’s optimized data shuffling routines, time series data does not need to be sorted, yet it is quickly and efficiently distributed across the shared memory array. The new architecture also has given us a set of new object packages to make more efficient use of the data and run faster than anything witnessed before. For example, we have seen 1.5 million weekly time series with three years of history take 130 hours (running single-machine and single-threaded) and reduce that down to run in 5 minutes on a 40 core networked array with 10 threads per core. Accurately forecasting 870 Gigabytes of information, in 5 minutes?!? That truly is amazing!

Conclusions

Though I first ventured into SAS Viya with some trepidation, it soon became clear that the new platform would fundamentally change how I build models and analytics.  In fact, the jumps in performance and the reductions in time spent to do routine work has been so compelling for me, I am having a hard time thinking about going back to a pure SAS9 environment.  For me it’s all about getting “better, faster answers,” and SAS Viya allows me to do just that.   Multi-threaded processing is the way of the future and I want to be part of that, not only for my own personal development, but also because it will help me achieve things for my customers they may not have thought possible before.  If you have not done so already, I highly recommend you to go out and sign up for a free trial and check out the benefits of SAS Viya for yourself.


[1] The definition of ‘extreme’ refers only to the distributed, multi-threaded aspect of any boosting technique.

References

Chen , Tianqi and Carlos Guestrin , “XGBoost: Reliable Large-scale Tree Boosting System”, 2015

Chen , Tianqi and Carlos Guestrin, “XGBoost: A Scalable Tree Boosting System”, 2016

Using the Robust Analytics Environment of SAS Viya was published on SAS Users.

3月 282017
 

Did you know that PROC SQL captures the record count for a result set in a special automatic macro variable? When you create a subset of data to include in a report, it's a nice touch to add a record count and other summaries as an eye-catcher to the report title. I often see the following pattern in SAS programs, which adds an extra step to get a record count:

proc sql noprint;
 
 create table result 
  as select * from sashelp.cars
  where origin='Asia';
 
 /* count the records in the result */
 select count(model) into :resultcount
  from result;
quit;
 
title "Summary of Cars from Asia: &resultcount. records";
proc means data=result;
run;

This creates a report with an informative title like this:

Here's the tip. Instead of including a SELECT INTO step that's going to make another pass through the data, you can rely on the &SQLOBS automatic macro variable. This variable holds the record "result set" count from the most recent SELECT clause.

proc sql noprint;
 create table result 
  as select * from sashelp.cars
  where origin='Asia';
 
 %let resultcount=&sqlobs;
quit;
 
title "Summary of Cars from Asia: &resultcount. records";
proc means data=result;
run;

Because SAS replaces the value with each subsequent SELECT clause, it's important to assign it to another macro variable immediately if you intend to use it later. Here's the result:

Not only is this more efficient, but SAS automatically trims the whitespace from the SQLOBS variable so that it looks better in a TITLE statement. If you're using SELECT INTO to populate macro variables for other reasons, you can use the TRIMMED keyword to achieve the same effect.

proc sql noprint;
 
 create table result 
  as select * from sashelp.cars
  where origin='Asia';
 
 %let resultcount=&sqlobs;
 
 select avg(mpg_highway) into: AvgMpg TRIMMED
  from result;
 
quit;
 
title "Summary of Cars from Asia: &resultcount. records, &AvgMpg. MPG Average";
proc means data=result;
run;

See also

The post How many records are in that PROC SQL result? appeared first on The SAS Dummy.

3月 272017
 

Here in the US, it's the nationwide men's college basketball tournament season! Therefore let's use some data from the previous years' tournaments to sharpen our analytics & visualization skills... But before we get started, I must mention (brag?) that my alma mater, NC State University, won this tournament in 1983. [...]

The post Let's analyze 32 years of basketball tournament data! appeared first on SAS Learning Post.

3月 272017
 

Here in the US, it's the nationwide men's college basketball tournament season! Therefore let's use some data from the previous years' tournaments to sharpen our analytics & visualization skills... But before we get started, I must mention (brag?) that my alma mater, NC State University, won this tournament in 1983. [...]

The post Let's analyze 32 years of basketball tournament data! appeared first on SAS Learning Post.