12月 142017
 

The goal of all types of analytics is to provide business insight. Consider that: Descriptive analytics provides the business with insight on what happened in the past and what is happening now. Predictive analytics provides the business with insight on the probability of what will happen in the future. Prescriptive analytics provides the [...]

The post How a data-driven business supports analytics goals appeared first on The Data Roundtable.

12月 132017
 

SAS Data Preparation 2.1 is now available and it includes the ability to perform data quality transformations on your data using the definitions from the SAS Quality Knowledge Base (QKB).

The SAS Quality Knowledge Base is a collection of files which store data and logic that define data quality operations such as parsing, standardization, and generating match codes to facilitate fuzzy matching based on geographic locales. SAS software products reference the QKB when performing data quality transformations on your data.  These products include: SAS Data Integration Studio, SAS DataFlux Data Management Studio/Server, SAS code via dqprocs, SAS MDM, SAS Data Loader for Hadoop, SAS Event Stream Processing, and now SAS Data Preparation which is powered by SAS Viya.

Out-of-the-box QKB definitions include the ability to perform data quality operations on items such as Name, Address, Phone, and Email.

SAS Data Preparation 2.1

SAS Data Preparation – Data Quality Transformations

The following are the data quality transformations available in SAS Data Preparation:

  • Casing – case a text string in upper, lower, or proper case. Example using the Proper (Organization) case definition – input: sas institute   output: SAS Institute.
  • Parsing – break up a text string into its tokens. Example using the Name parse definition – input: James Michael Smith   output: James (Given Name token), Michael (Middle Name token), and Smith (Family Name token).
  • Field extraction – extract relevant tokens from a text string. Example using a custom created extraction definition for Clothing information – input: The items purchased were a small red dress and a blue shirt, large   output: dress; shirt (Item token), red; blue (Color token), and small; large (Size token).
  • Gender analysis – guess the gender of a text string. Example using the Name gender analysis definition – input: James Michael Smith   output: M (abbreviation for Male).
  • Identification analysis – guess the type of data for a text string. Example using the Contact Info identification analysis definition:
  • Match codes – generate a code to fuzzy match text strings. Example using the Name match definition at a sensitivity of 85:
    For more information on match codes, view this YouTube video on The Power of the SAS® Matchcode.
  • Standardize – put a text string into a common format. Example using the Phone standardization definition – input: 9196778000   output: (919) 677 8000.

Note:  While all the examples above are using definitions from the English (United States) locale in the SAS Quality Knowledge Base for Contact Information, QKBs are available for dozens of locales.

You can also customize the definitions in the QKB using SAS DataFlux Data Management Studio.  This allows you to update the out-of-the-box QKB definitions or create your own data types and definitions to suit your project needs.  For example, you may need to create a definition to extract the clothing information from a free-form text field as shown the Field extraction example.  These customized definitions can then be used in SAS Data Preparation as part of your data quality transformations.  For more information on Customizing the QKB, you can view this YouTube video.

For more information on the SAS Quality Knowledge Base (QKB), you can view its documentation.

SAS Data Preparation 2.1: Data quality transformations was published on SAS Users.

12月 132017
 

The SAS language is large. Even after 20+ years of using SAS, there are many features that I have never used. Recently it became necessary for me to learn about DICTIONARY tables in PROC SQL (and the associated SASHELP views) because I needed to programmatically obtain the text for the current value of the system title in SAS. I had heard a lot about DICTIONARY tables, but this was my first time using them in a program.

This article discusses DICTIONARY tables and shows how to use the DICTIONARY.Titles table to obtain the current value of titles and footnotes in SAS.

DICTIONARY tables and titles

DICTIONARY tables are documented in the PROC SQL documentation. They are special read-only PROC SQL tables that contain information about the current state of SAS, including the state of libraries, data sets, and system options. The documentation lists all DICTIONARY tables, and I determined that I needed to look at the DICTIONARY.Titles table in PROC SQL or (if I needed to use another SAS procedure) the SASHELP.VTitle view, which contains the same information.

After finding out which table to use, I wanted to display the contents of the table. The following call to PROC SQL displays the structure of the table (names and types of variables) and the contents. Equivalently, you can use PROC CONTENTS and PROC PRINT to show similar information for the view SASHELP.VTitle. The output shows the table for a new SAS session:

proc sql;
describe table Dictionary.Titles;         /* writes to SAS log */
select * from Dictionary.Titles;          /* display table */
quit;
 
proc contents data=Sashelp.VTitle;  run;  
proc print data=Sashelp.VTitle;     run;

The table contains three variables:

  • The TYPE variable is a one-character variable with the values 'T' (for title) or 'F' (for footnote).
  • The NUMBER variable is a numeric variable. The value '1' indicates the value of the TITLE1 or FOOTNOTE1 global statements. The value '2' indicates the value of the TITLE2 or FOOTNOTE2 statements, and so on.
  • The TEXT variable is a 256-character variable that contains the value of a title or footnote. The output shows that when you first start SAS, the TITLE1 statement is set to "The SAS System."

You can run an example to see how the contents of the view change after you submit TITLE and FOOTNOTE statements:

title "Normal Distribution";    /* alias for TITLE1 statement */
title2 "mu = 0; sigma = 1;";
footnote "N = 100";             /* alias for FOOTNOTE1 statement */
 
proc print data=Sashelp.VTitle;  run;

Putting a title into a macro variable

The structure of the DICTIONARY.Titles table implies that you can use a WHERE clause to subset the table. For example, the clause WHERE Type="T" & Number=1 selects only the row for the TITLE1 statement. Similarly, the clause WHERE Type="F" & Number=2 selects the row for the FOOTNOTE2 statement. You can use the SELECT INTO :MacroVar statement in PROC SQL to put data into a macro variable, as follows:

PROC SQL noprint;
select Text into :TitleText TRIMMED  /* put the trimmed value into a macro */
  from Dictionary.Titles
  where Type="T" & Number=1;
quit;
%put &=TitleText;
TITLETEXT=Normal Distribution

You need to be a little careful when using this technique in production code. If you ask for the TITLE2 information when that title is not set, then the WHERE clause will return an empty table. For example, if you clear the TITLE1 statement and rerun the previous PROC SQL statement you will see that the SAS log displays NOTE: No rows were selected and value of the TitleText macro is not updated.

One way to handle this potential problem is to use the %LET statement to set the macro variable to an empty value before you call PROC SQL. If the macro variable is empty after PROC SQL runs, then the requested title or footnote is not set. The following macro uses this technique to set a macro variable to the value of the Nth title, where you can specify the parameter N:

/* Get the N_th title into the &TitleText macro variable */
%macro GetTitle(Number=1);
%global TitleText;
%let TitleText = ;           /* value is empty if TITLEn does not exist */
PROC SQL noprint;
select Text into :TitleText TRIMMED    /* value is set if TITLEn exists */
  from Dictionary.Titles
  where Type="T" & Number=&Number;
quit;
%mend;
 
/* test the %GetTitle macro */
title "Lognormal Distribution";   /* set TITLE1; clear TITLE2 */
%GetTitle();
%put Title1 = "&TitleText";      /* text of title1 */
%GetTitle(Number=2);
%put Title2 = "&TitleText";      /* text of title1 */
Title1 = "Lognormal Distribution"
Title2 = ""

This basic macro is sufficient for my purposes. Feel free to propose improvements in the comments. Also, let me know how you use DICTIONARY tables in your work.

If you would like to learn more about DICTIONARY tables, the following two references will get you started. Many papers have been written about DICTIONARY tables and views. An internet search of the form
sas proceedings "dictionary table"
will reveal some of the papers from SAS conferences.

The post How to get the current TITLE in SAS appeared first on The DO Loop.

12月 132017
 

I'm a big fan of the Import Data task in SAS Enterprise Guide, especially for its support of text-based files (CSV, tab delimited, fixed width, and more). There's no faster method for generating SAS code that reads your data exactly the way you need it. I use the tool so often that I take for granted some of its neatest features, and I forget that many new users (and even veteran users) might not know about them. In this article, I'll review a few of the cool things that this task can do for you.

Read fixed-width text files into SAS

We think of CSV files (and...alas...Excel files) as the main standard for data exchange among systems, but many legacy systems still produce and consume fixed-width text data formats. The SAS DATA step is a perfect tool for reading these files, but defining the columns and their properties can be tedious. The "Fixed columns" option on the Import Data task can make this job simple.

Suppose that you're beginning with a spec like this:

And a raw data file like this:

You can use the Import Data wizard to define the boundaries of your columns by adding boundary lines with just click-and-drag operations. Beginning with the File->Import Data task, select your source text file and advance to the second page of the wizard. When you select "Fixed columns" as the input text format, you'll see a layout ruler that looks like this:

Click at the column boundaries (referring to your original spec!) and drag the rule lines as needed to define those column boundaries. Then click Next, and fill out details for the column names and types:

Which then tells the Import Data task how to generate the proper INPUT statements:

When you click Finish, you end up with a data set that's ready for business:

Modify the properties for multiple columns -- with one step

Here's a click-saving trick. Sometimes you have an input data file that contains many columns that share the same properties: type, length, and SAS format. It can be tedious to click and modify the properties of each column that you want to import. There's a shortcut on the Define Field Attributes page of the wizard that you can use to change the attributes for several columns at the same time. Simply SHIFT+Click to select multiple column definitions on the page, then click Modify.... The "Field Attributes for Multiple Selections" window appears, and you can change the necessary attributes just once and apply to the many items you picked.

This trick works as you import any text file or Excel file.

Create SAS program code that you can reuse anywhere

In a previous article I described how the Import Data task works "behind the scenes." Some of the magic that the task performs is not captured in SAS code, and that can present a challenge when you want to reuse this work in other settings -- for example, in a batch process or in a larger SAS program. However, with a couple of tweaks you can coerce the Import Data task into creating SAS code that you can almost just "lift and shift," as is.

The first option is hidden under the Performance window, labeled as "Bypass the data cleansing process." By default, the Import Data task reformats your input text file to normalize it for a cleaner import step. While doing no harm, most of the time this step isn't needed -- especially if your original data file is well formed. And since this step changes the input file, it's isn't repeatable outside of this task. My first tip for the best reusable code: click Performance... on the first page of the wizard, then select the "Bypass.." checkbox. That guarantees that the code will be formulated to read your original raw file. (Note that the Performance button is available only when importing text files, not Excel files.)

The second option you'll want to change is related to this, but you'll find it on the final page with the Advanced Options. Select "Generalize import step to run outside of SAS Enterprise Guide." This ensures that the task won't attempt any behind-the-scenes monkey business with your original file -- everything is captured in the DATA step that the task generates. Well, almost everything...

The one missing piece, a confounding factor when you select a local text file to import on a remote SAS Workspace session, is the transfer of the local file to the remote server. SAS Enterprise Guide copies the file for you -- behind the scenes -- and there is no SAS code to represent this step.

You can take control of even this step, though, if you make use of the Copy Files task (now available for you on the Tasks->Data menu). You can then copy the file from a local source folder, and land it wherever you want on the SAS server. Modify your newly repurposed Import Data code to pull from that server-based destination, giving you more control over the individual steps in the import process.

Learn more about importing text files

If you're new to importing data into SAS, whether using a SAS program or SAS Enterprise Guide, you might learn some of the basics from these video tutorials that were produced by SAS instructors:

12月 132017
 

I'm a big fan of the Import Data task in SAS Enterprise Guide, especially for its support of text-based files (CSV, tab delimited, fixed width, and more). There's no faster method for generating SAS code that reads your data exactly the way you need it. I use the tool so often that I take for granted some of its neatest features, and I forget that many new users (and even veteran users) might not know about them. In this article, I'll review a few of the cool things that this task can do for you.

Read fixed-width text files into SAS

We think of CSV files (and...alas...Excel files) as the main standard for data exchange among systems, but many legacy systems still produce and consume fixed-width text data formats. The SAS DATA step is a perfect tool for reading these files, but defining the columns and their properties can be tedious. The "Fixed columns" option on the Import Data task can make this job simple.

Suppose that you're beginning with a spec like this:

And a raw data file like this:

You can use the Import Data wizard to define the boundaries of your columns by adding boundary lines with just click-and-drag operations. Beginning with the File->Import Data task, select your source text file and advance to the second page of the wizard. When you select "Fixed columns" as the input text format, you'll see a layout ruler that looks like this:

Click at the column boundaries (referring to your original spec!) and drag the rule lines as needed to define those column boundaries. Then click Next, and fill out details for the column names and types:

Which then tells the Import Data task how to generate the proper INPUT statements:

When you click Finish, you end up with a data set that's ready for business:

Modify the properties for multiple columns -- with one step

Here's a click-saving trick. Sometimes you have an input data file that contains many columns that share the same properties: type, length, and SAS format. It can be tedious to click and modify the properties of each column that you want to import. There's a shortcut on the Define Field Attributes page of the wizard that you can use to change the attributes for several columns at the same time. Simply SHIFT+Click to select multiple column definitions on the page, then click Modify.... The "Field Attributes for Multiple Selections" window appears, and you can change the necessary attributes just once and apply to the many items you picked.

This trick works as you import any text file or Excel file.

Create SAS program code that you can reuse anywhere

In a previous article I described how the Import Data task works "behind the scenes." Some of the magic that the task performs is not captured in SAS code, and that can present a challenge when you want to reuse this work in other settings -- for example, in a batch process or in a larger SAS program. However, with a couple of tweaks you can coerce the Import Data task into creating SAS code that you can almost just "lift and shift," as is.

The first option is hidden under the Performance window, labeled as "Bypass the data cleansing process." By default, the Import Data task reformats your input text file to normalize it for a cleaner import step. While doing no harm, most of the time this step isn't needed -- especially if your original data file is well formed. And since this step changes the input file, it's isn't repeatable outside of this task. My first tip for the best reusable code: click Performance... on the first page of the wizard, then select the "Bypass.." checkbox. That guarantees that the code will be formulated to read your original raw file. (Note that the Performance button is available only when importing text files, not Excel files.)

The second option you'll want to change is related to this, but you'll find it on the final page with the Advanced Options. Select "Generalize import step to run outside of SAS Enterprise Guide." This ensures that the task won't attempt any behind-the-scenes monkey business with your original file -- everything is captured in the DATA step that the task generates. Well, almost everything...

The one missing piece, a confounding factor when you select a local text file to import on a remote SAS Workspace session, is the transfer of the local file to the remote server. SAS Enterprise Guide copies the file for you -- behind the scenes -- and there is no SAS code to represent this step.

You can take control of even this step, though, if you make use of the Copy Files task (now available for you on the Tasks->Data menu). You can then copy the file from a local source folder, and land it wherever you want on the SAS server. Modify your newly repurposed Import Data code to pull from that server-based destination, giving you more control over the individual steps in the import process.

Learn more about importing text files

If you're new to importing data into SAS, whether using a SAS program or SAS Enterprise Guide, you might learn some of the basics from these video tutorials that were produced by SAS instructors:

12月 132017
 

PROC FREQ is one of the most popular procedures in the SAS language.  It is mostly used to describe frequency distribution of a variable or combination of variables in contingency tables.  However, PROC FREQ has much more functionality than that.  For an overview of all that it can do, see an introduction of the SAS documentation. SAS Viya does not have PROC FREQ, but that doesn’t mean you can’t take advantage of this procedure when working with BIG DATA. SAS 9.4m5 integration with SAS Viya allows you to summarize the data appropriately in CAS, and then pass the resulting summaries to PROC FREQ to get any of the statistics that it is designed to do, from your favorite SAS 9.4 coding client. In this article, you will see how easy it is to work with PROC FREQ in this manner.

These steps are necessary to accomplish this:

  1. Define the CAS environment you will be using from a SAS 9.4m5 interface.
  2. Identify variables and/or combination of variables that define the dimensionality of contingency tables.
  3. Load the table to memory that will need to be summarized.
  4. Summarize the data with a CAS enable procedure, use PROC FEDSQL for high cardinality cases.
  5. Write the appropriate PROC FREQ syntax using the WEIGHT statement to input cell count data.

Step 1: Define the CAS environment

Before you start writing any code, make sure that there is an _authinfo file in the appropriate user directory in your system (for example, c:\users\<userid>$$ with the following information:

host <cas server name> port <number> user "<cas user id>" password "<cas user password>"

This information can be found by running the following statement in the SAS Viya environment from SAS Studio:

cas;  
caslib _all_ assign;

 

Then, in your SAS 9.4m5 interface, run the following program to define the CAS environment that you will be working on:

options cashost=" " casport=;
cas mycas user=;
libname mycas cas;
/** Set the in memory shared library if you will be using any tables already promoted in CAS **/
libname public cas caslib=public;

 

PROC FREQ FOR BIG DATA

Figure 1

Figure 1 shows the log and libraries after connecting to the CAS environment that will be used to deal with Big Data summarizations.

Step 2: Identify variables

The variables here are those which will be use in the TABLE option in PROC FREQ.  Also, any numeric variable that is not part of TABLE statement can be used to determine the input cell count.

Step 3: Load tale to memory

There are two options to do this.  The first one is to load the table to the PUBLIC library in the SAS Viya environment directly.  The second option is to load it from your SAS 9.4m5 environment into CAS. Loading data into CAS can be as easy as writing a DATA step or using other more efficient methods depending on where the data resides and its size.  This is out of scope of this article.

Step 4: Summarize the data with a CAS enable procedure

Summarizing data for many cross tabulations can become very computing expensive, especially with Big Data. SAS Viya offers several ways to accomplish this (i.e. PROC MEANS, PROC MDSUMMARY, PROC FEDSQL). There are ways to optimize performance for high cardinality summarization when working with CAS.  PROC FEDSQL is my personal favorite since it has shown good performance.

Step 5: Write the PROC FREQ

When writing the PROC FREQ syntax make sure to use the WEIGHT statement to instruct the algorithm to use the appropriate cell count in the contingency tables. All feature and functionality of PROC FREQ is available here so use it to its max! The DATA option can point to the CAS library where your summarized table is, so there will be some overhead for data transfer to the SAS 9 work server. But, most of the time the summarized table is small enough that the transfer may not take long.  An alternative to letting PROC FREQ do the data transfer is doing a data step to bring the data from CAS to a SAS base library in SAS 9 and then running PROC FREQ with that table as the input.

Example

Figure 2 below shows a sample program on how to take advantage of CAS from SAS 9.4m5 to produce different analyses using PROC FREQ.

PROC FREQ for big data

Figure 2

 

Figure 3 shows the log of the summarization in CAS for a very large table in 1:05.97 minutes (over 571 million records with a dimensionality of 163,964 for all possible combinations of the values of feature, date and target).  The PROC FREQ shows three different ways to use the TABLE statement to produce the desired output from the summarized table which took only 1.22 seconds in the SAS 9.4m5 environment.

PROC FREQ for big data

Figure 3

Program

 

/** Connect to Viya Environment **/
options cashost="xxxxxxxxxxx.xxx.xxx.com" casport=5570;
cas mycas user=calara;
libname mycas cas datalimit=all;
 
/** Set the in memory shared library **/
libname public cas caslib=public datalimit=all;
 
/** Define macro variables **/
/* CAS Session where the FEDSQL will be performed */
%let casref=mycas;
 
/* Name of the output table of the frequency counts */
%let outsql=sql_sum;
 
/* Variables for cross classification cell counts */
%let dimvars=date, feature, target;
 
/* Variable for which frequencies are needed */
%let cntvar=visits;
 
/* Source table */
%let intble=public.clicks_summary;
 
proc FEDSQL sessref=&casref.;
	create table &outsql. as select &dimvars., count(&cntvar.) as freq 
		from &intble. group by &dimvars.;
	quit;
run;
 
proc freq data=&casref..&outsql.;
	weight freq;
	table date;
	table date*target;
	table feature*date / expected cellchi2 norow nocol chisq noprint;
	output out=ChiSqData n nmiss pchi lrchi;
run;

PROC FREQ for Big Data was published on SAS Users.

12月 112017
 
Self-similar Christmas tree created in SAS

Happy holidays to all my readers! My greeting-card to you is an image of a self-similar Christmas tree. The image (click to enlarge) was created in SAS by using two features that I blog about regularly: matrix computations and ODS statistical graphics.

Self-similarity in Kronecker products

I have previously shown that the repeated Kronecker product of a binary matrix leads to self-similar structures.

Specifically, if M is a binary matrix then the Kronecker product M@M is a block matrix in which each 1 in the original matrix is replaced with a copy of M and each 0 is replaced by a zero block. (Here '@' is the Kronecker product (or direct product) operator, which is implemented in SAS/IML software.) If you repeat this process three or more times and create a heat map of the product matrix, the self-similar structure is apparent.

For example, the following 5 x 7 binary matrix has a 1 in positions that approximate the shape of a Christmas tree. The direct product M@M@M is a matrix that has 53 = 125 rows and 73 = 343 columns. If you use the HEATMAPDISC subroutine in SAS/IML to create a heat map of the direct product, you obtain the following image in which a green cell represents a 1:

proc iml;
M = {0 0 0 1 0 0 0,
     0 0 1 1 1 0 0,
     0 1 1 1 1 1 0,
     1 1 1 1 1 1 1,
     0 0 0 1 0 0 0};
Tree = M @ M @ M;     /* M is 5 x 7, so Tree is 125 x 343 */
 
ods graphics / width=300px height=450px  ANTIALIASMAX=50000;
call heatmapdisc(Tree) title="Happy Holidays to All my Readers!"
     colorramp={White Green} displayoutlines=0 ShowLegend=0;

That's not a bad result for three SAS/IML statements! You can see that the "tree" has a self-similar structure: the large tree is composed of 17 smaller trees, each of which is composed of 17 mini trees.

Adding a star

My readers are worth extra effort, so I want to put a star on the top of the tree. One way to do this is to use a scatter plot and plot a special observation by using a star symbol. To plot the "tree" by using a scatter plot, it is necessary to "unpack" the 125 x 343 matrix into a list of (x, y) values (this is a wide-to-long conversion of the matrix). You can use the NDX2SUB function to extract the row and column of each 1 in the matrix, as follows:

/* extract (row, column) pairs for matrix elements that are '1' */
idx = loc(Tree = 1);                /* indices for the 1s */
s = ndx2sub(dimension(Tree), idx);  /* convert indices to subscripts */
create Tree from s[c={"RNames" "CNames"}];
append from s;
close;
 
call symputx("XPos", range(s[,2])/2); /* midrange = horiz position of star */
quit;

The previous statements create a SAS data set called TREE that contains the (x, y) positions of the 1s in the direct product matrix. The statements also saved the midrange value to a macro variable. The following SAS procedures create the location of the star, concatenate the two data sets, and create a scatter plot of the result, which is shown at the top of this article.

data star;
x = &XPos; y = 0; 
run;
 
data Tree2;
set Tree star;
run;
 
ods graphics / width=600px height=800px  ANTIALIASMAX=50000;
title "Happy Holidays";
proc sgplot data=Tree2 noborder noautolegend;
scatter x=CNames y=RNames / markerattrs=(color=ForestGreen symbol=SquareFilled size=3);
scatter x=x y=y / markerattrs=(symbol=StarFilled size=20)             /* yellow star */
                  filledoutlinedmarkers markerfillattrs=(color=yellow);
yaxis reverse display=none;
xaxis display=none;
run;

Previous SAS-created Christmas cards

All year long I blog about how to use SAS for serious pursuits like statistical modeling, data analysis, optimization, and simulation. Consequently, I enjoy my occasional forays into a fun and frivolous topic such as how to create a greeting card with SAS. If you like seeing geeky SAS-generated images, here is a list of my efforts from previous years:

Wherever you are and whatever holiday you celebrate, may you enjoy peace and happiness this season and in the New Year.

The post A self-similar Christmas tree appeared first on The DO Loop.

12月 092017
 

From national parks and healthcare to taxes and nutrition, federal civilian agencies feature an incredibly large and diverse set of missions. These agencies oversee almost every aspect of American life with an endless sea of projects, programs and general oversight. But, as Deloitte Consulting’s Mark Urbanczyk said during a recent [...]

How federal civilian agencies can improve outcomes with analytics was published on SAS Voices by Steve Bennett