Could data governance policies for analytics be the foundation for a model governance program?

The post The convergence of data and model governance appeared first on The Data Roundtable.

9月 192019

Could data governance policies for analytics be the foundation for a model governance program?

The post The convergence of data and model governance appeared first on The Data Roundtable.

9月 192019

Recent updates to SAS Grid Manager introduced many interesting new features, including the ability to handle more diverse workloads. In this post, we'll take a look at the steps required to get your SAS Grid Manager environment set up accept jobs from outside of traditional SAS clients. We'll demonstrate the process of submitting Python code for execution in the SAS Grid.

Obviously, we need a SAS Grid Manager (SAS 9.4 Maintenance 6 or later) environment to be installed and configured. Once the grid is deployed, there's not a whole lot more to do on the SAS side. SAS Workspace Servers need to be configured for launching on the grid by converting them to load balanced Workspace Servers, changing the algorithm to 'Grid', and then selecting the relevant checkbox in **SASApp – Logical Workspace Server** properties to launch on the grid as shown below.

The only other things that might need configuring are, if applicable, .authinfo files and Grid Option Sets. Keep reading for more information on these.

In this example scenario, the client is the Windows desktop machine where I will write my Python code. SAS is not installed here. Rather, we will deploy and use **Jupyter Notebook** as our IDE to write Python code, which we'll then submit to the SAS Grid. We can get Jupyter Notebook by installing **Anaconda**, which is a free, bundled distribution of Python and R that comes with a solid package management system. (The following steps are courtesy of my colleague. Allan Tham.)

First, we need to download and install Anaconda.

Once deployed, we can open **Anaconda Navigator** from the Start menu and from there, we can launch Jupyter Notebook and create a notebook.

Now it's time to configure the connection from Python to our SAS environment. To do this, we use the SAS-developed open-source **SASPy** module for Python, which is responsible for converting Python code to SAS code and running it in SAS. Installing SASPy is simple. First, we need to download and install a Java Runtime Environment as a prerequisite. After that, we can launch **Anaconda Prompt** (a command line interface) from the Start menu and use pip to install the SASPy package.

pip install saspy |

Connection parameters are defined in a file named sascfg.py. In fact, the best practice is to use sascfg.py as a template, and create a copy of it, which should be named sascfg_personal.py. SASPy will read and look for connections in both files. For the actual connection parameters, we need to specify the connection method. This depends largely on your topology.

In my environment, I used a Windows client to connect to a Linux server using a standard IOM connection. The most appropriate SASPy *configuration definition* is therefore '**winiomlinux**', which relies on a Java-based connection method to talk to the SAS workspace. This needs to be defined in the sascfg_personal.py file.

SAS_config_names=['winiomlinux'] |

We also need to specify the parameters for this connection definition as shown below.

# build out a local classpath variable to use below for Windows clients CHANGE THE PATHS TO BE CORRECT FOR YOUR INSTALLATION cpW = "C:\\ProgramData\\Anaconda3\\Lib\\site-packages\\saspy\\java\\sas.svc.connection.jar" cpW += ";C:\\ProgramData\\Anaconda3\\Lib\\site-packages\\saspy\\java\\log4j.jar" cpW += ";C:\\ProgramData\\Anaconda3\\Lib\\site-packages\\saspy\\java\\sas.security.sspi.jar" cpW += ";C:\\ProgramData\\Anaconda3\\Lib\\site-packages\\saspy\\java\\sas.core.jar" cpW += ";C:\\ProgramData\\Anaconda3\\Lib\\site-packages\\saspy\\java\\saspyiom.jar" |

Note that we are referencing a number of JAR files in the classpath that normally come with a SAS deployment. Since we don't have SAS on the client machine, we can copy these from the Linux SAS server. In my lab, these were copied from /SASDeploymentManager/9.4/products/deploywiz__94526__prt__xx__sp0__1/deploywiz/.

Further down in the file, we can see the connection parameters to the SAS environment. Update the path to the Java executable, the SAS compute server host, and the port for the Workspace Server.

winiomlinux = {'java' : 'C:\\Program Files\\Java\\jre1.8.0_221\\bin\\java', 'iomhost' : 'sasserver01.mydemo.sas.com', 'iomport' : 8591, 'encoding' : 'latin1', 'classpath' : cpW } |

**Note**: thanks to a recent contribution from the community, SASPy now also supports a native Windows connection method that doesn't require Java. Instead, it uses the SAS Integration Technologies client -- a component that is free to download from SAS. You'll already have this component if you've installed SAS Enterprise Guide, SAS Add-In for Microsoft Office, or Base SAS for Windows on your client machine. See instructions for this configuration in the SASPy documentation.

With the configuration set, we can import the SASPy module in our Jupyter Notebook.

Although we're importing sascfg_personal.py explicitly here, we can just call sascfg.py, as SASPy actually checks for sascfg_personal.py and uses that if it finds it. If not, it uses the default sascfg.py.

To connect to the grid from Jupyter Notebook using SASPy, we need to instantiate a SASsession object and specify the configuration definition (e.g. winiomlinux). You'll get prompted for credentials, and then a message indicating a successful connection will be displayed.

It's worth nothing that we could also specify the configuration file to use when we start a session here by specifying:

sas = saspy.SASsession(cfgfile='/sascfg_personal.py') |

Behind the scenes, SAS will find an appropriate grid node on which to launch a Workspace Server session. The SASPy module works in any SAS 9.4 environment, regardless of whether it is a grid environment or not. It simply runs in a Workspace Server; in a SAS Grid environment, it just happens to be a Grid-launched Workspace Server.

Now let's execute some code in the Workspace Server session we have launched. Note that not all Python methods are supported by SASPy, but because the module is open source, anyone can add to or update the available methods. Refer to the API doc for more information.

In taking another example from Allan, let's view some of the content in a SASHELP data set.

The output is in ODS, which is converted (by SASPy) to a Pandas data frame to display it in Jupyter Notebook.

SAS Workload Orchestrator Web interface allows us to view the IOM connections SASPy established from our client machine. We can see the grid node on which the Workspace Server was launched, and view some basic information. The job will remain in a RUNNING state until the connection is terminated (i.e. the SASsession is closed by calling the endsas() method. By the same token, creating multiple SASsessions will result in multiple grid jobs running.

To see more details about what actually runs, Workspace Server logs need to first be enabled. When we run myclass.head(), which will display the first 5 rows of the data set, we see the following written to Workspace Server log.

2019-08-14T02:47:34,713 INFO [00000011] :sasdemo - 239 ods listing close;ods html5 (id=saspy_internal) file=_tomods1 options(bitmap_mode='inline') device=svg style=HTMLBlue; 2019-08-14T02:47:34,713 INFO [00000011] :sasdemo - 239 ! ods graphics on / outputfmt=png; 2019-08-14T02:47:34,779 INFO [00000011] :sasdemo - NOTE: Writing HTML5(SASPY_INTERNAL) Body file: _TOMODS1 2019-08-14T02:47:34,780 INFO [00000011] :sasdemo - 240 ;*';*";*/; 2019-08-14T02:47:34,782 INFO [00000045] :sasdemo - 241 data _head ; set sashelp.prdsale (obs=5 ); run; 2019-08-14T02:47:34,782 INFO [00000045] :sasdemo - 2019-08-14T02:47:34,783 INFO [00000045] :sasdemo - NOTE: There were 5 observations read from the data set SASHELP.PRDSALE. 2019-08-14T02:47:34,784 INFO [00000045] :sasdemo - NOTE: The data set WORK._HEAD has 5 observations and 5 variables. 2019-08-14T02:47:34,787 INFO [00000045] :sasdemo - NOTE: DATA statement used (Total process time): 2019-08-14T02:47:34,787 INFO [00000045] :sasdemo - real time 0.00 seconds 2019-08-14T02:47:34,787 INFO [00000045] :sasdemo - cpu time 0.00 seconds 2019-08-14T02:47:34,787 INFO [00000045] :sasdemo - |

We can see the converted code, which in this case was a data step which creates a work table based on PRDSALE table with obs=5. Scrolling down, we also see the code that prints the output to ODS and converts it to a data frame.

The sascfg_personal.py file has an option for specifying an **authkey**, which is an identifier that maps to a set of credentials in an .authinfo file (or _authinfo on Windows). This can be leveraged to eliminate the prompting for credentials. For example, if your authinfo file looks like:

`IOM_GELGrid_SASDemo user sasdemo password lnxsas` |

your configuration defintion in your sascfg_personal.py should look like:

winiomlinux = {'java' : 'C:\\Program Files\\Java\\jre1.8.0_221\\bin\\java', 'iomhost' : 'sasserver01.mydemo.sas.com', 'iomport' : 8591, 'encoding' : 'latin1', 'authkey' : 'IOM_GELGrid_SASDemo' 'classpath' : cpW } |

There are special rules for how to secure the authinfo file (making it readable only by you), so be sure to refer to the instructions.

What if you want your code to run in the grid with certain parameters or options by default? For instance, say you want all Python code to be executed in a particular grid queue. SASPy can do this by leveraging Grid Option Sets. The process is outlined here, but in short, a new SASPy 'SAS Application' has to be configured in SAS Management Console, which is then used to the map to the Grid Options Set (created using the standard process).

My sincere thanks to Allan Tham and Greg Wootton for their valued contributions to this post.

Please refer to the official SAS Grid documentation for more information on *SAS Grid Manager in SAS 9.4M6*.

Thank you for reading. I hope the information provided in this post has been helpful. Please feel free to comment below to share your own experiences.

Using Python to run jobs in your SAS Grid was published on SAS Users.

9月 192019

9月 182019

In a scatter plot that displays many points, it can be important to visualize the density of the points. Scatter plots (indeed, all plots that show individual markers) can suffer from overplotting, which means that the graph does not indicate how many observations are at a specific (*x, y*) location.
You can overcome that problem by moving from a "point plot" to an "area plot" such as a heat map or a contour plot. These plots do a good job communicating density, but they typically do not show individual points, which can be a drawback if you are interested in displaying outliers.

This article shows the following four methods of visualizing the density of bivariate data:

- Use a heat map to visualize the density. The individual markers are not shown, but outliers are visible.
- Use a contour map to visualize the density. The individual markers are not shown.
- Use a scatter plot to show the markers. Use transparency to visualize the density of points.
- Use a scatter plot to show the markers. Use color to visualize the density of points.

The following SAS DATA step extracts data from the Sashelp.Heart data set and will be used to create all graphs. The data are measurements of the systolic blood pressure (the "top number") and cholesterol levels of 5,057 patients in a heart study. For convenience, the Systolic variable is renamed X and the Cholesterol variable is renamed Y:

data Have; set sashelp.Heart(keep=Systolic Cholesterol); rename Systolic=x Cholesterol=y; if cmiss(of _NUMERIC_)=0; /* keep only complete cases */ run; |

The simplest way to visualize the density of bivariate data is to bin the data and use color to indicate the number of observations in each bin. I've written many articles about how to perform two-dimensional binning in SAS. However, if you only want to visualize the data, you can use the HEATMAP statement in PROC SGPLOT, which automatically bins the data and creates a heat map of the counts:

title "Use Heat Map to Show Density"; title2 "Individual Observations Are Not Shown"; proc sgplot data=Have; styleattrs wallcolor=verylightgray; heatmap x=x y=y; xaxis grid; yaxis grid; run; |

The graph shows that the region that has the highest density is near (x,y) = (125, 225). The pink and red rectangles in the graph contain 20 or more observations. The outliers in the graph are visible.

Because the default color ramp includes white as one of its colors, I changed the color of the background to light gray. You can use the COLORMODEL= option on the HEATMAP statement to define and use other color ramps. I also used the default binning options for the heat map. You can use the NXBINS= and NYBINS= options to control the number of bins. Alternatively, you can use the XBINSIZE= and YBINSIZE= options to specify the size of the bins in data units.

For large data sets, the default number of bins might be too large. For most data, 200 bins in each direction shows the density on a fine scale, so you can use the options NXBINS=200 NYBINS=200. Of course, you can use smaller numbers (like 50) to visualize the density on a coarser scale.

The previous graph is somewhat "chunky." That is because the heat map is a density estimate that is created by binning the data. The result depends on the bin size and the anchor locations. You can obtain a smoother estimate by using a kernel density estimate (KDE). In SAS, PROC KDE enables you to obtain a smooth density estimate. The procedure also creates a contour plot of the resulting density estimate, as shown in the following call:

title "Use Contour Map to Show Density"; title2 "Individual Observations Are Not Shown"; proc kde data=Have; bivar x y / bwm=0.25 ngrid=250 plots=contour; run; |

The graph shows the density estimate. Unfortunately, the units of density (proportion of data per square unit) are not intuitive. Notice that individual observations (including outliers) are not visible in the contour plot. To get very small "hot spots," I used the BWM=0.25 to select a small kernel bandwidth. (BWM stands for "bandwidth multiplier.) If you use the default value (BWM=1), you will get one large region of high density instead of several smaller regions. Similarly, the NGRID=250 option computes the KDE on a fine grid. You will probably need to play with the BWM= option to determine which value is best for your needs.

PROC KDE creates the contour plot with a default title, but you can use the ODS Graphics Editor to change the plot title.

I've previously written about how to use a scatter plot and marker transparency to indicate density. The value for the transparency parameter depends on the data. You might have to experiment with several values (0.75, 0.8, 0.9, ...) until you find the right balance between showing individual observations and showing the density.

title "Use Transparency in a Scatter Plot"; title2 "Individual Observations Shown, but Faint"; proc sgplot data=Have; scatter x=x y=y / markerattrs=(symbol=CircleFilled) transparency=0.9; xaxis grid; yaxis grid; run; |

In the scatter plot, you can see that the many data values are rounded to multiples of 5 and 10. You can see clear striations at the values 110, 120, 130, 140, 150, and 160. Recall that the X variable is the systolic blood pressure for patients. Evidentially, there is a tendency to round these numbers to the nearest 5 or 10.

In this plot, you can see the individual outliers, but they are quite faint because the TRANSPARENCY= option is set to 0.9. If you increase the parameter further to 0.95 or higher, it becomes increasingly difficult to see the individual markers. Although you can see that the highest density region in near (x,y) = (125, 225), it is not clear how to associate a number to the density. Does a certain (x,y) value have 10 points overplotted? Or 20? Or 50? It's impossible to tell.

The last method borrows ideas from each of the previous methods. I was motivated by some graphs I saw by Lukas Kremer and Simon Anders (2019). My implementation in SAS is quite different from their implementation in R, but their graphs provided the inspiration for my graphs. The idea is to draw a scatter plot where the individual markers are colored according to the density of the region. Kremer and Anders use a count of neighbors that are within a given distance from each point, which is an expensive computation. It occurred to me that you can use a density estimate instead. Because you can compute a KDE by using a fast Fourier transform, it is a much faster computation. In addition, the KDE procedure supports the SCORE statement, which provides a convenient way to associate a density value with every observation. This is implemented in the following statements:

title "Use Scatter Plot to Show Density"; title2 "Individual Observations Shown"; proc kde data=Have; bivar x y / bwm=0.25 ngrid=250 plots=none; score data=Have out=KDEOut(rename=(value1=x value2=y)); run; proc sort data=KDEOut; /* Optional: but a good idea */ by density; run; proc sgplot data=KDEOut; label x= y=; styleattrs wallcolor=verylightgray; scatter x=x y=y / colorresponse=density colormodel=ThreeColorRamp markerattrs=(symbol=CircleFilled); xaxis grid; yaxis grid; run; |

The graph is a fusion of the other three graphs. Visually, it looks very similar to the heat map. However, it uses colors based on a density scale (like the contour plot) and it plots individual markers (like the scatter plot).

In summary, this article shows four ways to visualize the density of bivariate data.

- The heat map provides a nice compromise between seeing the density and seeing outliers. As a bonus, the heat map shows the density in terms of counts per bin, which is an intuitive quantity.
- The contour plot of density does not show individual markers, which can be an advantage if you want to focus on the bulk of the data and ignore outliers. You can use the BWM= option to control the scale of the bandwidth parameter for estimating the density.
- The transparent scatter plot shows outliers but doesn't show the density very well.
- The colored scatter plot combines the strengths (and weaknesses) of the contour plot and the transparent scatter plot. By using colors to indicate density, the scatter plot is easier to interpret than the transparent scatter plot.

For most situations, I recommend the heat map, although I also like the scatter plot that assigns colors to markers based on density. What are your thoughts? Do you prefer one of these graphs? Or perhaps you prefer a different graph for visualizing the density of bivariate data? Leave a comment.

The post 4 ways to visualize the density of bivariate data appeared first on The DO Loop.

9月 172019

9月 162019

Just when you think you’ve seen it all, life can surprise you in a big way, making you wonder what else you've missed.

That is what happened when I recently had a chance to work with the SAS® Scalable Performance Data Server, a product that's been around a while, but I never crossed paths with before. I open an SPDS table of a *hundred million* records in SAS® Enterprise Guide, and I can scroll it as fast as if it were an Excel “baby” spreadsheet of a hundred rows. That’s how powerful it feels, to say nothing about the lighting speed of the queries.

Also known as the SAS SPD Server (or SPDS), it's a data storage system designed for high-performance data delivery. Its primary purpose is to provide rapid queries of vast amounts of data. We are talking terabytes of data with tables containing billions of rows. SPDS employs parallel storage and efficient indexing, coupled with a multi-threaded server system concurrently processing tasks distributed across multiple processors.

Availability of the SPDS client in SAS® Viya effectively integrates SAS SPDS with SAS Viya, extending functionality of its applications beyond the native Cloud Analytic Services (CAS) where you can continue reaping all the benefits of the SAS SPDS.

In addition to connecting to SPD Server with explicit SQL pass-through, connection to SPD Server with a LIBNAME statement is available as well, for example:

libname mylibref sasspds 'serverdomain' host='nodename_or_ip' service='5400' user='mySPDuserid' password='{SAS003}XXXXXXX...XXX'; |

This effectively creates an SPDS library, and the tables in that library can be referenced by two-level name mylibref.tablename as if this were a SAS BASE library.

Besides ordinary data tables, SPDS library offers so called dynamic cluster tables – or ** clusters** for short – enabling transparent access to large amounts of data.

Dynamic cluster tables (cluster tables or clusters) are virtual tables that allow users to access many server tables (member tables) as if they were one table. A dynamic cluster table is a collection of SPD Server tables that are presented to the end-user application as a single table through a metadata layer acting like a view.

Member tables can be added to the cluster as well as replaced and removed from the cluster.

PROC SPDO is the SAS procedure for the SPD Server operator interface. It performs a wide range of SPD server, user and table management tasks:

- create, list, modify, destroy, and undo dynamic cluster tables
- add, remove, replace, and fix cluster table members
- add, modify, list, and delete access control lists (ACLs) for server resources
- define, describe, and remove WHERE constraints on tables for row-level security definition and management
- issue system commands on server nodes

In addition to PROC SPDO, SPD Server plug-in for SAS® Data Management Console is also available.

If you open an SPDS library in SAS Enterprise Guide, you won’t be able to tell which table in that library is a member table and which is a cluster table – they all look the same. But in many cases, we need to know what is what. Moreover, for data-driven processing we need to capture the SPDS library objects into a dataset and identify them whether they are clusters or member tables.

Luckily, PROC CONTENTS with OUT= option allows us to do just that. While MEMTYPE column is equal to ‘DATA’ for both, clusters and member tables, there is another, less known column inversely called **TYPEMEM** that has value of ** 'DATA'** for clusters and blank value

proc contents data=SPDSLIB._all_ out=WORK.ALLOBJECTS (keep=MEMNAME TYPEMEM); run; proc sort data=WORK.ALLOBJECTS nodupkey; by MEMNAME; run; data WORK.SPDSTYPES; set WORK.ALLOBJECT; attrib TABLETYPE $7 label='SPDS table type'; select(TYPEMEM); when('DATA') TABLETYPE = 'CLUSTER'; when('') TABLETYPE = 'MEMBER'; otherwise TABLETYPE = ''; end; run; |

In this code PROC CONTENTS produces one record per column NAME in every object MEMNAME in the SPDS library; PROC SORT reduces (un-duplicates) this list to one record per MEMNAME; finally, data step creates TABLETYPE column indicating which MEMNAME is CLUSTER and which is MEMBER.

In addition to retrieving a list of objects in the SPDS library described above, we also need a way of capturing the content (a list of members) of the cluster itself in order to control removing or replacing its members. PROC SPDO’s CLUSTER LIST statement produces such a list, and its OUT= option allows you to dump that list into a dataset:

proc spdo lib=SPDSLIB; cluster list CLUSTER1 out=CLUSTER1_MEMBERS; cluster list CLUSTER1 out=CLUSTER2_MEMBERS; /* ... */ cluster list CLUSTER1 out=CLUSTERN_MEMBERS; quit; |

This approach creates one output table per cluster, and you can’t use the same OUT= destination table for different clusters, for they will be overwritten with each subsequent CLUSTER LIST statement, not appended.

If you need to capture contents of several clusters into one dataset, then instead of the above method of outputting each cluster content into separate table and then appending (concatenating) them, the good old ODS OUTPUT with CLUSTERLIST= option allows us to do it in a single step:

ods noresults; ods output clusterlist=WORK.CLUSTER_MEMS; proc spdo lib=SPDSLIB; cluster list CLUSTER1; cluster list CLUSTER2; /* ... */ cluster list CLUSTERN; quit; ods output close; ods results; |

As additional bonus ODS NORESULTS suppresses printed output when it’s not needed, e.g. for automatic data-driven processing.

What is your experience with SAS SPDS? How might you use it in the future? Please comment below.

How to retrieve contents of SAS® Scalable Performance Data Server library was published on SAS Users.

9月 162019

A moving average is a statistical technique that is used to smooth a time series. My colleague, Cindy Wang, wrote an article about the Hull moving average (HMA), which is a time series smoother that is sometimes used as a technical indicator by stock market traders. Cindy showed how to compute the HMA in SAS Visual Analytics by using a GUI "formula builder" to compute new columns in a data table. This article shows how to implement the Hull moving average in the SAS/IML language. The SAS/IML language is an effective way to implement new (or existing) smoothers for time series because it is easy to use lags and rolling windows to extract values from a time series.

I have previously written about how to compute common moving averages in SAS/ETS (PROC EXPAND) and the DATA step. I have also shown how to compute a weighted moving average (WMA) in the SAS/IML language, which includes simple moving averages and exponential moving averages. This article shows how to implement the Hull moving average formula in SAS by leveraging the WMA function in the previous article.

Every choice of weights in a weighted moving average leads to a different smoother.
This article uses weights that are linear in time.
Specifically, if *Y* is a time series, the weighted moving average at time index *t* is obtained by the formula

WMA(*t*, *n*) = 1**Y*[*t* − *n* + 1] + 2*y[*t* − *n* + 2] +
3**Y*[*t* − *n* + 3] + ... + *n***Y*[*t*]) /
(1 + 2 + ... + *n*)

In general, the divisor in the formula is the sum of whatever weights are used.

Although this article focuses on the Hull moving average, the techniques are broadly applicable to other moving average computations.

Cindy's article contains formulas that show how to compute the Hull moving average (HMA). The HMA is the moving average of a linear combination of two other weighted moving averages, one on a shorter time scale and the other on a longer time scale.
Given a time series, Y, choose a lag time, *n*, which is sometimes called the window length. You can compute the Hull moving average by computing four quantities:

- Compute X
_{short}as the weighted moving average of Y by using a short window of length*n*/2. - Compute X
_{long}as the weighted moving average of Y by using the longer window of length*n*. - Define the series X = 2* X
_{short}– X_{long}. - The Hull moving average of Y is the weighted moving average of X by using a window of length sqrt(n).

The following SAS/IML program implements the Hull moving average. The WMA function is explained in my previous blog post. The HMA function computes the Hull moving average by calling the WMA function three times, twice on Y and once on X. It requires only four statements, one for each computed quantity:

proc iml; /* Weighted moving average of k data values. First k-1 values are assigned the weighted mean of all preceding values. Inputs: y column vector of length N >= k wt column vector of weights. w[k] is weight for most recent data; wt[1] is for k time units ago. The function internally standardizes the weights so that sum(wt)=1. Example calls: WMA = WMA(y, 1:5); is weighted moving average of most recent 5 points SMA = WMA(y, {1 1 1 1 1}); is simple moving average of See https://blogs.sas.com/content/iml/2016/02/03/rolling-statistics-sasiml.html */ start WMA(y, wt); w = colvec(wt) / sum(wt); /* standardize weights so that sum(w)=1 */ k = nrow(w); /* length of lag */ MA = j(nrow(y), 1, .); /* handle first k values separately */ do i = 1 to k-1; wIdx = k-i+1:k; /* index for previous i weights */ MA[i] = sum(wt[wIdx]#y[1:i]) / sum(wt[wIdx]); /* weighted average */ end; /* main computation: average of current and previous k-1 data values */ do i = k to nrow(y); idx = (i-k+1):i; /* rolling window of k data points */ MA[i] = sum( w#y[idx] ); /* weighted sum of k data values */ end; return ( MA ); finish; /* Hull moving average HMA(y, nLag) at each time point. All computations use linear weights 1:k for various choices of k. */ start HMA(y, nLag); Xshort = WMA(y, 1:round(nLag/2)); /* short smoother (nLag/2) */ Xlong = WMA(y, 1:nLag); /* longer smoother (nLag) */ X = 2*Xshort - Xlong; /* combination of smoothers */ HMA = WMA(X, 1:round(sqrt(nLag)));/* Hull moving average (length sqrt(nLag)) */ return HMA; finish; /* test on simple time series */ y = T({0 1 0 1 3 2 3 3}); nLag = 4; WMA = WMA(y, 1:nLag); HullMA = HMA( y, nLag ); print y WMA HullMA; |

The table shows the weighted moving average and the Hull moving average applied to a simple time series with eight observations. The smoothed values are shown. As a general rule, the Hull moving average tends to be smoother than a raw weighted moving average. For a given lag parameter, it responds more quickly to changes in Y.

However, it is important to realize that the HMA is not a smoother of Y. Rather, it smooths a linear combination of other smoothers. Consequently, the value of the HMA at time *t* can be outside of the range of the series {Y_{1}, Y_{1}, ..., Y_{t}}. This is seen in the last observation in the table. The HMA has the value 3.08 even though the range of Y is [0, 3]. This is shown in the following graph, which plots the series, the weighted moving average, and the Hull moving average.

This section reproduces the graphs in Cindy's blog post by applying the Hull moving average to the monthly closing price of IBM stock. The following statements compute three different moving averages (Hull, simple, and weighted) and use PROC SGPLOT to overlay the simple and Hull moving averages on a scatter plot that shows the closing price and a high-low plot that shows the range of stock values for each month.

use Sashelp.Stocks where(stock='IBM'); /* read stock data for IBM */ read all var {'Close'} into y; close; nLag = 5; SMA = WMA(y, j(1,nLag,1)); /* simple MA */ WMA = WMA(y, nLag); /* weighted MA */ HMA = HMA(y, nLag); /* Hull MA */ create out var {SMA WMA HMA}; append; close; /* write moving averages to SAS data set */ QUIT; data All; /* merge data and moving averages */ merge Sashelp.Stocks(where=(stock='IBM')) out; run; ods graphics / width=800px height=300px; title "Comparison of Simple and Hull Moving Average"; proc sgplot data=All noautolegend; highlow x=Date high=High low=Low / lineattrs=(color=LightGray); scatter x=Date y=Close / markerattrs=(color=LightGray); series x=Date y=SMA / curvelabel curvelabelattrs=(color=DarkGreen) lineattrs=(color=DarkGreen); series x=Date y=HMA / curvelabel curvelabelattrs=(color=DarkOrange) lineattrs=(color=DarkOrange); refline '01JUN1992'd / axis=x label='June 1992' labelloc=inside; xaxis grid display=(nolabel); yaxis grid max=200 label="IBM Closing Price"; run; |

It is well known that the simple moving average (green line) lags the data. As Cindy points out, the Hull moving average (orange line) responds more quickly to changes in the data. She compares the two smoothers in the months after June 1992 to emphasize that the HMA is closer to the data than the simple moving average.

In summary, this article shows how to implement a custom time series smoother in SAS. The example uses the Hull moving average and reproduces the computations and graphs shown in Cindy's article. The techniques apply to any custom time series smoother and time series computation. The SAS/IML language makes it easy to compute rolling-window computations because you can easily access segments of a time series.

The post The Hull moving average: Implement a custom time series smoother in SAS appeared first on The DO Loop.

9月 132019

Time-series decomposition is an important technique for time series analysis, especially for seasonal adjustment and trend strength measurement. Decomposition deconstructs a time series into several components, with each representing a certain pattern or characteristic. This post shows you how to use SAS^{®} Visual Analytics to visually show the decomposition of a time series so that you can understand more about its underlying patterns.

Time series decomposition generally splits a time series into three components: 1) a trend-cycle, which can be further decomposed into trend and cycle components; 2) seasonal; and 3) residual, in an additive or multiplicative fashion.

In additive decomposition, the cyclical, seasonal, and residual components are absolute deviations from the trend component, and they do not depend on trend level. In multiplicative decomposition, the cyclical, seasonal and residual components are relative deviations from the trend. Thus, we often see different magnitudes of seasonal, cyclical and residual components when comparing with the trend component, while the trend component keeps the same scale as the original series.

SAS provides several procedures for time series decomposition, I use the PROC Timeseries in this post. Now the first step is to decide whether to use additive or multiplicative decomposition. You know SAS PROC Timeseries provides multiplicative (MODE=MULT), additive (MODE=ADD), pseudo-additive (MODE=PSEUDOADD) and log-additive (MODE=LOGADD) decomposition. You can also use the default MODE option of MULTORADD to let SAS help you make a decision based on the feature of your data. Good thing is, you can always use the log transformation whenever there is a need to change a multiplicative relationship to an additive relationship. The plot option in PROC Timeseries can produce graphs of the generated trend-cycle component, seasonal component and residual component. In this post, I would like to output the OUTDECOMP dataset from PROC Timeseries, load the data and visualize the decomposed time series with SAS Visual Analytics to understand more about their patterns.

I decompose the time series in the SASHELP.AIR dataset as an example. The series involves data about international air travel with monthly data points from Jan 1949 to Jan 1961, as pictured below:

We see an obvious upward trend and significant seasonality in the original series, with more and more intensive fluctuation around the trend. This indicates that the multiplicative decomposition of trend and seasonality components is more appropriate. I get the decomposed components using this SAS code. Here I do not explicitly give the mode option first, and let SAS use the default MODE=MULTORADD option. Since the values in this time series are strictly positive, SAS eventually specifies the MODE=MULT to generate the decomposed series in the OUTDECOMP dataset (see details in the document).

When you load the data set into SAS Visual Analytics and make visualizations, it’s very straight forward to draw a time-series plot showing the decomposed series, respectively.

Note that the magnitudes of the Trend-Cycle-Seasonal and Trend-Cycle components are much larger than those of the Seasonal, Irregular and Cycle components. The upward trend and increasing volatility of the Trend-Cycle-Seasonal component reveal an obvious multiplicative composition of Trend-Cycle and Seasonal components. The formula should be: **Trend-Cycle-Seasonal Component = Trend-Cycle Component * Seasonal Component.**

I can easily make the log transformation of the decomposition series using the calculated item in SAS Visual Analytics, and accordingly show the additive relationship of the transformed series. The visualization below shows the additive relationship of the log transformation of the Trend-Cycle-Seasonal component with the log transformations of Trend-Cycle component and Seasonal component, which is the equivalent of the pre-transformed multiplicative relationship.

In the visualization below, the moss-green line series at the bottom of the chart shows the Log Seasonal component, with each vertical black line representing its value. The lines at the top show that the value of the orange line series (the Log Trend-Cycle component) adds to the value that the mint-green vertical lines (value of the Log Seasonal component) will make to the pine-green line series (the Log Trend-Cycle-Seasonal component).

In the list table, note that the value of the calculated item 'Trend-Cycle Component * Seasonal Component’ is equal to the 'Trend-Cycle-Seasonal Component' value highlighted in blue, which indicates the multiplicative composition of 'Trend-Cycle Component' and 'Seasonal Component' to the 'Trend-Cycle-Seasonal Component.' Also, summation of the calculated item 'Log Trend-Cycle Component' and the 'Log Seasonal Component' is equal to the value of 'Log Trend-Cycle-Seasonal Component' in light green. They verify the multiplicative and additive relationships, respectively.

Besides the above multiplicative decomposition, we can dig for more multiplicative or additive relationships from the original series and the decomposed series. Here are the formulas:

**Original Series = Trend-Cycle-Seasonal Component * Irregular Component**

**Seasonal-Irregular Component = Seasonal Component * Irregular Component**

**Original Series = Seasonal Adjusted Series * Seasonal Component**

**Trend-Cycle Component = Trend Component + Cycle Component ^{1}**

**[ **^{1 }Note: Despite setting the MODE=MULT option, SAS Proc Timeseries uses the Hodrick-Prescott filter, which always decomposes the trend-cycle component into the trend component and cycle component in an additive fashion. **]**

Considering the decomposed dataset from various time series will have the fixed structure as shown below, we can easily apply the visualizations in SAS Visual Analytics to the decomposed series from different time series. Just applying the new dataset, all the calculated items will be inherited accordingly, and the new data will be applied to the visualizations automatically. That’s the thing I like most for visualizing time series decomposition in SAS Visual Analytics.

Let’s compare the multiplicative decomposition and the additive decomposition of the same series. Note the Trend-Cycle components (as well as Trend component and Cycle component) from multiplicative and additive decomposition are the same, meaning that the seasonal component is decomposed differently in multiplicative and additive decomposition.

In the screenshot below, we see that the two seasonal components have similar seasonal fluctuation style, but the value of seasonal components are largely different between multiplicative and additive decomposition. Different decomposition method also leads to different Trend-Cycle-Seasonal component, Irregular component and Seasonal-Irregular component. In addition, we see still some patterns there in the Irregular component from additive decomposition.

But in multiplicative decomposition, the Irregular component seems more random-like. Thus, the multiplicative decomposition is a better choice than additive decomposition for SASHELP.AIR time series.

PROC Timeseries provides classical decomposition of time series, and SAS has other procedures that can perform more complex decomposition of time series. If you want to visualize time series decomposition in a way you like, give SAS Visual Analytics a try!

SAS® Visual Analytics on SAS® Viya® Try it for free!How to Visualize Time Series Decomposition using SAS Visual Analytics was published on SAS Users.

9月 112019

I often use axis tables in PROC SGPLOT in SAS to add a table of text to a graph so that the table values are aligned with the data. But axis tables are not the only way to display tabular data in a graph. You can also use the TEXT statement, which supports many features that are not available in axis tables, such as rotating the text. Recently I saw some graphs by Bull, et al. (Fig 3, *Nature*, 2019) in which a table was presented in an interesting way.
Part of the graph is reproduced at the right.

The complete graph includes ages from 18 to 45 years old, so there are many horizontal categories and they are very close together. For each age, the graph shows the mean luteal-phase length for women in a study about menstrual cycles. (The luteal-phase length is the number of days between ovulation and menstruation.) The numbers indicate the number of women (first number) and the number of menstrual cycles (second number) from which the mean is calculated. The vertical bars are a 95% confidence interval (CI) for the mean.

This article uses the SGPLOT procedure in SAS to create three graphs that are similar to this one:

- When the numbers in a table are small (a few digits), you can use the standard X axis table to show the numbers. However, an X axis table isn't effective for this particular graph because the numbers have too many digits to fit into the available horizontal space.
- One alternative (used by Bull, et al.) is to rotate the text.
- A second alternative is to rotate the entire plot and use a Y axis table to present the data table.

The authors did not make the data publicly available, so I estimated values from the chart to create some similar-looking data. I did not want to type in the means and CIs for all 28 age groups, so I stopped at Age=26.

/* Data based on Fig 3 of Bull, et al. (2019) "Real-world menstrual cycle characteristics of more than 600,000 menstrual cycles" npj Digital Medicine https://www.nature.com/articles/s41746-019-0152-7 */ data menstrual; input Age mean low high Users Cycles; label mean = "Luteal length"; datalines; 18 12.1 11.3 12.8 46 123 19 12.07 11.75 12.3 354 1082 20 12.2 11.97 12.33 811 2547 21 12.18 12.0 12.25 1535 4925 22 12.27 12.15 12.4 2425 8786 23 12.25 12.2 12.3 3527 13579 24 12.28 12.22 12.39 4693 19749 25 12.3 12.24 12.39 5966 27000 26 12.33 12.3 12.42 7718 35845 ; ods graphics / width=270px height=480px; /* make sure there isn't much room between age groups */ /* first attempt: Use XAXISTABLE to position text that shows Users and Cycles for each age */ title "Show Table of Counts"; title2 "XAXISTABLE Statement"; proc sgplot data=menstrual noautolegend; scatter x=Age y=mean / yerrorlower=low yerrorupper=high errorbarattrs=GraphData1; xaxistable Users Cycles / x=Age location=inside; /* can use POSITION=TOP */ xaxis grid min=18 offsetmin=0.1 offsetmax=0.1; yaxis grid; run; |

For graphs like this, the XAXISTABLE statement should usually be the first statement you try because it is so simple to use. In the XAXISTABLE statement, you list each variable that you want to display (Users and Cycles) and specify the variable for the horizontal positions (Age). The result is shown above. You can see that for these data, the cells in the axis table overlap each other and are unreadable because the distance between groups is so small.

Clearly, this first attempt does not work for this table. And because this table is intended to fit on one slide or piece of paper, you cannot simply make the graph wider to prevent the overlap. Instead, an alternative approach is required.

The authors chose to display the Users and Cycles data above each age group by using rotated text. To do this in SAS, you need to add two new variables to the data set. The first is a character variable that contains the comma-separated values of Users and Cycles. The second is the location for the text, in data coordinates. The following DATA step uses the CATX function to concatenate the data values into a comma-separated string. The height of the text string is set to 13 for all groups in this example, but you could make the height depend on the value of the mean if you prefer.

/* second attempt: Use TEXT and ROTATE=90 to position text */ data menstrual2; set menstrual; length Labl $20; Labl = catx(", ", Users, Cycles); /* concatenate: "123, 4567" */ Height = 13; /* this variable can depend on Age */ run; title2 "TEXT Statement, ROTATE=90"; proc sgplot data=menstrual2 noautolegend; scatter x=Age y=mean / yerrorlower=low yerrorupper=high errorbarattrs=GraphData1; text x=Age y=Height text=Labl / position=right rotate=90 backfill fillattrs=(color=white); yaxis grid offsetmin=0.1; xaxis grid min=18 offsetmin=0.1 offsetmax=0.1; run; |

When you use the ROTATE= option to display rotated text, it is important to understand how the text is positioned relative to the coordinates that you specify. The coordinates (in this case, (Age, Height)) determine an anchor point. The POSITION= option specifies how the text is positioned relative to the anchor point BEFORE the rotation. Therefore, the combination POSITION=RIGHT ROTATE=90 results in text that is positioned above the anchor point.

There is another option, which is to rotate the entire graph. Rather than specify Age as the horizontal variable and using vertical bars for the CIs, you can specify Age as the vertical variable and use horizontal bars. This results in a graphic that will be long rather than wide. There should be enough horizontal space to include two columns of text that show the data for Users and Cycles. Because a printed page (in portrait mode) is longer than it is wide, the graph will probably fit on a standard sheet of paper. However, it might not fit on a slide, which is wider than it is tall.

The following call to PROC SGPLOT creates a rotated version of the graph. In addition to rotating the graph, I add alternating bands of gray so that the reader can more easily associate intervals with age groups./* third attempt: Rotate plot, use YAXISTABLE, add alternating bands */ ods graphics / width=480px height=300px; title2 "YAXISTABLE Statement"; %macro HalfWidth(nCat); %sysevalf(0.5/&nCat) %mend; proc sgplot data=menstrual noautolegend; scatter y=Age x=mean / xerrorlower=low xerrorupper=high errorbarattrs=GraphData1; yaxistable Users Cycles / y=Age location=inside position=left valueattrs=(size=9); yaxis grid reverse type=discrete discreteorder=data fitpolicy=none offsetmin=%HalfWidth(9) offsetmax=%HalfWidth(9) /* half of 1/k, where k=number of categories */ colorbands=even colorbandsattrs=(color=gray transparency=0.9); xaxis grid; run; |

By rotating the graph, the table of numbers is easier to read. The graph for the full data will be somewhat long, but that is not usually a problem for the printed page or for HTML. The main drawback is that long graphs might not fit on a slide for a presentation. A second drawback is that the authors wanted to show that the luteal-phase length depends on age, and it is traditional to plot independent variables (age) horizontally and dependent variables (luteal length) vertically.

In summary, this article shows three ways to add tabular data to a scatter plot with error bars. The first way is to use the XAXISTABLE statement, which works when the table entries are not too wide relative to the horizontal spacing between groups. The second way is to rotate the text, as done in the *Nature* article. The third way is to rotate the plot so that the error bars are shown horizontally rather than vertically.
This third presentation is further enhanced by adding alternating bands of color to help the reader distinguish the age categories. (You can use alternating color bands for the XAXISTABLE, too.)

All three methods are useful in various circumstances, so remember to consider all three methods when you design graphs like this.

To learn more about using horizontal and vertical axis tables in SAS, see
Chapter 3 of Warren Kuhfeld's free e-book *Advanced ODS Graphics Examples*.

The post Axis tables versus rotated text: How to display a wide table in a small graph appeared first on The DO Loop.

9月 102019

I recently had the incredible opportunity to attend **SAS Global Forum** in Dallas as a presenter and New SAS Professional Award recipient. At the conference, I was able to learn more about SAS features and applications, share my knowledge of SAS applications in the clinical trials space, and make new professional connections.

Here are 11 reasons why you should consider applying for this award, too.

**1) Free registration & conference hotel:** The obvious perk for award winners is the waived fees associated with the cost of attending the conference, including the registration fee, pre-conference tutorial, and free stay at the conference hotel for award winners who are also presenters. As a junior-level employee, it can be difficult to convince your department to allow you to travel to a conference, but it makes it a lot easier to pitch the idea when an award covers most of the costs.

**2) See a new city:** I arrived at the conference a day early, so I was able to take advantage of my time in Dallas to see the city. I walked around downtown, toured the Dallas Arboretum and Botanical Garden, and ate some delicious barbeque. SAS Global Forum 2020 will be held in Washington D.C., so there will be plenty of sights to see there as well.

**3) Receive guidance from a mentor:** Award recipients who publish and present a paper are eligible to be matched with a mentor through the Presenter Mentoring Program. My mentor, Chris Battiston, was incredibly friendly, helpful, and personable. He provided advice on presentations to attend, public speaking tips, and even referred me for an opportunity to fly out to Canada as an invited speaker at the SAS Canada Health Users group conference. Having a mentor helped set my expectations for the conference and make a plan to maximize my experience.

**4) Open doors to additional opportunities: **This award, and my associated presentations, provided me with a huge boost in my credibility and the publicity around my work. As a direct result of presenting at this conference and receiving the award, I received invitations to speak on the main stage in front of 5,000+ people at SAS Global Forum 2019, to attend the SAS Canada Health Users Group as an invited speaker, to serve as a panel speaker at the Research Triangle SAS Users Group, and attend SAS Global Forum 2020 as an invited speaker. I also had opportunities to meet Jim Goodnight and other SAS executives, which was an incredible honor.

**5) Speak with SAS employees:** Have a question about a SAS procedure? At SAS Global Forum, you can ask your question to the actual developers of those procedures in The Quad. The Quad is a large exhibit and demo area with dozens of SAS booths as well as the conference sponsors. At the booths, I spoke to quite a few representatives from SAS and learned about the variety of areas where SAS is making an impact. I learned about the features and functions of SAS Viya, efforts at SAS to make data visualization accessible to those who are visually impaired, the rationale behind moving the certification exams to a performance-based format, and the free SAS-supported software platform to teach coding to children at a young age.

**6) Free swag:** Not the most important reason, but still an awesome bonus! I walked away from the conference with two free t-shirts, a backpack from the Pinnacle Solutions sponsor booth, and many trinkets, pens, and notepads collected from the various booths.

**7) Have fun:** There were quite a few events at the conference that were a lot of fun! It was easy to meet people because everyone at the event was so friendly. There were happy hour events, lunch networking groups where you could sit with a table of people based on common interests, escape rooms, get-togethers for SAS regional user groups, and a big party for all conference attendees on the last night. It is a great opportunity to spend time with the people you meet at the conference.

**8) Practice public speaking skills & teach others:** Presenting at the conference is a great opportunity to practice speaking in front of a large group and to teach other professionals about some aspect of SAS. As a "New" SAS professional, it may sound daunting to come up with a topic that would be useful for a more experienced audience, but you'd be surprised at the number of people who attend the conference with no knowledge of many of the base procedures. Additionally, conference attendees find it incredibly valuable to learn about how SAS can be used to solve a problem or how an existing common task can be programmed more efficiently. My topic was "Using PROC SQL to Generate Shift Tables More Efficiently", and it taught programmers and statisticians a shorter way to produce shift tables, which are commonly used to present categorical longitudinal data. Because of the preparation I put in to present at the conference, I left the event as a much more confident speaker than I had ever been before.

**9) Learn something new:** At the conference, you'll have the opportunity to attend sessions on virtually any topic you can think of that is related to SAS. Most of the talks I attended were related to statistics because the topic aligns with my job description as a Biostatistician. Some of the topics I learned about were Bayesian analysis, missing data, survival analysis, clinical graphs, and artificial intelligence. Additionally, the conference allows you opportunities to ask specific questions about any SAS procedure or task you’re struggling with. A resource available at the conference is the “Code Doctors” table in The Quad, where you can ask programming questions to SAS experts. I had the opportunity to serve as a “Resident” for the Code Doctors program and was able to observe and help those who needed advice.

**
10) Increase visibility within your company:** I was the only attendee from my company out of those working in my office, but there were several senior-level IQVIA employees from other regions in attendance, and I had the opportunity to meet them and spend time with them at the conference. I work at a very large company and would not have had the opportunity to meet these coworkers otherwise, so it was an excellent opportunity to increase my visibility even within my company. Additionally, I’ve had opportunities to apply the knowledge I gained from the conference at work and to share advice with coworkers based on what I learned.

**11) ****Make new connections:** Perhaps the most important reason to attend SAS Global Forum as a New SAS Professional is the connections you make at the conference. There are opportunities to meet people from all stages in their career who use SAS to complete statistical analysis. Despite working in different industries, I found that many conference attendees used the same procedures and dealt with the same issues that I did, and I truly felt a sense of community among the long-time attendees. Like most of the programmers, analysts, and statisticians in attendance, my day-to-day work is in a solitary environment on the computer. Although teamwork is involved within project teams, there is not a great amount of face-to-face interactions. I love connecting with other people, and this conference gave me the opportunity to meet other people working in similar positions.

The New SAS Professional Award is perfect for those with the potential to become a leader in their field and who are looking for more opportunities to present their ideas, to network and make connections, and to learn from experts.

This experience has allowed me to expand my skills and network, and served as a launchpad for my successful career. My attendance at the conference has allowed me to feel a greater sense of community with other SAS users, and to serve as a representative from the "next generation" of SAS Professionals. I encourage you to submit your abstract by **September 30th** and your award application by **November 1st** if this seems like the right opportunity for you. More details about this award and other award and scholarship opportunities are available on the SAS Global Forum 2020 website.

11 Reasons to Apply for the New SAS Professional Award was published on SAS Users.