9月 232019
 

The SAS Global Forum 2020 call for content is open until Sept. 30, 2019. Are you thinking of submitting a paper? If so, we have a few tips adapted from The Global English Style Guide that will help your paper shine. By following Global English guidelines, your writing will be clearer and easier to understand, which can boost the effectiveness of your communications.

Even if you’re not planning on submitting a paper and producing technical information is not your primary job function, being aware of Global English guidelines can help you communicate more effectively with your colleagues from around the world.

1. Use Short Sentences

Short sentences are less likely to contain ambiguities or complexities. For task-oriented information, try to limit your sentences to 20 words. If you have written a long sentence, break it up into two or more shorter sentences.

2. Use Complete Sentences

Incomplete sentences can be confusing for non-native speakers because the order of sentence parts is different in other languages. In addition, incomplete sentences can cause machine-translation software to produce garbled results. For example, the phrases below are fragments that may cause issues for readers:

Original: Lots of info here. Not my best, but whatever. Waiting to hear back until I do anything.
Better: There is lots of info here. It’s not my best work, but I am waiting to hear back until I do anything.

An extremely common location to encounter sentence fragments is in the introductions to lists. For example, consider the sentence introducing the list below.

The programs we use for analysis are:
• SAS Visual Analytics
• SAS Data Mining and Machine Learning
• SAS Visual Investigator

If you use an incomplete sentence to introduce a list, consider revising the sentence to be complete, then continue on to the list, as shown below.

We use the following programs in our analysis:
• SAS Visual Analytics
• SAS Data Mining and Machine Learning
• SAS Visual Investigator

3. Untangle Long Noun Phrases

A noun phrase can be a single noun, or it can consist of a noun plus one or more preceding words such as articles, pronouns, adjectives, and other nouns. For example, the following sentence contains a noun phrase with 6 words:

The red brick two-story apartment building was on fire.

Whenever possible, limit noun phrases to no more than three words while maintaining comprehensibility.

4. Expand -ED Verbs That Follow Nouns Whenever Possible

A past participle is the form of a verb that usually ends in -ed. It can be used as both the perfect and past tense of verbs as well as an adjective. This double use can be confusing for non-native English speakers. Consider the following sentence:

This is the algorithm used by the software.

In this example, the word “used” is an adjective, but it may be mistaken for a verb. Avoid using -ed verbs in ambiguous contexts. Instead, add words such as “that” or switch to the present tense to help readers interpret your meaning. A better version of the previous example sentence follows.

This is the algorithm that is used by the software.

5. Always Revise -ING Verbs That Follow Nouns

The name of this tip could have been written as “Always Revise -ING Verbs Following Nouns.” But that’s exactly what we want to avoid! If an -ING word immediately follows and modifies a noun, then either expand it or eliminate it. These constructions are ambiguous and confusing.

6. Use “That” Liberally

The word “that” is your friend! In English, the word “that” is often omitted before a relative clause. If it doesn’t feel unnatural or forced, try to include “that” before these clauses, as shown in the following example.

Original: The file you requested could not be located.

Better: The file that you requested could not be located.

7. Choose Simple, Precise Words That Have a Limited Range of Meanings

We are not often accustomed to thinking about the alternative meanings for the words we use. But consider that many words have multiple meanings. If translated incorrectly, these words could make your writing completely incomprehensible and possibly ridiculous. Consider the alternate meanings of a few of the words in the following sentences:

When you hover over the menu, a box appears.

We are deploying containers in order to scale up efficiency.

8. Don’t Use Slang, Idioms, Colloquialisms, or Figurative Language

In the UK, retail districts are called “high street.” In the US, they are called “main street.” This is just one example of how using colloquialisms can cause confusion. And Brits and Americans both speak the same language!

Especially in formal communications, keep your writing free of regional slang and idioms that cannot be easily understood by non-native English speakers. Common phrases such as “under the weather,” “piece of cake,” or “my neck of the woods” make absolutely no sense when translated literally.

We hope these Global English tips help you write your Global Forum paper, or any other communications you might produce as part of your work. For more helpful tips, read The Global English Style Guide: Writing Clear, Translatable Documentation for a Global Market by John R. Kohl.

Top 8 Global English Guidelines was published on SAS Users.

9月 232019
 

Although I do not typically blog about undocumented SAS options, I'll make an exception this time. For many years, I have known that the CONTENTS and COMPARE procedures support the BRIEF and SHORT options, but I always forget which option goes with which procedure. For the record, here are the documented options:

SAS provides an autocomplete feature that programmers can use to remind themselves which options are supported in each procedure. Both SAS Studio and SAS Enterprise Guide support the autocomplete feature, which displays a list of options as you type SAS code. Unfortunately, I turn off that feature because I find it distracting. Consequently, I cannot remember whether to use SHORT or BRIEF.

My "solution" to this dilemma is to randomly guess an option, run the program, and fix the error if I guessed wrong. I should be wrong 50% of the time, but I noticed that I am wrong only about 25% of the time. Upon investigating, I discovered that some seemingly "wrong" code actually works. Although it is not documented, it turns out that the COMPARE procedure supports the SHORT option as an alias for BRIEFSUMMARY.

This is awesome! No longer do I need to randomly guess the option. I can use the SHORT option in both PROC CONTENTS and PROC COMPARE! As a bonus, the SHORT option is also supported by7 PROC OPTIONS and PROC DATASETS.

If you have never used the SHORT option before, it's a great time saver because it produces condensed output, as shown in the following examples. For PROC CONTENTS, I like to use the options VARNUM SHORT when I want to display a list of the variables in a data set:

proc contents data=Sashelp.Cars varnum SHORT;      /* SHORT is documented option */
run;

For PROC COMPARE, I use the undocumented SHORT option (which is an alias for BRIEF) when I want to display whether two data sets are equal:

data cars; set Sashelp.cars; run;
 
proc compare base=sashelp.cars compare=cars SHORT; /* SHORT is not documented, but it works */
run;

As I mentioned, the SHORT option is also supported in PROC OPTIONS. For completeness, here is an example that writes the values of several options to the SAS log:

proc options option=(linesize pagesize memsize _LAST_) SHORT; run;    /* SHORT is documented option */
 LINESIZE=75 PAGESIZE=24 MEMSIZE=17179869184 _LAST_=WORK.CARS

In short (see what I did there?), you can use the SHORT option in several Base SAS procedures. If you can't remember that the SHORT option applies to PROC CONTENTS and the BRIEF option applies to PROC COMPARE, just use the SHORT option in both procedures and in PROC OPTIONS, too.

The post Use the SHORT option in Base SAS procedures to reduce output appeared first on The DO Loop.

9月 202019
 

Put simply, data literacy is the ability to derive meaning from data. That seems like a straightforward proposition, but, in truth, finding relationships in data can be fraught with complexities, including: Understanding where the data came from, including the lineage or source of that data. Ensuring that the data meet compliance [...]

Why you should care about data literacy was published on SAS Voices by Tom Fisher

9月 192019
 

Recent updates to SAS Grid Manager introduced many interesting new features, including the ability to handle more diverse workloads. In this post, we'll take a look at the steps required to get your SAS Grid Manager environment set up accept jobs from outside of traditional SAS clients. We'll demonstrate the process of submitting Python code for execution in the SAS Grid.

Preparing your SAS Grid

Obviously, we need a SAS Grid Manager (SAS 9.4 Maintenance 6 or later) environment to be installed and configured. Once the grid is deployed, there's not a whole lot more to do on the SAS side. SAS Workspace Servers need to be configured for launching on the grid by converting them to load balanced Workspace Servers, changing the algorithm to 'Grid', and then selecting the relevant checkbox in SASApp – Logical Workspace Server properties to launch on the grid as shown below.

The only other things that might need configuring are, if applicable, .authinfo files and Grid Option Sets. Keep reading for more information on these.

Preparing your client machine

In this example scenario, the client is the Windows desktop machine where I will write my Python code. SAS is not installed here. Rather, we will deploy and use Jupyter Notebook as our IDE to write Python code, which we'll then submit to the SAS Grid. We can get Jupyter Notebook by installing Anaconda, which is a free, bundled distribution of Python and R that comes with a solid package management system. (The following steps are courtesy of my colleague. Allan Tham.)

First, we need to download and install Anaconda.

Once deployed, we can open Anaconda Navigator from the Start menu and from there, we can launch Jupyter Notebook and create a notebook.

Now it's time to configure the connection from Python to our SAS environment. To do this, we use the SAS-developed open-source SASPy module for Python, which is responsible for converting Python code to SAS code and running it in SAS. Installing SASPy is simple. First, we need to download and install a Java Runtime Environment as a prerequisite. After that, we can launch Anaconda Prompt (a command line interface) from the Start menu and use pip to install the SASPy package.

pip install saspy

Connection parameters are defined in a file named sascfg.py. In fact, the best practice is to use sascfg.py as a template, and create a copy of it, which should be named sascfg_personal.py. SASPy will read and look for connections in both files. For the actual connection parameters, we need to specify the connection method. This depends largely on your topology.

In my environment, I used a Windows client to connect to a Linux server using a standard IOM connection. The most appropriate SASPy configuration definition is therefore 'winiomlinux', which relies on a Java-based connection method to talk to the SAS workspace. This needs to be defined in the sascfg_personal.py file.

SAS_config_names=['winiomlinux']

We also need to specify the parameters for this connection definition as shown below.

# build out a local classpath variable to use below for Windows clients   CHANGE THE PATHS TO BE CORRECT FOR YOUR INSTALLATION 
cpW  =  "C:\\ProgramData\\Anaconda3\\Lib\\site-packages\\saspy\\java\\sas.svc.connection.jar"
cpW += ";C:\\ProgramData\\Anaconda3\\Lib\\site-packages\\saspy\\java\\log4j.jar"
cpW += ";C:\\ProgramData\\Anaconda3\\Lib\\site-packages\\saspy\\java\\sas.security.sspi.jar"
cpW += ";C:\\ProgramData\\Anaconda3\\Lib\\site-packages\\saspy\\java\\sas.core.jar"
cpW += ";C:\\ProgramData\\Anaconda3\\Lib\\site-packages\\saspy\\java\\saspyiom.jar"

Note that we are referencing a number of JAR files in the classpath that normally come with a SAS deployment. Since we don't have SAS on the client machine, we can copy these from the Linux SAS server. In my lab, these were copied from /SASDeploymentManager/9.4/products/deploywiz__94526__prt__xx__sp0__1/deploywiz/.

Further down in the file, we can see the connection parameters to the SAS environment. Update the path to the Java executable, the SAS compute server host, and the port for the Workspace Server.

winiomlinux = {'java'   : 'C:\\Program Files\\Java\\jre1.8.0_221\\bin\\java',
            'iomhost'   : 'sasserver01.mydemo.sas.com',
            'iomport'   : 8591,
            'encoding'  : 'latin1',
            'classpath' : cpW
            }

Note: thanks to a recent contribution from the community, SASPy now also supports a native Windows connection method that doesn't require Java. Instead, it uses the SAS Integration Technologies client -- a component that is free to download from SAS. You'll already have this component if you've installed SAS Enterprise Guide, SAS Add-In for Microsoft Office, or Base SAS for Windows on your client machine. See instructions for this configuration in the SASPy documentation.

With the configuration set, we can import the SASPy module in our Jupyter Notebook.

Although we're importing sascfg_personal.py explicitly here, we can just call sascfg.py, as SASPy actually checks for sascfg_personal.py and uses that if it finds it. If not, it uses the default sascfg.py.

To connect to the grid from Jupyter Notebook using SASPy, we need to instantiate a SASsession object and specify the configuration definition (e.g. winiomlinux). You'll get prompted for credentials, and then a message indicating a successful connection will be displayed.

It's worth nothing that we could also specify the configuration file to use when we start a session here by specifying:

sas = saspy.SASsession(cfgfile='/sascfg_personal.py')

Behind the scenes, SAS will find an appropriate grid node on which to launch a Workspace Server session. The SASPy module works in any SAS 9.4 environment, regardless of whether it is a grid environment or not. It simply runs in a Workspace Server; in a SAS Grid environment, it just happens to be a Grid-launched Workspace Server.

Running in the Grid

Now let's execute some code in the Workspace Server session we have launched. Note that not all Python methods are supported by SASPy, but because the module is open source, anyone can add to or update the available methods. Refer to the API doc for more information.

In taking another example from Allan, let's view some of the content in a SASHELP data set.

The output is in ODS, which is converted (by SASPy) to a Pandas data frame to display it in Jupyter Notebook.

Monitoring workloads from SAS Grid Manager

SAS Workload Orchestrator Web interface allows us to view the IOM connections SASPy established from our client machine. We can see the grid node on which the Workspace Server was launched, and view some basic information. The job will remain in a RUNNING state until the connection is terminated (i.e. the SASsession is closed by calling the endsas() method. By the same token, creating multiple SASsessions will result in multiple grid jobs running.

To see more details about what actually runs, Workspace Server logs need to first be enabled. When we run myclass.head(), which will display the first 5 rows of the data set, we see the following written to Workspace Server log.

2019-08-14T02:47:34,713 INFO  [00000011] :sasdemo - 239        ods listing close;ods html5 (id=saspy_internal) file=_tomods1 options(bitmap_mode='inline') device=svg style=HTMLBlue;
2019-08-14T02:47:34,713 INFO  [00000011] :sasdemo - 239      ! ods graphics on / outputfmt=png;
2019-08-14T02:47:34,779 INFO  [00000011] :sasdemo - NOTE: Writing HTML5(SASPY_INTERNAL) Body file: _TOMODS1
2019-08-14T02:47:34,780 INFO  [00000011] :sasdemo - 240        ;*';*";*/;
2019-08-14T02:47:34,782 INFO  [00000045] :sasdemo - 241        data _head ; set sashelp.prdsale (obs=5 ); run;
2019-08-14T02:47:34,782 INFO  [00000045] :sasdemo -
2019-08-14T02:47:34,783 INFO  [00000045] :sasdemo - NOTE: There were 5 observations read from the data set SASHELP.PRDSALE.
2019-08-14T02:47:34,784 INFO  [00000045] :sasdemo - NOTE: The data set WORK._HEAD has 5 observations and 5 variables.
2019-08-14T02:47:34,787 INFO  [00000045] :sasdemo - NOTE: DATA statement used (Total process time):
2019-08-14T02:47:34,787 INFO  [00000045] :sasdemo -       real time           0.00 seconds
2019-08-14T02:47:34,787 INFO  [00000045] :sasdemo -       cpu time            0.00 seconds
2019-08-14T02:47:34,787 INFO  [00000045] :sasdemo -

We can see the converted code, which in this case was a data step which creates a work table based on PRDSALE table with obs=5. Scrolling down, we also see the code that prints the output to ODS and converts it to a data frame.

Additional Options

Authinfo files

The sascfg_personal.py file has an option for specifying an authkey, which is an identifier that maps to a set of credentials in an .authinfo file (or _authinfo on Windows). This can be leveraged to eliminate the prompting for credentials. For example, if your authinfo file looks like:

IOM_GELGrid_SASDemo user sasdemo password lnxsas

your configuration defintion in your sascfg_personal.py should look like:

winiomlinux = {'java'   : 'C:\\Program Files\\Java\\jre1.8.0_221\\bin\\java',
            'iomhost'   : 'sasserver01.mydemo.sas.com',
            'iomport'   : 8591,
            'encoding'  : 'latin1',
            'authkey' : 'IOM_GELGrid_SASDemo'
            'classpath' : cpW
            }

There are special rules for how to secure the authinfo file (making it readable only by you), so be sure to refer to the instructions.

Grid Option Sets

What if you want your code to run in the grid with certain parameters or options by default? For instance, say you want all Python code to be executed in a particular grid queue. SASPy can do this by leveraging Grid Option Sets. The process is outlined here, but in short, a new SASPy 'SAS Application' has to be configured in SAS Management Console, which is then used to the map to the Grid Options Set (created using the standard process).

More Information

My sincere thanks to Allan Tham and Greg Wootton for their valued contributions to this post.

Please refer to the official SAS Grid documentation for more information on SAS Grid Manager in SAS 9.4M6.

Thank you for reading. I hope the information provided in this post has been helpful. Please feel free to comment below to share your own experiences.

Using Python to run jobs in your SAS Grid was published on SAS Users.

9月 182019
 

In a scatter plot that displays many points, it can be important to visualize the density of the points. Scatter plots (indeed, all plots that show individual markers) can suffer from overplotting, which means that the graph does not indicate how many observations are at a specific (x, y) location. You can overcome that problem by moving from a "point plot" to an "area plot" such as a heat map or a contour plot. These plots do a good job communicating density, but they typically do not show individual points, which can be a drawback if you are interested in displaying outliers.

This article shows the following four methods of visualizing the density of bivariate data:

  1. Use a heat map to visualize the density. The individual markers are not shown, but outliers are visible.
  2. Use a contour map to visualize the density. The individual markers are not shown.
  3. Use a scatter plot to show the markers. Use transparency to visualize the density of points.
  4. Use a scatter plot to show the markers. Use color to visualize the density of points.

The following SAS DATA step extracts data from the Sashelp.Heart data set and will be used to create all graphs. The data are measurements of the systolic blood pressure (the "top number") and cholesterol levels of 5,057 patients in a heart study. For convenience, the Systolic variable is renamed X and the Cholesterol variable is renamed Y:

data Have;
   set sashelp.Heart(keep=Systolic Cholesterol);
   rename Systolic=x Cholesterol=y;
   if cmiss(of _NUMERIC_)=0;    /* keep only complete cases */
run;

Use a heat map to visualize the density

The simplest way to visualize the density of bivariate data is to bin the data and use color to indicate the number of observations in each bin. I've written many articles about how to perform two-dimensional binning in SAS. However, if you only want to visualize the data, you can use the HEATMAP statement in PROC SGPLOT, which automatically bins the data and creates a heat map of the counts:

title "Use Heat Map to Show Density";
title2 "Individual Observations Are Not Shown";
proc sgplot data=Have;
   styleattrs wallcolor=verylightgray;
   heatmap x=x y=y;
   xaxis grid; yaxis grid;
run;

The graph shows that the region that has the highest density is near (x,y) = (125, 225). The pink and red rectangles in the graph contain 20 or more observations. The outliers in the graph are visible.

Because the default color ramp includes white as one of its colors, I changed the color of the background to light gray. You can use the COLORMODEL= option on the HEATMAP statement to define and use other color ramps. I also used the default binning options for the heat map. You can use the NXBINS= and NYBINS= options to control the number of bins. Alternatively, you can use the XBINSIZE= and YBINSIZE= options to specify the size of the bins in data units.

For large data sets, the default number of bins might be too large. For most data, 200 bins in each direction shows the density on a fine scale, so you can use the options NXBINS=200 NYBINS=200. Of course, you can use smaller numbers (like 50) to visualize the density on a coarser scale.

Use a contour map to visualize the density

The previous graph is somewhat "chunky." That is because the heat map is a density estimate that is created by binning the data. The result depends on the bin size and the anchor locations. You can obtain a smoother estimate by using a kernel density estimate (KDE). In SAS, PROC KDE enables you to obtain a smooth density estimate. The procedure also creates a contour plot of the resulting density estimate, as shown in the following call:

title "Use Contour Map to Show Density";
title2 "Individual Observations Are Not Shown";
proc kde data=Have;
  bivar  x y / bwm=0.25 ngrid=250 plots=contour;
run;

The graph shows the density estimate. Unfortunately, the units of density (proportion of data per square unit) are not intuitive. Notice that individual observations (including outliers) are not visible in the contour plot. To get very small "hot spots," I used the BWM=0.25 to select a small kernel bandwidth. (BWM stands for "bandwidth multiplier.) If you use the default value (BWM=1), you will get one large region of high density instead of several smaller regions. Similarly, the NGRID=250 option computes the KDE on a fine grid. You will probably need to play with the BWM= option to determine which value is best for your needs.

PROC KDE creates the contour plot with a default title, but you can use the ODS Graphics Editor to change the plot title.

Use a scatter plot and transparency to visualize the density

I've previously written about how to use a scatter plot and marker transparency to indicate density. The value for the transparency parameter depends on the data. You might have to experiment with several values (0.75, 0.8, 0.9, ...) until you find the right balance between showing individual observations and showing the density.

title "Use Transparency in a Scatter Plot";
title2 "Individual Observations Shown, but Faint";
proc sgplot data=Have;
   scatter x=x y=y / markerattrs=(symbol=CircleFilled) transparency=0.9;
   xaxis grid; yaxis grid;
run;

In the scatter plot, you can see that the many data values are rounded to multiples of 5 and 10. You can see clear striations at the values 110, 120, 130, 140, 150, and 160. Recall that the X variable is the systolic blood pressure for patients. Evidentially, there is a tendency to round these numbers to the nearest 5 or 10.

In this plot, you can see the individual outliers, but they are quite faint because the TRANSPARENCY= option is set to 0.9. If you increase the parameter further to 0.95 or higher, it becomes increasingly difficult to see the individual markers. Although you can see that the highest density region in near (x,y) = (125, 225), it is not clear how to associate a number to the density. Does a certain (x,y) value have 10 points overplotted? Or 20? Or 50? It's impossible to tell.

Use a scatter plot and color to visualize the density

The last method borrows ideas from each of the previous methods. I was motivated by some graphs I saw by Lukas Kremer and Simon Anders (2019). My implementation in SAS is quite different from their implementation in R, but their graphs provided the inspiration for my graphs. The idea is to draw a scatter plot where the individual markers are colored according to the density of the region. Kremer and Anders use a count of neighbors that are within a given distance from each point, which is an expensive computation. It occurred to me that you can use a density estimate instead. Because you can compute a KDE by using a fast Fourier transform, it is a much faster computation. In addition, the KDE procedure supports the SCORE statement, which provides a convenient way to associate a density value with every observation. This is implemented in the following statements:

title "Use Scatter Plot to Show Density";
title2 "Individual Observations Shown";
proc kde data=Have;
  bivar  x y / bwm=0.25 ngrid=250 plots=none;
  score data=Have out=KDEOut(rename=(value1=x value2=y));
run;
 
proc sort data=KDEOut;   /* Optional: but a good idea */
   by density;
run;
 
proc sgplot data=KDEOut;
   label x= y=;
   styleattrs wallcolor=verylightgray;
   scatter x=x y=y / colorresponse=density colormodel=ThreeColorRamp
                      markerattrs=(symbol=CircleFilled);
   xaxis grid; yaxis grid;
run;

The graph is a fusion of the other three graphs. Visually, it looks very similar to the heat map. However, it uses colors based on a density scale (like the contour plot) and it plots individual markers (like the scatter plot).

Summary

In summary, this article shows four ways to visualize the density of bivariate data.

  • The heat map provides a nice compromise between seeing the density and seeing outliers. As a bonus, the heat map shows the density in terms of counts per bin, which is an intuitive quantity.
  • The contour plot of density does not show individual markers, which can be an advantage if you want to focus on the bulk of the data and ignore outliers. You can use the BWM= option to control the scale of the bandwidth parameter for estimating the density.
  • The transparent scatter plot shows outliers but doesn't show the density very well.
  • The colored scatter plot combines the strengths (and weaknesses) of the contour plot and the transparent scatter plot. By using colors to indicate density, the scatter plot is easier to interpret than the transparent scatter plot.

For most situations, I recommend the heat map, although I also like the scatter plot that assigns colors to markers based on density. What are your thoughts? Do you prefer one of these graphs? Or perhaps you prefer a different graph for visualizing the density of bivariate data? Leave a comment.

The post 4 ways to visualize the density of bivariate data appeared first on The DO Loop.

9月 162019
 

SAS SPDS is lightning fastJust when you think you’ve seen it all, life can surprise you in a big way, making you wonder what else you've missed.

That is what happened when I recently had a chance to work with the SAS® Scalable Performance Data Server, a product that's been around a while, but I never crossed paths with before. I open an SPDS table of a hundred million records in SAS® Enterprise Guide, and I can scroll it as fast as if it were an Excel “baby” spreadsheet of a hundred rows. That’s how powerful it feels, to say nothing about the lighting speed of the queries.

What is the SAS Scalable Performance Data Server?

Also known as the SAS SPD Server (or SPDS), it's a data storage system designed for high-performance data delivery. Its primary purpose is to provide rapid queries of vast amounts of data. We are talking terabytes of data with tables containing billions of rows. SPDS employs parallel storage and efficient indexing, coupled with a multi-threaded server system concurrently processing tasks distributed across multiple processors.

Availability of the SPDS client in SAS® Viya effectively integrates SAS SPDS with SAS Viya, extending functionality of its applications beyond the native Cloud Analytic Services (CAS) where you can continue reaping all the benefits of the SAS SPDS.

SPDS library

In addition to connecting to SPD Server with explicit SQL pass-through, connection to SPD Server with a LIBNAME statement is available as well, for example:

libname mylibref sasspds 'serverdomain' host='nodename_or_ip' service='5400'
                         user='mySPDuserid' password='{SAS003}XXXXXXX...XXX';

This effectively creates an SPDS library, and the tables in that library can be referenced by two-level name mylibref.tablename as if this were a SAS BASE library.

Cluster tables vs. member tables

Besides ordinary data tables, SPDS library offers so called dynamic cluster tables – or clusters for short – enabling transparent access to large amounts of data.

Dynamic cluster tables (cluster tables or clusters) are virtual tables that allow users to access many server tables (member tables) as if they were one table. A dynamic cluster table is a collection of SPD Server tables that are presented to the end-user application as a single table through a metadata layer acting like a view.

Member tables can be added to the cluster as well as replaced and removed from the cluster.

The role of PROC SPDO

PROC SPDO is the SAS procedure for the SPD Server operator interface. It performs a wide range of SPD server, user and table management tasks:

  • create, list, modify, destroy, and undo dynamic cluster tables
  • add, remove, replace, and fix cluster table members
  • add, modify, list, and delete access control lists (ACLs) for server resources
  • define, describe, and remove WHERE constraints on tables for row-level security definition and management
  • issue system commands on server nodes

In addition to PROC SPDO, SPD Server plug-in for SAS® Data Management Console is also available.

Retrieving SPDS library contents

If you open an SPDS library in SAS Enterprise Guide, you won’t be able to tell which table in that library is a member table and which is a cluster table – they all look the same. But in many cases, we need to know what is what. Moreover, for data-driven processing we need to capture the SPDS library objects into a dataset and identify them whether they are clusters or member tables.

Luckily, PROC CONTENTS with OUT= option allows us to do just that. While MEMTYPE column is equal to ‘DATA’ for both, clusters and member tables, there is another, less known column inversely called TYPEMEM that has value of 'DATA' for clusters and blank value ' ' for member tables. The following simple code allows you to retrieve SPDS library objects list into WORK.SPDSTYPES dataset where TABLETYPE column specifies whether it’s a cluster or a member for each library object MEMNAME:

proc contents data=SPDSLIB._all_ out=WORK.ALLOBJECTS (keep=MEMNAME TYPEMEM);
run;
 
proc sort data=WORK.ALLOBJECTS nodupkey;
   by MEMNAME;
run;
 
data WORK.SPDSTYPES;
   set WORK.ALLOBJECT;
   attrib TABLETYPE $7 label='SPDS table type';
   select(TYPEMEM);
      when('DATA') TABLETYPE = 'CLUSTER';
      when('')     TABLETYPE = 'MEMBER';
      otherwise    TABLETYPE = '';
   end;
run;

In this code PROC CONTENTS produces one record per column NAME in every object MEMNAME in the SPDS library; PROC SORT reduces (un-duplicates) this list to one record per MEMNAME; finally, data step creates TABLETYPE column indicating which MEMNAME is CLUSTER and which is MEMBER.

Retrieving SPDS cluster’s member list

In addition to retrieving a list of objects in the SPDS library described above, we also need a way of capturing the content (a list of members) of the cluster itself in order to control removing or replacing its members. PROC SPDO’s CLUSTER LIST statement produces such a list, and its OUT= option allows you to dump that list into a dataset:

proc spdo lib=SPDSLIB;
   cluster list CLUSTER1 out=CLUSTER1_MEMBERS;
   cluster list CLUSTER1 out=CLUSTER2_MEMBERS;
   /* ... */
   cluster list CLUSTER1 out=CLUSTERN_MEMBERS;
quit;

This approach creates one output table per cluster, and you can’t use the same OUT= destination table for different clusters, for they will be overwritten with each subsequent CLUSTER LIST statement, not appended.

If you need to capture contents of several clusters into one dataset, then instead of the above method of outputting each cluster content into separate table and then appending (concatenating) them, the good old ODS OUTPUT with CLUSTERLIST= option allows us to do it in a single step:

ods noresults;
ods output clusterlist=WORK.CLUSTER_MEMS;
proc spdo lib=SPDSLIB;
   cluster list CLUSTER1;
   cluster list CLUSTER2;
   /* ... */
   cluster list CLUSTERN;
quit;
ods output close;
ods results;

As additional bonus ODS NORESULTS suppresses printed output when it’s not needed, e.g. for automatic data-driven processing.

Your thoughts?

What is your experience with SAS SPDS? How might you use it in the future? Please comment below.

How to retrieve contents of SAS® Scalable Performance Data Server library was published on SAS Users.

9月 162019
 

A moving average is a statistical technique that is used to smooth a time series. My colleague, Cindy Wang, wrote an article about the Hull moving average (HMA), which is a time series smoother that is sometimes used as a technical indicator by stock market traders. Cindy showed how to compute the HMA in SAS Visual Analytics by using a GUI "formula builder" to compute new columns in a data table. This article shows how to implement the Hull moving average in the SAS/IML language. The SAS/IML language is an effective way to implement new (or existing) smoothers for time series because it is easy to use lags and rolling windows to extract values from a time series.

I have previously written about how to compute common moving averages in SAS/ETS (PROC EXPAND) and the DATA step. I have also shown how to compute a weighted moving average (WMA) in the SAS/IML language, which includes simple moving averages and exponential moving averages. This article shows how to implement the Hull moving average formula in SAS by leveraging the WMA function in the previous article.

Every choice of weights in a weighted moving average leads to a different smoother. This article uses weights that are linear in time. Specifically, if Y is a time series, the weighted moving average at time index t is obtained by the formula
WMA(t, n) = 1*Y[tn + 1] + 2*y[tn + 2] + 3*Y[tn + 3] + ... + n*Y[t]) / (1 + 2 + ... + n)
In general, the divisor in the formula is the sum of whatever weights are used.

Although this article focuses on the Hull moving average, the techniques are broadly applicable to other moving average computations.

Hull moving average

Cindy's article contains formulas that show how to compute the Hull moving average (HMA). The HMA is the moving average of a linear combination of two other weighted moving averages, one on a shorter time scale and the other on a longer time scale. Given a time series, Y, choose a lag time, n, which is sometimes called the window length. You can compute the Hull moving average by computing four quantities:

  1. Compute Xshort as the weighted moving average of Y by using a short window of length n/2.
  2. Compute Xlong as the weighted moving average of Y by using the longer window of length n.
  3. Define the series X = 2* Xshort – Xlong.
  4. The Hull moving average of Y is the weighted moving average of X by using a window of length sqrt(n).

The following SAS/IML program implements the Hull moving average. The WMA function is explained in my previous blog post. The HMA function computes the Hull moving average by calling the WMA function three times, twice on Y and once on X. It requires only four statements, one for each computed quantity:

proc iml;
/* Weighted moving average of k data values.
   First k-1 values are assigned the weighted mean of all preceding values.
   Inputs:     y     column vector of length N >= k
               wt    column vector of weights. w[k] is weight for most recent 
                      data; wt[1] is for k time units ago.  The function 
                     internally standardizes the weights so that sum(wt)=1.
   Example calls: 
   WMA  = WMA(y, 1:5);          is weighted moving average of most recent 5 points
   SMA  = WMA(y, {1 1 1 1 1});  is simple moving average of 
   See https://blogs.sas.com/content/iml/2016/02/03/rolling-statistics-sasiml.html
*/
start WMA(y, wt);
   w = colvec(wt) / sum(wt);         /* standardize weights so that sum(w)=1 */
   k = nrow(w);                      /* length of lag */
   MA = j(nrow(y), 1, .);
   /* handle first k values separately */
   do i = 1 to k-1;
      wIdx = k-i+1:k;                /* index for previous i weights */
      MA[i] = sum(wt[wIdx]#y[1:i]) / sum(wt[wIdx]);  /* weighted average */
   end;
   /* main computation: average of current and previous k-1 data values */
   do i = k to nrow(y);
      idx = (i-k+1):i;               /* rolling window of k data points */
      MA[i] = sum( w#y[idx] );       /* weighted sum of k data values */
   end;
   return ( MA );
finish;
 
/* Hull moving average HMA(y, nLag) at each time point. All computations
   use linear weights 1:k for various choices of k. */
start HMA(y, nLag);
   Xshort = WMA(y, 1:round(nLag/2)); /* short smoother (nLag/2) */
   Xlong = WMA(y, 1:nLag);           /* longer smoother (nLag) */
   X = 2*Xshort - Xlong;             /* combination of smoothers */
   HMA = WMA(X, 1:round(sqrt(nLag)));/* Hull moving average (length sqrt(nLag)) */
   return HMA;
finish;
 
/* test on simple time series */
y = T({0 1 0 1 3 2 3 3});
nLag = 4;
WMA = WMA(y, 1:nLag);  
HullMA = HMA( y, nLag );  
print y WMA HullMA;
Hull moving average compared with weighted moving average

The table shows the weighted moving average and the Hull moving average applied to a simple time series with eight observations. The smoothed values are shown. As a general rule, the Hull moving average tends to be smoother than a raw weighted moving average. For a given lag parameter, it responds more quickly to changes in Y.

However, it is important to realize that the HMA is not a smoother of Y. Rather, it smooths a linear combination of other smoothers. Consequently, the value of the HMA at time t can be outside of the range of the series {Y1, Y1, ..., Yt}. This is seen in the last observation in the table. The HMA has the value 3.08 even though the range of Y is [0, 3]. This is shown in the following graph, which plots the series, the weighted moving average, and the Hull moving average.

Graph of Hull moving average and weighted moving average applied to an example time series

The Hull moving average applied to stock data

This section reproduces the graphs in Cindy's blog post by applying the Hull moving average to the monthly closing price of IBM stock. The following statements compute three different moving averages (Hull, simple, and weighted) and use PROC SGPLOT to overlay the simple and Hull moving averages on a scatter plot that shows the closing price and a high-low plot that shows the range of stock values for each month.

use Sashelp.Stocks where(stock='IBM');       /* read stock data for IBM */
read all var {'Close'} into y; close;
 
nLag = 5;
SMA = WMA(y, j(1,nLag,1));                   /* simple MA */
WMA = WMA(y, nLag);                          /* weighted MA */
HMA = HMA(y, nLag);                          /* Hull MA */
create out var {SMA WMA HMA}; append; close; /* write moving averages to SAS data set */
QUIT;
 
data All;                                    /* merge data and moving averages */
   merge Sashelp.Stocks(where=(stock='IBM')) out;
run;
 
ods graphics / width=800px height=300px;
title "Comparison of Simple and Hull Moving Average";
proc sgplot data=All noautolegend;
   highlow x=Date high=High low=Low / lineattrs=(color=LightGray);
   scatter x=Date y=Close / markerattrs=(color=LightGray);
   series x=Date y=SMA / curvelabel curvelabelattrs=(color=DarkGreen) lineattrs=(color=DarkGreen);
   series x=Date y=HMA / curvelabel curvelabelattrs=(color=DarkOrange) lineattrs=(color=DarkOrange);
   refline '01JUN1992'd / axis=x label='June 1992' labelloc=inside;
   xaxis grid display=(nolabel); 
   yaxis grid max=200 label="IBM Closing Price";
run;
Hull moving average and wimple moving average applied to IBM stock price (1986-2006)

It is well known that the simple moving average (green line) lags the data. As Cindy points out, the Hull moving average (orange line) responds more quickly to changes in the data. She compares the two smoothers in the months after June 1992 to emphasize that the HMA is closer to the data than the simple moving average.

In summary, this article shows how to implement a custom time series smoother in SAS. The example uses the Hull moving average and reproduces the computations and graphs shown in Cindy's article. The techniques apply to any custom time series smoother and time series computation. The SAS/IML language makes it easy to compute rolling-window computations because you can easily access segments of a time series.

The post The Hull moving average: Implement a custom time series smoother in SAS appeared first on The DO Loop.