8月 282019
 

Moving Average (MA) is a common indicator in stocks, securities and futures trading in financial markets to gauge momentum and confirm trends. MA is often used to smooth out short-term fluctuations and show long-term trends. But most MA indicators have big lags in signaling a changing trend. To be faster to capture a trend reversal, several New MA indicators are now available that more quickly detect trend changes – and of those, the Hull Moving Average (HMA), is one of the most popular. This post demonstrates its superiority.

A closer look at HMA

Developed by Alan Hull, it's faster and thus a more useful signal than others. Its main advantage over general MA indicators is its relative smoothness as it signals change. Commonly-used MA indicators include Simple Moving Average (SMA), Weighted Moving Average (WMA) and so on. SMA calculates the arithmetic mean of the prices, which gives individual value equal weight. WMA averages individual values with some predetermined weights.

Since moving averages are computed from prior data, all MA indicators suffer a significant drawback of being a lagging indicator. Even in a shorter-period of moving average, which has less lag than one with a longer period, a stock price may drop sharply before a MA indicator signals the trend change. The Hull Moving Average (HMA) uses weighted moving average and the square root of the period instead of the actual period itself, which leads it to be more responsive to most recent price activity, whilst maintaining smoothness.

According to Alan Hull, the formula for HMA is:

We see that the major computing components in HMA are three WMAs. Refer to the specification here, we have the corresponding WMA formula as pictured below. In the WMA formula, the weight of each price value is related to the position of the value and the period length. The more recent the higher weights, and the shorter of the period the higher weights.

HMA in action

In the remainder of this post, I will show how to calculate HMA of a stock price using calculated items in SAS Visual Analytics and show that HMA gives faster upward/downward signals than SMA. I use the data from SASHELP. STOCK with ‘IBM’ as an example. The data needs to be sorted by the date and a column (named ‘tid’) added to hold the sequence number before loading into SAS Visual Analytics for calculation. The data preparation codes can be found here. After loading the data into SAS Visual Analytics, we can start by creating the calculated items. Here, I set the period length to 5 in calculation (i.e. =5 in the formula) and calculate HMA for ‘Close’ price of IBM stock for example.

1. Calculate the first WMA like so...

... using the AggregateCells operator in SAS Visual Analytics. I name it as 'WMA(5/2 days)'. Have the data value in, note I’ve rounded the (5⁄2) to an integer of 3. That is, the aggregation is starting from the previous two (-2) row and ending at current row (0). The corresponding formula of the calculated item ‘WMA(5/2 days)’ in SAS VA is:

AggregateCells(_Sum_, ( 'Close'n * 'tid'n ), default, CellIndex(current, -2), CellIndex(current, 0)) / AggregateCells(_Sum_, 'tid'n, default, CellIndex(current, -2), CellIndex(current, 0))

 

2. Similarly, calculate the second in SAS Visual Analytics:

Name it as ‘WMA(5 days)’. The corresponding formula is:
AggregateCells(_Sum_, ( 'Close'n * 'tid'n ), default, CellIndex(current, -4), CellIndex(current, 0)) / AggregateCells(_Sum_, 'tid'n, default, CellIndex(current, -4), CellIndex(current, 0))

3. Now we calculate the HMA, which computes the third WMA using the two WMAs we get from above calculation. In SAS Visual Analytics, if we directly apply a similar approach for the last WMA calculation, it will show message of operands requiring group. So here, I need a workaround to make the aggregation work.

4. To work around the problem, I create an aggregated item named ‘sumtid’ to indicate the row sequence number in an aggregation way. To do this, firstly create a calculated item named ‘One’ with the constant value 1; then use AggregateCells operator creating the ‘sumtid’ to get the current row number: AggregateCells(_Sum_, 'One'n, default, CellIndex(start, 0), CellIndex(current, 0)).

5. Now we can compute the HMA in a similar way as we do for previous two WMAs. Name it as ‘HMA for close (5 days)’. Since int(√(5 ))=2, the starting position of the aggregation is set to the previous row (-1) and the ending position is set to the current row (0). Note the operands are now using the aggregated item ‘sumtid’. The formula for the ‘HMA for close (5 days)’ item is:

AggregateCells(_Sum_, ( ( ( 2 * 'WMA(5/2 days)'n ) - 'WMA(5 days)'n ) * 'sumtid'n ), default, CellIndex(current, -1), CellIndex(current, 0)) / AggregateCells(_Sum_, 'sumtid'n, default, CellIndex(current, -1), CellIndex(current, 0))

So far, we’ve created the Hull Moving Average of IBM stock Close price and saved it in the calculated item ‘HMA for close (5 days)’. We can easily draw its time series plot in SAS Visual Analytics. Now, I'll create a Simple Moving Average of ‘SMA for the close (5 days)’ with an equal weight, and then compare it with the HMA. The formula for ‘SMA for the close (5 days)’ is: AggregateCells(_Average_, 'Close'n, default, CellIndex(current, -4), CellIndex(current, 0))

Now let’s visualize the ‘SMA for the close (5 days)’ and ‘HMA for close (5 days)’ respectively. In below chars, each grey vertical bar shows the monthly price span of IBM stock, and the red lines correspond to SMA and HMA respectively. With the upper SMA line, we see constant lags with price changing and poor smoothness. And with bottom HMA line, we see rapid keep-up with price activities while maintaining good smoothness.

Below is the comparison of the ‘SMA for the close (5 days)’, ‘HMA for close (5 days)’ and the Close price. Besides smoothing out some fluctuations in Close price, the HMA indeed gives better signal than SMA does in indicating a turning point when there is an upward/downward trend reversal. Note the obvious lags of SMA compared to HMA. For example, compare the trends around the reference line in the visualization below. The Close price reached to a local peak at Jun1992 and started to go down from Jul1992. HMA quickly reflected the downward turn with one lag at Aug1992, while SMA still showed the rising trend in the meantime. SMA started to go down with one more lag to give the reversal signal.

Now it’s easy to understand why HMA is a better indicator than SMA to signal the reversal point. What has been your experience with HMA?

How to Calculate Hull Moving Average in SAS Visual Analytics was published on SAS Users.

8月 282019
 

A SAS programmer asked an intriguing question on the SAS Support Communities: Can you use SAS to create a graph that shows how the elements in a box-and-whiskers plot relate to the data? The SAS documentation has several examples that explain how to read a box plot. One of the documentation images is shown to the right. The programmer wanted a program that creates the image.

In response to the question, a Communities member shared Graph Template Language (GTL) code that uses hard-coded values for the statistics (Q1, Median, Q3, etc) and overlays explanatory text and arrows onto a box plot. I wondered how hard it would be to dynamically position the explanatory text and lines based on a particular set of data? In other words, given any set of data, can you write a program that automatically positions the explanatory text next to the box plot for those data?

The answer is yes. While developing the program, I learned about axis tables and annotation in the SGPLOT procedure. This article shows how to use the YAXISTABLE statement to overlay a table of text on a box plot (or other plots). The article also shows how to use the SG annotation facility to overlay curves, arrows, and other features. You can use these ideas to augment and decorate your own SAS graphs. My primary goal is not to reproduce the graph in the documentation, but rather to demonstrate general techniques that are relevant in many situations.

The creation of the graph is divided into five steps:

  1. Plan the graph.
  2. Define the data.
  3. Compute values for the whiskers, outliers, and mean value.
  4. Compute values for quartiles and fences.
  5. Create an annotation data set for lines, arrows, and text.

Step 1: Plan the graph

For graphs like this, it is always best to decide what SGPLOT statements you can use and how to arrange the underlying data object for the graph. There are eight statistics that are important for understanding a box plot. To reduce the likelihood of overlapping text, I decided to split them into two groups, one on the right and one on the left. The graph that I will eventually produce is shown to the right. I want to use three SGPLOT statements and an annotation data set:

  • The box plot. For this, I need a variable, x, that contains the data. The VBOX statement will display the box plot.
  • The features on the right are mostly related to observations. They include the high and low whiskers, and any outliers and "far outliers." Because the mean often is close to the median, I'll also plot the mean on the right. A YAXISTABLE statement will display these words on the right. The variables for this axis table will be called _TYPE_ and _VALUE_, for reasons that will become apparent.
  • The features on the left are related to quantiles. The Q1, median, and Q3 values are self-explanatory. The upper fence is value Q3 + 1.5*IQR, where IQR = Q3 - Q1 is the interquartile range. The lower fence is value Q1 - 1.5*IQR. A second YAXISTABLE statement will display these words on the left. The variables for this axis table will be called Stat and Value2. (You can also define the "upper far fence" by Q3 + 3*IQR and the "lower far fence" by Q1 - 3*IQR.)
  • Annotation. To keep it simple, I will show how to draw dashed lines by using an annotation data set. You could also use annotation to display arrows and additional text.

Step 2: Define the data

Assume the data are in a variable named X in a data set named HAVE. Because PROC BOXPLOT (used in the next step) requires a Group variable, you need to add a constant variable named GROUP to the data. The following data simulates normally distributed data and adds three outliers:

/* Step 2. Create example data */
data Have(keep= Group x);
call streaminit(1);
Group=1;         /* required for PROC BOXPLOT */
do i = 1 to 40;
   x = rand("Normal", 0, 1.5);  /* normal data */
   output;
end;
x = 6; output;   /* upper outlier      */
x = 10; output;  /* far upper outliers */
x = -4; output;  /* lower outlier      */
run;

Step 3: Compute whiskers and outliers

The easiest way to compute the whiskers and outliers is to use the OUTBOX= option in PROC BOXPLOT. It writes SAS data set that contains two variables, _TYPE_ and _VALUE_, that contains the values for many of the features and statistics that are displayed by the box plot.

/* Step 3: Compute values for box plot features: Mean, lower/upper whiskers, outliers */
proc boxplot data=Have;
   plot x*Group / boxstyle=schematic outbox=outbox;
run;
 
proc print data=outbox;
  where _Type_  in ('MIN', 'MAX' 'MEAN' 'HIGH' 'LOW' 'FARHIGH' 'FARLOW' 'HIWHISKR' 'LOWHISKR');
  var _Type_ _Value_;
run;

Unfortunately, the output from the OUTBOX= option does not always contain the value of the whiskers (although it does for this example). If the box plot does not contain a lower outlier, the _TYPE_='LOWHISKR' observation does not exist and you need to use the value of the 'MIN' observation instead. Similarly, if there is no upper outlier, the _TYPE_='HIHISKR' observation does not exist and you need to use the value of the 'MAX' observation instead. I have previously discussed how to handle this situation, and the full SAS code shows how to create a new data set (called Outbox2) that contains all the needed information.

If you merge Outbox2 data with the original data, you can use the YAXISTABLE statement in PROC SGPLOT to overlay the box plot and a table of box plot features. The YAXISTABLE statement is great for aligning text with data values. (At first, I thought to use the TEXT statement to display the text, but the TEXT statement requires a coordinate system for the horizontal variable, which is not applicable for this graph.) The YAXISTABLE statement only requires vertical coordinates and places the text in the margins.

/* Step 3a. Solve a potential problem. See link to full code. */
data outBox2;
set outBox(where=(_Type_  in ('MIN', 'MAX' 'MEAN' 
                  'HIGH' 'LOW' 'FARHIGH' 'FARLOW' 'HIWHISKR' 'LOWHISKR')))
    END=EOF;                     /* EOF is temporary indicator variable */
/* ...more... */
run;
 
data Schematic1;  merge outBox2 Have;  run;
 
/* OPTIONAL: How are we doing? Display box plot with axis table on right */
ods graphics/ width=300px height=400px;
title "Schematic Box Plot";
proc sgplot data=Schematic1;
   vbox x;
   yaxistable _Type_  / y=_Value_ nolabel valueattrs=(size=12) location=inside;
   yaxis display=none;            /* OPTIONAL: Suppress Y ticks and values */
   xaxis offsetmin=0 offsetmax=0;
run;

Step 4: Compute quartiles, fences, and IQR

As explained previously, the upper and lower fences are computed by using the first and third quartiles and the interquartile range. The coordinates of the fences are not produced by PROC BOXPLOT. The following statements use PROC MEANS to compute the quartiles, then use a DATA step to compute the IQR and the locations of the fences. A call to PROC TRANSPOSE converts the data set from wide to long form so that it can be merged with the data set from the previous section. Lastly, a second YAXISTABLE statement is added to the PROC SGPLOT call to display the quartiles and fences on the left side of the graph.

/* Step 4: Use PROC MEANS and a DATA step to compute the quantile stats */
proc means data=Have Q1 Median Q3 noprint;
   var x;
   output out=Q Q1=Q1 Median=Median Q3=Q3;
run;
 
option validvarname=ANY;    /* permit 'Upper Fence'n to be a var name */
data IQR;
set Q;
IQR = Q3 - Q1;
'Upper Fence'n = Q3 + 1.5*IQR;
'Lower Fence'n = Q1 - 1.5*IQR;
drop _TYPE_ _FREQ_ IQR;
run;
 
proc transpose data=IQR out=IQR2(rename=(COL1=Value2)) name=Stat;
run;
 
data Schematic2;
   merge Schematic1 IQR2(keep=Stat Value2);
run;
 
/* OPTIONAL: How are we doing? Display box plot with two axis tables */
title "Schematic Box Plot: Second Draft";
proc print data=IQR2; run;
proc sgplot data=Schematic2;
   vbox x;
   yaxistable _Type_  / y=_Value_ nolabel valueattrs=(size=10) location=inside;
   yaxistable Stat  / y=Value2 nolabel valueattrs=(size=10) location=inside
                      position=left valuejustify=right;
   yaxis display=none;            /* OPTIONAL: Suppress Y ticks and values */
   xaxis offsetmin=0 offsetmax=0;
run;

Step 5: Create an annotation data set

If you want to overlay additional curves, arrows, or text on the graph, use an SG annotation data set. For a gentle introduction to SG annotation, see Dan Heath's 2011 SAS Global Forum paper about SG annotation. For a more complete exposition, including many cut-and-paste examples, see Chapter 3 of Warren Kuhfeld's free e-book Advanced ODS Graphics Examples, which also contains an excellent chapter about axis tables.

To keep this article from becoming too long and complicated, I will use SG annotation merely to add two dashed horizontal lines that indicate the location of the lower and upper fences. When you use annotation, it is important to choose good coordinate systems. For this example, you can use the WallPercent coordinate system ([0, 100]) for the horizontal direction and use the DataValue coordinate system for the vertical direction. I used the interval [35, 65] to draw the horizontal lines. The vertical position of the lines come from the IQR2 data, which was created in the previous section.

/* Step 5: Create annotation data set */
data anno; 
retain Function 'Line' 
       x1Space x2Space 'WallPercent  '
       y1Space y2Space 'DataValue    '
       LinePattern 2               /* short dashed line */
       x1 35 
       x2 65 
       y1 y2 0;
set IQR2(where=(upcase(Stat) contains 'FENCE'));
y1 = Value2; y2 = Value2;
run;
 
title "Schematic Box Plot: Final Version";
proc sgplot data=Schematic2 sganno=anno;
   vbox x;
   yaxistable _Type_  / y=_Value_ nolabel valueattrs=(size=10) location=inside;
   yaxistable Stat  / y=Value2 nolabel valueattrs=(size=10) location=inside
                      position=left valuejustify=right;
   yaxis display=none;            /* OPTIONAL: Suppress Y ticks and values */
   xaxis offsetmin=0.2 offsetmax=0;
run;

The graph is complete! The final graph is shown in the "Plan the Graph" section. The challenge in creating this graph is not the SGPLOT syntax (which is simple) but is computing all the coordinate values and arranging them in a rectangular format.

Does the program handle other data sets?

My goal was to augment a box plot with text that explains the important statistical features of the plot and that is flexible enough to work for an arbitrary data set. Let's see how well the program works on other data that are not so symmetrical. The following data set defines the X variable as the MPG_City values for the vehicles in the Sashelp.Cars data set. These data are positively skewed.

data Have(keep=x Group);
   set Sashelp.Cars(rename=(MPG_City=x));
   label x = "MPG_City";
   Group = 1;
run;

If you run the program on this data, you obtain the following graph. I commented out the YAXIS statement in the SGPLOT call so that the Y axis ticks and values are displayed.

The graph is typical of skewed data. There are many outliers and the text for the outliers can overlap other text. Nevertheless, the program does a fair job of displaying the important statistical features of the box plot for these data.

In summary, this article shows how to use PROC BOXPLOT, PROC MEANS, and the DATA step to compute data-dependent quantities that represent the statistical features of a box plot. You can use the YAXISTABLE statement to display explanatory text at various data values. You can use SG annotation to display lines, arrows, and other decorations. For simplicity, I hard-coded some the data set name and the variable name in this program. It is straightforward to encapsulate the computations into a SAS macro that would create an annotated box plot for any numerical variable.

You can download the full SAS code in this article.

The post Annotate features of a schematic box plot in SGPLOT appeared first on The DO Loop.

8月 282019
 

This article is not a tutorial on Hadoop, Spark, or big data. At the same time, no prerequisite knowledge of these technologies is required for understanding. We’ll give you enough background prior to diving into the details. In simplest terms, the Hadoop framework maintains the data and Spark controls and directs data processing. As an analogy, think of Hadoop as a train, big data as the payload, and Spark as the crew driving the train and organizing and distributing the goods.

Big data

I recently read that data volumes are doubling each year. Not that long ago we talked in terms of gigabytes. This quickly turned into terabytes and we’re now in the age of petabytes. The type of data is also changing. Data used to fit neatly into rows and columns. Now, nearly eighty percent of data is unstructured. All these trends and facts have led us to deal with massive amounts of data, aka big data. Maintaining and processing big data required creating technical frameworks. Next, we’ll investigate a couple of these tools.

Hadoop

Hadoop is a technology stack utilizing parallel processing on a distributed filesystem. Hadoop is useful to companies when data sets become so large or complex that their current solutions cannot effectively process the information in a reasonable amount of time. As the data science field has matured over the past few years, so has the need for a different approach to processing data.

Apache Spark

Apache Spark is a cluster-computing framework utilizing both iterative algorithms and interactive/exploratory data analysis. The goal of Spark is to keep the benefits of Hadoop’s scalable, distributed, fault-tolerant processing framework, while making it more efficient and easier to use. Using in-memory distributed computing, Spark provides capabilities over and above the batch model of Hadoop MapReduce. As a result, this brings to the big data world new applications of data science that were previously too expensive or slow on massive data sets.

Now let’s explore how SAS integrates with these technologies to maximize capturing, managing, and analyzing big data.

SAS capabilities to leverage Spark

SAS provides Hadoop data processing and data scoring capabilities using SAS/ACCESS Interface to Hadoop and In-Database Technologies to Hadoop with MapReduce or Spark as the processing framework. This addresses some of the traditional data management batch processing, huge volumes of extract, transform, load (ETL) data as well as faster, interactive and in-memory processing for quicker response with Spark.

In SAS Viya, SAS/ACCESS Interface to Hadoop includes SAS Data Connector to Hadoop. All users with SAS/ACCESS Interface to Hadoop can use the serial. Likewise, SAS Data Connect Accelerator to Hadoop can load or save data in parallel between Hadoop and SAS using SAS Embedded Process, as a Hive/MapReduce or Spark job.

Connecting to Spark in a Hadoop Cluster

There are two ways to connect to a Hadoop cluster using SAS/ACCESS Interface to Hadoop, based on the SAS environment: LIBNAME and CASLIB statements.

LIBNAME statement to connect to Spark from MVA SAS

options set=SAS_HADOOP_JAR_PATH="/third_party/Hadoop/jars/lib:/third_party/Hadoop/jars/lib/spark"; 
options set=SAS_HADOOP_CONFIG_PATH="/third_party/Hadoop/conf"; 
 
libname hdplib hadoop server="hadoop.server.com" port=10000 user="hive"
schema='default' properties="hive.execution.engine=SPARK";

Parameters

SAS_HADOOP_JAR_PATH Directory path for the Hadoop and Spark JAR files
SAS_HADOOP_CONFIG_PATH Directory path for the Hadoop cluster configuration files
Libref The hdplib libref specifies the location where SAS will find the data
SAS/ACCESS Engine Name HADOOP option to connect Hadoop engine
SERVER Hadoop Hive server to connect
PORT Listening Hive server Port. 10000 is the default, so it is not required. It is included just in case
USER and PASSWORD Are not always required
SCHEMA Hive schema to access. It is optional; by default, it connects to the “default” schema
PROPERTIES Hadoop properties. Choosing SPARK for the property hive.execution.engine enables SAS Viya to use Spark as the execution platform

 
CASLIB statement to connect from CAS

caslib splib sessref=mysession datasource=(srctype="hadoop", dataTransferMode="auto",username="hive", server="hadoop.server.com", 
hadoopjarpath="/opt/sas/viya/config/data/hadoop/lib:/opt/sas/viya/conf ig/data/hadoop/lib/spark", 
hadoopconfigdir="/opt/sas/viya/config/data/hadoop/conf", schema="default"
platform="spark"
dfdebug="EPALL" 
properties="hive.execution.engine=SPARK");

Parameters

CASLIB Space holder for the specified data access. The splib CAS library specifies the Hadoop data source
sessref Holds the CAS library in a specific CAS session. mysession is the current active CAS session
SRCTYPE Type of data source
DATATRANSFERMODE Type of data movement between CAS and Hadoop. Accepts one of three values – serial, parallel, auto. When AUTO is specified, CAS choose the type of data transfer based on available license in the system. If Data Connect Accelerator to Hadoop has been licensed, parallel data transfer will be used, otherwise serial mode of transfer is used
HADOOPJARPATH Hadoop and Spark JAR files location path on the CAS cluster
HADOOPCONFIGDIR Hadoop configuration files location path on the CAS cluster
PLATFORM Type of Hadoop platform to execute the job or transfer data using SAS Embedded Process. Default value is “mapred” for Hive MapReduce. When using “Spark”, data transfer and job executes as a Spark job
DFDEBUG Used to receive additional information back from SAS Embedded Process transfers data in the SAS log
PROPERTIES Hadoop properties. Choosing SPARK for the property hive.execution.engine enables SAS Viya to use Spark as the execution platform

 

Data Access using Spark

SAS Data Connect Accelerator for Hadoop with the Spark platform option uses Hive as the query engine to access Spark data. Data movement happens between Spark and CAS through SAS generated Scala code. This approach is useful when data already exists in Spark and either needs to be used for SAS analytics processing or moved to CAS for massively parallel data and analytics processing.

Loading Data from Hadoop to CAS using Spark

Processing data in CAS offers advanced data preparation, visualization, modeling and model pipelines, and finally model deployment. Model deployment can be performed using available CAS modules or pushed back to Spark if the data is already in Hadoop.

Load data from Hadoop to CAS using Spark

proc casutil 
      incaslib=splib
      outcaslib=casuser;
      load casdata="gas"
      casout="gas"
      replace;
run;

Parameters

PROC CASUTIL Used to process CAS action routines to process data
INCASLIB Input CAS library to read data
OUTCASLIB Output CAS library to write data
CASDATA Table to load to the CAS in-memory server
CASOUT Output CAS table name

 

We can look at the status of the data load job using Hadoop' resource management and job scheduling application, YARN. YARN is responsible for allocating system resources to the various applications running in a Hadoop cluster and scheduling tasks to be executed on different cluster nodes.

Loading Data from Hadoop to Viya CAS using Spark

In the figure above, the YARN application executed the data load as a Spark job. This was possible because the CASLIB statement had Platform= Spark option specified. The data movement direction, in this case Hadoop to CAS uses the Spark job name, “SAS CAS/DC Input,” where “Input” is data loaded into CAS.

Saving Data from CAS to Hadoop using Spark

You can save data back to Hadoop from CAS at many stages of the analytic life cycle. For example, use data in CAS to prepare, blend, visualize, and model. Once the data meets the business use case, data can be saved in parallel to Hadoop using Spark jobs to share with other parts of the organization.

Using the SAVE CAS action to move data to Hadoop using Spark

proc cas;
session mysession; 
      table.save /
      caslib="splib"
      table={caslib="casuser", name="gas"},
      name="gas.sashdat"
      replace=True;
quit;

Parameters

PROC CAS Used to execute CAS actionsets and actions to process data.
“table” is the actionset and “save” is the action
TABLE Location and name of the source table
NAME Name of the target table saved to the Hadoop library using Spark

 

We can verify the status of saving data from CAS to Hadoop using YARN application. Data from CAS saves as a Hadoop table using, Spark as the execution platform. Furthermore, as SAS Data Connect Accelerator for Hadoop transfers data in parallel, individual Spark executors in each of the Spark executor nodes handles data execution for that specific Hadoop cluster node.

Saving Data from Viya CAS to Hadoop using Spark

Finally, the SAVE data executed as a Spark job. As we can see from YARN, the Spark job named “SAS CAS/DC Output” specifies that the data moves from CAS to Hadoop.

Where we are; where we're going

We have so far traveled across the Spark pond to setup SAS libraries for Spark, Load and Save data from and to Hadoop using Spark. In the next section we’ll look at ways to Score data and execute SAS code inside Hadoop using Spark.

Data and Analytics Innovation using SAS & Spark - part 1 was published on SAS Users.

8月 272019
 

I was born in a country (Brazil) where voting is mandatory. Most of my family still lives there, and now that I live in the US, they ask me about American politics all the time. One thing that often catches them by surprise is that not only is voting not [...]

Examining voter registration data with SAS Visual Analytics was published on SAS Voices by Frank Silva

8月 262019
 

The marketing industry has never had greater access to data than it does today. However, data alone does not drive your marketing organization. Decisions do. And with all the recent hype regarding the potential of AI, a successful cross-channel campaign is propelled by a personalized, data-driven approach injected with machine [...]

SAS Customer Intelligence 360: Hybrid marketing and analytic's last mile [Part 1] was published on Customer Intelligence Blog.

8月 262019
 

Most SAS programmers know how to use PROC APPEND or the SET statement in DATA step to unconditionally append new observations to an existing data set. However, sometimes you need to scan the data to determine whether or not to append observations. In this situation, many SAS programmers choose one of the following methods:

  • Inside a DATA step, use the SYMPUT call to create a macro variable that indicates whether to append observations. After the DATA step ends, use %IF-%THEN processing to check the value of the macro variable and conditionally append the observations.
  • Use the DATA step to determine whether to append data and append data in the same DATA step. This is especially useful if the values for the new observations depend on the data that you scanned.

This article shows the second method. It shows how to use the SAS DATA step to scan through observations and remember certain values. If a condition is met, it uses the values to append new observations to the end of the data by using end-of-file processing.

A motivating example

Often SAS programmers need to implement complicated data-dependent logic. A simple example is "If the XYZ variable contains a certain value but doesn't contain a different value, then do something." On the SAS discussion forums, the experts often suggest scanning through the data with a DATA step and keeping one or more "flag variables" that indicate which conditions have been satisfied. At the end of the DATA step, you can look at the values of the flag variables to determine what action to take.

Last week I encountered a situation where I needed to conditionally append observations to input data. Although the solution is easy if you use the SAS/IML language (which enables you to scan an entire vector of values), I needed to solve the problem by using the DATA step, which processes only one observation at a time. The problem had the following form:

  • The data are in long form. The character variable TYPE always contains the values 'Min' and 'Max'. The numeric variable VALUE contains numbers.
  • The TYPE variable might or might not contain the values 'LowVal' and 'HighVal'. If they do appear, they always appear after the 'Min' and 'Max' values.

The goal is to create an output data set that always contains the four values 'Min', 'Max', 'LowVal', and 'HighVal'. The goal is summarized by the figure to the right. The following list describes how to generate the 'LowVal' and 'HighVal' observations if they do not exist.

  • If TYPE='HighVal' does not appear in the data, create it and copy the value of the TYPE='Max' observation to use for the VALUE variable.
  • Similarly, if TYPE='LowVal' does not appear in the data, create it. Copy the value of the TYPE='Min' observation to use for the VALUE variable.

The figure shows the four situations that can occur. The input data set always contains the 'Min' and 'Max' values but can contain none, one, or two of the other values. To goal is to produce the data set on the right, which always contains all four values. The next section presents a solution, so stop reading here if you want to solve the problem on your own!

The END= option and end-of-file processing

To solve the problems, I used two facts about the SAS DATA step:

  1. You can use the END= option on the SET statement to create a temporary binary indicator variable that has the value 1 for only the last observation of the input data.
  2. The SAS DATA step contains an implicit loop over all observations in the input data. If you do not use an OUTPUT statement, the DATA step performs an implicit output for each observation. However, if the program contains an OUTPUT statement anywhere in the program, then the implicit output is disabled. Therefore, whenever you use an OUTPUT statement, you must use other OUTPUT statements whenever you want to write an observation to the output data set.

The following program scans the input data. It remembers the values of the 'Min' and 'Max' observations, in case it needs them. It uses indicator variables to determine whether the data contains the 'LowVal' and 'HighVal' observations. After the input data are read, the program uses an end-of-file indicator variable (EOF) to determine whether or not to add observations for 'LowVal' and 'HighVal'.

Because the program uses an OUTPUT statement to (conditionally) create new observation, you must also put an OUTPUT statement after the SET statement to ensure that the original observations are all written.

/* Include all 4 test cases. Use WHERE clause to test each case. */
data Test;
length Type $7;
input Group Type $ Value;
datalines;
1 Min      -3
1 Max       3
1 LowVal   -2
1 HighVal   2
2 Min      -3
2 Max       3
2 HighVal   2
3 Min      -3
3 Max       3
3 LowVal   -2
4 Min      -3
4 Max       3
;
 
/* Input order is always 'Min' and 'Max' optionally followed by 
   'LowVal' and 'HighVal', if they exist. */
%let dsname = Test(where=(Group=4));    /* use 1,2,3,4 to test all cases */
 
data Want;
drop HighFound LowFound Min Max;   /* temporary variables */
retain HighFound LowFound 0        /* binary indicator variables: Initialize to 0 (false) */
       Min Max .;                  /* VALUE of 'Min' and 'Max' obs: Initialize to missing */
set &dsname end=EOF;               /* EOF is temporary indicator variable */
output;                            /* need OUTPUT because of EOF processing */
if Type = 'Min' then 
   min = Value;              /* remember the Min value */
else if Type = 'Max' then
   max = Value;              /* remember the Max value */
else if Type = 'LowVal' then 
   LowFound = 1;             /* Low value found; no need to create it */
else if Type = 'HighVal' then 
   HighFound = 1;            /* High value found; no need to create it */
 
/* end-of-file processing: conditionally append new observations */
if EOF then do;
   if ^LowFound then do;     /* Low value not found. Add it. */
      Type = "LowVal"; Value = Min; output;
   end;
   if ^HighFound then do;    /* High value not found. Add it. */
      Type = "HighVal"; Value = Max; output;
   end;
end;
run;
 
proc print data=Want; 
   var Group Type Value;
run;

The result is shown for input data that contains only the 'Min' and 'Max' observations but not the 'LowVal' or 'HighVal' observations. The output shows that the 'LowVal' or 'HighVal' observations were correctly appended to the input data, and that values for the VALUE column were copied from the 'Min' and 'Max' observations, respectively. You can verify that the other three input data sets are also correctly handled.

Use caution with the DELETE and subsetting IF statements

When performing end-of-file processing, be careful if you use a DELETE statement or a subsetting IF statement. For details and examples, see "The Perils of End-of-File Processing when Subsetting Data" (Landry, 2009). Landry summarizes the problem as follows (p. 2): "The problem occurs only when the last record in the dataset is deleted.... The reason this happens is that when a record is deleted..., SAS stops processing and returns to the next iteration of the DATA step. Thus, any executable statements placed after the [DELETE or subsetting IF statements] do not get executed."

Summary

In summary, this article shows how to use the SAS DATA step to conditionally add observations to the end of the input data. This is useful for data-dependent logic when the observations that you need to append depend on the values of the data. You can perform end-of-file processing by using the END= option on the SET statement to create an indicator variable that has the value 1 for the last observation in the input data. You can use the OUTPUT statement to append additional observations, but remember that you also need to use the OUTPUT statement after the SET statement if you want to output the original data.

Do you have a favorite way to conditionally append data? Do you know of other potential pitfalls with end-of-file processing? Leave a comment.

The post Conditionally append observations to a SAS data set appeared first on The DO Loop.

8月 242019
 

This summer the Accessibility and Applied Assistive Technology team at SAS launched a new course that teaches students with visual impairments how to independently analyze data, which is a critical skill that all students need for success in college and their careers. However, many students with visual impairments don’t have [...]

College-bound students with visual impairments learn to independently analyze data was published on SAS Voices by Lisa Morton

8月 222019
 

Editor’s note: This is the first article in a series by Conor Hogan, a Solutions Architect at SAS, on SAS and database and storage options on cloud technologies. This article covers the SAS offerings available to connect to and interact with the various database options available in Amazon Web Services.

As companies move their computing to the cloud, they are also moving their storage to the cloud. Just like compute in the cloud, data storage in the cloud is elastic and responds to demand while only paying for what you use. As more technologies moves to a cloud-based architecture, companies must consider questions like: Where is my data going to be stored? Do I want a hybrid solution? What cloud storage options do I have? What storage solution best meets my business needs?. Another question requiring an answer is: Is the software I use cloud-ready?. The answer in the case of SAS is, YES! SAS offers various cloud deployment patterns on various cloud providers and supports integration with cloud storage services.

This is part one in a series covering database as a service (DBaaS) and storage offerings from Amazon Web Services (AWS). Microsoft Azure and Google Cloud Platform will be covered in future articles. The goal is to supply a breakdown of these services to better understanding the business requirements of these offerings and how they relate to SAS. I will focus primarily on SAS Data Connectors as part of SAS Viya, but all the same functionality is available using a SAS/ACCESS Interface in SAS 9.4. SAS In-Database technologies in SAS Viya are called SAS Data Connect Accelerators and are synonymous with the SAS Embedded Process.

SAS integration with AWS

SAS has extended SAS Data Connectors and SAS In-Database Technologies support to Amazon Web Services database variants. A database running in AWS is much like your on-premise database, but instead Amazon is managing the software and hardware. Amazon’s DBaaS offerings take care of the scalability and high availability of the database with minimal user input. SAS integrates with your cloud database even if SAS is running on-premise or with a different cloud provider.

AWS databases

Amazon offers database service technologies familiar to users. It is important to understand the new terminology and how the different database services best meet the demands of your specific application. Many common databases already in use are being refactored and provided as service offerings to customers in AWS. The advantages for customers are clear: no hardware to manage and no software to install. Databases that scale automatically to meet demand and software that updates and create backups automatically means customers can spend more time creating value from their data and less time managing their infrastructure.

For the rest of this article I cover various database management systems, the AWS offering for each database type, and SAS integration. First let's consider the diagram below depicting a decision flow chart to determine integration points between AWS database services and SAS.

Integration points between AWS database services and SAS

Trace you path in the diagram and read on to learn more about connection details.

Relational Database Management System (RDBMS)

In the simplest possible terms, an RDBMS is a collection of managed tables with rows and columns. You can divide relational databases into two functional groups: online transaction processing (OLTP) and online analytical processing (OLAP). These two methods serve two distinct purposes and are optimized depending in how you plan to use the data in the database.

Transactional Databases (OLTP)

Transactional databases are good at processing reads, inserts, updates and deletes. These queries usually have minimal complexity, in large volumes. Transactional databases are not optimized for business intelligence or reporting. Data processing typically involves gathering input information, processing the data and updating existing data to reflect the collected and processed information. Transactional databases prevent two users accessing the same data concurrently. Examples include order entry, retail sales, and financial transaction systems. Amazon offers several types of transactional database services. You can organize Amazon Relational Database Service (RDS) into three categories: enterprise licenses, open source, and cloud native.

Enterprise License

Many customers already have workloads built around enterprise databases. Amazon provides a turn-key enterprise solution for customers not looking to break their relationship with enterprise vendors or refactor their existing workflows. AWS offers Oracle and Microsoft SQL Server as a turn-key enterprise solution in RDS. Both offerings include the required software license, however Oracle also allows you to “Bring Your Own License” (BYOL). SAS has extended SAS Data Connector support for both cloud variants. You can use your existing license for SAS Data Connector to Oracle or SAS Data Connector to Microsoft SQL Server to interact with these RDS databases.

Remember you can install and manage your own database on a virtual machine if there is not an available database as a service offering. The traditional backup and update responsibilities are left to the customer in this case. For example, both SAS Data Connector to Teradata and SAS Data Connect Accelerator for Teradata are supported for Teradata installed on AWS.

Open Source

Amazon provides service offerings for common open source databases like MySQL, MariaDB, and PostgreSQL. SAS has extended SAS Data Connector support for all these cloud variants. You can use your existing license for SAS Data Connector to MYSQL to connect to either RDS MYSQL or RDS MariaDB and SAS Data Connector to PostgreSQL to interface with RDS PostgreSQL.

Cloud Native

Amazon Aurora is a MySQL and PostgreSQL-compatible relational database built for the cloud, combining the performance and availability of traditional enterprise databases with the simplicity and cost-effectiveness of open source databases. SAS has extended SAS Data Connector support for Amazon Aurora. You can use your existing license for SAS Data Connector to MYSQL to connect to either Aurora MYSQL or and SAS Data Connector to PostgreSQL to interface with Aurora PostgreSQL.

Analytical Databases (OLAP)

Analytical Databases optimize on read performance. These databases work best from complex queries in smaller volume. When working with an analytical database you are typically doing analysis on multidimensional data interactively from multiple perspectives. Redshift is the analytical database service offered by Amazon. SAS has a dedicated product called SAS Data Connector to Amazon Redshift that was purpose built for analytics workloads running in the Amazon cloud.

NoSQL Databases

A non-relational or NoSQL database is any database not conforming to the relational database model. These databases are more easily scalable to a cluster of machines. NoSQL databases are a more natural fit for the cloud because the loose dependencies make the data easier to distribute and scale. The different NoSQL databases are designed to solve a specific business problem. Some of the most common data structures are key-value, column, document, and graph databases. Here is a brief overview of the most common data structures.

Key-Value Database

A key-value database stores data as a collection of key-value pairs. The key acts as a unique identifier for each record. Amazon’s key-value database as a service is DynamoDB. SAS interacts with DynamoDB using industry standard ODBC or JDBC drivers.

Columnar Database

Data in a traditional relational database is sorted by rows. The alternative columnar databases optimize by sorting data quickly using columns, saving valuable time and network I/O. Redshift is the columnar database service offered by Amazon. SAS has a dedicated product called SAS Data Connector to Amazon Redshift that was purpose built for this database.

Document Database

A document database queries data in documents typically stored in JSON format. DocumentDB is the document database service offering from Amazon. SAS interacts with DocumentDB using industry standard ODBC or JDBC drivers. DocumentDB is MongoDB-compatible which means existing MongoDB drivers and tolls work with DocumentDB. Currently SAS is building out functionally to support SAS Data Connector to MongoDB and you should expect that to expand further into DocumentDB as well.

Graph Database

Amazon Neptune is the graph database service designed to work with a complex hierarchy of interconnected data. These design of these types of databases queries relationships in data and reduce the number of table joins. SAS interacts with Amazon Neptune using industry standard ODBC or JDBC drivers.

Hadoop

The traditional deployment of Hadoop is changing dramatically with the cloud. Traditional Hadoop vendors may have a tough time keeping up with the service offerings available in the cloud. Hadoop still offers reliable replicated storage across nodes and powerful parallel processing of large jobs without much data movement. Amazon offers Elastic Map Reduce as their Hadoop as a service offering. Amazon Elastic supports both SAS Data Connector to Hadoop and SAS Data Connect Accelerator for Hadoop.

Finally

It is important to think about the use case for your database and the type of data that you plan to store before you select an AWS database service. Understanding your workloads is critical to getting the right performance and cost. When dealing with cloud databases always remember that you will be charged for the storage that you use but also for the data that you move out of the database. To do analysis and reporting on your data may require data transfer. Be aware of these costs and think about how you can lower these costs by keeping frequently accessed data cached somewhere or remain on-premise.

Additional Resources

  1. Support for Databases in SAS® Viya® 3.4
  2. Support for Cloud and Database Variants in SAS® 9.4

Accessing Databases in the Cloud – SAS Data Connectors and Amazon Web Services was published on SAS Users.