Yinliang Wu

6月 032020
 

In natural language processing, word vectors play a key role in making technologies such as machine translation and speech recognition possible. A word vector is a row of numeric values where each point captures a dimension of the word’s meaning. Each value represents how closely it relates to the concept behind that dimension, so the semantics of the word is embedded across the dimensions of the vector. Since similar words have similar vectors, representing words as vectors like this would simplify and unify vectors' operations.

Word vectors are generated by a training performed word-word co-occurrence statistics on a large corpus. You can use pre-trained word vectors like GloVe, provided by Stanford University.

Let's talk about how to transform word vector tables from long to wide in SAS, so we can potentially get sentence vectors to process further. Suppose we generate word vectors from the following 3 sentences:

Jack went outside.
Jill likes to draw in the afternoon.
Tony is a boy.

Each word has 2 numeric values (Vector1, Vector2), each value represents how closely the word relates to the concept defined by that dimension. The value numbers (here VNUM=2) may range from hundreds to thousands in real text analysis scenarios.

Long word vector table

The sample code below generates an upper sample table and sorts it for further processing.

data HAVE;
  length Word $ 45;
  input SentenceID Word Vector1-Vector2; /*300+*/
datalines;
1	Jacky 	   0.24011	 0.400996 
1	went	  -0.047581	 0.868716 
1	outside	  -1.197891	 1.162238
2	Jill	  -0.199579	 0.251252
2	likes	  -1.935640	-0.288264
2	to	  -0.526053	-1.143420
2	draw	  -0.736289	-0.794812
2	in 	  -2.757234	 0.506639
2	the	  -0.736289	-0.794812
2	afternoon -0.047581	 0.868716
3	Tony 	   0.34032	 0.600983 
3	is	   0.147531	 0.968817
3	a	   1.347543	 2.568323
3	boy       -3.257891      3.172238
run; 
proc sort data=HAVE;
  by SentenceID;
run;
proc print data=have;run;

If we want to transform the upper long table to a wide table as seen below, how can we do this as efficiently and simply as possible? The upper 14 words belong to 3 sentences that would result in the following 3 rows with 22 columns (1 + WNUM + WNUM x VNUM=1 + 7 + 7 x 2 = 22).

Wide word vector table

Please note that we can calculate the max word number (WNUM) in a sentence at runtime with SAS code below. For the upper case, the value of WNUM is 7.

proc sql noprint;
  select max(count) into :wnum from (
    select count(Word) as count from HAVE group by SentenceID 
  );
quit;

In fact, we don’t need any SAS PROC to handle this kind of transformation. A SAS Data step provides an efficient and convenient way to transform data. The key is to use an ARRAY to map all word vectors from the source table, and then define two ARRAYs to store output words and vectors in a wide style. These two arrays for output words and vectors need to be RETAIN during the implicit loop and KEEP for OUTPUT while it reaches the last SentenceId.

You can see the full SAS code below with detailed comments.

/*Long table to Wide table*/
%let vnum=2; /*vector numbers for a word*/
%let wnum=7; /*max word number in a sentence*/
data WANT;
  set HAVE;
  by Sentenceid;
  array _vector_ [*] vector:;         /*Map to source vectors*/
 
  array _word [ %eval(1*&wnum)] $ 45; /*Array to store WORD in wide table*/
  array _vector [ %eval(&wnum*&vnum)];/*Array to store VECTORS in wide table*/
  retain _word: _vector:;             /*RETAIN during the implicit loop*/
 
  retain _offset_ 0;                  /*Offset of a WORD in a sentence, base 0*/
  if first.Sentenceid then do;
    call missing(of _word[*]);
	call missing(of _vector[*]);
    _offset_=0;
  end;
  else _offset_=_offset_+1;
 
  _word[ _offset_+1 ]=word;           /*Cache current word to array WORD at [ _offset_+1]*/
  do i=1 to dim(_vector_);            /*Cache each vectors to array VECTORS at [_offset_* &vnum +i]*/
    _vector[_offset_* &vnum +i]=_vector_[i]; 
  end;
  keep Sentenceid _word: _vector: ;   /*Keep for output when it hit last.Sentenceid*/
 
  if last.Sentenceid then output;     /*Output the cached WORD and VECTORS*/
run;
 
proc print data=want;run;

Accordingly, if we need to transform a word vector back from wide style to long style, we need to generate &WNUM rows x &VNUM columns for each sentence, and it’s the reversed process for upper logic. The full SAS code with detailed comments is listed below:

/*Wide table to Long table*/
data HAVE2;
  set WANT; 
 
  array _word [*] _word:;           /*Array _word[] mapping to WORD in wide table*/
  array _vector_ [*] _vector:;     /*Array _vector[] mapping to VECTORS in wide table*/
 
  length Word $ 45;                 /*Output Word in the long table*/
  array Vector[&vnum];              /*Output Vectors in the long table*/
  do i=1 to &wnum;                  /*Unpack word from array _word[]*/       
    word=_word[i]; 
	if word=" " then continue;
    do j=1 to &vnum;                /*Unpack vectors from array _vector[]*/
	  oo= (j+&vnum * (i-1)); 
      Vector[j]=_vector_[j + &vnum *(i-1)];
    end;
	keep Sentenceid Word Vector:;
	output;                          /*One row in wide table generate &wnum rows[]*/
  end;
run;
proc print data=HAVE2;run;

To wrap the upper bi-directional transformation process for general repurposing in text analysis, we provide two SAS MACROs listed below:

%Long2Wide(data=Have, vnum=2, wnum=7, sid=SentenceId, word=Word, out=Want);
proc print data=Want;run;
 
%Wide2Long(data=Want, vnum=2, wnum=7, sid=Sentenceid, out=Have2, outword=Word, outvector=Vector);
proc print data=Have2;run;

We have demonstrated how to transform a word vector table from a long style to a wide style (or vice versa) efficiently with a SAS DATA step. We have also provided two well-wrapped SAS MACROs for general re-use purposes. To learn more, please check out these additional resources:

Transform word vector tables from long to wide was published on SAS Users.

12月 202019
 

One day, I received a question from the SAS community on how to exclude variables with missing values up to a certain percentage in data analysis. Before I dive into details, another question about how to remove variables with repeated values up to a certain percentage was raised. In fact, this user demands focus on the value characteristics for a single variable, and it determines if the user should include a specific variable in the following data analysis. It reflects the demand of variable selection for wide datasets with noise; it enhances data filtering by enabling variables(cols) selection, in addition to the normal observation(rows) filtering.

DATA Step implementation

If the variable filtering is based on the percentage of the missing values, the most intuitive thought is to use DATA step to calculate the percentage of missing values for each column of the input dataset, and then collect the variable names that meet the threshold conditions before passing them to a global macro variable for subsequent analysis. Since SAS DATA step includes both character and numeric data types, we cannot define one single array to map all columns of the dataset. Instead, we need to define two arrays to map all numeric variables and all character variables respectively.

In order to build a universal reusable SAS macro, we also need to detect the number of numeric and character variables in advance and pass them to the array definition in the next DATA step. In that DATA step, we use two arrays to access dataset variables and count the number of missing values respectively. When the loop reaches the last observation, it collects variable names through the vname() function which meet the threshold condition, and finally place them into a SAS global macro variable &VARLIST. The SAS macro below demonstrates this logic:

 
%macro FilterCols(data=, th=0, varlist=);
  data _null_;  /*get numeric and character variables count */
    set &data (obs=1);
    array _ncols_n _numeric_;
    array _ccols_n $ _character_;
    call symput ('_ncols_n',put(dim(_ncols_n),best.));
    call symput ('_ccols_n',put(dim(_ccols_n),best.));
  run;
  data _null_;
	set &data end=last;
    /*Process numeric types*/
	array _ncols_ _numeric_;	
	array _ncols_count {&_ncols_n};
	retain _ncols_count (&_ncols_n*0);	
	do i=1 to dim(_ncols_);			
	  if _ncols_[i]= .  then _ncols_count[i]=_ncols_count[i] + 1;
	end;
    /*Process character types*/
	array _ccols_ $  _character_;
    array _ccols_count {&_ccols_n};	
	retain _ccols_count (&_ccols_n*0);
	do i=1 to dim(_ccols_);			
	  if _ccols_[i]=" " then _ccols_count[i]=_ccols_count[i]+1;
	end; 
    /*Filter out the variables meet the conditions and place them in the &VARLIST global macro variable*/
	if last then do;	  
	  length _varlist_ $ 32767;
retain _varlist_ " ";
	  do i=1 to &_ncols_n; 
	    if (_ncols_count[i]/_N_) >= &th then do;
		  _varlist_=trim(_varlist_) || " " || trim( vname(_ncols_[i]) );
        end;				
	  end;
	  do i=1 to &_ccols_n; 
	    if (_ccols_count[i]/_N_) >= &th then do;
		  _varlist_=trim(_varlist_) || " " || trim( vname(_ccols_[i]) );
		end;
	  end;	
	  call symput ("&VARLIST",trim(_varlist_));			
	end;
  run;
%mend; 
/*Exclude variables if missing value pct GE. threshold  &th*/
%macro FilterData(data=, th=, out=);
  %global myvars;
  %FilterCols(data=&data, th=&th, varlist=myvars);
  data &out;
    set &data; 
    %if %str(&myvars)^=%str() %then %do; drop &myvars; %end;
  run;
%mend;

To verify the programing logic, let’s generate a demo dataset WORK.CLASS based on SASHELP.CLASS, the output dataset has different missing value percentage (10.53%, 36.84%, 52.63%) in Sex, Height and Weight columns.

data class; /*To generate demo dataset*/
  set sashelp.class;  
  if age < = 11 then sex=" ";
  if age <= 12 then height=.;
  if age <= 13 then weight=.;  
run;
proc print data=class; run; /*See Figure 1 for output */

Now we can use the following code to filter columns by percentage of missing values, which is greater than 40%; the Weight column will not appear in the output CLASS2 dataset.

 
%FilterData(data=class, th=0.4, out=class2);
proc print data=class2; run; /*See Figure 1 for output */

Figure 1: Data before filtered and after filtered

The calculation here for the percentage of missing values is implemented by the user programming in the DATA step. The missing value count is computed separately in two arrays according to the character type and the numerical type. If you want to do it in a single array, you must use HASH object to store the variable name and its corresponding index in that array. Whenever a missing value is detected, we need to check whether the variable is already in the HASH object; if not, then add it directly. Otherwise, look up the table to find the appropriate array index, and then increase the counter to the corresponding element in that array. The advantage of this approach is we only need one array in later variable names collecting phase. The disadvantage is the use of HASH object increases the programming complexity. The complete implementation based on this logic is as follows, it can completely replace the first version of %FilterCols implementation.

 
%macro FilterCols(data=, th=0, varlist=);
  data _null_; /*Detect variable numbers of dataset*/
    dsid=open("&data.",'i');
    if dsid>0 then do;
      nvars= attrn(dsid, 'NVARS'); 
	  call symput("NVARS", nvars); 
    end;
  run;
  data _null_;
    set &data. end=last; 
    /*Define 2 arrays map to numeric/character vars respectively*/
    array _ncols_ _numeric_;
    array _ccols_ _character_;
 
    if _N_=1 then do; /*HASH Object definition and initialize*/
      length key $ 32;
      declare hash h(); 
	  rc=h.definekey("key");
	  rc=h.definedata("data", "key");
	  rc=h.definedone();
	  call missing(key, data); 
    end;  
    /*Define one array for counting of missing value*/
	array _cols_count[&NVARS] _temporary_; 
	retain _cols_count ( &NVARS * 0); 
    retain _col_id_ 1; 
 
    do i=1 to dim(_ncols_);	  
      if missing( _ncols_[i] ) then do; 
        /*Find mapped array index and modify counter value*/
	    key=vname( _ncols_[i] );  
	    if h.find()^=0 then do;         	    
          data=_col_id_;
		   h.add();  
		   _col_id_=_col_id_+1;		
	    end;    
        _cols_count[ data ] = _cols_count[ data ] +1; 
	  end;
    end;
    do i=1 to dim(_ccols_); 
      if missing( _ccols_[i] ) then do;
        /*Find mapped array index and modify counter value*/
	    key=vname( _ccols_[i] );
	    if h.find()^=0 then do;         	    
          data=_col_id_;
		   h.add();	 
		   _col_id_=_col_id_+1;		
	    end;	        
        _cols_count[ data ] = _cols_count[ data ] +1; 	    		
	  end;
    end; 
    /*Filter out the variables meet the conditions and place them in the &VARLIST global macro variable*/
    if last then do;
      declare hiter iter('h');
      rc = iter.first();  
      do while (rc = 0);   	    
        pct=_cols_count[data]/ _N_;
		if pct >= &th. then do; 
	      length _varlist_ $ 32767; 
retain _varlist_ " ";
          _varlist_=trim(_varlist_) || " "|| trim(key);		  	  
        end; 
        rc = iter.next(); 
      end;	
	  call symput( "&VARLIST", trim( left(_varlist_)));
    end;
  run;
%mend;

PROC FREQ Implementation

In the two macros above, the counting of missing values for each column are implemented by user programming in DATA step, but SAS provides a more powerful frequency statistic step called PROC FREQ. Can we directly use its output for variable filtering? The answer is YES. For example, the following SAS code can list the Frequency, Percent, Cumulative Frequency, and Cumulative Percent of each variable values (including missing values) in the data set.

 
  proc freq data=class;
    table Sex Height Weight /missing;  /*See Figure 2 for output */
  run;

Figure 2: PROC FREQ output

To get the result of PROC FREQ into a dataset, we just need to add ODS Table statement before running PROC FREQ:

 
ods table onewayfreqs=class_freqdata;

Figure 3: OneWayFreqs table class_freqdata

So, the frequency statistics result of PROC FREQ is stored in the class_freqdata dataset. We only need to check the value of variable “percent” for each column. If it is greater than the threshold specified, the variable name is added to the buffer variable _varilist_, otherwise it is ignored; the final value of _varilist_ is placed to the SAS global macro variable &VARLIST for later use. Comparing to the DATA step implementation, the entire code here is much shorter and simpler.

 
%macro FilterCols(data=, th=0, varlist=);
  ods table onewayfreqs=&data._freqdata; /*FREQ out to temp dataset*/
  proc freq data=&data;
    table _all_ /missing;  
  run; 
  data _null_;
	set &data._freqdata end=last;
	var=scan(Table, 2);  
if getoption("LOCALE")="ZH_CN" then var=scan(Table, 2, "“”");
	value=vvaluex(var); 
/*Filter out the variables meet the conditions and place them in the &VARLIST global macro variable*/
	if  value=" " or compress(value)="." then  do; 
      if percent/100>= &th then do;
	    length _varlist_ $ 32767;
        retain _varlist_ " "; 
	    _varlist_=trim(_varlist_) || " "|| trim(var);	
		put var= _varlist_;
	  end;
   end;
   if last then call  symput( "&VARLIST", trim( left(_varlist_)));
  run; 
/*Delete the temp dataset in WORK library*/
  proc datasets nolist;
    delete &data._freqdata;
  quit;
%mend;

Please note that PROC FREQ would generate output to ODS HTML destination by default. If you want to avoid generating ODS HTML output when generating the temporary dataset “class_freqdata”, you can use the ods html close; statement to temporarily close the ODS HTML output and then open it again anytime needed.

Since the output of PROC FREQ already has the frequency information of each variable, we can naturally filter columns based on the percentage of duplicate values. However, it should be noted that a single variable always contains multiple distinct values, and each distinct value (including missing values) has a frequency percentage. If percentage of duplicate value is too high in a specific column, the user may exclude that column from the following analysis. As there are multiple frequency percentages, we just choose the largest one as the duplicate value percentage of that variable. For example, the Sex column of demo dataset CLASS has a missing value of 10.53%, a female of 42.11% and a male of 47.37%, so the percentage of duplicate values in the Sex column should be 47.37%.

The following code filter variables by percentage of duplicate values with a specific threshold. The code logic is almost same as before, but it filters values with non-missing values. So, run the same test code against the demo dataset will remove the Age column by duplicate values percentage threshold, instead of the Weight column by missing values percentage threshold.

 
%macro FilterCols(data=, th=0, varlist=);
  ods table onewayfreqs=&data._freqdata; /*FREQ out to temp dataset*/
  ods html close;
  proc freq data=&data;
    table _all_ / missing;  
  run; 
  ods html;
  data _null_;
    set &data._freqdata end=last;
var=scan(Table, 2);  
if getoption("LOCALE")="ZH_CN" then var=scan(Table, 2, "“”");
    value=vvaluex(var);
/*Filter out the variables meet the conditions and place them in the &VARLIST global macro variable*/
    if ^(value=" " or compress(value)=".") then do;  
      if  percent > maxpct then do; /*Search max percentage of duplicate values of a variable*/
	    retain maxpct 0;
        retain maxvar;
        maxpct=percent;  
	    maxvar=var;
      end; 
      if (_N_>1 and var ^= lastvar) or last then do;      
        if maxpct/100 >= &th. then do;
          length _varlist_ $ 32767;
          retain _varlist_ " ";
          _varlist_=trim(_varlist_) || " "|| trim(maxvar);	        	      
        end;
        maxpct=percent;
	    maxvar=var;
      end;
      lastvar=var;
      retain lastvar;
	end;
    if last then call symput( "&VARLIST", trim( left(_varlist_)));  
  run;
/*Delete the temp dataset in WORK library*/
  proc datasets nolist;
    delete &data._freqdata ;
  quit;
%mend;

In order to combine these two logics of by missing values and by duplicate values in one macro, we just introduce one flag parameter byduprate. The default value is 0 to filter variables by missing values, and value 1 to filter variables by duplicate values. The complete code is as below:

 
%macro FilterCols(data=, th=0, varlist=, byduprate=0);
  ods table onewayfreqs=&data._freqdata; /*FREQ out to temp dataset*/
  ods html close;
proc freq data=&data;
    table _all_ /missing;  
  run; 
ods html;
  data _null_;
	set &data._freqdata end=last;
	var=scan(Table, 2); 
if getoption("LOCALE")="ZH_CN" then var=scan(Table, 2, "“”"); 
	value=vvaluex(var);
  %if &byduprate=0 %then %do; /*Filter variables by missing value*/
	if  value=" " or compress(value)="." then  do; 
      if percent/100>= &th then do;
	    length _varlist_ $ 32767;
	    retain _varlist_ " ";
	    _varlist_=trim(_varlist_) || " "|| trim(var);				 
	  end;
   end;
  %end;
  %else %do; /*Filter variables by duplicate value percentage*/
    if ^(value=" " or compress(value)=".") then do;  
      if  percent > maxpct then do;      
        retain maxpct 0;
        retain maxvar;
        maxpct=percent;  
	    maxvar=var;		
      end; 
      if (_N_>1 and var ^= lastvar) or last then do;              
        if maxpct/100 >= &th. then do;
          length _varlist_ $ 32767;
          retain _varlist_ " ";
          _varlist_=trim(_varlist_) || " "|| trim(maxvar);	        
        end;
        maxpct=percent;
	    maxvar=var;
      end;
      lastvar=var;
      retain lastvar;
    end;
  %end;
    if last then call  symput( "&VARLIST", trim( left(_varlist_)));
  run; 
/*Delete the temp dataset in WORK library*/
  proc datasets nolist;
    delete &data._freqdata;
  quit;
%mend;

Accordingly, we also need to introduce the parameter byduprate with the same name and behavior in the %FilterData macro that's invoking the %FilterCols macro. The SAS macro %FilterData also has the same default value 0, but it needs to pass the byduprate argument specified by the user to the %FilterCols macro. The enhanced version is as below:

 
%macro FilterData(data=, th=, out=, byduprate=0);
  %global myvars;
  %FilterCols(data=&data, th=&th, varlist=myvars, byduprate=&byduprate); 
  data &out;
    set &data; 
    %if %str(&myvars)^=%str() %then %do; drop &myvars; %end;
  run;
%mend;

Now the final version of the macro %FilterData unified the two kinds of variable filtering logic, it just depends on the different value of the argument byduprate: If byduprate is not specified or the value is 0, it means filter variables by the percentage of missing values, otherwise it is by the percentage of duplicate values. For the following two code snippets, the former drops the Weight column, while the latter drops the Sex column.

 
/*Drop columns with missing value percentage >= 40% */
%FilterData(data=class, th=0.4, out=class2);              
proc print data=class2;run;
 
/*Drop columns with duplicate value percentage >= 40% */
%FilterData(data=class, th=0.4, out=class3, byduprate=1); 
proc print data=class3;run;

Figure 4: Drop column by missing and duplicate value percentage

Summary

This article discusses how to use SAS to filter variables in a dataset based on the percentage of missing values or duplicate values. The missing value statistics can be implemented by either DATA step programming on your own or reusing the existing powerful PROC FREQ with statistics result for missing values and duplicate values. The five macros in this article also demonstrated how to wrap up the existing powerful SAS analytics as a reusable, general-purpose SAS macro for a big project. The final version of macro %FilterData also demonstrated how to unify filtering variables by the percentage of either missing values or duplicate values in one to make calling SAS macro functions convenient and easy to reuse. To learn more, please check out these resources:

How to filter variables by percentage of missing values or duplicate values in SAS was published on SAS Users.

11月 282019
 

With time series data analysis, we can apply moving average methods to predict data points without seasonality. This includes Simple Average (SA), Simple Moving Average (SMA), Weighted Moving Average (WMA), Exponential Moving Average (EMA), etc. For series with a trend but without seasonality, we can use linear, non-linear and autoregressive prediction models. In practice, we often use these kinds of moving average methods for series data with large variance, e.g. financial and stock market data.

Since introducing the window concept, SMA reflects historical data with a lag problem, and all points in the same window have the same importance to the current point. Weights are introduced to reflect the different importances of points in a window, and it leads to WMA and EMA. WMA gives greater weight to recent observations while less weight to the longer-term observations. The weights decrease in a linear fashion. If the sequence fluctuations are not large, the same weight is given to the observation and WMA degenerates into a SMA and the WMA becomes an EMA when weights decrease exponentially. However, these traditional moving average methods ave a lag problem due to introducing the smooth window concept and its results are less responsive to the current value. To gain a better balance between lag reduction and curve smoothing in a moving average, we must get help from some smarter mathematics algorithm.

What is Hull Moving Average?

Hull Moving Average (HMA) is a fast moving average method with low lag developed by Australia investor Alan Hull in 2005. It’s known to almost eliminate lag altogether and manages to improve smoothing at the same time. Longer period HMAs can be used as an indicator to identify trend, and shorter period HMAs can be used as entry signals in the direction of the prevailing trend. Due to lag problems in all moving average methods, it’s not suggested to be used as a reverse signal alone. Anyway, we can use different window HMAs to build a more robust trading strategy. E.g., 52-week HMA for trend indicator, and 13-week HMA for entry signal. Here is how it works:

  • Long period HMA identifies the trend: rising HMA indicates the prevailing trend is rising, it’s better to enter long positions; falling HMA indicate the prevailing trend is falling, it’s better to enter short positions.
  • Short period HMA identifies the entry signals in the direction of the prevailing trend: When the prevailing trend is rising, HMA goes up to indicate a long entry signal. When the prevailing trend is falling, HMA goes down to indicate a short entry signal.

Both long & short period HMAs in bullish mode generate Long Signal, and both in bearish mode generate Short Signal. So Long Trades get a buy signal after a Long Signal occurs, while Short Trades get a sell signal after a Short Signal occurs. Long Trades get a Sell Signal when Long or Short Trend is no longer bullish and Short Trades get a buy signal when Long or Short trend is no longer bearish. Figure 1 is a sample for this HMA based trading strategy.

In fact, HMA is quite simple. Its calculation only includes three WMA steps:

  1. Calculate WMA of original sequence X using window length n ;
  2. Calculate WMA of X using window length n/2;
  3. Calculate WMA of derived sequence Y using window length Sqrt(n), while Y is the new series of 2 * WMA(x, n/2) – WMA(x, n). So HMA(x, n) formula can be defined as below:

HMA(x, n) = WMA( 2*WMA(x, n/2) − WMA(x, n), sqrt(n) )

HMA makes the moving average line more responsive to current value and keeps the smoothing of the curve line. It almost eliminates lag altogether and manages to improve smoothing at the same time.

How does HMA achieve the perfect balance?

We know SMA can gain the curve line of an average value for historical data falling in the current window, but it has the constant lag of at least a half window. The output may also have poor smoothness. If we embed SMA multiple times to get the average of the averages, it would be smoother but would increase lag at the same time.

Suppose we have this series: {1, 2, 3, 4, 5, 6, 7, 8, 9, 10}. The SMA at the last point is 5.5 with window length n=10, i.e. SUM[1...10]/10
is much different than the real value 10. If we reduce the window length to n=5, the SMA at the last point is 8, i.e. SUM[6…10]/5. If we have the second SMA with half size window and add the difference between these two SMAs, then we can get 8 + (8-5.5)=10.5, which is very close to the real value 10 and will eliminate lag. Then we can use SMA with specific window length again to reduce that slight overcompensation and improve smoothness. HMA uses linear WMA instead of SMA, and the last window length is sqrt(n). This is the key trick with using HMA.

Figure 2 below shows the difference between HMA(x, 4) and embed SMA twice SMA( SMA(x,4),4).

Figure 2: Comparison of SMA( SMA(x, 4), 4) and HMA(x, 4).

In stock trading, there are lots of complicated technical indicators derived from stock historical data, such as Ease of Movement indicators (EMV), momentum indicators (MTM), MACD indicators, Energy indicator CR, KDJ indicators, Bollinger (BOL) and so on. Its essence is like feature engineering extraction before neural network training. The fundamental purpose is to find some intrinsic property from price fluctuation. The Hull moving average itself was discovered and applied to real trading practices, which gives us a glimpse to the complication of building an automatic financial trading strategy under the veil. It’s just the beginning, the reflex in trading and aging of models may both bring uncertainty, so the effectiveness of quantitative trading strategies is a very complex topic which needs to be validated through real practice.

HMA in SAS Code

SAS was created for data analysis however, and my colleagues go into detail about how data analysis connects with HMA. Check out Cindy Wang's How to Calculate Hull Moving Average in SAS Visual Analytics and Rick Wicklin's The Hull moving average: Implement a custom time series smoother in SAS. I would like to share how to implement both HMA/WMA with only 25 lines of code in this article. You can use the macro anywhere to build HMA series with specific window lengths.

First, implement Weighted Moving Average (WMA) with SAS macro as shown below (only 11 lines). The macro has four arguments:

  • WMA_N       WMA window length
  • INPUT          Input variable name
  • OUT             Output variable name, default is input name with _WMA_n suffix
  • OUTLABEL The label of output variable, default is “input name WMA(n)
%macro AddWMA(WMA_N=4, INPUT=close, OUT=&INPUT._WMA_&WMA_N., OUTLABEL=&INPUT. WMA(&WMA_N.));
  retain _sumx_&INPUT._WMA_&WMA_N.;
  _sumx_&INPUT._WMA_&WMA_N.=sum (_sumx_&INPUT._WMA_&WMA_N., &INPUT., -lag&WMA_N.( &INPUT. )) ;
 
  retain _sum_&INPUT._WMA_&WMA_N.;
  _sum_&INPUT._WMA_&WMA_N.=sum (_sum_&INPUT._WMA_&WMA_N., &WMA_N * &INPUT., -lag( _sumx_&INPUT._WMA_&WMA_N.  )) ;
 
  retain _sumw_&INPUT._WMA_&WMA_N.;
  if _N_ <= &WMA_N then _sumw_&INPUT._WMA_&WMA_N. = sum (_sumw_&INPUT._WMA_&WMA_N., &WMA_N, -_N_, 1);
 
  &OUT=_sum_&INPUT._WMA_&WMA_N. / _sumw_&INPUT._WMA_&WMA_N.;
 
  drop _sumx_&INPUT._WMA_&WMA_N. _sumw_&INPUT._WMA_&WMA_N. _sum_&INPUT._WMA_&WMA_N.;
  label &OUT="&OUTLABEL"; 
%mend;

We must watch out for the trick in the upper implementation of WMA with linear weights. In fact, there is no do-loop in the code, and it also doesn’t perform repeat computations for weights and values. We just keep the last sum of all values in the window (See Ln2-3), and the sum for WMA is the last sum for WMA plus window length times the value subtracted from it (See Ln4-5). For weight sum computation, we just need to pay attention to the points less than the window length(See Ln6-7).

Second, implement HMA with three times WMA invocation as below, (only 9 lines). The macro has four arguments:

  • HMA_N       HMA window length
  • INPUT         Input variable name
  • OUT            Output variable name, default is input name with _HMA_n suffix
  • OUTLABEL The label of output variable, default is “input name HMA(n)
%macro AddHMA(HMA_N=5, INPUT=close, OUT=&INPUT._HMA_&HMA_N., outlabel=&INPUT. HMA(&HMA_N.));
  %AddWMA(WMA_N=&HMA_N, INPUT=&INPUT.);
 
  %AddWMA(WMA_N=%sysfunc(round(&HMA_N./2)), INPUT=&INPUT.);
 
  &INPUT._HMA_&HMA_N._DELTA=2 * &INPUT._WMA_%sysfunc(round(&HMA_N./2))  - &INPUT._WMA_&HMA_N;
  %AddWMA(WMA_N=%sysfunc(round(%sysfunc(sqrt(&HMA_N)))), INPUT=&INPUT._HMA_&HMA_N._DELTA);
 
  rename &INPUT._HMA_&HMA_N._DELTA_WMA_%sysfunc(round(%sysfunc(sqrt(&HMA_N.))))=&OUT;
  label &INPUT._HMA_&HMA_N._DELTA_WMA_%sysfunc(round(%sysfunc(sqrt(&HMA_N.))))="&outlabel"; 
 
  drop &INPUT._WMA_&HMA_N &INPUT._WMA_%sysfunc(round(&HMA_N./2)) &INPUT._HMA_&HMA_N._DELTA;
%mend;

The upper code is very intuitive as the first two lines perform WMA with window n and n/2 for X, then the next two lines generate new series Y and performs WMA with window sqrt(n). Other codes are just to improve macro flexibility and to drop temp variables, etc.

To compare HMA with SMA result, we must also implement SMA with the 7 lines of code as shown below. It has the same argument as WMA/HMA above, but SMA is an unnecessary part of HMA macro %ADDHMA implmentation.

 
%macro AddMA(MA_N=5, INPUT=close, out=&INPUT._MA_&MA_N., outlabel=&INPUT. MA(&MA_N.));
  retain _sum_&INPUT._MA_&MA_N.;
  _sum_&INPUT._MA_&MA_N.=sum (_sum_&INPUT._MA_&MA_N., &INPUT., -lag&MA_N.( &INPUT. )) ;
  &out=_sum_&INPUT._MA_&MA_N. / min(_n_, &MA_N.);
 
  drop _sum_&INPUT._MA_&MA_N.;
  label &out="&outlabel";
%mend;

Now we can test all the above implementations as shown below. First is a simple case we described above:

data a;
  input x @@;
  %AddMA( input=x, MA_N=5, out=SMA, outlabel=%str(SMA)); 
  %AddHMA( input=x, HMA_N=5, out=HMA, outlabel=%str(HMA))  
datalines;
1 2 3 4 5 6 7 8 9 10
;
proc print data=a label;
  format _all_ 8.2;
run;

The output is listed below, the HMA result is much more close to the real value x than SMA. It’s more responsive and less in lag.

Figure 3: HMA output sample

Now we can use the HMA macro to check IBM stock data in the SAS system dataset SASHELP.Stocks and draw trends in the plot to compare HMA(x, 5) and SMA(x, 5) output.

data Stocks_HMA;
  set sashelp.Stocks(where=(stock='IBM'));
  %AddHMA( HMA_N=5, out=HMA, outlabel=%str(HMA))
  %AddMA(MA_N=5, out=SMA, outlabel=%str(SMA));
run; 
ods graphics / width=800px height=300px;
title "Comparison of Simple and Hull Moving Average";
proc sgplot data=Stocks_HMA noautolegend;
   highlow x=Date high=High low=Low / lineattrs=(color=LightGray);
   scatter x=Date y=Close / markerattrs=(color=LightGray);
   series x=Date y=SMA / curvelabel curvelabelattrs=(color=DarkGreen) lineattrs=(color=DarkGreen);
   series x=Date y=HMA / curvelabel curvelabelattrs=(color=DarkOrange) lineattrs=(color=DarkOrange);
   refline '01JUN1992'd / axis=x label='June 1992' labelloc=inside;
   xaxis grid display=(nolabel); 
   yaxis grid max=200 label="IBM Closing Price";
run;

Figure 4: Comparison of SMA and HMA

Summary

We have talked about what Hull Moving Average is, how it works, and worked through a super simple SAS Macro implementation for HMA, WMA and SMA with only 25 lines of SAS code. This HMA implementation makes it quite simple and efficient to calculate HMA in SAS Data step without SAS/IML and Visual Analytics. In fact, it shows how easy process rolling-window computation for time series data can be in SAS, and opens a new window to build complicated technical indicators for stock trading systems yourself.

Learn more

Implementing HMA, WMA & SMA with 25 lines of SAS code was published on SAS Users.

11月 282019
 

With time series data analysis, we can apply moving average methods to predict data points without seasonality. This includes Simple Average (SA), Simple Moving Average (SMA), Weighted Moving Average (WMA), Exponential Moving Average (EMA), etc. For series with a trend but without seasonality, we can use linear, non-linear and autoregressive prediction models. In practice, we often use these kinds of moving average methods for series data with large variance, e.g. financial and stock market data.

Since introducing the window concept, SMA reflects historical data with a lag problem, and all points in the same window have the same importance to the current point. Weights are introduced to reflect the different importances of points in a window, and it leads to WMA and EMA. WMA gives greater weight to recent observations while less weight to the longer-term observations. The weights decrease in a linear fashion. If the sequence fluctuations are not large, the same weight is given to the observation and WMA degenerates into a SMA and the WMA becomes an EMA when weights decrease exponentially. However, these traditional moving average methods ave a lag problem due to introducing the smooth window concept and its results are less responsive to the current value. To gain a better balance between lag reduction and curve smoothing in a moving average, we must get help from some smarter mathematics algorithm.

What is Hull Moving Average?

Hull Moving Average (HMA) is a fast moving average method with low lag developed by Australia investor Alan Hull in 2005. It’s known to almost eliminate lag altogether and manages to improve smoothing at the same time. Longer period HMAs can be used as an indicator to identify trend, and shorter period HMAs can be used as entry signals in the direction of the prevailing trend. Due to lag problems in all moving average methods, it’s not suggested to be used as a reverse signal alone. Anyway, we can use different window HMAs to build a more robust trading strategy. E.g., 52-week HMA for trend indicator, and 13-week HMA for entry signal. Here is how it works:

  • Long period HMA identifies the trend: rising HMA indicates the prevailing trend is rising, it’s better to enter long positions; falling HMA indicate the prevailing trend is falling, it’s better to enter short positions.
  • Short period HMA identifies the entry signals in the direction of the prevailing trend: When the prevailing trend is rising, HMA goes up to indicate a long entry signal. When the prevailing trend is falling, HMA goes down to indicate a short entry signal.

Both long & short period HMAs in bullish mode generate Long Signal, and both in bearish mode generate Short Signal. So Long Trades get a buy signal after a Long Signal occurs, while Short Trades get a sell signal after a Short Signal occurs. Long Trades get a Sell Signal when Long or Short Trend is no longer bullish and Short Trades get a buy signal when Long or Short trend is no longer bearish. Figure 1 is a sample for this HMA based trading strategy.

In fact, HMA is quite simple. Its calculation only includes three WMA steps:

  1. Calculate WMA of original sequence X using window length n ;
  2. Calculate WMA of X using window length n/2;
  3. Calculate WMA of derived sequence Y using window length Sqrt(n), while Y is the new series of 2 * WMA(x, n/2) – WMA(x, n). So HMA(x, n) formula can be defined as below:

HMA(x, n) = WMA( 2*WMA(x, n/2) − WMA(x, n), sqrt(n) )

HMA makes the moving average line more responsive to current value and keeps the smoothing of the curve line. It almost eliminates lag altogether and manages to improve smoothing at the same time.

How does HMA achieve the perfect balance?

We know SMA can gain the curve line of an average value for historical data falling in the current window, but it has the constant lag of at least a half window. The output may also have poor smoothness. If we embed SMA multiple times to get the average of the averages, it would be smoother but would increase lag at the same time.

Suppose we have this series: {1, 2, 3, 4, 5, 6, 7, 8, 9, 10}. The SMA at the last point is 5.5 with window length n=10, i.e. SUM[1...10]/10
is much different than the real value 10. If we reduce the window length to n=5, the SMA at the last point is 8, i.e. SUM[6…10]/5. If we have the second SMA with half size window and add the difference between these two SMAs, then we can get 8 + (8-5.5)=10.5, which is very close to the real value 10 and will eliminate lag. Then we can use SMA with specific window length again to reduce that slight overcompensation and improve smoothness. HMA uses linear WMA instead of SMA, and the last window length is sqrt(n). This is the key trick with using HMA.

Figure 2 below shows the difference between HMA(x, 4) and embed SMA twice SMA( SMA(x,4),4).

Figure 2: Comparison of SMA( SMA(x, 4), 4) and HMA(x, 4).

In stock trading, there are lots of complicated technical indicators derived from stock historical data, such as Ease of Movement indicators (EMV), momentum indicators (MTM), MACD indicators, Energy indicator CR, KDJ indicators, Bollinger (BOL) and so on. Its essence is like feature engineering extraction before neural network training. The fundamental purpose is to find some intrinsic property from price fluctuation. The Hull moving average itself was discovered and applied to real trading practices, which gives us a glimpse to the complication of building an automatic financial trading strategy under the veil. It’s just the beginning, the reflex in trading and aging of models may both bring uncertainty, so the effectiveness of quantitative trading strategies is a very complex topic which needs to be validated through real practice.

HMA in SAS Code

SAS was created for data analysis however, and my colleagues go into detail about how data analysis connects with HMA. Check out Cindy Wang's How to Calculate Hull Moving Average in SAS Visual Analytics and Rick Wicklin's The Hull moving average: Implement a custom time series smoother in SAS. I would like to share how to implement both HMA/WMA with only 25 lines of code in this article. You can use the macro anywhere to build HMA series with specific window lengths.

First, implement Weighted Moving Average (WMA) with SAS macro as shown below (only 11 lines). The macro has four arguments:

  • WMA_N       WMA window length
  • INPUT          Input variable name
  • OUT             Output variable name, default is input name with _WMA_n suffix
  • OUTLABEL The label of output variable, default is “input name WMA(n)
%macro AddWMA(WMA_N=4, INPUT=close, OUT=&INPUT._WMA_&WMA_N., OUTLABEL=&INPUT. WMA(&WMA_N.));
  retain _sumx_&INPUT._WMA_&WMA_N.;
  _sumx_&INPUT._WMA_&WMA_N.=sum (_sumx_&INPUT._WMA_&WMA_N., &INPUT., -lag&WMA_N.( &INPUT. )) ;
 
  retain _sum_&INPUT._WMA_&WMA_N.;
  _sum_&INPUT._WMA_&WMA_N.=sum (_sum_&INPUT._WMA_&WMA_N., &WMA_N * &INPUT., -lag( _sumx_&INPUT._WMA_&WMA_N.  )) ;
 
  retain _sumw_&INPUT._WMA_&WMA_N.;
  if _N_ <= &WMA_N then _sumw_&INPUT._WMA_&WMA_N. = sum (_sumw_&INPUT._WMA_&WMA_N., &WMA_N, -_N_, 1);
 
  &OUT=_sum_&INPUT._WMA_&WMA_N. / _sumw_&INPUT._WMA_&WMA_N.;
 
  drop _sumx_&INPUT._WMA_&WMA_N. _sumw_&INPUT._WMA_&WMA_N. _sum_&INPUT._WMA_&WMA_N.;
  label &OUT="&OUTLABEL"; 
%mend;

We must watch out for the trick in the upper implementation of WMA with linear weights. In fact, there is no do-loop in the code, and it also doesn’t perform repeat computations for weights and values. We just keep the last sum of all values in the window (See Ln2-3), and the sum for WMA is the last sum for WMA plus window length times the value subtracted from it (See Ln4-5). For weight sum computation, we just need to pay attention to the points less than the window length(See Ln6-7).

Second, implement HMA with three times WMA invocation as below, (only 9 lines). The macro has four arguments:

  • HMA_N       HMA window length
  • INPUT         Input variable name
  • OUT            Output variable name, default is input name with _HMA_n suffix
  • OUTLABEL The label of output variable, default is “input name HMA(n)
%macro AddHMA(HMA_N=5, INPUT=close, OUT=&INPUT._HMA_&HMA_N., outlabel=&INPUT. HMA(&HMA_N.));
  %AddWMA(WMA_N=&HMA_N, INPUT=&INPUT.);
 
  %AddWMA(WMA_N=%sysfunc(round(&HMA_N./2)), INPUT=&INPUT.);
 
  &INPUT._HMA_&HMA_N._DELTA=2 * &INPUT._WMA_%sysfunc(round(&HMA_N./2))  - &INPUT._WMA_&HMA_N;
  %AddWMA(WMA_N=%sysfunc(round(%sysfunc(sqrt(&HMA_N)))), INPUT=&INPUT._HMA_&HMA_N._DELTA);
 
  rename &INPUT._HMA_&HMA_N._DELTA_WMA_%sysfunc(round(%sysfunc(sqrt(&HMA_N.))))=&OUT;
  label &INPUT._HMA_&HMA_N._DELTA_WMA_%sysfunc(round(%sysfunc(sqrt(&HMA_N.))))="&outlabel"; 
 
  drop &INPUT._WMA_&HMA_N &INPUT._WMA_%sysfunc(round(&HMA_N./2)) &INPUT._HMA_&HMA_N._DELTA;
%mend;

The upper code is very intuitive as the first two lines perform WMA with window n and n/2 for X, then the next two lines generate new series Y and performs WMA with window sqrt(n). Other codes are just to improve macro flexibility and to drop temp variables, etc.

To compare HMA with SMA result, we must also implement SMA with the 7 lines of code as shown below. It has the same argument as WMA/HMA above, but SMA is an unnecessary part of HMA macro %ADDHMA implmentation.

 
%macro AddMA(MA_N=5, INPUT=close, out=&INPUT._MA_&MA_N., outlabel=&INPUT. MA(&MA_N.));
  retain _sum_&INPUT._MA_&MA_N.;
  _sum_&INPUT._MA_&MA_N.=sum (_sum_&INPUT._MA_&MA_N., &INPUT., -lag&MA_N.( &INPUT. )) ;
  &out=_sum_&INPUT._MA_&MA_N. / min(_n_, &MA_N.);
 
  drop _sum_&INPUT._MA_&MA_N.;
  label &out="&outlabel";
%mend;

Now we can test all the above implementations as shown below. First is a simple case we described above:

data a;
  input x @@;
  %AddMA( input=x, MA_N=5, out=SMA, outlabel=%str(SMA)); 
  %AddHMA( input=x, HMA_N=5, out=HMA, outlabel=%str(HMA))  
datalines;
1 2 3 4 5 6 7 8 9 10
;
proc print data=a label;
  format _all_ 8.2;
run;

The output is listed below, the HMA result is much more close to the real value x than SMA. It’s more responsive and less in lag.

Figure 3: HMA output sample

Now we can use the HMA macro to check IBM stock data in the SAS system dataset SASHELP.Stocks and draw trends in the plot to compare HMA(x, 5) and SMA(x, 5) output.

data Stocks_HMA;
  set sashelp.Stocks(where=(stock='IBM'));
  %AddHMA( HMA_N=5, out=HMA, outlabel=%str(HMA))
  %AddMA(MA_N=5, out=SMA, outlabel=%str(SMA));
run; 
ods graphics / width=800px height=300px;
title "Comparison of Simple and Hull Moving Average";
proc sgplot data=Stocks_HMA noautolegend;
   highlow x=Date high=High low=Low / lineattrs=(color=LightGray);
   scatter x=Date y=Close / markerattrs=(color=LightGray);
   series x=Date y=SMA / curvelabel curvelabelattrs=(color=DarkGreen) lineattrs=(color=DarkGreen);
   series x=Date y=HMA / curvelabel curvelabelattrs=(color=DarkOrange) lineattrs=(color=DarkOrange);
   refline '01JUN1992'd / axis=x label='June 1992' labelloc=inside;
   xaxis grid display=(nolabel); 
   yaxis grid max=200 label="IBM Closing Price";
run;

Figure 4: Comparison of SMA and HMA

Summary

We have talked about what Hull Moving Average is, how it works, and worked through a super simple SAS Macro implementation for HMA, WMA and SMA with only 25 lines of SAS code. This HMA implementation makes it quite simple and efficient to calculate HMA in SAS Data step without SAS/IML and Visual Analytics. In fact, it shows how easy process rolling-window computation for time series data can be in SAS, and opens a new window to build complicated technical indicators for stock trading systems yourself.

Learn more

Implementing HMA, WMA & SMA with 25 lines of SAS code was published on SAS Users.

11月 282019
 

With time series data analysis, we can apply moving average methods to predict data points without seasonality. This includes Simple Average (SA), Simple Moving Average (SMA), Weighted Moving Average (WMA), Exponential Moving Average (EMA), etc. For series with a trend but without seasonality, we can use linear, non-linear and autoregressive prediction models. In practice, we often use these kinds of moving average methods for series data with large variance, e.g. financial and stock market data.

Since introducing the window concept, SMA reflects historical data with a lag problem, and all points in the same window have the same importance to the current point. Weights are introduced to reflect the different importances of points in a window, and it leads to WMA and EMA. WMA gives greater weight to recent observations while less weight to the longer-term observations. The weights decrease in a linear fashion. If the sequence fluctuations are not large, the same weight is given to the observation and WMA degenerates into a SMA and the WMA becomes an EMA when weights decrease exponentially. However, these traditional moving average methods ave a lag problem due to introducing the smooth window concept and its results are less responsive to the current value. To gain a better balance between lag reduction and curve smoothing in a moving average, we must get help from some smarter mathematics algorithm.

What is Hull Moving Average?

Hull Moving Average (HMA) is a fast moving average method with low lag developed by Australia investor Alan Hull in 2005. It’s known to almost eliminate lag altogether and manages to improve smoothing at the same time. Longer period HMAs can be used as an indicator to identify trend, and shorter period HMAs can be used as entry signals in the direction of the prevailing trend. Due to lag problems in all moving average methods, it’s not suggested to be used as a reverse signal alone. Anyway, we can use different window HMAs to build a more robust trading strategy. E.g., 52-week HMA for trend indicator, and 13-week HMA for entry signal. Here is how it works:

  • Long period HMA identifies the trend: rising HMA indicates the prevailing trend is rising, it’s better to enter long positions; falling HMA indicate the prevailing trend is falling, it’s better to enter short positions.
  • Short period HMA identifies the entry signals in the direction of the prevailing trend: When the prevailing trend is rising, HMA goes up to indicate a long entry signal. When the prevailing trend is falling, HMA goes down to indicate a short entry signal.

Both long & short period HMAs in bullish mode generate Long Signal, and both in bearish mode generate Short Signal. So Long Trades get a buy signal after a Long Signal occurs, while Short Trades get a sell signal after a Short Signal occurs. Long Trades get a Sell Signal when Long or Short Trend is no longer bullish and Short Trades get a buy signal when Long or Short trend is no longer bearish. Figure 1 is a sample for this HMA based trading strategy.

In fact, HMA is quite simple. Its calculation only includes three WMA steps:

  1. Calculate WMA of original sequence X using window length n ;
  2. Calculate WMA of X using window length n/2;
  3. Calculate WMA of derived sequence Y using window length Sqrt(n), while Y is the new series of 2 * WMA(x, n/2) – WMA(x, n). So HMA(x, n) formula can be defined as below:

HMA(x, n) = WMA( 2*WMA(x, n/2) − WMA(x, n), sqrt(n) )

HMA makes the moving average line more responsive to current value and keeps the smoothing of the curve line. It almost eliminates lag altogether and manages to improve smoothing at the same time.

How does HMA achieve the perfect balance?

We know SMA can gain the curve line of an average value for historical data falling in the current window, but it has the constant lag of at least a half window. The output may also have poor smoothness. If we embed SMA multiple times to get the average of the averages, it would be smoother but would increase lag at the same time.

Suppose we have this series: {1, 2, 3, 4, 5, 6, 7, 8, 9, 10}. The SMA at the last point is 5.5 with window length n=10, i.e. SUM[1...10]/10
is much different than the real value 10. If we reduce the window length to n=5, the SMA at the last point is 8, i.e. SUM[6…10]/5. If we have the second SMA with half size window and add the difference between these two SMAs, then we can get 8 + (8-5.5)=10.5, which is very close to the real value 10 and will eliminate lag. Then we can use SMA with specific window length again to reduce that slight overcompensation and improve smoothness. HMA uses linear WMA instead of SMA, and the last window length is sqrt(n). This is the key trick with using HMA.

Figure 2 below shows the difference between HMA(x, 4) and embed SMA twice SMA( SMA(x,4),4).

Figure 2: Comparison of SMA( SMA(x, 4), 4) and HMA(x, 4).

In stock trading, there are lots of complicated technical indicators derived from stock historical data, such as Ease of Movement indicators (EMV), momentum indicators (MTM), MACD indicators, Energy indicator CR, KDJ indicators, Bollinger (BOL) and so on. Its essence is like feature engineering extraction before neural network training. The fundamental purpose is to find some intrinsic property from price fluctuation. The Hull moving average itself was discovered and applied to real trading practices, which gives us a glimpse to the complication of building an automatic financial trading strategy under the veil. It’s just the beginning, the reflex in trading and aging of models may both bring uncertainty, so the effectiveness of quantitative trading strategies is a very complex topic which needs to be validated through real practice.

HMA in SAS Code

SAS was created for data analysis however, and my colleagues go into detail about how data analysis connects with HMA. Check out Cindy Wang's How to Calculate Hull Moving Average in SAS Visual Analytics and Rick Wicklin's The Hull moving average: Implement a custom time series smoother in SAS. I would like to share how to implement both HMA/WMA with only 25 lines of code in this article. You can use the macro anywhere to build HMA series with specific window lengths.

First, implement Weighted Moving Average (WMA) with SAS macro as shown below (only 11 lines). The macro has four arguments:

  • WMA_N       WMA window length
  • INPUT          Input variable name
  • OUT             Output variable name, default is input name with _WMA_n suffix
  • OUTLABEL The label of output variable, default is “input name WMA(n)
%macro AddWMA(WMA_N=4, INPUT=close, OUT=&INPUT._WMA_&WMA_N., OUTLABEL=&INPUT. WMA(&WMA_N.));
  retain _sumx_&INPUT._WMA_&WMA_N.;
  _sumx_&INPUT._WMA_&WMA_N.=sum (_sumx_&INPUT._WMA_&WMA_N., &INPUT., -lag&WMA_N.( &INPUT. )) ;
 
  retain _sum_&INPUT._WMA_&WMA_N.;
  _sum_&INPUT._WMA_&WMA_N.=sum (_sum_&INPUT._WMA_&WMA_N., &WMA_N * &INPUT., -lag( _sumx_&INPUT._WMA_&WMA_N.  )) ;
 
  retain _sumw_&INPUT._WMA_&WMA_N.;
  if _N_ <= &WMA_N then _sumw_&INPUT._WMA_&WMA_N. = sum (_sumw_&INPUT._WMA_&WMA_N., &WMA_N, -_N_, 1);
 
  &OUT=_sum_&INPUT._WMA_&WMA_N. / _sumw_&INPUT._WMA_&WMA_N.;
 
  drop _sumx_&INPUT._WMA_&WMA_N. _sumw_&INPUT._WMA_&WMA_N. _sum_&INPUT._WMA_&WMA_N.;
  label &OUT="&OUTLABEL"; 
%mend;

We must watch out for the trick in the upper implementation of WMA with linear weights. In fact, there is no do-loop in the code, and it also doesn’t perform repeat computations for weights and values. We just keep the last sum of all values in the window (See Ln2-3), and the sum for WMA is the last sum for WMA plus window length times the value subtracted from it (See Ln4-5). For weight sum computation, we just need to pay attention to the points less than the window length(See Ln6-7).

Second, implement HMA with three times WMA invocation as below, (only 9 lines). The macro has four arguments:

  • HMA_N       HMA window length
  • INPUT         Input variable name
  • OUT            Output variable name, default is input name with _HMA_n suffix
  • OUTLABEL The label of output variable, default is “input name HMA(n)
%macro AddHMA(HMA_N=5, INPUT=close, OUT=&INPUT._HMA_&HMA_N., outlabel=&INPUT. HMA(&HMA_N.));
  %AddWMA(WMA_N=&HMA_N, INPUT=&INPUT.);
 
  %AddWMA(WMA_N=%sysfunc(round(&HMA_N./2)), INPUT=&INPUT.);
 
  &INPUT._HMA_&HMA_N._DELTA=2 * &INPUT._WMA_%sysfunc(round(&HMA_N./2))  - &INPUT._WMA_&HMA_N;
  %AddWMA(WMA_N=%sysfunc(round(%sysfunc(sqrt(&HMA_N)))), INPUT=&INPUT._HMA_&HMA_N._DELTA);
 
  rename &INPUT._HMA_&HMA_N._DELTA_WMA_%sysfunc(round(%sysfunc(sqrt(&HMA_N.))))=&OUT;
  label &INPUT._HMA_&HMA_N._DELTA_WMA_%sysfunc(round(%sysfunc(sqrt(&HMA_N.))))="&outlabel"; 
 
  drop &INPUT._WMA_&HMA_N &INPUT._WMA_%sysfunc(round(&HMA_N./2)) &INPUT._HMA_&HMA_N._DELTA;
%mend;

The upper code is very intuitive as the first two lines perform WMA with window n and n/2 for X, then the next two lines generate new series Y and performs WMA with window sqrt(n). Other codes are just to improve macro flexibility and to drop temp variables, etc.

To compare HMA with SMA result, we must also implement SMA with the 7 lines of code as shown below. It has the same argument as WMA/HMA above, but SMA is an unnecessary part of HMA macro %ADDHMA implmentation.

 
%macro AddMA(MA_N=5, INPUT=close, out=&INPUT._MA_&MA_N., outlabel=&INPUT. MA(&MA_N.));
  retain _sum_&INPUT._MA_&MA_N.;
  _sum_&INPUT._MA_&MA_N.=sum (_sum_&INPUT._MA_&MA_N., &INPUT., -lag&MA_N.( &INPUT. )) ;
  &out=_sum_&INPUT._MA_&MA_N. / min(_n_, &MA_N.);
 
  drop _sum_&INPUT._MA_&MA_N.;
  label &out="&outlabel";
%mend;

Now we can test all the above implementations as shown below. First is a simple case we described above:

data a;
  input x @@;
  %AddMA( input=x, MA_N=5, out=SMA, outlabel=%str(SMA)); 
  %AddHMA( input=x, HMA_N=5, out=HMA, outlabel=%str(HMA))  
datalines;
1 2 3 4 5 6 7 8 9 10
;
proc print data=a label;
  format _all_ 8.2;
run;

The output is listed below, the HMA result is much more close to the real value x than SMA. It’s more responsive and less in lag.

Figure 3: HMA output sample

Now we can use the HMA macro to check IBM stock data in the SAS system dataset SASHELP.Stocks and draw trends in the plot to compare HMA(x, 5) and SMA(x, 5) output.

data Stocks_HMA;
  set sashelp.Stocks(where=(stock='IBM'));
  %AddHMA( HMA_N=5, out=HMA, outlabel=%str(HMA))
  %AddMA(MA_N=5, out=SMA, outlabel=%str(SMA));
run; 
ods graphics / width=800px height=300px;
title "Comparison of Simple and Hull Moving Average";
proc sgplot data=Stocks_HMA noautolegend;
   highlow x=Date high=High low=Low / lineattrs=(color=LightGray);
   scatter x=Date y=Close / markerattrs=(color=LightGray);
   series x=Date y=SMA / curvelabel curvelabelattrs=(color=DarkGreen) lineattrs=(color=DarkGreen);
   series x=Date y=HMA / curvelabel curvelabelattrs=(color=DarkOrange) lineattrs=(color=DarkOrange);
   refline '01JUN1992'd / axis=x label='June 1992' labelloc=inside;
   xaxis grid display=(nolabel); 
   yaxis grid max=200 label="IBM Closing Price";
run;

Figure 4: Comparison of SMA and HMA

Summary

We have talked about what Hull Moving Average is, how it works, and worked through a super simple SAS Macro implementation for HMA, WMA and SMA with only 25 lines of SAS code. This HMA implementation makes it quite simple and efficient to calculate HMA in SAS Data step without SAS/IML and Visual Analytics. In fact, it shows how easy process rolling-window computation for time series data can be in SAS, and opens a new window to build complicated technical indicators for stock trading systems yourself.

Learn more

Implementing HMA, WMA & SMA with 25 lines of SAS code was published on SAS Users.