Yinliang Wu

12月 232021
 

As the leader in analytics, SAS provides a very rich set of SAS functions and CALL routines for programmers. There is a total of 693 uniquely named SAS functions besides the hundreds of SAS PROC steps. These SAS functions not only cover general functions such as character/string processing, mathematical, permutation & combination, probability, search and sorting, date and time, etc., but also includes a powerful domain-specific function set involving variable information and control, external files and system extensions, web services and tools, descriptive statistics, financial calculations, geodetic distance & ZIP code, Git and other fields. This article reviews the financial functions provided by the SAS system and demonstrates how to use them to do financial analysis and simplify your decisions. Functions with the same name can be invoked easily during programming with SAS DATA step, DS2, FedSQL and CAS environment, which will improve your programs' portability.

Overview

SAS provides 94 financial functions in total, which are mainly used to calculate financial values, such as interest, periodic payments, depreciation, and prices for European options on stocks and futures. The 53 financial functions conforming to industry standards are packaged in 1 general function called FINANCE. Each finance function has a unique string identifier for the first parameter of the FINANCE function. These functions have to be invoked via the FINANCE function. For example, we can use irr=finance("IRR", -400,100,200,300); to compute the internal rate of return (IRR) for a series of cash flows.

The 53 FINANCE functions are compatible with the 53 financial functions provided by Microsoft Excel for Office 365. In fact, the list of financial functions available in the latest Excel for Office 365 software contains 56 functions. Among the 3 extra functions, PDURATION and RRI are specific functions for Excel 2013, and STOCKHISTORY is a specific function for Office 365. These 53 financial functions are designed for financial calculation such as depreciation, maturation, accrued interest, net present value, periodic savings, internal rates of return, etc. They can be categorized as investment, depreciation, Internal rate of return, financial value, and yield calculations. The complete list of SAS FINANCE functions is shown in Table 1.

Category Subcategory Function
Investment Calculation [12] Future Value [2] FV, FVSCHEDULE
Payment [4] PMT, IPMT, PPMT, ISPMT
Present Value [3] PV, NPV, XNPV
Annual interest rate [2] EFFECT, NOMINAL
Number of periods [1] NPER
Depreciation [7] AMORDEGRC, AMORLINC, DB, DDB, SLN, SYD, VDB
Internal rate of return [4] IRR, RATE, MIRR, XIRR
Financial value and yield [30] Accrued/Cumulative Interest & Principal [4] ACCRINT, ACCRINTM, CUMIPMT, CUMPRINC
Coupon days/date and Number [6] COUPDAYBS, COUPDAYS, COUPDAYSNC, COUPNCD, COUPNUM, COUPPCD
Yield [6] INTRATE, ODDFYIELD, ODDLYIELD, YIELD, YIELDDISC, YIELDMAT
Treasury bill [3] TBILLEQ, TBILLPRICE, TBILLYIELD
Duration [2] DURATION, MDURATION
Security [7] DISC, ODDFPRICE, ODDLPRICE, PRICE, PRICEDISC, PRICEMAT, RECEIVED
Dollar Price [2] DOLLARDE, DOLLARFR

Table 1: 53 FINANCE XXX functions in one FINANCE function

Another 40 independently named SAS financial functions are used for investment or loan calculation, cash flow, depreciation, and pricing models. For example, we can use irr =irr(1,-400,100,200,300); to compute the internal rate of return (IRR) as a percentage for a series of cash flows.

There are 7 functions (IRR, NPV, PMT, PPMT, IPMT, CUMIPMT, CUMPRINC) whose names are the same as the industry standard FINANCE functions, but they have different sign-in results or even different calculation logic. For example, FINANCE PMT is used to calculate the periodic payment of an annuity, but the PMT function is used to calculate the periodic payment for a constant payment loan or the periodic savings for a future balance. Annuity and periodic savings are different in anchoring market interest rates, interest calculation, liquidity, and binding, so the same arguments return different results for them. The list is shown in Table 2.

Category Function
Investment/Loan [16] IRR/INTRR, NPV/NETPV, SAVING, SAVINGS,
NOMRATE, EFFRATE, COMPOUND,
PMT, PPMT, IPMT, CUMIPMT, CUMPRINC, MORT, TIMEVALUE
Depreciation [10] DACCDB, DACCDBSL, DACCSL, DACCSYD, DACCTAB
DEPDB, DEPDBSL, DEPSL, DEPSYD, DEPTAB
Cash flow [6] CONVX, CONVXP, DUR, DURP, PVP, YIELDP
Pricing Models [8] Stocks: BLKSHCLPRC/BLKSHPTPRC, GARKHCLPRC/GARKHPTPRC, MARGRCLPRC/MARGRPTPRC
Futures: BLACKCLPRC/BLACKPTPRC

Table 2: 40 uniquely named financial functions

To keep semantic consistency, these industry-standard financial functions usually have the same parameter names, and some of the most important parameters are:

  • Interest rate (rate): The interest rate or discount rate of an investment or loan.
  • Number of periods (nper): The total number of payment periods for investment or loan.
  • Payment (pmt): The regular payment that is made each period for investment or loan, the value cannot change over the life of the annuity. Typically, it contains principal and interest, but not other fees and taxes.
  • Present value (pv): The current value of an investment or loan. The present value of an investment is the value at the beginning of the investment period, and the present value of the loan is the value of the principal borrowed.
  • Future value (fv): The value of an investment or loan after all payments have occurred, or a cash balance that you want to attain after the last payment is made.
  • Payments due Type (type): A flag indicates if payments are due at the beginning of the period. The omitted type or 0 means payments are due at the end of the period, 1 indicates payments are due at the beginning of the period.
  • Period (period): Specify the period for interest rate or depreciation calculation. The period must be between 1 and the number of periods (nper).
  • Day count basis (basis): Optional, specify the type of day count basis to use. 0-represents US (NASD) 30/360, default; 1-represents Actual/actual; 2-represents Actual /360; 3-represents Actual /365; 4-represents European 30/360.
  • Frequency (frequency): Specify the number of coupon payments per year, 1-annual payment; 2-semiannual payment; 4-quarterly payment.

Below, we use some simple examples to show how to use these SAS financial functions to calculate and analyze mortgages.

Mortgage calculation

The average interest rate of first home mortgage in China is 5.46% since September 2021, and the average interest rate of second home mortgages is 5.83%, up 23 and 29 basis points respectively from 2020. Assuming that the first housing loan amount is 2 million yuan (CNY), the loan periods is 25 years. If the mortgage type is Constant Payment Mortgage (CPM), how much will the monthly payment and total payment be?

The FINANCE PMT function can be used to calculate the periodic payment of an annuity, it can be used to calculate the constant payment for Constant Payment Mortgage by default. That is, the payment is a constant for all periods, the same amount of the loan, including principal and interest, is repaid every month. The required parameters of the FINANCE PMT function are the interest rate (rate), the number of periods (nper), and the loan amount as the present value (pv). The optional parameters are the future value (fv) and the payment due type (type).

The FINANCE PPMT function can be used to calculate the payment on the principal during a given period for an investment or a loan based on a constant periodic payment and a constant interest rate. The required parameters are interest rate (rate), the payment period (period), number of periods (nper), and the loan amount as the present value (pv), optional parameters are future value (fv) and payment due type (type). Since FINANCE PPMT returns the monthly principal, we can calculate the monthly interest and the principal balance (pb) that need to be repaid. The complete SAS program is as follows:

 /*Mortgage: Constant Payment Mortgage*/
data Mortgage1;
  rate=0.0546/12; /*Interest rate: convert yearly 5.46% to monthly*/
  nper=25*12;     /*Number of periods: Term in month, convert 25 years to 300 months*/
  pv=2000000;     /*Present value: Loan amount - the value of the principal borrowed 2M*/
 
  pmt = FINANCE('pmt', rate, nper, pv); /*Monthly payment*/
  pb=pv;           /*Principal Balance*/
  do period = 1 to nper;
    ppmt=FINANCE('ppmt', rate, period, nper, pv); /*Payment on the principal*/
    mi=pmt- ppmt; /*Payment on the interest*/
    pb=pb+ppmt;
    output;
  end;
 
  label rate="Interest Rate (monthly)" nper="Number of periods (month)" pv="Loan amount" 
        period="Period" pmt = "Monthly payment" ppmt="Payment on principal" mi="Payment on interest" pb="Principal balance";
run;
 
title 'Mortgage: Constant Payment Mortgage';
proc print label;
run;

The results are as follows (Figure 1):

Figure 1: Details of Constant Payment Mortgage payment

To show the total payment and the total interest, we use PROC REPORT to generate a summary table as Figure 2. The total payment for the Constant Payment Mortgage is 3,670,206.07, the total interest is 1,670,206.07, and the ratio of interest to principal is 83.51%.

proc report;
  column period periodid pmt ppmt mi pb;
  format period 3.0 pmt ppmt  mi pb comma15.2;
  define period / order noprint;
  define pb / order;
  rbreak after / summarize style=[font_weight=bold];
 
  define periodid / computed "Period" center; 
  compute periodid / character length=20;
    periodid=period ;
  endcomp;
  compute after;
    if _break_ in ('_RBREAK_') then periodid='Total';
  endcomp;
run;

Figure 2: Constant Payment Mortgage payment schedule

If the mortgage type is Constant Amortization Mortgage (CAM), what will happen to its monthly payment and interest expenses? A Constant Amortization Mortgage divides the total amount of the loan into equal parts during the payment periods and the lender repays the same amount of principal every month and the interest incurred on the remaining loan of that month. As a result, the monthly payment amount is not constant, the initial payment is high, but the interest component of the payments declines rapidly over time. For this type of mortgage, the monthly payment on the principal (ppmt) is constant; we only need to calculate the current payment on the interest (mi) generated by the principal balance (pb) to get the current total payment (pmt) for the given period. The calculation can be accomplished without calling SAS FINANCE functions. The program is as follows:

/*Mortgage: Constant Amortization Mortgage*/
data Mortgage2;
  rate=0.0546/12; /*Interest rate: convert yearly 5.46% to monthly*/
  nper=25*12;     /*Number of periods: Term in month, convert 25 years to 300 months*/
  pv=2000000;     /*Present value: Loan amount - the value of the principal borrowed 2M*/
 
  ppmt=-pv/nper;   /*Monthly Payment on the principal*/
  pb=pv;           /*Principal Balance*/
  do period=1 to nper;
    mi=-pb*rate;   /*Payment on the interest*/
    pmt=ppmt+ mi;
    pb=pb+ppmt;
    output;
  end;
 
  label rate="Interest Rate (monthly)" nper="Number of periods (month)" pv="Loan amount" 
        period="Period" pmt = "Monthly payment" ppmt="Payment on principal" mi="Payment on interest" pb="Principal balance";
RUN;
 
title 'Mortgage: Constant Amortization Mortgage';
PROC PRINT;
RUN;

We use the same PROC REPORT program above to generate the summary table as Figure 3. In the table, the total payment for the Constant Amortization Mortgage is 3,369,550.00, the total interest is 1,369,550.00, and the ratio of interest to principal is 68.48%. This indicates that using this type of mortgage can save 300,656.07 in interest expenses.

Figure 3: Constant Amortization Mortgage payment schedule

To compare the monthly payment, monthly interest, and their cumulative value changes with the repayment periods for two mortgage types, we use PROC SGPANEL to draw the following figures. The code is as follows:

/*Merge two mortgage data with type=1/2 in long format*/
data Mortgage;
  set Mortgage1 Mortgage2;  
  retain type 1;
  if period=1 and _N_^=1 then type=type+1;
  pmt=abs(pmt);
  mi=abs(mi);
  ppmt=abs(ppmt);
run;
/*Create format for mortgage type*/
proc format;
  value typefmt 1="Constant Payment Mortgage"
2="Constant Amortization Mortgage";
run;
data Mortgage;
  set Mortgage;
  accuppmt+ppmt;
  accupmt+pmt;
  if first.type then do; 
	accuppmt=ppmt;
	accupmt=pmt;
  end;
  by type;
  label accuppmt="Cumulative monthly payment on principal" accupmt="Cumulative monthly payment";
  format type typefmt.;
run;
 
ods graphics / width=800px height=300px;
title "Mortgage: CPM vs CAM";
proc sgpanel data=Mortgage;
  panelby type / novarname rows=2;
 
  vline period / response=pmt lineattrs=(color=Cxa0a0a0);
  vbar period / response=pmt nooutline fillattrs=(color=cxE0E0E0) FILLTYPE=SOLID transparency= 0.5;
  vbar period / response=ppmt nooutline fillattrs=(color=DarkOrange) FILLTYPE=SOLID transparency= 0.5;  
  colaxis label="Period" type=linear  values=(1 to 300 by 12  ) valueattrs=(color=gray size=12pt) grid;
run;
 
ods graphics / width=800px height=500px;
title "Mortgage: CPM vs CAM";
proc sgpanel data=Mortgage;
  panelby type / novarname rows=2 ;
 
  vline period / response=accupmt lineattrs=(color=Cxa0a0a0);
  vbar period / response=accupmt nooutline fillattrs=(color=cxE0E0E0) FILLTYPE=SOLID transparency= 0.5;;
  vbar period / response=accuppmt nooutline fillattrs=(color=DarkOrange) FILLTYPE=SOLID transparency= 0.5;;
  colaxis label="Period" type=linear  values=(1 to 300 by 12  )  valueattrs=(color=gray size=11pt) grid;  
run;

In the following figures, the y-axis represents the monthly payment or the cumulative monthly payment, and the x-axis represents the payment period. The orange represents the principal component and the gray represents the interest component; the two constitute the monthly payment.

In Figure 4, the monthly payment of CPM (top), and the monthly payment on principal of CAM (bottom) are fixed values, which appear as a horizontal line and do not change over time. Correspondingly, in Figure 5, the cumulative monthly payment of CPM (top) and the cumulative monthly payment on principal (bottom) show linear growth.

Figure 4: Monthly payment of CPM vs CAM over period

Figure 5: Cumulative monthly payment of CPM vs CAM over period

Since the monthly payment of CPM is constant, the principal component increases month by month, but the interest component decreases accordingly. For CAM, the monthly payment and the interest component decreases rapidly month by month, but the principal component is constant.

CPM brings less stress to lenders and it is suitable for people whose income is expected to rise, and the lender is only willing to bear a lower monthly payment. CPM is also suitable for those people who have other channels to manage their own money to achieve greater benefits than mortgage expenditures. It is also suitable for those people who have a plan to sell real estate after a certain period. The disadvantage of CPM is that the overall interest expense is relatively high.

CAM is suitable for people who have higher income but whose income is flat or declining in the future. These people hope to put the payment stress on now rather than in the future. Declining payments may be appropriate to match a deflationary environment, but its high initial payment and rapid paydown of principal reduces leverage too fast. Furthermore, some people may complain the constantly changing payment is difficult to budget for and manage.

Summary

We have summarized the design and implementation of SAS functions in financial calculations. We then provided examples to calculate and analyze the monthly payment, interest, and principal for CPM/CAM mortgages. Finally, we compared the monthly payment, principal, interest, and cumulative value changes to a period for two mortgage types.

SAS financial functions review and mortgage payment analysis was published on SAS Users.

12月 232021
 

As the leader in analytics, SAS provides a very rich set of SAS functions and CALL routines for programmers. There is a total of 693 uniquely named SAS functions besides the hundreds of SAS PROC steps. These SAS functions not only cover general functions such as character/string processing, mathematical, permutation & combination, probability, search and sorting, date and time, etc., but also includes a powerful domain-specific function set involving variable information and control, external files and system extensions, web services and tools, descriptive statistics, financial calculations, geodetic distance & ZIP code, Git and other fields. This article reviews the financial functions provided by the SAS system and demonstrates how to use them to do financial analysis and simplify your decisions. Functions with the same name can be invoked easily during programming with SAS DATA step, DS2, FedSQL and CAS environment, which will improve your programs' portability.

Overview

SAS provides 94 financial functions in total, which are mainly used to calculate financial values, such as interest, periodic payments, depreciation, and prices for European options on stocks and futures. The 53 financial functions conforming to industry standards are packaged in 1 general function called FINANCE. Each finance function has a unique string identifier for the first parameter of the FINANCE function. These functions have to be invoked via the FINANCE function. For example, we can use irr=finance("IRR", -400,100,200,300); to compute the internal rate of return (IRR) for a series of cash flows.

The 53 FINANCE functions are compatible with the 53 financial functions provided by Microsoft Excel for Office 365. In fact, the list of financial functions available in the latest Excel for Office 365 software contains 56 functions. Among the 3 extra functions, PDURATION and RRI are specific functions for Excel 2013, and STOCKHISTORY is a specific function for Office 365. These 53 financial functions are designed for financial calculation such as depreciation, maturation, accrued interest, net present value, periodic savings, internal rates of return, etc. They can be categorized as investment, depreciation, Internal rate of return, financial value, and yield calculations. The complete list of SAS FINANCE functions is shown in Table 1.

Category Subcategory Function
Investment Calculation [12] Future Value [2] FV, FVSCHEDULE
Payment [4] PMT, IPMT, PPMT, ISPMT
Present Value [3] PV, NPV, XNPV
Annual interest rate [2] EFFECT, NOMINAL
Number of periods [1] NPER
Depreciation [7] AMORDEGRC, AMORLINC, DB, DDB, SLN, SYD, VDB
Internal rate of return [4] IRR, RATE, MIRR, XIRR
Financial value and yield [30] Accrued/Cumulative Interest & Principal [4] ACCRINT, ACCRINTM, CUMIPMT, CUMPRINC
Coupon days/date and Number [6] COUPDAYBS, COUPDAYS, COUPDAYSNC, COUPNCD, COUPNUM, COUPPCD
Yield [6] INTRATE, ODDFYIELD, ODDLYIELD, YIELD, YIELDDISC, YIELDMAT
Treasury bill [3] TBILLEQ, TBILLPRICE, TBILLYIELD
Duration [2] DURATION, MDURATION
Security [7] DISC, ODDFPRICE, ODDLPRICE, PRICE, PRICEDISC, PRICEMAT, RECEIVED
Dollar Price [2] DOLLARDE, DOLLARFR

Table 1: 53 FINANCE XXX functions in one FINANCE function

Another 40 independently named SAS financial functions are used for investment or loan calculation, cash flow, depreciation, and pricing models. For example, we can use irr =irr(1,-400,100,200,300); to compute the internal rate of return (IRR) as a percentage for a series of cash flows.

There are 7 functions (IRR, NPV, PMT, PPMT, IPMT, CUMIPMT, CUMPRINC) whose names are the same as the industry standard FINANCE functions, but they have different sign-in results or even different calculation logic. For example, FINANCE PMT is used to calculate the periodic payment of an annuity, but the PMT function is used to calculate the periodic payment for a constant payment loan or the periodic savings for a future balance. Annuity and periodic savings are different in anchoring market interest rates, interest calculation, liquidity, and binding, so the same arguments return different results for them. The list is shown in Table 2.

Category Function
Investment/Loan [16] IRR/INTRR, NPV/NETPV, SAVING, SAVINGS,
NOMRATE, EFFRATE, COMPOUND,
PMT, PPMT, IPMT, CUMIPMT, CUMPRINC, MORT, TIMEVALUE
Depreciation [10] DACCDB, DACCDBSL, DACCSL, DACCSYD, DACCTAB
DEPDB, DEPDBSL, DEPSL, DEPSYD, DEPTAB
Cash flow [6] CONVX, CONVXP, DUR, DURP, PVP, YIELDP
Pricing Models [8] Stocks: BLKSHCLPRC/BLKSHPTPRC, GARKHCLPRC/GARKHPTPRC, MARGRCLPRC/MARGRPTPRC
Futures: BLACKCLPRC/BLACKPTPRC

Table 2: 40 uniquely named financial functions

To keep semantic consistency, these industry-standard financial functions usually have the same parameter names, and some of the most important parameters are:

  • Interest rate (rate): The interest rate or discount rate of an investment or loan.
  • Number of periods (nper): The total number of payment periods for investment or loan.
  • Payment (pmt): The regular payment that is made each period for investment or loan, the value cannot change over the life of the annuity. Typically, it contains principal and interest, but not other fees and taxes.
  • Present value (pv): The current value of an investment or loan. The present value of an investment is the value at the beginning of the investment period, and the present value of the loan is the value of the principal borrowed.
  • Future value (fv): The value of an investment or loan after all payments have occurred, or a cash balance that you want to attain after the last payment is made.
  • Payments due Type (type): A flag indicates if payments are due at the beginning of the period. The omitted type or 0 means payments are due at the end of the period, 1 indicates payments are due at the beginning of the period.
  • Period (period): Specify the period for interest rate or depreciation calculation. The period must be between 1 and the number of periods (nper).
  • Day count basis (basis): Optional, specify the type of day count basis to use. 0-represents US (NASD) 30/360, default; 1-represents Actual/actual; 2-represents Actual /360; 3-represents Actual /365; 4-represents European 30/360.
  • Frequency (frequency): Specify the number of coupon payments per year, 1-annual payment; 2-semiannual payment; 4-quarterly payment.

Below, we use some simple examples to show how to use these SAS financial functions to calculate and analyze mortgages.

Mortgage calculation

The average interest rate of first home mortgage in China is 5.46% since September 2021, and the average interest rate of second home mortgages is 5.83%, up 23 and 29 basis points respectively from 2020. Assuming that the first housing loan amount is 2 million yuan (CNY), the loan periods is 25 years. If the mortgage type is Constant Payment Mortgage (CPM), how much will the monthly payment and total payment be?

The FINANCE PMT function can be used to calculate the periodic payment of an annuity, it can be used to calculate the constant payment for Constant Payment Mortgage by default. That is, the payment is a constant for all periods, the same amount of the loan, including principal and interest, is repaid every month. The required parameters of the FINANCE PMT function are the interest rate (rate), the number of periods (nper), and the loan amount as the present value (pv). The optional parameters are the future value (fv) and the payment due type (type).

The FINANCE PPMT function can be used to calculate the payment on the principal during a given period for an investment or a loan based on a constant periodic payment and a constant interest rate. The required parameters are interest rate (rate), the payment period (period), number of periods (nper), and the loan amount as the present value (pv), optional parameters are future value (fv) and payment due type (type). Since FINANCE PPMT returns the monthly principal, we can calculate the monthly interest and the principal balance (pb) that need to be repaid. The complete SAS program is as follows:

 /*Mortgage: Constant Payment Mortgage*/
data Mortgage1;
  rate=0.0546/12; /*Interest rate: convert yearly 5.46% to monthly*/
  nper=25*12;     /*Number of periods: Term in month, convert 25 years to 300 months*/
  pv=2000000;     /*Present value: Loan amount - the value of the principal borrowed 2M*/
 
  pmt = FINANCE('pmt', rate, nper, pv); /*Monthly payment*/
  pb=pv;           /*Principal Balance*/
  do period = 1 to nper;
    ppmt=FINANCE('ppmt', rate, period, nper, pv); /*Payment on the principal*/
    mi=pmt- ppmt; /*Payment on the interest*/
    pb=pb+ppmt;
    output;
  end;
 
  label rate="Interest Rate (monthly)" nper="Number of periods (month)" pv="Loan amount" 
        period="Period" pmt = "Monthly payment" ppmt="Payment on principal" mi="Payment on interest" pb="Principal balance";
run;
 
title 'Mortgage: Constant Payment Mortgage';
proc print label;
run;

The results are as follows (Figure 1):

Figure 1: Details of Constant Payment Mortgage payment

To show the total payment and the total interest, we use PROC REPORT to generate a summary table as Figure 2. The total payment for the Constant Payment Mortgage is 3,670,206.07, the total interest is 1,670,206.07, and the ratio of interest to principal is 83.51%.

proc report;
  column period periodid pmt ppmt mi pb;
  format period 3.0 pmt ppmt  mi pb comma15.2;
  define period / order noprint;
  define pb / order;
  rbreak after / summarize style=[font_weight=bold];
 
  define periodid / computed "Period" center; 
  compute periodid / character length=20;
    periodid=period ;
  endcomp;
  compute after;
    if _break_ in ('_RBREAK_') then periodid='Total';
  endcomp;
run;

Figure 2: Constant Payment Mortgage payment schedule

If the mortgage type is Constant Amortization Mortgage (CAM), what will happen to its monthly payment and interest expenses? A Constant Amortization Mortgage divides the total amount of the loan into equal parts during the payment periods and the lender repays the same amount of principal every month and the interest incurred on the remaining loan of that month. As a result, the monthly payment amount is not constant, the initial payment is high, but the interest component of the payments declines rapidly over time. For this type of mortgage, the monthly payment on the principal (ppmt) is constant; we only need to calculate the current payment on the interest (mi) generated by the principal balance (pb) to get the current total payment (pmt) for the given period. The calculation can be accomplished without calling SAS FINANCE functions. The program is as follows:

/*Mortgage: Constant Amortization Mortgage*/
data Mortgage2;
  rate=0.0546/12; /*Interest rate: convert yearly 5.46% to monthly*/
  nper=25*12;     /*Number of periods: Term in month, convert 25 years to 300 months*/
  pv=2000000;     /*Present value: Loan amount - the value of the principal borrowed 2M*/
 
  ppmt=-pv/nper;   /*Monthly Payment on the principal*/
  pb=pv;           /*Principal Balance*/
  do period=1 to nper;
    mi=-pb*rate;   /*Payment on the interest*/
    pmt=ppmt+ mi;
    pb=pb+ppmt;
    output;
  end;
 
  label rate="Interest Rate (monthly)" nper="Number of periods (month)" pv="Loan amount" 
        period="Period" pmt = "Monthly payment" ppmt="Payment on principal" mi="Payment on interest" pb="Principal balance";
RUN;
 
title 'Mortgage: Constant Amortization Mortgage';
PROC PRINT;
RUN;

We use the same PROC REPORT program above to generate the summary table as Figure 3. In the table, the total payment for the Constant Amortization Mortgage is 3,369,550.00, the total interest is 1,369,550.00, and the ratio of interest to principal is 68.48%. This indicates that using this type of mortgage can save 300,656.07 in interest expenses.

Figure 3: Constant Amortization Mortgage payment schedule

To compare the monthly payment, monthly interest, and their cumulative value changes with the repayment periods for two mortgage types, we use PROC SGPANEL to draw the following figures. The code is as follows:

/*Merge two mortgage data with type=1/2 in long format*/
data Mortgage;
  set Mortgage1 Mortgage2;  
  retain type 1;
  if period=1 and _N_^=1 then type=type+1;
  pmt=abs(pmt);
  mi=abs(mi);
  ppmt=abs(ppmt);
run;
/*Create format for mortgage type*/
proc format;
  value typefmt 1="Constant Payment Mortgage"
2="Constant Amortization Mortgage";
run;
data Mortgage;
  set Mortgage;
  accuppmt+ppmt;
  accupmt+pmt;
  if first.type then do; 
	accuppmt=ppmt;
	accupmt=pmt;
  end;
  by type;
  label accuppmt="Cumulative monthly payment on principal" accupmt="Cumulative monthly payment";
  format type typefmt.;
run;
 
ods graphics / width=800px height=300px;
title "Mortgage: CPM vs CAM";
proc sgpanel data=Mortgage;
  panelby type / novarname rows=2;
 
  vline period / response=pmt lineattrs=(color=Cxa0a0a0);
  vbar period / response=pmt nooutline fillattrs=(color=cxE0E0E0) FILLTYPE=SOLID transparency= 0.5;
  vbar period / response=ppmt nooutline fillattrs=(color=DarkOrange) FILLTYPE=SOLID transparency= 0.5;  
  colaxis label="Period" type=linear  values=(1 to 300 by 12  ) valueattrs=(color=gray size=12pt) grid;
run;
 
ods graphics / width=800px height=500px;
title "Mortgage: CPM vs CAM";
proc sgpanel data=Mortgage;
  panelby type / novarname rows=2 ;
 
  vline period / response=accupmt lineattrs=(color=Cxa0a0a0);
  vbar period / response=accupmt nooutline fillattrs=(color=cxE0E0E0) FILLTYPE=SOLID transparency= 0.5;;
  vbar period / response=accuppmt nooutline fillattrs=(color=DarkOrange) FILLTYPE=SOLID transparency= 0.5;;
  colaxis label="Period" type=linear  values=(1 to 300 by 12  )  valueattrs=(color=gray size=11pt) grid;  
run;

In the following figures, the y-axis represents the monthly payment or the cumulative monthly payment, and the x-axis represents the payment period. The orange represents the principal component and the gray represents the interest component; the two constitute the monthly payment.

In Figure 4, the monthly payment of CPM (top), and the monthly payment on principal of CAM (bottom) are fixed values, which appear as a horizontal line and do not change over time. Correspondingly, in Figure 5, the cumulative monthly payment of CPM (top) and the cumulative monthly payment on principal (bottom) show linear growth.

Figure 4: Monthly payment of CPM vs CAM over period

Figure 5: Cumulative monthly payment of CPM vs CAM over period

Since the monthly payment of CPM is constant, the principal component increases month by month, but the interest component decreases accordingly. For CAM, the monthly payment and the interest component decreases rapidly month by month, but the principal component is constant.

CPM brings less stress to lenders and it is suitable for people whose income is expected to rise, and the lender is only willing to bear a lower monthly payment. CPM is also suitable for those people who have other channels to manage their own money to achieve greater benefits than mortgage expenditures. It is also suitable for those people who have a plan to sell real estate after a certain period. The disadvantage of CPM is that the overall interest expense is relatively high.

CAM is suitable for people who have higher income but whose income is flat or declining in the future. These people hope to put the payment stress on now rather than in the future. Declining payments may be appropriate to match a deflationary environment, but its high initial payment and rapid paydown of principal reduces leverage too fast. Furthermore, some people may complain the constantly changing payment is difficult to budget for and manage.

Summary

We have summarized the design and implementation of SAS functions in financial calculations. We then provided examples to calculate and analyze the monthly payment, interest, and principal for CPM/CAM mortgages. Finally, we compared the monthly payment, principal, interest, and cumulative value changes to a period for two mortgage types.

SAS financial functions review and mortgage payment analysis was published on SAS Users.

6月 252021
 

In many programming languages, there is a function named eval() that can be used to evaluate an expression and return the result at run time. For example, in Python, the eval() function parses the expression passed to it and runs a Python expression or code within the program. Even Python eval() supports more optional parameters, such as global/local dictionary for runtime context, but the goal for evaluation remains the same.

When an expression is stored as a character variable of an SAS observation, it suggests the running context is the SAS program execution phase, so the internal data structure like Input Buffer (IB), Program Data Vectors (PDV) and Descriptor Information of the output database are all accessible. Furthermore, all SAS system functions are also accessible in an expression, just like the raw SAS code exposed to SAS compiler and runtime. In the following SAS data set, what the user wants is to get the real value from the expression in variable C. It is different from the commonly used Calculated Column, which uses the same computing rule to expand a new variable for all observations. Here the column C has a different computing rule for each observation; the user's expected result is the column D.

%let MVAR=2;
data a; 
  length a b 8 c $ 255;
  a=3; b=4; c='a**2+b**2'; output; /* Arithmetic or Logical */  
  a=7; b=22; c='b/a * &MVAR'; output; /* SAS Macro Var*/
  a=113; b=355; c='cos(b/a)'; output; /* Trigonometric */
  a=0; b=1; c='cdf("NORMAL", 0, a, b)'; output; /* Probability */
run;
proc print;run;

What solutions ahead?

Someone might want a solution that parses the expression in variable C, and then try to rebuild the abstract syntax tree and try to evaluate it from bottom to top. I can’t say this solution is totally wrong but it’s very complex and too hard to generate a general solution. If we have an SAS function eval() in DATA Step, then we can easily use the following code to achieve the goal. Unfortunately, SAS did not provide the eval() function.

 
data b; 
  set a;
  d=eval(c);
run;

SAS provides Macro function %eval, %sysevalf to evaluate arithmetic and logical expressions using integer or floating-point arithmetic. SAS also provides function resolve() (not supported in the DATA Step that runs in CAS) to return the resolved value of the argument after it has been processed by the macro facility. Anyway, resolve() can’t access the values of macro variable assigned by symput() or symputn() at program execution phase. So, for the SAS code as below, it outputs c=150 and d=200. For simple arithmetic and logical expression, we can use the tranwrd() function to replace all variable names with real values, and then use resolve('%sysevalf('|| cats(expression_with_real_values) || ')') to get evaluated result, but it limited to only SAS functions supported by %eval() and %sysevalf() macro functions.

%let a=100;
%let b=50;
data _null_;
  call symput("b", 200);  
  c=resolve("%eval(&a+&b)");
  put c=;
  d=symget("b");
  put d=; 
run;

Now let’s return to the key question we raised: How can we implement the eval() function in SAS Data Step? The best solution we found so far is to use the dynamic code generation mechanism in SAS, and then submit the code to SAS for parsing and execution. In this way, we retrieve string expression from a variable to a real expression in SAS code successfully. We don’t care what the valid SAS expression is, we totally convert it to the code snippets and submit it for execution. Syntax check, reference to PDV variables, internal functions, and even SAS Macro variables are all supported. Yes, it’s a smart and concise implementation for general purposes.

Due to each observation having a different computing rule, we need to use _N_ to control the observation mapping. So, for the existing dataset A, we can use the following code to achieve the goal. The key secret is to use CALL EXECUTE to generate SAS code dynamically and delay execution after the code is ready. In the output dataset B, we have a numeric column D with the value evaluated with the character expression from column C. You can see the output on the RIGHT part of the Figure 1.

data _null_;
  set a end=last;
  if _n_=1 then call execute("data b; set a;");
  call execute( "if _N_=" || compress(_N_) || " then d=" || trim(c) || "; ");
  if last then call execute("run;");
run;
proc print; run;

Wrap up as a reusable SAS macro

We also can wrap up the above SAS code in a reusable SAS macro, so this logic can be used anywhere you want. The %evalvar macro has four arguments: ds= and var= are the input dataset and columns with expression to be evaluated. And outds= and outvar= are the output dataset and columns to store result. We also can specify same value for ds= and outds=, so the user just expands the existing dataset with an additional evaluated column.

%macro EvalVar(ds=, var=, outds=, outvar=);
data _null_;
  set &ds end=last;
  if _n_=1 then call execute("data &outds; set &ds;");
  call execute( "if _n_=" || compress(_n_) || " then &outvar=" || &var || ";");
  if last then call execute("run;");
 run;
%mend; 
 
%EvalVar(ds=a, var=c, outds=b, outvar=d);
proc print; run;

By the way, I also had a temporary SAS code generation solution implemented via the %include SAS macro. The temporary SAS code will be deleted automatically when the SAS session is closed. The sample code is also attached here for your information.

filename tmpfile temp;
data _null_;
  file tmpfile;
  set a end=last;  
  if _N_=1 then put "data b; set a;";
  put "if _N_ =" _N_ "then d=" c ";";
  if last then 	put "run;";
run;
%include tmpfile;
proc print; run;

Summary

In this article, we talk about how to evaluate SAS expressions in Data Step dynamically. The expression parsing and execution are totally handled by SAS at program execution phase. It avoids handling abstract syntax tree parsing and evaluation ourselves on it. We introduce two dynamic code generation implementations via call execute or %include. We also use _N_ to control observation mapping due to each observation has different computing rules. This implementation can reflect the beauty of simplicity in SAS before SAS provides system function eval() one day.

How to evaluate SAS expression in DATA Step dynamically was published on SAS Users.

6月 032020
 

In natural language processing, word vectors play a key role in making technologies such as machine translation and speech recognition possible. A word vector is a row of numeric values where each point captures a dimension of the word’s meaning. Each value represents how closely it relates to the concept behind that dimension, so the semantics of the word is embedded across the dimensions of the vector. Since similar words have similar vectors, representing words as vectors like this would simplify and unify vectors' operations.

Word vectors are generated by a training performed word-word co-occurrence statistics on a large corpus. You can use pre-trained word vectors like GloVe, provided by Stanford University.

Let's talk about how to transform word vector tables from long to wide in SAS, so we can potentially get sentence vectors to process further. Suppose we generate word vectors from the following 3 sentences:

Jack went outside.
Jill likes to draw in the afternoon.
Tony is a boy.

Each word has 2 numeric values (Vector1, Vector2), each value represents how closely the word relates to the concept defined by that dimension. The value numbers (here VNUM=2) may range from hundreds to thousands in real text analysis scenarios.

Long word vector table

The sample code below generates an upper sample table and sorts it for further processing.

data HAVE;
  length Word $ 45;
  input SentenceID Word Vector1-Vector2; /*300+*/
datalines;
1	Jacky 	   0.24011	 0.400996 
1	went	  -0.047581	 0.868716 
1	outside	  -1.197891	 1.162238
2	Jill	  -0.199579	 0.251252
2	likes	  -1.935640	-0.288264
2	to	  -0.526053	-1.143420
2	draw	  -0.736289	-0.794812
2	in 	  -2.757234	 0.506639
2	the	  -0.736289	-0.794812
2	afternoon -0.047581	 0.868716
3	Tony 	   0.34032	 0.600983 
3	is	   0.147531	 0.968817
3	a	   1.347543	 2.568323
3	boy       -3.257891      3.172238
run; 
proc sort data=HAVE;
  by SentenceID;
run;
proc print data=have;run;

If we want to transform the upper long table to a wide table as seen below, how can we do this as efficiently and simply as possible? The upper 14 words belong to 3 sentences that would result in the following 3 rows with 22 columns (1 + WNUM + WNUM x VNUM=1 + 7 + 7 x 2 = 22).

Wide word vector table

Please note that we can calculate the max word number (WNUM) in a sentence at runtime with SAS code below. For the upper case, the value of WNUM is 7.

proc sql noprint;
  select max(count) into :wnum from (
    select count(Word) as count from HAVE group by SentenceID 
  );
quit;

In fact, we don’t need any SAS PROC to handle this kind of transformation. A SAS Data step provides an efficient and convenient way to transform data. The key is to use an ARRAY to map all word vectors from the source table, and then define two ARRAYs to store output words and vectors in a wide style. These two arrays for output words and vectors need to be RETAIN during the implicit loop and KEEP for OUTPUT while it reaches the last SentenceId.

You can see the full SAS code below with detailed comments.

/*Long table to Wide table*/
%let vnum=2; /*vector numbers for a word*/
%let wnum=7; /*max word number in a sentence*/
data WANT;
  set HAVE;
  by Sentenceid;
  array _vector_ [*] vector:;         /*Map to source vectors*/
 
  array _word [ %eval(1*&wnum)] $ 45; /*Array to store WORD in wide table*/
  array _vector [ %eval(&wnum*&vnum)];/*Array to store VECTORS in wide table*/
  retain _word: _vector:;             /*RETAIN during the implicit loop*/
 
  retain _offset_ 0;                  /*Offset of a WORD in a sentence, base 0*/
  if first.Sentenceid then do;
    call missing(of _word[*]);
	call missing(of _vector[*]);
    _offset_=0;
  end;
  else _offset_=_offset_+1;
 
  _word[ _offset_+1 ]=word;           /*Cache current word to array WORD at [ _offset_+1]*/
  do i=1 to dim(_vector_);            /*Cache each vectors to array VECTORS at [_offset_* &vnum +i]*/
    _vector[_offset_* &vnum +i]=_vector_[i]; 
  end;
  keep Sentenceid _word: _vector: ;   /*Keep for output when it hit last.Sentenceid*/
 
  if last.Sentenceid then output;     /*Output the cached WORD and VECTORS*/
run;
 
proc print data=want;run;

Accordingly, if we need to transform a word vector back from wide style to long style, we need to generate &WNUM rows x &VNUM columns for each sentence, and it’s the reversed process for upper logic. The full SAS code with detailed comments is listed below:

/*Wide table to Long table*/
data HAVE2;
  set WANT; 
 
  array _word [*] _word:;           /*Array _word[] mapping to WORD in wide table*/
  array _vector_ [*] _vector:;     /*Array _vector[] mapping to VECTORS in wide table*/
 
  length Word $ 45;                 /*Output Word in the long table*/
  array Vector[&vnum];              /*Output Vectors in the long table*/
  do i=1 to &wnum;                  /*Unpack word from array _word[]*/       
    word=_word[i]; 
	if word=" " then continue;
    do j=1 to &vnum;                /*Unpack vectors from array _vector[]*/
	  oo= (j+&vnum * (i-1)); 
      Vector[j]=_vector_[j + &vnum *(i-1)];
    end;
	keep Sentenceid Word Vector:;
	output;                          /*One row in wide table generate &wnum rows[]*/
  end;
run;
proc print data=HAVE2;run;

To wrap the upper bi-directional transformation process for general repurposing in text analysis, we provide two SAS MACROs listed below:

%Long2Wide(data=Have, vnum=2, wnum=7, sid=SentenceId, word=Word, out=Want);
proc print data=Want;run;
 
%Wide2Long(data=Want, vnum=2, wnum=7, sid=Sentenceid, out=Have2, outword=Word, outvector=Vector);
proc print data=Have2;run;

We have demonstrated how to transform a word vector table from a long style to a wide style (or vice versa) efficiently with a SAS DATA step. We have also provided two well-wrapped SAS MACROs for general re-use purposes. To learn more, please check out these additional resources:

Transform word vector tables from long to wide was published on SAS Users.

12月 202019
 

One day, I received a question from the SAS community on how to exclude variables with missing values up to a certain percentage in data analysis. Before I dive into details, another question about how to remove variables with repeated values up to a certain percentage was raised. In fact, this user demands focus on the value characteristics for a single variable, and it determines if the user should include a specific variable in the following data analysis. It reflects the demand of variable selection for wide datasets with noise; it enhances data filtering by enabling variables(cols) selection, in addition to the normal observation(rows) filtering.

DATA Step implementation

If the variable filtering is based on the percentage of the missing values, the most intuitive thought is to use DATA step to calculate the percentage of missing values for each column of the input dataset, and then collect the variable names that meet the threshold conditions before passing them to a global macro variable for subsequent analysis. Since SAS DATA step includes both character and numeric data types, we cannot define one single array to map all columns of the dataset. Instead, we need to define two arrays to map all numeric variables and all character variables respectively.

In order to build a universal reusable SAS macro, we also need to detect the number of numeric and character variables in advance and pass them to the array definition in the next DATA step. In that DATA step, we use two arrays to access dataset variables and count the number of missing values respectively. When the loop reaches the last observation, it collects variable names through the vname() function which meet the threshold condition, and finally place them into a SAS global macro variable &VARLIST. The SAS macro below demonstrates this logic:

 
%macro FilterCols(data=, th=0, varlist=);
  data _null_;  /*get numeric and character variables count */
    set &data (obs=1);
    array _ncols_n _numeric_;
    array _ccols_n $ _character_;
    call symput ('_ncols_n',put(dim(_ncols_n),best.));
    call symput ('_ccols_n',put(dim(_ccols_n),best.));
  run;
  data _null_;
	set &data end=last;
    /*Process numeric types*/
	array _ncols_ _numeric_;	
	array _ncols_count {&_ncols_n};
	retain _ncols_count (&_ncols_n*0);	
	do i=1 to dim(_ncols_);			
	  if _ncols_[i]= .  then _ncols_count[i]=_ncols_count[i] + 1;
	end;
    /*Process character types*/
	array _ccols_ $  _character_;
    array _ccols_count {&_ccols_n};	
	retain _ccols_count (&_ccols_n*0);
	do i=1 to dim(_ccols_);			
	  if _ccols_[i]=" " then _ccols_count[i]=_ccols_count[i]+1;
	end; 
    /*Filter out the variables meet the conditions and place them in the &VARLIST global macro variable*/
	if last then do;	  
	  length _varlist_ $ 32767;
retain _varlist_ " ";
	  do i=1 to &_ncols_n; 
	    if (_ncols_count[i]/_N_) >= &th then do;
		  _varlist_=trim(_varlist_) || " " || trim( vname(_ncols_[i]) );
        end;				
	  end;
	  do i=1 to &_ccols_n; 
	    if (_ccols_count[i]/_N_) >= &th then do;
		  _varlist_=trim(_varlist_) || " " || trim( vname(_ccols_[i]) );
		end;
	  end;	
	  call symput ("&VARLIST",trim(_varlist_));			
	end;
  run;
%mend; 
/*Exclude variables if missing value pct GE. threshold  &th*/
%macro FilterData(data=, th=, out=);
  %global myvars;
  %FilterCols(data=&data, th=&th, varlist=myvars);
  data &out;
    set &data; 
    %if %str(&myvars)^=%str() %then %do; drop &myvars; %end;
  run;
%mend;

To verify the programing logic, let’s generate a demo dataset WORK.CLASS based on SASHELP.CLASS, the output dataset has different missing value percentage (10.53%, 36.84%, 52.63%) in Sex, Height and Weight columns.

data class; /*To generate demo dataset*/
  set sashelp.class;  
  if age < = 11 then sex=" ";
  if age <= 12 then height=.;
  if age <= 13 then weight=.;  
run;
proc print data=class; run; /*See Figure 1 for output */

Now we can use the following code to filter columns by percentage of missing values, which is greater than 40%; the Weight column will not appear in the output CLASS2 dataset.

 
%FilterData(data=class, th=0.4, out=class2);
proc print data=class2; run; /*See Figure 1 for output */

Figure 1: Data before filtered and after filtered

The calculation here for the percentage of missing values is implemented by the user programming in the DATA step. The missing value count is computed separately in two arrays according to the character type and the numerical type. If you want to do it in a single array, you must use HASH object to store the variable name and its corresponding index in that array. Whenever a missing value is detected, we need to check whether the variable is already in the HASH object; if not, then add it directly. Otherwise, look up the table to find the appropriate array index, and then increase the counter to the corresponding element in that array. The advantage of this approach is we only need one array in later variable names collecting phase. The disadvantage is the use of HASH object increases the programming complexity. The complete implementation based on this logic is as follows, it can completely replace the first version of %FilterCols implementation.

 
%macro FilterCols(data=, th=0, varlist=);
  data _null_; /*Detect variable numbers of dataset*/
    dsid=open("&data.",'i');
    if dsid>0 then do;
      nvars= attrn(dsid, 'NVARS'); 
	  call symput("NVARS", nvars); 
    end;
  run;
  data _null_;
    set &data. end=last; 
    /*Define 2 arrays map to numeric/character vars respectively*/
    array _ncols_ _numeric_;
    array _ccols_ _character_;
 
    if _N_=1 then do; /*HASH Object definition and initialize*/
      length key $ 32;
      declare hash h(); 
	  rc=h.definekey("key");
	  rc=h.definedata("data", "key");
	  rc=h.definedone();
	  call missing(key, data); 
    end;  
    /*Define one array for counting of missing value*/
	array _cols_count[&NVARS] _temporary_; 
	retain _cols_count ( &NVARS * 0); 
    retain _col_id_ 1; 
 
    do i=1 to dim(_ncols_);	  
      if missing( _ncols_[i] ) then do; 
        /*Find mapped array index and modify counter value*/
	    key=vname( _ncols_[i] );  
	    if h.find()^=0 then do;         	    
          data=_col_id_;
		   h.add();  
		   _col_id_=_col_id_+1;		
	    end;    
        _cols_count[ data ] = _cols_count[ data ] +1; 
	  end;
    end;
    do i=1 to dim(_ccols_); 
      if missing( _ccols_[i] ) then do;
        /*Find mapped array index and modify counter value*/
	    key=vname( _ccols_[i] );
	    if h.find()^=0 then do;         	    
          data=_col_id_;
		   h.add();	 
		   _col_id_=_col_id_+1;		
	    end;	        
        _cols_count[ data ] = _cols_count[ data ] +1; 	    		
	  end;
    end; 
    /*Filter out the variables meet the conditions and place them in the &VARLIST global macro variable*/
    if last then do;
      declare hiter iter('h');
      rc = iter.first();  
      do while (rc = 0);   	    
        pct=_cols_count[data]/ _N_;
		if pct >= &th. then do; 
	      length _varlist_ $ 32767; 
retain _varlist_ " ";
          _varlist_=trim(_varlist_) || " "|| trim(key);		  	  
        end; 
        rc = iter.next(); 
      end;	
	  call symput( "&VARLIST", trim( left(_varlist_)));
    end;
  run;
%mend;

PROC FREQ Implementation

In the two macros above, the counting of missing values for each column are implemented by user programming in DATA step, but SAS provides a more powerful frequency statistic step called PROC FREQ. Can we directly use its output for variable filtering? The answer is YES. For example, the following SAS code can list the Frequency, Percent, Cumulative Frequency, and Cumulative Percent of each variable values (including missing values) in the data set.

 
  proc freq data=class;
    table Sex Height Weight /missing;  /*See Figure 2 for output */
  run;

Figure 2: PROC FREQ output

To get the result of PROC FREQ into a dataset, we just need to add ODS Table statement before running PROC FREQ:

 
ods table onewayfreqs=class_freqdata;

Figure 3: OneWayFreqs table class_freqdata

So, the frequency statistics result of PROC FREQ is stored in the class_freqdata dataset. We only need to check the value of variable “percent” for each column. If it is greater than the threshold specified, the variable name is added to the buffer variable _varilist_, otherwise it is ignored; the final value of _varilist_ is placed to the SAS global macro variable &VARLIST for later use. Comparing to the DATA step implementation, the entire code here is much shorter and simpler.

 
%macro FilterCols(data=, th=0, varlist=);
  ods table onewayfreqs=&data._freqdata; /*FREQ out to temp dataset*/
  proc freq data=&data;
    table _all_ /missing;  
  run; 
  data _null_;
	set &data._freqdata end=last;
	var=scan(Table, 2);  
if getoption("LOCALE")="ZH_CN" then var=scan(Table, 2, "“”");
	value=vvaluex(var); 
/*Filter out the variables meet the conditions and place them in the &VARLIST global macro variable*/
	if  value=" " or compress(value)="." then  do; 
      if percent/100>= &th then do;
	    length _varlist_ $ 32767;
        retain _varlist_ " "; 
	    _varlist_=trim(_varlist_) || " "|| trim(var);	
		put var= _varlist_;
	  end;
   end;
   if last then call  symput( "&VARLIST", trim( left(_varlist_)));
  run; 
/*Delete the temp dataset in WORK library*/
  proc datasets nolist;
    delete &data._freqdata;
  quit;
%mend;

Please note that PROC FREQ would generate output to ODS HTML destination by default. If you want to avoid generating ODS HTML output when generating the temporary dataset “class_freqdata”, you can use the ods html close; statement to temporarily close the ODS HTML output and then open it again anytime needed.

Since the output of PROC FREQ already has the frequency information of each variable, we can naturally filter columns based on the percentage of duplicate values. However, it should be noted that a single variable always contains multiple distinct values, and each distinct value (including missing values) has a frequency percentage. If percentage of duplicate value is too high in a specific column, the user may exclude that column from the following analysis. As there are multiple frequency percentages, we just choose the largest one as the duplicate value percentage of that variable. For example, the Sex column of demo dataset CLASS has a missing value of 10.53%, a female of 42.11% and a male of 47.37%, so the percentage of duplicate values in the Sex column should be 47.37%.

The following code filter variables by percentage of duplicate values with a specific threshold. The code logic is almost same as before, but it filters values with non-missing values. So, run the same test code against the demo dataset will remove the Age column by duplicate values percentage threshold, instead of the Weight column by missing values percentage threshold.

 
%macro FilterCols(data=, th=0, varlist=);
  ods table onewayfreqs=&data._freqdata; /*FREQ out to temp dataset*/
  ods html close;
  proc freq data=&data;
    table _all_ / missing;  
  run; 
  ods html;
  data _null_;
    set &data._freqdata end=last;
var=scan(Table, 2);  
if getoption("LOCALE")="ZH_CN" then var=scan(Table, 2, "“”");
    value=vvaluex(var);
/*Filter out the variables meet the conditions and place them in the &VARLIST global macro variable*/
    if ^(value=" " or compress(value)=".") then do;  
      if  percent > maxpct then do; /*Search max percentage of duplicate values of a variable*/
	    retain maxpct 0;
        retain maxvar;
        maxpct=percent;  
	    maxvar=var;
      end; 
      if (_N_>1 and var ^= lastvar) or last then do;      
        if maxpct/100 >= &th. then do;
          length _varlist_ $ 32767;
          retain _varlist_ " ";
          _varlist_=trim(_varlist_) || " "|| trim(maxvar);	        	      
        end;
        maxpct=percent;
	    maxvar=var;
      end;
      lastvar=var;
      retain lastvar;
	end;
    if last then call symput( "&VARLIST", trim( left(_varlist_)));  
  run;
/*Delete the temp dataset in WORK library*/
  proc datasets nolist;
    delete &data._freqdata ;
  quit;
%mend;

In order to combine these two logics of by missing values and by duplicate values in one macro, we just introduce one flag parameter byduprate. The default value is 0 to filter variables by missing values, and value 1 to filter variables by duplicate values. The complete code is as below:

 
%macro FilterCols(data=, th=0, varlist=, byduprate=0);
  ods table onewayfreqs=&data._freqdata; /*FREQ out to temp dataset*/
  ods html close;
proc freq data=&data;
    table _all_ /missing;  
  run; 
ods html;
  data _null_;
	set &data._freqdata end=last;
	var=scan(Table, 2); 
if getoption("LOCALE")="ZH_CN" then var=scan(Table, 2, "“”"); 
	value=vvaluex(var);
  %if &byduprate=0 %then %do; /*Filter variables by missing value*/
	if  value=" " or compress(value)="." then  do; 
      if percent/100>= &th then do;
	    length _varlist_ $ 32767;
	    retain _varlist_ " ";
	    _varlist_=trim(_varlist_) || " "|| trim(var);				 
	  end;
   end;
  %end;
  %else %do; /*Filter variables by duplicate value percentage*/
    if ^(value=" " or compress(value)=".") then do;  
      if  percent > maxpct then do;      
        retain maxpct 0;
        retain maxvar;
        maxpct=percent;  
	    maxvar=var;		
      end; 
      if (_N_>1 and var ^= lastvar) or last then do;              
        if maxpct/100 >= &th. then do;
          length _varlist_ $ 32767;
          retain _varlist_ " ";
          _varlist_=trim(_varlist_) || " "|| trim(maxvar);	        
        end;
        maxpct=percent;
	    maxvar=var;
      end;
      lastvar=var;
      retain lastvar;
    end;
  %end;
    if last then call  symput( "&VARLIST", trim( left(_varlist_)));
  run; 
/*Delete the temp dataset in WORK library*/
  proc datasets nolist;
    delete &data._freqdata;
  quit;
%mend;

Accordingly, we also need to introduce the parameter byduprate with the same name and behavior in the %FilterData macro that's invoking the %FilterCols macro. The SAS macro %FilterData also has the same default value 0, but it needs to pass the byduprate argument specified by the user to the %FilterCols macro. The enhanced version is as below:

 
%macro FilterData(data=, th=, out=, byduprate=0);
  %global myvars;
  %FilterCols(data=&data, th=&th, varlist=myvars, byduprate=&byduprate); 
  data &out;
    set &data; 
    %if %str(&myvars)^=%str() %then %do; drop &myvars; %end;
  run;
%mend;

Now the final version of the macro %FilterData unified the two kinds of variable filtering logic, it just depends on the different value of the argument byduprate: If byduprate is not specified or the value is 0, it means filter variables by the percentage of missing values, otherwise it is by the percentage of duplicate values. For the following two code snippets, the former drops the Weight column, while the latter drops the Sex column.

 
/*Drop columns with missing value percentage >= 40% */
%FilterData(data=class, th=0.4, out=class2);              
proc print data=class2;run;
 
/*Drop columns with duplicate value percentage >= 40% */
%FilterData(data=class, th=0.4, out=class3, byduprate=1); 
proc print data=class3;run;

Figure 4: Drop column by missing and duplicate value percentage

Summary

This article discusses how to use SAS to filter variables in a dataset based on the percentage of missing values or duplicate values. The missing value statistics can be implemented by either DATA step programming on your own or reusing the existing powerful PROC FREQ with statistics result for missing values and duplicate values. The five macros in this article also demonstrated how to wrap up the existing powerful SAS analytics as a reusable, general-purpose SAS macro for a big project. The final version of macro %FilterData also demonstrated how to unify filtering variables by the percentage of either missing values or duplicate values in one to make calling SAS macro functions convenient and easy to reuse. To learn more, please check out these resources:

How to filter variables by percentage of missing values or duplicate values in SAS was published on SAS Users.

11月 282019
 

With time series data analysis, we can apply moving average methods to predict data points without seasonality. This includes Simple Average (SA), Simple Moving Average (SMA), Weighted Moving Average (WMA), Exponential Moving Average (EMA), etc. For series with a trend but without seasonality, we can use linear, non-linear and autoregressive prediction models. In practice, we often use these kinds of moving average methods for series data with large variance, e.g. financial and stock market data.

Since introducing the window concept, SMA reflects historical data with a lag problem, and all points in the same window have the same importance to the current point. Weights are introduced to reflect the different importances of points in a window, and it leads to WMA and EMA. WMA gives greater weight to recent observations while less weight to the longer-term observations. The weights decrease in a linear fashion. If the sequence fluctuations are not large, the same weight is given to the observation and WMA degenerates into a SMA and the WMA becomes an EMA when weights decrease exponentially. However, these traditional moving average methods ave a lag problem due to introducing the smooth window concept and its results are less responsive to the current value. To gain a better balance between lag reduction and curve smoothing in a moving average, we must get help from some smarter mathematics algorithm.

What is Hull Moving Average?

Hull Moving Average (HMA) is a fast moving average method with low lag developed by Australia investor Alan Hull in 2005. It’s known to almost eliminate lag altogether and manages to improve smoothing at the same time. Longer period HMAs can be used as an indicator to identify trend, and shorter period HMAs can be used as entry signals in the direction of the prevailing trend. Due to lag problems in all moving average methods, it’s not suggested to be used as a reverse signal alone. Anyway, we can use different window HMAs to build a more robust trading strategy. E.g., 52-week HMA for trend indicator, and 13-week HMA for entry signal. Here is how it works:

  • Long period HMA identifies the trend: rising HMA indicates the prevailing trend is rising, it’s better to enter long positions; falling HMA indicate the prevailing trend is falling, it’s better to enter short positions.
  • Short period HMA identifies the entry signals in the direction of the prevailing trend: When the prevailing trend is rising, HMA goes up to indicate a long entry signal. When the prevailing trend is falling, HMA goes down to indicate a short entry signal.

Both long & short period HMAs in bullish mode generate Long Signal, and both in bearish mode generate Short Signal. So Long Trades get a buy signal after a Long Signal occurs, while Short Trades get a sell signal after a Short Signal occurs. Long Trades get a Sell Signal when Long or Short Trend is no longer bullish and Short Trades get a buy signal when Long or Short trend is no longer bearish. Figure 1 is a sample for this HMA based trading strategy.

In fact, HMA is quite simple. Its calculation only includes three WMA steps:

  1. Calculate WMA of original sequence X using window length n ;
  2. Calculate WMA of X using window length n/2;
  3. Calculate WMA of derived sequence Y using window length Sqrt(n), while Y is the new series of 2 * WMA(x, n/2) – WMA(x, n). So HMA(x, n) formula can be defined as below:

HMA(x, n) = WMA( 2*WMA(x, n/2) − WMA(x, n), sqrt(n) )

HMA makes the moving average line more responsive to current value and keeps the smoothing of the curve line. It almost eliminates lag altogether and manages to improve smoothing at the same time.

How does HMA achieve the perfect balance?

We know SMA can gain the curve line of an average value for historical data falling in the current window, but it has the constant lag of at least a half window. The output may also have poor smoothness. If we embed SMA multiple times to get the average of the averages, it would be smoother but would increase lag at the same time.

Suppose we have this series: {1, 2, 3, 4, 5, 6, 7, 8, 9, 10}. The SMA at the last point is 5.5 with window length n=10, i.e. SUM[1...10]/10
is much different than the real value 10. If we reduce the window length to n=5, the SMA at the last point is 8, i.e. SUM[6…10]/5. If we have the second SMA with half size window and add the difference between these two SMAs, then we can get 8 + (8-5.5)=10.5, which is very close to the real value 10 and will eliminate lag. Then we can use SMA with specific window length again to reduce that slight overcompensation and improve smoothness. HMA uses linear WMA instead of SMA, and the last window length is sqrt(n). This is the key trick with using HMA.

Figure 2 below shows the difference between HMA(x, 4) and embed SMA twice SMA( SMA(x,4),4).

Figure 2: Comparison of SMA( SMA(x, 4), 4) and HMA(x, 4).

In stock trading, there are lots of complicated technical indicators derived from stock historical data, such as Ease of Movement indicators (EMV), momentum indicators (MTM), MACD indicators, Energy indicator CR, KDJ indicators, Bollinger (BOL) and so on. Its essence is like feature engineering extraction before neural network training. The fundamental purpose is to find some intrinsic property from price fluctuation. The Hull moving average itself was discovered and applied to real trading practices, which gives us a glimpse to the complication of building an automatic financial trading strategy under the veil. It’s just the beginning, the reflex in trading and aging of models may both bring uncertainty, so the effectiveness of quantitative trading strategies is a very complex topic which needs to be validated through real practice.

HMA in SAS Code

SAS was created for data analysis however, and my colleagues go into detail about how data analysis connects with HMA. Check out Cindy Wang's How to Calculate Hull Moving Average in SAS Visual Analytics and Rick Wicklin's The Hull moving average: Implement a custom time series smoother in SAS. I would like to share how to implement both HMA/WMA with only 25 lines of code in this article. You can use the macro anywhere to build HMA series with specific window lengths.

First, implement Weighted Moving Average (WMA) with SAS macro as shown below (only 11 lines). The macro has four arguments:

  • WMA_N       WMA window length
  • INPUT          Input variable name
  • OUT             Output variable name, default is input name with _WMA_n suffix
  • OUTLABEL The label of output variable, default is “input name WMA(n)
%macro AddWMA(WMA_N=4, INPUT=close, OUT=&INPUT._WMA_&WMA_N., OUTLABEL=&INPUT. WMA(&WMA_N.));
  retain _sumx_&INPUT._WMA_&WMA_N.;
  _sumx_&INPUT._WMA_&WMA_N.=sum (_sumx_&INPUT._WMA_&WMA_N., &INPUT., -lag&WMA_N.( &INPUT. )) ;
 
  retain _sum_&INPUT._WMA_&WMA_N.;
  _sum_&INPUT._WMA_&WMA_N.=sum (_sum_&INPUT._WMA_&WMA_N., &WMA_N * &INPUT., -lag( _sumx_&INPUT._WMA_&WMA_N.  )) ;
 
  retain _sumw_&INPUT._WMA_&WMA_N.;
  if _N_ <= &WMA_N then _sumw_&INPUT._WMA_&WMA_N. = sum (_sumw_&INPUT._WMA_&WMA_N., &WMA_N, -_N_, 1);
 
  &OUT=_sum_&INPUT._WMA_&WMA_N. / _sumw_&INPUT._WMA_&WMA_N.;
 
  drop _sumx_&INPUT._WMA_&WMA_N. _sumw_&INPUT._WMA_&WMA_N. _sum_&INPUT._WMA_&WMA_N.;
  label &OUT="&OUTLABEL"; 
%mend;

We must watch out for the trick in the upper implementation of WMA with linear weights. In fact, there is no do-loop in the code, and it also doesn’t perform repeat computations for weights and values. We just keep the last sum of all values in the window (See Ln2-3), and the sum for WMA is the last sum for WMA plus window length times the value subtracted from it (See Ln4-5). For weight sum computation, we just need to pay attention to the points less than the window length(See Ln6-7).

Second, implement HMA with three times WMA invocation as below, (only 9 lines). The macro has four arguments:

  • HMA_N       HMA window length
  • INPUT         Input variable name
  • OUT            Output variable name, default is input name with _HMA_n suffix
  • OUTLABEL The label of output variable, default is “input name HMA(n)
%macro AddHMA(HMA_N=5, INPUT=close, OUT=&INPUT._HMA_&HMA_N., outlabel=&INPUT. HMA(&HMA_N.));
  %AddWMA(WMA_N=&HMA_N, INPUT=&INPUT.);
 
  %AddWMA(WMA_N=%sysfunc(round(&HMA_N./2)), INPUT=&INPUT.);
 
  &INPUT._HMA_&HMA_N._DELTA=2 * &INPUT._WMA_%sysfunc(round(&HMA_N./2))  - &INPUT._WMA_&HMA_N;
  %AddWMA(WMA_N=%sysfunc(round(%sysfunc(sqrt(&HMA_N)))), INPUT=&INPUT._HMA_&HMA_N._DELTA);
 
  rename &INPUT._HMA_&HMA_N._DELTA_WMA_%sysfunc(round(%sysfunc(sqrt(&HMA_N.))))=&OUT;
  label &INPUT._HMA_&HMA_N._DELTA_WMA_%sysfunc(round(%sysfunc(sqrt(&HMA_N.))))="&outlabel"; 
 
  drop &INPUT._WMA_&HMA_N &INPUT._WMA_%sysfunc(round(&HMA_N./2)) &INPUT._HMA_&HMA_N._DELTA;
%mend;

The upper code is very intuitive as the first two lines perform WMA with window n and n/2 for X, then the next two lines generate new series Y and performs WMA with window sqrt(n). Other codes are just to improve macro flexibility and to drop temp variables, etc.

To compare HMA with SMA result, we must also implement SMA with the 7 lines of code as shown below. It has the same argument as WMA/HMA above, but SMA is an unnecessary part of HMA macro %ADDHMA implmentation.

 
%macro AddMA(MA_N=5, INPUT=close, out=&INPUT._MA_&MA_N., outlabel=&INPUT. MA(&MA_N.));
  retain _sum_&INPUT._MA_&MA_N.;
  _sum_&INPUT._MA_&MA_N.=sum (_sum_&INPUT._MA_&MA_N., &INPUT., -lag&MA_N.( &INPUT. )) ;
  &out=_sum_&INPUT._MA_&MA_N. / min(_n_, &MA_N.);
 
  drop _sum_&INPUT._MA_&MA_N.;
  label &out="&outlabel";
%mend;

Now we can test all the above implementations as shown below. First is a simple case we described above:

data a;
  input x @@;
  %AddMA( input=x, MA_N=5, out=SMA, outlabel=%str(SMA)); 
  %AddHMA( input=x, HMA_N=5, out=HMA, outlabel=%str(HMA))  
datalines;
1 2 3 4 5 6 7 8 9 10
;
proc print data=a label;
  format _all_ 8.2;
run;

The output is listed below, the HMA result is much more close to the real value x than SMA. It’s more responsive and less in lag.

Figure 3: HMA output sample

Now we can use the HMA macro to check IBM stock data in the SAS system dataset SASHELP.Stocks and draw trends in the plot to compare HMA(x, 5) and SMA(x, 5) output.

data Stocks_HMA;
  set sashelp.Stocks(where=(stock='IBM'));
  %AddHMA( HMA_N=5, out=HMA, outlabel=%str(HMA))
  %AddMA(MA_N=5, out=SMA, outlabel=%str(SMA));
run; 
ods graphics / width=800px height=300px;
title "Comparison of Simple and Hull Moving Average";
proc sgplot data=Stocks_HMA noautolegend;
   highlow x=Date high=High low=Low / lineattrs=(color=LightGray);
   scatter x=Date y=Close / markerattrs=(color=LightGray);
   series x=Date y=SMA / curvelabel curvelabelattrs=(color=DarkGreen) lineattrs=(color=DarkGreen);
   series x=Date y=HMA / curvelabel curvelabelattrs=(color=DarkOrange) lineattrs=(color=DarkOrange);
   refline '01JUN1992'd / axis=x label='June 1992' labelloc=inside;
   xaxis grid display=(nolabel); 
   yaxis grid max=200 label="IBM Closing Price";
run;

Figure 4: Comparison of SMA and HMA

Summary

We have talked about what Hull Moving Average is, how it works, and worked through a super simple SAS Macro implementation for HMA, WMA and SMA with only 25 lines of SAS code. This HMA implementation makes it quite simple and efficient to calculate HMA in SAS Data step without SAS/IML and Visual Analytics. In fact, it shows how easy process rolling-window computation for time series data can be in SAS, and opens a new window to build complicated technical indicators for stock trading systems yourself.

Learn more

Implementing HMA, WMA & SMA with 25 lines of SAS code was published on SAS Users.

11月 282019
 

With time series data analysis, we can apply moving average methods to predict data points without seasonality. This includes Simple Average (SA), Simple Moving Average (SMA), Weighted Moving Average (WMA), Exponential Moving Average (EMA), etc. For series with a trend but without seasonality, we can use linear, non-linear and autoregressive prediction models. In practice, we often use these kinds of moving average methods for series data with large variance, e.g. financial and stock market data.

Since introducing the window concept, SMA reflects historical data with a lag problem, and all points in the same window have the same importance to the current point. Weights are introduced to reflect the different importances of points in a window, and it leads to WMA and EMA. WMA gives greater weight to recent observations while less weight to the longer-term observations. The weights decrease in a linear fashion. If the sequence fluctuations are not large, the same weight is given to the observation and WMA degenerates into a SMA and the WMA becomes an EMA when weights decrease exponentially. However, these traditional moving average methods ave a lag problem due to introducing the smooth window concept and its results are less responsive to the current value. To gain a better balance between lag reduction and curve smoothing in a moving average, we must get help from some smarter mathematics algorithm.

What is Hull Moving Average?

Hull Moving Average (HMA) is a fast moving average method with low lag developed by Australia investor Alan Hull in 2005. It’s known to almost eliminate lag altogether and manages to improve smoothing at the same time. Longer period HMAs can be used as an indicator to identify trend, and shorter period HMAs can be used as entry signals in the direction of the prevailing trend. Due to lag problems in all moving average methods, it’s not suggested to be used as a reverse signal alone. Anyway, we can use different window HMAs to build a more robust trading strategy. E.g., 52-week HMA for trend indicator, and 13-week HMA for entry signal. Here is how it works:

  • Long period HMA identifies the trend: rising HMA indicates the prevailing trend is rising, it’s better to enter long positions; falling HMA indicate the prevailing trend is falling, it’s better to enter short positions.
  • Short period HMA identifies the entry signals in the direction of the prevailing trend: When the prevailing trend is rising, HMA goes up to indicate a long entry signal. When the prevailing trend is falling, HMA goes down to indicate a short entry signal.

Both long & short period HMAs in bullish mode generate Long Signal, and both in bearish mode generate Short Signal. So Long Trades get a buy signal after a Long Signal occurs, while Short Trades get a sell signal after a Short Signal occurs. Long Trades get a Sell Signal when Long or Short Trend is no longer bullish and Short Trades get a buy signal when Long or Short trend is no longer bearish. Figure 1 is a sample for this HMA based trading strategy.

In fact, HMA is quite simple. Its calculation only includes three WMA steps:

  1. Calculate WMA of original sequence X using window length n ;
  2. Calculate WMA of X using window length n/2;
  3. Calculate WMA of derived sequence Y using window length Sqrt(n), while Y is the new series of 2 * WMA(x, n/2) – WMA(x, n). So HMA(x, n) formula can be defined as below:

HMA(x, n) = WMA( 2*WMA(x, n/2) − WMA(x, n), sqrt(n) )

HMA makes the moving average line more responsive to current value and keeps the smoothing of the curve line. It almost eliminates lag altogether and manages to improve smoothing at the same time.

How does HMA achieve the perfect balance?

We know SMA can gain the curve line of an average value for historical data falling in the current window, but it has the constant lag of at least a half window. The output may also have poor smoothness. If we embed SMA multiple times to get the average of the averages, it would be smoother but would increase lag at the same time.

Suppose we have this series: {1, 2, 3, 4, 5, 6, 7, 8, 9, 10}. The SMA at the last point is 5.5 with window length n=10, i.e. SUM[1...10]/10
is much different than the real value 10. If we reduce the window length to n=5, the SMA at the last point is 8, i.e. SUM[6…10]/5. If we have the second SMA with half size window and add the difference between these two SMAs, then we can get 8 + (8-5.5)=10.5, which is very close to the real value 10 and will eliminate lag. Then we can use SMA with specific window length again to reduce that slight overcompensation and improve smoothness. HMA uses linear WMA instead of SMA, and the last window length is sqrt(n). This is the key trick with using HMA.

Figure 2 below shows the difference between HMA(x, 4) and embed SMA twice SMA( SMA(x,4),4).

Figure 2: Comparison of SMA( SMA(x, 4), 4) and HMA(x, 4).

In stock trading, there are lots of complicated technical indicators derived from stock historical data, such as Ease of Movement indicators (EMV), momentum indicators (MTM), MACD indicators, Energy indicator CR, KDJ indicators, Bollinger (BOL) and so on. Its essence is like feature engineering extraction before neural network training. The fundamental purpose is to find some intrinsic property from price fluctuation. The Hull moving average itself was discovered and applied to real trading practices, which gives us a glimpse to the complication of building an automatic financial trading strategy under the veil. It’s just the beginning, the reflex in trading and aging of models may both bring uncertainty, so the effectiveness of quantitative trading strategies is a very complex topic which needs to be validated through real practice.

HMA in SAS Code

SAS was created for data analysis however, and my colleagues go into detail about how data analysis connects with HMA. Check out Cindy Wang's How to Calculate Hull Moving Average in SAS Visual Analytics and Rick Wicklin's The Hull moving average: Implement a custom time series smoother in SAS. I would like to share how to implement both HMA/WMA with only 25 lines of code in this article. You can use the macro anywhere to build HMA series with specific window lengths.

First, implement Weighted Moving Average (WMA) with SAS macro as shown below (only 11 lines). The macro has four arguments:

  • WMA_N       WMA window length
  • INPUT          Input variable name
  • OUT             Output variable name, default is input name with _WMA_n suffix
  • OUTLABEL The label of output variable, default is “input name WMA(n)
%macro AddWMA(WMA_N=4, INPUT=close, OUT=&INPUT._WMA_&WMA_N., OUTLABEL=&INPUT. WMA(&WMA_N.));
  retain _sumx_&INPUT._WMA_&WMA_N.;
  _sumx_&INPUT._WMA_&WMA_N.=sum (_sumx_&INPUT._WMA_&WMA_N., &INPUT., -lag&WMA_N.( &INPUT. )) ;
 
  retain _sum_&INPUT._WMA_&WMA_N.;
  _sum_&INPUT._WMA_&WMA_N.=sum (_sum_&INPUT._WMA_&WMA_N., &WMA_N * &INPUT., -lag( _sumx_&INPUT._WMA_&WMA_N.  )) ;
 
  retain _sumw_&INPUT._WMA_&WMA_N.;
  if _N_ <= &WMA_N then _sumw_&INPUT._WMA_&WMA_N. = sum (_sumw_&INPUT._WMA_&WMA_N., &WMA_N, -_N_, 1);
 
  &OUT=_sum_&INPUT._WMA_&WMA_N. / _sumw_&INPUT._WMA_&WMA_N.;
 
  drop _sumx_&INPUT._WMA_&WMA_N. _sumw_&INPUT._WMA_&WMA_N. _sum_&INPUT._WMA_&WMA_N.;
  label &OUT="&OUTLABEL"; 
%mend;

We must watch out for the trick in the upper implementation of WMA with linear weights. In fact, there is no do-loop in the code, and it also doesn’t perform repeat computations for weights and values. We just keep the last sum of all values in the window (See Ln2-3), and the sum for WMA is the last sum for WMA plus window length times the value subtracted from it (See Ln4-5). For weight sum computation, we just need to pay attention to the points less than the window length(See Ln6-7).

Second, implement HMA with three times WMA invocation as below, (only 9 lines). The macro has four arguments:

  • HMA_N       HMA window length
  • INPUT         Input variable name
  • OUT            Output variable name, default is input name with _HMA_n suffix
  • OUTLABEL The label of output variable, default is “input name HMA(n)
%macro AddHMA(HMA_N=5, INPUT=close, OUT=&INPUT._HMA_&HMA_N., outlabel=&INPUT. HMA(&HMA_N.));
  %AddWMA(WMA_N=&HMA_N, INPUT=&INPUT.);
 
  %AddWMA(WMA_N=%sysfunc(round(&HMA_N./2)), INPUT=&INPUT.);
 
  &INPUT._HMA_&HMA_N._DELTA=2 * &INPUT._WMA_%sysfunc(round(&HMA_N./2))  - &INPUT._WMA_&HMA_N;
  %AddWMA(WMA_N=%sysfunc(round(%sysfunc(sqrt(&HMA_N)))), INPUT=&INPUT._HMA_&HMA_N._DELTA);
 
  rename &INPUT._HMA_&HMA_N._DELTA_WMA_%sysfunc(round(%sysfunc(sqrt(&HMA_N.))))=&OUT;
  label &INPUT._HMA_&HMA_N._DELTA_WMA_%sysfunc(round(%sysfunc(sqrt(&HMA_N.))))="&outlabel"; 
 
  drop &INPUT._WMA_&HMA_N &INPUT._WMA_%sysfunc(round(&HMA_N./2)) &INPUT._HMA_&HMA_N._DELTA;
%mend;

The upper code is very intuitive as the first two lines perform WMA with window n and n/2 for X, then the next two lines generate new series Y and performs WMA with window sqrt(n). Other codes are just to improve macro flexibility and to drop temp variables, etc.

To compare HMA with SMA result, we must also implement SMA with the 7 lines of code as shown below. It has the same argument as WMA/HMA above, but SMA is an unnecessary part of HMA macro %ADDHMA implmentation.

 
%macro AddMA(MA_N=5, INPUT=close, out=&INPUT._MA_&MA_N., outlabel=&INPUT. MA(&MA_N.));
  retain _sum_&INPUT._MA_&MA_N.;
  _sum_&INPUT._MA_&MA_N.=sum (_sum_&INPUT._MA_&MA_N., &INPUT., -lag&MA_N.( &INPUT. )) ;
  &out=_sum_&INPUT._MA_&MA_N. / min(_n_, &MA_N.);
 
  drop _sum_&INPUT._MA_&MA_N.;
  label &out="&outlabel";
%mend;

Now we can test all the above implementations as shown below. First is a simple case we described above:

data a;
  input x @@;
  %AddMA( input=x, MA_N=5, out=SMA, outlabel=%str(SMA)); 
  %AddHMA( input=x, HMA_N=5, out=HMA, outlabel=%str(HMA))  
datalines;
1 2 3 4 5 6 7 8 9 10
;
proc print data=a label;
  format _all_ 8.2;
run;

The output is listed below, the HMA result is much more close to the real value x than SMA. It’s more responsive and less in lag.

Figure 3: HMA output sample

Now we can use the HMA macro to check IBM stock data in the SAS system dataset SASHELP.Stocks and draw trends in the plot to compare HMA(x, 5) and SMA(x, 5) output.

data Stocks_HMA;
  set sashelp.Stocks(where=(stock='IBM'));
  %AddHMA( HMA_N=5, out=HMA, outlabel=%str(HMA))
  %AddMA(MA_N=5, out=SMA, outlabel=%str(SMA));
run; 
ods graphics / width=800px height=300px;
title "Comparison of Simple and Hull Moving Average";
proc sgplot data=Stocks_HMA noautolegend;
   highlow x=Date high=High low=Low / lineattrs=(color=LightGray);
   scatter x=Date y=Close / markerattrs=(color=LightGray);
   series x=Date y=SMA / curvelabel curvelabelattrs=(color=DarkGreen) lineattrs=(color=DarkGreen);
   series x=Date y=HMA / curvelabel curvelabelattrs=(color=DarkOrange) lineattrs=(color=DarkOrange);
   refline '01JUN1992'd / axis=x label='June 1992' labelloc=inside;
   xaxis grid display=(nolabel); 
   yaxis grid max=200 label="IBM Closing Price";
run;

Figure 4: Comparison of SMA and HMA

Summary

We have talked about what Hull Moving Average is, how it works, and worked through a super simple SAS Macro implementation for HMA, WMA and SMA with only 25 lines of SAS code. This HMA implementation makes it quite simple and efficient to calculate HMA in SAS Data step without SAS/IML and Visual Analytics. In fact, it shows how easy process rolling-window computation for time series data can be in SAS, and opens a new window to build complicated technical indicators for stock trading systems yourself.

Learn more

Implementing HMA, WMA & SMA with 25 lines of SAS code was published on SAS Users.

11月 282019
 

With time series data analysis, we can apply moving average methods to predict data points without seasonality. This includes Simple Average (SA), Simple Moving Average (SMA), Weighted Moving Average (WMA), Exponential Moving Average (EMA), etc. For series with a trend but without seasonality, we can use linear, non-linear and autoregressive prediction models. In practice, we often use these kinds of moving average methods for series data with large variance, e.g. financial and stock market data.

Since introducing the window concept, SMA reflects historical data with a lag problem, and all points in the same window have the same importance to the current point. Weights are introduced to reflect the different importances of points in a window, and it leads to WMA and EMA. WMA gives greater weight to recent observations while less weight to the longer-term observations. The weights decrease in a linear fashion. If the sequence fluctuations are not large, the same weight is given to the observation and WMA degenerates into a SMA and the WMA becomes an EMA when weights decrease exponentially. However, these traditional moving average methods ave a lag problem due to introducing the smooth window concept and its results are less responsive to the current value. To gain a better balance between lag reduction and curve smoothing in a moving average, we must get help from some smarter mathematics algorithm.

What is Hull Moving Average?

Hull Moving Average (HMA) is a fast moving average method with low lag developed by Australia investor Alan Hull in 2005. It’s known to almost eliminate lag altogether and manages to improve smoothing at the same time. Longer period HMAs can be used as an indicator to identify trend, and shorter period HMAs can be used as entry signals in the direction of the prevailing trend. Due to lag problems in all moving average methods, it’s not suggested to be used as a reverse signal alone. Anyway, we can use different window HMAs to build a more robust trading strategy. E.g., 52-week HMA for trend indicator, and 13-week HMA for entry signal. Here is how it works:

  • Long period HMA identifies the trend: rising HMA indicates the prevailing trend is rising, it’s better to enter long positions; falling HMA indicate the prevailing trend is falling, it’s better to enter short positions.
  • Short period HMA identifies the entry signals in the direction of the prevailing trend: When the prevailing trend is rising, HMA goes up to indicate a long entry signal. When the prevailing trend is falling, HMA goes down to indicate a short entry signal.

Both long & short period HMAs in bullish mode generate Long Signal, and both in bearish mode generate Short Signal. So Long Trades get a buy signal after a Long Signal occurs, while Short Trades get a sell signal after a Short Signal occurs. Long Trades get a Sell Signal when Long or Short Trend is no longer bullish and Short Trades get a buy signal when Long or Short trend is no longer bearish. Figure 1 is a sample for this HMA based trading strategy.

In fact, HMA is quite simple. Its calculation only includes three WMA steps:

  1. Calculate WMA of original sequence X using window length n ;
  2. Calculate WMA of X using window length n/2;
  3. Calculate WMA of derived sequence Y using window length Sqrt(n), while Y is the new series of 2 * WMA(x, n/2) – WMA(x, n). So HMA(x, n) formula can be defined as below:

HMA(x, n) = WMA( 2*WMA(x, n/2) − WMA(x, n), sqrt(n) )

HMA makes the moving average line more responsive to current value and keeps the smoothing of the curve line. It almost eliminates lag altogether and manages to improve smoothing at the same time.

How does HMA achieve the perfect balance?

We know SMA can gain the curve line of an average value for historical data falling in the current window, but it has the constant lag of at least a half window. The output may also have poor smoothness. If we embed SMA multiple times to get the average of the averages, it would be smoother but would increase lag at the same time.

Suppose we have this series: {1, 2, 3, 4, 5, 6, 7, 8, 9, 10}. The SMA at the last point is 5.5 with window length n=10, i.e. SUM[1...10]/10
is much different than the real value 10. If we reduce the window length to n=5, the SMA at the last point is 8, i.e. SUM[6…10]/5. If we have the second SMA with half size window and add the difference between these two SMAs, then we can get 8 + (8-5.5)=10.5, which is very close to the real value 10 and will eliminate lag. Then we can use SMA with specific window length again to reduce that slight overcompensation and improve smoothness. HMA uses linear WMA instead of SMA, and the last window length is sqrt(n). This is the key trick with using HMA.

Figure 2 below shows the difference between HMA(x, 4) and embed SMA twice SMA( SMA(x,4),4).

Figure 2: Comparison of SMA( SMA(x, 4), 4) and HMA(x, 4).

In stock trading, there are lots of complicated technical indicators derived from stock historical data, such as Ease of Movement indicators (EMV), momentum indicators (MTM), MACD indicators, Energy indicator CR, KDJ indicators, Bollinger (BOL) and so on. Its essence is like feature engineering extraction before neural network training. The fundamental purpose is to find some intrinsic property from price fluctuation. The Hull moving average itself was discovered and applied to real trading practices, which gives us a glimpse to the complication of building an automatic financial trading strategy under the veil. It’s just the beginning, the reflex in trading and aging of models may both bring uncertainty, so the effectiveness of quantitative trading strategies is a very complex topic which needs to be validated through real practice.

HMA in SAS Code

SAS was created for data analysis however, and my colleagues go into detail about how data analysis connects with HMA. Check out Cindy Wang's How to Calculate Hull Moving Average in SAS Visual Analytics and Rick Wicklin's The Hull moving average: Implement a custom time series smoother in SAS. I would like to share how to implement both HMA/WMA with only 25 lines of code in this article. You can use the macro anywhere to build HMA series with specific window lengths.

First, implement Weighted Moving Average (WMA) with SAS macro as shown below (only 11 lines). The macro has four arguments:

  • WMA_N       WMA window length
  • INPUT          Input variable name
  • OUT             Output variable name, default is input name with _WMA_n suffix
  • OUTLABEL The label of output variable, default is “input name WMA(n)
%macro AddWMA(WMA_N=4, INPUT=close, OUT=&INPUT._WMA_&WMA_N., OUTLABEL=&INPUT. WMA(&WMA_N.));
  retain _sumx_&INPUT._WMA_&WMA_N.;
  _sumx_&INPUT._WMA_&WMA_N.=sum (_sumx_&INPUT._WMA_&WMA_N., &INPUT., -lag&WMA_N.( &INPUT. )) ;
 
  retain _sum_&INPUT._WMA_&WMA_N.;
  _sum_&INPUT._WMA_&WMA_N.=sum (_sum_&INPUT._WMA_&WMA_N., &WMA_N * &INPUT., -lag( _sumx_&INPUT._WMA_&WMA_N.  )) ;
 
  retain _sumw_&INPUT._WMA_&WMA_N.;
  if _N_ <= &WMA_N then _sumw_&INPUT._WMA_&WMA_N. = sum (_sumw_&INPUT._WMA_&WMA_N., &WMA_N, -_N_, 1);
 
  &OUT=_sum_&INPUT._WMA_&WMA_N. / _sumw_&INPUT._WMA_&WMA_N.;
 
  drop _sumx_&INPUT._WMA_&WMA_N. _sumw_&INPUT._WMA_&WMA_N. _sum_&INPUT._WMA_&WMA_N.;
  label &OUT="&OUTLABEL"; 
%mend;

We must watch out for the trick in the upper implementation of WMA with linear weights. In fact, there is no do-loop in the code, and it also doesn’t perform repeat computations for weights and values. We just keep the last sum of all values in the window (See Ln2-3), and the sum for WMA is the last sum for WMA plus window length times the value subtracted from it (See Ln4-5). For weight sum computation, we just need to pay attention to the points less than the window length(See Ln6-7).

Second, implement HMA with three times WMA invocation as below, (only 9 lines). The macro has four arguments:

  • HMA_N       HMA window length
  • INPUT         Input variable name
  • OUT            Output variable name, default is input name with _HMA_n suffix
  • OUTLABEL The label of output variable, default is “input name HMA(n)
%macro AddHMA(HMA_N=5, INPUT=close, OUT=&INPUT._HMA_&HMA_N., outlabel=&INPUT. HMA(&HMA_N.));
  %AddWMA(WMA_N=&HMA_N, INPUT=&INPUT.);
 
  %AddWMA(WMA_N=%sysfunc(round(&HMA_N./2)), INPUT=&INPUT.);
 
  &INPUT._HMA_&HMA_N._DELTA=2 * &INPUT._WMA_%sysfunc(round(&HMA_N./2))  - &INPUT._WMA_&HMA_N;
  %AddWMA(WMA_N=%sysfunc(round(%sysfunc(sqrt(&HMA_N)))), INPUT=&INPUT._HMA_&HMA_N._DELTA);
 
  rename &INPUT._HMA_&HMA_N._DELTA_WMA_%sysfunc(round(%sysfunc(sqrt(&HMA_N.))))=&OUT;
  label &INPUT._HMA_&HMA_N._DELTA_WMA_%sysfunc(round(%sysfunc(sqrt(&HMA_N.))))="&outlabel"; 
 
  drop &INPUT._WMA_&HMA_N &INPUT._WMA_%sysfunc(round(&HMA_N./2)) &INPUT._HMA_&HMA_N._DELTA;
%mend;

The upper code is very intuitive as the first two lines perform WMA with window n and n/2 for X, then the next two lines generate new series Y and performs WMA with window sqrt(n). Other codes are just to improve macro flexibility and to drop temp variables, etc.

To compare HMA with SMA result, we must also implement SMA with the 7 lines of code as shown below. It has the same argument as WMA/HMA above, but SMA is an unnecessary part of HMA macro %ADDHMA implmentation.

 
%macro AddMA(MA_N=5, INPUT=close, out=&INPUT._MA_&MA_N., outlabel=&INPUT. MA(&MA_N.));
  retain _sum_&INPUT._MA_&MA_N.;
  _sum_&INPUT._MA_&MA_N.=sum (_sum_&INPUT._MA_&MA_N., &INPUT., -lag&MA_N.( &INPUT. )) ;
  &out=_sum_&INPUT._MA_&MA_N. / min(_n_, &MA_N.);
 
  drop _sum_&INPUT._MA_&MA_N.;
  label &out="&outlabel";
%mend;

Now we can test all the above implementations as shown below. First is a simple case we described above:

data a;
  input x @@;
  %AddMA( input=x, MA_N=5, out=SMA, outlabel=%str(SMA)); 
  %AddHMA( input=x, HMA_N=5, out=HMA, outlabel=%str(HMA))  
datalines;
1 2 3 4 5 6 7 8 9 10
;
proc print data=a label;
  format _all_ 8.2;
run;

The output is listed below, the HMA result is much more close to the real value x than SMA. It’s more responsive and less in lag.

Figure 3: HMA output sample

Now we can use the HMA macro to check IBM stock data in the SAS system dataset SASHELP.Stocks and draw trends in the plot to compare HMA(x, 5) and SMA(x, 5) output.

data Stocks_HMA;
  set sashelp.Stocks(where=(stock='IBM'));
  %AddHMA( HMA_N=5, out=HMA, outlabel=%str(HMA))
  %AddMA(MA_N=5, out=SMA, outlabel=%str(SMA));
run; 
ods graphics / width=800px height=300px;
title "Comparison of Simple and Hull Moving Average";
proc sgplot data=Stocks_HMA noautolegend;
   highlow x=Date high=High low=Low / lineattrs=(color=LightGray);
   scatter x=Date y=Close / markerattrs=(color=LightGray);
   series x=Date y=SMA / curvelabel curvelabelattrs=(color=DarkGreen) lineattrs=(color=DarkGreen);
   series x=Date y=HMA / curvelabel curvelabelattrs=(color=DarkOrange) lineattrs=(color=DarkOrange);
   refline '01JUN1992'd / axis=x label='June 1992' labelloc=inside;
   xaxis grid display=(nolabel); 
   yaxis grid max=200 label="IBM Closing Price";
run;

Figure 4: Comparison of SMA and HMA

Summary

We have talked about what Hull Moving Average is, how it works, and worked through a super simple SAS Macro implementation for HMA, WMA and SMA with only 25 lines of SAS code. This HMA implementation makes it quite simple and efficient to calculate HMA in SAS Data step without SAS/IML and Visual Analytics. In fact, it shows how easy process rolling-window computation for time series data can be in SAS, and opens a new window to build complicated technical indicators for stock trading systems yourself.

Learn more

Implementing HMA, WMA & SMA with 25 lines of SAS code was published on SAS Users.