8月 292018
 
Local polynomial kernel regression  in SAS

A SAS programmer recently asked me how to compute a kernel regression in SAS. He had read my blog posts "What is loess regression" and "Loess regression in SAS/IML" and was trying to implement a kernel regression in SAS/IML as part of a larger analysis. This article explains how to create a basic kernel regression analysis in SAS. You can download the complete SAS program that contains all the kernel regression computations in this article.

A kernel regression smoother is useful when smoothing data that do not appear to have a simple parametric relationship. The following data set contains an explanatory variable, E, which is the ratio of air to fuel in an engine. The dependent variable is a measurement of certain exhaust gasses (nitrogen oxides), which are contributors to the problem of air pollution. The scatter plot to the right displays the nonlinear relationship between these variables. The curve is a kernel regression smoother, which is developed later in this article.

data gas;
label NOx = "Nitric oxide and nitrogen dioxide";
label E = "Air/Fuel Ratio";
input NOx E @@;
datalines;
4.818  0.831   2.849  1.045   3.275  1.021   4.691  0.97    4.255  0.825
5.064  0.891   2.118  0.71    4.602  0.801   2.286  1.074   0.97   1.148
3.965  1       5.344  0.928   3.834  0.767   1.99   0.701   5.199  0.807
5.283  0.902   3.752  0.997   0.537  1.224   1.64   1.089   5.055  0.973
4.937  0.98    1.561  0.665
;

What is kernel regression?

Kernel regression was a popular method in the 1970s for smoothing a scatter plot. The predicted value, ŷ0, at a point x0 is determined by a weighted polynomial least squares regression of data near x0. The weights are given by a kernel function (such as the normal density) that assigns more weight to points near x0 and less weight to points far from x0. The bandwidth of the kernel function specifies how to measure "nearness." Because kernel regression has some intrinsic problems that are addressed by the loess algorithm, many statisticians switched from kernel regression to loess regression in the 1980s as a way to smooth scatter plots.

Because of the intrinsic shortcomings of kernel regression, SAS does not have a built-in procedure to fit a kernel regression, but it is straightforward to implement a basic kernel regression by using matrix computations in the SAS/IML language. The main steps for evaluating a kernel regression at x0 are as follows:

  1. Choose a kernel shape and bandwidth (smoothing) parameter: The shape of the kernel density function is not very important. I will choose a normal density function as the kernel. A small bandwidth overfits the data, which means that the curve might have lots of wiggles. A large bandwidth tends to underfit the data. You can specify a bandwidth (h) in the scale of the explanatory variable (X), or you can specify a value, H, that represents the proportion of the range of X and then use h = H*range(X) in the computation. A very small bandwidth can cause the kernel regression to fail. A very large bandwidth causes the kernel regression to approach an OLS regression.
  2. Assign weights to the nearby data: Although the normal density is infinite in support, in practice, observations farther than five bandwidths from x0 get essentially zero weight. You can use the PDF function to easily compute the weights. In the SAS/IML language, you can pass a vector of data values to the PDF function and thus compute all weights in a single call.
  3. Compute a weighted regression: The predicted value, ŷ0, of the kernel regression at x0 is the result of a weighted linear regression where the weights are assigned as above. The degree of the polynomial in the linear regression affects the result. The next section shows how to compute a first-degree linear regression; a subsequent section shows a zero-degree polynomial, which computes a weighted average.

Implement kernel regression in SAS

To implement kernel regression, you can reuse the SAS/IML modules for weighted polynomial regression from a previous blog post. You use the PDF function to compute the local weights. The KernelRegression module computes the kernel regression at a vector of points, as follows:

proc iml;
/* First, define the PolyRegEst and PolyRegScore modules 
   https://blogs.sas.com/content/iml/2016/10/05/weighted-regression.html 
   See the DOWNLOAD file.   
*/
 
/* Interpret H as "proportion of range" so that the bandwidth is h=H*range(X)
   The weight of an observation at x when fitting the regression at x0 
   is given by the f(x; x0, h), which is the density function of N(x0, h) at x.
   This is equivalent to (1/h)*pdf("Normal", (X-x0)/h) for the standard N(0,1) distribution. */
start GetKernelWeights(X, x0, H);
   return pdf("Normal", X, x0, H*range(X));   /* bandwidth h = H*range(X) */
finish;
 
/* Kernel regression module. 
   (X,Y) are column vectors that contain the data.
   H is the proportion of the data range. It determines the 
     bandwidth for the kernel regression as h = H*range(X)
   t is a vector of values at which to evaluate the fit. By defaul t=X.
*/
start KernelRegression(X, Y, H, _t=X, deg=1);
   t = colvec(_t);
   pred = j(nrow(t), 1, .);
   do i = 1 to nrow(t);
      /* compute weighted regression model estimates, degree deg */
      W = GetKernelWeights(X, t[i], H);
      b = PolyRegEst(Y, X, W, deg);    
      pred[i] = PolyRegScore(t[i], b);  /* score model at t[i] */
   end;
   return pred;
finish;

The following statements load the data for exhaust gasses and sort them by the X variable (E). A call to the KernelRegression module smooths the data at 201 evenly spaced points in the range of the explanatory variable. The graph at the top of this article shows the smoother overlaid on a scatter plot of the data.

use gas;  read all var {E NOx} into Z;  close;
call sort(Z);   /* for graphing, sort by X */
X = Z[,1];  Y = Z[,2];
 
H = 0.1;        /* choose bandwidth as proportion of range */
deg = 1;        /* degree of regression model */
Npts = 201;     /* number of points to evaluate smoother */
t = T( do(min(X), max(X), (max(X)-min(X))/(Npts-1)) ); /* evenly spaced x values */
pred =  KernelRegression(X, Y, H, t);                  /* (t, Pred) are points on the curve */

Nadaraya–Watson kernel regression

Kernel regression in SAS by using a weighted average

If you use a degree-zero polynomial to compute the smoother, each predicted value is a locally-weighted average of the data. This smoother is called the Nadaraya–Watson (N-W) kernel estimator. Because the N-W smoother is a weighted average, the computation is much simpler than for linear regression. The following two modules show the main computation. The N-W kernel smoother on the same exhaust data is shown to the right.

/* Nadaraya-Watson kernel estimator, which is a locally weighted average */
start NWKerReg(Y, X, x0, H);
   K = colvec( pdf("Normal", X, x0, H*range(X)) );
   return( K`*Y / sum(K) );
finish;
 
/* Nadaraya-Watson kernel smoother module. 
   (X,Y) are column vectors that contain the data.
   H is the proportion of the data range. It determines the 
     bandwidth for the kernel regression as h = H*range(X)
   t is a vector of values at which to evaluate the fit. By default t=X.
*/
start NWKernelSmoother(X, Y, H, _t=X);
   t = colvec(_t);
   pred = j(nrow(t), 1, .);
   do i = 1 to nrow(t);
      pred[i] = NWKerReg(Y, X, t[i], H);
   end;
   return pred;
finish;

Problems with kernel regression

As mentioned earlier, there are some intrinsic problems with kernel regression smoothers:

  • Kernel smoothers treat extreme values of a sample distribution differently than points near the middle of the sample. For example, if x0 is the minimum value of the data, the fit at x0 is a weighted average of only the observations that are greater than x0. In contrast, if x0 is the median value of the data, the fit is a weighted average of points on both sides of x0. You can see the bias in the graph of the Nadaraya–Watson estimate, which shows predicted values that are higher than the actual values for the smallest and largest values of the X variable. Some researchers have proposed modified kernels to handle this problem for the N-W estimate. Fortunately, the local linear estimator tends to perform well.
  • The bandwidth of kernel smoothers cannot be too small or else there are values of x0 for which prediction is not possible. If d = x[i+1] – x[i]is the gap between two consecutive X data values, then if h is smaller than about d/10, there is no way to predict a value at the midpoint (x[i+1] + x[i]) / 2. The linear system that you need to solve will be singular because the weights at x0 are zero. The loess method avoids this flaw by using nearest neighbors to form predicted values. A nearest-neighbor algorithm is a variable-width kernel method, as opposed to the fixed-width kernel used here.

Extensions of the simple kernel smoother

You can extend the one-dimensional example to more complex situations:

  • Higher Dimensions: You can compute kernel regression for two or more independent variables by using a multivariate normal density as the kernel function. In theory, you could use any correlated structure for the kernel function, but in practice most people use a covariance structure of the form Σ = diag(h21, ..., h2k), where hi is the bandwidth in the i_th coordinate direction. An even simpler covariance structure is to use the same bandwidth in all directions, which can be evaluated by using the one-dimensional standard normal density evaluated at || x - x0 || / h. Details are left as an exercise.
  • Automatic selection of the bandwidth: The example in this article uses a specified bandwidth, but as I have written before, it is better to use a fit criterion such as GCV or AIC to select a smoothing parameter.

In summary, you can use a weighted polynomial regression to implement a kernel smoother in SAS. The weights are provided by a kernel density function, such as the normal density. Implementing a simple kernel smoother requires only a few statements in the SAS/IML matrix language. However, most modern data analysts prefer a loess smoother over a kernel smoother because the loess algorithm (which is a varying-width kernel algorithm) solves some of the issues that arise when trying to use a fixed-width kernel smoother with a small bandwidth. You can use PROC LOESS in SAS to compute a loess smoother.

The post Kernel regression in SAS appeared first on The DO Loop.

8月 292018
 

The concept of "current working directory" is important within any SAS program that reads or creates external files. In SAS, when you reference a file location with a relative path (for example, "./projects/mydata.pdf"), that file reference resolves to an absolute path by way of the working directory. You can control the initial working directory by modifying the shell scripts that launch the SAS process, or by specifying the simple SAS macro that allows you to learn the current working directory. The macro uses a trick to assign a SAS fileref to the current path ('.'), grab the full path of that fileref by using Read the article for the full source (it's only about 7 lines). Here's how you would use it:

56         %put Current path is %curdir;
Current path is C:\WINDOWS\system32

As you might infer from my example here, I'm running this on a managed Windows environment. Most users cannot write to the "C:\WINDOWS\system32" path (and would not want to), so any relative file paths in my SAS code would cause errors. Maybe you've seen something like this:

25         ods html file="./test.html";
NOTE: Writing HTML Body file: ./test.html
ERROR: Insufficient authorization to access C:\WINDOWS\system32\test.html.
ERROR: No body file. HTML output will not be created.

If I want to use a relative path, I need to change the current working directory. Fortunately, there's a simple way to do that.

Change the current directory in SAS

Use the

/* working path for my projects */
%let rc = %sysfunc(dlgcdir('u:/projects'));
 
ods html file="./test.html";
proc print data=sashelp.class; run;
ods html close;

I can use my account-specific environment variables to make these paths work for all users. For example, on Windows I can reference the USERPROFILE environment variable. (On Unix, I can use the HOME environment variable instead.)

/* working path for my projects */
%let user = %sysget(USERPROFILE);
%let rc = %sysfunc(dlgcdir("&user./Documents"));
 
/* create an output data folder if needed */
options dlcreatedir;
libname outdata "./data";
 
ods html file="./test.html";
data outdata.class;
 set sashelp.class;
run;
proc print data=outdata.class; run;
ods html close;

Here's my log output. Notice how the HTML file and the output data folder are both created at locations relative to my home directory.

25         /* working path for my projects */
26         %let user = %sysget(USERPROFILE);
27         %let rc = %sysfunc(dlgcdir("&user./Documents"));
NOTE: The current working directory is now "C:\Users\sascrh\Documents".
28         
29         options dlcreatedir;
30         libname outdata "./data";
NOTE: Library OUTDATA was created.
NOTE: Libref OUTDATA was successfully assigned as follows: 
      Engine:        V9 
      Physical Name: C:\Users\sascrh\Documents\data
31         
32         ods html file="./test.html";
NOTE: Writing HTML Body file: ./test.html
33         data outdata.class;
34          set sashelp.class;
35         run;

If using SAS Enterprise Guide, you can add DLGCDIR function steps to the startup statements that run when you connect to SAS, ensuring that your working directory starts in a valid location for SAS output. You can specify those statements in Tools->Options->SAS Programs, "Submit SAS code when server is connected." A SAS administrator can also add code to the AUTOEXEC file that runs when the SAS session begins, thus helping to manage this for larger groups of SAS users.

See also

The post Manage the current directory within your SAS program appeared first on The SAS Dummy.

8月 272018
 

My buddy Chris recently blogged about accessing the IoT data from an M&M jar being monitored in one of the breakrooms at SAS. Now I'm going to take things a step further and analyze that data with some graphs. Grab a snack, and follow along, as we dig into this [...]

The post IoT: Monitoring the M&M supply in our breakroom appeared first on SAS Learning Post.

8月 272018
 

A frequent topic on SAS discussion forums is how to check the assumptions of an ordinary least squares linear regression model. Some posts indicate misconceptions about the assumptions of linear regression. In particular, I see incorrect statements such as the following:

  • Help! A histogram of my variables shows that they are not normal! I need to apply a normalizing transformation before I can run a regression....
  • Before I run a linear regression, I need to test that my response variable is normal....

Let me be perfectly clear: The variables in a least squares regression model do not have to be normally distributed. I'm not sure where this misconception came from, but perhaps people are (mis)remembering an assumption about the errors in an ordinary least squares (OLS) regression model. If the errors are normally distributed, you can prove theorems about inferential statistics such as confidence intervals and hypothesis tests for the regression coefficients. However, the normality-of-errors assumption is not required for the validity of the parameter estimates in OLS. For the details, the Wikipedia article on ordinary least squares regression lists four required assumptions; the normality of errors is listed as an optional fifth assumption.

In practice, analysts often "check the assumptions" by running the regression and then examining diagnostic plots and statistics. Diagnostic plots help you to determine whether the data reveal any deviations from the assumptions for linear regression. Consequently, this article provides a "getting started" example that demonstrates the following:

  1. The variables in a linear regression do not need to be normal for the regression to be valid.
  2. You can use the diagnostic plots that are produced automatically by PROC REG in SAS to check whether the data seem to satisfy some of the linear regression assumptions.

By the way, don't feel too bad if you misremember some of the assumptions of linear regression. Williams, Grajales, and Kurkiewicz (2013) point out that even professional statisticians sometimes get confused.

An example of nonnormal data in regression

Consider this thought experiment: Take any explanatory variable, X, and define Y = X. A linear regression model perfectly fits the data with zero error. The fit does not depend on the distribution of X or Y, which demonstrates that normality is not a requirement for linear regression.

For a numerical example, you can simulate data such that the explanatory variable is binary or is clustered close to two values. The following data shows an X variable that has 20 values near X=5 and 20 values near X=10. The response variable, Y, is approximately five times each X value. (This example is modified from an example in Williams, Grajales, and Kurkiewicz, 2013.) Neither variable is normally distributed, as shown by the output from PROC UNIVARIATE:

/* For n=1..20, X ~ N(5, 1). For n=21..40, X ~ N(10, 1).
   Y = 5*X + e, where e ~ N(0,1) */
data Have;
input X Y @@;
datalines;
 3.60 16.85  4.30 21.30  4.45 23.30  4.50 21.50  4.65 23.20 
 4.90 25.30  4.95 24.95  5.00 25.45  5.05 25.80  5.05 26.05 
 5.10 25.00  5.15 26.45  5.20 26.10  5.40 26.85  5.45 27.90 
 5.70 28.70  5.70 29.35  5.90 28.05  5.90 30.50  6.60 33.05 
 8.30 42.50  9.00 45.50  9.35 46.45  9.50 48.40  9.70 48.30 
 9.90 49.80 10.00 48.60 10.05 50.25 10.10 50.65 10.30 51.20 
10.35 49.80 10.50 53.30 10.55 52.15 10.85 56.10 11.05 55.15 
11.35 55.95 11.35 57.90 11.40 57.25 11.60 57.95 11.75 61.15 
;
 
proc univariate data=Have;
   var x y;
   histogram x y / normal;
run;

There is no need to "normalize" these data prior to performing an OLS regression, although it is always a good idea to create a scatter plot to check whether the variables appear to be linearly related. When you regress Y onto X, you can assess the fit by using the many diagnostic plots and statistics that are available in your statistical software. In SAS, PROC REG automatically produces a diagnostic panel of graphs and a table of fit statistics (such as R-squared):

/* by default, PROC REG creates a FitPlot, ResidualPlot, and a Diagnostics panel */
ods graphics on;
proc reg data=Have;
   model Y = X;
quit;

The R-squared value for the model is 0.9961, which is almost a perfect fit, as seen in the fit plot of Y versus X.

Using diagnostic plots to check the assumptions of linear regression

You can use the graphs in the diagnostics panel to investigate whether the data appears to satisfy the assumptions of least squares linear regression. The panel is shown below (click to enlarge).

The first column in the panel shows graphs of the residuals for the model. For these data and for this model, the graphs show the following:

  • The top-left graph shows a plot of the residuals versus the predicted values. You can use this graph to check several assumptions: whether the model is specified correctly, whether the residual values appear to be independent, and whether the errors have constant variance (homoscedastic). The graph for this model does not show any misspecification, autocorrelation, or heteroscedasticity.
  • The middle-left and bottom-left graphs indicate whether the residuals are normally distributed. The middle plot is a normal quantile-quantile plot. The bottom plot is a histogram of the residuals overlaid with a normal curve. Both these graphs indicate that the residuals are normally distributed. This is evidence that you can trust the p-values for significance and the confidence intervals for the parameters.

In summary, I wrote this article to addresses two points:

  1. To dispel the myth that variables in a regression need to be normal. They do not. However, you should check whether the residuals of the model are approximately normal because normality is important for the accuracy of the inferential portions of linear regression such as confidence intervals and hypothesis tests for parameters. (A colleague mentioned to me that standard errors and hypothesis tests tend to be robust to this assumption, so a modest departure from normality is often acceptable.)
  2. To show that the SAS regression procedures automatically provide many graphical diagnostic plots that you can use to assess the fit of the model and check some assumptions for least squares regression. In particular, you can use the plots to check the independence of errors, the constant variance of errors, and the normality of errors.

References

There have been many excellent books and papers that describe the various assumptions of linear regression. I don't feel a need to rehash what has already been written, In addition to the Wikipedia article about ordinary linear regression, I recommend the following:

The post On the assumptions (and misconceptions) of linear regression appeared first on The DO Loop.

8月 242018
 

I know a lot of you have been programming in SAS for a long time, which is awesome! However, when you do something for a long time, sometimes you get set in your ways and you miss out on new ways of doing things.

Although the COUNT and CAT functions have been around for a while now, I see a lot of customer code that is counting and concatenating text strings the "old-fashioned" way. In this article, I would like to introduce you to the COUNT, COUNTW, CATS and CATX functions. These functions make certain tasks much simpler, like counting words in a string and concatenating text together.

Counting words or text occurrences

First let's take a look at the COUNT and COUNTW functions.
The

Data a;
  Contributors='The Big Company INC, The Little Company, ACME Incorporated,    Big Data Co, Donut Inc.';
  Num=count(contributors,'inc','i');  /* the 'i' modifier means to ignore case*/
  Put num=;
Run;

When we examine the SAS log, we can see that NUM has a value of 3.

Num=3
NOTE: The data set WORK.A has 1 observations and 2 variables.
NOTE: DATA statement used (Total process time):
      real time           0.02 seconds
      cpu time            0.01 seconds

The

/* DON'T USE - use COUNTW instead */
data a(drop=done i);
  x='a#b#c#d#e';
  do until(done);
    i+1;
    y=scan(x,i,'#');
    if y='' then done=1;
    else output;
  end;
  run;

I realize this code isn't terrible, but I try to avoid DO UNTIL/WHILE loops if I can. There is always the possibility of going into an infinite loop.
The COUNTW function eliminates the need for a DO UNTIL/WHILE loop.

Here is an example of logic that I use all the time. In this example, I have a macro variable that contains a list of values that I want to loop through. I can use the COUNTW function to easily loop through each file listed in the resolved value of &FILE_NAMES. The code then uses the file name on the DATA statement and the INFILE statement.

%let file_names=01JAN2018.csv 01FEB2018.csv 01MAR2018.csv 01APR2018.csv;
%macro test(files);
 
%do i=1 %to %sysfunc(countw(&file_names,%str( )));
  %let file=%scan(&file_names,&i,%str( ));
  data _%scan(&file,1,.);
    infile "c:\my files\&file";
    input region $ manager $ sales;
  run;
%end;
%mend;
%test(&file_names)

The log is too large to list here, but you can see one of the generated DATA steps in the MPRINT output of this snapshot of the log.

MPRINT(TEST):   data _01JAN2018;
MPRINT(TEST):   infile "c:\my files\01JAN2018.csv";
MPRINT(TEST):   input region $ manager $ sales;
MPRINT(TEST):   run;

This data step will be generated for each file listed.

Counting strings within another text string should be easy to do. The COUNT functions definitely make this a reality!

Concatenating strings in SAS

Now that we know how to COUNT text in SAS, let me show you how to CAT in SAS with the CATS and CATX functions.

Back in the old days, I had hair(!) and we concatenated text strings using double pipes syntax.

  X=var1||var2||var3;

This syntax is not too bad, but what if VARn has trailing blanks? Prior to SAS Version 9 you had to remove the trailing blanks from each value. Also, if the text was right justified, you had to left justify the text. This complicates the syntax:

X=trim(left(var1))||trim(left(var2))||trim(left(var3));

You can now accomplish the same thing using CATS. The
data a;
  length var1 var2 var3 $12;
  var1='abc';
  var2='123';
  var3='xyz';
  x=cats(var1,var2,var3);
  put x=;
run;

VAR1-VAR3 have a length of 12, which means each value contains trailing blanks. By using the CATS function, all trailing blanks are removed and the text is concatenated without any spaces between the text. Here is the result of the above PUT statement.

x=abc123xyz

Another common need when concatenating text together is to create a delimited string. This can now be done using the CATX function. The

data a;
  length var1 var2 var3 $12;
  var1='abc';
  var2='123';
  var3='xyz';
  x=catx(',',var1,var2,var3);
  put x=;
run;

This syntax creates a comma separated list with all leading and trailing blanks removed. Here is the result of the PUT statement.

x=abc,123,xyz

Just how the COUNT functions making counting text in SAS easier, the CAT functions make concatenating strings so much easier. For more explanation and examples of these CAT* functions, see this paper by Louise Hadden, Purrfectly Fabulous Feline Functions (because they are CAT functions, get it?).

Before I let you go, let me point out that in addition to the COUNT, COUNTW, CATS and CATX functions, there are also the COUNTC, CAT, CATQ and CATT functions that provide even more functionality. These functions are not used as often, so I haven't discussed them here. Please How to COUNT CATs in SAS was published on SAS Users.

8月 242018
 

Jim Harris warns against allowing your data lake to become a poorly managed and ungoverned data dumping ground.

The post Data lake management for analytics appeared first on The Data Roundtable.

8月 242018
 

Dynamic programming is a powerful technique to implement algorithms, and is often used to solve complex computational problems. Some are applications are world-changing, such as aligning DNA sequences; others are more "everyday," such as spelling correction. If you search for "dynamic programming," you will find lots of materials including sample programs written in other programming languages, such as Java, C, and Python etc., but there isn't any SAS sample program. SAS is a powerful language, and of course SAS can do it! This article will show you how to write your dynamic programming function including the SPEDIS function. The purpose of this article is to demonstrate how you can implement such a function in a SAS dynamic programming method.

What is edit distance?

Edit distance is a metric used to measure dissimilarity between two strings, by matching one string to the other through insertions, deletions, or substitutions.

What is minimum edit distance?

Minimum edit distance measures the dissimilarity between two strings through the least number of edit operations.
Taking two strings "BAD" and "BED" as an example. These two words have multiple match possibilities. One solution is an edit distance of 1, resulting from one substitution from letter "A" to "E".

Another solution might be deleting letter "A" from "BAD" then inserting letter "E" between "B" and "D", which results the edit distance of 2

There are several variants of edit distance, depending on the cost of edit operation. For example, given a string pair "BAD" and "BED", if each operation has cost of 1, then its edit distance is 1; if we set substitution cost as 2 (Levenshtein edit distance), then its edit distance is 2.

What is dynamic programming?

The standard method used to solve minimum edit distance uses a dynamic programming algorithm.

Dynamic programming is a method used to resolve complex problems by breaking it into simpler sub-problems and solving these recursively. Partial solutions are saved in a big table, so it can be quickly accessed for successive calculations while avoiding repetitive work. Through this process of building on each preceding result, we eventually solve the original, challenging problem efficiently. Many difficult issues can be resolved using this method.

Here's the algorithm that solves Levenshtein edit distance through dynamic programming:

The following image shows the annotated SAS program that implements the algorithm. The complete code (which you can copy and test for yourself) is at the end of this article.
SPEDIS algorithm

To demonstrate the edit distance function usage and validate the intermediate edit distance matrix table, I used a string pair "INTENTION" and "EXECUTION" that I copied from Stanford's class material as example. (This same resource also shows how the technique applies to DNA sequence alignment.)

options cmplib=work.funcs; 
data test;  
   infile cards missover;
   input word1 : $20. word2 : $20.;
   d=editDistance(word1,word2); 
   put d=;
cards;
INTENTION EXECUTION
;
run;

The Levenshtein edit distance between "INTENTION" and "EXECUTION" is 8.

The edit distance table as follows.

N 8 9 10 11 12 11 10 9 8
O 7 8 9 10 11 10 9 8 9
I 6 7 8 9 10 9 8 9 10
T 5 6 7 8 9 8 9 10 11
N 4 5 6 7 8 9 10 11 10
E 3 4 5 6 7 8 9 10 9
T 4 5 6 7 8 7 8 9 8
N 3 4 5 6 7 8 7 8 7
I 2 3 4 5 6 7 6 7 8
  E X E C U T I O N

 

More dynamic programming applications

Dynamic programming is a powerful technique, and it can be used to solve many complex computation problems. Anna Di and I are presenting a paper to the PharmaSUG 2018 China conference to demonstrate how to align DNA sequences with SAS FCMP and SAS Viya. If you are interested in this topic, please look for our paper after the conference proceedings are published.

Appendix: Complete SAS program for editDistance function

proc fcmp outlib=work.funcs.spedis;
function editDistance(query $, keyword $);
   array distance[1,1]/nosymbols;    
   m = length(query);
   n = length(keyword);   
   call dynamic_array(distance, m+1, n+1); 
   do i=1 to m+1;
      do j=1 to n+1;
         distance[i, j]=-1;  
      end;
   end;
 
   i_max=m;
   j_max=n;
 
   dist = edDistRecursive(query, keyword, distance, m, n);
 
   do i=i_max to 1 by -1;
      do j=1 to j_max;
         put distance[i, j] best3. @;  
      end;
      put;
   end;
 
   return (dist);
endsub; 
 
function edDistRecursive(query $, keyword $, distance[*,*], m, n);
   outargs distance;
 
   if m = 0 then
      return (n);
   if n = 0 then
      return (m);
 
   if distance[m,n] >= 0 then
      return (distance[m,n]);
   if (substr(query,m,1) = substr(keyword,n,1)) then
      delta = 0;
   else
      delta = 2;
 
   ans = min(edDistRecursive(query, keyword, distance, m - 1, n - 1) + delta, 
             edDistRecursive(query, keyword, distance, m - 1, n) + 1, 
             edDistRecursive(query, keyword, distance, m, n - 1) + 1);
   distance[m,n] = ans;
   return (distance[m,n]);   
endsub;
run;
quit; 
 
options cmplib=work.funcs; 
data test;  
   infile cards missover;
   input word1 : $20. word2 : $20.;
   d=editDistance(word1,word2); 
   put d=;
cards;
INTENTION EXECUTION
;
run;

Dynamic programming with SAS FCMP was published on SAS Users.

8月 232018
 

Showing the most popular jobs in each state is interesting (as I showed in my previous two blogs 1, 2) ... but not that interesting. How about something a little more quirky?!? ... Let's determine the most disproportionately popular job in each state! Their Map I got the idea for [...]

The post The most disproportionately popular job in each state appeared first on SAS Learning Post.

8月 232018