3月 212018
 

Having been involved in some capacity in more than 100 Customer Intelligence (CI) sales and implementation cycles, one of the most common questions perspective clients ask me is: “What do we need to do to make this project successful?” Over the years, I have distilled six themes that ultimately determine [...]

Six factors that determine customer intelligence project success was published on Customer Intelligence Blog.

3月 212018
 
Pseudocode for the conjugate gradient method for solving a linear system (from Wikipedia)

I often claim that the "natural syntax" of the SAS/IML language makes it easy to implement an algorithm or statistical formula as it appears in a textbook or journal. The other day I had an opportunity to test the truth of that statement. A SAS programmer wanted to implement the conjugate gradient algorithm, which is an iterative method for solving a system of equations with certain properties. I looked up the Wikipedia article about the conjugate gradient method and saw the following text:

"The algorithm is detailed below for solving Ax = b where A is a real, symmetric, positive-definite matrix. The input vector x0 can be an approximate initial solution or 0." The text was followed by pseudocode (in the box; click to enlarge)

The conjugate gradient method in SAS/IML

I used the pseudocode to implement the conjugate gradient method in SAS/IML. (The method is explained further in the next section.) I chose not to use the 'k' and 'k+1' notation but merely to overwrite the old values of variables with new values. In the program, the variables x, r, and p are vectors and the variable A is a matrix.

/* Linear conjugate gradient method as presented in Wikipedia:
   https://en.wikipedia.org/wiki/Conjugate_gradient_method
   Solve the linear system A*x = b, where A is a symmetric positive definite matrix.
   The algorithm converges in at most n iterations, where n is the dimension of A.
 
   This function requires an initial guess, x0, which can be the zero vector.
   The function does not verify that the matrix A is symmetric positive definite.
   This module returns a matrix mX= x1 || x2 || ... || xn 
   whose columns contain the iterative path from x0 to the approximate solution vector xn.  */
proc iml;
start ConjGrad( x0, A, b, Tolerance=1e-6 );
  x = x0;                            /* initial guess */
  r = b - A*x0;                      /* residual */
  p = r;                             /* initial direction */
  mX = j(nrow(A), nrow(A), .);       /* optional: each column is result of an iteration */
  do k = 1 to ncol(mX) until ( done );
    rpr = r`*r;  Ap = A*p;           /* store partial results */
    alpha =  rpr / (p`*Ap);          /* step size */
    x = x + alpha*p;   mX[,k] = x;   /* remember new guess */
    r = r - alpha*Ap;                /* new residual */
    done = (sqrt(rpr) < Tolerance);  /* stop if ||r|| < Tol */
    beta = (r`*r) / rpr;             /* new stepsize */
    p = r + beta*p;                  /* new direction */
  end;
  return ( mX );     /* I want to return the entire iteration history, not just x */
finish;

The SAS/IML program is easy to read and looks remarkably similar to the pseudocode in the Wikipedia article. This is in contrast to lower-level languages such as C in which the implementation looks markedly different from the pseudocode.

Convergence of the conjugate gradient method

The conjugate gradient method is an iterative method to find the solution of a linear system A*x=b, where A is a symmetric positive definite n x n matrix, b is a vector, and x is the unknown solution vector. (Recall that symmetric positive definite matrices arise naturally in statistics as the crossproduct matrix (or covariance matrix) of a set of variables.) The beauty of the conjugate gradient method is twofold: it is guaranteed to find the solution (in exact arithmetic) in at most n iterations, and it requires only simple operations, namely matrix-vector multiplications and vector additions. This makes it ideal for solving system of large sparse systems, because you can implement the algorithm without explicitly forming the coefficient matrix.

It is fun to look at how the algorithm converges from the initial guess to the final solution. The following example converges gradually, but I know of examples for which the algorithm seems to make little progress for the first n – 1 iterations, only to make a huge jump on the final iteration straight to the solution!

Recall that you can use a Toeplitz matrix to construct a symmetric positive definite matrix. The following statements define a banded Toeplitz matrix with 5 on the diagonal and specify the right-hand side of the system. The zero vector is used as an initial guess for the algorithm. The call to the ConjGrad function returns a matrix whose columns contain the iteration history for the method. For this problem, the method requires five iterations to converge, so the fifth column contains the solution vector. You can check that the solution to this system is (x1, x2, 3, x4, x5) = (-0.75, 0, 1.5, 0.5, -0.75), either by performing matrix multiplication or by using the SOLVE function in IML to compute the solution vector.

A = {5 4 3 2 1,          /* SPD Toeplitz matrix */
     4 5 4 3 2, 
     3 4 5 4 3,
     2 3 4 5 4,
     1 2 3 4 5};
b = {1, 3, 5, 4, 2};     /* right hand side */
n = ncol(A);
x0 = j(n,1,0);           /* the zero vector */
traj = ConjGrad( x0, A, b );
x = traj[ ,n];           /* for this problem, solution is in last column */

It is instructive to view how the iteration progresses from an initial guess to the final solution. One way to view the iterations is to compute the Euclidean distance between each partial solution and the final solution. You can then graph the distance between each iteration and the final solution, which decreases monotonically (Hestenes and Stiefel, 1952).

Distance = sqrt( (traj - x)[##, ] ); /* || x[j] - x_Soln || */
Iteration = 1:n;
title "Convergence of Conjugate Gradient Method";
call series(Iteration, Distance) grid={x y} xValues=Iteration 
     option="markers" label={"" "Distance to Solution"};

Notice in the distance graph that the fourth iteration almost equals the final solution. You can try different initial guesses to see how the guess affects the convergence.

In addition to the global convergence, you can visualize the convergence in each coordinate. Because the vectors live in high-dimensional space, it impossible to draw a scatter plot of the iterative solutions. However, you can visualize each vector in "parallel coordinates" as a sequence of line segments that connect the coordinates in each variable. In the following graph, each "series plot" represents a partial solution. The curves are labeled by the iteration number. The blue horizontal line represents the initial guess (iteration 0). The partial solution after the first iteration is shown in red and so on until the final solution, which is displayed in black. You can see that the third iteration (for this example) is close to the final solution. The fourth partial solution is so close that it cannot be visually distinguished from the final solution.

Convergence of the conjugate gradient method

In summary, this post shows that the natural syntax of the SAS/IML language makes it easy to translate pseudocode into a working program. The article focuses on the conjugate gradient method, which solves a symmetric, positive definite, linear system in at most n iterations. The article shows two ways to visualize the convergence of the iterative method to the solution.

The post The conjugate gradient method appeared first on The DO Loop.

3月 192018
 

About once a month I see a question on the SAS Support Communities that involves what I like to call "computations with combinations." A typical question asks how to find k values (from a set of p values) that maximize or minimize some function, such as "I have 5 variables, and for each observation I want to find the largest product among any 3 values."

These types of problems are specific examples of a single abstract problem, as follows:

  1. From a set of p values, generate all subsets that contain k < p elements. Call the subsets Y1, Y2, ..., Yt, where t equals "p choose k". (In SAS, use the COMB function to compute the number of combinations: t = comb(p,k).)
  2. For each subset, evaluate some function on the subset. Call the values z1, z2, ..., zt.
  3. Return some statistic of the zi. Often the statistics is a maximum or minimum, but it could also be a mean, variance, or percentile.

This is an "exhaustive" method that explicitly generates all subsets, so clearly this technique is impractical for large values of p. The examples that I've seen on discussion forums often use p ≤ 10 and small values of k (often 2, 3, or 4). For parameters in this range, an exhaustive solution is feasible.

This general problem includes "leave-one-out" or jackknife estimates as a special case (k = p – 1), so clearly this formulation is both general and powerful. This formulation also includes the knapsack problem in discrete optimization. In the knapsack problem, you have p items and a knapsack that can hold k items. You want to choose the items so that the knapsack holds as much value as possible. The knapsack problem maximizes the sum of the values whereas the general problem in this article can handle nonlinear functions of the values.

Example data

You can use the following DATA set to simulate integer data with a specified number of columns and rows. I use the relatively new "Integer" distribution to generate uniformly distributed integers in the range [-3, 9].

%let p = 5;      /* number of variables */
%let NObs = 6;   /* number of observations */
data have(drop=i j);
call streaminit(123);
array x[&p];
do i = 1 to &NObs;
   do j = 1 to dim(x);
      x[j] = rand("Integer", -3, 9); /* SAS 9.4M4 */
   end;
   output;
end;
;
 
proc print data=have; run;
Sample data

Computing with combinations in SAS/IML

For p = 5 and k = 3, the problem is: "For each observation of the 5 variables, find the largest product among any 3 values." In the SAS/IML language, you can solve problems like this by using the ALLCOMB function to generate all combinations of size k from the index set {1,2,...,p}. These values are indices that you can use to reference each combination of values. You can evaluate your function on each combination and then compute the max, min, mean, etc. For example, the following SAS/IML statements generate all combinations of 3 values from the set {1, 2, 3, 4, 5}:

proc iml;
p = 5; k = 3;
c = allcomb(p, k);  /* combinations of p items taken k at a time */
print c;
All combinations of 5 elements taken 3 at a time

A cool feature of the SAS/IML language is that you can use these values as column subscripts! In particular, the expression X[i, c] generates all 3-fold combinations of values in the i_th row. You can then use the SHAPE function to reshape the values into a matrix that has 3 columns, as follows:

/* Example: find all combination of elements in the first row */
varNames = "x1":"x5";
use have; read all var varNames into X; close;
Y = X[1, c];                 /* all combinations of columns for 1st row */
M = shape(Y, nrow(c), k);    /* reshape so each row has k elements */
prod = M[, #];               /* product of elements across columns */
print M prod;
A function evaluated each subset of size 3

Notice that each row of the matrix M contains k = 3 elements of Y. There are "5 choose 3" = 10 possible ways to choose 3 items from a set of 5, so the M matrix has 10 rows. Notice that you can use a subscript reduction operator (#) to compute the product of elements for each combination of elements. The maximum three-value product for the first row of data is 24.

The following loop performs this computation for each observation. The result is a vector that contains the maximum three-value product of each row. The original data and the results are then displayed side by side:

/* for each row and for X1-X4, find maximum product of three elements */
result = j(nrow(X), 1);
do i = 1 to nrow(X);
   Y = X[i, c];                  /* get i_th row and all combinations of coloumns */
   M = shape(Y, nrow(c), k);     /* reshape so each row has k elements */
   result[i] = max( M[,#] );     /* max of product of rows */
end;
print X[colname=varNames] result[L="maxMul"];
Maximum product of three elements from a set of 5

Of course, if the computation for each observation is more complicated than in this example, you can define a function that computes the result and then call the module like this: result[i]= MyFunc(M);

Generate all combinations in the DATA step

You can perform a similar computation in the DATA step, but it requires more loops. You can use the ALLCOMB function (or the LEXCOMBI function) to generate all k-fold combinations of the indices {1, 2, ..., p}. You should call the ALLCOMB function inside a loop from 1 to NCOMB(p, k). Inside the loop, you can evaluate the objective function on each combination of data values. Many DATA step functions such as MAX, MIN, SMALLEST, and LARGEST accept arrays of variables, so you probably want to store the variables and the indices in arrays. The following DATA step contains comments that describe each step of the program:

%let p = 5;
%let k = 3;
%let NChooseK = %sysfunc(comb(&p,&k));  /* N choose k */
data Want(keep=x1-x&p maxProd);
set have;
array x[&p] x1-x&p;        /* array of data */
array c[&k];               /* array of indices */
array r[&NChoosek];        /* array of results for each combination */
ncomb = comb(&p, &k);      /* number of combinations */
do i=1 to &k; c[i]=0; end; /* zero the array of indices before first call to ALLCOMB */
do j = 1 to ncomb;
   call allcombi(&p, &k, of c[*]);  /* generate j_th combination of indices */
   /* evaluate function of the array {x[c[1]], x[c[2]], ..., x[c[k]]} */
   r[j] = 1;               /* initialize product to 1 */
   do i=1 to &k; r[j] = r[j] * x[c[i]]; end;  /* product of j_th combination */
end;
maxProd = max(of r[*]);    /* max of products */
output;
run;
 
proc print data=Want; run;
Maximum product of three elements from a set of five

The DATA step uses an array (R) of values to store the result of the function evaluated on each subset. For a MAX or MIN computation, this array is not necessary because you can keep track of the current MAX or MIN inside the loop over combinations. However, for more general problems (for example, find the median value), an array might be necessary.

In summary, this article shows how to solve a general class of problems. The general problem generates all subsets of size k from a set of size p. For each subset, you evaluate a function and produce a statistic. From among the "p choose k" statistics, you then choose the max, min, or some other measure. This article shows how to solve these problems efficiently in the SAS/IML language or in the SAS DATA step. Because this is a "brute force" technique, it is limited to small values of p. I suggest p ≤ 25.

The post Compute with combinations: Maximize a function over combinations of variables appeared first on The DO Loop.

3月 172018
 

This is a continuation of my previous blog post on SAS Data Studio and the Code transform. In this post, I will review some additional examples of using the Code transform in a SAS Data Studio data plan to help you prepare your data for analytic reports and/or models.

Create a Unique Identifier Example

The DATA step code below combines the _THREADID_ and the _N_ variables to create a UniqueID for each record.

SAS Data Studio Code Transform

The variable _THREADID_ returns the number that is associated with the thread that the DATA step is running in a server session. The variable _N_ is an internal system variable that counts the iterations of the DATA step as it automatically loops through the rows of an input data set. The _N_ variable is initially set to 1 and increases by 1 each time the DATA step loops past the DATA statement. The DATA step loops past the DATA statement for every row that it encounters in the input data. Because the DATA step is a built-in loop that iterates through each row in a table, the _N_ variable can be used as a counter variable in this case.

_THREADID_ and _N_ are variables that are created automatically by the SAS DATA step and saved in memory. For more information on automatic DATA step variables refer to its

Cluster Records Example

The DATA step code below combines the _THREADID_ and the counter variables to create a unique ClusterNum for each BY group.

This code uses the concept of FIRST.variable to increase the counter if it is the beginning of a new grouping. FIRST.variable and LAST.variable are variables that CAS creates for each BY variable. CAS sets FIRST.variable when it is processing the first observation in a BY group, and sets LAST.variable when it is processing the last observation in a BY group. These assignments enable you to take different actions, based on whether processing is starting for a new BY group or ending for a BY group. For more information, refer to the topic

De-duplication Example

The DATA step code below outputs the last record of each BY group; therefore, de-duplicating the data set by writing out only one record per grouping.

Below are the de-duplication results on the data set used in the previous Cluster Records Example section.

For more information about DATA step, refer to the

Below is the resulting customers2.xlsx file in the Public CAS library.

For more information on the available action sets, refer to the SAS® Cloud Analytic Services 3.3: CASL Reference guide.

For more information on SAS Data Studio and the Code transform, please refer to this SAS Data Studio Code Transform (Part 2) was published on SAS Users.

3月 172018
 

SAS Data Studio is a new application in SAS Viya 3.3 that provides a mechanism for performing simple, self-service data preparation tasks to prepare data for use in SAS Visual Analytics or other applications. It is accessed via the Prepare Data menu item or tile on SAS Home. Note: A user must belong to the Data Builders group in order to have access to this menu item.

In SAS Data Studio, you can either select to create a new data plan or open an existing one. A data plan starts with a source table and consists of transforms (steps) that are performed against that table. A plan can be saved and a target table can be created based on the transformations applied in the plan.

SAS Data Studio Code Transform

SAS Data Studio

In a previous blog post, I discussed the Data Quality transforms in SAS Studio.  This post is about the Code transform which enables you to create custom code to perform actions or transformations on a table. To add custom code using the Code transform, select the code language from the drop-down menu, and then enter the code in the text box.  The following code languages are available: CASL or DATA step.

Code Transform in SAS Data Studio

Each time you run a plan, the table and library names might change. To avoid errors, you must use variables in place of table and caslib names in your code within SAS Data Studio. Indicating variables in place of table and library names eliminates the possibility that the code will fail due to name changes.  Errors will occur if you use literal values. This is because session table names can change during processing.  Use the following variables:

  • _dp_inputCaslib – variable for the input CAS library name.
  • _dp_inputTable – variable for the input table name.
  • _dp_outputCaslib – variable for the output CAS library name.
  • _dp_outputTable –  variable for the output table name.

Note: For DATA step only, variables must be enclosed in braces, for example, data {{_dp_outputTable}} (caslib={{_dp_outputCaslib}});.

The syntax of “varname”n is needed for variable names with spaces and/or special characters.  Refer to the Avoiding Errors When Using Name Literals help topic for more Information.  There are also several

CASL Code Example

The CASL code example above uses the ActionSet fedSQL to create a summary table of counts by the standardized State value.  The results of this code are pictured below.

Results from CASL Code Example

For more information on the available action sets, refer to the SAS® Cloud Analytic Services 3.3: CASL Reference guide.

DATA Step Code Example

In this DATA step code example above, the BY statement is used to group all records with the same BY value. If you use more than one variable in a BY statement, a BY group is a group of records with the same combination of values for these variables. Each BY group has a unique combination of values for the variables.  On the CAS server, there is no guarantee of global ordering between BY groups. Each DATA step thread can group and order only the rows that are in the same data partition (thread).  Refer to the help topic

Results from DATA Step Code Example

For more information about DATA step, refer to the In my next blog post, I will review some more code examples that you can use in the Code transform in SAS Data Studio. For more information on SAS Data Studio and the Code transform, please refer to this SAS Data Studio Code Transform (Part 1) was published on SAS Users.

3月 152018
 

fraud detectionIn the medical field, an autopsy is valuable because it helps you understand the cause of death. But, what’s more valuable is identifying the leading indicators of an illness so that you can address it before the Grim Reaper comes knocking. Best in class organizations are taking a similar approach to their fraud detection, shifting from a purely hindsight view to insights and even foresight – getting out in front of the fraud before it happens, revenue is lost, reputation damaged and regulators apply even more pressures.

Proactively detecting fraud isn’t easy though. There is the nature of the challenge itself: Fraud is a behavioral problem and one that is dynamic, complex and often sophisticated. Then, there is the data challenge – lots of it and in many different formats, including structured and unstructured. Next is the analytics. There are many techniques available, and some might be good, and others not. Finally, the technology. There is no shortage of solutions, but they can be expensive and organizations need to beware of ending up with a collection of siloed, single-point solutions that don’t tell the full story.

That said, unless you’re willing to close your business, which is the only surefire way to get to 0% fraud, you’ve got to tackle it.

How to tackle fraud?

For starters, I advise leaders to define their risk appetite and tolerance. What is the level of risk that you – and the organization – can live with? If you can live with 5%, let’s say, then that’s your true North and benchmark to measure against. Once the risk appetite is set, next comes the balancing act of strategic long-term view and tactical short-term needs plus balancing fraud prevention against the customer experience, and more. Then, make sure you have the data, technology, people, processes, governance and analytics in place to continuously measure and refine.

What we are seeing today is that analytics is a key component of moving fraud detection from hindsight to foresight. It starts with dividing risk into three classes. The first is what you know. I have fraud, it’s happening, and I can put business rules in place to detect it. It’s a repeatable pattern that usually responds well to the “if x, then y” formula. The second class is what you do not know.  This is about anomaly detections and can often be found by highlighting things that don’t happen often, but stand out when they do. The third, and most challenging class, is when you don’t even know what you’re looking for. Is it a needle in a haystack? Maybe a rusty nail? This is where AI and ML come in play.

Applying best-in-class tools allows organizations to ingest enormous sets of data, including text, voice, social, structured and unstructured data. Adding best-in-class analytics helps to sort the noise from signals, and advanced analytics including Artificial Intelligence, Machine Learning and Natural Language Processing enable organizations to move faster, by processing in real time, and benefit from iterative learning, where humans help models become smarter and smarter until they can improve themselves every single time. And, of course, the best solutions provide an end-to-end analytics lifecycle from data to analytics to insights.

There’s no question that fraud is complex and challenging, but unless you’re willing to send your business to the morgue – and close your doors forever – you’ve got to tackle it. And, thanks to advances in analytics, we can help stop fraud before it starts.

Find out more at the SAS Global User Forum 2018

Join Constantine Boyadjiev for his “Suspect Behavior Identification through Sentiment Analysis and Communication Surveillance” Breakout Session at SAS Global Forum 2018 April 10 at 3 p.m. in Mile High Ballroom Theater C.

 

 

 

 

Move fraud detection from hindsight to insight to foresight was published on SAS Users.

3月 142018
 

One of my favorite magazines, Significance, printed an intriguing image of a symmetric matrix that shows repetition in a song's lyrics. The image was created by Colin Morris, who has created many similar images. When I saw these images, I knew that I wanted to duplicate the analysis in SAS!

Visualize repetition in lyrics: A simple example

The analysis is easy. Suppose that a song (or any text source) contains N words. Define the repetition matrix to be the N x N matrix where the (i,j)th cell has the value 1 if the i_th word is the same as the j_th word. Otherwise, the (i,j)th cell equals 0. Now visualize the matrix by using a heat map: Black indicates cells where the matrix is 1 and white for 0. A SAS program that performs this analysis is available at the end of this article.

To illustrate this algorithm, consider the nursery rhyme, "Row, Row, Row the Boat":

Row, row, row your boat
Gently down the stream
Merrily, merrily, merrily, merrily
Life is but a dream.

There are 18 words in this song. Words 1–3 are repeated, as are words 1-–13. You can use the SAS DATA steps to read the words of the song into a variable and use other SAS functions to strip out any punctuation. You can then use SAS/IML software to construct and visualize the repetition matrix. The details are shown at the end of this article.

Visualize repetition in song lyrics: 'Row, Row, Row Your Boat'

The repetition matrix for the song "Row, Row, Row, Your Boat" is shown to the right. For this example I could have put the actual words along the side and bottom of the matrix, but that is not feasible for songs that have hundreds of words. Instead, the matrix has a numerical axis where the number indicates the position of each word in the song.

Every repetition matrix has 1s on the diagonal. In this song, the words "row" and "merrily" are repeated. Consequently, there is a 3 x 3 block of 1s at the top left and a 4 x 4 block of 1s in the middle of the matrix. (Click to enlarge.)

As mentioned, this song has very little repetition. One way to quantify the amount of repetition is to compute the proportion of 1s in the upper triangular portion of the repetition matrix. The upper triangular portion of an N x N matrix has N(N–1)/2 elements. For this song, N=18, so there are 153 cells and 9 of them are 1s. Therefore the "repetition score" is 9 / 153 = 0.059.

Another simple example of repetition in song lyrics

I wrote a SAS/IML function that creates and visualizes the repetition matrix and returns the repetition score. In order to visualize songs that might have hundreds of words, I suppress the outlines (the grid) in the heat map. To illustrate the output of the function, the following image visualizes the words of the song "Here We Go Round the Mulberry Bush":

Here we go round the mulberry bush,
The mulberry bush,
The mulberry bush.
Here we go round the mulberry bush
So early in the morning.
Visualize repetition in song lyrics: 'Here We Go Round the Mulberry Bush'

The repetition score for this song is 0.087. You can see diagonal "stripes" that correspond to the repeating phrases "here we go round" and "the mulberry bush". In fact, if you study only the first seven rows, you can "see" almost the entire structure of the song. The first seven words contain all lyrics except for four words ("so", "early", "in", "morning").

Visualize Classic Song Lyrics

Let's visualize the repetitions in the lyrics of several classic songs.

Hey Jude (The Beatles)

When I saw Morris's examples, the first song I wanted to visualize was "Hey Jude" by the Beatles. Not only does the title phrase repeat throughout the song, but the final chorus ("Nah nah nah nah nah nah, nah nah nah, hey Jude") repeats more than a dozen times. This results in a very dense block in the lower right corner of the repetition matrix and a very high repetition score of 0.183. The following image visualizes "Hey Jude":

Visualize repetition in song lyrics: 'Hey Jude'

Love Shack (The B-52s)

The second song that I wanted to visualize was "Love Shack" by The B-52s. In addition to a title that repeats almost 40 times, the song contains a sequence near the end in which the phrase "Bang bang bang on the door baby" is alternated with various interjections. The following visualization of the repetition matrix indicates that there is a lot of variation interspersed with regular repetition. The repetition score is 0.035.

Visualize repetition in song lyrics: 'Love Shack'

Call Me (Blondie)

Lastly, I wanted to visualize the song "Call Me" by Blondie. This classic song has only 241 words, yet the title is repeated 41 times! In other words, about 1/3 of the song consists of those two words! Furthermore, there is a bridge in the middle of the song in which the phrase "oh oh oh oh oh" is alternated with other phrases (some in Italian and French) that appear only once in the song. The repetition score is 0.077. The song is visualized below:

Visualize repetition in song lyrics: 'Call Me'

How to create a repetition matrix in SAS

If you think this is a fun topic, you can construct these images yourself by using SAS. If you discover a song that has an interesting repetition matrix, post a comment!

Here's the basic idea of how to construct and visualize a repetition matrix. First, use the DATA step to read each word, use the COMPRESS function to remove any punctuation, and standardize the input by transforming all words to lowercase:
data Lyrics;
length word $20;
input word @@;
word = lowcase( compress(word, ,'ps') ); /* remove punctuation and spaces */
datalines;
Here we go round the mulberry bush,
The mulberry bush,
The mulberry bush.
Here we go round the mulberry bush
So early in the morning.
;

In SAS/IML software you can use the ELEMENT function to find the locations in the i_th row that have the value 1. After you construct a repetition matrix, you can use the HEATMAPDISC subroutine to display it. For example, the following SAS/IML program reads the words of the song into a vector and visualizes the repetition matrix. It also returns the repetition score, which is the proportion of 1s in the upper triangular portion of the matrix.

ods graphics / width=500 height=500 NXYBINSMAX=1000000;
proc iml;
/* define a function that creates and visualizes the repetition matrix */
start VizLyrics(DSName, Title);
   use (DSName); read all var _CHAR_ into Word;  close;
   N = nrow(Word);
   M = j(N,N,0);                       /* allocate N x N matrix */
   do i = 1 to N;
      M[,i] = element(Word, Word[i]);  /* construct i_th row */
   end;
   run heatmapdisc(M) title=Title
       colorramp={white black} displayoutlines=0 showlegend=0;
 
   /* compute the proportion of 1s in the upper triangular portion of the matrix */
   upperIdx = loc(col(M)>row(M));
   return ( M[upperIdx][:] );  /* proportion of words that are repeated */
finish;
 
score = VizLyrics("Lyrics", "Here We Go Round the Mulberry Bush");
print score;

If you want to reproduce the images in this post, you can download the SAS program for this article. In addition, the program creates repetition matrices for "We Didn't Start the Fire" (Billy Joel) and a portion of Martin Luthor King Jr.'s "I Have a Dream" speech. You can modify the program and enter lyrics for your favorite songs.

The post Visualize repetition in song lyrics appeared first on The DO Loop.

3月 132018
 

SAS Visual Analytics 8.2 introduces the Hidden Data Role. This role can accept one or more category or date data items which will be included in the query results but will not be displayed with the object. You can use this Hidden Data Role in:

  • Mapping Data Sources.
  • Color-Mapped Display Rules.
  • External Links.

Note that this Hidden Data Role is not available for all Objects and cannot be used as both a Hidden Data Role and Data tip value, it can only be assigned to one role.

In this example, we will look at how to use the Hidden Data Role for an External Link.

Here are a few applications of this example:

  • You want to show an index of available assets, and you have a URL to point directly to that asset.
  • Your company sells products, you want to show a table summary of product profit but have a URL that points to each Product’s development page.
  • As the travel department, you want to see individual travel reports rolled up to owner, but have a URL that can link out to each individual report.

The applications are endless when applied to our customer needs.

In my blog example, I have NFL data for Super Bowl wins. I have attached two columns of URLs for demonstration purposes:

  • One URL is for each Super Bowl event, so I have 52 URLs, one for each row of data.
  • The second URL is for each winning team. There have been 20 unique Super Bowl winning teams, so I have 20 unique URLs.

Hidden Data Role in SAS Visual Analytics

In previous versions of SAS Visual Analytics, if you wanted to link out to one of these URLs, you would have to include it in the visualization like in the List Table shown above. But now, using SAS Visual Analytics 8.2, you can assign a column containing these URLs to the Hidden Data Role and it will be available as an External URL.

Here is our target report. We want to be able to link to the Winning Team’s website.

In Visual Analytics 8.2, for the List Table, assign the Winning Team URL column to the Hidden Data Role.

Then, for the List Table, create a new URL Link Action. Give the Action a name and leave the URL section blank. This is because my data column contains a fully qualified URL. If you were linking to a destination and only needed to append a name value pair, then you could put in the partial URL and pass the parameter value, but that’s a different example.

That is using the column which has 20 URLs that matches the winning team in the Hidden Data Role. Now, what if we use the column that has the 52 URLs that link out to the individual Super Bowl events?

That’s right, the cardinality of the Hidden Data Role item does impact the object. Even though the Hidden data item is not visible on the Object, remember it is included in the results query; and therefore, the cardinality of the Hidden data item impacts the aggregation of the data.

Notice that some objects will just present an information warning that a duplicate classification of the data has caused a conflict.

In conclusion, the Hidden Data Role is an exciting addition to the SAS Visual Analytics 8.2 release. I know you'll enjoy and benefit from it.

The power behind a Hidden Data Role in SAS Visual Analytics was published on SAS Users.