7月 162018
 

My colleague Robert Allison recently blogged about using the diameter of Texas as a unit of measurement. The largest distance across Texas is about 801 miles, so Robert wanted to find the set of all points such that the distance from the point to Texas is less than or equal to 801 miles.

Robert's implementation was complicated by the fact that he was interested in points on the round earth that are within 801 miles from Texas as measured along a geodesic. However, the idea of "thickening" or "inflating" a polygonal shape is related to a concept in computational geometry called the offset polygon or the inflated polygon. A general algorithm to inflate a polygon is complicated, but this article demonstrates the basic ideas that are involved. This article discusses offset regions for convex and nonconvex polygons in the plane. The article concludes by drawing a planar region for a Texas-shaped polygon that has been inflated by the diameter of the polygon. And, of course, I supply the SAS programs for all computations and images.

Offset regions as a union of circles and rectangles

Assume that a simple polygon is defined by listing its vertices in counterclockwise order. (Recall that a simple polygon is a closed, nonintersecting, shape that has no holes.) You can define the offset region of radius r as the union of the following shapes:

  • The polygon itself
  • Circles of radius r centered at each vertex
  • Rectangles of width r positions outside the polygon along each edge

The following graphic shows the offset region (r = 0.5) for a convex "house-shaped" polygon. The left side of the image shows the polygon with an overlay of circles centered at each vertex and outward-pointing rectangles along each edge. The right side of the graphic shows the union of the offset regions (blue) and the original polygon (red):

The image on the right shows why the process is sometimes called an "inflating" a polygon. For a convex polygon, the edges are pushed out by a distance r and the vertices become rounded. For large values of r, the offset region becomes a nearly circular blob, although the boundary is always the union of line segments and arcs of circles.

You can draw a similar image for a nonconvex polygon. The inflated region near a convex (left turning) vertex looks the same as before. However, for the nonconvex (right turning) vertices, the circles do not contribute to the offset region. Computing the offset region for a nonconvex polygon is tricky because if the distance r is greater than the minimum distance between vertices, nonlocal effects can occur. For example, the following graphic shows a nonconvex polygon that has two "prongs." Let r0 be the distance between the prongs. When you inflate the polygon by an amount r > r0/2, the offset region can contain a hole, as shown. Furthermore, the boundary of the offset regions is not a simple polygon. For larger values of r, the hole can disappear. This demonstrates why it is difficult to construct the boundary of an offset region for nonconvex polygons.

Inflating a Texas-shaped polygon

The shape of the Texas mainland is nonconvex. I used PROC GREDUCE on the MAPS.US data set in SAS to approximate the shape of Texas by a 36-sided polygon. The polygon is in a standardized coordinate system and has a diameter (maximum distance between vertices) of r = 0.2036. I then constructed the inflated region by using the same technique as shown above. The polygon and its inflated region are shown below.

The image on the left, which shows 36 circles and 36 rectangles, is almost indecipherable. However, the image on the right is almost an exact replica of the region that appears in Robert Allison's post. Remember, though, that the distances in Robert's article are geodesic distances on a sphere whereas these distances are Euclidean distances in the plane. For the planar problem, you can classify a point as within the offset region by testing whether it is inside the polygon itself, inside any of the 36 rectangles, or within a distance r of a vertex. That computation is relatively fast because it is linear in the number of vertices in the polygon.

I don't want to dwell on the computation, but I do want to mention that it requires fewer than 20 SAS/IML statements! The key part of the computation uses vector operations to construct the outward-facing normal vector of length r to each edge of the polygon. If v is the vector that connects the i_th and (i+1)_th vertex of the polygon, then the outward-facing normal vector is given by the concise vector expression r * (v / ||v||) * M, where M is a rotation matrix that rotates by 90 degrees. You can download the SAS program that computes all the images in this article.

In conclusion, you can use a SAS program to construct the offset region for an arbitrary simple polygon. The offset region is the union of circles, rectangles, and the original polygon, which means that it is easy to test whether an arbitrary point in the plane is in the offset region. That is, you can test whether any point is within a distance r to an arbitrary polygon.

The post Offset regions: Find all points within a specified distance from a polygon appeared first on The DO Loop.

7月 142018
 

One of the biggest challenges we’ve been hearing from customers lately is that they need help operationalizing analytics to extend the value of their modeling efforts. In many cases, they’ve hired smart data analysts who can transform data and create sophisticated models, but their work is used for a single [...]

7 tips for operationalizing analytics was published on SAS Voices by Randy Guard

7月 132018
 

In part one of this blog posting series, we took an introductory tour of recommendation systems, digital marketing, and SAS Customer Intelligence 360. Helping users of your website or mobile properties find items of interest is useful in almost any situation. This is why the concept of personalized marketing is [...]

SAS Customer Intelligence 360: The Digital Shapeshifter of Recommendation Systems [Part 2] was published on Customer Intelligence Blog.

7月 122018
 

Word Mover's Distance (WMD) is a distance metric used to measure the dissimilarity between two documents, and its application in text analytics was introduced by a research group from Washington University in 2015. The group's paper, From Word Embeddings To Document Distances, was published on the 32nd International Conference on Machine Learning (ICML). In this paper, they demonstrated that the WMD metric leads to unprecedented low k-nearest neighbor document classification error rates on eight real world document classification data sets.

They leveraged word embedding and WMD to classify documents, and the biggest advantage of this method over the traditional method is its capability to incorporate the semantic similarity between individual word pairs (e.g. President and Obama) into the document distance metric. In a traditional way, one method to manipulate semantically similar words is to provide a synonym table so that the algorithm can merge words with same meaning into a representative word before measuring document distance, otherwise you cannot get an accurate dissimilarity result. However, maintaining synonym tables need continuous efforts of human experts and thus is time consuming and very expensive. Additionally, the semantic meaning of words depends on domain, and the general synonym table does not work well for varied domains.

Definition of Word Mover's Distance

WMD is the distance between the two documents as the minimum (weighted) cumulative cost required to move all words from one document to the other document. The distance is calculated through solving the following linear program problem.

Where

  • Tij denotes how much of word i in document d travels to word j in document d';
  • c(i; j) denotes the cost “traveling” from word i in document d to word j in document d'; here the cost is the two words' Euclidean distance in the word2vec embedding space;
  • If word i appears ci times in the document d, we denote

WMD is a special case of the earth mover's distance metric (EMD), a well-known transportation problem.

How to Calculate Earth Mover's Distance with SAS?

SAS/OR is the tool to solve transportation problems. Figure-1 shows a transportation example with four nodes and the distances between nodes, which I copied from this Earth Mover's Distance document. The objective is to find out the minimum flow from {x1, x2} to {y1, y2}. Now let's see how to solve this transportation problem using SAS/OR.

The weights of nodes and distances between nodes are given below.

Figure-1 A Transportation Problem

data x_set;
input _node_ $ _sd_;
datalines;
x1 0.74
x2 0.26
;
 
data y_set;
input _node_ $ _sd_;
datalines;
y1 0.23
y2 0.51
;
 
data arcdata;
input _tail_ $ _head_ $ _cost_;
datalines;
x1 y1 155.7
x1 y2 252.3
x2 y1 292.9
x2 y2 198.2
;
 
proc optmodel;
set xNODES;
num w {xNODES};
 
set yNODES;
num u {yNODES};
 
set <str,str> ARCS;
num arcCost {ARCS};
 
read data x_set into xNODES=[_node_] w=_sd_;
read data y_set into yNODES=[_node_] u=_sd_;
read data arcdata into ARCS=[_tail_ _head_] arcCost=_cost_;
 
var flow {<i,j> in ARCS} >= 0;
impvar sumY = sum{j in yNODES} u[j];
min obj = (sum {<i,j> in ARCS} arcCost[i,j] * flow[i,j])/sumY;
 
con con_y {j in yNODES}: sum {<i,(j)> in ARCS} flow[i,j] = u[j];
con con_x {i in xNODES}: sum {<(i),j> in ARCS} flow[i,j] <= w[i];
 
solve with lp / algorithm=ns scale=none logfreq=1;
print flow;
quit;

The solution of SAS/OR as Table-1 shows, and the EMD is the objective value: 203.26756757.

Table-1 EMD Calculated with SAS/OR

The flow data I got with SAS/OR as Table-2 shows the following, which is same as the diagram posted in the aforementioned Earth Mover's Distance document.

Table-2 Flow data of SAS/OR

Figure-2 Flow Diagram of Transportation Problem

 

How to Calculate Word Mover's Distance with SAS

The paper, From Word Embeddings To Document Distances, proposed a new metric called Relaxed Word Mover's Distance (RWMD) by removing the second constraint of WMD in order to decrease the calculations. Since we need to read word embedding data, I will show you how to calculate RWMD of two documents with SAS Viya.

/* start CAS server */
cas casauto host="host.example.com" port=5570;
libname sascas1 cas;
 
/* load documents into CAS */
data sascas1.documents;
infile datalines delimiter='|' missover;
length text varchar(300);
input text$ did;
datalines;
Obama speaks to the media in Illinois.|1
The President greets the press in Chicago.|2
;
run;
 
/* create stop list*/
data sascas1.stopList;
infile datalines missover;
length term $20;
input term$;
datalines;
the
to
in
;
run;
 
/* load word embedding model */
proc cas;
loadtable path='datasources/glove_100d_tab_clean.txt'
caslib="CASTestTmp"
importOptions={
fileType="delimited",
delimiter='\t',
getNames=True,
guessRows=2.0,
varChars=True
}
casOut={name='glove' replace=True};
run;
quit;
 
%macro calculateRWMD
(
textDS=documents,
documentID=did,
text=text,
language=English,
stopList=stopList,
word2VectDS=glove,
doc1_id=1,
doc2_id=2
);
/* text parsing and aggregation */
proc cas;
textParse.tpParse/
table = {name="&textDS",where="&documentID=&doc1_id or &documentID=&doc2_id"}
docId="&documentID",
language="&language",
stemming=False,
nounGroups=False,
tagging=False,
offset={name="outpos", replace=1},
text="&text";
run;
 
textparse.tpaccumulate/
parent={name="outparent1", replace=1}
language="&language",
offset='outpos',
stopList={name="&stoplist"},
terms={name="outterms1", replace=1},
child={name="outchild1", replace=1},
reduce=1,
cellweight='none',
termWeight='none';
run;
quit;
 
/* terms of the two test documents */
proc cas;
loadactionset "fedsql";
execdirect casout={name="doc_terms",replace=true}
query="
select outparent1.*,_term_
from outparent1
left join outterms1
on outparent1._termnum_ = outterms1._termnum_
where _Document_=&doc1_id or _Document_=&doc2_id;
"
;
run;
quit;
 
/* term vectors and counts of the two test documents */
proc cas;
loadactionset "fedsql";
execdirect casout={name="doc1_termvects",replace=true}
query="
select word2vect.*
from &word2VectDS word2vect, doc_terms
where _Document_=&doc2_id and lowcase(term) = _term_;
"
;
run;
 
execdirect casout={name="doc1_terms",replace=true}
query="
select doc_terms.*
from &word2VectDS, doc_terms
where _Document_=&doc2_id and lowcase(term) = _term_;
"
;
run;
 
simple.groupBy / table={name="doc1_terms"}
inputs={"_Term_", "_Count_"}
aggregator="n"
casout={name="doc1_termcount", replace=true};
run;
quit;
 
proc cas;
loadactionset "fedsql";
execdirect casout={name="doc2_termvects",replace=true}
query="
select word2vect.*
from &word2VectDS word2vect, doc_terms
where _Document_=&doc1_id and lowcase(term) = _term_;
"
;
run;
 
execdirect casout={name="doc2_terms",replace=true}
query="
select doc_terms.*
from &word2VectDS, doc_terms
where _Document_=&doc1_id and lowcase(term) = _term_;
"
;
run;
 
simple.groupBy / table={name="doc2_terms"}
inputs={"_Term_", "_Count_"}
aggregator="n"
casout={name="doc2_termcount", replace=true};
run;
quit;
 
/* calculate Euclidean distance between words */
data doc1_termvects;
set sascas1.doc1_termvects;
run;
 
data doc2_termvects;
set sascas1.doc2_termvects;
run;
 
proc iml;
use doc1_termvects;
read all var _char_ into lterm;
read all var _num_ into x;
close doc1_termvects;
 
use doc2_termvects;
read all var _char_ into rterm;
read all var _num_ into y;
close doc2_termvects;
 
d = distance(x,y);
 
lobs=nrow(lterm);
robs=nrow(rterm);
d_out=j(lobs*robs, 3, ' ');
do i=1 to lobs;
do j=1 to robs;
d_out[(i-1)*robs+j,1]=lterm[i];
d_out[(i-1)*robs+j,2]=rterm[j];
d_out[(i-1)*robs+j,3]=cats(d[i,j]);
end;
end;
 
create distance from d_out;
append from d_out;
close distance;
run;quit;
 
/* calculate RWMD between documents */
data x_set;
set sascas1.doc1_termcount;
rename _term_=_node_;
_weight_=_count_;
run;
 
data y_set;
set sascas1.doc2_termcount;
rename _term_=_node_;
_weight_=_count_;
run;
 
data arcdata;
set distance;
rename col1=_tail_;
rename col2=_head_;
length _cost_ 8;
_cost_= col3;
run;
 
proc optmodel;
set xNODES;
num w {xNODES};
 
set yNODES;
num u {yNODES};
 
set <str,str> ARCS;
num arcCost {ARCS};
 
read data x_set into xNODES=[_node_] w=_weight_;
read data y_set into yNODES=[_node_] u=_weight_;
read data arcdata into ARCS=[_tail_ _head_]
arcCost=_cost_;
 
var flow {<i,j> in ARCS} >= 0;
impvar sumY = sum{j in yNODES} u[j];
min obj = (sum {<i,j> in ARCS} arcCost[i,j] * flow[i,j])/sumY;
 
con con_y {j in yNODES}: sum {<i,(j)> in ARCS} flow[i,j] = u[j];
/* con con_x {i in xNODES}: sum {<(i),j> in ARCS} flow[i,j] <= w[i];*/
 
solve with lp / algorithm=ns scale=none logfreq=1;
 
call symput('obj', strip(put(obj,best.)));
create data flowData
from [i j]={<i, j> in ARCS: flow[i,j].sol > 0}
col("cost")=arcCost[i,j]
col("flowweight")=flow[i,j].sol;
run;
quit;
 
%put RWMD=&obj;
%mend calculateRWMD;
 
%calculateRWMD
(
textDS=documents,
documentID=did,
text=text,
language=English,
stopList=stopList,
word2VectDS=glove,
doc1_id=1,
doc2_id=2
);

The RWMD value is 5.0041121662.

Now let's have a look at the flow values.

proc print data=flowdata;
run;quit;

WMD method does not only measure document similarity, but also it explains why the two documents are similar by visualizing the flow data.

Besides document similarity, WMD is useful to measure sentence similarity. Check out this article about sentence similarity and you can try it with SAS.

How to calculate Word Mover's Distance with SAS was published on SAS Users.

7月 112018
 

SAS works with utilities all over the world, helping these companies capitalize on the value of analytics and become "digital utilities." When we talk to utilities, we look at analytics use cases across: assets and operations; customers; portfolio; and corporate operations (see diagram below).  In this fourth post of my four-part [...]

Analytics use cases for utilities: Portfolio analytics was published on SAS Voices by David Pope

7月 112018
 

In a previous article, I showed how to find the intersection (if it exists) between two line segments in the plane. There are some fun problems in probability theory that involve intersections of line segments. One is "What is the probability that two randomly chosen chords of a circle intersect?" This article shows how to create a simulation in SAS to estimate the probability.

An exact answer

For this problem, a "random chord" is defined as the line segment that joins two points chosen at random (with uniform probability) on the circle. The probability that two random chords intersect can be derived by using a simple counting argument. Suppose that you pick four points at random on the circle. Label the points according to their polar angle as p1, p2, p3, and p4. As illustrated by the following graphic, the points are arranged on the circle in one of the following three ways. Consequently, the probability that two random chords intersect is 1/3 because the chords intersect in only one of the three possible arrangements.

Possible arrangements of two random chords in the circle

A simulation in SAS

You can create a simulation to estimate the probability that two random chords intersect. The intersection of two segments can be detected by using either of the two SAS/IML modules in my article about the intersection of line segments. The following simulation generates four angles chosen uniformly at random in the interval (0, 2π). It converts those points to (x,y) coordinates on the unit circle. It then computes whether the chord between the first two points intersects the chord between the third and fourth points. It repeats this process 100,000 times and reports the proportion of times that the chords intersect.

proc iml;
/* Find the intersection between 2D line segments [p1,p2] and [q1,q2].
   This function assumes that the line segments have different slopes (A is nonsingular) */
start IntersectSegsSimple(p1, p2, q1, q2);
   b = colvec(q1 - p1); 
   A = colvec(p2-p1) || colvec(q1-q2); /* nonsingular when segments have different slopes */
   x = solve(A, b);                    /* x = (s,t) */
   if all(0<=x && x<=1) then           /* if x is in [0,1] x [0,1] */
      return (1-x[1])*p1 + x[1]*p2;    /* return intersection */
   else                                /* otherwise, segments do not intersect */
      return ({. .});                  /* return missing values */
finish;
 
/* Generate two random chords on the unit circle.
   Simulate the probability that they intersect  */
N = 1e5;
theta = j(N, 4);
call randseed(123456);
call randgen(theta, "uniform", 0, 2*constant('pi'));
intersect = j(N,1,0);
do i = 1 to N;
   t = theta[i,]`;                 /* 4 random U(0, 2*pi) */
   pts = cos(t) || sin(t);         /* 4 pts on unit circle */
   p1 = pts[1,];    p2 = pts[2,];
   q1 = pts[3,];    q2 = pts[4,];
   intersect[i] = all(IntersectSegsSimple(p1, p2, q1, q2) ^= .);
end;
 
prob = mean(intersect);
print prob;

This simulation produces an estimate that is close to the exact probability of 1/3.

Connection to Bertrand's Paradox

This problem has an interesting connection to Bertrand's Paradox. Bertrand's paradox shows the necessity of specifying the process that is used to define the random variables in a probability problem. It turns out that there are multiple ways to define "random chords" in a circle, and the different definitions can lead to different answers to probability questions. See the Wikipedia article for an example.

For the definition of "random chords" in this problem, the density of the endpoints is uniform on the circle. After you make that choice, other distributions are determined. For example, the distribution of the lengths of 1,000 random chords is shown below. The lengths are NOT uniformly distributed! The theoretical density of the chord lengths is overlaid on the distribution of the sample. If you change the process by which chords are randomly chosen (for example, you force the lengths to be uniformly distributed), you might also change the answer to the problem, as shown in Bertrand's Paradox.

Distribution of lengths of random chords in the unit circle, generated by choosing uniformly distributed endpoints

The post The probability that two random chords of a circle intersect appeared first on The DO Loop.

7月 102018
 

Did you know that 80 percent of an analytics life cycle is time spent on data preparation? For many SAS users and administrators, data preparation is what you live and breathe day in and day out.

Your analysis is only as good as your data, and that's why we wanted to shine a light on the importance of data preparation. I reached out to some of our superstar SAS users in the Friends of SAS community (for Canadian SAS customers and partners) for the inside scoop on different kinds of data preparation tasks they deal with on a daily basis.

Meet Kirby Wu, Actuarial Analyst for TD Insurance

At TD Insurance, SAS is used by many different teams for many different functions.

Kirby uses SAS mainly for its data preparation capabilities. This includes joining tables, cleaning the data, summarizations, segmentation and then sharing this ready-to-use data with the appropriate departments. A day in the life of Kirby includes tackling massive data sets containing billions of claim records. He needs powerful software to perform ETL (extract, transform and load) tasks and manage this data, and that’s where SAS comes in.

"SAS is the first step in our job to access good quality data," says Kirby. "Being an actuary, we use SAS to not only pick up data, but to do profiling, tech inquires and, most importantly, for data quality control purposes. Then we present the data to various teams to take advantage of the findings to improve the business."

Many actuaries have some basic SAS skills to understand the data set. Once they output the data, it is shared across departments and teams for others to make use of.

Prior to his work at TD Insurance, Kirby also used SAS for analytics. He ran GLM analysis where he encountered huge data sets. "When data comes out, we want to understand it, as well as performing statistical analysis on it," says Kirby. "SAS largely influences what direction to go in, and what variable we think is good to use."

Kirby left us with four reasons why he prefers SAS for data preparation.

  1. SAS is an enterprise solution, and the application itself is tried and proven.
  2. Working in insurance, there are many security concerns dealing with sensitive data. SAS provides reassurance in terms of data security.
  3. SAS has been serving the market for many years, and its capabilities and reputation are timeless.
  4. SAS offers a drag-and-drop GUI, as well as programming interfaces for users of varying skill levels.

Meet John Lam of CIBC

Since John joined the bank 15 years ago, he has been using SAS for ETL processes. John saves both time and money using Base SAS when he solves complex ETL tasks in his day-to-day work, and is mainly responsible for data preparation.

John accesses data from multiple source systems and transforms it for business consumption. He works on the technical side and passes on the transformed data within the same organization to the business side. The source data typically comes in at the beginning of each month, but the number of files varies month to month.

"SAS is a great tool," shares John. "The development time is a lot less and helps us save a lot of time on many projects."

John also shared with us some past experiences with complex issues where SAS would have come in handy. He once encountered a situation where he needed to calculate the length of time it would take for someone to receive benefits. However, this calculation method is very complicated and varies greatly depending on how the gap is structured.

"When I look back, the process that took us two to three weeks would have only taken us two to three days if we had used SAS," says John. "SAS would have provided a less complex way of figuring out the problem using date functions!"

Meet Horst Wolter, Manager at TD Bank

"My bread and butter is Base SAS," explains Horst. "The bank has data all over the place in multiple platforms and in multiple forms. We encounter a lot of data – from mainframe to Unix to PC, and flat files or mainframe SAS data sets."

Regardless of the platform or data he is dealing with, users always request slices and dices of the data. Horst takes all available data and finds ways of matching and merging different pieces together to create something that is relevant and easy to understand.

The majority of work Horst does is with credit card data. "I check database views that has millions of rows, which includes historical data."

The bank deals with millions of customers over many years, resulting in many records. Needless to say, the sizes of data he deals with are quite large! Accessing, processing and managing this data for business insight is a battle SAS helps Horst fight every day.

Sharing Is Caring!

How are you using SAS? Share in a few sentences in the comments!

About Friends of SAS

If you’re not familiar with Friends of SAS, it is an exclusive online community available only to our Canadian SAS customers and partners to recognize and show our appreciation for their affinity to SAS. Members complete activities called "challenges" and earn points that can be redeemed for rewards. There are opportunities to build powerful connections, gain privileged access to SAS resources and events, and boost your learning and development of SAS, all in a fun environment.

Interested in learning more about Friends of SAS? Feel free to email Natasha.Ulanowski@sas.com or Martha.Casanova@sas.com with any questions or to get more details.

No typical SAS user: how three professionals prep data using SAS was published on SAS Users.