5月 222019
 

This blog shows how the automatically generated concepts and categories in Visual Text Analytics (VTA) can be refined using LITI and Boolean rules. Because of these capabilities highly customized models can be developed in VTA. The rules used in this blog are basic. Developing linguistic rules and accurately categorizing documents requires subject matter expertise and understanding the grammatical structure of the language(s) used.

I will use a data set that contains information on 1527 randomly selected movies: their titles, reviews, MPAA Ratings, Main Genre classifications and Viewer Ratings. Two customized categories will be developed one for Children and the other for Sport movies. Because we are familiar with movies classification and MPAA ratings, it will be relatively easy to understand the rules used in this blog. The overall blog’s objective is to show how to formulate basic rules, thus their use can be extended to other fields.

SAS Visual Text Analytics (VTA) is the SAS offering designed to effectively extract insights from unstructured data in large scale. Offered on the SAS Viya architecture, VTA combines the power of Natural Language Processing (NLP), Machine Learning (ML) and Linguistic Rules. Currently, VTA supports 30 languages and it has an open architecture supporting 3rd-party programming interfaces.

As in all analytical projects, the discovery process in Text Analytics projects requires several iterations where the insights found in one iteration are used in the next iterations. In relationship to the linguistic rules, one must determine if the new rules are an improvement over the ones used in previous iterations, and find how many true positives and false positives are matched by the new rules. This process should be repeated until one obtains the precision required.

Initial Text Analysis Using Visual Analytics

Because Visual Analytics (VA) and VTA are highly integrated, the initial Text Analysis can be done in VA.

Every Text Analytics dataset must have a unique identifier associated to each document. In my blog, Discover Main Topics on #MLKDayofService Tweets Using SAS Visual Text Analytics, I showed how to set a “Unique Row Identifier”, and how to work with the nodes in the Pipeline.

In Visual Analytics (VA) one can do the initial analysis of text data, see the Word Cloud, and a list of topics. In the Options menu, I indicated I wanted a Maximum of Topics to Generate=7.

The photo above shows that there are 364 movies with the term “kid” in the Topic “+show,+kid, +rate,+movie”. We could build a category that groups appropriate movies for kids.

There are 203 documents with the Topic related to science fiction. Therefore, if I wanted to have a category for Sport movies, I would have to build it myself because sports terms appear in fewer documents.

Create a Visual Text Analytics Project

In VTA, a pipeline is a process flow diagram whose nodes represent tasks in the Text Analysis Process, I described in detail how to work with the nodes in my MLKDayofService blog mentioned above. Briefly, from the SAS Home menu select the action Build Models that will take you to SAS Model Studio, where you select and create a New Project.

The photo above shows the data role assignments done in the Data tab.

Notice that there is a Unique Row Identifier for each document, the Text Variable to analyze is Review, and two variables are used as Category: MPAARating and mainGenre. Later on, VTA will use these two variables to automatically create categories and their Boolean rules. Title doesn’t have a role but I want it to be displayed to facilitate the analysis.

Movies are already classified according to their main category (mainGenre), I want to see the Boolean rules that VTA automatically generates for each category, and if I can create new concepts and categories that improve on the initial categorization. For example, I would like:

  1. to find children movies that don’t contain violence,
  2. to find movies that are related to Sports,
  3. to read the reviews of my favorite old movies, and
  4. find movies whose reviews mention some of my favorite movie directors.

Method

I ran two pipelines. The first one had the default pipeline settings and also the option Include predefined concepts enabled for the Concepts node. The objective was to see the rules associated with the genres “Sports”, “Animated” and “Family”, the movies matched by these rules, as well as, the ones that shouldn’t have been matched. In the second pipeline, I developed LITI and Boolean rules with the objective of improving the default categorizations automatically produced in the first pipeline.

In the next sections I will describe how the new categories were built. In real business situations, sometimes we will have pre-defined categories available to us, and other times we will come up with categories that satisfy the business objectives after analyzing many documents.

Customized Concepts built in the Concept Node

In the second pipeline, I developed three customized concepts, I will use one of them “MySports” to build a new category later on.

Basic Boolean operators are used to define new concepts and categories. AND/NOT operators are applied to the whole document. There other operators that search within the same sentence (SENT), the same paragraph (PARA) or a number of terms (DIST).

# Any line that starts with “#” is a comment
# Use CLASSIFIER to match a literal sequence
# Use CONCEPT_RULE to use Boolean and proximity operators. The term extracted should use _c{ }

MySports Concept

I wrote this rule which matches 98 documents, most of them related to Sport movies and with few false positives. This rule will match a document if any of the terms sport, baseball, tennis, football, basketball, racetrack appear anywhere in the document (movie review) but the terms gambling, buddy or sporting do not appear anywhere in the document.

I will use this MySports CONCEPT_RULE to build the new Sports category:

CONCEPT_RULE: (AND, (OR, “_c{sport@}”,”_c{baseball}”,”_c{tennis}”,”_c{football}”,”_c{basketball}”, “_c{racetrack}”),(NOT,”sporting”),(NOT, “gambling”),(NOT,”buddy”))

filmmakersInReview Concept

I built this concept just to illustrate how to use a pre-defined concept, in this case nlpPerson:

CONCEPT_RULE: (DIST_10,(OR,”filmmaker”,”director”,”film producer”,”producer”,”movie maker”),”_c{nlpPerson}”)

favoriteMovies Concept

I built this concept to match my favorite old movies and one of my favorite directors. The first CONCEPT_RULE will match documents that contain in the same sentence the terms Stanley Kubrick and 2001. The second CONCEPT_RULE will match documents that contain the two terms anywhere in the document. Both CONCEPT_RULEs will only extract the first term "Stanley Kubrick":

CLASSIFIER:A Space Odyssey
CLASSIFIER:The Sound of Music
CLASSIFIER:Il Postino
CONCEPT_RULE:(SENT,”_c{Stanley Kubrick}”,”2001″)
CONCEPT_RULE:(AND,”_c{Stanley Kubrick}”,”A Space Odyssey”)

New Concepts in Parsing Text Node

The customized concepts developed in the Concepts node are passed to the Text Parsing Node. Notice the Terms football, sport, sports, baseball and the new Role MySports in the Kept Terms window, as well as the matched documents to the term "football":

Customized Categories in the Category Mode

In the second pipeline I developed new categories using as starting point the rules associated with the genres “Sports”, “Animated” and “Family”.

Sports Category

The input data has only 3 movies in the Sports category, it is difficult to generate a meaningful rule with such a small dataset. Once the first pipeline is ran, there are a total of 8 movies which include the 3 original ones, and 5 that are not related to sports. The automatically generated rule is:

(OR,(AND,”crowd-pleaser”),(AND,”conor”),(AND,”_x000d_stupid”))

For the second pipeline, I developed the MySports rule in the Concepts node as mentioned above, and write this Boolean rule in the Categories node:

(OR,(AND,”crowd-pleaser”),(AND,”conor”),(AND,”_x000d_stupid”),”[MySports]”)

The new rule matches 90 movies, most of them related to Sports. For the next iteration, one would need to look at the movies that don’t relate to Sports, the ones that relate to Sports and were not matched, and improve in the rule above.

ChildrenMovies Category

In the second pipeline, I combined the rules for the Family and Animation categories which were automatically produced in the first pipeline.

For the Family category, there were 6 movies matched by this rule

(OR,(AND,(OR,”adults”,”adult”),”oz”))

It matched “People vs Larry Flynt” which prompted me to use the terms “murder” and “obscenity” in the Concept rule.
The Animation category had 66 matched movies and the automatically generated rule was:

(OR,(AND,”pixar”),(AND,(OR,”animator”,”animators”)),(AND,(OR,”voiced”,”voices”,”voicing”,”voice”),(OR,”cartoon”,”cartoons”)),(AND,(OR,”cartoon characters”,”cartoon character”)),(AND,(OR,”lesson”,”lessons”),”animated”),(AND,”live action”),(AND,”jeffrey”,(OR,”features”,”feature”)),(AND,”3-d”))
I decided to modify these two rules. In the second pipeline, I used this new rule
(OR,(AND,(NOT,(OR,”adults”,”adult”,”suitable for children”,”rated R”,”strip@”,”suck@”,”crude humor”,”gore”,”horror”,”murder”,”obscenity”,”drug use@”)),”Wizard of Oz”),(AND,”pixar”),(AND,(OR,”animator”,”animators”)),(AND,(OR,”voiced”,”voices”,”voicing”,”voice”),(OR,”cartoon”,”cartoons”)),(AND,(OR,”cartoon characters”,”cartoon character”)),(AND,(OR,”lesson”,”lessons”),”animated”),(AND,”live action”),(AND,”jeffrey”,(OR,”features”,”feature”)),(AND,”3-d”))

This produced 73 movies and only two rated “R”. Therefore, I removed both the Animation and the Family categories and created the new category childrenMovies.

Again, to determine if the new rules are an improvement over the previous ones, we must find out how many true positives and false positives are matched by the new rules, and repeat the process until we obtain the precision required.

Conclusion

Because the automatically generated concepts and categories in Visual Text Analytics (VTA) can be refined using LITI and Boolean rules, highly customized models can be developed in VTA.

As in all analytical projects, the discovery process in Text Analytics projects requires several iterations where the insights found in one iteration are used in the next iterations. In relationship to the linguistic rules, one must determine if the new rules are an improvement over the ones used in previous iterations, and find how many true positives and false positives are matched by the new rules. This process should be repeated until one obtains the precision required.

Many thanks to Teresa Jade and Biljana Belamaric Wilsey for reviewing the linguistic rules used in this blog. For more information about Visual Text Analytics, please check out:

Analysis of Movie Reviews using Visual Text Analytics was published on SAS Users.

5月 222019
 

Analytics can bring tremendous value to a business. However, investing in an analytics solution is often easier said than done. The key challenge is demonstrating the value of analytics without first investing in technology and resources. The solution? Take a results-based approach to establish the value of analytics by using [...]

Establish the benefits of analytics before investing in a solution was published on SAS Voices by David Annis

5月 222019
 

The eigenvalues of a matrix are not easy to compute. It is remarkable, therefore, that with relatively simple mental arithmetic, you can obtain bounds for the eigenvalues of a matrix of any size. The bounds are provided by using a marvelous mathematical result known as Gershgorin's Disc Theorem. For certain matrices, you can use Gershgorin's theorem to quickly determine that the matrix is nonsingular or positive definite.

The Gershgorin Disc Theorem appears in Golub and van Loan (p. 357, 4th Ed; p. 320, 3rd Ed), where it is called the Gershgorin Circle Theorem. The theorem states that the eigenvalues of any N x N matrix, A, are contained in the union of N discs in the complex plane. The center of the i_th disc is the i_th diagonal element of A. The radius of the i_th disc is the absolute values of the off-diagonal elements in the i_th row. In symbols,
Di = {z ∈ C | |z - Ai i| ≤ ri }
where ri = Σi ≠ j |Ai j|. Although the theorem holds for matrices with complex values, this article only uses real-valued matrices.

An example of Gershgorin discs is shown to the right. The discs are shown for the following 4 x 4 symmetric matrix:

At first glance, it seems inconceivable that we can know anything about the eigenvalues without actually computing them. However, two mathematical theorems tells us quite a lot about the eigenvalues of this matrix, just by inspection. First, because the matrix is real and symmetric, the Spectral Theorem tells us that all eigenvalues are real. Second, the Gershgorin Disc Theorem says that the four eigenvalues are contained in the union of the following discs:

  • The first row produces a disc centered at x = 200. The disc has radius |30| + |-15| + |5| = 50.
  • The second row produces a disc centered at x = 100 with radius |30| + |5| + |5| = 40.
  • The third row produces a disc centered at x = 55 with radius |-15| + |5| + |0| = 20.
  • The last row produces a disc centered at x = 25 with radius |5| + |5| + |0| = 10.

Although the eigenvalues for this matrix are real, the Gershgorin discs are in the complex plane. The discs are visualized in the graph at the top of this article. The true eigenvalues of the matrix are shown inside the discs.

For this example, each disc contains an eigenvalue, but that is not true in general. (For example, the matrix A = {1 −1, 2 −1} does not have any eigenvalues in the disc centered at x=1.) What is true, however, is that disjoint unions of discs must contain as many eigenvalues as the number of discs in each disjoint region. For this matrix, the discs centered at x=25 and x=200 are disjoint. Therefore they each contain an eigenvalue. The union of the other two discs must contain two eigenvalues, but, in general, the eigenvalues can be anywhere in the union of the discs.

The visualization shows that the eigenvalues for this matrix are all positive. That means that the matrix is not only symmetric but also positive definite. You can predict that fact from the Gershgorin discs because no disc intersects the negative X axis.

Of course, you don't have to perform the disc calculations in your head. You can write a program that computes the centers and radii of the Gershgorin discs, as shown by the following SAS/IML program, which also computes the eigenvalues for the matrix:

proc iml;
A = { 200  30 -15  5,
       30 100   5  5,
      -15   5  55  0, 
        5   5   0 15};
 
evals = eigval(A);                 /* compute the eigenvalues */
center = vecdiag(A);               /* centers = diagonal elements */
radius = abs(A)[,+] - abs(center); /* sum of abs values of off-diagonal elements of each row */
discs = center || radius || round(evals,0.01);
print discs[c={"Center" "Radius" "Eigenvalue"} L="Gershgorin Discs"];

Diagonally dominant matrices

For this example, the matrix is strictly diagonally dominant. A strictly diagonally dominant matrix is one for which the magnitude of each diagonal element exceeds the sum of the magnitudes of the other elements in the row. In symbols, |Ai i| > Σi ≠ j |Ai j| for each i. Geometrically, this means that no Gershgorin disc intersects the origin, which implies that the matrix is nonsingular. So, by inspection, you can determine that his matrix is nonsingular.

Gershgorin discs for correlation matrices

The Gershgorin theorem is most useful when the diagonal elements are distinct. For repeated diagonal elements, it might not tell you much about the location of the eigenvalues. For example, all diagonal elements for a correlation matrix are 1. Consequently, all Gershgorin discs are centered at (1, 0) in the complex plane. The following graph shows the Gershgorin discs and the eigenvalues for a 10 x 10 correlation matrix. The eigenvalues of any 10 x 10 correlation matrix must be real and in the interval [0, 10], so the only new information from the Gershgorin discs is a smaller upper bound on the maximum eigenvalue.

Gershgorin discs for unsymmetric matrices

Gershgorin's theorem can be useful for unsymmetric matrices, which can have complex eigenvalues. The SAS/IML documentation contains the following 8 x 8 block-diagonal matrix, which has two pairs of complex eigenvalues:

A = {-1  2  0       0       0       0       0  0,
     -2 -1  0       0       0       0       0  0,
      0  0  0.2379  0.5145  0.1201  0.1275  0  0,
      0  0  0.1943  0.4954  0.1230  0.1873  0  0,
      0  0  0.1827  0.4955  0.1350  0.1868  0  0,
      0  0  0.1084  0.4218  0.1045  0.3653  0  0,
      0  0  0       0       0       0       2  2,
      0  0  0       0       0       0      -2  0 };

The matrix has four smaller Gershgorin discs and three larger discs (radius 2) that are centered at (-1,0), (2,0), and (0,0), respectively. The discs and the actual eigenvalues of this matrix are shown in the following graph. Not only does the Gershgorin theorem bound the magnitude of the real part of the eigenvalues, but it is clear that the imaginary part cannot exceed 2. In fact, this matrix has eigenvalues -1 ± 2 i, which are on the boundary of one of the discs, which shows that the Gershgorin bound is tight.

Conclusions

In summary, the Gershgorin Disc Theorem provides a way to visualize the possible location of eigenvalues in the complex plane. You can use the theorem to provide bounds for the largest and smallest eigenvalues.

I was never taught this theorem in school. I learned it from a talented mathematical friend at SAS. I use this theorem to create examples of matrices that have particular properties, which can be very useful for developing and testing software.

This theorem also helped me to understand the geometry behind "ridging", which is a statistical technique in which positive multiples of the identity are added to a nearly singular X`X matrix. The Gershgorin Disc Theorem shows the effect of ridging a matrix is to translate all of the Gershgorin discs to the right, which moves the eigenvalues away from zero while preserving their relative positions.

You can download the SAS program that I used to create the images in this article.

Further reading

There are several papers on the internet about Gershgorin discs. It is a favorite topic for advanced undergraduate projects in mathematics.

The post Gershgorin discs and the location of eigenvalues appeared first on The DO Loop.

5月 212019
 

If you spend any time working with maps and spatial data, having a fundamental understanding of coordinate systems and map projections becomes necessary.  It’s the foundation of how spatial data and maps work.  These areas invariably evoke trepidation and some angst, even in the most seasoned map professional.  And rightfully so, it can get complicated quickly. Fortunately, most of those worries can be set aside when creating maps with SAS Visual Analytics, without requiring a degree in Geodesy.

Visual Analytics includes several different coordinate system definitions configured out-of-the-box.  Like the Predefined geography types (see Fundamental of SAS Visual Analytics geo maps), they are selected from a drop-down list during the geography variable setup.  With the details handled by VA, all you need to know is what coordinate space your data uses and select the appropriate one.

The four Coordinate spaces included with VA are:

  1. World Geodetic System (WGS84)
    Area of coverage: World.  Used by GPS navigation systems and NATO military geodetic surveying.  This is the VA default and should work in most situations.
  2. Web Mercator
    Area of coverage: World.  Format used by Google maps, OpenStreetMap, Bing maps and other web map providers.
  3. British National Grid (OSGB36)
    Area of coverage: United Kingdom – Great Britain, Isle of Man
  4. Singapore Transverse Mercator (SVY21)
    Area of coverage: Singapore onshore/offshore

But what if your data does not use one of these?  For those situations, VA also supports custom coordinate spaces.  With this option, you can specify the definition of your desired coordinate space using industry standard formats for EPSG codes or Proj4 strings.  Before we get into the details of how to use custom coordinate spaces in VA, let’s take a step back and review the basics of coordinate spaces and projections.

Background

A coordinate space is simply a grid designed to cover a specific area of the Earth.  Some have global coverage (WGS84, the default in VA) and others cover relatively small areas (SVY21/Singapore Transverse Mercator).  Each coordinate space is defined by several parameters, including but not limited to:

  • Center coordinates (origin)
  • Coverage area (‘bounds’ or ‘extent’)
  • Unit of measurement (feet or meters)

Comparison of coordinate space definitions included in Visual Analytics -- Source: http://epsg.io

The image above compares the four coordinate space definitions included with VA.  The two on the right, BNG and Singapore Transverse Mercator, have a limited extent.  A red rectangle outlines the area of coverage for each region.  The two on the left, WGS84 and Mercator, are both world maps.  At first glance, they may appear to have the same coverage area, but they are not interchangeable.  The origin for both is located at the intersection of the Equator and the Prime Meridian.  However, the similarities end there.  Notice the extent for WGS84 covers the entire latitude range, from -90 to +90.  Mercator on the other hand, covers from -85 to +85 latitude, so the first 5 degrees from each Pole are not included.  Another difference is the unit of measurement.  WGS84 is measured in un-projected degrees, which is indicative of a spherical Geographic Coordinate System (GCS).  Mercator uses meters, which implies a Projected Coordinate System (PCS) used for a flat surface, ie. a screen or paper.

The projection itself is a complex mathematical operation that transforms the spherical surface of the GCS into the flat surface of the PCS.  This transformation introduces distortion in one or more qualities of the map: shape, area, direction, or distance.  The process of map projection compares to peeling an orange. Removing the peel and placing it on a flat surface will cause parts of it to stretch, tear or separate as it flattens. The same thing happens to a map projection.

A flat map will always have some degree of distortion.  The amount of distortion depends on the projection used.  Select a projection that minimizes the distortion in the areas most important to the map.  For example, are you creating a navigation map where direction is critical?  How about a World map to compare land mass of various countries?  Or maybe a local map of Municipality services where all factors are equally important?  These decisions are important if you are collecting and creating your data set from the field.  But, if you are using existing data sets, chances are that decision has already been made for you.  It then becomes a task of understanding what coordinate system was selected and how to use it within VA.

Using a Custom Coordinate Space in VA

When using VA’s custom coordinate space option, it is critical the geography variable and the dataset use the same coordinate space.  This tells VA how to align the grid used by the data with the grid used by the underlying map.  If they align, the data will be placed at the expected location.  If they don’t align, the data will appear in the wrong location or may not be displayed at all.

Illustration of aligning the map and data grids

To illustrate the process of using a custom coordinate space in VA, we will be creating a custom region map of the Oklahoma City School Districts.  The data can be found on the Oklahoma City Open Data Portal.  We will use the Esri shapefile format.  As you may recall from a previous blog post, Creating custom region maps with SAS Visual Analytics, the first step is to import the Esri shapefile data into a SAS dataset.

Once the shapefile has been successfully imported into SAS, we then must determine the coordinate system of the data.  While WGS84 is common and will work in many situations, it should not be assumed.  The first place to look is at the source, the data provider.  Many Open Data portals will have the coordinate system listed along with the metadata and description of the dataset.  But when using an Esri shapefile, there is an easier way to find what we need.

Locate the directory where you unzipped the original shapefile.  Inside of that directory is a file with a .prj extension.  This file defines the projection and coordinate system used by the shapefile.  Below are the contents of our .prj file with the first parameter highlighted.  We are only interested in this value.  Here, you can see the data has been defined in the Oklahoma State Plane coordinate system -- not in VA’s default WGS84.  So, we must use a custom coordinate system when defining the geography variable.

PROJCS["NAD_1983_StatePlane_Oklahoma_North_FIPS_3501_Feet",GEOGCS["GCS_North_American_1983",DATUM["D_North_American_1983",SPHEROID["GRS_1980",6378137,298.257222101004]],PRIMEM["Greenwich",0],UNIT["Degree",0.0174532925199433]], PROJECTION["Lambert_Conformal_Conic"],PARAMETER["False_Easting",1968500],PARAMETER["False_Northing",0],PARAMETER["Central_Meridian",-98],PARAMETER["Standard_Parallel_1",35.5666666666667],PARAMETER["Standard_Parallel_2",36.7666666666667],PARAMETER["Scale_Factor",1],PARAMETER["Latitude_Of_Origin",35],UNIT["Foot_US",0.304800609601219]]

Next, we need to look up the Oklahoma State Plane coordinate system to find a definition VA understands.  From the main page of the SpatialReference.org website, type ‘Oklahoma State Plane’ into the search box. Four results are returned.  Compare the results with the string highlighted above.  You can see the third option is what we are looking for: NAD 1983 StatePlane Oklahoma North FIPS 3501 Feet.

Selecting the appropriate definition based on the .prj file contents

To get the definitions we need for VA, click the third link for the option NAD 1983 StatePlane Oklahoma North FIPS 3501 Feet.  Here you will see a grey box with a bulleted list of links.  Each of these links represent a definition for the Oklahoma StatePlane coordinate space.

Visual Analytics supports two of the listed formats, EPSG and Proj4.  EPSG stands for European Petroleum Survey Group, an organization that publishes a database of coordinate system and projection information.  The syntax of this format is epsg:<number> or esri:<number>, where <number> is a 4-6 digit for the desired coordinate system.  In our cases, the format we need is the title of the page:

ESRI:102724

The second format supported by VA is Proj4, the third link in the image above.  This format consists of a string of space-delimited name value pairs.  The Oklahoma StatePlane proj4 definition we are interested in is:

+proj=lcc +lat_1=35.56666666666667 +lat_2=36.76666666666667 +lat_0=35 +lon_0=-98 +x_0=600000.0000000001 +y_0=0 +ellps=GRS80 +datum=NAD83 +to_meter=0.3048006096012192 +no_defs

Now we have identified the coordinate system used by our data set and looked up its definition, we are ready to configure VA to use it.

Using a Projected Coordinate System definition in VA

The following section assumes you are familiar with custom region maps and setting up a polygon provider.  If not, see my previous post on that process, Creating custom region maps with SAS Visual Analytics.  The first step in setting up a geography variable for a custom region map is to start with the polygon provider.  At the bottom of the ‘Edit Polygon Provider’ window, there is an ‘Advanced’ section that is collapsed by default.  Expand it to see the Coordinate Space option.  By default, it is populated with the value EPSG:4326, which is the EPSG code for WGS84.  Since our Oklahoma City School District code data does not use WGS84, we need to replace this value with the EPSG code that we looked up from SpatialReference.org (ESRI:102724).

Using the same Custom Coordinate definition for Polygon provider and geography variable

Next, we must make sure to configure the geography variable itself with the same coordinate space as the polygon provider.  On the ‘Edit Geography Item’ window, the Coordinate Space option is the last item.  Again, we must change this from the default WGS84 to ESRI:102724.  From the dropdown list, select the option ‘Custom’.  A new entry box appears where we can enter the custom coordinate space definition.  If configured correctly, you should see your map in the preview thumbnail and a 100% mapped indicator.

Congratulations!  The setup was successful.  Now, simply click OK and drag the geography variable to the canvas.  VA’s auto-map feature will recognize it and display the custom region map.

In this post, I showed how to identify the coordinate system of your Esri shapefile data, lookup its epsg and proj4 definitions, and configure VA to use it via the Custom Coordinate space option.  While the focus was on a custom region map, the technique also applies to Custom Coordinate maps, minus the polygon provider setup.  The support of custom coordinate spaces in VA allow the mapping of practically any spatial dataset, giving you a new level of power and flexibility in your mapping efforts.

Essentials of Map Coordinate Systems and Projections in Visual Analytics was published on SAS Users.

5月 202019
 

Recently I wrote about how to compute the Kolmogorov D statistic, which is used to determine whether a sample has a particular distribution. One of the beautiful facts about modern computational statistics is that if you can compute a statistic, you can use simulation to estimate the sampling distribution of that statistic. That means that instead of looking up critical values of the D statistic in a table, you can estimate the critical value by using empirical quantiles from the simulation.

This is a wonderfully liberating result! No longer are we statisticians constrained by the entries in a table in the appendix of a textbook. In fact, you could claim that modern computation has essentially killed the standard statistical table.

Obtain critical values by using simulation

Before we compute anything, let's recall a little statistical theory. If you get a headache thinking about null hypotheses and sampling distributions, you might want to skip the next two paragraphs!

When you run a hypothesis test, you compare a statistic (computed from data) to a hypothetical distribution (called the null distribution). If the observed statistic is way out in a tail of the null distribution, you reject the hypothesis that the statistic came from that distribution. In other words, the data does not seem to have the characteristic that you are testing for. Statistical tables use "critical values" to designate when a statistic is in the extreme tail. A critical value is a quantile of the null distribution; if the observed statistic is greater than the critical value, then the statistic is in the tail. (Technically, I've described a one-tailed test.)

One of the uses for simulation is to approximate the sampling distribution of a statistic when the true distribution is not known or is known only asymptotically. You can generate a large number of samples from the null hypothesis and compute the statistic on each sample. The union of the statistics approximates the true sampling distribution (under the null hypothesis) so you can use the quantiles to estimate the critical values of the null distribution.

Critical values of the Kolmogorov D distribution

You can use simulation to estimate the critical value for the Kolmogorov-Smirnov statistical test for normality. For the data in my previous article, the null hypothesis is that the sample data follow a N(59, 5) distribution. The alternative hypothesis is that they do not. The previous article computed a test statistic of D = 0.131 for the data (N = 30). If the null hypothesis is true, is that an unusual value to observe? Let's simulate 40,000 samples of size N = 30 from N(59,5) and compute the D statistic for each. Rather than use PROC UNIVARIATE, which computes dozens of statistics for each sample, you can use the SAS/IML computation from the previous article, which is very fast. The following simulation runs in a fraction of a second.

/* parameters of reference distribution: F = cdf("Normal", x, &mu, &sigma) */
%let mu    = 59;
%let sigma =  5;
%let N     = 30;
%let NumSamples = 40000;
 
proc iml;
call randseed(73);
N = &N;
i = T( 1:N );                           /* ranks */
u = i/N;                                /* ECDF height at right-hand endpoints */
um1 = (i-1)/N;                          /* ECDF height at left-hand endpoints  */
 
y = j(N, &NumSamples, .);               /* columns of Y are samples of size N */
call randgen(y, "Normal", &mu, &sigma); /* fill with random N(mu, sigma)      */
D = j(&NumSamples, 1, .);               /* allocate vector for results        */
 
do k = 1 to ncol(y);                    /* for each sample:                   */
   x = y[,k];                           /*    get sample x ~ N(mu, sigma)     */
   call sort(x);                        /*    sort sample                     */
   F = cdf("Normal", x, &mu, &sigma);   /*    CDF of reference distribution   */
   D[k] = max( F - um1, u - F );        /*    D = max( D_minus, D_plus )      */
end;
 
title "Monte Carlo Estimate of Sampling Distribution of Kolmogorov's D Statistic";
title2 "N = 30; N_MC = &NumSamples";
call histogram(D) other=
     "refline 0.131 / axis=x label='Sample D' labelloc=inside lineattrs=(color=red);";

The test statistic is right smack dab in the middle of the null distribution, so there is no reason to doubt that the sample is distributed as N(59, 5).

How big would the test statistic need to be to be considered extreme? To test the hypothesis at the α significance level, you can compute the 1 – α quantile of the null distribution. The following statements compute the critical value for α = 0.05 and N = 30:

/* estimate critical value as the 1 - alpha quantile */
alpha = 0.05;
call qntl(Dcrit_MC, D, 1-alpha);
print Dcrit_MC;

The estimated critical value for a sample of size 30 is 0.242. This compares favorably with the exact critical value from a statistical table, which gives Dcrit = 0.2417 for N = 30.

You can also use the null distribution to compute a p value for an observed statistic. The p value is estimated as the proportion of statistics in the simulation that exceed the observed value. For example, if you observe data that has a D statistic of 0.28, the estimated p value is obtained by the following statements:

Dobs = 0.28;                        /* hypothetical observed statistic */
pValue = sum(D >= Dobs) / nrow(D);  /* proportion of distribution values that exceed D0 */
print Dobs pValue;

This same technique works for any sample size, N, although most tables critical values only for all N ≤ 30. For N > 35, you can use the following asymptotic formulas, developed by Smirnov (1948), which depend only on α:

The Kolmogorov D statistic does not depend on the reference distribution

It is reasonable to assume that the results of this article apply only to a normal reference distribution. However, Kolmogorov proved that the sampling distribution of the D statistic is actually independent of the reference distribution. In other words, the distribution (and critical values) are the same regardless of the continuous reference distribution: beta, exponential, gamma, lognormal, normal, and so forth. That is a surprising result, which explains why there is only one statistical table for the critical values of the Kolmogorov D statistic, as opposed to having different tables for different reference distributions.

In summary, you can use simulation to estimate the critical values for the Kolmogorov D statistic. In a vectorized language such as SAS/IML, the entire simulation requires only about a dozen statements and runs extremely fast.

The post Critical values of the Kolmogorov-Smirnov test appeared first on The DO Loop.

5月 172019
 

As a publishing house inside of SAS, we often hear: “Does anyone want to read books anymore?” Especially technical programmers who are “too busy” to read. About a quarter of American adults (24%) say they haven’t read a book in whole or in part in the past year, whether in print, electronic or audio form. In addition, leisure reading is at an all-time low in the US. However, we know that as literacy expansion throughout the world has grown, it has also helped reduce inequalities across and within countries. Over the years many articles have been published about how books will soon become endangered species, but can we let that happen when we know the important role books play in education?

At SAS, curiosity and life-long learning are part of our culture. All employees are encouraged to grow their skill set and never stop learning! While different people do have different preferred learning styles, statistics show that reading is critical to the development of life-long learners, something we agree with at SAS Press:

  • In a study completed at Yale University, Researchers studied 3,635 people older than 50 and found that those who read books for 30 minutes daily lived an average of 23 months longer than nonreaders or magazine readers. The study stated that the practice of reading books creates a cognitive engagement that improves a host of different things including vocabulary, cognitive skills, and concentration. Reading can also affect empathy, social perception, and emotional intelligence, which all help people stay on the planet longer.
  • Vocabulary is notoriously resistant to aging, and having a vast one, according to researchers from Spain’s University of Santiago de Compostela, can significantly delay the manifestation of mental decline. When a research team at the university analyzed vocabulary test scores of more than 300 volunteers ages 50 and older, they found that participants with the lowest scores were between three and four times more at risk of cognitive decay than participants with the highest scores.
  • One international study of long-term economic trends among nations found that, along with math and science, “reading performance is strongly and significantly related to economic growth.”

Putting life-long learning into practice

Knowing the importance that reading plays, not only in adult life-long learning with books, SAS has been working hard to improve reading proficiency in young learners — which often ties directly to the number of books in the home, the number of times parents read to young learners, and the amount adults around them read themselves.

High-quality Pre-K lays the foundation for third-grade reading proficiency which is critical to future success in a knowledge-driven economy. — Dr. Jim Goodnight

With all the research pointing to why reading is so important to improving your vocabulary and mental fortitude, it seems only telling that learning SAS through our example-driven, in-depth books would prove natural.

So to celebrate #endangeredspecies day and help save what some call an “endangered species,” let’s think about:

  • What SAS books have you promised yourself you would read this year?
  • What SAS books will you read to continue your journey as a life-long learner?
  • What book do you think will get you to the next level of your SAS journey?

Let us know in the comments, what SAS book improved your love of SAS and took you on a life-long learner journey?

For almost thirty years SAS Press has published books by SAS users for SAS users. Want to find out more about SAS Press? For more about our books and some more of our SAS Press fun, subscribe to our newsletter. You’ll get all the latest news and exclusive newsletter discounts. Also, check out all our new SAS books at our online bookstore.

Other Resources:
About SAS: Education Outreach
About SAS: Reading Proficiency
Poor reading skills stymie children and the N.C. economy by Dr. Jim Goodnight

Do books count as endangered species? was published on SAS Users.

5月 172019
 
Did you know that you can run Lua code within Base SAS? This functionality has been available since the SAS® 9.4M3 (TS1M3) release. With the LUA procedure, you can submit Lua statements from an external Lua script or just submit the Lua statements using SAS code. In this blog, I will discuss what PROC LUA can do as well as show some examples. I will also talk about a package that provides a Lua interface to SAS® Cloud Analytic Services (CAS).

What Is Lua?

Lua is a lightweight, embeddable scripting language. You can use it in many different applications from gaming to web applications. You might already have written Lua code that you would like to run within SAS, and PROC LUA enables you to do so.
With PROC LUA, you can perform these tasks:

  • run Lua code within a SAS session
  • call most SAS functions within Lua statements
  • call functions that are created using the FCMP procedure within Lua statements
  • submit SAS code from Lua
  • Call CAS actions

PROC LUA Examples

Here is a look at the basic syntax for PROC LUA:

proc lua <infile='file-name'> <restart> <terminate>;

Suppose you have a file called my_lua.lua or my_lua.luc that contains Lua statements, and it is in a directory called /local/lua_scripts. You would like to run those Lua statements within a SAS session. You can use PROC LUA along with the INFILE= option and specify the file name that identifies the Lua source file (in this case, it is my_lua). The Lua file name within your directory must contain the .lua or. luc extension, but do not include the extension within the file name for the INFILE= option. A FILENAME statement must be specified with a LUAPATH fileref that points to the location of the Lua file. Then include the Lua file name for the INFILE= option, as shown here:

filename luapath '/local/lua_scripts';
proc lua infile='my_lua';

This example executes the Lua statements contained within the file my_lua.lua or my_lua.luc from the /local/lua_scripts directory.

If there are multiple directories that contain Lua scripts, you can list them all in one FILENAME statement:

filename luapath ('directory1', 'directory2', 'directory3');

The RESTART option resets the state of Lua code submissions for a SAS session. The TERMINATE option stops maintaining the Lua code state in memory and terminates the Lua state when PROC LUA completes.

The syntax above discusses how to run an external Lua script, but you can also run Lua statements directly in SAS code.

Here are a couple of examples that show how to use Lua statements directly inside PROC LUA:

Example 1

   proc lua; 
   submit; 
      local names= {'Mickey', 'Donald', 'Goofy', 'Minnie'} 
      for i,v in ipairs(names) do 
         print(v) 
   end 
   endsubmit; 
   run;

Here is the log output from Example 1:

NOTE: Lua initialized.
Mickey
Donald
Goofy
Minnie
NOTE: PROCEDURE LUA used (Total process time):
      real time           0.38 seconds
      cpu time            0.10 seconds

Example 2

   proc lua;
   submit;
      dirpath=sas.io.assign("c:\\test")
      dir=dirpath:opendir()
      if dir:has("script.txt") then print ("exists")
      else print("doesn't exist")
      end
   endsubmit;
   run;

Example 2 checks to see whether an external file called script.txt exists in the c:\test directory. Notice that two slashes are needed to specify the backslash in the directory path. One backslash would represent an escape character.

All Lua code must be contained between the SUBMIT and ENDSUBMIT statements.

You can also submit SAS code within PROC LUA by calling the SAS.SUBMIT function. The SAS code must be contained within [[ and ]] brackets. Here is an example:

   proc lua; 
   submit;
      sas.submit [[proc print data=sashelp.cars; run; ]]
   endsubmit;
   run;

Using a Lua Interface with CAS

Available to download is a package called SWAT, which stands for SAS Scripting Wrapper for Analytics Transfer. This is a Lua interface for CAS. After you download this package, you can load data into memory and apply CAS actions to transform, summarize, model, and score your data.

The package can be downloaded from this Downloads page: SAS Lua Client Interface for Viya. After you download the SWAT package, there are some requirements for the client machine to use Lua with CAS:

  1. You must use a 64-bit version of either Lua 5.2 or Lua 5.3 on Linux.

    Note: If your deployment requires newer Lua binaries, visit http://luabinaries.sourceforge.net/.
    Note: Some Linux distributions do not include the required shared library libnuma.so.1. It can be installed with the numactl package supplied by your distribution's package manager.

  2. You must install the third-party package dependencies middleclass (4.0+), csv, and ee5_base64, which are all included with a SAS® Viya® installation.

For more information about configuration, see the Readme file that is included with the SWAT download.

I hope this blog post has helped you understand the possible ways of using Lua with SAS. If you have other SAS issues that you would like me to cover in future blog posts, please comment below.

To learn more about PROC LUA, check out these resources:

Using the Lua programming language within Base SAS® was published on SAS Users.

5月 172019
 

In the article Serverless functions and SAS Viya - a good match I discussed using serverless functions to deliver SAS Viya applications. Ignoring all the buzz words, a serverless function boils down to a set of REST APIs. So, if you tried the example you are now a REST API developer 🙂 .

The serverless function allowed the application developer to do the following:

  1. Define what the end user must supply to the function. A good application developer will try to make the request simple and easy to understand.
  2. Return to the end user a response easily consumed by the client's program. Again, a good application developer would make sure the response satisfies most common usage scenarios.
  3. Hide all the details of what it took to satisfy the users request.

This blog discusses using GraphQL to achieve the same goals. First, I will briefly discuss GraphQL, where it fits in with SAS Viya application integration, and how to create GraphQL-based applications. I also provide a series of examples based on real-world scenarios.

The images below display a high level comparison of the approaches between serverless and GraphQL.

serverless and GraphQL process flow

serverless and GraphQL process flow

Steps in the GraphQL flow

  1. A GraphQL server replaces the AWS API Gateway.
  2. The code that runs in the GraphQL server is referred to as "resolvers" - as the name implies, resolvers are used by the GraphQL server to execute user requests.
  3. The resolvers make the necessary REST API calls to the SAS Viya Server.

All of the code in this article resides in the restaf-graphql-demo GitHub repository. If you are not familiar with GraphQL please review the links at the end of this article before proceeding.

Why GraphQL?

Some smart folks at Facebook created GraphQL to solve problems they encountered using standard REST APIs. Companies like Github, Netflix, PayPal, The New York Times and many others are adopting GraphQL.

Some of the key motivators are:

  1. Users define and request what they need, following exact specifications
  2. A convenient way to front existing systems (REST-based or not) and databases with a Developer Experience friendly API
  3. Returning only the requested information reduces the data transferred - important for reducing network traffic
  4. GraphQL is less "chatty" - where REST API will requires multiple trips to the server, GraphQL can accomplish the same task in one round trip

Why GraphQL for SAS Viya application developers?

While the general GraphQL characteristics listed above are important, GraphQL is also a useful technology for developers creating applications integrated with SAS Viya.

  1. GraphQL is a ready-made vehicle for SAS users to deliver their applications as the next generation "stored process" developed with the data step+procedures, CAS Language (CASL) statements, custom CASL actions and SAS REST APIs.
  2. GraphQL is a great way for front-end and back-end developers to communicate.
  3. Developers can code to an agreed contract as specified by the GraphQL schema.
  4. Front-end developers can be confident what they get is exactly what they asked for.

Writing the GraphQL-based applications

The GraphQL queries used in this article are examples for demonstration purposes only and not "standards or strict guidelines" to follow. The code in the GitHub repository and the examples outlined below will help you jump-start your excellent adventure in GraphQL and SAS Viya applications.

The high-level steps for writing an application using GraphQL query are:

SAS Viya Side

  1. SAS programmers, data analysts and data scientists develop their intellectual protocol with SAS programs written with SAS procedures, CAS Actions, data step and CASL language.

Server Side

  1. Build the GraphQL schema and define the queries (see this for examples). In relation to SAS Viya, the schema describes the input and output of the SAS programs.
    • Make sure you have discussed this with the UI developers and the SAS programmers
  2. Write the resolvers - GraphQL server will call this code to resolve the requests by the user (see this for examples).
  3. Register both of these with the GraphQL server.

Client Side

  1. You can build the web apps in the normal way with these characteristics:
    • These apps will call a single end point (/graphql) with a POST method.
    • The payload is the GraphQL query
    • The response will match the query and are easily accessible

The image below shows the flow of a GraphQL-based application. User's queries are sent to the GraphQL server. The server parses the queries and calls the appropriate resolver (your code) to obtain the values for the requested fields. In this project the resolvers use restaf to make REST API calls to SAS Viya.

GraphQL-based application process flow

GraphQL-based application process flow

The rest of the blog discusses a few examples. All these examples are available in the repository. I chose to write the examples using JavaScript since it is one of the languages I am familiar with and can write reasonably decent code in. You can develop GraphQL-based SAS Viya applications in all the popular languages of today.

Example 1: Scoring a loan from client app
In this example, a data scientist working for a bank, has created a model to score a loan applicant's eligibility. The scientist outlines the following requirements:

  1. The user can only enter the desired loan amount and their current assets. All the other parameters needed for scoring have set values. All the values must be passed to the SAS code as a dictionary named _args_.
  2. Since the scientist wants to run A/B experiments the location and name of the scoring model's astore must be passed in as dictionary named _appEnv_.
  3. The code developed by the data scientist is below. The score returns as a dictionary.

    {score= <value>}

SAS Code

I wrote the SAS program in this example in CASL.

loadactionset "astore";

  /* convert arguments to a cas table */
/* _args_  and _appEnv_ are  generated by caslBase - see caslBase for details */

/* CASL function to convert a dictionary to a cas table  see lib/argsToTable.js for details*/
argsToTable(_args_, 'casuser', 'INPUTDATA' );

/* score */
action astore.score /
    table  = { caslib= 'casuser', name = 'INPUTDATA' } 
    rstore = { caslib= _appEnv_.astore.caslib,  name=_appEnv_.astore.name }
    casout  = { caslib = 'casuser', name = 'OUTPUTDATA' replace= TRUE};

/* fetch results */
action table.fetch r = result /
    table = {  caslib = 'casuser' name = 'INPUTDATA' } ;

/* extract the score and send it as a dictionary */
score = result.Fetch[1].P_BAD;
scoreo= {score= score};
send_response(scoreo);

Key points to note:

  1. The resolver creates and prepends two CASL dictionaries _args_ and _appEnv_.
  2. The CASL program returns the result using the send_response function.
    • One of the cool things is that CASL allows the programmer to customize the returned value. In this example the score extracts into a dictionary.

Schema

Based on the requirement the schema is as shown below:

type Query {
   scoreLoan(amount: Int assets: Int) : Float

Key Point:

  1. The two values the user specifies are defined as the filter parameters to the query.

Application

scoreLoan

Key point:

  1. The user enters the two values the data scientist requires.

Client code

async function runScore(amount, assets){
    let payload = {
        query: `query {
            scoreLoan(amount: ${amount} assets: ${assets} )
        }`
    }

    let config = {
        url            : host + '/graphql',
        withCredentials: true,
        method         : 'POST',
        data           : payload
    }

    let r = await axios(config);
    return r.data.data.scoreLoan;
}

Key points:

  1. The payload is the GraphQL query.
  2. I use the POST method.
  3. The end point is /graphql - this is the only endpoint the application will use.
  4. The response is available as r.data.data.scoreLoan
  5. Note the simplicity of the client code to access the GraphQL server and obtain the results.

Resolver

let caslBase = require('../lib/caslBase');

module.exports = async function scoreLoan (_, args, context) {
    let { store } = context;
    let input = {
        JOB    : 'J1',
        CLAGE  : 100, 
        CLNO   : 20, 
        DEBTINC: 20, 
        DELINQ : 2, 
        DEROG  : 0, 
        MORTDUE: 4000, 
        NINQ   : 1,
        YOJ    : 10
    };

    input.LOAN  = args.amount;
    input.VALUE = args.assets;

    let env = {
        astore: {
            caslib: 'Public',
            name  : 'GRADIENT_BOOSTING___BAD_2'
        }
    }
    let result = await caslBase(store,['argsToTable.casl', 'score.casl'], input, env);
    let score = result.items('results', 'score');
    
    return score;

}

Key points:

  1. As required, the default values for the other parameters are added to the user input.
  2. The resolver contains the location and name of the model.
  3. The names of the SAS code are passed to caslBase - this allows the code to read the SAS code from a repository.
  4. The caslBase function calls the jsonToDict to convert the json parameters to CASL dictionary and passes it on to CAS along with the code.
  5. The user receives the resulting score.
Example 2: Reporting wine production to management
The TwoBit winery management wants a simple report to view the production of different wines by year. They want to be able to pick the year range and the wines in which they are interested. The data shown below is for the TwoBit Winery. The goal is to query for selected wines and filter on years.

The data for the winery is listed below.

 
Obs year cabernet merlot pinot chardonnay twobit
1 2000 10 20 30 40 50
2 2001 5 10 15 5 0
3 2002 6 7 11 12 13
4 2003 5 8 0 0 50
5 2004 11 5 7 8 100
6 2005 1 1 0 0 1000
7 2006 0 0 0 0 3000

 

SAS Code

The SAS experts at the company created the following SAS code to meet management's request. Note that for demo purposes the wine data is created inline.

data wineList;  
 input year cabernet merlot pinot chardonnay twobit ;  
 cards;  
 2000 10 20 30 40 50   
 2001 5 10 15 5 0  
 2002 6 7 11 12 13  
 2003 5 8 0 0 50 
 2004 11 5 7 8 100  
 2005 1  1 0 0 1000  
 2006 0 0 0 0 3000  
;;;; 
run;  
/* _selections_ macro was generated in src/lib/getSelections function.
data wine ;  
    set winelist( where= (year GE &amp;from &amp;&amp; year LE &amp;to)); 
    keep &amp;_selections_; 
    run;  
ods html style=barrettsblue;  
    proc print data=wine;run;  
ods html close;run ;

Key points to note:

  1. The code requires macro variables &from, &to and &_selections_ be set before this code executes.
  2. The name of the returned table is wine.

Schema

type Query{
wineProduction(from: Int, to: Int): WineProduction
}

type WineProduction {
"""
An array containing wine production
"""
wines : [WineList]

"""
ODS output and Log output
"""
report: SASResults
}

type WineList {
year : Int
cabernet : Int
merlot : Int
pinot : Int
chardonnay: Int
twobit : Int
}

type WineProductionCas {
wines : [WineList]
}

type SASResults {
        """
        ODS output from the server
        """
        ods: String
        """
        Log output from the server
        """
        log: String
    
    }

Key points:

  1. As required, the year range is specified as filters for the query.
  2. As required, the user can pick the wines in which they are interested.

Application

The application is shown below.

Client code

The relevant client code is shown below (see this in the repository for the full program).

 let gqString = `query userQuery($from: Int, $to: Int) {
                           results: wineProduction(from: $from to: $to) {
                              wines { 
                                  ${wineList} 
                                } 
                                ${reportList}
                             } 
                            }`;
        let payload = {
            url   : host + '/graphql',
            method: 'POST',
            data: { 
                query: gqString,
                variables: {
                    from: fromYear.value,
                    to  : toYear.value
                }
            }
        }
        setReportValues(null);
        setResultValues(null);
        axios(payload)
         .then ( r => {
            let res = r.data.data.results;
           // Simple to extract the results
            setResultValues(res.wines);
            if (res.report != null ) {
                setReportValues(res.report);
            }
        
         })
         .catch( e => alert(e))
    }
})

Key points:

  1. The GraphQL query string is sent as the payload (wineList and reportList are strings computed earlier in the program based on user selection).
  2. The endpoint is again /graphql with a POST method.
  3. This snippet also shows the preferred way to send the filter values.

Resolver

The root resolver is shown below.

let getProgram    = require('../lib/getProgram');
let getSelections = require('../lib/getSelections');
let spBase        = require('../lib/spBase');

module.exports = async function wineProduction (_, args, context, info){
    let {store} = context;

<span style="font-size: 14px;">   // read source - reads in the sas program</span>
    let src = await getProgram(store, ['wines.sas']); 

    // update args with the wine list specified by the user
    let selections = getSelections(info, 'wines', args);

   // execute the sas code with compute server and get results
    let resultSummary = await spBase(store, selections.args, src);
    
    // resultSummary is now passed to the resolvers for wines and results fields.
    return resultSummary;
}

Key points:

  1. Code from the GitHub repo uses winelist.js to resolve the list of wines.
  2. Code from sasresults.js, sasOds.js and sasLog.js returns ODS output and the SAS log.
  3. The SAS code reads in from a repository using the getPrograms function.
Example 3: List SAS Visual Analytics reports
Another common use case is retrieving information about reports developed with SAS Visual Analytics. The GraphQL query to get the list of reports, who edited it last and when is shown below. This example uses the reports REST API.

Schema

{
    reports {
        name
        modifiedBy
        modifiedOn
   }
}

Creating a UI for this is a challenge exercise for the reader (meaning I did not get around to writing it 🙂 ). The returned results look something like this:

{
    "data": {
    "reports": [
        {
            "name": "Application Activity",
            "modifiedBy": "SAS Supplied",
            "modifiedOn": "2018-04-20T14:24:05.258Z"
       },
      {
           "name": "CAS Activity",
           "modifiedBy": "SAS Supplied",
          "modifiedOn": "2018-06-08T20:21:14.727Z"
        }
...

Resolver

module.exports = async function reports (_, args, context) {
    let {store} = context;
    let reports = store.getService ('reports');
    let list =await getList(store, reports);
    return list;
}

async function getList(store, reports) {
    let reportsList =await store.apiCall (reports.links ('reports'));
    if (reportsList.itemsList().size ===0) {
       return [];
     }
    let r = reportsList.itemsList().map (name => {
         let t = {
             name : name,
             modifiedBy: reportsList.items(name, 'data', 'modifiedBy'),
             modifiedOn: reportsList.items(name, 'data', 'modifiedTimeStamp')
         };
        return t;
     });
   return r;
}

Example 4: Getting the URL and image of a specific report
The query below can be used to obtain the URL to display the interactive report and svg image of a specific report.

Schema

{
      report(name:"Application Activity"){
           url
          image
      }
}

The returned value will be along these lines:

{
  "data": {
    "report": {
      "url": "http://superuser.com/?reportUri=/reports/reports/ecec39ad-994f-4055-8e40-4360f410bc6e...",
      "image: {the svg of the image}
    }
}

Resolver

There are 3 resolvers associated with this query, the root resolver and resolvers for image and url. For the sake of brevity, I will not review those here. please visit the code in the repository.

In conclusion

The examples above cover some basic scenarios for SAS Viya applications.

  1. Using CAS actions
  2. Using traditional data step and procs
  3. Obtaining ODS output
  4. Working with SAS Visual Analytics

The simplicity of the client code and the resolvers are what makes GraphQL attractive for writing SAS Viya applications. You can also exploit other features in SAS Viya using the same pattern. Further, you can use the examples in this repository to easily customize your own use cases. The resolvers and helper functions are written to be reusable with minimal effort. The instructions are in the README file in the repository. If you create interesting schema and resolvers for SAS Viya, please share them with the SAS user community.

Opinion

Like all new technologies GraphQL has its proponents and detractors. Also, many people get caught in the low-value arguments about GraphQL being better or worse than REST. I personally do not follow these discussions since you should use the best tool for the job.

I find GraphQL most attractive when developing a back-end for SAS Viya applications. Both front and back-end developers will benefit from the clear definition of the schema. Having well supported GraphQL servers by Apollo and Facebook makes it easier to adopt GraphQL.

Useful links

There are a growing number of resources from which to learn and model. Below is small starter list.

  1. graphql.org
  2. Apollo
  3. Relay
  4. GraphQL Concepts Visualized by Dhaivat Pandya
  5. GraphQL tutorial from TutorialsPoint
  6. How to GraphQL

GraphQL and SAS Viya applications - a good match was published on SAS Users.

5月 172019
 

Competition in customer experience management has never been as challenging as it is now. Customers spend more money in aggregate, but less per brand. The average size of a single purchase has decreased, partly because competitive offers are just one click away. Predicting offer relevance to potential (and existing) customers plays a [...]

SAS Customer Intelligence 360: Factorization machines, visual analytics, and personalized marketing [Part 2] was published on Customer Intelligence Blog.