9月 282019
 

This article continues a series that began with Machine learning with SASPy: Exploring and preparing your data (part 1). Part 1 showed you how to explore data using SASPy with Python. Here, in part 2, you will learn how to begin to prepare your data to use it within a machine-learning model.

Review part 1 if needed and ensure you still have the ADULT data set ready to use. (The data set is available from the UCI Machine Learning Repository.) If not, take some time to download and explore the data again, as described in part 1.

Preparing your data

Preparing data is a necessary step to perform before applying the data toward a model. There are string values, skewed data, and missing data points to consider. In the data set, be sure to clear missing values, so you can jump into other methods.

For this exercise, you will explore how to transform skewed features using SASPy and Pandas.

First, you must separate the income data from the data set, because the income feature will later become your target variable to model.

Drop the income data and turn the pandas data frame back into a SAS data object, with the following code:

Now, let's take a second look at the numerical features. You will use SASPy to create a histogram of all numerical features. Typically, the Matplotlib library is used, but SASPy provides great opportunities to visualize the data.

The following graphs represent the expected output.

Taking a look at the numerical features, two values stick out. CAPITAL_GAIN and CAPITAL_LOSS are highly skewed. Highly skewed features can affect your model, as most models try to maintain a normally distributed curve. To fix this, you will apply a logarithmic transformation using pandas and then visualize the change using SASPy.

Transforming skewed features

First, you need to change the SAS data object back into a pandas data frame and assign the skewed features to a list variable:

Then, use pandas to apply the logarithmic transformation and convert the pandas data frame back into a SAS data object:

Display transformed data

Now, you are ready to visualize these changes using SASPy. In the previous section, you used histograms to display the data. To display this transformation, you will use the SASPy SASUTIL class. Specifically, you will use a procedure typically used in SAS, the UNIVARIATE procedure.

To use the SASUTIL class with SASPy, you first need to create a Python object that uses the SASUTIL class:

 

Now, use the univariate function from SASPy:

 

Using the UNIVARIATE procedure, you can set axis limits to the output histograms so that you can see the data in a clearer format. After running the selected code, you can use the dir() function to verify successful submission:

 

Here is the output:

 

 

 

The function calculates various descriptive statistics and plots. However, for this example, the focus is on the histogram.

 

Here are the results:

Wrapping up

You have now transformed the skewed data. Pandas applied the logarithmic transformation and SASPy displayed the histograms.

Up next

In the next and final article of this series, you will continue preparing your data by normalizing numerical features and one-hot encoding categorical features.

Machine learning with SASPy: Exploring and preparing your data (part 2) was published on SAS Users.

9月 262019
 

In part 1 of this post, we looked at setting up Spark jobs from Cloud Analytics Services (CAS) to load and save data to and from Hadoop. Now we are moving on to the next step in the analytic cycle, scoring data in Hadoop and executing SAS code as a Spark job. The Spark scoring jobs execute using SAS In-Database Technologies for Hadoop.

The integration of the SAS Embedded Process and Hadoop allows scoring code to run directly on Hadoop. As a result, publishing and scoring of both DS2 and Data step models occur inside Hadoop. Furthermore, access to Spark data exists through the SAS Workspace Server or the SAS Compute Server using SAS/ACCESS to Hadoop, or from the CAS server using SAS Data Connectors.

Scoring Data from CAS using Spark

SAS PROC SCOREACCEL provides an interface to the CAS server for DS2 and DATA step model publishing and scoring. Model code is published from CAS to Spark and then executed via the SAS Embedded Process.

PROC SCOREACCEL supports a file interface for passing the model components (model program, format XML, and analytic stores). The procedure reads the specified files and passes their contents on to the model-publishing CAS action. In this case, the files must be visible from the SAS client.

CAS publishModel and runModel actions publish and execute score data in Spark:

 
%let CLUSTER="/opt/sas/viya/config/data/hadoop/lib:
  /opt/sas/viya/config/data/hadoop/lib/spark:/opt/sas/viya/config/data/hadoop/conf";
proc scoreaccel sessref=mysess1;
publishmodel
   target=hadoop
   modelname="simple01"
   modeltype=DS2
/* 
   filelocation=local */
programfile="/demo/code/simple.ds2"
username="cas"
modeldir="/user/cas"
classpath=&CLUSTER.
; runmodel 
    target=hadoop
    modelname="simple01"
    username="cas"
    modeldir="/user/cas"
    server=hadoop.server.com'
    intable="simple01_scoredata"
    outtable="simple01_outdata"
    forceoverwrite=yes
    classpath=&CLUSTER.
    platform=SPARK
; 
quit;

In the PROC SCOREACCEL example above, a DS2 model is published to Hadoop and executed with the Spark processing engine. The CLASSPATH statement specifies a link to the Hadoop cluster. The input and output tables, simple01_scoredata and simple01_outdata, already exist on the Hadoop cluster.

Score data in Spark from CAS using SAS Scoring Accelerator

SAS Scoring Accelerator execution status in YARN as a Spark job from CAS

As you can see in the image above, the model is scored in Spark using the SAS Scoring Accelerator. The Spark job name reflects the input and output tables.

Scoring Data from MVA SAS using Spark

Steps to run a scoring model in Hadoop:

  1. Create a traditional scoring model using SAS Enterprise Miner or an analytic store scoring model, generated using SAS Factory Miner HPFOREST or HPSVM components.
  2. Specify the Hadoop connection attributes: %let indconn= user=myuserid;
  3. Use the INDCONN macro variable to provide credentials to connect to the Hadoop HDFS and Spark. Assign the INDCONN macro before running the %INDHD_PUBLISH_MODEL and the %INDHD_RUN_MODEL macros.
  4. Run the %INDHD_PUBLISH_MODEL macro.
  5. With traditional model scoring, the %INDHD_PUBLISH_MODEL performs multiple tasks using some of the files created by the SAS Enterprise Miner Score Code Export node. Using the scoring model program (score.sas file), the properties file (score.xml file), and (if the training data includes SAS user-defined formats) a format catalog, this model performs the following tasks:

    • translates the scoring model into the sasscore_modelname.ds2 file, runs scoring inside the SAS Embedded Process
    • takes the format catalog, if available, and produces the sasscore_modelname.xml file. This file contains user-defined formats for the published scoring model.
    • uses SAS/ACCESS Interface to Hadoop to copy the sasscore_modelname.ds2 and sasscore_modelname.xml scoring files to HDFS
  6. Run the %INDHD_RUN_MODEL macro.

The %INDHD_RUN_MODEL macro initiates a Spark job that uses the files generated by the %INDHD_PUBLISH_MODEL to execute the DS2 program. The Spark job stores the DS2 program output in the HDFS location specified by either the OUTPUTDATADIR= argument or by the element in the HDMD file.

Here is an example:

 
option set=SAS_HADOOP_CONFIG_PATH="/opt/sas9.4/Config/Lev1/HadoopServer/conf";
option set=SAS_HADOOP_JAR_PATH="/opt/sas9.4/Config/Lev1/HadoopServer/lib:/opt/sas9.4/Config/Lev1/HadoopServer/lib/spark";
 
%let scorename=m6sccode;
%let scoredir=/opt/code/score;
option sastrace=',,,d' sastraceloc=saslog;
option set=HADOOPPLATFORM=SPARK;
 
%let indconn = %str(USER=hive HIVE_SERVER=’hadoop.server.com');
%put &indconn;
%INDHD_PUBLISH_MODEL( dir=&scoredir., 
        datastep=&scorename..sas,
        xml=&scorename..xml,
        modeldir=/sasmodels,
        modelname=m6score,
        action=replace);
 
%INDHD_RUN_MODEL(inputtable=sampledata, 
outputtable=sampledata9score, 
scorepgm=/sasmodels/m6score/m6score.ds2, 
trace=yes, 
    platform=spark);
Score data in Spark from MVA SAS using Spark

SAS Scoring Accelerator execution status in YARN as a Spark job from MVA SAS

To execute the job in Spark, either set the HADOOPPLATFORM= option to SPARK or set PLATFORM= to SPARK inside the INDHD_RUN_MODEL macro. The SAS Scoring Accelerator uses SAS Embedded Process to execute the Spark job with the job name containing the input table and output table.

Executing user-written DS2 code using Spark

User-written DS2 programs can be complex. When running inside a database, a code accelerator execution plan might require multiple phases. Because of Scala program generation that integrates with the SAS Embedded Process program interface to Spark, the many phases of a Code Accelerator job reduces to one single Spark job.

In-Database Code Accelerator

The SAS In-Database Code Accelerator on Spark is a combination of generated Scala programs, Spark SQL statements, HDFS files access, and DS2 programs. The SAS In-Database Code Accelerator for Hadoop enables the publishing of user-written DS2 thread or data programs to Spark, executes in parallel, and exploits Spark’s massively parallel processing. Examples of DS2 thread programs include large transpositions, computationally complex programs, scoring models, and BY-group processing.

Below is a table of required DS2 options to execute as a Spark job.

DS2ACCEL Set to YES
HADOOPPLATFORM Set to SPARK

 

There are six different ways to run the code accelerator inside Spark, called CASES. The generation of the Scala program by the SAS Embedded Process Client Interface depends on how the DS2 program is written. In the following example, we are looking at Case 2, which is a thread and a data program, neither of them with a BY statement:

 
proc ds2 ds2accel=yes;
thread work.workthread / overwrite=yes; 
        method run();
          set hdplib.cars;
          output;
end; endthread; run; 
  data hdplib.carsout (overwrite=yes); dcl thread work.workthread m;
  dcl double count;
  keep count make model;
method run(); set from m; count+1; output; 
end; enddata; run; quit;

The entire DS2 program runs in two phases. The DS2 thread program runs during Phase One, and its tasks execute in parallel. The DS2 data program runs during Phase Two using a single task.

Finally

With SAS Scoring Accelerator and Spark integration, users have the power and flexibility to process and score modeled data in Spark. SAS Code Accelerator and Spark integration takes the flexibility further to process any Spark data using DS2 code. Furthermore it is now possible for business to respond to use cases immediately and with higher reliability in the big data space.

Data and Analytics Innovation using SAS & Spark - part 2 was published on SAS Users.

9月 262019
 

Mirror, mirror on the wall, whose conference presentations are the best of all?

Ok, well it doesn’t quite go that way in the fairy tale, but remakes and reimagining of classic tales have been plentiful in books (see The Shadow Queen), on the big screen (see Maleficent, which is about to get a sequel), on the little screen (see the seven seasons of Once upon a Time) and even on stage and screen (see Into the Woods). So, why not take some liberties in the service of analytics?

For this blog, I have turned our analytics mirror inward and gazed at the social media messages from four SAS conferences: SAS Global Forum 2018 in Denver, Analytics Experience 2018 in San Diego, Analytics Experience 2018 in Milan, and the 2019 Analyst Conference in Naples. While simply counting retweets could provide insight into what was popular, I wanted to look deeper to answer the question: What SAS conference presenters were most praised in social media and how? Information extraction, specifically fact extraction, could help with answering those questions.

Data preparation

Once upon a time, in a land far far away, there was a collection of social media messages, mostly Tweets, that the SAS social media department was kind enough to provide. I didn’t do much in terms of data preparation. I was only interested in unique messages, so I used Excel to remove duplicates based on the “Message” column.

Additionally, I kept only messages for which the language was listed as English, using the “language” column that was already provided in the data. SAS Text Analytics products support 33 languages, but for the purposes of this investigation I chose to focus on English only because the presentations were in English. Then, I imported this data, which was about 4,400 messages, into SAS Visual Text Analytics to explore it and create an information extraction model.

While exploring the data, I noticed that most of the tweets were in fact positive. Additionally, negation, such as “not great” for example, was generally absent. I took this finding into consideration while building my information extraction model: the rules did not have to account for negation, which made for a simpler model. No conniving sorcerer to battle in this tale!

Information extraction model

The magic wand here was SAS Visual Text Analytics. I created a rather simple concepts model with a top-level concept named posPerson, which was extracting pairs of mentions of presenters and positive words occurring within two sentences of the mentions of presenters. The model included several supporting concepts, as shown in this screenshot from SAS Visual Text Analytics concepts node.

Before I explain a little bit about each of the concepts, it is useful to understand how they are related together in the hierarchy represented in the following diagram. The lower-level concepts in the diagram are referenced in the rules of the higher-level ones.

Extending predefined concepts

The magic wand already came with predefined concepts such as nlpPerson and nlpOrganization (thanks, fairy godmother, ahem, SAS linguists). These concepts are included with Visual Text Analytics out of the box and allow users to tap into the knowledge of the SAS linguists for identifying person and organization names. Because Twitter handles, such as @oschabenberger and @randyguard, are not included in these predefined concepts, I expanded the predefined concepts with custom ones. The custom concepts for persons and organizations, customPerson and customOrg, referenced matches from the predefined concepts in addition to rules for combining the symbol @ from the atSymbol concept and various Twitter handles known to belong to persons and organizations, respectively. Here is the simple rule in the atSymbol concept that helps to accomplish this task:

CLASSIFIER:@ 

The screenshot below shows how the atSymbol concept and the personHandle concept are referenced together in the customPerson concept rule and produce matches, such as @RobertoVerganti and @mabel_pooe. Note also how the nlpPerson concept is referenced to produce matches, such as Oliver Schabenberger and Mary Beth Moore, in the same customPerson concept.

If you are interested to learn more about information extraction rules like the ones used in this blog, check out the book SAS Text Analytics for Business Applications: Concept Rules for Information Extraction Models, which my colleagues Teresa Jade and Michael Wallis co-authored with me. It’s a helpful guide for using your own magic wand for information extraction!

Exploratory use of the Sandbox

Visual Text Analytics also comes with its own crystal ball: the Sandbox feature. In the Sandbox, I refined the concept rules iteratively and was able to run the rules for each concept faster than running the entire model. Gazing into this crystal ball, I could quickly see how rule changes for one concept impacted matches.

In an exploratory step, I made the rules in the personHandle concept as general as possible, using part of speech tags such as :N (noun) and :PN (proper noun) in the definitions. As I explored the matches to those rules, I was able to identify matches that were actually organization handles, which I then added as CLASSIFIER rules to the orgHandle concept by double-clicking on a word or phrase and right-clicking to add that string to a rule.

I noticed that some handles were very similar to each other and REGEX rules more efficiently captured the possible combinations. Consult the book referenced above if you’re interested in understanding more about different rule types and how to use them effectively. After moving the rules to the Edit Concept tab, the rules for orgHandle included some of the ones in the following screenshot.

Automatic concept rule generation

Turning now to the second part of the original question, which was what words and phrases people used to praise the presenters, the answers came from two custom concepts: posAdj and posPhrase. The posAdj concept had rules that captured adjectives with positive sentiment, such as the following:

Most of these were captured from the text of the messages in the same manner as the person and organization Twitter handles.

But, the first two were created automatically by way of black magic! When I selected a term from the Textual Elements, as you can see below for the term “great”, the system automatically created the first rule in the concept above, including also the comparative form, “greater,” and the superlative, “greatest.” This is black magic harnessing the power of stemming or lemmatization.

The concept posPhrase built onto the posAdj concept by capturing the nouns that typically follow the adjectives in the first concept as well as a few other strings that have a positive connotation.

Filtering with global rules

Because the rules created overlapping matches, I took advantage of a globalRule concept, which allowed me to distinguish between the poisoned apples and the edible ones. Global rules served the following purposes:

  1. to remove matches from the more generally defined customPerson concept that were also matched for the customOrg concept
  2. to remove matches from the posAdj concept (such as “good”) that were also matched in the posPhrase concept (such as “good talk”)
  3. to remove false positives

As an example of a false positive, consider the following rule:

REMOVE_ITEM:(ALIGNED, "Data for _c{posAdj}", "Data for Good") 

Because the phrase “Data for Good” is a name of a program, the word “good” should not be taken into consideration in evaluating the positive mention. Therefore, the REMOVE_ITEM rule stated that when the posAdj concept match “good” is part of the phrase “Data for Good,” it should be removed from the posAdj concept matches.

Automatic fact rule generation

The top-most concept in the model, posPerson, took advantage of a magic potion called automatic fact rule building, which is another new feature added to the Visual Text Analytics product in the spring of 2019. This feature was used to put together matches from the posAdj and posPhrase concepts with matches from the customPerson concept without constructing the rule myself. It is a very useful feature for newer rule writers who want to explore the use of fact rules.

As input into the cauldron to make this magic potion, I selected the posAdj and customPerson concepts. These are the concepts I wanted the system to relate as facts.

I ran the node and inspected the autogenerated magic potion, i.e. the fact rule.

Then I did the same thing with the posPhrase and customPerson concepts. Each of the two rules that were created by Visual Text Analytics contained the SENT operator.

But I wanted to expand the context of the related concepts and tweaked the recipe a bit by replacing SENT with SENT_2 in order to look for matches within two sentences instead of one. I also replaced the names of the arguments, which the rule generation algorithm called concept1 and concept2, with ones that were more relevant to the task at hand, person and pos. Thus, the following rules were created:

PREDICATE_RULE:(person, pos):(SENT_2, "_person{customPerson}", "_pos{posAdj}")
PREDICATE_RULE:(person, pos):(SENT_2, "_person{customPerson}", "_pos{posPhrase}")

Results

So, what did the magic mirror show? Out of the 4,400 messages, I detected a reference to a person in about 1,650 (37%). In nearly 600 of the messages (14%) I extracted a positive phrase and in over 300 (7%) at least one positive adjective. Finally, only 7% (321) of the messages contained both a reference to a person and a positive comment within two sentences of each other.

I changed all but the posPerson and globalRule concepts to “supporting” so they don’t produce results and I can focus only on the relevant results. This step was akin to adjusting the mirror to focus only on the most important things and tuning out the background. You can learn more about this and other SAS Visual Text Analytics features in the User Guide.

Switching from the interactive view to the results view of the concepts node, I viewed the transactional output table.

With one click, I exported and opened this table in Visual Analytics in order to answer the questions which presenters were mentioned most often and in the context of what words or phrases with positive sentiment.

Visualization

With all of the magic items and preparation out of the way, I was ready to build a sparkly palace for my findings; that is, a report in Visual analytics. On the left, I added a treemap of the most common matches for the person argument. On the right, I added a word cloud with the most common matches for the pos argument and connected it with the treemap on the left. In both cases I excluded missing values in order to focus on the extracted information. With my trees and clouds in place, I turned to the bottom of the report. I added and connected a list table with the message, which was the entire input text, and keywords, which included the span of text from the match for the first argument to the match for the last argument, for an easy reference to the context for the above visualizations.

Based on the visualization on the left, the person with the most positive social media messages was SAS Chief Operating Officer (COO), Dr. Oliver Schabenberger, who accounted for 12% of the messages that contained both a person and a positive comment. His lead was followed by the featured presenters at the Milan conference, Roberto Verganti, Anders Indset and Giles Hutchins. Next most represented were the featured presenters at the San Diego conference, Robyn Benincasa and Dr. Tricia Wang.

Looking at the visualization on the right, some of the most common phrases expressing praise for all the presenters were “important,” “well done,” “great event,” and “exciting.” Quite a few phrases also contain the term “inspiring,” such as “inspiring videos,” “inspiring keynote,” “inspiring talk,” “inspiring speech,” etc.

Because of the connections that I set up in Visual Analytics between these visualizations, if I want to look at what positive phrases were most commonly associated with each presenter, I can click on their name in the treemap on the left; as a result, the word cloud on the right as well as the list table on the bottom will filter out data from other presenters. For example, the view for Oliver Schabenberger shows that the most common positive phrase associated with tweets about him was “great discussion.”

Conclusions

It is not surprising that the highest accolades in this experiment went to SAS’ COO since he participated in all four conferences and therefore had four times the opportunity to garner positive messages. Similarly, the featured presenters probably had larger audiences than breakout sessions, allowing these presenters more opportunities to be mentioned in social media messages. In this case, the reflection in the mirror is not too surprising. And they all lived happily ever after.

What tale does your social media data tell?

Learn more

A data fairy tale: Which speakers are the best received at conferences? was published on SAS Users.

9月 252019
 

One of the strengths of the SAS/IML language is its flexibility. Recently, a SAS programmer asked how to generalize a program in a previous article. The original program solved one optimization problem. The reader said that she wants to solve this type of problem 300 times, each time using a different set of parameters. Essentially, she wants to loop over a set of parameter values, solve the corresponding optimization problem, and save each solution to a data set.

Yes, you can do this in SAS/IML, and the technique is not limited to solving a series of optimization problems. Any time you have a parameterized family of problems, you can implement this idea. First, you figure out how to solve the problem for one set of parameters, then you can loop over many sets of parameters and solve each problem in turn.

Solve the problem one time

As I say in my article, "Ten tips before you run an optimization," you should always develop and debug solving ONE problem before you attempt to solve MANY problems. This section describes the problem and solves it for one set of parameters.

For this article, the goal is to find the values (x1, x2) that maximize a function of two variables:
F(x1, x2; a b) = 1 – (x1 – a)2 – (x2 – b)2 + exp(–(x12 + x22)/2);
The function has two parameters, (a, b), which are not involved in the optimization. The function is a sum of two terms: a quadratic function and an exponentially decreasing function that looks like the bivariate normal density.

Because the exponential term rapidly approaches zero when (x1, x2) moves away from the origin, this function is a perturbation of a quadratic function. The maximum will occur somewhat close to the value (x1, x2) = (a, b), which is the value for which the quadratic term is maximal. The following graph shows a contour plot for the function when (a, b) = (-1, 2). The maximum value occurs at approximately (x1, x2) = (-0.95, 1.9).

The following program defines the function in the SAS/IML language. In the definition, the parameters (a, b) are put in the GLOBAL statement because they are constant during each optimization. The SAS/IML language provides several nonlinear optimization routines. This program uses Newton-Raphson optimizer to solve the problem for (a, b) = (-1, 2).

proc iml;
start Func(x) global (a, b);
   return 1 - (x[,1] - a)##2 - (x[,2] - b)##2 +
            exp(-0.5*(x[,1]##2 + x[,2]##2));
finish;
 
a = -1; b = 2;      /* set GLOBAL parameters */
/* test functions for one set of parameters */
opt = {1,           /* find maximum of function   */
       2};          /* print a little bit of output */
x0 = {0 0};         /* initial guess for solution */
call nlpnra(rc, result, "Func", x0, opt);    /* find maximal solution */
print result[c={"x1" "x2"}];

As claimed earlier, when (a, b) = (-1, 2), the maximum value of the function occurs at approximately (x1, x2) = (-0.95, 1.9).

Solve the problem for many parameters

After you have successfully solved the problem for one set of parameters, you can iterate over a sequence of parameters. The following DATA step specifies five sets of parameters, but it could specify 300 or 3,000. These parameters are read into a SAS/IML matrix. When you solve a problem many times in a loop, it is a good idea to suppress any output. The SAS/IML program suppresses the tables and iteration history for each optimization step and saves the solution and the convergence status:

/* define a set of parameter values */
data Params;
input a b;
datalines;
-1  2
 0  1
 0  4
 1  0
 3  2
 ;
 
proc iml;
start Func(x) global (a, b);
   return 1 - (x[,1] - a)##2 - (x[,2] - b)##2 +
            exp(-0.5*(x[,1]##2 + x[,2]##2));
finish;
 
/* read parameters into a matrix */
varNames = {'a' 'b'}; 
use Params;  read all var varNames into Parms;  close;
 
/* loop over parameters and solve problem */
opt = {1,          /* find maximum of function   */
       0};         /* no print */
Soln = j(nrow(Parms), 2, .);
returnCode = j(nrow(Parms), 1, .);
do i = 1 to nrow(Parms);
   /* assign GLOBAL variables */
   a = Parms[i, 1];  b = Parms[i, 2]; 
   /* For now, use same guess for every parameter. */
   x0 = {0 0};        /* initial guess for solution */
   call nlpnra(rc, result, "Func", x0, opt);
   returnCode[i] = rc;   /* save convergence status and solution vector */
   Soln[i,] = result;
end;
print Parms[c=varNames] returnCode Soln[c={x1 x2} format=Best8.];

The output shows the parameter values (a,b), the status of the optimization, and the optimal solution for (x1, x2). The third column (returnCode) has the value 3 or 6 for these optimizations. You can look up the exact meaning of the return codes, but the main thing to remember is that a positive return code indicates a successful optimization. A negative return code indicates that the optimization terminated without finding an optimal solution. For this example, all five problems were solved successfully.

Choosing an initial guess based on the parameters

If the problem is not solved successfully for a certain parameter value, it might be that the initial guess was not very good. It is often possible to approximate the objective function to obtain a good initial guess for the solution. This not only helps ensure convergence, but it often improves the speed of convergence. If you want to solve 300 problems, having a good guess will speed up the total time.

For this example, the objective functions are a perturbation of a quadratic function. It is easy to show that the quadratic function has an optimal solution at (x1, x2) = (a, b), and you can verify from the previous output table that the optimal solutions are close to (a, b). Consequently, if you use (a, b) as an initial guess, rather than a generic value like (0, 0), then each problem will converge in only a few iterations. For this problem, the GetInitialGuess function return (a,b), but in general the function would return a function of the parameter or even solve a simpler set of equations.

/* Choose an initial guess based on the parameters. */
start GetInitialGuess(Parms);
   return Parms;  /* for this problem, the solution is near (x1,x2)=(a,b) */
finish;
 
do i = 1 to nrow(Parms);
   /* assign GLOBAL variables */
   a = Parms[i, 1]; 
   b = Parms[i, 2]; 
   /* Choose an initial guess based on the parameters. */
   x0 = GetInitialGuess(Parms[i,]); /* initial guess for solution */
   call nlpnra(rc, result, "Func", x0, opt);
   returnCode[i] = rc;
   Soln[i,] = result;
end;
print Parms[c=varNames] returnCode Soln[c={x1 x2} format=Best8.];

The solutions and return codes are similar to the previous example. You can see that changing the initial guess changes the convergence status and slightly changes the solution values.

For this simple problem, choosing a better initial guess provides only a small boost to performance. If you use (0,0) as an initial guess, you can solve 3,000 problems in about 2 seconds. If you use the GetInitialGuess function, it takes about 1.8 seconds to solve the same set of optimizations. For other problems, providing a good initial guess might be more important.

In conclusion, the SAS/IML language makes it easy to solve multiple optimization problems, where each problem uses a different set of parameter values. To improve convergence, you can sometimes use the parameter values to compute an initial guess.

The post Solve many optimization problems appeared first on The DO Loop.

9月 232019
 

The SAS Global Forum 2020 call for content is open until Sept. 30, 2019. Are you thinking of submitting a paper? If so, we have a few tips adapted from The Global English Style Guide that will help your paper shine. By following Global English guidelines, your writing will be clearer and easier to understand, which can boost the effectiveness of your communications.

Even if you’re not planning on submitting a paper and producing technical information is not your primary job function, being aware of Global English guidelines can help you communicate more effectively with your colleagues from around the world.

1. Use Short Sentences

Short sentences are less likely to contain ambiguities or complexities. For task-oriented information, try to limit your sentences to 20 words. If you have written a long sentence, break it up into two or more shorter sentences.

2. Use Complete Sentences

Incomplete sentences can be confusing for non-native speakers because the order of sentence parts is different in other languages. In addition, incomplete sentences can cause machine-translation software to produce garbled results. For example, the phrases below are fragments that may cause issues for readers:

Original: Lots of info here. Not my best, but whatever. Waiting to hear back until I do anything.
Better: There is lots of info here. It’s not my best work, but I am waiting to hear back until I do anything.

An extremely common location to encounter sentence fragments is in the introductions to lists. For example, consider the sentence introducing the list below.

The programs we use for analysis are:
• SAS Visual Analytics
• SAS Data Mining and Machine Learning
• SAS Visual Investigator

If you use an incomplete sentence to introduce a list, consider revising the sentence to be complete, then continue on to the list, as shown below.

We use the following programs in our analysis:
• SAS Visual Analytics
• SAS Data Mining and Machine Learning
• SAS Visual Investigator

3. Untangle Long Noun Phrases

A noun phrase can be a single noun, or it can consist of a noun plus one or more preceding words such as articles, pronouns, adjectives, and other nouns. For example, the following sentence contains a noun phrase with 6 words:

The red brick two-story apartment building was on fire.

Whenever possible, limit noun phrases to no more than three words while maintaining comprehensibility.

4. Expand -ED Verbs That Follow Nouns Whenever Possible

A past participle is the form of a verb that usually ends in -ed. It can be used as both the perfect and past tense of verbs as well as an adjective. This double use can be confusing for non-native English speakers. Consider the following sentence:

This is the algorithm used by the software.

In this example, the word “used” is an adjective, but it may be mistaken for a verb. Avoid using -ed verbs in ambiguous contexts. Instead, add words such as “that” or switch to the present tense to help readers interpret your meaning. A better version of the previous example sentence follows.

This is the algorithm that is used by the software.

5. Always Revise -ING Verbs That Follow Nouns

The name of this tip could have been written as “Always Revise -ING Verbs Following Nouns.” But that’s exactly what we want to avoid! If an -ING word immediately follows and modifies a noun, then either expand it or eliminate it. These constructions are ambiguous and confusing.

6. Use “That” Liberally

The word “that” is your friend! In English, the word “that” is often omitted before a relative clause. If it doesn’t feel unnatural or forced, try to include “that” before these clauses, as shown in the following example.

Original: The file you requested could not be located.

Better: The file that you requested could not be located.

7. Choose Simple, Precise Words That Have a Limited Range of Meanings

We are not often accustomed to thinking about the alternative meanings for the words we use. But consider that many words have multiple meanings. If translated incorrectly, these words could make your writing completely incomprehensible and possibly ridiculous. Consider the alternate meanings of a few of the words in the following sentences:

When you hover over the menu, a box appears.

We are deploying containers in order to scale up efficiency.

8. Don’t Use Slang, Idioms, Colloquialisms, or Figurative Language

In the UK, retail districts are called “high street.” In the US, they are called “main street.” This is just one example of how using colloquialisms can cause confusion. And Brits and Americans both speak the same language!

Especially in formal communications, keep your writing free of regional slang and idioms that cannot be easily understood by non-native English speakers. Common phrases such as “under the weather,” “piece of cake,” or “my neck of the woods” make absolutely no sense when translated literally.

We hope these Global English tips help you write your Global Forum paper, or any other communications you might produce as part of your work. For more helpful tips, read The Global English Style Guide: Writing Clear, Translatable Documentation for a Global Market by John R. Kohl.

Top 8 Global English Guidelines was published on SAS Users.

9月 232019
 

Although I do not typically blog about undocumented SAS options, I'll make an exception this time. For many years, I have known that the CONTENTS and COMPARE procedures support the BRIEF and SHORT options, but I always forget which option goes with which procedure. For the record, here are the documented options:

SAS provides an autocomplete feature that programmers can use to remind themselves which options are supported in each procedure. Both SAS Studio and SAS Enterprise Guide support the autocomplete feature, which displays a list of options as you type SAS code. Unfortunately, I turn off that feature because I find it distracting. Consequently, I cannot remember whether to use SHORT or BRIEF.

My "solution" to this dilemma is to randomly guess an option, run the program, and fix the error if I guessed wrong. I should be wrong 50% of the time, but I noticed that I am wrong only about 25% of the time. Upon investigating, I discovered that some seemingly "wrong" code actually works. Although it is not documented, it turns out that the COMPARE procedure supports the SHORT option as an alias for BRIEFSUMMARY.

This is awesome! No longer do I need to randomly guess the option. I can use the SHORT option in both PROC CONTENTS and PROC COMPARE! As a bonus, the SHORT option is also supported by7 PROC OPTIONS and PROC DATASETS.

If you have never used the SHORT option before, it's a great time saver because it produces condensed output, as shown in the following examples. For PROC CONTENTS, I like to use the options VARNUM SHORT when I want to display a list of the variables in a data set:

proc contents data=Sashelp.Cars varnum SHORT;      /* SHORT is documented option */
run;

For PROC COMPARE, I use the undocumented SHORT option (which is an alias for BRIEF) when I want to display whether two data sets are equal:

data cars; set Sashelp.cars; run;
 
proc compare base=sashelp.cars compare=cars SHORT; /* SHORT is not documented, but it works */
run;

As I mentioned, the SHORT option is also supported in PROC OPTIONS. For completeness, here is an example that writes the values of several options to the SAS log:

proc options option=(linesize pagesize memsize _LAST_) SHORT; run;    /* SHORT is documented option */
 LINESIZE=75 PAGESIZE=24 MEMSIZE=17179869184 _LAST_=WORK.CARS

In short (see what I did there?), you can use the SHORT option in several Base SAS procedures. If you can't remember that the SHORT option applies to PROC CONTENTS and the BRIEF option applies to PROC COMPARE, just use the SHORT option in both procedures and in PROC OPTIONS, too.

The post Use the SHORT option in Base SAS procedures to reduce output appeared first on The DO Loop.

9月 202019
 

Put simply, data literacy is the ability to derive meaning from data. That seems like a straightforward proposition, but, in truth, finding relationships in data can be fraught with complexities, including: Understanding where the data came from, including the lineage or source of that data. Ensuring that the data meet compliance [...]

Why you should care about data literacy was published on SAS Voices by Tom Fisher

9月 192019
 

Recent updates to SAS Grid Manager introduced many interesting new features, including the ability to handle more diverse workloads. In this post, we'll take a look at the steps required to get your SAS Grid Manager environment set up accept jobs from outside of traditional SAS clients. We'll demonstrate the process of submitting Python code for execution in the SAS Grid.

Preparing your SAS Grid

Obviously, we need a SAS Grid Manager (SAS 9.4 Maintenance 6 or later) environment to be installed and configured. Once the grid is deployed, there's not a whole lot more to do on the SAS side. SAS Workspace Servers need to be configured for launching on the grid by converting them to load balanced Workspace Servers, changing the algorithm to 'Grid', and then selecting the relevant checkbox in SASApp – Logical Workspace Server properties to launch on the grid as shown below.

The only other things that might need configuring are, if applicable, .authinfo files and Grid Option Sets. Keep reading for more information on these.

Preparing your client machine

In this example scenario, the client is the Windows desktop machine where I will write my Python code. SAS is not installed here. Rather, we will deploy and use Jupyter Notebook as our IDE to write Python code, which we'll then submit to the SAS Grid. We can get Jupyter Notebook by installing Anaconda, which is a free, bundled distribution of Python and R that comes with a solid package management system. (The following steps are courtesy of my colleague. Allan Tham.)

First, we need to download and install Anaconda.

Once deployed, we can open Anaconda Navigator from the Start menu and from there, we can launch Jupyter Notebook and create a notebook.

Now it's time to configure the connection from Python to our SAS environment. To do this, we use the SAS-developed open-source SASPy module for Python, which is responsible for converting Python code to SAS code and running it in SAS. Installing SASPy is simple. First, we need to download and install a Java Runtime Environment as a prerequisite. After that, we can launch Anaconda Prompt (a command line interface) from the Start menu and use pip to install the SASPy package.

pip install saspy

Connection parameters are defined in a file named sascfg.py. In fact, the best practice is to use sascfg.py as a template, and create a copy of it, which should be named sascfg_personal.py. SASPy will read and look for connections in both files. For the actual connection parameters, we need to specify the connection method. This depends largely on your topology.

In my environment, I used a Windows client to connect to a Linux server using a standard IOM connection. The most appropriate SASPy configuration definition is therefore 'winiomlinux', which relies on a Java-based connection method to talk to the SAS workspace. This needs to be defined in the sascfg_personal.py file.

SAS_config_names=['winiomlinux']

We also need to specify the parameters for this connection definition as shown below.

# build out a local classpath variable to use below for Windows clients   CHANGE THE PATHS TO BE CORRECT FOR YOUR INSTALLATION 
cpW  =  "C:\\ProgramData\\Anaconda3\\Lib\\site-packages\\saspy\\java\\sas.svc.connection.jar"
cpW += ";C:\\ProgramData\\Anaconda3\\Lib\\site-packages\\saspy\\java\\log4j.jar"
cpW += ";C:\\ProgramData\\Anaconda3\\Lib\\site-packages\\saspy\\java\\sas.security.sspi.jar"
cpW += ";C:\\ProgramData\\Anaconda3\\Lib\\site-packages\\saspy\\java\\sas.core.jar"
cpW += ";C:\\ProgramData\\Anaconda3\\Lib\\site-packages\\saspy\\java\\saspyiom.jar"

Note that we are referencing a number of JAR files in the classpath that normally come with a SAS deployment. Since we don't have SAS on the client machine, we can copy these from the Linux SAS server. In my lab, these were copied from /SASDeploymentManager/9.4/products/deploywiz__94526__prt__xx__sp0__1/deploywiz/.

Further down in the file, we can see the connection parameters to the SAS environment. Update the path to the Java executable, the SAS compute server host, and the port for the Workspace Server.

winiomlinux = {'java'   : 'C:\\Program Files\\Java\\jre1.8.0_221\\bin\\java',
            'iomhost'   : 'sasserver01.mydemo.sas.com',
            'iomport'   : 8591,
            'encoding'  : 'latin1',
            'classpath' : cpW
            }

Note: thanks to a recent contribution from the community, SASPy now also supports a native Windows connection method that doesn't require Java. Instead, it uses the SAS Integration Technologies client -- a component that is free to download from SAS. You'll already have this component if you've installed SAS Enterprise Guide, SAS Add-In for Microsoft Office, or Base SAS for Windows on your client machine. See instructions for this configuration in the SASPy documentation.

With the configuration set, we can import the SASPy module in our Jupyter Notebook.

Although we're importing sascfg_personal.py explicitly here, we can just call sascfg.py, as SASPy actually checks for sascfg_personal.py and uses that if it finds it. If not, it uses the default sascfg.py.

To connect to the grid from Jupyter Notebook using SASPy, we need to instantiate a SASsession object and specify the configuration definition (e.g. winiomlinux). You'll get prompted for credentials, and then a message indicating a successful connection will be displayed.

It's worth nothing that we could also specify the configuration file to use when we start a session here by specifying:

sas = saspy.SASsession(cfgfile='/sascfg_personal.py')

Behind the scenes, SAS will find an appropriate grid node on which to launch a Workspace Server session. The SASPy module works in any SAS 9.4 environment, regardless of whether it is a grid environment or not. It simply runs in a Workspace Server; in a SAS Grid environment, it just happens to be a Grid-launched Workspace Server.

Running in the Grid

Now let's execute some code in the Workspace Server session we have launched. Note that not all Python methods are supported by SASPy, but because the module is open source, anyone can add to or update the available methods. Refer to the API doc for more information.

In taking another example from Allan, let's view some of the content in a SASHELP data set.

The output is in ODS, which is converted (by SASPy) to a Pandas data frame to display it in Jupyter Notebook.

Monitoring workloads from SAS Grid Manager

SAS Workload Orchestrator Web interface allows us to view the IOM connections SASPy established from our client machine. We can see the grid node on which the Workspace Server was launched, and view some basic information. The job will remain in a RUNNING state until the connection is terminated (i.e. the SASsession is closed by calling the endsas() method. By the same token, creating multiple SASsessions will result in multiple grid jobs running.

To see more details about what actually runs, Workspace Server logs need to first be enabled. When we run myclass.head(), which will display the first 5 rows of the data set, we see the following written to Workspace Server log.

2019-08-14T02:47:34,713 INFO  [00000011] :sasdemo - 239        ods listing close;ods html5 (id=saspy_internal) file=_tomods1 options(bitmap_mode='inline') device=svg style=HTMLBlue;
2019-08-14T02:47:34,713 INFO  [00000011] :sasdemo - 239      ! ods graphics on / outputfmt=png;
2019-08-14T02:47:34,779 INFO  [00000011] :sasdemo - NOTE: Writing HTML5(SASPY_INTERNAL) Body file: _TOMODS1
2019-08-14T02:47:34,780 INFO  [00000011] :sasdemo - 240        ;*';*";*/;
2019-08-14T02:47:34,782 INFO  [00000045] :sasdemo - 241        data _head ; set sashelp.prdsale (obs=5 ); run;
2019-08-14T02:47:34,782 INFO  [00000045] :sasdemo -
2019-08-14T02:47:34,783 INFO  [00000045] :sasdemo - NOTE: There were 5 observations read from the data set SASHELP.PRDSALE.
2019-08-14T02:47:34,784 INFO  [00000045] :sasdemo - NOTE: The data set WORK._HEAD has 5 observations and 5 variables.
2019-08-14T02:47:34,787 INFO  [00000045] :sasdemo - NOTE: DATA statement used (Total process time):
2019-08-14T02:47:34,787 INFO  [00000045] :sasdemo -       real time           0.00 seconds
2019-08-14T02:47:34,787 INFO  [00000045] :sasdemo -       cpu time            0.00 seconds
2019-08-14T02:47:34,787 INFO  [00000045] :sasdemo -

We can see the converted code, which in this case was a data step which creates a work table based on PRDSALE table with obs=5. Scrolling down, we also see the code that prints the output to ODS and converts it to a data frame.

Additional Options

Authinfo files

The sascfg_personal.py file has an option for specifying an authkey, which is an identifier that maps to a set of credentials in an .authinfo file (or _authinfo on Windows). This can be leveraged to eliminate the prompting for credentials. For example, if your authinfo file looks like:

IOM_GELGrid_SASDemo user sasdemo password lnxsas

your configuration defintion in your sascfg_personal.py should look like:

winiomlinux = {'java'   : 'C:\\Program Files\\Java\\jre1.8.0_221\\bin\\java',
            'iomhost'   : 'sasserver01.mydemo.sas.com',
            'iomport'   : 8591,
            'encoding'  : 'latin1',
            'authkey' : 'IOM_GELGrid_SASDemo'
            'classpath' : cpW
            }

There are special rules for how to secure the authinfo file (making it readable only by you), so be sure to refer to the instructions.

Grid Option Sets

What if you want your code to run in the grid with certain parameters or options by default? For instance, say you want all Python code to be executed in a particular grid queue. SASPy can do this by leveraging Grid Option Sets. The process is outlined here, but in short, a new SASPy 'SAS Application' has to be configured in SAS Management Console, which is then used to the map to the Grid Options Set (created using the standard process).

More Information

My sincere thanks to Allan Tham and Greg Wootton for their valued contributions to this post.

Please refer to the official SAS Grid documentation for more information on SAS Grid Manager in SAS 9.4M6.

Thank you for reading. I hope the information provided in this post has been helpful. Please feel free to comment below to share your own experiences.

Using Python to run jobs in your SAS Grid was published on SAS Users.