8月 142019
 

Many programmers are familiar with "short-circuit" evaluation in an IF-THEN statement. Short circuit means that a program does not evaluate the remainder of a logical expression if the value of the expression is already logically determined. The SAS DATA step supports short-circuiting for simple logical expressions in IF-THEN statements and WHERE clauses (Polzin 1994, p. 1579; Gilsen 1997). For example, in the following logical-AND expression, the condition for the variable Y does not need to be checked if the condition for variable X is true:

data _null_;
set A end=eof;
if x>0 & y>0 then   /* Y is evaluated only if X>0 */
   count + 1;
if eof then 
   put count=;
run;

Order the conditions in a logical statement by likelihood

SAS programmers can optimize their IF-THEN and WHERE clauses if they can estimate the probability of each separate condition in a logical expression:

  • In a logical AND statement, put the least likely events first and the most likely events last. For example, suppose you want to find patients at a VA hospital that are male, over the age of 50, and have kidney cancer. You know that kidney cancer is a rare form of cancer. You also know that most patients at the VA hospital are male. To optimize a WHERE clause, you should put the least probable conditions first:
    WHERE Disease="Kidney Cancer" & Age>50 & Sex="Male";
  • In a logical OR statement, put the most likely events first and the least likely events last. For example, suppose you want to find all patients at a VA hospital that are either male, or over the age of 50, or have kidney cancer. To optimize a WHERE clause, you should use
    WHERE Sex="Male" | Age>50 | Disease="Kidney Cancer";

The SAS documentation does not discuss the conditions for which a logical expression does or does not short circuit. Polzin (1994, p. 1579) points out that when you put function calls in the logical expression, SAS evaluates certain function calls that produce side effects. Common functions that have side effects include random number functions and user-defined functions (via PROC FCMP) that have output arguments. The LAG and DIF functions can also produce side effects, but it appears that expressions that involve the LAG and DIF functions are short-circuited. You can force a function evaluation by calling the function prior to an IF-THEN statement. You can use nested IF-THEN/ELSE statements to ensure that functions are not evaluated unless prior conditions are satisfied.

Logical ligatures

The SAS/IML language does not support short-circuiting in IF-THEN statements, but it performs several similar optimizations that are designed to speed up your code execution. One optimization involves the ANY and ALL functions, which test whether any or all (respectively) elements of a matrix satisfy a logical condition. A common usage is to test whether a missing value appear anywhere in a vector, as shown in the following SAS/IML statement:

bAny = any(y = .);   /* TRUE if any element of Y is missing */
/* Equivalently, use   bAll = all(y ^= .);  */

The SAS/IML language treats simple logical expressions like these as a single function call, not as two operations. I call this a logical ligature because two operations are combined into one. (A computer scientist might just call this a parser optimization.)

You might assume that the expression ANY(Y=.) is evaluated by using a two-step process. In the first step, the Boolean expression y=. is evaluated and the result is assigned to a temporary binary matrix, which is the same size as Y. In the second step, the temporary matrix is sent to the ANY function, which evaluates the binary matrix and returns TRUE if any element is nonzero. However, it turns out that SAS/IML does not use a temporary matrix. The SAS/IML parser recognizes that the expression inside the ANY function is a simple logical expression. The program can evaluate the function by looking at each element of Y and returning TRUE as soon it finds a missing value. In other words, it short-circuits the computation. If no value is missing, the expression evaluates to FALSE.

Short circuiting an operation can save considerable time. In the following SAS/IML program, the vector X contains 100 million elements, all equal to 1. The vector Y also contains 100 million elements, but the first element of the vector is a missing value. Consequently, the computation for Y is essentially instantaneous whereas the computation for X takes a tenth of a second:

proc iml;
numReps = 10;      /* run computations 10 times and report average */
N = 1E8;           /* create vector with 100 million elements */
x = j(N, 1, 1);    /* all elements of x equal 1 */
y = x; y[1] = .;   /* the first element of x is missing */
 
/* the ALL and ANY functions short-circuit when the 
   argument is a simple logical expression */
/* these function calls examine only the first elements */
t0 = time();
do i = 1 to numReps;
   bAny = any(y = .);   /* TRUE for y[1] */
   bAll = all(y ^= .);  /* TRUE for y[2] */
end;
t = (time() - t0) / numReps;
print t[F=BEST6.];
 
/* these function calls must examine all elements */
t0 = time();
do i = 1 to numReps;
   bAny = any(x = .);   
   bAll = all(x ^= .);
end;
t = (time() - t0) / numReps;
print t[F=BEST6.];

Although the evaluation of X does not short circuit, it still uses the logical ligature to evaluate the expression. Consequently, the evaluation is much faster than the naive two-step process that is shown explicitly by the following statements, which require about 0.3 seconds and twice as much memory:

   /* two-step process: slower */
   b1 = (y=.);                /* form the binary vector */
   bAny = any(b1);            /* TRUE for y[1] */

In summary, the SAS DATA step uses short-circuit evaluation in IF-THEN statements and WHERE clauses that use simple logical expressions. If the expression contains several subexpressions, you can optimize your program by estimating the probability that each subexpression is true. In the SAS/IML language, the ANY and ALL functions not only short circuit, but when their argument is a simple Boolean expression, the language treats the function call as a logical ligature and evaluates the call in an efficient manner that does not require any temporary memory.

Short circuits can be perplexing if you don't expect them. Equally confusing is expecting a statement to short circuit, but it doesn't. If you have a funny story about short-circuit operators, in any language, leave a comment.

The post Short-circuit evaluation and logical ligatures in SAS appeared first on The DO Loop.

8月 132019
 

Across all industries, organizations are adopting a cloud-first strategy. What does it mean to be cloud-first? Broadly speaking, cloud first means using shared infrastructure instead of building and hosting your own private storage facility, systems, etc. Benefits of adopting a cloud-first strategy include cost savings and productivity improvements. However, what [...]

Five considerations for your cloud-first strategy was published on SAS Voices by Kevon Hayati

8月 132019
 

Raw data doesn’t change an organization, and neither do analytics on their own. It’s making decisions based on that data and the results of analytics that drives change through a company. Every decision is important and influences an organization. Thousands of decisions need to be made every day and many decisions are dependent on other decisions in an interconnected network.

SAS Intelligent Decisioning combines business rules management, decision processing, real-time event detection, decision governance and analytics to automate and manage decisions across the enterprise. It supports customer-facing activities such as personalized marketing and next-best action, plus decisions affecting customers, including credit services and fraud prevention.

Overview

Business rules

An integrated business rule management platform enables fast rule construction, testing, governance and integration within decision flows. You can manage rule versions for tracking and governance. The solution allows users to create complex business logic supported by sophisticated functions and integration with Lookup Tables.

Decision flows

A graphical drag-and-drop interface allows users to build decisions with minimal programming effort. Decisions are created in a decision flow that orchestrates business rules, analytical models, database access, custom code objects and more.

Graphical editor to create decisions

Further, it is possible to test and maintain different versions of decisions and business rules before deploying them for production real-time or batch execution.

The high-performance, real-time Micro Analytics Services (MAS) engine can handle more than 5,000 real-time transactions per second with response times of 10 milliseconds per transaction. The REST interface to call decisions or business rules in real-time provides simple integration with most third-party applications.

Monitor test results through Decision Path tracking

New Features

Recently, the latest release of SAS intelligent Decisioning was released and I’d like to highlight some of the new features.

SQL Query Node

Users can now submit SQL directly into a SQL Query node without supplying any additional coding logic. The SQL Query node supports SELECT, INSERT, DELETE and UPDATE.

To link a SQL statement to a decision, just point tables and columns to the decision variables as shown below in the curly brackets. Intelligent Decisioning will then automatically pass data into the SQL as appropriate.

If you query data via a select statement, the result is returned in a Datagrid. A Datagrid is a data type for an object in Intelligent Decisioning and represents data in a table format that belongs to a single record.

Datagrids are used in many places in Intelligent Decisioning and there is a rich set of Datagrid functions to access and work with data in a Datagrid.

Python Code Node

Intelligent Decisioning provides an environment that aims to minimize the need to write code to build decisions. But if necessary, it is possible to submit code. Intelligent Decisioning supports writing code in Python as part of a decision flow. Data from a decision flow can be passed into the Python code and return values will be passed back from Python into the decision flow.

To enable coding in Python, a Python execution environment needs to be installed alongside Intelligent Decisioning. If a decision flow contains a Python Code Node, the Python code will automatically be executed in the Python environment as part of the overall decision.

Decision Flow containing Python code node

A code editor in Intelligent Decisioning allows you to edit your Python code within the environment.

A Python code editor is part of Intelligent Decisioning

Decision Node

Decision flows can call other decision flows. This opens the way to designing and building modular decisions with “pluggable” components. You can also build reusable decisions which are called by different decision flows. Building decisions in such a modular way makes it easier to read and maintain decision flows.

Drill down from one decision to the next

Treatments

Treatments are lists of attributes with fixed or dynamic values.

Treatments are used to define offers to present to a customer as a result of an inbound marketing campaign. Or treatments can be used as parameter lists to control engine settings. There are numerous use cases for treatments.

Treatment attribute list

To determine if a treatment is valid for a decision, you can set Eligibility Rules to decide when a treatment will be used. For audit reasons and to track changes over time, you can also have different versions of a treatment.

To utilize treatments, you group them together in treatment groups, which can then be called from a decision flow.

Conclusion

Manging and analysing high volumes of data to make thousands of decisions every day in an automated fashion and applying analytics to real-time customer interactions require a sophisticated and complete solution like SAS Intelligent Decisioning. It enables users to create, test, control versioning and trace analytically driven decisions all in one solution.

By making decisions, smarter organizations become more efficient. As mentioned in the beginning: Data doesn’t change your organization, decisions do!

Learn more

Video: SAS Intelligent Decisioning | Product Overview
Documentation: SAS Intelligent Decisioning
Product: SAS Intelligent Decisioning

SAS Intelligent Decisioning: Intro and Update was published on SAS Users.

8月 122019
 

In part one of this blog series, we introduced the automation of AI (i.e., artificial intelligence) as a multifaceted and evolving topic for marketing and segmentation. After a discussion on maximizing the potential of a brand's first-party data, a machine learning method incorporating natural language explanations was provided in the context [...]

SAS Customer Intelligence 360: Automated AI and segmentation [Part 2] was published on Customer Intelligence Blog.

8月 122019
 
What is this math good for, anyway?
     –Every student, everywhere

I am a professional applied mathematician, yet many of the mathematical and statistical techniques that I use every day are not from advanced university courses but are based on simple ideas taught in high school or even in grade school. I've written many blog posts in which the solution to an interesting problem requires little more than high-school math. Even when the solution requires advanced techniques, high-school math often provides the basis for solving the problem.

In celebration of the upcoming school year, here are 12 articles that show connections between advanced topics and mathematical ideas that are introduced in high-school or earlier. If you have a school-age child, read some of the articles to prepare yourself for the inevitable mid-year whine, "But, Mom/Dad, when will I ever use this stuff?!"

Grade-school math

Obviously, most adults use basic arithmetic, fractions, decimals, and percents, but here are some less obvious uses of grade-school math:

High-school math

Algebra, linear transformations, geometry, and trigonometry are the main topics in high school mathematics. These topics are the bread-and-butter of applied mathematics:

  • Linear transformations: Anytime you create a graph on the computer screen, a linear transformation transforms the data from physical units (weight, cost, time) into pixel values. Although modern graphical software performs that linear transformation for you, a situation in which you have to manually apply a linear transformation is when you want to display data in two units (pounds and kilograms, dollars and Euros, etc). You can use a simple linear transformation to align the tick labels on the two axes.
  • Intersections: In high school, you learn to compute the intersection between two lines. You can extend the problem to find the intersection of two line segments. I needed that result to solve a problem in probability theory.
  • Solve a system of equations: In Algebra II, you learn to solve a system of equations. Solving linear systems is the foundation of linear algebra. Solving nonlinear systems is among the most important skills in applied mathematics.
  • Find the roots of a nonlinear equation: Numerically finding the roots of a function is taught in pre-calculus. It is the basis for all "inverse problems" in which you want to find inputs to a function that produce a specified output. In statistics, a common "inverse problem" is finding the quantile of a cumulative probability distribution.
  • Binomial coefficients: In algebra, many teachers use the mnemonic FOIL (First, Outer, Inner, Last) to teach students how to compute the quadratic expansion of (a + b)2. Later, students learn the binomial expansion of an arbitrary power, (a + b)n. The coefficients in this expansion are called the binomial coefficients and appear in Pascal's triangle as well as in many discrete probability distributions such as the negative binomial and hypergeometric distributions.
  • Pythagorean triples: In trigonometry, a huge number of homework problems involve using right triangles with side lengths that are proportional to (3, 4, 5) and (5, 12, 13). These are two examples of Pythagorean triples: right triangles whose side lengths are all integers. It turns out that you can use linear transformations to generate all primitive triples from the single triple (3, 4, 5). A high school student can understand this process, although the process is most naturally expressed in terms of matrix multiplication, which is not always taught in high school.

High-school statistics

Many high schools offer a unit on probability and statistics, and some students take AP Statistics.

Einstein famously said, "everything should be made as simple as possible, but not simpler." It is surprising to me how often an advanced technique can be simplified and explained by using elementary math. I don't claim that "everything I needed to know about math I learned in kindergarten," but I often return to elementary techniques when I describe how to solve non-elementary problems.

What about you? What are some elementary math or statistics concepts that you use regularly in your professional life? Are there fundamental topics that you learned in high school that are deeper and more useful than you realized at the time? Leave a comment.

The post The math you learned in school: Yes, it’s useful! appeared first on The DO Loop.

8月 122019
 
What is this math good for, anyway?
     –Every student, everywhere

I am a professional applied mathematician, yet many of the mathematical and statistical techniques that I use every day are not from advanced university courses but are based on simple ideas taught in high school or even in grade school. I've written many blog posts in which the solution to an interesting problem requires little more than high-school math. Even when the solution requires advanced techniques, high-school math often provides the basis for solving the problem.

In celebration of the upcoming school year, here are 12 articles that show connections between advanced topics and mathematical ideas that are introduced in high-school or earlier. If you have a school-age child, read some of the articles to prepare yourself for the inevitable mid-year whine, "But, Mom/Dad, when will I ever use this stuff?!"

Grade-school math

Obviously, most adults use basic arithmetic, fractions, decimals, and percents, but here are some less obvious uses of grade-school math:

High-school math

Algebra, linear transformations, geometry, and trigonometry are the main topics in high school mathematics. These topics are the bread-and-butter of applied mathematics:

  • Linear transformations: Anytime you create a graph on the computer screen, a linear transformation transforms the data from physical units (weight, cost, time) into pixel values. Although modern graphical software performs that linear transformation for you, a situation in which you have to manually apply a linear transformation is when you want to display data in two units (pounds and kilograms, dollars and Euros, etc). You can use a simple linear transformation to align the tick labels on the two axes.
  • Intersections: In high school, you learn to compute the intersection between two lines. You can extend the problem to find the intersection of two line segments. I needed that result to solve a problem in probability theory.
  • Solve a system of equations: In Algebra II, you learn to solve a system of equations. Solving linear systems is the foundation of linear algebra. Solving nonlinear systems is among the most important skills in applied mathematics.
  • Find the roots of a nonlinear equation: Numerically finding the roots of a function is taught in pre-calculus. It is the basis for all "inverse problems" in which you want to find inputs to a function that produce a specified output. In statistics, a common "inverse problem" is finding the quantile of a cumulative probability distribution.
  • Binomial coefficients: In algebra, many teachers use the mnemonic FOIL (First, Outer, Inner, Last) to teach students how to compute the quadratic expansion of (a + b)2. Later, students learn the binomial expansion of an arbitrary power, (a + b)n. The coefficients in this expansion are called the binomial coefficients and appear in Pascal's triangle as well as in many discrete probability distributions such as the negative binomial and hypergeometric distributions.
  • Pythagorean triples: In trigonometry, a huge number of homework problems involve using right triangles with side lengths that are proportional to (3, 4, 5) and (5, 12, 13). These are two examples of Pythagorean triples: right triangles whose side lengths are all integers. It turns out that you can use linear transformations to generate all primitive triples from the single triple (3, 4, 5). A high school student can understand this process, although the process is most naturally expressed in terms of matrix multiplication, which is not always taught in high school.

High-school statistics

Many high schools offer a unit on probability and statistics, and some students take AP Statistics.

Einstein famously said, "everything should be made as simple as possible, but not simpler." It is surprising to me how often an advanced technique can be simplified and explained by using elementary math. I don't claim that "everything I needed to know about math I learned in kindergarten," but I often return to elementary techniques when I describe how to solve non-elementary problems.

What about you? What are some elementary math or statistics concepts that you use regularly in your professional life? Are there fundamental topics that you learned in high school that are deeper and more useful than you realized at the time? Leave a comment.

The post The math you learned in school: Yes, it’s useful! appeared first on The DO Loop.

8月 092019
 

Opening Plenary session, Esri UC 2019

Several of my colleagues and I attended the annual Esri User Conference last month in San Diego - along with 18,000 other Geo professionals.  It was a busy week of meetings, seminars and talks about the latest in GIS and Spatial technologies.  The days were long and exhausting, but it was also exciting and a ton of fun.  As we continue to process, plan and prepare to integrate some of these technologies into SAS Visual Analytics, I thought it would be beneficial to highlight the Esri features available in VA today.

One topic that received a lot of questions during this year’s SAS Global Forum in Dallas was that of Geocoding.  Geocoding is the process of transforming text address data into numeric latitude and longitude values.  Once the latitude and longitude are known, they can be mapped and analyzed spatially.  SAS has offered geocoding capabilities for quite some time as a part of SAS/Graph.  Beginning with SAS v940m5, PROC GEOCODE has moved into BASE SAS.  See my colleague’s blog posts here and here for more information on geocoding from BASE SAS.

But Geocoding is no longer limited to just Base SAS.  You can also geocode from within Visual Analytics, thanks to the integration with the Esri geocoding api.  This feature is part of the Esri Premium agreement, and became available in VA 8.3.   Esri premium features require an existing relationship and credentials with Esri.  This post assumes that relationship exists and your credentials have been validated.  I will discuss the details of the Esri premium features in a future post, but for today the focus is how to use the Esri Geocoding feature from VA with a real-world data set.

1. Getting the data into Visual Analytics

We will be using point data from the City of Dallas for the Public Library branch locations.  You can download the .csv file from the Dallas Open Data portal.  After downloading, it must be imported into VA for geocoding.

  • From the Data tab in VA, select Import > Local File
  • Navigate to the location of the Dallas library .csv file and select it
  • Adjust the default settings, if desired, and click the ‘Import Item’ button
  • Once you see the green success message, the data has been imported into VA and is ready to be geocoded. Click the ‘Cancel’ button

Message indicating successful data import

2. Selecting the data columns to geocode on

Accessing the Geocoding feature in VA follows a similar process to the steps we just performed to import the .csv file.

  • From the Data tab in VA, select Import > Esri > Geocode. Here, you must select the location of the newly imported library data set.  This path will vary depending upon the configuration of your VA instance.  For my installation, it is located at cas-shared-default > Public folder > CITY_OF_DALLAS_LIBRARY_LOCATIONS.  Once located, click the 'Select' button
  • The Geocoding Import window will open. This window should look familiar.  The top half is the same as the Import data we just used to get the .csv file into VA.  Essentially, the geocoding process is a new data import.  It will send selected columns to Esri via a REST api call.  The response will contain the corresponding latitude and longitude values we desire.  They will be added to our existing data set and imported into VA as a new geocoded data set.  The name of the new data set will have _GEO_CODE appended to the end of the original data set name.  This name can be modified as desired.

Geocoding selection dialog window

  • At the bottom of the Geocoding Import window are two list boxes, Available items and Selected items. The Available items box on the left contains all columns in the data set.  Select the column(s) containing the address information you wish to geocode.  Double click or click the right arrow to move them to the Selected items window on the right.  In this example, we are using the Address column.
  • VA concatenates the selected column(s) to generate a sample address for geocoding. Clicking the ‘Test’ button returns coordinates for the sample address and a score representing the confidence level of the results.  In the screenshot above, our score is 71/100 for the test address.  Not bad, but it could be better.  More on this a bit later.
  • To finish the geocoding process, click the ‘Import Item’ at the top of the page, as we did with the original .csv file import. This time, you will be presented with a new dialog window.  Geocoding, as with other Esri premium features require the use of credits.  This dialog indicates how many Esri credits will be used by the geocoding process and will also be discussed in detail in a future post.

Esri credit usage alert dialog

For now, select 'Yes' to continue.  When you see the green success message, the operation is complete.  We are now ready to map our Dallas Library locations.  Click 'Ok' to open the new geocoded data set.

3. Create the geography variable and display the map

Next, we need to create our geography variable from the new geocoded data set.  As part of the geocoding process, four new columns have been added to the new data set: esri_latitude, esri_longitude, esri_score, esri_address.  We only need the esri_latitude and esri_longitude columns for our map.

  • Select the Branch Name category variable and change its Classification to Geography
  • For Geography data type, select Custom Coordinates
  • Select esri_latitude for Latitude
  • Select esri_longitude for Longitude
  • Click 'OK'
  • Drag the Branch Name geography variable to the canvas to create the map

Map of non-unique geocoded addresses

What happened??  Our data set contains Dallas Public library locations, so why are the data points spread across the world?  It’s all in the data.  If you look at the original data a bit deeper, you will notice the Address field we selected for the geocoding only contains the street number and street name of the library location.  It does not contain enough information to make it unique.  Therefore, during the geocoding process, the first instance of that address will be considered a match, regardless of where it is actually located.

Detailed view of incorrect geocoded address

In the image above for the Preston Royal branch, its street number and name were a perfect match to a location in Eugene, Oregon.  Not quite what we were looking for.  So, how do we fix this?  To make our addresses unique, it requires a simple addition to the source data .csv file.

Column selection to ensure unique addresses for geocoding

We need to add a ‘City’ and ‘State’ column to the original .csv file with the values of ‘Dallas’ and ‘Texas’ assigned to all entries.  This will ensure each address is unique and within our area of interest.  Re-import the new .csv file and geocode it using the Address, City and State columns.  The result?  A confidence score of a perfect 100.  Much better than our first attempt!  This will now give us the map we desire for the Dallas Public Library locations.

 

Final geocoded map of Dallas Library branches

In this post, I used real-world data to illustrate two things: the importance of knowing your data set, and how to geocode address information in SAS Visual Analytics.  Public data sets are a great resource but need to be used with a critical eye.  They may still need additional cleansing in order to work for your situation.

The geocoding feature is one example of the premium Esri features currently available in VA.  In future posts, I will go into more detail on other Esri features available, what make these features ‘premium’ and examples of how to use them.  Stay tuned!

Esri integration with SAS Visual Analytics: Geocoding was published on SAS Users.

8月 082019
 

Some key components of CASL are the action statements. These statements perform tasks that range from configuring the connection to the server, to summarizing large amounts of data, to processing image files. Each action has its own purpose. However, there is some overlapping functionality between actions. For example, more than one action can summarize numeric variables.
This blog looks at three actions: SIMPLE.SUMMARY, AGGREGATION.AGGREGATE, and DATAPREPROCESS.RUSTATS. Each of these actions generates summary statistics. Though there might be more actions that generate the same statistics, these three are a good place to start as you learn CASL.

Create a CAS table for these examples

The following step generates a table called mydata, stored in the casuser caslib, that will be used for the examples in this blog.

cas;
libname myuser cas caslib='casuser';
data myuser.mydata;
	length color $8;
	array X{100};
	do k=1 to 9000;
   	do i=1 to 50;
      	X{i} = rand('Normal',0, 4000);
      end;
      do i=51 to 100;
      	X{i} = rand('Normal', 100000, 1000000);
      end;
      if x1 < 0 then color='red';
      	else if x1 < 3000 then color='blue';
         else color='green';
		output;
	end;
run;

SIMPLE.SUMMARY

The purpose of the Simple Analytics action set is to perform basic analytical functions. One of the actions in the action set is the SUMMARY action, used for generating descriptive statistics like the minimum, maximum, mean, and sum.
This example demonstrates obtaining the sum, mean, and n statistics for five variables (x1–x5) and grouping the results by color. The numeric input variables are specified in the INPUTS parameter. The desired statistics are specified in the SUBSET parameter.

proc cas;
   simple.summary / 
      inputs={"x1","x2","x3","x4","x5"},
      subset={"sum","mean","n"},
      table={caslib="casuser",name="mydata",groupBy={"color"}},
      casout={caslib="casuser", name="mydata_summary", replace=true};
run;
	table.fetch /
		table={caslib="casuser",name="mydata_summary" };
run;
quit;

The SUMMARY action creates a table that is named mydata_summary. The TABLE.FETCH action is included to show the contents of the table.

The mydata_summary table can be used as input for other actions, its variable names can be changed, or it can be transposed. Now that you have the summary statistics, you can use them however you need to.

AGGREGATION.AGGREGATE

Many SAS® procedures have been CAS-enabled, which means you can use a CAS table as input. However, specifying a CAS table does not mean all of the processing takes place on the CAS server. Not every statement, option, or statistic is supported on the CAS server for every procedure. You need to be aware of what is not supported so that you do not run into issues if you choose to use a CAS-enabled procedure. In the documentation, refer to the CAS processing section to find the relevant details.
When a procedure is CAS-enabled, it means that, behind the scenes, it is submitting an action. The MEANS and SUMMARY procedure steps submit the AGGREGATION.AGGREGATE action.
With PROC MEANS it is common to use a BY or CLASS statement and ask for multiple statistics for each analysis variable, even different statistics for different variables. Here is an example:

proc means sum data=myuser.mydata noprint;
  by color;
   var x1 x2 x3;
   output out=test(drop=_type_ _freq_) sum(x1 x3)=x1_sum x3_sum
   max(x2)=x2_max std(x3)=x3_std;
run;

The AGGREGATE action produces the same statistics and the same structured output table as PROC MEANS.

proc cas;
	aggregation.aggregate / 
		table={name="mydata",caslib="casuser",groupby={"color"}}
      casout={name="mydata_aggregate", caslib='casuser', replace=true}
      varspecs={{name='x1', summarysubset='sum', columnnames={'x1_sum'}}, 
                {name='x2', agg='max', columnnames={'x2_max'}},
                {name='x3', summarysubset={'sum','std'},
                columnnames={'x3_sum','x3_std'}}}
      savegroupbyraw=true, savegroupbyformat=false, raw=true;
run;
quit;

The VARSPECS parameter might be confusing. It is where you specify the variables that you want to generate statistics for, which statistics to generate, and what the resulting column should be called. Check the documentation: depending on the desired statistic, you need to use either SUMMARYSUBSET or AGG arguments.

If you are using the GROUPBY action, you most likely want to use the SAVEGROUPBYRAW=TRUE parameter. Otherwise, you must list every GROUPBY variable in the VARSPECS parameter. Also, the SAVEGROUPBYFORMAT=FALSE parameter prevents the output from containing _f versions (formatted versions) of all of the GROUPBY variables.

DATAPREPROCESS.RUSTATS

The RUSTATS action, in the Data Preprocess action set, computes univariate statistics, centralized moments, quantiles, and frequency distribution statistics. This action is extremely useful when you need to calculate percentiles. If you ask for percentiles from a procedure, all of the data will be moved to the compute server and processed there, not on the CAS server.
This example has an extra step. Actions require a list of variables, which can be cumbersome when you want to generate summary statistics for more than a handful of variables. Macro variables are a handy way to insert a list of strings, variable names in this case, without having to enter all of the names yourself. The SQL procedure step generates a macro variable containing the names of all of the numeric variables. The macro variable is referenced in the INPUTS parameter.
The RUSTATS action has TABLE and INPUTS parameters like the previous actions. The REQUESTPACKAGES parameter is the parameter that allows for a request for percentiles.
The example also contains a bonus action, TRANSPOSE.TRANSPOSE. The goal is to have a final table, mydata_rustats2, with a structure like PROC MEANS would generate. The tricky part is the COMPUTEDVARSPROGRAM parameter.
The table generated by the RUSTATS action has a column called _Statistic_ that contains the name of the statistic. However, it contains “Percentile” multiple times. A different variable, _Arg1_, contains the value of the percentiles (1, 10, 20, and so on). The values of _Statistic_ and _Arg1_ need to be combined, and that new combined value generates the new variable names in the final table.
The COMPUTEDVARS parameter specifies that the name of the new variable will hold the concatenation of _Statistic_ and _Arg1_. The COMPUTEDVARSPROGRAM parameter tells CAS how to create the values for NEWID. The NEWID value is then used in the ID parameter to make the new variable names—pretty cool!

proc sql noprint;
	select quote(strip(name)) into: numvars separated by ','
	from dictionary.columns 
 	where libname='MYUSER' and memname='MYDATA' and type='num';
quit;
 
proc cas;
	dataPreprocess.rustats / 
   	table={name="mydata",caslib="casuser"} 
   	inputs={&numvars}
   	requestpackages={{percentiles={1, 10, 20, 30, 40, 50, 60, 70, 80, 90, 95 99},scales={"std"}}}
   	casoutstats={name="mydata_rustats",caslib="casuser"} ;
 
	transpose.transpose / 
   	table={caslib='casuser', name="mydata_rustats", groupby={"_variable_"},
			computedvars={{name="newid",format="$20."}},computedvarsprogram="newid=strip(_statistic_)||compress(strip(_arg1_),'.-');"}
   	transpose={"_value_"} 
   	id={"newid"}
   	casOut={caslib='casuser', name="mydata_rustats2", replace=true};
   run;
quit;

Here is a small portion of the final table. Remember, you can use the TABLE.FETCH action to view the table contents as well.

Summary

Summarizing numeric data is an important step in analyzing your data. CASL provides multiple actions that generate summary statistics. This blog provided a quick overview of three of those actions: SIMPLE.SUMMARY, AGGREGATION.AGGREGATE, and DATAPREPROCESS.RUSTATS.
The wonderful part of so many choices is that you can decide which one best fits your needs. Summarizing your data with actions also ensures that all of the processing occurs on the CAS server and that you are taking full advantage of its capabilities.
Be sure to use the DROPTABLE action to delete any tables that you do not want taking up space in memory:

proc cas;
	table.droptable / caslib='casuser' name='mydata' quiet=true;
	table.droptable / caslib='casuser' name='mydata_summary' quiet=true;
	table.droptable / caslib='casuser' name='mydata_aggregate' quiet=true;
	table.droptable / caslib='casuser' name='mydata_rustats' quiet=true;
	table.droptable / caslib='casuser' name='mydata_rustats2' quiet=true;
quit;
cas casauto terminate;

Learn More

Summarization in CASL was published on SAS Users.

8月 072019
 
2-D binning of counts

Do you want to bin a numeric variable into a small number of discrete groups? This article compiles a dozen resources and examples related to binning a continuous variable. The examples show both equal-width binning and quantile binning. In addition to standard one-dimensional techniques, this article also discusses various techniques for 2-D binning.

SAS procedures that support binning include the HPBIN, IML, KDE, RANK, and UNIVARIATE procedures.

Equal-width binning in SAS

The simplest binning technique is to form equal-width bins, which is also known as bucket binning. If a variable has the range [Min, Max] and you want to split the data into k equal-width bins (or buckets), each bin will have width (Max - Min) / k.

Quantile binning in SAS

In bucket binning, some bins have more observations than others. This enables you to estimate the density of the data, as in a histogram. However, you might want all bins to contain about the same number of observations. In that case, you can use quantiles of the data as cutpoints. If you want four bins, use the 25th, 50th, and 75th percentiles as cutpoints. If you want 10 bins, use the sample deciles as cutpoints. Here are several resources for quantile binning:

Binning by using arbitrary cutpoints in SAS

Sometimes you need to bin based on scientific standards or business rules. For example, the Saffir-Simpson hurricane scale uses specific wind speeds to classify a hurricane as Category 1, Category 2, and so forth. In these cases, you need to be able to define custom cutpoints and assign observations to bins based on those cutpoints.

2-D binning and bivariate histograms in SAS

A histogram is a visualization of a univariate equal-width binning scheme. You can perform similar computations and visualizations for two-dimensional data. If your goal is to understand the density of continuous bivariate data, you might want to use a bivariate histogram rather than a scatter plot (which, for large samples, suffers from overplotting).

In summary, this guide provides many links to programs and examples that bin data in SAS. Whether you want to use equal-width bins, quantile bins, or two-dimensional bins, hopefully, you will find an example to get you started. If I've missed an important topic, or if you have a favorite binning method that I have not covered, leave a comment.

The post The essential guide to binning in SAS appeared first on The DO Loop.