5月 252018
 

Have you ever tried to plot data on a map of Antarctica ... and been thoroughly frustrated or confused?!? If you're that person, or even a seasoned map maker wanting to hone your skills, then this blog post is for you! But first, here is a picture to get you [...]

The post Plotting data on Antarctica - a mapping challenge! appeared first on SAS Learning Post.

5月 242018
 

Poverty. It's a bit difficult to define who lives in poverty - I guess it's a relative thing, and depends on the standard of living of the people around you. Today we're going to take a look at the child poverty rates in several 'rich' countries (such as the United [...]

The post Comparing child poverty in 26 rich countries appeared first on SAS Learning Post.

5月 232018
 

In August 2017, Britta Gross spoke about General Motors’ perspective on bringing electric vehicles (and their derivatives) to market. Her point of view reaffirmed GM's research on consumer awareness of electric vehicles (only 60 percent) and consumer adoption concerns with this emerging technology. She also revealed the portfolio of cars [...]

Could analytics improve electric vehicle adoption? was published on SAS Voices by Lonnie Miller

5月 232018
 

Path analysis is an exploration of a chain of consecutive events that a given user or cohort performs during a set period while using a website, online game or mobile app (although other use cases can apply outside of digital analytics). As a subset of behavioral analytics, path analysis is [...]

SAS Customer Intelligence 360: Path analysis for re-engagement was published on Customer Intelligence Blog.

5月 232018
 
Butterfly plot of cholesterol by gender in SAS

This article shows how to construct a butterfly plot in SAS. A butterfly plot (also called a butterfly chart) is a comparative bar chart or histogram that displays the distribution of a variable for two subpopulations. A butterfly plot for the cholesterol readings of 5,057 patients in a medical study is shown to the right, where the distribution for the males is shown on the left side of the plot and the distribution for females is displayed on the right. (Click to enlarge.) The main contribution of this article is showing how to bin a continuous variable in SAS to form a butterfly chart.

The butterfly plot is similar to a comparative histogram because both enable you to compare the distribution of a continuous variable for subpopulations. The comparative histogram uses a panel to visualize several subpopulations in a panel, where each row represents a level of a classification variable. In contrast, the butterfly plot is limited to two levels and displays the distributions back-to-back.

Bin a continuous variable for each classification level

In a previous blog post, I constructed a butterfly chart that compares voice versus text usage by decade of age for cell phones. Similarly, there is a SAS Sample that shows how to create a butterfly chart for types of cancers by gender. In both of these examples, the butterfly chart is a comparative bar chart because the distribution shown is for a discrete variable ("decade of age" or "type of cancer"). This section shows how to start with a continuous variable (cholesterol) and bin it into intervals. You can then use the previous techniques to visualize the counts in each interval for each gender.

You can bin a continuous variable by using the BIN and TABULATE functions in SAS/IML or by using the OUTHIST= option on the HISTOGRAM statement in PROC UNIVARIATE. The following statements create an output data set (OutHist) that contains the counts of males and females that have cholesterol reading within bins of width 20. The bin width and the centers of the intervals are chosen automatically by PROC UNIVARIATE, but you can use the MIDPOINTS= option (shown in the comments) to control the placement of the intervals.

proc univariate data=Sashelp.Heart;
   class Sex;
   var Cholesterol;
   histogram cholesterol / nrows=2 outhist=OutHist 
                          /* midpoints=(80 to 560 by 40) */ /* control bin widths and locations */
                           odstitle="Cholesterol by Gender";
   ods select histogram;
run;
Comparative histogram in SAS: Cholesterol by Gender

Create a butterfly plot in SAS

The OutHist data set is in "long form." You need to convert it to "wide form" in order to construct a butterfly plot. The SAS code performs the following tasks:

  • Use a DATA step and WHERE clauses to convert the data from long to wide format.
  • Multiply the counts for the males by -1. The negative counts will be plotted on the left side of the butterfly plot.
  • Define a format that will display the absolute values of the counts. The axis for the formatted variables contains zero in the middle and increases in both directions.
  • Use the HBAR statements to plot the back-to-back bar charts for males and females.
/* convert data from long format to wide format */
data Butterfly;
   keep Cholesterol Males Females;
   label Males= Females= Cholesterol=; /* remove labels */
   merge OutHist(where=(sex="Female") rename=(_COUNT_=Females _MIDPT_=Cholesterol))
         OutHist(where=(sex="Male")   rename=(_COUNT_=Males   _MIDPT_=Cholesterol));
   by Cholesterol;
   Males = -Males;                     /* trick: reverse the direction of male counts */
run;
 
/* define format that displays the absolute value of a number */
proc format;
   picture positive low-<0="000,000"
   0<-high="000,000";
run;
 
ods graphics / reset;
title "Butterfly Plot of Cholesterol Counts By Gender";
proc sgplot data=Butterfly;
   format Males Females positive.;
   hbar Cholesterol / response=Males   legendlabel="Males";
   hbar Cholesterol / response=Females legendlabel="Females";
   xaxis label="Count" grid 
         min=-520 max=520 values=(-500 to 500 by 100) valueshint;
   yaxis label="Cholesterol" discreteorder=data;
run;

The graph is shown at the top of this article. You can see that the mode of the distribution is higher for males, and the distribution for males also has a longer tail.

A butterfly fringe plot

The butterfly plot is usually displayed with horizontal bars, as shown. However, you could use the VBAR statement to get a rotated version of the butterfly plot. As mentioned earlier, you can also use the MIDPOINTS= option on the HISTOGRAM statement to change the width of the histogram bins.

One useful variation in the butterfly plot is to use a very small bin width and replace the bar chart with a high-low plot. This creates a graph that I call a butterfly fringe plot. Recall that the usual fringe plot (also called a "rug plot") places a tick mark on an axis to show the distribution of data values. A fringe plot can suffer from overplotting when more than one observation has the same value. With a butterfly fringe plot, some ticks are higher than others (to represent repeated values) and bars that point up represent one binary value whereas bars that point down represent the other. The butterfly fringe plot provides a more complete visualization of the distribution of data for two levels of a response or classification variable.

ods graphics / width=640px height=180;
title "Butterfly Fringe Plot: Cholesterol Counts By Gender";
proc sgplot data=Butterfly;
   format Males Females positive.;
   highlow x=Cholesterol low=Males high=Females;
   refline 0 / axis=y;
   inset "Males"   / position=BottomLeft;
   inset "Females" / position=TopLeft;
   yaxis label="Count" grid;
   xaxis label="Cholesterol" discreteorder=data;
run;
Butterfly fringe plot of cholesterol by gender in SAS

The butterfly fringe plot was created by using a bin width of 5 for the cholesterol variable. That is, midpoints=(50 to 560 by 5).

In summary, this article shows how to create a butterfly plot for a continuous variable and a binary classification variable. When you bin the continuous variable, you obtain counts for each interval. You can then graph the counts back-to-back to form the butterfly chart. An interesting variation is the butterfly fringe plot, which combines a butterfly chart and a fringe plot.

The post A butterfly plot for comparing distributions appeared first on The DO Loop.

5月 222018
 

SAS ViyaSAS Viya Presentations is our latest extension of the SAS Platform and interoperable with SAS® 9.4. Designed to enable analytics to the enterprise, it seamlessly scales for data of any size, type, speed and complexity. It was also a star at this year’s SAS Global Forum 2018. In this series of articles, we will review several of the most interesting SAS Viya talks from the event. Our first installment reviews Hadley Christoffels’ talk, A Need For Speed: Loading Data via the Cloud.

You can read all the articles in this series or check out the individual interviews by clicking on the titles below:
Part 1: Technology that gets the most from the Cloud.


Technology that gets the most from the Cloud

Few would argue about the value the effective use of data can bring an organization. Advancements in analytics, particularly in areas like artificial intelligence and machine learning, allow organizations to analyze more complex data and deliver faster, more accurate results.

However, in his SAS Global Forum 2018 paper, A Need For Speed: Loading Data via the Cloud, Hadley Christoffels, CEO of Boemska, reminded the audience that 80% of an analyst’s time is still spent on the data. Getting insight from your data is where the magic happens, but the real value of powerful analytical methods like artificial intelligence and machine learning can only be realized when “you shorten the load cycle the quicker you get to value.”

Data Management is critical and still the most common area of investment in analytical software, making data management a primary responsibility of today’s data scientist. “Before you can get to any value the data has to be collected, has to be transformed, has to be enriched, has to be cleansed and has to be loaded before it can be consumed.”

Benefits of cloud adoption

The cloud can help, to a degree. According to Christoffels, “cloud adoption has become a strategic imperative for enterprises.” The advantages of moving to a cloud architecture are many, but the two greatest are elasticity and scalability.

Elasticity, defined by Christoffels, allows you to dynamically provision or remove virtual machines (VM), while scalability refers to increasing or decreasing capacity within existing infrastructure by scaling vertically, moving the workload to a bigger or smaller VM, or horizontally, by provisioning additional VM’s and distributing the application load between them.

“I can stand up VMs in a matter of seconds, I can add more servers when I need it, I can get a bigger one when I need it and a smaller one when I don’t, but, especially when it comes to horizontal scaling, you need technology that can make the most of it.” Cloud-readiness and multi-threaded processing make SAS® Viya® the perfect tool to take advantage of the benefits of “clouding up.”

SAS® Viya® can addresses complex analytical challenges and speed up data management processes. “If you have software that can only run on a single instance, then scaling horizontally means nothing to you because you can’t make use of that multi-threaded, parallel environment. SAS Viya is one of those technologies,” Christoffels said.

Challenges you need to consider

According to Christoffels, it’s important, when moving your processing to the cloud, that you understand and address existing performance challenges and whether it will meet your business needs in an agile manner. Inefficiencies on-premise are annoying; inefficiencies in the cloud are annoying and costly, since you pay for that resource.

It’s not the best use of the architecture to take what you have on premise and just shift it. “Finding and improving and eliminating inefficiencies is a massive part in cutting down the time data takes to load.”

Boemska, Christoffels’ company, has tools to help businesses find inefficiencies and understand the impact users have on the environment, including:

  1. Real-time diagnostics looking at CPU Usage, Memory Usage, SAS Workload, etc.
  2. Insight and comparison provides a historic view in a certain timeframe, essential when trying to optimize and shave off costly time when working in cloud.
  3. Utilization reports to better understand how the platform is used.

Optimizing inefficiencies with SAS Viya

But scaling vertically and horizontally from cloud-based infrastructure to speed the loading and data management process solves only part of the problem. Christoffels said SAS Viya capabilities completes the picture. SAS Viya offers a number of benefits in a Cloud infrastructure, Christoffels said. Code amendments that make use of the new techniques and benefits now available in SAS Viya, such as the multi-threaded DATA step or CAS Action Sets, can be extremely powerful.

One simple example of the benefits of SAS Viya, Christoffels said, is that with in-memory processing, PROC SORT is a procedure that’s no longer needed; SAS Viya does “grouping on the fly,” meaning you can remove sort routines from existing programs, which of itself, can cut down processing time significantly.

As a SAS Programmer, just the fact that SAS Viya can run multithreaded, the fact that you don’t have to do these sorts, the way it handles grouping on the fly, the fact that multithreaded nature and capability is built into how you deal with tables are all “significant,” according to Christoffels.

Conclusion

Data preparation and load processes have a direct impact on how applications can begin and subsequently complete. Many organizations are using the Cloud platform to speed up the process, but to take full advantage of the infrastructure you have to apply the right software technology. SAS Viya enables the full realization of Cloud benefits through performance improvements, such as the transposing of data and the transformation of data using the DATA step or CAS Action Sets.

Additional Resources

SAS Global Forum Video: A Need For Speed: Loading Data via the Cloud
SAS Global Forum 2018 Paper: A Need For Speed: Loading Data via the Cloud
SAS Viya
SAS Viya Products


Read all the posts in this series.

Part 1: Technology that gets the most from the Cloud

Technology that gets the most from the Cloud was published on SAS Users.

5月 222018
 

Andy Dufresne, the wrongly convicted character in The Shawshank Redemption, provocatively asks the prison guard early in the film: “Do you trust your wife?” It’s a dead serious question regarding avoiding taxes on a recent financial windfall that had come the guard's way, and leads to events that eventually win [...]

AI and trust was published on SAS Voices by Leo Sadovy

5月 212018
 

In a recent blog post, Chris Hemedinger used a scatter plot to show the result of 100 coin tosses. Chris arranged the 100 results in a 10 x 10 grid, where the first 10 results were shown on the first row, the second 10 were shown on the second row, and so on. Placing items along each row before going to the next row is called row-major order.

An implicit formula for arranging items in rows

If you process items sequentially, it is easy to position the items in a grid by using an inductive scheme:

  1. Place the first item at (1, 1).
  2. Assume the n_th is placed at position (r, c). Place the (n+1)st item at position (r, c+1) if there is room on the current row, otherwise place it at (r+1, 1), which is the first element of the next row.

The inductive scheme is also called an implicit or recursive formula because the position of the (n+1)st item is given in terms of the position of the nth item.

For example, suppose that you have 70 items and you want to place 11 items in each row. The inductive algorithm looks like the following:

%let Nx = 11;           /* number of items in row */
data Loc;
label r = "Row" c = "Column";
retain r 1  c 1 item 1;
output;                 /* base case */
do item = 2 to 70;      /* inductive step */
   c + 1;
   if c > &Nx then do;
      r + 1; c = 1;
   end;
   output;
end;
run;
 
title "Position of Items in Grid";
proc sgplot data=Loc;
   text x=c y=r text=item / textattrs=(size=12) position=center strip;
   xaxis integer offsetmin=0.05 offsetmax=0.05;
   yaxis reverse offsetmin=0.05 offsetmax=0.05;
run;
Items arranged in a grid in row-major order with 11 items in each row

The inductive algorithm is easy to implement and to understand. However, it does not enable you to easily determine the row and column of the 1,234,567_th item if there are 11 items in each row. Nor does it enable you to compute the positions when the index increments by a value greater than 1. To answer these questions, you need to use an explicit or direct formula.

An explicit formula for arranging items in rows

The explicit formula uses the MOD function to compute the column position and integer division to compute the row position. SAS does not have an explicit "integer division operator," but you can emulate it by using the FLOOR function. The following macro definitions encapsulate the formulas:

/* (row, col) for item n if there are Nx items in each row (count from 1),
   assuming row-major order */
%macro ColPos(n, Nx);
   1 + mod(&n.-1, &Nx.)
%mend;
%macro RowPos(n, Nx);
   1 + floor((&n.-1) / &Nx.)
%mend;

The formulas might look strange because they subtract 1, do a calculation, and then add 1. This formula assumes that you want to count the items, rows, and columns beginning with 1. If you prefer to count from 0 then the formulas become MOD(n, Nx) and FLOOR(N/Nx).

You can use the formulas to directly compute the position of the odd integers in the digits 1–70 when there are 11 items on each row:

%let Nx = 11;
data grid;
do item = 1 to 70 by 2;       /* only odd integers */
   row = %RowPos(item, &Nx);
   col = %ColPos(item, &Nx);
   output;
end;
run;
 
title "Position of Odd Integers in Grid";
proc sgplot data=grid;
   text x=col y=row text=item / textattrs=(size=12) position=center strip;
   xaxis integer offsetmin=0.05 offsetmax=0.05 label="Column" max=&Nx;
   yaxis reverse offsetmin=0.05 offsetmax=0.05 integer label="Row";
run;
Positions of odd integers in a grid in row-major order

Of course, you can also use the direct formula to process items incrementally. The following DATA step computes the positions for 19 observations in the Sashelp.Class data set, where five names are placed in each row:

data gridName;
set sashelp.class;
y = %RowPos(_N_, 5);  /* 5 columns in each row */
x = %ColPos(_N_, 5);
run;
 
title "Position Five Names in Each Row";
proc sgplot data=gridName;
   text x=x y=y text=Name / textattrs=(size=12) position=center strip;
   xaxis integer offsetmin=0.08 offsetmax=0.08;
   yaxis reverse offsetmin=0.05 offsetmax=0.05 integer;
run;
Positions of items in a grid with 5 items in each row

The explicit formula is used in the SAS/IML NDX2SUB function, which tells you the row and column information for the n_th item in a matrix.

In summary, you can use an implicit formula or an explicit formula to arrange items in rows, where each row contains Nx items. The implicit formula is useful when you are arranging the items sequentially. The explicit formula is ideal when you are randomly accessing the items and you need a direct computation that provides the row and column position.

Finally, if you want to arrange items in column-major order (down the first column, then down the second,...), you can use similar formulas. The row position of the n_th item is 1 + mod(n-1, Ny) and the column position is 1 + floor((n-1) / Ny), where Ny is the number of rows in the grid.

The post Position items in a grid appeared first on The DO Loop.