9月 052018
 

Are you curious? Do you have a passion for science, technology, engineering and math (STEM)? Do you enjoy robotics or statistics? Do you like to solve hard problems? If you answered yes to any of the above, you might have what it takes to be a data scientist. Recently, I [...]

Why should you learn data science? Two students’ perspectives was published on SAS Voices by Georgia Mariani

9月 052018
 

Typically, when filters are applied in SAS Visual Analytics it affects all the records and aggregations in linked objects. For example, in a typical sales report below, when filters are applied, it changes all the measures of linked objects.

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

With this kind of filtering, it becomes difficult to calculate measures which requires a different level of aggregation. In above image the expectation is that the ‘Total Customers’ should not be changing irrespective of ‘Region’, ‘State’, ‘Category’ and ‘Subcategory’ control selections. ‘Total Customers (Geo)’ should be changing only based on ‘Region’ and ‘State’ control selections. ‘Total Customers (Geo and Prod)’ should be changing based on all the controls mentioned above. In the above example only, a ‘Total Customers (Geo and Prod)’ calculation is correct.

We will learn to create measures with different levels of aggregation by using ‘Customer Penetration’ measure as an example.

          Customer Penetration = Distinct customers at selected geography and product level/ Distinct customers at selected geography level

Selective filtering may be used for creating similar reports like: Dealer Participation, Sales Contribution, etc. The below section exemplifies the creation of a customer penetration report with selective filtering.

Customer penetration using SAS Visual Analytics 8.2 (selective filtering)

Customer penetration is used to analyze whether marketing and sales strategies are working or not. Managers often uses customer penetration or dealer participation measures along with other measures to measure the popularity of a product, category or brand.

This report requirement is such that the numerator in the ‘Customer Penetration’ formula should be filtered based on region and state list control selections, while the denominator should be filtered based on region, state, category and subcategory list control selections. This is not the same requirement as filtering the whole table through common list controls. In general, if you link a table with any control, all the measures in that table will be filtered as per selected value(s) in controls. However, our requirement is not like that. Instead of linking control and tables we will use control parameters to achieve our objective.

Assume we have a customer transaction table with following variables:

 

 

 

 

 

 

Before we move, be ready with the basic report as per below image:

 

Once you are ready with the report as per the above image, create parameters for ‘Region’, ‘State’, ‘Category’, ‘SubCategory’:

Region Parameter


 

State Parameter

 


Category Parameter

 


SubCategory Parameter

 

Now create the following two calculated items derived from ‘Customer_ID’:

Geo_Customer_ID
Equivalent to ‘Customer_ID’. However, populated only for selected geography levels and rest would be filled with missing.

 

 

 

 

 

 

Geo_and_Prod_Customer_ID
Equivalent to ‘Customer_ID’. However, populated only for selected geography and product levels and rest would be filled with missing.

 

Create the following two aggregated measures:

Total Customers (Geo)
You need to subtract the distinct count related to missing ‘Geo_Customer_ID’, which is 1.

 

Total Customers (Geo and Prod)
You need to subtract the distinct count related to missing ‘Geo_and_Prod_Customer_ID’, which is 1.

 

Now you can create an aggregated measure ‘Customer Penetration’.

Customer Penetration = Total Customers (Geo and Prod) / Total Customers (Geo)

 

Final report will look like this:

 

 

 

 

 

 

 

 

 

 

 

 

 

Comparative images with default and selective filtering implementation:

 

If you compare the above images, you will find the difference in highlighted measures where the first image aggregation level is based on selective filtering, while in second image aggregation level is uniform.

Note – ‘Total Customers’ is count of distinct ‘Customer_ID’ i.e., total customers count is independent from geography and product hierarchy selection.

Conclusion

This process allows you to use control parameters in ‘If Then Else…’ statements to create a variable (calculated item) having character values. You can utilize this feature in several other applications – this is just one way you can use parameters to fulfil a business requirement.

Selective filtering in SAS Visual Analytics 8.2 was published on SAS Users.

9月 052018
 

Hybrid computers that marry CPUs and devices like GPUs and FPGAs are the fastest computers, but they are hard to program. This post explains how deep learning (DL) greatly simplifies programming hybrid computers.

The post Speeding Up Your Analytics with Machine Learning (Part 1) appeared first on SAS Learning Post.

9月 042018
 

In the SAS/IML language, you can only concatenate vectors that have conforming dimensions. For example, to horizontally concatenate two vectors X and Y, the symbols X and Y must have the same number of rows. If not, the statement Z = X || Y will produce an error: ERROR: Matrices do not conform to the operation.

The other day I wanted to concatenate multiple vectors of different sizes into a single matrix. I decided to use missing values to pad the "short" vectors. To make the operation reusable, I developed an IML module that accepts up to 16 different optional arguments. The module uses a few advanced techniques. This article presents the module and discusses programming techniques that you can use to systematically process all arguments passed to a SAS/IML module. Even if you never need to use the module, the techniques that implement the module are useful.

I wrote the module because I wanted to store multiple statistics of different sizes in a single matrix. For example, you can use this technique to store the data, parameter estimates, and the result of a hypothesis test. You can use a list to store the values, but for my application a matrix was more convenient. I've used a similar construction to store an array of matrices in a larger matrix.

A module to concatenate vectors of different lengths

Although "parameter" and "argument" are often used interchangeably, for this article I will try to be consistent by using the following definitions. A parameter is a local variable in the declaration of the function. An argument is the actual value that is passed to function. Parameters are known at compile time and never change; arguments are known at run time and are potentially different every time the function is called.

Let's first see what the module does, then we'll discuss how it works. The module is defined as follows:

proc iml;   
/* Pack k vectors into a matrix, 1 <= k <= 16. Each vector becomes a 
   column. The vectors must be all numeric or all character.        */
start MergeVectors(X1,   X2=,  X3=,  X4=,  X5=,  X6=,  X7=,  X8=,  
                   X9=, X10=, X11=, X12=, X13=, X14=, X15=, X16= );
   ParmList = "X1":"X16";      /* Names of params. Process in order. */
   done = 0;                   /* flag. Set to 1 when empty parameter found */
   type = type(X1);            /* type of first arg; all args must have this type */
   maxLen = 0;                 /* for character args, find longest LENGTH */
   N = 0;                      /* find max number of elements in the args */
 
   /* 1. Count args and check for consistent type (char/num). Stop at first empty arg */
   do k = 1 to ncol(ParmList) until(done);
      arg = value( ParmList[k] );    /* get value from name */
      done = IsEmpty(arg);           /* if empty matrix, then exit loop */
      if ^done then do;              /* if not empty matrix... */
         if type(arg)^= type then    /* check type for consistency */
            STOP "ERROR: Arguments must have the same type";
         maxLen = max(maxLen,nleng(arg));   /* save max length of char matrices */
         N = max(N,prod(dimension(arg)));   /* save max number of elements */
      end;
   end;
 
   numArgs = k - 1;                  /* How many args were supplied? */
   if type="N" then init = .;        /* if args are numeric, use numeric missing */
   else init = BlankStr(maxLen);     /* if args are character, use character missing */
   M = j(N, numArgs, init);          /* allocate N x p matrix of missing values */
 
   /* 2. Go through the args again. Fill i_th col with values of i_th arg */
   do i = 1 to numArgs;              
      arg = value( ParmList[i] );    /* get i_th arg */
      d = prod(dimension(arg));      /* count number of elements */
      M[1:d,i] = arg[1:d];           /* copy into the i_th column */
   end;
   return( M );                      /* return matrix with args packed into cols */
finish;
 
/* test the module */
M = MergeVectors(-1:1, T(1:4),  0:1);
print M;

In the example, the function is passed three arguments of different sizes. The function returns a matrix that has three columns. The i_th column contains the values of the i_th parameter. The matrix contains four rows, which is the number of elements in the second argument. Missing values are used to pad the columns that have fewer than four elements.

How to process all arguments to a function

The module in the previous section is defined to take one required parameter and 15 optional parameters. The arguments to the function are processed by using the following techniques:

  • The ParmList vector contains the names of the parameters: ParmList = "X1":"X16".
  • The module loops over the parameters and uses the VALUE function to get the value of the argument that is passed in.
  • The ISEMPTY function counts the number of arguments that are passed in. At the same time, the module finds the maximum number of elements in the arguments.
  • The module allocates a result matrix, M, that initially contains all missing values. The module iterates over the arguments and fills the columns of M with their values.

In summary, it is sometimes useful to pack several vectors of different sizes into a single matrix. But even if you never need this functionality, the structure of this module shows how to process an arbitrary number of optional arguments where each argument is processed in an identical manner.

The post Store vectors of different lengths in a matrix appeared first on The DO Loop.

8月 312018
 

The Western Users of SAS Software 2018 conference is coming to Sacramento, CA, September 5-7.  I have been to a lot of SAS conferences, but WUSS is always my favorite because it is big enough for me to learn a lot, but small enough to be really friendly.

On Wednesday, I will once again present SAS Essentials, a whirlwind introduction to SAS programming in just three hours specially designed for people who are new to SAS.

If you come I hope you will catch my presentations.  If you want a preview or if you can’t come, click the links below to download the papers.

How SAS Thinks: SAS Basics I

Introduction to DATA Step Programming: SAS Basics II

Introduction to SAS Procedures: SAS Basics III

I hope to see you there!

 

8月 312018
 

If you're good at games like Wheel of Fortune, Scrabble, or Words with Friends, you've probably figured out that certain letters appear more often than others. But do you have a cool way to figure out which letters appear most & least frequently? How about using a computer to plot [...]

The post Which keyboard keys do you use most frequently? appeared first on SAS Learning Post.

8月 302018
 

A combination of SAS Grid Manager and SAS Viya can change the game for IT leaders looking to take on peak computing demands without sacrificing reliability or driving higher costs.  Maybe that’s why we fielded so many questions about SAS Grid Manager and SAS Viya in our recent webinar  about how the two can work together to process massive volumes of data – fast.

Participants asked us so many great questions that we wanted to share the answers here, assuming that you may have the same questions.  This is the first of two blog posts focusing on some of the very best questions we received.  Stay tuned for more soon – and if you don’t see your own burning questions posed here, just post your question in the comments and we’ll respond.

1. Do SAS Grid Manager and SAS Viya need to be collocated in the same data center?

And if they’re in different data centers, can SAS Grid Manager and SAS Viya communicate with one another? As to the question of how well SAS Grid Manager and SAS Viya can communicate if they’re in different data centers, it shouldn’t be a functional concern. If there’s network connectivity between the two data centers, they can communicate.  Just be mindful of a few things:

  • The size of data being processed in each environment.
  • How much data is going back and forth between the two.
  • The impact all that data movement can have on response times and overall performance.

In addition to performance, there's a greater sensitivity around data handling in physically separated deployments. The emergence of stricter data protection regulations increases the complexity of compliance when moving data between locations with different legal jurisdictions. It will be important to consider the additional performance implications of encryption of the transferred data as well. Ultimately, having compute as close as possible to the data it needs results in less complexity and better performance.

In this case it is important to remember the old saying “just because you can doesn’t mean that you should.”  When SAS Grid Manager and SAS Viya are collocated, they can share the same data. I need to be clear that there are implications of sharing the same data for example data sets. For example, the data cannot be open by both a process running on the grid and by the SAS Viya analytics server at the same time. If your business processes can accommodate this requirement then sharing the same physical copies of data in storage may save your organization money as well as ease compliance efforts.  I also cannot sufficiently stress the need to complete a proof of concept with production data volumes and job complexity to compare the performance of hosting SAS Grid Manager and SAS Viya in the same data center, compared to having them in geographically separated data centers.

2. Should SAS Viya and SAS Grid Manager 9.4M5 run on the same OS for integration purposes?

From an integration perspective, they can definitely run on different operating systems. In fact, as of version SAS 9.4m5, you can access a CAS (Cloud Analytics Services) server from Solaris, AIX, or 64-bit Windows. The CAS server itself will run on Linux – and soon we’ll roll out SAS Viya for Windows server.

Short version: Crossing operating systems does not present a functional problem.  Those of you who have used SAS in heterogeneous environments know that there are performance implications when processing data that is not native to the running session.  You should carefully consider the performance implications of deploying in a heterogeneous topology before committing to a mixed environment.

3. How do SAS Viya and SAS Grid Manager compare in terms of complexity – particularly in the context of platform administration?

They’re actually very comparable.  Some level of detail varies but in many cases the underlying concepts and effort required are similar – especially when you reach the level of multi-node administration, keeping multiple hosts patched, and those sorts of issues.

4. SAS Grid Manager requires high-performance storage. Do we need to have that same level of storage (such as IBM General Parallel File System) for SAS Viya? 

No – SAS Viya relies on Cloud Analytic Services (CAS), so it doesn’t have the same storage requirements as SAS Grid Manager.  It’s more like what you’d find in a Hadoop environment – the CAS reference architecture is mainly a collection of nodes with local storage that allows SAS Viya to perform memory-mapping to disk run jobs that need resources larger than the total RAM and can use disk cache to continue running. SAS Viya can ingest data serially or in parallel, so for customers that have the ability to use cost-efficient distributed file systems, movement of data into CAS can be done in parallel.

Note, the shared file system that is part of an existing SAS Grid Manager environment could be further leveraged as a means to share data between the SAS Grid and SAS Viya environments.

SAS’ recommended IO throughput for SAS Grid Manager deployments are based upon years of experience with customers who have been unsatisfied with the performance of their chosen storage.  The resulting best practice is one that minimizes performance complaints and allows customers to process very large data in the timeliest manner. If SAS Viya is deployed with a multi-node analytics server (MPP mode) then a shared file system is required. The SAS Viya Cloud Analytic Server (CAS) has been designed with Network File System (NFS) in mind.

Our customers get the best value from environments built with a blend of storage solutions, including both shared files systems for job/user/application concurrency as well as less expensive distributed storage for workloads that may not require concurrency like large machine learning and AI training problems. The latter is where SAS Viya shines.

These were all great questions that we thought deserved more detail than we could offer in a webinar – and there are more!  Soon we’ll post a second set of questions that you can use to inform your work with SAS Grid Manager and SAS Viya.  In the meantime, feel free to post any further questions in the comment section of this post.  We’ll answer them quickly.

4 FAQs about SAS Grid Manager and SAS Viya was published on SAS Users.

8月 292018
 
Local polynomial kernel regression  in SAS

A SAS programmer recently asked me how to compute a kernel regression in SAS. He had read my blog posts "What is loess regression" and "Loess regression in SAS/IML" and was trying to implement a kernel regression in SAS/IML as part of a larger analysis. This article explains how to create a basic kernel regression analysis in SAS. You can download the complete SAS program that contains all the kernel regression computations in this article.

A kernel regression smoother is useful when smoothing data that do not appear to have a simple parametric relationship. The following data set contains an explanatory variable, E, which is the ratio of air to fuel in an engine. The dependent variable is a measurement of certain exhaust gasses (nitrogen oxides), which are contributors to the problem of air pollution. The scatter plot to the right displays the nonlinear relationship between these variables. The curve is a kernel regression smoother, which is developed later in this article.

data gas;
label NOx = "Nitric oxide and nitrogen dioxide";
label E = "Air/Fuel Ratio";
input NOx E @@;
datalines;
4.818  0.831   2.849  1.045   3.275  1.021   4.691  0.97    4.255  0.825
5.064  0.891   2.118  0.71    4.602  0.801   2.286  1.074   0.97   1.148
3.965  1       5.344  0.928   3.834  0.767   1.99   0.701   5.199  0.807
5.283  0.902   3.752  0.997   0.537  1.224   1.64   1.089   5.055  0.973
4.937  0.98    1.561  0.665
;

What is kernel regression?

Kernel regression was a popular method in the 1970s for smoothing a scatter plot. The predicted value, ŷ0, at a point x0 is determined by a weighted polynomial least squares regression of data near x0. The weights are given by a kernel function (such as the normal density) that assigns more weight to points near x0 and less weight to points far from x0. The bandwidth of the kernel function specifies how to measure "nearness." Because kernel regression has some intrinsic problems that are addressed by the loess algorithm, many statisticians switched from kernel regression to loess regression in the 1980s as a way to smooth scatter plots.

Because of the intrinsic shortcomings of kernel regression, SAS does not have a built-in procedure to fit a kernel regression, but it is straightforward to implement a basic kernel regression by using matrix computations in the SAS/IML language. The main steps for evaluating a kernel regression at x0 are as follows:

  1. Choose a kernel shape and bandwidth (smoothing) parameter: The shape of the kernel density function is not very important. I will choose a normal density function as the kernel. A small bandwidth overfits the data, which means that the curve might have lots of wiggles. A large bandwidth tends to underfit the data. You can specify a bandwidth (h) in the scale of the explanatory variable (X), or you can specify a value, H, that represents the proportion of the range of X and then use h = H*range(X) in the computation. A very small bandwidth can cause the kernel regression to fail. A very large bandwidth causes the kernel regression to approach an OLS regression.
  2. Assign weights to the nearby data: Although the normal density is infinite in support, in practice, observations farther than five bandwidths from x0 get essentially zero weight. You can use the PDF function to easily compute the weights. In the SAS/IML language, you can pass a vector of data values to the PDF function and thus compute all weights in a single call.
  3. Compute a weighted regression: The predicted value, ŷ0, of the kernel regression at x0 is the result of a weighted linear regression where the weights are assigned as above. The degree of the polynomial in the linear regression affects the result. The next section shows how to compute a first-degree linear regression; a subsequent section shows a zero-degree polynomial, which computes a weighted average.

Implement kernel regression in SAS

To implement kernel regression, you can reuse the SAS/IML modules for weighted polynomial regression from a previous blog post. You use the PDF function to compute the local weights. The KernelRegression module computes the kernel regression at a vector of points, as follows:

proc iml;
/* First, define the PolyRegEst and PolyRegScore modules 
   https://blogs.sas.com/content/iml/2016/10/05/weighted-regression.html 
   See the DOWNLOAD file.   
*/
 
/* Interpret H as "proportion of range" so that the bandwidth is h=H*range(X)
   The weight of an observation at x when fitting the regression at x0 
   is given by the f(x; x0, h), which is the density function of N(x0, h) at x.
   This is equivalent to (1/h)*pdf("Normal", (X-x0)/h) for the standard N(0,1) distribution. */
start GetKernelWeights(X, x0, H);
   return pdf("Normal", X, x0, H*range(X));   /* bandwidth h = H*range(X) */
finish;
 
/* Kernel regression module. 
   (X,Y) are column vectors that contain the data.
   H is the proportion of the data range. It determines the 
     bandwidth for the kernel regression as h = H*range(X)
   t is a vector of values at which to evaluate the fit. By defaul t=X.
*/
start KernelRegression(X, Y, H, _t=X, deg=1);
   t = colvec(_t);
   pred = j(nrow(t), 1, .);
   do i = 1 to nrow(t);
      /* compute weighted regression model estimates, degree deg */
      W = GetKernelWeights(X, t[i], H);
      b = PolyRegEst(Y, X, W, deg);    
      pred[i] = PolyRegScore(t[i], b);  /* score model at t[i] */
   end;
   return pred;
finish;

The following statements load the data for exhaust gasses and sort them by the X variable (E). A call to the KernelRegression module smooths the data at 201 evenly spaced points in the range of the explanatory variable. The graph at the top of this article shows the smoother overlaid on a scatter plot of the data.

use gas;  read all var {E NOx} into Z;  close;
call sort(Z);   /* for graphing, sort by X */
X = Z[,1];  Y = Z[,2];
 
H = 0.1;        /* choose bandwidth as proportion of range */
deg = 1;        /* degree of regression model */
Npts = 201;     /* number of points to evaluate smoother */
t = T( do(min(X), max(X), (max(X)-min(X))/(Npts-1)) ); /* evenly spaced x values */
pred =  KernelRegression(X, Y, H, t);                  /* (t, Pred) are points on the curve */

Nadaraya–Watson kernel regression

Kernel regression in SAS by using a weighted average

If you use a degree-zero polynomial to compute the smoother, each predicted value is a locally-weighted average of the data. This smoother is called the Nadaraya–Watson (N-W) kernel estimator. Because the N-W smoother is a weighted average, the computation is much simpler than for linear regression. The following two modules show the main computation. The N-W kernel smoother on the same exhaust data is shown to the right.

/* Nadaraya-Watson kernel estimator, which is a locally weighted average */
start NWKerReg(Y, X, x0, H);
   K = colvec( pdf("Normal", X, x0, H*range(X)) );
   return( K`*Y / sum(K) );
finish;
 
/* Nadaraya-Watson kernel smoother module. 
   (X,Y) are column vectors that contain the data.
   H is the proportion of the data range. It determines the 
     bandwidth for the kernel regression as h = H*range(X)
   t is a vector of values at which to evaluate the fit. By default t=X.
*/
start NWKernelSmoother(X, Y, H, _t=X);
   t = colvec(_t);
   pred = j(nrow(t), 1, .);
   do i = 1 to nrow(t);
      pred[i] = NWKerReg(Y, X, t[i], H);
   end;
   return pred;
finish;

Problems with kernel regression

As mentioned earlier, there are some intrinsic problems with kernel regression smoothers:

  • Kernel smoothers treat extreme values of a sample distribution differently than points near the middle of the sample. For example, if x0 is the minimum value of the data, the fit at x0 is a weighted average of only the observations that are greater than x0. In contrast, if x0 is the median value of the data, the fit is a weighted average of points on both sides of x0. You can see the bias in the graph of the Nadaraya–Watson estimate, which shows predicted values that are higher than the actual values for the smallest and largest values of the X variable. Some researchers have proposed modified kernels to handle this problem for the N-W estimate. Fortunately, the local linear estimator tends to perform well.
  • The bandwidth of kernel smoothers cannot be too small or else there are values of x0 for which prediction is not possible. If d = x[i+1] – x[i]is the gap between two consecutive X data values, then if h is smaller than about d/10, there is no way to predict a value at the midpoint (x[i+1] + x[i]) / 2. The linear system that you need to solve will be singular because the weights at x0 are zero. The loess method avoids this flaw by using nearest neighbors to form predicted values. A nearest-neighbor algorithm is a variable-width kernel method, as opposed to the fixed-width kernel used here.

Extensions of the simple kernel smoother

You can extend the one-dimensional example to more complex situations:

  • Higher Dimensions: You can compute kernel regression for two or more independent variables by using a multivariate normal density as the kernel function. In theory, you could use any correlated structure for the kernel function, but in practice most people use a covariance structure of the form Σ = diag(h21, ..., h2k), where hi is the bandwidth in the i_th coordinate direction. An even simpler covariance structure is to use the same bandwidth in all directions, which can be evaluated by using the one-dimensional standard normal density evaluated at || x - x0 || / h. Details are left as an exercise.
  • Automatic selection of the bandwidth: The example in this article uses a specified bandwidth, but as I have written before, it is better to use a fit criterion such as GCV or AIC to select a smoothing parameter.

In summary, you can use a weighted polynomial regression to implement a kernel smoother in SAS. The weights are provided by a kernel density function, such as the normal density. Implementing a simple kernel smoother requires only a few statements in the SAS/IML matrix language. However, most modern data analysts prefer a loess smoother over a kernel smoother because the loess algorithm (which is a varying-width kernel algorithm) solves some of the issues that arise when trying to use a fixed-width kernel smoother with a small bandwidth. You can use PROC LOESS in SAS to compute a loess smoother.

The post Kernel regression in SAS appeared first on The DO Loop.