In the first post of a two-part series, Phil Simon plays point-counterpoint with himself.
The post The case for automating data management appeared first on The Data Roundtable.
In the first post of a two-part series, Phil Simon plays point-counterpoint with himself.
The post The case for automating data management appeared first on The Data Roundtable.
When my colleague, Robert Allison, blogged about visualizing the Mandelbrot set, I was reminded of a story from the 1980s, which was the height of the fractal craze. A research group in computational mathematics had been awarded a multimillion-dollar grant to purchase a supercomputer. When the supercomputer arrived and got set up, what was the first program they ran? It was a computation of the Mandelbrot set! They wanted to see how fast they could generate pretty pictures with their supercomputer!
Although fractals are no longer in vogue, the desire for fast computations never goes out of style. In a matrix-vector language such as SAS/IML, you can achieve fast computations by vectorizing the computations, which means operating on large quantities of data and using matrix-vector computations. Sometimes this paradigm requires rethinking an algorithm. This article shows how to vectorize the Mandelbrot set computation. The example provides important lessons for statistical computations, and, yes, you get to see a pretty picture at the end!
As a reminder, the Mandelbrot set is a visualization of a portion of the complex plane. If c is a complex number, you determine the color for c by iterating a complex quadratic map z <- z^{2} + c, starting with z=0. You keep track of how many iterations it takes before the iteration diverges to infinity, which in practice means that |z| exceeds some radius, such as 5. The parameter values for which the iteration remains bounded belong to the Mandelbrot set. Other points are colored according to the number of iterations before the trajectory exceeded the specified radius.
The classical computation of the Mandelbrot set iterates over the parameter values in a grid, as follows:
For each value c in a grid: Set z = 0 For i = 1 to MaxIter: Apply the quadratic map, f, to form the i_th iteration, z_i = f^i(z; c). If z_i > radius, stop and remember the number of iterations. End; If the trajectory did not diverge, assume parameter value is in the Mandelbrot set. End; |
An advantage of this algorithm is that it does not require much memory, so it was great for the PCs of the 1980s! For each parameter, the color is determined (and often plotted), and then the parameter is never needed again.
In the classical algorithm, all computations are scalar computations. The outer loop is typically over a large number, K, of grid points. The maximum number of iterations is typically 100 or more. Thus, the algorithm performs 100*K scalar computations in order to obtain the colors for the image. For a large grid that contains K=250,000 parameters, that's 25 million scalar computations for the low-memory algorithm.
A vectorized version of this algorithm inverts the two loops. Instead of looping over the parameters and iterating each associated quadratic map, you can store the parameter values in a grid and iterate all K trajectories at once by applying the quadratic map in a vectorized manner. The vectorized algorithm is:
Define c to be a large grid of parameters. Initialize a large grid z = 0, which will hold the current state. For i = 1 to MaxIter: For all points that have not yet diverged: Apply the quadratic map, f, to z to update the current state. If any z > radius, assign the number of iterations for those parameters. End; End; If a trajectory did not diverge, assume parameter value is in the Mandelbrot set. |
The vectorized algorithm performs the same computations as the scalar algorithm, but each iteration of the quadratic map operates on a huge number of current states (z). Also, each all states are checked for divergence by using a single call to a vector operation. There are many fewer function calls, which reduces overhead costs, but the vectorized program uses a lot of memory to store all the parameters and states.
Let's see how the algorithm might be implemented in the SAS/IML language. First, define vectorized functions for performing the complex quadratic map and the computation of the complex magnitude (absolute value). Because SAS/IML does not provide built-in support for complex numbers, each complex number is represented by a 2-D vector, where the first column is the real part and the second column is the imaginary part.
ods graphics / width=720px height=640px NXYBINSMAX=1200000; /* enable large heat maps */ proc iml; /* Complex computations: z and c are (N x 2) matrices that represent complex numbers */ start Magnitude(z); return ( z[,##] ); /* |z| = x##2 + y##2 */ finish; /* complex iteration of quadratic mapping z -> z^2 + c For complex numbers z = x + iy and c = a + ib, w = z^2 + c is the complex number (x^2 - y^2 + a) + i(2*x*y + b) */ start Map(z, c); w = j(nrow(z), 2, 0); w[,1] = z[,1]#z[,1] - z[,2]#z[,2] + c[,1]; w[,2] = 2*z[,1]#z[,2] + c[,2]; return ( w ); finish; |
The next block of statements defines a grid of parameters for the parameter, c:
/* Set parameters: initial range for grid of points and number of grid divisions Radius for divergence and max number of iterations */ nRe = 541; /* num pts in Real (horiz) direction */ nIm = 521; /* num pts in Imag (vert) direction */ re_min = -2.1; re_max = 0.6; /* range in Real direction */ im_min = -1.3; im_max = 1.3; /* range in Imag direction */ radius = 5; /* when |z| > radius, iteration has diverged */ MaxIter = 100; /* maximum number of iterations */ d_Re = (Re_max - Re_min) / (nRe-1); /* horiz step size */ d_Im = (Im_max - Im_min) / (nIm-1); /* vert step size */ Re = do(re_min, re_max, d_Re); /* evenly spaced array of horiz values */ Im = do(im_min, im_max, d_Im); /* evenly spaced array of vert values */ c = expandgrid( re, im); /* make into 2D grid */ z = j(nrow(c), 2, 0); /* initialize z = 0 for grid of trajectories */ iters = j(nrow(c), 1, .); /* for each parameter, how many iterations until diverge */ |
In this example, the parameters for the mapping are chosen in the rectangle with real part [-2.1, 0.6] and imaginary part [-1.3, 1.3]. The parameter grid contains 541 horizontal values and 521 vertical values, for a total of almost 282,000 parameters.
The vectorized program will iterate all 282,000 mappings by using a single call to the Map function. After each iteration, all trajectories are checked to see which ones have left the disk of radius 5 (diverged). The iters vector stores the number of iterations until leaving the disk. During the iterations, a missing value indicates that the trajectory has not yet left the disk. At the end of the program, all trajectories that have not left the disk are set to the maximum number of iterations.
There are two efficiencies you can implement. First, you can pre-process some of the parameters that are known to be inside the "head" or "body" of the bug-shaped Mandelbrot set. Second, for each iteration, you only need to apply the quadratic map to the points that have not yet diverged (that is, iters is missing for that parameter). This is implemented in the following program statements:
/* Pre-process parameters that are known to be in Mandelbrot set. Params in "head": inside circle of radius 1/8 centered at (-1, 0) Params in "body": inside rect [-0.5, 0.2] x [-0.5, 0.5] */ idxHead = loc( (c - {-1 0})[,##] < 0.0625 ); if ncol(idxHead) > 0 then iters[idxHead] = MaxIter; idxBody = loc(c[,1]> -0.5 & c[,1]< 0.2 & c[,2]> -0.5 & c[,2]< 0.5); if ncol(idxBody) > 0 then iters[idxBody] = MaxIter; /* compute the Madelbrot set */ done = 0; /* have all points diverged? */ do i = 1 to MaxIter until(done); j = loc(iters=.); /* find the points that remain bounded */ w = z[j,]; a = c[j,]; /* keep only states and parameters tht have not diverged */ w = Map(w, a); /* map those points forward one iteration */ z[j,] = w; /* update current state */ mag = Magnitude(w); /* compute magnitude of current states */ diverged = (mag >= radius) & (iters[j] = .); /* which points diverged on this iteration? */ idx = loc(diverged); /* indices of the diverged points */ if ncol(idx)> 0 then /* if diverged, remember the iteration number */ iters[j[idx]] = i; if all(diverged > 0) then done = 1; end; /* Points that never diverged are part of the Mandelbrot set. Assign them MaxIter */ idx = loc(iters=.); if ncol(idx)>0 then iters[idx] = MaxIter; |
The last step is to assign colors that visualize the Mandelbrot set. Whereas Robert Allison used a scatter plot, I will use a heat map. The iters vector contains the number of iterations before divergence, or MaxIters if the trajectory never diverged. You can assign a color ramp to the number of iterations, or, as I've done below, to a log-transformation of the iteration number. I've previously discussed several ways to assign colors to a heat map.
/* Color parameter space by using LOG10(number of iterations) */ numIters = log10( shape(iters, nRe, nIm) ); /* shape iterations into matrix */ mandelbrot = numIters`; /* plot num iterations for each parameter */ call heatmapcont(mandelbrot) xValues=Re yValues=Im ColorRamp="Temperature" displayoutlines=0 showlegend=0 title="Mandelbrot Set from Vector Computations"; |
On my PC, the Mandelbrot computation takes about a second. This is not blazingly fast, but it is faster than the low-memory, non-vectorized, computation. If you care about the fastest speeds, use the DATA step, as shown in Robert's blog post.
I'm certainly not the first person to compute the Mandelbrot set, so what's the point of this exercise? The main purpose of this article is to show how to vectorize an iterative algorithm that performs the same computation many times for different parameter values. Rather than iterate over the set of parameter values and perform millions of scalar computations, you create an array (or matrix) of parameter values and perform a small number of vectorized computations. In most matrix-vector programming languages, a vectorized computation is more efficient than looping over each parameter value. There are few function calls and each call performs high-level computations on large matrices and vectors.
Statistical programmers don't usually have a need to compute fractals, but the ideas in this program apply to time series computations, root-finding, optimization, and any computation where you compute the same quantity for many parameter values. In short, although the Mandelbrot set is fun to compute, the ability to vectorize computations in a matrix language is a skill that rewards you with more than pretty pictures. And fast computations never go out of style.
The post Vectorize the computation of the Mandelbrot set in a matrix language appeared first on The DO Loop.
In a previous post, Zero to SAS in 60 Seconds- SAS Machine Learning on SAS Analytics Cloud, I documented my experience with a SAS free trial on the SAS Analytics Cloud. Well, the engineers at SAS have been busy and created another free trial. The new trial covers SAS Event Stream Processing (ESP).
This time last year (when just starting at SAS), I only knew ESP as extrasensory perception. I'm more enlightened now. Working through this exercise introduced me to how event stream processing is a powerful and effective tool for analyzing data using machine learning and streaming analytics to uncover insights for real-time decision making. In a nutshell, you create a model, stream your data, process the results, and make timely decisions based on the results.
The trial uses SAS ESPPy, allowing you to embed an ESP project inside a Python pipeline. To see ESPPy in action take a look at this video. To learn more about ESP and IoT see this article on the SAS Communities Library. In this article I chronicle my journey through the trial while introducing key concepts and operations of ESP.
The process to register and initial login are identical to the machine learning article. You must have a SAS Profile to participate in the trial. The only difference is you need to follow this link to sign up for the ESP trial. Please refer to the machine learning article for detailed steps of signing up and logging in.
As part of the environment management, it is important to continuously monitor the operation of the solar panels to optimize configuration parameters, detect potential equipment failure, and accurately forecast the amount of energy generated. Factors considered include panel angles, time of day, seasons, and weather patterns as the energy generated depends directly of the amount of sun available to the panels.
The ESP project in this demo is pre-loaded in the trial and is run through a Jupyter notebook. The project shows the monitoring of energy (kWh) and power (kW) generated during a specific time interval eliminating localized outlier effects and triggering alerts when there is a pre-defined difference in the energy generated between subsequent time intervals.
Take two minutes and watch this video on how SAS uses SAS software to create a work of art with solar farm data.Disclaimer: no sheep were harmed during data collection or writing of this article.
Once logged into the trial, you see the Applications screen.
The Data and Team options in the left pane behave exactly as those in the machine learning trial. These sections allow you to access data and manage a multi-user system. Select the SAS Event Stream Processing icon to start a JupyterLab session.
I will not go into the details of JupyterLab here. The left pane contains menus, file management, and other options. The pane on the right displays three options:
Python 3 Notebook - a blank Jupyter notebook - documents that combine live, runnable code with narrative text (Markdown), equations (LaTeX), images, interactive visualizations and other rich output
Python 3 Console - a blank Python console - code consoles enable you to run code interactively in a kernel
Text File - basic text editor - enables you to edit text files in JupyterLab
For this article we're going to follow along and interact with the pre-loaded demo Solar Farm ESP project. To locate the Jupyter notebook double click the demo directory from the left pane.
Next select Event_Stream_Processing. Before proceeding with the demo, I'd highly suggest opening the README.ipynb file.
Here you will find overview and environment organization information for the trial. The trial uses SAS ESPPy for designing, testing, and deploying projects on ESP Servers.
Before starting the trial, I needed a little background on event stream processing. I located the SAS ESP product documentation. I recommend referring to it for details on the ESP model, objects, and workflow.
To access the demo, double click the demo directory from the left pane. The trial comes with five pre-loaded demos. Feel free to try any/all of them. Double click on ESP Basic Project - Solar Farm.ipynb to display the Solar Farm notebook. The notebook walks you through the ESP model creation and execution. To run a command place the cursor in a command cell and select the 'Run' button (triangle-shaped button at the top of the notebook). If no response returns when running the cell block, assume the commands ran successfully.
Below is a brief description of the steps in the project:
While not included in the demo, the streaming data would pass through the filter and if a threshold breach occurs, an alert is created. Considering the graph above, alerts could very well have occurred just before 1:15 pm (IntkW drops from 185 to 150) and just before 2:30 pm (IntkW drops from 125 to 35).
Now that you have a taste of ESP, feel free to step through the rest of the demos. You may also load your own data and create your own ESP models. Feel free to share your experience and what you create by leaving a comment.
SAS Event Stream Processing on SAS Analytics Cloud - my journey was published on SAS Users.
While it appears to be a simple web site widget, the "Recommended by SAS" sidebar is made possible by an application of the full Analytics Life Cycle. This includes data collection and prep, model building and test, API wrappers with a gateway for monitoring, model deployment in containers with orchestration in Kubernetes, and model assessment using feedback from click actions on the recommendations. We built this by using a combination of SAS analytics and open source tools -- see the SAS Global Forum paper by my colleague, Jared Dean, for the full list of ingredients.
Jared and I have been working for over a year to bring this recommendation engine to life. We discussed it at SAS Global Forum 2018, and finally near the end of 2018 it went into production on communities.sas.com. The engine scores user visits for new recommendations thousands of times per day. The engine is updated each day with new data and a new scoring model.
Now that the recommendation engine is available, Jared and I met again in front of the camera. This time we discussed how the engine is working and the efforts required to get into production. Like many analytics projects, the hardest part of the journey was that "last mile," but we (and the entire company, actually) were very motivated to bring you a live example of SAS analytics in action. You can watch the full video at (where else?) communities.sas.com. The video is 17 minutes long -- longer than most "explainer"-type videos. But there was a lot to unpack here, and I think you'll agree there is much to learn from the experience. Not ready to binge on our video? I'll use the rest of this article to cover some highlights.
The approach of our recommendation engine is based upon your viewing behavior, especially as compared to the behavior of others in the community. With this approach, we don't need to capture much information about you personally, nor do we need information about the content you're reading. Rather, we just need the unique IDs (numbers) for each topic that is viewed, and the ID (again, a number) for the logged-in user who viewed it. One benefit of this approach is that we don't have to worry about surfacing any personal information in the recommendation API that we'll ultimately build. That makes the conversation with our IT and Legal colleagues much easier.
Our communities platform captures details about every action -- including page views -- that happens on the site. We use SAS and the community platform APIs to fetch this data every day so that we can build reports about community activity and health. We now save off a special subset of this data to feed our recommendation engine. Here's an example of the transactions we're using. It's millions of records, covering nearly 100,000 topics and nearly 150,000 active users.
Starting with these records, Jared uses SAS DATA step to prep the data for further analysis and a pass through the algorithm he selected: factorization machines. As Jared explains in the video, this algorithm shines when the data are represented in sparse matrices. That's what we have here. We have thousands of topics and thousands of community members, and we have a record for each "view" action of a topic by a member. Most members have not viewed most of the topics, and most of the topics have not been viewed by most members. With today's data, that results in a 13 billion cell matrix, but with only 3.3 million view events. Traditional linear algebra methods don't scale to this type of application.
Jared uses PROC FACTMAC (part of SAS Visual Data Mining and Machine Learning) to create an analytics store (ASTORE) for fast scoring. Using the autotuning feature, the FACTMAC selects the best combination of values for factors and iterations. And Jared caps the run time to 3600 seconds (1 hour) -- because we do need this to run in a predictable time window for updating each day.
proc factmac data=mycas.weighted_factmac outmodel=mycas.factors_out; autotune maxtime=3600 objective=MSE TUNINGPARAMETERS=(nfactors(init=20) maxiter(init=200) learnstep(init=0.001) ) ; input user_uid conversation_uid /level=nominal; target rating /level=interval; savestate rstore=mycas.sascomm_rstore; run; |
To update the model with new data each day and then deploy the scoring model as an ASTORE, Jared uses multiple SAS Viya environments. These SAS Viya environments need to "live" only for a short time -- for building the model and then for scoring data. We use Docker containers to spin these up as needed within the cloud environment hosted by SAS IT.
Jared makes the distinction between the "building container," which hosts the full stack of SAS Viya and everything that's needed to prep data and run FACTMAC, and the "scoring container", which contains just the ASTORE and enough code infrastructure (include the SAS Micro Analytics Service, or MAS) to score recommendations. This scoring container is lightweight and is actually run on multiple nodes so that our engine scales to lots of requests. And the fact that it does just the one thing -- score topics for user recommendations -- makes it an easier case for SAS IT to host as a service.
To access the scoring service, Jared built a simple API using a Python Flask app. The API accepts just one input: the user ID (a number). It returns a list of recommendations and scores. Here's my Postman snippet for testing the engine.
To provision this API as a hosted service that can be called from our community web site, we use an API gateway tool called Apigee. Apigee allows us to control access with API keys, and also monitors the performance of the API. Here's a sample performance report for the past 7 days.
In addition to this dashboard for reporting, we have integrated proactive alerts into Microsoft Teams, the tool we use for collaboration on this project. I scheduled a SAS program that tests the recommendations API daily, and the program then posts to a Teams channel (using the Teams API) with the results. I want to share the specific steps for this Microsoft Teams integration -- that's a topic for another article. But I'll tell you this: the process is very similar to the technique I shared about publishing to a Slack channel with SAS.
To make it easier to track recommendation clicks, we added special parameters to the recommended topics URLs to capture the clicks as Google Analytics "events." Here's what that data looks like within the Google Analytics web reporting tool:
You might know that I use SAS with the Google Analytics API to collect web metrics. I've added a new use case for that trick, so now I collect data about the "SAS Recommended Click" events. Each click event contains the unique ID of the recommendation score that the engine generated. Here's what that raw data looks like when I collect it with SAS:
With the data in SAS, we can use that to monitor the health/success of the model in SAS Model Manager, and eventually to improve the algorithm.
This project has been exciting from Day 1. When Jared and I saw the potential for using our own SAS Viya products to improve visitor experience on our communities, we committed ourselves to see it through. Like many analytics applications, this project required buy-in and cooperation from other stakeholders, especially SAS IT. Our friends in IT helped with the API gateway and it's their cloud infrastructure that hosts and orchestrates the containers for the production models. Putting models into production is often referred to as "the last mile" of an analytics project, and it can represent a difficult stretch. It helps when you have the proper tools to manage the scale and the risks.
We've all learned a lot in the process. We learned how to ask for services from IT and to present our case, with both benefits and risks. And we learned to mitigate those risks by applying security measures to our API, and by limiting the execution scope and data of the API container (which lives outside of our firewall).
Thanks to extensive preparation and planning, the engine has been running almost flawlessly for 8 months. You can experience it yourself by visiting SAS Support Communities and logging in with your SAS Profile. The recommendations that you see will be personal to you (whether they are good recommendations...that's another question). We have plans to expand the engine's use to anonymous visitors as well, which will significantly increase the traffic to our little API. Stay tuned!
The post Building a recommendation engine with SAS appeared first on The SAS Dummy.
Years ago I saw a line of SAS code that was really puzzling. It was a statement that started with:
if 0 then … ;
What? This was a statement that would always be evaluated as false. Why would anyone write such a statement? Recently, I was discussing with a good friend of mine a macro I wrote where I used an IF statement that was always false.
In some programming languages, the "IF 0" trick ensures that the code that follows would never be executed. But with the DATA step, SAS actually does still evaluate the "zeroed-out" code during the compile phase.
Let's see why this can actually be useful.
The problem I was trying to solve was to assign the number of observations in a data set to a macro variable. First, the code, then the explanation.
if 0 then set Oscar Nobs=Number_of_Obs;
call symputx('Number',Number_of_Obs);
stop;
run;
The SET option Nobs= places the number of observations in data set Oscar in a variable I called Number_of_Obs. This happens at compile time, before any observations are read from data set Oscar. The CALL SYMPUTX routine takes two arguments. The first is the name of a macro variable (Number) and the second is the name of a SAS variable (Number_of_Obs). The call routine assigns the value of the SAS variable to the Macro variable.
Most data sets stop executing when you reach the end-of-file on any data set. Because you are not reading any values from data set Oscar, you need a STOP statement to prevent the Data Step from looping. When this Data Step executes, the number of observations in data set Oscar is assigned to the macro variable called Number.
Now you, too, can write programs using this useful trick. The macro where I used this trick can be found in my book, Cody's Data Cleaning techniques, 3rd edition. You can see a copy of this macro (and all the other macros described in the book) by going to my author web site: support.sas.com/cody, scrolling down to the Data Cleaning book and clicking on the link to see example and data. One of the macros that use this trick is called HighLow.sas. As an alternative, you can buy the book!
By the way, you can download programs and data from any of my books by using this same method and it is free!
Books by Ron Cody See them all!A DATA step compiler trick to get the record count was published on SAS Users.
How do you explain flat-line forecasts to senior management? Or, do you just make manual overrides to adjust the forecast? When there is no detectable trend or seasonality associated with your demand history, or something has disrupted the trend and/or seasonality, simple time series methods (i.e. naïve and simple [...]
How do I explain a flat-line forecast to senior management? was published on SAS Voices by Charlie Chase
SAS supports more than 25 common probability distributions for the PDF, CDF, QUANTILE, and RAND functions. Of course, there are infinitely many distributions, so not every possible distribution is supported. If you need a less-common distribution, I've shown how to extend the functionality of Base SAS (by using PROC FCMP) or the SAS/IML language by defining your own functions. This article shows how to implement the Gumbel distribution in SAS. The density functions for four parameter values of the Gumbel distribution are shown to the right. As I recently discussed, the Gumbel distribution is used to model the maximum (or minimum) in a sample of size n.
If you are going to work with a probability distribution, you need at least four essential functions: The CDF (cumulative distribution), the PDF (probability density), the QUANTILE function (inverse CDF), and a way to generate random values from the distribution. Because the RAND function in SAS supports the Gumbel distribution, you only need to define the CDF, PDF, and QUANTILE functions.
These functions are trivial to implement for the Gumbel distribution because each has an explicit formula. The following list defines each function. To match the documentation for the RAND function and PROC UNIVARIATE (which both support the Gumbel distribution), μ is used for the location parameter and σ is used for the scale parameter:
The RAND function in Base SAS and the RANDGEN function in SAS/IML support generating random Gumbel variates. Alternatively, you could generate u ~ U(0,1) and then use the QUANTILE formula to transform u into a random Gumbel variate.
Using these definitions, you can implement the functions as inline functions, or you can create a function call, such as in the following SAS/IML program:
proc iml; start CDFGumbel(x, mu, sigma); z = (x-mu)/sigma; return exp(-exp(-z)); * CDF of Gumbel distribution; finish; start PDFGumbel(x, mu, sigma); z = (x-mu)/sigma; return exp(-z-exp(-z)) / sigma; finish; start QuantileGumbel(p, mu, sigma); return mu - sigma # log(-log(p)); finish; /* Example: Call the PDFGumbel function */ mu = 3.09; /* params for max value of a sample of size */ sigma = 0.286; /* n=1000 from the std normal distrib */ x = do(2, 5, 0.025); PDF = PDFGumbel(x, mu, sigma); title "Gumbel PDF"; call series(x, PDF) grid={x y}; |
The graph shows the density for the Gumbel(3.09, 0.286) distribution, which models the distribution of the maximum value in a sample of size n=1000 drawn from the standard normal distribution. You can see that the maximum value is typically between 2.5 and 4.5, which values near 3 being the most likely.
I recently discussed the best way to draw a family of curves in SAS by using PROC SGPLOT. The following DATA step defines the PDF and CDF for a family of Gumbel distributions. You can use this data set to display several curves that indicate how the distribution changes for various values of the parameters.
data Gumbel; array n[4] _temporary_ (10 100 1000 1E6); /* sample size */ array mu[4] _temporary_ (1.28 2.33 3.09 4.75); /* Gumbel location parameter */ array beta [4] _temporary_ (0.5 0.36 0.29 0.2); /* Gumbel scale parameter */ do i = 1 to dim(beta); /* parameters in the outer loop */ m = mu[i]; b = beta[i]; Params = catt("mu=", m, "; beta=", b); /* concatenate parameters */ do x = 0 to 6 by 0.01; z = (x-m)/b; CDF = exp( -exp(-z)); /* CDF for Gumbel */ PDF = exp(-z - exp(-z)) / b; /* PDF for Gumbel */ output; end; end; drop z; run; /* Use ODS GRAPHICS / ATTRPRIORITY=NONE if you want to force the line attributes to vary in the HTML destination. */ title "The Gumbel(mu, beta) Cumulative Distribution"; proc sgplot data=Gumbel; label cdf="Probability"; series x=x y=cdf / group=Params lineattrs=(thickness=2); keylegend / position=right; xaxis grid; yaxis grid; run; title "The Gumbel(mu, beta) Probability Density"; proc sgplot data=Gumbel; label pdf="Density"; series x=x y=pdf / group=Params lineattrs=(thickness=2); keylegend / position=right; xaxis grid offsetmin=0 offsetmax=0; yaxis grid; run; title "The Gumbel(mu, beta) Quantile Function"; proc sgplot data=Gumbel; label cdf="Probability"; series x=cdf y=x / group=Params lineattrs=(thickness=2); keylegend / position=right sortorder=reverseauto; xaxis grid; yaxis grid; run; |
The graph for the CDF is shown. A graph for the PDF is shown at the top of this article. I augmented the PDF curve with labels that show the sample size for which the Gumbel distribution is associated. The leftmost curve is the distribution of the maximum in a sample of size n=10; the rightmost curve is for a sample of size one million.
For completeness, I will mention that you can use the RAND function to generate random values and use PROC UNIVARIATE to fit Gumbel parameters to data. For example, the following DATA step generates 5,000 random variates from Gumbel(1.28, 0.5). The call to PROC UNIVARIATE fits a Gumbel model to the data. The estimates for (μ, σ) are (1.29, 0.51).
data GumbelRand; call streaminit(12345); mu = 1.28; beta = 0.5; do i = 1 to 5000; x = rand("Gumbel", mu, beta); output; end; keep x; run; proc univariate data=GumbelRand; histogram x / Gumbel; run; |
In summary, although the PDF, CDF, and QUANTILE functions in SAS do not support the Gumbel distribution, it is trivial to implement these functions. This article visualizes the PDF and CDF of the Gumbel distributions for several parameter values. It also shows how to simulate random values from a Gumbel distribution and how to fit the Gumbel model to data.
The post Implement the Gumbel distribution in SAS appeared first on The DO Loop.
Guest blogger Khari Villela shares tips to help you skip common pitfalls of building a data lake.
The post 3 tips for building a better data lake appeared first on The Data Roundtable.
Is 4 an extreme value for the standard normal distribution? In high school, students learn the famous 68-95-99.7 rule, which is a way to remember that 99.7 percent of random observation from a normal distribution are within three standard deviations from the mean. For the standard normal distribution, the probability that a random value is bigger than 3 is 0.0013. The probability that a random value is bigger than 4 is even smaller: about 0.00003 or 3 x 10^{-5}.
So, if you draw randomly from a standard normal distribution, it must be very rare to see an extreme value greater than 4, right? Well, yes and no. Although it is improbable that any ONE observation is more extreme than 4, if you draw MANY independent observations, the probability that the sample contains an extreme value increases with the sample size. If p is the probability that one observation is less than 4, then p^{n} is the probability that n independent observations are all less than 4. Thus 1 – p^{n} is the probability that some value is greater than 4. You can use the CDF function in SAS to compute these probabilities in a sample of size n:
/* What is a "large value" of a normal sample? The answer depends on the size of the sample. */ /* Use CDF to find probability that a random value from N(0,1) exceeds 4 */ proc iml; P_NotGT4 = cdf("Normal", 4); /* P(x < 4) */ /* Probability of an extreme obs in a sample that contains n independent observations */ n = {1, 100, 1000, 10000}; /* sample sizes */ P_NotGT4 = P_NotGT4**n; /* P(all values are < 4) */ P_GT4 = 1 - P_NotGT4; /* P(any value is > 4) */ print n P_NotGT4 P_GT4; |
The third column of the table shows that the probability of seeing an observation whose value is greater than 4 increases with the size of the sample. In a sample of size 10,000, the probability is 0.27, which implies that about one out of every four samples of that size will contain a maximum value that exceeds 4.
You can use a simulation to approximate the distribution of the maximum value of a normal sample of size n. For definiteness, choose n = 1,000 and sample from a standard normal distribution N(0,1). The following SAS/IML program simulates 5,000 samples of size n and computes the maximum value of each sample. You can then graph the distribution of the maximum values to understand how the maximum value varies in random samples of size n.
/* What is the distribution of the maximum value in a sample of size n drawn from the standard normal distribution? */ call randseed(12345); n = 1000; /* sample size */ numSim = 5000; /* number of simulations */ x = j(n, numSim); /* each column will be a sample */ call randgen(x, "Normal"); /* generate numSim samples */ max = x[<>, ]; /* find max of each sample (column) */ Title "Distribution of Maximum Value of a Normal Sample"; title2 "n = 1000"; call histogram(max); /* compute some descriptive statistics */ mean = mean(max`); call qntl(Q, max`, {0.25 0.5 0.75}); print (mean // Q)[rowname={"Mean" "P25" "Median" "P75"}]; |
Based on this simulation, the expected maximum value of a sample of size n = 1,000 is about 3.2. The table indicates that about 25% of the samples have a maximum value that is less than 3. About half have a maximum value less than 3.2. About 75% of the samples have a maximum value that is less than 3.4. The graph shows the distribution of maxima in these samples. The maximum value of a sample ranged from 2.3 to 5.2.
The distribution of a maximum (or minimum) value in a sample is studied in an area of statistics that is known as extreme value theory. It turns out that you can derive the sampling distribution of the maximum of a sample by using the Gumbel distribution, which is also known as the "extreme value distribution of Type 1." You can use the Gumbel distribution to describe the distribution of the maximum of a sample of size n. The Gumbel distribution actually describes the maximum for many distributions, but for simplicity I will only refer to the normal distribution.
This section does two things. First, it uses PROC UNIVARIATE to fit the parameters of a Gumbel distribution to the maximum values from the simulated samples. The Gumbel(3.07, 0.29) distribution is the distribution that maximizes the likelihood function for the simulated data. Second, it uses a theoretical formula to show that a Gumbel(3.09, 0.29) distribution is the distribution that models the maximum of a normally distributed sample of size n = 1,000. Thus, the results from the simulation and the theory are similar.
You can write the 5,000 maximum values from the simulation to a SAS data set and use PROC UNIVARIATE to estimate the MLE parameters for the Gumbel distribution, as follows:
create MaxVals var "max"; append; close; QUIT; /* Fit a Gumbel distribution, which models the distribution of maximum values */ proc univariate data=MaxVals; histogram max / gumbel; ods select Histogram ParameterEstimates FitQuantiles; run; |
The output from PROC UNIVARIATE shows that the Gumbel(3.07, 0.29) distribution is a good fit to the distribution of the simulated maxima. But where do those parameter values come from? How are the parameters of the Gumbel distribution related to the sample size of the standard normal distribution?
It was difficult to find an online reference that shows how to derive the Gumbel parameters for a normal sample of size n. I finally found a formula in the book Extreme Value Distributions (Kotz and Nadarajah, 2000, p. 9). For any cumulative distribution F that satisfies certain conditions, you can use the quantile function of the distribution to estimate the Gumbel parameters. The result is that the location parameter (μ) is equal to μ = F^{-1}(1-1/n) and the scale parameter (σ) is equal to σ = F^{-1}(1-1/(ne)) - μ, where e is the base of the natural logarithm. The following SAS/IML program uses the (1 – 1/n)th quantile of the normal distribution to derive the Gumbel parameters for a normal sample of size n:
proc iml; n = 1000; /* Compute parameters of Gumbel distribution when the sample is normal of size n. SAS calls the parameters (mu, sigma). Wikipedia calls them (mu, beta). Other references use (alpha, beta). */ mu_n = quantile("Normal", 1-1/n); /* location parameter */ sigma_n = quantile("Normal", 1-1/n*exp(-1)) - mu_n; /* scale parameter */ print n mu_n sigma_n; /* what is the mean (expected value) and median of this Gumbel distribution? */ gamma = constant("Euler"); /* Euler–Mascheroni constant = 0.5772157 */ mean = mu_n + gamma*sigma_n; /* expected value of maximum */ median = mu_n - sigma_n * log(log(2)); /* median of maximum value distribution */ print n mean median; |
Notice that these parameters are very close to the MLE estimates from the simulated normal samples. The results tell you that the expected maximum in a standard normal sample is 3.26 and about 50% of samples will have a maximum value of 3.19 or less.
You can use these same formulas to find the expected and median values of the maximum in samples of other sizes:
/* If the sample size is n, what is expected maximum */ n = {1E4, 2E4, 1E5, 2E5, 1E6, 2E6}; mu_n = quantile("Normal", 1-1/n); /* location parameter */ sigma_n = quantile("Normal", 1-1/n*exp(-1)) - mu_n; /* scale parameter */ mean = mu_n + gamma*sigma_n; /* expected value of maximum */ median = mu_n - sigma_n * log(log(2)); /* meadian of maximum value distribution */ print n mean median; |
The table shows that you would expect to see a maximum value of 4 in a sample of size 20,000. If there are two million observations in a sample, you would expect to see a maximum value of 5!
You can graph this data over a range of sample sizes. The following graph shows the expected value of the maximum value in a sample of size n (drawn from a standard normal distribution) for large values of n.
You can create similar images for quantiles. The p_th quantile for the Gumbel distribution is q = mu_n - sigma_n log(-log(p)).
So, is 4 an unlikely value for the standard normal distribution? Yes, but for sufficiently large samples that value is likely to be observed. You can use the Gumbel distribution to model the distribution of the maximum in a normal sample of size n to determine how likely it is that the sample contains an extreme value. The larger the sample, the more likely it is to observe an extreme value. Although I do not discuss the general case, the Gumbel distribution can also model the maximum value for samples drawn from some non-normal distributions.
The post Extreme values: What is an extreme value for normally distributed data? appeared first on The DO Loop.
This blog answers three questions related to using PROC RANK with groups and ties. Note that question two actually provide an alternative for using the DATA step when PROC RANK cannot provide what you need.
Most of us just accept the ranked values that are calculated by PROC RANK. But have you ever tried to figure out the calculation that happens behind the code? When you use the GROUPS= option in the PROC RANK statement, the values are assigned to groups that range from 0 to the number-of-groups minus 1, based on tied values. If you have many tied values, you might not obtain the number of groups that you request because observations with the same value are assigned to the same group.
The formula for calculating group values is as follows:
FLOOR(rank*k/(n+1))
In this formula:
Consider the following example. If you want to see the original value as well as the ranking, use both the VAR and RANKS statements in your PROC RANK step. The RANKS variable contains the rank value. If you use a BY statement, the raw rankings start over in each BY group.
Example 1
data; input x; cards; 1 1 1 1 1 1 8 8 ; run; proc rank data=test groups=5 out=test_rank ties=mean /* low and high */; var x; ranks rank_x; run; proc print data=test_rank; run; |
Output
x | rank_x |
---|---|
1 | 1 |
1 | 1 |
1 | 1 |
1 | 1 |
1 | 1 |
1 | 1 |
8 | 4 |
8 | 4 |
The following table shows the variable values, the raw ranks, the ties values, and the resulting ranks:
X | RAW_RANK | TIES=MEAN | TIES=LOW | TIES=HIGH | RANK_MEAN | RANK_LOW | RANK_HIGH |
---|---|---|---|---|---|---|---|
1 | 1 | 3.5 | 1 | 6 | 1 | 0 | 3 |
1 | 2 | 3.5 | 1 | 6 | 1 | 0 | 3 |
1 | 3 | 3.5 | 1 | 6 | 1 | 0 | 3 |
1 | 4 | 3.5 | 1 | 6 | 1 | 0 | 3 |
1 | 5 | 3.5 | 1 | 6 | 1 | 0 | 3 |
1 | 6 | 3.5 | 1 | 6 | 1 | 0 | 3 |
8 | 7 | 7.5 | 7 | 8 | 4 | 3 | 4 |
8 | 8 | 7.5 | 7 | 8 | 4 | 3 | 4 |
Using the formula that is shown previously, k=5 and n=8. Since TIES=MEAN, you sum the raw ranks of the tied values of X and divide by the number of observations. For X=1, the rank is (1+2+3+4+5+6)=21/6=3.5. For X=2, the rank is (7+8)=15/2=7.5. Similarly, if you use TIES=LOW, for X=1, the rank is 1; for X=2, the rank is 7. Finally, if you use TIES=HIGH, for X=1, the rank is 6; for X=2, the rank is 8. When you insert those values into the formula for the first observation, you obtain the following results:
TIES=MEAN: Floor(3.5*5/9)=Floor(1.9)=1
TIES=LOW: Floor(1*5/9)=Floor(0.5)=0
TIES=HIGH: Floor(6*5/9)=Floor(3.3)=3
Suppose that you want to create groups that have the same number of observations in each one, regardless of tied values. PROC RANK cannot do this. However, you can use the DATA step to accomplish this task.
You need to sort the data set by the ranking variable and then use the same formula in the DATA step, as shown below.
Example 2
proc sort data=test; by x; run; data ranks; set test nobs=numobs; group=floor(_n_*5/(numobs+1)); run; proc print data=ranks; run; |
Output
x | group |
---|---|
1 | 0 |
1 | 1 |
1 | 1 |
1 | 2 |
1 | 2 |
1 | 3 |
8 | 3 |
8 | 4 |
When you use the GROUPS= option, the values that are assigned to the groups start at 0. There is no way to indicate for PROC RANK to start the groups at 1. However, once you have the data set with the ranked values, you can add 1 using DATA step logic, as shown in this example:
Example 3
data test; input y; cards; 11 22 10 15 25 ; run; proc sort data=test; by y; run; proc rank data=test out=test_rank1 groups=3; var y; ranks rank_y; run; data test_rank1; set test_rank1; rank_y+1; run; proc print data=test_rank1; run; |
Output
y | rank_y |
---|---|
10 | 1 |
11 | 2 |
15 | 2 |
22 | 3 |
25 | 3 |
PROC RANK has many statistical applications, such as helping you understand how your data is distributed. Hopefully, this blog has provided you with a better understanding of how ranks are determined when tied values are present in the data and when you want to assign those ranks to groups.
How the RANK procedure calculates ranks with groups and ties was published on SAS Users.