1月 032017
 

How many of you have been given a SAS data set with variables such as Age, Height, and Weight and some or all of them were stored as character values instead of numeric?  Probably EVERYONE! Yes, we all know how to do the old "swap and drop" (rename and convert), but […]

The post Character to Numeric Conversion in SAS appeared first on SAS Learning Post.

1月 022017
 

Data analysis can be used for many things ... how about finding other beers you might like, so you don't keep drinking the same old brand every time? Hang on tight - I think we're about to make a beer run! I recently read an interesting article on the Flowingdata website, […]

The post Beer: Finding your favorite that you didn't know about! appeared first on SAS Learning Post.

12月 292016
 

This SAS Jedi is very excited about the SAS 9.4 M4 release, which brought many wonderful gifts just in time for Christmas. So in the interest of extending the Christmas spirit, I'm going to blog about some of my favorites! I've long loved the SAS DO statement variant which allows […]

The post SAS Jedi Christmas - SAS 9.4 M4 DS2 Do Loop Upgrade appeared first on SAS Learning Post.

12月 282016
 

How can you generate data that contains outliers in a simulation study? The contaminated normal distribution is a simple but useful distribution you can use to simulate outliers. The distribution is easy to explain and understand, and it is also easy to implement in SAS.

What is a contaminated normal distribution?

The contaminated normal distibution was originally studied by John Tukey in the 190s and '50s. As I say in my book Simulating Data with SAS (2013, p. 119), "the contaminated normal distribution is a specific instance of a two-component mixture distribution in which both components are normally distributed with a common mean.... This results in a distribution with heavier tails than normality. This is also a convenient way to generate data with outliers."

Specifically, a contaminated normal distribution is a mixture of two normal distributions with mixing probabilities (1 - α) and α, where typically 0 < α ≤ 0.1. You can write the density of a contaminated normal distribution in terms of the component densities. Let φ(x; μ, σ) denote the distribution of the normal distribution with mean μ and standard deviation σ. Then the contaminated normal density is
f(x) = (1 - α)φ(x; μ, σ) + α φ(x; μ, λσ)
where λ > 1 is a parameter that determines the standard deviation of the wider component.

The idea is that the "main" distribution (φ(x; μ, σ)) is slightly "contaminated" by a wider distribution. Tukey (1960) uses λ=3 as a scale multiplier. This article uses α = 0.1, which represents 10% "contamination." Tukey reports that when λ=3 and α=0.1, "the two constituents contribute equal amounts to the variance of the contaminated distribution." In the following sections, μ=0 and σ=1, so that the uncontaminated component is the standard normal distribution.

The density of the contaminated normal distribution

The following SAS DATA step constructs the density of a contaminated normal distribution as the linear combination of a N(0,1) and a N(0,3) density. The call to the SGPLOT procedure plots the density and the component densities:

%let alpha = 0.1;
%let lambda = 3;
data CNPDF;
label Y1="N(0,1)" Y2="N(0,3)";
do x = -3*&lambda to 3*&lambda by 0.1;
   Y1 = pdf("Normal", x, 0, 1);             /* std normal component */
   Y2 = pdf("Normal", x, 0, 1*&lambda);     /* contamination */
   CN = (1-&alpha)*Y1 + &alpha*Y2;          /* contaminated normal */
   output;
end;
run;
 
title "Contaminated Normal Distribution";
title2 "alpha = &alpha; lambda = &lambda";
proc sgplot data=CNPDF;
   label CN = "0.9 N(0,1) + 0.1 N(0,3)";
   series x=x y=Y1;
   series x=x y=Y2;
   series x=x y=CN / lineattrs=(thickness=3);
   xaxis grid values=(-9 to 9);
   yaxis grid;
run;
Contaminated normal density

As shown in the graph, the contaminated normal distribution (shown with a thick line) has heavier tails than the "uncontaminated" normal component.

Random samples from the contaminated normal distribution

The book Simulating Data with SAS (2013, p. 120) provides an algorithm for simulating data from a contaminated normal distribution. The algorithm is a special case of simulating from a mixture distribution. You iteratively choose a component with probability α and then generate a value from whichever component is chosen, as follows:

%let alpha= 0.1;               /* level of contamination */
%let lambda = 3;               /* magnitude of contamination */
%let N = 100;                  /* size of sample */
data CNRand(keep=x contaminate);
call streaminit(12345);
do i = 1 to &N;
   contaminate = rand("Bernoulli", &alpha);
   if contaminate then
      x = rand("Normal", 0, &lambda);
   else
      x = rand("Normal", 0, 1);
   output;
end;
run;
 
proc sgplot data=CNRand;
   histogram x;
   density x / type=normal(mu=0 sigma=1) name="normal";
   density x / type=kernel name="kernel";
   fringe x / group=contaminate  lineattrs=(thickness=2);
   keylegend "kernel" "normal" / location=inside position=topright across=1;
   yaxis offsetmin=0.035;
run;
Random sample from a contaminated normal distribution

The histogram shows the distribution of the simulated sample. A kernel density estimate is overlaid, as is the density for the uncontaminated N(0,1) component. A fringe plot (also called a "rug plot") is shown underneath the histogram so that you can see the actual values in the sample.

The data that are generated by the uncontaminated component are shown in one color; the data from the contaminated component are shown in a different color. Whereas the standard normal density rarely produces data values for which |x| > 3, the sample contains four values that exceed 3 in magnitude. As you can see, all four "extreme" values are generated from the contaminated component. However, the contaminated component also generates two values that are not so extreme.

CDF and quantiles for the contaminated normal distribution

You can easily compute the cumulative distribution (and therefore probabilities) of the contaminated normal (CN) distribution as a linear combination of the component CDFs. For example, if you want to know the probability that a random observation from the CN distribution exceeds 3 in magnitude, you can compute that probability as follows:

data Prob;
   x = 3;
   leftP  = (1-&alpha)*cdf("Normal", -x, 0, 1) +
                &alpha*cdf("Normal", -x, 0, &lambda);
   totalP = 2*leftP;     /* distribution is symmetric */
run;
proc print noobs;run;
CDF of contaminated normal distribution

The table shows that the probability is 0.034. Almost all of that probability comes from the contaminated component of the distribution.

The quantile function of the CN distribution does not have a closed-form solution. Given a probability P, the quantile of the CN is the value of x for which P = (1-α)CDF(x; μ, σ) + αCDF(x ; μ, λσ). You can use a root-finding method to find the quantile. For example, you can use the FROOT function in SAS/IML, as follows:

proc iml;
start CNQuantile(x) global (alpha, lambda, Prob);
   return Prob - ( (1-alpha)*CDF("Normal", x, 0, 1) + 
                      alpha *CDF("Normal", x, 0, lambda) );
finish;
 
Prob = 0.01708;
alpha = 0.1;
lambda = 3;
x = froot("CNQuantile", {-9 0});  /* find quantile for Prob */
print x;
Quantile of the contaminated normal distribution

The computation find that the quantile for 0.01708 is about -3.

Not all researchers agree that a contaminated normal distribution is an appropriate model for non-Gaussian data. Gleason (1993, JASA) provides an overview of the history of the CN distribution, discusses how the parameters contribute to the elongation of the tail, and compares the CN with other long-tailed distributions.

References

tags: Outliers, Simulation

The post The contaminated normal distribution appeared first on The DO Loop.

12月 282016
 

SAS temporary arrays are an underutilized jewel in the SAS toolbox. I find that many beginning to intermediate SAS programmers are not familiar with temporary arrays. The good news is that there is nothing complicated about them and they are very useful. First of all, what is a temporary array? […]

The post SAS Temporary Arrays, Not Just for Experts appeared first on SAS Learning Post.

12月 272016
 
蚂蚁金服 风险智能部 诚招数据挖掘,机器学习,base 上海 or 杭州
岗位描述:
主要从事互联网金融风控领域数据挖掘;
岗位要求:
(1)编码能力(python  or java)
(2)数据挖掘,机器学习实践应用经验(互联网领域)
(3)熟悉hadoop or tensorflow or spark技术;
(4)工作有激情;
欢迎站内联系;

 
 Posted by at 8:46 下午
12月 272016
 

It’s no secret that the US energy landscape has undergone massive changes in recent years: the emergence of cost-effective renewables, the natural gas revolution, the wide-scale penetration of intelligence across energy delivery networks, and soon a new resident at 1600 Pennsylvania Avenue. All of these changes are impacting different pockets of […]

An era of promise and uncertainty for the oil and gas industry was published on SAS Voices.

12月 272016
 

We have seen in a previous post of this series how to configure SAS Studio to better manage user preferences in SAS Grid environments. There are additional settings that an administrator can leverage to properly configure a multi-user environment; as you may imagine, these options deserve special considerations when SAS Studio is deployed in SAS Grid environments.

SAS Studio R&D and product management often collect customer feedback and suggestions, especially during events such as SAS Global Forum. We received several requests for SAS Studio to provide administrators with the ability to globally set various options. The goal is to eliminate the need to have all users define them in their user preferences or elsewhere in the application. To support these requests, SAS Studio 3.5 introduced a new configuration option, webdms.globalSettings. This setting specifies the location of a directory containing XML files used to define these global options.

Tip #1

How can I manage this option?

The procedure is the same as we have already seen for the webdms.studioDataParentDirectory property. They are both specified in the config.properties file in the configuration directory for SAS Studio. Refer to the previous blog for additional details, including considerations for environments with clustered mid-tiers.

Tip #2

How do I configure this option?
By default, this option points to the directory path !SASROOT/GlobalStudioSettings. SASROOT translates to the directory where SAS Foundation binaries are installed, such as /opt/sas/sashome/SASFoundation/9.4 on Unix or C:/Program Files/SASHome/SASFoundation/9.4/ on Windows. It is possible to change the webdms.globalSettings property to point to any chosen directory.

SAS Studio 3.6 documentation provides an additional key detail : in a multi-machine environment, the GlobalStudioSettings directory must be on the machine that hosts the workspace servers used by SAS Studio. We know that, in grid environments, this means that this location should be on shared storage accessible by every node.

Tip #3

Configuring Global Folder Shortcuts

SAS Studio Tips for SAS Grid Manager Administrators

In SAS Studio, end users can create folder shortcuts from the Files and Folders section in the navigation pane. An administrator might want to create global shortcuts for all the users, so that each user does not have to create these shortcuts manually. This is achieved by creating a file called shortcuts.xml in the location specified by webdms.globalSettings, as detailed in

SAS Studio repositories are an easy way to share tasks and snippets between users. An administrator may want to configure one or multiple centralized repositories and make them available to everyone. SAS Studio users could add these repositories through their Preferences window, but it’s easier to create global repositories that are automatically available from the Tasks and Utilities and Snippets sections. Again, this is achieved by creating a file called repositories.xml in the location specified by webdms.globalSettings, as detailed in tags: SAS Administrators, SAS Grid Manager, SAS Professional Services, sas studio

More SAS Studio Tips for SAS Grid Manager Administrators: Global Settings was published on SAS Users.