R

3月 072018
 

The R SWAT package (SAS Wrapper for Analytics Transfer) enables you to upload big data into an in-memory distributed environment to manage data and create predictive models using familiar R syntax. In the SAS Viya Integration with Open Source Languages: R course, you learn the syntax and methodology required to [...]

The post Use R to interface with SAS Cloud Analytics Services appeared first on SAS Learning Post.

5月 242017
 

According to Hyndman and Fan ("Sample Quantiles in Statistical Packages," TAS, 1996), there are nine definitions of sample quantiles that commonly appear in statistical software packages. Hyndman and Fan identify three definitions that are based on rounding and six methods that are based on linear interpolation. This blog post shows how to use SAS to visualize and compare the nine common definitions of sample quantiles. It also compares the default definitions of sample quantiles in SAS and R.

Definitions of sample quantiles

Suppose that a sample has N observations that are sorted so that x[1] ≤ x[2] ≤ ... ≤ x[N], and suppose that you are interested in estimating the p_th quantile (0 ≤ p ≤ 1) for the population. Intuitively, the data values near x[j], where j = floor(Np) are reasonable values to use to estimate the quantile. For example, if N=10 and you want to estimate the quantile for p=0.64, then j = floor(Np) = 6, so you can use the sixth ordered value (x[6]) and maybe other nearby values to estimate the quantile.

Hyndman and Fan (henceforth H&F) note that the quantile definitions in statistical software have three properties in common:

  • The value p and the sample size N are used to determine two adjacent data values, x[j]and x[j+1]. The quantile estimate will be in the closed interval between those data points. For the previous example, the quantile estimate would be in the closed interval between x[6] and x[7].
  • For many methods, a fractional quantity is used to determine an interpolation parameter, λ. For the previous example, the fraction quantity is (Np - j) = (6.4 - 6) = 0.4. If you use λ = 0.4, then an estimate the 64th percentile would be the value 40% of the way between x[6] and x[7].
  • Each definition has a parameter m, 0 ≤ m ≤ 1, which determines how the method interpolates between adjacent data points. In general, the methods define the index j by using j = floor(Np + m). The previous example used m=0, but other choices include m=0.5 or values of m that depend on p.

Thus a general formula for quantile estimates is q = (1 - λ) x[j]+ λ x[j+1], where λ and j depend on the values of p, N, and a method-specific parameter m.

You can read Hyndman and Fan (1986) for details or see the Wikipedia article about quantiles for a summary. The Wikipedia article points out a practical consideration: for values of p that are very close to 0 or 1, some definitions need to be slightly modified. For example, if p < 1/N, the quantity Np < 1 and so j = floor(Np) equals 0, which is an invalid index. The convention is to return x[1] when p is very small and return x[N] when p is very close to 1.

Compute all nine sample quantile definitions in SAS

SAS has built-in support for five of the quantile definitions, notably in PROC UNIVARIATE, PROC MEANS, and in the QNTL subroutine in SAS/IML. You can use the QNTLDEF= option to choose from the five definitions. The following table associates the five QNTLDEF= definitions in SAS to the corresponding definitions from H&F, which are also used by R. In R you choose the definition by using the type parameter in the quantile function.

SAS definitions of sample quantiles

It is straightforward to write a SAS/IML function to compute the other four definitions in H&F. In fact, H&F present the quantile interpolation functions as specific instances of one general formula that contains a parameter, which they call m. As mentioned above, you can also define a small value c (which depends on the method) such that the method returns x[1] if p < c, and the method returns x[N] if p ≥ 1 - c.

The following table presents the parameters for computing the four sample quantile definitions that are not natively supported in SAS:

Definitions of sample quantiles that are not natively supported in SAS

Visualizing the definitions of sample quantiles

Visualization of nine defniitions of sample quantiles, from Hyndman and Fan (1996)

You can download the SAS program that shows how to compute sample quantiles and graphs for any of the nine definitions in H&F. The differences between the definitions are most evident for small data sets and when there is a large "gap" between one or more adjacent data values. The following panel of graphs shows the nine sample quantile methods for a data set that has 10 observations, {0 1 1 1 2 2 2 4 5 8}. Each cell in the panel shows the quantiles for p = 0.001, 0.002, ..., 0.999. The bottom of each cell is a fringe plot that shows the six unique data values.

In these graphs, the horizontal axis represents the data and quantiles. For any value of x, the graph estimates the cumulative proportion of the population that is less than or equal to x. Notice that if you turn your head sideways, you can see the quantile function, which is the inverse function that estimates the quantile for each value of the cumulative probability.

You can see that although the nine quantile functions have the same basic shape, the first three methods estimate quantiles by using a discrete rounding scheme, whereas the other methods use a continuous interpolation scheme.

You can use the same data to compare methods. Instead of plotting each quantile definition in its own cell, you can overlay two or more methods. For example, by default, SAS computes sample quantiles by using the type=2 method, whereas R uses type=7 by default. The following graph overlays the sample quantiles to compare the default methods in SAS and R on this tiny data set. The default method in SAS always returns a data value or the average of adjacent data values; the default method in R can return any value in the range of the data.

Comparison of the default  quantile estimates in SAS and R on a tiny data set

Does the definition of sample quantiles matter?

As shown above, different software packages use different defaults for sample quantiles. Consequently, when you report quantiles for a small data set, it is important to report how the quantiles were computed.

However, in practice analysts don't worry too much about which definition they are using because the difference between methods is typically small for larger data sets (100 or more observations). The biggest differences are often between the discrete methods, which always report a data value or the average between two adjacent data values, and the interpolation methods, which can return any value in the range of the data. Extreme quantiles can also differ between the methods because the tails of the data often have fewer observations and wider gaps.

The following graph shows the sample quantiles for 100 observations that were generated from a random uniform distribution. As before, the two sample quantiles are type=2 (the SAS default) and type=7 (the R default). At this scale, you can barely detect any differences between the estimates. The red dots (type=7) are on top of the corresponding blue dots (type=2), so few blue dots are visible.

Comparison of the default  quantile estimates in SAS and R on a larger data set

So does the definition of the sample quantile matter? Yes and no. Theoretically, the different methods compute different estimates and have different properties. If you want to use an estimator that is unbiased or one that is based on distribution-free computations, feel free to read Hyndman and Fan and choose the definition that suits your needs. The differences are evident for tiny data sets. On the other hand, the previous graph shows that there is little difference between the methods for moderately sized samples and for quantiles that are not near gaps. In practice, most data analysts just accept the default method for whichever software they are using.

In closing, I will mention that there are other quantile estimation methods that are not simple formulas. In SAS, the QUANTREG procedure solves a minimization problem to estimate the quantiles. The QUANTREG procedure enables you to not only estimate quantiles, but also estimate confidence intervals, weighted quantiles, the difference between quantiles, conditional quantiles, and more.

SAS program to compute nine sample quantiles.

The post Sample quantiles: A comparison of 9 definitions appeared first on The DO Loop.

7月 312015
 

Last week, SAS released the 14.1 version of its analytics products, which are shipped as part of the third maintenance release of 9.4. If you run SAS/IML programs from a 64-bit Windows PC, you might be interested to know that you can now create matrices with about 231 ≈ 2 billion elements, provided that your system has enough RAM. (On Linux operating systems, this feature has been available since SAS 9.3.)

A numerical matrix with 2 billion elements requires 16 GB of RAM. In terms of matrix dimensions, this corresponds to a square numerical matrix that has approximately 46,000 rows and columns. I've written a handy SAS/IML program to determine how much RAM is required to store a matrix of a given size.

If you are running 64-bit SAS on Windows, this article describes how to set an upper limit for the amount of memory that SAS allocate for large matrices.

The MEMSIZE option

The amount of memory that SAS can allocate depends on the value of the MEMSIZE system option, which has a default value of 2GB on Windows. Many SAS sites do not override the default value, which means that SAS cannot allocate more than 2 GB of system memory.

You can run PROC OPTIONS to display the current value of the MEMSIZE option.

proc options option=memsize value;
run;
Option Value Information For SAS Option MEMSIZE
    Value: 2147483648
    Scope: SAS Session
    How option value set: Config File
    Config file name:
            C:Program FilesSASHomeSASFoundation9.4nlsensasv9.cfg

The value 2,147,483,648 is shown in the SAS log. The value is unfortunately in bytes. This number corresponds to 2 GB. Unless you change the MEMSIZE option, you will not be able to allocate a square matrix with more than about 16,000 rows and columns. For example, unless SAS can allocate 5 GB or more of RAM, the following SAS/IML program will produce an error message:

proc iml; 
/* allocate 25,000 x 25,000 matrix, which requires 4.7 GB */
x = j(25000, 25000, 0);


ERROR: Unable to allocate sufficient memory.

You can use the MEMSIZE system option to permit SAS to allocate a greater amount of system memory. SAS does not grab this memory and hold onto it. Instead, the MEMSIZE option specifies a maximum value for dynamic allocations.

The MEMSIZE option only applies when you launch SAS, so if SAS is currently running, save your work and exit SAS before continuing.

Changing the command-line invocation for SAS

If you run SAS locally on your PC, you can add the -MEMSIZE command-line option to the shortcut that you use to invoke SAS. This example uses "12G" to permit SAS to allocate up to 12 GB of RAM, but you can use different numbers, such as 8G or 16G.

  1. Locate the "SAS 9.4" icon on your Desktop or the "SAS 9.4" item on the Start menu.
  2. Right-click on the shortcut and select Properties
  3. A dialog box appears. Edit the Target field and insert -MEMSIZE 12G at the end of the current text, as shown in the image.

  4. memsize
  5. Click OK.

Every time you use this shortcut to launch SAS, the SAS process can allocate up to 12 GB of RAM. You can also specify -MEMSIZE 0, which permits allocations up to 80% of the available RAM. Personally, I do not use -MEMSIZE 0 because it permits SAS to consume most of the system memory, which does not leave much for other applications. I rarely permit SAS to use more than 75% of my RAM.

After editing the shortcut, launch SAS and call PROC OPTIONS. This time you should see something like the following:

Option Value Information For SAS Option MEMSIZE
    Value: 12884901888
    Scope: SAS Session
    How option value set: SAS Session Startup Command Line

SAS configuration files

A drawback of the command-line approach is that it only applies to a SAS session that is launched from the shortcut that you modified. In particular, it does not apply to launching SAS by double-clicking on a .sas or .sas7bdat file.

An alternative is to create or edit a configuration file. The SAS documentation has long and complete instructions about how to edit the sasv9.cfg file that sets the system options for SAS when SAS is launched.

SAS 9 creates two default configuration files during installation. Both configuration files are named SASV9.CFG. I suggest that you edit the one in !SASHOMESASFoundation9.4, which on many installations is c:program filesSASHomeSASFoundation9.4. By default, that configuration file has a -CONFIG option that points to a language-specific configuration file. Put the -MEMSIZE option and any other system options after the -CONFIG option, as follows:

-config "C:Program FilesSASHomeSASFoundation9.4nlsensasv9.cfg"
-RLANG
-MEMSIZE 12G

Notice that I also put the -RLANG option in this sasv9.cfg file. The -RLANG system option specifies that SAS/IML software can interface with the R language.

If you now double-click on a .sas file to launch SAS, PROC OPTIONS reports the following information:

Option Value Information For SAS Option MEMSIZE
    Value: 12884901888
    Scope: SAS Session
    How option value set: Config File
    Config file name:
            C:Program FilesSASHomeSASFoundation9.4SASV9.CFG

If you add multiple system options to the configuration file, you might want to go back to the SAS 9.4 Properties dialog box (in the previous section) and edit the Target value to point to the configuration file that you just edited.

Remote SAS servers

If you connect to a remote SAS server and submit SAS/IML programs through SAS/IML Studio, SAS Enterprise Guide, or SAS Studio, a SAS administrator has probably provided a configuration file that specifies how much RAM can be allocated by your SAS process. If you need a larger limit, discuss the situation with your SAS administrator.

Final thoughts on big matrices

You can create SAS/IML matrices that have millions of rows and hundreds of columns. However, you need to recognize that many matrix computations scale cubically with the number of elements in the matrix. For example, many computations on an n x n matrix require on the order of n3 floating point operations. Consequently, although you might be able to create extremely large matrices, computing with them can be very time consuming.

In short, allocating a large matrix is only the first step. The wise programmer will time a computation on a sequence of smaller problems as a way of estimating the time required to tackle The Big Problem.

tags: 14.1, R, SAS Programming

The post Large matrices in SAS/IML 14.1 appeared first on The DO Loop.

5月 132015
 

I didn’t play with SAS/IML for a while. I call it back when I need to read some R format data.

Technically, .Rdata is not a data format. It’s rather a big container to hold bunch of R objects:

Rdata

In this example, when a .Rdata is loaded, 3 objects are included where ‘data’(the ‘real’ data) and ‘desc’ (data description portion) are of our interests.

SAS/IML offers a nice interface to call R command which can be used to read the R format data:

proc iml;
submit / R;
load(“C:/data/w5/R data sets for 5e/GPA1.RData”)
endsubmit;

    call ImportDataSetFromR(“work.GPA1″, “data“);
call ImportDataSetFromR(“work.GPA1desc”, “desc“);

quit;

data _null_;
set GPA1desc end = eof ;
i+1;
II=left(put(i,3.));
call symputx(‘var’||II,variable);
call symputx(‘label’||II,label);
if eof then call symputx(‘n’,II);
run;

%macro labelit;
data gpa1;
set gpa1;

    label
%do i=1 %to &n;
&&var&i = &&label&i
%end;
;
run;
%mend;

%labelit

4月 022014
 

Last year I gave a talk in SESUG 2013 on list manipulation on SAS using a collection of function-like macros. Today I just explored in my recently upgraded SAS 9.4 that I can play with list natively, which means I can create a list, slice a list and do other list operations in Data Steps! This is not documented yet(which means it will not be supported by the software vendor) and I can see warning message in Log window like “WARNING: List object is preproduction in this release”,  and it is still limited somehow, so use it in your own risk (and of course, fun).  Adding such versatile list object will definitely make SAS programmers more powerful. I will keep watch its further development.

*************Update********

Some readers emailed to me that they can’t get the expected results as I did here. I think it’s best to check your own system:

I. Make sure you use the latest SAS software. I only tested on a 64-bit Window 7 machine with SAS 9.4 TS1M1:

SAS94

II. Make sure all hotfixes were applied (You can use this SAS Hot Fix Analysis, Download and Deployment Tool).

hotfix

*************Update End********

The followings are some quick plays and I will report more after more research:

1. Create a List

It’s easy to create a list:

data _null_;
a = ['apple', 'orange', 'banana'];
put a;
run;

the output in Log window:

list1

You can also transfer a string to a list:

data _null_;
a = ‘SAS94′
b =list(a)
put b;
run;

list2

2. Slice a List

Slicing a list is also pretty straightforward, like in R and Python:

data _null_;
a = ['apple', 'orange', 'banana'];
b = a[0];
c = a[:-1];
d = a[1:2];
put a;
put b;
put c;
put d;
run;

list3

3. List is Immutable in SAS!?

I felt much confortable to play list operations in SAS but a weird thing just happened. I tried to change a value in a list:

data _null_;
a = ['apple', 'orange', 'banana'];
a[0] = ‘Kivi’;
put a;
run;

Unexpectedly, I got an error:

list4

hhh, I need to create a new list to hold such modification? This is funny.

Based on my quick exploration, the list object in SAS is pretty intuitive from a programmers’ point of view. But since it’s undocumented and I don’t know how long it will stay in “preproduction” phase,  just be careful to implement it in your production work.

Personally I feel very exciting to “hack” such wonderful list features in SAS 9.4. If well implemented, it will easily beat R and Python (which claim themselves supporting rich data types and objects) as a scripting language for SAS programmers. I will keep update in this page.

12月 312013
 
Just some tips:

options(error=recover):  it will tell R to launch debug section and you can choose which one to debug


options(show.error.locations=TRUE):   let R show the source line number

Something else:

use traceback() to locate where the last error message is and then use browser() to run the function again to check what is wrong.
12月 252013
 


## first explain what is type="terms":  If type="terms" is selected, a matrix of predictions 
## on the additive scale is produced, each column giving the deviations from the overall mean
## (of the original data's response, on the additive scale), which is given by the attribute "constant".
 
set.seed(9999)
x1=rnorm(10)
x2=rnorm(10)
y=rnorm(10)
lmm=lm(y~x1+x2)
predict(lmm, data=cbind(x1,x2,y), type="terms")
 
lmm$coefficient[1]+lmm$coefficient[2]*x1+mean(lmm$coefficient[3]*x2)-mean(y)-predlm[,1]
lmm$coefficient[1]+lmm$coefficient[3]*x2+mean(lmm$coefficient[2]*x1)-mean(y)-predlm[,2]
 Posted by at 3:19 下午
12月 242013
 
library(ROCR)
library(Hmisc)
 
## calculate AUC from the package ROCR and compare with it from Hmisc
 
# method 1: from ROCR
data(ROCR.simple)
pred=prediction
(ROCR.simple$prediction, ROCR.simple$labels)
perf=performance
(pred, 'tpr', 'fpr') #true positive and false negative
plot(perf, colorize=T)
 
perf2=performance
(pred, 'auc')
auc=
unlist(slot(perf2, 'y.values')) # this is the AUC
 
# method 2: from Hmisc
rcorrstat=rcorr.cens
(ROCR.simple$prediction, ROCR.simple$labels)
rcorrstat
[1] # 1st is AUC, 2nd is Accuracy Ratio(Gini Coefficient, or PowerStat, or Somer's D)
 Posted by at 2:48 下午  Tagged with:
12月 102013
 
It seems that the combination of R and Hadoop is a must-have toolkit for people working with both statistics and large data set.

An aggregation example

The Hadoop version used here is Cloudera’s CDH4, and the underlying Linux OS is CentOS 6. The data used is a simulated sales data set form a training course by Udacity. Format of each line of the data set is: date, time, store name, item description, cost and method of payment. The six fields are separated by tab. Only two fields, store and cost, are used to aggregate the cost by each store.
A typical MapReduce job contains two R scripts: Mapper.R and reducer.R.
Mapper.R
# Use batch mode under R (don't use the path like /usr/bin/R)  
#! /usr/bin/env Rscript

options(warn=-1)

# We need to input tab-separated file and output tab-separated file

input = file("stdin", "r")
while(length(currentLine = readLines(input, n=1, warn=FALSE)) > 0) {
fields = unlist(strsplit(currentLine, "\t"))
# Make sure the line has six fields
if (length(fields)==6) {
cat(fields[3], fields[5], "\n", sep="\t")
}
}
close(input)
Reducer.R
#! /usr/bin/env Rscript  

options(warn=-1)
salesTotal = 0
oldKey = ""

# Loop around the data by the formats such as key-val pair
input = file("stdin", "r")
while(length(currentLine = readLines(input, n=1, warn=FALSE)) > 0) {
data_mapped = unlist(strsplit(currentLine, "\t"))
if (length(data_mapped) != 2) {
# Something has gone wrong. However, we can do nothing.
continue
}

thisKey = data_mapped[1]
thisSale = as.double(data_mapped[2])

if (!identical(oldKey, "") && !identical(oldKey, thisKey)) {
cat(oldKey, salesTotal, "\n", sep="\t")
oldKey = thisKey
salesTotal = 0
}

oldKey = thisKey
salesTotal = salesTotal + thisSale
}

if (!identical(oldKey, "")) {
cat(oldKey, salesTotal, "\n", sep="\t")
}

close(input)

Testing

Before running MapReduce, it is better to test the codes by some linux commands.
# Make R scripts executable   
chmod w+x mapper.R
chmod w+x reducer.R
ls -l

# Strip out a small file to test
head -500 purchases.txt > test1.txt
cat test1.txt | ./mapper.R | sort | ./reducer.R

Execution

One way is to specify all the paths and therefore start the expected MapReduce job.
hadoop jar /usr/lib/hadoop-0.20-mapreduce/contrib/streaming/hadoop-streaming-2.0.0-mr1-cdh4.1.1.jar   
-mapper mapper.R –reducer reducer.R
–file mapper.R –file reducer.R
-input myinput
-output joboutput
Or we can use the alias under CDH4, which saves a lot of typing.
hs mapper.R reducer.R myinput joboutput
Overall, the MapReduce job driven by R is performed smoothly. The Hadoop JobTracker can be used to monitor or diagnose the overall process.

Rhadoop or streaming?

RHadoop is a package developed under Revolution Alytics, which allows the users to apply MapReduce job directly in R and is surely a much more popular way to integrate R and Hadoop. However, this package currently undergoes fast evolution and requires complicated dependency. As an alternative, the functionality of streaming is embedded with Hadoop, and supports all programming languages including R. If the proper installation of RHadoop poses a challenge, then streaming is a good starting point.
11月 292013
 

It’s time of year to give thanks. As a programmer, e-book reader, blog writer and web surfer, I should express my sincere appreciation to such hardware and software (I use majority of them at daily basis and most of them are free):

0. Hardware

Lenovo Thinkpad W520 (This is not free): my workhorse machine, now replaced by W530.

1. Google Stuff

Google Chrome (Windows and Android apps): The first thing to do when getting a new machine is using IE to download the Chrome, then I can just keep moving.

Gmail, Google Maps, Google Voice, Google Talk, Google Search, Google Drive, Google Keep: just can’t live without them!

Google Nexus 7 (This is not free): first generation. This year I got four tablets, the same nexus 7(that’s a long story..), now I only hold one and I love it: compared to iPad, it’s much portable to hold in my back pocket; to iPad mini, it’s much cheaper! This tablet is heavily used for reading and offline navigation.

2. Windows Applications Supporting Tabs

I’m a big fan of “tabs” and feel much more comfortable when launching everything into tabs: web pages, text files, PDF files, Windows Office files (Word, Excel, PPT), and even folders:

Clover as replacement for Windows Explore;

Google Chrome as replacement for Internet Explore (IE);

Foxit Reader as replacement for Adobe Reader;

Office Tab (This is not free) to open Microsoft Office Word, Excel, Powerpoint files in tabs;

MTPuTTy to open Putty sessions in tabs.

3. Programming and Accessories

Notepad++: best of best to open and edit all the text files within a folder.

Vim: my only choice in Unix boxes; it can only serve as a humble SAS IDE.

SAS (This is not free): I’m primarily a SAS programmer and I live on it, my first choice for data manipulation and reporting.

R and Rstudio: R is hot and this February I followed a R programming course in Coursera, Computing for Data Analysis by Dr. Roger  Peng of Johns Hopkins, Biostatistics Department. It’s nice to learn a new language and I’m now well prepared to jump into R vs SAS debate. Also, Rstudio is a elegant IDE.

Powershell: since I finished a project by Powershell (almost at learn-by-doing basis), I must say it deserves efforts to learn.

Perl: I learned and used little bit of Perl this year. It’s ugly but good enough to get work done, and most important, it’s pre-installed in every Unix machine I worked with. Vim+Perl seem the perfect pair to survive in any Unix world. By many good reasons, I should keep learning Python, but Python currently doesn’t help me to make any bucks.

Cygwin: My W520 has Windows 7 installed and Cygwin offers nice collection of Unix tools I’m interested in.

Github: I put my collection of SAS utilities on Github.

Beyond Compare (This is not free): best on file/folder comparison.

4. Reading and Writing

Calibre: my first choice of e-books management, alone with the Android app, Calibre Companion (this is not free).

Feedly: Google Reader was dead and now Feedly is even much better than Google Reader! My first online resource collector.

Moon+ Reader Pro: my favorite reading app in my phone and tablet, better than Kindle app and any others, nice Dropbox support.

Windows Live Writer: The only Windows Live product left in my machine, I use it to write every of my blog post, best offline writer(its developers also produce Rstudio).

5. Miscellaneous

Dropbox: it’s even better than Google Drive!

Pandora: I leave it open when at home.

Password Safe: I dumped hundreds of passwords there…