Ken Kleinman

12月 022016
 
About two years ago we published a quick and easy guide to setting up your own RStudio server in the cloud using the Docker service and Digital Ocean. The process is incredibly easy-- about the only cumbersome part is retyping a random password. Today the excitement in virtual private servers is that Amazon is getting into the market, with their Lightsail product. They are not undercutting Digital Ocean entirely-- in fact, their prices look to be just about identical. But Amazon's interface may have some advantages for you, so here's how to get Docker and RStudio running with Amazon Lightsail.



1. Log in to Lightsail

2. Create an Instance; choose the Base OS, and Ubuntu (as of this writing 16.04 LTS)

3. Name it what you like

4. Wait for boot up. Once it's running, click "connect" under the three dots. This opens a console window where you are already logged in, saving some headache vs. Digital Ocean.

5. Time for console commands. Type: sudo apt-get install docker.io Then Y for yes to add the new material.

6. Type: sudo service docker start

7. Now you can start your docker/rstudio container. See our earlier blog post or this link for resources. Shortcuts:

a. Plain Rstudio: sudo docker run -d -p 8787:8787 rocker/rstudio

b. All of Hadleyverse: sudo docker run -d -p 8787:8787 rocker/hadleyverse

c. Custom password: sudo docker run -d -p 8787:8787 -e USER=ken -e PASSWORD=ken rocker/hadleyverse

d. Enable root: sudo docker run -d -p 8787:8787 -e ROOT=TRUE rocker/rstudio

8. Important! While the container is starting, go back to the Lightsail tab in your browser and click in the three dots in the "Running" instance to Manage. then click on the Networking tab. In the table of two enabled ports, click on the plus "Add Another". Leave "Custom" and "All" under "Aplication" and "Protocol", repectively, and change port range to 8787. Save.

9. The public IP is printed on the Networking page there. Cut and paste into your browser with :8787 appended. Your username and password are both rstudio, unless you changed them. To allow additional users onto your cloud server, see this page.

An unrelated note about aggregators:We love aggregators! Aggregators collect blogs that have similar coverage for the convenience of readers, and for blog authors they offer a way to reach new audiences. SAS and R is aggregated by R-bloggers, PROC-X, and statsblogs with our permission, and by at least 2 other aggregating services which have never contacted us. If you read this on an aggregator that does not credit the blogs it incorporates, please come visit us at SAS and R. We answer comments there and offer direct subscriptions if you like our content. In addition, no one is allowed to profit by this work under our license; if you see advertisements on this page, the aggregator is violating the terms by which we publish our work, except as noted above.
1月 182016
 
I love GitHub for version control and collaboration, though I'm no master of it. And the tools for integrating git and GitHub with RStudio are just amazing boons to productivity.

Unfortunately, my University-supplied computer does not play well with GitHub. Various directories are locked down, and I can't push or pull to GitHub directly from RStudio. I can't even use install_github() from the devtools package, which is needed for loading Shiny applications up to Shinyapps.io. I lived with this for a bit, using git from the desktop and rsconnect from a home computer. But what a PIA.

Then I remembered I know how to put RStudio in the cloud-- why not install R there, and make that be my GitHub solution?

It works great. The steps are below. In setting it up, I discovered that Digital Ocean has changed their set-up a little bit, so I update the earlier post as well.

1. Go to Digital Ocean and sign up for an account. By using this link, you will get a $10 credit. (Full disclosure: I will also get a $25 credit once you spend $25 real dollars there.) The reason to use this provider is that they have a system ready to run with Docker already built in, which makes it easy. In addition, their prices are quite reasonable. You will need to use a credit card or PayPal to activate your account, but you can play for a long time with your $10 credit-- the cheapest machine is $.007 per hour, up to a $5 per month maximum.

2. On your Digital Ocean page, click "Create droplet". Click on "One-click Apps" and select "Docker (1.9.1 on 14.04)". (The numbers in the parentheses are the Docker and Ubuntu version, and might change over time.) Then a size (meaning cost/power) of machine and the region closest to you. You can ignore the settings. Give your new computer an arbitrary name. Then click "Create Droplet" at the bottom of the page.

3. It takes a few seconds for the droplet to spin up. Then you should see your droplet dashboard. If not, click "Droplets" from the top bar. Under "More", click "Access Console". This brings up a virtual terminal to your cloud computer. Log in (your username is root) using the password that digital ocean sent you when the droplet spun up.

4. Start your RStudio container by typing: docker run -d -p 8787:8787 -e ROOT=TRUE rocker/hadleyverse

You can replace hadleyverse with rstudio if you like, for a quicker first-time installation, but many R users will want enough of Hadley Wickham's packages that it makes sense to install this version. The -e ROOT=TRUE is crucial for our approach to installing git into the container, but see the comment below from Petr Simicek below for another way to do the same thing.

5. Log in to your Cloud-based RStudio. Find the IP address of your cloud computer on the droplet dashboard, and append :8787 to it, and just put it into your browser. For example: http://135.104.92.185:8787. Log in as user rstudio with password rstudio.

6. Install git, inside the Docker container. Inside RStudio, click Tools -> Shell.... Note: you have to use this shell, it's not the same as using the droplet terminal. Type: sudo apt-get update and then sudo apt-get install git-core to install git.

git likes to know who you are. To set git up, from the same shell prompt, type git config --global user.name "Your Handle" and git config --global user.email "an.email@somewhere.edu"

7. Close the shell, and in RStudio, set things up to work with GitHub: Go to Tools -> Global Options -> Git/SVN. Click on create RSA key. You don't need a name for it. Create it, close the window, then view it and copy it.

8. Open GitHub, go to your Profile, click "Edit Profile", "SSH keys". Click "Add key", and just paste in the stuff you copied from RStudio in the previous step.

You're done! To clone an existing repos from Github to your cloud machine, open a new project in RStudio, and select Version Control, then Git, and paste in the URL name that GitHub provides. Then work away!

An unrelated note about aggregators:We love aggregators! Aggregators collect blogs that have similar coverage for the convenience of readers, and for blog authors they offer a way to reach new audiences. SAS and R is aggregated by R-bloggers, PROC-X, and statsblogs with our permission, and by at least 2 other aggregating services which have never contacted us. If you read this on an aggregator that does not credit the blogs it incorporates, please come visit us at SAS and R. We answer comments there and offer direct subscriptions if you like our content. In addition, no one is allowed to profit by this work under our license; if you see advertisements on this page, other than as mentioned above, the aggregator is violating the terms by which we publish our work.
12月 312015
 
I've been working on a Shiny app and wanted to display some math equations. It's possible to use LaTeX to show math using MathJax, as shown in this example from the makers of Shiny. However, by default, MathJax does not allow in-line equations, because the dollar sign is used so frequently. But I needed to use in-line math in my application. Fortunately, the folks who make MathJax show how to enable the in-line equation mode, and the Shiny documentation shows how to write raw HTML. Here's how to do it.

R

Here I replicated the code from the official Shiny example linked above. The magic code is inserted into ui.R, just below withMathJax().
## ui.R

library(shiny)

shinyUI(fluidPage(
title = 'MathJax Examples with in-line equations',
withMathJax(),
# section below allows in-line LaTeX via $ in mathjax. Replace less-than-sign with <
# and grater-than-sign with >
tags$div(HTML("less-than-sign script type='text/x-mathjax-config' greater-than-sign
MathJax.Hub.Config({
tex2jax: {inlineMath: [['$','$'], ['\\(','\\)']]}
});
less-than-sign /script greater-than-sign
")),
helpText('An irrational number $\\sqrt{2}$
and a fraction $1-\\frac{1}{2}$'),
helpText('and a fact about $\\pi$:$\\frac2\\pi = \\frac{\\sqrt2}2 \\cdot
\\frac{\\sqrt{2+\\sqrt2}}2 \\cdot
\\frac{\\sqrt{2+\\sqrt{2+\\sqrt2}}}2 \\cdots$'),
uiOutput('ex1'),
uiOutput('ex2'),
uiOutput('ex3'),
uiOutput('ex4'),
checkboxInput('ex5_visible', 'Show Example 5', FALSE),
uiOutput('ex5')
))



## server.R
library(shiny)

shinyServer(function(input, output, session) {
output$ex1 <- renderUI({
withMathJax(helpText('Dynamic output 1: $\\alpha^2$'))
})
output$ex2 <- renderUI({
withMathJax(
helpText('and output 2 $3^2+4^2=5^2$'),
helpText('and output 3 $\\sin^2(\\theta)+\\cos^2(\\theta)=1$')
)
})
output$ex3 <- renderUI({
withMathJax(
helpText('The busy Cauchy distribution
$\\frac{1}{\\pi\\gamma\\,\\left[1 +
\\left(\\frac{x-x_0}{\\gamma}\\right)^2\\right]}\\!$'))
})
output$ex4 <- renderUI({
invalidateLater(5000, session)
x <- round(rcauchy(1), 3)
withMathJax(sprintf("If $X$ is a Cauchy random variable, then
$P(X \\leq %.03f ) = %.03f$", x, pcauchy(x)))
})
output$ex5 <- renderUI({
if (!input$ex5_visible) return()
withMathJax(
helpText('You do not see me initially: $e^{i \\pi} + 1 = 0$')
)
})
})

Give it a try (or check out the Shiny app at https://r.amherst.edu/apps/nhorton/mathjax/)! One caveat is that the other means of in-line display, as shown in the official example, doesn't work when the MathJax HTML is inserted as above.

An unrelated note about aggregators:We love aggregators! Aggregators collect blogs that have similar coverage for the convenience of readers, and for blog authors they offer a way to reach new audiences. SAS and R is aggregated by R-bloggers, PROC-X, and statsblogs with our permission, and by at least 2 other aggregating services which have never contacted us. If you read this on an aggregator that does not credit the blogs it incorporates, please come visit us at SAS and R. We answer comments there and offer direct subscriptions if you like our content. In addition, no one is allowed to profit by this work under our license; if you see advertisements on this page, the aggregator is violating the terms by which we publish our work.
1月 062015
 
In the US, it's typical to borrow a fairly substantial portion of the cost of a new house from a bank. The cost of these loans, the mortgage rate, varies over time depending on what the financial wizards see in their crystal balls. What this means over time is that when the mortgage rates go down, the cost of living in your own house magically decreases--you take a new loan at the lower rate and pay off your old loan with it-- then you only have to pay off the new loan at the lower rate. You can find mortgage rate calculators on the web very easily-- if you don't mind their collecting your data and being bombarded with ads if you let their cookies trace you.

Instead, you can use SAS or R to calculate what you might pay for a new loan with various posted rates. There are some sophisticated tools available for either package if you're interested in the remaining principal or the proportion of each payment that's principal. Here, we just want to check the monthly payment.

R
We'll begin by writing a little function to calculate the monthly payment from the principal, interest rate (in per cent), and term (in years) of the loan. This is basic stuff, but the code here is adapted from a function written by Thomas Girke of UC Riverside.

mortgage <- function(principal=300000, rate=3.875, term=30) {
J <- rate/(12 * 100)
N <- 12 * term
M <- principal*J/(1-(1+J)^(-N))
monthPay <<- M
return(monthPay)
}
To compare the monthly costs for a series of loans offered by a local bank, we'll input the bank's loans as a data frame. To save typing, we'll use the rep() function to generate the term of the loan and the points.

offers = data.frame(
principal = rep(275000, times=9),
term = rep(c(30,20,15), each=3),
points = rep(c(0,1,2), times=3),
rate = c(3.875, 3.75, 3.5, 3.625, 3.5, 3.375, 3, 2.875, 2.75))

> offers

principal term points rate
1 275000 30 0 3.875
2 275000 30 1 3.750
3 275000 30 2 3.500
4 275000 20 0 3.625
5 275000 20 1 3.500
6 275000 20 2 3.375
7 275000 15 0 3.000
8 275000 15 1 2.875
9 275000 15 2 2.750
(Points are an up-front cost a borrower can pay to lower the mortgage rate for the loan.) With the data and function in hand, it's easy to add the monthly cost to the data frame:

offers$monthly = with(offers, mortgage(rate=rate, term=term, principal=principal))

> offers

principal term points rate monthly
1 275000 30 0 3.875 1293.152
2 275000 30 1 3.750 1273.568
3 275000 30 2 3.500 1234.873
4 275000 20 0 3.625 1612.610
5 275000 20 1 3.500 1594.889
6 275000 20 2 3.375 1577.282
7 275000 15 0 3.000 1899.100
8 275000 15 1 2.875 1882.611
9 275000 15 2 2.750 1866.210
In theory, each of these costs are fair, and the borrower should choose based on monthly costs they can afford, as well as whether they see a better value in having money in hand to spend on a better quality of life or to invest it in savings or in paying off their house sooner. Financial professionals often discuss things like the total dollars spent or the total spent on interest vs. principal, as well.

SAS
The SAS/ETS package provides the LOAN procedure, which can calculate the detailed analyses mentioned above. For simple calculations like this one, we can use the mort function in the data step. It will find and return the missing one of the four parameters-- principal, payment, rate, and term. To enter the data in a manner similar to R, we'll use array statements and do loops.

data t;
principal = 275000;
array te [3] (30,20,15);
array po [3] (0,1,2);
array ra [9] (.03875, .0375, .035, .03625, .035,
.03375, .03, .02875, .0275);
do i = 1 to 3;
do j = 1 to 3;
term = te[i];
points = po[j];
rate = ra[ 3 * (i-1) +j];
monthly = mort(principal,.,rate/12, term*12);
output;
end;
end;
run;

proc print noobs data = t;
var principal term points rate monthly; run;

principal term points rate monthly

275000 30 0 0.03875 1293.15
275000 30 1 0.03750 1273.57
275000 30 2 0.03500 1234.87
275000 20 0 0.03625 1612.61
275000 20 1 0.03500 1594.89
275000 20 2 0.03375 1577.28
275000 15 0 0.03000 1899.10
275000 15 1 0.02875 1882.61
275000 15 2 0.02750 1866.21
An unrelated note about aggregators:We love aggregators! Aggregators collect blogs that have similar coverage for the convenience of readers, and for blog authors they offer a way to reach new audiences. SAS and R is aggregated by R-bloggers, PROC-X, and statsblogs with our permission. If you read this on another aggregator that does not credit the blogs it incorporates, please come visit us at SAS and R. We answer comments there and offer direct subscriptions if you like our content. In addition, no one is allowed to profit by this work under our license; if you see advertisements on this page other than as noted above, the aggregator is violating the terms by which we publish our work.
12月 032014
 
In 2012, we presented a post showing how to run RStudio in the cloud on an Amazon server. There were 7 steps, including one with 7 sub-steps, one of which had 6 sub-sub-steps. It was still pretty easy, for what it was-- an effectively free computer in the cloud to run R on.

Today, we show the modern-- 3 years later!-- way to get the same result, only this approach is much easier, and the resulting installation includes all the best goodies of RStudio, including Markdown -> PDF and Hadley Wickham's packages pre-installed. Update, 2016: Digital ocean has changed their set-up, slightly. Check out the first step or two of this post in place of the first two steps below, if you're just starting out.

The approach builds on Docker, an infrastructure that saves start-up time and overhead, as well as efforts led by Dirk Eddelbuettel and Carl Boettiger to develop a Docker application of R. This project is called Rocker, and interested readers are encouraged to read the details. But if you want to just get up and running, here are the simple steps to get going.



1. Go to Digital Ocean and sign up for an account. By using this link, you will get a $10 credit. (Full disclosure: Ken will also get a $25 credit once you spend $25 real dollars there.) The reason to use this provider is that they have a system ready to run with Docker already built in. In addition, their prices are quite reasonable. You will need to use a credit card or PayPal to activate your account, but you can play for a long time with your $10 credit-- the cheapest machine is $.007 per hour, up to a $5 per month maximum.

2. On your Digital Ocean page, click "Create droplet". Then choose an (arbitrary) name, a size (meaning cost/power) of machine, and the region closest to you. You can ignore the settings. Under "Select Image", choose the "Applications" tab and select "Docker (1.3.2 on 14.04)". (The numbers in the parentheses are the Docker and Ubuntu version, and might change over time.) Then click "Create Droplet" at the bottom of the page.

3. It takes about a minute for the machine to start up. When it's ready, click the "Console Access" button. This opens a text terminal to your Ubuntu machine, inside your web page. Press enter to get a prompt, and log in (your username is root) using the password that was sent to your e-mail. You'll have to change the password.

4a. To start a terminal session of R, type
docker run --rm -ti rocker/r-base
you should see a bunch of messages about pulling and downloading, but eventually you will get the ">" prompt-- you can do R in here, but who would want to?

4b. To get RStudio server running, type
docker run -d -p 8787:8787 rocker/rstudio
But this is really not where you want to be. Instead, run the following command, to get a set-up that includes more useful packages installed in and with R.
docker run -d -p 8787:8787 rocker/hadleyverse


5. Use it! The IP address of your server is displayed below the terminal where you typed in your docker command. Open a new browser tab and go to the address http://(ip address):8787. For example: http://135.104.92.185:8787. You'll see the RStudio login screen, and can enter "rstudio" (without the quotes) as the username and password. The system is well tuned enough that you can open a new file --> markdown --> PDF and immediately click "Knit PDF", and see the example document beautifully presented back to you in moments.

That's it. It's still way cooler than sliced bread. let us know if you try it, and if you run into any trouble. Oh, and if you're feeling creeped out by the standard username and password in your RStudio, you can set them up from your docker command as follows.
docker run -d -p 8787:8787 -e USER=ken -e PASSWORD=ken rocker/hadleyverse
Other customization details and further information can be found on this Rocker page.

Update
I should perhaps have noted that what you are running here is in fact RStudio Server, and that you can allow additional users on your RStudio using instructions found here.

An unrelated note about aggregators:We love aggregators! Aggregators collect blogs that have similar coverage for the convenience of readers, and for blog authors they offer a way to reach new audiences. SAS and R is aggregated by R-bloggers, PROC-X, and statsblogs with our permission, and by at least 2 other aggregating services which have never contacted us. If you read this on an aggregator that does not credit the blogs it incorporates, please come visit us at SAS and R. We answer comments there and offer direct subscriptions if you like our content. In addition, no one is allowed to profit by this work under our license; if you see advertisements on this page, the aggregator is violating the terms by which we publish our work.
10月 222014
 
Data with repeated measures often come to us in the "wide" format, as shown below for the HELP data set we use in our book. Here we show just an ID, the CESD depression measure from four follow-up assessments, plus the baseline CESD.

Obs    ID    CESD1    CESD2    CESD3    CESD4    CESD

1 1 7 . 8 5 49
2 2 11 . . . 30
3 3 14 . . 49 39
4 4 44 . . 20 15
5 5 26 27 15 28 39
...


Frequently for data analysis we need to convert the data to the "long" format, with a single column for the repeated time-varying CESD measures and column indicating the time of measurement. This is needed, for example, in SAS proc mixed or in the lme4 package in R. The data should look something like this:

 Obs    ID    time    CESD    cesd_tv

1 1 1 49 7
2 1 2 49 .
3 1 3 49 8
4 1 4 49 5
...


In section 2.3.7 (2nd Edition) we discuss this problem, and we provide an example in section 7.10.9. Today we're adding a blog post to demonstrate some handy features in SAS and how the problem can be approached using plain R and, alternatively, using the new-ish R packages dplyr and tidyr, contributed by Hadley Wickham.

R
We'll begin by making a narrower data frame with just the columns noted above. We use the select() function from the dplyr package to do this; the syntax is simply to provide the the name of the input data frame as the first argument and then the names of the columns to be included in the output data frame. We use this function instead of the similar base R function subset(..., select=) because of dplyr's useful starts_with() function. This operates on the column names as character vectors in a hopefully obvious way.
load("c:/book/savedfile")

library(dplyr)
wide = select(ds, id, starts_with("cesd"))

Now we'll convert to the long format. The standard R approach is to use the reshape() function. The documentation for this is a bit of a slog, and the function can generate error messages that are not so helpful. But for simple problems like this one, it works well.
long = reshape(wide, varying = c("cesd1", "cesd2", "cesd3", "cesd4"),
v.names = "cesd_tv",
idvar = c("id", "cesd"), direction="long")
long[long$id == 1,]

id cesd time cesd_tv
1.49.1 1 49 1 7
1.49.2 1 49 2 NA
1.49.3 1 49 3 8
1.49.4 1 49 4 5

In the preceding, the varying parameter is a list of columns which vary over time, while the id.var columns appear at each time. The v.names parameter is the name of the column which will hold the values of the varying variables.

Another option would be to use base R knowledge to separate, rename, and then recombine the data as follows. The main hassle here is renaming the columns in each separate data frame so that they can be combined later.
c1 = subset(wide, select= c(id, cesd, cesd1))
c1$time = 1
names(c1)[3] = "cesd_tv"

c2 = subset(wide, select= c(id, cesd, cesd2))
c2$time = 2
names(c2)[3] = "cesd_tv"

c3 = subset(wide, select= c(id, cesd, cesd3))
c3$time = 3
names(c3)[3] = "cesd_tv"

c4 = subset(wide, select= c(id, cesd, cesd4))
c4$time = 4
names(c4)[3] = "cesd_tv"

long = rbind(c1,c2,c3,c4)
long[long$id==1,]

id cesd cesd_tv time
1 1 49 7 1
454 1 49 NA 2
907 1 49 8 3
1360 1 49 5 4

This is cumbersome, but effective.

More interesting is to use the tools provided by dplyr and tidyr.
library(tidyr)
gather(wide, key=names, value=cesd_tv, cesd1,cesd2,cesd3,cesd4) %>%
mutate(time = as.numeric(substr(names,5,5))) %>%
arrange(id,time) -> long

head(long)

id cesd names cesd_tv time
1 1 49 cesd1 7 1
2 1 49 cesd2 NA 2
3 1 49 cesd3 8 3
4 1 49 cesd4 5 4
5 2 30 cesd1 11 1
6 2 30 cesd2 NA 2

The gather() function takes a data frame (the first argument) and returns new columns named in the key and value parameter. The contents of the columns are the names (in the key) and the values (in the value) of the former columns listed. The result is a new data frame with a row for every column in the original data frame, for every row in the original data frame. Any columns not named are repeated in the output data frame. The mutate function is like the R base function transform() but has some additional features and may be faster in some settings. Finally, the arrange() function is a much more convenient sorting facility than is available in standard R. The input is a data frame and a list of columns to sort by, and the output is a sorted data frame. This saves us having to select out a subject to display

The %>% operator is a "pipe" or "chain" operator that may be familiar if you're a *nix user. It feeds the result of the last function into the next function as the first argument. This can cut down on the use of nested parentheses and may make reading R code easier for some folks. The effect of the piping is that the mutate() function should be read as taking the result of the gather() as its input data frame, and sending its output data frame into the arrange() function. For Ken, the right assignment arrow (-> long) makes sense as a way to finish off this set of piping rules, but Nick and many R users would prefer to write this as long = gather... or long <- gather.. , etc.

SAS
In SAS, we'll make the narrow data set using the keep statement in the data step, demonstrating meanwhile the convenient colon operator, that performs the same function provided by starts_with() in dplyr.

data all;
set "c:/book/help.sas7bdat";
run;

data wide;
set all;
keep id cesd:;
run;

The simpler way to make the desired data set is with the transpose procedure. Here the by statement forces the variables listed in that statement not to be transposed. The notsorted options save us having to actually sort the variables. Otherwise the procedure works like gather(): each transposed variable becomes a row in the output data set for every observation in the input data set. SAS uses standard variable names for gather()'s key (SAS: _NAME_)and value (SAS: COL1) though these can be changed.

proc transpose data = wide out = long_a;
by notsorted id notsorted cesd;
run;

data long;
set long_a;
time = substr(_name_, 5);
rename col1=cesd_tv;
run;

proc print data = long;
where id eq 1;
var id time cesd cesd_tv;
run;

Obs ID time CESD cesd_tv

1 1 1 49 7
2 1 2 49 .
3 1 3 49 8
4 1 4 49 5

As with R, it's trivial, though somewhat cumbersome, to generate this effect using basic coding.

data long;
set wide;
time = 1; cesd_tv = cesd1; output;
time = 2; cesd_tv = cesd2; output;
time = 3; cesd_tv = cesd3; output;
time = 4; cesd_tv = cesd4; output;
run;

proc print data = long;
where id eq 1;
var id time cesd cesd_tv;
run;

Obs ID time CESD cesd_tv

1 1 1 49 7
2 1 2 49 .
3 1 3 49 8
4 1 4 49 5


An unrelated note about aggregators:We love aggregators! Aggregators collect blogs that have similar coverage for the convenience of readers, and for blog authors they offer a way to reach new audiences. SAS and R is aggregated by R-bloggers, PROC-X, and statsblogs with our permission, and by at least 2 other aggregating services which have never contacted us. If you read this on an aggregator that does not credit the blogs it incorporates, please come visit us at SAS and R. We answer comments there and offer direct subscriptions if you like our content. In addition, no one is allowed to profit by this work under our license; if you see advertisements on this page, the aggregator is violating the terms by which we publish our work.
9月 302014
 
As we discuss in section 6.1.4 of the second edition, R and SAS handle categorical variables and their parameterization in models quite differently. SAS treats them on a procedure-by-procedure basis, which leads to some odd differences in capabilities and default parameterizations. For example, in the logistic procedure, the default is effect cell coding, while in the genmod procedure-- which also fits logistic regression-- the default is reference cell coding. Meanwhile, many procedures can only accommodate reference cell coding.

In R, in contrast, categorical variables can be designated as "factors" and parameterization stored an attribute of the factor.

In section 6.1.4, we demonstrate how the parameterization of a factor can be easily changed on the fly, in R, in lm(),glm(), and aov, using the contrasts= option in those functions. Here we show how to set the attribute more generally, for use in functions that don't accept the option. This post was inspired by a question from Julia Kuder, of Brigham and Women's Hospital.

SAS
We begin by simulating censored survival data as in Example 7.30. We'll also export the data to use in R.

data simcox;
beta1 = 2;
lambdat = 0.002; *baseline hazard;
lambdac = 0.004; *censoring hazard;
do i = 1 to 10000;
x1 = rantbl(0, .25, .25,.25);
linpred = exp(-beta1*(x1 eq 4));
t = rand("WEIBULL", 1, lambdaT * linpred);
* time of event;
c = rand("WEIBULL", 1, lambdaC);
* time of censoring;
time = min(t, c); * which came first?;
censored = (c lt t);
output;
end;
run;

proc export data=simcox replace
outfile="c:/temp/simcox.csv"
dbms=csv;
run;

Now we'll fit the data in SAS, using effect coding.

proc phreg data=simcox;
class x1 (param=effect);
model time*censored(0)= x1 ;
run;

We reproduce the rather unexciting results here for comparison with R.

Parameter Standard
Parameter DF Estimate Error

x1 1 1 -0.02698 0.03471
x1 2 1 -0.01211 0.03437
x1 3 1 -0.05940 0.03458


R
In R we read the data in, then use the C() function to assign the contr.sum contrast to a version of the x1 variable that we save as a factor. Once that is done, we can fit the proportional hazards regression with the desired contrast.

simcox<- read.csv("c:/temp/simcox.csv")
sc2 = transform(simcox, x1.eff = C(as.factor(x1), contr.sum(4)))
effmodel <- coxph(Surv(time, censored)~ x1.eff,data= sc2)
summary(effmodel)
We excerpt the relevant output to demonstrate equivalence with SAS.

coef exp(coef) se(coef)
x1.eff1 -0.02698 0.97339 0.03471
x1.eff2 -0.01211 0.98797 0.03437
x1.eff3 -0.05940 0.94233 0.03458
An unrelated note about aggregators:We love aggregators! Aggregators collect blogs that have similar coverage for the convenience of readers, and for blog authors they offer a way to reach new audiences. SAS and R is aggregated by R-bloggers, PROC-X, and statsblogs with our permission, and by at least 2 other aggregating services which have never contacted us. If you read this on an aggregator that does not credit the blogs it incorporates, please come visit us at SAS and R. We answer comments there and offer direct subscriptions if you like our content. In addition, no one is allowed to profit by this work under our license; if you see advertisements on this page, the aggregator is violating the terms by which we publish our work.
8月 182014
 
In Example 8.40, side-by-side histograms, we showed how to generate histograms for some continuous variable, for each level of a categorical variable in a data set. An anonymous reader asked how we would do this if both the variables were continuous. Keep the questions coming!

SAS
The SAS solution we presented relied on the sgpanel procedure. There, the panelby statement names a variable for which each distinct value will generate a panel. If there are many values, for example for a continuous variable, there will be many panels generated, which is probably not the desired result. As far as we know, there is no option to automatically categorize a continuous panel variable in proc sgpanel. If this is required, a two-step approach will be needed to first make groups of one of the variables.

We do that below using proc rank. In this approach, the groups option is the number of groups required and the ranks statement names a new variable to hold the group indicator. Once the groups are made, the same code demonstrated earlier can be used. (This is an example of "it's never too late to learn"-- I used to do this via a sort and a data step with implied variables, until I realized that there had to be a way to it via a procedure. --KK)

In this setting, the panels are another approach to the data we examine in a scatterplot. As an example, we show the mental compentency score by grouping of the physical competency score in the HELP data set.
proc rank data = 'c:bookhelp.sas7bdat' groups = 6 out = catmcs;
var mcs;
ranks mcs_sextile;
run;

title "Histograms of PCS by sextile of MCS";
proc sgpanel data = catmcs;
panelby mcs_sextile / columns = 3 rows =2;
histogram pcs;
run;
We also demonstrate the columns and rows options to the panelby statement, which allow control over the presentation of the panel results. The graphic produced is shown above.

R
Our R solution in the earlier entry used the lattice package (written by Deepayan Sarkar) to plot a formula such as histogram(~a | b). A simple substitution of a continuous covariate b into that syntax will also generate a panel for each distinct value of the covariates: a factor is expected. In the package, an implementation of Trellis graphics, the term "shingles" is used to approach the notion of categorizing a continuous variable for making panels. The function equal.count() is provided to make the (possibly overlapping) categories of the variables, and uses the panel headers to suggest the ranges of continuous covariate that are included in each panel.
ds = read.csv("http://www.amherst.edu/~nhorton/r2/datasets/help.csv")
library(lattice)
histogram(~ pcs | equal.count(mcs),
main="Histograms of PCS by shingle of MCS",
index.cond=list(c(4,5,6,1,2,3)),data=ds)
Note that the default ordering of panels in lattice is left to right, bottom to top. The index.cond option here re-orders the panels to go from left to right, top to bottom.

The default behavior of equal.count() is to allow some overlap between the categories, which is a little odd. In addition, there is a good deal of visual imprecision in the method used to identify the panels-- there's no key given, and the only indicator of the shingle value is the shading of the title bars. A more precise method would be to use the quantile() function manually, as we demonstrated in example 8.7, the Hosmer and Lemeshow goodness-of-fit test. We show here how the mutate() function in Hadley Wickham's dplyr package can be used to add a new variable to a data frame.

require(dplyr)
ds = mutate(ds, cutmcs = cut(ds$mcs, 
breaks = quantile(ds$mcs, probs=seq(0,1, 1/6)), include.lowest=TRUE))
histogram(~ pcs | cutmcs, main="Histograms of PCS by sextile of MCS",
index.cond=list(c(4,5,6,1,2,3)), data=ds)
This shows the exact values of the bin ranges in the panel titles, surely a better use of that space. Minor differences in the histograms are due to the overlapping categories included in the previous version.

Finally, we also show the approach one might use with the ggplot2 package, an implementation of Leland Wilkinson's Grammar of Graphics, coded by Hadley Wickham. The package includes the useful cut_number() function, which does something similar to the cut(..., breaks=quantile(...)) construction we showed above. In ggplot2, "facets" are analogous to the shingles used in lattice.
library(ggplot2)
ds = mutate(ds, cutmcsgg = cut_number(ds$mcs, n=6))
ggplot(ds, aes(pcs)) + geom_bar() +
facet_wrap(~cutmcsgg) + ggtitle("Histograms of PCS by sextile of MCS")
Roughly, we can read the syntax to state: 1) make a plot from the ds dataset in which the primary analytic variable will be pcs; 2) make histograms; 3) make facets of the cutmcsgg variable; 4) add a title. Since the syntax is a little unusual, Hadley provides the qplot() function, a wrapper which operates more like traditional functions. An identical plot to the above can be generated with qplot() as follows:
qplot(data=ds,x=pcs, geom="bar", facets= ~cutmcsgg, 
main="Histograms of PCS by sextile of MCS")


An unrelated note about aggregators:We love aggregators! Aggregators collect blogs that have similar coverage for the convenience of readers, and for blog authors they offer a way to reach new audiences. SAS and R is aggregated by R-bloggers, PROC-X, and statsblogs with our permission, and by at least 2 other aggregating services which have never contacted us. If you read this on an aggregator that does not credit the blogs it incorporates, please come visit us at SAS and R. We answer comments there and offer direct subscriptions if you like our content. In addition, no one is allowed to profit by this work under our license; if you see advertisements on this page, the aggregator is violating the terms by which we publish our work.
8月 122014
 
As of today, the second edition of "SAS and R: Data Management, Statistical Analysis, and Graphics" is shipping from CRC Press, Amazon, and other booksellers. There are lots of additional examples from this blog, new organization, and other features we hope you'll find useful. Thanks for your support. We'll be continuing to blog.

Now, on to today's main course.
For cyclical data, it's sometimes useful to generate rolling averages-- the average of some number of recent measurements, usually one full cycle. For example, for retail sales, one might want the rolling average of the most recent week. The rolling average will dampen the effects of repeated patterns but still show the location of the data.

In keeping with our habit of plotting personal data (e.g.,Example 8.11, Example 8.12, example 10.1, Example 10.2), I'll use my own weight recorded over the past 6 months. After reading about "alternate day dieting" in The Atlantic, I decided to try the diet described in the book by Varady. I've never really tried to diet for weight loss before, but this diet has worked really well for me over the past six months. The basics are that you eat 500 calories every other day (diet days) and on the non-diet days you eat what you want. There's a little science supporting the approach. I can't really recommend the book, unfortunately, unless you're a fan of the self-help style.

As you can imagine, one's weight tends to fluctuate pretty wildly between diet days and non-diet days. The cycle is just two days, but to get a sense of my weight at any given time, it might be best to use the rolling average of the past, say, four days.

The beginning of the data, available from http://www.amherst.edu/~nhorton/sasr2/datasets/weight.txt, follows.
1/11/14 219
1/12/14 NA
1/13/14 219
1/14/14 NA
1/15/14 221.8
1/16/14 218
...


R
As you can tell from the NAs, I compiled the data with the intent to read it into R.
> weights = read.table("c:/temp/weight.txt")
> head(weights)

V1 V2
1 1/11/14 219.0
2 1/12/14 NA
3 1/13/14 219.0
4 1/14/14 NA
5 1/15/14 221.8
6 1/16/14 218.0
Note, though, that the date values are just character strings (read in as a factor variable), and not so useful as read in.
> str(weights)
'data.frame': 161 obs. of 2 variables:
$ V1: Factor w/ 161 levels "1/11/14","1/12/14",..: 1 2 3 4 5 6 7 8 9 10 ...
$ V2: num 219 NA 219 NA 222 ...
The lubridate package contributed by the invaluable Hadley Wickham contains functions to make it easier to use dates in R. Here, I use its mdy() function to convert characters values into R dates.
library(lubridate)
with(weights, plot(V2 ~ mdy(V1),
xlim = c(mdy("1/1/14"),mdy("6/30/14")),
ylab="Weight", xlab="Date"))
The simple plot has enough values that you can clearly see the trend of weight loss over time, and perhaps the rolling average exercise is somewhat misplaced, here. To calculate the rolling average, I adapted (below) the lag function from section 2.2.18 (2nd edition; 1.4.17 in the 1st ed.)-- this is a simpler version that does not check for errors. The result of lag(x,k) is a vector with the first k values missing and with the remaining values being the beginning values of x. Thus the ith value of lag(x,k) is x[i-k]. To get the rolling average, I just take the mean of several lags. Here I use the rowMeans() function to do it for all the values at once. The lines() function adds the rolling average to the plot.
lag = function(x,k) {
return( c(rep(NA,k), x[1:(length(x)-k)]) )
}

y = weights$V2
ra = rowMeans(
matrix(c(y,lag(y,1),lag(y,2),lag(y,3)),ncol=4,byrow=F),
na.rm=T)

lines(mdy(weights$V1),ra)
The final plot is shown above. Note that the the initial values of the lagged vector are missing, as are weights for several dates throughout this period. The na.rm=T option causes rowMeans() to return the mean of the observed values-- equivalent to a single imputation of the mean of the observed values, which perhaps Nick will allow me in this setting (note from NH: I don't have major issues with this). There are also two periods where I failed to record weights for four days running. For these periods, rowMeans() returns NaN, or "Not a Number". This is usefully converted to regions in the plot where the running average line is not plotted. Compare, for instance, with the default SAS behavior shown below. For the record, I was ill in early May and had little appetite regardless of my dieting schedule.

SAS
The data can be easily read with the input statement. The mmddyy7. informat tells SAS that the data in the first field are as many as 7 characters long and should be read as dates. SAS will store them as SAS dates (section 2.4 in the 2nd edition; 1.6 in the 1st edition). As the data are read in, I use the lagk functions (section 2.2.18 2nd edition; 1.4.17 in the 1st ed.) to recall the values from recent days and calculate the rolling average as I go.
data weights;
infile "c:tempweight.txt";
input date mmddyy7. weight;
ra = mean(weight,lag(weight), lag2(weight), lag3(weight));
run;
Note that the input statement expects the weight values to be numbers, and interprets the NAs in the data as "Invalid data". It inserts missing values into the data set, which is what we desire. The mean function provides the mean of the non-missing values. When the weight and all of the lagged values of weight are missing, it will return a missing value. With the rolling average in hand, I can plot the observed weights and the rolling average. To print Julian dates rather than SAS dates, use the format statement to tell SAS that the date variable should be printed using the date. format.
symbol1 i = none v=dot c = blue;
symbol2 i = j v = none c = black w=5;
proc gplot data = weights;
plot (weight ra)*date /overlay;
format date date.;
run;
The results are shown below. The main difference from the R plot is that the gaps in my recording do not appear in the line. The SAS symbol statement, the equivalent of the lines() function, more or less, does not encounter NaNs, but only missing values, and so it connects the points. I think R's behavior is more appropriate here-- there's no particular reason to suppose a linear interpolation between the observed data points is best, and so the line ought to be missing.


An unrelated note about aggregators:We love aggregators! Aggregators collect blogs that have similar coverage for the convenience of readers, and for blog authors they offer a way to reach new audiences. SAS and R is aggregated by R-bloggers, PROC-X, and statsblogs with our permission, and by at least 2 other aggregating services which have never contacted us. If you read this on an aggregator that does not credit the blogs it incorporates, please come visit us at SAS and R. We answer comments there and offer direct subscriptions if you like our content. In addition, no one is allowed to profit by this work under our license; if you see advertisements on this page, the aggregator is violating the terms by which we publish our work.
6月 302014
 
In our last entry, we demonstrated how to simulate data from a logistic regression with an interaction between a dichotomous and a continuous covariate. In this entry we show how to use the simulation to estimate the power to detect that interaction. This is a simple, elegant, and powerful idea: simply simulate data under the alternative, and count the proportion of times the null is rejected. This is an estimate of power. If we lack infinite time to simulate data sets, we can also generate confidence intervals for the proportion.

R
In R, extending the previous example is almost trivially easy. The coef() function, applied to a glm summary object, returns an array with the parameter estimate, standard error, test statistic, and p-value. In one statement, we can extract the p-value for the interaction and return an indicator of a rejected null hypothesis. (This line is commented on below.) Then the routine is wrapped as a trivial function.
logist_inter = function() {
c = rep(0:1,each=50) # sample size is 100
x = rnorm(100)
lp = -3 + 2*c*x
link_lp = exp(lp)/(1 + exp(lp))
y = (runif(100) < link_lp)

log.int = glm(y~as.factor(c)*x, family=binomial)
reject = ifelse( coef(summary(log.int))[4,4] < .05, 1, 0)
# The coef() function above gets the parameter estimates; the [4,4]
# element is the p-value for the interaction.
return(reject)
}
Running the function many times is also trivial, using the replicate() function.
pow1 = replicate(100, logist_inter())
The result is an array of 1s and 0s. To get the estimated power and confidence limits, we use the binom.test() function.
binom.test(sum(pow1), 100)
The test gives a p-value against the null hypothesis that the probability of rejection is 0.5, which is not interesting. The interesting part is at the end.

95 percent confidence interval:
0.3219855 0.5228808
sample estimates:
probability of success
0.42
It would be simple to adjust this code to allow a change in the number of subjects or of the effect sizes, etc.

SAS
In SAS, generating the data is no trouble, but evaluating the power programmatically requires several relatively cumbersome steps. To generate multiple data sets, we include the data generation loop from the previous entry within another loop. (Note that the number of observations has also been reduced vs. the previous entry.)

data test;
do ds = 1 to 100; #100 data sets
do i = 1 to 100; #100 obs/data set
c = (i gt 50);
x = normal(0);
lp = -3 + 2*c*x;
link_lp = exp(lp)/(1 + exp(lp));
y = (uniform(0) lt link_lp);
output;
end;
end;
run;

Then we fit all of the models at once, using the by statement. Here, the ODS system suppresses voluminous output and is also used to capture the needed results in a single data set. The name of the piece of output that holds the parameter estimates (parameterestimates) can be found with the ods trace on statement.
ods select none;
ods output parameterestimates= int_ests;
proc logistic data = test ;
by ds;
class c (param = ref desc);
model y(event='1') = x|c;
run;
ods exclude none;

The univariate procedure can be used to count the number of times the null hypothesis of no interaction would be rejected. To do this, we use the loccount option to request a table of location counts, and the mu0 option to specify that the location of interest is 0.05. As above, since our goal is to use the count programmatically, we also extract the result into a data set. If you're following along at home, it's probably worth your while to print out some of this data to see what it looks like.

ods output locationcounts=int_power;
proc univariate data = int_ests loccount mu0=.05;
where variable = "x*c";
var probchisq;
run;
For example, while the locationcounts data set reports the number of observations above and below 0.05, it also reports the number not equal to 0.05. This is not so useful, and we need to exclude this row from the next step. We do that with a where statement. Then proc freq gives us the proportion and (95%) confidence limits we need, using the binomial option to get the confidence limits and the weight statement to convey the fact that the count variable represents the number of observations.

proc freq data = int_power;
where count ne "Num Obs ^= Mu0";
tables count / binomial;
weight value;
run;
Finally, we find our results:
                        Binomial Proportion
Count = Num Obs < Mu0

Proportion 0.4000
ASE 0.0490
95% Lower Conf Limit 0.3040
95% Upper Conf Limit 0.4960

Exact Conf Limits
95% Lower Conf Limit 0.3033
95% Upper Conf Limit 0.5028
We estimate our power at only 40%, with a confidence limit of (30%, 50%). This agrees closely enough with R: we don't need to narrow the limit to know that we'll need a larger sample size.

An unrelated note about aggregators:We love aggregators! Aggregators collect blogs that have similar coverage for the convenience of readers, and for blog authors they offer a way to reach new audiences. SAS and R is aggregated by R-bloggers, PROC-X, and statsblogs with our permission, and by at least 2 other aggregating services which have never contacted us. If you read this on an aggregator that does not credit the blogs it incorporates, please come visit us at SAS and R. We answer comments there and offer direct subscriptions if you like our content. In addition, no one is allowed to profit by this work under our license; if you see advertisements on this page, the aggregator is violating the terms by which we publish our work.