proc transpose

5月 072018
 

Datasets can present themselves in different ways. Identical data can bet arranged differently, often as wide or tall datasets. Generally, the tall dataset is better. Learn how to convert wide data into tall data with PROC TRANSPOSE.

The post "Wide" versus "Tall" data: PROC TRANSPOSE v. the DATA step appeared first on SAS Learning Post.

11月 182014
 
A recent post pointed us to a great talk that elegantly described how inferences from a trial could be analyzed with a purely resampling-based approach. The talk uses data from a paper that considered the association between beer consumption and mosquito attraction. We recommend the talk linked above for those thinking about creative ways to teach inference.

In this entry, we demonstrate how straightforward it is to replicate these analyses in R, and show how they can be done in SAS.

R

We'll repeat the exercise twice in R: first using the mosaic package that Nick and colleagues have been developing to help teach statistics, and then in base R.

For mosaic, we begin by entering the data and creating a dataframe. The do() operator and the shuffle() function facilitate carrying out a permutation test (see also section 5.4.5 of the book). 

beer = c(27, 19, 20, 20, 23, 17, 21, 24, 31, 26, 28, 20, 27, 19, 25, 31, 24, 28, 24, 29, 21, 21, 18, 27, 20)
water = c(21, 19, 13, 22, 15, 22, 15, 22, 20, 12, 24, 24, 21, 19, 18, 16, 23, 20)

ds = data.frame(y = c(beer, water),
x = c(rep("beer", length(beer)), rep("water", length(water))))
require(mosaic)
obsdiff = compareMean(y ~ x, data=ds)
nulldist = do(999)*compareMean(y ~ shuffle(x), data=ds)
histogram(~ result, xlab="permutation differences", data=nulldist)
ladd(panel.abline(v=obsdiff, col="red", lwd=2))

> obsdiff
[1] -4.377778
> tally(~ abs(result) > abs(obsdiff), format="percent", data=nulldist)

TRUE FALSE
0.1 99.9

The do() operator evaluates the expression on the right hand side a specified number of times.  In this case we shuffle (or permute) the group indicators. 

We observe a mean difference of 4.4 attractions (comparing the beer to water groups). The histogram of the results-- plotted with the lattice graphics package that mosaic loads by default-- demonstrates that this result would be highly unlikely to occur by chance: if the null hypothesis that the groups were equal was true, results more extreme than this would happen only 1 time out of 1000. This can be displayed using the tally() function, which adds some functionality to table(). We can calculate the p-value by including the observed statistic in the numerator and the denominator = (1+1)/(999 + 1) = .002.

For those not invested in the mosaic package,  base R functions can be used to perform this analysis . We present a version here that begins after making the data vectors.
alldata = c(beer, water)
labels = c(rep("beer", length(beer)), rep("water", length(water)))
obsdiff = mean(alldata[labels=="beer"]) - mean(alldata[labels=="water"])

> obsdiff
[1] -4.377778

The sample() function re-orders the labels, effectively implementing the supposition that the number of bites might have happened under either the water or the beer drinking regimen.
resample_labels = sample(labels)
resample_diff = mean(alldata[resample_labels=="beer"]) -
mean(alldata[resample_labels=="water"])

resample_diff
[1] 1.033333

In a teaching setting, the preceding code could be re-run several times, to mimic the presentation seen in the video linked above. To repeat many times, the most suitable base R tool is replicate(). To use it, we make a function of the resampling procedure shown above.
resamp_means = function(data, labs){
resample_labels = sample(labs)
resample_diff = mean(data[resample_labels=="beer"]) -
mean(data[resample_labels=="water"])
return(resample_diff)
}

nulldist = replicate(9999,resamp_means(alldata,labels))

hist(nulldist, col="cyan")
abline(v = obsdiff, col = "red")

The histogram is shown above. The p-value is obtained by counting the proportion of statistics (including the actual observed difference) among greater than or equal to the observed statistic:
alldiffs = c(obsdiff,nulldist)
p = sum(abs(alldiffs >= obsdiff)/ 10000)


SAS

The SAS code is relatively cumbersome in comparison. We begin by reading the data in, using the "line hold" double-ampersand and the infile datalines statement that allows us to specify a delimiter (other than a space) when reading data in directly in a data step. This let us copy the data from the R code. To identify the water and beer regimen subjects, we use the _n_ implied variable that SAS creates but does not save with the data.

The summary procedure generates the mean for each group and saves the results in a data set with a row for each group; the transpose procedure makes a data set with a single row and a variable for each group mean. Finally, we calculate the observed difference and use call symput to make it into a macro variable for later use.
data bites;;
if _n_ le 18 then drink="water";
else drink="beer";
infile datalines delimiter=',';
input bites @@;
datalines;
21, 19, 13, 22, 15, 22, 15, 22, 20, 12, 24, 24, 21, 19, 18, 16, 23, 20
27, 19, 20, 20, 23, 17, 21, 24, 31, 26, 28, 20, 27, 19, 25, 31, 24
28, 24, 29, 21, 21, 18, 27, 20
;
run;

proc summary nway data = bites;
class drink;
var bites;
output out=obsmeans mean=mean;
run;

proc transpose data = obsmeans out=om2;
var mean;
id drink;
run;

data om3;
set om2;
obsdiff = beer-water;
call symput('obsdiff',obsdiff);
run;

proc print data = om3; var obsdiff; run;

Obs obsdiff
1 4.37778
(Yes, we could have done this with proc ttest and ODS. But the spirit of the video is that we don't understand t-tests, so we want to avoid them.)

To rerandomize, we can assign a random number to each row, sort on the random number, and assign drink labels based on the new order of the data.
data rerand;
set bites;
randorder = uniform(0);
run;

proc sort data = rerand; by randorder; run;

data rerand2;
set rerand;
if _n_ le 18 then redrink = "water";
else redrink = "beer";
run;

proc summary nway data = rerand2;
class redrink;
var bites;
output out=rerand3 mean=mean;
run;

proc transpose data = rerand3 out=rerand4;
var mean;
id redrink;
run;

data rrdiff;
set rerand4;
rrdiff = beer-water;
run;

proc print data = rrdiff; var rrdiff; run;

Obs rrdiff
1 -1.73778
One way to repeat this a bunch of times would be to make a macro out of the above and collect the resulting rrdiff into a data set. Instead, we use the surveyselect procedure to do this much more efficiently. The groups option sample groups of 18 and 25 from the data, while the reps option requests this be done 9,999 times. We can then use the summary and transpose procedures as before, with the addition of the by replicate statement to make a data set with columns for each group mean and a row for each replicate.
proc surveyselect data = bites groups=(18,25) reps = 9999 out = ssresamp; run;

proc summary nway data = ssresamp;
by replicate;
class groupid;
var bites;
output out=ssresamp2 mean=mean;
run;

proc transpose data = ssresamp2 out=ssresamp3 prefix=group;
by replicate;
var mean;
id groupid;
run;
To get a p-value and make a histogram, we use the macro variable created earlier.
data ssresamp4;
set ssresamp3;
diff = group2 - group1;
exceeds = abs(diff) ge &obsdiff;
run;

proc means data = ssresamp4 sum; var exceeds; run;

The MEANS Procedure
Analysis Variable : exceeds
Sum
9.0000000

proc sgplot data = ssresamp4;
histogram diff;
refline &obsdiff /axis=x lineattrs=(thickness=2 color=red);
run;
The p-value is 0.001 (= (9+1)/10000).


An unrelated note about aggregators:We love aggregators! Aggregators collect blogs that have similar coverage for the convenience of readers, and for blog authors they offer a way to reach new audiences. SAS and R is aggregated by R-bloggers, PROC-X, and statsblogs with our permission, and by at least 2 other aggregating services which have never contacted us. If you read this on an aggregator that does not credit the blogs it incorporates, please come visit us at SAS and R. We answer comments there and offer direct subscriptions if you like our content. With exceptions noted above, no one is allowed to profit by this work under our license; if you see advertisements on this page, the aggregator is violating the terms by which we publish our work.
10月 222014
 
Data with repeated measures often come to us in the "wide" format, as shown below for the HELP data set we use in our book. Here we show just an ID, the CESD depression measure from four follow-up assessments, plus the baseline CESD.

Obs    ID    CESD1    CESD2    CESD3    CESD4    CESD

1 1 7 . 8 5 49
2 2 11 . . . 30
3 3 14 . . 49 39
4 4 44 . . 20 15
5 5 26 27 15 28 39
...


Frequently for data analysis we need to convert the data to the "long" format, with a single column for the repeated time-varying CESD measures and column indicating the time of measurement. This is needed, for example, in SAS proc mixed or in the lme4 package in R. The data should look something like this:

 Obs    ID    time    CESD    cesd_tv

1 1 1 49 7
2 1 2 49 .
3 1 3 49 8
4 1 4 49 5
...


In section 2.3.7 (2nd Edition) we discuss this problem, and we provide an example in section 7.10.9. Today we're adding a blog post to demonstrate some handy features in SAS and how the problem can be approached using plain R and, alternatively, using the new-ish R packages dplyr and tidyr, contributed by Hadley Wickham.

R
We'll begin by making a narrower data frame with just the columns noted above. We use the select() function from the dplyr package to do this; the syntax is simply to provide the the name of the input data frame as the first argument and then the names of the columns to be included in the output data frame. We use this function instead of the similar base R function subset(..., select=) because of dplyr's useful starts_with() function. This operates on the column names as character vectors in a hopefully obvious way.
load("c:/book/savedfile")

library(dplyr)
wide = select(ds, id, starts_with("cesd"))

Now we'll convert to the long format. The standard R approach is to use the reshape() function. The documentation for this is a bit of a slog, and the function can generate error messages that are not so helpful. But for simple problems like this one, it works well.
long = reshape(wide, varying = c("cesd1", "cesd2", "cesd3", "cesd4"),
v.names = "cesd_tv",
idvar = c("id", "cesd"), direction="long")
long[long$id == 1,]

id cesd time cesd_tv
1.49.1 1 49 1 7
1.49.2 1 49 2 NA
1.49.3 1 49 3 8
1.49.4 1 49 4 5

In the preceding, the varying parameter is a list of columns which vary over time, while the id.var columns appear at each time. The v.names parameter is the name of the column which will hold the values of the varying variables.

Another option would be to use base R knowledge to separate, rename, and then recombine the data as follows. The main hassle here is renaming the columns in each separate data frame so that they can be combined later.
c1 = subset(wide, select= c(id, cesd, cesd1))
c1$time = 1
names(c1)[3] = "cesd_tv"

c2 = subset(wide, select= c(id, cesd, cesd2))
c2$time = 2
names(c2)[3] = "cesd_tv"

c3 = subset(wide, select= c(id, cesd, cesd3))
c3$time = 3
names(c3)[3] = "cesd_tv"

c4 = subset(wide, select= c(id, cesd, cesd4))
c4$time = 4
names(c4)[3] = "cesd_tv"

long = rbind(c1,c2,c3,c4)
long[long$id==1,]

id cesd cesd_tv time
1 1 49 7 1
454 1 49 NA 2
907 1 49 8 3
1360 1 49 5 4

This is cumbersome, but effective.

More interesting is to use the tools provided by dplyr and tidyr.
library(tidyr)
gather(wide, key=names, value=cesd_tv, cesd1,cesd2,cesd3,cesd4) %>%
mutate(time = as.numeric(substr(names,5,5))) %>%
arrange(id,time) -> long

head(long)

id cesd names cesd_tv time
1 1 49 cesd1 7 1
2 1 49 cesd2 NA 2
3 1 49 cesd3 8 3
4 1 49 cesd4 5 4
5 2 30 cesd1 11 1
6 2 30 cesd2 NA 2

The gather() function takes a data frame (the first argument) and returns new columns named in the key and value parameter. The contents of the columns are the names (in the key) and the values (in the value) of the former columns listed. The result is a new data frame with a row for every column in the original data frame, for every row in the original data frame. Any columns not named are repeated in the output data frame. The mutate function is like the R base function transform() but has some additional features and may be faster in some settings. Finally, the arrange() function is a much more convenient sorting facility than is available in standard R. The input is a data frame and a list of columns to sort by, and the output is a sorted data frame. This saves us having to select out a subject to display

The %>% operator is a "pipe" or "chain" operator that may be familiar if you're a *nix user. It feeds the result of the last function into the next function as the first argument. This can cut down on the use of nested parentheses and may make reading R code easier for some folks. The effect of the piping is that the mutate() function should be read as taking the result of the gather() as its input data frame, and sending its output data frame into the arrange() function. For Ken, the right assignment arrow (-> long) makes sense as a way to finish off this set of piping rules, but Nick and many R users would prefer to write this as long = gather... or long <- gather.. , etc.

SAS
In SAS, we'll make the narrow data set using the keep statement in the data step, demonstrating meanwhile the convenient colon operator, that performs the same function provided by starts_with() in dplyr.

data all;
set "c:/book/help.sas7bdat";
run;

data wide;
set all;
keep id cesd:;
run;

The simpler way to make the desired data set is with the transpose procedure. Here the by statement forces the variables listed in that statement not to be transposed. The notsorted options save us having to actually sort the variables. Otherwise the procedure works like gather(): each transposed variable becomes a row in the output data set for every observation in the input data set. SAS uses standard variable names for gather()'s key (SAS: _NAME_)and value (SAS: COL1) though these can be changed.

proc transpose data = wide out = long_a;
by notsorted id notsorted cesd;
run;

data long;
set long_a;
time = substr(_name_, 5);
rename col1=cesd_tv;
run;

proc print data = long;
where id eq 1;
var id time cesd cesd_tv;
run;

Obs ID time CESD cesd_tv

1 1 1 49 7
2 1 2 49 .
3 1 3 49 8
4 1 4 49 5

As with R, it's trivial, though somewhat cumbersome, to generate this effect using basic coding.

data long;
set wide;
time = 1; cesd_tv = cesd1; output;
time = 2; cesd_tv = cesd2; output;
time = 3; cesd_tv = cesd3; output;
time = 4; cesd_tv = cesd4; output;
run;

proc print data = long;
where id eq 1;
var id time cesd cesd_tv;
run;

Obs ID time CESD cesd_tv

1 1 1 49 7
2 1 2 49 .
3 1 3 49 8
4 1 4 49 5


An unrelated note about aggregators:We love aggregators! Aggregators collect blogs that have similar coverage for the convenience of readers, and for blog authors they offer a way to reach new audiences. SAS and R is aggregated by R-bloggers, PROC-X, and statsblogs with our permission, and by at least 2 other aggregating services which have never contacted us. If you read this on an aggregator that does not credit the blogs it incorporates, please come visit us at SAS and R. We answer comments there and offer direct subscriptions if you like our content. In addition, no one is allowed to profit by this work under our license; if you see advertisements on this page, the aggregator is violating the terms by which we publish our work.
1月 152014
 
Hello, folks! I'm pleased to report that Nick and I have turned in the manuscript for the second edition of SAS and R: Data Management, Statistical Analysis, and Graphics. It should be available this summer. New material includes some of our more popular blog posts, plus reproducible analysis, RStudio, and more.

To celebrate, here's a new example. Parenthetically, I was fortunate to be able to present my course: R Boot Camp for SAS users at Boston University last week. One attendee cornered me after the course. She said: "Ken, R looks great, but you use SAS for all your real work, don't you?" Today's example might help a SAS diehard to see why it might be helpful to know R.

OK, the example: A colleague contacted me with a typical "5-minute" question. She needed to write a convincing power calculation for the sensitivity-- the probability that a test returns a positive result when the disease is present, for a fixed number of cases with the disease. I don't know how well this has been explored in the peer-reviewed literature, but I suggested the following process:
1. Guess at the true underlying sensitivity
2. Name a lower bound (less than the truth) which we would like the observed CI to exclude
3. Use basic probability results to report the probability of exclusion, marginally across the unknown number of observed positive tests.

This is not actually a power calculation, of course, but it provides some information about the kinds of statements that it's likely to be possible to make.

R

In R, this is almost trivial. We can get the probability of observing x positive tests simply, using the dbinom() function applied to a vector of numerators and the fixed denominator. Finding the confidence limits is a little trickier. Well, finding them is easy, using lapply() on binom.test(), but extracting them requires using sapply() on the results from lapply(). Then it's trivial to generate a logical vector indicating whether the value we want to exclude is in the CI or not, and the sum of the probabilities we see a number of positive tests where we include this value is our desired result.
> truesense = .9
> exclude = .6
> npos = 20
> probobs = dbinom(0:npos,npos,truesense)
> cis = t(sapply(lapply(0:npos,binom.test, n=npos),
function(bt) return(bt$conf.int)))
> included = cis[,1] < exclude & cis[,2] > exclude
> myprob = sum(probobs*included)
> myprob
[1] 0.1329533
(Note that I calculated the inclusion probability, not the exclusion probability.)

Of course, the real beauty and power of R is how simple it is to turn this into a function:
> probinc = function(truesense, exclude, npos) {
probobs = dbinom(0:npos,npos,truesense)
cis = t(sapply(lapply(0:npos,binom.test, n=npos),
function(bt) return(bt$conf.int)))
included = cis[,1] < exclude & cis[,2] > exclude
return(sum(probobs*included))
}
> probinc(.9,.6,20)
[1] 0.1329533


SAS

My SAS process took about 4 times as long to write.
I begin by making a data set with a variable recording both the number of events (positive tests) and non-events (false negatives) for each possible value. These serve as weights in the proc freq I use to generate the confidence limits.
%let truesense = .9;
%let exclude = .6;
%let npos = 20;

data rej;
do i = 1 to &npos;
w = i; event = 1; output;
w = &npos - i; event = 0; output;
end;
run;

ods output binomialprop = rej2;
proc freq data = rej;
by i;
tables event /binomial(level='1');
weight w;
run;
Note that I repeat the proc freq for each number of events using the by statement. After saving the results with the ODS system, I have to use proc transpose to make a table with one row for each number of positive tests-- before this, every statistic in the output has its own row.
proc transpose data = rej2 out = rej3;
where name1 eq "XL_BIN" or name1 eq "XU_BIN";
by i;
id name1;
var nvalue1;
run;
In my fourth data set, I can find the probability of observing each number of events and multiply this with my logical test of whether the CI included my target value or not. But here there is another twist. The proc freq approach won't generate a CI for both the situation where there are 0 positive tests and the setting where all are positive in the same run. My solution to this was to omit the case with 0 positives from my for loop above, but now I need to account for that possibility. Here I use the end=option to the set statement to figure out when I've reached the case with all positive (sensitivity =1). Then I can use the reflexive property to find the confidence limits for the case with 0 events. Then I'm finally ready to sum up the probabilities associated with the number of positive tests where the CI includes the target value.
data rej4;
set rej3 end = eof;
prob = pdf('BINOMIAL',i,&truesense,&npos);
prob_include = prob * ((xl_bin < &exclude) and (xu_bin > &exclude));
output;
if eof then do;
prob = pdf('BINOMIAL',0,&truesense,&npos);
prob_include = prob * (((1 - xu_bin) < &exclude) and ((1 - xl_bin) > &exclude));
output;
end;
run;

proc means data = rej4 sum;
var prob_include;
run;
Elegance is a subjective thing, I suppose, but to my eye, the R solution is simple and graceful, while the SAS solution is rather awkward. And I didn't even make a macro out of it yet!

An unrelated note about aggregators:We love aggregators! Aggregators collect blogs that have similar coverage for the convenience of readers, and for blog authors they offer a way to reach new audiences. SAS and R is aggregated by R-bloggers, PROC-X, and statsblogs with our permission, and by at least 2 other aggregating services which have never contacted us. If you read this on an aggregator that does not credit the blogs it incorporates, please come visit us at SAS and R. We answer comments there and offer direct subscriptions if you like our content. In addition, no one is allowed to profit by this work under our license; if you see advertisements on this page, the aggregator is violating the terms by which we publish our work.
5月 212012
 
In examples 9.30 and 9.31 we explored corrections for multiple testing and then extracting p-values adjusted by the Benjamini and Hochberg (or FDR) procedure. In this post we'll develop a simulation to explore the impact of "strong" and "weak" control of the family-wise error rate offered in multiple comparison corrections. Loosely put, weak control procedures may fail when some of the null hypotheses are actually false, in that the remaining (true) nulls may be rejected more than the nominal proportion of times.

For our simulation, we'll develop flexible code to generate some p-values from false nulls and others from true nulls. We'll assume that the true nulls have p-values distributed uniform (0,1); the false nulls will have p-values distributed uniform with a user-determined maximum. We'll also allow the number of tests overall and the number of false nulls to be set.

SAS
In SAS, a macro does the job. It accepts the user parameters described above, then generates false and true nulls for each desired simulation. With the data created, we can use proc multtest to apply the FDR procedure, with the ODS system saving the results. Note how the by statement allows us to replicate the analysis for each simulated set of p-values without creating a separate data set for each one. (Also note that we do not use proc sort before that by statement-- this can be risky, but works fine here.)

%macro fdr(nsims=1, ntests = 20, nfalse=10, howfalse=.01);
ods select none;
data test;
do sim = 1 to &nsims;
do i = 1 to &ntests;
raw_p = uniform(0) *
( ((i le &nfalse) * &howfalse ) + ((i gt &nfalse) * 1 ) );
output;
end;
end;
run;

ods output pvalues = __pv;
proc multtest inpvalues=test fdr;
by sim;
run;

With the results in hand, (still within the macro) we need to do some massaging to make the results usable. First we'll recode the rejections (assuming a 0.05 alpha level) so that non-rejections are 0 and rejections are 1/number of tests. That way we can just sum across the results to get the proportion of rejections. Next, we transform the data to get each simulation in a row (section 1.5.4). (The data output from proc multtest has nsims*ntests rows. After transposing, there are nsims rows.) Finally, we can sum across the rows to get the proportion of tests rejected in each simulated family of tests. The results are shown in a table made with proc freq.

data __pv1;
set __pv;
if falsediscoveryrate lt 0.05 then fdrprop = 1/&ntests;
else fdrprop =0;
run;

proc transpose data = __pv1 (keep =sim fdrprop) out = pvals_a;
by sim; run;

data pvals;
set pvals_a;
prop = sum(of col1 - col&ntests);
run;
ods select all;

proc freq data = pvals; tables prop; run;
%mend fdr;

%fdr(nsims = 1000, ntests = 20, nfalse = 10, howfalse=.001);

Cumulative Cumulative
prop Frequency Percent Frequency Percent
---------------------------------------------------------
0.5 758 75.80 758 75.80
0.55 210 21.00 968 96.80
0.6 27 2.70 995 99.50
0.65 5 0.50 1000 100.00

So true nulls were rejected 24% of the time, which seems like a lot. Multiple comparison procedures with "strong" control of the familywise error rate will reject them only 5% of the time. Building this simulation as a macro facilitates exploring the effects of the multiple comparison procedures in a variety of settings.

R
As in example 9.31, the R code is rather simpler, though perhaps a bit opaque. To make the p-values, we make them first for all of tests with the false, then for all of the tests with the true nulls. The matrix function reads these in by column, by default, meaning that the first nfalse columns get the nsims*nfalse observations. The apply function generates the FDR p-values for each row of the data set. The t() function just transposes the resulting matrix so that we get back a row for each simulation. As in the SAS version, we'll count each rejection as 1/ntests, and non-rejections as 0; we do this with the ifelse() statement. Then we sum across the simulations with another call to apply() and show the results with a simple table.

checkfdr = function(nsims=1, ntests=100, nfalse=0, howfalse=0.001) {
raw_p = matrix(c(runif(nfalse * nsims) * howfalse,
runif((ntests-nfalse) * nsims)), nrow=nsims)
fdr = t(apply(raw_p, 1, p.adjust, "fdr"))
reject = ifelse(fdr<.05, 1/ntests,0)
prop = apply(reject, 1, sum)
prop.table(table(prop))
}

> checkfdr(nsims=1000, ntests=20, nfalse=10, howfalse=.001)
prop
0.5 0.55 0.6 0.65
0.755 0.210 0.032 0.003

The results are reassuringly similar to those from SAS. In this R code, it's particularly simple to try a different test-- just replace "fdr" in the p.adjust() call. Here's the result with the Hochberg test, which has strong control.

checkhoch = function(nsims=1, ntests=100, nfalse=0, howfalse=0.001) {
pvals = matrix(c(runif(nfalse * nsims) * howfalse,
runif((ntests-nfalse) * nsims)), nrow=nsims)
hochberg = t(apply(pvals, 1, p.adjust,"hochberg"))
reject = ifelse(hochberg<.05,1/ntests,0)
prop = apply(reject, 1, sum)
prop.table(table(prop))
}

> checkhoch(nsims=1000, ntests=20, nfalse=10, howfalse=.001)
prop
0.5 0.55 0.6
0.951 0.046 0.003

With this procedure one or more of the true nulls is rejected an appropriate 4.9% of the time. For the most part, we feel more comfortable using multiple testing procedures with "strong control".


An unrelated note about aggregators
We love aggregators! Aggregators collect blogs that have similar coverage for the convenience of readers, and for blog authors they offer a way to reach new audiences. SAS and R is aggregated by R-bloggers, PROC-X, and statsblogs with our permission, and by at least 2 other aggregating services which have never contacted us. If you read this on an aggregator that does not credit the blogs it incorporates, please come visit us at SAS and R. We answer comments there and offer direct subscriptions if you like our content. In addition, no one is allowed to profit by this work under our license; if you see advertisements on this page, the aggregator is violating the terms by which we publish our work.
4月 182012
 
In the following data, group 1 is control grp and grp 2 to 6 are test grps. Each group we recorded the visits.First we want to calculate the ratio of each group v.s. group 1. After we get the ratio, we compare the ratio of each day with the ratio of the sample to check whether the ratio is higher or lower than the sample ratio.
To calculate the ratio, by intuition we will use array. Before using array, we need to transpose the data set.

proc transpose data=report_visits_h out=report_visits_v prefix=group;
run;

proc print data=report_visits_v ;
title "-----title print of report_visits_v ";
run;
title "";


The output is like:
From the output, we should drop the first row:

proc sql;
select count(1) into :m_n from report_visits_v;
quit;

%let n_m_n=%eval(&m_n-1);
%let m_pct=%eval(&n_m_n-1);

data report_visits_v;
set report_visits_v;
if _n_>=2;
drop _name_;
run;

proc print data=report_visits_v;
title "------title trans to horizontal";
run;
title "";


Then we can calculate the ratio, like (here use group: to list all vars whose name begin with group):

data report_visits_v;
set report_visits_v;
array a1{*}
group: ; ** array with dynamic # of dim;
array a2{*} ratio1-ratio6;
do i=1 to dim(a1);
a2{i}=a1{i}/a1{1};
end;
format ratio2-ratio6 6.5;
drop i;
run;

proc print data=report_visits_v width=min;
run;

The output is:
Then we transfer it back to horizontal data, that is, each row is for one group.

proc transpose data=report_visits_v out=report_visits_h2;
run;

data orig_records;
set report_visits_h2;
if _n_<=6 then output orig_records;
rename _name_=group;
run;

data ratio;
set report_visits_h2;
if _n_>6 then output ratio;
run;

data ratio(drop=_name_);
retain grp_over_grp1;
set ratio;
grp_over_grp1=substr(_name_,6,1);
run;

proc print data=ratio;
title "------title print of ratio";
run;
title "";

Now the print of data set ratio is like:
Next we calculate the percentage of ratio changed for each day compared with the sample ratio.(if macro variable n_m_n without %eval, it will not work here since its resolution will be a character and therefore the resolution could not be used as the number of dim or to indicate how many col are there):

data pct;
set ratio;
pct_over_sample = '';
array a3{*} col1-col
&n_m_n; ** compare with condition without %eval for n_m_n;
array pct{
&n_m_n} ; ** array with dim number being a macro variable;
do i=1 to dim(a3);
pct{i}=(a3{i}-a3{1})/a3{1};
end;
format pct2-pct
&n_m_n percentn12.6;
keep pct: ;
run;

proc print data=pct;
title "-----title print of percentage change";
run;


The output is:

That is what we want.
1月 222011
 


New in SAS 9.2, the TRANSPOSE procedure accepts multiple IDs in its ID statement. More than one IDs would automatically concatenate together as the new variable names. Previously, Proc Transpose usually only allows one ID. As the result, the concatenation of variable names has to be done by DATA step array in SAS 9.1 or earlier versions. This change would bring more flexibility to reshape data to any desired structure. In this case, a small file with date, gender and 3 credit records is transformed to a more flat data structure, only corresponding to the date. Gender would be moved from row name to column name, and consequently several new variables would be created to combine old variables: credit and gender.

A variety of methods in SAS can realize the reshaping purpose. The coding can follow the principle: first accumulate the numeric values in a single column; second expand them with its accompanying IDs as new variable names. As usual, Proc SQL is always the first choice to aggregate data by its Group By statement. In this example, splitting and combining did the trick but needs some more coding. To increase efficiency, building a macro may be useful. If the programming intention is to report, Proc Report has the magic power to display sub-categories with least code. However, variable names have to be re-defined in the following steps.

Two-step transpose by Proc Transpose is intuitive to change data structure horizontally. And it’s pretty extensible to even more complex data structure.
*******(0) INPUT RAW DATA***********;
data raw;
format date mmddyy10.;
input sex: $1. Date: mmddyy10. Credit1 Credit2 Credit3;
cards;
M       01/01/2011    600     610  650
M       01/02/2011    500     510   730
F       01/01/2011    700     710    820
F       01/02/2011    400     410   500
;
run;

*********(1)TWO-STEP TRANSPOSE***************;
******NOTE: only works in SAS 9.2*********;
proc sort data=raw out=raw_s; 
   by date sex;
run;

proc transpose data=raw_s out=raw_t;
   var credit:;
   by date sex;
run;

proc sort data=raw_t out=interim; 
   by date; 
run;

proc transpose data=interim out=final1(drop=_name_);
   var col1;
   by Date;
   id sex _name_ ;
run;

********(2)DATA STEP ARRAY: AN ALTERNATIVE FOR TWO-STEP TRANSPOSE******;
data arrout;
   set raw;
   array cr[*] credit:;
   do i=1 to dim(cr);
      cred=cr[i];
      cred_name=cats(sex,vname(cr[i]));
      output;
   end;
   keep date cred cred_name;
run;

proc sort data=arrout out=arrout_s; by date;run;
/*NOTE: a following transpose would be better than another data step array*/
proc transpose data=arrout_s out=final2(drop=_name_);
   by date;
   var cred;
   id cred_name;
run;

*********(3)PROC SQL AND MACRO********;
%macro reshape(max);
   %do i=1 %to &max;
      proc sql;
      create table out&i as
      select a.date,  a.credit&i  as Mcredit&i , b.credit&i  as    Fcredit&i
      from raw(where=(sex='M')) as a , raw(where=(sex='F'))  as b
      where a.date=b.date
      ;quit;
   %end;
   data final3;
   %do j=1 %to &max;
      set out&j;
   %end;
run;
%mend reshape;
%reshape(3);

******(4)PROC REPORT*********;
proc report data=raw nowd out=final4 ;
   column date sex,(credit:);
   define date/group;
   define sex/across;
run;

*****(5) ANOTHER EXAMPLE******;
data have;
    input id $ num;
    cards;
    A 1
    A 2
    A 3
    A 4
    B 1
    B 2
    B 3
    B 4
    B 5
    C 1
    ;
run;

proc transpose data=have out=temp;
    by id;
    var num;
run;

proc transpose data=temp out=want(drop=_name_);
    by _name_;
    id id;
    var col:;
run;
*********End of the program****Tested on 21Jan2011********;


6月 172010
 
The question is raised from http://www.listserv.uga.edu/cgi-bin/wa?A2=ind1005d&L=sas-l&F=&S=&P=11947

proc sort data=indat out=one;
by subjid vist;
run;
proc transpose data=one out=two ;
var result;
by subjid vist;
run;
proc transpose data=one out=three;
var unit;
by subjid vist;
run;

proc sql;
select a.subjid, a.vist, a.col1 as dpb, b.col1 as dpbunit, a.col2 as sbp, b. col2 as sbpunit, a.col3 as hr, b.col3 as hrunit
from two as a, three as b
where a.subjid=b.subjid and a.vist=b.vist;
quit;

4月 252010
 
"http://www.mitbbs.com/article_t/Statistics/31224273.html"
Derive new var in data step --> reshape data in proc transpose --> use proc sql to generate report

data one;
input good P1 P2 P3 P4 P5 P6 P7 P8 P9 P10;
if sum(of good-P10)=11 then yes=1;
else yes=0;
cards;
1 1 0 0 1 0 1 1 0 1 0
0 0 0 1 0 1 1 0 1 0 1
1 1 1 1 0 1 1 0 1 0 1
1 1 1 1 1 1 1 1 1 1 1
;
run;

proc sort data=one; by good yes; run;
Proc transpose data=one out=two name=P;
var P1-P10;
by good yes;
run;

Proc Sql ;
create table three as
select good, p, yes,
sum( col1 , col2) as count
from two
order by p, yes
;quit;

11月 232009
 
Data step array: flexible (conditional execution) and code-efficient; loop-oriented and have to be familiar with OUTPUT operator. .
Proc Transpose: intuitive and simple; very limited options and no condition can be set

********************HOW TO GENERATE A MULTIPLICATION TABLE*******************;
**********OPTION1: COMBINED METHOD (DATA STEP SIMULATION -- PROC TRANSPOSE -- DATA STEP ARRAY)**********;
**********STEP1: DATA STEP SIMULATION;
data one;
do i=1 to 9;
do j=1 to 9;
v=i*j;
output;
end;
end;
run;
**********STEP2: PROC TRANSPOSE;
proc transpose data=one out=two(drop=_name_);
var v;
by i notsorted; *Since i is ordered, no sorting is needed with command 'notsorted';
run;
**********STEP3: DATA STEP ARRAY;
data three;
set two;
array a[9] col1-col9;
do p = 1 to 9;
if _n_ lt p then a[p] = .;
end;
drop i p;
run;
**********END OF OPTION 1**********;

**********OPTION2: SINGLE STEP**********;
data four;
array a[9] col1-col9;
do i = 1 to 9; *Generate outer layer loop;
do j= 1 to 9; *Generate inner layer loop;
if i ge j then a[j]=i*j; *Horizontally pass values to the nine columns as long as the condition is satisfied;
end;
output; *Each loop one row is outputed;
end;
drop i j;
run;
**********END OF OPTION 2**********;
********************END OF SAS CODE*******************;