proc freq

12月 132017

PROC FREQ is one of the most popular procedures in the SAS language.  It is mostly used to describe frequency distribution of a variable or combination of variables in contingency tables.  However, PROC FREQ has much more functionality than that.  For an overview of all that it can do, see an introduction of the SAS documentation. SAS Viya does not have PROC FREQ, but that doesn’t mean you can’t take advantage of this procedure when working with BIG DATA. SAS 9.4m5 integration with SAS Viya allows you to summarize the data appropriately in CAS, and then pass the resulting summaries to PROC FREQ to get any of the statistics that it is designed to do, from your favorite SAS 9.4 coding client. In this article, you will see how easy it is to work with PROC FREQ in this manner.

These steps are necessary to accomplish this:

  1. Define the CAS environment you will be using from a SAS 9.4m5 interface.
  2. Identify variables and/or combination of variables that define the dimensionality of contingency tables.
  3. Load the table to memory that will need to be summarized.
  4. Summarize the data with a CAS enable procedure, use PROC FEDSQL for high cardinality cases.
  5. Write the appropriate PROC FREQ syntax using the WEIGHT statement to input cell count data.

Step 1: Define the CAS environment

Before you start writing any code, make sure that there is an _authinfo file in the appropriate user directory in your system (for example, c:\users\<userid>$$ with the following information:

host <cas server name> port <number> user "<cas user id>" password "<cas user password>"

This information can be found by running the following statement in the SAS Viya environment from SAS Studio:

caslib _all_ assign;


Then, in your SAS 9.4m5 interface, run the following program to define the CAS environment that you will be working on:

options cashost=" " casport=;
cas mycas user=;
libname mycas cas;
/** Set the in memory shared library if you will be using any tables already promoted in CAS **/
libname public cas caslib=public;



Figure 1

Figure 1 shows the log and libraries after connecting to the CAS environment that will be used to deal with Big Data summarizations.

Step 2: Identify variables

The variables here are those which will be use in the TABLE option in PROC FREQ.  Also, any numeric variable that is not part of TABLE statement can be used to determine the input cell count.

Step 3: Load tale to memory

There are two options to do this.  The first one is to load the table to the PUBLIC library in the SAS Viya environment directly.  The second option is to load it from your SAS 9.4m5 environment into CAS. Loading data into CAS can be as easy as writing a DATA step or using other more efficient methods depending on where the data resides and its size.  This is out of scope of this article.

Step 4: Summarize the data with a CAS enable procedure

Summarizing data for many cross tabulations can become very computing expensive, especially with Big Data. SAS Viya offers several ways to accomplish this (i.e. PROC MEANS, PROC MDSUMMARY, PROC FEDSQL). There are ways to optimize performance for high cardinality summarization when working with CAS.  PROC FEDSQL is my personal favorite since it has shown good performance.

Step 5: Write the PROC FREQ

When writing the PROC FREQ syntax make sure to use the WEIGHT statement to instruct the algorithm to use the appropriate cell count in the contingency tables. All feature and functionality of PROC FREQ is available here so use it to its max! The DATA option can point to the CAS library where your summarized table is, so there will be some overhead for data transfer to the SAS 9 work server. But, most of the time the summarized table is small enough that the transfer may not take long.  An alternative to letting PROC FREQ do the data transfer is doing a data step to bring the data from CAS to a SAS base library in SAS 9 and then running PROC FREQ with that table as the input.


Figure 2 below shows a sample program on how to take advantage of CAS from SAS 9.4m5 to produce different analyses using PROC FREQ.

PROC FREQ for big data

Figure 2


Figure 3 shows the log of the summarization in CAS for a very large table in 1:05.97 minutes (over 571 million records with a dimensionality of 163,964 for all possible combinations of the values of feature, date and target).  The PROC FREQ shows three different ways to use the TABLE statement to produce the desired output from the summarized table which took only 1.22 seconds in the SAS 9.4m5 environment.

PROC FREQ for big data

Figure 3



/** Connect to Viya Environment **/
options cashost="" casport=5570;
cas mycas user=calara;
libname mycas cas datalimit=all;
/** Set the in memory shared library **/
libname public cas caslib=public datalimit=all;
/** Define macro variables **/
/* CAS Session where the FEDSQL will be performed */
%let casref=mycas;
/* Name of the output table of the frequency counts */
%let outsql=sql_sum;
/* Variables for cross classification cell counts */
%let dimvars=date, feature, target;
/* Variable for which frequencies are needed */
%let cntvar=visits;
/* Source table */
%let intble=public.clicks_summary;
proc FEDSQL sessref=&casref.;
	create table &outsql. as select &dimvars., count(&cntvar.) as freq 
		from &intble. group by &dimvars.;
proc freq data=&casref..&outsql.;
	weight freq;
	table date;
	table date*target;
	table feature*date / expected cellchi2 norow nocol chisq noprint;
	output out=ChiSqData n nmiss pchi lrchi;

PROC FREQ for Big Data was published on SAS Users.

11月 012017

A ghoulish Halloween Boo to all my readers! Hope my costume freaks you out, but even if it doesn't, I’m positive PROC FREQ will in a few amazing ways! Today’s Programming 2: Data Manipulation Techniques class asked about the power of PROC FREQ. Since I stopped to explain some of it's benefits to [...]

The post 3 freaky ways PROC FREQ can work in your favor appeared first on SAS Learning Post.

5月 202016

ProblemSolversPROC FREQ is often the first choice when you want to generate basic frequency counts, but it is the last choice when it is compared to other statistical reporting procedures. People sometimes consider PROC FREQ last because they think they have little or no control over the appearance of the output. For example, PROC FREQ does not allow style options within the syntax, which the REPORT and TABULATE procedures do allow. Also, you cannot control the formats or headings with statements in the procedure step.

Sometimes, a simple frequency (via a one-way table) is all you want, and you don’t want to have to create an output data set just to add a format. A one-way table in PROC FREQ is unique also in that it includes a cumulative count and percent. These calculations cannot be done in other procedures without additional code or steps. However, there is a simple way to make some basic modifications to your table. By adding a PROC TEMPLATE step to modify either the Base.Freq.OneWayFreqs or the Base.Freq.OneWayList table template, you can change the formats of the statistics, move the label of the variable, change the labels of the statistics, and suppress the Frequency Missing row that appears below the table. These changes apply to all output destinations, including the traditional listing output.  You can also use PROC TEMPLATE to make small modifications to the nodes that are generated by one-way tables in the table of contents for non-listing ODS destinations.

Customize the Formats and Labels for Statistics

If you want to modify the format for the statistics Frequency, Cumulative Frequency, Percent, and Cumulative Percent, you can use the FORMAT= option in PROC TEMPLATE. This option accepts SAS formats and user-defined formats. If you want to change the statistic heading, you can use the HEADER= option.

The following example uses both the FORMAT= and the HEADER= statements to format the statistics values and to change the Frequency heading to Count. This example also changes the Percent heading to Pct.

proc format;
     picture pctfmt (round) other='009.9%';

proc template; 
     edit Base.Freq.OneWayList;
        edit frequency;  

        edit cumfrequency;
edit percent;
edit cumpercent;

data class;
     set sashelp.class;
proc freq data=class;
     tables age;
     weight wgt; 

This code generates the following table:

FREQ procedure's one-way tables01

Move the Variable Label

If a variable has a label, it is centered, by default, at the top of the table, and the variable name appears above the column of values.

FREQ procedure's one-way tables02

If you want to use the label instead of the name above the column, you can edit the HEADER= value in PROC TEMPLATE, as shown in the following example:

proc template;
   edit Base.Freq.OneWayList;             
        edit h1; 
                /* Set text to a blank instead of VARLABEL. */ 
             text " ";
        edit FVariable;                                           

proc freq data=sashelp.class;
   label age='Age in Years';
   tables age;

This code generates the following table, which replaces the default location of the label with blank text and moves the label so that it is above the column of the variable's values, as shown in this example:

FREQ procedure's one-way tables03

Suppress the Frequency Missing= Row

If a variable has missing values and you do not include the MISSING option in the TABLES statement of PROC FREQ, the output includes the frequency of missing values (Frequency Missing=) below the table with the number of missing values.

The following table shows the default output:

FREQ procedure's one-way tables04

To suppress this line without including missing values in the table output, you can modify the table template, as follows:

data test;
input c1 $;

proc template;
     edit Base.Freq.OneWayFreqs; 

   /* This is the default setting:                     */ 
   /*  define f2;                                      */
   /*     text "Frequency Missing =" fMissing -12.99; */
   /*     print = pfoot2;                              */
   /*  end;                                            */
        edit f2; 
           text " ";

proc freq data=test;
   table c1;

The table that is generated by this code does not contain the Frequency Missing= row at the bottom of the table.

FREQ procedure's one-way tables05

Customize the Table of Contents

For a one-way table, the default table of contents looks like this:

FREQ procedure's one-way tables06

To control the first node, The Freq Procedure, you can use the ODS PROCLABEL statement.  You can change the text of this node, but you cannot eliminate the node. You can remove the One-Way Frequencies node using PROC TEMPLATE with the CONTENTS= statement. The following example changes the text of the The Freq Procedure node, and it eliminates the One-Way Frequencies node.

proc template;
   edit Base.Freq.OneWayFreqs;

ods listing close;
ods pdf file='test.pdf';

ods proclabel='My One-Way Table';
proc freq data=sashelp.class;
tables age;

ods pdf close;
ods listing;

Here is the modified table of contents:

FREQ procedure's one-way tables07

(For additional customization of the table of contents, you need to consider using the DOCUMENT procedure.  The SAS Global Forum 2011 paper Let's Give 'Em Something to TOC about: Transforming the Table of Contents of Your PDF File, by Bari Lawhorn, contains useful information and examples that illustrate how to use PROC DOCUMENT, PROC TEMPLATE, and other SAS statements and options to customize your TOC.)

To restore the default table template, run the following code:

proc template;
delete Base.Freq.OneWayFreqs;
delete Base.Freq.OneWayList;

The examples above show how you actually can customize PROC FREQ one-way tables, with a little help from PROC TEMPLATE. PROC TEMPLATE offers several methods that make it easy to customize one-way, frequency-count tables that enhance table format and appearance.
tags: Problem Solvers, PROC FREQ, SAS Programmers

Seeing the FREQ procedure's one-way tables in a new light was published on SAS Users.

10月 192013
This week's SAS tip is from Art Carpenter and his latest book Carpenter's Guide to Innovative SAS Techniques. A SAS user since 1977, Art is an integral and highly respected member of the user community. He's also the author of several outstanding SAS books--and you can explore them here. The following excerpt is from SAS [...]
8月 272013

I planed to extend nonWinferiority testing to one of my statistical notes,  Equivalence Testing and TOST (Two One-Sided Test) since noninferiority test is simply half part of the equivalence test. Today I’m glad to find an even better explanation from

SAS Usage Note 48616: Design and analysis of noninferiority studies

You will love it if you are interested in this subject (bookmark this link since it looks like unfinished somehow). Few comments:

  • It’s better not to use a two-side test for noninferiority testing. In PROC TTEST, you can add option like “side=U” or “side=L”.
  • If you do like a two-side test for noninferiority testing, it’s OK but DO remember assign an alpha value of 0.1 since the default value is 0.05 (if you’re lost by this statement, read this post again). It’s pretty tricky and I saw lots of mistakes due to ignoring of this point.
  • In note 48616, it is said only test of mean(s) is supported by PROC TTEST. Actually, PROC TTEST can also test ratios (a log-transformation will be applied) by a  “dist=lognormal” option.
  • To make decision in noninferiority, superiority, and equivalence testings, lower or/and upper limits are used to compare with the margin(s) rather than the P-value. They are of course equivalent but the former approach is much more accessible and meaningful for non-statistician researchers.
4月 132013
This week's SAS tip is from Ron Cody and his popular book SAS Statistics by Example. Ron is the author of many bestselling SAS books. If you like this week's excerpt, visit his author page for additional bonus content from his books! The following excerpt is from SAS Press author Ron [...]
10月 082012
In some settings it may be necessary to recode a categorical variable with character values into a variable with numeric values. For example, the matching macro we discussed in example 7.35 will only match on numeric variables. One way to convert character variables to numeric values is to determine which values exist, then write a possibly long series of conditional tests to assign numbers to the values. Surely there's a better way?

In SAS, Rick Wicklin offers an IML solution and links to a macro with the same function. But if you're not an IML coder, and you don't want to investigate a macro solution, it's simple enough to do with data steps. We'll begin by making some fake data.
data test;
do i = 1 to 100;
cat = "meow";
if i gt 30 then cat = "Purr";
if i gt 70 then cat = "Hiss";
To make the new variable, we'll just sort (section 1.5.6) the data on the categorical variable we want to convert, then use the set ds; by x; syntax to keep track of when a new value is encountered in the data. It's hard to believe that we've never demonstrated this useful syntax before-- perhaps we just can't find it today. The set ds; by x; syntax makes new temporary variables first.x and last.x that are equal to 1 for the first and last observations of each new level of x, respectively, and 0 otherwise. When we find a new value, we'll increase a counter by 1; the counter is our new numeric-valued variable.
proc sort data = test; by cat; run;

data catize;
set test;
by cat;
retain catnum 0;
if then catnum = catnum + 1;

/* check the result */
proc freq data = catize;
tables cat * catnum;
The table also shows the recoding values.
                             Table of cat by catnum

cat catnum

Percent |
Row Pct |
Col Pct | 1| 2| 3| Total
Hiss | 30 | 0 | 0 | 30
| 30.00 | 0.00 | 0.00 | 30.00
| 100.00 | 0.00 | 0.00 |
| 100.00 | 0.00 | 0.00 |
Purr | 0 | 40 | 0 | 40
| 0.00 | 40.00 | 0.00 | 40.00
| 0.00 | 100.00 | 0.00 |
| 0.00 | 100.00 | 0.00 |
meow | 0 | 0 | 30 | 30
| 0.00 | 0.00 | 30.00 | 30.00
| 0.00 | 0.00 | 100.00 |
| 0.00 | 0.00 | 100.00 |
Total 30 40 30 100
30.00 40.00 30.00 100.00

We begin by making the data. To convert to numbers, we use the labels option to the factor() function, feeding it the sequences of numbers between 1 and however many different values there are. Note that we find this using the factor() function again. There's probably a better way of doing this, but it's a little bit amusing to code it this way. Then we have numbers, but they're store as a factor. We can get them out with a call to as.numeric().
cat = c(rep("meow",30),rep("Hiss",30), rep("Purr", 40))
catn1 = factor(cat, labels=(1:length(levels(factor(cat)))))
catn = as.numeric(catn1)

catn Hiss meow Purr
1 30 0 0
2 0 30 0
3 0 0 40

There's a warning in the documentation for factor() that the values are assigned in location-specific fashion, so the table should be used to establish how the codes were assigned.  For the record, the use cases for this kind of recoding in R may be more strained than the SAS example given above.

An unrelated note about aggregators:We love aggregators! Aggregators collect blogs that have similar coverage for the convenience of readers, and for blog authors they offer a way to reach new audiences. SAS and R is aggregated by R-bloggers, PROC-X, and statsblogs with our permission, and by at least 2 other aggregating services which have never contacted us. If you read this on an aggregator that does not credit the blogs it incorporates, please come visit us at SAS and R. We answer comments there and offer direct subscriptions if you like our content. In addition, no one is allowed to profit by this work under our license; if you see advertisements on this page, the aggregator is violating the terms by which we publish our work.
9月 242012

A guy notices a bunch of targets scattered over a barn wall, and in the center of each, in the "bulls-eye," is a bullet hole. "Wow," he says to the farmer, "that’s pretty good shooting. How’d you do it?" "Oh," says the farmer, "it was easy. I painted the targets after I shot the holes." – An Old Joke

Note on how to get the difference between independent binomial proportions using SAS PROC FREQ (tested in SAS 9.3M1).  Also, benchmarks took from Prof. Newcombe ’s paper:

Interval estimation for the difference between independent proportions: comparison of eleven methods.


Key findings (SAS output):


1. SAS Proc Freq offers 8 ( including 4 of total 11 in Prof. Newcombe ’s paper) methods to compute confidence intervals for difference between independent binomial proportions:

    • Wald #1
    • Wald (Corrected) #2
    • Exact
    • Exact (FM Score)
    • Newcombe Score #10
    • Newcombe Score (Corrected) #11
    • Farrington-Manning
    • Hauck-Anderson

Note that #1 method is the most popular one in textbook, #10 might  be the most wildly used method in industry.

2. There is a big discrepancy in the outputs in the  so called “exact” method among  SAS Proc Freq and Prof. Newcombe ’s paper. Seems they used different methods (in the same “exact” families) under same name. Needs further investigation.

The SAS codes:

2*2 table

        response0 response1
group1      56      14
group2      48      32

data ci;
    input grp res count;
1 0 56
1 1 14
2 0 48
2 1 32

Newcombe Score
ods select  PdiffCLs;
ods output PdiffCLs=ci1;
proc freq data=ci order=data;
    tables grp*res /riskdiff (CL=(EXACT FM HA WILSON WALD));
    weight count;
    exact riskdiff;

Exact (FM Score)
ods select  PdiffCLs;
ods output PdiffCLs=ci2;
proc freq data=ci order=data;
    tables grp*res /riskdiff (CL=(EXACT));
    weight count;
    exact riskdiff(fmscore);

Newcombe Score (Corrected)
Wald (Corrected)
ods select  PdiffCLs;
ods output PdiffCLs=ci3;
proc freq data=ci order=data;
    tables grp*res /riskdiff (CORRECT CL=(WILSON WALD));
    weight count;
    exact riskdiff;

/*put them all*/
data CI_all;
    set ci3 ci2 ci1;
    CI=’(‘||compress(put(LowerCL,8.4))||’, ‘||compress(put(UpperCL,8.4))||’)’;

proc sql;
    case Type
        when "Wald" then "a"
        when "Wald (Corrected)" then "b"
        when "Exact" then "c"
        when "Exact (FM Score)" then "d"
        when "Newcombe Score" then "e"
        when "Newcombe Score (Corrected)" then "f"
        when "Farrington-Manning" then "g"
        when "Hauck-Anderson" then "h"
        else "i"
    end as order,
        Type "method",
        CI "95% Confidence Limits"
    from CI_all
    order by order


You may be also interested in this post on calculating confidence intervals for single proportion:

Statistical Notes (3): Confidence Intervals for Binomial Proportion Using SAS

using the following 11 methods:

      • 1.  Simple asymptotic, Without CC | Wald
      • 2.  Simple asymptotic, With CC      
      • 3.  Score method, Without CC | Wilson       
      • 4.  Score method, With CC    
      • 5. Binomial-based, ‘Exact’ | Clopper-Pearson              
      • 6.  Binomial-based, Mid-p      
      • 7.  Likelihood-based            
      • 8.  Jeffreys    
      • 9.  Agresti-Coull,z^2/2 successes          
      • 10. Agresti-Coull,2 successes and 2 fail           
      • 11. Logit                                               
9月 212012

A latest updated post on Freakonomics, Beware the Weasel Word “Statistical” in Statistical Significance!,  seemed to attempt to challenge frequentist statistics by Bayesian. I have no research on Bayesian and won’t jump to the debates. I’d rather to use this case to apply the Dragon’s Teeth and Fleas logic of hypothesis testing (at least I think frequentist is pretty intuitive).

A story goes with this post, that Bob won 20 times out of total 30 rounds with Alice in a children card game, “War”. War itself is a totally random game, so everyone is supposed to have 50% chance to win in a fair game. Since Bob won 20 out of 30, the question raised: is he cheating?

To answer this question, just follow the Dragon’s Teeth and Fleas logic of hypothesis testing:

  • First, assume they are dragon’s teeth; in this case, we assume it is a fair game that everyone has probability of 0.5 to win.
  • Under this assumption, we know the the number of winning follows binomial distribution with probability of 0.5. Now we should check Bob’s 20 winning in 30 in this distribution: is this output extreme, or is it a flea?
  • We then get the p-Value calculated by P(X>=20) which means the chance of winning 20 times and more in 30 games (as extreme or even more extreme ), is 0.0494. Here we only think about winning, so it is a one-sided test. Since this p-Value is less than the default significance level of 0.05 (IT IS A FLEA!), we reject the null and think this is more likely not a fair game (Bob is cheating in this statistical point of view).


That’s all the story. Then this post goes, “Although the procedure is mathematically well defined, it is nonsense”. The following is the first reason listed in the post:

To calculate the probability of Bob’s winning at least 20 games, you use not only the data that you saw (that Bob won 20 games out of 30) but also the imaginary data of Bob winning 21 out of 30 games, or 22 out of 30 games, all the way up to Bob winning all 30 games. The idea of imaginary data is oxymoronic.

In my opinion, the so called “imaginary data” are just the data filling out the binomial distribution (under the assumption of true null hypothesis) we got before and they are not oxymoronic at all (sorry I use “data” as plural). 

What do you think? I can’t go far away from here due to my limit knowledge of Bayesian.

————-appendix: SAS Codes and Outputs——————

data a;
    input output count;
0 20
1 10

proc freq data=a;
    tables output /binomialc (p=0.5) alpha=0.05;
    exact binomial;
    weight count;