12月 022011
 

Recently the "SAS Sample of the Day" was a Knowledge Base article with an impressively long title:

Sample 42165: Using a stored process to eliminate duplicate values caused by multiple group memberships when creating a group-based, identity-driven filter in SAS® Information Map Studio

"Wow," I thought. "This is the longest title on a SAS Sample that I have ever seen!"

This got me wondering whether anyone has run statistics on the SAS Knowledge Base. It would be interesting, I thought, to see a distribution of the length of the titles, to see words that appear most frequently in titles, and so forth.

I enlisted the aid of my friend Chris Hemedinger who is no dummy at reading data into SAS. A few minutes later, Chris had assembled a SAS data set that contained the titles of roughly 2,250 SAS Samples.

The length of titles

The first statistic I looked at was the length of the title, which you can compute by using the LENGTH function. A quick call to PROC UNIVARIATE and—presto!—the analysis is complete:

proc univariate data=SampleTitles;
   var TitleLength;
   histogram TitleLength;
run;

The table of basic statistical measures shows that the median title length is about 50 characters long, with 50% of titles falling into the range 39–67 characters. Statistically speaking, a "typical" SAS Sample has 50 characters, such as this one: "Calculating rolling sums and averages using arrays." A histogram of the title lengths indicates that the distribution has a long tail:

The shortest title is the pithy "Heat Maps," which contains only nine characters. The longest title is the mouth-filling behemoth mentioned at the beginning of this article, which tips the scales at an impressive 173 characters and crushes the nearest competitor, which has a mere 149 characters.

Frequency of words that appear most often in SAS Samples

The next task was to investigate the frequency of words in the titles. Which words appear most often? The visual result of this investigation is a Wordle word cloud, shown at the beginning of this article. (In the word cloud, capitalization matters, so Using and using both appear.) As you might have expected, SAS and PROC are used frequently, as are action words such as use/using and create/creating. Nouns such as data, variable, example, and documentation also appear frequently.

You can do a frequency analysis of the words in the titles by using the COUNTW, SCAN, and SUBSTR functions to decompose the titles into words. The following SAS code excludes certain simple words (such as "a," "the," and "to") and runs PROC FREQ to perform a frequency analysis on the words that remain. The UPCASE function is used to combine words that differ only in capitalization:

data words;
keep Word;
set SampleTitles;
length Word $20;
count = countw(title);
do i = 1 to count;
   Word = scan(title, i);
   if substr(Word,1,3)="SAS" then Word="SAS"; /* get rid of (R) symbol */
   if upcase(Word) NOT IN ("A" "THE" "TO" "WITH" "FOR" "IN" "OF"
           "AND" "FROM" "AN" "ON" "THAT" "OR" "WHEN" 
           "1" "2" "3" "4" "5" "6" "7" "8" "9")
      & Word NOT IN ("by" "By") then do;
      Word = upcase(Word);
      output;
   end;
end;
run;
 
proc freq data=words order=freq noprint;
tables Word / out=FreqOut(where=(count>=50));
run;
 
ods graphics / height=1200 width=750;
proc sgplot data=FreqOut;
dot Word / response=count categoryorder=respdesc;
xaxis values=(0 to 650 by 50) grid fitpolicy=rotate;
run;

As is often the case, the distribution of frequencies decreases quickly and then has a long tail. The graph shows the frequency counts of terms that appear in titles more than 50 times.

tags: Data Analysis, Just for Fun, Statistical Graphics

 Leave a Reply

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>

(required)

(required)