7月 132010
 
Contributed by Angela Hall, author of the Real Business Intelligence for Real Users' blog, an acclaimed blog for tips and tricks on SAS BI. Angela works for SAS Professional Services as a Technical Architect and holds a MBA with a Technology Management focus and a BA in Statistics. You can also follow her on Twitter.

In mid-November 2008 I added some Google Analytics functionality to Real Business Intelligence for Real Users' to track usage. It’s fun to see where people are landing from search engines or what they might have bookmarked for future use. Of course data is hardly complete (since tracking utilization data started 3 years after the blog birth and this doesn’t include RSS feed statistics), but it’s still interesting! Let’s take a look.

1. http://sas-bi.blogspot.com
The main page of the blog is the coolest place to read up on my latest SAS BI adventures and discoveries. Most blogs exist in reverse chronological order so readers get the latest updates and rarely head to years old information.

2. Free SAS Training Material
What can I say? It’s FREE! Since I wrote this blog post there have been almost 2000 views of this single page! This also lists other free stuff to help learn SAS (such as User Group Papers).

SAS Training also offers some wonderful training courses for BI users, which may not be free, but they certainly are a great value.

3. So How Do You Know What is Installed?
Tasked with managing a SAS Version 9 installation and would like a great mechanism to see what is installed? This is the source!

4. METALIB Update
Administrators can grant Write Metadata access to SAS Enterprise Guide users so that SAS Management Console isn’t the only mechanism to update data table structures within the metadata. Here is a code mechanism to do this in 9.1. And in 9.2 there is a cool new task – check the comment from Chris Hemedinger on that one!

The SAS Enterprise Guide Administration course and the SAS Platform Administration Fast Track course provided by SAS Training also cover this topic.

5. Creating a Distinct Count Value in SAS Information Map Studio
How about that for content diversity? In the top 5 blog posts, training, administration, and technical content are all included! This is the technical – how to get counts of qualitative values. It’s a simple solution but something that is needed for Web Report Studio users.

This topic is also covered in Creating Business Intelligence for Your Organization 1 and Creating Business Intelligence for Your Organization 2, or the Creating Business Intelligence for Your Organization Fast Track. All three of these courses are provided by SAS Training.

Knowing what is popular is half the answer (and an easy answer to obtain). This helps me understand where to further elaborate on my blog. The next piece is to find out what is missing. Is there anything that you cannot locate out there (whether within SAS documentation, SAS Technical Papers, SAS Training, SAS Global Forum Proceedings, Discussion Forums, SAS blogs, etc)?
7月 122010
 
In example 8.1, we considered some simple tests for the randomness of the digits of Pi. Here we develop a different test and implement it.

If each digit appears in each place with equal and independent probability, then the places between recurrences of a digit should be Pr(gap = x) = .9^x * .1-- the probability the digit recurs immediately is 0.1, the probability it recurs after one intervening digit is 0.09, and so forth. Our test will compare the observed distribution of gaps to the null hypothesis described by the equation above. We'll just look at the gaps between the occurrence of the numeral 1.

R

The piby1 object, created in example 8.1, is a vector of length 10,000,000 containing the digits of Pi. We use the grep() function (section 1.4.6) to identify the places in the vector where the string "1" appears. Then we subtract the adjacent locations to find the distance between them, subtracting 1 to count the spaces.

grep("1", piby1)
gap = pos1[2:length(pos1)] - pos1[1:length(pos1) -1] -1


> head(gap)
[1] 1 33 2 8 18 25

Recollecting that Pi begins 3.141, the first value, 1, in the gap vector reflects accurately that there is a single non-1 digit between the first and second occurrences of the numeral 1. There are then 33 non-1 digits before another 1 turns up.

We next need to calculate the probabilities of each gap length under the null. We'll do that with the : shorthand for the seq() function (section 1.11.3). We'll lump all gaps larger than 88 digits together into one bin.

probgap = 0.9^(0:88) * .1
probgap[89] = 1 - sum(probgap[1:88])

How unusual was that 33-digit gap between the second and third 1?

> probgap[34]
[1] 0.003090315

Under the null, that gap should happen only about once in every 300 appearances.

Next, we lump the observed gaps similarly, then run the one-way chi-squared test.

obsgap = c(gaptable[1:88], sum(gaptable[89:length(gaptable)]))
chisq.test(obsgap, p=probgap)


Chi-squared test for given probabilities
data: obsgap
X-squared = 82.3844, df = 88, p-value = 0.6488

This test, more stringent than the simple test for equal frequency of appearance, also reveals no violation of the null.


SAS

In SAS, this operation is much more complex, at least when limited to data steps. Possibly it would be easier with proc sql.

We begin by finding the spaces between appearances of the numeral 1, using a retain statement to "remember" the number of observations since the last time a 1 was observed.

data gapout;
set test;
retain gap 0;
target=1;
if digit eq target then do;
output;
gap=0;
end;
else gap = gap + 1;
run;


Next, we tabulate the number of times each gap appeared, using proc freq to save the result in a data set (section 2.3.1).

proc freq data=gapout (firstobs=2) noprint;
tables gap / out=c1gaps;
run;


Next, we calculate the expected probabilities, simultaneously lumping the larger gaps together. Here we manually summed the cases-- this would be a somewhat more involved procedure to automate. The final if statement (section 1.5.1) prevents outputting lines with gaps larger than 88.

set c1gaps;
retain sumprob 0;
if _n_ lt 89 then do;
prob = .9**(_n_ - 1) * .1;
sumprob = sumprob + prob;
end;
else if _n_ eq 89 then do;
prob = 1 - sumprob;
count=93;
end;
if _n_ le 89;
run;

As involved as the above may seem, the real problem in SAS is to get the null probabilities into proc freq for the test. SAS isn't set up to accept a data set or variable in that space. One option would be to insert it manually, but it may be worthwhile showing how to do it via a global macro variable.

First, we use proc transpose (section 1.5.4) to make a single observation with all of the probabilities in it.

proc transpose data=c1gaps2 out=c1gaps3;
var prob;
run;


Next we make a global macro variable to store the printed values of the probabilities, and use the call symput function (section A.8.2) to put the values into the variable; the final trick is to use the of syntax (section 1.11.4) to print all of the probabilities at once.

%global explist;

data probs;
set c1gaps3;
call symput ("explist", catx(' ', of col1-col89));
run;


We're finally ready to run the test. The global macro is called in the tables statement to define the expected probabilities. The weight statement tells SAS that there are count observations in each row.

proc freq data=count1gaps2;
tables gap / chisq testp= (&explist);
weight count;
run;
The FREQ Procedure

Chi-Square 82.3844
DF 88
Pr > ChiSq 0.6488

Sample Size = 999332
7月 112010
 
Folkes,  Here is the answer from Andrew Karp..... Direct link ****************************************************************************************************;Allow me to weigh in on this topic. It comes up alot when I give SAS training classes. First, RUN and QUIT are both "explicit step boundaries" in the SAS Programming Language. PROC and DATA are "implied step boundaries." Example 1: Two explicit step boundaries. DATA NEW; SET OLD: C = A + B;RUN; PROC PRINT DATA=NEW; RUN; In this example, both the data and the proc steps are explicitly "ended" by their respective RUN statements. Example 2: No explicit step boundaries. DATA NEW; SET OLD;C = A + B; PROC PRINT DATA=NEW; In this example, the data step is implicitly terminated by the PROC statement. But, there is no step boundary for the PROC PRINT step/task, so it will not terminate unless/until the SAS supervisor "receives" a step boundary. Some PROCS support what is called RUN group processing. These include...

[[ This is a content summary only. Visit my website for full links, other content, and more! ]]
 Posted by at 4:08 上午
7月 112010
 


Folkes,  Here is the answer from Andrew Karp..... Direct link ****************************************************************************************************;Allow me to weigh in on this topic. It comes up alot when I give SAS training classes. First, RUN and QUIT are both "explicit step boundaries" in the SAS Programming Language. PROC and DATA are "implied step boundaries." Example 1: Two explicit step boundaries. DATA NEW; SET OLD: C = A + B;RUN; PROC PRINT DATA=NEW; RUN; In this example, both the data and the proc steps are explicitly "ended" by their respective RUN statements. Example 2: No explicit step boundaries. DATA NEW; SET OLD;C = A + B; PROC PRINT DATA=NEW; In this example, the data step is implicitly terminated by the PROC statement. But, there is no step boundary for the PROC PRINT step/task, so it will not terminate unless/until the SAS supervisor "receives" a step boundary. Some PROCS support what is called RUN group processing. These include...

[[ This is a content summary only. Visit my website for full links, other content, and more! ]]
 Posted by at 4:08 上午
7月 102010
 
You would think that a in discipline focused on information retrieval, where understanding the meaning of words and phrases is critical, everyone would know the difference between a taxonomy, an ontology, a thesauri and semantics. Oddly this is not always the case.

It's not that the terms are poorly defined; the terms are extremely well defined - with fun words like entity, hyponym, or zeugma. Now you have some new words for Scrabble. I think, could be wrong, the confusion comes from the fact that information retrieval relies on each term in the acronym TOTS.

Some asides:
  • The phrase “I think, could be wrong,” is a zeugma, the noun "I" is a noun that links "could be wrong" with "I think"—the I is implied in "could be wrong."

  • The game Scrabble is an example of an entity, a separate and distinct object or concept.

  • Information Retrieval is a hyponym of the more general term Search.

Now, I will use one of my favorite guilty pleasures, tater tots, to help explain the differences between, a taxonomy, an ontology, a thesauri and semantics: aka TOTS.

I recently read an article about new, trendy gourmet tots: everything from blue cheese tots to truffle tots. Another trendy thing is foraging for wild food. Depending on where you live this might include: clams, wild mushrooms, herbs, thistle, blueberries and much more. In order to understand the definitions contained in the acronym TOTS, I will try to discover a recipe for the ultimate trendy food "Gourmet forage tater tots," and define what I'm doing as I go.

To start, how do search engines and text analytics products know what I am talking about when I type the word “tots?" Am I talking about the Tater Tots? Or an 8th grade class? What are the semantics of what I am talking about in the document?
Continue reading "TOTS: Taxonomy, Ontology, Thesauri and Semantics"
7月 072010
 
Here's something you don't expect to hear from a banking executive: "The best thing that happened is the financial crisis."

Of course, Tonny Rabjerg is not your standard banking executive. He's the Vice President of CRM Systems at Danske Bank. "I know the financial crisis is not good for the bank, but for me it is good, because there has never been more focus on customers than before. Typically, banks think risk and credit are more important, but without the customer, risk and credit don't matter," says Rabjerg

So, how is Rabjerg taking advantage of this new focus on the customer? He's leading a shift in his organization from customer relationship management (CRM) to personal customer management (PCM), and committing five to seven years to the project. The difference between CRM and PCM involves one-to-me marketing instead of one-to-one marketing and personal product presentations instead of campaign-based sales. "It's about matching the customer's requirements before he needs it," says Rabjert. "I want to make sure we makes it easy for him to take out a car loan before going to car dealer."

Philippe Wallez, General Manager of Marketing, is leading similar customer-centric programs at ING Belgium. His full-scale direct-marketing project, which started in 2007, has transformed the way the bank communicates with its 2.7 million customers. Projects include targeted marketing and street advertising campaigns that put the brand's orange logos and themes directly on the backs of consumers. "If we are forced to communicate online, we will be forced to simplify," says Wallez. "Even on banking social networks, customers don't talk about banking. They talking about their homes and their cars and their financial concerns."

In 2006, ING Belgium conducted one direct marketing campaign per week. Last year, the marketing team conducted at least ten campaigns per day. How did they do it? They hired business analysts, campaign analysts, marketers, digital marketers and direct mailers. They developed a new campaign process, a new data platform, brought in new tools and established new customer data ownership policies.

The bank now has a global client contact strategy, and the marketing department reports results directly to the board every week.

The benefits of ING's strategy include:
  • Fully automated service.
  • Simple sales migrate to direct channels.
  • More time for advice and sales in branch network.
  • Increase advice efficiency through direct marketing generated leads.
For such a wide-scale project, Wallez recommends strategy above all else. "Whatever strategy you use, you have to have a strategy. Consistently focus on strategy and free up resources," he says. "It's not easy but it's possible."

[Cross-posted from the sascom voices blog]
7月 072010
 
I use this blog to talk about all of the great information that is available to you on support.sas.com. I point out features that you may have missed (like the RSS feeds or Discussion Forums); I answer questions that you send via our feedback form (see the Q&A category); and I give you hints for searching and navigating.

While many of the SAS resources are available to you on support.sas.com, locating and understanding them can be a taxing effort. If you like video, sit back, relax, and watch these two new videos from SAS. Each video was produced to help you get more from your SAS investment. One provides an overview of the resources that are available for our customers, and the second gives instructions on how to contact SAS Technical Support.

Discover SAS Customer Resources
The customer resources video includes a message from Stacy Hobson about customer loyalty as well as information about the SAS Users Groups, Publications, Training, Contracts, R&D, Technical Support, and support.sas.com. You can watch this video from the Community page on support.sas.com site. The video is also available on the SAS Software YouTube channel.

Contacting SAS Technical Support
Did you know that SAS customers have unlimited access to SAS Technical Support? Watch the video to find out when and how to contact Technical Support. The video is available on the Support page of support.sas.com.
7月 062010
 
Do the digits of Pi appear in a random order? If so, the trillions of digits of Pi calculated can serve as a useful random number generator. This post was inspired by this entry on Matt Asher's blog.

Generating pseudo-random numbers is a key piece of much of modern statistical practice, whether for Markov chain Monte Carlo applications or simpler simulations of what raw data would look like under given circumstances.

Generating sufficiently random pseudo-random numbers is not trivial and many methods exist for doing so. Unfortunately, there's no way to prove a series of pseudo-random numbers is indistinguishable from a truly random series of numbers. Instead, there are simply what amount to ad hoc tests, one of which might be failed by insufficiently random series of numbers.

The first 10 million digits of Pi can be downloaded from here. Here, we explore couple of simple ad hoc checks of randomness. We'll try something a little trickier in the next entry.


SAS

We start by reading in the data. The file contains one logical line with a length of 10,000,000. We tell SAS to read a digit-long variable and hold the place in the line. For later use, we'll also create a variable with the order of each digit, using the _n_ implied variable (section 1.4.15).


data test;
infile "c:\ken\pi-10million.txt" lrecl=10000000;
input digit 1. @@;
order = _n_;
run;


We can do a simple one-way chi-square test for equal probability of each digit in proc freq.


proc freq data = test;
tables digit/ chisq;
run;

Chi-Square 2.7838
DF 9
Pr > ChiSq 0.9723
Sample Size = 10000000


We didn't display the counts for each digit, but none was more than 0.11% away from the expected 1,000,000 occurrences.

Another simple check would be to assess autoregression. We can do this in proc autoreg. The dw=2 option calculates the Durbin-Watson statistic for adjacent and alternating residuals. We limit the observations to 1,000,000 digits for compatibility with R.


proc autoreg data=test (obs = 1000000);
model digit = / dw=2 dwprob;
run;
Durbin-Watson Statistics

Order DW Pr < DW Pr > DW
1 2.0028 0.9175 0.0825
2 1.9996 0.4130 0.5870

We might want to replicate this set of tests for series of 4 digits instead. To do this, we just tell the data step to read the line line in 4-digit chunks.

data test4;
infile "c:\ken\pi-10million.txt" lrecl=10000000;
input digit4 4. @@;
order = _n_;
run;

proc freq data = test4;
tables digit4/ chisq;
run;
Chi-Square 9882.9520
DF 9999
Pr > ChiSq 0.7936

Sample Size = 2500000

proc autoreg data=test4 (obs = 1000000);
model digit = / dw=3 dwprob;
run;
Durbin-Watson Statistics

Order DW Pr < DW Pr > DW
1 2.0014 0.7527 0.2473
2 1.9976 0.1181 0.8819
3 2.0007 0.6397 0.3603

So far, we see no evidence of a lack of randomness.

R

In R, we use the readLines() function to create a 10,000,000-digit scalar object. In the following line we split the digits using the strsplit() function (as in section 6.4.1). This results in a list object, to which the as.numeric() function (which forces the digit characters to be read as numeric, section 1.4.2) cannot be applied. The unlist() function converts the list into a vector first, so that strsplit() will work. Then the chisq.test() function performs the one-way chi-squared test.


mypi = readLines("c:/ken/pi-10million.txt", warn=FALSE)
piby1 = as.numeric(unlist(strsplit(mypi,"")))
chisq.test(table(piby1), p=rep(0.1, 10))

This generates the following output:

Chi-squared test for given probabilities

data: table(piby1)
X-squared = 2.7838, df = 9, p-value = 0.9723


Alternatively, it's trivial to write a function to automatically test for equal probabilities of all categories.


onewaychi = function(datavector){
datatable = table(datavector)
expect = rep(length(datavector)/length(datatable),length(datatable))
chi2 = sum(((datatable - expect)^2)/expect)
p = 1- pchisq(chi2,length(datatable)-1)
return(p)
}

> onewaychi(piby1)
[1] 0.972252



The Durbin-Watson test can be generated by the dwtest function, from the lmtest package. Using all 10,000,000 digits causes an error, so we use only the first 1,000,000.


> library(lmtest)
> dwtest(lm(piby1[1:1000000] ~ 1))

Durbin-Watson test

data: lm(piby1[1:1e+06] ~ 1)
DW = 2.0028, p-value = 0.9176
alternative hypothesis: true autocorrelation is greater than 0


To examine the digits in groups of 4, we read the digit vector as a matrix with 4 columns, then multiply each digit and add the columns together. Alternatively, we could use the paste() function (section 1.4.5) to glue the digits together as a character string, the use the as.numeric() to convert back to numbers.


> pimat = matrix(piby1, ncol = 4,byrow=TRUE)
> head(pimat)
[,1] [,2] [,3] [,4]
[1,] 1 4 1 5
[2,] 9 2 6 5
[3,] 3 5 8 9
[4,] 7 9 3 2
[5,] 3 8 4 6
[6,] 2 6 4 3

> piby4 = pimat[,1] * 1000 + pimat[,2] * 100 +
+ pimat[,3] * 10 + pimat[,4]
> head(piby4)
[1] 1415 9265 3589 7932 3846 2643

# alternate approach
# piby4_v2 = as.numeric(paste(pimat[,1], pimat[,2],
# pimat[,3], pimat[,4], sep=""))

> onewaychi(piby4)
[1] 0.7936358

> dwtest(lm(piby4[1:1000000] ~ 1))
Durbin-Watson test

data: lm(piby4[1:1e+06] ~ 1)
DW = 2.0014, p-value = 0.753
alternative hypothesis: true autocorrelation is greater than 0
7月 022010
 
I'm happy to report that I have achieved my SAS Certified Base Programmer credential. w00t!. I took my first SAS programming course on March 10th and passed my exam on June 18th (so I guess a more accurate title would say 3 months and 8 days!).

I thought I would share a couple notes from my experience. But first, I should point out that there are other great perspectives on certification preparation techniques. First, there is 10 year SAS programming vet Gongwei Chen's SAS Global Forum paper describing his 4 months of preparation for Base and Advanced SAS programming credentials. There is also the PROC CERTIFY project from SAS Publishing employees Stacey Hamilton and Christine Kjellberg, two self-affirmed non-techies tackling certification with the assistance of SAS Publishing books such as the Certification Prep Guide.

So, what's my background? I have a degree in engineering. I'm a veteran of the telecom industry with more than a dozen years spent in network engineering and technical education. I've achieved data networking certification in the form of Cisco's CCNA. I consider myself a techie, dabbling in computers, web, and database technologies. I'm not a programmer, but I have programmed, learning FORTRAN in college and then C and VBA afterward. So in terms of technical/SAS background, I guess I'm somewhere on the continuum between Gongwei on one side and Stacey and Christine on the other.

Why did I seek SAS certification? Two reasons. First, I'm new to SAS and, being a bit of a techie, really wanted to learn SAS programming. Second, I work on the certification team here at SAS and wanted to experience SAS certification first hand. Ethically, I would have to do this quickly so that I achieved my certification before I did any work on the Base Programming exam. Wouldn't really be fair to see the exam before taking the exam, now would it?

How did I prepare? Although I am a long-time fan of self-paced learning, I thought I would do something different this time and give classroom training a try. Specifically, I took the Programming 1 and Programming 2 courses, did a bunch of studying, wrote a bunch of programs, and then used the Certification e-Practice exam to ensure that I was ready to sit for the exam. These component courses may also be available to you at a discount via a Best Value Bundle, depending on your location.

Who do I think can benefit from my perspective? Maybe you would like to make a career change and take advantage of the 1000+ SAS related job postings on the major job boards. Maybe you already work with SAS code and would like to become more efficient with your coding time and gain some recognition of your SAS programming skills. Or maybe you just got your first SAS programming job and need to ramp-up quickly.

In my next post, I'll share more specifics about how I studied and prepared for the exam. I hope you will find my perspective helpful.
7月 012010
 
We’re pleased to offer another segment of “The Nuts & Bolts of Social Media,” this time with my esteemed colleague, Justin Huntsman. In this short video, Deb talks to Justin about how he’s integrating social media into his campaigns. During the discussion, Justin provides some great ideas about how to ease social media “onto your docket” and balance traditional marketing with social media marketing.

Some of the most interesting points come from the idea that they key to balancing traditional and social marketing activities is to establish your goals before you start. Related points include:

  • As you engage in both traditional and social marketing, if you begin with a goal, you’ll find your efforts align themselves automatically.
  • If you don’t begin with a goal, you likely will be making tradeoffs that in retrospect will end up costing you more than you imagined.

Integrating and balancing traditional and social marketing efforts is critical for marketers to be successful today. Justin offers an effective way to approach it by getting yourself to stop thinking of the two worlds as independent responsibilities, and also to think of social media as a means and not an end.

Click on the screen below to tune in – it’s a short interview packed with some good insights.