7月 142010
 
Thanks for your feedback on this topic. As promised, I am circling back on the Advertising: To Gate or Not to Gate post I wrote in late April. As you'll recall, I initiated a little experiment to offer one of our assets via an online ad without registration. For a period of almost two months, we ran ad units with 1 to 1 Media, American Marketing Association, BusinessWeek Online, destinationCRM and The Wall Street Journal Online.

The results are in!

And they're not what I expected (which prompts me to want to tinker some more). For the time being, here is what we found when we compared the two trial periods:

  • The Click-Through Rate (CTR) went up from 0.09% to 0.50%
  • The Conversion Rate (CVR) went down from 3.32% to 2.65%

To put those numbers into plain English:

  • When we removed registration, the proportion of the viewers that clicked through went up five-fold from 0.09% to 0.5%.
  • At the same time, the subset of viewers that clicked through and then also downloaded the whitepaper shrunk by 21% from 3.32% to 2.65%.

The CTR results were unexpected because we did not change the wording of the ad to say “no registration required,” so unless I don’t fully grasp this concept, there is no outward indication that would prompt more clicks. The CVR results are also unexpected because this should be like the bowl of Halloween candy left on the porch of the folks not at home. Remove the doorbell and the first group of trick-or-treaters hit the jackpot, but in the case of our ad it didn't play out as expected.

The “after” numbers (no registrations) were confirmed in the following weeks as the proportions remained the same. Preliminary conclusions based on these results are that if our objective with online ads is to drive awareness, then it’s more effective to offer assets ungated. Conversely, if we were looking to generate leads, then it would make more sense to continue gating our assets on the ads we place.

If you will recall, the asset we’re promoting in these advertisements is a Webcast summary paper titled, Tips from the Trade - Competing on Web Analytics from a Webcast called by the same name (Webcast Link). The Webcast featured an discussion with author Eric Peterson from Web Analytics Demystified and SAS Customer Office Depot, moderated by SAS' own Michele Eggers.

I am going to see if we can try another test, but this time changing the wording in the ads and the landing pages to see if that drives a different result and let you know what happens. Until then, let me know if you expected those same results. Have you had similar experiences? Share your thoughts when you have a chance. Thanks!
7月 132010
 

At the end of my first blog, I said I would explain more about those literature analyzing skills and what all that literature stuff has to do with SAS. I showed my “skills list” and discussed how learning to write a SAS program could be compared to analyzing a work of literature. The first item on my list was
“Understand the work.”


Moby-Dick and Finnegans Wake have the reputation of being great literature, but they also have the reputation of being difficult. By the time you’re into the beginning chapters of both books, you might not feel like you’re reading great literature. Both books can seem like they’re not worth the effort to keep going. SAS is like that too. You get that first red ERROR in the SAS log that says “Statement is not valid or it is used out of proper order” or “Expecting a (“ and you really wonder whether you should keep going. Yes, it is worth the effort to keep going. But even before you get that first error message or start coding your program or start writing your analytical paper, you have to understand the work.


In the P.O.E.M. (Professional Organization of English Majors) world, this step does involve actually reading Moby-Dick and Finnegans Wake (and more than once!). But that’s not the end of it, you don’t really understand the work until you’ve got all the particulars about it under your belt. For example, merely reading a work of literature would be no different from reading a beach book if all you did was read the book. When you're a Lit major, you have to know the Who, What, When, Where, and Why of every novel you read.


When you write an analytical paper on a work of literature, these are examples of the 5 “W’s” that pertain to the actual work itself and its author. But then the paper that you’re writing has its own set of questions that must be answered. Who am I writing this paper for? What is the paper about? What question do I have to answer? What case do I have to argue? What character do I have to explain? How am I going to research the paper? How am I going to write the paper? How am I going to deliver the paper?


However, let’s talk about the “W” questions and how they apply to a SAS project or program. At least you don’t have to read Moby-Dick and Finnegans Wake before you write your program!


Continue reading "Understand the Work"
7月 132010
 
In the 2010 SASware Ballot®, a dedicated PROC for Randomized SVD was among the options. While an official SAS PROC will not be available in the immediate future as well as in older SAS releases, it is fairly simple to implement this algorithm using existing SAS/STAT procedures.

Randomized SVD will be useful for large scale, high dimension data mining problems, for instance Text Mining. In SAS/Base and SAS/STAT, lack of sparse matrix operation capability puts any serious Text Mining task at the edge of infeasibility, such as using LSI or NMF algorithms. Randomized SVD provides an economic alternate solution by sacrificing a little accuracy which is bounded under the three sampling schema proposed by the authors [1], while the code below demos sampling schema 1.



/* Randomized SVD with sampling schema 1. */
%let dim=2048;
%let nobs=1e4;
%let s=256;
data matrix;
     array _x{*} x1-x&dim;
     do id=1 to &nobs;
     do _j=1 to dim(_x); _x[_j]=sin(mod(id, _j))+rannor(id); end;
  output;
  drop _j;
  end;
run;

%let datetime_start = %sysfunc(TIME()) ;
%let time=%sysfunc(datetime(), datetime.); %put &time;
data seed;
     array _x{*} x1-x&dim;
     do _j=1 to dim(_x); _x[_j]=0; end;
  output;
  stop;
run;

proc fastclus data=matrix  seed=seed      out=norm(keep=ID DISTANCE)
              maxiter=0    maxclusters=1  noprint  replace=none;
     var x1-x&dim;
run;
data normv/ view=normv;
     set norm(keep=DISTANCE);
     DISTANCE2=DISTANCE**2;
     drop DISTANCE;
run;
proc means data=normv noprint;
     var DISTANCE2;
     output  out=matrixnorm  sum(DISTANCE2)=Frobenius_sqr;
run;
data prob;
     set matrixnorm ;
  retain Frobenius_sqr;
  do until (eof);
     set norm  end=eof;
  _rate_=DISTANCE**2/Frobenius_sqr;
  keep ID _rate_;
  output;
  end;
run;

data matrixv/view=matrixv;
     merge matrix  prob(keep=_rate_);
run;

ods select none;
proc surveyselect data=matrixv  out=matrixsamp(drop=SamplingWeight  ExpectedHits  NumberHits _rate_)  
                   sampsize=&s  method=pps_wr   outhits  ;
     size _rate_;
run;
ods select all;

proc transpose data=matrixsamp  out=matrixsamp;
     var x1-x&dim;
run;

proc princomp data=matrixsamp  outstat=testv(where=(_type_ in ("USCORE")))  
              noint  cov  noprint;
  var col1-col&s;
run;
data testV_t/view=testV_t;
     retain _TYPE_ 'PARMS';
  set testv(drop=_TYPE_);
run;     

proc score data=matrixsamp   score=testV_t  type=parms  
           out=SW(keep=ID Prin:);
    var   col1-col&s;
run;

data seed;
     array _s{*}  prin1-prin&s;
  do _j=1 to dim(_s); _s[_j]=0; end;
  drop _j; output; stop;
run;

proc fastclus data=SW   seed=seed  maxiter=0  maxc=1  replace=NONE   out=SW2(drop=CLUSTER)  noprint;
     var prin1-prin&s;
run;

data HHT;
     set SW2;
  array _x{*}  prin1-prin&s;
  do _j=1 to dim(_x); _x[_j]=(_x[_j]/distance)**2; end;
  drop _j  distance;
run;

proc transpose data=HHT  out=HHT2(drop=_LABEL_);      
run;
data HHT2; 
     _TYPE_='PARMS';
     set HHT2; 
  rename COL1-COL2048=x1-X2048;
run;

proc score data=matrix   score=HHT2  type=parms  out=P(drop=x:);
     var x:;
run;     

%let time=%sysfunc(datetime(), datetime.); %put &time;
%put PROCESSING TIME:  %sysfunc(putn(%sysevalf(%sysfunc(TIME())-&datetime_start.),mmss.)) (mm:ss) ;
options notes source;


Reference:
[1], P. Drineas and M. W. Mahoney, "Randomized Algorithms for Matrices and Massive Data Sets", Proc. of the 32nd Annual Conference on Very Large Data Bases (VLDB), p. 1269, 2006.
 Posted by at 11:06 上午

Implement Randomized SVD in SAS

 predictive modeling, PROC FASTCLUS, PROC PRINCOMP, PROC SCORE, SVD  Implement Randomized SVD in SAS已关闭评论
7月 132010
 


In the 2010 SASware Ballot®, a dedicated PROC for Randomized SVD was among the options. While an official SAS PROC will not be available in the immediate future as well as in older SAS releases, it is fairly simple to implement this algorithm using existing SAS/STAT procedures.

Randomized SVD will be useful for large scale, high dimension data mining problems, for instance Text Mining. In SAS/Base and SAS/STAT, lack of sparse matrix operation capability puts any serious Text Mining task at the edge of infeasibility, such as using LSI or NMF algorithms. Randomized SVD provides an economic alternate solution by sacrificing a little accuracy which is bounded under the three sampling schema proposed by the authors [1], while the code below demos sampling schema 1.



/* Randomized SVD with sampling schema 1. */
%let dim=2048;
%let nobs=1e4;
%let s=256;
data matrix;
     array _x{*} x1-x&dim;
     do id=1 to &nobs;
     do _j=1 to dim(_x); _x[_j]=sin(mod(id, _j))+rannor(id); end;
  output;
  drop _j;
  end;
run;

%let datetime_start = %sysfunc(TIME()) ;
%let time=%sysfunc(datetime(), datetime.); %put &time;
data seed;
     array _x{*} x1-x&dim;
     do _j=1 to dim(_x); _x[_j]=0; end;
  output;
  stop;
run;

proc fastclus data=matrix  seed=seed      out=norm(keep=ID DISTANCE)
              maxiter=0    maxclusters=1  noprint  replace=none;
     var x1-x&dim;
run;
data normv/ view=normv;
     set norm(keep=DISTANCE);
     DISTANCE2=DISTANCE**2;
     drop DISTANCE;
run;
proc means data=normv noprint;
     var DISTANCE2;
     output  out=matrixnorm  sum(DISTANCE2)=Frobenius_sqr;
run;
data prob;
     set matrixnorm ;
  retain Frobenius_sqr;
  do until (eof);
     set norm  end=eof;
  _rate_=DISTANCE**2/Frobenius_sqr;
  keep ID _rate_;
  output;
  end;
run;

data matrixv/view=matrixv;
     merge matrix  prob(keep=_rate_);
run;

ods select none;
proc surveyselect data=matrixv  out=matrixsamp(drop=SamplingWeight  ExpectedHits  NumberHits _rate_)  
                   sampsize=&s  method=pps_wr   outhits  ;
     size _rate_;
run;
ods select all;

proc transpose data=matrixsamp  out=matrixsamp;
     var x1-x&dim;
run;

proc princomp data=matrixsamp  outstat=testv(where=(_type_ in ("USCORE")))  
              noint  cov  noprint;
  var col1-col&s;
run;
data testV_t/view=testV_t;
     retain _TYPE_ 'PARMS';
  set testv(drop=_TYPE_);
run;     

proc score data=matrixsamp   score=testV_t  type=parms  
           out=SW(keep=ID Prin:);
    var   col1-col&s;
run;

data seed;
     array _s{*}  prin1-prin&s;
  do _j=1 to dim(_s); _s[_j]=0; end;
  drop _j; output; stop;
run;

proc fastclus data=SW   seed=seed  maxiter=0  maxc=1  replace=NONE   out=SW2(drop=CLUSTER)  noprint;
     var prin1-prin&s;
run;

data HHT;
     set SW2;
  array _x{*}  prin1-prin&s;
  do _j=1 to dim(_x); _x[_j]=(_x[_j]/distance)**2; end;
  drop _j  distance;
run;

proc transpose data=HHT  out=HHT2(drop=_LABEL_);      
run;
data HHT2; 
     _TYPE_='PARMS';
     set HHT2; 
  rename COL1-COL2048=x1-X2048;
run;

proc score data=matrix   score=HHT2  type=parms  out=P(drop=x:);
     var x:;
run;     

%let time=%sysfunc(datetime(), datetime.); %put &time;
%put PROCESSING TIME:  %sysfunc(putn(%sysevalf(%sysfunc(TIME())-&datetime_start.),mmss.)) (mm:ss) ;
options notes source;


Reference:
[1], P. Drineas and M. W. Mahoney, "Randomized Algorithms for Matrices and Massive Data Sets", Proc. of the 32nd Annual Conference on Very Large Data Bases (VLDB), p. 1269, 2006.
 Posted by at 11:06 上午
7月 132010
 
Contributed by Angela Hall, author of the Real Business Intelligence for Real Users' blog, an acclaimed blog for tips and tricks on SAS BI. Angela works for SAS Professional Services as a Technical Architect and holds a MBA with a Technology Management focus and a BA in Statistics. You can also follow her on Twitter.

In mid-November 2008 I added some Google Analytics functionality to Real Business Intelligence for Real Users' to track usage. It’s fun to see where people are landing from search engines or what they might have bookmarked for future use. Of course data is hardly complete (since tracking utilization data started 3 years after the blog birth and this doesn’t include RSS feed statistics), but it’s still interesting! Let’s take a look.

1. http://sas-bi.blogspot.com
The main page of the blog is the coolest place to read up on my latest SAS BI adventures and discoveries. Most blogs exist in reverse chronological order so readers get the latest updates and rarely head to years old information.

2. Free SAS Training Material
What can I say? It’s FREE! Since I wrote this blog post there have been almost 2000 views of this single page! This also lists other free stuff to help learn SAS (such as User Group Papers).

SAS Training also offers some wonderful training courses for BI users, which may not be free, but they certainly are a great value.

3. So How Do You Know What is Installed?
Tasked with managing a SAS Version 9 installation and would like a great mechanism to see what is installed? This is the source!

4. METALIB Update
Administrators can grant Write Metadata access to SAS Enterprise Guide users so that SAS Management Console isn’t the only mechanism to update data table structures within the metadata. Here is a code mechanism to do this in 9.1. And in 9.2 there is a cool new task – check the comment from Chris Hemedinger on that one!

The SAS Enterprise Guide Administration course and the SAS Platform Administration Fast Track course provided by SAS Training also cover this topic.

5. Creating a Distinct Count Value in SAS Information Map Studio
How about that for content diversity? In the top 5 blog posts, training, administration, and technical content are all included! This is the technical – how to get counts of qualitative values. It’s a simple solution but something that is needed for Web Report Studio users.

This topic is also covered in Creating Business Intelligence for Your Organization 1 and Creating Business Intelligence for Your Organization 2, or the Creating Business Intelligence for Your Organization Fast Track. All three of these courses are provided by SAS Training.

Knowing what is popular is half the answer (and an easy answer to obtain). This helps me understand where to further elaborate on my blog. The next piece is to find out what is missing. Is there anything that you cannot locate out there (whether within SAS documentation, SAS Technical Papers, SAS Training, SAS Global Forum Proceedings, Discussion Forums, SAS blogs, etc)?
7月 122010
 
In example 8.1, we considered some simple tests for the randomness of the digits of Pi. Here we develop a different test and implement it.

If each digit appears in each place with equal and independent probability, then the places between recurrences of a digit should be Pr(gap = x) = .9^x * .1-- the probability the digit recurs immediately is 0.1, the probability it recurs after one intervening digit is 0.09, and so forth. Our test will compare the observed distribution of gaps to the null hypothesis described by the equation above. We'll just look at the gaps between the occurrence of the numeral 1.

R

The piby1 object, created in example 8.1, is a vector of length 10,000,000 containing the digits of Pi. We use the grep() function (section 1.4.6) to identify the places in the vector where the string "1" appears. Then we subtract the adjacent locations to find the distance between them, subtracting 1 to count the spaces.

grep("1", piby1)
gap = pos1[2:length(pos1)] - pos1[1:length(pos1) -1] -1


> head(gap)
[1] 1 33 2 8 18 25

Recollecting that Pi begins 3.141, the first value, 1, in the gap vector reflects accurately that there is a single non-1 digit between the first and second occurrences of the numeral 1. There are then 33 non-1 digits before another 1 turns up.

We next need to calculate the probabilities of each gap length under the null. We'll do that with the : shorthand for the seq() function (section 1.11.3). We'll lump all gaps larger than 88 digits together into one bin.

probgap = 0.9^(0:88) * .1
probgap[89] = 1 - sum(probgap[1:88])

How unusual was that 33-digit gap between the second and third 1?

> probgap[34]
[1] 0.003090315

Under the null, that gap should happen only about once in every 300 appearances.

Next, we lump the observed gaps similarly, then run the one-way chi-squared test.

obsgap = c(gaptable[1:88], sum(gaptable[89:length(gaptable)]))
chisq.test(obsgap, p=probgap)


Chi-squared test for given probabilities
data: obsgap
X-squared = 82.3844, df = 88, p-value = 0.6488

This test, more stringent than the simple test for equal frequency of appearance, also reveals no violation of the null.


SAS

In SAS, this operation is much more complex, at least when limited to data steps. Possibly it would be easier with proc sql.

We begin by finding the spaces between appearances of the numeral 1, using a retain statement to "remember" the number of observations since the last time a 1 was observed.

data gapout;
set test;
retain gap 0;
target=1;
if digit eq target then do;
output;
gap=0;
end;
else gap = gap + 1;
run;


Next, we tabulate the number of times each gap appeared, using proc freq to save the result in a data set (section 2.3.1).

proc freq data=gapout (firstobs=2) noprint;
tables gap / out=c1gaps;
run;


Next, we calculate the expected probabilities, simultaneously lumping the larger gaps together. Here we manually summed the cases-- this would be a somewhat more involved procedure to automate. The final if statement (section 1.5.1) prevents outputting lines with gaps larger than 88.

set c1gaps;
retain sumprob 0;
if _n_ lt 89 then do;
prob = .9**(_n_ - 1) * .1;
sumprob = sumprob + prob;
end;
else if _n_ eq 89 then do;
prob = 1 - sumprob;
count=93;
end;
if _n_ le 89;
run;

As involved as the above may seem, the real problem in SAS is to get the null probabilities into proc freq for the test. SAS isn't set up to accept a data set or variable in that space. One option would be to insert it manually, but it may be worthwhile showing how to do it via a global macro variable.

First, we use proc transpose (section 1.5.4) to make a single observation with all of the probabilities in it.

proc transpose data=c1gaps2 out=c1gaps3;
var prob;
run;


Next we make a global macro variable to store the printed values of the probabilities, and use the call symput function (section A.8.2) to put the values into the variable; the final trick is to use the of syntax (section 1.11.4) to print all of the probabilities at once.

%global explist;

data probs;
set c1gaps3;
call symput ("explist", catx(' ', of col1-col89));
run;


We're finally ready to run the test. The global macro is called in the tables statement to define the expected probabilities. The weight statement tells SAS that there are count observations in each row.

proc freq data=count1gaps2;
tables gap / chisq testp= (&explist);
weight count;
run;
The FREQ Procedure

Chi-Square 82.3844
DF 88
Pr > ChiSq 0.6488

Sample Size = 999332
7月 112010
 
Folkes,  Here is the answer from Andrew Karp..... Direct link ****************************************************************************************************;Allow me to weigh in on this topic. It comes up alot when I give SAS training classes. First, RUN and QUIT are both "explicit step boundaries" in the SAS Programming Language. PROC and DATA are "implied step boundaries." Example 1: Two explicit step boundaries. DATA NEW; SET OLD: C = A + B;RUN; PROC PRINT DATA=NEW; RUN; In this example, both the data and the proc steps are explicitly "ended" by their respective RUN statements. Example 2: No explicit step boundaries. DATA NEW; SET OLD;C = A + B; PROC PRINT DATA=NEW; In this example, the data step is implicitly terminated by the PROC statement. But, there is no step boundary for the PROC PRINT step/task, so it will not terminate unless/until the SAS supervisor "receives" a step boundary. Some PROCS support what is called RUN group processing. These include...

[[ This is a content summary only. Visit my website for full links, other content, and more! ]]
 Posted by at 4:08 上午
7月 112010
 


Folkes,  Here is the answer from Andrew Karp..... Direct link ****************************************************************************************************;Allow me to weigh in on this topic. It comes up alot when I give SAS training classes. First, RUN and QUIT are both "explicit step boundaries" in the SAS Programming Language. PROC and DATA are "implied step boundaries." Example 1: Two explicit step boundaries. DATA NEW; SET OLD: C = A + B;RUN; PROC PRINT DATA=NEW; RUN; In this example, both the data and the proc steps are explicitly "ended" by their respective RUN statements. Example 2: No explicit step boundaries. DATA NEW; SET OLD;C = A + B; PROC PRINT DATA=NEW; In this example, the data step is implicitly terminated by the PROC statement. But, there is no step boundary for the PROC PRINT step/task, so it will not terminate unless/until the SAS supervisor "receives" a step boundary. Some PROCS support what is called RUN group processing. These include...

[[ This is a content summary only. Visit my website for full links, other content, and more! ]]
 Posted by at 4:08 上午
7月 102010
 
You would think that a in discipline focused on information retrieval, where understanding the meaning of words and phrases is critical, everyone would know the difference between a taxonomy, an ontology, a thesauri and semantics. Oddly this is not always the case.

It's not that the terms are poorly defined; the terms are extremely well defined - with fun words like entity, hyponym, or zeugma. Now you have some new words for Scrabble. I think, could be wrong, the confusion comes from the fact that information retrieval relies on each term in the acronym TOTS.

Some asides:
  • The phrase “I think, could be wrong,” is a zeugma, the noun "I" is a noun that links "could be wrong" with "I think"—the I is implied in "could be wrong."

  • The game Scrabble is an example of an entity, a separate and distinct object or concept.

  • Information Retrieval is a hyponym of the more general term Search.

Now, I will use one of my favorite guilty pleasures, tater tots, to help explain the differences between, a taxonomy, an ontology, a thesauri and semantics: aka TOTS.

I recently read an article about new, trendy gourmet tots: everything from blue cheese tots to truffle tots. Another trendy thing is foraging for wild food. Depending on where you live this might include: clams, wild mushrooms, herbs, thistle, blueberries and much more. In order to understand the definitions contained in the acronym TOTS, I will try to discover a recipe for the ultimate trendy food "Gourmet forage tater tots," and define what I'm doing as I go.

To start, how do search engines and text analytics products know what I am talking about when I type the word “tots?" Am I talking about the Tater Tots? Or an 8th grade class? What are the semantics of what I am talking about in the document?
Continue reading "TOTS: Taxonomy, Ontology, Thesauri and Semantics"
7月 072010
 
Here's something you don't expect to hear from a banking executive: "The best thing that happened is the financial crisis."

Of course, Tonny Rabjerg is not your standard banking executive. He's the Vice President of CRM Systems at Danske Bank. "I know the financial crisis is not good for the bank, but for me it is good, because there has never been more focus on customers than before. Typically, banks think risk and credit are more important, but without the customer, risk and credit don't matter," says Rabjerg

So, how is Rabjerg taking advantage of this new focus on the customer? He's leading a shift in his organization from customer relationship management (CRM) to personal customer management (PCM), and committing five to seven years to the project. The difference between CRM and PCM involves one-to-me marketing instead of one-to-one marketing and personal product presentations instead of campaign-based sales. "It's about matching the customer's requirements before he needs it," says Rabjert. "I want to make sure we makes it easy for him to take out a car loan before going to car dealer."

Philippe Wallez, General Manager of Marketing, is leading similar customer-centric programs at ING Belgium. His full-scale direct-marketing project, which started in 2007, has transformed the way the bank communicates with its 2.7 million customers. Projects include targeted marketing and street advertising campaigns that put the brand's orange logos and themes directly on the backs of consumers. "If we are forced to communicate online, we will be forced to simplify," says Wallez. "Even on banking social networks, customers don't talk about banking. They talking about their homes and their cars and their financial concerns."

In 2006, ING Belgium conducted one direct marketing campaign per week. Last year, the marketing team conducted at least ten campaigns per day. How did they do it? They hired business analysts, campaign analysts, marketers, digital marketers and direct mailers. They developed a new campaign process, a new data platform, brought in new tools and established new customer data ownership policies.

The bank now has a global client contact strategy, and the marketing department reports results directly to the board every week.

The benefits of ING's strategy include:
  • Fully automated service.
  • Simple sales migrate to direct channels.
  • More time for advice and sales in branch network.
  • Increase advice efficiency through direct marketing generated leads.
For such a wide-scale project, Wallez recommends strategy above all else. "Whatever strategy you use, you have to have a strategy. Consistently focus on strategy and free up resources," he says. "It's not easy but it's possible."

[Cross-posted from the sascom voices blog]