062017
 

After reading my article about how to use BY-group processing to run 1000 regression models, a SAS programmer asked whether it is possible to reorder the output of a BY-group analysis. The answer is yes: you can use the DOCUMENT procedure to replay a portion of your output in any order you like. It is a feature of the SAS ODS system that is not very well known to statisticians. Kevin Smith (2012) has called PROC DOCUMENT "the most underutilized tool in the ODS toolbox."

The end of this article contains links to several papers that explain how PROC DOCUMENT works. I was first introduced to PROC DOCUMENT by Warren Kuhfeld, who uses it to modify the graphical output that comes from statistical procedures.

Reordering output (the topic of this article) is much easier than modifying the output. There is a SAS Sample that shows how to use PROC DOCUMENT to reorder BY groups, but I will show a simpler variation. The SAS Sample mentions that you can also use SAS macros to order the output from BY-group analyses, but the macro method is less efficient, computationally speaking.

Use PROC DOCUMENT to reorder SAS output

Suppose you use multiple procedures in an analysis and each procedure uses BY groups. The SAS output will consist of all the BY groups for the first procedure followed by all the BY groups for the second procedure. Sometimes it is preferable to reorder the output so that the results from all procedures are grouped by the BY-group levels.

A basic reordering of the SAS output uses the following steps:

  1. Use the ODS DOCUMENT statement to store the tables and graphs from the analyses into a destination called the document.
  2. Use the LIST statement in PROC DOCUMENT to examine the names of the tables and graphs within the document.
  3. Use the REPLAY statement in PROC DOCUMENT to send the tables and graphs to other ODS destinations in whatever order you choose.

Step 1: Store the output in a document

The following statements generate descriptive statistics for six clinical variables and for each level of the Smoking_Status variable. The Smoking_Status variable indicates whether a patient in this study is a non-smoker, a light smoker, and so on. PROC MEANS generates statistics for the continuous variable; PROC FREQ generates counts for the levels of the discrete variables. The ODS DOCUMENT statement writes all of the ODS output information to a SAS document (technically an item store) called 'doc':

proc sort data=sashelp.heart out=heart; /* sort for BY group processing */
   by smoking_status;
run;
 
ods document name=doc(write);      /* wrap ODS DOCUMENT around the output */
   proc means data=heart;
      by smoking_status;
      var Cholesterol Diastolic Systolic;         /* 1 table, results for three vars */
   run;
 
   proc freq data=heart;
      by smoking_status;
      tables BP_Status Chol_Status Weight_Status; /* 3 tables, one for each of three vars */
   run;
ods document close;                /* wrap ODS DOCUMENT around the output */

Step 2: Examine the names of the objects in the document

At this point, the output is captured in an ODS document. Each object is assigned a name that is closely related to the ODS name that is displayed if you use ODS TRACE ON. You can use PROC DOCUMENT to list the contents of the SAS item store. This will tell you the names of the ODS objects in the document.

PROC DOCUMENT is an interactive procedure, which means that the procedure continues running until you submit the QUIT statement. You can submit a group of statements followed by a RUN statement to execute the procedure interactively. The following statement runs a LIST statement, but does not exit the procedure:
proc document name=doc(read);
   list / levels=all bygroups;  /* add column for each BY group var */
run;

The output shows the table names. The tables from PROC MEANS begin with "Means" and end with the name of the table, which is "Summary." (The references explain the purpose of the many "#1" strings, which are called sequence numbers.) Similarly, the tables from PROC FREQ begin with "Freq" and end with "OneWayFreqs." Notice that the listing includes the BY-group variable as a column. You can use this variable to list or replay only certain BY groups. For example, to list only the tables for non-smokers, you can run the following PROC DOCUMENT statements:

   /* caret (^) is "current directory," not "negation"! */
   list ^ (where=(Smoking_Status="Non-smoker")) / levels=all bygroups;
run;

The output shows that there are four tables that correspond to the output for the "Non-smoker" level of the BY group.

Step 3: Replay the output in any order

The previous section showed that you can use a WHERE clause to list all the tables (and graphs) in a BY group. The REPLAY Statement also supports a WHERE clause, and you can use the REPLAY statement to send only those tables to the open ODS destinations. The DOCUMENT procedure is still running, so the following statements produce the output:

   replay ^ (where=(Smoking_Status="Non-smoker")); /* levels=all is default */
run;

The output is not shown, but it consists of all tables for which Smoking_Status="Non-smoker," regardless of which procedure created the output.

Notice that the "Non-smoker" group was the fifth level (in alphabetical order) of the Smoking_Status variable. You can use multiple REPLAY statements to send output for other levels in any order. For example, you might want to output the tables only for smokers, and order the output according to how much tobacco the patients consume. The following statement output the tables for smokers, grouped by tobacco usage:

   /* replay output in a different order */
   replay ^(where=(Smoking_Status="Light (1-5)"));
   replay ^(where=(Smoking_Status="Moderate (6-15)"));
   replay ^(where=(Smoking_Status="Heavy (16-25)"));
   replay ^(where=(Smoking_Status="Very Heavy (> 25)"));
run;
quit;

For brevity I have left out many details about the syntax of PROC DOCUMENT, but I hope the main idea is clear: You can store the results of multiple computations in a document and then replay any subset of the results in any order. The references in the next section contain many useful tips and techniques to help you get the most out of PROC DOCUMENT.

References

For an innovative use of PROC DOCUMENT to modify statistical graphics, see the papers and books by Warren Kuhfeld, such as Kuhfeld (2016) "Highly Customized Graphs Using ODS Graphics."

The post Reorder the output from a BY-group analysis in SAS appeared first on The DO Loop.

062017
 

We like to think of analytics as a logical discipline, where investment decisions are consistently rational and predictable. Not so. Customer organisations are driven by all the usual complex and sometimes contradictory forces. Business users want analytics to help them make good customer facing decisions as quickly as possible, while [...]

How to make analytics modernization work in practice was published on SAS Voices by Caroline Hermon

042017
 

hivemall

1. hivemall介绍

hivemall是基于hive udf的机器学习算法工具,在工业实践应用中非常方便,方便数据科学家快速构建机器学习模型原型,部署到实际应用中; 



2. hivemall应用

2.1 输入格式

(1)分类模型输入格式

feature格式 feature ::= : or 

index 0表示 bias 变量 10:3.4 123:0.5 34567:0.231 也可以用文本表示变量索引, "height:1.5" "length:2.0" 数值型和离散型变量 数值型变量必须有index,select addfeatureindex(array(3,4.0,5)) from dual; 离散型变量可以省略权重,feature ::=

特征哈希,特征非常多,>16777216,特征变量是文本,文本较大,占用大量内存,则考虑使用特征哈希


-- feature is v0.3.2 or before
concat(mhash(extract_feature("xxxxxxx-yyyyyy-weight:55.3")), ":", extract_weight("xxxxxxx-yyyyyy-weight:55.3"))

-- feature is v0.3.2-1 or later
feature(mhash(extract_feature("xxxxxxx-yyyyyy-weight:55.3")), extract_weight("xxxxxxx-yyyyyy-weight:55.3"))

Label format in Binary Classification目标变量格式


 ::= 0 | 1

Label format in Multi-class Classification 

(2)回归模型输入格式  ::=

(3)帮助函数


select feature("weight", 55.0);
weight:55.0

select extract_feature("weight:55.0"), extract_weight("weight:55.0");
weight | 55.0

select feature_index(array("10:0.2","7:0.3","9"));
[10,7,9]

select 
  convert_label(-1), convert_label(1), convert_label(0.0f), convert_label(1.0f)
from 
  dual;
 0.0f | 1.0f | -1 | 1

数值型特征


创建稀疏数值特征
select quantitative_features(array("apple","value"),1,120.3);
["apple:1.0","value:120.3"]

离散型特征


创建离散型稀疏特征
select categorical_features(
  array("is_cat","is_dog","is_lion","is_pengin","species"),
  1, 0, 1.0, true, "dog"
);
["is_cat#1","is_dog#0","is_lion#1.0","is_pengin#true","species#dog"]

准备训练数据表


select 
  rowid() as rowid,
  concat_array(
    array("bias:1.0"),
    categorical_features( 
      array("id", "name"),
      id, name
    ),
    quantitative_features(
      array("height", "weight"),
      height, weight
    )
  ) as features, 
  click_or_not as label
from
  table;
2.2 特征工程

min-max规范化


select min(target), max(target)
from (
select target from e2006tfidf_train 
-- union all
-- select target from e2006tfidf_test 
) t;
-7.899578 -0.51940954
set hivevar:min_target=-7.899578;
set hivevar:max_target=-0.51940954;

create or replace view e2006tfidf_train_scaled 
as
select 
  rowid,
  rescale(target, ${min_target}, ${max_target}) as target, 
  features
from 
  e2006tfidf_train;

z-score


select avg(target), stddev_pop(target)
from (
select target from e2006tfidf_train 
-- union all
-- select target from e2006tfidf_test 
) t;
-3.566241460963296 0.6278076335455348
set hivevar:mean_target=-3.566241460963296;
set hivevar:stddev_target=0.6278076335455348;

create or replace view e2006tfidf_train_scaled 
as
select 
  rowid,
  zscore(target, ${mean_target}, ${stddev_target}) as target, 
  features
from 
  e2006tfidf_train;

特征哈希


select feature_hashing('aaa');
> 4063537

select feature_hashing('aaa','-features 3');
> 2

select feature_hashing(array('aaa','bbb'));
> ["4063537","8459207"]

select feature_hashing(array('aaa','bbb'),'-features 10');
> ["7","1"]

select feature_hashing(array('aaa:1.0','aaa','bbb:2.0'));
> ["4063537:1.0","4063537","8459207:2.0"]

select feature_hashing(array(1,2,3));
> ["11293631","3322224","4331412"]

select feature_hashing(array('1','2','3'));
> ["11293631","3322224","4331412"]

select feature_hashing(array('1:0.1','2:0.2','3:0.3'));
> ["11293631:0.1","3322224:0.2","4331412:0.3"]

select feature_hashing(features), features from training_fm limit 2;

> ["1803454","6630176"]   ["userid#5689","movieid#3072"]
> ["1828616","6238429"]   ["userid#4505","movieid#2331"]

select feature_hashing(array("userid#4505:3.3","movieid#2331:4.999", "movieid#2331"));

> ["1828616:3.3","6238429:4.999","6238429"]

tf-idf计算


定义宏函数
create temporary macro max2(x INT, y INT)
if(x>y,x,y);

-- create temporary macro idf(df_t INT, n_docs INT)
-- (log(10, CAST(n_docs as FLOAT)/max2(1,df_t)) + 1.0);

create temporary macro tfidf(tf FLOAT, df_t INT, n_docs INT)
tf * (log(10, CAST(n_docs as FLOAT)/max2(1,df_t)) + 1.0);

数据准备
create external table wikipage (
  docid int,
  page string
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '|'
STORED AS TEXTFILE;

cd ~/tmp
wget https://gist.githubusercontent.com/myui/190b91a3a792ccfceda0/raw/327acd192da4f96da8276dcdff01b19947a4373c/tfidf_test.tsv

LOAD DATA LOCAL INPATH '/home/myui/tmp/tfidf_test.tsv' INTO TABLE wikipage;

create or replace view wikipage_exploded
as
select
  docid, 
  word
from
  wikipage LATERAL VIEW explode(tokenize(page,true)) t as word
where
  not is_stopword(word);
  计算tf
create or replace view term_frequency 
as
select
  docid, 
  word,
  freq
from (
select
  docid,
  tf(word) as word2freq
from
  wikipage_exploded
group by
  docid
) t 
LATERAL VIEW explode(word2freq) t2 as word, freq
计算df
create or replace view document_frequency
as
select
  word, 
  count(distinct docid) docs
from
  wikipage_exploded
group by
  word;
-- set the total number of documents
select count(distinct docid) from wikipage;
set hivevar:n_docs=3;
计算tfidf
create or replace view tfidf
as
select
  tf.docid,
  tf.word, 
  -- tf.freq * (log(10, CAST(${n_docs} as FLOAT)/max2(1,df.docs)) + 1.0) as tfidf
  tfidf(tf.freq, df.docs, ${n_docs}) as tfidf
from
  term_frequency tf 
  JOIN document_frequency df ON (tf.word = df.word)
order by 
  tfidf desc;

docid  word     tfidf
1       justice 0.1641245850805637
3       knowledge       0.09484606645205085
2       action  0.07033910867777095
1       law     0.06564983513276658
1       found   0.06564983513276658
1       religion        0.06564983513276658
1       discussion      0.06564983513276658

转化为特征变量
select
  docid, 
  -- collect_list(concat(word, ":", tfidf)) as features -- Hive 0.13 or later
  collect_list(feature(word, tfidf)) as features -- Hivemall v0.3.4 & Hive 0.13 or later
  -- collect_all(concat(word, ":", tfidf)) as features -- before Hive 0.13
from 
  tfidf
group by
  docid;

特征向量化


select
  id,
  vectorize_features(
    array("age","job","marital","education","default","balance","housing","loan","contact","day","month","duration","campaign","pdays","previous","poutcome"), 
    age,job,marital,education,default,balance,housing,loan,contact,day,month,duration,campaign,pdays,previous,poutcome
  ) as features,
  y
from
  train
limit 2;

> 1       ["age:39.0","job#blue-collar","marital#married","education#secondary","default#no","balance:1756.0","housing#yes","loan#no","contact#cellular","day:3.0","month#apr","duration:939.0","campaign:1.0","pdays:-1.0","poutcome#unknown"]   1
> 2       ["age:51.0","job#entrepreneur","marital#married","education#primary","default#no","balance:1443.0","housing#no","loan#no","contact#cellular","day:18.0","month#feb","duration:172.0","campaign:10.0","pdays:-1.0","poutcome#unknown"]   1
2.3 模型评估

auc


with data as (
  select 0.5 as prob, 0 as label
  union all
  select 0.3 as prob, 1 as label
  union all
  select 0.2 as prob, 0 as label
  union all
  select 0.8 as prob, 1 as label
  union all
  select 0.7 as prob, 1 as label
)
select auc(prob, label) as auc
from (
  select prob, label
  from data
  DISTRIBUTE BY floor(prob / 0.2)
  SORT BY prob DESC
) t;

precision recall


with truth as (
  select userid, collect_set(itemid) as truth
  from dummy_truth
  group by userid
),
rec as (
  select
    userid,
    map_values(to_ordered_map(score, itemid, true)) as rec,
    cast(count(itemid) as int) as max_k
  from dummy_rec
  group by userid
)
select
  -- rec = [1,3,2,6], truth = [1,2,4] for each user

  -- Recall@k
  recall(t1.rec, t2.truth, t1.max_k) as recall,
  recall(t1.rec, t2.truth, 2) as recall_at_2,

  -- Precision@k
  precision(t1.rec, t2.truth, t1.max_k) as precision,
  precision(t1.rec, t2.truth, 2) as precision_at_2,

  -- MAP
  average_precision(t1.rec, t2.truth, t1.max_k) as average_precision,
  average_precision(t1.rec, t2.truth, 2) as average_precision_at_2,

  -- AUC
  auc(t1.rec, t2.truth, t1.max_k) as auc,
  auc(t1.rec, t2.truth, 2) as auc_at_2,

  -- MRR
  mrr(t1.rec, t2.truth, t1.max_k) as mrr,
  mrr(t1.rec, t2.truth, 2) as mrr_at_2,

  -- NDCG
  ndcg(t1.rec, t2.truth, t1.max_k) as ndcg,
  ndcg(t1.rec, t2.truth, 2) as ndcg_at_2
from rec t1
join truth t2 on (t1.userid = t2.userid)
;

3. 机器学习模型

3.1 二分类

http://hivemall.incubator.apache.org/userguide/binaryclass/titanicrf.html http://hivemall.incubator.apache.org/userguide/regression/kddcup12tr2dataset.html

3.2 回归问题

http://hivemall.incubator.apache.org/userguide/regression/e2006_dataset.html

3.3 协同过滤

http://hivemall.incubator.apache.org/userguide/recommend/itembasedcf.html




 
042017
 

The Obama administration made great strides in improving the government’s use of information technology over the past eight years, and now it's up to the Trump administration to expand upon it. Let’s look at five possible Trump administration initiatives that can take government’s use of information technology to the next [...]

5 Tech initiatives for the Trump administration was published on SAS Voices by Steve Bennett

032017
 

There aren’t many things that keep me awake at night but let me share a recent example with you. I’ve been grappling with how to help a local SAS team respond to a customer’s request for a “generic enterprise analytics architecture.” As background, this customer organization had recently embarked on [...]

How to apply design thinking to your analytics architecture was published on SAS Voices by Paul Gittins

032017
 

Word on the proverbial technology street is that online learning is ideal for modern learners, and, at SAS, we wholeheartedly agree. In 2016, more the 80,000 SAS users trained in an online or blending learning format. In 2017, we kicked off the year with the announcement of our new SAS [...]

The post Modern learners meet modern training at SAS Global Forum 2017 appeared first on SAS Learning Post.

032017
 

Editor's note: The following post is from Emma Warrillow, Chief DiGGer at Data Insight Group Inc. (DiG). Emma is a featured speaker at SAS Global Forum 2017 and recently named as one of the Top Women in Direct Marketing by Direct Marketing News. Learn more about Emma.   “I need [...]

The post That analyst is certifiable! appeared first on SAS Learning Post.

022017
 

A common request we receive about SAS Global Certification practice exams is to make them as close as possible to the actual exams so that future test takers get a more accurate feel for the overall exam experience. Good news. Our new practice exams come as close to the “real thing” [...]

The post New practice exam options for SAS Global Certification appeared first on SAS Learning Post.

022017
 

SAS Global Forum 2017 is just a month away and, if you’re a SAS administrator, it’s a great place to meet your peers, share your experiences and attend presentations on SAS administration tips and tricks.

SAS Global Forum 2017 takes place in Orlando FL, April 2-5. You can find more information at https://www.sas.com/en_us/events/sas-global-forum/sas-global-forum-2017.html.  This schedule is for the entire conference and include pre and post conference events.

If you’re an administrator, though, I wanted to highlight a few events that would be of particular interest to you:

On Sunday, April 2nd from 2-4 pm there is a “Helping the SAS Administrator Succeed” event. More details can be found here.

On Monday, April 3rd from 6:30-8:00 pm the SAS Users Group for Administrators (SUGA) will be hosting a Community Linkup, with panelists on hand to help answer questions from SAS administrators. Location will be in the Dolphin Level – Asia 4.

There are two post-conference tutorials for the SAS Administrators:

Introduction to SAS Grid Manager, Wednesday, April 5th from 2:30-6:30pm
SAS Metadata Security, Thursday, April 6th from 8:00am-noon
More details can be found here.

For a list of the papers on the topic of SAS Administration, you can visit this link. You will see that SAS Administration has been broken down to Architecture, Deployment, SAS Administration and Security subtopic areas.

Some of the key papers under each sub-topic area are:

Architecture
Twelve Cluster Technologies Available in SAS 9.4
Deploying SAS on Software-Defined and Virtual Storage Systems
Shared File Systems:  Determining the Best Choice for your Distributed SAS Foundation Applications
Do You have a Disaster Recovery Plan for Your SAS Infrastructure

Deployment
Pillars of a Successful SAS Implementation with Lessons from Boston Scientific
Getting the Latest and Greatest from SAS 9.4: Best Practices for Upgrades and Migrations
Migrating Large, Complex SAS Environments: In-Place versus New Build

SAS Administration
SAS Metadata Security 201:  Security Basics for a New Administrator
SAS Environment Manager: Advanced Topics
The Top Ten SAS Studio Tips for SAS Grid Manager Administrators
Implementing Capacity Management Policies on a SASLASR Analytic Server Platform: Can You Afford Not To?
Auditing in SAS Visual Analytics
SAS Viya: What it Means for SAS Administration

Security
Guidelines for Protecting Your Computer, Network, and Data from Malware Threats
Getting Started with Designing and Implementing a SAS® 9.4 Metadata and File System Security Design
SAS® Metadata Security 301: Auditing your SAS Environment
SAS® Users Audit: An Automated Approach to Metadata Reporting
SAS® Metadata Security

In addition to the breakout sessions, there is an Administration Super Demo station where short presentations will be given. The schedule for these presentations is:

Sunday, April 2nd:
17:00     Shared File Systems for SAS Grid Manager
18:00     Where to Place SAS WORK in your SAS Grid Infrastructure

Monday, April 3rd:
11:00     Hands-on Secure Socket Layer Configuration for SAS 9.4 Environment Manager
12:00     Introduction to Configuring SAS Metadata Security for Mutlitenancy
13:00     SAS Viya Overview
14:00     Accelerate your SAS Programs with GPUs
15:00     Authentication and Identity Management with SAS Viya

Tuesday, April 4th:
11:00     Accelerate your SAS Programs with GPUs
12:00     Accelerating your Analytics Adoption with the Analytics Fast Track
13:00     New Deployment Experience for SAS
14:00     Managing Authorization in SAS Viya
15:00     Clustering in SAS Viya
16:00     A Docker Container Toolbox for the Data Scientist

As you can see, there is lots for SAS Administrators to learn and opportunities for SAS Administrators to socialize with fellow SAS Administrators.

Here’s to seeing you in sunny Florida next month.

P.S. SAS administrators don’t have to go to SAS Global Forum to get help administering their environment. In addition to SAS Global Forum and the SUGA group mentioned above, you can find out more information on resources for administrators in this blog. You can also visit our new webpage devoted just to users who administer their organization’s SAS environment. You can find that page here.

Resources for SAS Administrators at SAS Global Forum 2017 … and beyond was published on SAS Users.