bicloud

8月 212015
 

from statistics to data mining

 工业界对数据挖掘基本要求

 (1)基础理论(统计基础+数据挖掘算法)

 (2)编码能力(R programming),sql

 (3)excel + ppt技能

 (4)项目经验

1 基础理论

在具备一定的统计知识基础之上,可以多看看数据挖掘相关资料。根据统计学知识背景。推荐如下资料,可以仔细学习:

(1)book:http://web.stanford.edu/~hastie/local.ftp/Springer/ESLII_print10.pdf  The Elements of Statistical Learning: Data Mining, Inference, and Prediction

(2)book:http://web.stanford.edu/~hastie/local.ftp/Springer/ISLR_print4.pdf An Introduction to Statistical Learning with Applications in R

(3)统计学习方法,http://book.douban.com/subject/10590856/

(4)paper,数据挖掘算法综述,Top 10 algorithms in data mining,http://www.cs.umd.edu/~samir/498/10Algorithms-08.pdf

2 编程能力

统计学出身的同学,实际上对编程能力没有程序员要求高,但是需要掌握一定的编程技术。目前统计学出身做数据挖掘主要使用 R 编程技术, 认真看一本,动手操作即可。

参考:http://www.revolutionanalytics.com/r-language-resources

书籍: http://cran.r-project.org/doc/contrib/Paradis-rdebuts_en.pdf

3 excel+ppt

多动手,多练习

4 项目经验

通过公开数据集,做一些小的项目。在掌握上述基础之后,可以在kaggle.com 平台上下载一些数据集(这些数据集都是公司的实际数据,具备参考价值),参与数据挖掘竞赛,提升实际技能,编码能力,项目管理能力。同时,能够产生一些papers更优。

wish u good luck!  




 青春就应该这样绽放  游戏测试:三国时期谁是你最好的兄弟!!  你不得不信的星座秘密
7月 072015
 

Netflix prize 集成学习

Blending a set of CF algorithm which outperform a single CF algorithm.

SVD,neighborhood-based, RBM, asym-metric factor model , global effects

An ensemble method combines the predictions of differentalgorithms (the ensemble) to obtain a final prediction. Thecombination of different predictions into a final prediction isalso referred to asblending".

blending方法:neural network blendings, bagged gradient boosting decision tree, kernel ridge regression;

blending技术不仅仅应用在协同过滤,在一般的监督学习任务中均可使用。

1 协同过滤算法

2 blending技术

Blending predictions is asupervised machine learning problem. Each input vector xi is the F-dimensional vector of the predictions in the ensemble.

基于训练数据训练single CF algorithm。基于probe集,进行blending,将probe划分为训练集和测试集,评估blending结果。

评估

3  blending示例

 from __future__ import division

import numpy as np

import load_data

from sklearn.cross_validation import StratifiedKFold

from sklearn.ensemble import RandomForestClassifier, ExtraTreesClassifier, GradientBoostingClassifier

from sklearn.linear_model import LogisticRegression

#设计评估指标logloss

def logloss(attempt, actual, epsilon=1.0e-15):    """Logloss, i.e. the score of the bioresponse competition.    """    attempt = np.clip(attempt, epsilon, 1.0-epsilon)    return - np.mean(actual * np.log(attempt) + (1.0 - actual) * np.log(1.0 - attempt))

if __name__ == '__main__':

np.random.seed(0) # seed to shuffle the train set

n_folds = 10    verbose = True    shuffle = False

X, y, X_submission = load_data.load()

if shuffle:        idx = np.random.permutation(y.size)        X = X[idx]        y = y[idx]

skf = list(StratifiedKFold(y, n_folds))

clfs = [RandomForestClassifier(n_estimators=100, n_jobs=-1, criterion='gini'),            RandomForestClassifier(n_estimators=100, n_jobs=-1, criterion='entropy'),            ExtraTreesClassifier(n_estimators=100, n_jobs=-1, criterion='gini'),            ExtraTreesClassifier(n_estimators=100, n_jobs=-1, criterion='entropy'),            GradientBoostingClassifier(learn_rate=0.05, subsample=0.5, max_depth=6, n_estimators=50)]

print "Creating train and test sets for blending."        

dataset_blend_train = np.zeros((X.shape[0], len(clfs)))    

dataset_blend_test = np.zeros((X_submission.shape[0], len(clfs)))      

  for j, clf in enumerate(clfs):        print j, clf        dataset_blend_test_j = np.zeros((X_submission.shape[0], len(skf)))        for i, (train, test) in enumerate(skf):            print "Fold", i            X_train = X[train]            y_train = y[train]            X_test = X[test]            y_test = y[test]            clf.fit(X_train, y_train)            y_submission = clf.predict_proba(X_test)[:,1]            dataset_blend_train[test, j] = y_submission            dataset_blend_test_j[:, i] = clf.predict_proba(X_submission)[:,1]        dataset_blend_test[:,j] = dataset_blend_test_j.mean(1)

print    print "Blending."    clf = LogisticRegression()    clf.fit(dataset_blend_train, y)    y_submission = clf.predict_proba(dataset_blend_test)[:,1]

print "Linear stretch of predictions to [0,1]"   

 y_submission = (y_submission - y_submission.min()) / (y_submission.max() - y_submission.min())

print "Saving Results."    np.savetxt(fname='test.csv', X=y_submission, fmt='%0.9f')

 



 青春就应该这样绽放  游戏测试:三国时期谁是你最好的兄弟!!  你不得不信的星座秘密
 Posted by at 7:12 下午
6月 162015
 

阅读TencentRec: Real-time Stream Recommendation inPractice

大数据环境下的实时推荐需求,克服三大难题:大数据,实时性,准确度;

大数据,用户数据,业务数据;实时基于storm处理;算法主要基于item-basedcontent-baseddemographic,并且

根据实时特征,结合业务进行创新。

Highlights

1 Traditional recommender systems that analyze data and updatemodels at regular time intervals, e.g., hours or days, cannot meetthe real-time demands.

往往,实时用户意图更能真实的展现用户需求,离线计算的大多数是预测,而且大多数不准。 Traditional recommendersystems cannot make fast responses to users'preference changesand capture the users’ real-time interests, thus resulting in bad recommendationresults。这一块感同身受。

2 实时推荐系统问题,系统性能,数据稀疏性和隐式反馈,算法问题

3 腾讯实时推荐系统主要工作:

大数据环境下,实现传统item-based,content-based, demographic算法,并且将其应用到腾讯各个业务之中;

4 系统架构

1)平台选择

支持实时计算,高可伸缩性,优秀的容错性能,选择storm

2)数据访问接口

 

3)数据存储

5 算法设计

工业应用实践考虑,易用性和准确度,ROI

1item-based CF

处理隐式反馈问题,增量更新,裁剪技术减少计算成本

There are various types of user behaviors in our scenario, includingclick, browse, purchase, share, comment, etc.

通过技术手段,将隐式行为转化为显式评分。

增量更新

更新流程

we utilize the Hoeffding bound theoryand develop a real-time pruning technique

2)数据稀疏性处理

We develop two mechanisms to solve the data sparsity problem,including the demographic clustering and the demographic basedcomplement.

3)实时过滤机制

方法1,采用时间窗口,基于session过滤数据;

方法2,根据最近的行为做推荐种子。Besides the sliding window mechanism, we propose a real-timepersonalized filtering technique to serve the individual users' realtimedemands. For each user, we record the recent k items that heis interested in.

 

6 系统架构

7 应用点

腾讯视频,易迅网,腾讯文学,微信,大众点评,腾讯新闻,qq空间等

参考文献:

TencentRec: Real-time Stream Recommendation inPractice

 

启发点:

1)增量更新计算item-basedCFdemographic-based 剪枝

2)系统性能

 

 



 青春就应该这样绽放  游戏测试:三国时期谁是你最好的兄弟!!  你不得不信的星座秘密
 Posted by at 5:38 下午
6月 162015
 

Visual search at pinterest

Pinterest的可视化搜索工业应用实践

1 构建高效可视化搜索系统

2 通过可视化搜索技术,构建基于内容的推荐系统

 

两大挑战:

(1)控制开发成本和人力资源成本,譬如特征计算,高效可视化搜索系统实现

(2)商业应用的ROI,提升用户参与度

 

两大贡献:

1)基于开源工具,构建高效可视化搜索系统,开发目标检测和定位算法,实现增量式特征计算,

分布式索引和检索系统。

2)将可视化搜索应用到相关性推荐和看起来像的产品中,即Related Pins Similar Looks

 

可视化搜索架构

1 图像特征表示

图像特征主要包括基本特征和深度特征,其中深度特征基于CNN卷积神经网络计算,通过Caffe开源工具实现

2 二步目标发现和定位算法

首先基于文本特征构建分类模型,进行归类,即先找到对象所属类别;然后,将目标发现算法应用到所属类别中,进行对象发现。(deep learning

​​

3 点击预测

基于用户点击数据,构建CNN模型,预测用户对pin的点击率等。We trained a CNN to learn a mapping from imagesto the probability of a user bringing up the close-up view orclicking through to the content.

Both CUR and CTR arehelpful for applications like search ranking, recommendationsystems and ads targeting since we often need to knowwhich images are more likely to get attention from usersbased on their visual content.个性化排序

 

图像特征增量计算方案

基于版本,日期增量计算,

It incrementallyupdates the collection of features under two mainchange scenarios: new images uploaded to Pinterest, andfeature evolution (features added/modified by engineers).

搜索框架

应用技术

1)计算技术 aws hadoop hbase GPU集群

2)机器学习,deep learnig之卷积神经网络 CaffeSVM

借鉴之处:

电商中的可视化搜索导购,譬如以图搜图的业务场景

参考文献:

visual search at pinterest​


 青春就应该这样绽放  游戏测试:三国时期谁是你最好的兄弟!!  你不得不信的星座秘密
 Posted by at 11:01 上午
11月 052014
 

paper:The Browsemaps: Collaborative Filtering at LinkedIn

Linkedin 协同过滤算法架构,支持快速开发,部署和算法计算

Linkedin协同过滤系统经验总结

A picture is worth a thousand words.

Tall oaks grow from little acorns.

One hand washes the other.

You can’t get blood out of a stone

 

A chain is only as strong as its weakest link


G. Linden, B. Smith, and J. York. Amazon.com recommendations: Item-to-item collaborative filtering. IEEE Internet Computing, 7(1):76–80, Jan. 2003.

Amazon uses a neighborhood-based approach for its product recommendations [9].

 

J. Davidson, B. Liebald, J. Liu, P. Nandy, T. Van Vleet, U. Gargi, S. Gupta, Y. He, M. Lambert, B. Livingston, and D. Sampath. The YouTube video recommendation system. In RecSys, pages 293–296, 2010.

YouTube combines covisitation statistics with a user’s personal activity on the site to show additional videos to watch [4].

 

Y. Koren, R. Bell, and C. Volinsky. Matrix factorization techniques for recommender systems. Computer, 42(8):30–37, Aug. 2009.

Netflix employs matrix factorization methods for its movie recommendations [8].

 

M. A. Hasan, N. Parikh, G. Singh, and N. Sundaresan. Query suggestion for e-commerce sites. In WSDM, pages 765–774, 2011.

eBay applies collaborative filtering as a component in their query suggestions [5].

 

K. Ali andW. van Stam. Tivo: Making show recommendations using a distributed collaborative filtering architecture. KDD, pages 394–401, 2004.

Tivo uses correlation-based similarity to suggest shows to watch [2].

 

N. Koenigstein, G. Dror, and Y. Koren. Yahoo! music recommendations: Modeling music ratings with temporal dynamics and item taxonomy. In RecSys, pages 165–172, 2011.

Yahoo! applies a factorization approach for song recommendations on its music property [7].

 

A. S. Das, M. Datar, A. Garg, and S. Rajaram. Google news personalization: Scalable online collaborative filtering. In WWW, pages 271–280, 2007.

Google uses a linear combination of memory- and model-based collaborative filtering to showcase additional

 

stories for the user to read in their news product [3].



 青春就应该这样绽放  游戏测试:三国时期谁是你最好的兄弟!!  你不得不信的星座秘密
 Posted by at 7:42 下午
10月 162014
 

Practical Lessons from Predicting Clicks on Ads at Facebook combines decision trees with logistic regression, outperforming either of these methods on its own by over 3%, an improvement with significant impact to the overall system performance

Right feature + Right model

The click prediction system needs to be robust and adaptive, and capable of learning from massive volumes of data.

At Facebook, ads are not associated with a query, but instead specify demographic and interest targeting.

实验评价指标

Normalized Entropy: Normalized Cross-Entropy is equivalent to the average log loss per impression divided by what the average log loss per impression would be if a model predicted the background click through rate (CTR) for every impression.

Calibration is the ratio of the average estimated CTR and empirical CTR.

Area-Under-ROC (AUC) is also a pretty good metric for measuring ranking quality without considering calibration

预测模型结构


Hybrid model structure. Input features are transformed by means of boosted decision trees. The output of each individual tree is treated as a categorical input feature to a sparse linear classifier. Boosted decision trees prove to be very powerful feature transforms.

特征转换

1 对连续型变量进行离散化,分箱处理

2 cross-featureFor categorical features, the brute force approach consists in taking the Cartesian productIf the input features are continuous, one can do joint binning, using for example a k-d tree.

boosted decision tree based transformation as a supervised feature encoding that converts a real-valued vector into a compact binary-valued vector.

 

Online Learning Data/Model Flows.在线学习模型


from : http://dl.acm.org/citation.cfm?id=2648589

trick: 特征转换,通过模型处理,right model + right features


 青春就应该这样绽放  游戏测试:三国时期谁是你最好的兄弟!!  你不得不信的星座秘密
 Posted by at 9:09 上午
9月 242014
 

Modeling Impression Discounting in Large-scale Recommender Systems

摘要

User behavioral analysis and user feedback (both explicit and implicit) modeling are crucial for the improvement of any online recommender system.

The main contributions of this paper include:

(1) large-scale analysis of impression data from LinkedIn and KDD Cup; 数据分析

(2) novel anti-noise regression techniques, and its application to learn four different impression discounting functions including linear decay, inverse decay, exponential decay, and quadratic decay;

(3) applying these impression discounting functions to LinkedIn’s “People You May Know” and “Suggested Skills Endorsements” recommender systems.

1. 介绍

In the impression discounting problem we aim to maximize conversion of recommended items generated by a recommender system by applying a discounting factor, derived from past impressions, on top of scores generated by the recommender system.

两个挑战

(1)如何结合用户展示和反馈数据,构建有效的响应模型

(2)how can the model be applied to improve the performance of existing recommender systems?the number of times an item is impressed or recommended to a user;when the item was impressed, and frequency of user visits on the site or user seeing any of the recommended items.

The main contributions of this paper are as follows:

(1) Perform large scale correlation studies between impression signals and conversion rate of impressed items;

(2) Design effective impression discounting models based on linear/ multiplicative aggregation, and propose novel anti-noise regression model to deal with the data sparsity problem;

(3) Evaluate these regression models on real-world recommendation systems such as “People You May Know” and “Suggested Skills Endorsements” to demonstrate their effectiveness both in offline analysis and in online systems by A/B testing.

2. IMPRESSION DATA IN LARGE-SCALE RECOMMENDER SYSTEMS

(1)Formalizing Impressions 展示数据规范化

T = (user; item; conversion; [behavior1; behavior2; … ]; t;R)
用户id,推荐item,conversion 布尔型描述用户是否产生行为,观察交互行为类型,t 展示时间,R 推荐
分值

Behaviors.follow a “see-think-do” procedure.

  1. LastSeen,用户最近一次展示和当前展示之间的时间差;
  2. ImpCount,当前展示之前的历史展示次数之和;
  3. Position,item展现的位置;
  4. UserFreq,用户和推荐系统的交互频次;

Impression Sequence,(user, item) to obtain a sequence with the same (user, item). This sequence can be ordered by time

global Conversion Rate,


 local Conversion Rate,


3. IMPRESSION DISCOUNTING MODEL

(1)Conventions

Behavior List,Conversion Rate,Observation,Support of Observation,Observation Space

(2)Impression Discounting Framework

展示数据衰减计算框架



(3)Exploratory Analysis

工作流



(4)Aggregated Discounting Model

Linear Aggregation Model

Multiplicative Aggregation Model

Impression Discounting Factor

(5)Anti-Noise Regression Model

Outlier Observation Detection

Density Weighted Regression

4. Model Evaluation

RMSE

5. REFERENCES

D. Agarwal, B.-C. Chen, and P. Elango. Spatio-temporal models for estimating click-through rate. In WWW, 2009.
O. Moling, L. Baltrunas, and F. Ricci. Optimal radio channel recommendations with explicit and implicit feedback. In RecSys, pages 75–82, 2012.
G. Ristanoski, W. Liu, and J. Bailey. Time series forecasting using distribution enhanced linear regression. In PAKDD (1), pages 484–495, 2013.
D. Yang, T. Chen, W. Zhang, Q. Lu, and Y. Yu. Local implicit feedback mining for music recommendation. In RecSys, pages 91–98, 2012.

总结:feedback 模型,优化推荐结果,postprocessing, 对响应率低的推荐对象进行降权处理。


 青春就应该这样绽放  游戏测试:三国时期谁是你最好的兄弟!!  你不得不信的星座秘密
 Posted by at 11:18 上午
8月 062014
 

基于Item的时序协同过滤算法技术方案包括两个步骤:

(1)提取用户商品点击日志、搜索点击日志和商品基本信息等基本数据。然后,去除噪音数据(譬如每天点击商品数达到数以万计的用户)和缺失值数据,构建时序点击流数据,即记录用户每天按照点击时间先后顺序排序的商品行为数据。从而得到如下数据结构:<用户id,商品id,点击时间,点击日期>;

(2)时序协同过滤算法构建模块,根据数据预处理阶段的得到的商品点击时序数据集,在ODPS平台上实现该算法算法,计算商品之间的相关性。

时序协同过滤算法

步骤1:计算商品之间相关性

Map阶段

输入:key(行记录标识), value(用户id,商品id,点击时间,点击日期)

输出:key(用户id,点击日期,点击时间), value(商品id,点击时间)

Reduce阶段:

输入:key(用户id,点击日期,点击时间), value(商品id,点击时间)

输出:ODPS Record(点击日期,商品1, 商品2, 相关性分数)

计算相关性分数算法:

  1. 计算每个用户每天的点击时间序列对,按照升序排列,即<商品1,点击时间1>,<商品2,点击时间2>,…,<商品n,点击时间n>;
  2. 在每个用户的商品点击序列中,如果两两商品时间序列对的点击时间差小于等于两个小时,则表示这两个商品具备相关性;计算出有效时序商品序列<商品1,商品2,商品3,…,商品n>;
  3. 计算商品之间的相关性,公式如下:  

    ,其中score表示相关性分值,delta表示任意两个商品之间的时间排序位置之差

步骤2:计算每天商品之间相关性均值,减少噪声数据,譬如用户无意图的商品点击序列

Map阶段

输入:key(行记录标识), value(点击日期,商品1, 商品2, 相关性分数)

输出:key(点击日期,商品1, 商品2), value(相关性分数)

Reduce阶段

输入:key(点击日期,商品1, 商品2), value(相关性分数)

输出:ODPS Record(商品1, 商品2, 相关性分数均值)

步骤3:计算每月商品之间的相关性,减少噪声数据,去除异常的相关性数据

Map阶段

输入:key(行记录标识), value(商品1, 商品2, 相关性分数)

输出:key(商品1, 商品2), value(相关性分数)

Reduce阶段

输入:key(商品1, 商品2), value(相关性分数)

输出:ODPS Record(商品1, 商品2, 相关性分数均值)

参考:

http://blog.sina.com.cn/s/blog_61c463090102uwqu.html

http://www.aliyun.com/ odps 


 青春就应该这样绽放  游戏测试:三国时期谁是你最好的兄弟!!  你不得不信的星座秘密
 Posted by at 3:14 下午
7月 092014
 

Recommending the most relevant search keywords set to users not only enhances the search engine’s hit rate, but also helps the user to find the desired information more quickly.

计算query相关性

1 similarity graph

Similarity Graph for 1 session (left) and 2 sessions (right)

计算方法


2 CONTENT BASED SIMILARITY


cwi(p) and cwi(q) are the weights of the i-th common keyword in the query p, and q respectively and wi(p) and

wi(q) are the weights of the i-th keywords in the query p and q respectively.

SF-IDF, which is the search frequency multiplied by the inverse document frequency.

3 merge

 

When using the first consecutive query based method, we can get clusters that reflect all users’ consecutive search behavior in collaborative filtering. While using the second, which is content based method, we can group together queries that have similar composition.


参考文献:Mining Search Engine Query Logs for Query Recommendation

 


 青春就应该这样绽放  游戏测试:三国时期谁是你最好的兄弟!!  你不得不信的星座秘密
 Posted by at 7:42 下午
7月 012014
 
<失控>
kk.org


1 创新往往发生在边缘地带,发生在不那么优化的区域
2 问题越复杂,最后用到的模型就越简单。跟数据严丝合缝其实并不难,但如果你真的去做了,那你最后
一定只是侥幸成功,概括是关键
3 控制的未来是:伙伴关系,协同控制,人机混合控制
4 一个蚂蚁军团,智愚而不知测量,视短儿不及远望,却能迅速找到穿越崎岖地面的最短路径
5 进化的创造力是无穷的,它能超过人类的设计能力
6 分布式,去中心化,协作及可适应性
7 思考即行动,行动即思考
8 要消除一切先入之见,要想领悟复杂事物的群体本质,需要一种可以称为蜂群思维的意识。放下一切固有和确有的执念
9 自我复制,自我管理,有限的自我复制,适度进化以及局部学习
10 机器正在生物化,而生物正在工程化
<胡适文选>


1 不信任一切没有证据的东西
2 多研究些问题,少谈些主义,凡是有价值的思想,都是从这个那个具体问题下手的
3 我要教人疑而后信,考而后信,有充分证据而后信
4 大胆假设,小心求证
<信息简史>



 青春就应该这样绽放  游戏测试:三国时期谁是你最好的兄弟!!  你不得不信的星座秘密
 Posted by at 6:49 下午