# from statistics to data mining

工业界对数据挖掘基本要求

（1）基础理论（统计基础+数据挖掘算法）

（2）编码能力（R programming）,sql

（3）excel + ppt技能

（4）项目经验

1 基础理论

（1）book:http://web.stanford.edu/~hastie/local.ftp/Springer/ESLII_print10.pdf  The Elements of Statistical Learning: Data Mining, Inference, and Prediction

（2）book:http://web.stanford.edu/~hastie/local.ftp/Springer/ISLR_print4.pdf An Introduction to Statistical Learning with Applications in R

（3）统计学习方法，http://book.douban.com/subject/10590856/

（4）paper，数据挖掘算法综述，Top 10 algorithms in data mining，http://www.cs.umd.edu/~samir/498/10Algorithms-08.pdf

2 编程能力

3 excel+ppt

4 项目经验

wish u good luck!

青春就应该这样绽放  游戏测试：三国时期谁是你最好的兄弟！！  你不得不信的星座秘密

Netflix prize 集成学习

Blending a set of CF algorithm which outperform a single CF algorithm.

SVD,neighborhood-based, RBM, asym-metric factor model , global effects

An ensemble method combines the predictions of differentalgorithms (the ensemble) to obtain a final prediction. Thecombination of different predictions into a final prediction isalso referred to asblending".

blending方法：neural network blendings, bagged gradient boosting decision tree, kernel ridge regression;

blending技术不仅仅应用在协同过滤，在一般的监督学习任务中均可使用。

1 协同过滤算法

2 blending技术

Blending predictions is asupervised machine learning problem. Each input vector xi is the F-dimensional vector of the predictions in the ensemble.

3  blending示例

from __future__ import division

import numpy as np

from sklearn.cross_validation import StratifiedKFold

from sklearn.ensemble import RandomForestClassifier, ExtraTreesClassifier, GradientBoostingClassifier

from sklearn.linear_model import LogisticRegression

#设计评估指标logloss

def logloss(attempt, actual, epsilon=1.0e-15):    """Logloss, i.e. the score of the bioresponse competition.    """    attempt = np.clip(attempt, epsilon, 1.0-epsilon)    return - np.mean(actual * np.log(attempt) + (1.0 - actual) * np.log(1.0 - attempt))

if __name__ == '__main__':

np.random.seed(0) # seed to shuffle the train set

n_folds = 10    verbose = True    shuffle = False

if shuffle:        idx = np.random.permutation(y.size)        X = X[idx]        y = y[idx]

skf = list(StratifiedKFold(y, n_folds))

clfs = [RandomForestClassifier(n_estimators=100, n_jobs=-1, criterion='gini'),            RandomForestClassifier(n_estimators=100, n_jobs=-1, criterion='entropy'),            ExtraTreesClassifier(n_estimators=100, n_jobs=-1, criterion='gini'),            ExtraTreesClassifier(n_estimators=100, n_jobs=-1, criterion='entropy'),            GradientBoostingClassifier(learn_rate=0.05, subsample=0.5, max_depth=6, n_estimators=50)]

print "Creating train and test sets for blending."

dataset_blend_train = np.zeros((X.shape[0], len(clfs)))

dataset_blend_test = np.zeros((X_submission.shape[0], len(clfs)))

for j, clf in enumerate(clfs):        print j, clf        dataset_blend_test_j = np.zeros((X_submission.shape[0], len(skf)))        for i, (train, test) in enumerate(skf):            print "Fold", i            X_train = X[train]            y_train = y[train]            X_test = X[test]            y_test = y[test]            clf.fit(X_train, y_train)            y_submission = clf.predict_proba(X_test)[:,1]            dataset_blend_train[test, j] = y_submission            dataset_blend_test_j[:, i] = clf.predict_proba(X_submission)[:,1]        dataset_blend_test[:,j] = dataset_blend_test_j.mean(1)

print    print "Blending."    clf = LogisticRegression()    clf.fit(dataset_blend_train, y)    y_submission = clf.predict_proba(dataset_blend_test)[:,1]

print "Linear stretch of predictions to [0,1]"

y_submission = (y_submission - y_submission.min()) / (y_submission.max() - y_submission.min())

print "Saving Results."    np.savetxt(fname='test.csv', X=y_submission, fmt='%0.9f')

青春就应该这样绽放  游戏测试：三国时期谁是你最好的兄弟！！  你不得不信的星座秘密

Highlights

1 Traditional recommender systems that analyze data and updatemodels at regular time intervals, e.g., hours or days, cannot meetthe real-time demands.

2 实时推荐系统问题，系统性能，数据稀疏性和隐式反馈，算法问题

3 腾讯实时推荐系统主要工作：

4 系统架构

1）平台选择

2）数据访问接口

3）数据存储

5 算法设计

1item-based CF

There are various types of user behaviors in our scenario, includingclick, browse, purchase, share, comment, etc.

we utilize the Hoeffding bound theoryand develop a real-time pruning technique

2）数据稀疏性处理

We develop two mechanisms to solve the data sparsity problem,including the demographic clustering and the demographic basedcomplement.

3）实时过滤机制

6 系统架构

7 应用点

TencentRec: Real-time Stream Recommendation inPractice

1）增量更新计算item-basedCFdemographic-based 剪枝

2）系统性能

青春就应该这样绽放  游戏测试：三国时期谁是你最好的兄弟！！  你不得不信的星座秘密

Visual search at pinterest

Pinterest的可视化搜索工业应用实践

1 构建高效可视化搜索系统

2 通过可视化搜索技术，构建基于内容的推荐系统

(1)控制开发成本和人力资源成本，譬如特征计算，高效可视化搜索系统实现

(2)商业应用的ROI，提升用户参与度

1）基于开源工具，构建高效可视化搜索系统，开发目标检测和定位算法，实现增量式特征计算，

2）将可视化搜索应用到相关性推荐和看起来像的产品中，即Related Pins Similar Looks

1 图像特征表示

2 二步目标发现和定位算法

​​

3 点击预测

Both CUR and CTR arehelpful for applications like search ranking, recommendationsystems and ads targeting since we often need to knowwhich images are more likely to get attention from usersbased on their visual content.个性化排序

It incrementallyupdates the collection of features under two mainchange scenarios: new images uploaded to Pinterest, andfeature evolution (features added/modified by engineers).

2）机器学习，deep learnig之卷积神经网络 CaffeSVM

visual search at pinterest​

青春就应该这样绽放  游戏测试：三国时期谁是你最好的兄弟！！  你不得不信的星座秘密

paper：The Browsemaps: Collaborative Filtering at LinkedIn

A picture is worth a thousand words.

Tall oaks grow from little acorns.

One hand washes the other.

You can’t get blood out of a stone

A chain is only as strong as its weakest link

G. Linden, B. Smith, and J. York. Amazon.com recommendations: Item-to-item collaborative filtering. IEEE Internet Computing, 7(1):76–80, Jan. 2003.

Amazon uses a neighborhood-based approach for its product recommendations [9].

J. Davidson, B. Liebald, J. Liu, P. Nandy, T. Van Vleet, U. Gargi, S. Gupta, Y. He, M. Lambert, B. Livingston, and D. Sampath. The YouTube video recommendation system. In RecSys, pages 293–296, 2010.

YouTube combines covisitation statistics with a user’s personal activity on the site to show additional videos to watch [4].

Y. Koren, R. Bell, and C. Volinsky. Matrix factorization techniques for recommender systems. Computer, 42(8):30–37, Aug. 2009.

Netflix employs matrix factorization methods for its movie recommendations [8].

M. A. Hasan, N. Parikh, G. Singh, and N. Sundaresan. Query suggestion for e-commerce sites. In WSDM, pages 765–774, 2011.

eBay applies collaborative filtering as a component in their query suggestions [5].

K. Ali andW. van Stam. Tivo: Making show recommendations using a distributed collaborative filtering architecture. KDD, pages 394–401, 2004.

Tivo uses correlation-based similarity to suggest shows to watch [2].

N. Koenigstein, G. Dror, and Y. Koren. Yahoo! music recommendations: Modeling music ratings with temporal dynamics and item taxonomy. In RecSys, pages 165–172, 2011.

Yahoo! applies a factorization approach for song recommendations on its music property [7].

A. S. Das, M. Datar, A. Garg, and S. Rajaram. Google news personalization: Scalable online collaborative filtering. In WWW, pages 271–280, 2007.

Google uses a linear combination of memory- and model-based collaborative filtering to showcase additional

stories for the user to read in their news product [3].

青春就应该这样绽放  游戏测试：三国时期谁是你最好的兄弟！！  你不得不信的星座秘密

Practical Lessons from Predicting Clicks on Ads at Facebook combines decision trees with logistic regression, outperforming either of these methods on its own by over 3%, an improvement with significant impact to the overall system performance

Right feature + Right model

The click prediction system needs to be robust and adaptive, and capable of learning from massive volumes of data.

Normalized Entropy: Normalized Cross-Entropy is equivalent to the average log loss per impression divided by what the average log loss per impression would be if a model predicted the background click through rate (CTR) for every impression.

Calibration is the ratio of the average estimated CTR and empirical CTR.

Area-Under-ROC (AUC) is also a pretty good metric for measuring ranking quality without considering calibration

Hybrid model structure. Input features are transformed by means of boosted decision trees. The output of each individual tree is treated as a categorical input feature to a sparse linear classifier. Boosted decision trees prove to be very powerful feature transforms.

1 对连续型变量进行离散化，分箱处理

2 cross-featureFor categorical features, the brute force approach consists in taking the Cartesian productIf the input features are continuous, one can do joint binning, using for example a k-d tree.

boosted decision tree based transformation as a supervised feature encoding that converts a real-valued vector into a compact binary-valued vector.

Online Learning Data/Model Flows.在线学习模型

from : http://dl.acm.org/citation.cfm?id=2648589

trick: 特征转换，通过模型处理，right model + right features

青春就应该这样绽放  游戏测试：三国时期谁是你最好的兄弟！！  你不得不信的星座秘密

# 摘要

User behavioral analysis and user feedback (both explicit and implicit) modeling are crucial for the improvement of any online recommender system.

The main contributions of this paper include:

(1) large-scale analysis of impression data from LinkedIn and KDD Cup; 数据分析

(2) novel anti-noise regression techniques, and its application to learn four different impression discounting functions including linear decay, inverse decay, exponential decay, and quadratic decay;

(3) applying these impression discounting functions to LinkedIn’s “People You May Know” and “Suggested Skills Endorsements” recommender systems.

# 1. 介绍

In the impression discounting problem we aim to maximize conversion of recommended items generated by a recommender system by applying a discounting factor, derived from past impressions, on top of scores generated by the recommender system.

(1)如何结合用户展示和反馈数据，构建有效的响应模型

(2)how can the model be applied to improve the performance of existing recommender systems?the number of times an item is impressed or recommended to a user;when the item was impressed, and frequency of user visits on the site or user seeing any of the recommended items.

The main contributions of this paper are as follows:

(1) Perform large scale correlation studies between impression signals and conversion rate of impressed items;

(2) Design effective impression discounting models based on linear/ multiplicative aggregation, and propose novel anti-noise regression model to deal with the data sparsity problem;

(3) Evaluate these regression models on real-world recommendation systems such as “People You May Know” and “Suggested Skills Endorsements” to demonstrate their effectiveness both in offline analysis and in online systems by A/B testing.

# 2. IMPRESSION DATA IN LARGE-SCALE RECOMMENDER SYSTEMS

## (1)Formalizing Impressions 展示数据规范化

T = (user; item; conversion; [behavior1; behavior2; … ]; t;R)

1. LastSeen,用户最近一次展示和当前展示之间的时间差；
2. ImpCount，当前展示之前的历史展示次数之和；
3. Position，item展现的位置；
4. UserFreq，用户和推荐系统的交互频次；

Impression Sequence，(user, item) to obtain a sequence with the same (user, item). This sequence can be ordered by time

global Conversion Rate，

local Conversion Rate，

# 3. IMPRESSION DISCOUNTING MODEL

## (1)Conventions

Behavior List,Conversion Rate,Observation,Support of Observation,Observation Space

## (4)Aggregated Discounting Model

Linear Aggregation Model

Multiplicative Aggregation Model

Impression Discounting Factor

## (5)Anti-Noise Regression Model

Outlier Observation Detection

Density Weighted Regression

RMSE

# 5. REFERENCES

D. Agarwal, B.-C. Chen, and P. Elango. Spatio-temporal models for estimating click-through rate. In WWW, 2009.
O. Moling, L. Baltrunas, and F. Ricci. Optimal radio channel recommendations with explicit and implicit feedback. In RecSys, pages 75–82, 2012.
G. Ristanoski, W. Liu, and J. Bailey. Time series forecasting using distribution enhanced linear regression. In PAKDD (1), pages 484–495, 2013.
D. Yang, T. Chen, W. Zhang, Q. Lu, and Y. Yu. Local implicit feedback mining for music recommendation. In RecSys, pages 91–98, 2012.

青春就应该这样绽放  游戏测试：三国时期谁是你最好的兄弟！！  你不得不信的星座秘密

（1）提取用户商品点击日志、搜索点击日志和商品基本信息等基本数据。然后，去除噪音数据（譬如每天点击商品数达到数以万计的用户）和缺失值数据，构建时序点击流数据，即记录用户每天按照点击时间先后顺序排序的商品行为数据。从而得到如下数据结构：<用户id，商品id，点击时间，点击日期>；

（2）时序协同过滤算法构建模块，根据数据预处理阶段的得到的商品点击时序数据集，在ODPS平台上实现该算法算法，计算商品之间的相关性。

Map阶段

Reduce阶段：

1. 计算每个用户每天的点击时间序列对，按照升序排列，即<商品1，点击时间1>,<商品2，点击时间2>,…,<商品n，点击时间n>；
2. 在每个用户的商品点击序列中，如果两两商品时间序列对的点击时间差小于等于两个小时，则表示这两个商品具备相关性；计算出有效时序商品序列<商品1，商品2，商品3，…，商品n>；
3. 计算商品之间的相关性,公式如下:

，其中score表示相关性分值，delta表示任意两个商品之间的时间排序位置之差

Map阶段

Reduce阶段

Map阶段

Reduce阶段

http://blog.sina.com.cn/s/blog_61c463090102uwqu.html

http://www.aliyun.com/ odps

青春就应该这样绽放  游戏测试：三国时期谁是你最好的兄弟！！  你不得不信的星座秘密

Recommending the most relevant search keywords set to users not only enhances the search engine’s hit rate, but also helps the user to find the desired information more quickly.

1 similarity graph

Similarity Graph for 1 session (left) and 2 sessions (right)

2 CONTENT BASED SIMILARITY

cwi(p) and cwi(q) are the weights of the i-th common keyword in the query p, and q respectively and wi(p) and

wi(q) are the weights of the i-th keywords in the query p and q respectively.

SF-IDF, which is the search frequency multiplied by the inverse document frequency.

3 merge

When using the first consecutive query based method, we can get clusters that reflect all users’ consecutive search behavior in collaborative filtering. While using the second, which is content based method, we can group together queries that have similar composition.

青春就应该这样绽放  游戏测试：三国时期谁是你最好的兄弟！！  你不得不信的星座秘密

<失控>
kk.org

1 创新往往发生在边缘地带，发生在不那么优化的区域
2 问题越复杂，最后用到的模型就越简单。跟数据严丝合缝其实并不难，但如果你真的去做了，那你最后

3 控制的未来是：伙伴关系，协同控制，人机混合控制
4 一个蚂蚁军团，智愚而不知测量，视短儿不及远望，却能迅速找到穿越崎岖地面的最短路径
5 进化的创造力是无穷的，它能超过人类的设计能力
6 分布式，去中心化，协作及可适应性
7 思考即行动，行动即思考
8 要消除一切先入之见，要想领悟复杂事物的群体本质，需要一种可以称为蜂群思维的意识。放下一切固有和确有的执念
9 自我复制，自我管理，有限的自我复制，适度进化以及局部学习
10 机器正在生物化，而生物正在工程化
<胡适文选>

1 不信任一切没有证据的东西
2 多研究些问题，少谈些主义，凡是有价值的思想，都是从这个那个具体问题下手的
3 我要教人疑而后信，考而后信，有充分证据而后信
4 大胆假设，小心求证
<信息简史>

青春就应该这样绽放  游戏测试：三国时期谁是你最好的兄弟！！  你不得不信的星座秘密