bicloud

252017
 

python 获取tensorflow课程讲义

# -*- coding: utf-8 -*-
# @DATE    : 2017/3/25 11:08
# @Author  : 
# @File    : pdf_download.py

import os
import shutil
import requests
from bs4 import BeautifulSoup
import urllib2

def download_file(url, file_folder):
    file_name = url.split("/")[-1]
    file_path = os.path.join(file_folder, file_name)
    r = requests.get(url=url, stream=True)
    with open(file_path, "wb") as f:
        for chunk in r.iter_content(chunk_size=1024 * 1024):
            if chunk:
                f.write(chunk)
    r.close()
    return file_path

def get_pdfs(url, root_url, file_folder):
    html = urllib2.urlopen(url)
    soup = BeautifulSoup(html, "lxml")
    cnt = 0
    for link in soup.find_all("a"):
        file_url = link.get("href")
        if file_url.endswith(".pdf"):
            file_name = download_file(file_url, file_folder)
            print("downloading {} -> {}".format(file_url, file_name))
            cnt += 1
    print("downloaded {} pdfs".format(cnt))

def main():
    root_url = "http://web.stanford.edu/class/cs20si/lectures/"
    course_url = "http://web.stanford.edu/class/cs20si/syllabus.html"
    file_folder = "./course_note"
    if os.path.exists(file_folder):
        shutil.rmtree(file_folder)
    os.mkdir(file_folder)
    get_pdfs(course_url, root_url, file_folder)

if __name__ == "__main__":
    main()

 
 Posted by at 7:42 下午
172017
 

文档制作工具mkdocs学习

最近在写项目文档,发现mkdocs非常轻量,基于markdown,生成静态html,托管在服务器即可访问,非常方便适用。

1. 介绍

mkdocs是一款基于markdown构建项目文档的工具,并且通过静态html文件,可以部署在服务器上提供访问。

2. mkdocs安装

pip install mkdocs


$ mkdocs --version
mkdocs, version 0.16.1

3.1 用法

新建项目 mkdocs new 项目名称,生成一个配置文件mkdocs.yml,同时在docs文件夹下生成一个 markdown文件index.md。


$ mkdocs new deeplearning
INFO    -  Creating project directory: deeplearning
INFO    -  Writing config file: deeplearning/mkdocs.yml
INFO    -  Writing initial docs: deeplearning/docs/index.md
$ cd deeplearning/
$ ll
total 8
drwxr-xr-x  3   staff   102B  3 17 11:11 docs
-rw-r--r--  1   staff    19B  3 17 11:11 mkdocs.yml

启动mkdocs内置的develop server


 $ mkdocs serve
INFO    -  Building documentation...
INFO    -  Cleaning site directory
[I 170317 11:35:02 server:283] Serving on http://127.0.0.1:8000
[I 170317 11:35:02 handlers:60] Start watching changes
[I 170317 11:35:02 handlers:62] Start detecting changes

http://localhost:8000



3.2 文档编辑

(1)编辑页面

(2)编辑配置文件


$ cat mkdocs.yml
site_name: 深度学习
pages:
    - home: index.md
    - content: deeplearning.md
    - about: about.md




 
 Posted by at 8:36 下午
042017
 

hivemall

1. hivemall介绍

hivemall是基于hive udf的机器学习算法工具,在工业实践应用中非常方便,方便数据科学家快速构建机器学习模型原型,部署到实际应用中; 



2. hivemall应用

2.1 输入格式

(1)分类模型输入格式

feature格式 feature ::= : or 

index 0表示 bias 变量 10:3.4 123:0.5 34567:0.231 也可以用文本表示变量索引, "height:1.5" "length:2.0" 数值型和离散型变量 数值型变量必须有index,select addfeatureindex(array(3,4.0,5)) from dual; 离散型变量可以省略权重,feature ::=

特征哈希,特征非常多,>16777216,特征变量是文本,文本较大,占用大量内存,则考虑使用特征哈希


-- feature is v0.3.2 or before
concat(mhash(extract_feature("xxxxxxx-yyyyyy-weight:55.3")), ":", extract_weight("xxxxxxx-yyyyyy-weight:55.3"))

-- feature is v0.3.2-1 or later
feature(mhash(extract_feature("xxxxxxx-yyyyyy-weight:55.3")), extract_weight("xxxxxxx-yyyyyy-weight:55.3"))

Label format in Binary Classification目标变量格式


 ::= 0 | 1

Label format in Multi-class Classification 

(2)回归模型输入格式  ::=

(3)帮助函数


select feature("weight", 55.0);
weight:55.0

select extract_feature("weight:55.0"), extract_weight("weight:55.0");
weight | 55.0

select feature_index(array("10:0.2","7:0.3","9"));
[10,7,9]

select 
  convert_label(-1), convert_label(1), convert_label(0.0f), convert_label(1.0f)
from 
  dual;
 0.0f | 1.0f | -1 | 1

数值型特征


创建稀疏数值特征
select quantitative_features(array("apple","value"),1,120.3);
["apple:1.0","value:120.3"]

离散型特征


创建离散型稀疏特征
select categorical_features(
  array("is_cat","is_dog","is_lion","is_pengin","species"),
  1, 0, 1.0, true, "dog"
);
["is_cat#1","is_dog#0","is_lion#1.0","is_pengin#true","species#dog"]

准备训练数据表


select 
  rowid() as rowid,
  concat_array(
    array("bias:1.0"),
    categorical_features( 
      array("id", "name"),
      id, name
    ),
    quantitative_features(
      array("height", "weight"),
      height, weight
    )
  ) as features, 
  click_or_not as label
from
  table;
2.2 特征工程

min-max规范化


select min(target), max(target)
from (
select target from e2006tfidf_train 
-- union all
-- select target from e2006tfidf_test 
) t;
-7.899578 -0.51940954
set hivevar:min_target=-7.899578;
set hivevar:max_target=-0.51940954;

create or replace view e2006tfidf_train_scaled 
as
select 
  rowid,
  rescale(target, ${min_target}, ${max_target}) as target, 
  features
from 
  e2006tfidf_train;

z-score


select avg(target), stddev_pop(target)
from (
select target from e2006tfidf_train 
-- union all
-- select target from e2006tfidf_test 
) t;
-3.566241460963296 0.6278076335455348
set hivevar:mean_target=-3.566241460963296;
set hivevar:stddev_target=0.6278076335455348;

create or replace view e2006tfidf_train_scaled 
as
select 
  rowid,
  zscore(target, ${mean_target}, ${stddev_target}) as target, 
  features
from 
  e2006tfidf_train;

特征哈希


select feature_hashing('aaa');
> 4063537

select feature_hashing('aaa','-features 3');
> 2

select feature_hashing(array('aaa','bbb'));
> ["4063537","8459207"]

select feature_hashing(array('aaa','bbb'),'-features 10');
> ["7","1"]

select feature_hashing(array('aaa:1.0','aaa','bbb:2.0'));
> ["4063537:1.0","4063537","8459207:2.0"]

select feature_hashing(array(1,2,3));
> ["11293631","3322224","4331412"]

select feature_hashing(array('1','2','3'));
> ["11293631","3322224","4331412"]

select feature_hashing(array('1:0.1','2:0.2','3:0.3'));
> ["11293631:0.1","3322224:0.2","4331412:0.3"]

select feature_hashing(features), features from training_fm limit 2;

> ["1803454","6630176"]   ["userid#5689","movieid#3072"]
> ["1828616","6238429"]   ["userid#4505","movieid#2331"]

select feature_hashing(array("userid#4505:3.3","movieid#2331:4.999", "movieid#2331"));

> ["1828616:3.3","6238429:4.999","6238429"]

tf-idf计算


定义宏函数
create temporary macro max2(x INT, y INT)
if(x>y,x,y);

-- create temporary macro idf(df_t INT, n_docs INT)
-- (log(10, CAST(n_docs as FLOAT)/max2(1,df_t)) + 1.0);

create temporary macro tfidf(tf FLOAT, df_t INT, n_docs INT)
tf * (log(10, CAST(n_docs as FLOAT)/max2(1,df_t)) + 1.0);

数据准备
create external table wikipage (
  docid int,
  page string
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '|'
STORED AS TEXTFILE;

cd ~/tmp
wget https://gist.githubusercontent.com/myui/190b91a3a792ccfceda0/raw/327acd192da4f96da8276dcdff01b19947a4373c/tfidf_test.tsv

LOAD DATA LOCAL INPATH '/home/myui/tmp/tfidf_test.tsv' INTO TABLE wikipage;

create or replace view wikipage_exploded
as
select
  docid, 
  word
from
  wikipage LATERAL VIEW explode(tokenize(page,true)) t as word
where
  not is_stopword(word);
  计算tf
create or replace view term_frequency 
as
select
  docid, 
  word,
  freq
from (
select
  docid,
  tf(word) as word2freq
from
  wikipage_exploded
group by
  docid
) t 
LATERAL VIEW explode(word2freq) t2 as word, freq
计算df
create or replace view document_frequency
as
select
  word, 
  count(distinct docid) docs
from
  wikipage_exploded
group by
  word;
-- set the total number of documents
select count(distinct docid) from wikipage;
set hivevar:n_docs=3;
计算tfidf
create or replace view tfidf
as
select
  tf.docid,
  tf.word, 
  -- tf.freq * (log(10, CAST(${n_docs} as FLOAT)/max2(1,df.docs)) + 1.0) as tfidf
  tfidf(tf.freq, df.docs, ${n_docs}) as tfidf
from
  term_frequency tf 
  JOIN document_frequency df ON (tf.word = df.word)
order by 
  tfidf desc;

docid  word     tfidf
1       justice 0.1641245850805637
3       knowledge       0.09484606645205085
2       action  0.07033910867777095
1       law     0.06564983513276658
1       found   0.06564983513276658
1       religion        0.06564983513276658
1       discussion      0.06564983513276658

转化为特征变量
select
  docid, 
  -- collect_list(concat(word, ":", tfidf)) as features -- Hive 0.13 or later
  collect_list(feature(word, tfidf)) as features -- Hivemall v0.3.4 & Hive 0.13 or later
  -- collect_all(concat(word, ":", tfidf)) as features -- before Hive 0.13
from 
  tfidf
group by
  docid;

特征向量化


select
  id,
  vectorize_features(
    array("age","job","marital","education","default","balance","housing","loan","contact","day","month","duration","campaign","pdays","previous","poutcome"), 
    age,job,marital,education,default,balance,housing,loan,contact,day,month,duration,campaign,pdays,previous,poutcome
  ) as features,
  y
from
  train
limit 2;

> 1       ["age:39.0","job#blue-collar","marital#married","education#secondary","default#no","balance:1756.0","housing#yes","loan#no","contact#cellular","day:3.0","month#apr","duration:939.0","campaign:1.0","pdays:-1.0","poutcome#unknown"]   1
> 2       ["age:51.0","job#entrepreneur","marital#married","education#primary","default#no","balance:1443.0","housing#no","loan#no","contact#cellular","day:18.0","month#feb","duration:172.0","campaign:10.0","pdays:-1.0","poutcome#unknown"]   1
2.3 模型评估

auc


with data as (
  select 0.5 as prob, 0 as label
  union all
  select 0.3 as prob, 1 as label
  union all
  select 0.2 as prob, 0 as label
  union all
  select 0.8 as prob, 1 as label
  union all
  select 0.7 as prob, 1 as label
)
select auc(prob, label) as auc
from (
  select prob, label
  from data
  DISTRIBUTE BY floor(prob / 0.2)
  SORT BY prob DESC
) t;

precision recall


with truth as (
  select userid, collect_set(itemid) as truth
  from dummy_truth
  group by userid
),
rec as (
  select
    userid,
    map_values(to_ordered_map(score, itemid, true)) as rec,
    cast(count(itemid) as int) as max_k
  from dummy_rec
  group by userid
)
select
  -- rec = [1,3,2,6], truth = [1,2,4] for each user

  -- Recall@k
  recall(t1.rec, t2.truth, t1.max_k) as recall,
  recall(t1.rec, t2.truth, 2) as recall_at_2,

  -- Precision@k
  precision(t1.rec, t2.truth, t1.max_k) as precision,
  precision(t1.rec, t2.truth, 2) as precision_at_2,

  -- MAP
  average_precision(t1.rec, t2.truth, t1.max_k) as average_precision,
  average_precision(t1.rec, t2.truth, 2) as average_precision_at_2,

  -- AUC
  auc(t1.rec, t2.truth, t1.max_k) as auc,
  auc(t1.rec, t2.truth, 2) as auc_at_2,

  -- MRR
  mrr(t1.rec, t2.truth, t1.max_k) as mrr,
  mrr(t1.rec, t2.truth, 2) as mrr_at_2,

  -- NDCG
  ndcg(t1.rec, t2.truth, t1.max_k) as ndcg,
  ndcg(t1.rec, t2.truth, 2) as ndcg_at_2
from rec t1
join truth t2 on (t1.userid = t2.userid)
;

3. 机器学习模型

3.1 二分类

http://hivemall.incubator.apache.org/userguide/binaryclass/titanicrf.html http://hivemall.incubator.apache.org/userguide/regression/kddcup12tr2dataset.html

3.2 回归问题

http://hivemall.incubator.apache.org/userguide/regression/e2006_dataset.html

3.3 协同过滤

http://hivemall.incubator.apache.org/userguide/recommend/itembasedcf.html




 
272017
 
RNN预测股票价格小实验

RNN示例

基于basic rnn构建时间序列预测模型,预测股票价格趋势。本实例仅仅作为RNN学习实验。

代码


# -*- coding: utf-8 -*-
# @DATE    : 2017/2/14 17:50
# @Author  : 
# @File    : stock_predict.py

import os
import sys
import datetime

import tensorflow as tf
import pandas as pd
import numpy as np
from yahoo_finance import Share
import matplotlib.pyplot as plt

from utils import get_n_day_before, date_2_str


class StockRNN(object):
    def __init__(self, seq_size=12, input_dims=1, hidden_layer_size=12, stock_id="BABA", days=365, log_dir="stock_model/"):
        self.seq_size = seq_size
        self.input_dims = input_dims
        self.hidden_layer_size = hidden_layer_size
        self.stock_id = stock_id
        self.days = days
        self.data = self._read_stock_data()["Adj_Close"].astype(float).values
        self.log_dir = log_dir

    def _read_stock_data(self):
        stock = Share(self.stock_id)
        end_date = date_2_str(datetime.date.today())
        start_date = get_n_day_before(200)
        # print(start_date, end_date)

        his_data = stock.get_historical(start_date=start_date, end_date=end_date)
        stock_pd = pd.DataFrame(his_data)
        stock_pd["Adj_Close"] = stock_pd["Adj_Close"].astype(float)
        stock_pd.sort_values(["Date"], inplace=True, ascending=True)
        stock_pd.reset_index(inplace=True)
        return stock_pd[["Date", "Adj_Close"]]

    def _create_placeholders(self):
        with tf.name_scope(name="data"):
            self.X = tf.placeholder(tf.float32, [None, self.seq_size, self.input_dims], name="x_input")
            self.Y = tf.placeholder(tf.float32, [None, self.seq_size], name="y_input")

    def init_network(self, log_dir):
        print("Init RNN network")
        self.log_dir = log_dir
        self.sess = tf.Session()
        self.summary_op = tf.summary.merge_all()
        self.saver = tf.train.Saver()
        self.summary_writer = tf.summary.FileWriter(self.log_dir, self.sess.graph)
        self.sess.run(tf.global_variables_initializer())
        ckpt = tf.train.get_checkpoint_state(self.log_dir)
        if ckpt and ckpt.model_checkpoint_path:
            self.saver.restore(self.sess, ckpt.model_checkpoint_path)
            print("Model restore")

        self.coord = tf.train.Coordinator()
        self.threads = tf.train.start_queue_runners(self.sess, self.coord)

    def _create_rnn(self):
        W = tf.Variable(tf.random_normal([self.hidden_layer_size, 1], name="W"))
        b = tf.Variable(tf.random_normal([1], name="b"))
        with tf.variable_scope("cell_d"):
            cell = tf.contrib.rnn.BasicLSTMCell(self.hidden_layer_size)
        with tf.variable_scope("rnn_d"):
            outputs, states = tf.nn.dynamic_rnn(cell, self.X, dtype=tf.float32)

        W_repeated = tf.tile(tf.expand_dims(W, 0), [tf.shape(self.X)[0], 1, 1])
        out = tf.matmul(outputs, W_repeated) + b
        out = tf.squeeze(out)
        return out

    def _data_prepare(self):
        self.train_x = []
        self.train_y = []
        # data
        data = np.log1p(self.data)
        for i in xrange(len(data) - self.seq_size - 1):
            self.train_x.append(np.expand_dims(data[i: i + self.seq_size], axis=1).tolist())
            self.train_y.append(data[i + 1: i + self.seq_size + 1].tolist())

    def train_pred_rnn(self):

        self._create_placeholders()

        y_hat = self._create_rnn()
        self._data_prepare()
        loss = tf.reduce_mean(tf.square(y_hat - self.Y))
        train_optim = tf.train.AdamOptimizer(learning_rate=0.001).minimize(loss)
        feed_dict = {self.X: self.train_x, self.Y: self.train_y}

        saver = tf.train.Saver(tf.global_variables())
        with tf.Session() as sess:
            sess.run(tf.global_variables_initializer())
            for step in xrange(1, 20001):
                _, loss_ = sess.run([train_optim, loss], feed_dict=feed_dict)
                if step % 100 == 0:
                    print("{} {}".format(step, loss_))
            saver.save(sess, self.log_dir + "model.ckpt")

            # prediction
            prev_seq = self.train_x[-1]
            predict = []
            for i in range(5):
                next_seq = sess.run(y_hat, feed_dict={self.X: [prev_seq]})
                predict.append(next_seq[-1])
                prev_seq = np.vstack((prev_seq[1:], next_seq[-1]))
            predict = np.exp(predict) - 1
            print(predict)
            self.pred = predict

    def visualize(self):
        pred = self.pred
        plt.figure()
        plt.legend(prop={'family': 'SimHei', 'size': 15})
        plt.plot(list(range(len(self.data))), self.data, color='b')
        plt.plot(list(range(len(self.data), len(self.data) + len(pred))), pred, color='r')
        plt.title(u"{}股价预测".format(self.stock_id), fontproperties="SimHei")
        plt.xlabel(u"日期", fontproperties="SimHei")
        plt.ylabel(u"股价", fontproperties="SimHei")
        plt.savefig("stock.png")
        plt.show()


if __name__ == "__main__":
    stock = StockRNN()
    # print(stock.read_stock_data())
    log_dir = "stock_model"
    stock.train_pred_rnn()
    stock.visualize()

运行结果,预测BABA未来5天股价2017.2.27


[ 104.23436737  103.82189941  103.59770966  103.43360138  103.29838562]

部分数据2017-02-08,103.57 2017-02-09,103.339996 2017-02-10,102.360001 2017-02-13,103.099998 2017-02-14,101.589996 2017-02-15,101.550003 2017-02-16,100.82 2017-02-17,100.519997 2017-02-21,102.120003 2017-02-22,104.199997 2017-02-23,102.459999 2017-02-24,102.949997



 
 Posted by at 10:04 下午
252017
 

Forecasting at Scale

1. facebook时间序列预测

facebook开源时间序列预测算法,该算法基于加法模型,支持非线性趋势预测,改变点(change point),周期性,季节性以及节假日等等。

It is based on an additive model where non-linear trends are fit with yearly and weekly seasonality, plus holidays. It works best with daily periodicity data with at least one year of historical data. Prophet is robust to missing data, shifts in the trend, and large outliers.

时间序列预测在实际工作中非常频繁,譬如预测业务发展,制定业务目标;设定产品的kpi,预测未来的UV, PV等等;

2. 时间序列预测框架



3. 算法

加法模型

y(t)=g(t)+s(t)+h(t)+ϵt" role="presentation" style="-webkit-print-color-adjust: exact; display: inline; line-height: normal; word-spacing: normal; word-wrap: normal; white-space: nowrap; float: none; direction: ltr; max-width: none; max-height: none; min-width: 0px; min-height: 0px; border: 0px; padding: 0px; margin: 0px; position: relative;">y(t)=g(t)+s(t)+h(t)+ϵt其中,

g(t)" role="presentation" style="-webkit-print-color-adjust: exact; display: inline; line-height: normal; word-spacing: normal; word-wrap: normal; white-space: nowrap; float: none; direction: ltr; max-width: none; max-height: none; min-width: 0px; min-height: 0px; border: 0px; padding: 0px; margin: 0px; position: relative;">g(t)表示增长函数,拟合时间序列模型中非周期性变化的值;

s(t)" role="presentation" style="-webkit-print-color-adjust: exact; display: inline; line-height: normal; word-spacing: normal; word-wrap: normal; white-space: nowrap; float: none; direction: ltr; max-width: none; max-height: none; min-width: 0px; min-height: 0px; border: 0px; padding: 0px; margin: 0px; position: relative;">s(t)表示周或者年等季节性的周期性变化;

h(t)" role="presentation" style="-webkit-print-color-adjust: exact; display: inline; line-height: normal; word-spacing: normal; word-wrap: normal; white-space: nowrap; float: none; direction: ltr; max-width: none; max-height: none; min-width: 0px; min-height: 0px; border: 0px; padding: 0px; margin: 0px; position: relative;">h(t)表示节假日或者事件,对时间序列预测值的影响;

4. 实例


# -*- coding: utf-8 -*-
# @DATE    : 2017/2/25 18:18
# @Author  : 
# @File    : fb_example1.py

import pandas as pd
import numpy as np
from fbprophet import Prophet

data_df = pd.read_csv("data/example_wp_peyton_manning.csv")
data_df["y"] = np.log(data_df["y"])
print(data_df.head())
print(data_df.tail())

# fit the model, model params
# growth = 'linear',
# changepoints = None,
# n_changepoints = 25,
# yearly_seasonality = True,
# weekly_seasonality = True,
# holidays = None,
# seasonality_prior_scale = 10.0,
# holidays_prior_scale = 10.0,
# changepoint_prior_scale = 0.05,
# mcmc_samples = 0,
# interval_width = 0.80,
# uncertainty_samples = 1000
m = Prophet()
m.fit(data_df)

# make prediction
data_future = m.make_future_dataframe(periods=30)
print(data_future.tail())
pred_res = m.predict(data_future)
print(pred_res[['ds', 'yhat', 'yhat_lower', 'yhat_upper']].tail())

# visualization
m.plot(pred_res)

运行结果,


           ds         y
0  2007-12-10  9.590761
1  2007-12-11  8.519590
2  2007-12-12  8.183677
3  2007-12-13  8.072467
4  2007-12-14  7.893572
              ds          y
2900  2016-01-16   7.817223
2901  2016-01-17   9.273878
2902  2016-01-18  10.333775
2903  2016-01-19   9.125871
2904  2016-01-20   8.891374
STAN OPTIMIZATION COMMAND (LBFGS)
init = user
save_iterations = 1
init_alpha = 0.001
tol_obj = 1e-12
tol_grad = 1e-08
tol_param = 1e-08
tol_rel_obj = 10000
tol_rel_grad = 1e+07
history_size = 5
seed = 1691376609
initial log joint probability = -19.4685
    Iter      log prob        ||dx||      ||grad||       alpha      alpha0  # evals  Notes 
      99       7977.57   0.000941357       431.339      0.3404      0.3404      134   
     199        7988.7   0.000894011       356.862       0.739       0.739      241   
     299       7996.29    0.00359033       180.856           1           1      358   
     399       8000.11   0.000546236       205.358     0.09131      0.7253      481   
     499       8002.89    0.00024026        99.613           1           1      608   
     514       8003.11   5.25911e-05       135.817   7.646e-07       0.001      671  LS failed, Hessian reset 
     580       8003.41   3.04884e-05       92.4947    1.88e-07       0.001      798  LS failed, Hessian reset 
     599       8003.49   8.15685e-05        83.046      0.6885      0.6885      821   
     607        8003.5   2.60204e-05       67.9783   1.712e-07       0.001      874  LS failed, Hessian reset 
     654       8003.64   0.000118504       280.906   6.562e-07       0.001      973  LS failed, Hessian reset 
     699       8003.75   2.52751e-06       58.0645      0.3238           1     1029   
     705       8003.75   4.61033e-07       59.0008      0.2964           1     1037   
Optimization terminated normally: 
  Convergence detected: relative gradient magnitude is below tolerance
             ds
2930 2016-02-15
2931 2016-02-16
2932 2016-02-17
2933 2016-02-18
2934 2016-02-19
             ds      yhat  yhat_lower  yhat_upper
2930 2016-02-15  8.021739    7.371417    8.641458
2931 2016-02-16  7.710504    7.079853    8.334700
2932 2016-02-17  7.448298    6.849103    8.012131
2933 2016-02-18  7.370376    6.724225    8.004908
2934 2016-02-19  7.305117    6.683996    8.001754

Process finished with exit code 0

5. 参考资源

facebook prophet 

https://facebookincubator.github.io/prophet/

PS:在日常工作应用中,预测成交额,销量,PV等等可以借鉴fb的时间序列技术,引入季节性因素,节假日,促销事件(譬如双11,双12等);


 
102017
 

# -*- coding: utf-8 -*-
# @DATE : 2017/2/10 10:47
# @File : collection_usage.py

import collections

# counter 初始化
print(collections.Counter(["a", "b", "a", "b", "a", "c"]))
print(collections.Counter({"a": 2, "b": 3, "c": 1}))
print(collections.Counter(a=2, b=3, c=1))

c = collections.Counter()
print(c)
c.update("abababc")
print(c)
c.update({"a": 1, "d": 5})
print(c)

# counter access
c = collections.Counter("ababac")
for letter in "abcde":
print("{}: {}".format(letter, c[letter]))

# elements
c = collections.Counter("extremly")
print(c)
print(list(c.elements()))

# most common
c = collections.Counter()
with open("WordFilter.py", "r") as f:
for line in f:
c.update(line.strip().replace(" ", "").lower())

for letter, count in c.most_common(10):
print("{}: {}".format(letter, count))

# 数学运算
c1 = collections.Counter(["a", "b", "c", "a", "b", "b"])
c2 = collections.Counter("alphabet")

print(c1)
print(c2)
print(c1 + c2)
print(c1 - c2)
print(c1 & c2)
print(c1 | c2)


# ordered dict
d = {}
d["a"] = "A"
d["b"] = "B"
d["c"] = "C"
d["d"] = "D"
d["e"] = "E"
for k, v in d.items():
print("{}, {}".format(k, v))

print("Ordered Dict")
d = collections.OrderedDict()
d["a"] = "A"
d["b"] = "B"
d["c"] = "C"
d["d"] = "D"
d["e"] = "E"
for k, v in d.items():
print("{}, {}".format(k, v))




Counter({'a': 3, 'b': 2, 'c': 1})
Counter({'b': 3, 'a': 2, 'c': 1})
Counter({'b': 3, 'a': 2, 'c': 1})
Counter()
Counter({'a': 3, 'b': 3, 'c': 1})
Counter({'d': 5, 'a': 4, 'b': 3, 'c': 1})
a: 3
b: 2
c: 1
d: 0
e: 0
Counter({'e': 2, 'm': 1, 'l': 1, 'r': 1, 't': 1, 'y': 1, 'x': 1})
['e', 'e', 'm', 'l', 'r', 't', 'y', 'x']
e: 58
r: 56
t: 55
i: 48
o: 48
s: 48
d: 39
f: 36
n: 35
l: 33
Counter({'b': 3, 'a': 2, 'c': 1})
Counter({'a': 2, 'b': 1, 'e': 1, 'h': 1, 'l': 1, 'p': 1, 't': 1})
Counter({'a': 4, 'b': 4, 'c': 1, 'e': 1, 'h': 1, 'l': 1, 'p': 1, 't': 1})
Counter({'b': 2, 'c': 1})
Counter({'a': 2, 'b': 1})
Counter({'b': 3, 'a': 2, 'c': 1, 'e': 1, 'h': 1, 'l': 1, 'p': 1, 't': 1})
a, A
c, C
b, B
e, E
d, D
Ordered Dict
a, A
b, B
c, C
d, D
e, E

 
 Posted by at 9:13 下午
102017
 

深度学习在健康医疗领域应用

1. 深度学习

随着大数据和数据分析技术的不断发展,基于机器学习的数据驱动模型在健康医疗领域逐步得到推广和应用。深度学习作为在人工智能基础之上,是最近蓬勃发展的强大的机器学习工具之一,将会对未来人工智能产生颠覆性的作用。本文主要是对深度学习在健康医疗领域的应用进行综述,从而对目前的技术进行总结,对未来的应用进行展望,熟悉深度学习在健康医疗领域的应用。

2. deep learning Paper in health



3. 深度学习网络结构

深度学习算法在健康医疗中应用的统计 



3.1 Deep Neural Networks

常见的用于分类和回归的网络结构,隐藏层大于等于2;

3.2 卷积神经网络

适合二维数据,譬如图片,每一个卷积过滤将2d数据转化为3d输出,灵感来自于生物视觉;

3.3 循环神经网络

适合序列数据,输出依赖于前面的输出,共享参数;

3.4 Deep autoencoder

无监督学习,主要用于特征转换,输入节点数和输出节点数一致;

3.5 generative adversarial network

无监督学习,生成对抗神经网络,拟合出和原始数据分布一样的数据;

4. 深度学习计算包



5. 深度学习在健康医疗应用



6. 参考资料

Deep Learning for Health Informatics Daniele Rav`ı, Charence Wong, Fani Deligianni, Melissa Berthelot, Javier Andreu-Perez, Benny Lo, and Guang-Zhong Yang, Fellow, IEEE


 
202017
 

pytorch资料汇总

pytorch,语法类似numpy,非常高效;基于pytorch开发深度学习算法,方便快速,适合cpu和gpu计算。pytorch支持动态构建神经网络结构,从而可以提升挽留过结构的重用性。


from __future__ import print_function
import torch

= torch.Tensor(53)

= torch.rand(53)

x

 0.6528  0.0866  0.6087
 0.8979  0.9326  0.6805
 0.0554  0.8358  0.0058
 0.2011  0.7566  0.2541
 0.4476  0.9304  0.9173
[torch.FloatTensor of size 5x3]

x.size()

torch.Size([5, 3])

= torch.rand(53)

+ y

 0.6891  0.9336  1.5928
 1.4953  1.4822  1.5263
 0.6218  1.7133  0.2795
 0.8883  0.7674  0.9020
 1.2286  1.7552  1.7655
[torch.FloatTensor of size 5x3]

torch.add(x, y)

 0.6891  0.9336  1.5928
 1.4953  1.4822  1.5263
 0.6218  1.7133  0.2795
 0.8883  0.7674  0.9020
 1.2286  1.7552  1.7655
[torch.FloatTensor of size 5x3]

#定义一个输出的tensor 
result = torch.Tensor(53)
torch.add(x, y, out=result)

 0.6891  0.9336  1.5928
 1.4953  1.4822  1.5263
 0.6218  1.7133  0.2795
 0.8883  0.7674  0.9020
 1.2286  1.7552  1.7655
[torch.FloatTensor of size 5x3]

result

 0.6891  0.9336  1.5928
 1.4953  1.4822  1.5263
 0.6218  1.7133  0.2795
 0.8883  0.7674  0.9020
 1.2286  1.7552  1.7655
[torch.FloatTensor of size 5x3]

y.add_(x)

 0.6891  0.9336  1.5928
 1.4953  1.4822  1.5263
 0.6218  1.7133  0.2795
 0.8883  0.7674  0.9020
 1.2286  1.7552  1.7655
[torch.FloatTensor of size 5x3]

x[:1]

 0.6528  0.0866  0.6087
[torch.FloatTensor of size 1x3]

x

 0.6528  0.0866  0.6087
 0.8979  0.9326  0.6805
 0.0554  0.8358  0.0058
 0.2011  0.7566  0.2541
 0.4476  0.9304  0.9173
[torch.FloatTensor of size 5x3]

x[:, 1]

 0.0866
 0.9326
 0.8358
 0.7566
 0.9304
[torch.FloatTensor of size 5]

#将torch tensor转化成numpy array 
= torch.ones(5)
print(a)
= a.numpy()
print(b)

 1
 1
 1
 1
 1
[torch.FloatTensor of size 5]
 
[ 1.  1.  1.  1.  1.]

a.add_(1)
print(a)
print(b)
#share their underlying memory locations, and changing one will change the other. 

 2
 2
 2
 2
 2
[torch.FloatTensor of size 5]
 
[ 2.  2.  2.  2.  2.]

#Converting numpy Array to torch Tensor¶ 
import numpy as np
= np.ones(5)
= torch.from_numpy(a)
np.add(a, 1out=a)
print(a)
print(b)

[ 2.  2.  2.  2.  2.]
 
 2
 2
 2
 2
 2
[torch.DoubleTensor of size 5]

b

 2
 2
 2
 2
 2
[torch.DoubleTensor of size 5]

#Autograd: automatic differentiation framework 
from torch.autograd import Variable
= Variable(torch.ones(22)requires_grad=True)
x

Variable containing:
 1  1
 1  1
[torch.FloatTensor of size 2x2]

= x + 2
y

Variable containing:
 3  3
 3  3
[torch.FloatTensor of size 2x2]

y.creator


= y * y * 3

z

Variable containing:
 27  27
 27  27
[torch.FloatTensor of size 2x2]

out = z.mean()
out

Variable containing:
 27
[torch.FloatTensor of size 1]

out.backward()

x.grad

Variable containing:
 4.5000  4.5000
 4.5000  4.5000
[torch.FloatTensor of size 2x2]

= torch.randn(3)
= Variable(x, requires_grad=True)

= x * 2
while y.data.norm <</span> 1000:
    y = y * 2

y

Variable containing:
 0.7411
-1.6021
-0.7232
[torch.FloatTensor of size 3]

gradients = torch.FloatTensor([0.1, 1.0, 0.0001])
y.backward(gradients)

x.grad

Variable containing:
 0.2000
 2.0000
 0.0002
[torch.FloatTensor of size 3]

#神经网络 
import torch.nn as nn
import torch.nn.functional as F
 
class Net(nn.Module):
    def __init__(self):
        super(Net, self).__init__()
        self.conv1 = nn.Conv2d(165) # 1 input image channel, 6 output channels, 5x5 square convolution kernel 
        self.conv2 = nn.Conv2d(6165)
        self.fc1   = nn.Linear(16*5*5120) # an affine operation: y = Wx + b 
        self.fc2   = nn.Linear(12084)
        self.fc3   = nn.Linear(8410)
 
    def forward(self, x):
        x = F.max_pool2d(F.relu(self.conv1(x)), (22)) # Max pooling over a (2, 2) window 
        x = F.max_pool2d(F.relu(self.conv2(x))2) # If the size is a square you can only specify a single number 
        x = x.view(-1self.num_flat_features(x))
        x = F.relu(self.fc1(x))
        x = F.relu(self.fc2(x))
        x = self.fc3(x)
        return x
 
    def num_flat_features(self, x):
        size = x.size()[1:] # all dimensions except the batch dimension 
        num_features = 1
        for s in size:
            num_features *= s
        return num_features
 
net = Net()
net

Net (
  (conv1): Conv2d(1, 6, kernel_size=(5, 5), stride=(1, 1))
  (conv2): Conv2d(6, 16, kernel_size=(5, 5), stride=(1, 1))
  (fc1): Linear (400 -> 120)
  (fc2): Linear (120 -> 84)
  (fc3): Linear (84 -> 10)
)

params = list(net.parameters())
print(len(params))
print(params[0].size()# conv1's .weight 

10
torch.Size([6, 1, 5, 5])

input = Variable(torch.randn(113232))
out = net(input)
out

Variable containing:
-0.0021  0.0284 -0.0429  0.1176 -0.0347 -0.0097  0.0857 -0.0272  0.0875 -0.0659
[torch.FloatTensor of size 1x10]

net.zero_grad() # zeroes the gradient buffers of all parameters 
out.backward(torch.randn(110)) # backprops with random gradients 

output = net(input)
target = Variable(torch.range(110))  # a dummy target, for example 
criterion = nn.MSELoss()
loss = criterion(output, target)
loss

Variable containing:
 38.3687
[torch.FloatTensor of size 1]

# For illustration, let us follow a few steps backward 
print(loss.creator) # MSELoss 
print(loss.creator.previous_functions[0][0]# Linear 
print(loss.creator.previous_functions[0][0].previous_functions[0][0]# ReLU 


# now we shall call loss.backward(), and have a look at conv1's bias gradients before and after the backward. 
net.zero_grad() # zeroes the gradient buffers of all parameters 
print('conv1.bias.grad before backward')
print(net.conv1.bias.grad)
loss.backward()
print('conv1.bias.grad after backward')
print(net.conv1.bias.grad)

conv1.bias.grad before backward
Variable containing:
 0
 0
 0
 0
 0
 0
[torch.FloatTensor of size 6]
 
conv1.bias.grad after backward
Variable containing:
1.00000e-02 *
  0.4278
 -7.8947
  7.0225
 -1.5944
  9.4956
 -1.3269
[torch.FloatTensor of size 6]

import torch.optim as optim
# create your optimizer 
optimizer = optim.SGD(net.parameters()lr = 0.01)
 
# in your training loop: 
optimizer.zero_grad() # zero the gradient buffers 
output = net(input)
loss = criterion(output, target)
loss.backward()
optimizer.step() # Does the update 

#训练图片分类器 
import torchvision
import torchvision.transforms as transforms

transform=transforms.Compose([transforms.ToTensor(), 
                              transforms.Normalize((0.50.50.5), (0.50.50.5)), 
                             ])
trainset = torchvision.datasets.CIFAR10(root='./data', train=True, download=True, transform=transform)
trainloader = torch.utils.data.DataLoader(trainset, batch_size=4,
                                          shuffle=True, num_workers=2)
 
testset = torchvision.datasets.CIFAR10(root='./data', train=False, download=True, transform=transform)
testloader = torch.utils.data.DataLoader(testset, batch_size=4,
                                          shuffle=False, num_workers=2)
classes = ('plane''car''bird''cat',
           'deer''dog''frog''horse''ship''truck')

Downloading http://www.cs.toronto.edu/~kriz/cifar-10-python.tar.gz to ./data/cifar-10-python.tar.gz

# functions to show an image 
import matplotlib.pyplot as plt
import numpy as np
%matplotlib inline
def imshow(img):
    img = img / 2 + 0.5 # unnormalize 
    npimg = img.numpy()
    plt.imshow(np.transpose(npimg, (1,2,0)))

# show some random training images 
dataiter = iter(trainloader)
images, labels = dataiter.next()
 
# print images 
imshow(torchvision.utils.make_grid(images))
# print labels 
print(' '.join('%5s'%classes[labels[j]] for j in range(4)))

plane  frog  bird horse




#构建网络结构 
class Net(nn.Module):
    def __init__(self):
        super(Net, self).__init__()
        self.conv1 = nn.Conv2d(365)
        self.pool  = nn.MaxPool2d(2,2)
        self.conv2 = nn.Conv2d(6165)
        self.fc1   = nn.Linear(16*5*5120)
        self.fc2   = nn.Linear(12084)
        self.fc3   = nn.Linear(8410)
 
    def forward(self, x):
        x = self.pool(F.relu(self.conv1(x)))
        x = self.pool(F.relu(self.conv2(x)))
        x = x.view(-116*5*5)
        x = F.relu(self.fc1(x))
        x = F.relu(self.fc2(x))
        x = self.fc3(x)
        return x
 
net = Net()

#定义损失函数和优化器 
criterion = nn.CrossEntropyLoss() # use a Classification Cross-Entropy loss 
optimizer = optim.SGD(net.parameters()lr=0.001, momentum=0.9)

#训练神经网络 
for epoch in range(2)# loop over the dataset multiple times 
 
    running_loss = 0.0
    for i, data in enumerate(trainloader, 0):
        # get the inputs 
        inputs, labels = data
 
        # wrap them in Variable 
        inputs, labels = Variable(inputs)Variable(labels)
 
        # zero the parameter gradients 
        optimizer.zero_grad()
 
        # forward + backward + optimize 
        outputs = net(inputs)
        loss = criterion(outputs, labels)
        loss.backward()        
        optimizer.step()
 
        # print statistics 
        running_loss += loss.data[0]
        if i % 2000 == 1999# print every 2000 mini-batches 
            print('[%d]] loss: %.3f' % (epoch+1, i+1, running_loss / 2000))
            running_loss = 0.0
print('Finished Training')

[1,  2000] loss: 2.161
[1,  4000] loss: 1.829
[1,  6000] loss: 1.676
[1,  8000] loss: 1.573
[1, 10000] loss: 1.510
[1, 12000] loss: 1.470
[2,  2000] loss: 1.402
[2,  4000] loss: 1.374
[2,  6000] loss: 1.326
[2,  8000] loss: 1.322
[2, 10000] loss: 1.308
[2, 12000] loss: 1.293
Finished Training

dataiter = iter(testloader)
images, labels = dataiter.next()
 
# print images 
imshow(torchvision.utils.make_grid(images))
print('GroundTruth: '' '.join('%5s'%classes[labels[j]] for j in range(4)))

GroundTruth:    cat  ship  ship plane




outputs = net(Variable(images))
 
# the outputs are energies for the 10 classes. 
# Higher the energy for a class, the more the network 
# thinks that the image is of the particular class 
 
# So, let's get the index of the highest energy 
_, predicted = torch.max(outputs.data, 1)
 
print('Predicted: '' '.join('%5s'% classes[predicted[j][0]] for j in range(4)))

Predicted:    cat   car  ship plane

correct = 0
total = 0
for data in testloader:
    images, labels = data
    outputs = net(Variable(images))
    _, predicted = torch.max(outputs.data, 1)
    total += labels.size(0)
    correct += (predicted == labels).sum()
 
print('Accuracy of the network on the 10000 test images: %d %%' % (100 * correct / total))

Accuracy of the network on the 10000 test images: 55 %

class_correct = list(0. for i in range(10))
class_total = list(0. for i in range(10))
for data in testloader:
    images, labels = data
    outputs = net(Variable(images))
    _, predicted = torch.max(outputs.data, 1)
    c = (predicted == labels).squeeze()
    for i in range(4):
        label = labels[i]
        class_correct[label] += c[i]
        class_total[label] += 1

for i in range(10):
    print('Accuracy of %5s : - %%' % (classes[i]100 * class_correct[i] / class_total[i]))

Accuracy of plane : 62 %
Accuracy of   car : 83 %
Accuracy of  bird : 45 %
Accuracy of   cat : 28 %
Accuracy of  deer : 42 %
Accuracy of   dog : 36 %
Accuracy of  frog : 62 %
Accuracy of horse : 66 %
Accuracy of  ship : 69 %
Accuracy of truck : 56 %

参考资料
pytorch api接口文档 http://pytorch.org/docs/
pytorch 案例 https://github.com/pytorch/tutorials/blob/master/Deep Learning with PyTorch.ipynb
pytorch深度学习 https://github.com/pytorch/tutorials/blob/master/Introduction to PyTorch for former Torchies.ipynb


 
192017
 

Applying deep learning to Related Pins

1. 背景

Pinterest的Pins相关性推荐,即基于协同过滤的item2item的推荐系统;传统上,主要根据用户的保存,点击pins到board等行为,基于co-occurrence计算pins之间的相关性。本文主要基于深度学习,计算pins之间的相关性。同时,开发更加符合大数据可升缩计算的相关性推荐系统,和传统的算法相比,基于深度学习的算法较传统算法提升了5%的用户参与度。



2. 算法原理

2.1 传统的co-occurrence算法

基于用户的行为,计算pins之间的相关性;每个board里面的pins的相关性更强,基于每个board计算里面的pins之间的相关性。潜在的缺点:
(1)board的细分,即用户的一个兴趣可能包含多个boards;
(2)board的粒度,即每个board都有自己的主题,board里面的Pins描述的粒度可能不一样,譬如动物和野生动物;
(3)board的主题的漂移,即随着时间的变化,用户的兴趣也会变化,board里面的pins也可能变化。譬如健身,可能会漂移到饮食等等;

2.2 pin2vec算法

思想来源于word2vec,即基于用户关于pins活动的上下文,构建pins向量。
(1)训练数据准备




(2)神经网络结构




(3)模型结果



3. 参考资料

Applying deep learning to Related Pins
https://engineering.pinterest.com/blog/applying-deep-learning-related-pins?utm_campaign=Revue newsletter&utm_medium=Newsletter&utm_source=revue


 
112017
 

谷歌机器学习工业应用实践43条军规


1. Don’t be afraid to launch a product without machine learning.

机器学习不是万能的,产品不一定需要机器学习,重点在于解决问题,不要为了机器学习而机器学习;

2. First, design and implement metrics.

首先,定义和设计评估指标,方便统计效果,对比和迭代优化。不谈指标的机器学习都是
耍流氓;

3. Choose machine learning over a complex heuristic.

在复杂,具备启发式的问题上使用机器学习方法;

4. Keep the first model simple and get the infrastructure right.

首先设计好机器学习系统架构,方便扩展。选择简单,易于实现的,具备可执行性的模型;

5. Test the infrastructure independently from the machine learning.

架构要具备可执行性;

6. Be careful about dropped data when copying pipelines.

复制粘贴害死懒人,小心删除数据;

7. Turn heuristics into features, or handle them externally.

基于启发式思考,设计特征;

8. Know the freshness requirements of your system.

记得更新算法系统哦,算法会逐步退化;

9. Detect problems before exporting models.

10. Watch for silent failures.

11. Give feature column owners and documentation.

文档!文档!文档!

12. Don’t overthink which objective you choose to directly optimize.

抓住业务的核心目标;

13. Choose a simple, observable and attributable metric for your first objective.

目标可量化

14. Starting with an interpretable model makes debugging easier.

15. Separate Spam Filtering and Quality Ranking in a Policy Layer.

16. Plan to launch and iterate.

迭代选择特征,没有一蹴而就的,多尝试;

17. Start with directly observed and reported features as opposed to learned features.

18. Explore with features of content that generalize across contexts.

19. Use very specific features when you can.

20. Combine and modify existing features to create new features in human­ understandable ways.

21. The number of feature weights you can learn in a linear model is roughly proportional to the amount of data you have.

22. Clean up features you are no longer using.

23. You are not a typical end user.

24. Measure the delta between models.

25. When choosing models, utilitarian performance trumps predictive power.

26. Look for patterns in the measured errors, and create new features.

27.Try to quantify observed undesirable behavior.

28. Be aware that identical short­-term behavior does not imply identical long­-term behavior.

29. The best way to make sure that you train like you serve is to save the set of features used at serving time, and then pipe those features to a log to use them at training time.

30.Importance weight sampled data, don’t arbitrarily drop it!

31. Beware that if you join data from a table at training and serving time, the data in the table may change.

32. Re­use code between your training pipeline and your serving pipeline whenever possible.

33. If you produce a model based on the data until January 5th, test the model on the data from January 6th and after.

34. In binary classification for filtering (such as spam detection or determining interesting e­mails), make small short­ term sacrifices in performance for very clean data.

35. Beware of the inherent skew in ranking problems.

36.Avoid feedback loops with positional features.

37. Measure Training/Serving Skew.

38. Don’t waste time on new features if unaligned objectives have become the issue.

39. Launch decisions are a proxy for long­term product goals.

40. Keep ensembles simple.

41. When performance plateaus, look for qualitatively new sources of information to add rather than refining existing signals.

42. Don’t expect diversity, personalization, or relevance to be as correlated with popularity as you think they are.

43. Your friends tend to be the same across different products. Your interests tend not to be.