[學習筆記] 機器學習: 大量資料機器學習 (Machine Learning: Large Scale Machine Learning)

Machine Learning 演算法若模型選擇恰當使得具有 low bias 特性, 在這樣的情況下, 搭配 large dataset, 準確度能夠大幅提升.

因為資料量大, 不見得總是有幫助, 而若資料量大, 運算量會大幅增加, 因此決定採用 large dataset 來做 training 前, 要先確認是否目前的模型適合採用 large dataset 來提升準確度.

可從 learning curve 觀察是否演算法是否具有 low bias 特性.

若沒有 low bias 特性, 則需要試著先調整模型參數 (例如增加 features 或 hidden neural network units), 使得演算法具有 low bias 特性. 如此才能夠有效運用 large dataset 來提升準確度.

當演算法滿足 low bias 特性, 並決定採用 large dataset 來提升準確度後, 面臨的難題是運算複雜度會隨之大幅提升, 因此接下來我們介紹一些作法, 可設法加快運算速度.

Stochastic gradient descent


linear regression 為例, 先前介紹過的 gradient descent, 又稱為 batch gradient descent.

Batch gradient descent:
repeat until convergence {
\( \theta_{j} := \theta_{j} - \alpha \frac{1}{m} \sum_{i=1}^{m} (h_{\theta}(x^{(i)})-y^{(i)})x^{(i)}_{j} \) (simultaneous update \( \theta_{j} \) for \( j = 0, ..., n \) )
}
其中 m 是 number of training examples, 必須對每個 training examples 做 summation, \(\theta_j\) 才會 update.

Stochastic gradient descent:
1. Randomly shuffle (reorder) training examples.

2. repeat until convergence {
for \(i := 1, ..., m\) {
\( \theta_{j} := \theta_{j} - \alpha (h_{\theta}(x^{(i)})-y^{(i)})x^{(i)}_{j} \) (simultaneous update \( \theta_{j} \) for \( j = 0, ..., n \) )
}
}
式子中把 \(m\) 移除了, 因為 \(m\) 是 constant, 移除並沒有影響收斂的趨勢, 可藉由調整 \(\alpha\) 補償對於 \(m\) 的移除.

對於每個 training example, 都可以讓 \(\theta_j\) 做一次 update.

當 training examples 數量大時, stochastic gradient descent 的收斂速度會比 batch gradient descent 快很多.

Mini-batch gradient descent


Batch gradient descent: use all \(m\) examples in each iteration

Stochastic gradient descent: use \(1\) example in each iteration

Mini-batch gradient descent: use \(b\) examples in each iteration
\(b\): mini-batch size, 通常選擇 2 - 100.
Mini-batch gradient descent:
Assume \(b = 10, m = 1000\)

Repeat {
for \(i=1, 11, 21, 31, ..., 991\) {
\( \theta_{j} := \theta_{j} - \alpha \frac{1}{10} \sum_{k=i}^{i+9} (h_{\theta}(x^{(k)})-y^{(k)})x^{(k)}_{j} \) (simultaneous update \( \theta_{j} \) for \( j = 0, ..., n \) )
}
}
Mini-batch gradient descent 可善用 vectorization, 如此有時收斂速度會比 stochastic gradient descent 快一些.

Stochastic gradient descent convergence


觀察 batch gradient descent 是否收斂的方式:
Plot \(J_{train}(\theta)\) as a function of the number of iterations of gradient descent.

\(J_{train}(\theta) = \frac{1}{2m}\sum_{i=1}^{m}(h_{\theta}(x^{(i)})-y^{(i)})^2 \)

隨著每一次的 iteration, \(J_{train}(\theta)\) 會逐漸變小.
觀察 stochastic gradient descent 是否收斂的方式:
\(cost(\theta, (x^{(i)}, y^{(i)})) = \frac{1}{2}(h_{\theta}(x^{(i)})-y^{(i)})^2 \)

During learning, compute \(cost(\theta, (x^{(i)}, y^{(i)}))\) before updating \(\theta\) using \((x^{(i)}, y^{(i)})\).

Every 1000 iterations (say), plot \(cost(\theta, (x^{(i)}, y^{(i)}))\) averaged over the last 1000 examples processed by algorithm.

If learning curve is flat, may need to change parameters of the model.

If learning curve diverges, try to decrease the learning rate (use smaller \(\alpha\)).

If curve too noisy, try to increase number of training examples for average.

Stochastic gradient descent may not reach global minimum, may wandering around global minimum.

Learning rate \(\alpha\) is typically held constant. Can slowly decrease \(\alpha\) over time if we want \(\theta\) to converge.
E.g. \(\alpha = \frac{const1}{iterationNumbers + const2}\)

Online Learning


從持續流進來的資料進行學習.

一次從一位 user 的 preference 來學習, update parameter theta, 然後就把該資料捨棄.

Online learning 可以 adapt to changing user preferences.

例 1: 快遞服務網站, 使用者指定 origin, destination, 網站自動報價 (asking price), 使用者決定使用你的服務 (\(y=1\)) 或不使用你的服務 (\(y=0\)).
我們設計 features \(x\) 包含 user 的特性, origin, destination, asking price.

我們想要 learn \( p(y=1|x;\theta) \) 來 optimize price.
例 2: 搜尋產品
舉例來說, 使用者搜尋 "Android phone 8MP camera". 商店裡有 100 款手機, 每個頁面顯示 10 款手機.

\(x\): features of phone, how many words in user query match name of phone, how many words in query match description of phone, etc.

\(y = 1\) if user clicks on link. \(y = 0\) otherwise.

我們想要 learn \(p(y = 1 | x; \theta) \) (predicted click-through-rate), 以顯示使用者最可能點擊的 10 款手機.
其他的例子: 挑選特殊優惠呈現給使用者; 挑選使用者感興趣的新聞; 產品推薦; ...

Map reduce and data parallelism


Batch gradient descent:

\( \theta_{j} := \theta_{j} - \alpha \frac{1}{m} \sum_{i=1}^{m} (h_{\theta}(x^{(i)})-y^{(i)})x^{(i)}_{j} \) (simultaneous update \( \theta_{j} \) for \( j = 0, ..., n \) )



以 m = 400, 有 4 台電腦作為平行運算為例, 把 summation term 共有 400 筆 data, 把它分散給 4 台電腦, 每台電腦各自計算 100 筆 data, 最後再加總.

如此, 假設不考慮 overhead, 計算時間可縮短為原先的 \( \frac{1}{4} \).

延伸閱讀


[Coursera] Machine Learning: Large Scale Machine Learning, by Andrew Ng, Stanford University