[學習筆記] 機器學習: 大量資料機器學習 (Machine Learning: Large Scale Machine Learning)

Machine Learning 演算法若模型選擇恰當使得具有 low bias 特性, 在這樣的情況下, 搭配 large dataset, 準確度能夠大幅提升.

linear regression 為例, 先前介紹過的 gradient descent, 又稱為 batch gradient descent.

repeat until convergence {
$$\theta_{j} := \theta_{j} - \alpha \frac{1}{m} \sum_{i=1}^{m} (h_{\theta}(x^{(i)})-y^{(i)})x^{(i)}_{j}$$ (simultaneous update $$\theta_{j}$$ for $$j = 0, ..., n$$ )
}

1. Randomly shuffle (reorder) training examples.

2. repeat until convergence {
for $$i := 1, ..., m$$ {
$$\theta_{j} := \theta_{j} - \alpha (h_{\theta}(x^{(i)})-y^{(i)})x^{(i)}_{j}$$ (simultaneous update $$\theta_{j}$$ for $$j = 0, ..., n$$ )
}
}

Batch gradient descent: use all $$m$$ examples in each iteration

Stochastic gradient descent: use $$1$$ example in each iteration

Mini-batch gradient descent: use $$b$$ examples in each iteration
$$b$$: mini-batch size, 通常選擇 2 - 100.
Assume $$b = 10, m = 1000$$

Repeat {
for $$i=1, 11, 21, 31, ..., 991$$ {
$$\theta_{j} := \theta_{j} - \alpha \frac{1}{10} \sum_{k=i}^{i+9} (h_{\theta}(x^{(k)})-y^{(k)})x^{(k)}_{j}$$ (simultaneous update $$\theta_{j}$$ for $$j = 0, ..., n$$ )
}
}
Mini-batch gradient descent 可善用 vectorization, 如此有時收斂速度會比 stochastic gradient descent 快一些.

Stochastic gradient descent convergence

Plot $$J_{train}(\theta)$$ as a function of the number of iterations of gradient descent.

$$J_{train}(\theta) = \frac{1}{2m}\sum_{i=1}^{m}(h_{\theta}(x^{(i)})-y^{(i)})^2$$

$$cost(\theta, (x^{(i)}, y^{(i)})) = \frac{1}{2}(h_{\theta}(x^{(i)})-y^{(i)})^2$$

During learning, compute $$cost(\theta, (x^{(i)}, y^{(i)}))$$ before updating $$\theta$$ using $$(x^{(i)}, y^{(i)})$$.

Every 1000 iterations (say), plot $$cost(\theta, (x^{(i)}, y^{(i)}))$$ averaged over the last 1000 examples processed by algorithm.

If learning curve is flat, may need to change parameters of the model.

If learning curve diverges, try to decrease the learning rate (use smaller $$\alpha$$).

If curve too noisy, try to increase number of training examples for average.

Stochastic gradient descent may not reach global minimum, may wandering around global minimum.

Learning rate $$\alpha$$ is typically held constant. Can slowly decrease $$\alpha$$ over time if we want $$\theta$$ to converge.
E.g. $$\alpha = \frac{const1}{iterationNumbers + const2}$$

Online Learning

Online learning 可以 adapt to changing user preferences.

$$x$$: features of phone, how many words in user query match name of phone, how many words in query match description of phone, etc.

$$y = 1$$ if user clicks on link. $$y = 0$$ otherwise.

Map reduce and data parallelism

$$\theta_{j} := \theta_{j} - \alpha \frac{1}{m} \sum_{i=1}^{m} (h_{\theta}(x^{(i)})-y^{(i)})x^{(i)}_{j}$$ (simultaneous update $$\theta_{j}$$ for $$j = 0, ..., n$$ )