[學習筆記] 機器學習: 神經網路的學習 (Machine Learning: Neural Networks: Learning)

我們把神經網路運用在分類 (classification) 上, 神經網路的結構如下圖.


其中:
Training set: { \( (x^{(1)}, y^{(1)}), (x^{(2)}, y^{(2)}), ..., (x^{(m)}, y^{(m)}) \) }

\( L \): total number of layers in the network

\( s_l \): number of units (not counting bias unit) in layer \(l\).

\( K \): number of output units/classes.

\( (h_{\Theta}(x))_k \): a hypothesis that results in the \(k^{th}\) output.
我們考慮以下兩類的分類問題 (classification problem):

1. Binary classification
\( y \) = 0 or 1

1 output unit (K = 1)
2. Multi-class classification (K classes)
\( y \in \mathbb{R}^K \)
舉例: \( \begin{bmatrix}
1
\\ 0
\\ 0
\end{bmatrix} \) for cat, \( \begin{bmatrix}
0
\\ 1
\\ 0
\end{bmatrix} \) for dog, \( \begin{bmatrix}
0
\\ 0
\\ 1
\end{bmatrix} \) for fox.
K output units

Cost Function


Cost Function for Regularized Logistic Regression:

\( J(\theta) = -\frac{1}{m}\sum_{i=1}^{m}[y^{(i)} log(h_{\theta}(x^{(i)})) + (1-y^{(i)}) log(1- h_{\theta}(x^{(i)})] + \frac{\lambda}{2m}\sum_{j=1}^{n}\theta_j^2 \)

Cost Function for Neural Network:

\( J(\Theta) = -\frac{1}{m}\sum_{i=1}^{m}\sum_{k=1}^{K}[y_k^{(i)} log((h_{\Theta}(x^{(i)}))_k) + (1-y_k^{(i)}) log(1- (h_{\Theta}(x^{(i)})k)] + \frac{\lambda}{2m}\sum_{l=1}^{L-1}\sum_{i=1}^{s_l}\sum_{j=1}^{s_l+1}(\Theta_{(j, i)}^{(l)})^2 \)

Note:

the double sum: adds up the logistic regression costs calculated for each cell in the output layer

the triple sum: adds up the squares of all the individual Θs in the entire network.

Octave implementation:
a1 = [ones(m, 1) X];
z2 = a1 * Theta1';
a2 = sigmoid(z2);
a2 = [ones(m, 1) a2];
z3 = a2 * Theta2';
htheta = sigmoid(z3);

J = 0;
for k = 1:num_labels
    yk = (y == k);
    hthetak = htheta(:, k);
    Jk = sum(-yk .* log(hthetak) - (1 - yk) .* log(1 - hthetak));
    J = J + Jk;
end
J = J / m;

regularization = lambda / (2 * m) * (sum(sum(Theta1(:, 2:end) .^ 2)) + sum(sum(Theta2(:, 2:end) .^ 2)));
J = J + regularization;

Backpropagation Algorithm


目標是找出一組 \(\Theta\), 使得 \(J(\Theta)\) 是 minimized. Backpropagation Algorithm 可計算 \( \frac{\partial}{\partial\Theta_{i,j}^{(l)}}J(\Theta) \).

\( \delta_j^{(l)} \): "error" of node \(j\) in layer \(l\).

\( \frac{\partial}{\partial\Theta_{i,j}^{(l)}}J(\Theta) = a_j^{(l)}\delta_i^{(l+1)} \)

Training set: { \( (x^{(1)}, y^{(1)}), (x^{(2)}, y^{(2)}), ..., (x^{(m)}, y^{(m)}) \) }

Backpropagation Algorithm:

Set \( \Delta_{i,j}^{(l)} = 0 \) (for all \(l, i, j\)).

For \( i = 1 \) to \( m \) (iterate through training examples \((x^{(i)}, y^{(i)})\)
Set \(a^{(1)} = x^{(i)} \)

Perform forward propagation to compute \(a^{(l)}\) for \(l = 2, 3, ..., L\)

Using \(y^{(i)}\) to compute \(\delta^{(L)} = a^{(L)} - y^{(i)}\)

Compute \(\delta^{(L-1)}, \delta^{(L-2)}, ... \delta^{(2)} \) using \( \delta^{(l)} = ((\Theta^{(l)})^T \delta^{(l+1)} ) .* a^{(l)} .* (1-a^{(l)}) \)

\( \Delta_{i,j}^{(l)} := \Delta_{ij}^{(l)} + a_j^{(l)}\delta_i^{(l+1)} \) or with vectorization, \(\Delta^{(l)} := \Delta^{(l)} + \delta^{(l+1)}(a^{(l)})^T \)
\( D_{i,j}^{(l)} := \frac{1}{m}(\Delta_{i,j}^{(l)}+\lambda \Theta_{i,j}^{(l)}) \), if \(j \neq 0\).

\( D_{i,j}^{(l)} := \frac{1}{m}\Delta_{i,j}^{(l)} \), if \(j = 0\).

\( \frac{\partial}{\partial\Theta_{i,j}^{(l)}}J(\Theta) = D_{i,j}^{(l)} \)

Octave implementation:
Theta1_grad = zeros(size(Theta1));
Theta2_grad = zeros(size(Theta2));

for t = 1:m
    for k = 1:num_labels
        yk = y(t) == k;
        delta_3(k) = htheta(t, k) - yk;
    end
    delta_2 = Theta2' * delta_3' .* sigmoidGradient([1, z2(t, :)])';
    delta_2 = delta_2(2:end);

    Theta1_grad = Theta1_grad + delta_2 * a1(t, :);
    Theta2_grad = Theta2_grad + delta_3' * a2(t, :);
end

Theta1_grad = Theta1_grad / m;
Theta2_grad = Theta2_grad / m;

Theta1_grad(:, 2:end) = Theta1_grad(:, 2:end) + lambda / m * Theta1(:, 2:end);
Theta2_grad(:, 2:end) = Theta2_grad(:, 2:end) + lambda / m * Theta2(:, 2:end);

Implementation Note: Unrolling Parameters


在 neural network, parameters \(\Theta\) 及 partial derivative of cost function \(D\) 是 matrix.

但 optimizing functions 處理的對象是 vectors.

因此我們需要在 matrix 和 vectors 之間做轉換.

舉例來說, 如果是 4-layer neural network, \(s_1 = 10, s_2 = 10, s_3 = 10, s_4 = 1\) 我們有

\( \Theta^{(1)} \in \mathbb{R}^{10x11}, \Theta^{(2)} \in \mathbb{R}^{10x11}, \Theta^{(3)} \in \mathbb{R}^{1x11} \)

\( D^{(1)} \in \mathbb{R}^{10x11}, D^{(2)} \in \mathbb{R}^{10x11}, D^{(3)} \in \mathbb{R}^{1x11} \)

在 Octave, 我們可以用下列的方式把 Theta1, Theta2, Theta3 unroll, 以便能夠 pass 給 fminunc(@costFunction, initialTheta, options)
thetaVector = [ Theta1(:); Theta2(:); Theta3(:); ]
而在 costFunction 的 implementation 裡的把 vector 轉換成 matrix 以便於進行 vectorization.
function [jval, gradientVec] = costFunction(thetaVec)
Theta1 = reshape(thetaVec(1:110),10,11)

Theta2 = reshape(thetaVec(111:220),10,11)

Theta3 = reshape(thetaVec(221:231),1,11)
使用 forward propagation 及 backpropagation 計算 \( D^{(1)}, D^{(2)}, D^{(3)}, J(\Theta) \).

Unroll \( D^{(1)}, D^{(2)}, D^{(3)} \) 回傳 gradientVec.

Gradient Checking


Gradient checking 可以用來確認我們實作的 backpropagation 是正確的.

我們先考慮簡單的情況: \( \theta \in R \)
Numerical estimation of gradient:


\( \frac{d}{d\theta}J(\theta) \approx \frac{J(\theta+\epsilon)-J(\theta-\epsilon)}{2\epsilon} \)

\( \epsilon \) 通常會選擇在 \(10^{-4}\) 這樣的 order.

Octave implementation:
gradApprox = (J(theta + EPSILON) - J(theta - EPSILON))/(2*EPSILON)
接下來我們考慮實際的情況: \( \Theta \in \mathbb{R}^n \)
\( \Theta = (\theta_1, \theta_2, ..., \theta_n) \)

\( \frac{\partial}{\partial \theta_1}J(\Theta) \approx \frac{J(\theta_1+\epsilon, \theta_2, \theta_3, ..., \theta_n)-J(\theta_1-\epsilon, \theta_2, \theta_3, ..., \theta_n)}{2\epsilon} \)

\( \frac{\partial}{\partial \theta_2}J(\Theta) \approx \frac{J(\theta_1, \theta_2+\epsilon, \theta_3, ..., \theta_n)-J(\theta_1, \theta_2-\epsilon, \theta_3, ..., \theta_n)}{2\epsilon} \)

\( ... \)

\( \frac{\partial}{\partial \theta_n}J(\Theta) \approx \frac{J(\theta_1, \theta_2, \theta_3, ..., \theta_n+\epsilon)-J(\theta_1, \theta_2, \theta_3, ..., \theta_n-\epsilon)}{2\epsilon} \)

Octave implementation:
for i=1:n,
    thetaPlus = theta;
    thetaPlus(i) = thetaPlus(i) + EPSILON;
    thetaMinus = theta;
    thetaMinus(i) = thetaMinus(i) - EPSILON;
    gradApprox(i) = (J(thetaPlus) - J(thetaMinus ))/(2*EPSILON);
end;

Random initialization


在 neural network, 把 theta 初始化為 0 是不 work 的, 如此每次的 gradient descent update, 對於每一個 layer, 各個 node 的值都會是相同的.

藉由把 \(\Theta_{i,j}^{(l)}\) 初始化為 \([\epsilon, -\epsilon]\) 區間內的 random value, 可以避免這個問題, 稱之為 symmetry breaking.

Octave implementation:
Theta1 = rand(10,11)*(2*INIT_EPSILON)-INIT_EPSILON;

總結


訓練 neural network 時, 首先決定網路結構 (connectivity pattern between neurons), 有幾個 layers, 各個 layers 有幾個 units.
Number of input units: dimension of features \(x^{(i)}\)

Number of output units: 分類的個數 (number of classes)

Number of hidden units per layer: 通常越多越好, 可設定為 features 數的倍數, 但不要太多以免運算量過大.

Number of hidden layers: 預設為 1. 如果大於 1, 建議每一個 hidden layers 有相同數量的 units.
訓練 neural network 的步驟:
1. Randomly initialize the weights

2. Implement forward propagation to get \(h_{\Theta}(x^{(i)})\) for any \(x^{(i)}\)

3. Implement code to compute cost function \(J(\Theta)\)

4. Implement backpropagation to compute partial derivatives \(\frac{\partial}{\partial \Theta_{j,k}^{(l)}}J(\Theta)\)

5. Use gradient checking to confirm that the backpropagation works. Then disable gradient checking.

6. Use gradient descent or advanced optimization algorithm with backpropagation to minimize the cost function \(J(\Theta)\) with the weights in \(\Theta\).
When we perform forward and back propagation, we loop on every training example:
for i = 1:m,
Perform forward propagation and backpropagation using example \( (x^{(i)}, y^{(i)}) \)

(Get activations \( a^{(l)} \) and delta terms \( \delta^{(l)} \) for \(l = 2,...,L\) )

系列文章


[學習筆記] 機器學習: 簡介 (Machine learning: Introduction)

[學習筆記] 機器學習: 單變數線性回歸 (Machine learning: Linear Regression with One Variable)

[學習筆記] 機器學習: 多變數線性回歸 (Machine learning: Linear Regression with Multiple Variables)

[學習筆記] 機器學習: 邏輯回歸 (Machine learning: Logistic Regression)

[學習筆記] 機器學習: 正規化 (Machine learning: Regularization)

[學習筆記] 機器學習: 神經網路的結構 (Machine Learning: Neural Networks: Representation)

[學習筆記] 機器學習: 神經網路的學習 (Machine Learning: Neural Networks: Learning)

[學習筆記] 機器學習: 應用上的建議 (Machine Learning: Advice for Applying Machine Learning)

[學習筆記] 機器學習: 系統設計 (Machine Learning: System Design)

[學習筆記] 機器學習: 支援向量機 (Machine Learning: Support Vector Machines)

[學習筆記] 機器學習: 非監督式學習 (Machine Learning: Unsupervised Learning)

[學習筆記] 機器學習: 維度縮減 (Machine Learning: Dimensionality Reduction)

[學習筆記] 機器學習: 異常偵測 (Machine Learning: Anomaly Detection)

[學習筆記] 機器學習: 推薦系統 (Machine Learning: Recommender Systems)

[學習筆記] 機器學習: 大量資料機器學習 (Machine Learning: Large Scale Machine Learning)

延伸閱讀


[Coursera] Machine Learning: Neural Networks: Learning, by Andrew Ng, Stanford University

Stanford’s CS231n: backprop notes