[學習筆記] 機器學習: 神經網路的學習 (Machine Learning: Neural Networks: Learning)

我們把神經網路運用在分類 (classification) 上, 神經網路的結構如下圖.


其中:
Training set: { \( (x^{(1)}, y^{(1)}), (x^{(2)}, y^{(2)}), ..., (x^{(m)}, y^{(m)}) \) }

\( L \): total number of layers in the network

\( s_l \): number of units (not counting bias unit) in layer \(l\).

\( K \): number of output units/classes.

\( (h_{\Theta}(x))_k \): a hypothesis that results in the \(k^{th}\) output.
我們考慮以下兩類的分類問題 (classification problem):

1. Binary classification
\( y \) = 0 or 1

1 output unit (K = 1)
2. Multi-class classification (K classes)
\( y \in \mathbb{R}^K \)
舉例: \( \begin{bmatrix}
1
\\ 0
\\ 0
\end{bmatrix} \) for cat, \( \begin{bmatrix}
0
\\ 1
\\ 0
\end{bmatrix} \) for dog, \( \begin{bmatrix}
0
\\ 0
\\ 1
\end{bmatrix} \) for fox.
K output units

Cost Function


Cost Function for Regularized Logistic Regression:

\( J(\theta) = -\frac{1}{m}\sum_{i=1}^{m}[y^{(i)} log(h_{\theta}(x^{(i)})) + (1-y^{(i)}) log(1- h_{\theta}(x^{(i)})] + \frac{\lambda}{2m}\sum_{j=1}^{n}\theta_j^2 \)

Cost Function for Neural Network:

\( J(\Theta) = -\frac{1}{m}\sum_{i=1}^{m}\sum_{k=1}^{K}[y_k^{(i)} log((h_{\Theta}(x^{(i)}))_k) + (1-y_k^{(i)}) log(1- (h_{\Theta}(x^{(i)})k)] + \frac{\lambda}{2m}\sum_{l=1}^{L-1}\sum_{i=1}^{s_l}\sum_{j=1}^{s_l+1}(\Theta_{(j, i)}^{(l)})^2 \)

Note:

the double sum: adds up the logistic regression costs calculated for each cell in the output layer

the triple sum: adds up the squares of all the individual Θs in the entire network.

Octave implementation:
a1 = [ones(m, 1) X];
z2 = a1 * Theta1';
a2 = sigmoid(z2);
a2 = [ones(m, 1) a2];
z3 = a2 * Theta2';
htheta = sigmoid(z3);

J = 0;
for k = 1:num_labels
    yk = (y == k);
    hthetak = htheta(:, k);
    Jk = sum(-yk .* log(hthetak) - (1 - yk) .* log(1 - hthetak));
    J = J + Jk;
end
J = J / m;

regularization = lambda / (2 * m) * (sum(sum(Theta1(:, 2:end) .^ 2)) + sum(sum(Theta2(:, 2:end) .^ 2)));
J = J + regularization;

Backpropagation Algorithm


目標是找出一組 \(\Theta\), 使得 \(J(\Theta)\) 是 minimized. Backpropagation Algorithm 可計算 \( \frac{\partial}{\partial\Theta_{i,j}^{(l)}}J(\Theta) \).

\( \delta_j^{(l)} \): "error" of node \(j\) in layer \(l\).

\( \frac{\partial}{\partial\Theta_{i,j}^{(l)}}J(\Theta) = a_j^{(l)}\delta_i^{(l+1)} \)

Training set: { \( (x^{(1)}, y^{(1)}), (x^{(2)}, y^{(2)}), ..., (x^{(m)}, y^{(m)}) \) }

Backpropagation Algorithm:

Set \( \Delta_{i,j}^{(l)} = 0 \) (for all \(l, i, j\)).

For \( i = 1 \) to \( m \) (iterate through training examples \((x^{(i)}, y^{(i)})\)
Set \(a^{(1)} = x^{(i)} \)

Perform forward propagation to compute \(a^{(l)}\) for \(l = 2, 3, ..., L\)

Using \(y^{(i)}\) to compute \(\delta^{(L)} = a^{(L)} - y^{(i)}\)

Compute \(\delta^{(L-1)}, \delta^{(L-2)}, ... \delta^{(2)} \) using \( \delta^{(l)} = ((\Theta^{(l)})^T \delta^{(l+1)} ) .* a^{(l)} .* (1-a^{(l)}) \)

\( \Delta_{i,j}^{(l)} := \Delta_{ij}^{(l)} + a_j^{(l)}\delta_i^{(l+1)} \) or with vectorization, \(\Delta^{(l)} := \Delta^{(l)} + \delta^{(l+1)}(a^{(l)})^T \)
\( D_{i,j}^{(l)} := \frac{1}{m}(\Delta_{i,j}^{(l)}+\lambda \Theta_{i,j}^{(l)}) \), if \(j \neq 0\).

\( D_{i,j}^{(l)} := \frac{1}{m}\Delta_{i,j}^{(l)} \), if \(j = 0\).

\( \frac{\partial}{\partial\Theta_{i,j}^{(l)}}J(\Theta) = D_{i,j}^{(l)} \)

Octave implementation:
Theta1_grad = zeros(size(Theta1));
Theta2_grad = zeros(size(Theta2));

for t = 1:m
    for k = 1:num_labels
        yk = y(t) == k;
        delta_3(k) = htheta(t, k) - yk;
    end
    delta_2 = Theta2' * delta_3' .* sigmoidGradient([1, z2(t, :)])';
    delta_2 = delta_2(2:end);

    Theta1_grad = Theta1_grad + delta_2 * a1(t, :);
    Theta2_grad = Theta2_grad + delta_3' * a2(t, :);
end

Theta1_grad = Theta1_grad / m;
Theta2_grad = Theta2_grad / m;

Theta1_grad(:, 2:end) = Theta1_grad(:, 2:end) + lambda / m * Theta1(:, 2:end);
Theta2_grad(:, 2:end) = Theta2_grad(:, 2:end) + lambda / m * Theta2(:, 2:end);

Implementation Note: Unrolling Parameters


在 neural network, parameters \(\Theta\) 及 partial derivative of cost function \(D\) 是 matrix.

但 optimizing functions 處理的對象是 vectors.

因此我們需要在 matrix 和 vectors 之間做轉換.

舉例來說, 如果是 4-layer neural network, \(s_1 = 10, s_2 = 10, s_3 = 10, s_4 = 1\) 我們有

\( \Theta^{(1)} \in \mathbb{R}^{10x11}, \Theta^{(2)} \in \mathbb{R}^{10x11}, \Theta^{(3)} \in \mathbb{R}^{1x11} \)

\( D^{(1)} \in \mathbb{R}^{10x11}, D^{(2)} \in \mathbb{R}^{10x11}, D^{(3)} \in \mathbb{R}^{1x11} \)

在 Octave, 我們可以用下列的方式把 Theta1, Theta2, Theta3 unroll, 以便能夠 pass 給 fminunc(@costFunction, initialTheta, options)
thetaVector = [ Theta1(:); Theta2(:); Theta3(:); ]
而在 costFunction 的 implementation 裡的把 vector 轉換成 matrix 以便於進行 vectorization.
function [jval, gradientVec] = costFunction(thetaVec)
Theta1 = reshape(thetaVec(1:110),10,11)

Theta2 = reshape(thetaVec(111:220),10,11)

Theta3 = reshape(thetaVec(221:231),1,11)
使用 forward propagation 及 backpropagation 計算 \( D^{(1)}, D^{(2)}, D^{(3)}, J(\Theta) \).

Unroll \( D^{(1)}, D^{(2)}, D^{(3)} \) 回傳 gradientVec.

Gradient Checking


Gradient checking 可以用來確認我們實作的 backpropagation 是正確的.

我們先考慮簡單的情況: \( \theta \in R \)
Numerical estimation of gradient:


\( \frac{d}{d\theta}J(\theta) \approx \frac{J(\theta+\epsilon)-J(\theta-\epsilon)}{2\epsilon} \)

\( \epsilon \) 通常會選擇在 \(10^{-4}\) 這樣的 order.

Octave implementation:
gradApprox = (J(theta + EPSILON) - J(theta - EPSILON))/(2*EPSILON)
接下來我們考慮實際的情況: \( \Theta \in \mathbb{R}^n \)
\( \Theta = (\theta_1, \theta_2, ..., \theta_n) \)

\( \frac{\partial}{\partial \theta_1}J(\Theta) \approx \frac{J(\theta_1+\epsilon, \theta_2, \theta_3, ..., \theta_n)-J(\theta_1-\epsilon, \theta_2, \theta_3, ..., \theta_n)}{2\epsilon} \)

\( \frac{\partial}{\partial \theta_2}J(\Theta) \approx \frac{J(\theta_1, \theta_2+\epsilon, \theta_3, ..., \theta_n)-J(\theta_1, \theta_2-\epsilon, \theta_3, ..., \theta_n)}{2\epsilon} \)

\( ... \)

\( \frac{\partial}{\partial \theta_n}J(\Theta) \approx \frac{J(\theta_1, \theta_2, \theta_3, ..., \theta_n+\epsilon)-J(\theta_1, \theta_2, \theta_3, ..., \theta_n-\epsilon)}{2\epsilon} \)

Octave implementation:
for i=1:n,
    thetaPlus = theta;
    thetaPlus(i) = thetaPlus(i) + EPSILON;
    thetaMinus = theta;
    thetaMinus(i) = thetaMinus(i) - EPSILON;
    gradApprox(i) = (J(thetaPlus) - J(thetaMinus ))/(2*EPSILON);
end;

Random initialization


在 neural network, 把 theta 初始化為 0 是不 work 的, 如此每次的 gradient descent update, 對於每一個 layer, 各個 node 的值都會是相同的.

藉由把 \(\Theta_{i,j}^{(l)}\) 初始化為 \([\epsilon, -\epsilon]\) 區間內的 random value, 可以避免這個問題, 稱之為 symmetry breaking.

Octave implementation:
Theta1 = rand(10,11)*(2*INIT_EPSILON)-INIT_EPSILON;

總結


訓練 neural network 時, 首先決定網路結構 (connectivity pattern between neurons), 有幾個 layers, 各個 layers 有幾個 units.
Number of input units: dimension of features \(x^{(i)}\)

Number of output units: 分類的個數 (number of classes)

Number of hidden units per layer: 通常越多越好, 可設定為 features 數的倍數, 但不要太多以免運算量過大.

Number of hidden layers: 預設為 1. 如果大於 1, 建議每一個 hidden layers 有相同數量的 units.
訓練 neural network 的步驟:
1. Randomly initialize the weights

2. Implement forward propagation to get \(h_{\Theta}(x^{(i)})\) for any \(x^{(i)}\)

3. Implement code to compute cost function \(J(\Theta)\)

4. Implement backpropagation to compute partial derivatives \(\frac{\partial}{\partial \Theta_{j,k}^{(l)}}J(\Theta)\)

5. Use gradient checking to confirm that the backpropagation works. Then disable gradient checking.

6. Use gradient descent or advanced optimization algorithm with backpropagation to minimize the cost function \(J(\Theta)\) with the weights in \(\Theta\).
When we perform forward and back propagation, we loop on every training example:
for i = 1:m,
Perform forward propagation and backpropagation using example \( (x^{(i)}, y^{(i)}) \)

(Get activations \( a^{(l)} \) and delta terms \( \delta^{(l)} \) for \(l = 2,...,L\) )

延伸閱讀


[Coursera] Machine Learning: Neural Networks: Learning, by Andrew Ng, Stanford University