[學習筆記] 機器學習: 正規化 (Machine learning: Regularization)

Model of Supervised Learning



Notation:
h: hypothesis function

n: number of features

m: number of training examples

\( x^{(i)} \): the column vector of all the feature inputs of the \( i^{th} \) training example

\( x^{(i)}_{j} \): value of feature \(j\) in the \( i^{th} \) training example
Supervised learning 的目標是要讓 learning algorithm 能夠根據 training set 學習到 function h, 使得 h(x1, x2, ... xn) 能夠盡可能準確地預測 y.

當我們要預測的 y 是 continuous value, 此 learning problem 稱之為 regression problem.

當我們要預測的 y 是 discrete value, 此 learning problem 稱之為 classification problem.

Overfitting


如果我們有過多的 features (x1, x2, ..., xn), 藉由 training set 學習而得的 hypothesis function 或許可以準確地預測 training set 裡的 examples (i.e. \( J(\theta) = \frac{1}{2m} \sum_{i=1}^{m} (h_{\theta}(x^{(i)})-y^{(i)})^2 \approx 0 \)), 但無法準確預測新的 examples.

以下圖的 linear regression 為例, 左邊的 hypothesis function 是 underfit, 它的預測具有較大的誤差 (high bias), 中間的 hypothesis function 恰到好處, 右邊的 hypothesis function 是 overfit, 會做出變化較劇烈的預測 (high variance).


以下圖的 logistic regression 為例, 左邊的 hypothesis function 是 underfit, 它的預測具有較大的誤差 (high bias), 中間的 hypothesis function 恰到好處, 右邊的 hypothesis function 是 overfit, 會做出變化較劇烈的預測 (high variance).


避免 overfitting 的方式


1. 降低 features 的數量.
a. 手動選擇要保留或要捨棄哪些 features.

b. 採用 model selection algorithm 決定要保留或要捨棄哪些 features.

這個作法的缺點是, 捨棄 features 的同時, 也捨棄了一些可能有用的資訊.
2. 正規化 (regularization).
保留所有的 features, 但降低 parameters θj 的大小.

這個作法適用於有許多 features, 且每個 features 對於預測 y 都有一點點貢獻.

Intuition for Cost Function of Linear Regression



上圖左邊的 hypothesis function 恰到好處, 右邊的 hypothesis function 是 overfit.

因此我們如果調整 cost function, 設法使得 θ3, θ4 的數值變得很小, 如此, 右邊的 hypothesis function 就會接近左邊的 hypothesis function.

\( J(\theta) = \frac{1}{2m} \sum_{i=1}^{m} (h_{\theta}(x^{(i)})-y^{(i)})^2 + 1000\theta_3^2 + 1000\theta_4^2 \)

Cost Function for Regularized Linear Regression


\( J(\theta) = \frac{1}{2m} [\sum_{i=1}^{m} (h_{\theta}(x^{(i)})-y^{(i)})^2 + \lambda \sum_{j=1}^{n}\theta_j^2] \)
\( \lambda \sum_{j=1}^{n}\theta_j^2 \): regularization term

\( \lambda \): regularization parameter

如果 λ 選得太大, 會導致 underfit. 在極端的狀況, hypothesis function 會退化成相當於 hθ(x) = θ0.

如果 λ 選得太小, 會導致 overfit.

注意: regularization term 裡, summation 不包含 θ0.

Regularized Linear Regression


Cost Function:
\( J(\theta) = \frac{1}{2m} [\sum_{i=1}^{m} (h_{\theta}(x^{(i)})-y^{(i)})^2 + \lambda \sum_{j=1}^{n}\theta_j^2] \)
Gradient Descent:
repeat until congergence {
\( \theta_{0} := \theta_{0} - \alpha \frac{1}{m} \sum_{i=1}^{m} (h_{\theta}(x^{(i)})-y^{(i)})x^{(i)}_{0}) \)

\( \theta_{j} := \theta_{j} - \alpha [(\frac{1}{m} \sum_{i=1}^{m} (h_{\theta}(x^{(i)})-y^{(i)})x^{(i)}_{j}) + \frac{\lambda}{m}\theta_j] \qquad (j = 1, ..., n) \)
}

或可表示為:

repeat until congergence {
\( \theta_{0} := \theta_{0} - \alpha \frac{1}{m} \sum_{i=1}^{m} (h_{\theta}(x^{(i)})-y^{(i)})x^{(i)}_{0}) \)

\( \theta_{j} := \theta_{j}(1- \frac{\lambda}{m}\theta_j) - \alpha \frac{1}{m} \sum_{i=1}^{m} (h_{\theta}(x^{(i)})-y^{(i)})x^{(i)}_{j} \qquad (j = 1, 2, ..., n) \)
}

其中 \( 1- \frac{\lambda}{m}\theta_j \lt 1 \), 因此每次的 update, θj 的值都會隨之減小.
Normal Equation:
\( x^{(i)} =
\begin{bmatrix}
x_0^{(i)}
\\x_1^{(i)}
\\.
\\.
\\.
\\x_n^{(i)}
\end{bmatrix}_{(n+1)*1} \in \mathbb{R}^{n+1},
X = \begin{bmatrix}
(x^{(1)})^T
\\(x^{(2)})^T
\\.
\\.
\\.
\\(x^{(m)})^T
\end{bmatrix}_{m*(n+1)},
y =
\begin{bmatrix}
y^{(1)}
\\y^{(2)}
\\.
\\.
\\.
\\y^{(m)}
\end{bmatrix}_{m*1}
\)

\( \theta = (X^TX + \lambda L)^{-1}X^Ty \)

其中 \( L =
\begin{bmatrix}
0 & & & & & &
\\ & 1 & & & & &
\\ & & 1 & & & &
\\ & & & . & & &
\\ & & & & . & &
\\ & & & & & . &
\\ & & & & & & 1
\end{bmatrix}_{(n+1)*(n+1)}
\)

如果 m < n, XTX 會是 non-invertible, 然而即使如此, XTX + λL 會是 invertible.

Intuition for Cost Function of Logistic Regression



上圖粉紅色的 hypothesis function 恰到好處, 藍色的 hypothesis function 是 overfit.

因此我們如果調整 cost function, 設法使得 θ5 的數值變得很小, 如此, hypothesis function 就會接近粉紅色的線段.

Cost Function for Regularized Logistic Regression


\( J(\theta) = -\frac{1}{m}\sum_{i=1}^{m}[y^{(i)} log(h_{\theta}(x^{(i)})) + (1-y^{(i)}) log(1- h_{\theta}(x^{(i)})] + \frac{\lambda}{2m}\sum_{j=1}^{n}\theta_j^2 \)

注意: regularization term 裡, summation 不包含 θ0.

Octave implementation:

function [J] = costFunctionReg(theta, X, y, lambda)
m = length(y); % number of training examples
J = 0;
n = size(X,2);
for i = 1:m;
  J = J - y(i)*log(sigmoid((X(i,:)*theta)))-(1-y(i))*log(1-sigmoid((X(i,:)*theta)));
end;
r_term = 0;
for j = 2:n;
  r_term = r_term + theta(j)^2;
end;
r_term = r_term * lambda / 2;
J = (J + r_term)/(m);
end

Regularized Logistic Regression


Gradient Descent:
Repeat {

\( \theta_0 := \theta_0 - \alpha \frac{1}{m} \sum_{i=1}^{m}(h_{\theta}(x^{(i)})-y^{(i)})x_0^{(i)} \)

\( \theta_j := \theta_j - \alpha [\frac{1}{m} \sum_{i=1}^{m}(h_{\theta}(x^{(i)})-y^{(i)})x_j^{(i)} + \frac{\lambda}{m}\theta_j] \qquad (j = 1, 2, ..., n) \)

}

Octave implementation:

function [grad] = gradient(theta, X, y, lambda)
m = length(y); % number of training examples
n = size(X,2); % number of features
grad = X'*(sigmoid(X*theta)-y);
for j = 2:n;
  grad(j) = grad(j) + lambda*theta(j);
end;
grad = grad/(m);
end

延伸閱讀


[Coursera] Machine Learning: Regularization, by Andrew Ng, Stanford University