[學習筆記] 機器學習: 分類 (Machine learning: Classification)

Model of Supervised Learning



Supervised learning 的目標是要讓 learning algorithm 能夠根據 training set 學習到 function h, 使得 h(x1, x2, ... xn) 能夠盡可能準確地預測 y.

Notation:
h: hypothesis function

n: number of features

m: number of training examples

\( x^{(i)} \): the column vector of all the feature inputs of the \( i^{th} \) training example

\( x^{(i)}_{j} \): value of feature \(j\) in the \( i^{th} \) training example
當我們要預測的 y 是 continuous value, 此 learning problem 稱之為 regression problem.

當我們要預測的 y 是 discrete value, 此 learning problem 稱之為 classification problem.

Binary Classification (or Two Class Classification)


舉例:
email: spam / not spam?

線上交易: 詐欺 / 正常?

腫瘤: 惡性 / 良性?
\( y \in {0, 1} \)
0: "Negative Class" (e.g. spam/詐欺/惡性)

1: "Positive Class" (e.g. not spam/正常/良性)
不適合以 linear regression 來解 classification 問題,因為 classification 並非 linear function。
以下圖為例,當 training set 只有左邊 8 個點時,會以粉紅色的 hypothesis function 來作預測,這樣的情況恰好可以做出正確的預測。

但當 training set 納入右邊的點時,會以藍色的 hypothesis function 來作預測,新增了資訊,卻使得預測的準確度降低了,因此不適合以 linear regression 來解 classification 問題。


對於 binary classification 問題, y = 0 or 1

對於 linear regression 而言, hθ(x) 可能 >1 或 <0,和 y 的值可能會很大的差異.

而對於 logistic regression, \( 0 \leqslant h_\theta (x) \leqslant 1 \).

Hypohesis Function for Binary Classification


我們需要 \( h_\theta (x) \) 能夠滿足此條件: \( 0 \leqslant h_\theta (x) \leqslant 1 \)

Sigmoid Function (Logistic Function):
\( h_\theta (x) = g( \theta^T x ) \)

\( z = \theta^T x \)

\( g(z) = \frac{1}{1 + e^{-z}} \)


hθ(x): estimated probability that y = 1 on input x
Example: If \( x =
\begin{bmatrix}
x_0
\\ x_1
\end{bmatrix} =
\begin{bmatrix}
1
\\ tumorSize
\end{bmatrix} \)

hθ(x) = 0.7 代表有 70% 的可能性腫瘤是惡性的。
\( h_{\theta}(x) = P(y=1|x;\theta) \) "probability that y = 1, given x, parameterized by θ

\( P(y=0|x;\theta) + P(y=1|x;\theta) = 1 \) 因為 y 只有 0 或 1 兩種可能性,因此 P(y=0) + P(y=1) = 1

\( P(y=0|x;\theta) = 1 - P(y=1|x;\theta) \)
Sigmoid function in Octave language:

function g = sigmoid(z)
g = zeros(size(z));
g = 1 ./ (1+exp(-z));
end

Decision Boundary


\( h_\theta (x) = g( \theta^T x ) \)

\( z = \theta^T x \)

\( g(z) = \frac{1}{1 + e^{-z}} \)


Suppose predict "y = 1" if \( h_{\theta}(x) \geqslant 0.5 \)
\( g(z) \geqslant 0.5 \) when \( z \geqslant 0 \)

\( h_\theta (x) = g(\theta^T x) \geqslant 0.5 \) when \( \theta^T x \geqslant 0 \)
predict "y = 0" if \( h_{\theta}(x) \lt 0.5 \)
\( g(z) \lt 0.5 \) when \( z \lt 0 \)

\( h_\theta (x) = g(\theta^T x) \lt 0.5 \) when \( \theta^T x \lt 0 \)
Example of linear decision boundary.

\( h_\theta (x) = g(\theta^T x) = g(\theta_0 + \theta_1 x_1 + \theta_2 x_2) \)

Predict "\(y=1\)" if \(\theta^T x = -3 + x_1 + x_2 \geqslant 0\)
Example of non-linear decision boundary.

\( h_\theta (x) = g(\theta^T x) = g(\theta_0 + \theta_1 x_1 + \theta_2 x_2 + \theta_3 x_1^2 + \theta_4 x_2^2) \)

Predict "\(y=1\)" if \(\theta^T x = -1 + x_1^2 + x_2^2 \geqslant 0\)

Logistic Regression Model



Training set: { \( (x^{(1)}, y^{(1)}), (x^{(2)}, y^{(2)}), ..., (x^{(m)}, y^{(m)}) \) }

Features: \( x =
\begin{bmatrix}
x_{0}
\\ x_{1}
\\ .
\\ .
\\ .
\\ x_{n}
\end{bmatrix}
\in \mathbb{R}^{n+1} \)

\( y \in {0, 1} \)

\( h_{\theta}(x) = \frac{1}{1+e^{-\theta^Tx}} \)

How to choose parameters \( \theta \)?

若採用 linear regression 的 cost function,因為 \( h_{\theta}(x) \) 的特性不同,會導致 cost function 是 non-convex,會有許多 local minimum。

因此需要定義適合於 logistic regression 的 cost function:
\( J(\theta) = \frac{1}{m}\sum_{i=1}^{m}Cost(h_{\theta}(x^{(i)}), y^{(i)}) \)

\( Cost(h_{\theta}(x), y) = - log(h_{\theta}(x)) \) if \( y = 1 \)

\( Cost(h_{\theta}(x), y) = - log(1- h_{\theta}(x)) \) if \( y = 0 \)
當 \( y = 1 \) 時,\( J(\theta) \) 和 \( h_{\theta}(x) \) 的關係如下圖:

當 \( h_{\theta}(x) = 1 \) 時,\( cost = 0 \)。

當 \( h_{\theta}(x) \rightarrow 0 \) 時,\( cost \rightarrow \infty \)。
當 \( y = 0 \) 時,\( J(\theta) \) 和 \( h_{\theta}(x) \) 的關係如下圖:

當 \( h_{\theta}(x) = 0 \) 時,\( cost = 0 \)。

當 \( h_{\theta}(x) \rightarrow 1 \) 時,\( cost \rightarrow \infty \)。

Simplified Logistic Regression Cost Function


\( J(\theta) = \frac{1}{m}\sum_{i=1}^{m}Cost(h_{\theta}(x^{(i)}), y^{(i)}) \)

\( Cost(h_{\theta}(x), y) = - log(h_{\theta}(x)) \) if \( y = 1 \)

\( Cost(h_{\theta}(x), y) = - log(1- h_{\theta}(x)) \) if \( y = 0 \)

Cost function 可以簡化為以 1 line 表示:

\( Cost(h_{\theta}(x), y) = -y log(h_{\theta}(x)) - (1-y) log(1- h_{\theta}(x)) \)

\( J(\theta) = \frac{1}{m}\sum_{i=1}^{m}Cost(h_{\theta}(x^{(i)}), y^{(i)}) \)

\( J(\theta) = -\frac{1}{m}\sum_{i=1}^{m} y^{(i)} log(h_{\theta}(x^{(i)})) + (1-y^{(i)}) log(1- h_{\theta}(x^{(i)}) \)

Vectorized implementation:

\( h = g(X \theta) = \frac{1}{1+e^{-X \theta}} \)

\( J(\theta) = \frac{1}{m}(-y^T log(h) - (1-y)^T log (1-h)) \)

Octave implementation for cost function:

function [J] = costFunction(theta, X, y)
m = length(y); % number of training examples
J = 0;
for i = 1:m;
  J = J - y(i)*log(sigmoid((X(i,:)*theta)))-(1-y(i))*log(1-sigmoid((X(i,:)*theta)));
end;
J = J/(m);
end

Gradient Descent


Repeat {

\( \theta_j := \theta_j - \alpha \frac{\partial}{\partial \theta_j} J(\theta) \)

}

對 \( J(\theta) \) 進行偏微分後:

Repeat {

\( \theta_j := \theta_j - \frac{\alpha}{m} \sum_{i=1}^{m}(h_{\theta}(x^{(i)})-y^{(i)})x_j^{(i)} \)

}

Vectorized implementation:

\( \theta := \theta - \frac{\alpha}{m} \sum_{i=1}^{m}(h_{\theta}(x^{(i)})-y^{(i)})x^{(i)} \)

\( \theta := \theta - \frac{\alpha}{m} X^T ( g(X \theta) - y) \)

Octave implementation for gradient:

function [grad] = gradient(theta, X, y)
m = length(y); % number of training examples
grad = X'*(sigmoid(X*theta)-y);
grad = grad/(m);
end

Optimization Algorithms


給定 cost function \( J(\theta) \),我們想到求得 \( \theta \),使得 \( J(\theta) \) 能夠達到最小值。

Conjugate gradient、BFGS、L-BFGS 是比 Gradient Descent 更快速的演算法。

Octave 提供了這些最佳化演算法。

首先,我們需要提供一個 function,給定 \( \theta \),計算 \( J(\theta) \) 及 \( \frac{\partial}{\partial \theta_j} J(\theta) \)。

function [jVal, gradient] = costFunction(theta)
  jVal = [...code to compute J(theta)...];
  gradient = [...code to compute derivative of J(theta)...];
end

然後,我們可以使用 Octave 的 "fminunc()" 最佳化演算法。

options = optimset('GradObj', 'on', 'MaxIter', 100);
initialTheta = zeros(2,1);
   [optTheta, functionVal, exitFlag] = fminunc(@costFunction, initialTheta, options);

Multiclass Classification


Binary Classification: \( y \in {0,1} \) 有 2 種類別

Multiclass Classification: \( y \in {0,1...n} \) 有 n+1 種類別


Multiclass Classification 的作法是把問題拆解為 n+1 個 Binary Classification 的問題來處理。

對於每一種類別 i,我們訓練 logistic regression classifier \( h_\theta^{(i)}(x) \),用以預測 \( y = i \) 的機率。

\( h_\theta^{(i)}(x) = P (y = i|x; \theta) \)


給定新的 input x,預測 x 落在哪一個類別的方法: 找出 i,使得 \( h_\theta^{(i)}(x) \) (i.e. \( P (y = i|x; \theta) \) ) 為最大值。

延伸閱讀


[Coursera] Machine Learning: Logistic Regression, by Andrew Ng, Stanford University