[學習筆記] 機器學習: 多變數線性回歸 (Machine learning: Linear Regression with Multiple Variables)

Model of Supervised Learning



Supervised learning 的目標是要讓 learning algorithm 能夠根據 training set 學習到 function h, 使得 h(x1, x2, ... xn) 能夠盡可能準確地預測 y.

Notation:
h: hypothesis function

n: number of features

m: number of training examples

\( x^{(i)} \): the column vector of all the feature inputs of the \( i^{th} \) training example

\( x^{(i)}_{j} \): value of feature \(j\) in the \( i^{th} \) training example
當我們要預測的 y 是 discrete value, 此 learning problem 稱之為 classification problem.

當我們要預測的 y 是 continuous value, 此 learning problem 稱之為 regression problem.
舉例: 給定 m 組房屋坪數、臥房數、層數、屋齡及其成交價的資料。對於坪數為 x1,臥房數為 x2 的房屋,層數為 x3,屋齡為 x4 的房屋,預測其成交價 y。
Training set: m 組房屋坪數、臥房數、層數、屋齡及其成交價的資料

x1: 坪數

x2: 臥房數

x3: 層數

x4: 屋齡

n: 4

y: 成交價

Multivariate Linear Regression


\( h_{\theta}(x) = \theta_{0} + \theta_{1}x_{1} + \theta_{2}x_{2} + ... + \theta_{n}x_{n} \)
\( \theta_{0}, \theta_{1}, \theta_{2}, ..., \theta_{n} \): parameters
For convenience of notation, define \(x_{0} = 1, x =
\begin{bmatrix}
x_{0}
\\ x_{1}
\\ .
\\ .
\\ .
\\ x_{n}
\end{bmatrix}
\in \mathbb{R}^{n+1}, \theta =
\begin{bmatrix}
\theta_{0}
\\ \theta_{1}
\\ .
\\ .
\\ .
\\ \theta_{n}
\end{bmatrix}
\in \mathbb{R}^{n+1}
\)

如此, 原式可表示為:

\( h_{\theta}(x) = \theta_{0}x_{1} + \theta_{1}x_{1} + \theta_{2}x_{2} + ... + \theta_{n}x_{n} = \theta^{T}x \)

Gradient Descent for Linear Regression with Multiple Variables


Cost function:
\( J(\theta) = \frac{1}{2m} \sum_{i=1}^{m} (h_{\theta}(x^{(i)})-y^{(i)})^2 \)

m: number of training examples
Cost function in Octave language:

function J = computeCost(X, y, theta)
m = length(y);
J = 0;
for i = 1:m;
    J = J + (X(i,:)*theta-y(i))^2;
end
J = J/(2*m);
end
Gradient Descent:
repeat until congergence {
\( \theta_{j} := \theta_{j} - \alpha \frac{\partial}{\partial \theta_{j}} J(\theta) \) (simultaneous update for every \( j = 0, ..., n \))
}

Gradient Descent for Linear Regression:
repeat until congergence {
\( \theta_{j} := \theta_{j} - \alpha \frac{1}{m} \sum_{i=1}^{m} (h_{\theta}(x^{(i)})-y^{(i)})x^{(i)}_{j} \) (simultaneous update \( \theta_{j} \) for \( j = 0, ..., n \) )
}
Gradient Descent for Linear Regression in Octave language:

function [theta] = gradientDescent(X, y, theta, alpha, num_iters)
m = length(y);
J_history = zeros(num_iters, 1);
n = length(theta)
for iter = 1:num_iters
    delta=zeros(n, 1);
    for i = 1:m
        delta = delta + (X(i,:)*theta-y(i))*X(i,:)'; 
    end 
    theta = theta - (alpha/m)* delta;
end
end

Gradient Descent in Practice


Feature Scaling: Get every feature into approximately \( -1 \leqslant x_i \leqslant 1 \) range.
Idea: make sure features are on a similar scale. (如此較易於收斂)

Mean Normalization: Replace \( x_i \) with \( x_i - \mu_i \) to make features have approximately zero mean (Do not apply to \( x_0 = 1 \) )

General formula with feature scaling and mean normalization:
\( x_i := \frac{x_i - \mu_i}{s_i} \)
\( \mu_i \): mean of all the values for feature (i)

\( s_i \): standard deviation or (max - min) of all the values for feature (i)

Make sure gradient descent is working properly
.
\( J(\theta) \) 應該要隨著每一次的 iteration 而遞減.


如果發現 \( J(\theta) \) 沒有隨著每一次的 iteration 而遞減,表示 \( \alpha \) (learning rate) 選得太大了,要設定較小的 \( \alpha \)。


數學上可以證明,如果 \( \alpha \) 夠小,\( J(\theta) \) 會隨著每一次的 iteration 而遞減.

但如果 \( \alpha \) 太小,gradient descent 收斂的速度會很慢。

Automatic convergence test: Declare convergence if J(θ) decreases by less than threshold value E in one iteration, where E is some small value such as \( 10^{−3} \). (但實際上,很難選擇一個合適的 E).

Features and Polynomial Regression


我們可以把多個 features 合併為一個.
舉例來說,以 width 及 depth 預測 house price, 一個選擇是把 width 及 depth 視為 2 個 feature.

\( h_{\theta}(x) = \theta_0 + \theta_1*width + \theta_2*height \)

另一個選擇是把 width 及 width 合併為 1 個 feature (area = width * depth).

\( area = width * height \)

\( h_{\theta}(x) = \theta_0 + \theta_1*area \)
Hypothesis function 並非只能是線性,為了 fit data,也可以把 hypothesis function 設定為 quadratic, cubic or square root function (or any other form).
舉例來說,為了 fit 下圖的 data,我們可以設定 hypothesis function 為

\( h_{\theta}(x) = \theta_0 + \theta_1*size + \theta_2*size^2 \)

或是

\( h_{\theta}(x) = \theta_0 + \theta_1*size + \theta_2\sqrt{size^2} \)

Computing Parameters Analytically


Gradient Descent 是以 iterative 的方式逐步找出 cost function \( J(\theta) \) 的最小值,而 Normal Equation 則能夠以計算的方式直接求得其最小值。

首先,考慮 \( \theta \) 為 1D scalar value (而非 vector) 的情況:
\( \theta \in \mathbb{R} \)

假設 cost function \( J(\theta) = a\theta^2 + b\theta + c \)


求 \( J(\theta) \)最小值的方式為設 \( \frac{d}{d\theta} J(\theta) = 0 \)

Solve for \( \theta \).
因為對 \( J(\theta) \) 微分,得到的是切線,當切線斜率為 0 時,位置處於最低點。
當 \( \theta \) 是 n+1 dimensional vector 時:
\( \theta \in \mathbb{R}^{n+1} \)

假設 cost function \( J(\theta) = \frac{1}{2m} \sum_{i=1}^{m} (h_{\theta}(x^{(i)})-y^{(i)})^2 \)
m: number of training examples

training examples: \( (x^{(1)}, y^{(1)}), ..., (x^{(m)}, y^{(m)}) \)

n: number of features
求 \( J(\theta) \)最小值的方式為設 \( \frac{\partial}{\partial \theta_j} J(\theta) = 0 \) (for every \( j \))

Solve for \( \theta_0, \theta_1, ..., \theta_n \).

\( x^{(i)} =
\begin{bmatrix}
x_0^{(i)}
\\x_1^{(i)}
\\.
\\.
\\.
\\x_n^{(i)}
\end{bmatrix}_{(n+1)*1} \in \mathbb{R}^{n+1},
X = \begin{bmatrix}
(x^{(1)})^T
\\(x^{(2)})^T
\\.
\\.
\\.
\\(x^{(m)})^T
\end{bmatrix}_{m*(n+1)},
y =
\begin{bmatrix}
y^{(1)}
\\y^{(2)}
\\.
\\.
\\.
\\y^{(m)}
\end{bmatrix}_{m*1}
\)

\( \theta = (X^TX)^{-1}X^Ty \)
Octave language: pinv(X'*X)*X'*y
Example.


\( X =
\begin{bmatrix}
1 & 2104 & 5 & 1 & 45
\\1 & 1416 & 3 & 2 & 40
\\1 & 1534 & 3 & 2 & 30
\\1 & 852 & 2 & 1 & 36
\end{bmatrix}_{m*(n+1)}, y =
\begin{bmatrix}
460
\\232
\\315
\\178
\end{bmatrix}_{m*1}
\)

Comparison of gradient descent and normal equation


Gradient Descent Normal Equation
Need to choose \( \alpha \) No need to choose \( \alpha \)
Needs many iterations No need to iterate
\( O(kn^2) \) \( O(n^3) \), need to calculate inverse of \( X^TX \)
Works well when n is large Slow if n is very large (e.g. >=10000)

Normal Equation Noninvertibility


\( \theta = (X^TX)^{-1}X^Ty \)

如果 \( (X^TX) \) 是 noninvertible,可能的原因:

1. 有些 features 之間是 linearly dependent.

2. Features 數量太多. (e.g. \( n \geqslant m \))

如果使用 Octave language,inv 及 pinv 都可以計算 invertible matrix,當 \( (X^TX) \) 是 noninvertible 時,使用 pinv,仍可以得到想要的結果。

延伸閱讀


[Coursera] Machine Learning: Multivariate Linear Regression, by Andrew Ng, Stanford University