[學習筆記] 機器學習: 維度縮減 (Machine Learning: Dimensionality Reduction)

Dimensionality reduction 是一種 unsupervised learning problem.

Dimensionality reduction 可應用於資料壓縮 (data compression).
資料壓縮不僅有助於使用較少的 RAM 或 disk space, 也有助於加速 learning algorithm.


以上圖為例, features x1 及 x2 有高度相關性, 可以合併為 1 個 feature.



以上圖為例, 把 \(x^{(i)}\) 投影到藍色線, 從 2 dimension 縮減為 1 dimension.

\(x^{(1)} \in \mathbb{R}^2 \rightarrow z^{(1)} \in \mathbb{R} \)

\(x^{(2)} \in \mathbb{R}^2 \rightarrow z^{(2)} \in \mathbb{R} \)

...

\(x^{(m)} \in \mathbb{R}^2 \rightarrow z^{(m)} \in \mathbb{R} \)

Dimensionality reduction 也可應用於資料視覺化 (data visualization).
Data visualization 可以讓我們更瞭解 data, 因而有助於採用有效的 learning algorithm 來處理 data.

舉例來說, 當 features 數為 50 個時, 很難 visualize data, 而不容易瞭解 data. 應用 dimensionality reduction 把 features 數縮減至 2 到 3 個, 則能夠以 2D 或 3D plot 來 visualize data.

Principal Component Analysis


Problem formulation:


Reduce from 2-dimension to 1-dimension: Find a vector \(u^{(1)} \in \mathbb{R}^n\) onto which to project the data so as to minimize the projection error.

Reduce from n-dimension to k-dimension: Find \(k\) vectors \(u^{(1)}, u^{(2)}, ..., u^{(k)}\) onto which to project the data so as to minimize the projection error.
PCA is not linear regression.


Linear Regression: predict \(y\) from \(x^{(i)} \in \mathbb{R}^n\). Try to minimize the distance between \(x^{(i)}\) and predicted \(y\).

PCA: reduce feature numbers from \(n\) to \(k\). Try to minimize the projection error.

Principal Component Analysis algorithm


Target: reduce data from \(n\)-dimensions to \(k\)-dimensions

Data preprocessing:
Training set: \(x^{(1)}, x^{(2)}, ..., x^{(m)}\)

Preprocessing (feature scaling/mean normalization):
\(\mu_j\): mean of all the values for feature j
\(\mu_j = \frac{1}{m}\sum_{i=1}^{m}x_j^{(i)} \)
\(s_j\): standard deviation or (max - min) of all the values for feature j

Replace each \(x_j^{(i)}\) with \(\frac{x_j-\mu_j}{s_j}\).
Principal Component Analysis algorithm
Compute "co-variance" matrix:
\( Sigma = \frac{1}{m}\sum_{i=1}^{n}(x^{(i)})(x^{(i)})^T \)
\( x^{(i)} \): \(n\)x\(1\) matrix

\( (x^{(i)})^T \): \(1\)x\(n\) matrix

\( Sigma \): \(n\)x\(n\) matrix
Compute "eigenvectors" \(u^{(1)}, u^{(2)}, ..., u^{(n)}\) of matrix \(Sigma\):
Octave implementation: [U, S, V] = svd(Sigma);
\( U = [u^{(1)} \: u^{(2)} \: ... \: u^{(n)}] \in \mathbb{R}^{n \times n} \)
Ureduce = first k "eigenvectors".
Octave implementation: Ureduce = U(:,1:k);
\( Ureduce = [u^{(1)} \: u^{(2)} \: ... \: u^{(k)}] \in \mathbb{R}^{n \times k} \)
Project data from \(n\)-dimensions to \(k\)-dimensions
Octave implementation: z = Ureduce'*x;
\( z^{(i)} = [u^{(1)} \: u^{(2)} \: ... \: u^{(k)}]^T x^{(i)} =
\begin{bmatrix}
(u^{(1)})^T
\\(u^{(2)})^T
\\...
\\(u^{(k)})^T
\end{bmatrix} x^{(i)}
\)

\( [u^{(1)} \: u^{(2)} \: ... \: u^{(k)}] \): \(n\)x\(k\) matrix

\( [u^{(1)} \: u^{(2)} \: ... \: u^{(k)}]^T \): \(k\)x\(n\) matrix

\( x^{(i)} \): \(n\)x\(1\) matrix

\( z^{(i)} \): \(k\)x\(1\) matrix

Note: \(z_j = (u^{(j)})^T x\)

Reconstruction from compressed representation




PCA (n-dimension project to k-dimension): \(z^{(i)} = U_{reduce}^T x^{(i)} \)

Reconstruction (k-dimension to n-dimension): \(x_{approx}^{(i)} = U_{reduce} z^{(i)} \)
\( U_{reduce}: n \times k \) matrix

\( z^{(i)}: k \times 1 \) matrix

\(x_{approx}^{(i)}: n \times 1 \) matrix

Choosing the number of principal components


Average squared project error: \(\frac{1}{m}\sum_{i=1}^{m}||x^{(i)}-x_{approx}^{(i)}||^2 \)

Total variation in the data: \( \frac{1}{m}\sum_{i=1}^{m}||x^{(i)}||^2 \)

選擇最小的 \(k\) (the number of principal components) 使得 \( \frac{\frac{1}{m}\sum_{i=1}^{m}||x^{(i)}-x_{approx}^{(i)}||^2}{\frac{1}{m}\sum_{i=1}^{m}||x^{(i)}||^2} \leq 0.15 \sim 0.01\).

如此, 保留了 85% ~ 99% 的 variance.

一般會選擇保留 99% 的 variance.

Octave implementation:
[U, S, V] = svd(Sigma);

\(1 - \frac{\sum_{i=1}^{k}s_{ii}}{\sum_{i=1}^{n}s_{ii}} = \frac{\frac{1}{m}\sum_{i=1}^{m}||x^{(i)}-x_{approx}^{(i)}||^2}{\frac{1}{m}\sum_{i=1}^{m}||x^{(i)}||^2} \)

只需要 call svd function 一次, 逐步增加 k, 直到滿足條件.

PCA 應用上的建議


加速 supervised learning
Training set: \((x^{(1)}, y^{(1)}), (x^{(2)}, y^{(2)}), ..., (x^{(m)}, y^{(m)}) \)
舉例來說, input 為 100 pixel x 100 pixel image, \(x^{(i)} \in \mathbb{R}^{10000}\)
Extract input:
Apply PCA on unlabeled dataset: \(x^{(1)}, x^{(2)}, ..., x^{(m)} \in \mathbb{R}^{10000} \rightarrow z^{(1)}, z^{(2)}, ..., z^{(m)} in \mathbb{R}^{1000}\)
New training set: \((z^{(1)}, y^{(1)}), (z^{(2)}, y^{(2)}), ..., (z^{(m)}, y^{(m)}) \)

Note: Mapping \(x^{(i)} \rightarrow z^{(i)}\) should be defined by running PCA only on the training set (not on cross validation and test sets). This mapping can be applied as well to the examples \(x_{cv}^{(i)}\) and \(x_{test}^{(i)}\) in the cross validation and test sets.
不要把 PCA 用來避免 overfitting
使用 \(z^{(i)}\) 而非 \(x^{(i)}\) 把 feature 數從 \(n\) 降到 \(k\) (\( k \lt n\) ).

如此, features 數較少, 較不會 overfit.

或許可以 work, 但不是處理 overfitting 的好方法. 以 regularization 來處理比較合適. 因為採用 PCA 可能會導致丟掉一些重要的 features.
不要把 PCA 當作是設計 machine learning system 裡必然要做的一個步驟.
只有在 optimization 的階段, 再考慮 apply PCA, 並觀察 apply PCA 後對於準確度的影響.

不要一開始就 apply PCA 把 feature 數降下來, 否則可能會導致丟掉一些重要的 features.

延伸閱讀


[Coursera] Machine Learning: Dimensionality Reduction, by Andrew Ng, Stanford University