[學習筆記] 機器學習: 異常偵測 (Machine Learning: Anomaly Detection)

異常偵測可找出 data set 中, 哪些 points 是 outliers (偏移平均甚遠), 可應用在找出生產流程上的瑕疵及異常, 偵測信用卡是否被盜刷.

Problem Motivation




Dataset: \(\{ x^{(1)}, x^{(2)}, ..., x^{(m)}\}\)

假設 dataset 內的 examples 是 normal examples.

對於一個新的 example \(x_{test}\), 我們想要知道它是否是 anomalous.

作法是根據 dataset 建立 model \(p(x)\).
\(p(x) \lt \epsilon \rightarrow \) anomaly

\(p(x) \geq \epsilon \rightarrow \) OK

Anomaly detection 的應用:
Fraud detection:
\(x^{(i)}\): features of user \(i\)'s activities

Model \(p(x)\) from dataset.

Identify unusual users by checking which have \(p(x) \lt \epsilon\)
Monitoring computers in a data center.
\(x^{(i)}\): features of machine \(i\)

\(x_1\): memory use

\(x_2\): number of disk access per second

\(x_3\): CPU load

\(x_4\): network traffic

檢視 computer 狀態是否異常 (可能當機了), 提示 system administrator 進一步檢視.
Manufacturing
觀察成品是否有異常, 需要進一步檢驗.

Gaussian (normal) distribution


\(x \in \mathbb{R}\)

If \(x\) is distributed Gaussian with mean \(\mu\), variance \(\sigma^2\), this can be denoted as \( x \sim \mathcal{N}(\mu,\,\sigma^{2}) \).
\(\mathcal{N}\): normal

\(\sim\): distributed as

\(\sigma\): one standard deviation
\(p(x; \mu , \sigma^{2}) = \frac{1}{\sqrt{2\pi}\sigma} e^{-\frac{(x-\mu)^2}{2\sigma^{2}}}\)



Parameter estimation
Given a dataset: \( \{ x^{(1)}, x^{(2)}, ..., x^{(m)} \} \), \(x^{(i)} \in \mathbb{R}\)

Assume \(x^{(i)} \sim \mathcal{N}(\mu,\,\sigma^{2}) \).

How to estimate \(\mu\) and \(\sigma^{2}\)?

\( \mu = \frac{1}{m}\sum_{i=1}^{m}x^{(i)} \)

\( \sigma^{2} = \frac{1}{m}\sum_{i=1}^{m}(x^{(i)}-\mu)^2 \)

Anomaly Detection Algorithm


Given a training set: \( \{ x^{(1)}, x^{(2)}, ..., x^{(m)} \} \)
\(x^{(i)} \in \mathbb{R}^{n}\)
Assume:
\(x_1 \sim \mathcal{N}(\mu_1,\,\sigma_1^{2}) \).

\(x_2 \sim \mathcal{N}(\mu_2,\,\sigma_2^{2}) \).

...

\(x_n \sim \mathcal{N}(\mu_n,\,\sigma_n^{2}) \).
Density estimation:
\( p(x) \\
= p(x_1; \mu_1, \sigma_1^2) p(x_2; \mu_2, \sigma_2^2) ... p(x_n; \mu_n, \sigma_n^2) \\
= \Pi_{j=1}^{n}p(x_j; \mu_j, \sigma_j^2) \)
Anomaly detection algorithm:
1. Choose features \(x_i\) that might be indicative of anomalous examples.

2. Fit parameters \(\mu_1, \mu_2, ..., \mu_n, \sigma_1^2, \sigma_2^2, ..., \sigma_n^2\)
\( \mu_j = \frac{1}{m}\sum_{i=1}^{m}x_j^{(i)} \)

\( \sigma_j^2 = \frac{1}{m}\sum_{i=1}^{m}(x_j^{(i)}-\mu_j)^2 \)
3. Given new example x, compute p(x):
\( p(x) = \Pi_{j=1}^{n}p(x_j; \mu_j, \sigma_j^2) = \Pi_{j=1}^{n}\frac{1}{\sqrt{2\pi}\sigma_j}e^{-\frac{(x_j-\mu_j)^2}{2\sigma_j^2}} \)

Anomaly if \(p(x) \lt \epsilon\)

Developing and evaluating an anomaly detection system


Assume we have some labeled data, of anomalous and non-anomalous examples.
\( y = 0 \) if normal

\( y = 1 \) if anomalous
Training set: \( x^{(1)}, x^{(2)}, ..., x^{(m)} \)

Cross validation set: \( (x_{cv}^{(1)}, y_{cv}^{(1)}), (x_{cv}^{(2)}, y_{cv}^{(2)}), ..., (x_{cv}^{(m)}, y_{cv}^{(m)}) \)

Test set: \( (x_{test}^{(1)}, y_{test}^{(1)}), (x_{test}^{(2)}, y_{test}^{(2)}), ..., (x_{test}^{(m)}, y_{test}^{(m)}) \)

Example:
10000 good (normal) engines

20 flawed engines (anomalous)

\(\rightarrow\)

Training set: 6000 good engines (\(y=0\))

Cross validation set: 2000 good engines (\(y=0\)), 10 anomalous (\(y=1\))

Test set: 2000 good engines (\(y=0\)), 10 anomalous (\(y=1\))
Algorithm evaluation:
Fit model \(p(x)\) on training set \( \{ x^{(1)}, x^{(2)}, ..., x^{(m)} \} \) (unlabeled examples, normal examples)

On a cross validation set / test set example \(x\), predict
\(y=\begin{cases}
1, \quad \text{if } p(x) \lt \epsilon \quad \text{(anomaly)} \\
0, \quad \text{if } p(x) \geq \epsilon \quad \text{(normal)}
\end{cases}\)
因為 data 非常 skewed (anomaly examples 數量相對很少), classification accuracy 不是合適的衡量指標 (evaluation metrics).
舉例來說, always predict 是 normal, 就可以達到很高的 accuracy.
可能的衡量指標:
True positive, false positive, false negative, true negative

Precision / Recall

\(F_1\)-score
也可以使用 cross validation set 來選擇合適的 \(epsilon\).

Anomaly detection v.s. supervised learning



Anomaly detection Supervised learning
Very small number of positive examples (y=1). (0-20 is common)

Large number of negative examples (y=0).
Large number of positive and negative examples.
有許多型態的 anomalies.

演算法很難從 positive examples 學習到如何辨認出 anomalies.

未來的 anomalies 可能和目前為止看到的完全不同.

有足夠多的 positive examples 可讓演算法學習到能夠辨認出 positive examples.

未來的 positive examples 很可能和 training set 裡的相似.
應用例:

fraud detection

manufacturing

monitoring machines in a data center
應用例:

email spam classification

weather prediction (sunny/rainy...)

cancer classification

Choosing what features to use


觀察各個 feature 的 histogram, 如果形狀不像 Gaussian, 試著做一些轉換, 讓形狀變得像 Gaussian.
舉例: \( x_1 \leftarrow log(x_1) \), \( x_2 \leftarrow log(x_2 + 1)\), \( x_3 \leftarrow \sqrt{x_3}\), \( x_4 \leftarrow x_4^{\frac{1}{3}}\)

在 Octave, 若 x 是 vector, 可下 hist(x, 50) 畫出 \(x\) 的 histogram with 50 bins.

可下 hist(x.^0.5, 50) 畫出 \(x^{0.5}\) 的 histogram.
Error analysis for anomaly detection
We want:
\(p(x)\) large for normal examples \(x\).

\(p(x)\) small for anomalous examples \(x\).
Most common problem:
\(p(x)\) is comparable (say, both large) for normal and anomalous examples.
Approach:
Choose features that can distinguish anomalous examples from normal examples by analyzing characteristics of anomalous examples.
Example: Monitoring computers in a data center
Choose features that might take on unusually large or small values in the event of an anomaly.

\(x_1\): memory use of computer

\(x_2\): number of disk access per sec

\(x_3\): CPU load

\(x_4\): network traffic

\(x_5\): \(\frac{\text{CPU load}}{\text{network traffic}}\) 可偵測到 CPU loading 很高但 network traffic 很低的異常狀況, 舉例來說, CPU stuck 在 infinity loop 裡.

延伸閱讀


[Coursera] Machine Learning: Anomaly Detection, by Andrew Ng, Stanford University