[學習筆記] 機器學習: 推薦系統 (Machine Learning: Recommender Systems)

許多線上購物網站會自動推薦我們可能會感興趣的商品, 這是推薦系統的應用, 推薦系統能夠根據使用者和產品之間的關聯性自動推導出這些推薦.

以預測 user 對於電影的評等為例, 如下表, 已知 user 的評等為標示的 0-5, 未評等的部分為 ?, 我們想要預測 user 未給予評等的部分, user 可能會給予的評等為何.



Content based recommendations


Problem formulation
\( r(i, j)\): 1 if user \(j\) has rated movie \(i\) (0 otherwise)

\( y^{(i, j)}\): rating given by user \(j\) to movie \(i\) (defined only if \(r(i, j) = 1\))

\( \theta^{(j)} \): parameter vector for user \(j\)

\( x^{(i)} \): feature vector for movie \(i\)

\( n_u \): number of users

\( n_m \): number of movies

\(m^{(j)}\): number of movies rated by user \(j\)

For user \(j\), movie \(i\), predicted rating: \( y^{(i,j)} = (\theta^{(j)})^T(x^{(i)}) \)
對於每一組 user \(j\), movie \(i\), 以 linear regression 來預測 rating.
假設我們知道每部電影的 features \( x^{(i)} \), 我們想要學習得到每位 user 的 \( \theta^{(j)} \).
以上表為例, 我們可設計 2 個 features: 愛情片, 動作片.

前 3 部電影和愛情片的關聯性高, 後 2 部電影和動作片的關聯性高.

前 2 位 user 對於愛情片的 rating 高, 後 2 位 user 對於動作片的 rating 高.
Optimization algorithm:
給定 \( x^{(1)}, x^{(2)}, ..., x^{(n_m)} \), 學習得到 \(\theta^{(j)}\) (parameter for user \(j\)) 的作法:
minimize cost function: \(\frac{1}{2}\sum_{i:r(i,j)=1}((\theta^{(j)})^T x^{(i)}-y^{(i,j)})^2 + \frac{\lambda}{2}\sum_{k=1}^{n}(\theta_k^{(j)})^2 \)
給定 \( x^{(1)}, x^{(2)}, ..., x^{(n_m)} \), 學習得到 \(\theta^{(1)}, \theta^{(2)}, ..., \theta^{(n_u)}\) 的作法:
minimize cost function: \(\frac{1}{2}\sum_{j=1}^{n_u}\sum_{i:r(i,j)=1}((\theta^{(j)})^T x^{(i)}-y^{(i,j)})^2 + \frac{\lambda}{2}\sum_{j=1}^{n_u}\sum_{k=1}^{n}(\theta_k^{(j)})^2 \)
以 gradient descent 找出能夠 minimize cost function 的 \(\theta^{(1)}, \theta^{(2)}, ..., \theta^{(n_u)}\):
\(\theta_k^{(j)} := \theta_k^{(j)} - \alpha \sum_{i:r(i,j)=1} ((\theta^{(j)})^T x^{(i)} - y ^{(i,j)}) x_k^{(i)} \quad (\text{for } k = 0) \)

\(\theta_k^{(j)} := \theta_k^{(j)} - \alpha (\sum_{i:r(i,j)=1} ((\theta^{(j)})^T x^{(i)} - y ^{(i,j)}) x_k^{(i)} + \lambda \theta_k^{(j)}) \quad (\text{for } k \neq 0) \)

Collaborative filtering


Content based recommendations 是假設我們知道每部電影的 features \( x^{(i)} \), 我們想要學習得到每位 user 的 \( \theta^{(j)} \).

但我們不見得能夠知道每部電影的 features \( x^{(i)} \), 因此另一種 approach 是, 假設我們知道每位 user 的 \( \theta^{(j)} \), 據此學習得到每部電影的 features \( x^{(i)} \).

Optimization algorithm:
給定 \(\theta^{(1)}, \theta^{(2)}, ..., \theta^{(n_u)} \), 學習得到 \( x^{(i)} \) 的作法:
minimize cost function: \(\frac{1}{2} \sum_{j:r(i,j)=1} ((\theta^{(j)})^T x^{(i)} - y ^ {(i,j)})^2 + \frac{\lambda}{2}\sum_{k=1}^{n} (x_k^{(i)})^2 \)
給定 \(\theta^{(1)}, \theta^{(2)}, ..., \theta^{(n_u)} \), 學習得到 \( x^{(1)}, x^{(2)}, ..., x^{(n_m)} \) 的作法:
minimize cost function: \(\frac{1}{2} \sum_{i=1}^{n_m} \sum_{j:r(i,j)=1} ((\theta^{(j)})^T x^{(i)} - y ^ {(i,j)})^2 + \frac{\lambda}{2} \sum_{i=1}^{n_m} \sum_{k=1}^{n} (x_k^{(i)})^2 \)
Collaborative filtering:
結合這 2 種作法:
給定 \(x^{(1)}, x^{(2)}, ..., x^{(n_m)}\), 推測 \(\theta^{(1)}, \theta^{(2)}, ..., \theta^{(n_u)}\)

給定 \(\theta^{(1)}, \theta^{(2)}, ..., \theta^{(n_u)}\), 推測 \(x^{(1)}, x^{(2)}, ..., x^{(n_m)}\)
首先, 猜測一組 \(\theta\), 據此推測一組 \(x\), 再據此推測 \(\theta\), ... 逐漸達到收斂.

Collaborative filtering algorithm


Optimization objective:
給定 \(x^{(1)}, x^{(2)}, ..., x^{(n_m)}\), 可推測 \(\theta^{(1)}, \theta^{(2)}, ..., \theta^{(n_u)}\)
minimize cost function \( J(\theta^{(1)}, \theta^{(2)}, ..., \theta^{(n_u)}) = \frac{1}{2}\sum_{j=1}^{n_u}\sum_{i:r(i,j)=1}((\theta^{(j)})^T x^{(i)}-y^{(i,j)})^2 + \frac{\lambda}{2}\sum_{j=1}^{n_u}\sum_{k=1}^{n}(\theta_k^{(j)})^2 \)
給定 \(\theta^{(1)}, \theta^{(2)}, ..., \theta^{(n_u)}\), 可推測 \(x^{(1)}, x^{(2)}, ..., x^{(n_m)}\)
minimize cost function \( J(x^{(1)}, x^{(2)}, ..., x^{(n_m)}) = \frac{1}{2} \sum_{i=1}^{n_m} \sum_{j:r(i,j)=1} ((\theta^{(j)})^T x^{(i)} - y ^ {(i,j)})^2 + \frac{\lambda}{2} \sum_{i=1}^{n_m} \sum_{k=1}^{n} (x_k^{(i)})^2 \)
同時推測 \(x^{(1)}, x^{(2)}, ..., x^{(n_m)}\) 和 \(\theta^{(1)}, \theta^{(2)}, ..., \theta^{(n_u)}\):
minimize cost function \( J(x^{(1)}, x^{(2)}, ..., x^{(n_m)}, \theta^{(1)}, \theta^{(2)}, ..., \theta^{(n_u)}) = \frac{1}{2}\sum_{(i,j):r(i,j)=1}((\theta^{(j)})^T x^{(i)}-y^{(i,j)})^2 + + \frac{\lambda}{2} \sum_{i=1}^{n_m} \sum_{k=1}^{n} (x_k^{(i)})^2 + \frac{\lambda}{2}\sum_{j=1}^{n_u}\sum_{k=1}^{n}(\theta_k^{(j)})^2 \)
Collaborative algorithm:
1. Initialize \(x^{(1)}, x^{(2)}, ..., x^{(n_m)}, \theta^{(1)}, \theta^{(2)}, ..., \theta^{(n_u)}\) to small random values.

2. Minimize \( J(x^{(1)}, x^{(2)}, ..., x^{(n_m)}, \theta^{(1)}, \theta^{(2)}, ..., \theta^{(n_u)}) \) using gradient descent (or an advanced optimization algorithm)
\( x_k^{(i)} := x_k^{(i)} - \alpha (\sum_{j:r(i,j)=1}((\theta^{(j)})^T x^{(i)} - y^{(i,j)}) \theta_k^{(j)} + \lambda x_k^{(i)}) \)

\( \theta_k^{(j)} := \theta_k^{(j)} - \alpha (\sum_{i:r(i,j)=1}((\theta^{(j)})^T x^{(i)} - y^{(i,j)}) x_k^{(i)} + \lambda \theta_k^{(j)}) \)

for every \( j=1, 2, ..., n_u, i=1, 2, ..., n_m\)
3. For a user with parameters \(\theta\) and a movie with (learned) features \(x\), predict a rating of \(\theta^T x\).

Vectorization: Low rank matrix factorization


Predict ratings

= \( \begin{bmatrix}
(\theta^{(1)})^T(x^{(1)})) & (\theta^{(2)})^T(x^{(1)})) ... (\theta^{(n_u)})^T(x^{(1)})) \\
(\theta^{(1)})^T(x^{(2)})) & (\theta^{(2)})^T(x^{(2)})) ... (\theta^{(n_u)})^T(x^{(2)})) \\
... \\
(\theta^{(1)})^T(x^{(n_m)})) & (\theta^{(2)})^T(x^{(n_m)})) ... (\theta^{(n_u)})^T(x^{(n_m)})) \\
\end{bmatrix} \)

= \( X \Theta^T \)

其中:
\( X = \begin{bmatrix}
(x^{(1)})^T \\
(x^{(2)})^T \\
... \\
(x^{(n_m)})^T
\end{bmatrix} \)

\( \Theta = \begin{bmatrix}
(\theta^{(1)})^T \\
(\theta^{(2)})^T \\
... \\
(\theta^{(n_u)})^T
\end{bmatrix} \)

Finding related products


在購物網站, 常常會推薦使用者相關的產品.

採用 collaborative filtering algorithm, 對於每個產品 \(i\), 可以學習得到 feature vector \(x^{(i)} \in \mathbb{R}^n\).

我們如何找出和產品 \(i\) 有關的產品 \(j\) 呢?

舉例來說, 要找出 5 個和產品 \(i\) 最相關的產品, 作法如下:
找出 5 個產品 \(j\), 其中 \( ||x^{(i)}-x^{(j)}||\) 是最小值 (和產品 \(i\) 的 features 差異最小).

延伸閱讀


[Coursera] Machine Learning: Recommender Systems, by Andrew Ng, Stanford University