Coursera Machine Learning Week3の学習

に引き続き、Coursera Machine Learning Week3を受講した。

　Week3の講義を受けると次のことを学ぶことができた。

分類問題を解決する一手法であるロジスティック回帰のアルゴリズム
- 予測関数、目的関数、最急降下法まで
機械学習における過学習とは何か
過学習を防ぐための正則化という手法について

　以下講義を受けながら書いたメモを置いておく。

Classification and Representation

Classification

線形回帰を分類問題に単純に適用してみるとうまく分類できない
ロジスティック回帰という手法を使うことで、分類問題を解くことができる

Hypothesis Representation

ロジスティック回帰で、あるxが与えられた時y=1となる確率を求める
ロジスティック回帰の予測関数は次のとおり

$\begin{align*}& h_\theta (x) = g ( \theta^T x ) \newline \newline& z = \theta^T x \newline& g(z) = \dfrac{1}{1 + e^{-z}}\end{align*}$

Decision Boundary

ロジスティック回帰によって、決定境界を見つけるというイメージ
決定境界を決めるような $\theta$ を得られたら、あとは $\theta^Tx$ を計算することで、yの値を決めることができる

$\begin{align*}& \theta^T x \geq 0 \Rightarrow y = 1 \newline& \theta^T x < 0 \Rightarrow y = 0 \newline\end{align*}$

Logistic Regression Model

Cost Function

線形回帰と同一のcost functionを利用してしまうと、局所最適が多数できるような形になり、最急降下法が利用できない
代わりに次のようなcost functionを利用する

$\begin{align*}& J(\theta) = \dfrac{1}{m} \sum_{i=1}^m \mathrm{Cost}(h_\theta(x^{(i)}),y^{(i)}) \newline & \mathrm{Cost}(h_\theta(x),y) = -\log(h_\theta(x)) \; & \text{if y = 1} \newline & \mathrm{Cost}(h_\theta(x),y) = -\log(1-h_\theta(x)) \; & \text{if y = 0}\end{align*}$

$\begin{align*}& \mathrm{Cost}(h_\theta(x),y) = 0 \text{ if } h_\theta(x) = y \newline & \mathrm{Cost}(h_\theta(x),y) \rightarrow \infty \text{ if } y = 0 \; \mathrm{and} \; h_\theta(x) \rightarrow 1 \newline & \mathrm{Cost}(h_\theta(x),y) \rightarrow \infty \text{ if } y = 1 \; \mathrm{and} \; h_\theta(x) \rightarrow 0 \newline \end{align*}$

Simplified Cost Function and Gradient Descent

ロジスティック回帰のcost functionを1行に圧縮すると
$J(\theta) = - \frac{1}{m} \displaystyle \sum_{i=1}^m [y^{(i)}\log (h_\theta (x^{(i)})) + (1 - y^{(i)})\log (1 - h_\theta(x^{(i)}))$ ]
となり、それをベクター化すると
$\begin{align*} & h = g(X\theta)\newline & J(\theta) = \frac{1}{m} \cdot \left(-y^{T}\log(h)-(1-y)^{T}\log(1-h)\right) \end{align*}$
となる。

最急降下法のアルゴリズムは
$\begin{align*}& Repeat \; \lbrace \newline & \; \theta_j := \theta_j - \alpha \dfrac{\partial}{\partial \theta_j}J(\theta) \newline & \rbrace\end{align*}$
となり、これもベクター化すると
$\theta := \theta - \frac{\alpha}{m} X^{T} (g(X \theta ) - \vec{y})$
となる。

Advanced Optimization

正直あまり理解が進まなかったので、気が向いたらもう一回見る

Solving the Problem of Overfitting

The Problem of Overfitting

Modelのfeatureが少なすぎてうまく線形回帰問題や分類問題を解けない時、underfitting(high bias)と呼ぶ
逆にfeatureが多すぎて、学習として与えたものにはフィットするが、予測に使えないModelになってしまったものをoverfitting(high variance)という
対策としては
- 1. featureを減らす
  - 手動で減らす <- 選択が難しいことが多い
  - model selection algorithmを利用する
- 2. 正則化を行う
  - 特定のパラメータの重みを減らす(?)

Cost Function

正則化を行う際、Cost Functionは以下のようになる。
- $\dfrac{1}{2m}\ \sum_{i=1}^m (h_\theta(x^{(i)}) - y^{(i)})^2 + \lambda\ \sum_{j=1}^n \theta_j^2$
λは正則化パラメータと呼ぶ
これを使うことで不必要なfeatureの重みを減らすことができ、overfitting(過学習)を防げる

Regularized Linear Regression

線形回帰で正則化を行う際の最急降下法の公式。 $\theta_0$ は対象としないため、別の式となっている。

$\begin{align*} & \text{Repeat}\ \lbrace \newline & \ \ \ \ \theta_0 := \theta_0 - \alpha\ \frac{1}{m}\ \sum_{i=1}^m (h_\theta(x^{(i)}) - y^{(i)})x_0^{(i)} \newline & \ \ \ \ \theta_j := \theta_j - \alpha\ \left[ \left( \frac{1}{m}\ \sum_{i=1}^m (h_\theta(x^{(i)}) - y^{(i)})x_j^{(i)} \right) + \frac{\lambda}{m}\theta_j \right] &\ \ \ \ \ \ \ \ \ \ j \in \lbrace 1,2...n\rbrace\newline & \rbrace \end{align*}$

これを変形すると、

$\theta_j := \theta_j(1 - \alpha\frac{\lambda}{m}) - \alpha\frac{1}{m}\sum_{i=1}^m(h_\theta(x^{(i)}) - y^{(i)})x_j^{(i)}$

となり、 $\theta_j(1 - \alpha\frac{\lambda}{m})$ の項から、イテレーションごとに $\theta_j$ を減らす処理を繰り返していることが分かる。

正規方程式で正則化を行う場合なら以下のとおり。

$\begin{align*}& \theta = \left( X^TX + \lambda \cdot L \right)^{-1} X^Ty \newline& \text{where}\ \ L = \begin{bmatrix} 0 & & & & \newline & 1 & & & \newline & & 1 & & \newline & & & \ddots & \newline & & & & 1 \newline\end{bmatrix}\end{align*}$

Regularized Logistic Regression

ロジスティック回帰にも同様に正則化を入れてみる。ロジスティック回帰のCost Functionに対して、正則化のための項を入れると以下のようになる。

$J(\theta) = - \frac{1}{m} \sum_{i=1}^m [ y^{(i)}\ \log (h_\theta (x^{(i)})) + (1 - y^{(i)})\ \log (1 - h_\theta(x^{(i)})) ] + \frac{\lambda}{2m}\sum_{j=1}^n \theta_j^2$