Machine Learning Lecture 07 Kernels

2022/04/04 21:58 PM posted in  Stanford CS229   comments
Tags:  #Stanford CS229


  • optimization problem
  • representer theorem
  • kernels
  • example of kernels

In optimal margin classifier, we want to find the decision boundary parameterized by w and b that maximizes the geometric margin.

\[\max _{\gamma, w, b} \gamma \begin{array}{ll} \text { s.t. } & y^{(i)}\left(w^{T} x^{(i)}+b\right) \geq \gamma, \quad i=1, \ldots, n \\ & \|w\|=1 \end{array} \]

If we choose \(||w|| = \frac 1 \gamma\)

\[\begin{aligned} \min _{w, b} & \frac{1}{2}\|w\|^{2} \\ \text { s.t. } & y^{(i)}\left(w^{T} x^{(i)}+b\right) \geq 1, \quad i=1, \ldots, n \end{aligned} \]

What we assume is

\[w = \sum^m_{i=1} \alpha_i y^{(i)} x^{(i)} \]

The w can be expressed as linear combination of the training examples. \(y^{(i)} = \pm1\)

The representor theorem will prove we can make this assumption without losing any performance.

Intuition 2 of representor theorem

w set the direction of decision boundary, and we set different b to move decision boundary back and forth.

Let's assume \(w = \sum^m_{i=1} \alpha_iy^{(i)}x^{(i)}\)

\[\begin{aligned} \min _{w, b} & \frac{1}{2}\|w\|^{2} = \frac 1 2 w^Tw\\ \text { s.t. } & y^{(i)}\left(w^{T} x^{(i)}+b\right) \geq 1, \quad i=1, \ldots, n \end{aligned} \]

That is to say, we are going to solve

\[\begin{aligned} & \min _{w, b} \frac 1 2 (\sum^m_{i=1} \alpha_iy^{(i)}x^{(i)})^T ( \sum^m_{i=1} \alpha_iy^{(i)}x^{(i)})\\ = &\min _{w, b} \frac 1 2 \sum_i \sum_j \alpha_i \alpha_j y^{(i)} y^{(j)}{x^{(i)}}^T x^{(i)} \end{aligned} \]

In which \({x^{(i)}}^T x^{(i)}\) can be represented as \(<x^{(i)}, x^{(j)}>\).

\(<x^{(i)}, x^{(j)}> = x^Tz\) is the inner products.

The constraint \( y^{(i)}\left(w^{T} x^{(i)}+b\right) \geq 1\) will become

\[\begin{aligned} y^{(i)}\left({\left(\sum^m_{i=1} \alpha_iy^{(i)}x^{(i)}\right)}^{T} x^{(i)}+b\right) \geq 1 \\ y^{(i)}\left(\sum^m_{i=1} \alpha_iy^{(i)}<x^{(j)}, x^{(i)}>+b\right) \geq 1 \end{aligned} \]

Dual Optimization Problem

\[\begin{aligned} \max _{\alpha} & W(\alpha)=\sum_{i=1}^{n} \alpha_{i}-\frac{1}{2} \sum_{i, j=1}^{n} y^{(i)} y^{(j)} \alpha_{i} \alpha_{j}\left\langle x^{(i)}, x^{(j)}\right\rangle \\ \text { s.t. } & \alpha_{i} \geq 0, \quad i=1, \ldots, n \\ & \sum_{i=1}^{n} \alpha_{i} y^{(i)}=0 \end{aligned} \]

Which can be simplified from the one above by convex optimization theory.

Kernal Trick

  1. Write algorithm in terms of \(<x^{(i)}, x^{(j)}>\).
  2. Let there be mapping from x to \(\phi(x)\).

    x may be 1D or 2D, but \(\phi(x)\) would be 100,00 D or even infinite dimension.
  3. Find a way to compute \(K(x,z) = \phi(x)^T\phi(z)\)
  4. Replace <x,z> in algorithm with K(x,z )

One example of kernal


SVM = optimal classifier + kernel trick

How to make kernels?

  • If x, z are similar, \(K(x,z) = \phi(x)^T\phi(z)\) is large.
    • The inner product of two vectors direction is similar is large.
  • If x, z are dissimilar, \(K(x,z) = \phi(x)^T\phi(z)\) is small.

If we think kernel funtion as a similarity measure function, we can have another measure function. **Gaussian Kernel **

\[K(x, z)=\exp \left(-\frac{\|x-z\|^{2}}{2 \sigma^{2}}\right) \]

Linear Kernel \(K(x,z) = x^Tz, \phi(x)= x\)
Gaussian Kernel \(\phi(x)\in \mathbb{R}^\infty\)