SVM:
1) in the case that data can be seperable
2) un-seperabel data need l1 regulazation

solving SVM with
Lagranga Dual function for optimation
Kernel for feature transformation
SMO for optimation of mutivariable function with constrain (the dual function)

notes
a paper abour svm
about SMO

remark:

in the discussion,
$w \in R^D$ instead of D+1, w0 is extract out as b now.
$y_n \in {-1,1}$ in stead of 0 and 1,

seperable:

decision plane: $f(x) = w^T x + b = 0$

classify by $f(x_n) >\< 0$
for training sample $(x_n, y_n)$
$y_n * f(x_n) >= 1$ if classification is correct

margin: ‘distance’ form point to the plane

negative if wrongly classified

assume seperable: all points are correctly classi.
for points:
$$r_n = \frac{y_n ( - dist(0 - decision plane))}{|w|} = \frac{y_n (w^T x_n - b)}{|w|} = \frac{y_n \phi^T w}{|w|}$$

for function:
$$r = min{r_n, n = 1,…,N}$$

goal: maximize r w.r.t w => a optimaztion problem with constrains

transform the problem to make life easier:

minimize $g(w) = |w|^2$ + constrains => generate Lagrange dual function to solve it

lagrancian
$L(W,\alpha) = g(w) + alpha*constain$

take derivative w.e.t w and a (and $\xi$ in the later case)

by patial deri = 0

generate dual function (substitution and therefore only 1 variable left)
(life is eaiser now!)

attention to the sigh of the constrain should according to def of lagran-problem

maximize $Dual(\lambda_{\alpha}, w)$ + constrains => $\lambda$ is largrange multiplier.

$Dual(\lambda_{\alpha}, w)$ is lower bound of $g(w)$, to get the mini of $g(w)$ we instead find the maximum of dual function
to ensure the maxi’ = mini, need KKT condiction

in the final function, we can find x shown in the form $x^T x$, therefore we can use kernel function to substitude it

to solv the dual function + constrains, we use SMO algorithm, which is fix n-1 $\lambda_i$ and then optimize the function of $\lambda_1$ $\lambda_2$(randam picked) => generate a quadratic function of $\lambda_1$ after substitution

kernel choice

can be linear or polynormial or RBF(guassian)

unseperable case

introduce slack variable $\xi$:

$$\xi = \begin{cases} 0 & \quad \text{if } y_n - (w^T + b) >= 1
\quad \text{dist-to-line: r=1} = |y_n - (w^T + b)| & \quad \text{if } n \text{ is classi.wrong or too close to the plane}\ \end{cases} ] $$
the idea is that, we allow wrong classification, but these point should pay for it. ie, they should increase the $g(w)$ (add term to it, therefore need to set the weight factor C) which we are trying to minimize
as for the points classi. right, we ignore it, just like the seperable case

it can also view as an l1 regulazation, when this term finaly shows as $C\sum \xi_i$. later we will find that setting value of C as weight introduce a trafe off between ignoring the wrong-class.data and exsuring that most example have function margin at least 1

maximize margin

we perfrom the same steps as seperable case and finaly generate a function of C (as weight factor) and set of lagrange mutiplers $\alpha_i$

notice that in the whole procedure, x can be substitude as $\Phi$ ie. transformation of x, by $k(a,b) = \Phi(a)^T \Phi(b)$ we can skip the step of definiin the phi function and use kernel trickly.

the reason for SVM ... sparse... is that:
after generate value of $\alpha_i$ thereare many 0, since that if the point is classi. correct, the relurazation term(>=0) is 0, so $\alpha_k = 0$. ie, most of the points can be discarded during the calculation through SMO.
SMO solving largrange multipliers [200,1] reshape in [45, 8]

prediction function
$\sum \alpha_i \y_i k(xi, x{pred}) + b$ if $\alpha = i$ we can discarded that term
then we use SMO to solve it

solving b:

1) inferred from one support vectors
2) more stable solution: ave all support vectors $$b = 1/N_S \sum_n(y_n - \sum \alpha_m y_m k(x_n, x_m))$$
n, m in S, $N_S = |S|$ is the total num of support vectors
from notes

return support vector after SMO:

based on the value of $\alpha$
if = 0: correct classi point + margin > 1
if = C: wrong classi point or correct but margin < 1

if in between 0 and C: at the line margin = 1 (correct side)

also mention:

perception algorithm - learning binary classifier

$f(x) = 1 if \ada>0$ and $=0$ otherwise

mulit class SVM:

$$L = \frac{1}{N} \sumi \sum{j != y_i} [max(0, f(x_i, W)_j - f(xi, W){y_i} + \Delta)] + \lambda \sum_k \suml W^2{k, l}$$