SVM:
1) in the case that data can be seperable
2) un-seperabel data need l1 regulazation
solving SVM with
Lagranga Dual function for optimation
Kernel for feature transformation
SMO for optimation of mutivariable function with constrain (the dual function)
remark:
in the discussion,
$w \in R^D$ instead of D+1, w0 is extract out as b now.
$y_n \in {-1,1}$ in stead of 0 and 1,
seperable:
decision plane: $f(x) = w^T x + b = 0$
classify by $f(x_n) >\< 0$
for training sample $(x_n, y_n)$
$y_n * f(x_n) >= 1$ if classification is correct
margin: ‘distance’ form point to the plane
negative if wrongly classified
assume seperable: all points are correctly classi.
for points:
$$r_n = \frac{y_n (
for function:
$$r = min{r_n, n = 1,…,N}$$
goal: maximize r w.r.t w => a optimaztion problem with constrains
transform the problem to make life easier:
minimize $g(w) = |w|^2$ + constrains => generate Lagrange dual function to solve it
1 | lagrancian |
attention to the sigh of the constrain should according to def of lagran-problem
maximize $Dual(\lambda_{\alpha}, w)$ + constrains => $\lambda$ is largrange multiplier.
- $Dual(\lambda_{\alpha}, w)$ is lower bound of $g(w)$, to get the mini of $g(w)$ we instead find the maximum of dual function
to ensure the maxi’ = mini, need KKT condiction
in the final function, we can find x shown in the form $x^T x$, therefore we can use kernel function to substitude it
to solv the dual function + constrains, we use SMO algorithm, which is fix n-1 $\lambda_i$ and then optimize the function of $\lambda_1$ $\lambda_2$(randam picked) => generate a quadratic function of $\lambda_1$ after substitution
kernel choice
can be linear or polynormial or RBF(guassian)
unseperable case
introduce slack variable $\xi$:
$$\xi = \begin{cases} 0 & \quad \text{if } y_n - (w^T + b) >= 1
\quad \text{dist-to-line: r=1} = |y_n - (w^T + b)| & \quad \text{if } n \text{ is classi.wrong or too close to the plane}\ \end{cases} ] $$
the idea is that, we allow wrong classification, but these point should pay for it. ie, they should increase the $g(w)$ (add term to it, therefore need to set the weight factor C) which we are trying to minimize
as for the points classi. right, we ignore it, just like the seperable case
it can also view as an l1 regulazation, when this term finaly shows as $C\sum \xi_i$. later we will find that setting value of C as weight introduce a trafe off between ignoring the wrong-class.data and exsuring that most example have function margin at least 1
maximize margin
we perfrom the same steps as seperable case and finaly generate a function of C (as weight factor) and set of lagrange mutiplers $\alpha_i$
notice that in the whole procedure, x can be substitude as $\Phi$ ie. transformation of x, by $k(a,b) = \Phi(a)^T \Phi(b)$ we can skip the step of definiin the phi function and use kernel trickly.
the reason for SVM ... sparse...
is that:
after generate value of $\alpha_i$ thereare many 0, since that if the point is classi. correct, the relurazation term(>=0) is 0, so $\alpha_k = 0$. ie, most of the points can be discarded during the calculation through SMO.
prediction function
$\sum \alpha_i \y_i k(xi, x{pred}) + b$ if $\alpha = i$ we can discarded that term
then we use SMO to solve it
solving b:
1) inferred from one support vectors
2) more stable solution: ave all support vectors $$b = 1/N_S \sum_n(y_n - \sum \alpha_m y_m k(x_n, x_m))$$
n, m in S, $N_S = |S|$ is the total num of support vectors
return support vector after SMO:
based on the value of $\alpha$
if = 0: correct classi point + margin > 1
if = C: wrong classi point or correct but margin < 1
if in between 0 and C: at the line margin = 1 (correct side)
also mention:
perception algorithm - learning binary classifier
$f(x) = 1 if \ada>0$ and $=0$ otherwise
mulit class SVM:
$$L = \frac{1}{N} \sumi \sum{j != y_i} [max(0, f(x_i, W)_j - f(xi, W){y_i} + \Delta)] + \lambda \sum_k \suml W^2{k, l}$$