course notes machine learning

Posted on 2015-11-10 | In machine learning , courses

course notes: pattern classification and machine learning @EPFL, Switzerland

linear regression

simple, multiple

regression

base function expansion: polynormial => generated features

cost function

desire: 1. to be symmetric 2. penalize large and very large mistakes
statistical(outliers, robust statistic ) and computational(convexity) tradeoff

MSE
MAE
- in general many global minimum but not differentiate everywhere and no local minimum

Huber loss
- more robust to outliers,

Tukeys bisquare loss
- ideal for working with outliers, but not differentiate and not convex

Gradient descent

learning
: given cost function, find $\beta$ minimize the cost
: learing posted as optimization

optimization method

1. grid search

presudo-code: O(x^D * NDD), x is length(beta0)
for i1 = 1: length(beta0)
    for i2 = 1: length(beta1)
        for i3 = 1:length(beta2)
            ...
                e = y - tX*[beta0,beta1..] //O(NDD) 
                cost(i1,i2,i3,...) = e' * e * (1/N) //O(1DD)

2. gradient descent: batch vs stochastic

examples when the cost function is MSE
implementation issue
: stopping criteria - convergence assessment

positive defininely hessian
: step size: line search
: feature normailzation - gradient descent sensitive to ill-condictioning

presudo-code: using MSE as cost function: O(NDi(steps))
for i=1:maxIeters
    e = y - tX*beta // O(ND)
    g = -(1/N) * tX' *e //  O(DN)

SGD in neural network: learning rate lr(~0.001) and momentumn(~0.9)

hyparameter
batch size

3. Least square

normal equaction, close form solution

if use MSE as cost function:
g = -(1/N) tX’ \(y - tX*beta) = 0
close form: tX*beta=y, beta=inv(tX’ tX)tX y

beta=inv(tX' tX)tX' y //O(DNN+D^3+DDN+DNN)
// in matlab
beta = pinv(tX'*tX)\*(tX'\*y) //O(DNN+D^3+DNN+DD)
beta = (tX' \* tX) \ (tX * y)

Maximum likelihood

Assume: data is generated from a probability distribution

assume follow guassian distribution + independent ~~ MSE

likelihood of observing y given X and beta: p(y|X, beta)

take log => log-likelihood
maximize likelihood function(guassian dist) = minimize MSE cost function

maximum likehood estimator - least square

another interpre: find the model under which the observed data is most likely to have been generated from

properites of mle

?consistent under some condition
asymptotically normal, (fisher information)
efficient, achieves Cramor-Rao lower bound

assume follow Laplace distribution ~~ MAE

overfitting

Occams Razor: simpler is better

regulizarion

shrinkage, weight decay, early stopping

ridge regression - fight overfitting - as MAP estimator

extend linear model: nonlinear basis function
$cost = L(\beta) + \lamba/2N \sum{beta_j}^2$
$g:= 1/N * (tX’())$

Maximum a posteriori estimate

estimate the maximum of the product of likelihood and prior

cross validation

generalization error

Bias variance Decomposition

train error
in-sample test error
test error
expected test error = noise variance + bias + …

classification

logistic regression

probabilistic model: bernouly distribution

generate distribution/probability of y: p(y|sigmoid(X,beta),lambda(if penalized)) (then take log, get L(beta)), maximize it.
$L’=tX’ [sigmoid(tX *\beta) - y]$
not able to get normal equation
=> cant use least square
=> use gradient descent or newton method

convexity

hessian: H(beta) = tX’ S tX (if penalize, + lamba ID)
S: diagonal, $entry{nn} =\sigma(..)[1-\sigma(..)]_n$

Generalized linear model

exponential family distribution

bernoulli distribution, gaussian dist

properties: $A’(\ada)=\mu$, $A’’(\ada)=\sigma$

maximum log-likelihood estimate for exponential family distribution

=> normal equtaion: $X^T[g^{-1}(\ada)-y]$

curse of dimensionality