course notes machine learning

course notes: pattern classification and machine learning @EPFL, Switzerland

linear regression

simple, multiple

regression

base function expansion: polynormial => generated features

cost function

desire: 1. to be symmetric 2. penalize large and very large mistakes
statistical(outliers, robust statistic) and computational(convexity) tradeoff

  • MSE
  • MAE
    • in general many global minimum but not differentiate everywhere and no local minimum
  • Huber loss
    • more robust to outliers,
  • Tukeys bisquare loss
    • ideal for working with outliers, but not differentiate and not convex

Gradient descent

learning
: given cost function, find $\beta$ minimize the cost
: learing posted as optimization

optimization method

1
2
3
4
5
6
7
presudo-code: O(x^D * NDD), x is length(beta0)
for i1 = 1: length(beta0)
for i2 = 1: length(beta1)
for i3 = 1:length(beta2)
...
e = y - tX*[beta0,beta1..] //O(NDD)
cost(i1,i2,i3,...) = e' * e * (1/N) //O(1DD)

2. gradient descent: batch vs stochastic

examples when the cost function is MSE
implementation issue
: stopping criteria - convergence assessment

  • positive defininely hessian
    : step size: line search
    : feature normailzation - gradient descent sensitive to ill-condictioning
    1
    2
    3
    4
    presudo-code: using MSE as cost function: O(NDi(steps))
    for i=1:maxIeters
    e = y - tX*beta // O(ND)
    g = -(1/N) * tX' *e // O(DN)

SGD in neural network: learning rate lr(~0.001) and momentumn(~0.9)

hyparameter
batch size

3. Least square

normal equaction, close form solution

if use MSE as cost function:
g = -(1/N) tX’ \(y - tX*beta) = 0
close form: tX*beta=y, beta=inv(tX’ tX)tX y

1
2
3
4
beta=inv(tX' tX)tX' y //O(DNN+D^3+DDN+DNN)
// in matlab
beta = pinv(tX'*tX)\*(tX'\*y) //O(DNN+D^3+DNN+DD)
beta = (tX' \* tX) \ (tX * y)


Maximum likelihood

Assume: data is generated from a probability distribution

assume follow guassian distribution + independent ~~ MSE

likelihood of observing y given X and beta: p(y|X, beta)

take log => log-likelihood
maximize likelihood function(guassian dist) = minimize MSE cost function

maximum likehood estimator - least square

another interpre: find the model under which the observed data is most likely to have been generated from

properites of mle

?consistent under some condition
asymptotically normal, (fisher information)
efficient, achieves Cramor-Rao lower bound

assume follow Laplace distribution ~~ MAE


overfitting

Occams Razor: simpler is better

regulizarion

shrinkage, weight decay, early stopping

ridge regression - fight overfitting - as MAP estimator

extend linear model: nonlinear basis function
$cost = L(\beta) + \lamba/2N \sum{beta_j}^2$
$g:= 1/N * (tX’())$

Maximum a posteriori estimate

estimate the maximum of the product of likelihood and prior

cross validation

generalization error


Bias variance Decomposition

train error
in-sample test error
test error
expected test error = noise variance + bias + …


classification

logistic regression

probabilistic model: bernouly distribution

generate distribution/probability of y: p(y|sigmoid(X,beta),lambda(if penalized)) (then take log, get L(beta)), maximize it.
$L’=tX’ [sigmoid(tX *\beta) - y]$
not able to get normal equation
=> cant use least square
=> use gradient descent or newton method

convexity

hessian: H(beta) = tX’ S tX (if penalize, + lamba ID)
S: diagonal, $entry
{nn} =\sigma(..)[1-\sigma(..)]_n$


Generalized linear model

exponential family distribution

bernoulli distribution, gaussian dist

properties: $A’(\ada)=\mu$, $A’’(\ada)=\sigma$

maximum log-likelihood estimate for exponential family distribution

=> normal equtaion: $X^T[g^{-1}(\ada)-y]$


curse of dimensionality