course notes: pattern classification and machine learning @EPFL, Switzerland
linear regression
simple, multiple
regression
base function expansion: polynormial => generated features
cost function
desire: 1. to be symmetric 2. penalize large and very large mistakes
statistical(outliers, ) and computational(convexity) tradeoff
- MSE
- MAE
- in general many global minimum but not differentiate everywhere and no local minimum
- Huber loss
- more robust to outliers,
- Tukeys bisquare loss
- ideal for working with outliers, but not differentiate and not convex
Gradient descent
learning
: given cost function, find $\beta$ minimize the cost
: learing posted as optimization
optimization method
1. grid search
1 | presudo-code: O(x^D * NDD), x is length(beta0) |
2. gradient descent: batch vs stochastic
examples when the cost function is MSE
implementation issue
: stopping criteria - convergence assessment
- positive defininely hessian
: step size: line search
: feature normailzation - gradient descent sensitive to ill-condictioning1
2
3
4presudo-code: using MSE as cost function: O(NDi(steps))
for i=1:maxIeters
e = y - tX*beta // O(ND)
g = -(1/N) * tX' *e // O(DN)
SGD in neural network: learning rate lr(~0.001) and momentumn(~0.9)
hyparameter
batch size
3. Least square
normal equaction, close form solution
if use MSE as cost function:
g = -(1/N) tX’ \(y - tX*beta) = 0
close form: tX*beta=y, beta=inv(tX’ tX)tX y1
2
3
4beta=inv(tX' tX)tX' y //O(DNN+D^3+DDN+DNN)
// in matlab
beta = pinv(tX'*tX)\*(tX'\*y) //O(DNN+D^3+DNN+DD)
beta = (tX' \* tX) \ (tX * y)
Maximum likelihood
Assume: data is generated from a probability distribution
assume follow guassian distribution + independent ~~ MSE
likelihood of observing y given X and beta: p(y|X, beta)
take log => log-likelihood
maximize likelihood function(guassian dist) = minimize MSE cost function
maximum likehood estimator - least square
another interpre: find the model under which the observed data is most likely to have been generated from
properites of mle
?consistent under some condition
asymptotically normal, (fisher information)
efficient, achieves Cramor-Rao lower bound
assume follow Laplace distribution ~~ MAE
overfitting
Occams Razor: simpler is better
regulizarion
shrinkage, weight decay, early stopping
ridge regression - fight overfitting - as MAP estimator
extend linear model: nonlinear basis function
$cost = L(\beta) + \lamba/2N \sum{beta_j}^2$
$g:= 1/N * (tX’())$
Maximum a posteriori estimate
estimate the maximum of the product of likelihood and prior
cross validation
generalization error
Bias variance Decomposition
train error
in-sample test error
test error
expected test error = noise variance + bias + …
classification
logistic regression
probabilistic model: bernouly distribution
generate distribution/probability of y: p(y|sigmoid(X,beta),lambda(if penalized)) (then take log, get L(beta)), maximize it.
$L’=tX’ [sigmoid(tX *\beta) - y]$
not able to get normal equation
=> cant use least square
=> use gradient descent or newton method
convexity
hessian: H(beta) = tX’ S tX (if penalize, + lamba ID)
S: diagonal, $entry{nn} =\sigma(..)[1-\sigma(..)]_n$
Generalized linear model
exponential family distribution
bernoulli distribution, gaussian dist
properties: $A’(\ada)=\mu$, $A’’(\ada)=\sigma$
maximum log-likelihood estimate for exponential family distribution
=> normal equtaion: $X^T[g^{-1}(\ada)-y]$