readning of mechine leaning

reference book I am currently reading…

  1. PRML
  2. machine learning, a probalistic approach: murphy
  3. the elements of statistical learning:HTF
  4. Andrew's notes

life is so hard.

  1. logistic regression is not applied for regression
  2. least mean square is a rule(the update rule for gradient descent)
  3. least square is actually means: using MSE distance function, it can be solve by Gradient descent or normal equation
  4. maximum likelihood exrimate of $\theta$ is solve by minimize the negative of log-likehood function of $\theta$
  5. likehood may be view intuitively as the likelihood of what y will be given x but when talking MLE it is actually the likelihood of what parameters will be gien x (and y, or distrubution of y)
  6. sigmoid function is a magic(fourier is also a magic): $\sigma(x)’ =\sigma(x)(1-\sigma(x))$
  7. if you hold strong prior belief, you will have high bias and low variance, the world will not affect you…
  8. MLE can be really silly.. due to overfitting. see here
  9. under a set of assumptions(Gaussian, independent) least squares regression could be derived as the maximum likelihood estimator

preprocessed

for input may be preprocessed

- like fixed size of the image
- also called feature extraction
- speed up computation
- easier to solve problem

reference book

  1. PRML:
  2. machine learning, a probalistic approach: murphy
  3. the elements of statistical learning:HTF
  4. Andrew's notes

OVERVIEW

  • for the data input type we can define the learning process as supervised or unsupervised
  • for the type of prediction, we can define the case as classification(discrete output) or regression(continuous output)
  • for the fitting model we choose, we can have linear model(basis function expansion), or rigistic regression(not regression but for classification actually)
  • analyze the model and find parameter – cost function:

    • derectly define cost function based on distance such as MAE, MSE…

      • whether the cost function is robust, ie, how they react to the outlier will also affect our model – statistic property
      • whether the cost function is convex relate whether we can reach the minimum if we count on the gradient to solve the minimum – computational property
      • just based on the cost function above is not enough, we can also take another approach..
    • define from probabilistic view of machine learning, we try to seek support for our modeling precess, analyze the model and tuning the model,

      • probability theory can be applied during the prediction, model selection, measurement to be performed

      • for regression, we focus on the distribution of the output data or error

        • by assume the error following the Gaussian distribution, the value we try to estimate/predict also follow the Gaussian distribution which take the y_prediction as mean.
      • while for classification, we take the Bayisan theorm – and then generate the likehood function
      • we can use the joint probability assuming the data is indenpendent
    • WHY cost function so important:

      • since that the parameter is choosen by the score return by cost function
  • consider whether the parameter is fixed or changed with the size of data, the model can be parametric(linear regression(oridinary), logistic…) or nonparametric(Locally weighted linear regression(only meet at andrew’s, not the linear-regress we saw),KNN)
  • for optimation numerically ie. solve the minimum of function in which we regard the parameter instead of the data as variable, we have: gradient descent, grid search, least square(closed form soluction, Gram matrix), newton method. for the solver, we need to consider the compuration complexity and convexity
    • implementation issue: – see reference: non-linear programming
    • gradient descent: feature normalization, direction(batch gradient descent, stochastic), stepsize(fixed, line-search:backtracking), convexity and converge, stopping criteria(optimal condition, positive definite)

  • supervised learning
    • classification- discrete output
    • regression- continues output
    • generalization – analyzing model
      • := make predictions on novel input
      • generalization error = expected value of the misclassification rate when averaged over future data, can never know but can be approximate by: misclassification rate on a large independenttest set
  • unsupervised learning
    • goal: knowledge discovery or discover interesting sreuctre in data in murphy
    • visualization
    • clustering
    • density estimation

      -reinforcement learning
    • find suirable actions to take in a given stituation in order to maximized the reward

linear model

  • function which is linear in unknow parameters are called linear model
    • for the polynomial model:
    • $y(x, \mathbf{\beta}) = \beta^0 + \beta_1 x + \beta_2 x^2 + …. + \beta_M x^M$
    • notices that the basis function(polynomial function) is nonliear in input x but linear in unknow parameter $\beta$
    • implement issue/ model selection:[detail in ovrifitting part]
      • choose the value of coefficents/parameters/weights is determined by fitting data - minimizing error function - cost function
      • choose the order/degree M - model selection/comparision
        • overfitting
        • regulazation
  • in linear regression: take $\phi$ as veriable – linear; but the $\phi$ function itself can be nonliear

probability theory

  • 提出probability 是想更加科学一些
  • expectation
    • def: weighted averages of funtions
    • if we are given a finite number N of points drawn from the probability distribution or probability density
      • $E[f] ~- 1/N sum_{n=1,2…N}f(x_n)$
    • consider expectation of functions of several variables eg f(x,y)
      • expectation have subscript with repesct to different variable:
      • $E_x[f(x,y)]$ and $E_y[f(x,y)]$
  • variance: $Var[x] = E[x^2] - E[x]^2$
    • consider functions of several variables eg f(x,y)
      • covariance: $cov[x,y] = E_{x,y}[{x - E[x]}{y^T - E[y^T]}]$

background: interpretation of probabilities

frequentist interpretation

  • popular: classical or frequentist way
    • the probability P of an uncertain event A, P(A) is defined by the frequency of that event based on previous observations
    • In the frequentist this view of the world, θ is not random—it just happens to be unknown—and it’s our job to come up with statistical procedures (such as maximum likelihood) to try to estimate this parameter.(NG)

Bayesian interpretation

  • another point of view: Bayesian view - probability provide a quanrification of uncertainty
    • ads: can be used to model our uncertainty of the event without long term frequency
    • for future event, we do not have historical database thus can not count the frequency.
    • can measure the belief in a statement $a$ based on some ‘knowledge’, denote as P(a|K), different K can generate different P(a|K) and even same K can have different P(a|K) – the belief is subjective

Bayes rule

  • consider conditional probabilities
  • $P(A|B) = \frac {P(B|A) P(A)}{P(B)}$
    • interpretation: updating our belief about a hypothesis A in the light of new evidence B
      • in likelihood, it is, output brief of y/A given B/input values+paramters
    • P(A|B): posterior belief
    • P(A): prior belief
    • P(B|A): likelihood, ie the B(our model) will occur if A(the output value of the sample data) is true.
    • P(B) is computed by: $\sum_{i=1,2,…}P(B|A_i)P(A_i)$ by marginalisation.
    • if you still remember the ‘cancer-test examples’ in statistic course, then p(A) is p(cancer), p(B|A) is p(positive|cancer), p(A|B) is p(cancer|positive) which is the value patient care for(but in fact we dont know)

different view of likelihood function

  • likelihood function: $P(\mathbf{y}|\mathbf{\beta})$

from frequentist way of interpretation:

  • parameter $\beta$ is a fixed parameter, the value is determined by ‘estimator’
  • A widely used frequentist estimator is maximum likelihood, in which we set the value that maximizes the likelihood function
  • ie. choosing $\beta$ s.t. probability of the observed data is maximized(seems to be the MAP estimate introduced in Murphy’s book, see below)
  • in practice, use negative log of likelihood function = log-likelihood:= error function(monotonically decreasing)
  • One approach to determining frequentist error bars is the bootstrap,
    • s1: 就是在已有的dataset(size N)里面random弄出L个dataset(size N) by drawing data from 已有的dataset(抽取方式是,可以重复抽,可以有的没有被抽中)
    • s2: looking at the variability of predictions between the different bootstrap data sets. then evaluate the accuracy of the estimates of the parameter
  • drawback: may lead to extreme conclusion if the dataset is bad, eg, a fair-looking coin is tossed three times and lands heads each time. in this case, we will generate parameter $\beta$ to make P(lands head) = 1

from Bayesian viewpoint:

in machine learning, Bayes theorem is used to convert a priot ptobability $P(A) = P(\mathbf{\beta})$ into a posterior probability $P(A|B) = P(\mathbf{\beta}|y)$ by incorpoating the evidence provided by the observed data

  • for $\mathbf{\beta}$ in the polynormial curve fitting model, we can take an approach with Bayes theorem:
  • $P(\mathbf{\beta} | \mathbf{y}) =\frac{ P(\mathbf{y}|\mathbf{\beta}) p(\mathbf{\beta})}{P(\mathbf{y})} $
    • given data {y_1,y_2,…}, we want to know the $\beta$, cant get directly. $P(\mathbf{\beta} | \mathbf{y})$:= posterior probability
    • $P(\beta)$:= prior probability; our assumption of $\beta$
    • ${P(\mathbf{y})}$:= normalization constant since the given data is fixed
    • $P(\mathbf{y}|\mathbf{\beta})$:= likelihood function;
      • can be view as function of parameter $\beta$
      • not a probability distrubution, so intergral w.r.t $\beta$ not nessary = 1
    • state Bayes theorem as : posterior 8 likelihood × prior, consider all of these as function of parameters $\beta$
      • about the choice of the parameters murphy have detail description in chapter3 but should notice that murphy have different notation for the relation above…
        • intergrate both side base on $\beta$: $p(y)= \integral p(y|\beta)p(\beta)d\beta$
        • issue: particularly the need to marginalize (sum or integrate) over the whole of parameter space

Murphy’s explanation – MAP(maximum a posterior) estimate

  • look at probability is conditional on the test input x, as well as the training set D, and the form of model M that we use to make predictions
  • Given a probabilistic output, we can always compute our “best guess” as to the “true label” using:
    • $\hat y = \hat f(\mathbf{x} = argmax_{c=1..C} p(y = c|\mathbf{x},D,M)$
    • This corresponds to the most probable class label, and is called the mode of the distribution p(y|x,D); it is also known as a MAP estimate(maximum a posterior).
  • note that p(y|x,D) is the probrbility distribution of y given x and D (and in case of in the situation of model selection, M should also be clearified, write as p(y|x,D,M))

Murphy’s definition of probabilistic model in machine learning

when defining a models, if the number of parameter is:

  1. grow with the amount of training data => non-parametric model

    • flexible but computation cost high
    • eg: KNN
      • KNN probabilistic model:
      • $p(y=c|\mathbf{x},D,K) = 1/K \Sigma_{i in N_K(x,D)} II(y_i=c)$
      • NK(x,D)are the (indices of the)Knearest points toxin DandI(e)is the indicator function
      • issure: fail to work in high D due to curse of dimensionality[SEE BELOW]
  2. fixed => paramteric

    • ads: fast to apply
    • dis: strong assumption about the nature of data distribute
    • regression
      • linear regression:
        • model: $y(x)=w^Tx + \episilon$
        • assume $e~N(\mu,\sigma^2)$ connect linear regre and Gaussian:
          • model: $p(y|\mathbf{x},\mathbf{w},\sigma^2)=N(t|\mathbf{w}^T \phi(\mathbf{x}),\sigma^2(x))$
          • $\phi(\mathbf{x})$ basis function expansion, usually it is a vector of different degree of x – in polynomial regression
    • classification:
      • logistic regression, ber-distribute, use sigmoid function as basis function.
        • define decision rule + basi function expansion ->create decision boundary

the curse of dimensionality (murphy ch1)

  • KNN fail to work in high D
  • explanation:
    • lets say, apply KNN to uniformly distributed points in 3 dimensional unit cube grow a hyper-cube around x untill it contain 0.5 of the data points, $V(hyper-cube)=e_3(0.5)^3=0.5$, $e_3(0.5) = 0.5^{1/3}$,
    • for D dimension, given disired faction f, length of cube around x is $e_D(f)=f^{1/D}$
    • $e_{10}(0.1)=0.8$: D=10, want 0.1 of the datas, length=0.8, ie, need to extend the cube 80% along each dimension around x, which means that, the ‘nearest nerighbor’ is indeed not that near since that on any dimendion, their distance can be 0.8 and the largest distance is only 1;
  • combat by making assumption about the nature of data distribution – know as inductive bias embodied in parametric model

overfitting and model selection

  • overfit is an issue raised in both unsurpervised learning and surpervised learning during the selection of model
  • PRML said by adopting Bayesian approach, the overfitting problem can be avoided and the num of parameters can even greatlt exceed the num of data point
  • overfitting problam can be regards as a general property of maximum likelihood, the matrix tend to be always in ill-condition since that the data points can not be completely in linear relation in reality.

overfitting in linear regression

  • in linear regression, if choose polynormial as basis function, then the degress of the polynormial function distinct the model, the higher the degree is, it is more likely to overfit, which will lead to high error on the test set, since we use this error to estimate the generalization error, it means that the generalization increase
    • as the degree raise, the complexity of the model increase, the matrx will tend to be ill condition, and the solution of parameters will become more sensitive to the training data, which is really bad!
    • also notice the fact that, as degree increase, the value of parameter also increase dramatically, in the example of PRML, D=1,w0=0.19 while for D=9,w9=125201 !
    • commend someone said when we are not sure about the model, we should always choose the simplier model to fit the data
    • commend we can still have some way to use complex model and reduce the generalization error

solution1: shrinkage method / regularization to fight with overfitting

  • usually we will try to penilize the parameters of the model, which perform lift of the eigrnvalue, in this way, we are actually trying to simplify the models, (thinking that when the panelized parameter $\lambda$ bacomre very large, the parameters, except $\w_0$ which is nor panelized, are all going to zero, so we are using a very simple model in fact).

    • involves adding a penalty term to the error function (1.2) in order to discourage the coefficients from reaching large values
    • add $\lambda / 2 | \mathbf{\beta} |^2$ $\lambda$ is the importance of the regulazation term;
      • if use quadric regularizer - call ridge regression
      • other regulazation such as weight decay in neural network
    • $\lambda$ can suppressed over-fitting, reduce value of weight, but if $\lambda$ too large, weight goes to 0, will lead to poor fit
    • we should also notice that the panelized parameter $\lambda$ could very a lot with the change the degree, from my experiment, it can very in the 1 ~ 10^7 range as the degree goes from about 2 to 10. So when perform cross validation on the selection of the model, more specifically, on the decision of penalized perameter and the degree, we need to pay attention about this
    • after adding regularization, the complexity of the model should not be measured simply by the number of parameters or the degrees

overfitting in KNN,

  • it also happens that when K is small, it tends to overfit.

speed-accuracy-complexity tradeoff

in the following notes, I will focus on the comparison of the tradeoff

  • indicator function: $II(e)=1,if e is true; =2 if e is false$

distribution

multinomial distribution

  • MODEL: tossing K side die
  • $\mathbf{x}=(x_1,…,x_D)$, $x_j$ the number of times side x of the die occurs
  • notice that $N = sum(x_1,x_2,…x_D)$
  • ANOTHER MODEL: from D color, choose fill in N place(not consider the order, ie,red in place1+blue in place2=red in 2+blue in 1), $x_j$ the number of color j shows in the image. probability that j is chosen = $\theta_j$
  • $Mu(\mathbf{x}|N,\mathbf(\theta)) := \binom{N}{x_1} \binom{N-x_1}{x_2}..\binom{N-x_1-x_2-…}{x_D} * \theta_1^{x_1} \theta_2^{x_2}…\theta_D^{x_D}$
  • $= \binom{N}{x_1 x_2 ..x_D} \prod{\substack{j=1,2,…D}} \theta_j^{x_j}$

multi-noulli

  • simply let N = 1;
  • $Mu(\mathbf{x}|1,\mathbf(\theta)) := \prod{\substack{j=1,2,…D}} \theta_j^{II(x_j=1)}$ II for 0,1 case

information theory

  • deta compression/ source coding, concerned with: compact data representation and errorless transmission and storing

entropy

  • for a random variable Y(with K state) with ditribution p, the entropy of Y is denoted by
  • $\mathbb{H}(Y)$ or $\mathbb{H}(p)$
  • $:= - \sum_{\substack{k=1,2,….K}} p(Y=k) log_2 p(Y=k)$
  • is a measure of its uncertainty
    • say if $X \in {1,…,5}$ then if p = [1/5,1/5,1/5,1/5,1/5] the abs(entropy) reach maximum and in this case, entropy(<0) is minimum, therefore uncertainty is largest when p = [1,0,0,0,0] the netropy reach minimum - no uncertainty

feature selection

motivation: having too many features and some of them are actuallt invariant, feature selection is belonging to model selection as well, introduced in notes 5 of Andrew NG

regulazation

- solve ill posed problem/ overfitting
1. when talking about MLE which is alway facing overfitting problem, the MAP and baysian estimation is improving it with prior belif which act as regulazation 

regular expression help to replace the wrong mathjex:
eg find: (mathbf)((.))
replace with \1{\2}