linear regression model can be interprete from probabilistic view of point
you will find it ‘magical’ that least square appear in the same form as maximum likelihood estimation.
Also notice that ridge regression can also be approached through Bernoulli distribution.
I went through a hard time struggling about the term probability likehood
and their relation.. This is severed as a reference. I think I may need to make it cleanner….after I give a summary of GLM and exponential distribution family.
MLE: maximum likelihood estimate
MAP: maximum a posteriori

MLE

difference between likelihood and probability

likelihood	probability
…	density
hold $x$	known $θ$
function of $\theta$	function of $x$
estimate $\theta$ from given data x – MLE

-if we have a probability model with parameters θ. and note that
$$p(θ|y)=\frac{p(y | θ) p(θ)}{p(x)}$$,$\$

pack all parameters into a single vector $\theta = {\alpha,\sigma^2}$ and write the function:
$$f(y|x;θ)$$. (andraw suggest we write $f(y|x;θ)$ in stead of $f(y|x,θ)$ since $\theta$ is nor random)(it is said that y and $\theta$ are interchangble)
1. When we view $p(\mathbf{y}|X;θ)$ as a density/probability/distribution of y/ function is varying in $y$: given constant values x and $\theta$ in mind and what is distribution of y? – the uncertainty
2. When we view it as a function of $\theta$ holding $X$ constant, given the model relating each y^i and x^i, what is the best guess/estimate of $\theta$, then it become a likelihood function of $\theta$
- $$L(\theta)=L(\theta;X,\mathbf{y})=p(\mathbf{y}|X;\theta)= \prod_{i=1,2,…N}p(y^{(i)}|x^{(i)};θ)$$ (pay attention to L and p)
- Then we turn to pricipal of maximum likelihood which says: choose $\theta$ so as to make the data as high probability as possible! then we now should choose $\theta$ to maximize $L(\theta)$
  - lets look into $L(\theta)$: if we assume error is iid independent and follow Guanssian distribution with $\sigma^2$, then we can know the distribution of y(same as e) in order to exspan $p(y^{(i)}|x^{(i)};θ)$ – keep in mind!!
  - we also take log for easier calculation which now called log likelihood $l(\theta)$, and trun find max to find mini of $$1/2 \sum_{i=1,2,..N}(e^(i))^2$$
    - SURPISE! the form is the same as $J(\theta)$ which we generate in MSE cost function!
    - which can be solve by gradient descent or least square
  - additionally, for two condidate $\theta_1$ and $\theta_2$, the likelihood ratio is used to measure the relative likelihood
- commend: this view of the parameters as being constant-valued but unknow in taken in frequentist statistic
  
  (from andrew’s notes, keep for further understand)To summarize: Under the previous probabilistic assumptions on the data, least-squares regression corresponds to finding the maximum likelihood estimate of θ. This is thus one set of assumptions under which least-squares regression can be justified as a very natural method that’s just doing maximum likelihood estimation. (Note however that the probabilistic assumptions are by no means necessary for least-squares to be a perfectly good and rational procedure, and there may—and indeed there are—other natural assumptions that can also be used to justify it.) Note also that, in our previous discussion, our final choice of θ did not depend on what was $\sigma^2$ , and indeed we’d have arrived at the same result even if $\sigma^2$ 2 were unknown. We will use this fact again later, when we talk about the exponential family and generalized linear models

MLE	MAP
find $\theta$ maximizes $p(x	θ)$	view $p(θ	x) 8 p(θ) p(x	θ)$ then find $\theta$ that maximizes $p(θ	x)$ == maximizing $p(θ) p(x	θ)$
comes from linear regression modle: $f(y	x,\alpha,\sigma)$	add regulazation(called `prior belief` in the probabilistic point of view), ie, ridge regression, we can view ridge regression as astimator of MAP
dis: overfitting, explain below

MLE have problem of overfit the data, variance of the parameter estimates is high, or put another way, that the outcome of the parameter estimate is sensitive to random variations in data (which becomes pathological with small amounts of data). To deal with this, it usually helps to add regularisation to MLE (i.e., reduce variance by introducing bias into the estimate).
In maximum a posteriori (MAP), this regularisation is achieved by assuming that the parameters themselves are also (in addition to the data) drawn from a random process. The prior beliefs about the parameters determine what this random process looks like.

MAP

contrasts with MLE, the maximum-a-posteriori or MAP estimate,
- which is the $\theta$ that maximizes $p(θ | x)$.
- Since x is fixed, this is equivalent to maximizing $p(θ) p(x | θ)$, the product of the prior probability of θ with the likelihood of θ.
- the prior probability/belief is indeed the regulazation term in ridge regression, so is the prior belief is strong, then high bias and low var(data affect a little)
for an infinite amount of data, MAP gives the same result as MLE (as long as the prior is non-zero everywhere in parameter space);
for an infinitely weak prior belief (i.e., uniform prior), MAP also gives the same result as MLE.
MLE can be silly, for example if we throw a coin twice, both head, then MLE asid you will always have head in the future. Bayesian have clever explain consider the assumtion that head come with 0.5 probability
MAP is the foundation for Naive Bayes classifiers
MAP is applied in spam filter while MLE can not

in Naive Bayes classifiers, we assum: features are conditionally independent
and use empirical probabilities:
prediction = argmaxC P(C = c|X = x) ∝ argmaxC P(X = x|C = c)P(C = c)
example:
$P(spam|words) ∝ \prod_{i=1,2..N}P(wordsi|spam)P(spam)$
$P(~spam|words) ∝ \prod{i=1,2..N}P(words_i|~spam)P(~spam)$
Whichever one’s bigger wins.

dis: sampling is important, may blow up thind is we train on data mostly spam and test on mostly non-spam(our P(spam) is WRONG) – but we can perfrom cv to adviod this
modify NB: joint conditional distribution
decision surface of Naive Bayes: P(c|word) = P(word|c)I(word) + P(¬word|c)I(¬word)

Bayesian estimation

using Bayes’ rule, come up with a distribution of possible parameters:
$P(\mathbf{\theta} | \mathbf{D}) =\frac{ P(\mathbf{D}|\mathbf{\theta}) p(\mathbf{\theta})}{P(\mathbf{D})}$
p(\mathbf{\theta})} is known as prior(it means we make some assumption of the parameters, or, we somehow know some fact such as the coin have 0.5 changes of getting head. but the assumption may be wrong obviously?)
KEEP IN MIND: ${P(\mathbf{D})}$ is constant
we need to intergral two side w.r.t $\theta$ which have high computational cost

probability or statistic

In probability, we’re given a model, and asked what kind of data we’re likely to see.
In statistics, we’re given data, and asked what kind of model is likely to have generated it.

least square

The method of least squares is a standard approach in regression analysis to the approximate solution of overdetermined systems, i.e., sets of equations in which there are more equations than unknowns.
if all residual are linear, then it is linear least square:
The linear least-squares problem occurs in statistical regression analysis;
it has a closed-form solution.

fisher information

Fisher information is a way of measuring the amount of information that an observable random variable X carries about an unknown parameter θ of a distribution that models X. Formally, it is the variance of the score, or the expected value of the observed information.

The likelihood

defined as the joint density or probability of the outcomes, with the roles of the values of the outcomes y and the values of the parameters θ interchanged. $p(y|x,\theta)$, $p(\theta|y)$
score function = derivative of the log-likelihood:
- find the maximum likelihood estimator, set the score function equal to zero

maximum likelihood estimator

The maximum likelihood estimator of θ for the model given by the joint densities or probabilities $f(y;\theta)$, with $\theta \in Θ$, is defined as the value of θ at which the corresponding likelihood L(θ;y) attains its maximum
$\hat \theta{ML} = argmax{\theta} L(\theta;y)$

efficiency of the maximum likelihood estimators

relate to asymptotic settings
A ubiquitous caveat associated with all the results is that the model has to be valid;
it has to contain the distribution according to which the outcomes are generated.
Consistency
- a property of an estimator
- that it would recover the value of the target if it were based on many observations.
- ie. we refer to the sequence of (univariate) estimators $\hat θ_n$ based on thenth set of observations yn as a single estimator. Consistency of such an estimator ˆ θ of a target θ is defined as convergence of ˆ θnto the target θ as n→+∞.
- for MLE:
  - An important result about maximum likelihood estimators is that under some regularity conditions they are consistent. The regularity conditions include smoothness of the likelihood, its distinctness for each vector of model parameters and finite dimensionality of the parameter space, independent of the sample size.
- Asymptotic Efficiency and Normality
  - The qualifier asymptoticrefers to properties in the limit as the sample size increases above all bounds.
  - For a set of many conditionally independent outcomes (large sample size n), given covariates and a finite-dimensional set of parametersθ, the maximum likelihood estimator is approximately unbiased, and its distribution is well approximated by the normal distribution with sampling variance matrix equal to the inverse of the expected information matrix. This result is referred to as asymptotic normality. Further, the maximum likelihood estimator isasymptotically efficientand, asymptotically, the sampling variance of the estimator is equal to the corresponding diagonal element of the inverse of the expected information matrix. That is, for largen, there are no estimators substantially more efficient than the maximum likelihood estimator.
    http://www.52nlp.cn/wp-content/uploads/2015/01/prml3-10.png
  - TheCram´er–Rao inequalityis a powerful result that relates to all unbiased estimators. It gives a lower bound for the variance of an unbiased estimator.
  - Asymptotic normality and efficiency of the maximum likelihood estimator confer the central role on the normal distribution in statistics.
  - http://www.52nlp.cn/wp-content/uploads/2015/01/prml3-10.png
Instead of $L(\theta; y,X)$ it is more convenient to work with its logarithm, called the log-likelihood – product convert into summation

information matrix/ fisher information

Fisher information is used to determine the sample size with which we design an experiment; second, in the Bayesian paradigm, Fisher information is used to define a default parameter prior; finally, in the minimum description length paradigm, Fisher information is used to measure model complexity

machine learning经常用到probability和statistic的解释和一些概念，感觉看起来一模一样的东西又可以有很多不同解释..如果google的话强烈推荐quora啊太良心了，如果是stackexchange的话经常会发生努力看完top回答然后下面来一个it’s totally wrong简直是人生观都要颠覆了。

任意一个函数可以放到 ${1,x,x^2,x^3….}$所张成的函数空间，如果是有限个基的话就称为欧式空间，无穷的话就是 Hilbert空间
http://www.52nlp.cn/wp-content/uploads/2015/01/prml3-25.png

XXXH

MLE & MAP & Bayesian Estimation

MLE