When we try to use MLE to maximize marginal distribution of x. which is difficult to solve & x in a joint distribution with r (from graphic model: summing joint dist(x, y) get marginal dist).
We trun to optimaze the joing distribution. and we try to marginalize/kick out r (consider as latent variable) out and make our life easier
to solve the problem we need to use EM as solver.
my implementaion of EM algo:

k means - online learning

Define cost: ‘distortion measure’:

for each $x_n$, calculate $dist(x_n,rn)$, sum up to be $J$
(r_n indicate the groupk x_n assighed to)

$$J = \sum_n \sumk r{nk} |x_n - \mu_k|^2$$

goal: given k, xn, minimize J w.r.t parameter $\mu_k$ and $r_nk$
since that they are depend to each other, need to optimatize in two step, ie. no close form
algo:

1
2
3

randomly initialize ${\mu_k}$, assigh rnk for each point, by minimize dist(x_n, rn)
having rn for all points, minimize J by take derivative wrt $\mu_k$ and update $\mu_k$
repeat

GMM:

START FROM GRAPHIC MODEL

binary variable {r_k} generated from multinomial distribution $Mult(r|\pi)$
$\pi$ & $r \in R ^k$ is vector here and assume p(x|r_k) is gaussian distribution

From graphic model, sum of joint distribution x,r for all possible state of latent variable r gives the marginal distribution of x. to maximize the marginal distribution of x, we can instead maximize the sum of( product of( the joing distribution x,r with marginal distribution of r)) – which define our cost function, (in this case, we say that we marginalize the latent variable out!)

as usual, we minimize the cost function, in this case, we have three parameters are they are depent on each other, no close form can derived, we use EM algo to find the solution.

the general idea is that, we first (random) initialize two of them, and then cal the third one, then cal another one by fix two again.

notice that EM can not only apply on GMM and in this way the problem is not identifiable, ie, the ‘almost best’ solution is not unique

relation to k means

We can find that using GMM with EM is very similar to the kmeans algorithm and actually, it can be derived that when the covalance matrix of the gmm is approacing 0*I, GMM is indeed kmeans.

Attention, dont mix up the complete data likelihhod and the imcomplete data(x) likelihood

complete data likelihhod come from the joing distubtion

$$p(X,r) = \prod_n \prod_k \pik^{r{nk}} N(x_n|\mu_k,\Sigmak)^{r{nk}}$$

and the likelihood which we indeed define as cost function is

$$p(X) = \prod_n \sum_k \pi_k N(x_n|\mu_k,\Sigma_k)$$

EM

we mention that EM can not only works for GMM, it can apply generally to other MLE form joing distribution.(ie, in this case, p(x|r_k) may not be gaussian distribution)

consider the cost is depend on x(given),z,\theta

initialize \theta(random)
E: expectation, since z unknow, taking expection and then optimate p(z|x,\theta)
(like taking mean of each cluster in kmean)
M : maximization, now we have z, by minimizing cost function wrt \theta, we can update \theta

below is some knowledge fragments

probability density function $p(X|\mu, \sigma)$
intergrate pdf get cdf

multivariable gaussian distribution

pdf:
$X-N(\mu, \Sigma)$

$$p(X|\mu, \Sigma) = \frac{1}{(2 \pi)^{2n}* det(\Sigma)^{1/2}} exp(-1/2 (x-\mu)^T \Sigma^{-1} (x-\mu))$$

$$\Sigma = E[(X-\mu)(x-\mu)^T]$$

given X,
assume X from $N(\mu, \Sigma)$, estimate $(\mu, \Sigma)$ by MLE:
take log, get log likelihood function
$$L = ln(p(X|\mu, \Sigma)) = ln(\prod_n p(\mathbf{x_n}|\mu, \Sigma)) = -2/N * ln(|\Sigma|) -1/2 (x-\mu)^T \Sigma^{-1} (x-\mu) + const$$
take derivative wrt $\Sigma$ and $\mu$
$$\partial L / \partial \Sigma^{-1} = -2/N * |\Sigma|^{T} - 1/2 N \sum_n (x-\mu)(x-\mu)^T = 0$$
get
for $\Sigma$
$$\Sigma = 1/N \sum_n (x-\mu)(x-\mu)^T$$
for $\mu$
$$\partial L / \partial \mu = … = 0$$
$$\mu = 1/N \sum_n \mathbf{x_n} $$

multi-gaussian models

x can only belong to one of the k models - hard clustering

$$p(x|\mu,\Sigma,r) = N(x|\mu_j,\Sigma_j), x \in group j(r_j=1)$$

compact form

$$p(x|\mu,\Sigma,r) = \prod_k [N(x|\mu_k,\Sigma_k)]^{r_k}$$
notice that product + rk means only choose one N

gaussiam mixtrue distribution

x have posivility belong to severals groups, mariginal distribution of variable x is sumning all possible state of joint distribution of hidden variabke r

$$p(\mathbf{x}) = \sum_k \pi_k N(x|\mu_k,\Sigma_k) = ? \sum_k p(x|\mu_k,\Sigma_k,r_k) p(r_k)$$
notice that here sum up all N weighted by $\pi_k$

consider{x1, x2…}

$$p(\mathbf{X}) = \prod_n p(\mathbf{x_n}) = \prod_n \sum_k \pi_k N(x|\mu_k,\Sigma_k)$$

take log
(1.1)$$L = ln(p(\mathbf{X}) = \prod_n p(\mathbf{x_n})) = ln(\prod_n \sum_k \pi_k N(x|\mu_k,\Sigma_k))$$
take derivative wrt $\Sigma$ and $\mu$

for each $\mu_k$

$$\partial L / \partial \mu_k = \sum_n \frac{\pi_k N(x|\mu_k,\Sigma_k)}{\sum_k \pi_k N(x|\mu_k,\Sigma_k)} * \partial exp() / \partial \mu_k= 0$$

$$\frac{\pi_k N(x|\mu_k,\Sigma_k)}{\sum_k \pi_k N(x|\mu_k,\Sigmak)} = p{nk}$$

$$\partial exp(..) / \partial \mu_k = \Sigma_k(x_n - \mu_k)$$

TARGET1:——->>

$$\mu_k = \frac {\sumn (p{nk} x_n)} {\sumn p{nk}}$$

def:
$N_k = \sumn {p{nk}} $ :
effective # points assigned to cluster k

TARGET2:——->>

$$\pi_k = N_k/N = \sumn {p{nk}} / N$$

TARGET3:——->>

for each $\Sigma_k$

$$\Sigma_k = \frac {\sumn (p{nk} (x-\mu)(x-\mu)^T)}{\sumn p{nk}}$$

bayes view point

(prior) marginal distribution $p(\mathbf{r})$

$r \in R^k$
the kth entry = 1, ie belong to kth group:

prior probality

$$p(r) = \pi_k,if: r_k = 1$$

compactly, notice that only one $r_j = 1$

$$p(\mathbf{r}) = \prod_k p(r_k = 1)=\prod_k \pi_k^{r_k}$$

conditional distribution $p(x|r)$ [def by arraw in graph]

know x belong to kth group, the distribution of x
$$p(x|r_k=1) = N(x|\mu_k,\Sigma_k)$$
compactly:
$$p(x|r) = \prod_k N(x|\mu_k,\Sigma_k)^{r_k}$$

joint distribution $p(x,r)$ [product of all conditionl prop in graph]

$$p(x,r_k) {：if r_k = 1} = p(x|r_k)p(\mathbf{r}_k) = \prod_k \pi_k N(x|\mu_k,\Sigma_k) $$
compactly
$$p(x,r) = p(x|r)p(\mathbf{r}) = \prod_k \pi_k^{r_k} N(x|\mu_k,\Sigma_k)^{r_k} $$
noticed that in Bayeis network, if one node do not have parents then its conditional probability equal to its marginal probability

marginal distribution of x $p(x)$

= sum up joint distribution over all possible states of r(r have k state)
$$p(x) = \sum_r p(x,r) = \sum_k p(x,r_k=1) =\sum_k p(x|r_k=1)p(\mathbf{r_k}=1) $$

$$= \sum_k p(x|r_k=1)\pi_k^{r_k} = \sum_k N(x|\mu_k,\Sigma_k)\pi_k^{r_k}$$
since k is specified
$$= \sum_k N(x|\mu_k,\Sigma_k)\pi_k$$

consider N points:
$$p(X) = \prod_n p(x_n) = \prod_n \sum_k N(x|\mu_k,\Sigma_k)\pi_k$$
which is the same as (1,1)
take log, get the loglikehood function, take derivative then

It follows that for every observed data point $x_n$ there is a corresponding latent vector $r_n$ , i.e., its sub-class

posterior probability $p(rk=1|x) = p{nk} = \gamma(r_k)$

responsibility
$$p(r_k=1|x) = \frac{p(r_k=1)p(x|r_k=1)}{\sum_j p(r_i=1)p(x|r_j=1) }=\frac{\pi_k N(x|\mu_k,\Sigma_k)}{\sum_k \pi_k N(x|\mu_k,\Sigmak)} = p{nk} = \gamma(r_k)$$

generate MLE problem for GMM modles

We are given N i.i.d. samples ${x_n}$ with unknown latent points ${r_n}$(the addigh of the point) with parameters $\pi_k$.
The samples have parameters: mean $\mu$ and covariance $\Sigma$

Goal is to estimate the three sets of parameters :
${x_n}$
${r_n}$
mean $\mu$ and covariance $\Sigma$

notes on how to cal the gradient and the minimum:

pre:

$$log(\sum_k a_k b_k)>=\sum_k a_k log(b_k)$$
$\Sigma$ symmetry
$$\frac {\partial{a^T B a}} {\partial{a}} = (B + B^T)a$$
$$\frac {\partial{det(AXB)}}{\partial{X}} = det(AXB)(X^{-1})^T$$
$$\frac {\partial{a^tX^{-1}b}} {\partial{x}} = -x^{-T}ab^TX^{-T}$$
$$\frac {\partial{\sum_k \pi_k N_k} } { \partial{\mu_k}} = \partial{\pi_k N_k} / \partial{\mu_k}$$

problem of GMM

if distribution one of the cluster have vary small variance, then the values are close to mean, therefore the exponent term will goes to 0 and then exp() -> 1, while 1/var will goes to infinity, then the equation which we want to find macimum will goes to infinity

solution: resetting mean or variance
http://www.cedar.buffalo.edu/~srihari/CSE574/Chap9/Ch9.2-MixturesofGaussians.pdf

not identifiable
就是给三个组套上{1,2,3}的安排，组名的顺序变换之后分类的结果都是一样好的，有k组实际上就是有2^k个solution
Takes many more iterations than K-means
EM is not guaranteed to find global maximum of log likelihood function

XXXH

EM in GMM and in general