A brief introduction of simplest classification implement: logistic regression

logistic regression

hypotheses $h_{\theta}(x) = \sigma(\theta^Tx)$
$\sigma(x)$ is the logistic function/sigmoid function
we can assume:　
- $P(y = 1 | x; θ) =h_{\theta}(x)$
- $P(y = 0 | x; θ) =1 - h_{\theta}(x)$
- it can be written compactly
Assuming that the m training examples were generated independently
- we can write $L(\theta) = p(\mathbf{y}|X;\theta)$ as joint probability of p(y1)p(y2)… then take log, get log likehood function of $\theta$
  - maximize it (mini its negative) by ..

by gradient descent or least square

one thing to mention: the update rule for stochastic gradient rule here $$\theta_j := \thetaj + \alpha y^{(i)} − h{\theta}(x^{(i)})xj^{(i)}$$ it is almost the same as we saw in linear regression, but, $h{\theta}(x^{(i)})$ here is a non-linear function of $\theta^Tx^{(i)}$

by newtom method

find zero of first derivative
$\theta := \theta - H^{-1}∇_{\theta}l{\theta}$
Newton’s method typically enjoys faster convergence than (batch) gradient descent, and requires many fewer iterations to get very close to the minimum. One iteration of Newton’s can, however, be more expensive than one iteration of gradient descent, since it requires finding and inverting an n-by-n Hessian;
When Newton’s method is applied to maximize the logistic regression
log likelihood function ℓ(θ), the resulting method is also called Fisher
scoring

perceptron learning algorithm

Consider modifying the logistic regression method to “force” it to output values that are either 0 or 1 or exactly. instead of sigmoid(x) we now use: threshold function g(x) = 1,if x>0 | 0,if x<0

notice that it is a very different type of algorithm than logistic regression