solving linear regression

Linear regression is the basis of machine learning.
Basically, to compute the linear regression model, usually we will first look at the cost and try to minimize it. Gradient descent/ Follow the gradient is a general appraoch. And there are two slovers for the optimization problem: least square & normal equation.

least square using Gradient descent least square using normal equation
convergent order = 1, linear convergent
minimize function of $\alpha$ iteratively, search algorithm directly
cal gradient, step by step: update rule, which is call ‘least mean squares/LMS update rule’ set first derivative = 0, generate normal equation -> closed form solution: $\theta = (X^TX)^{−1}X^T \mathbf{y}$
may or may not the whole set of data need whoe set of data
ill condition matrix happens
batch stochastic
direction caled based on all of the data caled take randomly take one pair of data
convex, promise to reach global minimum close to global minimum but can not reach
high computation cost, can be modified by ‘rocking’

Gradient descent

  • a common solver in machine learning:
  • offline learning: having a batch of data, optimaze the equation(average loss, called rish in frequentist decision theory) base on all the data – batch gradient descent
  • online learning: having a stream of data, update the estimate with new coming data

##BATCH GRADIENT DESCENT(offline learning)

  • one parameter: (singularle variable function optimization) $\beta^{(k+1)} \gets \beta^{(k)} - \alpha \frac{\partial \mathcal{L}(\beta)^{(k)}}{\partial \beta}$ (stepsize $\alpha$ is hard to determine mannully)
  • multi-parameter:(multivariable function optimization) $\mathbf{\beta}^{(k+1)} \gets \mathbf{\beta}^{(k)} - \alpha \frac{\partial \mathcal{L}(\mathbf{\beta})^{(k)}}{\partial \mathbf{\beta}}$
    = $\mathbf{\beta}^{(k)} - \alpha \nabla \mathcal{L}(\mathbf{\beta})^{(k)} $
  • dis: require large memory
    • a trick to speed up: rocking pass data forward-backward
  • ads: could reach global minimum

stochastic gradient descent

  • choose random pair in hte traning set and take a step,

implement issue

direction selection

  • beside choosing the step size, we can also choose the direction, originally we use the steepest-descent methos which is follow the gradient direction, but this way is usually have a low convergent order + numerically not convergent there are other direction can be use – more general descent methods
  • say direction $p^k$, gradient: $\nabla f(\mathbf{\beta})$, the requirement for the p is :$\nabla f(\mathbf{\beta}) ^T p^k < 0 $(in the nearly opposite direction)

convergence of the method - analysis of gradient descend

  • when is guaranteed for the method to converge? ?: $\mathcal{L}(\beta^{(k)})$ is continuous differentiable in the bounded set {$\beta | \mathcal{L}(\beta) < \mathcal{L}(\beta_0)$} – refer the [convergent theorem][1]
    • convex function have only one global minimun(all local minima are also global minimum); strictly convex function have unique global minimum;
    • sequence {$\nabla \mathcal{L}(\beta^{(k)})$}
  • convergent order , linear convergent , convergent constant [see here][2]
    • meaning of convergent constant in gradient descend: if =0.1, then te itration gain 0.1 of accuracy in the optimal objective function value each round, if =0.9, need $0.9^{22}$, ie 22 round
      • convergent constant small is better, even both in linear convergent, different c.c can means huge diff. performance
      • cal $cc = (\frac {\lambda{max} - \lambda{min}}{\lambda{max} + \lambda{min}})^2$ see here
    • convergent rate can be increase by feature scaling/data normalization/prepossing, methods can be rescaling, standarization, or scling to unit length.
      • standarization: x = (x - mean(x))/var(x)
  • if the problem is see here quadratic optimization problem , the model can be
  • QP: minimize $f(x) := 1/2 x^T Q x + c^T x$ which is a ellipse and Q is symmetry matrix, in this case, eigenvalue is the radius of the ellipse and the convergent constant depends very much on the ratio of the largest to the smallest eigenvalue of the Hessian matrix H(x) at the optimal soluction $x^$ *we want the ratio to be small, that is , nearly 1
  • ratio of largest to smallest eigenvalue = condition number of the matrix

Stopping criteria - optimality conditions

  1. first derivative = 0
  2. second deri. positive definite. what is this mean?

step-size selection: $\alpha$

  • requirement: Convergence to a local minimum is guaranteed only when $\alpha$ < $\alpha{min}$ where $\alpha{min}$ is a fxed constant that depends on the problem.
  • $\alpha$ can be different each iteration(decreasing) or fixed
    • we may hope $\alpha$ decrease as it approach the minimum, however, it can not simply decrease rach iteration otherwise it may not converge to the minimum, so we should besed on the value of each iteration to decide whether decrease or keep – automatically selection
      method to find the locally optimal step size:
      1.Exact Line Search: $\alpha^k$ is picked to minimize a function of $\alpha^{k}$ knowing x and d: $ = arg min_{\alpha}h(\alpha)= f(x^k + \alpha p^k)$ => usable, not cost effective
  1. Inexact line search:
  • bisection algorithm to soling the equation: $h’(\alpha)=0$
    • s1: i=0, al=0,au=$\hat{a}$; s2: a = (al+au)/2; s3: if h’(a)>0, au=a,k++, else, al=a,k++; stop whrn h’(a)=0
  • line-search method: used to set step-size automatically: backtracking
    • [MIT nonlinear programming][3]
    • [line-search notes][4]
    • backtracking (Armijo) line search:
      • in a word, reduce step size after making sure that value of f is smaller than before
      • 1) $\alpha^{(0)} = \alpha_{init}$ was given $l=0$, 2) if $f(x^k + a^{(l)}p^k) “<” f^k$, i) set $\alpha^{(l+1)} = t\alpha^{(l)} $ t in (0,1) is fixed and ii) increment l by 1 3)set $\alpha^k = \alpha{(l)}$ – prevent step size getting too small before approach the solution but it does not prevent step to be too long – improve by tighten the requirement for “<” see the notes for modification and termination

comment on gradient descent:

as along as the cost function is convex and the first derivative is continuous, the gradient descent method is guaranteed to reach the global minimum (ie, convergent). But the fact is that gradent descent have problem with vally shape optimizaton, in which case the ratio of the maximal and minimal eigenvalues of X is large, the number of iterations to reduce the optimality cap can be very large, http://ocw.mit.edu/courses/sloan-school-of-management/15-084j-nonlinear-programming-spring-2004/lecture-notes/lec5_steep_desce.pdf, here shows that when the ratio vary from 1 to 100, the number of iterations vary from 10 to 2310,(mention that the convergent constant from 0.0023 t0 0.99, for what it means ,refer above)
conclusion: steepest descent convergence rate is very sensitive to the eigenvalue ratio
which is saying that the features are not normalized well
speed of convergence of gradient descent depends on the
perfrom: feature scaling


  • Lipschitz function:
    • a function with bounded first derivative

condition number

  • condition number measure how easy it is to invert a matrix
    • condition number of $\mathbf{A} = \parallel\mathbf{A^T}\parallel\times \parallel \mathbf{A} \parallel$ [link: ==][1] $\parallel \mathbf{A^{-1}} \parallel \parallel \mathbf{A} \parallel = \frac{\sigma{max} (\mathbf{A})}{\sigma{min} (\mathbf{A})}$
    • matrix norms: a way of sizing up a matrix, a small change to a matrix can change the norm in a great scale
    • large condition number of $\mathbf{A}$
      = ill-conditioned/ill-posed problem
      = inaccurate result
      = hard to invert
    • view condition number as error multiplier of solution of linear system $\mathbf{A}x = b$
  • One way that systems of equations can be made better conditioned is to fix things so that all the rows have largest elements that are about the same size - (normalization)
  • for an invertible matrix:
    $\Delta b$ is the error of the collected data
    $\mathbf{x^}$ is the solution of the system
    $\mathbf{A}x = b + \Delta b$
    $\mathbf{x} = \mathbf{A^{-1}} (\mathbf{b} + \Delta \mathbf{b}) = \mathbf{A^{-1}} \mathbf{b} + \mathbf{A^{-1}} \Delta \mathbf{b} = \mathbf{x^
    } + \mathbf{A^{-1}} \Delta \mathbf{b}$
    intuitively, if $A$ is small then $A^{-1}$ is large