what Bias-Variance decomposition(BVD) brings to us

During project I finally gain more insight of what BVD means.
In the following I will briefly introduce what is BVD and the commend in parathesis is my point of view.

summary for svd for least square:

  • reduce variance of least square fit
    • try get more training examples
    • apply ridge regression
      • penalized parameter, higher bias while lower variance
    • try smaller set of features
      • add bias
  • reduce model bias of LS
    • Increase the number of features, perhaps using transformations.
      • increase complexity - more fitting, meas of hypothesis closer to truth
  • reduce the estimation-bias of LS
    • try get more traning examples - ?
  • more fitting, less bias
  • Adding more basis functions to a linear regression model
    • decrease model bias
  • As we increase the number of splits K, the variance of the estimate of test error will be smaller

First of all, there are several question we need to pay attention to:

what is the measurement of the model complexity?

  • we may simply think that in linear regression with polynormial basis function expansion, as the degree increase, the model complecity increase. However, if the regulization is added, things change! Degree is not the only measument. when the regulization term is large, the weights are tend to zero(expect beta0), so the model is indeed simple even if the degree may be high.
    • in a word, when measure the complexity of the model, it is better to keep one variable(ie, the degree or the regulazation) fixed. More specifically, for linear regression, it is simply depend on degree, for ridge regression, we usually fix the degree and vary lambda.

what the definition of the bias?

  • intuitively, if we mannully add our assumption(which may come from experience or the truth we know, it can be correct or wrong, benifit or harm to the model) to the model, the model will tend to depend less on the data we collect, then we will say the bias is high. The extreme case is that, we choose a large regulization term and all weights(expect beta0) go to zero, the models do not ‘listen’ to the data(the truth) and it is high bias.
  • specifically, the definition of bias is
    $bias^2 = (\mu(h{hypothesis}(x^*)) - f{truemodel}(x^*))^2$
    • describe the average error of the hypothesis

what the definition of variance?

  • def: $variance = E[(h{hypothesis}(x^*) - \mu(h{hypothesis}(x^*)))^2]$
    • it is the stability/variance of our model/hypothesis, if our model do not ‘listen’ to the data, then no matter what tranining set it use, it do not variant a lot and the value is low. On the other hand, if the model is overfitting(high complexity), then the variance is high.
    • in other word, it describe how much $h(x^*)$ differ from others

under what assumption do we perform the BVD?

  • we assume that there exist a true model $f$, even though we can nerver generate/get it.
  • also assume that there is some noise in the data we collect,
  • ie $y^* = f(x^*) + e$ (e: noise)
    • notice that under this assumption,
    • $E(y^*) = f(x^*) = \mu(y^*)$

expected prediction error

  • prediction error:
    • $(true value - predict value)^2$
    • $[y^* - h(x^*)]^2$

why BVD is important for model selecton?

  • the measurement of whether a model is good is based on the test error, since that we do not know the true value of the data we need to predict, we can not compute the test error directly and we need to extimate it by testing on the sample data.
    • the Estimation can be decomposited into three term: $bias$ + $variance^2$ + $noise^2$, detail
    • therefore the tradeoff between B and V is important

how the tradeoff happens?

  • when the complexity of the model increase, in general, the trainning error will decrease since it tend to highly fitting into the trainning set(overfitting) and the test error will increase.
    • Reason: when the input is away from the exact line(ie, there is some noise in the input), the output will become very different from the true value(ie, the noise will have huge impact on our prediction, also if there are noises in the trainning set, our model will be vary a lot and very different from the true model)
      In this case, the bias is low(refer to the def of bias here), more specifically, the trainning error is ‘always’ low.

So how do we measure B and D in practice ?? [not finish]

to get $E(h(x^*))$ need more than one traning set. but only one indead, so need to simulate multiplr trainning sets by bootstrap repicates ie, randomly split the sample data into trainning and test for many times with different seeds(in MATLAB)

  • algorithm:
  1. given training sample S, set s seeds to split it
  2. genarate multiple hypothesis h1,h2,h3,….hs
  3. (option) determining corresponding weights w1,w2,…wL, here we view each h qually, ie. no weights for different hypothesis
  4. classify new points by $y_{pred} = hi(x{new})$
    $E(h(x^*)) = \mu(h(x^*)) = mean(\sum (h_i(x^*)) ) = \frac{1}{s}\sum (h_i(x^*))$,
    $i \in {1,2,3,…s}$

since that
$bias^2 = (\mu(h(x^*)) - f{truemodel}(x^))^2$
$bias = (\mu(h(x^\
)) - f
{truemodel}(x^*))$

$$variance = E[(h(x^*) - \mu(h(x^*)))^2] =Var(h(x^*)) = Var(h(x_1^*)) + Var(h(x_2^*)) + …= \frac{1}{s} *\Sum{E[(h_i(x^*) - \mu(h(x^*)))^2]}$$
when shown in the figure, it is the variance of different result(test error) of different seeds.

1
2
3
4
5
6
7
8
for s = 1,2,3....
S = (Tr(s),Te(s));
traning on Tr(s)
compute errorTr(s)
// for traning the parameters
test on Te(s)
compute errorTr(s)
// expect train(on all the sample we have) error E[(h(x^*) - y^*)^2]

how to reduce variance:

bagging and other resampling technique, one typical example is random forest
it means we preduce many times choosing different training set and get the average prediction as the result


insight

  1. bagging of a flexible(overfitting) models may reduce the variance and benift from low bias