layers in caffe

To get familier with caffe framework especially the layer structure. Learn how to implement new layer

from neural network to convolution neural network:
NN: take vectors as input
CNN: neuron have three dimension(therefore have volumes): hight, width, depth. made up of layers, every layer take 3D volume as input and 3D volume as output visual:

CIFAR-10: input images as an input volume of activation: 32x32x3(w,h,d). output layer: 1x1x10 (a single vector of class scores along depth dimension)
- input image: size 32x32, with RGB as the third dimsion
- CONV input [32,32,3], computing a dot product between their weights and the region they are connected output: [32,32,12]

neuron (K, F, S, P)

three parameters control the size of the output volumeL:

depth D (perfrom on the same sparial area),
stride S (allocate depth columns around the spatial dimensions),
zero-padding P (use to preserve the spatial size of the input volume)
receptive field size F: =size of kernel, spatial extent of the connectivity
num of neurons = $(W-F+2P)/S+1$;
W is the input volume size, in general choose $P = (F-1)/2$, when $S = 1$
RELU elementwise activation function, such as the max(0,x) thresholding at zero. size unchange: [32,32,12]
POOL downsampling operation along the spatial dimensions (width, height), reduce size to [16x16x12].
FC compute the class scores, output [1x1x10]. (there are 10 classes)
relu and pool have fixed function wile fc and conv have parameters：fc is more like a logistic function
below is the ![information}(http://cs231n.github.io/convolutional-networks/) kept for reference:

Conv Layer:

Accepts a volume of size $W1×H1×D1$
Requires four hyperparameters:
Number of filters $K$,
their spatial extent $F$, size of the kernel
the stride $S$,
the amount of zero padding $P$.
Produces a volume of size $W2×H2×D2$ where:
$W2=(W1−F+2P)/S+1$
$H2=(H1−F+2P)/S+1$ (i.e. width and height are computed equally by symmetry)
$D_2=K$
With parameter sharing, it introduces $FFD1$ weights per filter, for a total of $(FFD_1)K$ weights and K biases.
In the output volume, the d-th depth slice (of size $W_2×H_2$) is the result of performing a valid convolution of the d-th filter over the input volume with a stride of S, and then offset by d-th bias.

matrix implement of convolution

input: I(W1,H1,D1)
filter: (K,F,S,P)
-> calculated output: (W2,H2,D2) ,D2=K

output in col (W2*H2,D2) 
//s1
for i = 1:W2*H2
    take part_i F*F*D1 = B pixels of I as Ii
end

//s2
for j = 1:D2
    use filter_j in [F*F*D1] 
    col [1, F*F*D1] convolution with X_col [F*F*D1,W2*H2]
end
aggregate all up;
// s2 equal to 
filter matrix [filter_j1, filter_j2 ..] convolution with X_col

get output: [D2,W2*H2]


get X_col matrix [F*F*D1,W2*H2]
weights: [D2,F*F*D1]

comment: the convolution layer fi;ter the arer with same $W$ and $H$ but different $D$ at the same time, there are $K$ filter for the same region therefore generate the output in depth of $K$. the depth of output is decided by the depth of the filter, each layer can be viewed as different filter result of the input

pooling layer

between two conv layer:

reduce the spatial size of the representation to reduce the amount of parameters and computation in the network, and hence to also control overfitting.

The Pooling Layer operates independently on every depth slice of the input and resizes it spatially, using the MAX operation

input [W1, H1, D1]

kernel [F, S, D2==D1]

output [W2, H2, D2]

W2 = (W1 - F)/S + 1
H2..
D2 = D1

layer implement in C API )

COMMON SETTING

CONV: F = 3 or 5; S = 1; p depends, result in W2 = W1, H2 = H1
POOLING: F = 2, S = 2; F = 3, S = 2(overlapping pooling)