To get familier with caffe framework especially the layer structure. Learn how to implement new layer
from neural network to convolution neural network:
NN: take vectors as input
CNN: neuron have three dimension(therefore have volumes): hight, width, depth. made up of layers, every layer take 3D volume as input and 3D volume as output
CIFAR-10: input images as an input volume of activation: 32x32x3(w,h,d). output layer: 1x1x10 (a single vector of class scores along depth dimension)
- input image: size 32x32, with RGB as the third dimsion
- CONV input [32,32,3], computing a dot product between their weights and the region they are connected output: [32,32,12]
neuron (K, F, S, P)
three parameters control the size of the output volumeL:
- depth D (perfrom on the same sparial area),
- stride S (allocate depth columns around the spatial dimensions),
- zero-padding P (use to preserve the spatial size of the input volume)
- receptive field size F: =size of kernel, spatial extent of the connectivity
num of neurons = $(W-F+2P)/S+1$;
W is the input volume size, in general choose $P = (F-1)/2$, when $S = 1$
RELU elementwise activation function, such as the max(0,x) thresholding at zero. size unchange: [32,32,12]
- POOL downsampling operation along the spatial dimensions (width, height), reduce size to [16x16x12].
- FC compute the class scores, output [1x1x10]. (there are 10 classes)
relu and pool have fixed function wile fc and conv have parameters:fc is more like a logistic function
below is the ![information}(http://cs231n.github.io/convolutional-networks/) kept for reference:
Conv Layer:
Accepts a volume of size $W1×H1×D1$
Requires four hyperparameters:
Number of filters $K$,
their spatial extent $F$, size of the kernel
the stride $S$,
the amount of zero padding $P$.
Produces a volume of size $W2×H2×D2$ where:
$W2=(W1−F+2P)/S+1$
$H2=(H1−F+2P)/S+1$ (i.e. width and height are computed equally by symmetry)
$D_2=K$
With parameter sharing, it introduces $FFD1$ weights per filter, for a total of $(FFD_1)K$ weights and K biases.
In the output volume, the d-th depth slice (of size $W_2×H_2$) is the result of performing a valid convolution of the d-th filter over the input volume with a stride of S, and then offset by d-th bias.
matrix implement of convolution1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24input: I(W1,H1,D1)
filter: (K,F,S,P)
-> calculated output: (W2,H2,D2) ,D2=K
output in col (W2*H2,D2)
//s1
for i = 1:W2*H2
take part_i F*F*D1 = B pixels of I as Ii
end
//s2
for j = 1:D2
use filter_j in [F*F*D1]
col [1, F*F*D1] convolution with X_col [F*F*D1,W2*H2]
end
aggregate all up;
// s2 equal to
filter matrix [filter_j1, filter_j2 ..] convolution with X_col
get output: [D2,W2*H2]
get X_col matrix [F*F*D1,W2*H2]
weights: [D2,F*F*D1]
- comment: the convolution layer fi;ter the arer with same $W$ and $H$ but different $D$ at the same time, there are $K$ filter for the same region therefore generate the output in depth of $K$. the depth of output is decided by the depth of the filter, each layer can be viewed as different filter result of the input
pooling layer
between two conv layer:
- reduce the spatial size of the representation to reduce the amount of parameters and computation in the network, and hence to also control overfitting.
The Pooling Layer operates independently on every depth slice of the input and resizes it spatially, using the MAX operation
input [W1, H1, D1]
kernel [F, S, D2==D1]
output [W2, H2, D2]
W2 = (W1 - F)/S + 1
H2..
D2 = D1
)
COMMON SETTING
CONV: F = 3 or 5; S = 1; p depends, result in W2 = W1, H2 = H1
POOLING: F = 2, S = 2; F = 3, S = 2(overlapping pooling)