Notes - Computer Vision MT25, Convolutional neural networks

What was the “neocognitron”?

An early (1959) example of an image-processing neural network, trained layerwise using a method called “self-organising maps”.

What idea motivated CNNs?

They should be invariant to shifts, scale and small distortions.

What three properties of a CNN help make them learn to be invariant to shifts, scale and small distortions?

Local weighted connections
Shared weights across spatial locations
Spatial subsampling

How are CNN architectures generally composed?

Filters arranged with a width, height and depth
Alternating convolutional and subsampling layers
Fully connected layers at the end

What was maybe unnecessary about the filters used in LeNet?

They were unnecessarily large, many later architectures used much smaller filters.

Suppose you have a convolutional filter of dimensions $h \times w \times d$. How much should you zero-pad the input?

Add $(h-1)/2$ cells to the top and bottom, and $(w-1)/2$ cells to the left and right.

Suppose you have a $10 \times 10 \times 3$ input volume, which you process by applying a $3 \times 3 \times 3$ convolutional filter. What is the size of the resulting output volume?

\[10 \times 10 \times 1\]

@example~

Suppose:

You have an input volume of size $h _ \text{in} \times w _ \text{in} \times d$
You have a convolutional filter of size $h \times w \times d$ (although $d$ being the same here is not necessary)
The stride of this filter is $s$

@State the dimensions of the output.

\[h _ \text{out} \times w _ \text{out} \times 1\]

where

\[\begin{aligned} h _ \text{out} &= \frac{h _ \text{in} - h + 2p}{s} + 1 \\ w _ \text{out} &= \frac{w _ \text{in} - w + 2p}{s} + 1 \end{aligned}\]

What are the hyperparameters of a convolution layer in a CNN?

The number of filters
The width, height and depth of the filters
The stride of each of the filters
How padding is calculated

How is bias added in a convolutional layer of a CNN?

It is added to each output channel, so e.g. if after applying your filter you have a $3 \times 3$ output, you add the same scalar to each of the $3 \times 3$ entries.

Suppose:

You have an input volume of size $h _ \text{in} \times w _ \text{in} \times d$
You have a pooling layer of width $w$ and height $h$
The stride is $s$

When does overlapping of the sliding window occur?

If $s < w$ or $s < h$.

Suppose:

You have an input volume of size $h _ \text{in} \times w _ \text{in} \times d$
You have a pooling layer of width $w$ and height $h$
The stride is $s$

@State the dimensions of the output.

\[h _ \text{out} \times w _ \text{out}\]

where

\[\begin{aligned} h _ \text{out} &= \frac{h _ \text{in} - h}{s} + 1 \\ w _ \text{out} &= \frac{w _ \text{in} - w}{s} + 1 \end{aligned}\]

How can you interpret a fully connected layer as a convolutional layer?

Each filter has the same size as the input, so you can do an arbitrary linear combination of all the previous layers.

What differences did VGG make over LeNet?

Smaller convolutions
Deeper network

Computer Vision MT25, Convolutional neural networks

Flashcards

Related posts