Convolutional Neural Networks

Module 7 - Layers, Pooling, Activations, Feature Maps

Why CNNs for Vision?

A fully connected (FC) network on a 224×224 RGB image would need 150,528 input neurons. That's enormous - too many parameters, too much compute, and no spatial structure is exploited.

CNNs solve this with three key ideas:

Convolutional Layer

A conv layer applies N learned filters (kernels) to the input. Each filter produces one feature map. If input is H×W×C and we apply N filters of size k×k×C, the output is H'×W'×N.

Output size: H' = (H - k + 2P) / S + 1
Symbol guide
H'output height (or width - same formula applies to both) after the conv layer
Hinput height in pixels before this layer
kkernel size - the height/width of the square filter, e.g. k=3 for a 3×3 kernel
Ppadding - rows/columns of zeros added to the border; P=(k-1)/2 keeps H' = H
2Ptotal padding added (P rows on each side, so double)
Sstride - how many pixels the kernel jumps each step; S=1 is dense, S=2 halves the output size
+1accounts for the first kernel position at the top-left corner

P = padding, S = stride. Stride > 1 downsamples. Padding = (k-1)/2 preserves spatial size.

Receptive field: After several conv layers, each output neuron "sees" a larger area of the input. With 5 layers of 3×3 convolutions (stride 1), the effective receptive field is 11×11.

Pooling Layers

Pooling reduces spatial dimensions, adding translation invariance and reducing compute.

Max Pooling (2×2, stride 2)

Keep maximum value in each 2×2 block

Average Pooling (2×2, stride 2)

Average value in each 2×2 block

Activation Functions

Without nonlinear activations, stacked linear layers collapse to a single linear transformation. Key activations:

Activation Function Visualizer

Interactive CNN Architecture

Click each layer to learn what it does:

CNN Architecture Explorer

Click a layer above to see details.

Feature Maps - What Does Each Layer See?

Early layers in a CNN learn to detect simple patterns - edges, blobs, color gradients - because these are the patterns that differ between training examples. Deep layers combine these into complex, task-specific features.

Simulated Feature Map Responses

Each panel simulates what a filter "responds to". Bright = strong activation.

Famous CNN Architectures

Model Year Key Innovation ImageNet Top-5
AlexNet2012GPU training, ReLU, Dropout84.7%
VGG-162014Depth with 3×3 convs only92.7%
GoogLeNet2014Inception modules, 1×1 convs93.3%
ResNet-502015Residual (skip) connections96.4%
EfficientNet2019Compound scaling (depth/width/res)97.1%
ViT2020Vision Transformer, patch tokens97.9%

Quiz

Check your understanding

1. A conv layer has input 32×32×3, uses 16 filters of size 3×3, stride 1, padding 1. What is the output shape?

2. Why are residual (skip) connections in ResNet important?

3. ReLU(x) = max(0,x). What is its main advantage over sigmoid?