Convolutional Neural Networks
Module 7 - Layers, Pooling, Activations, Feature Maps
Why CNNs for Vision?
A fully connected (FC) network on a 224×224 RGB image would need 150,528 input neurons. That's enormous - too many parameters, too much compute, and no spatial structure is exploited.
CNNs solve this with three key ideas:
- Local connectivity: Each neuron only sees a small patch (receptive field)
- Parameter sharing: The same filter slides across the whole image - one filter learns one feature everywhere
- Spatial hierarchy: Early layers detect edges → middle layers find shapes → deep layers recognize objects
Convolutional Layer
A conv layer applies N learned filters (kernels) to the input. Each filter produces one feature map. If input is H×W×C and we apply N filters of size k×k×C, the output is H'×W'×N.
| H' | output height (or width - same formula applies to both) after the conv layer |
| H | input height in pixels before this layer |
| k | kernel size - the height/width of the square filter, e.g. k=3 for a 3×3 kernel |
| P | padding - rows/columns of zeros added to the border; P=(k-1)/2 keeps H' = H |
| 2P | total padding added (P rows on each side, so double) |
| S | stride - how many pixels the kernel jumps each step; S=1 is dense, S=2 halves the output size |
| +1 | accounts for the first kernel position at the top-left corner |
P = padding, S = stride. Stride > 1 downsamples. Padding = (k-1)/2 preserves spatial size.
Pooling Layers
Pooling reduces spatial dimensions, adding translation invariance and reducing compute.
Keep maximum value in each 2×2 block
Average value in each 2×2 block
Activation Functions
Without nonlinear activations, stacked linear layers collapse to a single linear transformation. Key activations:
Activation Function Visualizer
Interactive CNN Architecture
Click each layer to learn what it does:
CNN Architecture Explorer
Feature Maps - What Does Each Layer See?
Early layers in a CNN learn to detect simple patterns - edges, blobs, color gradients - because these are the patterns that differ between training examples. Deep layers combine these into complex, task-specific features.
Simulated Feature Map Responses
Famous CNN Architectures
| Model | Year | Key Innovation | ImageNet Top-5 |
|---|---|---|---|
| AlexNet | 2012 | GPU training, ReLU, Dropout | 84.7% |
| VGG-16 | 2014 | Depth with 3×3 convs only | 92.7% |
| GoogLeNet | 2014 | Inception modules, 1×1 convs | 93.3% |
| ResNet-50 | 2015 | Residual (skip) connections | 96.4% |
| EfficientNet | 2019 | Compound scaling (depth/width/res) | 97.1% |
| ViT | 2020 | Vision Transformer, patch tokens | 97.9% |
Quiz
Check your understanding
1. A conv layer has input 32×32×3, uses 16 filters of size 3×3, stride 1, padding 1. What is the output shape?
2. Why are residual (skip) connections in ResNet important?
3. ReLU(x) = max(0,x). What is its main advantage over sigmoid?