Computer Vision with Marco

A fully connected (FC) network on a 224×224 RGB image would need 150,528 input neurons. That's enormous - too many parameters, too much compute, and no spatial structure is exploited.

Convolutional Layer

A conv layer applies N learned filters (kernels) to the input. Each filter produces one feature map. If input is H×W×C and we apply N filters of size k×k×C, the output is H'×W'×N.

Symbol guide

H'	output height (or width - same formula applies to both) after the conv layer
H	input height in pixels before this layer
k	kernel size - the height/width of the square filter, e.g. k=3 for a 3×3 kernel
P	padding - rows/columns of zeros added to the border; P=(k-1)/2 keeps H' = H
2P	total padding added (P rows on each side, so double)
S	stride - how many pixels the kernel jumps each step; S=1 is dense, S=2 halves the output size
+1	accounts for the first kernel position at the top-left corner

P = padding, S = stride. Stride > 1 downsamples. Padding = (k-1)/2 preserves spatial size.

Receptive field: After several conv layers, each output neuron "sees" a larger area of the input. With 5 layers of 3×3 convolutions (stride 1), the effective receptive field is 11×11.

Pooling Layers

Pooling reduces spatial dimensions, adding translation invariance and reducing compute.

Max Pooling (2×2, stride 2)

Keep maximum value in each 2×2 block

Average Pooling (2×2, stride 2)

Average value in each 2×2 block

Activation Functions

Without nonlinear activations, stacked linear layers collapse to a single linear transformation. Key activations:

Interactive CNN Architecture

Feature Maps - What Does Each Layer See?

Early layers in a CNN learn to detect simple patterns - edges, blobs, color gradients - because these are the patterns that differ between training examples. Deep layers combine these into complex, task-specific features.

Simulated Feature Map Responses

Each panel simulates what a filter "responds to". Bright = strong activation.

Famous CNN Architectures

Model	Year	Key Innovation	ImageNet Top-5
AlexNet	2012	GPU training, ReLU, Dropout	84.7%
VGG-16	2014	Depth with 3×3 convs only	92.7%
GoogLeNet	2014	Inception modules, 1×1 convs	93.3%
ResNet-50	2015	Residual (skip) connections	96.4%
EfficientNet	2019	Compound scaling (depth/width/res)	97.1%
ViT	2020	Vision Transformer, patch tokens	97.9%

Convolutional Neural Networks

Why CNNs for Vision?

Convolutional Layer

Pooling Layers

Activation Functions

Activation Function Visualizer

Interactive CNN Architecture

CNN Architecture Explorer

Feature Maps - What Does Each Layer See?

Simulated Feature Map Responses

Famous CNN Architectures

Quiz

Check your understanding