Computer Vision with Marco

Classification answers "what is in the image?". Detection answers "what is where?" - for each object instance, output a bounding box (x, y, w, h) and a class label with a confidence score.

Detection is harder: images may contain multiple objects of different sizes; objects may overlap; the number of objects is variable.

Sliding Window Approach

The naive approach: slide a window across the image at multiple scales and aspect ratios, run a classifier on each window. If confidence is high, declare a detection.

Problems: Exponentially many windows (exhaustive), very slow (~10⁴–10⁵ windows per image), no end-to-end training.

This was replaced by region proposal + CNN (R-CNN family) and then one-stage detectors (YOLO, SSD).

Intersection over Union (IoU)

How do we measure if a predicted box matches a ground truth box? IoU is the ratio of overlap to total area:

Symbol guide

IoU	Intersection over Union - a score from 0 (no overlap) to 1.0 (perfect match)
Predicted	the bounding box your model output - defined by (x, y, width, height)
Ground Truth	the manually labeled correct box for the object
∩ (intersection)	the overlapping region between the two boxes - the area both boxes share
∪ (union)	the total area covered by either box - intersection plus the non-overlapping parts
Area(…)	pixel area of the region, computed as width × height of the rectangle
IoU ≥ 0.5	common threshold for counting a detection as a "true positive" in evaluation metrics

Anchor Boxes

Instead of sliding arbitrary windows, modern detectors pre-define a set of anchor boxes at each grid cell - boxes with different aspect ratios and scales. The network predicts offsets from these anchors rather than absolute coordinates. This drastically reduces the search space.

Why anchors work: Natural objects (cars, people, faces) have characteristic aspect ratios. Anchors initialized to these ratios give the network a good starting point - it only needs to predict small adjustments.

Non-Maximum Suppression (NMS)

A detection network typically produces hundreds of overlapping bounding boxes, all pointing to the same object. NMS keeps only the best:

YOLO Intuition

YOLO (You Only Look Once, 2016) made real-time detection practical. Its core idea: divide the image into an S×S grid. Each cell predicts B bounding boxes and C class probabilities - in a single forward pass.

Symbol guide

S × S	spatial grid - the image is divided into S rows and S columns of cells; in YOLO v1, S=7
B	number of bounding boxes each grid cell predicts; YOLO v1 uses B=2
5	values per box: x offset, y offset, width, height, and objectness confidence score
B × 5	all bounding box predictions for one cell, e.g. 2 × 5 = 10 values
C	number of object classes the model can recognize, e.g. C=20 for VOC dataset (car, dog, person...)
B × 5 + C	total values per grid cell - box predictions plus class probabilities, e.g. 10 + 20 = 30
S × S × (B×5+C)	full output volume shape, e.g. 7 × 7 × 30 - one 3-D tensor from a single forward pass

Each box: (x, y, w, h, confidence). For YOLO v1: S=7, B=2, C=20 → output 7×7×30.

YOLO trades slightly lower accuracy for enormous speed (45 fps on GPU at launch). Subsequent versions (v3–v10) closed the accuracy gap.

Object Detection Simulator

Simulates a YOLO-style output: grid cells + predicted boxes + NMS.

Grid size S: 7

Conf threshold: 0.4

NMS IoU thresh: 0.5

Show

Detector Evolution

Model	Year	Type	Key idea	Speed
R-CNN	2014	2-stage	Region proposals + CNN per region	~47s/img
Fast R-CNN	2015	2-stage	Shared conv features, RoI pooling	~0.3s/img
Faster R-CNN	2015	2-stage	RPN replaces selective search	5 fps
YOLO v1	2016	1-stage	Single pass, grid prediction	45 fps
SSD	2016	1-stage	Multi-scale feature maps + anchors	59 fps
YOLOv8	2023	1-stage	Anchor-free, C2f blocks	Real-time
DETR	2020	Transformer	End-to-end, no anchors/NMS	Medium

Object Detection

What is Object Detection?

Sliding Window Approach

Intersection over Union (IoU)

IoU Calculator - Drag the boxes

Anchor Boxes

Non-Maximum Suppression (NMS)

YOLO Intuition

Object Detection Simulator

Detector Evolution

Quiz

Check your understanding