Object Detection

Module 8 - Sliding Windows, Anchors, IoU, NMS, YOLO

What is Object Detection?

Classification answers "what is in the image?". Detection answers "what is where?" - for each object instance, output a bounding box (x, y, w, h) and a class label with a confidence score.

Detection is harder: images may contain multiple objects of different sizes; objects may overlap; the number of objects is variable.

Sliding Window Approach

The naive approach: slide a window across the image at multiple scales and aspect ratios, run a classifier on each window. If confidence is high, declare a detection.

Problems: Exponentially many windows (exhaustive), very slow (~10⁴–10⁵ windows per image), no end-to-end training.

This was replaced by region proposal + CNN (R-CNN family) and then one-stage detectors (YOLO, SSD).

Intersection over Union (IoU)

How do we measure if a predicted box matches a ground truth box? IoU is the ratio of overlap to total area:

IoU = Area(Predicted ∩ Ground Truth) / Area(Predicted ∪ Ground Truth)
Symbol guide
IoUIntersection over Union - a score from 0 (no overlap) to 1.0 (perfect match)
Predictedthe bounding box your model output - defined by (x, y, width, height)
Ground Truththe manually labeled correct box for the object
∩ (intersection)the overlapping region between the two boxes - the area both boxes share
∪ (union)the total area covered by either box - intersection plus the non-overlapping parts
Area(…)pixel area of the region, computed as width × height of the rectangle
IoU ≥ 0.5common threshold for counting a detection as a "true positive" in evaluation metrics

IoU Calculator - Drag the boxes

Predicted box
Ground truth box
IoU = ,
Click and drag either box to move it.

Anchor Boxes

Instead of sliding arbitrary windows, modern detectors pre-define a set of anchor boxes at each grid cell - boxes with different aspect ratios and scales. The network predicts offsets from these anchors rather than absolute coordinates. This drastically reduces the search space.

Why anchors work: Natural objects (cars, people, faces) have characteristic aspect ratios. Anchors initialized to these ratios give the network a good starting point - it only needs to predict small adjustments.

Non-Maximum Suppression (NMS)

A detection network typically produces hundreds of overlapping bounding boxes, all pointing to the same object. NMS keeps only the best:

  1. Sort all boxes by confidence score (highest first)
  2. Keep the top box
  3. Remove all remaining boxes with IoU ≥ threshold (e.g. 0.5) vs. the kept box
  4. Repeat for next surviving box

YOLO Intuition

YOLO (You Only Look Once, 2016) made real-time detection practical. Its core idea: divide the image into an S×S grid. Each cell predicts B bounding boxes and C class probabilities - in a single forward pass.

Output tensor: S × S × (B × 5 + C)
Symbol guide
S × Sspatial grid - the image is divided into S rows and S columns of cells; in YOLO v1, S=7
Bnumber of bounding boxes each grid cell predicts; YOLO v1 uses B=2
5values per box: x offset, y offset, width, height, and objectness confidence score
B × 5all bounding box predictions for one cell, e.g. 2 × 5 = 10 values
Cnumber of object classes the model can recognize, e.g. C=20 for VOC dataset (car, dog, person...)
B × 5 + Ctotal values per grid cell - box predictions plus class probabilities, e.g. 10 + 20 = 30
S × S × (B×5+C)full output volume shape, e.g. 7 × 7 × 30 - one 3-D tensor from a single forward pass

Each box: (x, y, w, h, confidence). For YOLO v1: S=7, B=2, C=20 → output 7×7×30.

YOLO trades slightly lower accuracy for enormous speed (45 fps on GPU at launch). Subsequent versions (v3–v10) closed the accuracy gap.

Object Detection Simulator

Simulates a YOLO-style output: grid cells + predicted boxes + NMS.

Detector Evolution

Model Year Type Key idea Speed
R-CNN20142-stageRegion proposals + CNN per region~47s/img
Fast R-CNN20152-stageShared conv features, RoI pooling~0.3s/img
Faster R-CNN20152-stageRPN replaces selective search5 fps
YOLO v120161-stageSingle pass, grid prediction45 fps
SSD20161-stageMulti-scale feature maps + anchors59 fps
YOLOv820231-stageAnchor-free, C2f blocksReal-time
DETR2020TransformerEnd-to-end, no anchors/NMSMedium

Quiz

Check your understanding

1. Two boxes have area 100 each and their intersection area is 40. What is their IoU?

2. What does NMS (Non-Maximum Suppression) solve?

3. YOLO v1 uses a 7×7 grid, 2 boxes per cell, 20 classes. What is the output tensor shape?