Object Detection
Module 8 - Sliding Windows, Anchors, IoU, NMS, YOLO
What is Object Detection?
Classification answers "what is in the image?". Detection answers "what is where?" - for each object instance, output a bounding box (x, y, w, h) and a class label with a confidence score.
Detection is harder: images may contain multiple objects of different sizes; objects may overlap; the number of objects is variable.
Sliding Window Approach
The naive approach: slide a window across the image at multiple scales and aspect ratios, run a classifier on each window. If confidence is high, declare a detection.
Problems: Exponentially many windows (exhaustive), very slow (~10⁴–10⁵ windows per image), no end-to-end training.
This was replaced by region proposal + CNN (R-CNN family) and then one-stage detectors (YOLO, SSD).
Intersection over Union (IoU)
How do we measure if a predicted box matches a ground truth box? IoU is the ratio of overlap to total area:
| IoU | Intersection over Union - a score from 0 (no overlap) to 1.0 (perfect match) |
| Predicted | the bounding box your model output - defined by (x, y, width, height) |
| Ground Truth | the manually labeled correct box for the object |
| ∩ (intersection) | the overlapping region between the two boxes - the area both boxes share |
| ∪ (union) | the total area covered by either box - intersection plus the non-overlapping parts |
| Area(…) | pixel area of the region, computed as width × height of the rectangle |
| IoU ≥ 0.5 | common threshold for counting a detection as a "true positive" in evaluation metrics |
- IoU = 1.0: Perfect match
- IoU ≥ 0.5: Commonly used threshold for "correct" detection
- IoU = 0: No overlap
IoU Calculator - Drag the boxes
Anchor Boxes
Instead of sliding arbitrary windows, modern detectors pre-define a set of anchor boxes at each grid cell - boxes with different aspect ratios and scales. The network predicts offsets from these anchors rather than absolute coordinates. This drastically reduces the search space.
Non-Maximum Suppression (NMS)
A detection network typically produces hundreds of overlapping bounding boxes, all pointing to the same object. NMS keeps only the best:
- Sort all boxes by confidence score (highest first)
- Keep the top box
- Remove all remaining boxes with IoU ≥ threshold (e.g. 0.5) vs. the kept box
- Repeat for next surviving box
YOLO Intuition
YOLO (You Only Look Once, 2016) made real-time detection practical. Its core idea: divide the image into an S×S grid. Each cell predicts B bounding boxes and C class probabilities - in a single forward pass.
| S × S | spatial grid - the image is divided into S rows and S columns of cells; in YOLO v1, S=7 |
| B | number of bounding boxes each grid cell predicts; YOLO v1 uses B=2 |
| 5 | values per box: x offset, y offset, width, height, and objectness confidence score |
| B × 5 | all bounding box predictions for one cell, e.g. 2 × 5 = 10 values |
| C | number of object classes the model can recognize, e.g. C=20 for VOC dataset (car, dog, person...) |
| B × 5 + C | total values per grid cell - box predictions plus class probabilities, e.g. 10 + 20 = 30 |
| S × S × (B×5+C) | full output volume shape, e.g. 7 × 7 × 30 - one 3-D tensor from a single forward pass |
Each box: (x, y, w, h, confidence). For YOLO v1: S=7, B=2, C=20 → output 7×7×30.
YOLO trades slightly lower accuracy for enormous speed (45 fps on GPU at launch). Subsequent versions (v3–v10) closed the accuracy gap.
Object Detection Simulator
Simulates a YOLO-style output: grid cells + predicted boxes + NMS.
Detector Evolution
| Model | Year | Type | Key idea | Speed |
|---|---|---|---|---|
| R-CNN | 2014 | 2-stage | Region proposals + CNN per region | ~47s/img |
| Fast R-CNN | 2015 | 2-stage | Shared conv features, RoI pooling | ~0.3s/img |
| Faster R-CNN | 2015 | 2-stage | RPN replaces selective search | 5 fps |
| YOLO v1 | 2016 | 1-stage | Single pass, grid prediction | 45 fps |
| SSD | 2016 | 1-stage | Multi-scale feature maps + anchors | 59 fps |
| YOLOv8 | 2023 | 1-stage | Anchor-free, C2f blocks | Real-time |
| DETR | 2020 | Transformer | End-to-end, no anchors/NMS | Medium |
Quiz
Check your understanding
1. Two boxes have area 100 each and their intersection area is 40. What is their IoU?
2. What does NMS (Non-Maximum Suppression) solve?
3. YOLO v1 uses a 7×7 grid, 2 boxes per cell, 20 classes. What is the output tensor shape?