COMPUTER VISIONIN PROGRESSINTERMEDIATE · 2024-12

Computer Vision Experiments

Computer vision experiments including object detection, image classification, and real-time video processing using YOLO and custom PyTorch models.

OpenCVPyTorchPythonYOLO

Overview

Computer vision is where machine learning becomes tangible — the model is looking at the same thing you are, and either gets it right or obviously doesn't. These experiments range from fine-tuning YOLO for custom detection tasks to building from scratch in PyTorch to understand what the layers are actually learning.

Experiments

Custom Object Detection with YOLOv8

Task: Detect and classify Sri Lankan road signs in dashcam footage.

Sri Lankan road signs don't appear in standard COCO-trained models. Built a custom dataset of 800 annotated images across 12 sign classes using Roboflow for annotation and augmentation.

Fine-tuned YOLOv8-nano for 100 epochs on a Colab A100. Achieved 87% mAP@0.5 on the held-out test set. The model runs at 34fps on CPU — fast enough for real-time dashcam processing.

Key learning: Data quality beats model complexity. Spent 60% of the time on annotation quality and augmentation strategy.

Image Classification from Scratch in PyTorch

Built a CNN from scratch to understand what convolutions actually learn. Starting from a simple 3-layer network on CIFAR-10, progressively added:

Batch normalisation (huge stability improvement)
Dropout (solved overfitting at epoch 15)
Residual connections (got from 78% → 89% accuracy)

Visualised learned filters and activation maps using hooks. The early layers learn Gabor-like edge detectors regardless of the dataset — this is now obvious to me but wasn't before building it manually.

Real-Time Video Processing Pipeline

Built an OpenCV pipeline that:

Captures from webcam or RTSP stream
Runs YOLOv8 inference per frame
Applies ByteTrack for multi-object tracking (keeps track IDs stable across occlusions)
Overlays bounding boxes and track trajectories
Logs detection events to a SQLite database

The pipeline runs at 28fps on an RTX 3090 processing 1080p video.

What I'm Learning

Computer vision forced me to understand:

Convolutions: Not magic — just learned sliding filters
Receptive field: Why deep networks can see globally but shallow ones can't
Augmentation: Geometric augmentation matters more than colour jitter for detection
Transfer learning vs fine-tuning: When to freeze backbone layers vs train end-to-end

What's Next

Semantic segmentation experiments with SAM (Segment Anything Model)
Depth estimation from monocular video
Deploying a vision model to edge hardware (Raspberry Pi + Hailo accelerator)