Computer Vision Experiments
Computer vision experiments including object detection, image classification, and real-time video processing using YOLO and custom PyTorch models.
Overview
Computer vision is where machine learning becomes tangible — the model is looking at the same thing you are, and either gets it right or obviously doesn't. These experiments range from fine-tuning YOLO for custom detection tasks to building from scratch in PyTorch to understand what the layers are actually learning.
Experiments
Custom Object Detection with YOLOv8
Task: Detect and classify Sri Lankan road signs in dashcam footage.
Sri Lankan road signs don't appear in standard COCO-trained models. Built a custom dataset of 800 annotated images across 12 sign classes using Roboflow for annotation and augmentation.
Fine-tuned YOLOv8-nano for 100 epochs on a Colab A100. Achieved 87% mAP@0.5 on the held-out test set. The model runs at 34fps on CPU — fast enough for real-time dashcam processing.
Key learning: Data quality beats model complexity. Spent 60% of the time on annotation quality and augmentation strategy.
Image Classification from Scratch in PyTorch
Built a CNN from scratch to understand what convolutions actually learn. Starting from a simple 3-layer network on CIFAR-10, progressively added:
- Batch normalisation (huge stability improvement)
- Dropout (solved overfitting at epoch 15)
- Residual connections (got from 78% → 89% accuracy)
Visualised learned filters and activation maps using hooks. The early layers learn Gabor-like edge detectors regardless of the dataset — this is now obvious to me but wasn't before building it manually.
Real-Time Video Processing Pipeline
Built an OpenCV pipeline that:
- Captures from webcam or RTSP stream
- Runs YOLOv8 inference per frame
- Applies ByteTrack for multi-object tracking (keeps track IDs stable across occlusions)
- Overlays bounding boxes and track trajectories
- Logs detection events to a SQLite database
The pipeline runs at 28fps on an RTX 3090 processing 1080p video.
What I'm Learning
Computer vision forced me to understand:
- Convolutions: Not magic — just learned sliding filters
- Receptive field: Why deep networks can see globally but shallow ones can't
- Augmentation: Geometric augmentation matters more than colour jitter for detection
- Transfer learning vs fine-tuning: When to freeze backbone layers vs train end-to-end
What's Next
- Semantic segmentation experiments with SAM (Segment Anything Model)
- Depth estimation from monocular video
- Deploying a vision model to edge hardware (Raspberry Pi + Hailo accelerator)