LLM Fine-tuning Experiments
A collection of fine-tuning experiments on open source LLMs. Exploring LoRA, QLoRA, and instruction tuning on domain-specific datasets.
Overview
Parameter-efficient fine-tuning has made it feasible to adapt large language models on consumer hardware. This is a living collection of experiments — each one targeting a specific hypothesis about what fine-tuning actually does to a model's behaviour, not just its outputs.
The experiments run on a mix of local GPU (RTX 3090) and Google Colab A100 instances.
Experiments
Experiment 1: Instruction Tuning on Domain Corpus
Hypothesis: A general 7B model instruction-tuned on Sri Lankan legal documents will outperform GPT-3.5 on Sri Lankan legal QA with zero prompt engineering.
Method: Collected ~2,000 passages from publicly available court judgements and gazette notifications. Generated instruction-response pairs using GPT-4 as a teacher. Fine-tuned Mistral-7B using LoRA (rank=16, alpha=32) with PEFT.
Result: On a held-out test set of 50 legal questions, the fine-tuned model scored 73% vs GPT-3.5's 61%. More importantly, it cited the correct legal doctrine 80% of the time vs 52% for GPT-3.5.
Experiment 2: QLoRA on 4-bit Quantised Base Models
Hypothesis: QLoRA (4-bit base + LoRA adapters) can match full LoRA quality at 1/4 the VRAM cost.
Method: Used bitsandbytes for 4-bit NF4 quantisation. Compared Llama-2-13B with full LoRA vs QLoRA on the same instruction dataset.
Result: Within 2% perplexity of full LoRA. Fit on a single 24GB GPU. Training time was 40% longer due to dequantisation overhead — worth the trade-off.
Experiment 3: Catastrophic Forgetting Under Domain Tuning
Hypothesis: Narrow domain fine-tuning causes significant degradation on general benchmarks.
Method: Measured MMLU score before and after fine-tuning on the legal corpus. Varied LoRA rank (4, 8, 16, 32).
Result: Rank 4 preserved 96% of MMLU performance. Rank 32 dropped to 81%. There's a clear trade-off between specialisation and generality — lower rank keeps the model more general.
Infrastructure
- Training: PEFT + Transformers + Accelerate for multi-GPU when available
- Tracking: Weights & Biases for loss curves, grad norms, eval metrics per checkpoint
- Datasets: Stored as HuggingFace Dataset objects for reproducibility
- Evaluation: Custom eval harness on held-out splits + standard benchmarks (MMLU, HellaSwag)
What I'm Learning
Fine-tuning is less about the final model and more about understanding why the model changes. The experiments have taught me that:
- Learning rate matters more than rank for stability
- The quality of your instruction pairs dominates everything else
- Gradient checkpointing is non-negotiable on anything larger than 7B