FINE-TUNINGIN PROGRESSADVANCED · 2025-02

LLM Fine-tuning Experiments

A collection of fine-tuning experiments on open source LLMs. Exploring LoRA, QLoRA, and instruction tuning on domain-specific datasets.

PyTorchHuggingFacePythonLoRA

Overview

Parameter-efficient fine-tuning has made it feasible to adapt large language models on consumer hardware. This is a living collection of experiments — each one targeting a specific hypothesis about what fine-tuning actually does to a model's behaviour, not just its outputs.

The experiments run on a mix of local GPU (RTX 3090) and Google Colab A100 instances.

Experiments

Experiment 1: Instruction Tuning on Domain Corpus

Hypothesis: A general 7B model instruction-tuned on Sri Lankan legal documents will outperform GPT-3.5 on Sri Lankan legal QA with zero prompt engineering.

Method: Collected ~2,000 passages from publicly available court judgements and gazette notifications. Generated instruction-response pairs using GPT-4 as a teacher. Fine-tuned Mistral-7B using LoRA (rank=16, alpha=32) with PEFT.

Result: On a held-out test set of 50 legal questions, the fine-tuned model scored 73% vs GPT-3.5's 61%. More importantly, it cited the correct legal doctrine 80% of the time vs 52% for GPT-3.5.

Experiment 2: QLoRA on 4-bit Quantised Base Models

Hypothesis: QLoRA (4-bit base + LoRA adapters) can match full LoRA quality at 1/4 the VRAM cost.

Method: Used bitsandbytes for 4-bit NF4 quantisation. Compared Llama-2-13B with full LoRA vs QLoRA on the same instruction dataset.

Result: Within 2% perplexity of full LoRA. Fit on a single 24GB GPU. Training time was 40% longer due to dequantisation overhead — worth the trade-off.

Experiment 3: Catastrophic Forgetting Under Domain Tuning

Hypothesis: Narrow domain fine-tuning causes significant degradation on general benchmarks.

Method: Measured MMLU score before and after fine-tuning on the legal corpus. Varied LoRA rank (4, 8, 16, 32).

Result: Rank 4 preserved 96% of MMLU performance. Rank 32 dropped to 81%. There's a clear trade-off between specialisation and generality — lower rank keeps the model more general.

Infrastructure

Training: PEFT + Transformers + Accelerate for multi-GPU when available
Tracking: Weights & Biases for loss curves, grad norms, eval metrics per checkpoint
Datasets: Stored as HuggingFace Dataset objects for reproducibility
Evaluation: Custom eval harness on held-out splits + standard benchmarks (MMLU, HellaSwag)

What I'm Learning

Fine-tuning is less about the final model and more about understanding why the model changes. The experiments have taught me that:

Learning rate matters more than rank for stability
The quality of your instruction pairs dominates everything else
Gradient checkpointing is non-negotiable on anything larger than 7B