itsjorigoBLOG
←︎ Back to BlogEngineering

8 min read

Building a RAG Pipeline for a Low-Resource Language

SCROLL

How I built an ontology-guided retrieval-augmented generation system to produce curriculum-aligned MCQs in Sinhala — a language with almost no NLP tooling.

When I started my final-year project, I quickly discovered that building AI systems for a low-resource language like Sinhala is a fundamentally different problem than building them for English. There are no off-the-shelf embeddings. There is no rich pre-trained tokenizer. And the curriculum content I needed to work with — secondary-school History textbooks — existed only as scanned PDFs.

Why RAG Over Fine-Tuning

Fine-tuning a language model on Sinhala educational content was never a realistic option. The compute budget wasn't there, and more importantly, the labelled data wasn't there. RAG let me sidestep both problems: I could use an existing multilingual LLM (in this case, a quantised LLaMA variant) and ground its outputs in retrieved chunks from the actual textbook — so the questions it generated were always tied to the source material.

The Ontology Layer

The key insight that made the system work was adding a knowledge graph on top of the retrieval layer. I modelled the entire curriculum in Neo4j — chapters, topics, sub-topics, key concepts, relationships between events — and used this ontology to guide which chunks got retrieved and how the prompt was structured. Instead of naive semantic similarity, the retriever first consulted the graph to understand what concepts a given topic depended on, then pulled chunks covering those concepts specifically.

This made a measurable difference in question quality. Without the ontology layer, the LLM would frequently generate questions about tangential details. With it, the questions consistently targeted the curriculum-critical concepts.

Difficulty Control

Controlling difficulty was a prompt engineering problem more than a model problem. I defined three levels — recall, comprehension, and application — with explicit Bloom's Taxonomy descriptors baked into the system prompt for each level. Recall questions ask for facts directly stated in the text. Comprehension questions ask the student to explain a relationship or cause. Application questions require connecting two or more concepts from different parts of the curriculum.

Distractor Generation

Plausible wrong answers are arguably harder than the questions themselves. I used a two-step approach: first, extract candidate distractors from neighbouring nodes in the knowledge graph (related but incorrect facts); then rank them by semantic similarity to the correct answer using a multilingual sentence-transformer. The goal was distractors that would fool a student who half-knew the material — not obvious nonsense, not the correct answer in disguise.

What I Would Do Differently

The biggest bottleneck was building the ontology itself. I did it manually for the chapters in scope, which was time-consuming. In a production system, I would invest in a semi-automated pipeline to extract the graph structure from the source documents — possibly using a smaller extraction-focused LLM to bootstrap the node and relationship definitions, with a human review step.