RAGIN PROGRESSADVANCED · 2025-01

Sinhala MCQ Generator

AI RAG pipeline that generates Sinhala multiple choice questions from Sri Lankan History textbooks. Uses Neo4j as a knowledge graph, LangChain for orchestration, and FastAPI for the API layer.

LangChainNeo4jFastAPIPythonHuggingFace

Overview

Most Sinhala educational content lives in physical textbooks with no machine-readable format. This project tackles that by building a full RAG pipeline that ingests Sri Lankan History textbooks (Grade 6–11), structures the knowledge into a graph database, and generates curriculum-aligned multiple choice questions on demand.

The goal is not just question generation — it's traceable question generation. Every MCQ produced can be traced back to the exact passage in the source material, making it auditable for teachers and educators.

Architecture

The pipeline runs in three stages:

1. Ingestion

Raw textbook PDFs are parsed and chunked using a custom Sinhala-aware tokenizer (standard tokenizers butcher Sinhala Unicode). Each chunk is embedded using a multilingual model from HuggingFace and stored as a node in Neo4j alongside its metadata: chapter, grade, topic.

2. Knowledge Graph

Neo4j structures relationships between concepts — not just flat vector similarity. A question about the "Kandyan Kingdom" can pull context from related nodes like "Colonial period," "Portuguese arrival," and "Sinhala kings" to generate distractors that are plausible but wrong.

3. Generation

LangChain orchestrates the retrieval-augmented generation step. Given a topic, the pipeline:

Queries Neo4j for the most relevant passages via graph traversal + vector search
Feeds retrieved context to the LLM with a structured prompt
Returns a 4-option MCQ with a correct answer key and source reference

FastAPI exposes this as a REST endpoint for integration with the frontend.

Key Challenges

Sinhala tokenization: Standard NLP tools don't handle Sinhala Unicode well. Had to write a custom pre-processing step to handle compound words and diacritics.
Distractor quality: Generating good wrong answers is harder than generating correct ones. The graph structure helps — related concepts become naturally plausible distractors.
Hallucination grounding: Every LLM output is cross-checked against the retrieved source nodes. If the answer isn't grounded in the retrieved context, the generation is rejected and retried.

What's Next

Fine-tune a Sinhala-English bilingual model on educational QA pairs
Add support for Grade 12–13 Advanced Level content
Build a teacher dashboard for reviewing and publishing generated questions