Sinhala MCQ Generator
AI RAG pipeline that generates Sinhala multiple choice questions from Sri Lankan History textbooks. Uses Neo4j as a knowledge graph, LangChain for orchestration, and FastAPI for the API layer.
Overview
Most Sinhala educational content lives in physical textbooks with no machine-readable format. This project tackles that by building a full RAG pipeline that ingests Sri Lankan History textbooks (Grade 6–11), structures the knowledge into a graph database, and generates curriculum-aligned multiple choice questions on demand.
The goal is not just question generation — it's traceable question generation. Every MCQ produced can be traced back to the exact passage in the source material, making it auditable for teachers and educators.
Architecture
The pipeline runs in three stages:
1. Ingestion
Raw textbook PDFs are parsed and chunked using a custom Sinhala-aware tokenizer (standard tokenizers butcher Sinhala Unicode). Each chunk is embedded using a multilingual model from HuggingFace and stored as a node in Neo4j alongside its metadata: chapter, grade, topic.
2. Knowledge Graph
Neo4j structures relationships between concepts — not just flat vector similarity. A question about the "Kandyan Kingdom" can pull context from related nodes like "Colonial period," "Portuguese arrival," and "Sinhala kings" to generate distractors that are plausible but wrong.
3. Generation
LangChain orchestrates the retrieval-augmented generation step. Given a topic, the pipeline:
- Queries Neo4j for the most relevant passages via graph traversal + vector search
- Feeds retrieved context to the LLM with a structured prompt
- Returns a 4-option MCQ with a correct answer key and source reference
FastAPI exposes this as a REST endpoint for integration with the frontend.
Key Challenges
- Sinhala tokenization: Standard NLP tools don't handle Sinhala Unicode well. Had to write a custom pre-processing step to handle compound words and diacritics.
- Distractor quality: Generating good wrong answers is harder than generating correct ones. The graph structure helps — related concepts become naturally plausible distractors.
- Hallucination grounding: Every LLM output is cross-checked against the retrieved source nodes. If the answer isn't grounded in the retrieved context, the generation is rejected and retried.
What's Next
- Fine-tune a Sinhala-English bilingual model on educational QA pairs
- Add support for Grade 12–13 Advanced Level content
- Build a teacher dashboard for reviewing and publishing generated questions