← PROJECTS
RAGIN PROGRESSADVANCED · 2025-01

Sinhala MCQ Generator

AI RAG pipeline that generates Sinhala multiple choice questions from Sri Lankan History textbooks. Uses Neo4j as a knowledge graph, LangChain for orchestration, and FastAPI for the API layer.

LangChainNeo4jFastAPIPythonHuggingFace

Overview

Most Sinhala educational content lives in physical textbooks with no machine-readable format. This project tackles that by building a full RAG pipeline that ingests Sri Lankan History textbooks (Grade 6–11), structures the knowledge into a graph database, and generates curriculum-aligned multiple choice questions on demand.

The goal is not just question generation — it's traceable question generation. Every MCQ produced can be traced back to the exact passage in the source material, making it auditable for teachers and educators.

Architecture

The pipeline runs in three stages:

1. Ingestion

Raw textbook PDFs are parsed and chunked using a custom Sinhala-aware tokenizer (standard tokenizers butcher Sinhala Unicode). Each chunk is embedded using a multilingual model from HuggingFace and stored as a node in Neo4j alongside its metadata: chapter, grade, topic.

2. Knowledge Graph

Neo4j structures relationships between concepts — not just flat vector similarity. A question about the "Kandyan Kingdom" can pull context from related nodes like "Colonial period," "Portuguese arrival," and "Sinhala kings" to generate distractors that are plausible but wrong.

3. Generation

LangChain orchestrates the retrieval-augmented generation step. Given a topic, the pipeline:

FastAPI exposes this as a REST endpoint for integration with the frontend.

Key Challenges

What's Next