Financial Document RAG Q&A System
A RAG (Retrieval-Augmented Generation) Q&A system built for Chinese financial regulatory documents (PBOC / SAFE regulations), enabling natural-language compliance queries with precise answers and source citations.
Overview
Key Features
Hybrid retrieval: BM25 sparse search + dense vector retrieval (dual-path recall) for superior relevance on complex regulatory queries
Semantic-aware chunking: splits documents at clause boundaries to preserve legal meaning and avoid cross-clause truncation
Grounded LLM synthesis: retrieved passages fed to an LLM to generate coherent answers with per-claim source attribution
Citation tracing: every answer includes regulation name, article number, and an original-text excerpt for fast verification
Multi-document retrieval: simultaneous search across multiple regulations with automatic cross-reference handling
Methodology
Built with LangChain + ChromaDB. Documents are chunked at regulatory clause boundaries (chunk_size=512, overlap=64). Retrieval blends BM25 and text-embedding-3-small vector scores (weights 0.3:0.7). Generation uses GPT-4o with structured JSON output (answer, sources, confidence), and the frontend renders inline citation cards from this schema.
Tech Stack
Project Info
RAG Q&A Live Demo
Experience the full hybrid retrieval pipeline live — searching 11 real regulatory documents and generating cited answers.
Financial Regulation Q&A
BM25 + dense vector hybrid · 11 regulatory docs · real-time retrieval + citations
11
docs