A production-ready corrective RAG pipeline with tunable relevance thresholds, multi-level refinement, web search fallback, and forced citations — designed to drastically reduce hallucinations and keep every answer grounded in verified information.
- Overview
- Architecture
- Features
- Getting Started
- How It Works
- Configuration
- Hallucination Prevention
- Tech Stack
- License
CRAG implements a sophisticated Corrective RAG pipeline that uses tunable relevance thresholds to decide whether local documents are sufficient, need refinement, or require a web search fallback.
By scoring both entire chunks and individual sentences, and by forcing the generator to cite its sources, the system drastically reduces hallucinations and ensures answers are grounded in verified information.
The pipeline operates in five main phases:
| Phase | Description |
|---|---|
| Document Ingestion | Uploaded files (PDF, DOCX, TXT) are parsed, split into overlapping chunks, embedded, and stored in a vector database |
| Query Processing | A user question triggers retrieval of the most relevant chunks from the vector store |
| Evaluation & Classification | Each retrieved chunk is scored by an LLM llama-3.1-8b-instant (0–10). Based on configurable thresholds (UT, LT), results are classified as Correct, Ambiguous, or Incorrect |
| Refinement & Fallback | Chunks are refined at the sentence level; web search is triggered when local knowledge is insufficient |
| Generation | The final context is fed to qwen/qwen3-32b, which produces an answer with citations |
- Threshold-based evaluation — tunable
UPPER_THRESHOLDandLOWER_THRESHOLDto control strictness - Two-stage verification — chunk-level scoring followed by sentence-level filtering
- Web search fallback — Tavily API triggered automatically when local documents are insufficient
- Query rewriting — LLM rewrites ambiguous/failed queries to be more search-engine-friendly
- Forced citations — generator must cite exact supporting sentences, combating hallucinations
- Streaming output — answers streamed token-by-token for real-time feedback
- Full transparency — Streamlit UI exposes pipeline trace, scores, sources, and model reasoning
- Persistent vector store — ChromaDB saves indexed chunks; re-indexing only needed on document changes
# 1. Clone the repository
git clone https://github.com/your-username/CRAG.git
cd CRAG
# 2. Create and activate a virtual environment
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
# 3. Install dependencies
pip install -r requirements.txt
# 4. Copy the example env file and fill in your keys
cp .env.example .envCreate a .env file in the root directory (see .env.example):
# Groq API
GROQ_API_KEY=your_groq_api_key_here
# Tavily API (for web search)
TAVILY_API_KEY=your_tavily_api_key_here
# Thresholds (tune these as needed)
UPPER_THRESHOLD=8.0 # UT – Correct if score ≥ UT
LOWER_THRESHOLD=3.0 # LT – Incorrect if score ≤ LT
STRIP_THRESHOLD=5.0 # Minimum relevance to keep a sentence strip
# Retrieval settings
TOP_K_DOCUMENTS=5 # Number of documents to retrieve from vector store
TOP_K_WEB_RESULTS=3 # Number of web results to fetch
# Model names
GENERATOR_MODEL=qwen/qwen3-32b
EVALUATOR_MODEL=llama-3.1-8b-instantstreamlit run app.pyThen open http://localhost:8501 in your browser.
Upload PDF, DOCX, or TXT files via the sidebar. The pipeline:
- Parses raw text using
pypdf,python-docx, or built-in I/O - Splits text into ~500-character overlapping chunks (50-char overlap) using NLTK sentence boundaries to avoid mid-sentence breaks
- Embeds each chunk using
sentence-transformers/all-MiniLM-L6-v2 - Stores chunks in a persistent ChromaDB collection with metadata
A user question is embedded and used to query ChromaDB. The top-k most similar chunks (default: 5) are retrieved using cosine similarity.
Each retrieved chunk is scored 0–10 by llama-3.1-8b-instant using a detailed relevance rubric. The maximum score across all chunks determines classification:
| Condition | Classification |
|---|---|
max_score ≥ UPPER_THRESHOLD |
Correct — local docs are sufficient |
Any score between LT and UT |
Ambiguous — local docs partially relevant |
All scores < LOWER_THRESHOLD |
Incorrect — local docs insufficient |
| Classification | Action |
|---|---|
| Correct | Refine only the top-scoring chunk(s) |
| Ambiguous | Refine all moderately relevant chunks + trigger web search |
| Incorrect | Skip local docs entirely, use web search only |
Refinement process for each chunk:
- Split into individual sentences (
stripper.py) - Score each sentence against the query (
filter.py, LLM-based) - Keep sentences with score ≥
STRIP_THRESHOLD - Merge kept sentences into a clean context block (
merger.py)
If a chunk is highly relevant but no single sentence passes the threshold, the full chunk is kept as a fallback to prevent information loss.
Web search uses Tavily with an LLM-rewritten query (optimised for search engines). Web results undergo the same refinement process as local chunks.
The final merged context (local + web) is passed to qwen/qwen3-32b with a prompt that:
- Restricts the model to answer only from the provided context
- Forces citation of exact supporting sentences
- Falls back to "I don't have enough information" if context is insufficient
The response is streamed live in the Streamlit UI.
| Variable | Default | Description |
|---|---|---|
UPPER_THRESHOLD |
8.0 |
Minimum score to classify a chunk as Correct |
LOWER_THRESHOLD |
3.0 |
Maximum score to classify a chunk as Incorrect |
STRIP_THRESHOLD |
5.0 |
Minimum sentence score to keep during refinement |
TOP_K_DOCUMENTS |
5 |
Number of chunks to retrieve from ChromaDB |
TOP_K_WEB_RESULTS |
3 |
Number of Tavily web results to fetch |
GENERATOR_MODEL |
qwen/qwen3-32b |
Model for final answer generation |
EVALUATOR_MODEL |
llama-3.1-8b-instant |
Model for scoring and query rewriting |
Tip: Raise
UPPER_THRESHOLDto make the system more conservative before accepting local documents. LowerSTRIP_THRESHOLDto keep more sentences during refinement.
CRAG employs five complementary strategies to minimise hallucinations:
-
Two-stage verification — Chunks are scored by an LLM; only those above thresholds are used. Then individual sentences are scored again. This double-checking filters out irrelevant content before it reaches the generator.
-
Context restriction — The generator receives only the merged, filtered context. It is explicitly instructed to answer based solely on that context — not its internal training knowledge.
-
Forced citations — By requiring the model to cite exact supporting sentences, every claim must be traceable to a source. If no supporting sentence exists, the model responds with "I don't have enough information."
-
Web fallback — When local knowledge is insufficient, real-world data is fetched instead of relying on the generator's potentially outdated or incorrect internal knowledge.
-
Full transparency — The Streamlit UI exposes the complete pipeline trace: all scores, classification decisions, refined chunks, web results, and context preview — so users can verify and debug every response.
| Component | Technology |
|---|---|
| UI | Streamlit |
| LLM Inference | Groq (llama-3.1-8b-instant, qwen/qwen3-32b) |
| Embeddings | sentence-transformers (all-MiniLM-L6-v2) |
| Vector Store | ChromaDB (persistent) |
| Web Search | Tavily |
| PDF Parsing | pypdf |
| DOCX Parsing | python-docx |
| Tokenisation | NLTK (sent_tokenize) |
This project is licensed under the terms of the LICENSE file included in this repository.
- Research Paper - Corrective Retrieval Augmented Generation
Built with ❤️ using Streamlit
