Live App: Click here to try RAGbot on Streamlit
Medium Article: The Complete Guide to RAG, Part I: Operational Mechanics
Medium Article: The Complete Guide to RAG, Part II: Setup, Design and Application
This project accompanies a long-form Medium article that explains Retrieval-Augmented Generation (RAG) in depth and walks through its implementation. The final solution is packaged in a user-friendly Streamlit App, allowing anyone to experiment with building a simple RAGbot.
The goal of this project is to provide an interactive Retrieval-Augmented Generation (RAG) chatbot that allows users to explore the novel Crime and Punishment by Fyodor Dostoevsky in a conversational manner.
By combining document retrieval with large language model generation, RAGbot delivers contextually accurate, memory-aware responses to literary and philosophical questions about the text.
This app follows a retrieval + generation architecture using LlamaIndex, HuggingFace embeddings, and Groq’s LLaMA 3 model.
-
Document Loading
The full text of Crime and Punishment (plaintext file) is ingested usingSimpleDirectoryReader. -
Embedding & Indexing
- Uses sentence-transformers/all-MiniLM-L6-v2 for text embeddings.
- Indexed into a vector store for fast semantic search.
-
Context Retrieval
- Retrieves the top-k most relevant passages for each query.
top_kis configurable in the UI.
-
Generation with Context
- Groq's LLaMA 3.3 70B Versatile model is used for answer generation.
- Responses are grounded in retrieved context to reduce hallucination.
-
Memory-Aware Conversations
- Maintains a buffer of conversation history so the chatbot can respond coherently over multiple turns.
| Component | Purpose | Benefit |
|---|---|---|
| HuggingFace Embeddings | Encode text into vector space | Enables accurate semantic search |
| VectorStoreIndex | Store embeddings for fast retrieval | Low-latency, scalable context retrieval |
| Groq LLaMA 3.3 70B | Generate answers from context | High-quality, human-like responses |
| ChatMemoryBuffer | Store chat history | Provides conversational continuity |
| Streamlit UI | Easy web interface | Quick deployment & interaction |
- Languages: Python
- Frameworks: Streamlit, LlamaIndex
- Embeddings: sentence-transformers/all-MiniLM-L6-v2
- LLM Provider: Groq (LLaMA 3.3 70B Versatile)
- Others: python-dotenv for secrets, pathlib for file handling
- Semantic Search — Retrieves the most relevant text excerpts from Crime and Punishment.
- Memory-Aware Chat — Keeps track of past exchanges for contextually coherent conversations.
- Adjustable Context Depth —
top_kslider to control how many passages to retrieve. - Streamlit UI — Simple, elegant web app interface.
- Configurable API Keys — Supports
.envor.streamlit/secrets.toml.
Observability makes internal behavior visible so we can diagnose why a request was fast/slow or correct/incorrect. It helps pinpoint areas for optimization or performance improvements.
- Trace: The end-to-end record of a single request.
- Span: A timed sub-operation within a trace.
Each user query generates three spans:- retrieve.topk: Time taken to perform vector search for relevant chunks.
- engine.chat: Time taken by the LLM to generate the answer.
- rag.e2e: End-to-end time, from the user’s prompt to the final answer.
These spans are stored in local_traces.json, and the DIY Observability tab computes recent averages and displays per-request performance.
Traces reveal where the time is spent. For example, if retrieval is consistently fast, but generation times are long, this suggests focusing on model/runtime settings rather than the index.
- The user asks a question, and a new
request_idandsession_idare assigned. - The retriever logs
k, hit count, and best similarity score for retrieve.topk. - The LLM call logs engine.chat for the generation process.
- The app logs rag.e2e for total roundtrip time.
- The observability dashboard displays average times for retrieval, generation, and roundtrip, along with a chart for quick visual comparison.
In simple terms, the Observability tab helps you understand where the time goes by showing the details behind each request.
Monitoring helps track a few known signals over time, allowing you to spot issues like drift or outages quickly.
Core metrics:
- Availability / success rate: 1 − (errors/requests)
- Throughput: Requests per minute.
- Latency percentiles: Response time under which 95% and 99% of requests complete.
- Health checks: Ensures API key, corpus file, and index are present.
In the app:
- Health: Shows system status (green/red) with inline reasons for issues (e.g., missing API key, absent corpus file).
- Performance: Success rate, throughput, and latency (p95/p99) are calculated from recorded request durations.
- Monitoring Dashboard: Provides a high-level view of system health and performance, with compact metrics for quick troubleshooting.
Intuition: Monitoring acts as the “smoke alarm” for the system. If something goes wrong, it provides enough information to trigger further investigation in the Observability tab.
This is the primary interface where users interact with the RAGbot.
You can ask questions about Crime and Punishment and receive context-grounded answers generated by Groq’s LLaMA 3 model.
A slider allows you to adjust how many relevant text passages (top_k) are retrieved per query.
The observability dashboard provides detailed timing metrics for each request:
- Find passages — time to retrieve relevant chunks.
- Write answer — time for the LLM to generate the response.
- Total roundtrip — end-to-end time from question to answer.
This view helps identify bottlenecks and monitor efficiency.
This chart visualizes request timings over multiple queries.
It highlights spikes in latency (e.g., long generation times) and makes it easy to compare retrieval, generation, and roundtrip performance across sessions.
The monitoring dashboard gives a high-level system health overview:
- Status — overall health of the system.
- Performance Metrics — success rate, throughput, and latency (p95 / p99).
This ensures the chatbot is running reliably and performing at scale.
Architecture Steps:
- User Query → Enters prompt in Streamlit chat UI.
- Retriever → Queries vector store for top-k relevant passages.
- LLM → Groq LLaMA 3.3 70B processes query + retrieved context.
- Response → Sent back to Streamlit UI and added to memory buffer.
- Conversation History → Maintains context for multi-turn dialogue.
ragbot_crime_and_punishment/
│
├── data/
│ └── crime_and_punishment.txt # The full text of Crime and Punishment
│
├── storage/
│ └── vector_index/ # Persistent vector index data
│
├── .streamlit/
│ └── secrets.toml # Optional API keys for deployment
│
├── app/
│ ├── streamlit_app.py # Main Streamlit application that runs the RAGbot
│ ├── config.py # Configuration settings (API keys, paths, etc.)
│ ├── metrics.py # Metrics for monitoring and observability
│ ├── tracing.py # Trace recording for performance and observability
│ ├── feedback.py # (Optional) Feedback collection module
│
├── requirements.txt # Python dependencies
├── .env # Local development secrets
├── README.md # Project description (this file)
git clone https://github.com/hsjoi1402/ragbot-crime-and-punishment.git
cd ragbot-crime-and-punishmentpip install -r requirements.txtSet your Groq API key in .env:
GROQ_API_KEY=your_api_key_herestreamlit run app.pyThe app will open in your browser at: http://localhost:8xxx
- Load the novel text from /data/crime_and_punishment.txt.
- Embed & Index: Create a vector index using HuggingFace embeddings.
- Persist Index: Store it in /storage/vector_index for reuse.
- Retrieve Context: On user queries, fetch top-k relevant passages.
- Generate Answer: Send the context to Groq's LLaMA 3.3 model.
- Display & Store: Show answer in chat UI and add to conversation history.
- What is Raskolnikov’s moral struggle?
- Summarize the conversation between Raskolnikov and Sonia.
- How does Dostoevsky portray guilt in the novel?
- Observability Dashboard → Request timings (retrieval, generation, roundtrip).
- Performance Graphs → Latency breakdowns across recent queries.
- Monitoring Dashboard → Success rate, throughput, and system health.
These dashboards make it easy to debug latency spikes, track throughput, and ensure reliability.
The app is Streamlit-ready and can be deployed:
- Locally (via streamlit run)
- On Streamlit Cloud with .streamlit/secrets.toml
- In a Docker container for production
Pull requests are welcome! Future improvements:
- Add multi-document support and routing.
- Enhance UI with richer formatting.
- Integrate summarization features.
- Retrieve links and sources with answers.
Prakash
- Fyodor Dostoevsky — For writing Crime and Punishment.
- ChatGPT (OpenAI) — For providing boilerplate code, improving scripts, and assisting with comments, docstrings, and documentation.





