RAGbot — Conversations with Crime and Punishment

Live App: Click here to try RAGbot on Streamlit

Medium Article: The Complete Guide to RAG, Part I: Operational Mechanics

Medium Article: The Complete Guide to RAG, Part II: Setup, Design and Application

This project accompanies a long-form Medium article that explains Retrieval-Augmented Generation (RAG) in depth and walks through its implementation. The final solution is packaged in a user-friendly Streamlit App, allowing anyone to experiment with building a simple RAGbot.

Objective

The goal of this project is to provide an interactive Retrieval-Augmented Generation (RAG) chatbot that allows users to explore the novel Crime and Punishment by Fyodor Dostoevsky in a conversational manner.
By combining document retrieval with large language model generation, RAGbot delivers contextually accurate, memory-aware responses to literary and philosophical questions about the text.

Project Strategy

This app follows a retrieval + generation architecture using LlamaIndex, HuggingFace embeddings, and Groq’s LLaMA 3 model.

Workflow Overview

Document Loading
The full text of Crime and Punishment (plaintext file) is ingested using SimpleDirectoryReader.
Embedding & Indexing
- Uses sentence-transformers/all-MiniLM-L6-v2 for text embeddings.
- Indexed into a vector store for fast semantic search.
Context Retrieval
- Retrieves the top-k most relevant passages for each query.
- top_k is configurable in the UI.
Generation with Context
- Groq's LLaMA 3.3 70B Versatile model is used for answer generation.
- Responses are grounded in retrieved context to reduce hallucination.
Memory-Aware Conversations
- Maintains a buffer of conversation history so the chatbot can respond coherently over multiple turns.

Why This Approach Works

Component	Purpose	Benefit
HuggingFace Embeddings	Encode text into vector space	Enables accurate semantic search
VectorStoreIndex	Store embeddings for fast retrieval	Low-latency, scalable context retrieval
Groq LLaMA 3.3 70B	Generate answers from context	High-quality, human-like responses
ChatMemoryBuffer	Store chat history	Provides conversational continuity
Streamlit UI	Easy web interface	Quick deployment & interaction

Tools & Libraries

Languages: Python
Frameworks: Streamlit, LlamaIndex
Embeddings: sentence-transformers/all-MiniLM-L6-v2
LLM Provider: Groq (LLaMA 3.3 70B Versatile)
Others: python-dotenv for secrets, pathlib for file handling

Key Features

Semantic Search — Retrieves the most relevant text excerpts from Crime and Punishment.
Memory-Aware Chat — Keeps track of past exchanges for contextually coherent conversations.
Adjustable Context Depth — top_k slider to control how many passages to retrieve.
Streamlit UI — Simple, elegant web app interface.
Configurable API Keys — Supports .env or .streamlit/secrets.toml.

Observability & Monitoring

Concept

Observability makes internal behavior visible so we can diagnose why a request was fast/slow or correct/incorrect. It helps pinpoint areas for optimization or performance improvements.

What We Capture & Why

Trace: The end-to-end record of a single request.
Span: A timed sub-operation within a trace.
Each user query generates three spans:
- retrieve.topk: Time taken to perform vector search for relevant chunks.
- engine.chat: Time taken by the LLM to generate the answer.
- rag.e2e: End-to-end time, from the user’s prompt to the final answer.

These spans are stored in local_traces.json, and the DIY Observability tab computes recent averages and displays per-request performance.

Intuition

Traces reveal where the time is spent. For example, if retrieval is consistently fast, but generation times are long, this suggests focusing on model/runtime settings rather than the index.

Walkthrough (Illustrative)

The user asks a question, and a new request_id and session_id are assigned.
The retriever logs k, hit count, and best similarity score for retrieve.topk.
The LLM call logs engine.chat for the generation process.
The app logs rag.e2e for total roundtrip time.
The observability dashboard displays average times for retrieval, generation, and roundtrip, along with a chart for quick visual comparison.

In simple terms, the Observability tab helps you understand where the time goes by showing the details behind each request.

Monitoring: What We Capture & Why

Monitoring helps track a few known signals over time, allowing you to spot issues like drift or outages quickly.

Core metrics:

Availability / success rate: 1 − (errors/requests)
Throughput: Requests per minute.
Latency percentiles: Response time under which 95% and 99% of requests complete.
Health checks: Ensures API key, corpus file, and index are present.

In the app:

Health: Shows system status (green/red) with inline reasons for issues (e.g., missing API key, absent corpus file).
Performance: Success rate, throughput, and latency (p95/p99) are calculated from recorded request durations.
Monitoring Dashboard: Provides a high-level view of system health and performance, with compact metrics for quick troubleshooting.

Intuition: Monitoring acts as the “smoke alarm” for the system. If something goes wrong, it provides enough information to trigger further investigation in the Observability tab.

Screenshots

1. Main Chat Interface

This is the primary interface where users interact with the RAGbot.
You can ask questions about Crime and Punishment and receive context-grounded answers generated by Groq’s LLaMA 3 model.
A slider allows you to adjust how many relevant text passages (top_k) are retrieved per query.

2. Observability Dashboard

The observability dashboard provides detailed timing metrics for each request:

Find passages — time to retrieve relevant chunks.
Write answer — time for the LLM to generate the response.
Total roundtrip — end-to-end time from question to answer.

This view helps identify bottlenecks and monitor efficiency.

With end-to-end time, 0.88 seconds

3. Performance Trace Graph

This chart visualizes request timings over multiple queries.
It highlights spikes in latency (e.g., long generation times) and makes it easy to compare retrieval, generation, and roundtrip performance across sessions.

1: With end-to-end time - ONLY FOR THE CURRENT SESSION, 0.88 seconds

1: With end-to-end time - FOR ALL SESSIONS UNTIL NOW

4. Monitoring Dashboard

The monitoring dashboard gives a high-level system health overview:

Status — overall health of the system.
Performance Metrics — success rate, throughput, and latency (p95 / p99).

This ensures the chatbot is running reliably and performing at scale.

Architecture Diagram

Architecture Steps:

User Query → Enters prompt in Streamlit chat UI.
Retriever → Queries vector store for top-k relevant passages.
LLM → Groq LLaMA 3.3 70B processes query + retrieved context.
Response → Sent back to Streamlit UI and added to memory buffer.
Conversation History → Maintains context for multi-turn dialogue.

Project Structure

ragbot_crime_and_punishment/
│
├── data/
│   └── crime_and_punishment.txt   # The full text of Crime and Punishment
│
├── storage/
│   └── vector_index/              # Persistent vector index data
│
├── .streamlit/
│   └── secrets.toml               # Optional API keys for deployment
│
├── app/
│   ├── streamlit_app.py           # Main Streamlit application that runs the RAGbot
│   ├── config.py                  # Configuration settings (API keys, paths, etc.)
│   ├── metrics.py                 # Metrics for monitoring and observability
│   ├── tracing.py                 # Trace recording for performance and observability
│   ├── feedback.py                # (Optional) Feedback collection module
│
├── requirements.txt               # Python dependencies
├── .env                           # Local development secrets
├── README.md                      # Project description (this file)

Installation & Usage

Clone this Repository

git clone https://github.com/hsjoi1402/ragbot-crime-and-punishment.git
cd ragbot-crime-and-punishment

Install Dependencies

pip install -r requirements.txt

Configure API Key

Set your Groq API key in .env:

GROQ_API_KEY=your_api_key_here

Run the App

streamlit run app.py

The app will open in your browser at: http://localhost:8xxx

How It Works (Step-by-Step)

Load the novel text from /data/crime_and_punishment.txt.
Embed & Index: Create a vector index using HuggingFace embeddings.
Persist Index: Store it in /storage/vector_index for reuse.
Retrieve Context: On user queries, fetch top-k relevant passages.
Generate Answer: Send the context to Groq's LLaMA 3.3 model.
Display & Store: Show answer in chat UI and add to conversation history.

Example Queries

What is Raskolnikov’s moral struggle?
Summarize the conversation between Raskolnikov and Sonia.
How does Dostoevsky portray guilt in the novel?

Observability & Monitoring

Observability Dashboard → Request timings (retrieval, generation, roundtrip).
Performance Graphs → Latency breakdowns across recent queries.
Monitoring Dashboard → Success rate, throughput, and system health.

These dashboards make it easy to debug latency spikes, track throughput, and ensure reliability.

Deployment

The app is Streamlit-ready and can be deployed:

Locally (via streamlit run)
On Streamlit Cloud with .streamlit/secrets.toml
In a Docker container for production

Contribution Guidelines

Pull requests are welcome! Future improvements:

Add multi-document support and routing.
Enhance UI with richer formatting.
Integrate summarization features.
Retrieve links and sources with answers.

Author

Prakash

Acknowledgements

Fyodor Dostoevsky — For writing Crime and Punishment.
ChatGPT (OpenAI) — For providing boilerplate code, improving scripts, and assisting with comments, docstrings, and documentation.

Name		Name	Last commit message	Last commit date
Latest commit History 29 Commits
app		app
assets		assets
data		data
storage/vector_index		storage/vector_index
.gitignore		.gitignore
constraints-otel.txt		constraints-otel.txt
docker-compose.yml		docker-compose.yml
local_traces.json		local_traces.json
readme.md		readme.md
requirements-app.txt		requirements-app.txt
requirements-obs.txt		requirements-obs.txt
requirements.txt		requirements.txt

hsjoi0214/RAGbot

Folders and files

Latest commit

History

Repository files navigation

RAGbot — Conversations with Crime and Punishment

Objective

Project Strategy

Workflow Overview

Why This Approach Works

Tools & Libraries

Key Features

Observability & Monitoring

Concept

What We Capture & Why

Intuition

Walkthrough (Illustrative)

Monitoring: What We Capture & Why

Screenshots

1. Main Chat Interface

2. Observability Dashboard

With end-to-end time, 0.88 seconds

3. Performance Trace Graph

1: With end-to-end time - ONLY FOR THE CURRENT SESSION, 0.88 seconds

1: With end-to-end time - FOR ALL SESSIONS UNTIL NOW

4. Monitoring Dashboard

Architecture Diagram

Project Structure

Installation & Usage

Clone this Repository

Install Dependencies

Configure API Key

Run the App

How It Works (Step-by-Step)

Example Queries

Observability & Monitoring

Deployment

Contribution Guidelines

Author

Acknowledgements

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages