A FastAPI-based service for processing and analyzing documents (PDF, Excel, CSV, Word) with vector embeddings and summarization capabilities.
- Document processing for multiple file types (PDF, Excel, CSV, Word)
- Text extraction and summarization
- Vector embeddings generation
- Document similarity search
- RabbitMQ integration for asynchronous processing
- Clerk authentication for secure API access
- Vector database storage with pgvector
- Python 3.12
- PostgreSQL with pgvector extension
- RabbitMQ
- Clerk account for authentication
- virtualenvwrapper
- Clone the repository:
git clone <repository-url>
cd echo_doc_proc- Set up the virtual environment using virtualenvwrapper:
# Create and activate virtual environment
mkvirtualenv -p /home/keith/Envs/echoRag/bin/python echo_doc_proc
workon echo_doc_proc
# Install dependencies
pip install -r requirements.txt- Create necessary directories:
mkdir -p models- Set up environment variables in
.env:
DATABASE_URL=postgresql://user:password@localhost:5432/dbname
RABBITMQ_URL=amqp://guest:guest@localhost:5672/
CLERK_SECRET_KEY=your_clerk_secret_key
CLERK_SERVICE_USER_ID=your_service_user_id
API_URL=your_api_url- Create a PostgreSQL database with pgvector extension:
CREATE DATABASE your_database;
\c your_database
CREATE EXTENSION vector;- Create the required tables:
CREATE TABLE documents_proc (
id TEXT PRIMARY KEY,
document_id TEXT,
content TEXT,
summary TEXT,
metadata JSONB,
embedding vector(384)
);- Open the project in VS Code
- Set breakpoints in your code
- Press F5 or use the Run and Debug menu
- Select "Python Debugger: FastAPI" configuration
uvicorn app.main:app --host 0.0.0.0 --port 8000curl -X POST "http://localhost:8000/api/process-document?document_id=your-document-id"curl -X POST "http://localhost:8000/api/upload" \
-H "Content-Type: multipart/form-data" \
-F "file=@/path/to/your/document.pdf"curl "http://localhost:8000/api/search?query=your-search-query&limit=5"curl "http://localhost:8000/health"echo_doc_proc/
├── app/
│ ├── __init__.py
│ ├── main.py # FastAPI application
│ ├── document_processor.py # Document processing logic
│ └── database.py # Database operations
├── models/ # Model cache directory
├── .env # Environment variables
├── requirements.txt # Python dependencies
└── main.py # Entry point for running the application
The project is configured with VS Code debugging support. To use the debugger:
-
Ensure you're using the correct Python interpreter:
- Path:
/home/keith/Envs/echoRag/bin/python - Can be set in VS Code's Python interpreter settings
- Path:
-
Set breakpoints in your code
-
Press F5 or use the Run and Debug menu
-
Select "Python Debugger: FastAPI" configuration
You can add breakpoints in two ways:
- Click to the left of the line number in VS Code
- Add the following line in your code:
import pdb; pdb.set_trace()[Your License]