A comprehensive document analysis pipeline for extracting figures, tables, and text from scientific PDFs using state-of-the-art computer vision and OCR technologies.
This project uses DocLayout-YOLO for document layout detection and Tesseract OCR for text extraction to process neuroscience research papers. The pipeline automatically extracts figures with captions, tables with structure, and plain text with intelligent context matching.
First make sure you have Python 3.10 (any versions of 3.10 can work, but 3.10.11 is the easiest to install)
Install Python 3.10.11 here: https://www.python.org/downloads/release/python-31011/
You can use conda as well for this step, for this project Venv is recommended. In this step we will be setting up our virtual environment and then entering it.
Make sure to run these commands in the Neurosci directory in the terminal
Windows:
py -3.10 -m venv venv
.\venv\Scripts\ActivateLinux:
python3.10 -m venv venv
source venv/bin/activateYou will only need to run the venv creation command one time. For the rest of the project you should be using activate to enter the venv you created.
This project uses Tesseract OCR (tested at 100% accuracy in 30 runs). You must install the Tesseract executable separately from Python packages.
Windows:
- Download and install from: https://github.com/UB-Mannheim/tesseract/wiki
- Or use chocolatey:
choco install tesseract - Or use scoop:
scoop install tesseract
Linux:
sudo apt-get install tesseract-ocrAfter installation, verify it works:
python -c "import pytesseract; print(pytesseract.get_tesseract_version())"See Extraction/Information/OCR_SETUP.md for detailed OCR setup instructions and alternative engines.
Before you install, make sure your version of pip is up to date:
python.exe -m pip install --upgrade pipNext make sure you are in the root of the project and currently in the virtual environment (Venv). We will now install the required packages and libraries for this project:
pip install -r requirements.txtThe extraction pipeline processes PDF documents and extracts structured data including figures, tables, and text with intelligent context matching.
- Make sure you followed the setup steps above and your venv is activated
- Navigate to the extraction directory:
cd Extraction-
Place your PDF files in the
input_pdfsfolder (sample papers are already included for testing) -
Run the extraction:
python main.pyNote: YoloExtraction.py is the legacy file. The current pipeline uses main.py.
The pipeline follows these steps for each PDF:
- Uses DocLayout-YOLO (v10) to detect document elements on each page
- Detects: figures, tables, plain text, figure captions, table captions, table footnotes
- Applies confidence thresholds and filters to remove noise
- Lower confidence threshold (0.15) for captions to catch wide/short caption boxes
- Extracts figure images from detected regions
- Matches figures with their captions using proximity-based algorithm
- Saves figure images and captions separately
- Extracts reference tokens (e.g., "Figure 1", "Fig. 2A") from captions
- Can aggregate all figures into a single master folder if configured
- Extracts table images from detected regions
- Matches tables with captions and footnotes
- Uses LLM Whisperer API to extract structured table content
- Caches table extractions to avoid redundant API calls
- Saves table images, captions, footnotes, and extracted text
- Extracts plain text regions using Tesseract OCR
- Detects multi-column layouts automatically
- Segments text into paragraphs with intelligent merging
- Preserves document structure and reading order
- Saves each paragraph as a separate file
- Matches paragraphs that reference figures/tables with their corresponding items
- Recognizes diverse reference patterns: "Figure 1", "Fig. 2A-C", "Table I", "see Fig 3"
- Handles panel references (A, B, C), roman numerals (I, II, III), and multi-figure citations
- Updates context continuously as each page is processed
- Final comprehensive update after all pages are done
- Unified OCR interface supporting multiple engines
- Primary engine: Tesseract (best accuracy for scientific papers)
- Alternative engines available: PaddleOCR, EasyOCR
- Handles orientation correction for scanned documents
- Configurable in
main.pyby changingOCR_ENGINEvariable
After extraction, results are organized in the output_results folder (ignored by git):
output_results/
└── [paper_name]/
├── pages/ # Rendered page images
├── detections/ # Detection visualizations
├── figures/ # Extracted figures
│ └── page_X_figure_Y/
│ ├── figure.png
│ ├── caption.txt
│ └── context.json # Paragraphs referencing this figure
├── tables/ # Extracted tables
│ └── page_X_table_Y/
│ ├── table.png
│ ├── caption.txt
│ ├── footnote.txt
│ ├── table_text.txt # Structured table content
│ └── context.json # Paragraphs referencing this table
└── text/ # Extracted text
└── page_X/
├── paragraph_1.txt
├── paragraph_2.txt
└── ...
Edit main.py to customize:
# Input/Output
INPUT_FOLDER = "input_pdfs"
OUTPUT_FOLDER = "output_results"
# Performance
BATCH_SIZE = 10 # Pages per batch to manage memory
# Features
AGGREGATE_FIGURES_INTO_ONE_FOLDER = True # Collect all figures in one place
# OCR Engine
OCR_ENGINE = "tesseract" # Options: "tesseract", "paddle", "easyocr"- DocLayout-YOLO: State-of-the-art document layout detection model
- Tesseract OCR: Industry-standard OCR engine (100% accuracy on test set)
- PyMuPDF (fitz): PDF rendering and manipulation
- LLM Whisperer: Table structure extraction API
- OpenCV: Image processing and manipulation
- The pipeline processes PDFs in batches to prevent memory issues with large documents
- PDF names are sanitized to remove invalid characters and spaces
- All extraction results are cached where possible to speed up re-processing
- Do not push changes to input_pdfs or output_results unless given permission
After extraction, the project uses a vector database for semantic search across figures and text content.
The project uses Qdrant as the vector database. Install and run Qdrant:
Docker (Recommended):
docker pull qdrant/qdrant
docker run -p 6333:6333 qdrant/qdrantAlternative: Download binary from https://qdrant.tech/
Qdrant will be available at http://localhost:6333
Located in the VectorDB/ folder, the embedding system generates semantic embeddings using NVIDIA NIM API (Llama 3.2 NeMoRetriever).
-
CreateCollection.py - Initializes Qdrant collections
- Creates vector collections with 2048 dimensions
- Uses cosine similarity for search
-
FigPipeline.py - Figure embedding pipeline
- Processes extracted figures from
output_results/ - Generates multimodal embeddings (image + caption)
- Stores in Qdrant with figure metadata
- Rate-limited to 40 requests/minute
- Processes extracted figures from
-
TextPipeline.py - Text embedding pipeline
- Processes extracted paragraphs
- Generates text embeddings
- Stores with full context metadata
-
EmbedFigures.py / EmbedText.py - Core embedding functions
- Interface with NVIDIA NIM API
- Automatic retry logic and rate limiting
- Error handling for API failures
-
FindSimilarContent.py / SimSearch.py - Semantic search
- Query vector database with text or images
- Returns most similar content with relevance scores
- Supports both figure and text search
After extraction is complete:
cd VectorDB
# Create/recreate collection and embed figures
python FigPipeline.py
# Embed text content
python TextPipeline.py
# Search for similar content
python FindSimilarContent.pyThe project uses NVIDIA's NIM API for multimodal embeddings:
- Model:
nvidia/llama-3.2-nemoretriever-1b-vlm-embed-v1 - Supports both text and image inputs
- 2048-dimensional embeddings
- Rate limit: 40 requests/minute
Note: API key is configured in the embedding scripts. For production, use environment variables.
The interface/ folder contains the web interface components:
- backend/ - API server (to be implemented)
- frontend/ - Web UI (to be implemented)
This will provide a user-friendly interface for:
- Uploading PDFs
- Viewing extraction results
- Performing semantic searches
- Browsing figures and tables
Neurosci/
├── readme.md # This file - project documentation
├── requirements.txt # Python dependencies
│
├── Extraction/ # PDF extraction pipeline
│ ├── main.py # Main entry point (use this)
│ ├── YoloExtraction.py # Legacy file (deprecated)
│ ├── detection.py # YOLO detection logic
│ ├── figure_processor.py # Figure extraction
│ ├── table_processor.py # Table extraction
│ ├── text_processor.py # Text/paragraph extraction
│ ├── context_extractor.py # Reference matching
│ ├── ocr_reader.py # OCR interface
│ ├── TableExtraction.py # LLM Whisperer API
│ ├── docYolo.pt # Trained YOLO model
│ ├── input_pdfs/ # Input folder for PDFs
│ ├── output_results/ # Extraction output (gitignored)
│ └── Information/ # Documentation
│ ├── OCR_SETUP.md # OCR setup guide
│ └── DocLayoutReadme.md # YOLO reference docs
│
├── VectorDB/ # Embedding & search system
│ ├── FigPipeline.py # Figure embedding pipeline
│ ├── TextPipeline.py # Text embedding pipeline
│ ├── CreateCollection.py # Qdrant collection setup
│ ├── EmbedFigures.py # Figure embedding logic
│ ├── EmbedText.py # Text embedding logic
│ ├── FindSimilarContent.py # Similarity search
│ └── SimSearch.py # Search utilities
│
├── interface/ # Web interface (in development)
│ ├── backend/ # API server
│ └── frontend/ # Web UI
│
└── tests/ # Test files
Here's the complete workflow from PDF to searchable database:
-
Setup Environment
- Install Python 3.10, create venv
- Install Tesseract OCR
- Install dependencies:
pip install -r requirements.txt
-
Extract Content from PDFs
cd Extraction python main.py- Processes PDFs in
input_pdfs/ - Outputs to
output_results/
- Processes PDFs in
-
Setup Vector Database
docker run -p 6333:6333 qdrant/qdrant
-
Generate Embeddings
cd VectorDB python FigPipeline.py # Embed figures python TextPipeline.py # Embed text
-
Search & Query
- Use
FindSimilarContent.pyfor semantic search - Or use the web interface (when available)
- Use
- Document Analysis: DocLayout-YOLO v10
- OCR: Tesseract (primary), PaddleOCR/EasyOCR (alternatives)
- PDF Processing: PyMuPDF (fitz)
- Table Extraction: LLM Whisperer API
- Image Processing: OpenCV, Pillow
- Embeddings: NVIDIA NIM API (Llama 3.2 NeMoRetriever)
- Vector Database: Qdrant
- Deep Learning: PyTorch, torchvision
- The pipeline processes PDFs in batches to prevent memory issues with large documents
- PDF names are sanitized to remove invalid characters and spaces
- All extraction results are cached where possible to speed up re-processing
- Embeddings include automatic rate limiting and retry logic
- Do not push changes to input_pdfs or output_results unless given permission