Neuroscience TBI Project

A comprehensive document analysis pipeline for extracting figures, tables, and text from scientific PDFs using state-of-the-art computer vision and OCR technologies.

Overview

This project uses DocLayout-YOLO for document layout detection and Tesseract OCR for text extraction to process neuroscience research papers. The pipeline automatically extracts figures with captions, tables with structure, and plain text with intelligent context matching.

Configuration and Setup

Install Python 3.10.xx

First make sure you have Python 3.10 (any versions of 3.10 can work, but 3.10.11 is the easiest to install)

Install Python 3.10.11 here: https://www.python.org/downloads/release/python-31011/

Setup Virtual Environment

You can use conda as well for this step, for this project Venv is recommended. In this step we will be setting up our virtual environment and then entering it.

Make sure to run these commands in the Neurosci directory in the terminal

Windows:

py -3.10 -m venv venv
.\venv\Scripts\Activate

Linux:

python3.10 -m venv venv
source venv/bin/activate

You will only need to run the venv creation command one time. For the rest of the project you should be using activate to enter the venv you created.

Install Tesseract OCR

This project uses Tesseract OCR (tested at 100% accuracy in 30 runs). You must install the Tesseract executable separately from Python packages.

Windows:

Download and install from: https://github.com/UB-Mannheim/tesseract/wiki
Or use chocolatey: choco install tesseract
Or use scoop: scoop install tesseract

Linux:

sudo apt-get install tesseract-ocr

After installation, verify it works:

python -c "import pytesseract; print(pytesseract.get_tesseract_version())"

See Extraction/Information/OCR_SETUP.md for detailed OCR setup instructions and alternative engines.

Installing Requirements

Before you install, make sure your version of pip is up to date:

python.exe -m pip install --upgrade pip

Next make sure you are in the root of the project and currently in the virtual environment (Venv). We will now install the required packages and libraries for this project:

pip install -r requirements.txt

Extraction Pipeline

The extraction pipeline processes PDF documents and extracts structured data including figures, tables, and text with intelligent context matching.

Quick Start

Make sure you followed the setup steps above and your venv is activated
Navigate to the extraction directory:

cd Extraction

Place your PDF files in the input_pdfs folder (sample papers are already included for testing)
Run the extraction:

python main.py

Note: YoloExtraction.py is the legacy file. The current pipeline uses main.py.

How the Extraction Works

The pipeline follows these steps for each PDF:

1. Document Layout Detection (`detection.py`)

Uses DocLayout-YOLO (v10) to detect document elements on each page
Detects: figures, tables, plain text, figure captions, table captions, table footnotes
Applies confidence thresholds and filters to remove noise
Lower confidence threshold (0.15) for captions to catch wide/short caption boxes

2. Figure Processing (`figure_processor.py`)

Extracts figure images from detected regions
Matches figures with their captions using proximity-based algorithm
Saves figure images and captions separately
Extracts reference tokens (e.g., "Figure 1", "Fig. 2A") from captions
Can aggregate all figures into a single master folder if configured

3. Table Processing (`table_processor.py`)

Extracts table images from detected regions
Matches tables with captions and footnotes
Uses LLM Whisperer API to extract structured table content
Caches table extractions to avoid redundant API calls
Saves table images, captions, footnotes, and extracted text

4. Text Processing (`text_processor.py`)

Extracts plain text regions using Tesseract OCR
Detects multi-column layouts automatically
Segments text into paragraphs with intelligent merging
Preserves document structure and reading order
Saves each paragraph as a separate file

5. Context Extraction (`context_extractor.py`)

Matches paragraphs that reference figures/tables with their corresponding items
Recognizes diverse reference patterns: "Figure 1", "Fig. 2A-C", "Table I", "see Fig 3"
Handles panel references (A, B, C), roman numerals (I, II, III), and multi-figure citations
Updates context continuously as each page is processed
Final comprehensive update after all pages are done

6. OCR Processing (`ocr_reader.py`)

Unified OCR interface supporting multiple engines
Primary engine: Tesseract (best accuracy for scientific papers)
Alternative engines available: PaddleOCR, EasyOCR
Handles orientation correction for scanned documents
Configurable in main.py by changing OCR_ENGINE variable

Output Structure

After extraction, results are organized in the output_results folder (ignored by git):

output_results/
└── [paper_name]/
    ├── pages/               # Rendered page images
    ├── detections/          # Detection visualizations
    ├── figures/            # Extracted figures
    │   └── page_X_figure_Y/
    │       ├── figure.png
    │       ├── caption.txt
    │       └── context.json  # Paragraphs referencing this figure
    ├── tables/             # Extracted tables
    │   └── page_X_table_Y/
    │       ├── table.png
    │       ├── caption.txt
    │       ├── footnote.txt
    │       ├── table_text.txt  # Structured table content
    │       └── context.json    # Paragraphs referencing this table
    └── text/               # Extracted text
        └── page_X/
            ├── paragraph_1.txt
            ├── paragraph_2.txt
            └── ...

Configuration Options

Edit main.py to customize:

# Input/Output
INPUT_FOLDER = "input_pdfs"
OUTPUT_FOLDER = "output_results"

# Performance
BATCH_SIZE = 10  # Pages per batch to manage memory

# Features
AGGREGATE_FIGURES_INTO_ONE_FOLDER = True  # Collect all figures in one place

# OCR Engine
OCR_ENGINE = "tesseract"  # Options: "tesseract", "paddle", "easyocr"

Key Technologies

DocLayout-YOLO: State-of-the-art document layout detection model
Tesseract OCR: Industry-standard OCR engine (100% accuracy on test set)
PyMuPDF (fitz): PDF rendering and manipulation
LLM Whisperer: Table structure extraction API
OpenCV: Image processing and manipulation

Notes

The pipeline processes PDFs in batches to prevent memory issues with large documents
PDF names are sanitized to remove invalid characters and spaces
All extraction results are cached where possible to speed up re-processing
Do not push changes to input_pdfs or output_results unless given permission

Vector Database & Semantic Search

After extraction, the project uses a vector database for semantic search across figures and text content.

Setup Qdrant

The project uses Qdrant as the vector database. Install and run Qdrant:

Docker (Recommended):

docker pull qdrant/qdrant
docker run -p 6333:6333 qdrant/qdrant

Alternative: Download binary from https://qdrant.tech/

Qdrant will be available at http://localhost:6333

Embedding Pipeline

Located in the VectorDB/ folder, the embedding system generates semantic embeddings using NVIDIA NIM API (Llama 3.2 NeMoRetriever).

Components:

CreateCollection.py - Initializes Qdrant collections
- Creates vector collections with 2048 dimensions
- Uses cosine similarity for search
FigPipeline.py - Figure embedding pipeline
- Processes extracted figures from output_results/
- Generates multimodal embeddings (image + caption)
- Stores in Qdrant with figure metadata
- Rate-limited to 40 requests/minute
TextPipeline.py - Text embedding pipeline
- Processes extracted paragraphs
- Generates text embeddings
- Stores with full context metadata
EmbedFigures.py / EmbedText.py - Core embedding functions
- Interface with NVIDIA NIM API
- Automatic retry logic and rate limiting
- Error handling for API failures
FindSimilarContent.py / SimSearch.py - Semantic search
- Query vector database with text or images
- Returns most similar content with relevance scores
- Supports both figure and text search

Running the Embedding Pipeline

After extraction is complete:

cd VectorDB

# Create/recreate collection and embed figures
python FigPipeline.py

# Embed text content
python TextPipeline.py

# Search for similar content
python FindSimilarContent.py

NVIDIA NIM API

The project uses NVIDIA's NIM API for multimodal embeddings:

Model: nvidia/llama-3.2-nemoretriever-1b-vlm-embed-v1
Supports both text and image inputs
2048-dimensional embeddings
Rate limit: 40 requests/minute

Note: API key is configured in the embedding scripts. For production, use environment variables.

Interface (In Development)

The interface/ folder contains the web interface components:

backend/ - API server (to be implemented)
frontend/ - Web UI (to be implemented)

This will provide a user-friendly interface for:

Uploading PDFs
Viewing extraction results
Performing semantic searches
Browsing figures and tables

Project Structure Summary

Neurosci/
├── readme.md                    # This file - project documentation
├── requirements.txt             # Python dependencies
│
├── Extraction/                  # PDF extraction pipeline
│   ├── main.py                 # Main entry point (use this)
│   ├── YoloExtraction.py       # Legacy file (deprecated)
│   ├── detection.py            # YOLO detection logic
│   ├── figure_processor.py     # Figure extraction
│   ├── table_processor.py      # Table extraction
│   ├── text_processor.py       # Text/paragraph extraction
│   ├── context_extractor.py    # Reference matching
│   ├── ocr_reader.py           # OCR interface
│   ├── TableExtraction.py      # LLM Whisperer API
│   ├── docYolo.pt              # Trained YOLO model
│   ├── input_pdfs/             # Input folder for PDFs
│   ├── output_results/         # Extraction output (gitignored)
│   └── Information/            # Documentation
│       ├── OCR_SETUP.md        # OCR setup guide
│       └── DocLayoutReadme.md  # YOLO reference docs
│
├── VectorDB/                    # Embedding & search system
│   ├── FigPipeline.py          # Figure embedding pipeline
│   ├── TextPipeline.py         # Text embedding pipeline
│   ├── CreateCollection.py     # Qdrant collection setup
│   ├── EmbedFigures.py         # Figure embedding logic
│   ├── EmbedText.py            # Text embedding logic
│   ├── FindSimilarContent.py   # Similarity search
│   └── SimSearch.py            # Search utilities
│
├── interface/                   # Web interface (in development)
│   ├── backend/                # API server
│   └── frontend/               # Web UI
│
└── tests/                       # Test files

Workflow: Complete Pipeline

Here's the complete workflow from PDF to searchable database:

Setup Environment
- Install Python 3.10, create venv
- Install Tesseract OCR
- Install dependencies: pip install -r requirements.txt
Extract Content from PDFs
```
cd Extraction
python main.py
```
- Processes PDFs in input_pdfs/
- Outputs to output_results/
Setup Vector Database
```
docker run -p 6333:6333 qdrant/qdrant
```

Generate Embeddings

cd VectorDB
python FigPipeline.py  # Embed figures
python TextPipeline.py # Embed text

Search & Query
- Use FindSimilarContent.py for semantic search
- Or use the web interface (when available)

Key Technologies Used

Document Analysis: DocLayout-YOLO v10
OCR: Tesseract (primary), PaddleOCR/EasyOCR (alternatives)
PDF Processing: PyMuPDF (fitz)
Table Extraction: LLM Whisperer API
Image Processing: OpenCV, Pillow
Embeddings: NVIDIA NIM API (Llama 3.2 NeMoRetriever)
Vector Database: Qdrant
Deep Learning: PyTorch, torchvision

Development Notes

The pipeline processes PDFs in batches to prevent memory issues with large documents
PDF names are sanitized to remove invalid characters and spaces
All extraction results are cached where possible to speed up re-processing
Embeddings include automatic rate limiting and retry logic
Do not push changes to input_pdfs or output_results unless given permission

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Neuroscience TBI Project

Overview

Configuration and Setup

Install Python 3.10.xx

Setup Virtual Environment

Install Tesseract OCR

Installing Requirements

Extraction Pipeline

Quick Start

How the Extraction Works

1. Document Layout Detection (`detection.py`)

2. Figure Processing (`figure_processor.py`)

3. Table Processing (`table_processor.py`)

4. Text Processing (`text_processor.py`)

5. Context Extraction (`context_extractor.py`)

6. OCR Processing (`ocr_reader.py`)

Output Structure

Configuration Options

Key Technologies

Notes

Vector Database & Semantic Search

Setup Qdrant

Embedding Pipeline

Components:

Running the Embedding Pipeline

NVIDIA NIM API

Interface (In Development)

Project Structure Summary

Workflow: Complete Pipeline

Key Technologies Used

Development Notes

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
Extraction		Extraction
OldFiles		OldFiles
VectorDB		VectorDB
tests		tests
.gitignore		.gitignore
readme.md		readme.md
requirements.txt		requirements.txt

njitvis/Neurosci

Folders and files

Latest commit

History

Repository files navigation

Neuroscience TBI Project

Overview

Configuration and Setup

Install Python 3.10.xx

Setup Virtual Environment

Install Tesseract OCR

Installing Requirements

Extraction Pipeline

Quick Start

How the Extraction Works

1. Document Layout Detection (detection.py)

2. Figure Processing (figure_processor.py)

3. Table Processing (table_processor.py)

4. Text Processing (text_processor.py)

5. Context Extraction (context_extractor.py)

6. OCR Processing (ocr_reader.py)

Output Structure

Configuration Options

Key Technologies

Notes

Vector Database & Semantic Search

Setup Qdrant

Embedding Pipeline

Components:

Running the Embedding Pipeline

NVIDIA NIM API

Interface (In Development)

Project Structure Summary

Workflow: Complete Pipeline

Key Technologies Used

Development Notes

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

1. Document Layout Detection (`detection.py`)

2. Figure Processing (`figure_processor.py`)

3. Table Processing (`table_processor.py`)

4. Text Processing (`text_processor.py`)

5. Context Extraction (`context_extractor.py`)

6. OCR Processing (`ocr_reader.py`)

Packages