Skip to content

njitvis/Neurosci

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

9 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Neuroscience TBI Project

A comprehensive document analysis pipeline for extracting figures, tables, and text from scientific PDFs using state-of-the-art computer vision and OCR technologies.

Overview

This project uses DocLayout-YOLO for document layout detection and Tesseract OCR for text extraction to process neuroscience research papers. The pipeline automatically extracts figures with captions, tables with structure, and plain text with intelligent context matching.

Configuration and Setup

Install Python 3.10.xx

First make sure you have Python 3.10 (any versions of 3.10 can work, but 3.10.11 is the easiest to install)

Install Python 3.10.11 here: https://www.python.org/downloads/release/python-31011/

Setup Virtual Environment

You can use conda as well for this step, for this project Venv is recommended. In this step we will be setting up our virtual environment and then entering it.

Make sure to run these commands in the Neurosci directory in the terminal

Windows:

py -3.10 -m venv venv
.\venv\Scripts\Activate

Linux:

python3.10 -m venv venv
source venv/bin/activate

You will only need to run the venv creation command one time. For the rest of the project you should be using activate to enter the venv you created.

Install Tesseract OCR

This project uses Tesseract OCR (tested at 100% accuracy in 30 runs). You must install the Tesseract executable separately from Python packages.

Windows:

Linux:

sudo apt-get install tesseract-ocr

After installation, verify it works:

python -c "import pytesseract; print(pytesseract.get_tesseract_version())"

See Extraction/Information/OCR_SETUP.md for detailed OCR setup instructions and alternative engines.

Installing Requirements

Before you install, make sure your version of pip is up to date:

python.exe -m pip install --upgrade pip

Next make sure you are in the root of the project and currently in the virtual environment (Venv). We will now install the required packages and libraries for this project:

pip install -r requirements.txt

Extraction Pipeline

The extraction pipeline processes PDF documents and extracts structured data including figures, tables, and text with intelligent context matching.

Quick Start

  1. Make sure you followed the setup steps above and your venv is activated
  2. Navigate to the extraction directory:
cd Extraction
  1. Place your PDF files in the input_pdfs folder (sample papers are already included for testing)

  2. Run the extraction:

python main.py

Note: YoloExtraction.py is the legacy file. The current pipeline uses main.py.

How the Extraction Works

The pipeline follows these steps for each PDF:

1. Document Layout Detection (detection.py)

  • Uses DocLayout-YOLO (v10) to detect document elements on each page
  • Detects: figures, tables, plain text, figure captions, table captions, table footnotes
  • Applies confidence thresholds and filters to remove noise
  • Lower confidence threshold (0.15) for captions to catch wide/short caption boxes

2. Figure Processing (figure_processor.py)

  • Extracts figure images from detected regions
  • Matches figures with their captions using proximity-based algorithm
  • Saves figure images and captions separately
  • Extracts reference tokens (e.g., "Figure 1", "Fig. 2A") from captions
  • Can aggregate all figures into a single master folder if configured

3. Table Processing (table_processor.py)

  • Extracts table images from detected regions
  • Matches tables with captions and footnotes
  • Uses LLM Whisperer API to extract structured table content
  • Caches table extractions to avoid redundant API calls
  • Saves table images, captions, footnotes, and extracted text

4. Text Processing (text_processor.py)

  • Extracts plain text regions using Tesseract OCR
  • Detects multi-column layouts automatically
  • Segments text into paragraphs with intelligent merging
  • Preserves document structure and reading order
  • Saves each paragraph as a separate file

5. Context Extraction (context_extractor.py)

  • Matches paragraphs that reference figures/tables with their corresponding items
  • Recognizes diverse reference patterns: "Figure 1", "Fig. 2A-C", "Table I", "see Fig 3"
  • Handles panel references (A, B, C), roman numerals (I, II, III), and multi-figure citations
  • Updates context continuously as each page is processed
  • Final comprehensive update after all pages are done

6. OCR Processing (ocr_reader.py)

  • Unified OCR interface supporting multiple engines
  • Primary engine: Tesseract (best accuracy for scientific papers)
  • Alternative engines available: PaddleOCR, EasyOCR
  • Handles orientation correction for scanned documents
  • Configurable in main.py by changing OCR_ENGINE variable

Output Structure

After extraction, results are organized in the output_results folder (ignored by git):

output_results/
└── [paper_name]/
    ├── pages/               # Rendered page images
    ├── detections/          # Detection visualizations
    ├── figures/            # Extracted figures
    │   └── page_X_figure_Y/
    │       ├── figure.png
    │       ├── caption.txt
    │       └── context.json  # Paragraphs referencing this figure
    ├── tables/             # Extracted tables
    │   └── page_X_table_Y/
    │       ├── table.png
    │       ├── caption.txt
    │       ├── footnote.txt
    │       ├── table_text.txt  # Structured table content
    │       └── context.json    # Paragraphs referencing this table
    └── text/               # Extracted text
        └── page_X/
            ├── paragraph_1.txt
            ├── paragraph_2.txt
            └── ...

Configuration Options

Edit main.py to customize:

# Input/Output
INPUT_FOLDER = "input_pdfs"
OUTPUT_FOLDER = "output_results"

# Performance
BATCH_SIZE = 10  # Pages per batch to manage memory

# Features
AGGREGATE_FIGURES_INTO_ONE_FOLDER = True  # Collect all figures in one place

# OCR Engine
OCR_ENGINE = "tesseract"  # Options: "tesseract", "paddle", "easyocr"

Key Technologies

  • DocLayout-YOLO: State-of-the-art document layout detection model
  • Tesseract OCR: Industry-standard OCR engine (100% accuracy on test set)
  • PyMuPDF (fitz): PDF rendering and manipulation
  • LLM Whisperer: Table structure extraction API
  • OpenCV: Image processing and manipulation

Notes

  • The pipeline processes PDFs in batches to prevent memory issues with large documents
  • PDF names are sanitized to remove invalid characters and spaces
  • All extraction results are cached where possible to speed up re-processing
  • Do not push changes to input_pdfs or output_results unless given permission

Vector Database & Semantic Search

After extraction, the project uses a vector database for semantic search across figures and text content.

Setup Qdrant

The project uses Qdrant as the vector database. Install and run Qdrant:

Docker (Recommended):

docker pull qdrant/qdrant
docker run -p 6333:6333 qdrant/qdrant

Alternative: Download binary from https://qdrant.tech/

Qdrant will be available at http://localhost:6333

Embedding Pipeline

Located in the VectorDB/ folder, the embedding system generates semantic embeddings using NVIDIA NIM API (Llama 3.2 NeMoRetriever).

Components:

  1. CreateCollection.py - Initializes Qdrant collections

    • Creates vector collections with 2048 dimensions
    • Uses cosine similarity for search
  2. FigPipeline.py - Figure embedding pipeline

    • Processes extracted figures from output_results/
    • Generates multimodal embeddings (image + caption)
    • Stores in Qdrant with figure metadata
    • Rate-limited to 40 requests/minute
  3. TextPipeline.py - Text embedding pipeline

    • Processes extracted paragraphs
    • Generates text embeddings
    • Stores with full context metadata
  4. EmbedFigures.py / EmbedText.py - Core embedding functions

    • Interface with NVIDIA NIM API
    • Automatic retry logic and rate limiting
    • Error handling for API failures
  5. FindSimilarContent.py / SimSearch.py - Semantic search

    • Query vector database with text or images
    • Returns most similar content with relevance scores
    • Supports both figure and text search

Running the Embedding Pipeline

After extraction is complete:

cd VectorDB

# Create/recreate collection and embed figures
python FigPipeline.py

# Embed text content
python TextPipeline.py

# Search for similar content
python FindSimilarContent.py

NVIDIA NIM API

The project uses NVIDIA's NIM API for multimodal embeddings:

  • Model: nvidia/llama-3.2-nemoretriever-1b-vlm-embed-v1
  • Supports both text and image inputs
  • 2048-dimensional embeddings
  • Rate limit: 40 requests/minute

Note: API key is configured in the embedding scripts. For production, use environment variables.

Interface (In Development)

The interface/ folder contains the web interface components:

  • backend/ - API server (to be implemented)
  • frontend/ - Web UI (to be implemented)

This will provide a user-friendly interface for:

  • Uploading PDFs
  • Viewing extraction results
  • Performing semantic searches
  • Browsing figures and tables

Project Structure Summary

Neurosci/
├── readme.md                    # This file - project documentation
├── requirements.txt             # Python dependencies
│
├── Extraction/                  # PDF extraction pipeline
│   ├── main.py                 # Main entry point (use this)
│   ├── YoloExtraction.py       # Legacy file (deprecated)
│   ├── detection.py            # YOLO detection logic
│   ├── figure_processor.py     # Figure extraction
│   ├── table_processor.py      # Table extraction
│   ├── text_processor.py       # Text/paragraph extraction
│   ├── context_extractor.py    # Reference matching
│   ├── ocr_reader.py           # OCR interface
│   ├── TableExtraction.py      # LLM Whisperer API
│   ├── docYolo.pt              # Trained YOLO model
│   ├── input_pdfs/             # Input folder for PDFs
│   ├── output_results/         # Extraction output (gitignored)
│   └── Information/            # Documentation
│       ├── OCR_SETUP.md        # OCR setup guide
│       └── DocLayoutReadme.md  # YOLO reference docs
│
├── VectorDB/                    # Embedding & search system
│   ├── FigPipeline.py          # Figure embedding pipeline
│   ├── TextPipeline.py         # Text embedding pipeline
│   ├── CreateCollection.py     # Qdrant collection setup
│   ├── EmbedFigures.py         # Figure embedding logic
│   ├── EmbedText.py            # Text embedding logic
│   ├── FindSimilarContent.py   # Similarity search
│   └── SimSearch.py            # Search utilities
│
├── interface/                   # Web interface (in development)
│   ├── backend/                # API server
│   └── frontend/               # Web UI
│
└── tests/                       # Test files

Workflow: Complete Pipeline

Here's the complete workflow from PDF to searchable database:

  1. Setup Environment

    • Install Python 3.10, create venv
    • Install Tesseract OCR
    • Install dependencies: pip install -r requirements.txt
  2. Extract Content from PDFs

    cd Extraction
    python main.py
    • Processes PDFs in input_pdfs/
    • Outputs to output_results/
  3. Setup Vector Database

    docker run -p 6333:6333 qdrant/qdrant
  4. Generate Embeddings

    cd VectorDB
    python FigPipeline.py  # Embed figures
    python TextPipeline.py # Embed text
  5. Search & Query

    • Use FindSimilarContent.py for semantic search
    • Or use the web interface (when available)

Key Technologies Used

  • Document Analysis: DocLayout-YOLO v10
  • OCR: Tesseract (primary), PaddleOCR/EasyOCR (alternatives)
  • PDF Processing: PyMuPDF (fitz)
  • Table Extraction: LLM Whisperer API
  • Image Processing: OpenCV, Pillow
  • Embeddings: NVIDIA NIM API (Llama 3.2 NeMoRetriever)
  • Vector Database: Qdrant
  • Deep Learning: PyTorch, torchvision

Development Notes

  • The pipeline processes PDFs in batches to prevent memory issues with large documents
  • PDF names are sanitized to remove invalid characters and spaces
  • All extraction results are cached where possible to speed up re-processing
  • Embeddings include automatic rate limiting and retry logic
  • Do not push changes to input_pdfs or output_results unless given permission

About

Neuroscience TBI Project

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages