Free bugs included
A Python-based tool for extracting, validating, and processing text from PDF documents with advanced NLP capabilities and Markdown conversion.
PDF Text Extractor is a comprehensive solution for transforming PDF documents into clean, structured text. It handles various PDF types including scanned documents and employs NLP techniques to correct common extraction issues like broken words, hyphenation problems, and OCR errors. The tool also provides conversion to Markdown format to make documentation more accessible.
- Versatile PDF Text Extraction: Support for both digital and scanned PDFs
- OCR Integration: Extract text from images and scanned documents
- Advanced Text Processing:
- Chapter and section detection
- Table recognition and formatting
- Document structure analysis
- NLP-powered Text Correction:
- Fix broken words and spaced text (like "D u n g e o n s")
- Correct hyphenation issues
- Repair OCR errors
- Identify and normalize document structure
- Markdown Conversion: Convert extracted content to well-formatted Markdown
- Multi-format Output: Export to plain text, JSON, YAML, or Markdown
- Interactive CLI: User-friendly command-line interface for file selection and processing
git clone https://github.com/traagel/pdf-extractor.git
cd pdf-text-extractor
uv venv --python 3.11.11
source .venv/bin/activate # On Windows: .venv\Scripts\activate
uv pip install -e .
uv pip install -e ".[dev]"
git clone https://github.com/traagel/pdf-extractor.git
cd pdf-text-extractor
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
pip install -e .
pip install -e ".[dev]"
- Python 3.8+ (Python 3.11 recommended)
- Dependencies:
- PyMuPDF (fitz)
- PyPDF
- pytesseract
- spaCy (with en_core_web_sm model)
- PyYAML
- tqdm
- inquirer
- Tesseract OCR engine (for OCR functionality)
- Ubuntu/Debian:
sudo apt-get install tesseract-ocr - macOS:
brew install tesseract - Windows: Download from UB-Mannheim Tesseract
- Ubuntu/Debian:
After installation, you can use the PDF Text Extractor in several ways:
-
As a console script for extraction:
# Interactive mode (will scan input directory and prompt for choices) pdf-extractor # Specify input file and format pdf-extractor --file path/to/document.pdf --format yaml --type lines_chapters -
For Markdown conversion:
# Convert extraction output to Markdown pdf-extractor convert-md path/to/extracted.yaml -o output.md # With additional options pdf-extractor convert-md path/to/extracted.json --no-toc --clean-text advanced -
As a Python module:
python -m src
from src.extraction.pdf_extractor import PDFExtractor
from src.processing.text_processor import TextProcessor
from src.converters.markdown_converter import convert_to_markdown
# Extract text from a PDF file
extractor = PDFExtractor()
text = extractor.extract("data/input/document.pdf")
# Process text into structured format
processor = TextProcessor()
structured_content = processor.process(text)
# Convert to Markdown
markdown = convert_to_markdown(structured_content, "output.md", {
'toc': True,
'text_cleaning': 'light'
})
pdf-text-extractor/
β
βββ src/ # Source code
β βββ extraction/ # PDF text extraction modules
β βββ processing/ # Text processing components
β βββ nlp/ # Natural Language Processing components
β βββ converters/ # Format conversion tools
β βββ validation/ # Text validation tools
β βββ utils/ # Utility functions and helpers
β
βββ tests/ # Unit and integration tests
βββ data/ # Directory for input/output data and models
β βββ input/ # Place PDF files here for processing
β βββ output/ # Extracted and processed texts are saved here
β βββ resources/ # NLP resources and word lists
β
βββ config/ # Configuration files
βββ docs/ # Documentation
- Document Loading - Load PDF from file
- Extraction - Extract text using PyMuPDF, PyPDF, or OCR
- Line Processing - Split text into clean lines
- Chapter Processing - Identify chapters and sections
- NLP Processing - Clean text and fix common issues
- Validation - Check extraction quality
- Format Conversion - Export to desired format (YAML, JSON, Markdown)
- PDFExtractor: Core PDF text extraction with multiple methods
- ImageTextExtractor: OCR-based extraction for images and scanned PDFs
- TextProcessor: Structures PDF content into organized sections
- ChapterProcessor: Identifies and extracts chapters
- TableProcessor: Recognizes and formats tables in text
- LineProcessor: Handles line-based text processing
- TextStructureAnalyzer: Document structure analysis
- TextCleaner: Fixes common extraction artifacts
- TextValidator: Validates text quality
- MarkdownConverter: Converts structured content to Markdown format
- TextValidator: Checks extraction quality
- SchemaValidator: Validates output against schemas
- Logger: Configurable logging
- FileHandler: File I/O utilities
The PDF Text Extractor includes a powerful Markdown conversion feature:
# Convert to Markdown with table of contents
python -m src convert-md data/output/processed/document.yaml
# Convert without table of contents
python -m src convert-md data/output/processed/document.json --no-toc
# Apply advanced text cleaning
python -m src convert-md data/output/processed/document.yaml --clean-text advanced
# Process all files in a directory recursively
python -m src convert-md data/output/processed/ --recursive -o docs/
Markdown conversion features:
- Table of contents generation
- Clean formatting of chapters and sections
- Table support
- Text cleaning to fix common OCR artifacts
- Front matter handling
- Custom styling options
Contributions are welcome! Please feel free to submit a Pull Request.
- Fork the repository
- Create your feature branch (
git checkout -b feature/amazing-feature) - Commit your changes (
git commit -m 'Add some amazing feature') - Push to the branch (
git push origin feature/amazing-feature) - Open a Pull Request
This project is licensed under the MIT License - see the LICENSE file for details.
For questions or support, please open an issue on the GitHub repository or contact the maintainers.
The PDF Text Extractor organizes output files into different directories based on the level of processing:
- Raw Extraction (
data/output/raw/): Text extracted directly from PDFs - Lines (
data/output/lines/): Text split into lines with basic cleaning - Lines & Chapters (
data/output/lines_chapters/): Text organized into chapters and sections - Processed (
data/output/processed/): Text after all NLP corrections and enhancements
You can specify the desired output type using the --type command-line option or through the interactive prompt:
# Using the CLI option
pdf-extractor --file document.pdf --format yaml --type lines_chapters
# The interactive mode will prompt you to select the output type
pdf-extractor