Skip to content

Convert DOCX to Markdown, Text, and More - Extract charts, tables, images, footnotes, comments, and formatting from Word documents. Pure Python, no dependencies, fast and reliable DOCX converter.

License

Notifications You must be signed in to change notification settings

sudipnext/docx2everything

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

docx2everything

Convert DOCX files to plain text or markdown format with preserved structure.

Installation

pip install docx2everything

Or install from source:

# Modern way (recommended)
pip install .

# Or using setup.py (deprecated but still works)
python setup.py install

Testing Without Installation

The CLI script works directly without installation - no PYTHONPATH needed!

Using CLI (no installation required):

# Extract text
python3 bin/docx2everything demo.docx

# Convert to markdown
python3 bin/docx2everything --markdown demo.docx > output.md

# With images
python3 bin/docx2everything --markdown -i images/ demo.docx > output.md

Using Python:

# Set PYTHONPATH to current directory
PYTHONPATH=. python3 -c "import docx2everything; print(docx2everything.process('demo.docx')[:100])"

In Python script:

import sys
sys.path.insert(0, '/path/to/python-docx2txt')

import docx2everything
text = docx2everything.process('document.docx')

Usage

Command Line

Extract plain text:

docx2everything document.docx

Convert to markdown:

docx2everything --markdown document.docx > output.md

Extract images:

docx2everything -i images/ document.docx

Markdown with images:

docx2everything --markdown -i images/ document.docx > output.md

Python API

import docx2everything

# Extract plain text
text = docx2everything.process("document.docx")

# Convert to markdown
markdown = docx2everything.process_to_markdown("document.docx")

# Extract images
text = docx2everything.process("document.docx", img_dir="images/")

# Markdown with images
markdown = docx2everything.process_to_markdown("document.docx", img_dir="images/")

Features

  • ✅ Plain text extraction
  • ✅ Markdown conversion with preserved structure:
    • Tables → Markdown tables (with merged cells support, alignment hints)
    • Lists → Bulleted/numbered lists (with proper sequence tracking)
    • Headings → Markdown headings (#, ##, ###, etc.) with custom style detection
    • Formatting → Bold, italic, strikethrough
    • Links → Markdown links
    • Images → Markdown image references
    • Footnotes → Markdown footnote references [^1]
    • Endnotes → Markdown endnote references [^1]
    • Comments → Inline HTML comments with author info
    • Charts → Chart placeholders with type and metadata *[Chart: Title (Chart Type)]*
    • Page breaks → HTML comments <!-- Page Break -->
    • Section breaks → HTML comments <!-- Section Break -->
  • ✅ Image extraction
  • ✅ Header and footer support
  • ✅ Custom style detection (parses styles.xml for better heading detection)
  • ✅ Table formatting (column alignment detection and hints)
  • ✅ Robust error handling for malformed DOCX files

Requirements

Python 3.6+

License

MIT License - see LICENSE.txt

About

Convert DOCX to Markdown, Text, and More - Extract charts, tables, images, footnotes, comments, and formatting from Word documents. Pure Python, no dependencies, fast and reliable DOCX converter.

Topics

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages