Skip to content

Scrape academic papers by DOI and convert to Markdown with images. Supports ACM, IEEE, and institutional access via cookies.

License

Notifications You must be signed in to change notification settings

ahnafnafee/doi-paper-scraper

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

📄 DOI Paper Scraper — Extract Academic Papers to Markdown 🚀

An automated research paper extraction tool designed for academics, researchers, and developers.

Scrape academic papers from ACM Digital Library, IEEE Xplore, and other publishers using just a DOI. Convert complex academic layouts into structured, clean Markdown with full-text content, LaTeX equations, tables, and high-quality figures.

Python Version License: MIT


🌟 Key Features

  • 🎯 Intelligent DOI Resolution: Accepts plain DOIs, doi.org URLs, publisher direct links, or any string containing a Digital Object Identifier.
  • 🛡️ Cloudflare & Anti-Bot Bypass: Leverages pydoll for advanced browser automation to bypass WAFs and access protected content.
  • 📚 Multi-Publisher Support: Built-in specialized scrapers for ACM Digital Library and IEEE Xplore. Easily extensible for Springer, Elsevier, Wiley, and more.
  • 📐 Rich Content Extraction:
    • Preserves full paper hierarchy (Headings, Sub-headings).
    • Automatically converts MathJax/LaTeX equations into Markdown $math$ blocks.
    • Extracts Figures and Tables with original captions and placement.
  • 🔗 Institutional Access Support: Seamlessly navigate paywalls using Institutional Proxy redirection and Browser Cookie injection (supports GMU's EZProxy and others).
  • 📋 Structured Output: Generates clean, text-searchable Markdown files—perfect for research archival, NLP analysis, and building personal knowledge bases.

🛠️ Installation

This project uses the high-performance uv package manager.

# 1. Clone the repository
git clone https://github.com/ahnafnafee/doi-paper-scraper.git
cd doi-paper-scraper

# 2. Install dependencies (creates a virtualenv automatically)
uv sync

🚀 Quick Usage

Extract any paper into Markdown with one command:

# Extract by plain DOI
uv run paper-scrape 10.1145/3746059.3747603

# Extract by DOI URL
uv run paper-scrape "https://doi.org/10.1109/CSCloud-EdgeCom58631.2023.00053"

# Save to a specific directory
uv run paper-scrape [DOI] --output-dir ./my_research

🏫 Accessing Paywalled Content (Institutional Login)

If you have access via a University library (e.g., George Mason University):

  1. Log in to the publisher (IEEE/ACM) through your university's proxy.
  2. Export your session cookies as a JSON file using a browser extension (like Cookie-Editor).
  3. Run the scraper with the cookies and proxy flag:
uv run paper-scrape [DOI] --cookies ieee_cookies.json --proxy "https://mutex.gmu.edu/login?qurl=%u"

💻 CLI Reference

Option Shorthand Description Default
--output-dir -o Directory where papers and images will be saved. output/
--cookies -c Path to a JSON cookie file for institutional authentication. None
--proxy -p Proxy URL template (use %u for target URL). GMU EZProxy
--no-proxy Disable the default proxy even if on a supported domain. False
--verbose -v Enable detailed logging for debugging. False

📂 Output Structure

The tool organizes extracted data into a clean, portable structure:

output/
├── Quarks_A_Secure_Messaging_Network.md   # Paper text + Markdown formatting
└── images/                                # Extracted figures, diagrams, and tables
    ├── fig_a1b2.png
    └── table_c3d4.gif

🧬 Why Choose DOI Paper Scraper?

  • Research Portability: Text-searchable Markdown is 100x easier to search and edit than static PDFs.
  • Knowledge Graphs: Perfect for importing papers into tools like Obsidian, Logseq, or Notion.
  • NLP Research: Clean text extraction without the "noise" of PDF parsing (extra line breaks, headers/footers).
  • Automation: Designed to be integrated into CI/CD pipelines or batch processing scripts.

📜 License

Distributed under the MIT License. Free for academic, personal, and commercial use. 🎓


Developed with ❤️ by Ahnaf Nafee

Releases

No releases published

Packages

 
 
 

Contributors

Languages