DMP Chef is an open-source (MIT License), Python-based pipeline that draft funder-compliant Data Management & Sharing Plan (DMPs) using a Large Language Model (LLM), such as Llama 3.3
It supports two modes entirely in Python:
- RAG: Retrieves related guidance from an indexed document collection and uses it to ground the draft. In this mode, the pipeline can ingest documents, build and search an index, and draft a DMP.
- No-RAG: Generates the draft only from the user’s project inputs (no retrieval).
This project is part of a broader extension of the DMP Tool platform. The ultimate goal is to integrate the DMP Chef pipeline into the DMP Tool platform, providing researchers with a familiar and convenient user interface that does not require any coding knowledge.
👉 Learn more: DMP-Chef.
The overall codebase is organized in alignment with the FAIR-BioRS guidelines. All Python code follows PEP 8 conventions, including consistent formatting, inline comments, and docstrings. Project dependencies are fully captured in requirements.txt. We also retain dmp-template as inside the prompt template used by the DMP generation workflow.
dmpchef/api.py— Public, importable API:generate()/draft()to produce DMP outputs (Markdown, DOCX, DMPTool JSON, optional PDF)prepare_nih_corpus()to prepare NIH reference PDFs for RAG (one-time)
src/core_pipeline.py— Core generation logic (RAG vs No-RAG; retrieval → prompt → generate).src/NIH_data_ingestion.py— NIH/DMPTool ingestion to collect reference PDFs for RAGmain.py— Command-line entry point for running the pipeline end-to-end.demo_import.ipynb— Jupyter demo showing.
dmpchef/
│── main.py # CLI entry point (run pipeline end-to-end)
│── README.md # Project overview + usage
│── requirements.txt # Python dependencies
│── setup.py # Packaging (editable installs via pip install -e .)
│── pyproject.toml # Build system config (wheel builds)
│── MANIFEST.in # Include non-code files in distributions
│── demo_import.ipynb # Notebook demo: import + run generate()
│── LICENSE
│── .gitignore
│── .env # Local env vars (do not commit)
│
├── dmpchef/ # ✅ Installable Python package (public API)
│ ├── __init__.py # Exports: generate, draft, prepare_nih_corpus
│ └── api.py # Importable API used by notebooks/backends
│
├── config/ # Configuration
│ ├── config.yaml # Main settings (models, paths, retriever params)
│ └── config_schema.py # Validation/schema helpers (optional)
│
├── data/ # Local workspace data + artifacts (not guaranteed in wheel)
│ ├── inputs/ # Templates + examples
│ │ ├── nih-dms-plan-template.docx # NIH blank Word template
│ │ └── input.json # Example request file
│ ├── web_links.json # Seed links for NIH/DMPTool ingestion (used by src/NIH_data_ingestion.py)
│ ├── NIH_95/ # Reference PDFs collected for NIH RAG (optional)
│ ├── index/ # Vector index artifacts (e.g., FAISS)
│ ├── outputs/ # Generated artifacts
│ │ ├── markdown/ # Generated Markdown DMPs
│ │ ├── docx/ # Generated DOCX DMPs (template-preserving)
│ │ ├── json/ # DMPTool-compatible JSON outputs
│ │ ├── pdf/ # Optional PDFs converted from DOCX
│ │ └── debug/ # Optional retrieval debug outputs (retrieved context, logs, etc.)
│ └── data_ingestion/ # Session folders + manifests from crawling
│
├── src/ # Core implementation
│ ├── __init__.py
│ ├── core_pipeline.py # Pipeline logic (RAG/no-RAG)
│ └── NIH_data_ingestion.py # NIH/DMPTool crawl → export PDFs to data/NIH_95
│
├── prompt/ # Prompt templates/utilities
│ └── prompt_library.py
│
├── utils/ # Shared helpers
│ ├── config_loader.py
│ ├── model_loader.py
│ ├── dmptool_json.py
│ └── nih_docx_writer.py
│
├── logger/ # Logging utilities
│ ├── __init__.py
│ └── custom_logger.py
│
├── exception/ # Custom exceptions
│ ├── __init__.py
│ └── custom_exception.py
│
├── notebook_DMP_RAG/ # Notebooks/experiments (non-production)
└── venv/ # Local virtualenv (ignore in git)
git clone https://github.com/fairdataihub/dmpchef.git
cd dmpchef
code .Windows (cmd):
python -m venv venv
venv\Scripts\activate.batmacOS/Linux:
python -m venv venv
source venv/bin/activatepip install -r requirements.txt
# or (recommended for local dev)
pip install -e .Use demo_import.ipynb.
Use main.py
-
Reference documents: guidance PDFs (and other funder instructions) placed in your configured
paths.data_pdfsfolder.
These are used only whenuse_rag=trueto retrieve funder-aligned language and examples. -
Request JSON: a single “job request” file (e.g.,
data/inputs/input.json) that tells the pipeline what to generate.Top-level fields
- title: Project title (also used for output filenames).
- funding_agency: Funder key (e.g.,
NIH; future-ready for others likeNSF); - use_rag:
true/false(optional). If omitted, the pipeline uses the YAML defaultrag.enabled. - inputs: A dictionary of user/project fields used to draft the plan (free-form keys are allowed). Common examples include:
research_context,data_types,data_source,human_subjects,consent_status,data_volume, etc.
- Markdown: the generated funder-aligned DMP narrative (currently NIH structure).
- DOCX: generated using the funder template (NIH template today) to preserve official formatting.
- PDF: created by converting the DOCX (platform-dependent; typically works on Windows/macOS with Word).
- JSON: a DMPTool-compatible JSON file (
*.dmptool.json).
Notes
- Output filenames include a run suffix to prevent overwriting:
__rag__k{top_k}__{llm}(RAG runs)__norag__{llm}(No-RAG runs)
This work is licensed under the MIT License. See LICENSE for more information.
Use GitHub Issues to submit feedback, report problems, or suggest improvements.
You can also fork the repository and submit a Pull Request with your changes.
If you use this code, please cite this repository using the versioned DOI on Zenodo for the specific release you used (instructions will be added once the Zenodo record is available). For now, you can reference the repository here: fairdataihub/dmpchef.