DMP Chef

DMP Chef is an open-source (MIT License), Python-based pipeline that draft funder-compliant Data Management & Sharing Plan (DMPs) using a Large Language Model (LLM), such as Llama 3.3

It supports two modes entirely in Python:

RAG: Retrieves related guidance from an indexed document collection and uses it to ground the draft. In this mode, the pipeline can ingest documents, build and search an index, and draft a DMP.
No-RAG: Generates the draft only from the user’s project inputs (no retrieval).

This project is part of a broader extension of the DMP Tool platform. The ultimate goal is to integrate the DMP Chef pipeline into the DMP Tool platform, providing researchers with a familiar and convenient user interface that does not require any coding knowledge.

👉 Learn more: DMP-Chef.

Standards followed

The overall codebase is organized in alignment with the FAIR-BioRS guidelines. All Python code follows PEP 8 conventions, including consistent formatting, inline comments, and docstrings. Project dependencies are fully captured in requirements.txt. We also retain dmp-template as inside the prompt template used by the DMP generation workflow.

Main files

dmpchef/api.py — Public, importable API:
- generate() / draft() to produce DMP outputs (Markdown, DOCX, DMPTool JSON, optional PDF)
- prepare_nih_corpus() to prepare NIH reference PDFs for RAG (one-time)
src/core_pipeline.py — Core generation logic (RAG vs No-RAG; retrieval → prompt → generate).
src/NIH_data_ingestion.py — NIH/DMPTool ingestion to collect reference PDFs for RAG
main.py — Command-line entry point for running the pipeline end-to-end.
demo_import.ipynb — Jupyter demo showing.

Repository Structure

dmpchef/
│── main.py                 # CLI entry point (run pipeline end-to-end)
│── README.md               # Project overview + usage
│── requirements.txt        # Python dependencies
│── setup.py                # Packaging (editable installs via pip install -e .)
│── pyproject.toml          # Build system config (wheel builds)
│── MANIFEST.in             # Include non-code files in distributions
│── demo_import.ipynb       # Notebook demo: import + run generate()
│── LICENSE
│── .gitignore
│── .env                    # Local env vars (do not commit)
│
├── dmpchef/                # ✅ Installable Python package (public API)
│   ├── __init__.py         # Exports: generate, draft, prepare_nih_corpus
│   └── api.py              # Importable API used by notebooks/backends
│
├── config/                 # Configuration
│   ├── config.yaml         # Main settings (models, paths, retriever params)
│   └── config_schema.py    # Validation/schema helpers (optional)
│
├── data/                   # Local workspace data + artifacts (not guaranteed in wheel)
│   ├── inputs/             # Templates + examples
│   │   ├── nih-dms-plan-template.docx  # NIH blank Word template
│   │   └── input.json                  # Example request file
│   ├── web_links.json      # Seed links for NIH/DMPTool ingestion (used by src/NIH_data_ingestion.py)
│   ├── NIH_95/             # Reference PDFs collected for NIH RAG (optional)
│   ├── index/              # Vector index artifacts (e.g., FAISS)
│   ├── outputs/            # Generated artifacts
│   │   ├── markdown/       # Generated Markdown DMPs
│   │   ├── docx/           # Generated DOCX DMPs (template-preserving)
│   │   ├── json/           # DMPTool-compatible JSON outputs
│   │   ├── pdf/            # Optional PDFs converted from DOCX
│   │   └── debug/          # Optional retrieval debug outputs (retrieved context, logs, etc.)
│   └── data_ingestion/     # Session folders + manifests from crawling
│
├── src/                    # Core implementation
│   ├── __init__.py
│   ├── core_pipeline.py    # Pipeline logic (RAG/no-RAG)
│   └── NIH_data_ingestion.py # NIH/DMPTool crawl → export PDFs to data/NIH_95
│
├── prompt/                 # Prompt templates/utilities
│   └── prompt_library.py
│
├── utils/                  # Shared helpers
│   ├── config_loader.py
│   ├── model_loader.py
│   ├── dmptool_json.py
│   └── nih_docx_writer.py
│
├── logger/                 # Logging utilities
│   ├── __init__.py
│   └── custom_logger.py
│
├── exception/              # Custom exceptions
│   ├── __init__.py
│   └── custom_exception.py
│
├── notebook_DMP_RAG/       # Notebooks/experiments (non-production)
└── venv/                   # Local virtualenv (ignore in git)

Setup (Local Development)

Step 1 — Clone the repository

git clone https://github.com/fairdataihub/dmpchef.git
cd dmpchef
code .

Step 2 — Create and activate a virtual environment

Windows (cmd):

python -m venv venv
venv\Scripts\activate.bat

macOS/Linux:

python -m venv venv
source venv/bin/activate

Step 3 — Install dependencies

pip install -r requirements.txt
# or (recommended for local dev)
pip install -e .

Run DMP Chef

Option A — Jupyter demo

Use demo_import.ipynb.

Option B — CLI: Command-line entry point for running the pipeline end-to-end

Use main.py

Inputs

Reference documents: guidance PDFs (and other funder instructions) placed in your configured paths.data_pdfs folder.
These are used only when use_rag=true to retrieve funder-aligned language and examples.
Request JSON: a single “job request” file (e.g., data/inputs/input.json) that tells the pipeline what to generate.

Top-level fields
- title: Project title (also used for output filenames).
- funding_agency: Funder key (e.g., NIH; future-ready for others like NSF);
- use_rag: true / false (optional). If omitted, the pipeline uses the YAML default rag.enabled.
- inputs: A dictionary of user/project fields used to draft the plan (free-form keys are allowed). Common examples include:
  - research_context, data_types, data_source, human_subjects, consent_status, data_volume, etc.

Outputs

Markdown: the generated funder-aligned DMP narrative (currently NIH structure).
DOCX: generated using the funder template (NIH template today) to preserve official formatting.
PDF: created by converting the DOCX (platform-dependent; typically works on Windows/macOS with Word).
JSON: a DMPTool-compatible JSON file (*.dmptool.json).

Notes

Output filenames include a run suffix to prevent overwriting:
- __rag__k{top_k}__{llm} (RAG runs)
- __norag__{llm} (No-RAG runs)

License

This work is licensed under the MIT License. See LICENSE for more information.

Feedback and contribution

Use GitHub Issues to submit feedback, report problems, or suggest improvements.
You can also fork the repository and submit a Pull Request with your changes.

How to cite

If you use this code, please cite this repository using the versioned DOI on Zenodo for the specific release you used (instructions will be added once the Zenodo record is available). For now, you can reference the repository here: fairdataihub/dmpchef.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

DMP Chef

Standards followed

Main files

Repository Structure

Setup (Local Development)

Step 1 — Clone the repository

Step 2 — Create and activate a virtual environment

Step 3 — Install dependencies

Run DMP Chef

Option A — Jupyter demo

Option B — CLI: Command-line entry point for running the pipeline end-to-end

Inputs

Outputs

License

Feedback and contribution

How to cite

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 163 Commits
config		config
data		data
dmpchef		dmpchef
exception		exception
logger		logger
model		model
notebook_DMP_RAG		notebook_DMP_RAG
prompt		prompt
src		src
utils		utils
.gitignore		.gitignore
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
README.md		README.md
demo_import.ipynb		demo_import.ipynb
main.py		main.py
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
setup.py		setup.py

License

fairdataihub/dmpchef

Folders and files

Latest commit

History

Repository files navigation

DMP Chef

Standards followed

Main files

Repository Structure

Setup (Local Development)

Step 1 — Clone the repository

Step 2 — Create and activate a virtual environment

Step 3 — Install dependencies

Run DMP Chef

Option A — Jupyter demo

Option B — CLI: Command-line entry point for running the pipeline end-to-end

Inputs

Outputs

License

Feedback and contribution

How to cite

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages