Skip to content

fairdataihub/dmpchef

Repository files navigation

Contributors Stars Issues License DOI

DMP Chef

DMP Chef is an open-source (MIT License), Python-based pipeline that draft funder-compliant Data Management & Sharing Plan (DMPs) using a Large Language Model (LLM), such as Llama 3.3

It supports two modes entirely in Python:

  • RAG: Retrieves related guidance from an indexed document collection and uses it to ground the draft. In this mode, the pipeline can ingest documents, build and search an index, and draft a DMP.
  • No-RAG: Generates the draft only from the user’s project inputs (no retrieval).

This project is part of a broader extension of the DMP Tool platform. The ultimate goal is to integrate the DMP Chef pipeline into the DMP Tool platform, providing researchers with a familiar and convenient user interface that does not require any coding knowledge.

👉 Learn more: DMP-Chef.


Standards followed

The overall codebase is organized in alignment with the FAIR-BioRS guidelines. All Python code follows PEP 8 conventions, including consistent formatting, inline comments, and docstrings. Project dependencies are fully captured in requirements.txt. We also retain dmp-template as inside the prompt template used by the DMP generation workflow.

Main files

  • dmpchef/api.py — Public, importable API:
    • generate() / draft() to produce DMP outputs (Markdown, DOCX, DMPTool JSON, optional PDF)
    • prepare_nih_corpus() to prepare NIH reference PDFs for RAG (one-time)
  • src/core_pipeline.py — Core generation logic (RAG vs No-RAG; retrieval → prompt → generate).
  • src/NIH_data_ingestion.py — NIH/DMPTool ingestion to collect reference PDFs for RAG
  • main.py — Command-line entry point for running the pipeline end-to-end.
  • demo_import.ipynb — Jupyter demo showing.

Repository Structure

dmpchef/
│── main.py                 # CLI entry point (run pipeline end-to-end)
│── README.md               # Project overview + usage
│── requirements.txt        # Python dependencies
│── setup.py                # Packaging (editable installs via pip install -e .)
│── pyproject.toml          # Build system config (wheel builds)
│── MANIFEST.in             # Include non-code files in distributions
│── demo_import.ipynb       # Notebook demo: import + run generate()
│── LICENSE
│── .gitignore
│── .env                    # Local env vars (do not commit)
│
├── dmpchef/                # ✅ Installable Python package (public API)
│   ├── __init__.py         # Exports: generate, draft, prepare_nih_corpus
│   └── api.py              # Importable API used by notebooks/backends
│
├── config/                 # Configuration
│   ├── config.yaml         # Main settings (models, paths, retriever params)
│   └── config_schema.py    # Validation/schema helpers (optional)
│
├── data/                   # Local workspace data + artifacts (not guaranteed in wheel)
│   ├── inputs/             # Templates + examples
│   │   ├── nih-dms-plan-template.docx  # NIH blank Word template
│   │   └── input.json                  # Example request file
│   ├── web_links.json      # Seed links for NIH/DMPTool ingestion (used by src/NIH_data_ingestion.py)
│   ├── NIH_95/             # Reference PDFs collected for NIH RAG (optional)
│   ├── index/              # Vector index artifacts (e.g., FAISS)
│   ├── outputs/            # Generated artifacts
│   │   ├── markdown/       # Generated Markdown DMPs
│   │   ├── docx/           # Generated DOCX DMPs (template-preserving)
│   │   ├── json/           # DMPTool-compatible JSON outputs
│   │   ├── pdf/            # Optional PDFs converted from DOCX
│   │   └── debug/          # Optional retrieval debug outputs (retrieved context, logs, etc.)
│   └── data_ingestion/     # Session folders + manifests from crawling
│
├── src/                    # Core implementation
│   ├── __init__.py
│   ├── core_pipeline.py    # Pipeline logic (RAG/no-RAG)
│   └── NIH_data_ingestion.py # NIH/DMPTool crawl → export PDFs to data/NIH_95
│
├── prompt/                 # Prompt templates/utilities
│   └── prompt_library.py
│
├── utils/                  # Shared helpers
│   ├── config_loader.py
│   ├── model_loader.py
│   ├── dmptool_json.py
│   └── nih_docx_writer.py
│
├── logger/                 # Logging utilities
│   ├── __init__.py
│   └── custom_logger.py
│
├── exception/              # Custom exceptions
│   ├── __init__.py
│   └── custom_exception.py
│
├── notebook_DMP_RAG/       # Notebooks/experiments (non-production)
└── venv/                   # Local virtualenv (ignore in git)



Setup (Local Development)

Step 1 — Clone the repository

git clone https://github.com/fairdataihub/dmpchef.git
cd dmpchef
code .

Step 2 — Create and activate a virtual environment

Windows (cmd):

python -m venv venv
venv\Scripts\activate.bat

macOS/Linux:

python -m venv venv
source venv/bin/activate

Step 3 — Install dependencies

pip install -r requirements.txt
# or (recommended for local dev)
pip install -e .

Run DMP Chef

Option A — Jupyter demo

Use demo_import.ipynb.

Option B — CLI: Command-line entry point for running the pipeline end-to-end

Use main.py


Inputs

  • Reference documents: guidance PDFs (and other funder instructions) placed in your configured paths.data_pdfs folder.
    These are used only when use_rag=true to retrieve funder-aligned language and examples.

  • Request JSON: a single “job request” file (e.g., data/inputs/input.json) that tells the pipeline what to generate.

    Top-level fields

    • title: Project title (also used for output filenames).
    • funding_agency: Funder key (e.g., NIH; future-ready for others like NSF);
    • use_rag: true / false (optional). If omitted, the pipeline uses the YAML default rag.enabled.
    • inputs: A dictionary of user/project fields used to draft the plan (free-form keys are allowed). Common examples include:
      • research_context, data_types, data_source, human_subjects, consent_status, data_volume, etc.

Outputs

  • Markdown: the generated funder-aligned DMP narrative (currently NIH structure).
  • DOCX: generated using the funder template (NIH template today) to preserve official formatting.
  • PDF: created by converting the DOCX (platform-dependent; typically works on Windows/macOS with Word).
  • JSON: a DMPTool-compatible JSON file (*.dmptool.json).

Notes

  • Output filenames include a run suffix to prevent overwriting:
    • __rag__k{top_k}__{llm} (RAG runs)
    • __norag__{llm} (No-RAG runs)

License

This work is licensed under the MIT License. See LICENSE for more information.


Feedback and contribution

Use GitHub Issues to submit feedback, report problems, or suggest improvements.
You can also fork the repository and submit a Pull Request with your changes.


How to cite

If you use this code, please cite this repository using the versioned DOI on Zenodo for the specific release you used (instructions will be added once the Zenodo record is available). For now, you can reference the repository here: fairdataihub/dmpchef.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •