-
Notifications
You must be signed in to change notification settings - Fork 57
Add multi-step tool-calling SDG tutorial for workplace assistant #327
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Add multi-step tool-calling SDG tutorial for workplace assistant #327
Conversation
Add notebook, tool definitions, and utility modules for generating synthetic multi-step tool-calling training data using Data Designer. Includes dual-level LLM judge filtering and NeMo Gym export. Signed-off-by: Shashank Verma <shashankv@nvidia.com>
|
Thank you for your submission! We ask that you sign our Developer Certificate of Origin before we can accept your contribution. You can sign the DCO by adding a comment below using this text: I have read the DCO document and I hereby sign the DCO. You can retrigger this bot by commenting recheck in this Pull Request. Posted by the DCO Assistant Lite bot. |
Greptile OverviewGreptile SummaryAdds comprehensive multi-step tool-calling synthetic data generation tutorial using Data Designer. Implements a complete pipeline for generating realistic workplace assistant queries with simulated agent trajectories, dual-level LLM judge filtering for quality control, and NeMo Gym export format compatibility. Key Components:
Style Issue:
|
| Filename | Overview |
|---|---|
| docs/colab_notebooks/5-multistep-toolcalling/multistep-toolcalling.ipynb | Comprehensive tutorial notebook for multi-step tool-calling SDG with clear examples and dual-level quality filtering |
| docs/colab_notebooks/5-multistep-toolcalling/utils/init.py | Package initialization - missing NVIDIA license headers required per AGENTS.md |
| docs/colab_notebooks/5-multistep-toolcalling/utils/convert_to_nemo_gym_format.py | NeMo Gym format converter with proper type hints - missing NVIDIA license headers required per AGENTS.md |
| docs/colab_notebooks/5-multistep-toolcalling/utils/quality_filtering.py | Quality filtering utilities with dual-level validation - missing NVIDIA license headers required per AGENTS.md |
| docs/colab_notebooks/5-multistep-toolcalling/tools/environment.json | Environment configuration with 27 multi-step patterns covering all tool combinations |
Sequence Diagram
sequenceDiagram
participant User
participant DataDesigner
participant LLM
participant QualityFilter
participant NeMoGym
User->>DataDesigner: Load tool schemas & seed data
DataDesigner->>LLM: Generate user query from pattern
LLM-->>DataDesigner: Return user query
DataDesigner->>LLM: Judge user query (feasibility, schema compliance)
LLM-->>DataDesigner: Return query scores
DataDesigner->>LLM: Generate trajectory (tool calls)
LLM-->>DataDesigner: Return agent trajectory
DataDesigner->>LLM: Judge trajectory (tool validity, completeness)
LLM-->>DataDesigner: Return trajectory scores
DataDesigner->>QualityFilter: Filter by dual-level scores
QualityFilter->>QualityFilter: Stage 1: Validate query
QualityFilter->>QualityFilter: Stage 2: Validate trajectory
QualityFilter-->>User: Return filtered dataset
User->>NeMoGym: Convert to NeMo Gym JSONL format
NeMoGym-->>User: Training data ready for RL
Last reviewed commit: eb2b52b
…olcalling.ipynb Signed-off-by: Shashank Verma <shashankv@nvidia.com>
- quality_filtering.py: remove FilterThresholds dataclass, quickstart print, and verbose ASCII output; de-duplicate show_rejection_reasons (285 → 85 lines) - convert_to_nemo_gym_format.py: remove factory pattern and quickstart print (108 → 76 lines) - __init__.py: export only 4 functions (24 → 10 lines) - Notebook: fix imports to use data_designer.config/interface instead of removed essentials module; use functools.partial for converter Signed-off-by: Shashank Verma <shashankv@nvidia.com>
70ab957 to
eb2b52b
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
11 files reviewed, 3 comments
| @@ -0,0 +1,9 @@ | |||
| from .quality_filtering import filter_high_quality, show_rejection_reasons | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing NVIDIA license headers. Per AGENTS.md, all Python files must include:
| from .quality_filtering import filter_high_quality, show_rejection_reasons | |
| # SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved. | |
| # SPDX-License-Identifier: Apache-2.0 | |
| from .quality_filtering import filter_high_quality, show_rejection_reasons |
Context Used: Context from dashboard - AGENTS.md (source)
Prompt To Fix With AI
This is a comment left during a code review.
Path: docs/colab_notebooks/5-multistep-toolcalling/utils/__init__.py
Line: 1:1
Comment:
Missing NVIDIA license headers. Per AGENTS.md, all Python files must include:
```suggestion
# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
from .quality_filtering import filter_high_quality, show_rejection_reasons
```
**Context Used:** Context from `dashboard` - AGENTS.md ([source](https://app.greptile.com/review/custom-context?memory=95f8243f-5118-40bc-a3ab-69210b72e57e))
How can I resolve this? If you propose a fix, please make it concise.| @@ -0,0 +1,75 @@ | |||
| """Utilities for converting generated records to NeMo Gym JSONL format.""" | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing NVIDIA license headers. Per AGENTS.md, all Python files must include:
| """Utilities for converting generated records to NeMo Gym JSONL format.""" | |
| # SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved. | |
| # SPDX-License-Identifier: Apache-2.0 | |
| """Utilities for converting generated records to NeMo Gym JSONL format.""" |
Context Used: Context from dashboard - AGENTS.md (source)
Prompt To Fix With AI
This is a comment left during a code review.
Path: docs/colab_notebooks/5-multistep-toolcalling/utils/convert_to_nemo_gym_format.py
Line: 1:1
Comment:
Missing NVIDIA license headers. Per AGENTS.md, all Python files must include:
```suggestion
# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
"""Utilities for converting generated records to NeMo Gym JSONL format."""
```
**Context Used:** Context from `dashboard` - AGENTS.md ([source](https://app.greptile.com/review/custom-context?memory=95f8243f-5118-40bc-a3ab-69210b72e57e))
How can I resolve this? If you propose a fix, please make it concise.| @@ -0,0 +1,86 @@ | |||
| """Utilities for dual-level quality filtering of generated datasets.""" | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing NVIDIA license headers. Per AGENTS.md, all Python files must include:
| """Utilities for dual-level quality filtering of generated datasets.""" | |
| # SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved. | |
| # SPDX-License-Identifier: Apache-2.0 | |
| """Utilities for dual-level quality filtering of generated datasets.""" |
Context Used: Context from dashboard - AGENTS.md (source)
Prompt To Fix With AI
This is a comment left during a code review.
Path: docs/colab_notebooks/5-multistep-toolcalling/utils/quality_filtering.py
Line: 1:1
Comment:
Missing NVIDIA license headers. Per AGENTS.md, all Python files must include:
```suggestion
# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
"""Utilities for dual-level quality filtering of generated datasets."""
```
**Context Used:** Context from `dashboard` - AGENTS.md ([source](https://app.greptile.com/review/custom-context?memory=95f8243f-5118-40bc-a3ab-69210b72e57e))
How can I resolve this? If you propose a fix, please make it concise.
Add notebook, tool definitions, and utility modules for generating synthetic multi-step tool-calling training data using Data Designer. Includes dual-level LLM judge filtering and NeMo Gym export.