Skip to content

Conversation

@shashank3959
Copy link

Add notebook, tool definitions, and utility modules for generating synthetic multi-step tool-calling training data using Data Designer. Includes dual-level LLM judge filtering and NeMo Gym export.

Add notebook, tool definitions, and utility modules for generating
synthetic multi-step tool-calling training data using Data Designer.
Includes dual-level LLM judge filtering and NeMo Gym export.

Signed-off-by: Shashank Verma <shashankv@nvidia.com>
@shashank3959 shashank3959 requested a review from a team as a code owner February 12, 2026 22:56
@github-actions
Copy link
Contributor

Thank you for your submission! We ask that you sign our Developer Certificate of Origin before we can accept your contribution. You can sign the DCO by adding a comment below using this text:


I have read the DCO document and I hereby sign the DCO.


You can retrigger this bot by commenting recheck in this Pull Request. Posted by the DCO Assistant Lite bot.

@greptile-apps
Copy link

greptile-apps bot commented Feb 12, 2026

Greptile Overview

Greptile Summary

Adds comprehensive multi-step tool-calling synthetic data generation tutorial using Data Designer. Implements a complete pipeline for generating realistic workplace assistant queries with simulated agent trajectories, dual-level LLM judge filtering for quality control, and NeMo Gym export format compatibility.

Key Components:

  • Tutorial notebook with 27 workplace assistant tools across 6 databases (email, calendar, CRM, analytics, project management, company directory)
  • Dual-level quality filtering utilities to validate both user queries and generated trajectories
  • NeMo Gym format conversion for RL training compatibility
  • 27 multi-step patterns for diverse task generation (lookup-then-send, search-then-update, etc.)

Style Issue:

  • All 3 Python utility files are missing required NVIDIA SPDX license headers as specified in AGENTS.md

Confidence Score: 4/5

  • Safe to merge after adding license headers to Python files
  • Well-structured tutorial with proper type annotations and good code organization. The only issue is missing NVIDIA license headers on 3 Python utility files, which is a style requirement that should be fixed before merging
  • The 3 Python utility files in docs/colab_notebooks/5-multistep-toolcalling/utils/ need NVIDIA SPDX license headers added

Important Files Changed

Filename Overview
docs/colab_notebooks/5-multistep-toolcalling/multistep-toolcalling.ipynb Comprehensive tutorial notebook for multi-step tool-calling SDG with clear examples and dual-level quality filtering
docs/colab_notebooks/5-multistep-toolcalling/utils/init.py Package initialization - missing NVIDIA license headers required per AGENTS.md
docs/colab_notebooks/5-multistep-toolcalling/utils/convert_to_nemo_gym_format.py NeMo Gym format converter with proper type hints - missing NVIDIA license headers required per AGENTS.md
docs/colab_notebooks/5-multistep-toolcalling/utils/quality_filtering.py Quality filtering utilities with dual-level validation - missing NVIDIA license headers required per AGENTS.md
docs/colab_notebooks/5-multistep-toolcalling/tools/environment.json Environment configuration with 27 multi-step patterns covering all tool combinations

Sequence Diagram

sequenceDiagram
    participant User
    participant DataDesigner
    participant LLM
    participant QualityFilter
    participant NeMoGym

    User->>DataDesigner: Load tool schemas & seed data
    DataDesigner->>LLM: Generate user query from pattern
    LLM-->>DataDesigner: Return user query
    DataDesigner->>LLM: Judge user query (feasibility, schema compliance)
    LLM-->>DataDesigner: Return query scores
    DataDesigner->>LLM: Generate trajectory (tool calls)
    LLM-->>DataDesigner: Return agent trajectory
    DataDesigner->>LLM: Judge trajectory (tool validity, completeness)
    LLM-->>DataDesigner: Return trajectory scores
    DataDesigner->>QualityFilter: Filter by dual-level scores
    QualityFilter->>QualityFilter: Stage 1: Validate query
    QualityFilter->>QualityFilter: Stage 2: Validate trajectory
    QualityFilter-->>User: Return filtered dataset
    User->>NeMoGym: Convert to NeMo Gym JSONL format
    NeMoGym-->>User: Training data ready for RL
Loading

Last reviewed commit: eb2b52b

…olcalling.ipynb

Signed-off-by: Shashank Verma <shashankv@nvidia.com>
- quality_filtering.py: remove FilterThresholds dataclass, quickstart
  print, and verbose ASCII output; de-duplicate show_rejection_reasons
  (285 → 85 lines)
- convert_to_nemo_gym_format.py: remove factory pattern and quickstart
  print (108 → 76 lines)
- __init__.py: export only 4 functions (24 → 10 lines)
- Notebook: fix imports to use data_designer.config/interface instead
  of removed essentials module; use functools.partial for converter

Signed-off-by: Shashank Verma <shashankv@nvidia.com>
@shashank3959 shashank3959 force-pushed the dev/multistep-toolcalling-sdg branch from 70ab957 to eb2b52b Compare February 12, 2026 23:51
Copy link

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

11 files reviewed, 3 comments

Edit Code Review Agent Settings | Greptile

@@ -0,0 +1,9 @@
from .quality_filtering import filter_high_quality, show_rejection_reasons
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing NVIDIA license headers. Per AGENTS.md, all Python files must include:

Suggested change
from .quality_filtering import filter_high_quality, show_rejection_reasons
# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
from .quality_filtering import filter_high_quality, show_rejection_reasons

Context Used: Context from dashboard - AGENTS.md (source)

Prompt To Fix With AI
This is a comment left during a code review.
Path: docs/colab_notebooks/5-multistep-toolcalling/utils/__init__.py
Line: 1:1

Comment:
Missing NVIDIA license headers. Per AGENTS.md, all Python files must include:

```suggestion
# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0

from .quality_filtering import filter_high_quality, show_rejection_reasons
```

**Context Used:** Context from `dashboard` - AGENTS.md ([source](https://app.greptile.com/review/custom-context?memory=95f8243f-5118-40bc-a3ab-69210b72e57e))

How can I resolve this? If you propose a fix, please make it concise.

@@ -0,0 +1,75 @@
"""Utilities for converting generated records to NeMo Gym JSONL format."""
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing NVIDIA license headers. Per AGENTS.md, all Python files must include:

Suggested change
"""Utilities for converting generated records to NeMo Gym JSONL format."""
# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
"""Utilities for converting generated records to NeMo Gym JSONL format."""

Context Used: Context from dashboard - AGENTS.md (source)

Prompt To Fix With AI
This is a comment left during a code review.
Path: docs/colab_notebooks/5-multistep-toolcalling/utils/convert_to_nemo_gym_format.py
Line: 1:1

Comment:
Missing NVIDIA license headers. Per AGENTS.md, all Python files must include:

```suggestion
# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0

"""Utilities for converting generated records to NeMo Gym JSONL format."""
```

**Context Used:** Context from `dashboard` - AGENTS.md ([source](https://app.greptile.com/review/custom-context?memory=95f8243f-5118-40bc-a3ab-69210b72e57e))

How can I resolve this? If you propose a fix, please make it concise.

@@ -0,0 +1,86 @@
"""Utilities for dual-level quality filtering of generated datasets."""
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing NVIDIA license headers. Per AGENTS.md, all Python files must include:

Suggested change
"""Utilities for dual-level quality filtering of generated datasets."""
# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
"""Utilities for dual-level quality filtering of generated datasets."""

Context Used: Context from dashboard - AGENTS.md (source)

Prompt To Fix With AI
This is a comment left during a code review.
Path: docs/colab_notebooks/5-multistep-toolcalling/utils/quality_filtering.py
Line: 1:1

Comment:
Missing NVIDIA license headers. Per AGENTS.md, all Python files must include:

```suggestion
# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0

"""Utilities for dual-level quality filtering of generated datasets."""
```

**Context Used:** Context from `dashboard` - AGENTS.md ([source](https://app.greptile.com/review/custom-context?memory=95f8243f-5118-40bc-a3ab-69210b72e57e))

How can I resolve this? If you propose a fix, please make it concise.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant