-
Notifications
You must be signed in to change notification settings - Fork 57
feat: add HuggingFace Hub integration for dataset publishing #275
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Implement HuggingFace Hub integration to upload DataDesigner datasets:
- Add HuggingFaceHubClient with upload_dataset method
- Upload main parquet files to data/ subset
- Upload processor outputs to data/{processor_name}/ subsets
- Generate dataset card from metadata.json with column statistics
- Include sdg.json and metadata.json configuration files
- Comprehensive validation and error handling
- Add push_to_hub() method to DatasetCreationResults
…nitions
- Add progress logging with emojis following codebase style
- Add repository exists check before creation
- Update metadata.json paths for HuggingFace structure (parquet-files/ → data/, processors-files/{name}/ → {name}/)
- Enhance dataset card with detailed intro, tabular schema/statistics, and clickable config links
- Add explicit configs in YAML frontmatter to fix schema mismatch between main dataset and processor outputs
- Set data config as default configuration
- Add description parameter to push_to_hub() for custom dataset card content - Description appears after NeMo Data Designer intro section - Update dataset card template to conditionally render custom description - Add tests for with/without custom description scenarios
- Make description parameter required in push_to_hub() - Improve dataset card layout with flexbox header (title + right-aligned tagline) - Add horizontal dividers between sections for visual separation - Add emoji icons to section headers for better readability - Move About NeMo Data Designer section after Citation - Update section order: Description → Quick Start → Dataset Summary → Schema & Statistics → Generation Details → Citation → About - Update all tests to provide required description parameter
Greptile OverviewGreptile SummaryThis PR adds comprehensive HuggingFace Hub integration to DataDesigner, enabling users to publish datasets directly to the HuggingFace Hub with a single method call. The implementation includes automated dataset card generation with rich metadata, robust error handling with specific exception types, and flexible upload options for both main datasets and processor outputs. Key Changes:
Implementation Highlights:
The PR is well-structured, thoroughly tested, and ready for production use.
|
| Filename | Overview |
|---|---|
| packages/data-designer/src/data_designer/integrations/huggingface/client.py | comprehensive HuggingFace client with robust validation, error handling, and upload functionality |
| packages/data-designer/src/data_designer/integrations/huggingface/dataset_card.py | clean dataset card generator with proper metadata extraction and size categorization |
| packages/data-designer/src/data_designer/interface/results.py | added push_to_hub() method with clear API and comprehensive documentation |
| packages/data-designer/tests/integrations/huggingface/test_client.py | extensive test coverage (559 lines) with fixtures, mocks, and edge case validation |
Sequence Diagram
sequenceDiagram
participant User
participant Results as DatasetCreationResults
participant Client as HuggingFaceHubClient
participant API as HfApi
participant Card as DataDesignerDatasetCard
participant Storage as ArtifactStorage
User->>Results: push_to_hub(repo_id, description, token, private, tags)
Results->>Client: __init__(token)
Client->>API: HfApi(token)
Results->>Client: upload_dataset(repo_id, base_dataset_path, description, private, tags)
Client->>Client: _validate_repo_id(repo_id)
Note over Client: Check format: username/dataset-name<br/>Validate with HF validator
Client->>Client: _validate_dataset_path(base_dataset_path)
Note over Client: Verify metadata.json exists<br/>Check parquet-files/ directory<br/>Validate JSON structure
Client->>API: repo_exists(repo_id)
API-->>Client: True/False
Client->>API: create_repo(repo_id, exist_ok=True, private)
API-->>Client: Repo created/exists
Client->>Client: _upload_dataset_card(...)
Client->>Storage: Read metadata.json
Storage-->>Client: metadata dict
Client->>Storage: Read builder_config.json
Storage-->>Client: builder_config dict
Client->>Card: from_metadata(metadata, builder_config, repo_id, description, tags)
Card->>Card: Extract stats, compute size category
Card->>Card: Render Jinja2 template
Card-->>Client: DatasetCard instance
Client->>Card: push_to_hub(repo_id)
Card->>API: Upload README.md
API-->>Card: Success
Client->>Client: _upload_main_dataset_files(...)
Client->>API: upload_folder(parquet_folder → data/)
API-->>Client: Success
Client->>Client: _upload_processor_files(...)
loop For each processor
Client->>API: upload_folder(processor_dir → processor_name/)
API-->>Client: Success
end
Client->>Client: _upload_config_files(...)
Client->>API: upload_file(builder_config.json)
API-->>Client: Success
Client->>Client: _update_metadata_paths(metadata_path)
Note over Client: Transform paths:<br/>parquet-files/ → data/<br/>processors-files/X/ → X/
Client->>API: upload_file(metadata.json)
API-->>Client: Success
Client-->>Results: HuggingFace dataset URL
Results-->>User: https://huggingface.co/datasets/username/dataset-name
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
4 files reviewed, no comments
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
4 files reviewed, no comments
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
4 files reviewed, 5 comments
packages/data-designer/src/data_designer/integrations/huggingface/client.py
Show resolved
Hide resolved
packages/data-designer/src/data_designer/integrations/huggingface/dataset_card.py
Show resolved
Hide resolved
packages/data-designer/src/data_designer/integrations/huggingface/__init__.py
Show resolved
Hide resolved
packages/data-designer/tests/integrations/huggingface/test_dataset_card.py
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
4 files reviewed, no comments
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
4 files reviewed, no comments
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
4 files reviewed, 2 comments
packages/data-designer/src/data_designer/integrations/huggingface/dataset_card.py
Outdated
Show resolved
Hide resolved
packages/data-designer/src/data_designer/integrations/huggingface/client.py
Outdated
Show resolved
Hide resolved
…ace/dataset_card.py Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
4 files reviewed, no comments
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
5 files reviewed, no comments
packages/data-designer-engine/src/data_designer/engine/dataset_builders/artifact_storage.py
Show resolved
Hide resolved
packages/data-designer/src/data_designer/integrations/huggingface/__init__.py
Outdated
Show resolved
Hide resolved
packages/data-designer/src/data_designer/integrations/huggingface/client.py
Outdated
Show resolved
Hide resolved
packages/data-designer/src/data_designer/integrations/huggingface/client.py
Outdated
Show resolved
Hide resolved
packages/data-designer/src/data_designer/integrations/huggingface/client.py
Outdated
Show resolved
Hide resolved
packages/data-designer/src/data_designer/integrations/huggingface/dataset_card_template.md
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
4 files reviewed, no comments
packages/data-designer/src/data_designer/integrations/huggingface/client.py
Show resolved
Hide resolved
packages/data-designer/src/data_designer/integrations/huggingface/client.py
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
4 files reviewed, no comments
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
4 files reviewed, no comments
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
4 files reviewed, 1 comment
packages/data-designer/src/data_designer/integrations/huggingface/client.py
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
4 files reviewed, 1 comment
packages/data-designer/src/data_designer/integrations/huggingface/client.py
Show resolved
Hide resolved
davanstrien
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left a few small suggestions
packages/data-designer/src/data_designer/integrations/huggingface/client.py
Outdated
Show resolved
Hide resolved
packages/data-designer/src/data_designer/integrations/huggingface/client.py
Outdated
Show resolved
Hide resolved
packages/data-designer-engine/src/data_designer/engine/dataset_builders/artifact_storage.py
Outdated
Show resolved
Hide resolved
…ace/client.py Co-authored-by: Daniel van Strien <davanstrien@users.noreply.github.com>
Co-authored-by: Daniel van Strien <davanstrien@users.noreply.github.com>
…ace/client.py Co-authored-by: Daniel van Strien <davanstrien@users.noreply.github.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
3 files reviewed, 3 comments
packages/data-designer/src/data_designer/integrations/huggingface/client.py
Show resolved
Hide resolved
packages/data-designer/src/data_designer/integrations/huggingface/client.py
Show resolved
Hide resolved
packages/data-designer/src/data_designer/integrations/huggingface/client.py
Show resolved
Hide resolved
|
Nice, thanks @nabinchha! Thank you @davanstrien and @Wauplin for your feedback – super helpful! |
📋 Summary
This PR adds comprehensive HuggingFace Hub integration, enabling users to publish DataDesigner datasets directly to the HuggingFace Hub with automated dataset card generation, flexible upload options, and robust error handling.
🔄 Changes
✨ Added
HuggingFaceHubClient- Complete client for uploading datasets to HuggingFace Hub with support for:DataDesignerDatasetCard- Rich dataset card generator with:push_to_hub()method onDatasetCreationResults- Simple API for publishing results🔧 Changed
DatasetCreationResultsclass to support HuggingFace publishing workflow🔍 Attention Areas
client.py(349 lines) - Core upload logic with path mapping and error handlingdataset_card.py(139 lines) - Dataset card template rendering and metadata extractiontest_client.py(569 lines) - Extensive test coverage for upload scenariosSee this data set card as an example published directly from the create results object: https://huggingface.co/datasets/nabinnvidia/multi-lingual-greetings
Closes #7
Draws inspiration on implementation from conversations in this PR: #127
🤖 Generated with AI