-
Notifications
You must be signed in to change notification settings - Fork 57
feat: add allow_resize for 1:N and N:1 generation patterns #286
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
3c9fa49 to
8ba264c
Compare
0cedd56 to
8c9a33e
Compare
| default=None, | ||
| description="Optional typed configuration object passed as second argument to generator function", | ||
| ) | ||
| allow_resize: bool = Field( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm wondering if we should elevate this to a property on the base column config (default is False), which you can override in custom columns and plugins.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah that was the initial solution, then I ended up doing a mixin instead.
Thought it was a bit opaque for plugins specifically, that they developer to find out about a specific attribute/property 🤔 But it makes things simpler I suppose?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a pattern we already use for custom emojis. Also the required_columns and side_effect_columns (these ones have to be set, though).
Adds support for generators that produce a different number of records than the input (expansion or retraction). This addresses GitHub issue #265. Changes: - Add `allow_resize` parameter to `update_records()` in DatasetBatchManager - Add `allow_resize` field to CustomColumnConfig - Add validation requiring FULL_COLUMN strategy when allow_resize=True - Track and report actual_num_records in metadata (may differ from target) - Add logging when batch size changes - Add example_allow_resize.py demonstrating the feature - Add comprehensive tests
- Merge update_records and replace_buffer into a single replace_buffer method with allow_resize parameter on DatasetBatchManager - Move allow_resize field from CustomColumnConfig to SingleColumnConfig so plugins inherit it without needing a mixin - Align example and logging with final CustomColumn API - Parametrize resize tests and extract shared stub in test_columns
- Add expand->retract->expand chaining test (single batch) - Add multi-batch resize test verifying combined parquet output - Update example to chain expand/retract/expand with preview+build - Use 💥/✂️ emojis for resize logging (expand/retract)
d4a6100 to
d84d799
Compare
Summary
Adds
allow_resizeparameter toCustomColumnConfigenabling custom column generators to produce a different number of records than the input. This supports 1:N expansion (e.g., generating multiple variations per input) and N:1 retraction (e.g., filtering or aggregating records) patterns. Addresses #265.Changes
Added
allow_resizefield onCustomColumnConfigwith validation requiringfull_columnstrategyallow_resizeparameter toupdate_records()inDatasetBatchManageractual_num_recordstracking in dataset metadata (may differ fromtarget_num_recordswhen resizing)example_allow_resize.pydemonstrating expansion (1:N) and retraction (N:1) patternsChanged
column_wise_builder.py- logs resize operations, passesallow_resizeto batch managerCustomColumnGenerator.log_pre_generation()- logsallow_resizewhen enabledAttention Areas
dataset_batch_manager.py- Core change to buffer handling with newallow_resizeparametercolumn_configs.py- New field and validation logic onCustomColumnConfigcolumn_wise_builder.py- Engine-level handling for resize operationsDescription updated with AI