Bug: cannot query metadata with unsupported annotations#653
Conversation
|
Important Review skippedAuto incremental reviews are disabled on this repository. Please check the settings in the CodeRabbit UI or the You can disable this status message by setting the Use the checkbox below for a quick retry:
📝 WalkthroughWalkthroughThis change introduces support for handling unsupported metadata annotations by adding a new Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes 🚥 Pre-merge checks | ✅ 2✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
ⓘ You are approaching your monthly quota for Qodo. Upgrade your plan Review Summary by QodoFix unsupported annotations breaking queries and double-loading
WalkthroughsDescription• Handle unsupported metadata annotations gracefully without breaking queries • Create UnsupportedMetadataAnnotations subclass to prevent double-parsing errors • Add defensive checks to prevent re-processing already-converted annotation fields • Improve error messaging for unsupported or invalid annotations Diagramflowchart LR
A["Unsupported Annotation"] --> B["UnsupportedMetadataAnnotations"]
B --> C["Prevents Double-Parsing"]
C --> D["Query Succeeds"]
E["Defensive Checks"] --> F["Skip MetadataAnnotations"]
F --> D
File Changes1. dagshub/data_engine/annotation/metadata.py
|
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (2)
tests/data_engine/annotation_import/res/audio_annotation.json (1)
44-71: Consider using fake PII in test fixtures.The test data contains what appears to be a real email address and name. Even in test fixtures, it's better practice to use obviously fake data (e.g.,
test@example.com,John,Doe) to avoid any compliance concerns around PII in source control.dagshub/data_engine/util/not_implemented.py (1)
9-12: Edge case:dir(base)includes inherited attributes fromobjectand all ancestors.
dir(base)returns attributes from the entire MRO, not justbaseitself. IfMetadataAnnotationsinherits public methods from other classes in the future, those will also be replaced withNotImplementedErrorin the subclass. This is likely the desired behavior, but worth being aware of.Also,
staticmethodandclassmethoddescriptors from the base are not detected separately — they'll be seen ascallableand wrapped as instance methods, which changes their calling convention. Currently not a problem sincefrom_ls_task(a@staticmethod) is always called on the base class directly, but could be surprising if the pattern is reused.
📜 Review details
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (6)
dagshub/data_engine/annotation/metadata.pydagshub/data_engine/model/query_result.pydagshub/data_engine/util/__init__.pydagshub/data_engine/util/not_implemented.pytests/data_engine/annotation_import/res/audio_annotation.jsontests/data_engine/annotation_import/test_annotation_parsing.py
🧰 Additional context used
🧬 Code graph analysis (2)
dagshub/data_engine/annotation/metadata.py (3)
dagshub/common/helpers.py (1)
log_message(100-107)dagshub/data_engine/util/not_implemented.py (1)
NotImplementedMeta(1-48)dagshub/data_engine/model/datapoint.py (1)
Datapoint(29-263)
tests/data_engine/annotation_import/test_annotation_parsing.py (2)
dagshub/data_engine/annotation/metadata.py (7)
MetadataAnnotations(36-319)UnsupportedMetadataAnnotations(322-339)add_image_bbox(146-183)value(104-114)value(332-333)to_ls_task(88-101)to_ls_task(335-336)dagshub/data_engine/model/query_result.py (1)
get_blob_fields(334-437)
🪛 Ruff (0.15.0)
tests/data_engine/annotation_import/test_annotation_parsing.py
[warning] 64-64: Avoid specifying long messages outside the exception class
(TRY003)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (6)
- GitHub Check: Agent
- GitHub Check: build (3.11)
- GitHub Check: build (3.10)
- GitHub Check: build (3.9)
- GitHub Check: build (3.13)
- GitHub Check: build (3.12)
🔇 Additional comments (5)
dagshub/data_engine/annotation/metadata.py (1)
322-339: LGTM!The
UnsupportedMetadataAnnotationsclass correctly wraps unsupported annotations while preservingvalueandto_ls_taskbehavior. The metaclass will replace all non-overridden public methods (likeadd_image_bbox,add_image_segmentation, etc.) withNotImplementedError, which is the intended behavior.dagshub/data_engine/model/query_result.py (2)
429-435: Good defensive guard against re-processing parsed annotations as documents.This correctly prevents the
TypeError: expected str, bytes or os.PathLike object, not MetadataAnnotationsby skipping fields that were already parsed as annotations. Solid fix for Bug 2.
449-463: LGTM — annotation conversion guards and unsupported annotation fallback.The
isinstanceguard at line 451 prevents double-loading, and theValidationErrorcatch at line 460 gracefully wraps invalid annotations inUnsupportedMetadataAnnotationsinstead of crashing.tests/data_engine/annotation_import/test_annotation_parsing.py (2)
99-115: LGTM — good coverage of basic annotation parsing and double-load safety.
test_double_loading_annotation_worksat line 104 is a valuable regression test ensuring that callingget_blob_fieldsa second time doesn't crash or corrupt the already-parsed annotations.
118-134: Good test coverage for unsupported annotations.Verifies the
UnsupportedMetadataAnnotationssubclass behavior: isinstance checks,NotImplementedErroron unsupported methods, and content preservation viavalue/to_ls_task.
✏️ Tip: You can disable this entire section by setting review_details to false in your review settings.
There was a problem hiding this comment.
Pull request overview
Fixes Data Engine metadata blob handling so unsupported/invalid Label Studio annotations no longer break Datasource.all() / QueryResult.get_annotations(), and adds regression tests for double-loading behavior.
Changes:
- Add
UnsupportedMetadataAnnotationsto represent annotations that fail parsing while keeping consistentMetadataAnnotationstyping. - Harden
QueryResult.get_blob_fields()conversions to avoid re-processing already-parsed annotation fields and improve handling of path-vs-bytes blobs. - Extend annotation parsing tests and add an “unsupported” annotation fixture.
Reviewed changes
Copilot reviewed 5 out of 6 changed files in this pull request and generated 7 comments.
Show a summary per file
| File | Description |
|---|---|
dagshub/data_engine/model/query_result.py |
Adjusts blob/document/annotation conversion logic and introduces unsupported-annotation wrapper usage. |
dagshub/data_engine/annotation/metadata.py |
Adds UnsupportedMetadataAnnotations type for unsupported/invalid LS tasks. |
dagshub/data_engine/util/not_implemented.py |
Introduces a metaclass to block inherited methods/properties on unsupported annotations. |
tests/data_engine/annotation_import/test_annotation_parsing.py |
Adds regression tests for double-loading and unsupported annotations. |
tests/data_engine/annotation_import/res/audio_annotation.json |
Adds a fixture representing an unsupported (audio) LS task. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| # Defensive check to not mangle annotation fields by accident | ||
| if isinstance(dp.metadata[fld], MetadataAnnotations): | ||
| continue | ||
| # Force load the content into memory, even if load_into_memory was set to False | ||
| if not load_into_memory or isinstance(dp.metadata[fld], Path): | ||
| dp.metadata[fld] = Path(dp.metadata[fld]).read_bytes() | ||
| dp.metadata[fld] = dp.metadata[fld].decode("utf-8") |
There was a problem hiding this comment.
When _get_blob fails it returns a string like "Error while downloading binary blob: ..." (see Datapoint._get_blob). In that case this conversion path will call Path(<error string>).read_bytes() and raise FileNotFoundError, turning a download warning into a hard failure. Add a guard before Path(...).read_bytes() to detect error strings / non-path values (or make _get_blob raise and handle it), and skip decoding/conversion for that field when download failed.
There was a problem hiding this comment.
Yes, it's correct and I have a reproduction test that actually causes this problem to appear.
Will fix in a separate PR, so as to not overload this one
Code Review by Qodo
✅ 1.
|
| # Force load the content into memory, even if load_into_memory was set to False | ||
| if not load_into_memory or isinstance(dp.metadata[fld], Path): |
There was a problem hiding this comment.
I don't get the condition and the comment.
The comment says to ignore load_into_memory - then why do you check it? I assume it's because when load_into_memory == True, then in some earlier part of the code the field was already read as bytes. Then why not just assert that? I guess maybe because then the field content would be string or bytes and you can't know for sure whether that's the blob content or a path without relying on the assumed earlier side effects of the field (would be solved by the type safety we discussed, but out of scope in this PR).
Assuming everything I've guessed is correct - I think a comment explaining your assumptions would really help to make this code understandable
| # Defensive check to not mangle annotation fields by accident | ||
| if isinstance(dp.metadata[fld], MetadataAnnotations): | ||
| continue | ||
| # Force load the content into memory, even if load_into_memory was set to False | ||
| if not load_into_memory or isinstance(dp.metadata[fld], Path): | ||
| dp.metadata[fld] = Path(dp.metadata[fld]).read_bytes() | ||
| dp.metadata[fld] = dp.metadata[fld].decode("utf-8") |
Co-authored-by: Guy Smoilovsky <611655+guysmoilov@users.noreply.github.com>
… listing from taking up too much space if used with big datasources
This PR addresses two bugs:
Bug 1: unsupported metadata annotations break
ds.all(), if they were also tagged as document.Reproduction:
DOCUMENTtag on it (will happen automatically right now on DagsHub)ds.all()Expected:
The warning shows up, but otherwise the program continues execution.
Actual:
Following stack trace, that happens due to trying to read the annotation content a second time:
Solution:
Created a new subclass of
MetadataAnnotationsnamedUnsupportedMetadataAnnotations, that prevents interacting with the annotations, but still allows to get the value via.valueand.to_ls_task(), making it soqr.get_annotations()acts consistently.No matter if an annotation is valid or not, all metadata values of this field will be of
MetadataAnnotationsThis makes it so that trying to load the annotation a second time will fall through due to this line:
client/dagshub/data_engine/model/query_result.py
Line 451 in 511801c
This bug has a reproduction test added, that fails on current main, and is fixed in this PR
Bug 2: Possible to load an annotation as a document a second time, despite the annotation already having been parsed
I couldn't get an actual reproduction for it, aside from the stack trace that was throwing in a similar way, but in a function
get_annotations, leading toget_blob_fields:client/dagshub/data_engine/model/query_result.py
Line 430 in 4e254d4
Solution:
Added a safety guard against it at this line, and added a test that loads the annotations a second time.