Bug: parsing annotations that failed to download crashes#655
Conversation
…ng the get_blob_fields functions
|
Important Review skippedAuto incremental reviews are disabled on this repository. Please check the settings in the CodeRabbit UI or the You can disable this status message by setting the Use the checkbox below for a quick retry:
📝 WalkthroughWalkthroughThe changes introduce specialized error handling for blob downloads and annotation metadata. A new Changes
Sequence DiagramsequenceDiagram
participant Client
participant QueryResult
participant Datapoint
participant BlobStore
Client->>QueryResult: _convert_annotation_fields(fields)
QueryResult->>Datapoint: get_blob(field)
Datapoint->>Datapoint: Check if value is BlobHashMetadata
alt Blob Download Success
Datapoint->>BlobStore: Download blob using hash
BlobStore-->>Datapoint: Return blob content
Datapoint-->>QueryResult: Return blob content
else Blob Download Failure
Datapoint->>Datapoint: Raise BlobDownloadError
QueryResult->>QueryResult: Catch BlobDownloadError
QueryResult->>QueryResult: Store ErrorMetadataAnnotations
end
QueryResult-->>Client: Complete annotation conversion
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Possibly related PRs
🚥 Pre-merge checks | ✅ 1 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (1 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
dagshub/data_engine/model/query_result.py (1)
458-471:⚠️ Potential issue | 🔴 CriticalBug:
original_value=metadata_valuepasses wrong type toUnsupportedMetadataAnnotations.At line 469,
metadata_value(captured at line 449) will be aPathorBlobHashMetadataafter the blob download phase—notbytes.UnsupportedMetadataAnnotations.__init__expectsoriginal_value: bytes. The actual bytes content is inannotation_contentfrom line 460.Proposed fix
except ValidationError: dp.metadata[fld] = UnsupportedMetadataAnnotations( - datapoint=dp, field=fld, original_value=metadata_value + datapoint=dp, field=fld, original_value=annotation_content )
📜 Review details
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (4)
dagshub/data_engine/annotation/metadata.pydagshub/data_engine/model/datapoint.pydagshub/data_engine/model/query_result.pytests/data_engine/annotation_import/test_annotation_parsing.py
🧰 Additional context used
🧬 Code graph analysis (3)
dagshub/data_engine/annotation/metadata.py (1)
dagshub/data_engine/util/not_implemented.py (1)
NotImplementedMeta(1-48)
dagshub/data_engine/model/datapoint.py (5)
dagshub/common/api/repo.py (1)
download(393-468)dagshub/common/helpers.py (1)
http_request(40-60)dagshub/data_engine/annotation/metadata.py (7)
MetadataAnnotations(36-319)value(104-114)value(332-333)value(353-354)to_ls_task(88-101)to_ls_task(335-336)to_ls_task(356-357)dagshub/data_engine/client/models.py (2)
DatapointHistoryResult(124-126)MetadataSelectFieldSchema(82-101)dagshub/data_engine/dtypes.py (1)
MetadataFieldType(20-36)
tests/data_engine/annotation_import/test_annotation_parsing.py (5)
dagshub/data_engine/annotation/metadata.py (10)
ErrorMetadataAnnotations(342-360)UnsupportedMetadataAnnotations(322-339)MetadataAnnotations(36-319)add_image_bbox(146-183)value(104-114)value(332-333)value(353-354)to_ls_task(88-101)to_ls_task(335-336)to_ls_task(356-357)dagshub/data_engine/model/datapoint.py (2)
BlobDownloadError(39-42)BlobHashMetadata(29-36)dagshub/data_engine/model/datasource_state.py (1)
blob_path(120-124)dagshub/data_engine/model/datasource.py (1)
all(314-337)dagshub/data_engine/model/query_result.py (1)
get_annotations(684-696)
🪛 Ruff (0.15.0)
tests/data_engine/annotation_import/test_annotation_parsing.py
[warning] 67-67: Abstract raise to an inner function
(TRY301)
[warning] 67-67: Avoid specifying long messages outside the exception class
(TRY003)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (6)
- GitHub Check: Agent
- GitHub Check: build (3.11)
- GitHub Check: build (3.9)
- GitHub Check: build (3.10)
- GitHub Check: build (3.12)
- GitHub Check: build (3.13)
🔇 Additional comments (11)
dagshub/data_engine/model/datapoint.py (3)
28-42: LGTM! Clean introduction ofBlobHashMetadataas a frozen dataclass andBlobDownloadErroras a dedicated exception. The__str__returning the hash preserves backward compatibility where string conversion was previously used.
148-161: LGTM! Wrapping blob field values inBlobHashMetadataat parse time is a sound approach—it prevents accidental string interpretation of blob hashes downstream.
322-343: LGTM! RaisingBlobDownloadErrorinstead of returning an error string is the right fix for the reported crash. The retry logic correctly only retriesRuntimeError(server errors > 400), while 404 and other codes fail immediately.dagshub/data_engine/annotation/metadata.py (1)
342-360: LGTM!ErrorMetadataAnnotationsfollows the same pattern asUnsupportedMetadataAnnotations, usingNotImplementedMetato block mutation operations while providing clear error messages viavalueandto_ls_task. Good subclass design to preserveisinstance(x, MetadataAnnotations)checks.dagshub/data_engine/model/query_result.py (3)
41-47: LGTM! Import additions are clean and correctly bring in the new types needed for the refactored blob/annotation handling.
396-402: LGTM! Checking forBlobHashMetadatainstead of raw strings is the right approach and consistent with the wrapping done infrom_gql_edge.
422-434: LGTM! The simplified_convert_annotation_fieldscall and document field conversion viadp.get_blob(fld)are clean. Theget_blobmethod properly handles bothPath(cached) andBlobHashMetadata(needs download) cases.tests/data_engine/annotation_import/test_annotation_parsing.py (4)
59-73: LGTM! The mock now correctly wraps errors inBlobDownloadError, matching the real_get_blobbehavior. This ensures tests exercise the same error-handling code paths as production.
88-89: LGTM! Patching bothquery_result._get_blobanddatapoint._get_blobis necessary sinceDatapoint.get_blobnow calls_get_blobdirectly from its own module.
142-165: LGTM! Thorough test for the nonexistent annotation path—validates theErrorMetadataAnnotationstype, subclass relationship,NotImplementedErroron mutation, andValueErroronvalue/to_ls_taskaccess with the expected error message.
168-170: LGTM! Good regression test verifying that blob values are wrapped inBlobHashMetadataimmediately after parsing the GQL response, before any autoloading occurs.
There was a problem hiding this comment.
Pull request overview
This PR fixes a crash that occurred when parsing annotations that failed to download. The solution involves:
Changes:
- Introduced
BlobHashMetadatawrapper class to distinguish blob hashes from other string values - Introduced
BlobDownloadErrorexception for explicit error handling of blob download failures - Introduced
ErrorMetadataAnnotationsclass to represent annotations that failed to download - Simplified blob loading logic by using
dp.get_blob()consistently
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 1 comment.
| File | Description |
|---|---|
| dagshub/data_engine/model/datapoint.py | Added BlobHashMetadata and BlobDownloadError classes; updated from_gql_edge to wrap blob hashes; updated get_blob to handle BlobHashMetadata; fixed comment typo; updated _get_blob to raise BlobDownloadError on failure |
| dagshub/data_engine/model/query_result.py | Updated get_blob_fields to check for BlobHashMetadata type; simplified document field processing to use dp.get_blob(); updated _convert_annotation_fields to catch BlobDownloadError and create ErrorMetadataAnnotations; fixed comment typo |
| dagshub/data_engine/annotation/metadata.py | Added ErrorMetadataAnnotations class that raises ValueError when accessing value or converting to Label Studio task |
| tests/data_engine/annotation_import/test_annotation_parsing.py | Added test for nonexistent annotations; added test to verify blob metadata wrapping; updated mock function to raise BlobDownloadError; added monkeypatch for datapoint._get_blob |
Comments suppressed due to low confidence (1)
dagshub/data_engine/model/query_result.py:470
- When a ValidationError occurs during annotation parsing, the code passes
metadata_valuetoUnsupportedMetadataAnnotations, which expectsoriginal_value: bytes. However,metadata_valuecould be aBlobHashMetadataobject,Path, orbytes. This should useannotation_contentinstead, which is guaranteed to be bytes from the successfuldp.get_blob()call on line 460.
dp.metadata[fld] = UnsupportedMetadataAnnotations(
datapoint=dp, field=fld, original_value=metadata_value
)
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
guysmoilov
left a comment
There was a problem hiding this comment.
Am I right in my understanding that this solves the issue of blobs having ambiguous types, while avoiding compatibility issues for customers by not making the internal type visible to callers? Is there a situation where it is exposed to callers? e.g. if iterating over datapoints in a query result and accessing the field value directly?
Also, I kind of don't understand the PR description. If the blob downloads fail, why wouldn't we expect ds.all() to raise an error?
| annotation_content = dp.get_blob(fld) | ||
| dp.metadata[fld] = MetadataAnnotations.from_ls_task( | ||
| datapoint=dp, field=fld, ls_task=metadata_value | ||
| datapoint=dp, field=fld, ls_task=annotation_content |
There was a problem hiding this comment.
Isn't it weird for the dp.get_blob to convert to ls_task (bytes), then reconvert it to MetadataAnnotations here? Or is this for the scenario where dp.get_blob returns bytes directly without going through MetadataAnnotations (loading from disk IIUC)? Feels clumsy, like maybe converting to MetadataAnnotations should happen only here or in dp.get_blob but why in both? In which scenario will elif isinstance(current_value, MetadataAnnotations): in line 213 be true?
There was a problem hiding this comment.
The intentional data flow here is:
- get blob hashes from backend and make them
BlobHashMetadata qr.get_blob_fieldsis called via autoloading/explicitly by user_get_blobfills in the field's value with, most probably, the rawbytes, but could be other things (e.g. Path if load_into_memory is False)qr._convert_annotation_fieldsgoes over all annotation metadata and tries to convert them- During that, in order to get the bytes,
dp.get_blob()is called for the field. This ensures that the value in the field is guaranteed to be bytes, and not Path, str, or anything else.
The reason dp.get_blob() converts the annotation to bytes is because that is a publically exposed function, and it makes sense that calling it on metadata would return the raw LS task's bytes. It actually doesn't really have any intention for the main purpose of this PR, and I more just added it as a defensive/interface consistency check.
This part will not be hit if you, for example, try to load and convert an annotation field a second time, because at that point, the conversion would have already happened, and then this type guard prevents doing a double conversion:
client/dagshub/data_engine/model/query_result.py
Lines 458 to 459 in 7c8fa55
| except ValidationError: | ||
| dp.metadata[fld] = UnsupportedMetadataAnnotations( | ||
| datapoint=dp, field=fld, original_value=metadata_value | ||
| datapoint=dp, field=fld, original_value=annotation_content |
There was a problem hiding this comment.
Isn't it dangerous to use annotation_content when it's undefined because an error was thrown from dp.get_blob? Or you think it can only be thrown from from_ls_task? Maybe that means the catches should be separate? Just asking questions
There was a problem hiding this comment.
ValidationError is only thrown by the Pydantic code, so on from_ls_task, and that means that get_blob() has already succeeded or thrown a different uncaught error (unlikely file not found, for example).
Bug:
Reproduction:
ds.all()Expected:
Actual:
Parsing annotations will fail, because it will try to parse the return error's string returned from the
_get_blob()function.There is a test case
test_nonexistent_annotationthat reproduces this issue.Solution:
BlobHashMetadataobjects right after they get returned from the server, instead of strings. This allows for easier handling when loading blobs. Previously it required careful handling with sequence of operations, so as to not break whenever document fields are involved, since they were automatically converted into strings. Now loading blobs checks if the metadata is of typeBlobHashMetadataand loads that_get_blob()throw a BlobDownloadError, that can be explicitlyexcepted and handled. Autoloading annotations now relies on that error and creates aErrorMetadataAnnotationson receiving this error.dp.get_blob()wherever possible, since that function already handles possible cases of blob metadata.Technical Implementation
Core Data Structures
Error Handling for Annotations
valueorto_ls_task()are accessed, preventing silent failures from being treated as successfully parsed annotationsBlob Loading Architecture Changes
BlobHashMetadataat deserialization time, ensuring blob hashes are properly typed objects rather than ambiguous stringsBlobHashMetadata: Downloads blob and optionally caches to diskPath: Reads existing cached blob from diskMetadataAnnotations: Converts to Label Studio task viato_ls_task()BlobDownloadErroron any download failure instead of returning error strings; includes newpath_formatparameter to control path representation (str vs Path object)Annotation Conversion Workflow
load_into_memoryparameter; now uses unifieddp.get_blob()approach for all annotation content retrievalBlobDownloadErrorto createErrorMetadataAnnotationsandValidationErrorto createUnsupportedMetadataAnnotations, with warning logs listing problematic datapointsTesting Coverage
ds_with_nonexistent_annotationsimulates blob download failurestest_nonexistent_annotationverifiesErrorMetadataAnnotationsis created and raises descriptive errors when accessedtest_blob_metadata_is_wrapped_from_backendconfirms blob hash wrapping at deserialization