Skip to content

Bug: parsing annotations that failed to download crashes#655

Merged
kbolashev merged 5 commits intomainfrom
bug/parsing-errored-annotation
Feb 19, 2026
Merged

Bug: parsing annotations that failed to download crashes#655
kbolashev merged 5 commits intomainfrom
bug/parsing-errored-annotation

Conversation

@kbolashev
Copy link
Member

@kbolashev kbolashev commented Feb 16, 2026

Bug:

Reproduction:

  • Create datasource with annotations
  • Make download fail in some way (e.g. turn off internet)
  • try to run ds.all()

Expected:

  • No errors thrown

Actual:
Parsing annotations will fail, because it will try to parse the return error's string returned from the _get_blob() function.
There is a test case test_nonexistent_annotation that reproduces this issue.

Solution:

  • Overhauled how the blob hashes are stored - now they are BlobHashMetadata objects right after they get returned from the server, instead of strings. This allows for easier handling when loading blobs. Previously it required careful handling with sequence of operations, so as to not break whenever document fields are involved, since they were automatically converted into strings. Now loading blobs checks if the metadata is of type BlobHashMetadata and loads that
  • Made _get_blob() throw a BlobDownloadError, that can be explicitly excepted and handled. Autoloading annotations now relies on that error and creates a ErrorMetadataAnnotations on receiving this error.
  • Additionally, simplified the logic in autoloading functions, reducing obscure checks, and instead using dp.get_blob() wherever possible, since that function already handles possible cases of blob metadata.

Technical Implementation

Core Data Structures

  • BlobHashMetadata: New dataclass that wraps blob hash strings with proper representation methods, replacing raw string handling for blob field values from the server
  • BlobDownloadError: New exception class for blob download failures, providing clearer error semantics than previous string-based error returns

Error Handling for Annotations

  • ErrorMetadataAnnotations: New subclass of MetadataAnnotations that represents failed annotation downloads. Stores an error message and raises it when value or to_ls_task() are accessed, preventing silent failures from being treated as successfully parsed annotations

Blob Loading Architecture Changes

  • from_gql_edge(): Now wraps blob field values (BLOB type) in BlobHashMetadata at deserialization time, ensuring blob hashes are properly typed objects rather than ambiguous strings
  • get_blob(): Extended to handle three value types with distinct pathways:
    • BlobHashMetadata: Downloads blob and optionally caches to disk
    • Path: Reads existing cached blob from disk
    • MetadataAnnotations: Converts to Label Studio task via to_ls_task()
  • _get_blob(): Now raises BlobDownloadError on any download failure instead of returning error strings; includes new path_format parameter to control path representation (str vs Path object)

Annotation Conversion Workflow

  • _convert_annotation_fields(): Simplified by removing load_into_memory parameter; now uses unified dp.get_blob() approach for all annotation content retrieval
  • Error handling: Catches BlobDownloadError to create ErrorMetadataAnnotations and ValidationError to create UnsupportedMetadataAnnotations, with warning logs listing problematic datapoints
  • Failed annotation downloads now result in informative error states instead of parsing crashes

Testing Coverage

  • New fixture ds_with_nonexistent_annotation simulates blob download failures
  • Test test_nonexistent_annotation verifies ErrorMetadataAnnotations is created and raises descriptive errors when accessed
  • Assertion test_blob_metadata_is_wrapped_from_backend confirms blob hash wrapping at deserialization

@kbolashev kbolashev self-assigned this Feb 16, 2026
@kbolashev kbolashev added the bug Something isn't working label Feb 16, 2026
@dagshub
Copy link

dagshub bot commented Feb 16, 2026

@coderabbitai
Copy link

coderabbitai bot commented Feb 16, 2026

Important

Review skipped

Auto incremental reviews are disabled on this repository.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

The changes introduce specialized error handling for blob downloads and annotation metadata. A new ErrorMetadataAnnotations class is added to represent annotation download failures, BlobHashMetadata wraps blob hash references, and BlobDownloadError is raised on download failures. The annotation and document conversion workflows are refactored to fetch blob content via the blob download mechanism with proper error handling.

Changes

Cohort / File(s) Summary
Error Metadata Annotations
dagshub/data_engine/annotation/metadata.py
Added ErrorMetadataAnnotations class to encapsulate error messages encountered during annotation processing, with a custom __repr__ reporting the annotation download error.
Blob Metadata & Download Error Handling
dagshub/data_engine/model/datapoint.py
Introduced BlobHashMetadata dataclass for blob hash references and BlobDownloadError exception. Extended from_gql_edge to wrap blob values in BlobHashMetadata, updated get_blob to handle BlobHashMetadata and MetadataAnnotations, and added path_format parameter to _get_blob to control path representation.
Annotation & Document Processing
dagshub/data_engine/model/query_result.py
Refactored blob field handling to use BlobHashMetadata type checks and updated download URL computation. Reworked annotation/document conversion to fetch blob content via dp.get_blob with proper error handling; removed load_into_memory parameter from _convert_annotation_fields. On blob or validation errors, stores ErrorMetadataAnnotations or UnsupportedMetadataAnnotations respectively.
Test Coverage & Fixtures
tests/data_engine/annotation_import/test_annotation_parsing.py
Added imports for new error and metadata classes. Enhanced mock_get_blob to wrap failures in BlobDownloadError. Introduced ds_with_unsupported_annotation and ds_with_nonexistent_annotation fixtures. Added test_nonexistent_annotation to verify error annotation behavior and test_blob_metadata_is_wrapped_from_backend for metadata wrapping validation.

Sequence Diagram

sequenceDiagram
    participant Client
    participant QueryResult
    participant Datapoint
    participant BlobStore

    Client->>QueryResult: _convert_annotation_fields(fields)
    QueryResult->>Datapoint: get_blob(field)
    Datapoint->>Datapoint: Check if value is BlobHashMetadata
    alt Blob Download Success
        Datapoint->>BlobStore: Download blob using hash
        BlobStore-->>Datapoint: Return blob content
        Datapoint-->>QueryResult: Return blob content
    else Blob Download Failure
        Datapoint->>Datapoint: Raise BlobDownloadError
        QueryResult->>QueryResult: Catch BlobDownloadError
        QueryResult->>QueryResult: Store ErrorMetadataAnnotations
    end
    QueryResult-->>Client: Complete annotation conversion
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

🚥 Pre-merge checks | ✅ 1 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Merge Conflict Detection ⚠️ Warning ❌ Merge conflicts detected (4 files):

⚔️ dagshub/data_engine/annotation/metadata.py (content)
⚔️ dagshub/data_engine/model/datapoint.py (content)
⚔️ dagshub/data_engine/model/query_result.py (content)
⚔️ tests/data_engine/annotation_import/test_annotation_parsing.py (content)

These conflicts must be resolved before merging into main.
Resolve conflicts locally and push changes to this branch.
✅ Passed checks (1 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch bug/parsing-errored-annotation

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
dagshub/data_engine/model/query_result.py (1)

458-471: ⚠️ Potential issue | 🔴 Critical

Bug: original_value=metadata_value passes wrong type to UnsupportedMetadataAnnotations.

At line 469, metadata_value (captured at line 449) will be a Path or BlobHashMetadata after the blob download phase—not bytes. UnsupportedMetadataAnnotations.__init__ expects original_value: bytes. The actual bytes content is in annotation_content from line 460.

Proposed fix
                     except ValidationError:
                         dp.metadata[fld] = UnsupportedMetadataAnnotations(
-                            datapoint=dp, field=fld, original_value=metadata_value
+                            datapoint=dp, field=fld, original_value=annotation_content
                         )
📜 Review details

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 38dfb2b and 342612b.

📒 Files selected for processing (4)
  • dagshub/data_engine/annotation/metadata.py
  • dagshub/data_engine/model/datapoint.py
  • dagshub/data_engine/model/query_result.py
  • tests/data_engine/annotation_import/test_annotation_parsing.py
🧰 Additional context used
🧬 Code graph analysis (3)
dagshub/data_engine/annotation/metadata.py (1)
dagshub/data_engine/util/not_implemented.py (1)
  • NotImplementedMeta (1-48)
dagshub/data_engine/model/datapoint.py (5)
dagshub/common/api/repo.py (1)
  • download (393-468)
dagshub/common/helpers.py (1)
  • http_request (40-60)
dagshub/data_engine/annotation/metadata.py (7)
  • MetadataAnnotations (36-319)
  • value (104-114)
  • value (332-333)
  • value (353-354)
  • to_ls_task (88-101)
  • to_ls_task (335-336)
  • to_ls_task (356-357)
dagshub/data_engine/client/models.py (2)
  • DatapointHistoryResult (124-126)
  • MetadataSelectFieldSchema (82-101)
dagshub/data_engine/dtypes.py (1)
  • MetadataFieldType (20-36)
tests/data_engine/annotation_import/test_annotation_parsing.py (5)
dagshub/data_engine/annotation/metadata.py (10)
  • ErrorMetadataAnnotations (342-360)
  • UnsupportedMetadataAnnotations (322-339)
  • MetadataAnnotations (36-319)
  • add_image_bbox (146-183)
  • value (104-114)
  • value (332-333)
  • value (353-354)
  • to_ls_task (88-101)
  • to_ls_task (335-336)
  • to_ls_task (356-357)
dagshub/data_engine/model/datapoint.py (2)
  • BlobDownloadError (39-42)
  • BlobHashMetadata (29-36)
dagshub/data_engine/model/datasource_state.py (1)
  • blob_path (120-124)
dagshub/data_engine/model/datasource.py (1)
  • all (314-337)
dagshub/data_engine/model/query_result.py (1)
  • get_annotations (684-696)
🪛 Ruff (0.15.0)
tests/data_engine/annotation_import/test_annotation_parsing.py

[warning] 67-67: Abstract raise to an inner function

(TRY301)


[warning] 67-67: Avoid specifying long messages outside the exception class

(TRY003)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (6)
  • GitHub Check: Agent
  • GitHub Check: build (3.11)
  • GitHub Check: build (3.9)
  • GitHub Check: build (3.10)
  • GitHub Check: build (3.12)
  • GitHub Check: build (3.13)
🔇 Additional comments (11)
dagshub/data_engine/model/datapoint.py (3)

28-42: LGTM! Clean introduction of BlobHashMetadata as a frozen dataclass and BlobDownloadError as a dedicated exception. The __str__ returning the hash preserves backward compatibility where string conversion was previously used.


148-161: LGTM! Wrapping blob field values in BlobHashMetadata at parse time is a sound approach—it prevents accidental string interpretation of blob hashes downstream.


322-343: LGTM! Raising BlobDownloadError instead of returning an error string is the right fix for the reported crash. The retry logic correctly only retries RuntimeError (server errors > 400), while 404 and other codes fail immediately.

dagshub/data_engine/annotation/metadata.py (1)

342-360: LGTM! ErrorMetadataAnnotations follows the same pattern as UnsupportedMetadataAnnotations, using NotImplementedMeta to block mutation operations while providing clear error messages via value and to_ls_task. Good subclass design to preserve isinstance(x, MetadataAnnotations) checks.

dagshub/data_engine/model/query_result.py (3)

41-47: LGTM! Import additions are clean and correctly bring in the new types needed for the refactored blob/annotation handling.


396-402: LGTM! Checking for BlobHashMetadata instead of raw strings is the right approach and consistent with the wrapping done in from_gql_edge.


422-434: LGTM! The simplified _convert_annotation_fields call and document field conversion via dp.get_blob(fld) are clean. The get_blob method properly handles both Path (cached) and BlobHashMetadata (needs download) cases.

tests/data_engine/annotation_import/test_annotation_parsing.py (4)

59-73: LGTM! The mock now correctly wraps errors in BlobDownloadError, matching the real _get_blob behavior. This ensures tests exercise the same error-handling code paths as production.


88-89: LGTM! Patching both query_result._get_blob and datapoint._get_blob is necessary since Datapoint.get_blob now calls _get_blob directly from its own module.


142-165: LGTM! Thorough test for the nonexistent annotation path—validates the ErrorMetadataAnnotations type, subclass relationship, NotImplementedError on mutation, and ValueError on value/to_ls_task access with the expected error message.


168-170: LGTM! Good regression test verifying that blob values are wrapped in BlobHashMetadata immediately after parsing the GQL response, before any autoloading occurs.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR fixes a crash that occurred when parsing annotations that failed to download. The solution involves:

Changes:

  • Introduced BlobHashMetadata wrapper class to distinguish blob hashes from other string values
  • Introduced BlobDownloadError exception for explicit error handling of blob download failures
  • Introduced ErrorMetadataAnnotations class to represent annotations that failed to download
  • Simplified blob loading logic by using dp.get_blob() consistently

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 1 comment.

File Description
dagshub/data_engine/model/datapoint.py Added BlobHashMetadata and BlobDownloadError classes; updated from_gql_edge to wrap blob hashes; updated get_blob to handle BlobHashMetadata; fixed comment typo; updated _get_blob to raise BlobDownloadError on failure
dagshub/data_engine/model/query_result.py Updated get_blob_fields to check for BlobHashMetadata type; simplified document field processing to use dp.get_blob(); updated _convert_annotation_fields to catch BlobDownloadError and create ErrorMetadataAnnotations; fixed comment typo
dagshub/data_engine/annotation/metadata.py Added ErrorMetadataAnnotations class that raises ValueError when accessing value or converting to Label Studio task
tests/data_engine/annotation_import/test_annotation_parsing.py Added test for nonexistent annotations; added test to verify blob metadata wrapping; updated mock function to raise BlobDownloadError; added monkeypatch for datapoint._get_blob
Comments suppressed due to low confidence (1)

dagshub/data_engine/model/query_result.py:470

  • When a ValidationError occurs during annotation parsing, the code passes metadata_value to UnsupportedMetadataAnnotations, which expects original_value: bytes. However, metadata_value could be a BlobHashMetadata object, Path, or bytes. This should use annotation_content instead, which is guaranteed to be bytes from the successful dp.get_blob() call on line 460.
                        dp.metadata[fld] = UnsupportedMetadataAnnotations(
                            datapoint=dp, field=fld, original_value=metadata_value
                        )

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copy link
Member

@guysmoilov guysmoilov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Am I right in my understanding that this solves the issue of blobs having ambiguous types, while avoiding compatibility issues for customers by not making the internal type visible to callers? Is there a situation where it is exposed to callers? e.g. if iterating over datapoints in a query result and accessing the field value directly?

Also, I kind of don't understand the PR description. If the blob downloads fail, why wouldn't we expect ds.all() to raise an error?

Comment on lines +463 to +465
annotation_content = dp.get_blob(fld)
dp.metadata[fld] = MetadataAnnotations.from_ls_task(
datapoint=dp, field=fld, ls_task=metadata_value
datapoint=dp, field=fld, ls_task=annotation_content
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't it weird for the dp.get_blob to convert to ls_task (bytes), then reconvert it to MetadataAnnotations here? Or is this for the scenario where dp.get_blob returns bytes directly without going through MetadataAnnotations (loading from disk IIUC)? Feels clumsy, like maybe converting to MetadataAnnotations should happen only here or in dp.get_blob but why in both? In which scenario will elif isinstance(current_value, MetadataAnnotations): in line 213 be true?

Copy link
Member Author

@kbolashev kbolashev Feb 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The intentional data flow here is:

  • get blob hashes from backend and make them BlobHashMetadata
  • qr.get_blob_fields is called via autoloading/explicitly by user
  • _get_blob fills in the field's value with, most probably, the raw bytes, but could be other things (e.g. Path if load_into_memory is False)
  • qr._convert_annotation_fields goes over all annotation metadata and tries to convert them
  • During that, in order to get the bytes, dp.get_blob() is called for the field. This ensures that the value in the field is guaranteed to be bytes, and not Path, str, or anything else.

The reason dp.get_blob() converts the annotation to bytes is because that is a publically exposed function, and it makes sense that calling it on metadata would return the raw LS task's bytes. It actually doesn't really have any intention for the main purpose of this PR, and I more just added it as a defensive/interface consistency check.

This part will not be hit if you, for example, try to load and convert an annotation field a second time, because at that point, the conversion would have already happened, and then this type guard prevents doing a double conversion:

elif isinstance(metadata_value, MetadataAnnotations):
continue

except ValidationError:
dp.metadata[fld] = UnsupportedMetadataAnnotations(
datapoint=dp, field=fld, original_value=metadata_value
datapoint=dp, field=fld, original_value=annotation_content
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't it dangerous to use annotation_content when it's undefined because an error was thrown from dp.get_blob? Or you think it can only be thrown from from_ls_task? Maybe that means the catches should be separate? Just asking questions

Copy link
Member Author

@kbolashev kbolashev Feb 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ValidationError is only thrown by the Pydantic code, so on from_ls_task, and that means that get_blob() has already succeeded or thrown a different uncaught error (unlikely file not found, for example).

@kbolashev kbolashev requested a review from guysmoilov February 19, 2026 08:53
@kbolashev kbolashev merged commit 18bffc9 into main Feb 19, 2026
9 checks passed
@kbolashev kbolashev deleted the bug/parsing-errored-annotation branch February 19, 2026 09:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants