Bug: parsing annotations that failed to download crashes by kbolashev · Pull Request #655 · DagsHub/client

kbolashev · 2026-02-16T14:57:28Z

Bug:

Reproduction:

Create datasource with annotations
Make download fail in some way (e.g. turn off internet)
try to run ds.all()

Expected:

No errors thrown

Actual:
Parsing annotations will fail, because it will try to parse the return error's string returned from the _get_blob() function.
There is a test case test_nonexistent_annotation that reproduces this issue.

Solution:

Overhauled how the blob hashes are stored - now they are BlobHashMetadata objects right after they get returned from the server, instead of strings. This allows for easier handling when loading blobs. Previously it required careful handling with sequence of operations, so as to not break whenever document fields are involved, since they were automatically converted into strings. Now loading blobs checks if the metadata is of type BlobHashMetadata and loads that
Made _get_blob() throw a BlobDownloadError, that can be explicitly excepted and handled. Autoloading annotations now relies on that error and creates a ErrorMetadataAnnotations on receiving this error.
Additionally, simplified the logic in autoloading functions, reducing obscure checks, and instead using dp.get_blob() wherever possible, since that function already handles possible cases of blob metadata.

Technical Implementation

Core Data Structures

BlobHashMetadata: New dataclass that wraps blob hash strings with proper representation methods, replacing raw string handling for blob field values from the server
BlobDownloadError: New exception class for blob download failures, providing clearer error semantics than previous string-based error returns

Error Handling for Annotations

ErrorMetadataAnnotations: New subclass of MetadataAnnotations that represents failed annotation downloads. Stores an error message and raises it when value or to_ls_task() are accessed, preventing silent failures from being treated as successfully parsed annotations

Blob Loading Architecture Changes

from_gql_edge(): Now wraps blob field values (BLOB type) in BlobHashMetadata at deserialization time, ensuring blob hashes are properly typed objects rather than ambiguous strings
get_blob(): Extended to handle three value types with distinct pathways:
- BlobHashMetadata: Downloads blob and optionally caches to disk
- Path: Reads existing cached blob from disk
- MetadataAnnotations: Converts to Label Studio task via to_ls_task()
_get_blob(): Now raises BlobDownloadError on any download failure instead of returning error strings; includes new path_format parameter to control path representation (str vs Path object)

Annotation Conversion Workflow

_convert_annotation_fields(): Simplified by removing load_into_memory parameter; now uses unified dp.get_blob() approach for all annotation content retrieval
Error handling: Catches BlobDownloadError to create ErrorMetadataAnnotations and ValidationError to create UnsupportedMetadataAnnotations, with warning logs listing problematic datapoints
Failed annotation downloads now result in informative error states instead of parsing crashes

Testing Coverage

New fixture ds_with_nonexistent_annotation simulates blob download failures
Test test_nonexistent_annotation verifies ErrorMetadataAnnotations is created and raises descriptive errors when accessed
Assertion test_blob_metadata_is_wrapped_from_backend confirms blob hash wrapping at deserialization

…ng the get_blob_fields functions

dagshub · 2026-02-16T14:57:31Z

Join the discussion on DagsHub!

coderabbitai · 2026-02-16T14:57:51Z

Important

Review skipped

Auto incremental reviews are disabled on this repository.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

🔍 Trigger review

📝 Walkthrough

Walkthrough

The changes introduce specialized error handling for blob downloads and annotation metadata. A new ErrorMetadataAnnotations class is added to represent annotation download failures, BlobHashMetadata wraps blob hash references, and BlobDownloadError is raised on download failures. The annotation and document conversion workflows are refactored to fetch blob content via the blob download mechanism with proper error handling.

Changes

Cohort / File(s)	Summary
Error Metadata Annotations `dagshub/data_engine/annotation/metadata.py`	Added `ErrorMetadataAnnotations` class to encapsulate error messages encountered during annotation processing, with a custom `__repr__` reporting the annotation download error.
Blob Metadata & Download Error Handling `dagshub/data_engine/model/datapoint.py`	Introduced `BlobHashMetadata` dataclass for blob hash references and `BlobDownloadError` exception. Extended `from_gql_edge` to wrap blob values in `BlobHashMetadata`, updated `get_blob` to handle `BlobHashMetadata` and `MetadataAnnotations`, and added `path_format` parameter to `_get_blob` to control path representation.
Annotation & Document Processing `dagshub/data_engine/model/query_result.py`	Refactored blob field handling to use `BlobHashMetadata` type checks and updated download URL computation. Reworked annotation/document conversion to fetch blob content via `dp.get_blob` with proper error handling; removed `load_into_memory` parameter from `_convert_annotation_fields`. On blob or validation errors, stores `ErrorMetadataAnnotations` or `UnsupportedMetadataAnnotations` respectively.
Test Coverage & Fixtures `tests/data_engine/annotation_import/test_annotation_parsing.py`	Added imports for new error and metadata classes. Enhanced `mock_get_blob` to wrap failures in `BlobDownloadError`. Introduced `ds_with_unsupported_annotation` and `ds_with_nonexistent_annotation` fixtures. Added `test_nonexistent_annotation` to verify error annotation behavior and `test_blob_metadata_is_wrapped_from_backend` for metadata wrapping validation.

Sequence Diagram

sequenceDiagram
    participant Client
    participant QueryResult
    participant Datapoint
    participant BlobStore

    Client->>QueryResult: _convert_annotation_fields(fields)
    QueryResult->>Datapoint: get_blob(field)
    Datapoint->>Datapoint: Check if value is BlobHashMetadata
    alt Blob Download Success
        Datapoint->>BlobStore: Download blob using hash
        BlobStore-->>Datapoint: Return blob content
        Datapoint-->>QueryResult: Return blob content
    else Blob Download Failure
        Datapoint->>Datapoint: Raise BlobDownloadError
        QueryResult->>QueryResult: Catch BlobDownloadError
        QueryResult->>QueryResult: Store ErrorMetadataAnnotations
    end
    QueryResult-->>Client: Complete annotation conversion

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

Bug: cannot query metadata with unsupported annotations #653: Introduces related UnsupportedMetadataAnnotations subclass and updates annotation error handling in the same modules with coordinated metadata annotation strategies.

🚥 Pre-merge checks | ✅ 1 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Merge Conflict Detection	⚠️ Warning	❌ Merge conflicts detected (4 files): ⚔️ `dagshub/data_engine/annotation/metadata.py` (content) ⚔️ `dagshub/data_engine/model/datapoint.py` (content) ⚔️ `dagshub/data_engine/model/query_result.py` (content) ⚔️ `tests/data_engine/annotation_import/test_annotation_parsing.py` (content) These conflicts must be resolved before merging into `main`.	Resolve conflicts locally and push changes to this branch.

✅ Passed checks (1 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch bug/parsing-errored-annotation

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

dagshub/data_engine/model/query_result.py (1)
458-471: ⚠️ Potential issue | 🔴 Critical

Bug: original_value=metadata_value passes wrong type to UnsupportedMetadataAnnotations.

At line 469, metadata_value (captured at line 449) will be a Path or BlobHashMetadata after the blob download phase—not bytes. UnsupportedMetadataAnnotations.__init__ expects original_value: bytes. The actual bytes content is in annotation_content from line 460.
Proposed fix
                     except ValidationError:
                         dp.metadata[fld] = UnsupportedMetadataAnnotations(
-                            datapoint=dp, field=fld, original_value=metadata_value
+                            datapoint=dp, field=fld, original_value=annotation_content
                         )

📜 Review details

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 38dfb2b and 342612b.

📒 Files selected for processing (4)

dagshub/data_engine/annotation/metadata.py
dagshub/data_engine/model/datapoint.py
dagshub/data_engine/model/query_result.py
tests/data_engine/annotation_import/test_annotation_parsing.py

🧰 Additional context used

🧬 Code graph analysis (3)

dagshub/data_engine/annotation/metadata.py (1)

dagshub/data_engine/util/not_implemented.py (1)

NotImplementedMeta (1-48)

dagshub/data_engine/model/datapoint.py (5)

dagshub/common/api/repo.py (1)

download (393-468)

dagshub/common/helpers.py (1)

http_request (40-60)

dagshub/data_engine/annotation/metadata.py (7)

MetadataAnnotations (36-319)

value (104-114)

value (332-333)

value (353-354)

to_ls_task (88-101)

to_ls_task (335-336)

to_ls_task (356-357)

dagshub/data_engine/client/models.py (2)

DatapointHistoryResult (124-126)

MetadataSelectFieldSchema (82-101)

dagshub/data_engine/dtypes.py (1)

MetadataFieldType (20-36)

tests/data_engine/annotation_import/test_annotation_parsing.py (5)

dagshub/data_engine/annotation/metadata.py (10)

ErrorMetadataAnnotations (342-360)

UnsupportedMetadataAnnotations (322-339)

MetadataAnnotations (36-319)

add_image_bbox (146-183)

value (104-114)

value (332-333)

value (353-354)

to_ls_task (88-101)

to_ls_task (335-336)

to_ls_task (356-357)

dagshub/data_engine/model/datapoint.py (2)

BlobDownloadError (39-42)

BlobHashMetadata (29-36)

dagshub/data_engine/model/datasource_state.py (1)

blob_path (120-124)

dagshub/data_engine/model/datasource.py (1)

all (314-337)

dagshub/data_engine/model/query_result.py (1)

get_annotations (684-696)

🪛 Ruff (0.15.0)

tests/data_engine/annotation_import/test_annotation_parsing.py

[warning] 67-67: Abstract raise to an inner function

(TRY301)

[warning] 67-67: Avoid specifying long messages outside the exception class

(TRY003)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (6)

GitHub Check: Agent
GitHub Check: build (3.11)
GitHub Check: build (3.9)
GitHub Check: build (3.10)
GitHub Check: build (3.12)
GitHub Check: build (3.13)

🔇 Additional comments (11)

dagshub/data_engine/model/datapoint.py (3)

28-42: LGTM! Clean introduction of BlobHashMetadata as a frozen dataclass and BlobDownloadError as a dedicated exception. The __str__ returning the hash preserves backward compatibility where string conversion was previously used.

148-161: LGTM! Wrapping blob field values in BlobHashMetadata at parse time is a sound approach—it prevents accidental string interpretation of blob hashes downstream.

322-343: LGTM! Raising BlobDownloadError instead of returning an error string is the right fix for the reported crash. The retry logic correctly only retries RuntimeError (server errors > 400), while 404 and other codes fail immediately.

dagshub/data_engine/annotation/metadata.py (1)

342-360: LGTM! ErrorMetadataAnnotations follows the same pattern as UnsupportedMetadataAnnotations, using NotImplementedMeta to block mutation operations while providing clear error messages via value and to_ls_task. Good subclass design to preserve isinstance(x, MetadataAnnotations) checks.

dagshub/data_engine/model/query_result.py (3)

41-47: LGTM! Import additions are clean and correctly bring in the new types needed for the refactored blob/annotation handling.

396-402: LGTM! Checking for BlobHashMetadata instead of raw strings is the right approach and consistent with the wrapping done in from_gql_edge.

422-434: LGTM! The simplified _convert_annotation_fields call and document field conversion via dp.get_blob(fld) are clean. The get_blob method properly handles both Path (cached) and BlobHashMetadata (needs download) cases.

tests/data_engine/annotation_import/test_annotation_parsing.py (4)

59-73: LGTM! The mock now correctly wraps errors in BlobDownloadError, matching the real _get_blob behavior. This ensures tests exercise the same error-handling code paths as production.

88-89: LGTM! Patching both query_result._get_blob and datapoint._get_blob is necessary since Datapoint.get_blob now calls _get_blob directly from its own module.

142-165: LGTM! Thorough test for the nonexistent annotation path—validates the ErrorMetadataAnnotations type, subclass relationship, NotImplementedError on mutation, and ValueError on value/to_ls_task access with the expected error message.

168-170: LGTM! Good regression test verifying that blob values are wrapped in BlobHashMetadata immediately after parsing the GQL response, before any autoloading occurs.

dagshub/data_engine/model/datapoint.py

Copilot

Pull request overview

This PR fixes a crash that occurred when parsing annotations that failed to download. The solution involves:

Changes:

Introduced BlobHashMetadata wrapper class to distinguish blob hashes from other string values
Introduced BlobDownloadError exception for explicit error handling of blob download failures
Introduced ErrorMetadataAnnotations class to represent annotations that failed to download
Simplified blob loading logic by using dp.get_blob() consistently

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 1 comment.

File	Description
dagshub/data_engine/model/datapoint.py	Added `BlobHashMetadata` and `BlobDownloadError` classes; updated `from_gql_edge` to wrap blob hashes; updated `get_blob` to handle `BlobHashMetadata`; fixed comment typo; updated `_get_blob` to raise `BlobDownloadError` on failure
dagshub/data_engine/model/query_result.py	Updated `get_blob_fields` to check for `BlobHashMetadata` type; simplified document field processing to use `dp.get_blob()`; updated `_convert_annotation_fields` to catch `BlobDownloadError` and create `ErrorMetadataAnnotations`; fixed comment typo
dagshub/data_engine/annotation/metadata.py	Added `ErrorMetadataAnnotations` class that raises ValueError when accessing value or converting to Label Studio task
tests/data_engine/annotation_import/test_annotation_parsing.py	Added test for nonexistent annotations; added test to verify blob metadata wrapping; updated mock function to raise `BlobDownloadError`; added monkeypatch for `datapoint._get_blob`

Comments suppressed due to low confidence (1)

dagshub/data_engine/model/query_result.py:470

When a ValidationError occurs during annotation parsing, the code passes metadata_value to UnsupportedMetadataAnnotations, which expects original_value: bytes. However, metadata_value could be a BlobHashMetadata object, Path, or bytes. This should use annotation_content instead, which is guaranteed to be bytes from the successful dp.get_blob() call on line 460.

                        dp.metadata[fld] = UnsupportedMetadataAnnotations(
                            datapoint=dp, field=fld, original_value=metadata_value
                        )

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

dagshub/data_engine/model/query_result.py

guysmoilov

Am I right in my understanding that this solves the issue of blobs having ambiguous types, while avoiding compatibility issues for customers by not making the internal type visible to callers? Is there a situation where it is exposed to callers? e.g. if iterating over datapoints in a query result and accessing the field value directly?

Also, I kind of don't understand the PR description. If the blob downloads fail, why wouldn't we expect ds.all() to raise an error?

guysmoilov · 2026-02-18T21:37:29Z

dagshub/data_engine/model/query_result.py

+                        annotation_content = dp.get_blob(fld)
                        dp.metadata[fld] = MetadataAnnotations.from_ls_task(
-                            datapoint=dp, field=fld, ls_task=metadata_value
+                            datapoint=dp, field=fld, ls_task=annotation_content


Isn't it weird for the dp.get_blob to convert to ls_task (bytes), then reconvert it to MetadataAnnotations here? Or is this for the scenario where dp.get_blob returns bytes directly without going through MetadataAnnotations (loading from disk IIUC)? Feels clumsy, like maybe converting to MetadataAnnotations should happen only here or in dp.get_blob but why in both? In which scenario will elif isinstance(current_value, MetadataAnnotations): in line 213 be true?

The intentional data flow here is:

get blob hashes from backend and make them BlobHashMetadata

qr.get_blob_fields is called via autoloading/explicitly by user

_get_blob fills in the field's value with, most probably, the raw bytes, but could be other things (e.g. Path if load_into_memory is False)

qr._convert_annotation_fields goes over all annotation metadata and tries to convert them

During that, in order to get the bytes, dp.get_blob() is called for the field. This ensures that the value in the field is guaranteed to be bytes, and not Path, str, or anything else.

The reason dp.get_blob() converts the annotation to bytes is because that is a publically exposed function, and it makes sense that calling it on metadata would return the raw LS task's bytes. It actually doesn't really have any intention for the main purpose of this PR, and I more just added it as a defensive/interface consistency check.

This part will not be hit if you, for example, try to load and convert an annotation field a second time, because at that point, the conversion would have already happened, and then this type guard prevents doing a double conversion:

client/dagshub/data_engine/model/query_result.py

Lines 458 to 459 in 7c8fa55

elif isinstance(metadata_value, MetadataAnnotations):

continue

guysmoilov · 2026-02-18T21:38:39Z

dagshub/data_engine/model/query_result.py

                    except ValidationError:
                        dp.metadata[fld] = UnsupportedMetadataAnnotations(
-                            datapoint=dp, field=fld, original_value=metadata_value
+                            datapoint=dp, field=fld, original_value=annotation_content


Isn't it dangerous to use annotation_content when it's undefined because an error was thrown from dp.get_blob? Or you think it can only be thrown from from_ls_task? Maybe that means the catches should be separate? Just asking questions

ValidationError is only thrown by the Pydantic code, so on from_ls_task, and that means that get_blob() has already succeeded or thrown a different uncaught error (unlikely file not found, for example).

kbolashev added 3 commits February 16, 2026 16:37

Overhaul blob loading, changing how the hash is handled and simplifyi…

1c208ec

…ng the get_blob_fields functions

Typo

81260de

Comment logic fix

342612b

kbolashev requested review from Copilot and guysmoilov February 16, 2026 14:57

kbolashev self-assigned this Feb 16, 2026

kbolashev added the bug Something isn't working label Feb 16, 2026

Copilot started reviewing on behalf of kbolashev February 16, 2026 14:57 View session

coderabbitai bot reviewed Feb 16, 2026

View reviewed changes

dagshub/data_engine/model/datapoint.py Show resolved Hide resolved

Copilot AI reviewed Feb 16, 2026

View reviewed changes

dagshub/data_engine/model/query_result.py Outdated Show resolved Hide resolved

kbolashev added 2 commits February 16, 2026 17:12

CR fixes

6aa0ffa

CR fixes

7c8fa55

guysmoilov reviewed Feb 18, 2026

View reviewed changes

kbolashev requested a review from guysmoilov February 19, 2026 08:53

guysmoilov approved these changes Feb 19, 2026

View reviewed changes

kbolashev merged commit 18bffc9 into main Feb 19, 2026
9 checks passed

kbolashev deleted the bug/parsing-errored-annotation branch February 19, 2026 09:07

kbolashev mentioned this pull request Feb 19, 2026

Bugfix - improve error handling in Get Blob Function #651

Closed

	elif isinstance(metadata_value, MetadataAnnotations):
	continue

Conversation

kbolashev commented Feb 16, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Bug:

Solution:

Technical Implementation

Core Data Structures

Error Handling for Annotations

Blob Loading Architecture Changes

Annotation Conversion Workflow

Testing Coverage

Uh oh!

dagshub bot commented Feb 16, 2026

Uh oh!

coderabbitai bot commented Feb 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review skipped

Walkthrough

Changes

Sequence Diagram

Estimated code review effort

Possibly related PRs

❌ Failed checks (1 warning)

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

guysmoilov left a comment

Choose a reason for hiding this comment

Uh oh!

guysmoilov Feb 18, 2026

Choose a reason for hiding this comment

Uh oh!

kbolashev Feb 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

guysmoilov Feb 18, 2026

Choose a reason for hiding this comment

Uh oh!

kbolashev Feb 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

kbolashev commented Feb 16, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Feb 16, 2026 •

edited

Loading

kbolashev Feb 19, 2026 •

edited

Loading

kbolashev Feb 19, 2026 •

edited

Loading