Skip to content

Conversation

@weiqingy
Copy link
Collaborator

@weiqingy weiqingy commented Jan 29, 2026

Linked issue: #508

Purpose of change

Fix a crash that occurs when restoring a Python async action from checkpoint.

Root cause: Python coroutines cannot be serialized. When a checkpoint captures state while an async action is in progress (e.g., waiting for LLM response), the coroutine object is lost. On restore, the awaitable reference is None, causing:
AttributeError: 'NoneType' object has no attribute 'send'

Fix:

  • Detect None awaitable in PythonActionExecutor.callPythonAwaitable()
  • Throw AwaitableLostException to signal the awaitable was lost
  • PythonGeneratorActionTask catches this and re-executes the action from the beginning
  • Durable execution cache ensures already-completed calls are skipped

Tests

  • Manual testing: Run react_agent_example.py, trigger LLM timeout, verify job recovers instead of crashing
  • No unit test added - these classes depend on Pemja (Python interpreter) which is difficult to mock; the fix is better validated via e2e testing

API

No public API changes.

Documentation

  • doc-needed
  • doc-not-needed
  • doc-included

When a Python async action fails mid-execution (e.g., LLM timeout), the job
restarts from checkpoint. During restore, the Python awaitable (coroutine)
reference is None because Python coroutines cannot be serialized.

Previously this caused: AttributeError: 'NoneType' object has no attribute 'send'

This fix:
- Detects when awaitable is None in PythonActionExecutor.callPythonAwaitable()
- Throws AwaitableLostException to signal the awaitable was lost
- PythonGeneratorActionTask catches this and re-executes the action from beginning
- Durable execution cache ensures already-completed calls are skipped
@github-actions github-actions bot added priority/major Default priority of the PR or issue. fixVersion/0.2.0 The feature or bug should be implemented/fixed in the 0.2.0 version. doc-not-needed Your PR changes do not impact docs labels Jan 29, 2026
@weiqingy
Copy link
Collaborator Author

Hey @Sxnan, wondering if there’s a way to re-run the failed CI? It seems I don’t have access to trigger it. This looks like a pre-existing flaky test, not related to the PR.

@Sxnan
Copy link
Contributor

Sxnan commented Jan 30, 2026

@weiqingy , Thanks for identifying this bug and submitting the fix! The issue you found is real and important.
After reviewing the PR, I made some adjustments to align the Python async handling with how we handle Java continuations. The current approach uses a transient map in the operator (similar to continuationContexts for Java), which makes the checkpoint restore detection more explicit and consistent across both languages.
I've pushed a fixup commit with these changes. Could you please take a look and let me know what you think?

@weiqingy
Copy link
Collaborator Author

@weiqingy , Thanks for identifying this bug and submitting the fix! The issue you found is real and important. After reviewing the PR, I made some adjustments to align the Python async handling with how we handle Java continuations. The current approach uses a transient map in the operator (similar to continuationContexts for Java), which makes the checkpoint restore detection more explicit and consistent across both languages. I've pushed a fixup commit with these changes. Could you please take a look and let me know what you think?

Thanks for the quick review and changes, @Sxnan! It LGTM.

@Sxnan
Copy link
Contributor

Sxnan commented Jan 30, 2026

@xintongsong Could you take a look at this PR?

Copy link
Contributor

@xintongsong xintongsong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@Sxnan Sxnan merged commit 36e0731 into apache:main Jan 30, 2026
20 checks passed
Sxnan added a commit that referenced this pull request Jan 30, 2026
[runtime] Fix Python awaitable lost during checkpoint restore

Co-authored-by: sxnan <suxuannan95@gmail.com>
@Sxnan
Copy link
Contributor

Sxnan commented Jan 30, 2026

Ported to release-0.2 in 5d9fb77

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

doc-not-needed Your PR changes do not impact docs fixVersion/0.2.0 The feature or bug should be implemented/fixed in the 0.2.0 version. priority/major Default priority of the PR or issue.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants