diff --git a/contributing/samples/long_running_task/README.md b/contributing/samples/long_running_task/README.md
new file mode 100644
index 0000000000..b649a5e941
--- /dev/null
+++ b/contributing/samples/long_running_task/README.md
@@ -0,0 +1,182 @@
+# Durable Session Demo
+
+This demo showcases the durable session persistence feature in ADK, which
+enables checkpoint-based durability for long-running agent invocations.
+
+## Overview
+
+Durable sessions provide:
+- **Checkpoint persistence**: Agent state is saved to BigQuery + GCS
+- **Failure recovery**: Resume from the last checkpoint after crashes
+- **Host migration**: Move sessions between hosts seamlessly
+- **Lease management**: Prevent concurrent modifications
+
+## Prerequisites
+
+1. **Google Cloud Project** with billing enabled
+2. **APIs enabled**:
+ - BigQuery API
+ - Cloud Storage API
+ - Vertex AI API (for Gemini models)
+3. **IAM permissions**:
+ - `roles/bigquery.dataEditor`
+ - `roles/storage.objectAdmin`
+ - `roles/aiplatform.user`
+
+## Setup
+
+### 1. Configure your environment
+
+```bash
+# Set your project
+export PROJECT_ID="test-project-0728-467323"
+gcloud config set project $PROJECT_ID
+
+# Set your Google Cloud API key (required for Gemini 3)
+export GOOGLE_CLOUD_API_KEY="your-api-key-here"
+
+# Authenticate
+gcloud auth application-default login
+```
+
+### 2. Create BigQuery and GCS resources
+
+```bash
+# Run the setup script
+python contributing/samples/long_running_task/setup.py
+
+# To verify setup
+python contributing/samples/long_running_task/setup.py --verify
+
+# To clean up resources
+python contributing/samples/long_running_task/setup.py --cleanup
+```
+
+### 3. Run the demo
+
+```bash
+adk web contributing/samples/long_running_task
+```
+
+## Demo Scenarios
+
+### Scenario 1: Long-running table scan
+
+```
+User: Scan the bigquery-public-data.samples.shakespeare table
+
+Agent: [Calls simulate_long_running_scan]
+ [Checkpoint written at async boundary]
+ [Scan completes after ~5-10 seconds]
+ The scan found 164,656 rows with the following findings:
+ - Found 5 instances of 'to be or not to be'
+ - Most common word: 'the' (27,801 occurrences)
+ - Unique words: 29,066
+```
+
+### Scenario 2: Multi-stage pipeline
+
+```
+User: Run a pipeline from source_table to dest_table with transformations:
+ filter, aggregate, join
+
+Agent: [Calls run_data_pipeline]
+ [Checkpoint written at each stage boundary]
+ Pipeline completed successfully:
+ - Stage 1 (filter): 45,000 rows processed
+ - Stage 2 (aggregate): 32,000 rows processed
+ - Stage 3 (join): 28,000 rows processed
+```
+
+### Scenario 3: Failure recovery
+
+1. Start a long-running scan
+2. Kill the process mid-execution
+3. Restart and resume with the invocation_id
+4. Agent continues from the last checkpoint
+
+## Architecture
+
+```
+ +-----------------+
+ | Agent |
+ | (LlmAgent) |
+ +--------+--------+
+ |
+ v
+ +-----------------+
+ | Runner |
+ | (with durability)|
+ +--------+--------+
+ |
+ +----------------+----------------+
+ | |
+ v v
+ +--------------+ +----------------+
+ | BigQuery | | GCS |
+ | (metadata) | | (state blobs) |
+ +--------------+ +----------------+
+ | - sessions | | - checkpoints/ |
+ | - checkpoints| | {session_id}/|
+ +--------------+ +----------------+
+```
+
+## Configuration
+
+The agent is configured in `agent.py`:
+
+```python
+app = App(
+ name="durable_session_demo",
+ root_agent=root_agent,
+ resumability_config=ResumabilityConfig(is_resumable=True),
+ durable_session_config=DurableSessionConfig(
+ is_durable=True,
+ checkpoint_policy="async_boundary",
+ checkpoint_store=BigQueryCheckpointStore(
+ project=PROJECT_ID,
+ dataset=DATASET,
+ gcs_bucket=GCS_BUCKET,
+ ),
+ lease_timeout_seconds=300,
+ ),
+)
+```
+
+### Checkpoint Policies
+
+- `async_boundary`: Checkpoint when hitting async/long-running operations
+- `every_turn`: Checkpoint after every agent turn
+- `manual`: Only checkpoint when explicitly requested
+
+## Monitoring
+
+### View sessions
+
+```sql
+SELECT * FROM `test-project-0728-467323.adk_metadata.sessions`
+ORDER BY updated_at DESC
+LIMIT 10;
+```
+
+### View checkpoints
+
+```sql
+SELECT * FROM `test-project-0728-467323.adk_metadata.checkpoints`
+ORDER BY created_at DESC
+LIMIT 10;
+```
+
+### List checkpoint blobs
+
+```bash
+gsutil ls -l gs://test-project-0728-467323-adk-checkpoints/checkpoints/
+```
+
+## Cleanup
+
+To remove all resources created by this demo:
+
+```bash
+python contributing/samples/long_running_task/setup.py --cleanup
+```
diff --git a/contributing/samples/long_running_task/REVIEW_FEEDBACK.md b/contributing/samples/long_running_task/REVIEW_FEEDBACK.md
new file mode 100644
index 0000000000..c8e6387f69
--- /dev/null
+++ b/contributing/samples/long_running_task/REVIEW_FEEDBACK.md
@@ -0,0 +1,239 @@
+# Design Document Review: Durable Session Persistence for Long-Horizon ADK Agents
+
+**Reviewer:** Claude Code
+**Date:** 2026-02-01
+**Document:** `long_running_task_design.md`
+
+---
+
+## Executive Summary
+
+The design document is **well-structured and comprehensive**, covering a real problem with a thorough technical approach. However, there are **critical accuracy issues** regarding ADK's current capabilities that must be addressed before the document can be considered accurate for review.
+
+**Overall Assessment:** Good foundation, requires significant revisions to accurately reflect ADK's existing resumability features.
+
+---
+
+## 1. Reference Validation
+
+### External URLs (7 total) - ALL VALID
+
+| # | URL | Status | Notes |
+|---|-----|--------|-------|
+| 1 | LangGraph durable-execution | VALID | Content matches claims |
+| 2 | LangGraph persistence | VALID | Checkpointing docs |
+| 3 | LangGraph overview | VALID | Framework intro |
+| 4 | LangGraph checkpoints reference | VALID | API docs |
+| 5 | Deep Agents overview | VALID | LangChain library |
+| 6 | Deep Agents long-term memory | VALID | Memory patterns |
+| 7 | Anthropic harnesses article | VALID | Published 2025-11-26 |
+
+---
+
+## 2. CRITICAL ISSUE: ADK Already Has Resumability
+
+### Problem Statement Inaccuracy
+
+The document states (Section 2):
+> "Current ADK sessions are optimized for synchronous 'serving' patterns... state is ephemeral... background execution is not a first-class runtime mode"
+
+**This is inaccurate.** ADK already has an experimental resumability feature:
+
+```python
+# src/google/adk/apps/app.py lines 42-58
+@experimental
+class ResumabilityConfig(BaseModel):
+ """The "resumability" in ADK refers to the ability to:
+ 1. pause an invocation upon a long-running function call.
+ 2. resume an invocation from the last event, if it's paused or failed midway
+ through.
+ """
+ is_resumable: bool = False
+```
+
+### Existing ADK Capabilities Not Mentioned
+
+| Capability | Location | Status |
+|------------|----------|--------|
+| `ResumabilityConfig` | `src/google/adk/apps/app.py:42-58` | Experimental |
+| `should_pause_invocation()` | `src/google/adk/agents/invocation_context.py:355-389` | Implemented |
+| `long_running_tool_ids` | `src/google/adk/events/event.py` | Implemented |
+| Resume from last event | `src/google/adk/runners.py:1294` | Implemented |
+
+### Required Fix
+
+**The document must:**
+1. Acknowledge existing `ResumabilityConfig` and pause/resume capability
+2. Clearly articulate how this proposal **extends** existing features vs. replacing them
+3. Update Section 2 (Problem Statement) to reflect actual gaps (e.g., durable cross-process persistence, BigQuery-based audit, external event triggers)
+
+---
+
+## 3. Technical Review
+
+### 3.1 SQL Schema (Appendix B) - VALID WITH MINOR ISSUES
+
+**Strengths:**
+- Proper partitioning strategy (`PARTITION BY DATE`)
+- Sensible clustering choices
+- JSON columns for flexibility
+
+**Issues:**
+
+1. **Missing primary key constraint on checkpoints:**
+ ```sql
+ -- Should add:
+ PRIMARY KEY (session_id, checkpoint_seq)
+ ```
+
+2. **events table lacks PRIMARY KEY:**
+ ```sql
+ -- Consider adding:
+ PRIMARY KEY (event_id) -- or composite key
+ ```
+
+3. **View `v_latest_checkpoint` uses ARRAY_AGG with OFFSET(0):**
+ - This is valid but will error if no checkpoints exist
+ - Consider `SAFE_OFFSET(0)` or handle NULL case
+
+### 3.2 Python Code Snippets - MOSTLY VALID
+
+**Section 7.1 `write_checkpoint()`:**
+- Logic is sound (two-phase commit pattern)
+- Consider adding error handling for partial failures
+
+**Section 7.2 `reconcile_on_resume()`:**
+- Good idempotency pattern
+- Missing: what happens if `bq.get_job()` fails?
+
+### 3.3 Leasing Approach (Section 7.3) - REASONABLE
+
+The BQ-based optimistic lease is correctly noted as best-effort. The suggestion to use Firestore/Spanner for stronger guarantees is appropriate.
+
+**Suggestion:** Add a concrete example of when to use each backend (BQ vs Firestore).
+
+---
+
+## 4. Architecture Feedback
+
+### 4.1 Strengths
+
+1. **Clear separation of control plane (BQ) vs data plane (GCS)** - follows Google best practices
+2. **Logical checkpointing over heap snapshots** - pragmatic and maintainable
+3. **Two-phase commit pattern** - ensures atomic visibility
+4. **Authoritative reconciliation** - critical for BigQuery job scenarios
+5. **Good competitive analysis** (Section 14)
+
+### 4.2 Gaps / Missing Considerations
+
+| Gap | Impact | Suggested Action |
+|-----|--------|------------------|
+| No mention of existing `ResumabilityConfig` | Misleading problem statement | Add section on existing capability |
+| No cost estimates for BQ storage/queries | Budget planning | Add rough estimates |
+| No mention of BQ quota limits | Operational risk | Document relevant quotas |
+| Checkpoint versioning migration strategy | Future maintenance | Expand Section 16.2 |
+| No monitoring/alerting design | Operability | Add observability section |
+| No rollback strategy | Safety | Document how to rollback |
+
+### 4.3 API Contract Review
+
+The proposed `CheckpointableAgentState` interface is clean:
+
+```python
+class CheckpointableAgentState:
+ def export_state(self) -> dict: ...
+ def import_state(self, state: dict) -> None: ...
+```
+
+**Suggestion:** Consider alignment with existing ADK patterns:
+- Existing `BaseAgentState` in `src/google/adk/agents/base_agent.py`
+- Existing state patterns in `src/google/adk/sessions/state.py`
+
+---
+
+## 5. Specific Line-by-Line Feedback
+
+### Section 0 (Executive Summary)
+- Line 14: "12-minute barrier" - should cite source or clarify this is environment-specific
+- Line 28: Cost estimate "< $0.01/session-day paused" - show calculation
+
+### Section 2 (Problem Statement)
+- **Major revision needed** - must acknowledge existing resumability
+
+### Section 4.1 (States)
+- Consider: should PAUSED be a first-class `Session.status` field or remain at `InvocationContext` level?
+
+### Section 8 (API Extensions)
+- `checkpoint_policy` options are good, but:
+ - What triggers `superstep`?
+ - How does `manual` interact with `long_running_tool_ids`?
+
+### Section 13 (Moltbot Alignment)
+- Moltbot reference is useful context
+- Consider adding link/citation if public
+
+### Section 18 (Open Questions)
+- Good list, but add: "How does this integrate with existing `ResumabilityConfig`?"
+
+---
+
+## 6. Recommended Document Changes
+
+### High Priority (Must Fix)
+
+1. **Add Section 1.3: "Existing ADK Resumability"**
+ - Document current `ResumabilityConfig` capability
+ - Explain limitations this design addresses
+ - Position proposal as extension, not replacement
+
+2. **Revise Section 2 (Problem Statement)**
+ - Remove/qualify claims about ADK lacking pause/resume
+ - Focus on actual gaps: cross-process durability, external event triggers, enterprise audit
+
+3. **Add explicit integration plan**
+ - How does `CheckpointableAgentState` relate to `BaseAgentState`?
+ - Migration path from current resumability to new design
+
+### Medium Priority
+
+4. Add cost estimation section
+5. Add monitoring/observability design
+6. Add rollback/recovery procedures
+7. Fix SQL schema issues (PKs)
+
+### Low Priority
+
+8. Add Moltbot citation if available
+9. Add BQ quota documentation links
+10. Consider adding architecture diagram (beyond Mermaid sequence)
+
+---
+
+## 7. Summary Table
+
+| Category | Status | Details |
+|----------|--------|---------|
+| External URLs | VALID | All 7 references work |
+| SQL Syntax | VALID with issues | Missing PKs, edge cases |
+| Python Code | VALID | Sound patterns |
+| Problem Statement | INACCURATE | Ignores existing resumability |
+| Architecture | SOUND | Good Google-scale patterns |
+| Completeness | GAPS | Missing cost, monitoring, rollback |
+
+---
+
+## 8. Conclusion
+
+This is a **solid technical design** for extending ADK's capabilities for long-running BigQuery workloads. The core architecture (BQ control plane, GCS data plane, two-phase commit, authoritative reconciliation) is well-reasoned.
+
+**However, the document cannot be approved in its current form** because it misrepresents ADK's existing capabilities. Once the existing `ResumabilityConfig` is acknowledged and the document is repositioned as an extension rather than a new capability, it will be ready for technical review.
+
+**Recommended Next Steps:**
+1. Revise document to acknowledge existing resumability
+2. Add cost/monitoring sections
+3. Fix SQL schema issues
+4. Re-submit for review
+
+---
+
+*Review generated by Claude Code on 2026-02-01*
diff --git a/contributing/samples/long_running_task/__init__.py b/contributing/samples/long_running_task/__init__.py
new file mode 100644
index 0000000000..4015e47d6e
--- /dev/null
+++ b/contributing/samples/long_running_task/__init__.py
@@ -0,0 +1,15 @@
+# Copyright 2026 Google LLC
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from . import agent
diff --git a/contributing/samples/long_running_task/agent.py b/contributing/samples/long_running_task/agent.py
new file mode 100644
index 0000000000..10e95f663a
--- /dev/null
+++ b/contributing/samples/long_running_task/agent.py
@@ -0,0 +1,142 @@
+# Copyright 2026 Google LLC
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""Durable session demo agent with long-running BigQuery operations.
+
+This agent demonstrates the durable session persistence feature, which enables
+checkpointing of agent state to BigQuery + GCS for recovery from failures.
+
+To run this demo:
+ 1. Set up the BigQuery tables and GCS bucket (see setup.py)
+ 2. Set GOOGLE_CLOUD_API_KEY environment variable
+ 3. Run: adk web contributing/samples/long_running_task
+
+Example prompts:
+ - "Scan the bigquery-public-data.samples.shakespeare table"
+ - "Get the schema of bigquery-public-data.samples.github_nested"
+ - "Run a pipeline from source_table to dest_table with filter, aggregate"
+"""
+
+import os
+from functools import cached_property
+
+from google.adk.agents import LlmAgent
+from google.adk.apps import App
+from google.adk.apps import ResumabilityConfig
+from google.adk.durable import BigQueryCheckpointStore
+from google.adk.durable import DurableSessionConfig
+from google.adk.models.google_llm import Gemini
+from google.adk.tools import LongRunningFunctionTool
+from google.genai import Client
+from google.genai import types
+
+from .tools import get_table_schema
+from .tools import run_batch_etl_job
+from .tools import run_data_pipeline
+from .tools import run_demo_analysis
+from .tools import run_extended_analysis
+from .tools import run_ml_training_job
+from .tools import simulate_long_running_scan
+
+# Configuration
+PROJECT_ID = "test-project-0728-467323"
+DATASET = "adk_metadata"
+GCS_BUCKET = f"{PROJECT_ID}-adk-checkpoints"
+
+# API Key for Vertex AI (must be set via environment variable)
+GOOGLE_CLOUD_API_KEY = os.environ.get("GOOGLE_CLOUD_API_KEY", "")
+
+
+class VertexAIGemini(Gemini):
+ """Custom Gemini model configured for Vertex AI with API key."""
+
+ model: str = "gemini-3-flash-preview"
+
+ @cached_property
+ def api_client(self) -> Client:
+ """Provides the api client configured for Vertex AI."""
+ return Client(
+ vertexai=True,
+ api_key=GOOGLE_CLOUD_API_KEY,
+ http_options=types.HttpOptions(
+ headers=self._tracking_headers(),
+ retry_options=self.retry_options,
+ ),
+ )
+
+
+# Create the checkpoint store
+checkpoint_store = BigQueryCheckpointStore(
+ project=PROJECT_ID,
+ dataset=DATASET,
+ gcs_bucket=GCS_BUCKET,
+)
+
+# Create the root agent with long-running tools using custom Vertex AI model
+root_agent = LlmAgent(
+ model=VertexAIGemini(model="gemini-3-flash-preview"),
+ name="durable_bq_scanner",
+ description="Long-running BigQuery scanner with durable checkpoints",
+ instruction="""You are a data analyst assistant that can run various data processing jobs.
+
+Your capabilities:
+1. Get table schemas - Use get_table_schema for quick schema lookups
+2. Scan tables - Use simulate_long_running_scan for table analysis (~5-10 seconds)
+3. Run data pipelines - Use run_data_pipeline for multi-stage transformations
+4. Demo analysis - Use run_demo_analysis for a 1-minute demo (perfect for presentations!)
+5. Extended analysis - Use run_extended_analysis for jobs that run 1-60 minutes
+6. ML training - Use run_ml_training_job for model training (2-30 minutes based on size)
+7. Batch ETL - Use run_batch_etl_job for large ETL jobs (1-60 minutes)
+
+For quick demos (~1 minute):
+- run_demo_analysis: Specify analysis_type (e.g., "sentiment", "anomaly", "trend", "clustering")
+
+For long-running jobs (10+ minutes):
+- run_extended_analysis: Specify duration_minutes (e.g., 10, 15, 30)
+- run_ml_training_job: Use dataset_size "large" (10 min), "xlarge" (15 min), or "enterprise" (30 min)
+- run_batch_etl_job: Specify processing_minutes (e.g., 10, 15, 30)
+
+The system will automatically checkpoint your progress during long-running
+operations, so you can resume if interrupted.
+
+Important: When using long-running tools, wait for them to complete before
+taking further action. Do not call the same tool again if it returned a
+pending status.
+""",
+ tools=[
+ get_table_schema,
+ LongRunningFunctionTool(func=simulate_long_running_scan),
+ LongRunningFunctionTool(func=run_data_pipeline),
+ LongRunningFunctionTool(func=run_demo_analysis),
+ LongRunningFunctionTool(func=run_extended_analysis),
+ LongRunningFunctionTool(func=run_ml_training_job),
+ LongRunningFunctionTool(func=run_batch_etl_job),
+ ],
+ generate_content_config=types.GenerateContentConfig(
+ temperature=1.0, # Required for Gemini 3
+ ),
+)
+
+# Create the app with durable session configuration
+app = App(
+ name="long_running_task",
+ root_agent=root_agent,
+ resumability_config=ResumabilityConfig(is_resumable=True),
+ durable_session_config=DurableSessionConfig(
+ is_durable=True,
+ checkpoint_policy="async_boundary",
+ checkpoint_store=checkpoint_store,
+ lease_timeout_seconds=300,
+ ),
+)
diff --git a/contributing/samples/long_running_task/comment.md b/contributing/samples/long_running_task/comment.md
new file mode 100644
index 0000000000..356cd67548
--- /dev/null
+++ b/contributing/samples/long_running_task/comment.md
@@ -0,0 +1,1094 @@
+# Design Review Comments and Responses
+
+## Comment 1: Session Service as Durable Persistence
+
+**From:** ADK Team
+**Date:** 2026-02-02
+
+**Comment:**
+> "Session service is the durable session persistence. For local, user starts with InMemoryService, but they can opt-in storage-based session service: SQLite, DatabaseSessionService, BigQuerySessionService, etc."
+
+---
+
+### Response
+
+Thank you for the feedback. You're correct that ADK already has a robust session service hierarchy. This comment raises an important architectural question: **Why introduce a separate CheckpointStore when SessionService already provides persistence?**
+
+#### Key Distinction: Session State vs. Checkpoint State
+
+| Aspect | Session Service | Checkpoint Store (Proposed) |
+|--------|-----------------|----------------------------|
+| **What it stores** | Conversation history (events, messages, tool calls) | Agent execution state (job ledgers, progress cursors, partial results) |
+| **Granularity** | Per-message/event append | Per-checkpoint snapshot at logical boundaries |
+| **Data model** | Event stream (append-only) | Point-in-time snapshots (two-phase commit) |
+| **Primary use case** | Replay conversation context to LLM | Resume long-running task from failure point |
+| **Recovery question** | "What did the agent say?" | "Where was the agent in a 6-hour BigQuery scan?" |
+| **External job tracking** | Tool call events (but not reconciliation-ready) | Authoritative job ledger with status sync |
+
+#### Why Session Service Alone May Be Insufficient
+
+1. **Job Ledger with Authoritative Reconciliation**
+ - Session events record that a tool was called, but don't maintain a ledger that can be reconciled against external job states (DONE/FAILED/RUNNING)
+ - On resume, we need to query BigQuery: "Is job X still running?" and update our ledger accordingly
+ - This reconciliation pattern doesn't fit the append-only event model
+
+2. **Partial Results Persistence**
+ - A 50-table PII scan may complete 30 tables before failure
+ - Checkpoint stores: which tables done, their findings, which remain
+ - Session stores: the conversation about starting the scan
+
+3. **Two-Phase Commit Semantics**
+ - Checkpoints require atomic visibility: GCS blob uploaded AND metadata pointer updated
+ - Session services typically use simpler append semantics
+ - Partial checkpoint writes must not be visible
+
+4. **Workspace Snapshots**
+ - Long-running coding agents may need `/workspace` file persistence
+ - This is binary blob data, not conversation events
+ - Doesn't fit session event model
+
+5. **Different Query Patterns**
+ - Session: "Give me all events for session X in order"
+ - Checkpoint: "Give me the latest checkpoint for session X" (single row)
+ - Fleet ops: "Show me all paused sessions with checkpoints > 1 hour old"
+
+---
+
+### Potential Approaches
+
+#### Option A: Separate CheckpointStore (Current Design)
+
+```
+┌─────────────────────────────────────────────────────────────┐
+│ ADK Application │
+├─────────────────────────────────────────────────────────────┤
+│ SessionService (existing) │ CheckpointStore (new) │
+│ - Conversation history │ - Execution state │
+│ - Event replay for LLM │ - Job ledgers │
+│ - Append-only events │ - Two-phase commit │
+│ - SQLite/DB/BigQuery │ - BigQuery + GCS │
+└─────────────────────────────────────────────────────────────┘
+```
+
+**Pros:**
+- Clear separation of concerns
+- Different consistency models for different needs
+- No changes to existing SessionService implementations
+- Checkpoint-specific optimizations (compression, GCS blob storage)
+
+**Cons:**
+- Two services to configure for durable agents
+- Potential confusion about which stores what
+- Additional infrastructure (though can share BigQuery dataset)
+
+#### Option B: Extend SessionService with Checkpoint Capability
+
+```python
+class SessionService(ABC):
+ # Existing methods...
+
+ # New checkpoint methods
+ async def write_checkpoint(
+ self, session_id: str, checkpoint_seq: int, state: bytes, ...
+ ) -> None: ...
+
+ async def read_latest_checkpoint(
+ self, session_id: str
+ ) -> tuple[int, bytes] | None: ...
+```
+
+**Pros:**
+- Single service to configure
+- Unified persistence layer
+- Familiar pattern for ADK users
+
+**Cons:**
+- Mixes conversation semantics with execution semantics
+- May require significant changes to existing implementations
+- Two-phase commit harder to add to existing append-only services
+- Risk of breaking changes
+
+#### Option C: Checkpoint as Special Event Type
+
+```python
+# Store checkpoint as a special event in the session
+event = Event(
+ author="system",
+ type=EventType.CHECKPOINT,
+ checkpoint_data=CheckpointData(
+ seq=5,
+ state_gcs_uri="gs://...",
+ job_ledger={...},
+ )
+)
+session_service.append_event(session_id, event)
+```
+
+**Pros:**
+- Uses existing SessionService infrastructure
+- Single storage location
+- Events remain the universal abstraction
+
+**Cons:**
+- Checkpoint retrieval requires scanning events (inefficient)
+- Two-phase commit semantics still needed for GCS blob
+- Mixing large blobs with conversation events
+- Query patterns still don't match (latest vs. stream)
+
+---
+
+### Recommendation
+
+**Option A (Separate CheckpointStore)** is recommended for v1 because:
+
+1. **Clean separation**: Conversation history and execution state serve different purposes
+2. **No breaking changes**: Existing SessionService implementations unchanged
+3. **Optimized for use case**: Checkpoint-specific features (GCS blobs, two-phase commit, lease management)
+4. **Incremental adoption**: Users can add checkpointing without changing session config
+
+However, we should:
+- Document the relationship clearly
+- Consider Option B for v2 if the pattern proves successful
+- Ensure both can share the same BigQuery dataset for operational simplicity
+
+---
+
+## Suggested Updates to Design Doc
+
+Based on this feedback, the following sections should be added/updated in `long_running_task_design.md`:
+
+### 1. Add New Section: "Relationship to Existing Session Service"
+
+**Location:** After Section 5 (Architecture Overview)
+
+```markdown
+## 5.4 Relationship to Existing Session Service
+
+ADK provides a `SessionService` abstraction for conversation persistence:
+
+| Implementation | Storage | Use Case |
+|----------------|---------|----------|
+| `InMemorySessionService` | RAM | Development/testing |
+| `SQLiteSessionService` | Local SQLite | Single-machine persistence |
+| `DatabaseSessionService` | PostgreSQL/MySQL | Production multi-instance |
+| `BigQuerySessionService` | BigQuery | Enterprise scale |
+
+**Why a separate CheckpointStore?**
+
+The `SessionService` and `CheckpointStore` serve complementary purposes:
+
+| SessionService | CheckpointStore |
+|----------------|-----------------|
+| Conversation history | Execution state snapshots |
+| Append-only events | Point-in-time checkpoints |
+| LLM context replay | Task resume from failure |
+| Per-event granularity | Per-checkpoint granularity |
+
+A durable long-horizon agent typically uses both:
+- `SessionService` for conversation continuity
+- `CheckpointStore` for execution state durability
+
+**Shared Infrastructure**
+
+Both services can share the same BigQuery dataset:
+- `adk_metadata.sessions` (SessionService)
+- `adk_metadata.events` (SessionService)
+- `adk_metadata.durable_sessions` (CheckpointStore)
+- `adk_metadata.checkpoints` (CheckpointStore)
+```
+
+### 2. Update Section 8.2 (Configuration)
+
+Add clarity about the relationship:
+
+```markdown
+### 8.2 Configuration
+
+```python
+# A durable agent uses BOTH session service and checkpoint store
+app = App(
+ name="durable_scanner",
+ root_agent=agent,
+
+ # Session service for conversation history (existing)
+ session_service=BigQuerySessionService(
+ project="my-project",
+ dataset="adk_metadata",
+ ),
+
+ # Checkpoint store for execution state (new)
+ durable_session_config=DurableSessionConfig(
+ is_durable=True,
+ checkpoint_store=BigQueryCheckpointStore(
+ project="my-project",
+ dataset="adk_metadata", # Can share dataset
+ gcs_bucket="my-checkpoints",
+ ),
+ ),
+)
+```
+
+**Note:** Both services can share the same BigQuery dataset. The checkpoint tables use a `durable_` prefix to avoid conflicts.
+```
+
+### 3. Add to Section 15 (Alternatives Considered)
+
+```markdown
+| Alternative | Why not (v1) |
+|-------------|--------------|
+| Extend SessionService with checkpoint methods | Different consistency models; risk of breaking changes to existing implementations |
+| Checkpoint as special Event type | Inefficient retrieval (scan vs. point lookup); mixes blob storage with events |
+```
+
+### 4. Add FAQ Entry
+
+```markdown
+## Appendix F: FAQ
+
+### Why not just use SessionService for checkpoints?
+
+SessionService is optimized for conversation history (append-only event streams).
+Checkpoints require:
+- Point-in-time snapshots (not event streams)
+- Two-phase commit (GCS blob + metadata atomicity)
+- Different query patterns (latest-per-session, not full history)
+- Large blob storage (workspace snapshots)
+
+The separation ensures each service is optimized for its use case.
+
+### Can I use CheckpointStore without SessionService?
+
+Yes, but not recommended. SessionService provides conversation context for
+the LLM on resume. Without it, the agent loses conversation history.
+
+### Do they share the same BigQuery dataset?
+
+Yes, recommended. Use the same dataset with different table prefixes:
+- SessionService: `sessions`, `events`
+- CheckpointStore: `durable_sessions`, `checkpoints`
+```
+
+---
+
+## Action Items
+
+- [ ] Add Section 5.4 to design doc
+- [ ] Update Section 8.2 with dual-service example
+- [ ] Add alternatives to Section 15
+- [ ] Add FAQ appendix
+- [ ] Consider renaming tables to avoid confusion (`durable_sessions` vs `sessions`)
+- [ ] Document shared dataset configuration in README
+
+---
+
+## Open Questions for ADK Team
+
+1. **Table naming**: Should checkpoint tables use a prefix (`durable_sessions`) or separate dataset?
+2. **Unified service**: Is there interest in a `DurableSessionService` wrapper that manages both?
+3. **Event integration**: Should checkpoint events be mirrored to SessionService for audit trail?
+4. **BigQuerySessionService**: Does it already have any checkpoint-like capabilities we should leverage?
+
+---
+
+## Comment 2: GcsArtifactService for Large Blobs
+
+**From:** ADK Team
+**Date:** 2026-02-02
+
+**Comment:**
+> "In ADK, ArtifactService is designed for large blobs. Have you checked that? We have a GcsArtifactService in the core library."
+
+---
+
+### Response
+
+Thank you for pointing this out. Yes, I've reviewed `GcsArtifactService` (`src/google/adk/artifacts/gcs_artifact_service.py`) and the `BaseArtifactService` interface. This is a valid consideration.
+
+#### Current ArtifactService Capabilities
+
+| Feature | GcsArtifactService |
+|---------|-------------------|
+| Storage backend | GCS bucket |
+| Key structure | `{app_name}/{user_id}/{session_id}/{filename}/{version}` |
+| Versioning | Monotonic integer versions (0, 1, 2, ...) |
+| Data type | `types.Part` (inline_data, text, file_data) |
+| Metadata | Custom metadata dict on blob |
+| Operations | save, load, list, delete, list_versions |
+
+#### Checkpoint Blob Requirements
+
+| Requirement | ArtifactService Support | Gap |
+|-------------|------------------------|-----|
+| Store bytes/JSON blobs | Yes (`types.Part.from_bytes`) | None |
+| Session-scoped storage | Yes | None |
+| Version tracking | Yes (monotonic) | Checkpoint uses `checkpoint_seq` |
+| Custom metadata | Yes | Need SHA-256, trigger, size_bytes |
+| Two-phase commit | **No** | Critical gap |
+| Atomic visibility with BQ | **No** | Critical gap |
+| Workspace tar.gz bundles | Partially (as bytes) | None |
+| Integrity verification | **No** | Need SHA-256 on read |
+
+#### Key Gaps
+
+**1. Two-Phase Commit Semantics**
+
+The checkpoint pattern requires:
+```
+Phase 1: Upload blob to GCS (may fail, invisible)
+Phase 2: Insert metadata to BigQuery (makes checkpoint visible)
+```
+
+`GcsArtifactService.save_artifact()` uploads and returns immediately. There's no coordination with an external metadata store. A partial upload becomes immediately "visible" via `load_artifact()`.
+
+**2. Atomic Visibility with BigQuery Metadata**
+
+Checkpoints must be invisible until both:
+- GCS blob exists AND
+- BigQuery metadata row exists
+
+`GcsArtifactService` doesn't have this concept - artifacts are visible as soon as they're uploaded.
+
+**3. SHA-256 Integrity Verification**
+
+Checkpoints require integrity verification on read:
+```python
+# On read
+blob = gcs.download(uri)
+if sha256(blob) != metadata.sha256:
+ raise CheckpointCorruptionError()
+```
+
+`GcsArtifactService` doesn't compute or verify checksums.
+
+**4. Key Structure Mismatch**
+
+| Service | Key Pattern |
+|---------|-------------|
+| ArtifactService | `{app}/{user}/{session}/{filename}/{version}` |
+| CheckpointStore | `{session_id}/{checkpoint_seq}/state.json` |
+
+Checkpoints don't have `app_name`, `user_id`, or `filename` - they're keyed purely by `session_id` + `checkpoint_seq`.
+
+---
+
+### Potential Approaches
+
+#### Option A: Use GcsArtifactService as Underlying Storage (Adapt)
+
+```python
+class BigQueryCheckpointStore(DurableSessionStore):
+ def __init__(self, artifact_service: GcsArtifactService, ...):
+ self._artifact_service = artifact_service
+
+ async def write_checkpoint(self, session_id, seq, state_blob, ...):
+ # Phase 1: Use artifact service for GCS upload
+ version = await self._artifact_service.save_artifact(
+ app_name="checkpoints",
+ user_id="system",
+ session_id=session_id,
+ filename=f"checkpoint_{seq}",
+ artifact=types.Part.from_bytes(state_blob, mime_type="application/json"),
+ custom_metadata={"sha256": sha256(state_blob)},
+ )
+
+ # Phase 2: Insert BQ metadata (makes checkpoint visible)
+ await self._insert_bq_metadata(session_id, seq, ...)
+```
+
+**Pros:**
+- Reuses existing GCS infrastructure
+- Consistent with ADK patterns
+- Less code duplication
+
+**Cons:**
+- Awkward key mapping (`app_name="checkpoints"`, `user_id="system"`)
+- Still need custom two-phase commit logic
+- Still need SHA-256 verification layer
+- Version semantics don't match (artifact version vs checkpoint_seq)
+
+#### Option B: Direct GCS Client (Current Design)
+
+```python
+class BigQueryCheckpointStore(DurableSessionStore):
+ def __init__(self, gcs_bucket: str, ...):
+ self._gcs_client = storage.Client()
+ self._bucket = self._gcs_client.bucket(gcs_bucket)
+
+ async def write_checkpoint(self, session_id, seq, state_blob, ...):
+ # Phase 1: Direct GCS upload with preconditions
+ blob = self._bucket.blob(f"{session_id}/{seq}/state.json")
+ blob.upload_from_string(
+ state_blob,
+ if_generation_match=0, # Fail if exists (idempotency)
+ )
+
+ # Phase 2: Insert BQ metadata
+ await self._insert_bq_metadata(session_id, seq, ...)
+```
+
+**Pros:**
+- Full control over GCS operations
+- Clean key structure
+- Native support for preconditions (`if_generation_match`)
+- Simpler code path
+
+**Cons:**
+- Doesn't leverage existing ArtifactService
+- Separate GCS client initialization
+
+#### Option C: Extend ArtifactService Interface
+
+Add checkpoint-specific methods to `BaseArtifactService`:
+
+```python
+class BaseArtifactService(ABC):
+ # Existing methods...
+
+ # New: Checkpoint-specific operations
+ async def save_checkpoint_blob(
+ self,
+ *,
+ session_id: str,
+ checkpoint_seq: int,
+ blob: bytes,
+ sha256: str,
+ ) -> str:
+ """Save a checkpoint blob and return GCS URI."""
+ ...
+
+ async def load_checkpoint_blob(
+ self,
+ *,
+ session_id: str,
+ checkpoint_seq: int,
+ expected_sha256: str,
+ ) -> bytes:
+ """Load and verify checkpoint blob."""
+ ...
+```
+
+**Pros:**
+- Unified artifact/checkpoint interface
+- Extensible for future blob types
+
+**Cons:**
+- Modifies core ADK interface
+- Checkpoint semantics may not fit all artifact backends
+- Two-phase commit still external
+
+---
+
+### Recommendation
+
+**Option B (Direct GCS Client)** is recommended for v1 because:
+
+1. **Simpler implementation**: No adapter layer or key mapping
+2. **Full control**: Native GCS preconditions for idempotency
+3. **Clean semantics**: Checkpoint keys match checkpoint concepts
+4. **No interface changes**: Doesn't require modifying BaseArtifactService
+
+However, we should:
+- Document the relationship with ArtifactService
+- Consider Option A or C for v2 if there's desire for unification
+- Ensure both can share the same GCS bucket if needed
+
+---
+
+### Suggested Design Doc Updates
+
+Add to Section 15 (Alternatives Considered):
+
+```markdown
+| Alternative | Why not (v1) |
+|-------------|--------------|
+| Use GcsArtifactService for checkpoint blobs | Key structure mismatch; no two-phase commit support; no SHA-256 verification; would require adapter layer |
+```
+
+Add to Section 5.3 (Integration with Existing ADK Services):
+
+```markdown
+### Relationship to ArtifactService
+
+ADK's `ArtifactService` (`GcsArtifactService`, `FileArtifactService`, etc.) is designed for
+user/session-scoped file artifacts with versioning.
+
+Checkpoints have different requirements:
+- Two-phase commit with BigQuery metadata
+- SHA-256 integrity verification
+- Different key structure (session_id/checkpoint_seq)
+
+For v1, `CheckpointStore` uses direct GCS client access. Future versions may consider
+unifying with `ArtifactService` if the interface can be extended to support checkpoint
+semantics.
+```
+
+---
+
+---
+
+## Comment 3: Leasing as General Requirement
+
+**From:** ADK Team
+**Date:** 2026-02-02
+
+**Reference:** Section 7.3 - "We must ensure only one runner resumes a session at a time"
+
+**Comment:**
+> "This is not only applicable to resume. `Runner.run_async` also requires this. Leasing is a general requirement for app developers."
+
+---
+
+### Response
+
+This is an important clarification. You're correct that session-level concurrency control is a **general requirement**, not specific to durable session resume.
+
+#### Expanded Scope of Leasing
+
+| Scenario | Concurrency Risk | Current ADK Handling |
+|----------|------------------|---------------------|
+| Multiple `run_async()` on same session | Race conditions, duplicate tool calls | App developer responsibility |
+| Resume after pause | Duplicate resume attempts | App developer responsibility |
+| Pub/Sub event redelivery | Multiple runners wake on same event | App developer responsibility |
+| Horizontal scaling | Multiple instances claim same session | App developer responsibility |
+
+The design doc incorrectly scoped leasing as a "durable session" concern. In reality:
+
+```
+Leasing requirement = ANY scenario where multiple runners might access the same session
+```
+
+#### Current State in ADK
+
+Looking at `Runner.run_async()` in `src/google/adk/runners.py`:
+
+```python
+async def run_async(
+ self,
+ *,
+ user_id: str,
+ session_id: str,
+ new_message: types.Content,
+ ...
+) -> AsyncGenerator[Event, None]:
+ # No built-in lease acquisition
+ # App developer must ensure single-runner-per-session
+```
+
+There's no built-in lease mechanism. App developers must implement their own concurrency control.
+
+#### Implications for Design
+
+**Option A: Leasing in Durable Layer Only (Current Design)**
+
+```
+┌─────────────────────────────────────────────────────────────┐
+│ ADK Application │
+├─────────────────────────────────────────────────────────────┤
+│ Runner.run_async() │ CheckpointStore │
+│ - No built-in leasing │ - Has lease management │
+│ - App manages concurrency │ - Protects resume only │
+└─────────────────────────────────────────────────────────────┘
+```
+
+**Pros:** Non-breaking, durable sessions get protection
+**Cons:** Inconsistent; regular sessions still unprotected
+
+**Option B: Leasing in Runner (Framework-Level)**
+
+```python
+class Runner:
+ def __init__(self, ..., lease_manager: Optional[LeaseManager] = None):
+ self._lease_manager = lease_manager
+
+ async def run_async(self, ..., session_id: str, ...):
+ if self._lease_manager:
+ lease = await self._lease_manager.acquire(session_id)
+ if not lease:
+ raise SessionLeaseDeniedError(session_id)
+ try:
+ # ... execute agent logic
+ finally:
+ if self._lease_manager:
+ await self._lease_manager.release(session_id)
+```
+
+**Pros:** Consistent protection for all sessions
+**Cons:** Breaking change; requires lease manager configuration
+
+**Option C: Leasing in SessionService (Storage-Level)**
+
+```python
+class BaseSessionService(ABC):
+ @abstractmethod
+ async def acquire_session_lease(
+ self, session_id: str, lease_id: str, ttl_seconds: int
+ ) -> bool: ...
+
+ @abstractmethod
+ async def release_session_lease(
+ self, session_id: str, lease_id: str
+ ) -> None: ...
+```
+
+**Pros:** Unified with session storage; natural fit
+**Cons:** Requires changes to all SessionService implementations
+
+---
+
+### Recommendation
+
+**Short-term (v1):** Keep leasing in `CheckpointStore` for durable sessions, but:
+- Update design doc to acknowledge this is a subset of a broader need
+- Document that app developers need their own concurrency control for non-durable sessions
+
+**Medium-term (v2):** Consider adding leasing to `SessionService` interface:
+- `BigQuerySessionService` already has infrastructure for this
+- `DatabaseSessionService` can use row-level locks
+- `InMemorySessionService` can use asyncio locks
+
+**Long-term:** Consider Runner-level lease integration as opt-in feature.
+
+---
+
+### Suggested Design Doc Updates
+
+**Update Section 7.3 Title:**
+
+From:
+> "7.3 Leasing & optimistic concurrency"
+
+To:
+> "7.3 Leasing & optimistic concurrency (session-level)"
+
+**Add Clarification Paragraph:**
+
+```markdown
+### 7.3 Leasing & Optimistic Concurrency
+
+**Note:** Session-level concurrency control is a general ADK requirement, not
+specific to durable sessions. Any scenario where multiple runners might access
+the same session requires leasing:
+
+- Multiple `run_async()` calls on the same session
+- Resume after pause (durable or in-process)
+- Event-driven wake-up with potential redelivery
+- Horizontal scaling with shared session storage
+
+Currently, ADK leaves session leasing to app developers. The durable session
+layer provides lease management for checkpoint-protected sessions, but this
+does not cover all concurrency scenarios.
+
+**Future consideration:** Add optional `LeaseManager` to `Runner` or lease
+methods to `SessionService` interface for framework-level protection.
+```
+
+**Add to Section 18 (Open Questions):**
+
+```markdown
+| Question | Risk Level | Notes |
+|----------|------------|-------|
+| Framework-level leasing | Medium | Should Runner have built-in lease support? Would require LeaseManager abstraction |
+| SessionService lease methods | Medium | Natural fit but requires interface changes |
+```
+
+---
+
+---
+
+## Comment 4: Cross-Process Durability Clarification
+
+**From:** ADK Team
+**Date:** 2026-02-02
+
+**Reference:** Section 1.2 - "Cross-process durability: state lost if the process dies"
+
+**Comment:**
+> "Could you elaborate on this? I think agent state is persisted in the event and the event will be persisted in the selected session service."
+
+---
+
+### Response
+
+You're correct that session events are persisted in the SessionService. Let me clarify what "state lost" means in the context of long-running tasks.
+
+#### What IS Preserved (SessionService Events)
+
+| Data | Preserved? | Location |
+|------|------------|----------|
+| User messages | Yes | Session events |
+| Agent responses | Yes | Session events |
+| Tool call records | Yes | Session events (tool name, args, result) |
+| LLM conversation context | Yes | Replayable from events |
+
+#### What May NOT Be Preserved (or Not Usable)
+
+| Data | Preserved? | Issue |
+|------|------------|-------|
+| In-flight tool execution | **No** | Process dies mid-tool-call |
+| External job handles | **Partial** | Job ID in event, but no reconciliation structure |
+| Multi-step operation progress | **No** | "I'm on step 3 of 7" not tracked |
+| Agent's execution plan | **No** | Task graph, priorities, dependencies |
+| Partial aggregated results | **No** | "Scanned 30 of 50 tables, found X so far" |
+| Workspace files in progress | **No** | Draft reports, intermediate artifacts |
+
+#### Concrete Example: 50-Table PII Scan
+
+**Scenario:** Agent is scanning 50 BigQuery tables for PII. Process dies after completing 30 tables.
+
+**With SessionService only:**
+
+```
+Events stored:
+ - User: "Scan all tables for PII"
+ - Agent: "I'll scan these 50 tables..."
+ - ToolCall: scan_table("table_1") → {findings: [...]}
+ - ToolCall: scan_table("table_2") → {findings: [...]}
+ ...
+ - ToolCall: scan_table("table_30") → {findings: [...]}
+ - [PROCESS DIES HERE]
+```
+
+On restart:
+- Events replay to LLM ✓
+- LLM sees 30 tool calls completed ✓
+- But: **LLM must re-deduce** which tables remain
+- But: **No structured job ledger** for reconciliation
+- But: **Aggregated findings** must be re-computed from events
+- Risk: **LLM may miscount** or re-scan tables
+
+**With Checkpoint + SessionService:**
+
+```
+Checkpoint stored:
+ {
+ "job_ledger": {
+ "table_1": {"status": "complete", "findings": 3},
+ "table_2": {"status": "complete", "findings": 0},
+ ...
+ "table_30": {"status": "complete", "findings": 5},
+ "table_31": {"status": "pending"},
+ ...
+ "table_50": {"status": "pending"}
+ },
+ "aggregated_findings": {
+ "total_tables_scanned": 30,
+ "total_findings": 47,
+ "findings_by_type": {"email": 20, "ssn": 15, "phone": 12}
+ },
+ "execution_plan": {
+ "current_phase": "scanning",
+ "next_table_index": 31
+ }
+ }
+```
+
+On restart:
+- Load checkpoint ✓
+- Know exactly which tables remain ✓
+- Reconcile with BigQuery job states ✓
+- Continue with aggregated state intact ✓
+- No LLM re-deduction needed ✓
+
+#### The Key Distinction
+
+| Aspect | Session Events | Checkpoint State |
+|--------|----------------|------------------|
+| Purpose | LLM conversation context | Execution state recovery |
+| Structure | Append-only event stream | Point-in-time snapshot |
+| Recovery mode | Replay events to LLM | Load structured state |
+| External jobs | Tool call records | Reconcilable job ledger |
+| Aggregations | Must re-compute from events | Pre-computed, ready to use |
+| Reliability | LLM must re-deduce state | Deterministic restoration |
+
+#### When Session Events Are Sufficient
+
+Session events alone work well for:
+- Short conversations (< 5 min)
+- Simple tool calls (no external async jobs)
+- Stateless operations (each tool call independent)
+- Human-in-the-loop flows (human provides continuity)
+
+#### When Checkpoints Add Value
+
+Checkpoints are valuable for:
+- Long-running operations (hours/days)
+- External async jobs (BigQuery, Cloud Build, ML training)
+- Multi-step plans with dependencies
+- Aggregated/computed state (partial results)
+- Deterministic recovery (no LLM re-deduction)
+
+---
+
+### End-to-End Concrete Example: Enterprise PII Compliance Audit
+
+Let me walk through a complete scenario showing what the checkpoint approach enables that event logging alone cannot.
+
+#### Scenario Setup
+
+**Task:** Scan 100 BigQuery tables across 5 datasets for PII (emails, SSNs, phone numbers) to generate a compliance report.
+
+**Environment:**
+- Cloud Run with 60-minute timeout
+- Each table scan takes 2-10 minutes (BigQuery job)
+- Total expected runtime: ~8 hours
+- Multiple Cloud Run instances may be involved
+
+**User Request:**
+```
+"Scan all tables in the customer_data, transactions, analytics,
+logs, and marketing datasets for PII. Generate a compliance report
+with findings by table and recommendations."
+```
+
+---
+
+#### Timeline: What Happens
+
+```
+Hour 0:00 - Agent starts
+ - Discovers 100 tables across 5 datasets
+ - Creates execution plan: scan tables, aggregate findings, generate report
+ - Begins scanning tables
+
+Hour 2:30 - Progress checkpoint
+ - 35 tables scanned
+ - 127 PII findings so far
+ - 15 BigQuery jobs completed, 2 running, 83 pending
+
+Hour 3:15 - PROCESS DIES (Cloud Run timeout/crash)
+ - 2 BigQuery jobs still running in the cloud
+ - Agent process terminated
+```
+
+---
+
+#### Path A: Event Logging Only (Current ADK)
+
+**Events stored in SessionService:**
+```json
+[
+ {"type": "user_message", "content": "Scan all tables..."},
+ {"type": "agent_message", "content": "I'll scan 100 tables..."},
+ {"type": "tool_call", "tool": "submit_bq_scan", "args": {"table": "customer_data.users"}, "result": {"job_id": "job_001", "status": "submitted"}},
+ {"type": "tool_call", "tool": "get_job_result", "args": {"job_id": "job_001"}, "result": {"findings": [{"type": "email", "column": "contact_email", "count": 15000}]}},
+ {"type": "tool_call", "tool": "submit_bq_scan", "args": {"table": "customer_data.orders"}, "result": {"job_id": "job_002", "status": "submitted"}},
+ // ... 70 more tool call events ...
+ {"type": "tool_call", "tool": "submit_bq_scan", "args": {"table": "analytics.events"}, "result": {"job_id": "job_037", "status": "submitted"}},
+ // PROCESS DIES - no more events
+]
+```
+
+**On Restart (New Cloud Run Instance):**
+
+1. **Events replay to LLM** - LLM sees conversation history ✓
+
+2. **LLM must re-deduce state:**
+ ```
+ LLM thinking: "Looking at these events... I see job_001 through job_037
+ were submitted. Some have results, some don't. Let me figure out what's done..."
+ ```
+
+3. **Problems:**
+
+ | Problem | Impact |
+ |---------|--------|
+ | **Job status unknown** | job_036, job_037 may have completed while process was dead - LLM doesn't know |
+ | **No structured ledger** | LLM must parse 70+ events to determine table status |
+ | **Aggregation lost** | "127 findings so far" must be re-counted from events |
+ | **May re-submit jobs** | LLM might re-scan tables it already scanned |
+ | **May miss completed jobs** | Jobs that finished during downtime have results waiting |
+ | **Non-deterministic** | Different LLM calls may reach different conclusions |
+
+4. **Likely LLM Response:**
+ ```
+ "I see we were scanning tables for PII. Let me check what's been done...
+ [Spends tokens re-parsing events]
+ I think tables 1-35 are done. Let me continue with table 36...
+
+ Actually, I'm not sure if job_036 completed. Let me re-submit it to be safe."
+ ```
+
+5. **Result:**
+ - Duplicate BigQuery jobs (wasted cost)
+ - Inconsistent findings count
+ - Report may have duplicates or gaps
+ - ~30 minutes spent "figuring out" state
+
+---
+
+#### Path B: Checkpoint + Event Logging (Proposed)
+
+**Checkpoint stored (in addition to events):**
+```json
+{
+ "checkpoint_seq": 15,
+ "created_at": "2026-02-02T05:30:00Z",
+
+ "execution_plan": {
+ "phase": "scanning",
+ "total_tables": 100,
+ "tables_completed": 35,
+ "tables_in_progress": 2,
+ "tables_pending": 63
+ },
+
+ "job_ledger": {
+ "job_001": {"table": "customer_data.users", "status": "complete", "findings": 3},
+ "job_002": {"table": "customer_data.orders", "status": "complete", "findings": 0},
+ // ... jobs 3-35: complete ...
+ "job_036": {"table": "analytics.sessions", "status": "running", "submitted_at": "2026-02-02T05:28:00Z"},
+ "job_037": {"table": "analytics.events", "status": "running", "submitted_at": "2026-02-02T05:29:00Z"}
+ },
+
+ "aggregated_findings": {
+ "total_findings": 127,
+ "by_type": {"email": 45, "ssn": 32, "phone": 28, "address": 22},
+ "by_dataset": {"customer_data": 67, "transactions": 35, "analytics": 25},
+ "tables_with_pii": ["customer_data.users", "customer_data.profiles", "..."]
+ },
+
+ "pending_tables": [
+ "analytics.pageviews",
+ "logs.access_logs",
+ // ... 63 more tables ...
+ ]
+}
+```
+
+**On Restart (New Cloud Run Instance):**
+
+1. **Load checkpoint** - Deterministic state restoration ✓
+
+2. **Reconcile with BigQuery:**
+ ```python
+ # Automatic reconciliation
+ for job_id, job_meta in checkpoint["job_ledger"].items():
+ if job_meta["status"] == "running":
+ actual_status = bq_client.get_job(job_id).state
+ if actual_status == "DONE":
+ # Job completed while we were dead - fetch results
+ results = fetch_results(job_id)
+ update_findings(results)
+ job_meta["status"] = "complete"
+ ```
+
+3. **Result of reconciliation:**
+ ```
+ Checkpoint loaded: 35 tables complete, 2 in-progress
+ Reconciliation: job_036 DONE (found 5 PII), job_037 DONE (found 2 PII)
+ Updated state: 37 tables complete, 134 total findings
+ Remaining: 63 tables
+
+ Resuming scan from table 38...
+ ```
+
+4. **Agent continues seamlessly:**
+ - No duplicate jobs
+ - No re-parsing events
+ - Findings aggregation intact
+ - Deterministic, reliable
+ - Resume took ~5 seconds
+
+---
+
+#### Side-by-Side Comparison
+
+| Aspect | Events Only | Checkpoint + Events |
+|--------|-------------|---------------------|
+| **Recovery time** | ~30 min (LLM re-parsing) | ~5 sec (load + reconcile) |
+| **Duplicate jobs** | Likely (LLM uncertainty) | None (ledger prevents) |
+| **Missed job results** | Possible | None (reconciliation catches) |
+| **Findings accuracy** | May have errors | Exact (pre-aggregated) |
+| **Token cost** | High (re-process events) | Low (structured state) |
+| **Determinism** | No (LLM-dependent) | Yes (explicit state) |
+| **Total runtime** | ~10 hours (retries, confusion) | ~8 hours (clean resume) |
+
+---
+
+#### What Checkpoint Enables That Events Cannot
+
+1. **Authoritative Job Reconciliation**
+ ```
+ Events: "job_036 was submitted" (but is it done now?)
+ Checkpoint: "job_036 status=running" → reconcile → "actually DONE, here are results"
+ ```
+
+2. **Pre-Aggregated State**
+ ```
+ Events: Count findings from 70 tool_call results
+ Checkpoint: {"total_findings": 127, "by_type": {...}}
+ ```
+
+3. **Explicit Execution Plan**
+ ```
+ Events: LLM must re-deduce "what was I doing?"
+ Checkpoint: {"phase": "scanning", "tables_completed": 35, "tables_pending": 63}
+ ```
+
+4. **Idempotent Resume**
+ ```
+ Events: May or may not re-submit jobs (LLM decides)
+ Checkpoint: Never re-submits (ledger tracks all jobs)
+ ```
+
+5. **Multi-Instance Coordination**
+ ```
+ Events: Two instances might both try to continue
+ Checkpoint: Lease ensures only one instance resumes
+ ```
+
+---
+
+#### Cost Impact Example
+
+| Metric | Events Only | Checkpoint |
+|--------|-------------|------------|
+| BigQuery jobs submitted | 115 (15 duplicates) | 100 (exact) |
+| BQ job cost @ $5/job | $575 | $500 |
+| Cloud Run time | 10 hours | 8 hours |
+| Cloud Run cost @ $0.10/hr | $1.00 | $0.80 |
+| LLM tokens for recovery | ~50,000 | ~1,000 |
+| LLM cost @ $0.01/1K | $0.50 | $0.01 |
+| **Total extra cost** | **$75.50** | **$0** |
+
+For enterprise workloads running daily, this adds up significantly.
+
+---
+
+### Suggested Design Doc Update
+
+Revise Section 1.2 limitation description:
+
+**From:**
+> "Cross-process durability: state lost if the process dies"
+
+**To:**
+> "Cross-process durability: While session events persist conversation history, structured execution state (job ledgers, aggregated results, execution plans) is not captured in a form that enables deterministic recovery. On restart, the LLM must re-deduce state from event history, which may be unreliable for complex multi-step operations."
+
+Add clarification table to Section 1.2:
+
+```markdown
+**Clarification: Session Events vs. Checkpoint State**
+
+| Recovery Need | Session Events | Checkpoint |
+|---------------|----------------|------------|
+| Conversation context | ✓ Sufficient | ✓ |
+| External job reconciliation | ✗ Manual | ✓ Structured ledger |
+| Multi-step progress tracking | ✗ LLM re-deduces | ✓ Explicit state |
+| Aggregated partial results | ✗ Re-compute | ✓ Pre-computed |
+| Deterministic recovery | ✗ LLM-dependent | ✓ Guaranteed |
+```
+
+---
+
+## Updated Open Questions for ADK Team
+
+1. **Table naming**: Should checkpoint tables use a prefix (`durable_sessions`) or separate dataset?
+2. **Unified service**: Is there interest in a `DurableSessionService` wrapper that manages both SessionService and CheckpointStore?
+3. **Event integration**: Should checkpoint events be mirrored to SessionService for audit trail?
+4. **BigQuerySessionService**: Does it already have any checkpoint-like capabilities we should leverage?
+5. **ArtifactService unification**: Should we extend `BaseArtifactService` with checkpoint-specific methods in v2?
+6. **Shared bucket**: Can checkpoints share a GCS bucket with artifacts, or should they be separate?
+7. **Framework-level leasing**: Should `Runner` have optional built-in lease management? Or should `SessionService` have lease methods?
+8. **Lease backend standardization**: If leasing becomes a framework feature, what backends should be supported (BQ, Firestore, Redis, DB row locks)?
+9. **Event-based recovery**: Is there interest in adding structured "execution state" events to SessionService as an alternative to separate checkpoints?
diff --git a/contributing/samples/long_running_task/demo_server.py b/contributing/samples/long_running_task/demo_server.py
new file mode 100644
index 0000000000..1715c1f11b
--- /dev/null
+++ b/contributing/samples/long_running_task/demo_server.py
@@ -0,0 +1,435 @@
+# Copyright 2026 Google LLC
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""Custom demo server with checkpoint visualization UI."""
+
+import asyncio
+import json
+import os
+import uuid
+from datetime import datetime
+from pathlib import Path
+from typing import Any, Optional
+
+from fastapi import FastAPI, HTTPException, Request
+from fastapi.middleware.cors import CORSMiddleware
+from fastapi.responses import HTMLResponse, JSONResponse
+from fastapi.staticfiles import StaticFiles
+from pydantic import BaseModel
+import uvicorn
+
+from google.adk.durable import BigQueryCheckpointStore
+
+# Configuration
+PROJECT_ID = os.environ.get("GOOGLE_CLOUD_PROJECT", "test-project-0728-467323")
+DATASET = "adk_metadata"
+GCS_BUCKET = f"{PROJECT_ID}-adk-checkpoints"
+
+# Initialize checkpoint store
+checkpoint_store = BigQueryCheckpointStore(
+ project=PROJECT_ID,
+ dataset=DATASET,
+ gcs_bucket=GCS_BUCKET,
+)
+
+# In-memory task state for demo
+active_tasks: dict[str, dict] = {}
+
+app = FastAPI(title="ADK Durable Session Demo")
+
+# CORS
+app.add_middleware(
+ CORSMiddleware,
+ allow_origins=["*"],
+ allow_credentials=True,
+ allow_methods=["*"],
+ allow_headers=["*"],
+)
+
+
+class TaskRequest(BaseModel):
+ task_type: str # "sentiment", "anomaly", "trend", "scan"
+ duration_seconds: int = 60
+
+
+class ResumeRequest(BaseModel):
+ session_id: str
+
+
+@app.get("/", response_class=HTMLResponse)
+async def root():
+ """Serve the demo UI."""
+ html_path = Path(__file__).parent / "demo_ui.html"
+ if html_path.exists():
+ return HTMLResponse(content=html_path.read_text())
+ return HTMLResponse(content="
Demo UI not found
")
+
+
+@app.get("/api/sessions")
+async def list_sessions():
+ """List all sessions from BigQuery."""
+ try:
+ client = checkpoint_store._get_bq_client()
+ query = f"""
+ SELECT session_id, status, agent_name, current_checkpoint_seq,
+ created_at, updated_at
+ FROM `{checkpoint_store._sessions_table_id}`
+ ORDER BY updated_at DESC
+ LIMIT 20
+ """
+ results = client.query(query).result()
+ sessions = []
+ for row in results:
+ sessions.append({
+ "session_id": row.session_id,
+ "status": row.status,
+ "agent_name": row.agent_name,
+ "checkpoint_seq": row.current_checkpoint_seq,
+ "created_at": row.created_at.isoformat() if row.created_at else None,
+ "updated_at": row.updated_at.isoformat() if row.updated_at else None,
+ })
+ return {"sessions": sessions}
+ except Exception as e:
+ return {"sessions": [], "error": str(e)}
+
+
+@app.get("/api/checkpoints/{session_id}")
+async def list_checkpoints(session_id: str):
+ """List checkpoints for a session."""
+ try:
+ checkpoints = await checkpoint_store.list_checkpoints(
+ session_id=session_id, limit=20
+ )
+ return {
+ "checkpoints": [
+ {
+ "checkpoint_seq": cp.checkpoint_seq,
+ "created_at": cp.created_at.isoformat() if cp.created_at else None,
+ "trigger": cp.trigger,
+ "size_bytes": cp.size_bytes,
+ "gcs_uri": cp.gcs_state_uri,
+ "agent_state": cp.agent_state,
+ }
+ for cp in checkpoints
+ ]
+ }
+ except Exception as e:
+ return {"checkpoints": [], "error": str(e)}
+
+
+@app.post("/api/task/start")
+async def start_task(request: TaskRequest):
+ """Start a new long-running task with checkpointing."""
+ session_id = f"demo-{uuid.uuid4().hex[:8]}"
+
+ # Create session in BigQuery
+ try:
+ session = await checkpoint_store.create_session(
+ session_id=session_id,
+ agent_name="demo_agent",
+ metadata={"task_type": request.task_type}
+ )
+ except Exception as e:
+ raise HTTPException(status_code=500, detail=f"Failed to create session: {e}")
+
+ # Initialize task state
+ active_tasks[session_id] = {
+ "task_type": request.task_type,
+ "status": "running",
+ "progress": 0,
+ "total_duration": request.duration_seconds,
+ "records_processed": 0,
+ "insights_found": 0,
+ "checkpoints": [],
+ "start_time": datetime.now().isoformat(),
+ "should_fail": False,
+ "failed_at": None,
+ "final_output": None,
+ }
+
+ # Start background task
+ asyncio.create_task(run_task_with_checkpoints(session_id, request.duration_seconds))
+
+ return {
+ "session_id": session_id,
+ "status": "started",
+ "message": f"Started {request.task_type} analysis task"
+ }
+
+
+@app.post("/api/task/fail/{session_id}")
+async def simulate_failure(session_id: str):
+ """Simulate a task failure."""
+ if session_id not in active_tasks:
+ raise HTTPException(status_code=404, detail="Task not found")
+
+ active_tasks[session_id]["should_fail"] = True
+ return {"status": "failure_triggered", "session_id": session_id}
+
+
+@app.post("/api/task/resume")
+async def resume_task(request: ResumeRequest):
+ """Resume a task from checkpoint."""
+ session_id = request.session_id
+
+ # Read the latest checkpoint
+ result = await checkpoint_store.read_latest_checkpoint(session_id=session_id)
+ if not result:
+ raise HTTPException(status_code=404, detail="No checkpoint found")
+
+ checkpoint, state_blob = result
+ state = json.loads(state_blob.decode('utf-8'))
+
+ # Get session info
+ session = await checkpoint_store.get_session(session_id=session_id)
+ if not session:
+ raise HTTPException(status_code=404, detail="Session not found")
+
+ # Restore task state
+ active_tasks[session_id] = {
+ "task_type": state.get("task_type", "unknown"),
+ "status": "running",
+ "progress": state.get("progress", 0),
+ "total_duration": state.get("total_duration", 60),
+ "records_processed": state.get("records_processed", 0),
+ "insights_found": state.get("insights_found", 0),
+ "checkpoints": state.get("checkpoints", []),
+ "start_time": state.get("start_time"),
+ "resumed_from": checkpoint.checkpoint_seq,
+ "should_fail": False,
+ "failed_at": None,
+ }
+
+ # Calculate remaining duration
+ remaining = active_tasks[session_id]["total_duration"] * (1 - active_tasks[session_id]["progress"] / 100)
+
+ # Resume background task
+ asyncio.create_task(run_task_with_checkpoints(session_id, int(remaining), resume=True))
+
+ return {
+ "session_id": session_id,
+ "status": "resumed",
+ "resumed_from_checkpoint": checkpoint.checkpoint_seq,
+ "progress": active_tasks[session_id]["progress"],
+ "message": f"Resumed from checkpoint #{checkpoint.checkpoint_seq}"
+ }
+
+
+@app.get("/api/task/status/{session_id}")
+async def get_task_status(session_id: str):
+ """Get current task status."""
+ if session_id not in active_tasks:
+ # Try to get from BigQuery
+ session = await checkpoint_store.get_session(session_id=session_id)
+ if session:
+ return {
+ "session_id": session_id,
+ "status": session.status,
+ "checkpoint_seq": session.current_checkpoint_seq,
+ "from_db": True
+ }
+ raise HTTPException(status_code=404, detail="Task not found")
+
+ return {
+ "session_id": session_id,
+ **active_tasks[session_id]
+ }
+
+
+async def run_task_with_checkpoints(session_id: str, duration: int, resume: bool = False):
+ """Run a long-running task with periodic checkpoints."""
+ import random
+
+ task = active_tasks.get(session_id)
+ if not task:
+ return
+
+ checkpoint_interval = 10 # Checkpoint every 10 seconds
+ start_progress = task["progress"] if resume else 0
+
+ for elapsed in range(0, duration, checkpoint_interval):
+ # Check if we should fail
+ if task.get("should_fail"):
+ task["status"] = "failed"
+ task["failed_at"] = datetime.now().isoformat()
+ await checkpoint_store.update_session_status(
+ session_id=session_id, status="failed"
+ )
+ return
+
+ # Simulate work
+ await asyncio.sleep(min(checkpoint_interval, duration - elapsed))
+
+ # Update progress
+ progress = start_progress + ((elapsed + checkpoint_interval) / duration) * (100 - start_progress)
+ task["progress"] = min(progress, 100)
+ task["records_processed"] += random.randint(50000, 150000)
+ task["insights_found"] += random.randint(1, 3)
+
+ # Get current checkpoint seq
+ session = await checkpoint_store.get_session(session_id=session_id)
+ next_seq = (session.current_checkpoint_seq if session else 0) + 1
+
+ # Create checkpoint
+ state_data = {
+ "task_type": task["task_type"],
+ "progress": task["progress"],
+ "total_duration": task["total_duration"],
+ "records_processed": task["records_processed"],
+ "insights_found": task["insights_found"],
+ "checkpoints": task["checkpoints"],
+ "start_time": task["start_time"],
+ }
+
+ try:
+ checkpoint = await checkpoint_store.write_checkpoint(
+ session_id=session_id,
+ checkpoint_seq=next_seq,
+ state_blob=json.dumps(state_data).encode('utf-8'),
+ agent_state={"progress": task["progress"], "step": f"checkpoint_{next_seq}"},
+ trigger="periodic",
+ )
+
+ task["checkpoints"].append({
+ "seq": checkpoint.checkpoint_seq,
+ "time": datetime.now().isoformat(),
+ "progress": task["progress"],
+ })
+ except Exception as e:
+ print(f"Checkpoint failed: {e}")
+
+ # Task completed - Generate final output based on task type
+ task["status"] = "completed"
+ task["progress"] = 100
+
+ # Generate realistic final output
+ task_type = task.get("task_type", "analysis")
+ records = task["records_processed"]
+ insights = task["insights_found"]
+
+ if task_type == "sentiment":
+ task["final_output"] = {
+ "title": "Sentiment Analysis Report",
+ "summary": f"Analyzed {records:,} text records across multiple data sources.",
+ "results": {
+ "overall_sentiment": "72% Positive",
+ "positive_records": int(records * 0.72),
+ "neutral_records": int(records * 0.18),
+ "negative_records": int(records * 0.10),
+ "confidence_score": 0.94,
+ },
+ "key_findings": [
+ "Strong positive sentiment around product quality",
+ "Minor concerns about delivery times (8% of negative)",
+ "Customer service mentions trending upward (+15%)",
+ f"Identified {insights} actionable insights for improvement",
+ ],
+ "top_themes": ["quality", "value", "service", "speed", "reliability"],
+ "recommendation": "Focus on delivery optimization to improve overall sentiment score by estimated 5-8%.",
+ }
+ elif task_type == "anomaly":
+ task["final_output"] = {
+ "title": "Anomaly Detection Report",
+ "summary": f"Scanned {records:,} data points for unusual patterns.",
+ "results": {
+ "total_anomalies": insights,
+ "critical_anomalies": max(1, insights // 4),
+ "warning_anomalies": insights // 2,
+ "info_anomalies": insights - insights // 4 - insights // 2,
+ "false_positive_rate": "2.3%",
+ },
+ "key_findings": [
+ f"Detected {insights} anomalies requiring attention",
+ "3 critical anomalies in transaction processing",
+ "Seasonal pattern identified in Q3 data",
+ "Root cause: 67% related to system load spikes",
+ ],
+ "anomaly_clusters": [
+ {"type": "Transaction Volume Spike", "count": 5, "severity": "high"},
+ {"type": "Response Time Degradation", "count": 8, "severity": "medium"},
+ {"type": "Error Rate Increase", "count": 3, "severity": "high"},
+ ],
+ "recommendation": "Investigate transaction processing during peak hours. Consider auto-scaling policies.",
+ }
+ elif task_type == "trend":
+ task["final_output"] = {
+ "title": "Trend Analysis Report",
+ "summary": f"Analyzed {records:,} historical data points for patterns.",
+ "results": {
+ "trend_direction": "Upward",
+ "growth_rate": "15.3% MoM",
+ "seasonality_detected": True,
+ "forecast_confidence": 0.89,
+ },
+ "key_findings": [
+ "Strong upward trend detected over past 6 months",
+ "15.3% month-over-month growth rate",
+ "Seasonal peaks in Q4 (holiday season)",
+ f"Identified {insights} significant trend changes",
+ ],
+ "forecast": {
+ "next_month": "+12% projected",
+ "next_quarter": "+38% projected",
+ "confidence_interval": "±8%",
+ },
+ "recommendation": "Prepare for Q4 surge. Current trajectory suggests 2x capacity needed by year end.",
+ }
+ elif task_type == "clustering":
+ task["final_output"] = {
+ "title": "Data Clustering Report",
+ "summary": f"Clustered {records:,} data points into meaningful segments.",
+ "results": {
+ "clusters_identified": 5,
+ "silhouette_score": 0.78,
+ "largest_cluster_size": "45%",
+ "smallest_cluster_size": "8%",
+ },
+ "key_findings": [
+ "Identified 5 distinct customer segments",
+ "Largest segment (45%): 'Value Seekers'",
+ "High-value segment (12%): 'Premium Customers'",
+ f"Found {insights} key differentiating factors",
+ ],
+ "clusters": [
+ {"name": "Value Seekers", "size": "45%", "description": "Price-sensitive, bulk buyers"},
+ {"name": "Premium Customers", "size": "12%", "description": "High-spend, quality-focused"},
+ {"name": "Occasional Shoppers", "size": "23%", "description": "Infrequent, event-driven"},
+ {"name": "New Users", "size": "12%", "description": "Recent signups, exploring"},
+ {"name": "Churning Risk", "size": "8%", "description": "Declining engagement"},
+ ],
+ "recommendation": "Target 'Churning Risk' segment with retention campaign. Estimated 15% recovery rate.",
+ }
+ else:
+ task["final_output"] = {
+ "title": "Analysis Complete",
+ "summary": f"Processed {records:,} records successfully.",
+ "results": {"records_processed": records, "insights_found": insights},
+ "key_findings": [f"Found {insights} notable patterns in the data"],
+ }
+
+ task["final_output"]["metadata"] = {
+ "session_id": session_id,
+ "task_type": task_type,
+ "duration_seconds": task["total_duration"],
+ "checkpoints_created": len(task["checkpoints"]),
+ "completed_at": datetime.now().isoformat(),
+ }
+
+ await checkpoint_store.update_session_status(
+ session_id=session_id, status="completed"
+ )
+
+
+if __name__ == "__main__":
+ uvicorn.run(app, host="0.0.0.0", port=8080)
diff --git a/contributing/samples/long_running_task/demo_ui.html b/contributing/samples/long_running_task/demo_ui.html
new file mode 100644
index 0000000000..60ac43ac35
--- /dev/null
+++ b/contributing/samples/long_running_task/demo_ui.html
@@ -0,0 +1,832 @@
+
+
+
+
+
+ ADK Durable Session Demo - Real Checkpoint Visualization
+
+
+
+
+
+
+
+
+ ADK Durable Session Demo
+
+
Real Checkpoint-Based Persistence for Long-Running Agent Tasks
+
+
+ All tasks are REAL - Writing to BigQuery & GCS
+
+
+
+
+
+
+
+
🗄️
+
+
BigQuery
+
test-project-0728-467323.adk_metadata
+
+
+
+
☁️
+
+
Cloud Storage
+
gs://test-project-0728-467323-adk-checkpoints
+
+
+
+
🔐
+
+
SHA-256 Verified
+
Checkpoint integrity guaranteed
+
+
+
+
+
+
+
+
+
+
+
+ 🎯 Choose a Real Task
+
+
Each task simulates a real long-running data processing job with actual checkpoints saved to GCP.
+
+
+
+
+
+
😊
+
+
Sentiment Analysis
+
Analyzes text data to determine emotional tone. Processes customer reviews, social media posts, and feedback.
+
Output: Positive/Negative ratios, key themes, trend analysis
+
+
+
+
+
+
+
🔍
+
+
Anomaly Detection
+
Scans datasets for unusual patterns and outliers. Identifies potential fraud, system errors, or data quality issues.
+
Output: Anomaly count, severity levels, root cause hints
+
+
+
+
+
+
+
📈
+
+
Trend Analysis
+
Identifies patterns and trends over time. Forecasts future values based on historical data.
+
Output: Growth rates, seasonal patterns, forecasts
+
+
+
+
+
+
+
🎨
+
+
Data Clustering
+
Groups similar data points together. Segments customers, categorizes content, identifies patterns.
+
Output: Cluster count, segment profiles, separation metrics
+
+
+
+
+
+
+
+
+
+
+ 60s
+
+
Checkpoints saved every 10 seconds
+
+
+
+
+
+
Task will create real checkpoints in BigQuery & GCS
+
+
+
+
+
+ 📊 Live Task Monitor
+
+
+
+
+
⏸️
+
No task running
+
Select a task and click Start
+
+
+
+
+
+
+
+
+
+ Session ID:
+ -
+
+
+ Status:
+ -
+
+
+ Checkpoints:
+ 0
+
+
+
+
+
+
+ Processing Progress
+ 0%
+
+
+
+
+
+
+
+
0
+
Records Processed
+
+
+
+
+
+
+
+ 💾 Latest Checkpoint Saved
+
+
+
+ Checkpoint #:
+ -
+
+
+ Progress Saved:
+ -
+
+
+ Saved At:
+ -
+
+
+ Storage:
+ BigQuery + GCS
+
+
+
+
+
+
+
+
+
+
+ 💥 Simulate Crash
+
+
Simulate a server crash to test checkpoint recovery. The task state is safely stored!
+
+
+
+
+
+
+ 🔄 Recovery Available
+
+
Task crashed but checkpoint was saved! Click below to resume from last checkpoint.
+
+
+
+
+
+
+
+ 💥 Task Crashed!
+
+
Task failed at 0% progress
+
+ ✅ Don't worry! State was saved in checkpoint #0
+
+
+
+
+
+ 🔄 Successfully Resumed!
+
+
Recovered from checkpoint #0
+
Continuing from 0% progress
+
+
+
+
+ ✅ Task Completed!
+
+
Processed 0 records
+
Found 0 insights
+
+
+
+
+
+
+ 📊 Analysis Results
+
+
+
+
+
+
+
+
+
+
+
+
+ 🎯 Recommendation
+
+
+
+
+
+
+ Session:
+ Checkpoints:
+ Completed:
+
+
+
+
+
+
+
+
+ 📍 Checkpoint Timeline
+
+
Real checkpoints being written to BigQuery & GCS
+
+
+
+
📭
+
No checkpoints yet
+
Start a task to see real checkpoints appear
+
+
+
+
+
+
+
+
🔧 How Durable Checkpointing Works
+
+
+
+
1️⃣
+
Task Runs
+
Long-running analysis processes data in chunks
+
+
+
+
2️⃣
+
Checkpoint Created
+
Every 10s, state is serialized and compressed
+
+
+
+
3️⃣
+
Two-Phase Commit
+
Blob → GCS, then metadata → BigQuery
+
+
+
+
4️⃣
+
Recovery Ready
+
If crash occurs, resume from last checkpoint
+
+
+
+
+
📊 Verify in BigQuery
+
SELECT session_id, checkpoint_seq, created_at, trigger, size_bytes
+FROM `test-project-0728-467323.adk_metadata.checkpoints`
+ORDER BY created_at DESC LIMIT 10;
+
+
+
+
+
+
+ 📋 Real Sessions in BigQuery
+
+
+
These sessions are stored in BigQuery. Click "Select" to resume any failed session.
+
+
+
+
+ | Session ID |
+ Status |
+ Checkpoints |
+ Last Updated |
+ Actions |
+
+
+
+ | Loading from BigQuery... |
+
+
+
+
+
+
+
+
+
diff --git a/contributing/samples/long_running_task/long_running_task_design.md b/contributing/samples/long_running_task/long_running_task_design.md
new file mode 100644
index 0000000000..38877fae7e
--- /dev/null
+++ b/contributing/samples/long_running_task/long_running_task_design.md
@@ -0,0 +1,1448 @@
+# Durable Session Persistence for Long-Horizon ADK Agents (BigQuery-first, Generalizable Framework Capability)
+
+**Author:** Haiyuan Cao
+**Status:** Implemented (v1 core functionality)
+**Target audience:** ADK engineering leads, BigQuery Agent Analytics stakeholders, SRE/Security reviewers
+**Last updated:** 2026-02-02
+**Revision:** 3.0 (implementation complete, demo deployed)
+
+---
+
+## Implementation Status
+
+| Component | Status | Location |
+|-----------|--------|----------|
+| `DurableSessionConfig` | Implemented | `src/google/adk/durable/config.py` |
+| `CheckpointableAgentState` | Implemented | `src/google/adk/durable/checkpointable_state.py` |
+| `DurableSessionStore` (ABC) | Implemented | `src/google/adk/durable/stores/base_checkpoint_store.py` |
+| `BigQueryCheckpointStore` | Implemented | `src/google/adk/durable/stores/bigquery_checkpoint_store.py` |
+| `WorkspaceSnapshotter` | Implemented | `src/google/adk/durable/workspace_snapshotter.py` |
+| App integration | Implemented | `src/google/adk/apps/app.py` |
+| Demo agent | Implemented | `contributing/samples/long_running_task/` |
+| Demo UI (Cloud Run) | Deployed | `https://durable-demo-201486563047.us-central1.run.app` |
+
+### Live Demo
+
+A fully functional demo is deployed on Cloud Run showcasing:
+- Real-time checkpoint visualization
+- Task failure simulation
+- Checkpoint-based recovery
+- BigQuery metadata queries
+- Final task output display
+
+**URL:** https://durable-demo-201486563047.us-central1.run.app
+
+**Infrastructure:**
+- BigQuery Dataset: `test-project-0728-467323.adk_metadata`
+- GCS Bucket: `gs://test-project-0728-467323-adk-checkpoints`
+- SHA-256 checkpoint integrity verification
+
+---
+
+## 0. Executive One-Pager (for PM/Director skim)
+
+### Problem
+
+ADK agents struggle with BigQuery's **async, long-running workloads**. While ADK has experimental in-process resumability (`ResumabilityConfig`), it lacks:
+- **Cross-process durability**: state lost if the process dies
+- **External event triggers**: no Pub/Sub integration for job completion
+- **Enterprise auditability**: no SQL-queryable checkpoint history
+- **Cloud job reconciliation**: no authoritative state sync with BigQuery jobs
+
+Sandboxes time out (the "12-minute barrier" in typical cloud deployments), causing repeated cold starts, redundant metadata scans, and risk of duplicate job submissions.
+
+### Solution
+
+**Extend** ADK's existing resumability with a **Durable Session Persistence Layer**:
+
+* Extend lifecycle with durable **PAUSED** state (cross-process, not just in-memory)
+* Persist **logical checkpoints** (plan + job ledger + tool ledger) and optionally workspace artifacts
+* Store control-plane metadata + audit trail in **BigQuery**
+* Store large blobs (checkpoint/workspace) in **GCS**
+* Resume on external events (BigQuery job completion → Pub/Sub) with **authoritative reconciliation**
+
+### Key benefits
+
+* **Reliability:** deterministic "warm start"; prevents duplicate job fleets
+* **Cost:** no idle compute while waiting; typical storage **< $0.01/session-day paused** (see [Section 21: Cost Estimation](#21-cost-estimation))
+* **Enterprise:** SQL auditability (inspect what the agent did at hour 4 of 12)
+* **Strategic:** differentiates ADK by enabling **cloud job execution continuity + enterprise audit**, not just "reasoning continuity"
+
+### Ask / decisions
+
+1. Review `CheckpointableAgentState` + integration with existing `ResumabilityConfig`
+2. Confirm reference infra (BQ + GCS) and leasing approach
+3. Select pilot (recommended: PII scanner)
+ **Decision:** Durable PAUSED as extension to existing resumability vs separate plugin
+
+### Proposed timeline (8 weeks to pilot)
+
+* Weeks 1–2: API + storage/lease decisions, integration design with existing resumability
+* Weeks 3–4: reference store + resume skeleton
+* Weeks 5–8: pilot + metrics
+* Week 9+: iterate and choose rollout path
+
+---
+
+## 1. Background & Motivation
+
+### 1.1 The "12-minute barrier" in cloud data workflows
+
+BigQuery workloads are inherently asynchronous and may run from minutes to hours. In typical cloud sandbox deployments (Cloud Run, Cloud Functions, GKE with autoscaling), agents face timeout constraints:
+
+* **Cloud Run:** default 5-minute timeout, max 60 minutes
+* **Cloud Functions:** default 1-minute timeout, max 9 minutes (1st gen) or 60 minutes (2nd gen)
+* **Vertex AI Agent Builder:** session timeouts vary by deployment mode
+
+When these timeouts occur during long-running BigQuery jobs, agents:
+
+* lose job IDs and progress state (unless using existing resumability)
+* repeat metadata scans and tool calls
+* risk re-submitting already-running jobs
+
+### 1.2 Existing ADK Resumability (Current State)
+
+ADK already has an **experimental resumability feature** (`src/google/adk/apps/app.py`):
+
+```python
+@experimental
+class ResumabilityConfig(BaseModel):
+ """The "resumability" in ADK refers to the ability to:
+ 1. pause an invocation upon a long-running function call.
+ 2. resume an invocation from the last event, if it's paused or failed midway
+ through.
+
+ Note: ADK resumes the invocation in a best-effort manner:
+ 1. Tool call to resume needs to be idempotent because we only guarantee
+ an at-least-once behavior once resumed.
+ 2. Any temporary / in-memory state will be lost upon resumption.
+ """
+ is_resumable: bool = False
+```
+
+**Current capabilities:**
+| Feature | Status | Location |
+|---------|--------|----------|
+| `ResumabilityConfig.is_resumable` | Experimental | `src/google/adk/apps/app.py:42-58` |
+| `InvocationContext.should_pause_invocation()` | Implemented | `src/google/adk/agents/invocation_context.py:355-389` |
+| `long_running_tool_ids` tracking | Implemented | `src/google/adk/events/event.py` |
+| Resume from last event | Implemented | `src/google/adk/runners.py:1294+` |
+
+**Current limitations (gaps this design addresses):**
+| Limitation | Impact |
+|------------|--------|
+| In-memory only | State lost on process death/restart |
+| No external event triggers | Cannot wake on Pub/Sub, webhooks |
+| No cross-process persistence | Cannot resume in different runner instance |
+| No enterprise audit trail | No SQL-queryable checkpoint history |
+| No cloud job reconciliation | No authoritative sync with BQ job states |
+
+### 1.3 Dogfooding BigQuery Agent Analytics
+
+Using BigQuery as a durable control plane is strategically aligned with the BigQuery Agent Analytics direction:
+
+* **Dogfooding:** demonstrates BQ-based agent observability capabilities
+* **Auditability:** admins can query checkpoints directly ("what was the agent doing at hour 4?")
+* **SQL robustness:** BigQuery idioms (e.g., ARRAY_AGG latest-per-session) make operational queries easy and efficient
+
+---
+
+## 2. Problem Statement
+
+**This design extends ADK's existing resumability** to address gaps in cross-process durability and enterprise scenarios.
+
+Current ADK resumability is optimized for **in-process pause/resume**:
+* Works within a single runner process lifecycle
+* State persisted to session service (SQLite, Postgres, etc.)
+* No external event-driven wake-up mechanism
+* No BigQuery-native audit trail
+
+**Gaps this design addresses:**
+
+| Gap | Current State | Proposed Solution |
+|-----|---------------|-------------------|
+| Cross-process durability | State in session DB, but no checkpoint snapshots | BQ metadata + GCS blobs |
+| External event triggers | Manual resume via API call | Pub/Sub → Resumer service |
+| Cloud job reconciliation | App must track job IDs manually | Authoritative ledger reconciliation |
+| Enterprise audit | Logs only | SQL-queryable BQ tables |
+| Fleet observability | Per-session queries | Cross-agent BQ analytics |
+
+**Net effect:** ADK's existing resumability handles the "pause on long tool call" case well, but is not sufficient for BigQuery job fleets, multi-hour compliance scans, or any agentic workflow that needs **durable, cross-process, event-driven** "pause/wake/resume" loops.
+
+---
+
+## 3. Goals & Non-Goals
+
+### 3.1 Goals
+
+1. **Extend** existing `ResumabilityConfig` to support durable, cross-process checkpoints
+2. Support **hours-to-days** workflows via durable lifecycle state **PAUSED**
+3. Enable **event-driven resume** (Pub/Sub/job events) with safe retries
+4. Persist a deterministic **logical checkpoint**, not runtime heap snapshots
+5. Provide **enterprise-grade auditability**, retention, and security posture
+6. Ensure correctness via **two-phase commit**, **authoritative reconciliation**, and **lease-based resuming**
+7. **Backward compatible** with existing ADK session services
+
+### 3.2 Non-Goals (v1)
+
+* Interpreter heap snapshot/restore (pickle/dill) — brittle across deployments and library changes
+* Full microVM/container checkpointing — future work
+* Replacing existing `ResumabilityConfig` — this design extends it
+* Modifying existing session service implementations — new service alongside existing
+
+---
+
+## 4. Proposed Lifecycle Model
+
+### 4.1 States
+
+Building on ADK's existing pause concept, we formalize durable states:
+
+* **RUNNING:** executing agent logic + tool calls
+* **PAUSED:** no active compute; durable checkpoint exists in BQ+GCS; resumable via event or API
+* **KILLED:** finalized; resources released; retention applies
+ (Optional operational outcomes: `FAILED`, `EXPIRED`.)
+
+### 4.2 Integration with Existing Resumability
+
+```
+Existing ADK Resumability Durable Session Extension
+───────────────────────────── ──────────────────────────────
+InvocationContext.is_resumable → DurableSessionConfig.is_durable
+should_pause_invocation() → triggers checkpoint write
+long_running_tool_ids → included in checkpoint ledger
+Session events → replayed on resume
+ + BQ audit trail
+ + GCS checkpoint blobs
+ + Pub/Sub event triggers
+```
+
+### 4.3 "Serving → Rollout" framing
+
+This design shifts ADK from a request/response mindset to an **agentic rollout** model:
+
+* do work
+* wait for environment events
+* resume deterministically
+* avoid compute idling
+
+---
+
+## 5. Architecture Overview
+
+### 5.1 Layered checkpointing: logical → workspace → execution (future)
+
+**v1** explicitly adopts **Logical Checkpointing**:
+
+1. **Logical checkpoint (required):** plan/task graph state, job ledger, tool ledger, progress cursors
+2. **Workspace snapshot (optional):** `/workspace` bundle (draft reports, code, small caches)
+3. **Execution snapshot (future):** microVM/container restore
+
+**Rationale:** heap snapshots are notoriously fragile under code/library/version drift. Logical checkpoints remain deterministic across restarts and upgrades.
+
+### 5.2 Control plane vs data plane (Google-scale reliability pattern)
+
+* **Control plane: BigQuery**
+
+ * sessions/checkpoints/events as structured tables
+ * queryable summaries for auditing and fleet observability
+* **Data plane: GCS**
+
+ * checkpoint state blobs
+ * workspace bundles
+ * large artifacts (reports, samples, exports)
+
+### 5.3 Integration with Existing ADK Services
+
+```
+┌─────────────────────────────────────────────────────────────────┐
+│ ADK Application │
+├─────────────────────────────────────────────────────────────────┤
+│ App( │
+│ resumability_config=ResumabilityConfig(is_resumable=True), │
+│ durable_session_config=DurableSessionConfig( # NEW │
+│ is_durable=True, │
+│ checkpoint_store=BigQueryCheckpointStore(...), │
+│ event_source=PubSubEventSource(...), │
+│ ), │
+│ ) │
+├─────────────────────────────────────────────────────────────────┤
+│ Existing ADK Services │
+│ ┌──────────────┐ ┌──────────────┐ ┌──────────────────────┐ │
+│ │SessionService│ │ArtifactService│ │MemoryService │ │
+│ │(SQLite/PG/...)│ │(GCS/local) │ │(in-memory/vertex) │ │
+│ └──────────────┘ └──────────────┘ └──────────────────────┘ │
+├─────────────────────────────────────────────────────────────────┤
+│ NEW: Durable Session Layer │
+│ ┌──────────────────┐ ┌─────────────────┐ ┌───────────────┐ │
+│ │DurableSessionStore│ │CheckpointStore │ │ResumeService │ │
+│ │(orchestration) │ │(BQ meta+GCS blob)│ │(Pub/Sub listen)│ │
+│ └──────────────────┘ └─────────────────┘ └───────────────┘ │
+└─────────────────────────────────────────────────────────────────┘
+```
+
+---
+
+## 6. Why BigQuery as the Control Plane
+
+Using BigQuery as the metadata store is strategic:
+
+* **Auditability:** SQL query of checkpoints at any time without parsing logs
+* **Fleet visibility:** query state of thousands of agents concurrently
+* **Robust ops patterns:** latest-per-session via idiomatic BigQuery view is simple and performant
+* **Dogfooding:** demonstrates BigQuery Agent Analytics and cross-agent observability
+* **Existing infrastructure:** many ADK users already have BQ datasets for analytics
+
+---
+
+## 7. Correctness & Failure Safety
+
+### 7.1 Two-phase checkpoint commit (atomic visibility)
+
+A checkpoint is "live" only once the **BigQuery metadata row** exists.
+
+```python
+def write_checkpoint(
+ session_id: str,
+ seq: int,
+ state_json: bytes,
+ workspace_path: str | None
+) -> None:
+ """Two-phase checkpoint commit with error handling."""
+ try:
+ # Phase 1: blobs to GCS (retry-safe, idempotent)
+ state_uri = gcs.upload(
+ f"checkpoints/{session_id}/{seq}/state.json",
+ state_json,
+ if_generation_match=0, # Fail if already exists
+ )
+ workspace_uri = None
+ if workspace_path:
+ workspace_uri = gcs.upload(
+ f"checkpoints/{session_id}/{seq}/workspace.tar.gz",
+ compress_tar_gz(workspace_path),
+ if_generation_match=0,
+ )
+
+ # Phase 2: commit metadata in BigQuery (checkpoint becomes visible here)
+ bq.insert("checkpoints", {
+ "session_id": session_id,
+ "checkpoint_seq": seq,
+ "gcs_state_uri": state_uri,
+ "gcs_workspace_uri": workspace_uri,
+ "sha256": sha256(state_json),
+ "size_bytes": len(state_json),
+ "created_at": now(),
+ "trigger": "async_boundary",
+ "agent_state_json": extract_small_summary(state_json),
+ "checkpoint_fingerprint": fingerprint_checkpoint(state_json),
+ })
+
+ # Update pointer only after checkpoint metadata exists
+ bq.update("sessions", session_id, {
+ "current_checkpoint_seq": seq,
+ "updated_at": now(),
+ })
+
+ except GCSUploadError as e:
+ # Phase 1 failed - no cleanup needed, checkpoint not visible
+ logger.error(f"Checkpoint {seq} GCS upload failed: {e}")
+ raise CheckpointWriteError(f"GCS upload failed: {e}") from e
+
+ except BigQueryInsertError as e:
+ # Phase 2 failed - orphan GCS blobs will be cleaned by GC
+ logger.error(f"Checkpoint {seq} BQ insert failed: {e}")
+ raise CheckpointWriteError(f"BQ insert failed: {e}") from e
+```
+
+**Garbage collection:** orphan GCS objects without a corresponding BQ metadata row are deleted after a grace window (default: 24 hours).
+
+---
+
+### 7.2 Authoritative reconciliation (the core idempotency mechanism)
+
+On resume, do not trust events alone. Reconcile the ledger against authoritative cloud state.
+
+```python
+def reconcile_on_resume(state: dict) -> dict:
+ """Reconcile job ledger against authoritative BigQuery state.
+
+ This is the core idempotency mechanism - ensures we never
+ re-submit completed jobs or miss failed ones.
+ """
+ ledger = state["job_ledger"]
+ reconciliation_results = {
+ "jobs_completed": 0,
+ "jobs_failed": 0,
+ "jobs_cancelled": 0,
+ "jobs_still_running": 0,
+ }
+
+ for job_id, meta in ledger.items():
+ try:
+ job = bq.get_job(job_id)
+ except NotFoundError:
+ # Job was deleted or never existed
+ logger.warning(f"Job {job_id} not found, marking as lost")
+ meta["status"] = "LOST"
+ meta["reconciled_at"] = now()
+ continue
+
+ if job.state == "DONE" and not meta.get("consumed"):
+ state["results"][job_id] = fetch_results(job, meta)
+ meta["consumed"] = True
+ meta["reconciled_at"] = now()
+ reconciliation_results["jobs_completed"] += 1
+
+ elif job.state == "FAILED":
+ handle_failed_job(job_id, job.error_result, meta, state)
+ reconciliation_results["jobs_failed"] += 1
+
+ elif job.state == "CANCELLED":
+ handle_cancelled_job(job_id, meta, state)
+ reconciliation_results["jobs_cancelled"] += 1
+
+ elif job.state in ("RUNNING", "PENDING"):
+ register_completion_callback(job_id)
+ reconciliation_results["jobs_still_running"] += 1
+
+ state["_reconciliation_results"] = reconciliation_results
+ return state
+```
+
+This is the enterprise-grade version of "remember where you left off":
+
+* prevents re-submitting 2-hour scans
+* handles partial failures/cancellations deterministically
+* turns resume into a repeatable state machine
+* provides audit trail of reconciliation results
+
+---
+
+### 7.3 Leasing & optimistic concurrency
+
+We must ensure only one runner resumes a session at a time.
+
+**BigQuery constraint:** lacks true row-level locking. BQ-based leasing is **optimistic lease acquisition (best-effort without external lock)**. If high-burst concurrency demands stronger guarantees, the pluggable lease manager can be backed by Firestore/Spanner or external single-delivery orchestration (e.g., Cloud Tasks).
+
+**When to use each backend:**
+
+| Backend | Use Case | Guarantees |
+|---------|----------|------------|
+| BigQuery (default) | Low-medium concurrency, cost-sensitive | Best-effort, ~100ms latency |
+| Firestore | High concurrency, strong consistency needed | Strong, ~10ms latency |
+| Cloud Tasks | Exactly-once delivery required | Exactly-once with dedup window |
+| Spanner | Global distribution, strong consistency | Strong, multi-region |
+
+BQ lease acquire template:
+
+```sql
+UPDATE `your_project.adk_metadata.sessions`
+SET active_lease_id = @lease_id,
+ lease_expiry = TIMESTAMP_ADD(CURRENT_TIMESTAMP(), INTERVAL @ttl_seconds SECOND),
+ updated_at = CURRENT_TIMESTAMP()
+WHERE session_id = @session_id
+ AND status = 'PAUSED'
+ AND (active_lease_id IS NULL OR lease_expiry < CURRENT_TIMESTAMP());
+```
+
+**Note:** BigQuery time travel (`FOR SYSTEM_TIME AS OF`) is useful for debugging historical state, but does not replace strong mutual exclusion. The "pluggable SessionLeaseManager" is the safety valve.
+
+---
+
+## 8. ADK API Extensions (v1 contract)
+
+### 8.1 Core Interfaces
+
+```python
+from abc import ABC, abstractmethod
+from typing import Optional
+from pydantic import BaseModel
+
+class CheckpointableAgentState(ABC):
+ """Interface for agents that support durable checkpointing.
+
+ Extends the existing BaseAgentState pattern from
+ src/google/adk/agents/base_agent.py
+ """
+
+ @abstractmethod
+ def export_state(self) -> dict:
+ """Export agent state to a serializable dictionary.
+
+ Returns:
+ Dictionary containing all state needed to resume.
+ Must be JSON-serializable.
+ """
+ ...
+
+ @abstractmethod
+ def import_state(self, state: dict) -> None:
+ """Import agent state from a previously exported dictionary.
+
+ Args:
+ state: Dictionary from a previous export_state() call.
+ """
+ ...
+
+ def get_state_schema_version(self) -> int:
+ """Return the schema version for this state format.
+
+ Override to implement versioned state migrations.
+ Default: 1
+ """
+ return 1
+
+
+class WorkspaceSnapshotter:
+ """Handles workspace directory snapshots to/from GCS."""
+
+ def snapshot_to_gcs(
+ self,
+ session_id: str,
+ checkpoint_seq: int,
+ workspace_path: str = "/workspace",
+ max_size_bytes: int = 1 * 1024 * 1024 * 1024, # 1GB default
+ ) -> str:
+ """Snapshot workspace to GCS.
+
+ Returns:
+ GCS URI of the uploaded snapshot.
+
+ Raises:
+ WorkspaceTooLargeError: If workspace exceeds max_size_bytes.
+ """
+ ...
+
+ def restore_from_gcs(self, gcs_uri: str, workspace_path: str = "/workspace") -> None:
+ """Restore workspace from GCS snapshot."""
+ ...
+
+
+class DurableSessionStore(ABC):
+ """Abstract interface for durable checkpoint storage."""
+
+ @abstractmethod
+ def write_checkpoint(
+ self,
+ session_id: str,
+ checkpoint_seq: int,
+ state: dict,
+ workspace_gcs_uri: Optional[str] = None,
+ trigger: str = "async_boundary",
+ ) -> None:
+ """Write a checkpoint with two-phase commit."""
+ ...
+
+ @abstractmethod
+ def read_latest_checkpoint(
+ self,
+ session_id: str,
+ ) -> tuple[int, dict, Optional[str]]:
+ """Read the latest checkpoint for a session.
+
+ Returns:
+ Tuple of (checkpoint_seq, state_dict, workspace_gcs_uri).
+
+ Raises:
+ CheckpointNotFoundError: If no checkpoint exists.
+ """
+ ...
+
+ @abstractmethod
+ def list_checkpoints(
+ self,
+ session_id: str,
+ limit: int = 100,
+ ) -> list[dict]:
+ """List checkpoint metadata for a session."""
+ ...
+```
+
+### 8.2 Configuration
+
+```python
+from pydantic import BaseModel, Field
+from typing import Literal, Optional
+
+class DurableSessionConfig(BaseModel):
+ """Configuration for durable session persistence.
+
+ Works alongside existing ResumabilityConfig.
+ """
+
+ is_durable: bool = False
+ """Enable durable cross-process checkpointing."""
+
+ checkpoint_policy: Literal[
+ "async_boundary", # Checkpoint when pausing for async tool (default)
+ "tool_call_boundary", # Checkpoint after every tool call
+ "superstep", # Checkpoint at agent-defined superstep boundaries
+ "manual", # Only checkpoint when explicitly requested
+ ] = "async_boundary"
+ """When to create checkpoints."""
+
+ workspace_snapshot_enabled: bool = False
+ """Whether to include workspace directory in checkpoints."""
+
+ workspace_max_size_bytes: int = Field(
+ default=100 * 1024 * 1024, # 100MB
+ description="Maximum workspace snapshot size",
+ )
+
+ checkpoint_store: Optional[DurableSessionStore] = None
+ """The checkpoint store implementation. If None, uses BigQueryCheckpointStore."""
+
+ lease_backend: Literal["bigquery", "firestore", "cloud_tasks"] = "bigquery"
+ """Backend for lease management."""
+
+ lease_ttl_seconds: int = Field(
+ default=300, # 5 minutes
+ description="Lease TTL before auto-release",
+ )
+
+ retry_policy: Optional[dict] = None
+ """Per-tool-type retry policies for failed jobs."""
+```
+
+### 8.3 Checkpoint Policy Details
+
+| Policy | Trigger | Use Case |
+|--------|---------|----------|
+| `async_boundary` | `should_pause_invocation()` returns True | BigQuery jobs, external APIs (default) |
+| `tool_call_boundary` | After every tool call completes | Maximum durability, higher cost |
+| `superstep` | Agent calls `checkpoint_now()` | Agent controls checkpoint granularity |
+| `manual` | Only via explicit API call | Testing, debugging |
+
+---
+
+## 9. Current vs Proposed Capability Comparison
+
+| Feature | Current ADK (ResumabilityConfig) | Durable Session Extension |
+|---------|----------------------------------|---------------------------|
+| Pause on long tool call | Yes (experimental) | Yes |
+| Resume from last event | Yes (in-process) | Yes (cross-process) |
+| State persistence | Session service (SQLite/PG) | Session service + BQ/GCS checkpoints |
+| Cross-process resume | No | Yes |
+| External event triggers | No | Yes (Pub/Sub, webhooks) |
+| Max job duration | Process lifetime | Practically unlimited (days/weeks) |
+| Compute cost while waiting | Idle if process alive | Zero compute while PAUSED |
+| Job knowledge (IDs, state) | In-memory or session state | Persisted in ledger + BQ tables |
+| Recovery | Resume API call | Automatic via event + idempotent resume |
+| Auditability | Logs, session events | SQL-queryable BQ control plane |
+| Fleet visibility | Per-session queries | Cross-agent BQ analytics |
+
+---
+
+## 10. Demo Scenario: Multi-Day PII Audit
+
+Assume discovery finds ~50 tables; agent submits **1 BigQuery job per table**.
+
+1. **RUNNING:** enumerate schema, prioritize, build ledger
+2. **RUNNING → PAUSED:** submit job fleet, checkpoint (two-phase), mark PAUSED, release compute
+3. **PAUSED (hours/days):** jobs run in BigQuery; agent consumes zero compute
+4. **Resume:** Pub/Sub event → resumer acquires lease → reads checkpoint → reconciles ledger
+5. **RUNNING:** process completed jobs, handle failures, submit retries if needed
+6. **KILLED:** compile compliance report, write final audit rows, cleanup
+
+---
+
+## 11. "Plumbing vs Logic": Why Framework-Level Support Matters
+
+### 11.1 Framework-level ADK support > agent-specific hacks
+
+This capability should live at the ADK level, not be reinvented per agent team:
+
+| Dimension | Specific Agent Approach | ADK Framework Approach |
+|-----------|-------------------------|------------------------|
+| Engineering effort | each team reimplements persistence/resume | toggled via config; solved once |
+| Security/compliance | inconsistent VPC-SC/CMEK/IAM | governance baked into store/resumer |
+| Observability | fragmented logs | unified BQ schema across agents |
+| Skill portability | skills tied to bespoke persistence | state-aware skills via standard interface |
+
+### 11.2 The "plumbing" components (solve once)
+
+* two-phase commit
+* workspace snapshotting
+* durable store + GC
+* resume service + idempotent event handling
+* leasing/concurrency strategy
+* observability/audit tables
+
+### 11.3 The "logic" components (agent-owned)
+
+* what to persist in checkpoint (`job_ledger`, `audit_cursor`, partial findings)
+* retry policy decisions by job/tool type
+* domain-specific analysis and reporting logic
+
+---
+
+## 12. Generalization Beyond BigQuery (Universal Long-Horizon Primitive)
+
+Although the motivating example is BigQuery, the primitives are general:
+
+* **Ledger-based reconciliation:** any external handle can be tracked (job ID, build ID, ticket ID)
+* **Workspace snapshots:** preserve files for coding/refactoring/report assembly tasks
+* **Event-driven resume:** Pub/Sub triggers can represent almost any service completion webhook
+
+### 12.1 Non-BigQuery long-horizon scenarios
+
+| Task Type | Resume trigger | Ledger contents |
+|-----------|----------------|-----------------|
+| Cloud infra provisioning | resource-ready events | resource manifests + status |
+| Software refactoring | CI completion | build IDs, test results, patch plan |
+| Deep research | scheduled polling/new index event | search caches + draft outline |
+| Human-in-the-loop | Slack/Chat message | approval flags + pending actions |
+| ML training | training job completion | model artifacts, metrics, hyperparams |
+
+---
+
+## 13. Alignment with Moltbot (formerly ClawBot) Architecture
+
+This proposal aligns strongly with the long-running daemon style popularized by Moltbot/ClawBot, especially in lifecycle/state management:
+
+| Feature | Moltbot/ClawBot Design | Durable ADK Design | Alignment |
+|---------|------------------------|--------------------| ----------|
+| Orchestration | Gateway/Coordinator routes persistent sessions | ADK Agent Runner + Resumer | High |
+| Persistence | Local FS "diary files" | BQ (metadata) + GCS (blobs) | High (enterprise-grade) |
+| Lifecycle | Running / Paused / Killed | RUNNING / PAUSED / KILLED | Identical |
+| Execution model | "Rollout" async loops | Background agent hibernates + resumes | High |
+
+**Enterprise advantage vs local-first bots**
+
+* BQ control plane enables fleet-scale SQL audit ("1,000 agents state now")
+* VPC-SC, CMEK, IAM boundaries can be standardized at framework level
+
+---
+
+## 14. Competitive Landscape (LangGraph + Claude)
+
+### 14.1 TL;DR
+
+LangGraph offers durable workflow checkpointing; Claude SDK offers session continuity/harness patterns. Neither makes **cloud job reconciliation** plus **SQL-audit control plane** a first-class target.
+
+### 14.2 Feature comparison
+
+| Feature | ADK (current) | ADK (proposed) | LangGraph | Claude SDK |
+|---------|---------------|----------------|-----------|------------|
+| In-process pause/resume | Yes (experimental) | Yes | Yes | Yes |
+| Cross-process durability | No | Yes (BQ+GCS) | Yes (checkpointers) | Via harness |
+| External event triggers | No | Yes (Pub/Sub) | Via external code | Via harness |
+| Cloud job reconciliation | No | Yes (authoritative) | No | No |
+| SQL audit trail | No | Yes (BQ) | No (requires custom) | No |
+| Fleet observability | No | Yes (BQ analytics) | Via LangSmith | No |
+
+### 14.3 Why not "just use LangGraph checkpointers with BigQuery storage"
+
+LangGraph checkpointers serialize and restore workflow state at step boundaries, but BigQuery long-horizon requires:
+
+* authoritative job status reconciliation (DONE/FAILED/CANCELLED/RUNNING)
+* result retrieval from destination tables
+* partial failure handling and enterprise audit semantics
+
+This is not a drop-in "graph replay" problem; it's **cloud job continuity**.
+
+### 14.4 Borrow vs differentiate (prioritized)
+
+**v1 essential**
+
+1. checkpoint policy ergonomics (inspired by LangGraph)
+2. coordinator/worker harness pattern (inspired by Anthropic article)
+
+**v2**
+3. hybrid filesystem backends
+4. skills/plugins packaging for BigQuery playbooks
+
+---
+
+## 15. Alternatives Considered
+
+| Alternative | Why not (v1) |
+|-------------|--------------|
+| Extend existing SessionService | Different consistency model; BQ provides SQL audit |
+| Firestore metadata | less SQL-auditable for analytics; can be lease backend later |
+| Spanner leasing | heavy for v1; keep pluggable |
+| Redis/Memorystore | ephemeral-first; lacks audit/query semantics |
+| VM checkpointing | complex; brittle with environment drift |
+| Cloud Workflows | static DAGs; agents need dynamic replanning |
+
+---
+
+## 16. Size Limits, Spill Strategy, Compatibility
+
+### 16.1 Size limits
+
+* Keep `agent_state_json` summary small (< 1MB) and queryable
+* Store full checkpoint in GCS (recommended < 100MB, hard limit 5GB)
+* Workspace snapshot recommended ≤ 1 GB; large artifacts should be explicit GCS objects, not tarballed
+
+### 16.2 Compatibility & schema evolution
+
+* `agent_version`: code version (e.g., "1.2.3" or git SHA)
+* `state_schema_version`: **monotonic INT64** (1,2,3…)
+* optional `state_schema_version_label`: semver string for readability
+
+**v1 stance:** version mismatches hard-fail (safe). This prevents subtle bugs from incompatible state.
+
+**Migration strategy (v2):**
+
+```python
+class CheckpointableAgentState(ABC):
+ def get_state_schema_version(self) -> int:
+ return 1
+
+ def migrate_state(self, old_state: dict, old_version: int) -> dict:
+ """Override to implement state migrations.
+
+ Called when loading a checkpoint with older schema version.
+ Default: raise error (v1 behavior).
+ """
+ raise StateSchemaMismatchError(
+ f"Cannot migrate from version {old_version} to {self.get_state_schema_version()}"
+ )
+```
+
+### 16.3 checkpoint_fingerprint definition
+
+`checkpoint_fingerprint` = SHA256 of canonical checkpoint state excluding timestamps and non-deterministic fields. Useful for dedupe/debugging.
+
+```python
+def fingerprint_checkpoint(state: dict) -> str:
+ """Compute deterministic fingerprint for checkpoint state."""
+ # Remove non-deterministic fields
+ canonical = {k: v for k, v in state.items()
+ if k not in ("_timestamp", "_reconciliation_results")}
+ # Sort keys for determinism
+ canonical_json = json.dumps(canonical, sort_keys=True, separators=(',', ':'))
+ return hashlib.sha256(canonical_json.encode()).hexdigest()
+```
+
+---
+
+## 17. Security, Governance, Enterprise Readiness
+
+### 17.1 Data sensitivity
+
+* **Sensitive by default:** checkpoints may include PII findings, credentials, business data
+* **Classification:** treat checkpoint data with same sensitivity as source data
+
+### 17.2 Encryption
+
+| Layer | Mechanism |
+|-------|-----------|
+| GCS blobs | CMEK (Customer-Managed Encryption Keys) |
+| BQ tables | BQ encryption policies (default or CMEK) |
+| In-transit | TLS 1.3 |
+
+### 17.3 Access control
+
+* **IAM:** least privilege, separate identities for runner vs store
+* **Runner identity:** needs BQ read/write, GCS read/write
+* **Resumer identity:** needs BQ read/write, GCS read, Pub/Sub subscribe
+* **Audit identity:** needs BQ read only
+
+### 17.4 Retention & compliance
+
+* **TTL:** configurable per session/agent type
+* **GC:** automatic cleanup of expired sessions and orphan blobs
+* **Legal hold:** support for compliance holds if needed
+* **Audit log:** all checkpoint operations logged to Cloud Audit Logs
+
+### 17.5 VPC-SC
+
+* **Day-1 requirement** for many enterprise customers
+* Ensure checkpoint bucket is in same VPC-SC perimeter
+* Use restricted.googleapis.com endpoints
+* Document perimeter configuration in deployment guide
+
+---
+
+## 18. Open Questions & Risks (Senior review)
+
+| Question | Risk Level | Notes |
+|----------|------------|-------|
+| Lease contention & latency under high event bursts | Medium | May need Firestore/Tasks for >100 concurrent resumes |
+| Workspace growth management | Low | Differential sync/manifest snapshots for v2 |
+| Checkpoint frequency tuning | Low | Define "smart boundaries" to balance cost and safety |
+| VPC-SC compliance validation | High | Day-1 requirement; needs security review |
+| Multi-region/DR support | Medium | Cross-region resume: supported or out of scope? |
+| Integration with existing ResumabilityConfig | Low | Design is additive, not replacing |
+| State migration complexity | Medium | Hard-fail v1 is safe but limits upgrades |
+
+---
+
+## 19. Milestones / Rollout Plan
+
+| Week | Milestone | Deliverables |
+|------|-----------|--------------|
+| 1–2 | API design & integration planning | `DurableSessionConfig` API, integration with `ResumabilityConfig`, storage/lease strategy doc |
+| 3–4 | Core implementation | `BigQueryCheckpointStore`, `WorkspaceSnapshotter`, two-phase commit |
+| 5–6 | Resume service | `ResumeService`, Pub/Sub integration, lease management |
+| 7–8 | Pilot integration | PII scanner pilot, metrics collection |
+| 9+ | Iterate & decide | Performance tuning, decide first-class vs plugin path |
+
+---
+
+## 20. Immediate Ask / Decisions
+
+1. **Review** `CheckpointableAgentState` contract and integration with existing `ResumabilityConfig`
+2. **Confirm** BQ+GCS as reference infra and lease backend strategy
+3. **Select** pilot use case (PII scanner recommended)
+4. **Decide:** Durable PAUSED as extension to existing resumability vs separate plugin/extension
+
+---
+
+## 21. Cost Estimation
+
+### 21.1 Storage costs
+
+| Component | Typical Size | Monthly Cost (US) |
+|-----------|--------------|-------------------|
+| BQ session row | ~2 KB | ~$0.00004/row |
+| BQ checkpoint row | ~5 KB | ~$0.0001/row |
+| GCS checkpoint blob | ~100 KB | ~$0.0026/GB = ~$0.00000026 |
+| GCS workspace snapshot | ~50 MB | ~$0.0026/GB = ~$0.00013 |
+
+**Example: 1,000 sessions, 10 checkpoints each, 24-hour retention**
+
+| Item | Quantity | Cost |
+|------|----------|------|
+| BQ session rows | 1,000 | $0.04 |
+| BQ checkpoint rows | 10,000 | $1.00 |
+| GCS checkpoint blobs | 10,000 × 100KB = 1GB | $0.026 |
+| GCS workspace snapshots | 1,000 × 50MB = 50GB | $1.30 |
+| **Total daily** | | **~$2.37** |
+
+**Cost per session-day paused:** ~$0.002 (well under $0.01 estimate)
+
+### 21.2 Compute costs
+
+| Component | Cost |
+|-----------|------|
+| PAUSED session | $0 (no compute) |
+| Resume service (Cloud Run) | ~$0.001 per resume |
+| Pub/Sub events | ~$0.04 per million messages |
+
+### 21.3 BigQuery query costs
+
+| Query Type | Estimated Data Scanned | Cost |
+|------------|------------------------|------|
+| Get latest checkpoint | ~10 KB | ~$0.00000005 |
+| List session checkpoints | ~100 KB | ~$0.0000005 |
+| Fleet analytics query | ~10 MB | ~$0.00005 |
+
+---
+
+## 22. Monitoring & Observability
+
+### 22.1 Key metrics
+
+| Metric | Description | Alert Threshold |
+|--------|-------------|-----------------|
+| `checkpoint_write_latency_ms` | Time to write checkpoint (P50, P99) | P99 > 5000ms |
+| `checkpoint_write_errors` | Failed checkpoint writes | > 1% error rate |
+| `resume_latency_ms` | Time from event to resumed | P99 > 10000ms |
+| `lease_contention_rate` | Failed lease acquisitions | > 5% |
+| `orphan_blob_count` | GCS blobs without BQ metadata | > 1000 |
+| `paused_session_count` | Currently paused sessions | Informational |
+| `sessions_near_ttl` | Sessions expiring within 24h | > 100 |
+
+### 22.2 Dashboards
+
+**Operational dashboard:**
+- Active sessions by state (RUNNING/PAUSED/KILLED)
+- Checkpoint write success rate
+- Resume latency distribution
+- Lease acquisition success rate
+
+**Cost dashboard:**
+- Storage usage (BQ + GCS)
+- Query costs by type
+- Compute costs (resume service)
+
+### 22.3 Alerting
+
+| Alert | Condition | Severity |
+|-------|-----------|----------|
+| High checkpoint failure rate | > 1% errors in 5 min | P1 |
+| Resume service unhealthy | > 50% error rate | P1 |
+| Lease contention spike | > 10% contention in 5 min | P2 |
+| Orphan blob accumulation | > 10,000 orphans | P3 |
+| Sessions nearing TTL | > 100 sessions within 1h of TTL | P3 |
+
+### 22.4 Logging
+
+All operations emit structured logs with:
+- `session_id`, `checkpoint_seq`, `operation`
+- `latency_ms`, `success`, `error_code`
+- Correlation IDs for tracing
+
+---
+
+## 23. Rollback & Recovery Procedures
+
+### 23.1 Checkpoint rollback
+
+```python
+def rollback_to_checkpoint(session_id: str, target_seq: int) -> None:
+ """Rollback session to a previous checkpoint.
+
+ Use cases:
+ - Agent made incorrect decisions
+ - Corrupted state detected
+ - Testing/debugging
+ """
+ # 1. Verify target checkpoint exists
+ checkpoint = store.read_checkpoint(session_id, target_seq)
+
+ # 2. Update session to point to target checkpoint
+ bq.update("sessions", session_id, {
+ "current_checkpoint_seq": target_seq,
+ "updated_at": now(),
+ })
+
+ # 3. Log rollback for audit
+ bq.insert("events", {
+ "session_id": session_id,
+ "event_type": "ROLLBACK",
+ "event_payload": {"from_seq": current_seq, "to_seq": target_seq},
+ "event_time": now(),
+ })
+```
+
+### 23.2 Session recovery
+
+| Scenario | Recovery Procedure |
+|----------|-------------------|
+| Resume service crash | Automatic retry via Pub/Sub redelivery |
+| Checkpoint corruption | Rollback to previous checkpoint |
+| BQ metadata loss | Rebuild from GCS blob inventory |
+| GCS blob loss | Mark checkpoint invalid, resume from earlier |
+| Lease stuck | Auto-expire after TTL, manual release available |
+
+### 23.3 Disaster recovery
+
+**Same-region:**
+- BQ point-in-time recovery (7 days default)
+- GCS object versioning
+
+**Cross-region (v2):**
+- BQ dataset replication
+- GCS dual-region or multi-region buckets
+
+---
+
+## 24. Implementation Details (v1)
+
+### 24.1 Module Structure
+
+```
+src/google/adk/durable/
+├── __init__.py # Public exports
+├── config.py # DurableSessionConfig
+├── checkpointable_state.py # CheckpointableAgentState ABC
+├── workspace_snapshotter.py # GCS workspace snapshot handling
+└── stores/
+ ├── __init__.py # Store exports
+ ├── base_checkpoint_store.py # DurableSessionStore ABC
+ └── bigquery_checkpoint_store.py # BQ + GCS implementation
+```
+
+### 24.2 Key Implementation Decisions
+
+| Decision | Rationale |
+|----------|-----------|
+| DML INSERT over streaming inserts | BigQuery streaming buffer limitations prevent immediate UPDATE after streaming insert |
+| JSON column type checking | BigQuery returns JSON columns as dicts, not strings - added runtime type detection |
+| SHA-256 verification | Checkpoint integrity verification on read |
+| Async-first API | All store methods are async for non-blocking I/O |
+| Experimental decorators | All public classes marked `@experimental` for API stability signals |
+
+### 24.3 BigQuery Table Schema (Simplified for v1)
+
+```sql
+-- Sessions table
+CREATE TABLE `project.adk_metadata.sessions` (
+ session_id STRING NOT NULL,
+ status STRING NOT NULL,
+ agent_name STRING NOT NULL,
+ created_at TIMESTAMP NOT NULL,
+ updated_at TIMESTAMP NOT NULL,
+ current_checkpoint_seq INT64 NOT NULL,
+ active_lease_id STRING,
+ lease_expiry TIMESTAMP,
+ ttl_expiry TIMESTAMP,
+ metadata JSON,
+ PRIMARY KEY (session_id) NOT ENFORCED
+);
+
+-- Checkpoints table
+CREATE TABLE `project.adk_metadata.checkpoints` (
+ session_id STRING NOT NULL,
+ checkpoint_seq INT64 NOT NULL,
+ created_at TIMESTAMP NOT NULL,
+ gcs_state_uri STRING NOT NULL,
+ sha256 STRING NOT NULL,
+ size_bytes INT64 NOT NULL,
+ agent_state JSON,
+ trigger STRING NOT NULL,
+ PRIMARY KEY (session_id, checkpoint_seq) NOT ENFORCED
+);
+```
+
+### 24.4 Demo Architecture
+
+```
+┌─────────────────────────────────────────────────────────────────┐
+│ Cloud Run: durable-demo │
+│ ┌───────────────────────────────────────────────────────────┐ │
+│ │ FastAPI Server │ │
+│ │ - demo_server.py: Task management + checkpoint APIs │ │
+│ │ - demo_ui.html: Real-time visualization UI │ │
+│ └───────────────────────────────────────────────────────────┘ │
+│ │ │
+│ ▼ │
+│ ┌───────────────────────────────────────────────────────────┐ │
+│ │ BigQueryCheckpointStore │ │
+│ │ - Two-phase commit (GCS blob → BQ metadata) │ │
+│ │ - Lease management for concurrency │ │
+│ │ - SHA-256 integrity verification │ │
+│ └───────────────────────────────────────────────────────────┘ │
+└─────────────────────────────────────────────────────────────────┘
+ │ │
+ ▼ ▼
+ ┌──────────────────┐ ┌──────────────────┐
+ │ BigQuery │ │ GCS │
+ │ adk_metadata │ │ checkpoints/ │
+ │ - sessions │ │ {session_id}/ │
+ │ - checkpoints │ │ {seq}/state.json│
+ └──────────────────┘ └──────────────────┘
+```
+
+### 24.5 Demo Features
+
+| Feature | Implementation |
+|---------|----------------|
+| Task types | Sentiment, Anomaly, Trend, Clustering analysis |
+| Checkpoint interval | Every 10 seconds |
+| Failure simulation | Manual trigger via UI |
+| Resume from checkpoint | Automatic state restoration |
+| Final output | Task-specific analysis reports |
+| Real-time UI | Polling-based status updates |
+| Checkpoint timeline | Visual checkpoint history |
+
+---
+
+# Appendix A: Feature-to-Requirement Mapping (Demo Coverage)
+
+| Feature | Functional Purpose | Long-horizon benefit |
+|---------|--------------------|-----------------------|
+| Two-phase checkpoint commit | atomic visibility of state | prevents half-saved resumes |
+| BigQuery job ledger | track async job IDs & states | hibernate during hours-long jobs |
+| Workspace snapshotting | preserve files and drafts | warm start for coding/report tasks |
+| Lease-based resuming | prevent concurrent resume | avoids corruption in parallel runs |
+| Durable lifecycle model | add persistent PAUSED | releases compute, supports indefinite horizon |
+| Authoritative reconciliation | sync with cloud job state | prevents duplicate submissions |
+| Integration with ResumabilityConfig | backward compatibility | incremental adoption |
+
+---
+
+# Appendix B: BigQuery SQL (Copy/Paste)
+
+## B0) Dataset
+
+```sql
+CREATE SCHEMA IF NOT EXISTS `your_project.adk_metadata`
+OPTIONS (
+ location = "US",
+ description = "ADK Durable Session control-plane metadata (sessions, checkpoints, events)."
+);
+```
+
+## B1) sessions
+
+```sql
+CREATE TABLE IF NOT EXISTS `your_project.adk_metadata.sessions` (
+ session_id STRING NOT NULL,
+ parent_session_id STRING,
+ owner_principal STRING NOT NULL,
+
+ status STRING NOT NULL,
+ agent_name STRING NOT NULL,
+ agent_version STRING NOT NULL,
+ persistence_mode STRING NOT NULL,
+
+ created_at TIMESTAMP NOT NULL,
+ updated_at TIMESTAMP NOT NULL,
+
+ current_checkpoint_seq INT64 NOT NULL,
+ active_lease_id STRING,
+ lease_expiry TIMESTAMP,
+
+ ttl_expiry TIMESTAMP NOT NULL,
+
+ labels JSON,
+ metadata JSON,
+
+ state_schema_version INT64 NOT NULL,
+ state_schema_version_label STRING,
+
+ -- Primary key constraint (BigQuery syntax)
+ PRIMARY KEY (session_id) NOT ENFORCED
+)
+PARTITION BY DATE(updated_at)
+CLUSTER BY status, owner_principal
+OPTIONS (description = "Durable agent session control-plane table.");
+```
+
+## B2) checkpoints
+
+```sql
+CREATE TABLE IF NOT EXISTS `your_project.adk_metadata.checkpoints` (
+ session_id STRING NOT NULL,
+ checkpoint_seq INT64 NOT NULL,
+
+ agent_version STRING NOT NULL,
+ state_schema_version INT64 NOT NULL,
+ state_schema_version_label STRING,
+
+ created_at TIMESTAMP NOT NULL,
+
+ gcs_state_uri STRING NOT NULL,
+ gcs_workspace_uri STRING,
+
+ sha256 STRING NOT NULL,
+ size_bytes INT64 NOT NULL,
+
+ agent_state_json JSON,
+ trigger STRING NOT NULL,
+
+ num_jobs INT64,
+ num_tables_scanned INT64,
+ num_findings INT64,
+
+ checkpoint_fingerprint STRING,
+
+ -- Composite primary key
+ PRIMARY KEY (session_id, checkpoint_seq) NOT ENFORCED
+)
+PARTITION BY DATE(created_at)
+CLUSTER BY session_id
+OPTIONS (description = "Checkpoint metadata; full blobs stored in GCS.");
+```
+
+## B3) events
+
+```sql
+CREATE TABLE IF NOT EXISTS `your_project.adk_metadata.events` (
+ event_id STRING NOT NULL,
+ session_id STRING NOT NULL,
+
+ event_time TIMESTAMP NOT NULL,
+ event_type STRING NOT NULL,
+ event_payload JSON,
+
+ processed BOOL NOT NULL,
+ processed_at TIMESTAMP,
+ processing_lease_id STRING,
+
+ source STRING,
+ severity STRING,
+
+ -- Primary key
+ PRIMARY KEY (event_id) NOT ENFORCED
+)
+PARTITION BY DATE(event_time)
+CLUSTER BY session_id, processed
+OPTIONS (description = "Resume trigger events and processing audit trail.");
+```
+
+## B4) Views
+
+Latest checkpoint per session (with NULL handling):
+
+```sql
+CREATE OR REPLACE VIEW `your_project.adk_metadata.v_latest_checkpoint` AS
+SELECT
+ session_id,
+ ARRAY_AGG(c ORDER BY checkpoint_seq DESC LIMIT 1)[SAFE_OFFSET(0)] AS latest_checkpoint
+FROM `your_project.adk_metadata.checkpoints` c
+GROUP BY session_id;
+```
+
+Paused sessions nearing TTL:
+
+```sql
+CREATE OR REPLACE VIEW `your_project.adk_metadata.v_paused_near_ttl` AS
+SELECT
+ session_id, owner_principal, agent_name, agent_version,
+ ttl_expiry, updated_at, current_checkpoint_seq,
+ TIMESTAMP_DIFF(ttl_expiry, CURRENT_TIMESTAMP(), HOUR) AS hours_until_expiry
+FROM `your_project.adk_metadata.sessions`
+WHERE status = 'PAUSED'
+ AND ttl_expiry < TIMESTAMP_ADD(CURRENT_TIMESTAMP(), INTERVAL 24 HOUR);
+```
+
+Fleet status summary:
+
+```sql
+CREATE OR REPLACE VIEW `your_project.adk_metadata.v_fleet_status` AS
+SELECT
+ agent_name,
+ status,
+ COUNT(*) AS session_count,
+ AVG(current_checkpoint_seq) AS avg_checkpoints,
+ MIN(created_at) AS oldest_session,
+ MAX(updated_at) AS most_recent_activity
+FROM `your_project.adk_metadata.sessions`
+WHERE ttl_expiry > CURRENT_TIMESTAMP()
+GROUP BY agent_name, status;
+```
+
+Lease acquire template:
+
+```sql
+UPDATE `your_project.adk_metadata.sessions`
+SET active_lease_id = @lease_id,
+ lease_expiry = TIMESTAMP_ADD(CURRENT_TIMESTAMP(), INTERVAL @ttl_seconds SECOND),
+ updated_at = CURRENT_TIMESTAMP()
+WHERE session_id = @session_id
+ AND status = 'PAUSED'
+ AND (active_lease_id IS NULL OR lease_expiry < CURRENT_TIMESTAMP());
+```
+
+---
+
+# Appendix C: Sequence Diagram (Mermaid)
+
+```mermaid
+sequenceDiagram
+ participant App as ADK Application
+ participant Runner as ADK Agent Runner
+ participant ResConfig as ResumabilityConfig
+ participant DurConfig as DurableSessionConfig
+ participant Store as Durable Store (BQ+GCS)
+ participant BQ as BigQuery
+ participant PS as Pub/Sub
+ participant Resumer as Resume Service
+
+ Note over App,Resumer: Initialization
+ App->>Runner: Create with ResumabilityConfig + DurableSessionConfig
+ Runner->>ResConfig: is_resumable = True
+ Runner->>DurConfig: is_durable = True
+
+ Note over App,Resumer: Execution & Pause
+ Runner->>BQ: Submit async jobs (N)
+ Runner->>ResConfig: should_pause_invocation() = True
+ Runner->>Store: Phase1: Write state blob to GCS
+ Runner->>Store: Phase2: Insert checkpoint metadata (BQ)
+ Runner->>Store: Update session status = PAUSED
+ Runner-->>App: Yield control (zero compute)
+
+ Note over App,Resumer: External Events
+ BQ-->>PS: Job completion event(s)
+ PS-->>Resumer: Deliver event (may be duplicated)
+
+ Note over App,Resumer: Resume
+ Resumer->>Store: Acquire lease(session_id)
+
+ alt Lease already held
+ Store-->>Resumer: Lease denied
+ Resumer->>Resumer: Back off and retry / skip event
+ else Lease granted
+ Store-->>Resumer: Lease granted
+ Resumer->>Store: Read latest checkpoint
+ Resumer->>BQ: Reconcile job ledger (authoritative)
+ Resumer->>Runner: Resume session with checkpoint
+ Runner->>Store: Periodic checkpoint updates
+ Runner->>Store: Finalize session status = KILLED
+ Resumer->>Store: Release lease(session_id)
+ end
+```
+
+---
+
+# Appendix D: Failure Modes (Operational)
+
+| Failure Mode | Detection | Recovery |
+|--------------|-----------|----------|
+| Duplicate Pub/Sub event | Lease acquisition fails | Skip, idempotent |
+| Partial checkpoint write (Phase 1) | GCS upload error | Retry, no cleanup needed |
+| Partial checkpoint write (Phase 2) | BQ insert error | Orphan blob GC |
+| Resume crash mid-execution | Lease expires, no heartbeat | Re-acquire lease, resume from checkpoint |
+| Jobs still running on resume | Reconciliation detects RUNNING | Re-register completion callback |
+| Jobs failed/cancelled | Reconciliation detects state | Agent retry policy, audit decision |
+| Permission revoked | API error | Fail with explicit error + audit row |
+| TTL expiry | Scheduled job | GC + mark expired |
+| Checkpoint corruption | SHA256 mismatch | Rollback to previous checkpoint |
+| State schema mismatch | Version check on load | Hard-fail (v1), migrate (v2) |
+
+---
+
+# Appendix E: Integration Example
+
+```python
+from google.adk.apps import App, ResumabilityConfig
+from google.adk.agents import LlmAgent
+from google.adk.durable import (
+ DurableSessionConfig,
+ BigQueryCheckpointStore,
+ PubSubEventSource,
+)
+
+# Create durable-enabled application
+app = App(
+ name="pii_scanner",
+ root_agent=LlmAgent(
+ name="scanner",
+ model="gemini-2.0-flash",
+ instructions="Scan BigQuery tables for PII...",
+ tools=[bq_query_tool, bq_job_tool],
+ ),
+ # Existing resumability (in-process)
+ resumability_config=ResumabilityConfig(
+ is_resumable=True,
+ ),
+ # NEW: Durable cross-process persistence
+ durable_session_config=DurableSessionConfig(
+ is_durable=True,
+ checkpoint_policy="async_boundary",
+ workspace_snapshot_enabled=False,
+ checkpoint_store=BigQueryCheckpointStore(
+ project="my-project",
+ dataset="adk_metadata",
+ gcs_bucket="my-checkpoints-bucket",
+ ),
+ lease_backend="bigquery",
+ lease_ttl_seconds=300,
+ ),
+)
+
+# Run with runner (checkpoint happens automatically on pause)
+runner = Runner(
+ app=app,
+ session_service=DatabaseSessionService(...),
+)
+
+# Events from Pub/Sub automatically trigger resume
+async for event in runner.run_async(
+ user_id="user-123",
+ session_id="session-456",
+ new_message=Content(parts=[Part(text="Scan all tables for PII")]),
+):
+ print(event)
+```
+
+---
+
+# References (URLs)
+
+1. LangGraph durable execution: [https://docs.langchain.com/oss/python/langgraph/durable-execution/](https://docs.langchain.com/oss/python/langgraph/durable-execution/)
+2. LangGraph persistence/checkpointers: [https://docs.langchain.com/oss/python/langgraph/persistence/](https://docs.langchain.com/oss/python/langgraph/persistence/)
+3. LangGraph overview: [https://docs.langchain.com/oss/python/langgraph/](https://docs.langchain.com/oss/python/langgraph/)
+4. LangGraph checkpoints reference: [https://reference.langchain.com/python/langgraph/checkpoints/](https://reference.langchain.com/python/langgraph/checkpoints/)
+5. Deep Agents overview: [https://docs.langchain.com/oss/python/deepagents/overview/](https://docs.langchain.com/oss/python/deepagents/overview/)
+6. Deep Agents long-term memory: [https://docs.langchain.com/oss/python/deepagents/long-term-memory/](https://docs.langchain.com/oss/python/deepagents/long-term-memory/)
+7. Anthropic long-running harnesses: [https://www.anthropic.com/engineering/effective-harnesses-for-long-running-agents](https://www.anthropic.com/engineering/effective-harnesses-for-long-running-agents)
+8. ADK ResumabilityConfig: `src/google/adk/apps/app.py:42-58`
+9. ADK InvocationContext pause: `src/google/adk/agents/invocation_context.py:355-389`
diff --git a/contributing/samples/long_running_task/setup.py b/contributing/samples/long_running_task/setup.py
new file mode 100644
index 0000000000..c97ecad3e9
--- /dev/null
+++ b/contributing/samples/long_running_task/setup.py
@@ -0,0 +1,246 @@
+#!/usr/bin/env python
+# Copyright 2026 Google LLC
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""Setup script for the durable session demo.
+
+This script creates the required BigQuery dataset, tables, and GCS bucket
+for the durable session persistence demo.
+
+Usage:
+ python setup.py
+
+Prerequisites:
+ - Google Cloud SDK installed and configured
+ - BigQuery API enabled
+ - Cloud Storage API enabled
+ - Appropriate IAM permissions:
+ - roles/bigquery.dataEditor
+ - roles/storage.objectAdmin
+"""
+
+import argparse
+import subprocess
+import sys
+
+# Configuration
+PROJECT_ID = "test-project-0728-467323"
+DATASET = "adk_metadata"
+GCS_BUCKET = f"{PROJECT_ID}-adk-checkpoints"
+LOCATION = "US"
+
+
+def run_command(
+ cmd: list[str], check: bool = True
+) -> subprocess.CompletedProcess:
+ """Run a shell command and return the result."""
+ print(f"Running: {' '.join(cmd)}")
+ result = subprocess.run(cmd, capture_output=True, text=True)
+ if check and result.returncode != 0:
+ print(f"Error: {result.stderr}")
+ if not result.stderr.strip().endswith("already exists"):
+ sys.exit(1)
+ return result
+
+
+def create_gcs_bucket():
+ """Create the GCS bucket for checkpoint blobs."""
+ print("\n=== Creating GCS Bucket ===")
+ run_command(
+ ["gsutil", "mb", "-l", LOCATION, f"gs://{GCS_BUCKET}"], check=False
+ )
+
+ # Set lifecycle policy to delete old checkpoints after 30 days
+ lifecycle_config = """
+{
+ "lifecycle": {
+ "rule": [
+ {
+ "action": {"type": "Delete"},
+ "condition": {"age": 30}
+ }
+ ]
+ }
+}
+"""
+ with open("/tmp/lifecycle.json", "w") as f:
+ f.write(lifecycle_config)
+
+ run_command(
+ [
+ "gsutil",
+ "lifecycle",
+ "set",
+ "/tmp/lifecycle.json",
+ f"gs://{GCS_BUCKET}",
+ ],
+ check=False,
+ )
+
+ print(f"GCS bucket created: gs://{GCS_BUCKET}")
+
+
+def create_bigquery_dataset():
+ """Create the BigQuery dataset."""
+ print("\n=== Creating BigQuery Dataset ===")
+ run_command(
+ [
+ "bq",
+ "mk",
+ "--dataset",
+ "--location",
+ LOCATION,
+ f"{PROJECT_ID}:{DATASET}",
+ ],
+ check=False,
+ )
+ print(f"BigQuery dataset created: {PROJECT_ID}.{DATASET}")
+
+
+def create_sessions_table():
+ """Create the sessions metadata table."""
+ print("\n=== Creating Sessions Table ===")
+
+ schema = """
+session_id:STRING,
+status:STRING,
+agent_name:STRING,
+created_at:TIMESTAMP,
+updated_at:TIMESTAMP,
+current_checkpoint_seq:INT64,
+active_lease_id:STRING,
+lease_expiry:TIMESTAMP,
+ttl_expiry:TIMESTAMP,
+metadata:JSON
+"""
+
+ run_command(
+ [
+ "bq",
+ "mk",
+ "--table",
+ f"{PROJECT_ID}:{DATASET}.sessions",
+ schema.replace("\n", "").strip(),
+ ],
+ check=False,
+ )
+
+ print(f"Sessions table created: {PROJECT_ID}.{DATASET}.sessions")
+
+
+def create_checkpoints_table():
+ """Create the checkpoints table."""
+ print("\n=== Creating Checkpoints Table ===")
+
+ schema = """
+session_id:STRING,
+checkpoint_seq:INT64,
+created_at:TIMESTAMP,
+gcs_state_uri:STRING,
+sha256:STRING,
+size_bytes:INT64,
+agent_state_json:JSON,
+trigger:STRING
+"""
+
+ run_command(
+ [
+ "bq",
+ "mk",
+ "--table",
+ f"{PROJECT_ID}:{DATASET}.checkpoints",
+ schema.replace("\n", "").strip(),
+ ],
+ check=False,
+ )
+
+ print(f"Checkpoints table created: {PROJECT_ID}.{DATASET}.checkpoints")
+
+
+def verify_setup():
+ """Verify that all resources were created successfully."""
+ print("\n=== Verifying Setup ===")
+
+ # Check GCS bucket
+ result = run_command(["gsutil", "ls", f"gs://{GCS_BUCKET}"], check=False)
+ if result.returncode == 0:
+ print(f"[OK] GCS bucket exists: gs://{GCS_BUCKET}")
+ else:
+ print(f"[FAIL] GCS bucket not found: gs://{GCS_BUCKET}")
+
+ # Check BigQuery tables
+ for table in ["sessions", "checkpoints"]:
+ result = run_command(
+ ["bq", "show", f"{PROJECT_ID}:{DATASET}.{table}"], check=False
+ )
+ if result.returncode == 0:
+ print(f"[OK] BigQuery table exists: {PROJECT_ID}.{DATASET}.{table}")
+ else:
+ print(f"[FAIL] BigQuery table not found: {PROJECT_ID}.{DATASET}.{table}")
+
+
+def cleanup():
+ """Delete all resources created by this script."""
+ print("\n=== Cleaning Up Resources ===")
+
+ # Delete BigQuery tables
+ for table in ["sessions", "checkpoints"]:
+ run_command(
+ ["bq", "rm", "-f", f"{PROJECT_ID}:{DATASET}.{table}"], check=False
+ )
+
+ # Delete BigQuery dataset
+ run_command(["bq", "rm", "-f", "-d", f"{PROJECT_ID}:{DATASET}"], check=False)
+
+ # Delete GCS bucket
+ run_command(["gsutil", "rm", "-r", f"gs://{GCS_BUCKET}"], check=False)
+
+ print("Cleanup complete.")
+
+
+def main():
+ parser = argparse.ArgumentParser(
+ description="Setup resources for the durable session demo"
+ )
+ parser.add_argument(
+ "--cleanup",
+ action="store_true",
+ help="Delete all resources instead of creating them",
+ )
+ parser.add_argument(
+ "--verify", action="store_true", help="Only verify that resources exist"
+ )
+ args = parser.parse_args()
+
+ print(f"Project: {PROJECT_ID}")
+ print(f"Dataset: {DATASET}")
+ print(f"GCS Bucket: {GCS_BUCKET}")
+ print(f"Location: {LOCATION}")
+
+ if args.cleanup:
+ cleanup()
+ elif args.verify:
+ verify_setup()
+ else:
+ create_gcs_bucket()
+ create_bigquery_dataset()
+ create_sessions_table()
+ create_checkpoints_table()
+ verify_setup()
+
+ print("\nDone!")
+
+
+if __name__ == "__main__":
+ main()
diff --git a/contributing/samples/long_running_task/tools.py b/contributing/samples/long_running_task/tools.py
new file mode 100644
index 0000000000..4dbbf4455c
--- /dev/null
+++ b/contributing/samples/long_running_task/tools.py
@@ -0,0 +1,489 @@
+# Copyright 2026 Google LLC
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""Long-running tools for the durable session demo."""
+
+import asyncio
+import random
+from datetime import datetime
+from typing import Any
+
+from google.adk.tools.tool_context import ToolContext
+
+
+async def simulate_long_running_scan(
+ table_name: str,
+ tool_context: ToolContext,
+) -> dict[str, Any]:
+ """Simulate a long-running BigQuery table scan.
+
+ This tool demonstrates durable checkpointing by simulating a scan that
+ takes several seconds. In a real scenario, this would be a BigQuery job
+ that processes large amounts of data.
+
+ Args:
+ table_name: The fully-qualified BigQuery table name to scan.
+ tool_context: The tool context for accessing state and artifacts.
+
+ Returns:
+ A dictionary with scan results including status, row count, and findings.
+ """
+ # Simulate processing time (5-10 seconds)
+ processing_time = random.uniform(5.0, 10.0)
+ await asyncio.sleep(processing_time)
+
+ # Simulate scan results
+ rows_scanned = random.randint(100000, 10000000)
+ findings = []
+
+ # Generate some sample findings based on table name
+ if "shakespeare" in table_name.lower():
+ findings = [
+ "Found 5 instances of 'to be or not to be'",
+ "Most common word: 'the' (27,801 occurrences)",
+ "Unique words: 29,066",
+ ]
+ elif "github" in table_name.lower():
+ findings = [
+ "Most active repository: kubernetes/kubernetes",
+ "Peak commit hour: 14:00 UTC",
+ "Average commits per day: 45,000",
+ ]
+ else:
+ findings = [
+ f"Scanned {rows_scanned:,} rows",
+ "No anomalies detected",
+ "Data quality: 99.8%",
+ ]
+
+ return {
+ "status": "complete",
+ "table": table_name,
+ "rows_scanned": rows_scanned,
+ "processing_time_seconds": round(processing_time, 2),
+ "findings": findings,
+ }
+
+
+async def run_data_pipeline(
+ source_table: str,
+ destination_table: str,
+ transformations: list[str],
+ tool_context: ToolContext,
+) -> dict[str, Any]:
+ """Run a data transformation pipeline.
+
+ This simulates a multi-stage data pipeline that would typically be
+ checkpointed at each stage for durability.
+
+ Args:
+ source_table: The source BigQuery table.
+ destination_table: The destination BigQuery table.
+ transformations: List of transformation operations to apply.
+ tool_context: The tool context for accessing state and artifacts.
+
+ Returns:
+ Pipeline execution results.
+ """
+ stages_completed = []
+ total_rows_processed = 0
+
+ # Simulate each transformation stage
+ for i, transformation in enumerate(transformations):
+ # Simulate stage processing time
+ stage_time = random.uniform(2.0, 5.0)
+ await asyncio.sleep(stage_time)
+
+ rows_processed = random.randint(10000, 100000)
+ total_rows_processed += rows_processed
+
+ stages_completed.append({
+ "stage": i + 1,
+ "transformation": transformation,
+ "rows_processed": rows_processed,
+ "duration_seconds": round(stage_time, 2),
+ })
+
+ return {
+ "status": "complete",
+ "source_table": source_table,
+ "destination_table": destination_table,
+ "stages_completed": stages_completed,
+ "total_rows_processed": total_rows_processed,
+ "total_stages": len(transformations),
+ }
+
+
+async def run_extended_analysis(
+ job_name: str,
+ duration_minutes: int,
+ tool_context: ToolContext,
+) -> dict[str, Any]:
+ """Run an extended analysis job for a specified duration.
+
+ This tool simulates a long-running analysis job that can run for 10+ minutes.
+ Use this to test durable checkpointing with extended job durations.
+
+ Args:
+ job_name: A descriptive name for the analysis job.
+ duration_minutes: How many minutes the job should run (1-60 minutes).
+ tool_context: The tool context for accessing state and artifacts.
+
+ Returns:
+ Analysis job results with timing and metrics.
+ """
+ start_time = datetime.now()
+ duration_seconds = min(max(duration_minutes, 1), 60) * 60
+
+ # Process in chunks, reporting progress
+ chunk_size = 30 # Report every 30 seconds
+ chunks_completed = 0
+ total_chunks = duration_seconds // chunk_size
+
+ metrics = {
+ "records_processed": 0,
+ "anomalies_detected": 0,
+ "patterns_found": 0,
+ }
+
+ for i in range(0, duration_seconds, chunk_size):
+ remaining = min(chunk_size, duration_seconds - i)
+ await asyncio.sleep(remaining)
+
+ chunks_completed += 1
+ metrics["records_processed"] += random.randint(100000, 500000)
+ metrics["anomalies_detected"] += random.randint(0, 10)
+ metrics["patterns_found"] += random.randint(1, 5)
+
+ end_time = datetime.now()
+ actual_duration = (end_time - start_time).total_seconds()
+
+ return {
+ "status": "complete",
+ "job_name": job_name,
+ "requested_duration_minutes": duration_minutes,
+ "actual_duration_seconds": round(actual_duration, 2),
+ "actual_duration_minutes": round(actual_duration / 60, 2),
+ "start_time": start_time.isoformat(),
+ "end_time": end_time.isoformat(),
+ "metrics": metrics,
+ "summary": (
+ f"Processed {metrics['records_processed']:,} records, "
+ f"found {metrics['anomalies_detected']} anomalies and "
+ f"{metrics['patterns_found']} patterns"
+ ),
+ }
+
+
+async def run_ml_training_job(
+ model_name: str,
+ dataset_size: str,
+ epochs: int,
+ tool_context: ToolContext,
+) -> dict[str, Any]:
+ """Run a simulated ML model training job.
+
+ This tool simulates training a machine learning model, which can take
+ 10+ minutes depending on the dataset size and epochs.
+
+ Dataset sizes and approximate training times:
+ - "small": ~2 minutes
+ - "medium": ~5 minutes
+ - "large": ~10 minutes
+ - "xlarge": ~15 minutes
+ - "enterprise": ~30 minutes
+
+ Args:
+ model_name: Name for the model being trained.
+ dataset_size: Size of dataset - "small", "medium", "large", "xlarge", or "enterprise".
+ epochs: Number of training epochs (1-100).
+ tool_context: The tool context for accessing state and artifacts.
+
+ Returns:
+ Training results with metrics and model performance.
+ """
+ start_time = datetime.now()
+
+ # Map dataset size to base training time (in seconds)
+ size_to_time = {
+ "small": 120, # 2 minutes
+ "medium": 300, # 5 minutes
+ "large": 600, # 10 minutes
+ "xlarge": 900, # 15 minutes
+ "enterprise": 1800, # 30 minutes
+ }
+
+ base_time = size_to_time.get(dataset_size.lower(), 300)
+ epochs = min(max(epochs, 1), 100)
+
+ # Total time scales with epochs (but not linearly)
+ total_time = base_time * (1 + (epochs - 1) * 0.1)
+ total_time = min(total_time, 3600) # Cap at 1 hour
+
+ # Simulate training epochs
+ epoch_results = []
+ time_per_epoch = total_time / epochs
+
+ for epoch in range(1, epochs + 1):
+ await asyncio.sleep(time_per_epoch)
+
+ # Simulate improving metrics over epochs
+ base_loss = 2.5 - (epoch / epochs) * 2.0
+ loss = base_loss + random.uniform(-0.1, 0.1)
+ accuracy = min(0.5 + (epoch / epochs) * 0.45 + random.uniform(-0.02, 0.02), 0.99)
+
+ epoch_results.append({
+ "epoch": epoch,
+ "loss": round(loss, 4),
+ "accuracy": round(accuracy, 4),
+ "learning_rate": round(0.001 * (0.95 ** (epoch - 1)), 6),
+ })
+
+ end_time = datetime.now()
+ actual_duration = (end_time - start_time).total_seconds()
+
+ final_metrics = epoch_results[-1] if epoch_results else {}
+
+ return {
+ "status": "complete",
+ "model_name": model_name,
+ "dataset_size": dataset_size,
+ "epochs_completed": epochs,
+ "start_time": start_time.isoformat(),
+ "end_time": end_time.isoformat(),
+ "actual_duration_seconds": round(actual_duration, 2),
+ "actual_duration_minutes": round(actual_duration / 60, 2),
+ "final_loss": final_metrics.get("loss"),
+ "final_accuracy": final_metrics.get("accuracy"),
+ "training_history": epoch_results[-5:], # Last 5 epochs
+ "model_artifact": f"gs://models/{model_name}/v1/model.pkl",
+ }
+
+
+async def run_batch_etl_job(
+ job_id: str,
+ source_tables: list[str],
+ target_table: str,
+ processing_minutes: int,
+ tool_context: ToolContext,
+) -> dict[str, Any]:
+ """Run a batch ETL (Extract, Transform, Load) job.
+
+ This tool simulates a large-scale ETL job that processes multiple source
+ tables and loads data into a target table. Can run for 10+ minutes.
+
+ Args:
+ job_id: Unique identifier for this ETL job.
+ source_tables: List of source table names to process.
+ target_table: Destination table for processed data.
+ processing_minutes: Estimated processing time in minutes (1-60).
+ tool_context: The tool context for accessing state and artifacts.
+
+ Returns:
+ ETL job results with detailed metrics.
+ """
+ start_time = datetime.now()
+ duration_seconds = min(max(processing_minutes, 1), 60) * 60
+
+ # Process each source table
+ table_results = []
+ time_per_table = duration_seconds / max(len(source_tables), 1)
+
+ total_rows_extracted = 0
+ total_rows_transformed = 0
+ total_rows_loaded = 0
+
+ for table in source_tables:
+ await asyncio.sleep(time_per_table)
+
+ rows_extracted = random.randint(1000000, 10000000)
+ rows_transformed = int(rows_extracted * random.uniform(0.85, 0.99))
+ rows_loaded = int(rows_transformed * random.uniform(0.98, 1.0))
+
+ total_rows_extracted += rows_extracted
+ total_rows_transformed += rows_transformed
+ total_rows_loaded += rows_loaded
+
+ table_results.append({
+ "source_table": table,
+ "rows_extracted": rows_extracted,
+ "rows_transformed": rows_transformed,
+ "rows_loaded": rows_loaded,
+ "transform_ratio": round(rows_transformed / rows_extracted, 4),
+ })
+
+ end_time = datetime.now()
+ actual_duration = (end_time - start_time).total_seconds()
+
+ return {
+ "status": "complete",
+ "job_id": job_id,
+ "source_tables_processed": len(source_tables),
+ "target_table": target_table,
+ "start_time": start_time.isoformat(),
+ "end_time": end_time.isoformat(),
+ "actual_duration_seconds": round(actual_duration, 2),
+ "actual_duration_minutes": round(actual_duration / 60, 2),
+ "total_rows_extracted": total_rows_extracted,
+ "total_rows_transformed": total_rows_transformed,
+ "total_rows_loaded": total_rows_loaded,
+ "overall_success_rate": round(total_rows_loaded / total_rows_extracted, 4),
+ "table_details": table_results,
+ }
+
+
+async def run_demo_analysis(
+ analysis_type: str,
+ tool_context: ToolContext,
+) -> dict[str, Any]:
+ """Run a 1-minute demo analysis job to showcase durable checkpointing.
+
+ This tool is perfect for demos - it runs for exactly 1 minute with
+ progress updates every 10 seconds, showing how the system handles
+ long-running operations with checkpointing.
+
+ Args:
+ analysis_type: Type of analysis to run (e.g., "sentiment", "anomaly",
+ "trend", "clustering").
+ tool_context: The tool context for accessing state and artifacts.
+
+ Returns:
+ Analysis results with timing and metrics.
+ """
+ start_time = datetime.now()
+ total_duration = 60 # 1 minute
+ update_interval = 10 # Progress every 10 seconds
+
+ progress_updates = []
+ metrics = {
+ "records_analyzed": 0,
+ "insights_found": 0,
+ "confidence_score": 0.0,
+ }
+
+ for i in range(0, total_duration, update_interval):
+ await asyncio.sleep(update_interval)
+
+ progress_pct = ((i + update_interval) / total_duration) * 100
+ records_batch = random.randint(50000, 150000)
+ metrics["records_analyzed"] += records_batch
+ metrics["insights_found"] += random.randint(1, 5)
+ metrics["confidence_score"] = min(
+ 0.6 + (progress_pct / 100) * 0.35 + random.uniform(-0.02, 0.02),
+ 0.99
+ )
+
+ progress_updates.append({
+ "timestamp": datetime.now().isoformat(),
+ "progress_percent": round(progress_pct, 1),
+ "records_batch": records_batch,
+ "cumulative_records": metrics["records_analyzed"],
+ })
+
+ end_time = datetime.now()
+ actual_duration = (end_time - start_time).total_seconds()
+
+ # Generate analysis-specific insights
+ insights = {
+ "sentiment": [
+ "Overall sentiment: 72% positive",
+ "Key themes: innovation, growth, sustainability",
+ "Sentiment trend: improving over time",
+ ],
+ "anomaly": [
+ "Detected 3 significant anomalies",
+ "Anomaly cluster in Q3 data",
+ "Root cause: seasonal variation",
+ ],
+ "trend": [
+ "Strong upward trend detected",
+ "Growth rate: 15% month-over-month",
+ "Forecast: continued growth expected",
+ ],
+ "clustering": [
+ "Identified 5 distinct clusters",
+ "Largest cluster: 45% of data",
+ "Cluster separation: excellent",
+ ],
+ }.get(analysis_type.lower(), [
+ f"Completed {analysis_type} analysis",
+ "Results within expected parameters",
+ "No critical issues detected",
+ ])
+
+ return {
+ "status": "complete",
+ "analysis_type": analysis_type,
+ "start_time": start_time.isoformat(),
+ "end_time": end_time.isoformat(),
+ "duration_seconds": round(actual_duration, 2),
+ "metrics": metrics,
+ "insights": insights,
+ "progress_history": progress_updates,
+ "summary": (
+ f"Completed {analysis_type} analysis on "
+ f"{metrics['records_analyzed']:,} records. "
+ f"Found {metrics['insights_found']} insights with "
+ f"{metrics['confidence_score']:.1%} confidence."
+ ),
+ }
+
+
+def get_table_schema(table_name: str) -> dict[str, Any]:
+ """Get the schema of a BigQuery table.
+
+ This is a quick synchronous operation that doesn't require checkpointing.
+
+ Args:
+ table_name: The fully-qualified BigQuery table name.
+
+ Returns:
+ The table schema information.
+ """
+ # Simulate some common schemas
+ if "shakespeare" in table_name.lower():
+ return {
+ "table": table_name,
+ "fields": [
+ {"name": "word", "type": "STRING"},
+ {"name": "word_count", "type": "INTEGER"},
+ {"name": "corpus", "type": "STRING"},
+ {"name": "corpus_date", "type": "INTEGER"},
+ ],
+ "num_rows": 164656,
+ "size_bytes": 6432064,
+ }
+ elif "github" in table_name.lower():
+ return {
+ "table": table_name,
+ "fields": [
+ {"name": "repo_name", "type": "STRING"},
+ {"name": "path", "type": "STRING"},
+ {"name": "content", "type": "STRING"},
+ {"name": "size", "type": "INTEGER"},
+ ],
+ "num_rows": 2800000000,
+ "size_bytes": 2500000000000,
+ }
+ else:
+ return {
+ "table": table_name,
+ "fields": [
+ {"name": "id", "type": "INTEGER"},
+ {"name": "name", "type": "STRING"},
+ {"name": "created_at", "type": "TIMESTAMP"},
+ ],
+ "num_rows": 1000000,
+ "size_bytes": 100000000,
+ }
diff --git a/src/google/adk/apps/__init__.py b/src/google/adk/apps/__init__.py
index 3a5d0b0643..88d3474f3a 100644
--- a/src/google/adk/apps/__init__.py
+++ b/src/google/adk/apps/__init__.py
@@ -15,7 +15,18 @@
from .app import App
from .app import ResumabilityConfig
+
+# Lazy import for DurableSessionConfig to avoid circular imports
+def __getattr__(name: str):
+ if name == 'DurableSessionConfig':
+ from ..durable.config import DurableSessionConfig
+
+ return DurableSessionConfig
+ raise AttributeError(f'module {__name__!r} has no attribute {name!r}')
+
+
__all__ = [
'App',
'ResumabilityConfig',
+ 'DurableSessionConfig',
]
diff --git a/src/google/adk/apps/app.py b/src/google/adk/apps/app.py
index 71ea5ce5aa..8779ad67dc 100644
--- a/src/google/adk/apps/app.py
+++ b/src/google/adk/apps/app.py
@@ -14,6 +14,7 @@
from __future__ import annotations
from typing import Optional
+from typing import TYPE_CHECKING
from pydantic import BaseModel
from pydantic import ConfigDict
@@ -26,6 +27,9 @@
from ..plugins.base_plugin import BasePlugin
from ..utils.feature_decorator import experimental
+if TYPE_CHECKING:
+ from ..durable.config import DurableSessionConfig
+
def validate_app_name(name: str) -> None:
"""Ensures the provided application name is safe and intuitive."""
@@ -118,6 +122,13 @@ class App(BaseModel):
If configured, will be applied to all agents in the app.
"""
+ durable_session_config: Optional["DurableSessionConfig"] = None
+ """
+ The config for durable session persistence.
+ If configured, sessions will be checkpointed to external storage (BigQuery +
+ GCS), enabling recovery from failures and migration across hosts.
+ """
+
@model_validator(mode="after")
def _validate_name(self) -> App:
validate_app_name(self.name)
diff --git a/src/google/adk/durable/__init__.py b/src/google/adk/durable/__init__.py
new file mode 100644
index 0000000000..bdc9082a0d
--- /dev/null
+++ b/src/google/adk/durable/__init__.py
@@ -0,0 +1,33 @@
+# Copyright 2026 Google LLC
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""Durable session persistence module for ADK.
+
+This module provides checkpoint-based durability for long-running agent
+invocations, enabling recovery from failures and migration across hosts.
+"""
+
+from .checkpointable_state import CheckpointableAgentState
+from .config import DurableSessionConfig
+from .stores import BigQueryCheckpointStore
+from .stores import DurableSessionStore
+from .workspace_snapshotter import WorkspaceSnapshotter
+
+__all__ = [
+ "CheckpointableAgentState",
+ "DurableSessionConfig",
+ "DurableSessionStore",
+ "BigQueryCheckpointStore",
+ "WorkspaceSnapshotter",
+]
diff --git a/src/google/adk/durable/checkpointable_state.py b/src/google/adk/durable/checkpointable_state.py
new file mode 100644
index 0000000000..e9a372855e
--- /dev/null
+++ b/src/google/adk/durable/checkpointable_state.py
@@ -0,0 +1,114 @@
+# Copyright 2026 Google LLC
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""Abstract base class for checkpointable agent state."""
+
+from __future__ import annotations
+
+import abc
+from typing import Any
+from typing import Dict
+
+from pydantic import BaseModel
+from pydantic import ConfigDict
+
+
+class CheckpointableAgentState(BaseModel, abc.ABC):
+ """Abstract base class for agent state that can be checkpointed.
+
+ Agents that need to preserve custom state across checkpoints should inherit
+ from this class and implement the serialization methods.
+
+ Example:
+ ```python
+ class MyAgentState(CheckpointableAgentState):
+ counter: int = 0
+ processed_items: list[str] = []
+
+ def to_checkpoint_dict(self) -> dict[str, Any]:
+ return {
+ "counter": self.counter,
+ "processed_items": self.processed_items,
+ }
+
+ @classmethod
+ def from_checkpoint_dict(cls, data: dict[str, Any]) -> "MyAgentState":
+ return cls(
+ counter=data.get("counter", 0),
+ processed_items=data.get("processed_items", []),
+ )
+ ```
+ """
+
+ model_config = ConfigDict(
+ extra="allow",
+ )
+
+ @abc.abstractmethod
+ def to_checkpoint_dict(self) -> Dict[str, Any]:
+ """Serialize the state to a dictionary for checkpointing.
+
+ Returns:
+ A dictionary containing all state that should be persisted.
+ The dictionary must be JSON-serializable.
+ """
+
+ @classmethod
+ @abc.abstractmethod
+ def from_checkpoint_dict(
+ cls, data: Dict[str, Any]
+ ) -> "CheckpointableAgentState":
+ """Deserialize the state from a checkpoint dictionary.
+
+ Args:
+ data: The dictionary previously returned by to_checkpoint_dict().
+
+ Returns:
+ A new instance of the state class with restored values.
+ """
+
+
+class SimpleCheckpointableState(CheckpointableAgentState):
+ """A simple implementation of CheckpointableAgentState using a dict.
+
+ This class provides a basic implementation that stores arbitrary key-value
+ pairs. Use this when you don't need custom serialization logic.
+
+ Example:
+ ```python
+ state = SimpleCheckpointableState()
+ state.data["counter"] = 5
+ state.data["results"] = ["a", "b", "c"]
+
+ # Checkpoint
+ checkpoint = state.to_checkpoint_dict()
+
+ # Restore
+ restored = SimpleCheckpointableState.from_checkpoint_dict(checkpoint)
+ assert restored.data["counter"] == 5
+ ```
+ """
+
+ data: Dict[str, Any] = {}
+
+ def to_checkpoint_dict(self) -> Dict[str, Any]:
+ """Serialize the state to a dictionary."""
+ return {"data": self.data.copy()}
+
+ @classmethod
+ def from_checkpoint_dict(
+ cls, data: Dict[str, Any]
+ ) -> "SimpleCheckpointableState":
+ """Deserialize the state from a checkpoint dictionary."""
+ return cls(data=data.get("data", {}))
diff --git a/src/google/adk/durable/config.py b/src/google/adk/durable/config.py
new file mode 100644
index 0000000000..d4b7d91e7a
--- /dev/null
+++ b/src/google/adk/durable/config.py
@@ -0,0 +1,70 @@
+# Copyright 2026 Google LLC
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""Configuration for durable session persistence."""
+
+from __future__ import annotations
+
+from typing import Any
+from typing import Literal
+from typing import Optional
+
+from pydantic import BaseModel
+from pydantic import ConfigDict
+from pydantic import Field
+
+from ..utils.feature_decorator import experimental
+
+
+@experimental
+class DurableSessionConfig(BaseModel):
+ """Configuration for durable session persistence.
+
+ Durable sessions provide checkpoint-based persistence that survives process
+ restarts, enabling recovery from failures and migration across hosts. This
+ goes beyond the basic resumability feature by persisting session state to
+ external storage (BigQuery + GCS).
+
+ Attributes:
+ is_durable: Whether to enable durable checkpointing.
+ checkpoint_policy: When to create checkpoints:
+ - "async_boundary": Checkpoint when hitting async/long-running operations
+ - "every_turn": Checkpoint after every agent turn
+ - "manual": Only checkpoint when explicitly requested
+ checkpoint_store: The store to use for persisting checkpoints.
+ lease_timeout_seconds: How long a lease is valid before expiring.
+ max_checkpoint_size_bytes: Maximum size for checkpoint state blobs.
+ """
+
+ model_config = ConfigDict(
+ arbitrary_types_allowed=True,
+ extra="forbid",
+ )
+
+ is_durable: bool = False
+ """Whether to enable durable checkpointing."""
+
+ checkpoint_policy: Literal["async_boundary", "every_turn", "manual"] = (
+ "async_boundary"
+ )
+ """When to create checkpoints during execution."""
+
+ checkpoint_store: Optional[Any] = Field(default=None)
+ """The store to use for persisting checkpoints (DurableSessionStore)."""
+
+ lease_timeout_seconds: int = Field(default=300, ge=60, le=3600)
+ """How long a lease is valid before expiring (60-3600 seconds)."""
+
+ max_checkpoint_size_bytes: int = Field(default=10 * 1024 * 1024, ge=1024)
+ """Maximum size for checkpoint state blobs (default 10MB)."""
diff --git a/src/google/adk/durable/stores/__init__.py b/src/google/adk/durable/stores/__init__.py
new file mode 100644
index 0000000000..cb04e432b6
--- /dev/null
+++ b/src/google/adk/durable/stores/__init__.py
@@ -0,0 +1,23 @@
+# Copyright 2026 Google LLC
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""Checkpoint store implementations for durable sessions."""
+
+from .base_checkpoint_store import DurableSessionStore
+from .bigquery_checkpoint_store import BigQueryCheckpointStore
+
+__all__ = [
+ "DurableSessionStore",
+ "BigQueryCheckpointStore",
+]
diff --git a/src/google/adk/durable/stores/base_checkpoint_store.py b/src/google/adk/durable/stores/base_checkpoint_store.py
new file mode 100644
index 0000000000..e6d553cd57
--- /dev/null
+++ b/src/google/adk/durable/stores/base_checkpoint_store.py
@@ -0,0 +1,258 @@
+# Copyright 2026 Google LLC
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""Abstract base class for durable session checkpoint stores."""
+
+from __future__ import annotations
+
+import abc
+from dataclasses import dataclass
+from datetime import datetime
+from typing import Any
+from typing import Dict
+from typing import Optional
+
+
+@dataclass
+class Checkpoint:
+ """Represents a checkpoint for a durable session.
+
+ Attributes:
+ session_id: The ID of the session this checkpoint belongs to.
+ checkpoint_seq: The sequence number of this checkpoint (monotonically
+ increasing).
+ created_at: When this checkpoint was created.
+ gcs_state_uri: The GCS URI where the full state blob is stored.
+ sha256: SHA-256 hash of the state blob for integrity verification.
+ size_bytes: Size of the state blob in bytes.
+ agent_state: Small agent state stored inline in BigQuery (optional).
+ trigger: What triggered this checkpoint (e.g., "async_boundary", "manual").
+ """
+
+ session_id: str
+ checkpoint_seq: int
+ created_at: datetime
+ gcs_state_uri: str
+ sha256: str
+ size_bytes: int
+ agent_state: Optional[Dict[str, Any]] = None
+ trigger: str = "async_boundary"
+
+
+@dataclass
+class SessionMetadata:
+ """Metadata about a durable session.
+
+ Attributes:
+ session_id: The unique session identifier.
+ status: Current status ("active", "paused", "completed", "failed").
+ agent_name: Name of the root agent for this session.
+ created_at: When the session was created.
+ updated_at: When the session was last updated.
+ current_checkpoint_seq: The latest checkpoint sequence number.
+ active_lease_id: ID of the current lease holder (if any).
+ lease_expiry: When the current lease expires.
+ ttl_expiry: When this session should be garbage collected.
+ metadata: Additional custom metadata.
+ """
+
+ session_id: str
+ status: str
+ agent_name: str
+ created_at: datetime
+ updated_at: datetime
+ current_checkpoint_seq: int
+ active_lease_id: Optional[str] = None
+ lease_expiry: Optional[datetime] = None
+ ttl_expiry: Optional[datetime] = None
+ metadata: Optional[Dict[str, Any]] = None
+
+
+class DurableSessionStore(abc.ABC):
+ """Abstract base class for checkpoint stores.
+
+ A checkpoint store provides persistent storage for session checkpoints,
+ enabling recovery from failures and migration across hosts.
+
+ Implementations must provide:
+ - Checkpoint write/read operations with two-phase commit
+ - Lease management to prevent concurrent modifications
+ - Session metadata management
+ """
+
+ @abc.abstractmethod
+ async def create_session(
+ self,
+ *,
+ session_id: str,
+ agent_name: str,
+ metadata: Optional[Dict[str, Any]] = None,
+ ) -> SessionMetadata:
+ """Create a new durable session.
+
+ Args:
+ session_id: Unique identifier for the session.
+ agent_name: Name of the root agent.
+ metadata: Optional custom metadata.
+
+ Returns:
+ The created session metadata.
+
+ Raises:
+ ValueError: If a session with this ID already exists.
+ """
+
+ @abc.abstractmethod
+ async def get_session(self, *, session_id: str) -> Optional[SessionMetadata]:
+ """Get session metadata.
+
+ Args:
+ session_id: The session to retrieve.
+
+ Returns:
+ The session metadata, or None if not found.
+ """
+
+ @abc.abstractmethod
+ async def update_session_status(
+ self, *, session_id: str, status: str
+ ) -> None:
+ """Update the status of a session.
+
+ Args:
+ session_id: The session to update.
+ status: The new status.
+ """
+
+ @abc.abstractmethod
+ async def write_checkpoint(
+ self,
+ *,
+ session_id: str,
+ checkpoint_seq: int,
+ state_blob: bytes,
+ agent_state: Optional[Dict[str, Any]] = None,
+ trigger: str = "async_boundary",
+ ) -> Checkpoint:
+ """Write a checkpoint with two-phase commit.
+
+ This operation should:
+ 1. Upload the state blob to GCS
+ 2. Record the checkpoint metadata in BigQuery
+ 3. Update the session's current_checkpoint_seq
+
+ Args:
+ session_id: The session to checkpoint.
+ checkpoint_seq: The sequence number for this checkpoint.
+ state_blob: The serialized state to persist.
+ agent_state: Small agent state to store inline (optional).
+ trigger: What triggered this checkpoint.
+
+ Returns:
+ The created checkpoint.
+
+ Raises:
+ ValueError: If the checkpoint_seq is not greater than the current.
+ """
+
+ @abc.abstractmethod
+ async def read_latest_checkpoint(
+ self, *, session_id: str
+ ) -> Optional[tuple[Checkpoint, bytes]]:
+ """Read the latest checkpoint for a session.
+
+ Args:
+ session_id: The session to read.
+
+ Returns:
+ A tuple of (checkpoint, state_blob), or None if no checkpoints exist.
+ """
+
+ @abc.abstractmethod
+ async def read_checkpoint(
+ self, *, session_id: str, checkpoint_seq: int
+ ) -> Optional[tuple[Checkpoint, bytes]]:
+ """Read a specific checkpoint.
+
+ Args:
+ session_id: The session to read.
+ checkpoint_seq: The checkpoint sequence number.
+
+ Returns:
+ A tuple of (checkpoint, state_blob), or None if not found.
+ """
+
+ @abc.abstractmethod
+ async def acquire_lease(
+ self, *, session_id: str, lease_id: str, timeout_seconds: int
+ ) -> bool:
+ """Attempt to acquire a lease on a session.
+
+ Leases prevent concurrent modifications to a session. Only the lease
+ holder can write checkpoints or update session status.
+
+ Args:
+ session_id: The session to lease.
+ lease_id: A unique identifier for this lease attempt.
+ timeout_seconds: How long the lease should be valid.
+
+ Returns:
+ True if the lease was acquired, False if another lease is active.
+ """
+
+ @abc.abstractmethod
+ async def release_lease(self, *, session_id: str, lease_id: str) -> None:
+ """Release a lease on a session.
+
+ Args:
+ session_id: The session to release.
+ lease_id: The lease ID to release (must match the active lease).
+ """
+
+ @abc.abstractmethod
+ async def renew_lease(
+ self, *, session_id: str, lease_id: str, timeout_seconds: int
+ ) -> bool:
+ """Renew an existing lease.
+
+ Args:
+ session_id: The session to renew.
+ lease_id: The lease ID to renew (must match the active lease).
+ timeout_seconds: New timeout for the lease.
+
+ Returns:
+ True if the lease was renewed, False if the lease is not active.
+ """
+
+ @abc.abstractmethod
+ async def list_checkpoints(
+ self, *, session_id: str, limit: int = 10
+ ) -> list[Checkpoint]:
+ """List checkpoints for a session.
+
+ Args:
+ session_id: The session to list checkpoints for.
+ limit: Maximum number of checkpoints to return.
+
+ Returns:
+ List of checkpoints, ordered by checkpoint_seq descending.
+ """
+
+ @abc.abstractmethod
+ async def delete_session(self, *, session_id: str) -> None:
+ """Delete a session and all its checkpoints.
+
+ Args:
+ session_id: The session to delete.
+ """
diff --git a/src/google/adk/durable/stores/bigquery_checkpoint_store.py b/src/google/adk/durable/stores/bigquery_checkpoint_store.py
new file mode 100644
index 0000000000..3d53995ecc
--- /dev/null
+++ b/src/google/adk/durable/stores/bigquery_checkpoint_store.py
@@ -0,0 +1,693 @@
+# Copyright 2026 Google LLC
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""BigQuery + GCS implementation of durable session checkpoint store."""
+
+from __future__ import annotations
+
+from datetime import datetime
+from datetime import timedelta
+from datetime import timezone
+import hashlib
+import json
+import logging
+from typing import Any
+from typing import Dict
+from typing import Optional
+import uuid
+
+from ...utils.feature_decorator import experimental
+from .base_checkpoint_store import Checkpoint
+from .base_checkpoint_store import DurableSessionStore
+from .base_checkpoint_store import SessionMetadata
+
+logger = logging.getLogger("google_adk." + __name__)
+
+
+@experimental
+class BigQueryCheckpointStore(DurableSessionStore):
+ """Checkpoint store using BigQuery for metadata and GCS for state blobs.
+
+ This implementation stores:
+ - Session metadata and checkpoint records in BigQuery tables
+ - Large state blobs in Google Cloud Storage
+
+ Prerequisites:
+ - BigQuery dataset with sessions and checkpoints tables
+ - GCS bucket for state blobs
+ - Appropriate IAM permissions
+
+ Example:
+ ```python
+ store = BigQueryCheckpointStore(
+ project="my-project",
+ dataset="adk_metadata",
+ gcs_bucket="my-project-adk-checkpoints",
+ )
+
+ # Create a session
+ await store.create_session(
+ session_id="sess-123",
+ agent_name="my_agent",
+ )
+
+ # Write a checkpoint
+ await store.write_checkpoint(
+ session_id="sess-123",
+ checkpoint_seq=1,
+ state_blob=b"...",
+ )
+
+ # Read it back
+ checkpoint, blob = await store.read_latest_checkpoint(session_id="sess-123")
+ ```
+ """
+
+ def __init__(
+ self,
+ *,
+ project: str,
+ dataset: str,
+ gcs_bucket: str,
+ sessions_table: str = "sessions",
+ checkpoints_table: str = "checkpoints",
+ location: str = "US",
+ ):
+ """Initialize the BigQuery checkpoint store.
+
+ Args:
+ project: GCP project ID.
+ dataset: BigQuery dataset name.
+ gcs_bucket: GCS bucket name for state blobs.
+ sessions_table: Name of the sessions table.
+ checkpoints_table: Name of the checkpoints table.
+ location: BigQuery dataset location.
+ """
+ self._project = project
+ self._dataset = dataset
+ self._gcs_bucket = gcs_bucket
+ self._sessions_table = sessions_table
+ self._checkpoints_table = checkpoints_table
+ self._location = location
+
+ # Lazy-loaded clients
+ self._bq_client = None
+ self._storage_client = None
+
+ @property
+ def _sessions_table_id(self) -> str:
+ return f"{self._project}.{self._dataset}.{self._sessions_table}"
+
+ @property
+ def _checkpoints_table_id(self) -> str:
+ return f"{self._project}.{self._dataset}.{self._checkpoints_table}"
+
+ def _get_bq_client(self):
+ """Lazy-load BigQuery client."""
+ if self._bq_client is None:
+ from google.cloud import bigquery
+
+ self._bq_client = bigquery.Client(
+ project=self._project, location=self._location
+ )
+ return self._bq_client
+
+ def _get_storage_client(self):
+ """Lazy-load Cloud Storage client."""
+ if self._storage_client is None:
+ from google.cloud import storage
+
+ self._storage_client = storage.Client(project=self._project)
+ return self._storage_client
+
+ def _get_gcs_uri(self, session_id: str, checkpoint_seq: int) -> str:
+ """Generate a GCS URI for a checkpoint blob."""
+ return f"gs://{self._gcs_bucket}/checkpoints/{session_id}/{checkpoint_seq}.json.gz"
+
+ async def create_session(
+ self,
+ *,
+ session_id: str,
+ agent_name: str,
+ metadata: Optional[Dict[str, Any]] = None,
+ ) -> SessionMetadata:
+ """Create a new durable session."""
+ now = datetime.now(timezone.utc)
+
+ # Check if session already exists
+ existing = await self.get_session(session_id=session_id)
+ if existing:
+ raise ValueError(f"Session {session_id} already exists")
+
+ # Insert session record using DML (not streaming) for immediate updatability
+ client = self._get_bq_client()
+ from google.cloud import bigquery
+
+ insert_query = f"""
+ INSERT INTO `{self._sessions_table_id}`
+ (session_id, status, agent_name, created_at, updated_at,
+ current_checkpoint_seq, active_lease_id, lease_expiry, ttl_expiry, metadata)
+ VALUES
+ (@session_id, @status, @agent_name, @created_at, @updated_at,
+ @current_checkpoint_seq, @active_lease_id, @lease_expiry, @ttl_expiry,
+ PARSE_JSON(@metadata))
+ """
+
+ job_config = bigquery.QueryJobConfig(
+ query_parameters=[
+ bigquery.ScalarQueryParameter("session_id", "STRING", session_id),
+ bigquery.ScalarQueryParameter("status", "STRING", "active"),
+ bigquery.ScalarQueryParameter("agent_name", "STRING", agent_name),
+ bigquery.ScalarQueryParameter(
+ "created_at", "TIMESTAMP", now.isoformat()
+ ),
+ bigquery.ScalarQueryParameter(
+ "updated_at", "TIMESTAMP", now.isoformat()
+ ),
+ bigquery.ScalarQueryParameter("current_checkpoint_seq", "INT64", 0),
+ bigquery.ScalarQueryParameter("active_lease_id", "STRING", None),
+ bigquery.ScalarQueryParameter("lease_expiry", "TIMESTAMP", None),
+ bigquery.ScalarQueryParameter("ttl_expiry", "TIMESTAMP", None),
+ bigquery.ScalarQueryParameter(
+ "metadata", "STRING", json.dumps(metadata) if metadata else None
+ ),
+ ]
+ )
+ client.query(insert_query, job_config=job_config).result()
+
+ logger.info("Created durable session: %s", session_id)
+
+ return SessionMetadata(
+ session_id=session_id,
+ status="active",
+ agent_name=agent_name,
+ created_at=now,
+ updated_at=now,
+ current_checkpoint_seq=0,
+ metadata=metadata,
+ )
+
+ async def get_session(self, *, session_id: str) -> Optional[SessionMetadata]:
+ """Get session metadata."""
+ query = f"""
+ SELECT
+ session_id,
+ status,
+ agent_name,
+ created_at,
+ updated_at,
+ current_checkpoint_seq,
+ active_lease_id,
+ lease_expiry,
+ ttl_expiry,
+ metadata
+ FROM `{self._sessions_table_id}`
+ WHERE session_id = @session_id
+ """
+
+ client = self._get_bq_client()
+ from google.cloud import bigquery
+
+ job_config = bigquery.QueryJobConfig(
+ query_parameters=[
+ bigquery.ScalarQueryParameter("session_id", "STRING", session_id),
+ ]
+ )
+ results = client.query(query, job_config=job_config).result()
+
+ for row in results:
+ return SessionMetadata(
+ session_id=row.session_id,
+ status=row.status,
+ agent_name=row.agent_name,
+ created_at=row.created_at,
+ updated_at=row.updated_at,
+ current_checkpoint_seq=row.current_checkpoint_seq,
+ active_lease_id=row.active_lease_id,
+ lease_expiry=row.lease_expiry,
+ ttl_expiry=row.ttl_expiry,
+ metadata=row.metadata if isinstance(row.metadata, dict) else (json.loads(row.metadata) if row.metadata else None),
+ )
+
+ return None
+
+ async def update_session_status(
+ self, *, session_id: str, status: str
+ ) -> None:
+ """Update the status of a session."""
+ now = datetime.now(timezone.utc)
+
+ query = f"""
+ UPDATE `{self._sessions_table_id}`
+ SET status = @status, updated_at = @updated_at
+ WHERE session_id = @session_id
+ """
+
+ client = self._get_bq_client()
+ from google.cloud import bigquery
+
+ job_config = bigquery.QueryJobConfig(
+ query_parameters=[
+ bigquery.ScalarQueryParameter("session_id", "STRING", session_id),
+ bigquery.ScalarQueryParameter("status", "STRING", status),
+ bigquery.ScalarQueryParameter(
+ "updated_at", "TIMESTAMP", now.isoformat()
+ ),
+ ]
+ )
+ client.query(query, job_config=job_config).result()
+ logger.debug("Updated session %s status to %s", session_id, status)
+
+ async def write_checkpoint(
+ self,
+ *,
+ session_id: str,
+ checkpoint_seq: int,
+ state_blob: bytes,
+ agent_state: Optional[Dict[str, Any]] = None,
+ trigger: str = "async_boundary",
+ ) -> Checkpoint:
+ """Write a checkpoint with two-phase commit."""
+ import gzip
+
+ now = datetime.now(timezone.utc)
+
+ # Verify session exists and checkpoint_seq is valid
+ session = await self.get_session(session_id=session_id)
+ if not session:
+ raise ValueError(f"Session {session_id} not found")
+
+ if checkpoint_seq <= session.current_checkpoint_seq:
+ raise ValueError(
+ f"checkpoint_seq {checkpoint_seq} must be greater than current"
+ f" {session.current_checkpoint_seq}"
+ )
+
+ # Compute hash of the state blob
+ sha256 = hashlib.sha256(state_blob).hexdigest()
+ size_bytes = len(state_blob)
+
+ # Phase 1: Upload to GCS
+ gcs_uri = self._get_gcs_uri(session_id, checkpoint_seq)
+ blob_path = gcs_uri.replace(f"gs://{self._gcs_bucket}/", "")
+
+ storage_client = self._get_storage_client()
+ bucket = storage_client.bucket(self._gcs_bucket)
+ blob = bucket.blob(blob_path)
+
+ compressed = gzip.compress(state_blob)
+ blob.upload_from_string(compressed, content_type="application/gzip")
+ logger.debug(
+ "Uploaded checkpoint blob to %s (%d bytes compressed)",
+ gcs_uri,
+ len(compressed),
+ )
+
+ # Phase 2: Insert checkpoint record using DML
+ client = self._get_bq_client()
+ from google.cloud import bigquery
+
+ insert_query = f"""
+ INSERT INTO `{self._checkpoints_table_id}`
+ (session_id, checkpoint_seq, created_at, gcs_state_uri, sha256,
+ size_bytes, agent_state_json, trigger)
+ VALUES
+ (@session_id, @checkpoint_seq, @created_at, @gcs_state_uri, @sha256,
+ @size_bytes, PARSE_JSON(@agent_state_json), @trigger)
+ """
+
+ job_config = bigquery.QueryJobConfig(
+ query_parameters=[
+ bigquery.ScalarQueryParameter("session_id", "STRING", session_id),
+ bigquery.ScalarQueryParameter(
+ "checkpoint_seq", "INT64", checkpoint_seq
+ ),
+ bigquery.ScalarQueryParameter(
+ "created_at", "TIMESTAMP", now.isoformat()
+ ),
+ bigquery.ScalarQueryParameter("gcs_state_uri", "STRING", gcs_uri),
+ bigquery.ScalarQueryParameter("sha256", "STRING", sha256),
+ bigquery.ScalarQueryParameter("size_bytes", "INT64", size_bytes),
+ bigquery.ScalarQueryParameter(
+ "agent_state_json", "STRING",
+ json.dumps(agent_state) if agent_state else None
+ ),
+ bigquery.ScalarQueryParameter("trigger", "STRING", trigger),
+ ]
+ )
+ try:
+ client.query(insert_query, job_config=job_config).result()
+ except Exception as e:
+ # Rollback: delete the GCS blob
+ blob.delete()
+ raise RuntimeError(f"Failed to insert checkpoint record: {e}")
+
+ # Phase 3: Update session's current_checkpoint_seq
+ from google.cloud import bigquery
+
+ update_query = f"""
+ UPDATE `{self._sessions_table_id}`
+ SET current_checkpoint_seq = @checkpoint_seq, updated_at = @updated_at
+ WHERE session_id = @session_id
+ """
+
+ job_config = bigquery.QueryJobConfig(
+ query_parameters=[
+ bigquery.ScalarQueryParameter("session_id", "STRING", session_id),
+ bigquery.ScalarQueryParameter(
+ "checkpoint_seq", "INT64", checkpoint_seq
+ ),
+ bigquery.ScalarQueryParameter(
+ "updated_at", "TIMESTAMP", now.isoformat()
+ ),
+ ]
+ )
+ client.query(update_query, job_config=job_config).result()
+
+ logger.info(
+ "Wrote checkpoint %d for session %s (%d bytes, sha256=%s)",
+ checkpoint_seq,
+ session_id,
+ size_bytes,
+ sha256[:16],
+ )
+
+ return Checkpoint(
+ session_id=session_id,
+ checkpoint_seq=checkpoint_seq,
+ created_at=now,
+ gcs_state_uri=gcs_uri,
+ sha256=sha256,
+ size_bytes=size_bytes,
+ agent_state=agent_state,
+ trigger=trigger,
+ )
+
+ async def read_latest_checkpoint(
+ self, *, session_id: str
+ ) -> Optional[tuple[Checkpoint, bytes]]:
+ """Read the latest checkpoint for a session."""
+ session = await self.get_session(session_id=session_id)
+ if not session or session.current_checkpoint_seq == 0:
+ return None
+
+ return await self.read_checkpoint(
+ session_id=session_id, checkpoint_seq=session.current_checkpoint_seq
+ )
+
+ async def read_checkpoint(
+ self, *, session_id: str, checkpoint_seq: int
+ ) -> Optional[tuple[Checkpoint, bytes]]:
+ """Read a specific checkpoint."""
+ import gzip
+
+ query = f"""
+ SELECT
+ session_id,
+ checkpoint_seq,
+ created_at,
+ gcs_state_uri,
+ sha256,
+ size_bytes,
+ agent_state_json,
+ trigger
+ FROM `{self._checkpoints_table_id}`
+ WHERE session_id = @session_id AND checkpoint_seq = @checkpoint_seq
+ """
+
+ client = self._get_bq_client()
+ from google.cloud import bigquery
+
+ job_config = bigquery.QueryJobConfig(
+ query_parameters=[
+ bigquery.ScalarQueryParameter("session_id", "STRING", session_id),
+ bigquery.ScalarQueryParameter(
+ "checkpoint_seq", "INT64", checkpoint_seq
+ ),
+ ]
+ )
+ results = client.query(query, job_config=job_config).result()
+
+ checkpoint_row = None
+ for row in results:
+ checkpoint_row = row
+ break
+
+ if not checkpoint_row:
+ return None
+
+ # Download blob from GCS
+ gcs_uri = checkpoint_row.gcs_state_uri
+ blob_path = gcs_uri.replace(f"gs://{self._gcs_bucket}/", "")
+
+ storage_client = self._get_storage_client()
+ bucket = storage_client.bucket(self._gcs_bucket)
+ blob = bucket.blob(blob_path)
+
+ compressed = blob.download_as_bytes()
+ state_blob = gzip.decompress(compressed)
+
+ # Verify integrity
+ actual_sha256 = hashlib.sha256(state_blob).hexdigest()
+ if actual_sha256 != checkpoint_row.sha256:
+ raise RuntimeError(
+ "Checkpoint integrity check failed: expected"
+ f" {checkpoint_row.sha256}, got {actual_sha256}"
+ )
+
+ checkpoint = Checkpoint(
+ session_id=checkpoint_row.session_id,
+ checkpoint_seq=checkpoint_row.checkpoint_seq,
+ created_at=checkpoint_row.created_at,
+ gcs_state_uri=checkpoint_row.gcs_state_uri,
+ sha256=checkpoint_row.sha256,
+ size_bytes=checkpoint_row.size_bytes,
+ agent_state=(
+ checkpoint_row.agent_state_json if isinstance(checkpoint_row.agent_state_json, dict)
+ else (json.loads(checkpoint_row.agent_state_json) if checkpoint_row.agent_state_json else None)
+ ),
+ trigger=checkpoint_row.trigger,
+ )
+
+ logger.debug(
+ "Read checkpoint %d for session %s (%d bytes)",
+ checkpoint_seq,
+ session_id,
+ len(state_blob),
+ )
+
+ return checkpoint, state_blob
+
+ async def acquire_lease(
+ self, *, session_id: str, lease_id: str, timeout_seconds: int
+ ) -> bool:
+ """Attempt to acquire a lease on a session."""
+ now = datetime.now(timezone.utc)
+ expiry = now + timedelta(seconds=timeout_seconds)
+
+ # Atomic update: only succeed if no active lease or lease expired
+ query = f"""
+ UPDATE `{self._sessions_table_id}`
+ SET
+ active_lease_id = @lease_id,
+ lease_expiry = @lease_expiry,
+ updated_at = @updated_at
+ WHERE
+ session_id = @session_id
+ AND (active_lease_id IS NULL OR lease_expiry < @now)
+ """
+
+ client = self._get_bq_client()
+ from google.cloud import bigquery
+
+ job_config = bigquery.QueryJobConfig(
+ query_parameters=[
+ bigquery.ScalarQueryParameter("session_id", "STRING", session_id),
+ bigquery.ScalarQueryParameter("lease_id", "STRING", lease_id),
+ bigquery.ScalarQueryParameter(
+ "lease_expiry", "TIMESTAMP", expiry.isoformat()
+ ),
+ bigquery.ScalarQueryParameter(
+ "updated_at", "TIMESTAMP", now.isoformat()
+ ),
+ bigquery.ScalarQueryParameter("now", "TIMESTAMP", now.isoformat()),
+ ]
+ )
+ result = client.query(query, job_config=job_config).result()
+
+ # Check if the update affected any rows
+ if result.num_dml_affected_rows and result.num_dml_affected_rows > 0:
+ logger.info("Acquired lease %s on session %s", lease_id, session_id)
+ return True
+ else:
+ logger.debug("Failed to acquire lease on session %s", session_id)
+ return False
+
+ async def release_lease(self, *, session_id: str, lease_id: str) -> None:
+ """Release a lease on a session."""
+ now = datetime.now(timezone.utc)
+
+ query = f"""
+ UPDATE `{self._sessions_table_id}`
+ SET
+ active_lease_id = NULL,
+ lease_expiry = NULL,
+ updated_at = @updated_at
+ WHERE session_id = @session_id AND active_lease_id = @lease_id
+ """
+
+ client = self._get_bq_client()
+ from google.cloud import bigquery
+
+ job_config = bigquery.QueryJobConfig(
+ query_parameters=[
+ bigquery.ScalarQueryParameter("session_id", "STRING", session_id),
+ bigquery.ScalarQueryParameter("lease_id", "STRING", lease_id),
+ bigquery.ScalarQueryParameter(
+ "updated_at", "TIMESTAMP", now.isoformat()
+ ),
+ ]
+ )
+ client.query(query, job_config=job_config).result()
+ logger.info("Released lease %s on session %s", lease_id, session_id)
+
+ async def renew_lease(
+ self, *, session_id: str, lease_id: str, timeout_seconds: int
+ ) -> bool:
+ """Renew an existing lease."""
+ now = datetime.now(timezone.utc)
+ expiry = now + timedelta(seconds=timeout_seconds)
+
+ query = f"""
+ UPDATE `{self._sessions_table_id}`
+ SET lease_expiry = @lease_expiry, updated_at = @updated_at
+ WHERE session_id = @session_id AND active_lease_id = @lease_id
+ """
+
+ client = self._get_bq_client()
+ from google.cloud import bigquery
+
+ job_config = bigquery.QueryJobConfig(
+ query_parameters=[
+ bigquery.ScalarQueryParameter("session_id", "STRING", session_id),
+ bigquery.ScalarQueryParameter("lease_id", "STRING", lease_id),
+ bigquery.ScalarQueryParameter(
+ "lease_expiry", "TIMESTAMP", expiry.isoformat()
+ ),
+ bigquery.ScalarQueryParameter(
+ "updated_at", "TIMESTAMP", now.isoformat()
+ ),
+ ]
+ )
+ result = client.query(query, job_config=job_config).result()
+
+ if result.num_dml_affected_rows and result.num_dml_affected_rows > 0:
+ logger.debug("Renewed lease %s on session %s", lease_id, session_id)
+ return True
+ return False
+
+ async def list_checkpoints(
+ self, *, session_id: str, limit: int = 10
+ ) -> list[Checkpoint]:
+ """List checkpoints for a session."""
+ query = f"""
+ SELECT
+ session_id,
+ checkpoint_seq,
+ created_at,
+ gcs_state_uri,
+ sha256,
+ size_bytes,
+ agent_state_json,
+ trigger
+ FROM `{self._checkpoints_table_id}`
+ WHERE session_id = @session_id
+ ORDER BY checkpoint_seq DESC
+ LIMIT @limit
+ """
+
+ client = self._get_bq_client()
+ from google.cloud import bigquery
+
+ job_config = bigquery.QueryJobConfig(
+ query_parameters=[
+ bigquery.ScalarQueryParameter("session_id", "STRING", session_id),
+ bigquery.ScalarQueryParameter("limit", "INT64", limit),
+ ]
+ )
+ results = client.query(query, job_config=job_config).result()
+
+ checkpoints = []
+ for row in results:
+ checkpoints.append(
+ Checkpoint(
+ session_id=row.session_id,
+ checkpoint_seq=row.checkpoint_seq,
+ created_at=row.created_at,
+ gcs_state_uri=row.gcs_state_uri,
+ sha256=row.sha256,
+ size_bytes=row.size_bytes,
+ agent_state=(
+ row.agent_state_json if isinstance(row.agent_state_json, dict)
+ else (json.loads(row.agent_state_json) if row.agent_state_json else None)
+ ),
+ trigger=row.trigger,
+ )
+ )
+
+ return checkpoints
+
+ async def delete_session(self, *, session_id: str) -> None:
+ """Delete a session and all its checkpoints."""
+ # Delete checkpoints from GCS
+ checkpoints = await self.list_checkpoints(session_id=session_id, limit=1000)
+ storage_client = self._get_storage_client()
+ bucket = storage_client.bucket(self._gcs_bucket)
+
+ for checkpoint in checkpoints:
+ blob_path = checkpoint.gcs_state_uri.replace(
+ f"gs://{self._gcs_bucket}/", ""
+ )
+ blob = bucket.blob(blob_path)
+ try:
+ blob.delete()
+ except Exception as e:
+ logger.warning("Failed to delete blob %s: %s", blob_path, e)
+
+ # Delete checkpoint records
+ client = self._get_bq_client()
+ from google.cloud import bigquery
+
+ delete_checkpoints = f"""
+ DELETE FROM `{self._checkpoints_table_id}`
+ WHERE session_id = @session_id
+ """
+
+ job_config = bigquery.QueryJobConfig(
+ query_parameters=[
+ bigquery.ScalarQueryParameter("session_id", "STRING", session_id),
+ ]
+ )
+ client.query(delete_checkpoints, job_config=job_config).result()
+
+ # Delete session record
+ delete_session = f"""
+ DELETE FROM `{self._sessions_table_id}`
+ WHERE session_id = @session_id
+ """
+ client.query(delete_session, job_config=job_config).result()
+
+ logger.info(
+ "Deleted session %s and %d checkpoints", session_id, len(checkpoints)
+ )
diff --git a/src/google/adk/durable/workspace_snapshotter.py b/src/google/adk/durable/workspace_snapshotter.py
new file mode 100644
index 0000000000..1462b883d7
--- /dev/null
+++ b/src/google/adk/durable/workspace_snapshotter.py
@@ -0,0 +1,187 @@
+# Copyright 2026 Google LLC
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""Workspace snapshot handling for durable sessions."""
+
+from __future__ import annotations
+
+import hashlib
+import io
+import json
+import logging
+from pathlib import Path
+import tarfile
+from typing import Any
+from typing import Dict
+from typing import Optional
+
+logger = logging.getLogger("google_adk." + __name__)
+
+
+class WorkspaceSnapshotter:
+ """Handles workspace file snapshots for durable checkpoints.
+
+ This class provides utilities for creating and restoring snapshots of
+ workspace directories, enabling agents to persist and restore file-based
+ state across checkpoint boundaries.
+
+ Example:
+ ```python
+ snapshotter = WorkspaceSnapshotter(workspace_dir="/tmp/agent_workspace")
+
+ # Create a snapshot
+ blob, sha256, size = snapshotter.create_snapshot()
+
+ # Later, restore from snapshot
+ snapshotter.restore_snapshot(blob)
+ ```
+ """
+
+ def __init__(
+ self,
+ workspace_dir: Optional[str] = None,
+ exclude_patterns: Optional[list[str]] = None,
+ ):
+ """Initialize the workspace snapshotter.
+
+ Args:
+ workspace_dir: Path to the workspace directory to snapshot.
+ exclude_patterns: List of glob patterns to exclude from snapshots.
+ """
+ self._workspace_dir = Path(workspace_dir) if workspace_dir else None
+ self._exclude_patterns = exclude_patterns or [
+ "__pycache__",
+ "*.pyc",
+ ".git",
+ ".env",
+ "node_modules",
+ "*.log",
+ ]
+
+ @property
+ def workspace_dir(self) -> Optional[Path]:
+ """The workspace directory being snapshotted."""
+ return self._workspace_dir
+
+ def create_snapshot(self) -> tuple[bytes, str, int]:
+ """Create a tarball snapshot of the workspace directory.
+
+ Returns:
+ A tuple of (blob_bytes, sha256_hash, size_bytes).
+
+ Raises:
+ ValueError: If no workspace directory is configured.
+ FileNotFoundError: If the workspace directory doesn't exist.
+ """
+ if not self._workspace_dir:
+ raise ValueError("No workspace directory configured")
+
+ if not self._workspace_dir.exists():
+ raise FileNotFoundError(
+ f"Workspace directory not found: {self._workspace_dir}"
+ )
+
+ buffer = io.BytesIO()
+ with tarfile.open(fileobj=buffer, mode="w:gz") as tar:
+ for path in self._workspace_dir.rglob("*"):
+ if path.is_file() and not self._should_exclude(path):
+ arcname = path.relative_to(self._workspace_dir)
+ tar.add(path, arcname=str(arcname))
+
+ blob = buffer.getvalue()
+ sha256 = hashlib.sha256(blob).hexdigest()
+
+ logger.debug(
+ "Created workspace snapshot: %d bytes, sha256=%s", len(blob), sha256
+ )
+
+ return blob, sha256, len(blob)
+
+ def restore_snapshot(self, blob: bytes) -> None:
+ """Restore a workspace from a tarball snapshot.
+
+ Args:
+ blob: The snapshot blob previously created by create_snapshot().
+
+ Raises:
+ ValueError: If no workspace directory is configured.
+ """
+ if not self._workspace_dir:
+ raise ValueError("No workspace directory configured")
+
+ self._workspace_dir.mkdir(parents=True, exist_ok=True)
+
+ buffer = io.BytesIO(blob)
+ with tarfile.open(fileobj=buffer, mode="r:gz") as tar:
+ # Filter to prevent path traversal attacks
+ safe_members = [
+ m for m in tar.getmembers() if not m.name.startswith(("/", ".."))
+ ]
+ tar.extractall(path=self._workspace_dir, members=safe_members)
+
+ logger.debug(
+ "Restored workspace snapshot: %d bytes to %s",
+ len(blob),
+ self._workspace_dir,
+ )
+
+ def _should_exclude(self, path: Path) -> bool:
+ """Check if a path should be excluded from snapshots."""
+ path_str = str(path)
+ for pattern in self._exclude_patterns:
+ if pattern.startswith("*"):
+ # Suffix match (e.g., *.pyc)
+ if path_str.endswith(pattern[1:]):
+ return True
+ elif pattern in path_str:
+ # Contains match (e.g., __pycache__)
+ return True
+ return False
+
+
+def serialize_state_to_json(state: Dict[str, Any]) -> bytes:
+ """Serialize state dictionary to JSON bytes.
+
+ Args:
+ state: The state dictionary to serialize.
+
+ Returns:
+ JSON-encoded bytes.
+ """
+ return json.dumps(state, sort_keys=True, default=str).encode("utf-8")
+
+
+def deserialize_state_from_json(blob: bytes) -> Dict[str, Any]:
+ """Deserialize state from JSON bytes.
+
+ Args:
+ blob: JSON-encoded bytes.
+
+ Returns:
+ The deserialized state dictionary.
+ """
+ return json.loads(blob.decode("utf-8"))
+
+
+def compute_state_hash(state: Dict[str, Any]) -> str:
+ """Compute a SHA-256 hash of the state dictionary.
+
+ Args:
+ state: The state dictionary to hash.
+
+ Returns:
+ The hex-encoded SHA-256 hash.
+ """
+ blob = serialize_state_to_json(state)
+ return hashlib.sha256(blob).hexdigest()
diff --git a/tests/unittests/durable/__init__.py b/tests/unittests/durable/__init__.py
new file mode 100644
index 0000000000..58d482ea38
--- /dev/null
+++ b/tests/unittests/durable/__init__.py
@@ -0,0 +1,13 @@
+# Copyright 2026 Google LLC
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
diff --git a/tests/unittests/durable/test_bigquery_checkpoint_store.py b/tests/unittests/durable/test_bigquery_checkpoint_store.py
new file mode 100644
index 0000000000..f792c6b4b0
--- /dev/null
+++ b/tests/unittests/durable/test_bigquery_checkpoint_store.py
@@ -0,0 +1,273 @@
+# Copyright 2026 Google LLC
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""Tests for BigQueryCheckpointStore."""
+
+from datetime import datetime
+from datetime import timezone
+from unittest.mock import AsyncMock
+from unittest.mock import MagicMock
+from unittest.mock import patch
+
+from google.adk.durable.stores.bigquery_checkpoint_store import BigQueryCheckpointStore
+import pytest
+
+
+class TestBigQueryCheckpointStore:
+ """Tests for BigQueryCheckpointStore."""
+
+ @pytest.fixture
+ def store(self):
+ """Create a store instance for testing."""
+ return BigQueryCheckpointStore(
+ project="test-project",
+ dataset="test_dataset",
+ gcs_bucket="test-bucket",
+ )
+
+ def test_init(self, store):
+ """Test store initialization."""
+ assert store._project == "test-project"
+ assert store._dataset == "test_dataset"
+ assert store._gcs_bucket == "test-bucket"
+ assert store._sessions_table == "sessions"
+ assert store._checkpoints_table == "checkpoints"
+ assert store._location == "US"
+
+ def test_table_ids(self, store):
+ """Test table ID generation."""
+ assert store._sessions_table_id == "test-project.test_dataset.sessions"
+ assert (
+ store._checkpoints_table_id == "test-project.test_dataset.checkpoints"
+ )
+
+ def test_gcs_uri_generation(self, store):
+ """Test GCS URI generation."""
+ uri = store._get_gcs_uri("session-123", 5)
+ assert uri == "gs://test-bucket/checkpoints/session-123/5.json.gz"
+
+ @pytest.mark.asyncio
+ async def test_create_session(self, store):
+ """Test session creation."""
+ mock_client = MagicMock()
+ mock_client.insert_rows_json.return_value = []
+
+ with patch.object(store, "_get_bq_client", return_value=mock_client):
+ with patch.object(
+ store, "get_session", new_callable=AsyncMock
+ ) as mock_get:
+ mock_get.return_value = None
+
+ session = await store.create_session(
+ session_id="test-session",
+ agent_name="test_agent",
+ metadata={"key": "value"},
+ )
+
+ assert session.session_id == "test-session"
+ assert session.agent_name == "test_agent"
+ assert session.status == "active"
+ assert session.current_checkpoint_seq == 0
+ assert session.metadata == {"key": "value"}
+
+ mock_client.insert_rows_json.assert_called_once()
+
+ @pytest.mark.asyncio
+ async def test_create_session_already_exists(self, store):
+ """Test session creation when session already exists."""
+ with patch.object(store, "get_session", new_callable=AsyncMock) as mock_get:
+ from google.adk.durable.stores.base_checkpoint_store import SessionMetadata
+
+ mock_get.return_value = SessionMetadata(
+ session_id="test-session",
+ status="active",
+ agent_name="test_agent",
+ created_at=datetime.now(timezone.utc),
+ updated_at=datetime.now(timezone.utc),
+ current_checkpoint_seq=0,
+ )
+
+ with pytest.raises(ValueError, match="already exists"):
+ await store.create_session(
+ session_id="test-session",
+ agent_name="test_agent",
+ )
+
+ @pytest.mark.asyncio
+ async def test_write_checkpoint(self, store):
+ """Test checkpoint writing."""
+ mock_bq_client = MagicMock()
+ mock_bq_client.insert_rows_json.return_value = []
+ mock_bq_client.query.return_value.result.return_value = None
+
+ mock_storage_client = MagicMock()
+ mock_bucket = MagicMock()
+ mock_blob = MagicMock()
+ mock_storage_client.bucket.return_value = mock_bucket
+ mock_bucket.blob.return_value = mock_blob
+
+ with patch.object(store, "_get_bq_client", return_value=mock_bq_client):
+ with patch.object(
+ store, "_get_storage_client", return_value=mock_storage_client
+ ):
+ with patch.object(
+ store, "get_session", new_callable=AsyncMock
+ ) as mock_get:
+ from google.adk.durable.stores.base_checkpoint_store import SessionMetadata
+
+ mock_get.return_value = SessionMetadata(
+ session_id="test-session",
+ status="active",
+ agent_name="test_agent",
+ created_at=datetime.now(timezone.utc),
+ updated_at=datetime.now(timezone.utc),
+ current_checkpoint_seq=0,
+ )
+
+ checkpoint = await store.write_checkpoint(
+ session_id="test-session",
+ checkpoint_seq=1,
+ state_blob=b'{"state": "data"}',
+ agent_state={"key": "value"},
+ trigger="async_boundary",
+ )
+
+ assert checkpoint.session_id == "test-session"
+ assert checkpoint.checkpoint_seq == 1
+ assert checkpoint.trigger == "async_boundary"
+ assert checkpoint.agent_state == {"key": "value"}
+
+ # Verify GCS upload was called
+ mock_blob.upload_from_string.assert_called_once()
+
+ # Verify BQ insert was called
+ mock_bq_client.insert_rows_json.assert_called_once()
+
+ @pytest.mark.asyncio
+ async def test_write_checkpoint_invalid_seq(self, store):
+ """Test checkpoint writing with invalid sequence number."""
+ with patch.object(store, "get_session", new_callable=AsyncMock) as mock_get:
+ from google.adk.durable.stores.base_checkpoint_store import SessionMetadata
+
+ mock_get.return_value = SessionMetadata(
+ session_id="test-session",
+ status="active",
+ agent_name="test_agent",
+ created_at=datetime.now(timezone.utc),
+ updated_at=datetime.now(timezone.utc),
+ current_checkpoint_seq=5,
+ )
+
+ with pytest.raises(ValueError, match="must be greater"):
+ await store.write_checkpoint(
+ session_id="test-session",
+ checkpoint_seq=3, # Less than current (5)
+ state_blob=b"data",
+ )
+
+ @pytest.mark.asyncio
+ async def test_write_checkpoint_session_not_found(self, store):
+ """Test checkpoint writing when session doesn't exist."""
+ with patch.object(store, "get_session", new_callable=AsyncMock) as mock_get:
+ mock_get.return_value = None
+
+ with pytest.raises(ValueError, match="not found"):
+ await store.write_checkpoint(
+ session_id="nonexistent",
+ checkpoint_seq=1,
+ state_blob=b"data",
+ )
+
+ @pytest.mark.asyncio
+ async def test_acquire_lease_success(self, store):
+ """Test successful lease acquisition."""
+ mock_client = MagicMock()
+ mock_result = MagicMock()
+ mock_result.num_dml_affected_rows = 1
+ mock_client.query.return_value.result.return_value = mock_result
+
+ with patch.object(store, "_get_bq_client", return_value=mock_client):
+ result = await store.acquire_lease(
+ session_id="test-session",
+ lease_id="lease-123",
+ timeout_seconds=300,
+ )
+
+ assert result is True
+ mock_client.query.assert_called_once()
+
+ @pytest.mark.asyncio
+ async def test_acquire_lease_failure(self, store):
+ """Test failed lease acquisition (another lease active)."""
+ mock_client = MagicMock()
+ mock_result = MagicMock()
+ mock_result.num_dml_affected_rows = 0
+ mock_client.query.return_value.result.return_value = mock_result
+
+ with patch.object(store, "_get_bq_client", return_value=mock_client):
+ result = await store.acquire_lease(
+ session_id="test-session",
+ lease_id="lease-123",
+ timeout_seconds=300,
+ )
+
+ assert result is False
+
+ @pytest.mark.asyncio
+ async def test_release_lease(self, store):
+ """Test lease release."""
+ mock_client = MagicMock()
+ mock_client.query.return_value.result.return_value = None
+
+ with patch.object(store, "_get_bq_client", return_value=mock_client):
+ await store.release_lease(
+ session_id="test-session",
+ lease_id="lease-123",
+ )
+
+ mock_client.query.assert_called_once()
+
+ @pytest.mark.asyncio
+ async def test_renew_lease_success(self, store):
+ """Test successful lease renewal."""
+ mock_client = MagicMock()
+ mock_result = MagicMock()
+ mock_result.num_dml_affected_rows = 1
+ mock_client.query.return_value.result.return_value = mock_result
+
+ with patch.object(store, "_get_bq_client", return_value=mock_client):
+ result = await store.renew_lease(
+ session_id="test-session",
+ lease_id="lease-123",
+ timeout_seconds=600,
+ )
+
+ assert result is True
+
+ @pytest.mark.asyncio
+ async def test_renew_lease_failure(self, store):
+ """Test failed lease renewal (lease not held)."""
+ mock_client = MagicMock()
+ mock_result = MagicMock()
+ mock_result.num_dml_affected_rows = 0
+ mock_client.query.return_value.result.return_value = mock_result
+
+ with patch.object(store, "_get_bq_client", return_value=mock_client):
+ result = await store.renew_lease(
+ session_id="test-session",
+ lease_id="lease-123",
+ timeout_seconds=600,
+ )
+
+ assert result is False
diff --git a/tests/unittests/durable/test_checkpointable_state.py b/tests/unittests/durable/test_checkpointable_state.py
new file mode 100644
index 0000000000..9c9ab0753c
--- /dev/null
+++ b/tests/unittests/durable/test_checkpointable_state.py
@@ -0,0 +1,172 @@
+# Copyright 2026 Google LLC
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""Tests for CheckpointableAgentState."""
+
+from typing import Any
+from typing import Dict
+
+from google.adk.durable.checkpointable_state import CheckpointableAgentState
+from google.adk.durable.checkpointable_state import SimpleCheckpointableState
+import pytest
+
+
+class TestSimpleCheckpointableState:
+ """Tests for SimpleCheckpointableState."""
+
+ def test_default_state(self):
+ """Test default state initialization."""
+ state = SimpleCheckpointableState()
+ assert state.data == {}
+
+ def test_state_with_data(self):
+ """Test state with initial data."""
+ state = SimpleCheckpointableState(data={"key": "value", "count": 5})
+ assert state.data["key"] == "value"
+ assert state.data["count"] == 5
+
+ def test_to_checkpoint_dict(self):
+ """Test serialization to checkpoint dict."""
+ state = SimpleCheckpointableState(data={"items": [1, 2, 3], "name": "test"})
+ checkpoint = state.to_checkpoint_dict()
+
+ assert checkpoint == {"data": {"items": [1, 2, 3], "name": "test"}}
+
+ def test_from_checkpoint_dict(self):
+ """Test deserialization from checkpoint dict."""
+ checkpoint = {"data": {"counter": 10, "results": ["a", "b"]}}
+ state = SimpleCheckpointableState.from_checkpoint_dict(checkpoint)
+
+ assert state.data["counter"] == 10
+ assert state.data["results"] == ["a", "b"]
+
+ def test_roundtrip(self):
+ """Test roundtrip serialization/deserialization."""
+ original = SimpleCheckpointableState(
+ data={
+ "nested": {"deep": {"value": 42}},
+ "list": [1, 2, 3],
+ "string": "hello",
+ }
+ )
+
+ checkpoint = original.to_checkpoint_dict()
+ restored = SimpleCheckpointableState.from_checkpoint_dict(checkpoint)
+
+ assert restored.data == original.data
+
+ def test_empty_checkpoint_dict(self):
+ """Test deserialization from empty checkpoint dict."""
+ state = SimpleCheckpointableState.from_checkpoint_dict({})
+ assert state.data == {}
+
+
+class CustomState(CheckpointableAgentState):
+ """Custom state implementation for testing."""
+
+ counter: int = 0
+ items: list[str] = []
+ metadata: dict[str, Any] = {}
+
+ def __init__(self, **data):
+ super().__init__(**data)
+ if "items" not in data:
+ self.items = []
+ if "metadata" not in data:
+ self.metadata = {}
+
+ def to_checkpoint_dict(self) -> Dict[str, Any]:
+ return {
+ "counter": self.counter,
+ "items": self.items.copy(),
+ "metadata": self.metadata.copy(),
+ }
+
+ @classmethod
+ def from_checkpoint_dict(cls, data: Dict[str, Any]) -> "CustomState":
+ return cls(
+ counter=data.get("counter", 0),
+ items=data.get("items", []),
+ metadata=data.get("metadata", {}),
+ )
+
+
+class TestCustomCheckpointableState:
+ """Tests for custom CheckpointableAgentState implementations."""
+
+ def test_custom_state_default(self):
+ """Test custom state with default values."""
+ state = CustomState()
+ assert state.counter == 0
+ assert state.items == []
+ assert state.metadata == {}
+
+ def test_custom_state_with_values(self):
+ """Test custom state with initial values."""
+ state = CustomState(
+ counter=5,
+ items=["a", "b"],
+ metadata={"key": "value"},
+ )
+ assert state.counter == 5
+ assert state.items == ["a", "b"]
+ assert state.metadata == {"key": "value"}
+
+ def test_custom_state_to_checkpoint(self):
+ """Test custom state serialization."""
+ state = CustomState(counter=10, items=["x", "y", "z"])
+ checkpoint = state.to_checkpoint_dict()
+
+ assert checkpoint["counter"] == 10
+ assert checkpoint["items"] == ["x", "y", "z"]
+ assert checkpoint["metadata"] == {}
+
+ def test_custom_state_from_checkpoint(self):
+ """Test custom state deserialization."""
+ checkpoint = {
+ "counter": 42,
+ "items": ["item1", "item2"],
+ "metadata": {"created_by": "test"},
+ }
+ state = CustomState.from_checkpoint_dict(checkpoint)
+
+ assert state.counter == 42
+ assert state.items == ["item1", "item2"]
+ assert state.metadata == {"created_by": "test"}
+
+ def test_custom_state_roundtrip(self):
+ """Test custom state roundtrip."""
+ original = CustomState(
+ counter=100,
+ items=["first", "second", "third"],
+ metadata={"version": 1, "tags": ["test", "demo"]},
+ )
+
+ checkpoint = original.to_checkpoint_dict()
+ restored = CustomState.from_checkpoint_dict(checkpoint)
+
+ assert restored.counter == original.counter
+ assert restored.items == original.items
+ assert restored.metadata == original.metadata
+
+ def test_custom_state_isolation(self):
+ """Test that checkpoint data is isolated from original."""
+ state = CustomState(items=["a", "b"])
+ checkpoint = state.to_checkpoint_dict()
+
+ # Modify checkpoint
+ checkpoint["items"].append("c")
+
+ # Original should be unchanged
+ assert state.items == ["a", "b"]
diff --git a/tests/unittests/durable/test_config.py b/tests/unittests/durable/test_config.py
new file mode 100644
index 0000000000..cf47e2107e
--- /dev/null
+++ b/tests/unittests/durable/test_config.py
@@ -0,0 +1,104 @@
+# Copyright 2026 Google LLC
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""Tests for DurableSessionConfig."""
+
+from google.adk.durable.config import DurableSessionConfig
+from pydantic import ValidationError
+import pytest
+
+
+class TestDurableSessionConfig:
+ """Tests for DurableSessionConfig model."""
+
+ def test_default_config(self):
+ """Test default configuration values."""
+ config = DurableSessionConfig()
+
+ assert config.is_durable is False
+ assert config.checkpoint_policy == "async_boundary"
+ assert config.checkpoint_store is None
+ assert config.lease_timeout_seconds == 300
+ assert config.max_checkpoint_size_bytes == 10 * 1024 * 1024
+
+ def test_enabled_config(self):
+ """Test enabled configuration."""
+ config = DurableSessionConfig(
+ is_durable=True,
+ checkpoint_policy="every_turn",
+ lease_timeout_seconds=600,
+ )
+
+ assert config.is_durable is True
+ assert config.checkpoint_policy == "every_turn"
+ assert config.lease_timeout_seconds == 600
+
+ def test_checkpoint_policies(self):
+ """Test valid checkpoint policies."""
+ for policy in ["async_boundary", "every_turn", "manual"]:
+ config = DurableSessionConfig(checkpoint_policy=policy)
+ assert config.checkpoint_policy == policy
+
+ def test_invalid_checkpoint_policy(self):
+ """Test that invalid checkpoint policies raise validation error."""
+ with pytest.raises(ValidationError):
+ DurableSessionConfig(checkpoint_policy="invalid_policy")
+
+ def test_lease_timeout_bounds(self):
+ """Test lease timeout validation bounds."""
+ # Valid minimum
+ config = DurableSessionConfig(lease_timeout_seconds=60)
+ assert config.lease_timeout_seconds == 60
+
+ # Valid maximum
+ config = DurableSessionConfig(lease_timeout_seconds=3600)
+ assert config.lease_timeout_seconds == 3600
+
+ # Below minimum
+ with pytest.raises(ValidationError):
+ DurableSessionConfig(lease_timeout_seconds=59)
+
+ # Above maximum
+ with pytest.raises(ValidationError):
+ DurableSessionConfig(lease_timeout_seconds=3601)
+
+ def test_max_checkpoint_size_bounds(self):
+ """Test max checkpoint size validation."""
+ # Valid minimum
+ config = DurableSessionConfig(max_checkpoint_size_bytes=1024)
+ assert config.max_checkpoint_size_bytes == 1024
+
+ # Below minimum
+ with pytest.raises(ValidationError):
+ DurableSessionConfig(max_checkpoint_size_bytes=1023)
+
+ def test_extra_fields_forbidden(self):
+ """Test that extra fields are not allowed."""
+ with pytest.raises(ValidationError):
+ DurableSessionConfig(unknown_field="value")
+
+ def test_config_serialization(self):
+ """Test config can be serialized to dict."""
+ config = DurableSessionConfig(
+ is_durable=True,
+ checkpoint_policy="every_turn",
+ lease_timeout_seconds=120,
+ )
+
+ data = config.model_dump()
+
+ assert data["is_durable"] is True
+ assert data["checkpoint_policy"] == "every_turn"
+ assert data["lease_timeout_seconds"] == 120
+ assert data["checkpoint_store"] is None
diff --git a/tests/unittests/durable/test_workspace_snapshotter.py b/tests/unittests/durable/test_workspace_snapshotter.py
new file mode 100644
index 0000000000..4e381f1022
--- /dev/null
+++ b/tests/unittests/durable/test_workspace_snapshotter.py
@@ -0,0 +1,246 @@
+# Copyright 2026 Google LLC
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""Tests for WorkspaceSnapshotter and utilities."""
+
+import os
+import tempfile
+
+from google.adk.durable.workspace_snapshotter import compute_state_hash
+from google.adk.durable.workspace_snapshotter import deserialize_state_from_json
+from google.adk.durable.workspace_snapshotter import serialize_state_to_json
+from google.adk.durable.workspace_snapshotter import WorkspaceSnapshotter
+import pytest
+
+
+class TestSerializationUtilities:
+ """Tests for serialization utility functions."""
+
+ def test_serialize_simple_dict(self):
+ """Test serialization of simple dictionary."""
+ state = {"key": "value", "number": 42}
+ blob = serialize_state_to_json(state)
+
+ assert isinstance(blob, bytes)
+ assert b"key" in blob
+ assert b"value" in blob
+ assert b"42" in blob
+
+ def test_deserialize_simple_dict(self):
+ """Test deserialization of simple dictionary."""
+ blob = b'{"key": "value", "number": 42}'
+ state = deserialize_state_from_json(blob)
+
+ assert state == {"key": "value", "number": 42}
+
+ def test_roundtrip_serialization(self):
+ """Test roundtrip serialization/deserialization."""
+ original = {
+ "string": "hello",
+ "number": 123,
+ "float": 3.14,
+ "bool": True,
+ "null": None,
+ "list": [1, 2, 3],
+ "nested": {"a": {"b": "c"}},
+ }
+
+ blob = serialize_state_to_json(original)
+ restored = deserialize_state_from_json(blob)
+
+ assert restored == original
+
+ def test_serialize_deterministic(self):
+ """Test that serialization is deterministic (sorted keys)."""
+ state1 = {"z": 1, "a": 2, "m": 3}
+ state2 = {"a": 2, "m": 3, "z": 1}
+
+ blob1 = serialize_state_to_json(state1)
+ blob2 = serialize_state_to_json(state2)
+
+ assert blob1 == blob2
+
+ def test_compute_state_hash(self):
+ """Test state hash computation."""
+ state = {"key": "value"}
+ hash1 = compute_state_hash(state)
+
+ assert isinstance(hash1, str)
+ assert len(hash1) == 64 # SHA-256 produces 64 hex characters
+
+ def test_hash_deterministic(self):
+ """Test that hash is deterministic."""
+ state1 = {"z": 1, "a": 2}
+ state2 = {"a": 2, "z": 1}
+
+ assert compute_state_hash(state1) == compute_state_hash(state2)
+
+ def test_hash_changes_with_content(self):
+ """Test that hash changes with content."""
+ hash1 = compute_state_hash({"key": "value1"})
+ hash2 = compute_state_hash({"key": "value2"})
+
+ assert hash1 != hash2
+
+
+class TestWorkspaceSnapshotter:
+ """Tests for WorkspaceSnapshotter."""
+
+ def test_init_default(self):
+ """Test default initialization."""
+ snapshotter = WorkspaceSnapshotter()
+
+ assert snapshotter.workspace_dir is None
+ assert "__pycache__" in snapshotter._exclude_patterns
+
+ def test_init_with_workspace(self):
+ """Test initialization with workspace directory."""
+ snapshotter = WorkspaceSnapshotter(workspace_dir="/tmp/workspace")
+
+ assert str(snapshotter.workspace_dir) == "/tmp/workspace"
+
+ def test_init_with_custom_excludes(self):
+ """Test initialization with custom exclude patterns."""
+ snapshotter = WorkspaceSnapshotter(
+ workspace_dir="/tmp/workspace",
+ exclude_patterns=["*.log", "temp/"],
+ )
+
+ assert snapshotter._exclude_patterns == ["*.log", "temp/"]
+
+ def test_should_exclude_pycache(self):
+ """Test exclusion of __pycache__ directories."""
+ snapshotter = WorkspaceSnapshotter()
+ from pathlib import Path
+
+ assert snapshotter._should_exclude(
+ Path("/some/path/__pycache__/module.pyc")
+ )
+
+ def test_should_exclude_pyc_files(self):
+ """Test exclusion of .pyc files."""
+ snapshotter = WorkspaceSnapshotter()
+ from pathlib import Path
+
+ assert snapshotter._should_exclude(Path("/some/path/module.pyc"))
+
+ def test_should_not_exclude_py_files(self):
+ """Test that .py files are not excluded."""
+ snapshotter = WorkspaceSnapshotter()
+ from pathlib import Path
+
+ assert not snapshotter._should_exclude(Path("/some/path/module.py"))
+
+ def test_should_exclude_git(self):
+ """Test exclusion of .git directories."""
+ snapshotter = WorkspaceSnapshotter()
+ from pathlib import Path
+
+ assert snapshotter._should_exclude(Path("/some/path/.git/config"))
+
+ def test_should_exclude_env(self):
+ """Test exclusion of .env files."""
+ snapshotter = WorkspaceSnapshotter()
+ from pathlib import Path
+
+ assert snapshotter._should_exclude(Path("/some/path/.env"))
+
+ def test_create_snapshot_no_workspace(self):
+ """Test that create_snapshot fails without workspace."""
+ snapshotter = WorkspaceSnapshotter()
+
+ with pytest.raises(ValueError, match="No workspace directory"):
+ snapshotter.create_snapshot()
+
+ def test_create_snapshot_missing_directory(self):
+ """Test that create_snapshot fails with missing directory."""
+ snapshotter = WorkspaceSnapshotter(workspace_dir="/nonexistent/path")
+
+ with pytest.raises(FileNotFoundError):
+ snapshotter.create_snapshot()
+
+ def test_restore_snapshot_no_workspace(self):
+ """Test that restore_snapshot fails without workspace."""
+ snapshotter = WorkspaceSnapshotter()
+
+ with pytest.raises(ValueError, match="No workspace directory"):
+ snapshotter.restore_snapshot(b"data")
+
+ def test_create_and_restore_snapshot(self):
+ """Test creating and restoring a workspace snapshot."""
+ with tempfile.TemporaryDirectory() as tmpdir:
+ # Create source workspace with files
+ source_dir = os.path.join(tmpdir, "source")
+ os.makedirs(source_dir)
+
+ # Create test files
+ with open(os.path.join(source_dir, "file1.txt"), "w") as f:
+ f.write("content1")
+ with open(os.path.join(source_dir, "file2.py"), "w") as f:
+ f.write("print('hello')")
+
+ # Create subdirectory
+ subdir = os.path.join(source_dir, "subdir")
+ os.makedirs(subdir)
+ with open(os.path.join(subdir, "nested.txt"), "w") as f:
+ f.write("nested content")
+
+ # Create snapshot
+ snapshotter = WorkspaceSnapshotter(workspace_dir=source_dir)
+ blob, sha256, size = snapshotter.create_snapshot()
+
+ assert isinstance(blob, bytes)
+ assert len(sha256) == 64
+ assert size > 0
+
+ # Restore to different location
+ dest_dir = os.path.join(tmpdir, "dest")
+ restore_snapshotter = WorkspaceSnapshotter(workspace_dir=dest_dir)
+ restore_snapshotter.restore_snapshot(blob)
+
+ # Verify files were restored
+ assert os.path.exists(os.path.join(dest_dir, "file1.txt"))
+ assert os.path.exists(os.path.join(dest_dir, "file2.py"))
+ assert os.path.exists(os.path.join(dest_dir, "subdir", "nested.txt"))
+
+ # Verify content
+ with open(os.path.join(dest_dir, "file1.txt")) as f:
+ assert f.read() == "content1"
+
+ def test_snapshot_excludes_pycache(self):
+ """Test that snapshots exclude __pycache__ directories."""
+ with tempfile.TemporaryDirectory() as tmpdir:
+ # Create workspace with __pycache__
+ workspace = os.path.join(tmpdir, "workspace")
+ os.makedirs(workspace)
+
+ with open(os.path.join(workspace, "main.py"), "w") as f:
+ f.write("print('main')")
+
+ pycache = os.path.join(workspace, "__pycache__")
+ os.makedirs(pycache)
+ with open(os.path.join(pycache, "main.cpython-311.pyc"), "wb") as f:
+ f.write(b"\x00\x00\x00\x00")
+
+ # Create snapshot
+ snapshotter = WorkspaceSnapshotter(workspace_dir=workspace)
+ blob, _, _ = snapshotter.create_snapshot()
+
+ # Restore and verify __pycache__ was excluded
+ dest = os.path.join(tmpdir, "dest")
+ restore_snapshotter = WorkspaceSnapshotter(workspace_dir=dest)
+ restore_snapshotter.restore_snapshot(blob)
+
+ assert os.path.exists(os.path.join(dest, "main.py"))
+ assert not os.path.exists(os.path.join(dest, "__pycache__"))