diff --git a/contributing/samples/long_running_task/README.md b/contributing/samples/long_running_task/README.md
new file mode 100644
index 0000000000..b649a5e941
--- /dev/null
+++ b/contributing/samples/long_running_task/README.md
@@ -0,0 +1,182 @@
+# Durable Session Demo
+
+This demo showcases the durable session persistence feature in ADK, which
+enables checkpoint-based durability for long-running agent invocations.
+
+## Overview
+
+Durable sessions provide:
+- **Checkpoint persistence**: Agent state is saved to BigQuery + GCS
+- **Failure recovery**: Resume from the last checkpoint after crashes
+- **Host migration**: Move sessions between hosts seamlessly
+- **Lease management**: Prevent concurrent modifications
+
+## Prerequisites
+
+1. **Google Cloud Project** with billing enabled
+2. **APIs enabled**:
+   - BigQuery API
+   - Cloud Storage API
+   - Vertex AI API (for Gemini models)
+3. **IAM permissions**:
+   - `roles/bigquery.dataEditor`
+   - `roles/storage.objectAdmin`
+   - `roles/aiplatform.user`
+
+## Setup
+
+### 1. Configure your environment
+
+```bash
+# Set your project
+export PROJECT_ID="test-project-0728-467323"
+gcloud config set project $PROJECT_ID
+
+# Set your Google Cloud API key (required for Gemini 3)
+export GOOGLE_CLOUD_API_KEY="your-api-key-here"
+
+# Authenticate
+gcloud auth application-default login
+```
+
+### 2. Create BigQuery and GCS resources
+
+```bash
+# Run the setup script
+python contributing/samples/long_running_task/setup.py
+
+# To verify setup
+python contributing/samples/long_running_task/setup.py --verify
+
+# To clean up resources
+python contributing/samples/long_running_task/setup.py --cleanup
+```
+
+### 3. Run the demo
+
+```bash
+adk web contributing/samples/long_running_task
+```
+
+## Demo Scenarios
+
+### Scenario 1: Long-running table scan
+
+```
+User: Scan the bigquery-public-data.samples.shakespeare table
+
+Agent: [Calls simulate_long_running_scan]
+       [Checkpoint written at async boundary]
+       [Scan completes after ~5-10 seconds]
+       The scan found 164,656 rows with the following findings:
+       - Found 5 instances of 'to be or not to be'
+       - Most common word: 'the' (27,801 occurrences)
+       - Unique words: 29,066
+```
+
+### Scenario 2: Multi-stage pipeline
+
+```
+User: Run a pipeline from source_table to dest_table with transformations:
+      filter, aggregate, join
+
+Agent: [Calls run_data_pipeline]
+       [Checkpoint written at each stage boundary]
+       Pipeline completed successfully:
+       - Stage 1 (filter): 45,000 rows processed
+       - Stage 2 (aggregate): 32,000 rows processed
+       - Stage 3 (join): 28,000 rows processed
+```
+
+### Scenario 3: Failure recovery
+
+1. Start a long-running scan
+2. Kill the process mid-execution
+3. Restart and resume with the invocation_id
+4. Agent continues from the last checkpoint
+
+## Architecture
+
+```
+                    +-----------------+
+                    |     Agent       |
+                    |  (LlmAgent)     |
+                    +--------+--------+
+                             |
+                             v
+                    +-----------------+
+                    |     Runner      |
+                    | (with durability)|
+                    +--------+--------+
+                             |
+            +----------------+----------------+
+            |                                 |
+            v                                 v
+    +--------------+                 +----------------+
+    |   BigQuery   |                 |      GCS       |
+    |  (metadata)  |                 | (state blobs)  |
+    +--------------+                 +----------------+
+    | - sessions   |                 | - checkpoints/ |
+    | - checkpoints|                 |   {session_id}/|
+    +--------------+                 +----------------+
+```
+
+## Configuration
+
+The agent is configured in `agent.py`:
+
+```python
+app = App(
+    name="durable_session_demo",
+    root_agent=root_agent,
+    resumability_config=ResumabilityConfig(is_resumable=True),
+    durable_session_config=DurableSessionConfig(
+        is_durable=True,
+        checkpoint_policy="async_boundary",
+        checkpoint_store=BigQueryCheckpointStore(
+            project=PROJECT_ID,
+            dataset=DATASET,
+            gcs_bucket=GCS_BUCKET,
+        ),
+        lease_timeout_seconds=300,
+    ),
+)
+```
+
+### Checkpoint Policies
+
+- `async_boundary`: Checkpoint when hitting async/long-running operations
+- `every_turn`: Checkpoint after every agent turn
+- `manual`: Only checkpoint when explicitly requested
+
+## Monitoring
+
+### View sessions
+
+```sql
+SELECT * FROM `test-project-0728-467323.adk_metadata.sessions`
+ORDER BY updated_at DESC
+LIMIT 10;
+```
+
+### View checkpoints
+
+```sql
+SELECT * FROM `test-project-0728-467323.adk_metadata.checkpoints`
+ORDER BY created_at DESC
+LIMIT 10;
+```
+
+### List checkpoint blobs
+
+```bash
+gsutil ls -l gs://test-project-0728-467323-adk-checkpoints/checkpoints/
+```
+
+## Cleanup
+
+To remove all resources created by this demo:
+
+```bash
+python contributing/samples/long_running_task/setup.py --cleanup
+```
diff --git a/contributing/samples/long_running_task/REVIEW_FEEDBACK.md b/contributing/samples/long_running_task/REVIEW_FEEDBACK.md
new file mode 100644
index 0000000000..c8e6387f69
--- /dev/null
+++ b/contributing/samples/long_running_task/REVIEW_FEEDBACK.md
@@ -0,0 +1,239 @@
+# Design Document Review: Durable Session Persistence for Long-Horizon ADK Agents
+
+**Reviewer:** Claude Code
+**Date:** 2026-02-01
+**Document:** `long_running_task_design.md`
+
+---
+
+## Executive Summary
+
+The design document is **well-structured and comprehensive**, covering a real problem with a thorough technical approach. However, there are **critical accuracy issues** regarding ADK's current capabilities that must be addressed before the document can be considered accurate for review.
+
+**Overall Assessment:** Good foundation, requires significant revisions to accurately reflect ADK's existing resumability features.
+
+---
+
+## 1. Reference Validation
+
+### External URLs (7 total) - ALL VALID
+
+| # | URL | Status | Notes |
+|---|-----|--------|-------|
+| 1 | LangGraph durable-execution | VALID | Content matches claims |
+| 2 | LangGraph persistence | VALID | Checkpointing docs |
+| 3 | LangGraph overview | VALID | Framework intro |
+| 4 | LangGraph checkpoints reference | VALID | API docs |
+| 5 | Deep Agents overview | VALID | LangChain library |
+| 6 | Deep Agents long-term memory | VALID | Memory patterns |
+| 7 | Anthropic harnesses article | VALID | Published 2025-11-26 |
+
+---
+
+## 2. CRITICAL ISSUE: ADK Already Has Resumability
+
+### Problem Statement Inaccuracy
+
+The document states (Section 2):
+> "Current ADK sessions are optimized for synchronous 'serving' patterns... state is ephemeral... background execution is not a first-class runtime mode"
+
+**This is inaccurate.** ADK already has an experimental resumability feature:
+
+```python
+# src/google/adk/apps/app.py lines 42-58
+@experimental
+class ResumabilityConfig(BaseModel):
+  """The "resumability" in ADK refers to the ability to:
+  1. pause an invocation upon a long-running function call.
+  2. resume an invocation from the last event, if it's paused or failed midway
+  through.
+  """
+  is_resumable: bool = False
+```
+
+### Existing ADK Capabilities Not Mentioned
+
+| Capability | Location | Status |
+|------------|----------|--------|
+| `ResumabilityConfig` | `src/google/adk/apps/app.py:42-58` | Experimental |
+| `should_pause_invocation()` | `src/google/adk/agents/invocation_context.py:355-389` | Implemented |
+| `long_running_tool_ids` | `src/google/adk/events/event.py` | Implemented |
+| Resume from last event | `src/google/adk/runners.py:1294` | Implemented |
+
+### Required Fix
+
+**The document must:**
+1. Acknowledge existing `ResumabilityConfig` and pause/resume capability
+2. Clearly articulate how this proposal **extends** existing features vs. replacing them
+3. Update Section 2 (Problem Statement) to reflect actual gaps (e.g., durable cross-process persistence, BigQuery-based audit, external event triggers)
+
+---
+
+## 3. Technical Review
+
+### 3.1 SQL Schema (Appendix B) - VALID WITH MINOR ISSUES
+
+**Strengths:**
+- Proper partitioning strategy (`PARTITION BY DATE`)
+- Sensible clustering choices
+- JSON columns for flexibility
+
+**Issues:**
+
+1. **Missing primary key constraint on checkpoints:**
+   ```sql
+   -- Should add:
+   PRIMARY KEY (session_id, checkpoint_seq)
+   ```
+
+2. **events table lacks PRIMARY KEY:**
+   ```sql
+   -- Consider adding:
+   PRIMARY KEY (event_id)  -- or composite key
+   ```
+
+3. **View `v_latest_checkpoint` uses ARRAY_AGG with OFFSET(0):**
+   - This is valid but will error if no checkpoints exist
+   - Consider `SAFE_OFFSET(0)` or handle NULL case
+
+### 3.2 Python Code Snippets - MOSTLY VALID
+
+**Section 7.1 `write_checkpoint()`:**
+- Logic is sound (two-phase commit pattern)
+- Consider adding error handling for partial failures
+
+**Section 7.2 `reconcile_on_resume()`:**
+- Good idempotency pattern
+- Missing: what happens if `bq.get_job()` fails?
+
+### 3.3 Leasing Approach (Section 7.3) - REASONABLE
+
+The BQ-based optimistic lease is correctly noted as best-effort. The suggestion to use Firestore/Spanner for stronger guarantees is appropriate.
+
+**Suggestion:** Add a concrete example of when to use each backend (BQ vs Firestore).
+
+---
+
+## 4. Architecture Feedback
+
+### 4.1 Strengths
+
+1. **Clear separation of control plane (BQ) vs data plane (GCS)** - follows Google best practices
+2. **Logical checkpointing over heap snapshots** - pragmatic and maintainable
+3. **Two-phase commit pattern** - ensures atomic visibility
+4. **Authoritative reconciliation** - critical for BigQuery job scenarios
+5. **Good competitive analysis** (Section 14)
+
+### 4.2 Gaps / Missing Considerations
+
+| Gap | Impact | Suggested Action |
+|-----|--------|------------------|
+| No mention of existing `ResumabilityConfig` | Misleading problem statement | Add section on existing capability |
+| No cost estimates for BQ storage/queries | Budget planning | Add rough estimates |
+| No mention of BQ quota limits | Operational risk | Document relevant quotas |
+| Checkpoint versioning migration strategy | Future maintenance | Expand Section 16.2 |
+| No monitoring/alerting design | Operability | Add observability section |
+| No rollback strategy | Safety | Document how to rollback |
+
+### 4.3 API Contract Review
+
+The proposed `CheckpointableAgentState` interface is clean:
+
+```python
+class CheckpointableAgentState:
+    def export_state(self) -> dict: ...
+    def import_state(self, state: dict) -> None: ...
+```
+
+**Suggestion:** Consider alignment with existing ADK patterns:
+- Existing `BaseAgentState` in `src/google/adk/agents/base_agent.py`
+- Existing state patterns in `src/google/adk/sessions/state.py`
+
+---
+
+## 5. Specific Line-by-Line Feedback
+
+### Section 0 (Executive Summary)
+- Line 14: "12-minute barrier" - should cite source or clarify this is environment-specific
+- Line 28: Cost estimate "< $0.01/session-day paused" - show calculation
+
+### Section 2 (Problem Statement)
+- **Major revision needed** - must acknowledge existing resumability
+
+### Section 4.1 (States)
+- Consider: should PAUSED be a first-class `Session.status` field or remain at `InvocationContext` level?
+
+### Section 8 (API Extensions)
+- `checkpoint_policy` options are good, but:
+  - What triggers `superstep`?
+  - How does `manual` interact with `long_running_tool_ids`?
+
+### Section 13 (Moltbot Alignment)
+- Moltbot reference is useful context
+- Consider adding link/citation if public
+
+### Section 18 (Open Questions)
+- Good list, but add: "How does this integrate with existing `ResumabilityConfig`?"
+
+---
+
+## 6. Recommended Document Changes
+
+### High Priority (Must Fix)
+
+1. **Add Section 1.3: "Existing ADK Resumability"**
+   - Document current `ResumabilityConfig` capability
+   - Explain limitations this design addresses
+   - Position proposal as extension, not replacement
+
+2. **Revise Section 2 (Problem Statement)**
+   - Remove/qualify claims about ADK lacking pause/resume
+   - Focus on actual gaps: cross-process durability, external event triggers, enterprise audit
+
+3. **Add explicit integration plan**
+   - How does `CheckpointableAgentState` relate to `BaseAgentState`?
+   - Migration path from current resumability to new design
+
+### Medium Priority
+
+4. Add cost estimation section
+5. Add monitoring/observability design
+6. Add rollback/recovery procedures
+7. Fix SQL schema issues (PKs)
+
+### Low Priority
+
+8. Add Moltbot citation if available
+9. Add BQ quota documentation links
+10. Consider adding architecture diagram (beyond Mermaid sequence)
+
+---
+
+## 7. Summary Table
+
+| Category | Status | Details |
+|----------|--------|---------|
+| External URLs | VALID | All 7 references work |
+| SQL Syntax | VALID with issues | Missing PKs, edge cases |
+| Python Code | VALID | Sound patterns |
+| Problem Statement | INACCURATE | Ignores existing resumability |
+| Architecture | SOUND | Good Google-scale patterns |
+| Completeness | GAPS | Missing cost, monitoring, rollback |
+
+---
+
+## 8. Conclusion
+
+This is a **solid technical design** for extending ADK's capabilities for long-running BigQuery workloads. The core architecture (BQ control plane, GCS data plane, two-phase commit, authoritative reconciliation) is well-reasoned.
+
+**However, the document cannot be approved in its current form** because it misrepresents ADK's existing capabilities. Once the existing `ResumabilityConfig` is acknowledged and the document is repositioned as an extension rather than a new capability, it will be ready for technical review.
+
+**Recommended Next Steps:**
+1. Revise document to acknowledge existing resumability
+2. Add cost/monitoring sections
+3. Fix SQL schema issues
+4. Re-submit for review
+
+---
+
+*Review generated by Claude Code on 2026-02-01*
diff --git a/contributing/samples/long_running_task/__init__.py b/contributing/samples/long_running_task/__init__.py
new file mode 100644
index 0000000000..4015e47d6e
--- /dev/null
+++ b/contributing/samples/long_running_task/__init__.py
@@ -0,0 +1,15 @@
+# Copyright 2026 Google LLC
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from . import agent
diff --git a/contributing/samples/long_running_task/agent.py b/contributing/samples/long_running_task/agent.py
new file mode 100644
index 0000000000..10e95f663a
--- /dev/null
+++ b/contributing/samples/long_running_task/agent.py
@@ -0,0 +1,142 @@
+# Copyright 2026 Google LLC
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""Durable session demo agent with long-running BigQuery operations.
+
+This agent demonstrates the durable session persistence feature, which enables
+checkpointing of agent state to BigQuery + GCS for recovery from failures.
+
+To run this demo:
+    1. Set up the BigQuery tables and GCS bucket (see setup.py)
+    2. Set GOOGLE_CLOUD_API_KEY environment variable
+    3. Run: adk web contributing/samples/long_running_task
+
+Example prompts:
+    - "Scan the bigquery-public-data.samples.shakespeare table"
+    - "Get the schema of bigquery-public-data.samples.github_nested"
+    - "Run a pipeline from source_table to dest_table with filter, aggregate"
+"""
+
+import os
+from functools import cached_property
+
+from google.adk.agents import LlmAgent
+from google.adk.apps import App
+from google.adk.apps import ResumabilityConfig
+from google.adk.durable import BigQueryCheckpointStore
+from google.adk.durable import DurableSessionConfig
+from google.adk.models.google_llm import Gemini
+from google.adk.tools import LongRunningFunctionTool
+from google.genai import Client
+from google.genai import types
+
+from .tools import get_table_schema
+from .tools import run_batch_etl_job
+from .tools import run_data_pipeline
+from .tools import run_demo_analysis
+from .tools import run_extended_analysis
+from .tools import run_ml_training_job
+from .tools import simulate_long_running_scan
+
+# Configuration
+PROJECT_ID = "test-project-0728-467323"
+DATASET = "adk_metadata"
+GCS_BUCKET = f"{PROJECT_ID}-adk-checkpoints"
+
+# API Key for Vertex AI (must be set via environment variable)
+GOOGLE_CLOUD_API_KEY = os.environ.get("GOOGLE_CLOUD_API_KEY", "")
+
+
+class VertexAIGemini(Gemini):
+  """Custom Gemini model configured for Vertex AI with API key."""
+
+  model: str = "gemini-3-flash-preview"
+
+  @cached_property
+  def api_client(self) -> Client:
+    """Provides the api client configured for Vertex AI."""
+    return Client(
+        vertexai=True,
+        api_key=GOOGLE_CLOUD_API_KEY,
+        http_options=types.HttpOptions(
+            headers=self._tracking_headers(),
+            retry_options=self.retry_options,
+        ),
+    )
+
+
+# Create the checkpoint store
+checkpoint_store = BigQueryCheckpointStore(
+    project=PROJECT_ID,
+    dataset=DATASET,
+    gcs_bucket=GCS_BUCKET,
+)
+
+# Create the root agent with long-running tools using custom Vertex AI model
+root_agent = LlmAgent(
+    model=VertexAIGemini(model="gemini-3-flash-preview"),
+    name="durable_bq_scanner",
+    description="Long-running BigQuery scanner with durable checkpoints",
+    instruction="""You are a data analyst assistant that can run various data processing jobs.
+
+Your capabilities:
+1. Get table schemas - Use get_table_schema for quick schema lookups
+2. Scan tables - Use simulate_long_running_scan for table analysis (~5-10 seconds)
+3. Run data pipelines - Use run_data_pipeline for multi-stage transformations
+4. Demo analysis - Use run_demo_analysis for a 1-minute demo (perfect for presentations!)
+5. Extended analysis - Use run_extended_analysis for jobs that run 1-60 minutes
+6. ML training - Use run_ml_training_job for model training (2-30 minutes based on size)
+7. Batch ETL - Use run_batch_etl_job for large ETL jobs (1-60 minutes)
+
+For quick demos (~1 minute):
+- run_demo_analysis: Specify analysis_type (e.g., "sentiment", "anomaly", "trend", "clustering")
+
+For long-running jobs (10+ minutes):
+- run_extended_analysis: Specify duration_minutes (e.g., 10, 15, 30)
+- run_ml_training_job: Use dataset_size "large" (10 min), "xlarge" (15 min), or "enterprise" (30 min)
+- run_batch_etl_job: Specify processing_minutes (e.g., 10, 15, 30)
+
+The system will automatically checkpoint your progress during long-running
+operations, so you can resume if interrupted.
+
+Important: When using long-running tools, wait for them to complete before
+taking further action. Do not call the same tool again if it returned a
+pending status.
+""",
+    tools=[
+        get_table_schema,
+        LongRunningFunctionTool(func=simulate_long_running_scan),
+        LongRunningFunctionTool(func=run_data_pipeline),
+        LongRunningFunctionTool(func=run_demo_analysis),
+        LongRunningFunctionTool(func=run_extended_analysis),
+        LongRunningFunctionTool(func=run_ml_training_job),
+        LongRunningFunctionTool(func=run_batch_etl_job),
+    ],
+    generate_content_config=types.GenerateContentConfig(
+        temperature=1.0,  # Required for Gemini 3
+    ),
+)
+
+# Create the app with durable session configuration
+app = App(
+    name="long_running_task",
+    root_agent=root_agent,
+    resumability_config=ResumabilityConfig(is_resumable=True),
+    durable_session_config=DurableSessionConfig(
+        is_durable=True,
+        checkpoint_policy="async_boundary",
+        checkpoint_store=checkpoint_store,
+        lease_timeout_seconds=300,
+    ),
+)
diff --git a/contributing/samples/long_running_task/comment.md b/contributing/samples/long_running_task/comment.md
new file mode 100644
index 0000000000..356cd67548
--- /dev/null
+++ b/contributing/samples/long_running_task/comment.md
@@ -0,0 +1,1094 @@
+# Design Review Comments and Responses
+
+## Comment 1: Session Service as Durable Persistence
+
+**From:** ADK Team
+**Date:** 2026-02-02
+
+**Comment:**
+> "Session service is the durable session persistence. For local, user starts with InMemoryService, but they can opt-in storage-based session service: SQLite, DatabaseSessionService, BigQuerySessionService, etc."
+
+---
+
+### Response
+
+Thank you for the feedback. You're correct that ADK already has a robust session service hierarchy. This comment raises an important architectural question: **Why introduce a separate CheckpointStore when SessionService already provides persistence?**
+
+#### Key Distinction: Session State vs. Checkpoint State
+
+| Aspect | Session Service | Checkpoint Store (Proposed) |
+|--------|-----------------|----------------------------|
+| **What it stores** | Conversation history (events, messages, tool calls) | Agent execution state (job ledgers, progress cursors, partial results) |
+| **Granularity** | Per-message/event append | Per-checkpoint snapshot at logical boundaries |
+| **Data model** | Event stream (append-only) | Point-in-time snapshots (two-phase commit) |
+| **Primary use case** | Replay conversation context to LLM | Resume long-running task from failure point |
+| **Recovery question** | "What did the agent say?" | "Where was the agent in a 6-hour BigQuery scan?" |
+| **External job tracking** | Tool call events (but not reconciliation-ready) | Authoritative job ledger with status sync |
+
+#### Why Session Service Alone May Be Insufficient
+
+1. **Job Ledger with Authoritative Reconciliation**
+   - Session events record that a tool was called, but don't maintain a ledger that can be reconciled against external job states (DONE/FAILED/RUNNING)
+   - On resume, we need to query BigQuery: "Is job X still running?" and update our ledger accordingly
+   - This reconciliation pattern doesn't fit the append-only event model
+
+2. **Partial Results Persistence**
+   - A 50-table PII scan may complete 30 tables before failure
+   - Checkpoint stores: which tables done, their findings, which remain
+   - Session stores: the conversation about starting the scan
+
+3. **Two-Phase Commit Semantics**
+   - Checkpoints require atomic visibility: GCS blob uploaded AND metadata pointer updated
+   - Session services typically use simpler append semantics
+   - Partial checkpoint writes must not be visible
+
+4. **Workspace Snapshots**
+   - Long-running coding agents may need `/workspace` file persistence
+   - This is binary blob data, not conversation events
+   - Doesn't fit session event model
+
+5. **Different Query Patterns**
+   - Session: "Give me all events for session X in order"
+   - Checkpoint: "Give me the latest checkpoint for session X" (single row)
+   - Fleet ops: "Show me all paused sessions with checkpoints > 1 hour old"
+
+---
+
+### Potential Approaches
+
+#### Option A: Separate CheckpointStore (Current Design)
+
+```
+┌─────────────────────────────────────────────────────────────┐
+│                    ADK Application                           │
+├─────────────────────────────────────────────────────────────┤
+│  SessionService (existing)     │  CheckpointStore (new)     │
+│  - Conversation history        │  - Execution state         │
+│  - Event replay for LLM        │  - Job ledgers             │
+│  - Append-only events          │  - Two-phase commit        │
+│  - SQLite/DB/BigQuery          │  - BigQuery + GCS          │
+└─────────────────────────────────────────────────────────────┘
+```
+
+**Pros:**
+- Clear separation of concerns
+- Different consistency models for different needs
+- No changes to existing SessionService implementations
+- Checkpoint-specific optimizations (compression, GCS blob storage)
+
+**Cons:**
+- Two services to configure for durable agents
+- Potential confusion about which stores what
+- Additional infrastructure (though can share BigQuery dataset)
+
+#### Option B: Extend SessionService with Checkpoint Capability
+
+```python
+class SessionService(ABC):
+    # Existing methods...
+
+    # New checkpoint methods
+    async def write_checkpoint(
+        self, session_id: str, checkpoint_seq: int, state: bytes, ...
+    ) -> None: ...
+
+    async def read_latest_checkpoint(
+        self, session_id: str
+    ) -> tuple[int, bytes] | None: ...
+```
+
+**Pros:**
+- Single service to configure
+- Unified persistence layer
+- Familiar pattern for ADK users
+
+**Cons:**
+- Mixes conversation semantics with execution semantics
+- May require significant changes to existing implementations
+- Two-phase commit harder to add to existing append-only services
+- Risk of breaking changes
+
+#### Option C: Checkpoint as Special Event Type
+
+```python
+# Store checkpoint as a special event in the session
+event = Event(
+    author="system",
+    type=EventType.CHECKPOINT,
+    checkpoint_data=CheckpointData(
+        seq=5,
+        state_gcs_uri="gs://...",
+        job_ledger={...},
+    )
+)
+session_service.append_event(session_id, event)
+```
+
+**Pros:**
+- Uses existing SessionService infrastructure
+- Single storage location
+- Events remain the universal abstraction
+
+**Cons:**
+- Checkpoint retrieval requires scanning events (inefficient)
+- Two-phase commit semantics still needed for GCS blob
+- Mixing large blobs with conversation events
+- Query patterns still don't match (latest vs. stream)
+
+---
+
+### Recommendation
+
+**Option A (Separate CheckpointStore)** is recommended for v1 because:
+
+1. **Clean separation**: Conversation history and execution state serve different purposes
+2. **No breaking changes**: Existing SessionService implementations unchanged
+3. **Optimized for use case**: Checkpoint-specific features (GCS blobs, two-phase commit, lease management)
+4. **Incremental adoption**: Users can add checkpointing without changing session config
+
+However, we should:
+- Document the relationship clearly
+- Consider Option B for v2 if the pattern proves successful
+- Ensure both can share the same BigQuery dataset for operational simplicity
+
+---
+
+## Suggested Updates to Design Doc
+
+Based on this feedback, the following sections should be added/updated in `long_running_task_design.md`:
+
+### 1. Add New Section: "Relationship to Existing Session Service"
+
+**Location:** After Section 5 (Architecture Overview)
+
+```markdown
+## 5.4 Relationship to Existing Session Service
+
+ADK provides a `SessionService` abstraction for conversation persistence:
+
+| Implementation | Storage | Use Case |
+|----------------|---------|----------|
+| `InMemorySessionService` | RAM | Development/testing |
+| `SQLiteSessionService` | Local SQLite | Single-machine persistence |
+| `DatabaseSessionService` | PostgreSQL/MySQL | Production multi-instance |
+| `BigQuerySessionService` | BigQuery | Enterprise scale |
+
+**Why a separate CheckpointStore?**
+
+The `SessionService` and `CheckpointStore` serve complementary purposes:
+
+| SessionService | CheckpointStore |
+|----------------|-----------------|
+| Conversation history | Execution state snapshots |
+| Append-only events | Point-in-time checkpoints |
+| LLM context replay | Task resume from failure |
+| Per-event granularity | Per-checkpoint granularity |
+
+A durable long-horizon agent typically uses both:
+- `SessionService` for conversation continuity
+- `CheckpointStore` for execution state durability
+
+**Shared Infrastructure**
+
+Both services can share the same BigQuery dataset:
+- `adk_metadata.sessions` (SessionService)
+- `adk_metadata.events` (SessionService)
+- `adk_metadata.durable_sessions` (CheckpointStore)
+- `adk_metadata.checkpoints` (CheckpointStore)
+```
+
+### 2. Update Section 8.2 (Configuration)
+
+Add clarity about the relationship:
+
+```markdown
+### 8.2 Configuration
+
+```python
+# A durable agent uses BOTH session service and checkpoint store
+app = App(
+    name="durable_scanner",
+    root_agent=agent,
+
+    # Session service for conversation history (existing)
+    session_service=BigQuerySessionService(
+        project="my-project",
+        dataset="adk_metadata",
+    ),
+
+    # Checkpoint store for execution state (new)
+    durable_session_config=DurableSessionConfig(
+        is_durable=True,
+        checkpoint_store=BigQueryCheckpointStore(
+            project="my-project",
+            dataset="adk_metadata",  # Can share dataset
+            gcs_bucket="my-checkpoints",
+        ),
+    ),
+)
+```
+
+**Note:** Both services can share the same BigQuery dataset. The checkpoint tables use a `durable_` prefix to avoid conflicts.
+```
+
+### 3. Add to Section 15 (Alternatives Considered)
+
+```markdown
+| Alternative | Why not (v1) |
+|-------------|--------------|
+| Extend SessionService with checkpoint methods | Different consistency models; risk of breaking changes to existing implementations |
+| Checkpoint as special Event type | Inefficient retrieval (scan vs. point lookup); mixes blob storage with events |
+```
+
+### 4. Add FAQ Entry
+
+```markdown
+## Appendix F: FAQ
+
+### Why not just use SessionService for checkpoints?
+
+SessionService is optimized for conversation history (append-only event streams).
+Checkpoints require:
+- Point-in-time snapshots (not event streams)
+- Two-phase commit (GCS blob + metadata atomicity)
+- Different query patterns (latest-per-session, not full history)
+- Large blob storage (workspace snapshots)
+
+The separation ensures each service is optimized for its use case.
+
+### Can I use CheckpointStore without SessionService?
+
+Yes, but not recommended. SessionService provides conversation context for
+the LLM on resume. Without it, the agent loses conversation history.
+
+### Do they share the same BigQuery dataset?
+
+Yes, recommended. Use the same dataset with different table prefixes:
+- SessionService: `sessions`, `events`
+- CheckpointStore: `durable_sessions`, `checkpoints`
+```
+
+---
+
+## Action Items
+
+- [ ] Add Section 5.4 to design doc
+- [ ] Update Section 8.2 with dual-service example
+- [ ] Add alternatives to Section 15
+- [ ] Add FAQ appendix
+- [ ] Consider renaming tables to avoid confusion (`durable_sessions` vs `sessions`)
+- [ ] Document shared dataset configuration in README
+
+---
+
+## Open Questions for ADK Team
+
+1. **Table naming**: Should checkpoint tables use a prefix (`durable_sessions`) or separate dataset?
+2. **Unified service**: Is there interest in a `DurableSessionService` wrapper that manages both?
+3. **Event integration**: Should checkpoint events be mirrored to SessionService for audit trail?
+4. **BigQuerySessionService**: Does it already have any checkpoint-like capabilities we should leverage?
+
+---
+
+## Comment 2: GcsArtifactService for Large Blobs
+
+**From:** ADK Team
+**Date:** 2026-02-02
+
+**Comment:**
+> "In ADK, ArtifactService is designed for large blobs. Have you checked that? We have a GcsArtifactService in the core library."
+
+---
+
+### Response
+
+Thank you for pointing this out. Yes, I've reviewed `GcsArtifactService` (`src/google/adk/artifacts/gcs_artifact_service.py`) and the `BaseArtifactService` interface. This is a valid consideration.
+
+#### Current ArtifactService Capabilities
+
+| Feature | GcsArtifactService |
+|---------|-------------------|
+| Storage backend | GCS bucket |
+| Key structure | `{app_name}/{user_id}/{session_id}/{filename}/{version}` |
+| Versioning | Monotonic integer versions (0, 1, 2, ...) |
+| Data type | `types.Part` (inline_data, text, file_data) |
+| Metadata | Custom metadata dict on blob |
+| Operations | save, load, list, delete, list_versions |
+
+#### Checkpoint Blob Requirements
+
+| Requirement | ArtifactService Support | Gap |
+|-------------|------------------------|-----|
+| Store bytes/JSON blobs | Yes (`types.Part.from_bytes`) | None |
+| Session-scoped storage | Yes | None |
+| Version tracking | Yes (monotonic) | Checkpoint uses `checkpoint_seq` |
+| Custom metadata | Yes | Need SHA-256, trigger, size_bytes |
+| Two-phase commit | **No** | Critical gap |
+| Atomic visibility with BQ | **No** | Critical gap |
+| Workspace tar.gz bundles | Partially (as bytes) | None |
+| Integrity verification | **No** | Need SHA-256 on read |
+
+#### Key Gaps
+
+**1. Two-Phase Commit Semantics**
+
+The checkpoint pattern requires:
+```
+Phase 1: Upload blob to GCS (may fail, invisible)
+Phase 2: Insert metadata to BigQuery (makes checkpoint visible)
+```
+
+`GcsArtifactService.save_artifact()` uploads and returns immediately. There's no coordination with an external metadata store. A partial upload becomes immediately "visible" via `load_artifact()`.
+
+**2. Atomic Visibility with BigQuery Metadata**
+
+Checkpoints must be invisible until both:
+- GCS blob exists AND
+- BigQuery metadata row exists
+
+`GcsArtifactService` doesn't have this concept - artifacts are visible as soon as they're uploaded.
+
+**3. SHA-256 Integrity Verification**
+
+Checkpoints require integrity verification on read:
+```python
+# On read
+blob = gcs.download(uri)
+if sha256(blob) != metadata.sha256:
+    raise CheckpointCorruptionError()
+```
+
+`GcsArtifactService` doesn't compute or verify checksums.
+
+**4. Key Structure Mismatch**
+
+| Service | Key Pattern |
+|---------|-------------|
+| ArtifactService | `{app}/{user}/{session}/{filename}/{version}` |
+| CheckpointStore | `{session_id}/{checkpoint_seq}/state.json` |
+
+Checkpoints don't have `app_name`, `user_id`, or `filename` - they're keyed purely by `session_id` + `checkpoint_seq`.
+
+---
+
+### Potential Approaches
+
+#### Option A: Use GcsArtifactService as Underlying Storage (Adapt)
+
+```python
+class BigQueryCheckpointStore(DurableSessionStore):
+    def __init__(self, artifact_service: GcsArtifactService, ...):
+        self._artifact_service = artifact_service
+
+    async def write_checkpoint(self, session_id, seq, state_blob, ...):
+        # Phase 1: Use artifact service for GCS upload
+        version = await self._artifact_service.save_artifact(
+            app_name="checkpoints",
+            user_id="system",
+            session_id=session_id,
+            filename=f"checkpoint_{seq}",
+            artifact=types.Part.from_bytes(state_blob, mime_type="application/json"),
+            custom_metadata={"sha256": sha256(state_blob)},
+        )
+
+        # Phase 2: Insert BQ metadata (makes checkpoint visible)
+        await self._insert_bq_metadata(session_id, seq, ...)
+```
+
+**Pros:**
+- Reuses existing GCS infrastructure
+- Consistent with ADK patterns
+- Less code duplication
+
+**Cons:**
+- Awkward key mapping (`app_name="checkpoints"`, `user_id="system"`)
+- Still need custom two-phase commit logic
+- Still need SHA-256 verification layer
+- Version semantics don't match (artifact version vs checkpoint_seq)
+
+#### Option B: Direct GCS Client (Current Design)
+
+```python
+class BigQueryCheckpointStore(DurableSessionStore):
+    def __init__(self, gcs_bucket: str, ...):
+        self._gcs_client = storage.Client()
+        self._bucket = self._gcs_client.bucket(gcs_bucket)
+
+    async def write_checkpoint(self, session_id, seq, state_blob, ...):
+        # Phase 1: Direct GCS upload with preconditions
+        blob = self._bucket.blob(f"{session_id}/{seq}/state.json")
+        blob.upload_from_string(
+            state_blob,
+            if_generation_match=0,  # Fail if exists (idempotency)
+        )
+
+        # Phase 2: Insert BQ metadata
+        await self._insert_bq_metadata(session_id, seq, ...)
+```
+
+**Pros:**
+- Full control over GCS operations
+- Clean key structure
+- Native support for preconditions (`if_generation_match`)
+- Simpler code path
+
+**Cons:**
+- Doesn't leverage existing ArtifactService
+- Separate GCS client initialization
+
+#### Option C: Extend ArtifactService Interface
+
+Add checkpoint-specific methods to `BaseArtifactService`:
+
+```python
+class BaseArtifactService(ABC):
+    # Existing methods...
+
+    # New: Checkpoint-specific operations
+    async def save_checkpoint_blob(
+        self,
+        *,
+        session_id: str,
+        checkpoint_seq: int,
+        blob: bytes,
+        sha256: str,
+    ) -> str:
+        """Save a checkpoint blob and return GCS URI."""
+        ...
+
+    async def load_checkpoint_blob(
+        self,
+        *,
+        session_id: str,
+        checkpoint_seq: int,
+        expected_sha256: str,
+    ) -> bytes:
+        """Load and verify checkpoint blob."""
+        ...
+```
+
+**Pros:**
+- Unified artifact/checkpoint interface
+- Extensible for future blob types
+
+**Cons:**
+- Modifies core ADK interface
+- Checkpoint semantics may not fit all artifact backends
+- Two-phase commit still external
+
+---
+
+### Recommendation
+
+**Option B (Direct GCS Client)** is recommended for v1 because:
+
+1. **Simpler implementation**: No adapter layer or key mapping
+2. **Full control**: Native GCS preconditions for idempotency
+3. **Clean semantics**: Checkpoint keys match checkpoint concepts
+4. **No interface changes**: Doesn't require modifying BaseArtifactService
+
+However, we should:
+- Document the relationship with ArtifactService
+- Consider Option A or C for v2 if there's desire for unification
+- Ensure both can share the same GCS bucket if needed
+
+---
+
+### Suggested Design Doc Updates
+
+Add to Section 15 (Alternatives Considered):
+
+```markdown
+| Alternative | Why not (v1) |
+|-------------|--------------|
+| Use GcsArtifactService for checkpoint blobs | Key structure mismatch; no two-phase commit support; no SHA-256 verification; would require adapter layer |
+```
+
+Add to Section 5.3 (Integration with Existing ADK Services):
+
+```markdown
+### Relationship to ArtifactService
+
+ADK's `ArtifactService` (`GcsArtifactService`, `FileArtifactService`, etc.) is designed for
+user/session-scoped file artifacts with versioning.
+
+Checkpoints have different requirements:
+- Two-phase commit with BigQuery metadata
+- SHA-256 integrity verification
+- Different key structure (session_id/checkpoint_seq)
+
+For v1, `CheckpointStore` uses direct GCS client access. Future versions may consider
+unifying with `ArtifactService` if the interface can be extended to support checkpoint
+semantics.
+```
+
+---
+
+---
+
+## Comment 3: Leasing as General Requirement
+
+**From:** ADK Team
+**Date:** 2026-02-02
+
+**Reference:** Section 7.3 - "We must ensure only one runner resumes a session at a time"
+
+**Comment:**
+> "This is not only applicable to resume. `Runner.run_async` also requires this. Leasing is a general requirement for app developers."
+
+---
+
+### Response
+
+This is an important clarification. You're correct that session-level concurrency control is a **general requirement**, not specific to durable session resume.
+
+#### Expanded Scope of Leasing
+
+| Scenario | Concurrency Risk | Current ADK Handling |
+|----------|------------------|---------------------|
+| Multiple `run_async()` on same session | Race conditions, duplicate tool calls | App developer responsibility |
+| Resume after pause | Duplicate resume attempts | App developer responsibility |
+| Pub/Sub event redelivery | Multiple runners wake on same event | App developer responsibility |
+| Horizontal scaling | Multiple instances claim same session | App developer responsibility |
+
+The design doc incorrectly scoped leasing as a "durable session" concern. In reality:
+
+```
+Leasing requirement = ANY scenario where multiple runners might access the same session
+```
+
+#### Current State in ADK
+
+Looking at `Runner.run_async()` in `src/google/adk/runners.py`:
+
+```python
+async def run_async(
+    self,
+    *,
+    user_id: str,
+    session_id: str,
+    new_message: types.Content,
+    ...
+) -> AsyncGenerator[Event, None]:
+    # No built-in lease acquisition
+    # App developer must ensure single-runner-per-session
+```
+
+There's no built-in lease mechanism. App developers must implement their own concurrency control.
+
+#### Implications for Design
+
+**Option A: Leasing in Durable Layer Only (Current Design)**
+
+```
+┌─────────────────────────────────────────────────────────────┐
+│                    ADK Application                           │
+├─────────────────────────────────────────────────────────────┤
+│  Runner.run_async()          │  CheckpointStore             │
+│  - No built-in leasing       │  - Has lease management      │
+│  - App manages concurrency   │  - Protects resume only      │
+└─────────────────────────────────────────────────────────────┘
+```
+
+**Pros:** Non-breaking, durable sessions get protection
+**Cons:** Inconsistent; regular sessions still unprotected
+
+**Option B: Leasing in Runner (Framework-Level)**
+
+```python
+class Runner:
+    def __init__(self, ..., lease_manager: Optional[LeaseManager] = None):
+        self._lease_manager = lease_manager
+
+    async def run_async(self, ..., session_id: str, ...):
+        if self._lease_manager:
+            lease = await self._lease_manager.acquire(session_id)
+            if not lease:
+                raise SessionLeaseDeniedError(session_id)
+        try:
+            # ... execute agent logic
+        finally:
+            if self._lease_manager:
+                await self._lease_manager.release(session_id)
+```
+
+**Pros:** Consistent protection for all sessions
+**Cons:** Breaking change; requires lease manager configuration
+
+**Option C: Leasing in SessionService (Storage-Level)**
+
+```python
+class BaseSessionService(ABC):
+    @abstractmethod
+    async def acquire_session_lease(
+        self, session_id: str, lease_id: str, ttl_seconds: int
+    ) -> bool: ...
+
+    @abstractmethod
+    async def release_session_lease(
+        self, session_id: str, lease_id: str
+    ) -> None: ...
+```
+
+**Pros:** Unified with session storage; natural fit
+**Cons:** Requires changes to all SessionService implementations
+
+---
+
+### Recommendation
+
+**Short-term (v1):** Keep leasing in `CheckpointStore` for durable sessions, but:
+- Update design doc to acknowledge this is a subset of a broader need
+- Document that app developers need their own concurrency control for non-durable sessions
+
+**Medium-term (v2):** Consider adding leasing to `SessionService` interface:
+- `BigQuerySessionService` already has infrastructure for this
+- `DatabaseSessionService` can use row-level locks
+- `InMemorySessionService` can use asyncio locks
+
+**Long-term:** Consider Runner-level lease integration as opt-in feature.
+
+---
+
+### Suggested Design Doc Updates
+
+**Update Section 7.3 Title:**
+
+From:
+> "7.3 Leasing & optimistic concurrency"
+
+To:
+> "7.3 Leasing & optimistic concurrency (session-level)"
+
+**Add Clarification Paragraph:**
+
+```markdown
+### 7.3 Leasing & Optimistic Concurrency
+
+**Note:** Session-level concurrency control is a general ADK requirement, not
+specific to durable sessions. Any scenario where multiple runners might access
+the same session requires leasing:
+
+- Multiple `run_async()` calls on the same session
+- Resume after pause (durable or in-process)
+- Event-driven wake-up with potential redelivery
+- Horizontal scaling with shared session storage
+
+Currently, ADK leaves session leasing to app developers. The durable session
+layer provides lease management for checkpoint-protected sessions, but this
+does not cover all concurrency scenarios.
+
+**Future consideration:** Add optional `LeaseManager` to `Runner` or lease
+methods to `SessionService` interface for framework-level protection.
+```
+
+**Add to Section 18 (Open Questions):**
+
+```markdown
+| Question | Risk Level | Notes |
+|----------|------------|-------|
+| Framework-level leasing | Medium | Should Runner have built-in lease support? Would require LeaseManager abstraction |
+| SessionService lease methods | Medium | Natural fit but requires interface changes |
+```
+
+---
+
+---
+
+## Comment 4: Cross-Process Durability Clarification
+
+**From:** ADK Team
+**Date:** 2026-02-02
+
+**Reference:** Section 1.2 - "Cross-process durability: state lost if the process dies"
+
+**Comment:**
+> "Could you elaborate on this? I think agent state is persisted in the event and the event will be persisted in the selected session service."
+
+---
+
+### Response
+
+You're correct that session events are persisted in the SessionService. Let me clarify what "state lost" means in the context of long-running tasks.
+
+#### What IS Preserved (SessionService Events)
+
+| Data | Preserved? | Location |
+|------|------------|----------|
+| User messages | Yes | Session events |
+| Agent responses | Yes | Session events |
+| Tool call records | Yes | Session events (tool name, args, result) |
+| LLM conversation context | Yes | Replayable from events |
+
+#### What May NOT Be Preserved (or Not Usable)
+
+| Data | Preserved? | Issue |
+|------|------------|-------|
+| In-flight tool execution | **No** | Process dies mid-tool-call |
+| External job handles | **Partial** | Job ID in event, but no reconciliation structure |
+| Multi-step operation progress | **No** | "I'm on step 3 of 7" not tracked |
+| Agent's execution plan | **No** | Task graph, priorities, dependencies |
+| Partial aggregated results | **No** | "Scanned 30 of 50 tables, found X so far" |
+| Workspace files in progress | **No** | Draft reports, intermediate artifacts |
+
+#### Concrete Example: 50-Table PII Scan
+
+**Scenario:** Agent is scanning 50 BigQuery tables for PII. Process dies after completing 30 tables.
+
+**With SessionService only:**
+
+```
+Events stored:
+  - User: "Scan all tables for PII"
+  - Agent: "I'll scan these 50 tables..."
+  - ToolCall: scan_table("table_1") → {findings: [...]}
+  - ToolCall: scan_table("table_2") → {findings: [...]}
+  ...
+  - ToolCall: scan_table("table_30") → {findings: [...]}
+  - [PROCESS DIES HERE]
+```
+
+On restart:
+- Events replay to LLM ✓
+- LLM sees 30 tool calls completed ✓
+- But: **LLM must re-deduce** which tables remain
+- But: **No structured job ledger** for reconciliation
+- But: **Aggregated findings** must be re-computed from events
+- Risk: **LLM may miscount** or re-scan tables
+
+**With Checkpoint + SessionService:**
+
+```
+Checkpoint stored:
+  {
+    "job_ledger": {
+      "table_1": {"status": "complete", "findings": 3},
+      "table_2": {"status": "complete", "findings": 0},
+      ...
+      "table_30": {"status": "complete", "findings": 5},
+      "table_31": {"status": "pending"},
+      ...
+      "table_50": {"status": "pending"}
+    },
+    "aggregated_findings": {
+      "total_tables_scanned": 30,
+      "total_findings": 47,
+      "findings_by_type": {"email": 20, "ssn": 15, "phone": 12}
+    },
+    "execution_plan": {
+      "current_phase": "scanning",
+      "next_table_index": 31
+    }
+  }
+```
+
+On restart:
+- Load checkpoint ✓
+- Know exactly which tables remain ✓
+- Reconcile with BigQuery job states ✓
+- Continue with aggregated state intact ✓
+- No LLM re-deduction needed ✓
+
+#### The Key Distinction
+
+| Aspect | Session Events | Checkpoint State |
+|--------|----------------|------------------|
+| Purpose | LLM conversation context | Execution state recovery |
+| Structure | Append-only event stream | Point-in-time snapshot |
+| Recovery mode | Replay events to LLM | Load structured state |
+| External jobs | Tool call records | Reconcilable job ledger |
+| Aggregations | Must re-compute from events | Pre-computed, ready to use |
+| Reliability | LLM must re-deduce state | Deterministic restoration |
+
+#### When Session Events Are Sufficient
+
+Session events alone work well for:
+- Short conversations (< 5 min)
+- Simple tool calls (no external async jobs)
+- Stateless operations (each tool call independent)
+- Human-in-the-loop flows (human provides continuity)
+
+#### When Checkpoints Add Value
+
+Checkpoints are valuable for:
+- Long-running operations (hours/days)
+- External async jobs (BigQuery, Cloud Build, ML training)
+- Multi-step plans with dependencies
+- Aggregated/computed state (partial results)
+- Deterministic recovery (no LLM re-deduction)
+
+---
+
+### End-to-End Concrete Example: Enterprise PII Compliance Audit
+
+Let me walk through a complete scenario showing what the checkpoint approach enables that event logging alone cannot.
+
+#### Scenario Setup
+
+**Task:** Scan 100 BigQuery tables across 5 datasets for PII (emails, SSNs, phone numbers) to generate a compliance report.
+
+**Environment:**
+- Cloud Run with 60-minute timeout
+- Each table scan takes 2-10 minutes (BigQuery job)
+- Total expected runtime: ~8 hours
+- Multiple Cloud Run instances may be involved
+
+**User Request:**
+```
+"Scan all tables in the customer_data, transactions, analytics,
+logs, and marketing datasets for PII. Generate a compliance report
+with findings by table and recommendations."
+```
+
+---
+
+#### Timeline: What Happens
+
+```
+Hour 0:00 - Agent starts
+  - Discovers 100 tables across 5 datasets
+  - Creates execution plan: scan tables, aggregate findings, generate report
+  - Begins scanning tables
+
+Hour 2:30 - Progress checkpoint
+  - 35 tables scanned
+  - 127 PII findings so far
+  - 15 BigQuery jobs completed, 2 running, 83 pending
+
+Hour 3:15 - PROCESS DIES (Cloud Run timeout/crash)
+  - 2 BigQuery jobs still running in the cloud
+  - Agent process terminated
+```
+
+---
+
+#### Path A: Event Logging Only (Current ADK)
+
+**Events stored in SessionService:**
+```json
+[
+  {"type": "user_message", "content": "Scan all tables..."},
+  {"type": "agent_message", "content": "I'll scan 100 tables..."},
+  {"type": "tool_call", "tool": "submit_bq_scan", "args": {"table": "customer_data.users"}, "result": {"job_id": "job_001", "status": "submitted"}},
+  {"type": "tool_call", "tool": "get_job_result", "args": {"job_id": "job_001"}, "result": {"findings": [{"type": "email", "column": "contact_email", "count": 15000}]}},
+  {"type": "tool_call", "tool": "submit_bq_scan", "args": {"table": "customer_data.orders"}, "result": {"job_id": "job_002", "status": "submitted"}},
+  // ... 70 more tool call events ...
+  {"type": "tool_call", "tool": "submit_bq_scan", "args": {"table": "analytics.events"}, "result": {"job_id": "job_037", "status": "submitted"}},
+  // PROCESS DIES - no more events
+]
+```
+
+**On Restart (New Cloud Run Instance):**
+
+1. **Events replay to LLM** - LLM sees conversation history ✓
+
+2. **LLM must re-deduce state:**
+   ```
+   LLM thinking: "Looking at these events... I see job_001 through job_037
+   were submitted. Some have results, some don't. Let me figure out what's done..."
+   ```
+
+3. **Problems:**
+
+   | Problem | Impact |
+   |---------|--------|
+   | **Job status unknown** | job_036, job_037 may have completed while process was dead - LLM doesn't know |
+   | **No structured ledger** | LLM must parse 70+ events to determine table status |
+   | **Aggregation lost** | "127 findings so far" must be re-counted from events |
+   | **May re-submit jobs** | LLM might re-scan tables it already scanned |
+   | **May miss completed jobs** | Jobs that finished during downtime have results waiting |
+   | **Non-deterministic** | Different LLM calls may reach different conclusions |
+
+4. **Likely LLM Response:**
+   ```
+   "I see we were scanning tables for PII. Let me check what's been done...
+   [Spends tokens re-parsing events]
+   I think tables 1-35 are done. Let me continue with table 36...
+
+   Actually, I'm not sure if job_036 completed. Let me re-submit it to be safe."
+   ```
+
+5. **Result:**
+   - Duplicate BigQuery jobs (wasted cost)
+   - Inconsistent findings count
+   - Report may have duplicates or gaps
+   - ~30 minutes spent "figuring out" state
+
+---
+
+#### Path B: Checkpoint + Event Logging (Proposed)
+
+**Checkpoint stored (in addition to events):**
+```json
+{
+  "checkpoint_seq": 15,
+  "created_at": "2026-02-02T05:30:00Z",
+
+  "execution_plan": {
+    "phase": "scanning",
+    "total_tables": 100,
+    "tables_completed": 35,
+    "tables_in_progress": 2,
+    "tables_pending": 63
+  },
+
+  "job_ledger": {
+    "job_001": {"table": "customer_data.users", "status": "complete", "findings": 3},
+    "job_002": {"table": "customer_data.orders", "status": "complete", "findings": 0},
+    // ... jobs 3-35: complete ...
+    "job_036": {"table": "analytics.sessions", "status": "running", "submitted_at": "2026-02-02T05:28:00Z"},
+    "job_037": {"table": "analytics.events", "status": "running", "submitted_at": "2026-02-02T05:29:00Z"}
+  },
+
+  "aggregated_findings": {
+    "total_findings": 127,
+    "by_type": {"email": 45, "ssn": 32, "phone": 28, "address": 22},
+    "by_dataset": {"customer_data": 67, "transactions": 35, "analytics": 25},
+    "tables_with_pii": ["customer_data.users", "customer_data.profiles", "..."]
+  },
+
+  "pending_tables": [
+    "analytics.pageviews",
+    "logs.access_logs",
+    // ... 63 more tables ...
+  ]
+}
+```
+
+**On Restart (New Cloud Run Instance):**
+
+1. **Load checkpoint** - Deterministic state restoration ✓
+
+2. **Reconcile with BigQuery:**
+   ```python
+   # Automatic reconciliation
+   for job_id, job_meta in checkpoint["job_ledger"].items():
+       if job_meta["status"] == "running":
+           actual_status = bq_client.get_job(job_id).state
+           if actual_status == "DONE":
+               # Job completed while we were dead - fetch results
+               results = fetch_results(job_id)
+               update_findings(results)
+               job_meta["status"] = "complete"
+   ```
+
+3. **Result of reconciliation:**
+   ```
+   Checkpoint loaded: 35 tables complete, 2 in-progress
+   Reconciliation: job_036 DONE (found 5 PII), job_037 DONE (found 2 PII)
+   Updated state: 37 tables complete, 134 total findings
+   Remaining: 63 tables
+
+   Resuming scan from table 38...
+   ```
+
+4. **Agent continues seamlessly:**
+   - No duplicate jobs
+   - No re-parsing events
+   - Findings aggregation intact
+   - Deterministic, reliable
+   - Resume took ~5 seconds
+
+---
+
+#### Side-by-Side Comparison
+
+| Aspect | Events Only | Checkpoint + Events |
+|--------|-------------|---------------------|
+| **Recovery time** | ~30 min (LLM re-parsing) | ~5 sec (load + reconcile) |
+| **Duplicate jobs** | Likely (LLM uncertainty) | None (ledger prevents) |
+| **Missed job results** | Possible | None (reconciliation catches) |
+| **Findings accuracy** | May have errors | Exact (pre-aggregated) |
+| **Token cost** | High (re-process events) | Low (structured state) |
+| **Determinism** | No (LLM-dependent) | Yes (explicit state) |
+| **Total runtime** | ~10 hours (retries, confusion) | ~8 hours (clean resume) |
+
+---
+
+#### What Checkpoint Enables That Events Cannot
+
+1. **Authoritative Job Reconciliation**
+   ```
+   Events: "job_036 was submitted" (but is it done now?)
+   Checkpoint: "job_036 status=running" → reconcile → "actually DONE, here are results"
+   ```
+
+2. **Pre-Aggregated State**
+   ```
+   Events: Count findings from 70 tool_call results
+   Checkpoint: {"total_findings": 127, "by_type": {...}}
+   ```
+
+3. **Explicit Execution Plan**
+   ```
+   Events: LLM must re-deduce "what was I doing?"
+   Checkpoint: {"phase": "scanning", "tables_completed": 35, "tables_pending": 63}
+   ```
+
+4. **Idempotent Resume**
+   ```
+   Events: May or may not re-submit jobs (LLM decides)
+   Checkpoint: Never re-submits (ledger tracks all jobs)
+   ```
+
+5. **Multi-Instance Coordination**
+   ```
+   Events: Two instances might both try to continue
+   Checkpoint: Lease ensures only one instance resumes
+   ```
+
+---
+
+#### Cost Impact Example
+
+| Metric | Events Only | Checkpoint |
+|--------|-------------|------------|
+| BigQuery jobs submitted | 115 (15 duplicates) | 100 (exact) |
+| BQ job cost @ $5/job | $575 | $500 |
+| Cloud Run time | 10 hours | 8 hours |
+| Cloud Run cost @ $0.10/hr | $1.00 | $0.80 |
+| LLM tokens for recovery | ~50,000 | ~1,000 |
+| LLM cost @ $0.01/1K | $0.50 | $0.01 |
+| **Total extra cost** | **$75.50** | **$0** |
+
+For enterprise workloads running daily, this adds up significantly.
+
+---
+
+### Suggested Design Doc Update
+
+Revise Section 1.2 limitation description:
+
+**From:**
+> "Cross-process durability: state lost if the process dies"
+
+**To:**
+> "Cross-process durability: While session events persist conversation history, structured execution state (job ledgers, aggregated results, execution plans) is not captured in a form that enables deterministic recovery. On restart, the LLM must re-deduce state from event history, which may be unreliable for complex multi-step operations."
+
+Add clarification table to Section 1.2:
+
+```markdown
+**Clarification: Session Events vs. Checkpoint State**
+
+| Recovery Need | Session Events | Checkpoint |
+|---------------|----------------|------------|
+| Conversation context | ✓ Sufficient | ✓ |
+| External job reconciliation | ✗ Manual | ✓ Structured ledger |
+| Multi-step progress tracking | ✗ LLM re-deduces | ✓ Explicit state |
+| Aggregated partial results | ✗ Re-compute | ✓ Pre-computed |
+| Deterministic recovery | ✗ LLM-dependent | ✓ Guaranteed |
+```
+
+---
+
+## Updated Open Questions for ADK Team
+
+1. **Table naming**: Should checkpoint tables use a prefix (`durable_sessions`) or separate dataset?
+2. **Unified service**: Is there interest in a `DurableSessionService` wrapper that manages both SessionService and CheckpointStore?
+3. **Event integration**: Should checkpoint events be mirrored to SessionService for audit trail?
+4. **BigQuerySessionService**: Does it already have any checkpoint-like capabilities we should leverage?
+5. **ArtifactService unification**: Should we extend `BaseArtifactService` with checkpoint-specific methods in v2?
+6. **Shared bucket**: Can checkpoints share a GCS bucket with artifacts, or should they be separate?
+7. **Framework-level leasing**: Should `Runner` have optional built-in lease management? Or should `SessionService` have lease methods?
+8. **Lease backend standardization**: If leasing becomes a framework feature, what backends should be supported (BQ, Firestore, Redis, DB row locks)?
+9. **Event-based recovery**: Is there interest in adding structured "execution state" events to SessionService as an alternative to separate checkpoints?
diff --git a/contributing/samples/long_running_task/demo_server.py b/contributing/samples/long_running_task/demo_server.py
new file mode 100644
index 0000000000..1715c1f11b
--- /dev/null
+++ b/contributing/samples/long_running_task/demo_server.py
@@ -0,0 +1,435 @@
+# Copyright 2026 Google LLC
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""Custom demo server with checkpoint visualization UI."""
+
+import asyncio
+import json
+import os
+import uuid
+from datetime import datetime
+from pathlib import Path
+from typing import Any, Optional
+
+from fastapi import FastAPI, HTTPException, Request
+from fastapi.middleware.cors import CORSMiddleware
+from fastapi.responses import HTMLResponse, JSONResponse
+from fastapi.staticfiles import StaticFiles
+from pydantic import BaseModel
+import uvicorn
+
+from google.adk.durable import BigQueryCheckpointStore
+
+# Configuration
+PROJECT_ID = os.environ.get("GOOGLE_CLOUD_PROJECT", "test-project-0728-467323")
+DATASET = "adk_metadata"
+GCS_BUCKET = f"{PROJECT_ID}-adk-checkpoints"
+
+# Initialize checkpoint store
+checkpoint_store = BigQueryCheckpointStore(
+    project=PROJECT_ID,
+    dataset=DATASET,
+    gcs_bucket=GCS_BUCKET,
+)
+
+# In-memory task state for demo
+active_tasks: dict[str, dict] = {}
+
+app = FastAPI(title="ADK Durable Session Demo")
+
+# CORS
+app.add_middleware(
+    CORSMiddleware,
+    allow_origins=["*"],
+    allow_credentials=True,
+    allow_methods=["*"],
+    allow_headers=["*"],
+)
+
+
+class TaskRequest(BaseModel):
+    task_type: str  # "sentiment", "anomaly", "trend", "scan"
+    duration_seconds: int = 60
+
+
+class ResumeRequest(BaseModel):
+    session_id: str
+
+
+@app.get("/", response_class=HTMLResponse)
+async def root():
+    """Serve the demo UI."""
+    html_path = Path(__file__).parent / "demo_ui.html"
+    if html_path.exists():
+        return HTMLResponse(content=html_path.read_text())
+    return HTMLResponse(content="<h1>Demo UI not found</h1>")
+
+
+@app.get("/api/sessions")
+async def list_sessions():
+    """List all sessions from BigQuery."""
+    try:
+        client = checkpoint_store._get_bq_client()
+        query = f"""
+        SELECT session_id, status, agent_name, current_checkpoint_seq,
+               created_at, updated_at
+        FROM `{checkpoint_store._sessions_table_id}`
+        ORDER BY updated_at DESC
+        LIMIT 20
+        """
+        results = client.query(query).result()
+        sessions = []
+        for row in results:
+            sessions.append({
+                "session_id": row.session_id,
+                "status": row.status,
+                "agent_name": row.agent_name,
+                "checkpoint_seq": row.current_checkpoint_seq,
+                "created_at": row.created_at.isoformat() if row.created_at else None,
+                "updated_at": row.updated_at.isoformat() if row.updated_at else None,
+            })
+        return {"sessions": sessions}
+    except Exception as e:
+        return {"sessions": [], "error": str(e)}
+
+
+@app.get("/api/checkpoints/{session_id}")
+async def list_checkpoints(session_id: str):
+    """List checkpoints for a session."""
+    try:
+        checkpoints = await checkpoint_store.list_checkpoints(
+            session_id=session_id, limit=20
+        )
+        return {
+            "checkpoints": [
+                {
+                    "checkpoint_seq": cp.checkpoint_seq,
+                    "created_at": cp.created_at.isoformat() if cp.created_at else None,
+                    "trigger": cp.trigger,
+                    "size_bytes": cp.size_bytes,
+                    "gcs_uri": cp.gcs_state_uri,
+                    "agent_state": cp.agent_state,
+                }
+                for cp in checkpoints
+            ]
+        }
+    except Exception as e:
+        return {"checkpoints": [], "error": str(e)}
+
+
+@app.post("/api/task/start")
+async def start_task(request: TaskRequest):
+    """Start a new long-running task with checkpointing."""
+    session_id = f"demo-{uuid.uuid4().hex[:8]}"
+
+    # Create session in BigQuery
+    try:
+        session = await checkpoint_store.create_session(
+            session_id=session_id,
+            agent_name="demo_agent",
+            metadata={"task_type": request.task_type}
+        )
+    except Exception as e:
+        raise HTTPException(status_code=500, detail=f"Failed to create session: {e}")
+
+    # Initialize task state
+    active_tasks[session_id] = {
+        "task_type": request.task_type,
+        "status": "running",
+        "progress": 0,
+        "total_duration": request.duration_seconds,
+        "records_processed": 0,
+        "insights_found": 0,
+        "checkpoints": [],
+        "start_time": datetime.now().isoformat(),
+        "should_fail": False,
+        "failed_at": None,
+        "final_output": None,
+    }
+
+    # Start background task
+    asyncio.create_task(run_task_with_checkpoints(session_id, request.duration_seconds))
+
+    return {
+        "session_id": session_id,
+        "status": "started",
+        "message": f"Started {request.task_type} analysis task"
+    }
+
+
+@app.post("/api/task/fail/{session_id}")
+async def simulate_failure(session_id: str):
+    """Simulate a task failure."""
+    if session_id not in active_tasks:
+        raise HTTPException(status_code=404, detail="Task not found")
+
+    active_tasks[session_id]["should_fail"] = True
+    return {"status": "failure_triggered", "session_id": session_id}
+
+
+@app.post("/api/task/resume")
+async def resume_task(request: ResumeRequest):
+    """Resume a task from checkpoint."""
+    session_id = request.session_id
+
+    # Read the latest checkpoint
+    result = await checkpoint_store.read_latest_checkpoint(session_id=session_id)
+    if not result:
+        raise HTTPException(status_code=404, detail="No checkpoint found")
+
+    checkpoint, state_blob = result
+    state = json.loads(state_blob.decode('utf-8'))
+
+    # Get session info
+    session = await checkpoint_store.get_session(session_id=session_id)
+    if not session:
+        raise HTTPException(status_code=404, detail="Session not found")
+
+    # Restore task state
+    active_tasks[session_id] = {
+        "task_type": state.get("task_type", "unknown"),
+        "status": "running",
+        "progress": state.get("progress", 0),
+        "total_duration": state.get("total_duration", 60),
+        "records_processed": state.get("records_processed", 0),
+        "insights_found": state.get("insights_found", 0),
+        "checkpoints": state.get("checkpoints", []),
+        "start_time": state.get("start_time"),
+        "resumed_from": checkpoint.checkpoint_seq,
+        "should_fail": False,
+        "failed_at": None,
+    }
+
+    # Calculate remaining duration
+    remaining = active_tasks[session_id]["total_duration"] * (1 - active_tasks[session_id]["progress"] / 100)
+
+    # Resume background task
+    asyncio.create_task(run_task_with_checkpoints(session_id, int(remaining), resume=True))
+
+    return {
+        "session_id": session_id,
+        "status": "resumed",
+        "resumed_from_checkpoint": checkpoint.checkpoint_seq,
+        "progress": active_tasks[session_id]["progress"],
+        "message": f"Resumed from checkpoint #{checkpoint.checkpoint_seq}"
+    }
+
+
+@app.get("/api/task/status/{session_id}")
+async def get_task_status(session_id: str):
+    """Get current task status."""
+    if session_id not in active_tasks:
+        # Try to get from BigQuery
+        session = await checkpoint_store.get_session(session_id=session_id)
+        if session:
+            return {
+                "session_id": session_id,
+                "status": session.status,
+                "checkpoint_seq": session.current_checkpoint_seq,
+                "from_db": True
+            }
+        raise HTTPException(status_code=404, detail="Task not found")
+
+    return {
+        "session_id": session_id,
+        **active_tasks[session_id]
+    }
+
+
+async def run_task_with_checkpoints(session_id: str, duration: int, resume: bool = False):
+    """Run a long-running task with periodic checkpoints."""
+    import random
+
+    task = active_tasks.get(session_id)
+    if not task:
+        return
+
+    checkpoint_interval = 10  # Checkpoint every 10 seconds
+    start_progress = task["progress"] if resume else 0
+
+    for elapsed in range(0, duration, checkpoint_interval):
+        # Check if we should fail
+        if task.get("should_fail"):
+            task["status"] = "failed"
+            task["failed_at"] = datetime.now().isoformat()
+            await checkpoint_store.update_session_status(
+                session_id=session_id, status="failed"
+            )
+            return
+
+        # Simulate work
+        await asyncio.sleep(min(checkpoint_interval, duration - elapsed))
+
+        # Update progress
+        progress = start_progress + ((elapsed + checkpoint_interval) / duration) * (100 - start_progress)
+        task["progress"] = min(progress, 100)
+        task["records_processed"] += random.randint(50000, 150000)
+        task["insights_found"] += random.randint(1, 3)
+
+        # Get current checkpoint seq
+        session = await checkpoint_store.get_session(session_id=session_id)
+        next_seq = (session.current_checkpoint_seq if session else 0) + 1
+
+        # Create checkpoint
+        state_data = {
+            "task_type": task["task_type"],
+            "progress": task["progress"],
+            "total_duration": task["total_duration"],
+            "records_processed": task["records_processed"],
+            "insights_found": task["insights_found"],
+            "checkpoints": task["checkpoints"],
+            "start_time": task["start_time"],
+        }
+
+        try:
+            checkpoint = await checkpoint_store.write_checkpoint(
+                session_id=session_id,
+                checkpoint_seq=next_seq,
+                state_blob=json.dumps(state_data).encode('utf-8'),
+                agent_state={"progress": task["progress"], "step": f"checkpoint_{next_seq}"},
+                trigger="periodic",
+            )
+
+            task["checkpoints"].append({
+                "seq": checkpoint.checkpoint_seq,
+                "time": datetime.now().isoformat(),
+                "progress": task["progress"],
+            })
+        except Exception as e:
+            print(f"Checkpoint failed: {e}")
+
+    # Task completed - Generate final output based on task type
+    task["status"] = "completed"
+    task["progress"] = 100
+
+    # Generate realistic final output
+    task_type = task.get("task_type", "analysis")
+    records = task["records_processed"]
+    insights = task["insights_found"]
+
+    if task_type == "sentiment":
+        task["final_output"] = {
+            "title": "Sentiment Analysis Report",
+            "summary": f"Analyzed {records:,} text records across multiple data sources.",
+            "results": {
+                "overall_sentiment": "72% Positive",
+                "positive_records": int(records * 0.72),
+                "neutral_records": int(records * 0.18),
+                "negative_records": int(records * 0.10),
+                "confidence_score": 0.94,
+            },
+            "key_findings": [
+                "Strong positive sentiment around product quality",
+                "Minor concerns about delivery times (8% of negative)",
+                "Customer service mentions trending upward (+15%)",
+                f"Identified {insights} actionable insights for improvement",
+            ],
+            "top_themes": ["quality", "value", "service", "speed", "reliability"],
+            "recommendation": "Focus on delivery optimization to improve overall sentiment score by estimated 5-8%.",
+        }
+    elif task_type == "anomaly":
+        task["final_output"] = {
+            "title": "Anomaly Detection Report",
+            "summary": f"Scanned {records:,} data points for unusual patterns.",
+            "results": {
+                "total_anomalies": insights,
+                "critical_anomalies": max(1, insights // 4),
+                "warning_anomalies": insights // 2,
+                "info_anomalies": insights - insights // 4 - insights // 2,
+                "false_positive_rate": "2.3%",
+            },
+            "key_findings": [
+                f"Detected {insights} anomalies requiring attention",
+                "3 critical anomalies in transaction processing",
+                "Seasonal pattern identified in Q3 data",
+                "Root cause: 67% related to system load spikes",
+            ],
+            "anomaly_clusters": [
+                {"type": "Transaction Volume Spike", "count": 5, "severity": "high"},
+                {"type": "Response Time Degradation", "count": 8, "severity": "medium"},
+                {"type": "Error Rate Increase", "count": 3, "severity": "high"},
+            ],
+            "recommendation": "Investigate transaction processing during peak hours. Consider auto-scaling policies.",
+        }
+    elif task_type == "trend":
+        task["final_output"] = {
+            "title": "Trend Analysis Report",
+            "summary": f"Analyzed {records:,} historical data points for patterns.",
+            "results": {
+                "trend_direction": "Upward",
+                "growth_rate": "15.3% MoM",
+                "seasonality_detected": True,
+                "forecast_confidence": 0.89,
+            },
+            "key_findings": [
+                "Strong upward trend detected over past 6 months",
+                "15.3% month-over-month growth rate",
+                "Seasonal peaks in Q4 (holiday season)",
+                f"Identified {insights} significant trend changes",
+            ],
+            "forecast": {
+                "next_month": "+12% projected",
+                "next_quarter": "+38% projected",
+                "confidence_interval": "±8%",
+            },
+            "recommendation": "Prepare for Q4 surge. Current trajectory suggests 2x capacity needed by year end.",
+        }
+    elif task_type == "clustering":
+        task["final_output"] = {
+            "title": "Data Clustering Report",
+            "summary": f"Clustered {records:,} data points into meaningful segments.",
+            "results": {
+                "clusters_identified": 5,
+                "silhouette_score": 0.78,
+                "largest_cluster_size": "45%",
+                "smallest_cluster_size": "8%",
+            },
+            "key_findings": [
+                "Identified 5 distinct customer segments",
+                "Largest segment (45%): 'Value Seekers'",
+                "High-value segment (12%): 'Premium Customers'",
+                f"Found {insights} key differentiating factors",
+            ],
+            "clusters": [
+                {"name": "Value Seekers", "size": "45%", "description": "Price-sensitive, bulk buyers"},
+                {"name": "Premium Customers", "size": "12%", "description": "High-spend, quality-focused"},
+                {"name": "Occasional Shoppers", "size": "23%", "description": "Infrequent, event-driven"},
+                {"name": "New Users", "size": "12%", "description": "Recent signups, exploring"},
+                {"name": "Churning Risk", "size": "8%", "description": "Declining engagement"},
+            ],
+            "recommendation": "Target 'Churning Risk' segment with retention campaign. Estimated 15% recovery rate.",
+        }
+    else:
+        task["final_output"] = {
+            "title": "Analysis Complete",
+            "summary": f"Processed {records:,} records successfully.",
+            "results": {"records_processed": records, "insights_found": insights},
+            "key_findings": [f"Found {insights} notable patterns in the data"],
+        }
+
+    task["final_output"]["metadata"] = {
+        "session_id": session_id,
+        "task_type": task_type,
+        "duration_seconds": task["total_duration"],
+        "checkpoints_created": len(task["checkpoints"]),
+        "completed_at": datetime.now().isoformat(),
+    }
+
+    await checkpoint_store.update_session_status(
+        session_id=session_id, status="completed"
+    )
+
+
+if __name__ == "__main__":
+    uvicorn.run(app, host="0.0.0.0", port=8080)
diff --git a/contributing/samples/long_running_task/demo_ui.html b/contributing/samples/long_running_task/demo_ui.html
new file mode 100644
index 0000000000..60ac43ac35
--- /dev/null
+++ b/contributing/samples/long_running_task/demo_ui.html
@@ -0,0 +1,832 @@
+<!DOCTYPE html>
+<html lang="en">
+<head>
+    <meta charset="UTF-8">
+    <meta name="viewport" content="width=device-width, initial-scale=1.0">
+    <title>ADK Durable Session Demo - Real Checkpoint Visualization</title>
+    <script src="https://cdn.tailwindcss.com"></script>
+    <style>
+        @keyframes pulse-green {
+            0%, 100% { box-shadow: 0 0 0 0 rgba(34, 197, 94, 0.7); }
+            50% { box-shadow: 0 0 0 10px rgba(34, 197, 94, 0); }
+        }
+        @keyframes pulse-red {
+            0%, 100% { box-shadow: 0 0 0 0 rgba(239, 68, 68, 0.7); }
+            50% { box-shadow: 0 0 0 10px rgba(239, 68, 68, 0); }
+        }
+        @keyframes spin-slow {
+            from { transform: rotate(0deg); }
+            to { transform: rotate(360deg); }
+        }
+        .checkpoint-pulse { animation: pulse-green 1s ease-in-out; }
+        .failure-pulse { animation: pulse-red 0.5s ease-in-out 3; }
+        .spin-slow { animation: spin-slow 3s linear infinite; }
+        .task-card:hover { transform: translateY(-2px); }
+    </style>
+</head>
+<body class="bg-gray-900 text-white min-h-screen">
+    <div class="container mx-auto px-4 py-8 max-w-7xl">
+        <!-- Header -->
+        <div class="text-center mb-8">
+            <h1 class="text-4xl font-bold mb-2 bg-gradient-to-r from-blue-400 via-purple-500 to-pink-500 bg-clip-text text-transparent">
+                ADK Durable Session Demo
+            </h1>
+            <p class="text-gray-400 mb-2">Real Checkpoint-Based Persistence for Long-Running Agent Tasks</p>
+            <div class="inline-flex items-center gap-2 px-4 py-2 bg-green-900/50 rounded-full border border-green-700 text-sm">
+                <span class="w-2 h-2 bg-green-500 rounded-full animate-pulse"></span>
+                <span class="text-green-400">All tasks are REAL - Writing to BigQuery & GCS</span>
+            </div>
+        </div>
+
+        <!-- Real Infrastructure Banner -->
+        <div class="mb-6 p-4 bg-gradient-to-r from-blue-900/50 to-purple-900/50 rounded-xl border border-blue-700">
+            <div class="flex items-center justify-center gap-8 flex-wrap">
+                <div class="flex items-center gap-2">
+                    <span class="text-2xl">🗄️</span>
+                    <div>
+                        <div class="font-semibold text-blue-400">BigQuery</div>
+                        <div class="text-xs text-gray-400">test-project-0728-467323.adk_metadata</div>
+                    </div>
+                </div>
+                <div class="flex items-center gap-2">
+                    <span class="text-2xl">☁️</span>
+                    <div>
+                        <div class="font-semibold text-purple-400">Cloud Storage</div>
+                        <div class="text-xs text-gray-400">gs://test-project-0728-467323-adk-checkpoints</div>
+                    </div>
+                </div>
+                <div class="flex items-center gap-2">
+                    <span class="text-2xl">🔐</span>
+                    <div>
+                        <div class="font-semibold text-green-400">SHA-256 Verified</div>
+                        <div class="text-xs text-gray-400">Checkpoint integrity guaranteed</div>
+                    </div>
+                </div>
+            </div>
+        </div>
+
+        <!-- Main Grid -->
+        <div class="grid grid-cols-1 lg:grid-cols-3 gap-6">
+
+            <!-- Left Panel: Task Selection -->
+            <div class="bg-gray-800 rounded-xl p-6">
+                <h2 class="text-xl font-semibold mb-4 flex items-center">
+                    <span class="text-2xl mr-2">🎯</span> Choose a Real Task
+                </h2>
+                <p class="text-sm text-gray-400 mb-4">Each task simulates a real long-running data processing job with actual checkpoints saved to GCP.</p>
+
+                <!-- Task Cards -->
+                <div class="space-y-3 mb-6">
+                    <div class="task-card p-4 bg-gray-700 rounded-lg border-2 border-transparent hover:border-blue-500 cursor-pointer transition-all"
+                         onclick="selectTask('sentiment')" id="task-sentiment">
+                        <div class="flex items-start gap-3">
+                            <span class="text-2xl">😊</span>
+                            <div class="flex-1">
+                                <div class="font-semibold text-blue-400">Sentiment Analysis</div>
+                                <div class="text-xs text-gray-400 mt-1">Analyzes text data to determine emotional tone. Processes customer reviews, social media posts, and feedback.</div>
+                                <div class="text-xs text-gray-500 mt-2">Output: Positive/Negative ratios, key themes, trend analysis</div>
+                            </div>
+                        </div>
+                    </div>
+
+                    <div class="task-card p-4 bg-gray-700 rounded-lg border-2 border-transparent hover:border-orange-500 cursor-pointer transition-all"
+                         onclick="selectTask('anomaly')" id="task-anomaly">
+                        <div class="flex items-start gap-3">
+                            <span class="text-2xl">🔍</span>
+                            <div class="flex-1">
+                                <div class="font-semibold text-orange-400">Anomaly Detection</div>
+                                <div class="text-xs text-gray-400 mt-1">Scans datasets for unusual patterns and outliers. Identifies potential fraud, system errors, or data quality issues.</div>
+                                <div class="text-xs text-gray-500 mt-2">Output: Anomaly count, severity levels, root cause hints</div>
+                            </div>
+                        </div>
+                    </div>
+
+                    <div class="task-card p-4 bg-gray-700 rounded-lg border-2 border-transparent hover:border-green-500 cursor-pointer transition-all"
+                         onclick="selectTask('trend')" id="task-trend">
+                        <div class="flex items-start gap-3">
+                            <span class="text-2xl">📈</span>
+                            <div class="flex-1">
+                                <div class="font-semibold text-green-400">Trend Analysis</div>
+                                <div class="text-xs text-gray-400 mt-1">Identifies patterns and trends over time. Forecasts future values based on historical data.</div>
+                                <div class="text-xs text-gray-500 mt-2">Output: Growth rates, seasonal patterns, forecasts</div>
+                            </div>
+                        </div>
+                    </div>
+
+                    <div class="task-card p-4 bg-gray-700 rounded-lg border-2 border-transparent hover:border-purple-500 cursor-pointer transition-all"
+                         onclick="selectTask('clustering')" id="task-clustering">
+                        <div class="flex items-start gap-3">
+                            <span class="text-2xl">🎨</span>
+                            <div class="flex-1">
+                                <div class="font-semibold text-purple-400">Data Clustering</div>
+                                <div class="text-xs text-gray-400 mt-1">Groups similar data points together. Segments customers, categorizes content, identifies patterns.</div>
+                                <div class="text-xs text-gray-500 mt-2">Output: Cluster count, segment profiles, separation metrics</div>
+                            </div>
+                        </div>
+                    </div>
+                </div>
+
+                <!-- Duration Slider -->
+                <div class="mb-4">
+                    <label class="block text-sm text-gray-400 mb-2">Task Duration</label>
+                    <div class="flex items-center gap-3">
+                        <input type="range" id="duration" min="30" max="120" value="60"
+                            class="flex-1 h-2 bg-gray-700 rounded-lg appearance-none cursor-pointer"
+                            oninput="updateDurationLabel()">
+                        <span id="duration-label" class="text-sm font-mono bg-gray-700 px-2 py-1 rounded">60s</span>
+                    </div>
+                    <div class="text-xs text-gray-500 mt-1">Checkpoints saved every 10 seconds</div>
+                </div>
+
+                <!-- Start Button -->
+                <button onclick="startTask()" id="start-btn"
+                    class="w-full bg-gradient-to-r from-green-600 to-emerald-600 hover:from-green-700 hover:to-emerald-700 px-4 py-3 rounded-lg font-semibold transition flex items-center justify-center">
+                    <span class="mr-2">▶️</span> Start Real Task
+                </button>
+
+                <p class="text-xs text-center text-gray-500 mt-2">Task will create real checkpoints in BigQuery & GCS</p>
+            </div>
+
+            <!-- Middle Panel: Live Task Monitor -->
+            <div class="bg-gray-800 rounded-xl p-6">
+                <h2 class="text-xl font-semibold mb-4 flex items-center">
+                    <span class="text-2xl mr-2">📊</span> Live Task Monitor
+                </h2>
+
+                <!-- Current Task Info -->
+                <div id="task-info-idle" class="text-center py-8 text-gray-500">
+                    <span class="text-4xl">⏸️</span>
+                    <p class="mt-2">No task running</p>
+                    <p class="text-sm">Select a task and click Start</p>
+                </div>
+
+                <div id="task-info-running" class="hidden">
+                    <!-- Task Header -->
+                    <div id="current-task-header" class="p-4 bg-gray-900 rounded-lg mb-4">
+                        <div class="flex items-center gap-3">
+                            <span id="current-task-icon" class="text-3xl spin-slow">⚙️</span>
+                            <div>
+                                <div id="current-task-name" class="font-semibold text-lg">Task Name</div>
+                                <div id="current-task-desc" class="text-xs text-gray-400">Description</div>
+                            </div>
+                        </div>
+                    </div>
+
+                    <!-- Session Info -->
+                    <div class="mb-4 p-3 bg-gray-900 rounded-lg text-sm">
+                        <div class="flex justify-between">
+                            <span class="text-gray-400">Session ID:</span>
+                            <span id="session-id" class="font-mono text-blue-400">-</span>
+                        </div>
+                        <div class="flex justify-between mt-1">
+                            <span class="text-gray-400">Status:</span>
+                            <span id="task-status" class="font-semibold">-</span>
+                        </div>
+                        <div class="flex justify-between mt-1">
+                            <span class="text-gray-400">Checkpoints:</span>
+                            <span id="checkpoint-count" class="font-semibold text-green-400">0</span>
+                        </div>
+                    </div>
+
+                    <!-- Progress Bar -->
+                    <div class="mb-4">
+                        <div class="flex justify-between text-sm mb-1">
+                            <span>Processing Progress</span>
+                            <span id="progress-text">0%</span>
+                        </div>
+                        <div class="w-full bg-gray-700 rounded-full h-4 overflow-hidden">
+                            <div id="progress-bar" class="bg-gradient-to-r from-blue-500 via-purple-500 to-pink-500 h-4 rounded-full transition-all duration-500 relative" style="width: 0%">
+                                <div class="absolute inset-0 bg-white/20 animate-pulse"></div>
+                            </div>
+                        </div>
+                    </div>
+
+                    <!-- Real-time Metrics -->
+                    <div class="grid grid-cols-2 gap-3 mb-4">
+                        <div class="p-3 bg-gray-900 rounded-lg text-center">
+                            <div id="records-count" class="text-2xl font-bold text-blue-400">0</div>
+                            <div class="text-xs text-gray-400">Records Processed</div>
+                        </div>
+                        <div class="p-3 bg-gray-900 rounded-lg text-center">
+                            <div id="insights-count" class="text-2xl font-bold text-green-400">0</div>
+                            <div class="text-xs text-gray-400">Insights Found</div>
+                        </div>
+                    </div>
+
+                    <!-- Last Checkpoint Detail -->
+                    <div id="last-checkpoint-box" class="p-4 bg-green-900/30 rounded-lg border border-green-800 hidden">
+                        <h3 class="font-semibold mb-2 text-green-400 flex items-center">
+                            <span class="mr-2">💾</span> Latest Checkpoint Saved
+                        </h3>
+                        <div class="text-sm space-y-1">
+                            <div class="flex justify-between">
+                                <span class="text-gray-400">Checkpoint #:</span>
+                                <span id="last-cp-seq" class="font-mono font-bold">-</span>
+                            </div>
+                            <div class="flex justify-between">
+                                <span class="text-gray-400">Progress Saved:</span>
+                                <span id="last-cp-progress" class="text-green-400">-</span>
+                            </div>
+                            <div class="flex justify-between">
+                                <span class="text-gray-400">Saved At:</span>
+                                <span id="last-cp-time" class="text-xs">-</span>
+                            </div>
+                            <div class="flex justify-between">
+                                <span class="text-gray-400">Storage:</span>
+                                <span class="text-xs text-blue-400">BigQuery + GCS</span>
+                            </div>
+                        </div>
+                    </div>
+                </div>
+
+                <!-- Action Buttons -->
+                <div class="mt-4 space-y-3">
+                    <!-- Simulate Failure -->
+                    <div class="p-4 bg-red-900/30 rounded-lg border border-red-800">
+                        <h3 class="font-semibold mb-1 text-red-400 flex items-center">
+                            <span class="mr-2">💥</span> Simulate Crash
+                        </h3>
+                        <p class="text-xs text-gray-400 mb-2">Simulate a server crash to test checkpoint recovery. The task state is safely stored!</p>
+                        <button onclick="simulateFailure()" id="fail-btn" disabled
+                            class="w-full bg-red-600 hover:bg-red-700 disabled:bg-gray-600 disabled:cursor-not-allowed px-4 py-2 rounded-lg font-semibold transition">
+                            Crash Task Now
+                        </button>
+                    </div>
+
+                    <!-- Resume from Checkpoint -->
+                    <div id="resume-section" class="p-4 bg-blue-900/30 rounded-lg border border-blue-800 hidden">
+                        <h3 class="font-semibold mb-1 text-blue-400 flex items-center">
+                            <span class="mr-2">🔄</span> Recovery Available
+                        </h3>
+                        <p class="text-xs text-gray-400 mb-2">Task crashed but checkpoint was saved! Click below to resume from last checkpoint.</p>
+                        <button onclick="resumeTask()" id="resume-btn"
+                            class="w-full bg-blue-600 hover:bg-blue-700 px-4 py-2 rounded-lg font-semibold transition">
+                            Resume from Checkpoint
+                        </button>
+                    </div>
+                </div>
+
+                <!-- Alerts -->
+                <div id="failure-alert" class="p-4 bg-red-900/50 rounded-lg border-2 border-red-500 hidden mt-4 failure-pulse">
+                    <h3 class="font-semibold text-red-400 flex items-center">
+                        <span class="mr-2">💥</span> Task Crashed!
+                    </h3>
+                    <p class="text-sm text-gray-300 mt-1">Task failed at <span id="fail-progress" class="font-bold">0</span>% progress</p>
+                    <p class="text-sm text-green-400 mt-2 flex items-center">
+                        <span class="mr-1">✅</span> Don't worry! State was saved in checkpoint #<span id="fail-checkpoint">0</span>
+                    </p>
+                </div>
+
+                <div id="resumed-alert" class="p-4 bg-blue-900/50 rounded-lg border-2 border-blue-500 hidden mt-4">
+                    <h3 class="font-semibold text-blue-400 flex items-center">
+                        <span class="mr-2">🔄</span> Successfully Resumed!
+                    </h3>
+                    <p class="text-sm text-gray-300 mt-1">Recovered from checkpoint #<span id="resumed-cp" class="font-bold">0</span></p>
+                    <p class="text-sm text-green-400 mt-1">Continuing from <span id="resumed-progress">0</span>% progress</p>
+                </div>
+
+                <div id="completed-alert" class="p-4 bg-green-900/50 rounded-lg border-2 border-green-500 hidden mt-4">
+                    <h3 class="font-semibold text-green-400 flex items-center">
+                        <span class="mr-2">✅</span> Task Completed!
+                    </h3>
+                    <p class="text-sm text-gray-300 mt-1">Processed <span id="final-records">0</span> records</p>
+                    <p class="text-sm text-gray-300">Found <span id="final-insights">0</span> insights</p>
+                </div>
+
+                <!-- Final Output Section -->
+                <div id="final-output-section" class="hidden mt-4">
+                    <div class="p-5 bg-gradient-to-br from-green-900/40 to-blue-900/40 rounded-xl border border-green-700">
+                        <h3 id="final-output-title" class="text-lg font-bold text-green-400 mb-2 flex items-center">
+                            <span class="mr-2">📊</span> Analysis Results
+                        </h3>
+                        <p id="final-output-summary" class="text-sm text-gray-300 mb-4"></p>
+
+                        <!-- Results Grid -->
+                        <div id="final-output-results" class="grid grid-cols-2 gap-2 mb-4"></div>
+
+                        <!-- Key Findings -->
+                        <div class="mb-4">
+                            <h4 class="font-semibold text-blue-400 mb-2 flex items-center">
+                                <span class="mr-2">💡</span> Key Findings
+                            </h4>
+                            <ul id="final-output-findings" class="text-sm space-y-1"></ul>
+                        </div>
+
+                        <!-- Recommendation -->
+                        <div id="final-output-recommendation-box" class="p-3 bg-yellow-900/30 rounded-lg border border-yellow-700">
+                            <h4 class="font-semibold text-yellow-400 mb-1 flex items-center">
+                                <span class="mr-2">🎯</span> Recommendation
+                            </h4>
+                            <p id="final-output-recommendation" class="text-sm text-gray-300"></p>
+                        </div>
+
+                        <!-- Metadata -->
+                        <div class="mt-3 pt-3 border-t border-gray-700 text-xs text-gray-500">
+                            <span class="mr-4">Session: <span id="final-meta-session" class="font-mono text-blue-400"></span></span>
+                            <span class="mr-4">Checkpoints: <span id="final-meta-checkpoints" class="text-green-400"></span></span>
+                            <span>Completed: <span id="final-meta-time"></span></span>
+                        </div>
+                    </div>
+                </div>
+            </div>
+
+            <!-- Right Panel: Checkpoint Timeline -->
+            <div class="bg-gray-800 rounded-xl p-6">
+                <h2 class="text-xl font-semibold mb-2 flex items-center">
+                    <span class="text-2xl mr-2">📍</span> Checkpoint Timeline
+                </h2>
+                <p class="text-xs text-gray-400 mb-4">Real checkpoints being written to BigQuery & GCS</p>
+
+                <div id="checkpoint-timeline" class="space-y-3 max-h-[500px] overflow-y-auto pr-2">
+                    <div class="text-center text-gray-500 py-8">
+                        <span class="text-4xl">📭</span>
+                        <p class="mt-2">No checkpoints yet</p>
+                        <p class="text-sm">Start a task to see real checkpoints appear</p>
+                    </div>
+                </div>
+            </div>
+        </div>
+
+        <!-- How It Works -->
+        <div class="mt-8 bg-gray-800 rounded-xl p-6">
+            <h2 class="text-xl font-semibold mb-4">🔧 How Durable Checkpointing Works</h2>
+
+            <div class="grid grid-cols-1 md:grid-cols-4 gap-4 text-center">
+                <div class="p-4 bg-gray-900 rounded-lg">
+                    <div class="text-3xl mb-2">1️⃣</div>
+                    <div class="font-semibold text-blue-400">Task Runs</div>
+                    <div class="text-xs text-gray-400 mt-1">Long-running analysis processes data in chunks</div>
+                </div>
+
+                <div class="p-4 bg-gray-900 rounded-lg">
+                    <div class="text-3xl mb-2">2️⃣</div>
+                    <div class="font-semibold text-green-400">Checkpoint Created</div>
+                    <div class="text-xs text-gray-400 mt-1">Every 10s, state is serialized and compressed</div>
+                </div>
+
+                <div class="p-4 bg-gray-900 rounded-lg">
+                    <div class="text-3xl mb-2">3️⃣</div>
+                    <div class="font-semibold text-purple-400">Two-Phase Commit</div>
+                    <div class="text-xs text-gray-400 mt-1">Blob → GCS, then metadata → BigQuery</div>
+                </div>
+
+                <div class="p-4 bg-gray-900 rounded-lg">
+                    <div class="text-3xl mb-2">4️⃣</div>
+                    <div class="font-semibold text-orange-400">Recovery Ready</div>
+                    <div class="text-xs text-gray-400 mt-1">If crash occurs, resume from last checkpoint</div>
+                </div>
+            </div>
+
+            <div class="mt-6 p-4 bg-gray-900 rounded-lg">
+                <h3 class="font-semibold mb-2">📊 Verify in BigQuery</h3>
+                <pre class="text-xs bg-gray-800 p-3 rounded overflow-x-auto"><code>SELECT session_id, checkpoint_seq, created_at, trigger, size_bytes
+FROM `test-project-0728-467323.adk_metadata.checkpoints`
+ORDER BY created_at DESC LIMIT 10;</code></pre>
+            </div>
+        </div>
+
+        <!-- Sessions Table -->
+        <div class="mt-8 bg-gray-800 rounded-xl p-6">
+            <h2 class="text-xl font-semibold mb-4 flex items-center justify-between">
+                <span><span class="text-2xl mr-2">📋</span> Real Sessions in BigQuery</span>
+                <button onclick="loadSessions()" class="text-sm bg-gray-700 hover:bg-gray-600 px-3 py-1 rounded flex items-center gap-1">
+                    <span>🔄</span> Refresh
+                </button>
+            </h2>
+            <p class="text-sm text-gray-400 mb-4">These sessions are stored in BigQuery. Click "Select" to resume any failed session.</p>
+            <div class="overflow-x-auto">
+                <table class="w-full text-sm">
+                    <thead>
+                        <tr class="text-gray-400 border-b border-gray-700">
+                            <th class="text-left py-2 px-3">Session ID</th>
+                            <th class="text-left py-2 px-3">Status</th>
+                            <th class="text-left py-2 px-3">Checkpoints</th>
+                            <th class="text-left py-2 px-3">Last Updated</th>
+                            <th class="text-left py-2 px-3">Actions</th>
+                        </tr>
+                    </thead>
+                    <tbody id="sessions-table">
+                        <tr><td colspan="5" class="text-center py-4 text-gray-500">Loading from BigQuery...</td></tr>
+                    </tbody>
+                </table>
+            </div>
+        </div>
+    </div>
+
+    <script>
+        const API_BASE = window.location.origin;
+        let currentSessionId = null;
+        let selectedTaskType = 'sentiment';
+        let pollInterval = null;
+
+        const taskInfo = {
+            sentiment: {
+                icon: '😊',
+                name: 'Sentiment Analysis',
+                desc: 'Analyzing text data for emotional tone and key themes',
+                color: 'blue'
+            },
+            anomaly: {
+                icon: '🔍',
+                name: 'Anomaly Detection',
+                desc: 'Scanning for unusual patterns and outliers in data',
+                color: 'orange'
+            },
+            trend: {
+                icon: '📈',
+                name: 'Trend Analysis',
+                desc: 'Identifying patterns and forecasting future values',
+                color: 'green'
+            },
+            clustering: {
+                icon: '🎨',
+                name: 'Data Clustering',
+                desc: 'Grouping similar data points and identifying segments',
+                color: 'purple'
+            }
+        };
+
+        function selectTask(type) {
+            selectedTaskType = type;
+            // Update UI
+            document.querySelectorAll('.task-card').forEach(card => {
+                card.classList.remove('border-blue-500', 'border-orange-500', 'border-green-500', 'border-purple-500');
+                card.classList.add('border-transparent');
+            });
+            const colors = { sentiment: 'blue', anomaly: 'orange', trend: 'green', clustering: 'purple' };
+            document.getElementById(`task-${type}`).classList.remove('border-transparent');
+            document.getElementById(`task-${type}`).classList.add(`border-${colors[type]}-500`);
+        }
+
+        function updateDurationLabel() {
+            const duration = document.getElementById('duration').value;
+            document.getElementById('duration-label').textContent = duration + 's';
+        }
+
+        async function startTask() {
+            const duration = parseInt(document.getElementById('duration').value);
+            const info = taskInfo[selectedTaskType];
+
+            try {
+                const response = await fetch(`${API_BASE}/api/task/start`, {
+                    method: 'POST',
+                    headers: { 'Content-Type': 'application/json' },
+                    body: JSON.stringify({ task_type: selectedTaskType, duration_seconds: duration })
+                });
+
+                const data = await response.json();
+                currentSessionId = data.session_id;
+
+                // Update UI
+                document.getElementById('task-info-idle').classList.add('hidden');
+                document.getElementById('task-info-running').classList.remove('hidden');
+                document.getElementById('session-id').textContent = currentSessionId;
+                document.getElementById('current-task-icon').textContent = info.icon;
+                document.getElementById('current-task-name').textContent = info.name;
+                document.getElementById('current-task-name').className = `font-semibold text-lg text-${info.color}-400`;
+                document.getElementById('current-task-desc').textContent = info.desc;
+
+                document.getElementById('start-btn').disabled = true;
+                document.getElementById('fail-btn').disabled = false;
+                document.getElementById('resume-section').classList.add('hidden');
+                document.getElementById('failure-alert').classList.add('hidden');
+                document.getElementById('resumed-alert').classList.add('hidden');
+                document.getElementById('completed-alert').classList.add('hidden');
+                document.getElementById('last-checkpoint-box').classList.add('hidden');
+                document.getElementById('final-output-section').classList.add('hidden');
+
+                // Reset metrics
+                document.getElementById('progress-bar').style.width = '0%';
+                document.getElementById('progress-text').textContent = '0%';
+                document.getElementById('records-count').textContent = '0';
+                document.getElementById('insights-count').textContent = '0';
+                document.getElementById('checkpoint-count').textContent = '0';
+
+                // Clear timeline
+                document.getElementById('checkpoint-timeline').innerHTML = `
+                    <div class="text-center text-gray-400 py-4">
+                        <span class="animate-pulse text-2xl">⏳</span>
+                        <p class="mt-2">Task started...</p>
+                        <p class="text-xs">First checkpoint in ~10 seconds</p>
+                    </div>
+                `;
+
+                startPolling();
+                loadSessions();
+
+            } catch (error) {
+                alert('Failed to start task: ' + error.message);
+            }
+        }
+
+        async function simulateFailure() {
+            if (!currentSessionId) return;
+
+            try {
+                await fetch(`${API_BASE}/api/task/fail/${currentSessionId}`, { method: 'POST' });
+                document.getElementById('fail-btn').disabled = true;
+            } catch (error) {
+                alert('Failed to trigger failure: ' + error.message);
+            }
+        }
+
+        async function resumeTask() {
+            if (!currentSessionId) return;
+
+            try {
+                const response = await fetch(`${API_BASE}/api/task/resume`, {
+                    method: 'POST',
+                    headers: { 'Content-Type': 'application/json' },
+                    body: JSON.stringify({ session_id: currentSessionId })
+                });
+
+                const data = await response.json();
+
+                // Show resumed alert
+                document.getElementById('resumed-alert').classList.remove('hidden');
+                document.getElementById('resumed-cp').textContent = data.resumed_from_checkpoint;
+                document.getElementById('resumed-progress').textContent = Math.round(data.progress);
+                document.getElementById('failure-alert').classList.add('hidden');
+                document.getElementById('resume-section').classList.add('hidden');
+
+                document.getElementById('fail-btn').disabled = false;
+
+                // Add resume marker to timeline
+                addTimelineEvent('resumed', `Recovered from checkpoint #${data.resumed_from_checkpoint}`, data.progress);
+
+                startPolling();
+
+            } catch (error) {
+                alert('Failed to resume task: ' + error.message);
+            }
+        }
+
+        async function pollStatus() {
+            if (!currentSessionId) return;
+
+            try {
+                const response = await fetch(`${API_BASE}/api/task/status/${currentSessionId}`);
+                const data = await response.json();
+
+                updateStatusUI(data);
+
+                if (data.status === 'completed' || data.status === 'failed') {
+                    stopPolling();
+                    document.getElementById('start-btn').disabled = false;
+                    document.getElementById('fail-btn').disabled = true;
+                    document.getElementById('current-task-icon').classList.remove('spin-slow');
+
+                    if (data.status === 'failed') {
+                        document.getElementById('resume-section').classList.remove('hidden');
+                        document.getElementById('failure-alert').classList.remove('hidden');
+                        document.getElementById('fail-progress').textContent = Math.round(data.progress);
+                        document.getElementById('fail-checkpoint').textContent = data.checkpoints?.length || 0;
+                        document.getElementById('final-output-section').classList.add('hidden');
+                        addTimelineEvent('failed', `Task crashed at ${Math.round(data.progress)}%`);
+                    } else if (data.status === 'completed') {
+                        document.getElementById('completed-alert').classList.remove('hidden');
+                        document.getElementById('final-records').textContent = (data.records_processed || 0).toLocaleString();
+                        document.getElementById('final-insights').textContent = data.insights_found || 0;
+                        addTimelineEvent('completed', 'Task completed successfully!');
+
+                        // Display final output
+                        if (data.final_output) {
+                            renderFinalOutput(data.final_output);
+                        }
+                    }
+
+                    loadSessions();
+                }
+
+            } catch (error) {
+                console.error('Poll error:', error);
+            }
+        }
+
+        function updateStatusUI(data) {
+            const statusEl = document.getElementById('task-status');
+            statusEl.textContent = data.status.charAt(0).toUpperCase() + data.status.slice(1);
+            statusEl.className = data.status === 'running' ? 'font-semibold text-green-400' :
+                                data.status === 'failed' ? 'font-semibold text-red-400' :
+                                data.status === 'completed' ? 'font-semibold text-blue-400' : 'font-semibold';
+
+            const progress = Math.round(data.progress || 0);
+            document.getElementById('progress-bar').style.width = `${progress}%`;
+            document.getElementById('progress-text').textContent = `${progress}%`;
+
+            document.getElementById('records-count').textContent = (data.records_processed || 0).toLocaleString();
+            document.getElementById('insights-count').textContent = data.insights_found || 0;
+
+            if (data.checkpoints && data.checkpoints.length > 0) {
+                document.getElementById('checkpoint-count').textContent = data.checkpoints.length;
+                const lastCp = data.checkpoints[data.checkpoints.length - 1];
+
+                document.getElementById('last-checkpoint-box').classList.remove('hidden');
+                document.getElementById('last-cp-seq').textContent = lastCp.seq;
+                document.getElementById('last-cp-progress').textContent = Math.round(lastCp.progress) + '%';
+                document.getElementById('last-cp-time').textContent = new Date(lastCp.time).toLocaleTimeString();
+
+                updateTimeline(data.checkpoints);
+            }
+        }
+
+        function updateTimeline(checkpoints) {
+            const timeline = document.getElementById('checkpoint-timeline');
+
+            const existingEvents = [];
+            timeline.querySelectorAll('.event-marker:not([data-checkpoint])').forEach(el => {
+                existingEvents.push(el.outerHTML);
+            });
+
+            let html = '';
+            checkpoints.forEach((cp, index) => {
+                const isLatest = index === checkpoints.length - 1;
+                html += `
+                    <div class="flex items-start gap-3 event-marker ${isLatest ? 'checkpoint-pulse' : ''}" data-checkpoint="true">
+                        <div class="flex flex-col items-center">
+                            <div class="w-5 h-5 bg-green-500 rounded-full flex items-center justify-center text-xs ${isLatest ? 'ring-4 ring-green-500/30' : ''}">
+                                ${cp.seq}
+                            </div>
+                            ${index < checkpoints.length - 1 ? '<div class="w-0.5 h-8 bg-gray-600"></div>' : ''}
+                        </div>
+                        <div class="flex-1 pb-4">
+                            <div class="flex justify-between items-start">
+                                <span class="font-semibold text-green-400">💾 Checkpoint #${cp.seq}</span>
+                                <span class="text-xs text-gray-500">${new Date(cp.time).toLocaleTimeString()}</span>
+                            </div>
+                            <div class="text-xs text-gray-400 mt-1">Progress: ${Math.round(cp.progress)}% saved</div>
+                            <div class="text-xs text-gray-500">→ BigQuery + GCS</div>
+                        </div>
+                    </div>
+                `;
+            });
+
+            html += existingEvents.join('');
+            timeline.innerHTML = html || '<div class="text-center text-gray-500 py-4">No checkpoints yet</div>';
+        }
+
+        function addTimelineEvent(type, message, progress) {
+            const timeline = document.getElementById('checkpoint-timeline');
+            const colors = { failed: 'red', resumed: 'blue', completed: 'green' };
+            const icons = { failed: '💥', resumed: '🔄', completed: '✅' };
+            const color = colors[type];
+
+            const eventHtml = `
+                <div class="flex items-start gap-3 event-marker ${type === 'failed' ? 'failure-pulse' : ''}">
+                    <div class="flex flex-col items-center">
+                        <div class="w-5 h-5 bg-${color}-500 rounded-full flex items-center justify-center ring-4 ring-${color}-500/30">
+                            <span class="text-xs">${icons[type]}</span>
+                        </div>
+                    </div>
+                    <div class="flex-1 pb-4">
+                        <div class="flex justify-between items-start">
+                            <span class="font-semibold text-${color}-400">${icons[type]} ${type.charAt(0).toUpperCase() + type.slice(1)}</span>
+                            <span class="text-xs text-gray-500">${new Date().toLocaleTimeString()}</span>
+                        </div>
+                        <div class="text-xs text-gray-400 mt-1">${message}</div>
+                    </div>
+                </div>
+            `;
+
+            timeline.insertAdjacentHTML('beforeend', eventHtml);
+            timeline.scrollTop = timeline.scrollHeight;
+        }
+
+        function startPolling() {
+            if (pollInterval) clearInterval(pollInterval);
+            pollInterval = setInterval(pollStatus, 1000);
+            pollStatus();
+        }
+
+        function stopPolling() {
+            if (pollInterval) {
+                clearInterval(pollInterval);
+                pollInterval = null;
+            }
+        }
+
+        async function loadSessions() {
+            try {
+                const response = await fetch(`${API_BASE}/api/sessions`);
+                const data = await response.json();
+
+                const tbody = document.getElementById('sessions-table');
+                if (data.sessions && data.sessions.length > 0) {
+                    tbody.innerHTML = data.sessions.map(s => `
+                        <tr class="border-b border-gray-700 hover:bg-gray-700/50">
+                            <td class="py-2 px-3 font-mono text-xs">${s.session_id}</td>
+                            <td class="py-2 px-3">
+                                <span class="px-2 py-1 rounded text-xs ${
+                                    s.status === 'completed' ? 'bg-green-900 text-green-400' :
+                                    s.status === 'failed' ? 'bg-red-900 text-red-400' :
+                                    s.status === 'active' ? 'bg-blue-900 text-blue-400' : 'bg-gray-700'
+                                }">${s.status}</span>
+                            </td>
+                            <td class="py-2 px-3 text-green-400">${s.checkpoint_seq} saved</td>
+                            <td class="py-2 px-3 text-xs text-gray-400">${s.updated_at ? new Date(s.updated_at).toLocaleString() : '-'}</td>
+                            <td class="py-2 px-3">
+                                ${s.status === 'failed' ? `
+                                    <button onclick="selectSession('${s.session_id}')"
+                                        class="text-xs bg-blue-600 hover:bg-blue-700 px-2 py-1 rounded">
+                                        Resume
+                                    </button>
+                                ` : '-'}
+                            </td>
+                        </tr>
+                    `).join('');
+                } else {
+                    tbody.innerHTML = '<tr><td colspan="5" class="text-center py-4 text-gray-500">No sessions found in BigQuery</td></tr>';
+                }
+            } catch (error) {
+                console.error('Failed to load sessions:', error);
+            }
+        }
+
+        function selectSession(sessionId) {
+            currentSessionId = sessionId;
+            document.getElementById('task-info-idle').classList.add('hidden');
+            document.getElementById('task-info-running').classList.remove('hidden');
+            document.getElementById('session-id').textContent = sessionId;
+            document.getElementById('task-status').textContent = 'Ready to Resume';
+            document.getElementById('task-status').className = 'font-semibold text-yellow-400';
+            document.getElementById('current-task-icon').textContent = '⏸️';
+            document.getElementById('current-task-icon').classList.remove('spin-slow');
+            document.getElementById('current-task-name').textContent = 'Paused Task';
+            document.getElementById('current-task-desc').textContent = 'Click Resume to continue from last checkpoint';
+            document.getElementById('resume-section').classList.remove('hidden');
+            document.getElementById('fail-btn').disabled = true;
+            document.getElementById('final-output-section').classList.add('hidden');
+        }
+
+        function renderFinalOutput(output) {
+            const section = document.getElementById('final-output-section');
+            section.classList.remove('hidden');
+
+            // Title
+            document.getElementById('final-output-title').innerHTML =
+                `<span class="mr-2">📊</span> ${output.title || 'Analysis Results'}`;
+
+            // Summary
+            document.getElementById('final-output-summary').textContent = output.summary || '';
+
+            // Results grid
+            const resultsContainer = document.getElementById('final-output-results');
+            if (output.results) {
+                resultsContainer.innerHTML = Object.entries(output.results).map(([key, value]) => {
+                    const label = key.replace(/_/g, ' ').replace(/\b\w/g, c => c.toUpperCase());
+                    const displayValue = typeof value === 'number' ? value.toLocaleString() : value;
+                    return `
+                        <div class="p-2 bg-gray-800 rounded text-center">
+                            <div class="text-lg font-bold text-blue-400">${displayValue}</div>
+                            <div class="text-xs text-gray-400">${label}</div>
+                        </div>
+                    `;
+                }).join('');
+            } else {
+                resultsContainer.innerHTML = '';
+            }
+
+            // Key findings
+            const findingsContainer = document.getElementById('final-output-findings');
+            if (output.key_findings && output.key_findings.length > 0) {
+                findingsContainer.innerHTML = output.key_findings.map(finding =>
+                    `<li class="flex items-start gap-2">
+                        <span class="text-green-400 mt-0.5">•</span>
+                        <span class="text-gray-300">${finding}</span>
+                    </li>`
+                ).join('');
+            } else {
+                findingsContainer.innerHTML = '<li class="text-gray-500">No findings available</li>';
+            }
+
+            // Recommendation
+            const recBox = document.getElementById('final-output-recommendation-box');
+            if (output.recommendation) {
+                recBox.classList.remove('hidden');
+                document.getElementById('final-output-recommendation').textContent = output.recommendation;
+            } else {
+                recBox.classList.add('hidden');
+            }
+
+            // Metadata
+            if (output.metadata) {
+                document.getElementById('final-meta-session').textContent =
+                    output.metadata.session_id || currentSessionId;
+                document.getElementById('final-meta-checkpoints').textContent =
+                    output.metadata.checkpoints_created || '-';
+                document.getElementById('final-meta-time').textContent =
+                    output.metadata.completed_at ? new Date(output.metadata.completed_at).toLocaleString() : '-';
+            }
+
+            // Scroll to show the output
+            section.scrollIntoView({ behavior: 'smooth', block: 'nearest' });
+        }
+
+        // Initialize
+        document.addEventListener('DOMContentLoaded', () => {
+            selectTask('sentiment');
+            loadSessions();
+        });
+    </script>
+</body>
+</html>
diff --git a/contributing/samples/long_running_task/long_running_task_design.md b/contributing/samples/long_running_task/long_running_task_design.md
new file mode 100644
index 0000000000..38877fae7e
--- /dev/null
+++ b/contributing/samples/long_running_task/long_running_task_design.md
@@ -0,0 +1,1448 @@
+# Durable Session Persistence for Long-Horizon ADK Agents (BigQuery-first, Generalizable Framework Capability)
+
+**Author:** Haiyuan Cao
+**Status:** Implemented (v1 core functionality)
+**Target audience:** ADK engineering leads, BigQuery Agent Analytics stakeholders, SRE/Security reviewers
+**Last updated:** 2026-02-02
+**Revision:** 3.0 (implementation complete, demo deployed)
+
+---
+
+## Implementation Status
+
+| Component | Status | Location |
+|-----------|--------|----------|
+| `DurableSessionConfig` | Implemented | `src/google/adk/durable/config.py` |
+| `CheckpointableAgentState` | Implemented | `src/google/adk/durable/checkpointable_state.py` |
+| `DurableSessionStore` (ABC) | Implemented | `src/google/adk/durable/stores/base_checkpoint_store.py` |
+| `BigQueryCheckpointStore` | Implemented | `src/google/adk/durable/stores/bigquery_checkpoint_store.py` |
+| `WorkspaceSnapshotter` | Implemented | `src/google/adk/durable/workspace_snapshotter.py` |
+| App integration | Implemented | `src/google/adk/apps/app.py` |
+| Demo agent | Implemented | `contributing/samples/long_running_task/` |
+| Demo UI (Cloud Run) | Deployed | `https://durable-demo-201486563047.us-central1.run.app` |
+
+### Live Demo
+
+A fully functional demo is deployed on Cloud Run showcasing:
+- Real-time checkpoint visualization
+- Task failure simulation
+- Checkpoint-based recovery
+- BigQuery metadata queries
+- Final task output display
+
+**URL:** https://durable-demo-201486563047.us-central1.run.app
+
+**Infrastructure:**
+- BigQuery Dataset: `test-project-0728-467323.adk_metadata`
+- GCS Bucket: `gs://test-project-0728-467323-adk-checkpoints`
+- SHA-256 checkpoint integrity verification
+
+---
+
+## 0. Executive One-Pager (for PM/Director skim)
+
+### Problem
+
+ADK agents struggle with BigQuery's **async, long-running workloads**. While ADK has experimental in-process resumability (`ResumabilityConfig`), it lacks:
+- **Cross-process durability**: state lost if the process dies
+- **External event triggers**: no Pub/Sub integration for job completion
+- **Enterprise auditability**: no SQL-queryable checkpoint history
+- **Cloud job reconciliation**: no authoritative state sync with BigQuery jobs
+
+Sandboxes time out (the "12-minute barrier" in typical cloud deployments), causing repeated cold starts, redundant metadata scans, and risk of duplicate job submissions.
+
+### Solution
+
+**Extend** ADK's existing resumability with a **Durable Session Persistence Layer**:
+
+* Extend lifecycle with durable **PAUSED** state (cross-process, not just in-memory)
+* Persist **logical checkpoints** (plan + job ledger + tool ledger) and optionally workspace artifacts
+* Store control-plane metadata + audit trail in **BigQuery**
+* Store large blobs (checkpoint/workspace) in **GCS**
+* Resume on external events (BigQuery job completion → Pub/Sub) with **authoritative reconciliation**
+
+### Key benefits
+
+* **Reliability:** deterministic "warm start"; prevents duplicate job fleets
+* **Cost:** no idle compute while waiting; typical storage **< $0.01/session-day paused** (see [Section 21: Cost Estimation](#21-cost-estimation))
+* **Enterprise:** SQL auditability (inspect what the agent did at hour 4 of 12)
+* **Strategic:** differentiates ADK by enabling **cloud job execution continuity + enterprise audit**, not just "reasoning continuity"
+
+### Ask / decisions
+
+1. Review `CheckpointableAgentState` + integration with existing `ResumabilityConfig`
+2. Confirm reference infra (BQ + GCS) and leasing approach
+3. Select pilot (recommended: PII scanner)
+   **Decision:** Durable PAUSED as extension to existing resumability vs separate plugin
+
+### Proposed timeline (8 weeks to pilot)
+
+* Weeks 1–2: API + storage/lease decisions, integration design with existing resumability
+* Weeks 3–4: reference store + resume skeleton
+* Weeks 5–8: pilot + metrics
+* Week 9+: iterate and choose rollout path
+
+---
+
+## 1. Background & Motivation
+
+### 1.1 The "12-minute barrier" in cloud data workflows
+
+BigQuery workloads are inherently asynchronous and may run from minutes to hours. In typical cloud sandbox deployments (Cloud Run, Cloud Functions, GKE with autoscaling), agents face timeout constraints:
+
+* **Cloud Run:** default 5-minute timeout, max 60 minutes
+* **Cloud Functions:** default 1-minute timeout, max 9 minutes (1st gen) or 60 minutes (2nd gen)
+* **Vertex AI Agent Builder:** session timeouts vary by deployment mode
+
+When these timeouts occur during long-running BigQuery jobs, agents:
+
+* lose job IDs and progress state (unless using existing resumability)
+* repeat metadata scans and tool calls
+* risk re-submitting already-running jobs
+
+### 1.2 Existing ADK Resumability (Current State)
+
+ADK already has an **experimental resumability feature** (`src/google/adk/apps/app.py`):
+
+```python
+@experimental
+class ResumabilityConfig(BaseModel):
+  """The "resumability" in ADK refers to the ability to:
+  1. pause an invocation upon a long-running function call.
+  2. resume an invocation from the last event, if it's paused or failed midway
+  through.
+
+  Note: ADK resumes the invocation in a best-effort manner:
+  1. Tool call to resume needs to be idempotent because we only guarantee
+  an at-least-once behavior once resumed.
+  2. Any temporary / in-memory state will be lost upon resumption.
+  """
+  is_resumable: bool = False
+```
+
+**Current capabilities:**
+| Feature | Status | Location |
+|---------|--------|----------|
+| `ResumabilityConfig.is_resumable` | Experimental | `src/google/adk/apps/app.py:42-58` |
+| `InvocationContext.should_pause_invocation()` | Implemented | `src/google/adk/agents/invocation_context.py:355-389` |
+| `long_running_tool_ids` tracking | Implemented | `src/google/adk/events/event.py` |
+| Resume from last event | Implemented | `src/google/adk/runners.py:1294+` |
+
+**Current limitations (gaps this design addresses):**
+| Limitation | Impact |
+|------------|--------|
+| In-memory only | State lost on process death/restart |
+| No external event triggers | Cannot wake on Pub/Sub, webhooks |
+| No cross-process persistence | Cannot resume in different runner instance |
+| No enterprise audit trail | No SQL-queryable checkpoint history |
+| No cloud job reconciliation | No authoritative sync with BQ job states |
+
+### 1.3 Dogfooding BigQuery Agent Analytics
+
+Using BigQuery as a durable control plane is strategically aligned with the BigQuery Agent Analytics direction:
+
+* **Dogfooding:** demonstrates BQ-based agent observability capabilities
+* **Auditability:** admins can query checkpoints directly ("what was the agent doing at hour 4?")
+* **SQL robustness:** BigQuery idioms (e.g., ARRAY_AGG latest-per-session) make operational queries easy and efficient
+
+---
+
+## 2. Problem Statement
+
+**This design extends ADK's existing resumability** to address gaps in cross-process durability and enterprise scenarios.
+
+Current ADK resumability is optimized for **in-process pause/resume**:
+* Works within a single runner process lifecycle
+* State persisted to session service (SQLite, Postgres, etc.)
+* No external event-driven wake-up mechanism
+* No BigQuery-native audit trail
+
+**Gaps this design addresses:**
+
+| Gap | Current State | Proposed Solution |
+|-----|---------------|-------------------|
+| Cross-process durability | State in session DB, but no checkpoint snapshots | BQ metadata + GCS blobs |
+| External event triggers | Manual resume via API call | Pub/Sub → Resumer service |
+| Cloud job reconciliation | App must track job IDs manually | Authoritative ledger reconciliation |
+| Enterprise audit | Logs only | SQL-queryable BQ tables |
+| Fleet observability | Per-session queries | Cross-agent BQ analytics |
+
+**Net effect:** ADK's existing resumability handles the "pause on long tool call" case well, but is not sufficient for BigQuery job fleets, multi-hour compliance scans, or any agentic workflow that needs **durable, cross-process, event-driven** "pause/wake/resume" loops.
+
+---
+
+## 3. Goals & Non-Goals
+
+### 3.1 Goals
+
+1. **Extend** existing `ResumabilityConfig` to support durable, cross-process checkpoints
+2. Support **hours-to-days** workflows via durable lifecycle state **PAUSED**
+3. Enable **event-driven resume** (Pub/Sub/job events) with safe retries
+4. Persist a deterministic **logical checkpoint**, not runtime heap snapshots
+5. Provide **enterprise-grade auditability**, retention, and security posture
+6. Ensure correctness via **two-phase commit**, **authoritative reconciliation**, and **lease-based resuming**
+7. **Backward compatible** with existing ADK session services
+
+### 3.2 Non-Goals (v1)
+
+* Interpreter heap snapshot/restore (pickle/dill) — brittle across deployments and library changes
+* Full microVM/container checkpointing — future work
+* Replacing existing `ResumabilityConfig` — this design extends it
+* Modifying existing session service implementations — new service alongside existing
+
+---
+
+## 4. Proposed Lifecycle Model
+
+### 4.1 States
+
+Building on ADK's existing pause concept, we formalize durable states:
+
+* **RUNNING:** executing agent logic + tool calls
+* **PAUSED:** no active compute; durable checkpoint exists in BQ+GCS; resumable via event or API
+* **KILLED:** finalized; resources released; retention applies
+  (Optional operational outcomes: `FAILED`, `EXPIRED`.)
+
+### 4.2 Integration with Existing Resumability
+
+```
+Existing ADK Resumability          Durable Session Extension
+─────────────────────────────      ──────────────────────────────
+InvocationContext.is_resumable  →  DurableSessionConfig.is_durable
+should_pause_invocation()       →  triggers checkpoint write
+long_running_tool_ids           →  included in checkpoint ledger
+Session events                  →  replayed on resume
+                                   + BQ audit trail
+                                   + GCS checkpoint blobs
+                                   + Pub/Sub event triggers
+```
+
+### 4.3 "Serving → Rollout" framing
+
+This design shifts ADK from a request/response mindset to an **agentic rollout** model:
+
+* do work
+* wait for environment events
+* resume deterministically
+* avoid compute idling
+
+---
+
+## 5. Architecture Overview
+
+### 5.1 Layered checkpointing: logical → workspace → execution (future)
+
+**v1** explicitly adopts **Logical Checkpointing**:
+
+1. **Logical checkpoint (required):** plan/task graph state, job ledger, tool ledger, progress cursors
+2. **Workspace snapshot (optional):** `/workspace` bundle (draft reports, code, small caches)
+3. **Execution snapshot (future):** microVM/container restore
+
+**Rationale:** heap snapshots are notoriously fragile under code/library/version drift. Logical checkpoints remain deterministic across restarts and upgrades.
+
+### 5.2 Control plane vs data plane (Google-scale reliability pattern)
+
+* **Control plane: BigQuery**
+
+  * sessions/checkpoints/events as structured tables
+  * queryable summaries for auditing and fleet observability
+* **Data plane: GCS**
+
+  * checkpoint state blobs
+  * workspace bundles
+  * large artifacts (reports, samples, exports)
+
+### 5.3 Integration with Existing ADK Services
+
+```
+┌─────────────────────────────────────────────────────────────────┐
+│                         ADK Application                          │
+├─────────────────────────────────────────────────────────────────┤
+│  App(                                                            │
+│    resumability_config=ResumabilityConfig(is_resumable=True),   │
+│    durable_session_config=DurableSessionConfig(  # NEW          │
+│      is_durable=True,                                           │
+│      checkpoint_store=BigQueryCheckpointStore(...),             │
+│      event_source=PubSubEventSource(...),                       │
+│    ),                                                            │
+│  )                                                               │
+├─────────────────────────────────────────────────────────────────┤
+│                    Existing ADK Services                         │
+│  ┌──────────────┐  ┌──────────────┐  ┌──────────────────────┐  │
+│  │SessionService│  │ArtifactService│ │MemoryService         │  │
+│  │(SQLite/PG/...)│ │(GCS/local)   │  │(in-memory/vertex)    │  │
+│  └──────────────┘  └──────────────┘  └──────────────────────┘  │
+├─────────────────────────────────────────────────────────────────┤
+│                    NEW: Durable Session Layer                    │
+│  ┌──────────────────┐  ┌─────────────────┐  ┌───────────────┐  │
+│  │DurableSessionStore│ │CheckpointStore  │  │ResumeService  │  │
+│  │(orchestration)    │ │(BQ meta+GCS blob)│ │(Pub/Sub listen)│ │
+│  └──────────────────┘  └─────────────────┘  └───────────────┘  │
+└─────────────────────────────────────────────────────────────────┘
+```
+
+---
+
+## 6. Why BigQuery as the Control Plane
+
+Using BigQuery as the metadata store is strategic:
+
+* **Auditability:** SQL query of checkpoints at any time without parsing logs
+* **Fleet visibility:** query state of thousands of agents concurrently
+* **Robust ops patterns:** latest-per-session via idiomatic BigQuery view is simple and performant
+* **Dogfooding:** demonstrates BigQuery Agent Analytics and cross-agent observability
+* **Existing infrastructure:** many ADK users already have BQ datasets for analytics
+
+---
+
+## 7. Correctness & Failure Safety
+
+### 7.1 Two-phase checkpoint commit (atomic visibility)
+
+A checkpoint is "live" only once the **BigQuery metadata row** exists.
+
+```python
+def write_checkpoint(
+    session_id: str,
+    seq: int,
+    state_json: bytes,
+    workspace_path: str | None
+) -> None:
+    """Two-phase checkpoint commit with error handling."""
+    try:
+        # Phase 1: blobs to GCS (retry-safe, idempotent)
+        state_uri = gcs.upload(
+            f"checkpoints/{session_id}/{seq}/state.json",
+            state_json,
+            if_generation_match=0,  # Fail if already exists
+        )
+        workspace_uri = None
+        if workspace_path:
+            workspace_uri = gcs.upload(
+                f"checkpoints/{session_id}/{seq}/workspace.tar.gz",
+                compress_tar_gz(workspace_path),
+                if_generation_match=0,
+            )
+
+        # Phase 2: commit metadata in BigQuery (checkpoint becomes visible here)
+        bq.insert("checkpoints", {
+            "session_id": session_id,
+            "checkpoint_seq": seq,
+            "gcs_state_uri": state_uri,
+            "gcs_workspace_uri": workspace_uri,
+            "sha256": sha256(state_json),
+            "size_bytes": len(state_json),
+            "created_at": now(),
+            "trigger": "async_boundary",
+            "agent_state_json": extract_small_summary(state_json),
+            "checkpoint_fingerprint": fingerprint_checkpoint(state_json),
+        })
+
+        # Update pointer only after checkpoint metadata exists
+        bq.update("sessions", session_id, {
+            "current_checkpoint_seq": seq,
+            "updated_at": now(),
+        })
+
+    except GCSUploadError as e:
+        # Phase 1 failed - no cleanup needed, checkpoint not visible
+        logger.error(f"Checkpoint {seq} GCS upload failed: {e}")
+        raise CheckpointWriteError(f"GCS upload failed: {e}") from e
+
+    except BigQueryInsertError as e:
+        # Phase 2 failed - orphan GCS blobs will be cleaned by GC
+        logger.error(f"Checkpoint {seq} BQ insert failed: {e}")
+        raise CheckpointWriteError(f"BQ insert failed: {e}") from e
+```
+
+**Garbage collection:** orphan GCS objects without a corresponding BQ metadata row are deleted after a grace window (default: 24 hours).
+
+---
+
+### 7.2 Authoritative reconciliation (the core idempotency mechanism)
+
+On resume, do not trust events alone. Reconcile the ledger against authoritative cloud state.
+
+```python
+def reconcile_on_resume(state: dict) -> dict:
+    """Reconcile job ledger against authoritative BigQuery state.
+
+    This is the core idempotency mechanism - ensures we never
+    re-submit completed jobs or miss failed ones.
+    """
+    ledger = state["job_ledger"]
+    reconciliation_results = {
+        "jobs_completed": 0,
+        "jobs_failed": 0,
+        "jobs_cancelled": 0,
+        "jobs_still_running": 0,
+    }
+
+    for job_id, meta in ledger.items():
+        try:
+            job = bq.get_job(job_id)
+        except NotFoundError:
+            # Job was deleted or never existed
+            logger.warning(f"Job {job_id} not found, marking as lost")
+            meta["status"] = "LOST"
+            meta["reconciled_at"] = now()
+            continue
+
+        if job.state == "DONE" and not meta.get("consumed"):
+            state["results"][job_id] = fetch_results(job, meta)
+            meta["consumed"] = True
+            meta["reconciled_at"] = now()
+            reconciliation_results["jobs_completed"] += 1
+
+        elif job.state == "FAILED":
+            handle_failed_job(job_id, job.error_result, meta, state)
+            reconciliation_results["jobs_failed"] += 1
+
+        elif job.state == "CANCELLED":
+            handle_cancelled_job(job_id, meta, state)
+            reconciliation_results["jobs_cancelled"] += 1
+
+        elif job.state in ("RUNNING", "PENDING"):
+            register_completion_callback(job_id)
+            reconciliation_results["jobs_still_running"] += 1
+
+    state["_reconciliation_results"] = reconciliation_results
+    return state
+```
+
+This is the enterprise-grade version of "remember where you left off":
+
+* prevents re-submitting 2-hour scans
+* handles partial failures/cancellations deterministically
+* turns resume into a repeatable state machine
+* provides audit trail of reconciliation results
+
+---
+
+### 7.3 Leasing & optimistic concurrency
+
+We must ensure only one runner resumes a session at a time.
+
+**BigQuery constraint:** lacks true row-level locking. BQ-based leasing is **optimistic lease acquisition (best-effort without external lock)**. If high-burst concurrency demands stronger guarantees, the pluggable lease manager can be backed by Firestore/Spanner or external single-delivery orchestration (e.g., Cloud Tasks).
+
+**When to use each backend:**
+
+| Backend | Use Case | Guarantees |
+|---------|----------|------------|
+| BigQuery (default) | Low-medium concurrency, cost-sensitive | Best-effort, ~100ms latency |
+| Firestore | High concurrency, strong consistency needed | Strong, ~10ms latency |
+| Cloud Tasks | Exactly-once delivery required | Exactly-once with dedup window |
+| Spanner | Global distribution, strong consistency | Strong, multi-region |
+
+BQ lease acquire template:
+
+```sql
+UPDATE `your_project.adk_metadata.sessions`
+SET active_lease_id = @lease_id,
+    lease_expiry = TIMESTAMP_ADD(CURRENT_TIMESTAMP(), INTERVAL @ttl_seconds SECOND),
+    updated_at = CURRENT_TIMESTAMP()
+WHERE session_id = @session_id
+  AND status = 'PAUSED'
+  AND (active_lease_id IS NULL OR lease_expiry < CURRENT_TIMESTAMP());
+```
+
+**Note:** BigQuery time travel (`FOR SYSTEM_TIME AS OF`) is useful for debugging historical state, but does not replace strong mutual exclusion. The "pluggable SessionLeaseManager" is the safety valve.
+
+---
+
+## 8. ADK API Extensions (v1 contract)
+
+### 8.1 Core Interfaces
+
+```python
+from abc import ABC, abstractmethod
+from typing import Optional
+from pydantic import BaseModel
+
+class CheckpointableAgentState(ABC):
+    """Interface for agents that support durable checkpointing.
+
+    Extends the existing BaseAgentState pattern from
+    src/google/adk/agents/base_agent.py
+    """
+
+    @abstractmethod
+    def export_state(self) -> dict:
+        """Export agent state to a serializable dictionary.
+
+        Returns:
+            Dictionary containing all state needed to resume.
+            Must be JSON-serializable.
+        """
+        ...
+
+    @abstractmethod
+    def import_state(self, state: dict) -> None:
+        """Import agent state from a previously exported dictionary.
+
+        Args:
+            state: Dictionary from a previous export_state() call.
+        """
+        ...
+
+    def get_state_schema_version(self) -> int:
+        """Return the schema version for this state format.
+
+        Override to implement versioned state migrations.
+        Default: 1
+        """
+        return 1
+
+
+class WorkspaceSnapshotter:
+    """Handles workspace directory snapshots to/from GCS."""
+
+    def snapshot_to_gcs(
+        self,
+        session_id: str,
+        checkpoint_seq: int,
+        workspace_path: str = "/workspace",
+        max_size_bytes: int = 1 * 1024 * 1024 * 1024,  # 1GB default
+    ) -> str:
+        """Snapshot workspace to GCS.
+
+        Returns:
+            GCS URI of the uploaded snapshot.
+
+        Raises:
+            WorkspaceTooLargeError: If workspace exceeds max_size_bytes.
+        """
+        ...
+
+    def restore_from_gcs(self, gcs_uri: str, workspace_path: str = "/workspace") -> None:
+        """Restore workspace from GCS snapshot."""
+        ...
+
+
+class DurableSessionStore(ABC):
+    """Abstract interface for durable checkpoint storage."""
+
+    @abstractmethod
+    def write_checkpoint(
+        self,
+        session_id: str,
+        checkpoint_seq: int,
+        state: dict,
+        workspace_gcs_uri: Optional[str] = None,
+        trigger: str = "async_boundary",
+    ) -> None:
+        """Write a checkpoint with two-phase commit."""
+        ...
+
+    @abstractmethod
+    def read_latest_checkpoint(
+        self,
+        session_id: str,
+    ) -> tuple[int, dict, Optional[str]]:
+        """Read the latest checkpoint for a session.
+
+        Returns:
+            Tuple of (checkpoint_seq, state_dict, workspace_gcs_uri).
+
+        Raises:
+            CheckpointNotFoundError: If no checkpoint exists.
+        """
+        ...
+
+    @abstractmethod
+    def list_checkpoints(
+        self,
+        session_id: str,
+        limit: int = 100,
+    ) -> list[dict]:
+        """List checkpoint metadata for a session."""
+        ...
+```
+
+### 8.2 Configuration
+
+```python
+from pydantic import BaseModel, Field
+from typing import Literal, Optional
+
+class DurableSessionConfig(BaseModel):
+    """Configuration for durable session persistence.
+
+    Works alongside existing ResumabilityConfig.
+    """
+
+    is_durable: bool = False
+    """Enable durable cross-process checkpointing."""
+
+    checkpoint_policy: Literal[
+        "async_boundary",  # Checkpoint when pausing for async tool (default)
+        "tool_call_boundary",  # Checkpoint after every tool call
+        "superstep",  # Checkpoint at agent-defined superstep boundaries
+        "manual",  # Only checkpoint when explicitly requested
+    ] = "async_boundary"
+    """When to create checkpoints."""
+
+    workspace_snapshot_enabled: bool = False
+    """Whether to include workspace directory in checkpoints."""
+
+    workspace_max_size_bytes: int = Field(
+        default=100 * 1024 * 1024,  # 100MB
+        description="Maximum workspace snapshot size",
+    )
+
+    checkpoint_store: Optional[DurableSessionStore] = None
+    """The checkpoint store implementation. If None, uses BigQueryCheckpointStore."""
+
+    lease_backend: Literal["bigquery", "firestore", "cloud_tasks"] = "bigquery"
+    """Backend for lease management."""
+
+    lease_ttl_seconds: int = Field(
+        default=300,  # 5 minutes
+        description="Lease TTL before auto-release",
+    )
+
+    retry_policy: Optional[dict] = None
+    """Per-tool-type retry policies for failed jobs."""
+```
+
+### 8.3 Checkpoint Policy Details
+
+| Policy | Trigger | Use Case |
+|--------|---------|----------|
+| `async_boundary` | `should_pause_invocation()` returns True | BigQuery jobs, external APIs (default) |
+| `tool_call_boundary` | After every tool call completes | Maximum durability, higher cost |
+| `superstep` | Agent calls `checkpoint_now()` | Agent controls checkpoint granularity |
+| `manual` | Only via explicit API call | Testing, debugging |
+
+---
+
+## 9. Current vs Proposed Capability Comparison
+
+| Feature | Current ADK (ResumabilityConfig) | Durable Session Extension |
+|---------|----------------------------------|---------------------------|
+| Pause on long tool call | Yes (experimental) | Yes |
+| Resume from last event | Yes (in-process) | Yes (cross-process) |
+| State persistence | Session service (SQLite/PG) | Session service + BQ/GCS checkpoints |
+| Cross-process resume | No | Yes |
+| External event triggers | No | Yes (Pub/Sub, webhooks) |
+| Max job duration | Process lifetime | Practically unlimited (days/weeks) |
+| Compute cost while waiting | Idle if process alive | Zero compute while PAUSED |
+| Job knowledge (IDs, state) | In-memory or session state | Persisted in ledger + BQ tables |
+| Recovery | Resume API call | Automatic via event + idempotent resume |
+| Auditability | Logs, session events | SQL-queryable BQ control plane |
+| Fleet visibility | Per-session queries | Cross-agent BQ analytics |
+
+---
+
+## 10. Demo Scenario: Multi-Day PII Audit
+
+Assume discovery finds ~50 tables; agent submits **1 BigQuery job per table**.
+
+1. **RUNNING:** enumerate schema, prioritize, build ledger
+2. **RUNNING → PAUSED:** submit job fleet, checkpoint (two-phase), mark PAUSED, release compute
+3. **PAUSED (hours/days):** jobs run in BigQuery; agent consumes zero compute
+4. **Resume:** Pub/Sub event → resumer acquires lease → reads checkpoint → reconciles ledger
+5. **RUNNING:** process completed jobs, handle failures, submit retries if needed
+6. **KILLED:** compile compliance report, write final audit rows, cleanup
+
+---
+
+## 11. "Plumbing vs Logic": Why Framework-Level Support Matters
+
+### 11.1 Framework-level ADK support > agent-specific hacks
+
+This capability should live at the ADK level, not be reinvented per agent team:
+
+| Dimension | Specific Agent Approach | ADK Framework Approach |
+|-----------|-------------------------|------------------------|
+| Engineering effort | each team reimplements persistence/resume | toggled via config; solved once |
+| Security/compliance | inconsistent VPC-SC/CMEK/IAM | governance baked into store/resumer |
+| Observability | fragmented logs | unified BQ schema across agents |
+| Skill portability | skills tied to bespoke persistence | state-aware skills via standard interface |
+
+### 11.2 The "plumbing" components (solve once)
+
+* two-phase commit
+* workspace snapshotting
+* durable store + GC
+* resume service + idempotent event handling
+* leasing/concurrency strategy
+* observability/audit tables
+
+### 11.3 The "logic" components (agent-owned)
+
+* what to persist in checkpoint (`job_ledger`, `audit_cursor`, partial findings)
+* retry policy decisions by job/tool type
+* domain-specific analysis and reporting logic
+
+---
+
+## 12. Generalization Beyond BigQuery (Universal Long-Horizon Primitive)
+
+Although the motivating example is BigQuery, the primitives are general:
+
+* **Ledger-based reconciliation:** any external handle can be tracked (job ID, build ID, ticket ID)
+* **Workspace snapshots:** preserve files for coding/refactoring/report assembly tasks
+* **Event-driven resume:** Pub/Sub triggers can represent almost any service completion webhook
+
+### 12.1 Non-BigQuery long-horizon scenarios
+
+| Task Type | Resume trigger | Ledger contents |
+|-----------|----------------|-----------------|
+| Cloud infra provisioning | resource-ready events | resource manifests + status |
+| Software refactoring | CI completion | build IDs, test results, patch plan |
+| Deep research | scheduled polling/new index event | search caches + draft outline |
+| Human-in-the-loop | Slack/Chat message | approval flags + pending actions |
+| ML training | training job completion | model artifacts, metrics, hyperparams |
+
+---
+
+## 13. Alignment with Moltbot (formerly ClawBot) Architecture
+
+This proposal aligns strongly with the long-running daemon style popularized by Moltbot/ClawBot, especially in lifecycle/state management:
+
+| Feature | Moltbot/ClawBot Design | Durable ADK Design | Alignment |
+|---------|------------------------|--------------------| ----------|
+| Orchestration | Gateway/Coordinator routes persistent sessions | ADK Agent Runner + Resumer | High |
+| Persistence | Local FS "diary files" | BQ (metadata) + GCS (blobs) | High (enterprise-grade) |
+| Lifecycle | Running / Paused / Killed | RUNNING / PAUSED / KILLED | Identical |
+| Execution model | "Rollout" async loops | Background agent hibernates + resumes | High |
+
+**Enterprise advantage vs local-first bots**
+
+* BQ control plane enables fleet-scale SQL audit ("1,000 agents state now")
+* VPC-SC, CMEK, IAM boundaries can be standardized at framework level
+
+---
+
+## 14. Competitive Landscape (LangGraph + Claude)
+
+### 14.1 TL;DR
+
+LangGraph offers durable workflow checkpointing; Claude SDK offers session continuity/harness patterns. Neither makes **cloud job reconciliation** plus **SQL-audit control plane** a first-class target.
+
+### 14.2 Feature comparison
+
+| Feature | ADK (current) | ADK (proposed) | LangGraph | Claude SDK |
+|---------|---------------|----------------|-----------|------------|
+| In-process pause/resume | Yes (experimental) | Yes | Yes | Yes |
+| Cross-process durability | No | Yes (BQ+GCS) | Yes (checkpointers) | Via harness |
+| External event triggers | No | Yes (Pub/Sub) | Via external code | Via harness |
+| Cloud job reconciliation | No | Yes (authoritative) | No | No |
+| SQL audit trail | No | Yes (BQ) | No (requires custom) | No |
+| Fleet observability | No | Yes (BQ analytics) | Via LangSmith | No |
+
+### 14.3 Why not "just use LangGraph checkpointers with BigQuery storage"
+
+LangGraph checkpointers serialize and restore workflow state at step boundaries, but BigQuery long-horizon requires:
+
+* authoritative job status reconciliation (DONE/FAILED/CANCELLED/RUNNING)
+* result retrieval from destination tables
+* partial failure handling and enterprise audit semantics
+
+This is not a drop-in "graph replay" problem; it's **cloud job continuity**.
+
+### 14.4 Borrow vs differentiate (prioritized)
+
+**v1 essential**
+
+1. checkpoint policy ergonomics (inspired by LangGraph)
+2. coordinator/worker harness pattern (inspired by Anthropic article)
+
+**v2**
+3. hybrid filesystem backends
+4. skills/plugins packaging for BigQuery playbooks
+
+---
+
+## 15. Alternatives Considered
+
+| Alternative | Why not (v1) |
+|-------------|--------------|
+| Extend existing SessionService | Different consistency model; BQ provides SQL audit |
+| Firestore metadata | less SQL-auditable for analytics; can be lease backend later |
+| Spanner leasing | heavy for v1; keep pluggable |
+| Redis/Memorystore | ephemeral-first; lacks audit/query semantics |
+| VM checkpointing | complex; brittle with environment drift |
+| Cloud Workflows | static DAGs; agents need dynamic replanning |
+
+---
+
+## 16. Size Limits, Spill Strategy, Compatibility
+
+### 16.1 Size limits
+
+* Keep `agent_state_json` summary small (< 1MB) and queryable
+* Store full checkpoint in GCS (recommended < 100MB, hard limit 5GB)
+* Workspace snapshot recommended ≤ 1 GB; large artifacts should be explicit GCS objects, not tarballed
+
+### 16.2 Compatibility & schema evolution
+
+* `agent_version`: code version (e.g., "1.2.3" or git SHA)
+* `state_schema_version`: **monotonic INT64** (1,2,3…)
+* optional `state_schema_version_label`: semver string for readability
+
+**v1 stance:** version mismatches hard-fail (safe). This prevents subtle bugs from incompatible state.
+
+**Migration strategy (v2):**
+
+```python
+class CheckpointableAgentState(ABC):
+    def get_state_schema_version(self) -> int:
+        return 1
+
+    def migrate_state(self, old_state: dict, old_version: int) -> dict:
+        """Override to implement state migrations.
+
+        Called when loading a checkpoint with older schema version.
+        Default: raise error (v1 behavior).
+        """
+        raise StateSchemaMismatchError(
+            f"Cannot migrate from version {old_version} to {self.get_state_schema_version()}"
+        )
+```
+
+### 16.3 checkpoint_fingerprint definition
+
+`checkpoint_fingerprint` = SHA256 of canonical checkpoint state excluding timestamps and non-deterministic fields. Useful for dedupe/debugging.
+
+```python
+def fingerprint_checkpoint(state: dict) -> str:
+    """Compute deterministic fingerprint for checkpoint state."""
+    # Remove non-deterministic fields
+    canonical = {k: v for k, v in state.items()
+                 if k not in ("_timestamp", "_reconciliation_results")}
+    # Sort keys for determinism
+    canonical_json = json.dumps(canonical, sort_keys=True, separators=(',', ':'))
+    return hashlib.sha256(canonical_json.encode()).hexdigest()
+```
+
+---
+
+## 17. Security, Governance, Enterprise Readiness
+
+### 17.1 Data sensitivity
+
+* **Sensitive by default:** checkpoints may include PII findings, credentials, business data
+* **Classification:** treat checkpoint data with same sensitivity as source data
+
+### 17.2 Encryption
+
+| Layer | Mechanism |
+|-------|-----------|
+| GCS blobs | CMEK (Customer-Managed Encryption Keys) |
+| BQ tables | BQ encryption policies (default or CMEK) |
+| In-transit | TLS 1.3 |
+
+### 17.3 Access control
+
+* **IAM:** least privilege, separate identities for runner vs store
+* **Runner identity:** needs BQ read/write, GCS read/write
+* **Resumer identity:** needs BQ read/write, GCS read, Pub/Sub subscribe
+* **Audit identity:** needs BQ read only
+
+### 17.4 Retention & compliance
+
+* **TTL:** configurable per session/agent type
+* **GC:** automatic cleanup of expired sessions and orphan blobs
+* **Legal hold:** support for compliance holds if needed
+* **Audit log:** all checkpoint operations logged to Cloud Audit Logs
+
+### 17.5 VPC-SC
+
+* **Day-1 requirement** for many enterprise customers
+* Ensure checkpoint bucket is in same VPC-SC perimeter
+* Use restricted.googleapis.com endpoints
+* Document perimeter configuration in deployment guide
+
+---
+
+## 18. Open Questions & Risks (Senior review)
+
+| Question | Risk Level | Notes |
+|----------|------------|-------|
+| Lease contention & latency under high event bursts | Medium | May need Firestore/Tasks for >100 concurrent resumes |
+| Workspace growth management | Low | Differential sync/manifest snapshots for v2 |
+| Checkpoint frequency tuning | Low | Define "smart boundaries" to balance cost and safety |
+| VPC-SC compliance validation | High | Day-1 requirement; needs security review |
+| Multi-region/DR support | Medium | Cross-region resume: supported or out of scope? |
+| Integration with existing ResumabilityConfig | Low | Design is additive, not replacing |
+| State migration complexity | Medium | Hard-fail v1 is safe but limits upgrades |
+
+---
+
+## 19. Milestones / Rollout Plan
+
+| Week | Milestone | Deliverables |
+|------|-----------|--------------|
+| 1–2 | API design & integration planning | `DurableSessionConfig` API, integration with `ResumabilityConfig`, storage/lease strategy doc |
+| 3–4 | Core implementation | `BigQueryCheckpointStore`, `WorkspaceSnapshotter`, two-phase commit |
+| 5–6 | Resume service | `ResumeService`, Pub/Sub integration, lease management |
+| 7–8 | Pilot integration | PII scanner pilot, metrics collection |
+| 9+ | Iterate & decide | Performance tuning, decide first-class vs plugin path |
+
+---
+
+## 20. Immediate Ask / Decisions
+
+1. **Review** `CheckpointableAgentState` contract and integration with existing `ResumabilityConfig`
+2. **Confirm** BQ+GCS as reference infra and lease backend strategy
+3. **Select** pilot use case (PII scanner recommended)
+4. **Decide:** Durable PAUSED as extension to existing resumability vs separate plugin/extension
+
+---
+
+## 21. Cost Estimation
+
+### 21.1 Storage costs
+
+| Component | Typical Size | Monthly Cost (US) |
+|-----------|--------------|-------------------|
+| BQ session row | ~2 KB | ~$0.00004/row |
+| BQ checkpoint row | ~5 KB | ~$0.0001/row |
+| GCS checkpoint blob | ~100 KB | ~$0.0026/GB = ~$0.00000026 |
+| GCS workspace snapshot | ~50 MB | ~$0.0026/GB = ~$0.00013 |
+
+**Example: 1,000 sessions, 10 checkpoints each, 24-hour retention**
+
+| Item | Quantity | Cost |
+|------|----------|------|
+| BQ session rows | 1,000 | $0.04 |
+| BQ checkpoint rows | 10,000 | $1.00 |
+| GCS checkpoint blobs | 10,000 × 100KB = 1GB | $0.026 |
+| GCS workspace snapshots | 1,000 × 50MB = 50GB | $1.30 |
+| **Total daily** | | **~$2.37** |
+
+**Cost per session-day paused:** ~$0.002 (well under $0.01 estimate)
+
+### 21.2 Compute costs
+
+| Component | Cost |
+|-----------|------|
+| PAUSED session | $0 (no compute) |
+| Resume service (Cloud Run) | ~$0.001 per resume |
+| Pub/Sub events | ~$0.04 per million messages |
+
+### 21.3 BigQuery query costs
+
+| Query Type | Estimated Data Scanned | Cost |
+|------------|------------------------|------|
+| Get latest checkpoint | ~10 KB | ~$0.00000005 |
+| List session checkpoints | ~100 KB | ~$0.0000005 |
+| Fleet analytics query | ~10 MB | ~$0.00005 |
+
+---
+
+## 22. Monitoring & Observability
+
+### 22.1 Key metrics
+
+| Metric | Description | Alert Threshold |
+|--------|-------------|-----------------|
+| `checkpoint_write_latency_ms` | Time to write checkpoint (P50, P99) | P99 > 5000ms |
+| `checkpoint_write_errors` | Failed checkpoint writes | > 1% error rate |
+| `resume_latency_ms` | Time from event to resumed | P99 > 10000ms |
+| `lease_contention_rate` | Failed lease acquisitions | > 5% |
+| `orphan_blob_count` | GCS blobs without BQ metadata | > 1000 |
+| `paused_session_count` | Currently paused sessions | Informational |
+| `sessions_near_ttl` | Sessions expiring within 24h | > 100 |
+
+### 22.2 Dashboards
+
+**Operational dashboard:**
+- Active sessions by state (RUNNING/PAUSED/KILLED)
+- Checkpoint write success rate
+- Resume latency distribution
+- Lease acquisition success rate
+
+**Cost dashboard:**
+- Storage usage (BQ + GCS)
+- Query costs by type
+- Compute costs (resume service)
+
+### 22.3 Alerting
+
+| Alert | Condition | Severity |
+|-------|-----------|----------|
+| High checkpoint failure rate | > 1% errors in 5 min | P1 |
+| Resume service unhealthy | > 50% error rate | P1 |
+| Lease contention spike | > 10% contention in 5 min | P2 |
+| Orphan blob accumulation | > 10,000 orphans | P3 |
+| Sessions nearing TTL | > 100 sessions within 1h of TTL | P3 |
+
+### 22.4 Logging
+
+All operations emit structured logs with:
+- `session_id`, `checkpoint_seq`, `operation`
+- `latency_ms`, `success`, `error_code`
+- Correlation IDs for tracing
+
+---
+
+## 23. Rollback & Recovery Procedures
+
+### 23.1 Checkpoint rollback
+
+```python
+def rollback_to_checkpoint(session_id: str, target_seq: int) -> None:
+    """Rollback session to a previous checkpoint.
+
+    Use cases:
+    - Agent made incorrect decisions
+    - Corrupted state detected
+    - Testing/debugging
+    """
+    # 1. Verify target checkpoint exists
+    checkpoint = store.read_checkpoint(session_id, target_seq)
+
+    # 2. Update session to point to target checkpoint
+    bq.update("sessions", session_id, {
+        "current_checkpoint_seq": target_seq,
+        "updated_at": now(),
+    })
+
+    # 3. Log rollback for audit
+    bq.insert("events", {
+        "session_id": session_id,
+        "event_type": "ROLLBACK",
+        "event_payload": {"from_seq": current_seq, "to_seq": target_seq},
+        "event_time": now(),
+    })
+```
+
+### 23.2 Session recovery
+
+| Scenario | Recovery Procedure |
+|----------|-------------------|
+| Resume service crash | Automatic retry via Pub/Sub redelivery |
+| Checkpoint corruption | Rollback to previous checkpoint |
+| BQ metadata loss | Rebuild from GCS blob inventory |
+| GCS blob loss | Mark checkpoint invalid, resume from earlier |
+| Lease stuck | Auto-expire after TTL, manual release available |
+
+### 23.3 Disaster recovery
+
+**Same-region:**
+- BQ point-in-time recovery (7 days default)
+- GCS object versioning
+
+**Cross-region (v2):**
+- BQ dataset replication
+- GCS dual-region or multi-region buckets
+
+---
+
+## 24. Implementation Details (v1)
+
+### 24.1 Module Structure
+
+```
+src/google/adk/durable/
+├── __init__.py                    # Public exports
+├── config.py                      # DurableSessionConfig
+├── checkpointable_state.py        # CheckpointableAgentState ABC
+├── workspace_snapshotter.py       # GCS workspace snapshot handling
+└── stores/
+    ├── __init__.py                # Store exports
+    ├── base_checkpoint_store.py   # DurableSessionStore ABC
+    └── bigquery_checkpoint_store.py  # BQ + GCS implementation
+```
+
+### 24.2 Key Implementation Decisions
+
+| Decision | Rationale |
+|----------|-----------|
+| DML INSERT over streaming inserts | BigQuery streaming buffer limitations prevent immediate UPDATE after streaming insert |
+| JSON column type checking | BigQuery returns JSON columns as dicts, not strings - added runtime type detection |
+| SHA-256 verification | Checkpoint integrity verification on read |
+| Async-first API | All store methods are async for non-blocking I/O |
+| Experimental decorators | All public classes marked `@experimental` for API stability signals |
+
+### 24.3 BigQuery Table Schema (Simplified for v1)
+
+```sql
+-- Sessions table
+CREATE TABLE `project.adk_metadata.sessions` (
+  session_id STRING NOT NULL,
+  status STRING NOT NULL,
+  agent_name STRING NOT NULL,
+  created_at TIMESTAMP NOT NULL,
+  updated_at TIMESTAMP NOT NULL,
+  current_checkpoint_seq INT64 NOT NULL,
+  active_lease_id STRING,
+  lease_expiry TIMESTAMP,
+  ttl_expiry TIMESTAMP,
+  metadata JSON,
+  PRIMARY KEY (session_id) NOT ENFORCED
+);
+
+-- Checkpoints table
+CREATE TABLE `project.adk_metadata.checkpoints` (
+  session_id STRING NOT NULL,
+  checkpoint_seq INT64 NOT NULL,
+  created_at TIMESTAMP NOT NULL,
+  gcs_state_uri STRING NOT NULL,
+  sha256 STRING NOT NULL,
+  size_bytes INT64 NOT NULL,
+  agent_state JSON,
+  trigger STRING NOT NULL,
+  PRIMARY KEY (session_id, checkpoint_seq) NOT ENFORCED
+);
+```
+
+### 24.4 Demo Architecture
+
+```
+┌─────────────────────────────────────────────────────────────────┐
+│                    Cloud Run: durable-demo                       │
+│  ┌───────────────────────────────────────────────────────────┐  │
+│  │                    FastAPI Server                          │  │
+│  │  - demo_server.py: Task management + checkpoint APIs       │  │
+│  │  - demo_ui.html: Real-time visualization UI                │  │
+│  └───────────────────────────────────────────────────────────┘  │
+│                              │                                    │
+│                              ▼                                    │
+│  ┌───────────────────────────────────────────────────────────┐  │
+│  │               BigQueryCheckpointStore                      │  │
+│  │  - Two-phase commit (GCS blob → BQ metadata)              │  │
+│  │  - Lease management for concurrency                        │  │
+│  │  - SHA-256 integrity verification                          │  │
+│  └───────────────────────────────────────────────────────────┘  │
+└─────────────────────────────────────────────────────────────────┘
+                    │                           │
+                    ▼                           ▼
+        ┌──────────────────┐        ┌──────────────────┐
+        │    BigQuery      │        │      GCS         │
+        │  adk_metadata    │        │  checkpoints/    │
+        │  - sessions      │        │  {session_id}/   │
+        │  - checkpoints   │        │  {seq}/state.json│
+        └──────────────────┘        └──────────────────┘
+```
+
+### 24.5 Demo Features
+
+| Feature | Implementation |
+|---------|----------------|
+| Task types | Sentiment, Anomaly, Trend, Clustering analysis |
+| Checkpoint interval | Every 10 seconds |
+| Failure simulation | Manual trigger via UI |
+| Resume from checkpoint | Automatic state restoration |
+| Final output | Task-specific analysis reports |
+| Real-time UI | Polling-based status updates |
+| Checkpoint timeline | Visual checkpoint history |
+
+---
+
+# Appendix A: Feature-to-Requirement Mapping (Demo Coverage)
+
+| Feature | Functional Purpose | Long-horizon benefit |
+|---------|--------------------|-----------------------|
+| Two-phase checkpoint commit | atomic visibility of state | prevents half-saved resumes |
+| BigQuery job ledger | track async job IDs & states | hibernate during hours-long jobs |
+| Workspace snapshotting | preserve files and drafts | warm start for coding/report tasks |
+| Lease-based resuming | prevent concurrent resume | avoids corruption in parallel runs |
+| Durable lifecycle model | add persistent PAUSED | releases compute, supports indefinite horizon |
+| Authoritative reconciliation | sync with cloud job state | prevents duplicate submissions |
+| Integration with ResumabilityConfig | backward compatibility | incremental adoption |
+
+---
+
+# Appendix B: BigQuery SQL (Copy/Paste)
+
+## B0) Dataset
+
+```sql
+CREATE SCHEMA IF NOT EXISTS `your_project.adk_metadata`
+OPTIONS (
+  location = "US",
+  description = "ADK Durable Session control-plane metadata (sessions, checkpoints, events)."
+);
+```
+
+## B1) sessions
+
+```sql
+CREATE TABLE IF NOT EXISTS `your_project.adk_metadata.sessions` (
+  session_id STRING NOT NULL,
+  parent_session_id STRING,
+  owner_principal STRING NOT NULL,
+
+  status STRING NOT NULL,
+  agent_name STRING NOT NULL,
+  agent_version STRING NOT NULL,
+  persistence_mode STRING NOT NULL,
+
+  created_at TIMESTAMP NOT NULL,
+  updated_at TIMESTAMP NOT NULL,
+
+  current_checkpoint_seq INT64 NOT NULL,
+  active_lease_id STRING,
+  lease_expiry TIMESTAMP,
+
+  ttl_expiry TIMESTAMP NOT NULL,
+
+  labels JSON,
+  metadata JSON,
+
+  state_schema_version INT64 NOT NULL,
+  state_schema_version_label STRING,
+
+  -- Primary key constraint (BigQuery syntax)
+  PRIMARY KEY (session_id) NOT ENFORCED
+)
+PARTITION BY DATE(updated_at)
+CLUSTER BY status, owner_principal
+OPTIONS (description = "Durable agent session control-plane table.");
+```
+
+## B2) checkpoints
+
+```sql
+CREATE TABLE IF NOT EXISTS `your_project.adk_metadata.checkpoints` (
+  session_id STRING NOT NULL,
+  checkpoint_seq INT64 NOT NULL,
+
+  agent_version STRING NOT NULL,
+  state_schema_version INT64 NOT NULL,
+  state_schema_version_label STRING,
+
+  created_at TIMESTAMP NOT NULL,
+
+  gcs_state_uri STRING NOT NULL,
+  gcs_workspace_uri STRING,
+
+  sha256 STRING NOT NULL,
+  size_bytes INT64 NOT NULL,
+
+  agent_state_json JSON,
+  trigger STRING NOT NULL,
+
+  num_jobs INT64,
+  num_tables_scanned INT64,
+  num_findings INT64,
+
+  checkpoint_fingerprint STRING,
+
+  -- Composite primary key
+  PRIMARY KEY (session_id, checkpoint_seq) NOT ENFORCED
+)
+PARTITION BY DATE(created_at)
+CLUSTER BY session_id
+OPTIONS (description = "Checkpoint metadata; full blobs stored in GCS.");
+```
+
+## B3) events
+
+```sql
+CREATE TABLE IF NOT EXISTS `your_project.adk_metadata.events` (
+  event_id STRING NOT NULL,
+  session_id STRING NOT NULL,
+
+  event_time TIMESTAMP NOT NULL,
+  event_type STRING NOT NULL,
+  event_payload JSON,
+
+  processed BOOL NOT NULL,
+  processed_at TIMESTAMP,
+  processing_lease_id STRING,
+
+  source STRING,
+  severity STRING,
+
+  -- Primary key
+  PRIMARY KEY (event_id) NOT ENFORCED
+)
+PARTITION BY DATE(event_time)
+CLUSTER BY session_id, processed
+OPTIONS (description = "Resume trigger events and processing audit trail.");
+```
+
+## B4) Views
+
+Latest checkpoint per session (with NULL handling):
+
+```sql
+CREATE OR REPLACE VIEW `your_project.adk_metadata.v_latest_checkpoint` AS
+SELECT
+  session_id,
+  ARRAY_AGG(c ORDER BY checkpoint_seq DESC LIMIT 1)[SAFE_OFFSET(0)] AS latest_checkpoint
+FROM `your_project.adk_metadata.checkpoints` c
+GROUP BY session_id;
+```
+
+Paused sessions nearing TTL:
+
+```sql
+CREATE OR REPLACE VIEW `your_project.adk_metadata.v_paused_near_ttl` AS
+SELECT
+  session_id, owner_principal, agent_name, agent_version,
+  ttl_expiry, updated_at, current_checkpoint_seq,
+  TIMESTAMP_DIFF(ttl_expiry, CURRENT_TIMESTAMP(), HOUR) AS hours_until_expiry
+FROM `your_project.adk_metadata.sessions`
+WHERE status = 'PAUSED'
+  AND ttl_expiry < TIMESTAMP_ADD(CURRENT_TIMESTAMP(), INTERVAL 24 HOUR);
+```
+
+Fleet status summary:
+
+```sql
+CREATE OR REPLACE VIEW `your_project.adk_metadata.v_fleet_status` AS
+SELECT
+  agent_name,
+  status,
+  COUNT(*) AS session_count,
+  AVG(current_checkpoint_seq) AS avg_checkpoints,
+  MIN(created_at) AS oldest_session,
+  MAX(updated_at) AS most_recent_activity
+FROM `your_project.adk_metadata.sessions`
+WHERE ttl_expiry > CURRENT_TIMESTAMP()
+GROUP BY agent_name, status;
+```
+
+Lease acquire template:
+
+```sql
+UPDATE `your_project.adk_metadata.sessions`
+SET active_lease_id = @lease_id,
+    lease_expiry = TIMESTAMP_ADD(CURRENT_TIMESTAMP(), INTERVAL @ttl_seconds SECOND),
+    updated_at = CURRENT_TIMESTAMP()
+WHERE session_id = @session_id
+  AND status = 'PAUSED'
+  AND (active_lease_id IS NULL OR lease_expiry < CURRENT_TIMESTAMP());
+```
+
+---
+
+# Appendix C: Sequence Diagram (Mermaid)
+
+```mermaid
+sequenceDiagram
+  participant App as ADK Application
+  participant Runner as ADK Agent Runner
+  participant ResConfig as ResumabilityConfig
+  participant DurConfig as DurableSessionConfig
+  participant Store as Durable Store (BQ+GCS)
+  participant BQ as BigQuery
+  participant PS as Pub/Sub
+  participant Resumer as Resume Service
+
+  Note over App,Resumer: Initialization
+  App->>Runner: Create with ResumabilityConfig + DurableSessionConfig
+  Runner->>ResConfig: is_resumable = True
+  Runner->>DurConfig: is_durable = True
+
+  Note over App,Resumer: Execution & Pause
+  Runner->>BQ: Submit async jobs (N)
+  Runner->>ResConfig: should_pause_invocation() = True
+  Runner->>Store: Phase1: Write state blob to GCS
+  Runner->>Store: Phase2: Insert checkpoint metadata (BQ)
+  Runner->>Store: Update session status = PAUSED
+  Runner-->>App: Yield control (zero compute)
+
+  Note over App,Resumer: External Events
+  BQ-->>PS: Job completion event(s)
+  PS-->>Resumer: Deliver event (may be duplicated)
+
+  Note over App,Resumer: Resume
+  Resumer->>Store: Acquire lease(session_id)
+
+  alt Lease already held
+    Store-->>Resumer: Lease denied
+    Resumer->>Resumer: Back off and retry / skip event
+  else Lease granted
+    Store-->>Resumer: Lease granted
+    Resumer->>Store: Read latest checkpoint
+    Resumer->>BQ: Reconcile job ledger (authoritative)
+    Resumer->>Runner: Resume session with checkpoint
+    Runner->>Store: Periodic checkpoint updates
+    Runner->>Store: Finalize session status = KILLED
+    Resumer->>Store: Release lease(session_id)
+  end
+```
+
+---
+
+# Appendix D: Failure Modes (Operational)
+
+| Failure Mode | Detection | Recovery |
+|--------------|-----------|----------|
+| Duplicate Pub/Sub event | Lease acquisition fails | Skip, idempotent |
+| Partial checkpoint write (Phase 1) | GCS upload error | Retry, no cleanup needed |
+| Partial checkpoint write (Phase 2) | BQ insert error | Orphan blob GC |
+| Resume crash mid-execution | Lease expires, no heartbeat | Re-acquire lease, resume from checkpoint |
+| Jobs still running on resume | Reconciliation detects RUNNING | Re-register completion callback |
+| Jobs failed/cancelled | Reconciliation detects state | Agent retry policy, audit decision |
+| Permission revoked | API error | Fail with explicit error + audit row |
+| TTL expiry | Scheduled job | GC + mark expired |
+| Checkpoint corruption | SHA256 mismatch | Rollback to previous checkpoint |
+| State schema mismatch | Version check on load | Hard-fail (v1), migrate (v2) |
+
+---
+
+# Appendix E: Integration Example
+
+```python
+from google.adk.apps import App, ResumabilityConfig
+from google.adk.agents import LlmAgent
+from google.adk.durable import (
+    DurableSessionConfig,
+    BigQueryCheckpointStore,
+    PubSubEventSource,
+)
+
+# Create durable-enabled application
+app = App(
+    name="pii_scanner",
+    root_agent=LlmAgent(
+        name="scanner",
+        model="gemini-2.0-flash",
+        instructions="Scan BigQuery tables for PII...",
+        tools=[bq_query_tool, bq_job_tool],
+    ),
+    # Existing resumability (in-process)
+    resumability_config=ResumabilityConfig(
+        is_resumable=True,
+    ),
+    # NEW: Durable cross-process persistence
+    durable_session_config=DurableSessionConfig(
+        is_durable=True,
+        checkpoint_policy="async_boundary",
+        workspace_snapshot_enabled=False,
+        checkpoint_store=BigQueryCheckpointStore(
+            project="my-project",
+            dataset="adk_metadata",
+            gcs_bucket="my-checkpoints-bucket",
+        ),
+        lease_backend="bigquery",
+        lease_ttl_seconds=300,
+    ),
+)
+
+# Run with runner (checkpoint happens automatically on pause)
+runner = Runner(
+    app=app,
+    session_service=DatabaseSessionService(...),
+)
+
+# Events from Pub/Sub automatically trigger resume
+async for event in runner.run_async(
+    user_id="user-123",
+    session_id="session-456",
+    new_message=Content(parts=[Part(text="Scan all tables for PII")]),
+):
+    print(event)
+```
+
+---
+
+# References (URLs)
+
+1. LangGraph durable execution: [https://docs.langchain.com/oss/python/langgraph/durable-execution/](https://docs.langchain.com/oss/python/langgraph/durable-execution/)
+2. LangGraph persistence/checkpointers: [https://docs.langchain.com/oss/python/langgraph/persistence/](https://docs.langchain.com/oss/python/langgraph/persistence/)
+3. LangGraph overview: [https://docs.langchain.com/oss/python/langgraph/](https://docs.langchain.com/oss/python/langgraph/)
+4. LangGraph checkpoints reference: [https://reference.langchain.com/python/langgraph/checkpoints/](https://reference.langchain.com/python/langgraph/checkpoints/)
+5. Deep Agents overview: [https://docs.langchain.com/oss/python/deepagents/overview/](https://docs.langchain.com/oss/python/deepagents/overview/)
+6. Deep Agents long-term memory: [https://docs.langchain.com/oss/python/deepagents/long-term-memory/](https://docs.langchain.com/oss/python/deepagents/long-term-memory/)
+7. Anthropic long-running harnesses: [https://www.anthropic.com/engineering/effective-harnesses-for-long-running-agents](https://www.anthropic.com/engineering/effective-harnesses-for-long-running-agents)
+8. ADK ResumabilityConfig: `src/google/adk/apps/app.py:42-58`
+9. ADK InvocationContext pause: `src/google/adk/agents/invocation_context.py:355-389`
diff --git a/contributing/samples/long_running_task/setup.py b/contributing/samples/long_running_task/setup.py
new file mode 100644
index 0000000000..c97ecad3e9
--- /dev/null
+++ b/contributing/samples/long_running_task/setup.py
@@ -0,0 +1,246 @@
+#!/usr/bin/env python
+# Copyright 2026 Google LLC
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""Setup script for the durable session demo.
+
+This script creates the required BigQuery dataset, tables, and GCS bucket
+for the durable session persistence demo.
+
+Usage:
+    python setup.py
+
+Prerequisites:
+    - Google Cloud SDK installed and configured
+    - BigQuery API enabled
+    - Cloud Storage API enabled
+    - Appropriate IAM permissions:
+        - roles/bigquery.dataEditor
+        - roles/storage.objectAdmin
+"""
+
+import argparse
+import subprocess
+import sys
+
+# Configuration
+PROJECT_ID = "test-project-0728-467323"
+DATASET = "adk_metadata"
+GCS_BUCKET = f"{PROJECT_ID}-adk-checkpoints"
+LOCATION = "US"
+
+
+def run_command(
+    cmd: list[str], check: bool = True
+) -> subprocess.CompletedProcess:
+  """Run a shell command and return the result."""
+  print(f"Running: {' '.join(cmd)}")
+  result = subprocess.run(cmd, capture_output=True, text=True)
+  if check and result.returncode != 0:
+    print(f"Error: {result.stderr}")
+    if not result.stderr.strip().endswith("already exists"):
+      sys.exit(1)
+  return result
+
+
+def create_gcs_bucket():
+  """Create the GCS bucket for checkpoint blobs."""
+  print("\n=== Creating GCS Bucket ===")
+  run_command(
+      ["gsutil", "mb", "-l", LOCATION, f"gs://{GCS_BUCKET}"], check=False
+  )
+
+  # Set lifecycle policy to delete old checkpoints after 30 days
+  lifecycle_config = """
+{
+  "lifecycle": {
+    "rule": [
+      {
+        "action": {"type": "Delete"},
+        "condition": {"age": 30}
+      }
+    ]
+  }
+}
+"""
+  with open("/tmp/lifecycle.json", "w") as f:
+    f.write(lifecycle_config)
+
+  run_command(
+      [
+          "gsutil",
+          "lifecycle",
+          "set",
+          "/tmp/lifecycle.json",
+          f"gs://{GCS_BUCKET}",
+      ],
+      check=False,
+  )
+
+  print(f"GCS bucket created: gs://{GCS_BUCKET}")
+
+
+def create_bigquery_dataset():
+  """Create the BigQuery dataset."""
+  print("\n=== Creating BigQuery Dataset ===")
+  run_command(
+      [
+          "bq",
+          "mk",
+          "--dataset",
+          "--location",
+          LOCATION,
+          f"{PROJECT_ID}:{DATASET}",
+      ],
+      check=False,
+  )
+  print(f"BigQuery dataset created: {PROJECT_ID}.{DATASET}")
+
+
+def create_sessions_table():
+  """Create the sessions metadata table."""
+  print("\n=== Creating Sessions Table ===")
+
+  schema = """
+session_id:STRING,
+status:STRING,
+agent_name:STRING,
+created_at:TIMESTAMP,
+updated_at:TIMESTAMP,
+current_checkpoint_seq:INT64,
+active_lease_id:STRING,
+lease_expiry:TIMESTAMP,
+ttl_expiry:TIMESTAMP,
+metadata:JSON
+"""
+
+  run_command(
+      [
+          "bq",
+          "mk",
+          "--table",
+          f"{PROJECT_ID}:{DATASET}.sessions",
+          schema.replace("\n", "").strip(),
+      ],
+      check=False,
+  )
+
+  print(f"Sessions table created: {PROJECT_ID}.{DATASET}.sessions")
+
+
+def create_checkpoints_table():
+  """Create the checkpoints table."""
+  print("\n=== Creating Checkpoints Table ===")
+
+  schema = """
+session_id:STRING,
+checkpoint_seq:INT64,
+created_at:TIMESTAMP,
+gcs_state_uri:STRING,
+sha256:STRING,
+size_bytes:INT64,
+agent_state_json:JSON,
+trigger:STRING
+"""
+
+  run_command(
+      [
+          "bq",
+          "mk",
+          "--table",
+          f"{PROJECT_ID}:{DATASET}.checkpoints",
+          schema.replace("\n", "").strip(),
+      ],
+      check=False,
+  )
+
+  print(f"Checkpoints table created: {PROJECT_ID}.{DATASET}.checkpoints")
+
+
+def verify_setup():
+  """Verify that all resources were created successfully."""
+  print("\n=== Verifying Setup ===")
+
+  # Check GCS bucket
+  result = run_command(["gsutil", "ls", f"gs://{GCS_BUCKET}"], check=False)
+  if result.returncode == 0:
+    print(f"[OK] GCS bucket exists: gs://{GCS_BUCKET}")
+  else:
+    print(f"[FAIL] GCS bucket not found: gs://{GCS_BUCKET}")
+
+  # Check BigQuery tables
+  for table in ["sessions", "checkpoints"]:
+    result = run_command(
+        ["bq", "show", f"{PROJECT_ID}:{DATASET}.{table}"], check=False
+    )
+    if result.returncode == 0:
+      print(f"[OK] BigQuery table exists: {PROJECT_ID}.{DATASET}.{table}")
+    else:
+      print(f"[FAIL] BigQuery table not found: {PROJECT_ID}.{DATASET}.{table}")
+
+
+def cleanup():
+  """Delete all resources created by this script."""
+  print("\n=== Cleaning Up Resources ===")
+
+  # Delete BigQuery tables
+  for table in ["sessions", "checkpoints"]:
+    run_command(
+        ["bq", "rm", "-f", f"{PROJECT_ID}:{DATASET}.{table}"], check=False
+    )
+
+  # Delete BigQuery dataset
+  run_command(["bq", "rm", "-f", "-d", f"{PROJECT_ID}:{DATASET}"], check=False)
+
+  # Delete GCS bucket
+  run_command(["gsutil", "rm", "-r", f"gs://{GCS_BUCKET}"], check=False)
+
+  print("Cleanup complete.")
+
+
+def main():
+  parser = argparse.ArgumentParser(
+      description="Setup resources for the durable session demo"
+  )
+  parser.add_argument(
+      "--cleanup",
+      action="store_true",
+      help="Delete all resources instead of creating them",
+  )
+  parser.add_argument(
+      "--verify", action="store_true", help="Only verify that resources exist"
+  )
+  args = parser.parse_args()
+
+  print(f"Project: {PROJECT_ID}")
+  print(f"Dataset: {DATASET}")
+  print(f"GCS Bucket: {GCS_BUCKET}")
+  print(f"Location: {LOCATION}")
+
+  if args.cleanup:
+    cleanup()
+  elif args.verify:
+    verify_setup()
+  else:
+    create_gcs_bucket()
+    create_bigquery_dataset()
+    create_sessions_table()
+    create_checkpoints_table()
+    verify_setup()
+
+  print("\nDone!")
+
+
+if __name__ == "__main__":
+  main()
diff --git a/contributing/samples/long_running_task/tools.py b/contributing/samples/long_running_task/tools.py
new file mode 100644
index 0000000000..4dbbf4455c
--- /dev/null
+++ b/contributing/samples/long_running_task/tools.py
@@ -0,0 +1,489 @@
+# Copyright 2026 Google LLC
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""Long-running tools for the durable session demo."""
+
+import asyncio
+import random
+from datetime import datetime
+from typing import Any
+
+from google.adk.tools.tool_context import ToolContext
+
+
+async def simulate_long_running_scan(
+    table_name: str,
+    tool_context: ToolContext,
+) -> dict[str, Any]:
+  """Simulate a long-running BigQuery table scan.
+
+  This tool demonstrates durable checkpointing by simulating a scan that
+  takes several seconds. In a real scenario, this would be a BigQuery job
+  that processes large amounts of data.
+
+  Args:
+    table_name: The fully-qualified BigQuery table name to scan.
+    tool_context: The tool context for accessing state and artifacts.
+
+  Returns:
+    A dictionary with scan results including status, row count, and findings.
+  """
+  # Simulate processing time (5-10 seconds)
+  processing_time = random.uniform(5.0, 10.0)
+  await asyncio.sleep(processing_time)
+
+  # Simulate scan results
+  rows_scanned = random.randint(100000, 10000000)
+  findings = []
+
+  # Generate some sample findings based on table name
+  if "shakespeare" in table_name.lower():
+    findings = [
+        "Found 5 instances of 'to be or not to be'",
+        "Most common word: 'the' (27,801 occurrences)",
+        "Unique words: 29,066",
+    ]
+  elif "github" in table_name.lower():
+    findings = [
+        "Most active repository: kubernetes/kubernetes",
+        "Peak commit hour: 14:00 UTC",
+        "Average commits per day: 45,000",
+    ]
+  else:
+    findings = [
+        f"Scanned {rows_scanned:,} rows",
+        "No anomalies detected",
+        "Data quality: 99.8%",
+    ]
+
+  return {
+      "status": "complete",
+      "table": table_name,
+      "rows_scanned": rows_scanned,
+      "processing_time_seconds": round(processing_time, 2),
+      "findings": findings,
+  }
+
+
+async def run_data_pipeline(
+    source_table: str,
+    destination_table: str,
+    transformations: list[str],
+    tool_context: ToolContext,
+) -> dict[str, Any]:
+  """Run a data transformation pipeline.
+
+  This simulates a multi-stage data pipeline that would typically be
+  checkpointed at each stage for durability.
+
+  Args:
+    source_table: The source BigQuery table.
+    destination_table: The destination BigQuery table.
+    transformations: List of transformation operations to apply.
+    tool_context: The tool context for accessing state and artifacts.
+
+  Returns:
+    Pipeline execution results.
+  """
+  stages_completed = []
+  total_rows_processed = 0
+
+  # Simulate each transformation stage
+  for i, transformation in enumerate(transformations):
+    # Simulate stage processing time
+    stage_time = random.uniform(2.0, 5.0)
+    await asyncio.sleep(stage_time)
+
+    rows_processed = random.randint(10000, 100000)
+    total_rows_processed += rows_processed
+
+    stages_completed.append({
+        "stage": i + 1,
+        "transformation": transformation,
+        "rows_processed": rows_processed,
+        "duration_seconds": round(stage_time, 2),
+    })
+
+  return {
+      "status": "complete",
+      "source_table": source_table,
+      "destination_table": destination_table,
+      "stages_completed": stages_completed,
+      "total_rows_processed": total_rows_processed,
+      "total_stages": len(transformations),
+  }
+
+
+async def run_extended_analysis(
+    job_name: str,
+    duration_minutes: int,
+    tool_context: ToolContext,
+) -> dict[str, Any]:
+  """Run an extended analysis job for a specified duration.
+
+  This tool simulates a long-running analysis job that can run for 10+ minutes.
+  Use this to test durable checkpointing with extended job durations.
+
+  Args:
+    job_name: A descriptive name for the analysis job.
+    duration_minutes: How many minutes the job should run (1-60 minutes).
+    tool_context: The tool context for accessing state and artifacts.
+
+  Returns:
+    Analysis job results with timing and metrics.
+  """
+  start_time = datetime.now()
+  duration_seconds = min(max(duration_minutes, 1), 60) * 60
+
+  # Process in chunks, reporting progress
+  chunk_size = 30  # Report every 30 seconds
+  chunks_completed = 0
+  total_chunks = duration_seconds // chunk_size
+
+  metrics = {
+      "records_processed": 0,
+      "anomalies_detected": 0,
+      "patterns_found": 0,
+  }
+
+  for i in range(0, duration_seconds, chunk_size):
+    remaining = min(chunk_size, duration_seconds - i)
+    await asyncio.sleep(remaining)
+
+    chunks_completed += 1
+    metrics["records_processed"] += random.randint(100000, 500000)
+    metrics["anomalies_detected"] += random.randint(0, 10)
+    metrics["patterns_found"] += random.randint(1, 5)
+
+  end_time = datetime.now()
+  actual_duration = (end_time - start_time).total_seconds()
+
+  return {
+      "status": "complete",
+      "job_name": job_name,
+      "requested_duration_minutes": duration_minutes,
+      "actual_duration_seconds": round(actual_duration, 2),
+      "actual_duration_minutes": round(actual_duration / 60, 2),
+      "start_time": start_time.isoformat(),
+      "end_time": end_time.isoformat(),
+      "metrics": metrics,
+      "summary": (
+          f"Processed {metrics['records_processed']:,} records, "
+          f"found {metrics['anomalies_detected']} anomalies and "
+          f"{metrics['patterns_found']} patterns"
+      ),
+  }
+
+
+async def run_ml_training_job(
+    model_name: str,
+    dataset_size: str,
+    epochs: int,
+    tool_context: ToolContext,
+) -> dict[str, Any]:
+  """Run a simulated ML model training job.
+
+  This tool simulates training a machine learning model, which can take
+  10+ minutes depending on the dataset size and epochs.
+
+  Dataset sizes and approximate training times:
+  - "small": ~2 minutes
+  - "medium": ~5 minutes
+  - "large": ~10 minutes
+  - "xlarge": ~15 minutes
+  - "enterprise": ~30 minutes
+
+  Args:
+    model_name: Name for the model being trained.
+    dataset_size: Size of dataset - "small", "medium", "large", "xlarge", or "enterprise".
+    epochs: Number of training epochs (1-100).
+    tool_context: The tool context for accessing state and artifacts.
+
+  Returns:
+    Training results with metrics and model performance.
+  """
+  start_time = datetime.now()
+
+  # Map dataset size to base training time (in seconds)
+  size_to_time = {
+      "small": 120,      # 2 minutes
+      "medium": 300,     # 5 minutes
+      "large": 600,      # 10 minutes
+      "xlarge": 900,     # 15 minutes
+      "enterprise": 1800,  # 30 minutes
+  }
+
+  base_time = size_to_time.get(dataset_size.lower(), 300)
+  epochs = min(max(epochs, 1), 100)
+
+  # Total time scales with epochs (but not linearly)
+  total_time = base_time * (1 + (epochs - 1) * 0.1)
+  total_time = min(total_time, 3600)  # Cap at 1 hour
+
+  # Simulate training epochs
+  epoch_results = []
+  time_per_epoch = total_time / epochs
+
+  for epoch in range(1, epochs + 1):
+    await asyncio.sleep(time_per_epoch)
+
+    # Simulate improving metrics over epochs
+    base_loss = 2.5 - (epoch / epochs) * 2.0
+    loss = base_loss + random.uniform(-0.1, 0.1)
+    accuracy = min(0.5 + (epoch / epochs) * 0.45 + random.uniform(-0.02, 0.02), 0.99)
+
+    epoch_results.append({
+        "epoch": epoch,
+        "loss": round(loss, 4),
+        "accuracy": round(accuracy, 4),
+        "learning_rate": round(0.001 * (0.95 ** (epoch - 1)), 6),
+    })
+
+  end_time = datetime.now()
+  actual_duration = (end_time - start_time).total_seconds()
+
+  final_metrics = epoch_results[-1] if epoch_results else {}
+
+  return {
+      "status": "complete",
+      "model_name": model_name,
+      "dataset_size": dataset_size,
+      "epochs_completed": epochs,
+      "start_time": start_time.isoformat(),
+      "end_time": end_time.isoformat(),
+      "actual_duration_seconds": round(actual_duration, 2),
+      "actual_duration_minutes": round(actual_duration / 60, 2),
+      "final_loss": final_metrics.get("loss"),
+      "final_accuracy": final_metrics.get("accuracy"),
+      "training_history": epoch_results[-5:],  # Last 5 epochs
+      "model_artifact": f"gs://models/{model_name}/v1/model.pkl",
+  }
+
+
+async def run_batch_etl_job(
+    job_id: str,
+    source_tables: list[str],
+    target_table: str,
+    processing_minutes: int,
+    tool_context: ToolContext,
+) -> dict[str, Any]:
+  """Run a batch ETL (Extract, Transform, Load) job.
+
+  This tool simulates a large-scale ETL job that processes multiple source
+  tables and loads data into a target table. Can run for 10+ minutes.
+
+  Args:
+    job_id: Unique identifier for this ETL job.
+    source_tables: List of source table names to process.
+    target_table: Destination table for processed data.
+    processing_minutes: Estimated processing time in minutes (1-60).
+    tool_context: The tool context for accessing state and artifacts.
+
+  Returns:
+    ETL job results with detailed metrics.
+  """
+  start_time = datetime.now()
+  duration_seconds = min(max(processing_minutes, 1), 60) * 60
+
+  # Process each source table
+  table_results = []
+  time_per_table = duration_seconds / max(len(source_tables), 1)
+
+  total_rows_extracted = 0
+  total_rows_transformed = 0
+  total_rows_loaded = 0
+
+  for table in source_tables:
+    await asyncio.sleep(time_per_table)
+
+    rows_extracted = random.randint(1000000, 10000000)
+    rows_transformed = int(rows_extracted * random.uniform(0.85, 0.99))
+    rows_loaded = int(rows_transformed * random.uniform(0.98, 1.0))
+
+    total_rows_extracted += rows_extracted
+    total_rows_transformed += rows_transformed
+    total_rows_loaded += rows_loaded
+
+    table_results.append({
+        "source_table": table,
+        "rows_extracted": rows_extracted,
+        "rows_transformed": rows_transformed,
+        "rows_loaded": rows_loaded,
+        "transform_ratio": round(rows_transformed / rows_extracted, 4),
+    })
+
+  end_time = datetime.now()
+  actual_duration = (end_time - start_time).total_seconds()
+
+  return {
+      "status": "complete",
+      "job_id": job_id,
+      "source_tables_processed": len(source_tables),
+      "target_table": target_table,
+      "start_time": start_time.isoformat(),
+      "end_time": end_time.isoformat(),
+      "actual_duration_seconds": round(actual_duration, 2),
+      "actual_duration_minutes": round(actual_duration / 60, 2),
+      "total_rows_extracted": total_rows_extracted,
+      "total_rows_transformed": total_rows_transformed,
+      "total_rows_loaded": total_rows_loaded,
+      "overall_success_rate": round(total_rows_loaded / total_rows_extracted, 4),
+      "table_details": table_results,
+  }
+
+
+async def run_demo_analysis(
+    analysis_type: str,
+    tool_context: ToolContext,
+) -> dict[str, Any]:
+  """Run a 1-minute demo analysis job to showcase durable checkpointing.
+
+  This tool is perfect for demos - it runs for exactly 1 minute with
+  progress updates every 10 seconds, showing how the system handles
+  long-running operations with checkpointing.
+
+  Args:
+    analysis_type: Type of analysis to run (e.g., "sentiment", "anomaly",
+      "trend", "clustering").
+    tool_context: The tool context for accessing state and artifacts.
+
+  Returns:
+    Analysis results with timing and metrics.
+  """
+  start_time = datetime.now()
+  total_duration = 60  # 1 minute
+  update_interval = 10  # Progress every 10 seconds
+
+  progress_updates = []
+  metrics = {
+      "records_analyzed": 0,
+      "insights_found": 0,
+      "confidence_score": 0.0,
+  }
+
+  for i in range(0, total_duration, update_interval):
+    await asyncio.sleep(update_interval)
+
+    progress_pct = ((i + update_interval) / total_duration) * 100
+    records_batch = random.randint(50000, 150000)
+    metrics["records_analyzed"] += records_batch
+    metrics["insights_found"] += random.randint(1, 5)
+    metrics["confidence_score"] = min(
+        0.6 + (progress_pct / 100) * 0.35 + random.uniform(-0.02, 0.02),
+        0.99
+    )
+
+    progress_updates.append({
+        "timestamp": datetime.now().isoformat(),
+        "progress_percent": round(progress_pct, 1),
+        "records_batch": records_batch,
+        "cumulative_records": metrics["records_analyzed"],
+    })
+
+  end_time = datetime.now()
+  actual_duration = (end_time - start_time).total_seconds()
+
+  # Generate analysis-specific insights
+  insights = {
+      "sentiment": [
+          "Overall sentiment: 72% positive",
+          "Key themes: innovation, growth, sustainability",
+          "Sentiment trend: improving over time",
+      ],
+      "anomaly": [
+          "Detected 3 significant anomalies",
+          "Anomaly cluster in Q3 data",
+          "Root cause: seasonal variation",
+      ],
+      "trend": [
+          "Strong upward trend detected",
+          "Growth rate: 15% month-over-month",
+          "Forecast: continued growth expected",
+      ],
+      "clustering": [
+          "Identified 5 distinct clusters",
+          "Largest cluster: 45% of data",
+          "Cluster separation: excellent",
+      ],
+  }.get(analysis_type.lower(), [
+      f"Completed {analysis_type} analysis",
+      "Results within expected parameters",
+      "No critical issues detected",
+  ])
+
+  return {
+      "status": "complete",
+      "analysis_type": analysis_type,
+      "start_time": start_time.isoformat(),
+      "end_time": end_time.isoformat(),
+      "duration_seconds": round(actual_duration, 2),
+      "metrics": metrics,
+      "insights": insights,
+      "progress_history": progress_updates,
+      "summary": (
+          f"Completed {analysis_type} analysis on "
+          f"{metrics['records_analyzed']:,} records. "
+          f"Found {metrics['insights_found']} insights with "
+          f"{metrics['confidence_score']:.1%} confidence."
+      ),
+  }
+
+
+def get_table_schema(table_name: str) -> dict[str, Any]:
+  """Get the schema of a BigQuery table.
+
+  This is a quick synchronous operation that doesn't require checkpointing.
+
+  Args:
+    table_name: The fully-qualified BigQuery table name.
+
+  Returns:
+    The table schema information.
+  """
+  # Simulate some common schemas
+  if "shakespeare" in table_name.lower():
+    return {
+        "table": table_name,
+        "fields": [
+            {"name": "word", "type": "STRING"},
+            {"name": "word_count", "type": "INTEGER"},
+            {"name": "corpus", "type": "STRING"},
+            {"name": "corpus_date", "type": "INTEGER"},
+        ],
+        "num_rows": 164656,
+        "size_bytes": 6432064,
+    }
+  elif "github" in table_name.lower():
+    return {
+        "table": table_name,
+        "fields": [
+            {"name": "repo_name", "type": "STRING"},
+            {"name": "path", "type": "STRING"},
+            {"name": "content", "type": "STRING"},
+            {"name": "size", "type": "INTEGER"},
+        ],
+        "num_rows": 2800000000,
+        "size_bytes": 2500000000000,
+    }
+  else:
+    return {
+        "table": table_name,
+        "fields": [
+            {"name": "id", "type": "INTEGER"},
+            {"name": "name", "type": "STRING"},
+            {"name": "created_at", "type": "TIMESTAMP"},
+        ],
+        "num_rows": 1000000,
+        "size_bytes": 100000000,
+    }
diff --git a/src/google/adk/apps/__init__.py b/src/google/adk/apps/__init__.py
index 3a5d0b0643..88d3474f3a 100644
--- a/src/google/adk/apps/__init__.py
+++ b/src/google/adk/apps/__init__.py
@@ -15,7 +15,18 @@
 from .app import App
 from .app import ResumabilityConfig
 
+
+# Lazy import for DurableSessionConfig to avoid circular imports
+def __getattr__(name: str):
+  if name == 'DurableSessionConfig':
+    from ..durable.config import DurableSessionConfig
+
+    return DurableSessionConfig
+  raise AttributeError(f'module {__name__!r} has no attribute {name!r}')
+
+
 __all__ = [
     'App',
     'ResumabilityConfig',
+    'DurableSessionConfig',
 ]
diff --git a/src/google/adk/apps/app.py b/src/google/adk/apps/app.py
index 71ea5ce5aa..8779ad67dc 100644
--- a/src/google/adk/apps/app.py
+++ b/src/google/adk/apps/app.py
@@ -14,6 +14,7 @@
 from __future__ import annotations
 
 from typing import Optional
+from typing import TYPE_CHECKING
 
 from pydantic import BaseModel
 from pydantic import ConfigDict
@@ -26,6 +27,9 @@
 from ..plugins.base_plugin import BasePlugin
 from ..utils.feature_decorator import experimental
 
+if TYPE_CHECKING:
+  from ..durable.config import DurableSessionConfig
+
 
 def validate_app_name(name: str) -> None:
   """Ensures the provided application name is safe and intuitive."""
@@ -118,6 +122,13 @@ class App(BaseModel):
   If configured, will be applied to all agents in the app.
   """
 
+  durable_session_config: Optional["DurableSessionConfig"] = None
+  """
+  The config for durable session persistence.
+  If configured, sessions will be checkpointed to external storage (BigQuery +
+  GCS), enabling recovery from failures and migration across hosts.
+  """
+
   @model_validator(mode="after")
   def _validate_name(self) -> App:
     validate_app_name(self.name)
diff --git a/src/google/adk/durable/__init__.py b/src/google/adk/durable/__init__.py
new file mode 100644
index 0000000000..bdc9082a0d
--- /dev/null
+++ b/src/google/adk/durable/__init__.py
@@ -0,0 +1,33 @@
+# Copyright 2026 Google LLC
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""Durable session persistence module for ADK.
+
+This module provides checkpoint-based durability for long-running agent
+invocations, enabling recovery from failures and migration across hosts.
+"""
+
+from .checkpointable_state import CheckpointableAgentState
+from .config import DurableSessionConfig
+from .stores import BigQueryCheckpointStore
+from .stores import DurableSessionStore
+from .workspace_snapshotter import WorkspaceSnapshotter
+
+__all__ = [
+    "CheckpointableAgentState",
+    "DurableSessionConfig",
+    "DurableSessionStore",
+    "BigQueryCheckpointStore",
+    "WorkspaceSnapshotter",
+]
diff --git a/src/google/adk/durable/checkpointable_state.py b/src/google/adk/durable/checkpointable_state.py
new file mode 100644
index 0000000000..e9a372855e
--- /dev/null
+++ b/src/google/adk/durable/checkpointable_state.py
@@ -0,0 +1,114 @@
+# Copyright 2026 Google LLC
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""Abstract base class for checkpointable agent state."""
+
+from __future__ import annotations
+
+import abc
+from typing import Any
+from typing import Dict
+
+from pydantic import BaseModel
+from pydantic import ConfigDict
+
+
+class CheckpointableAgentState(BaseModel, abc.ABC):
+  """Abstract base class for agent state that can be checkpointed.
+
+  Agents that need to preserve custom state across checkpoints should inherit
+  from this class and implement the serialization methods.
+
+  Example:
+    ```python
+    class MyAgentState(CheckpointableAgentState):
+        counter: int = 0
+        processed_items: list[str] = []
+
+        def to_checkpoint_dict(self) -> dict[str, Any]:
+            return {
+                "counter": self.counter,
+                "processed_items": self.processed_items,
+            }
+
+        @classmethod
+        def from_checkpoint_dict(cls, data: dict[str, Any]) -> "MyAgentState":
+            return cls(
+                counter=data.get("counter", 0),
+                processed_items=data.get("processed_items", []),
+            )
+    ```
+  """
+
+  model_config = ConfigDict(
+      extra="allow",
+  )
+
+  @abc.abstractmethod
+  def to_checkpoint_dict(self) -> Dict[str, Any]:
+    """Serialize the state to a dictionary for checkpointing.
+
+    Returns:
+      A dictionary containing all state that should be persisted.
+      The dictionary must be JSON-serializable.
+    """
+
+  @classmethod
+  @abc.abstractmethod
+  def from_checkpoint_dict(
+      cls, data: Dict[str, Any]
+  ) -> "CheckpointableAgentState":
+    """Deserialize the state from a checkpoint dictionary.
+
+    Args:
+      data: The dictionary previously returned by to_checkpoint_dict().
+
+    Returns:
+      A new instance of the state class with restored values.
+    """
+
+
+class SimpleCheckpointableState(CheckpointableAgentState):
+  """A simple implementation of CheckpointableAgentState using a dict.
+
+  This class provides a basic implementation that stores arbitrary key-value
+  pairs. Use this when you don't need custom serialization logic.
+
+  Example:
+    ```python
+    state = SimpleCheckpointableState()
+    state.data["counter"] = 5
+    state.data["results"] = ["a", "b", "c"]
+
+    # Checkpoint
+    checkpoint = state.to_checkpoint_dict()
+
+    # Restore
+    restored = SimpleCheckpointableState.from_checkpoint_dict(checkpoint)
+    assert restored.data["counter"] == 5
+    ```
+  """
+
+  data: Dict[str, Any] = {}
+
+  def to_checkpoint_dict(self) -> Dict[str, Any]:
+    """Serialize the state to a dictionary."""
+    return {"data": self.data.copy()}
+
+  @classmethod
+  def from_checkpoint_dict(
+      cls, data: Dict[str, Any]
+  ) -> "SimpleCheckpointableState":
+    """Deserialize the state from a checkpoint dictionary."""
+    return cls(data=data.get("data", {}))
diff --git a/src/google/adk/durable/config.py b/src/google/adk/durable/config.py
new file mode 100644
index 0000000000..d4b7d91e7a
--- /dev/null
+++ b/src/google/adk/durable/config.py
@@ -0,0 +1,70 @@
+# Copyright 2026 Google LLC
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""Configuration for durable session persistence."""
+
+from __future__ import annotations
+
+from typing import Any
+from typing import Literal
+from typing import Optional
+
+from pydantic import BaseModel
+from pydantic import ConfigDict
+from pydantic import Field
+
+from ..utils.feature_decorator import experimental
+
+
+@experimental
+class DurableSessionConfig(BaseModel):
+  """Configuration for durable session persistence.
+
+  Durable sessions provide checkpoint-based persistence that survives process
+  restarts, enabling recovery from failures and migration across hosts. This
+  goes beyond the basic resumability feature by persisting session state to
+  external storage (BigQuery + GCS).
+
+  Attributes:
+    is_durable: Whether to enable durable checkpointing.
+    checkpoint_policy: When to create checkpoints:
+      - "async_boundary": Checkpoint when hitting async/long-running operations
+      - "every_turn": Checkpoint after every agent turn
+      - "manual": Only checkpoint when explicitly requested
+    checkpoint_store: The store to use for persisting checkpoints.
+    lease_timeout_seconds: How long a lease is valid before expiring.
+    max_checkpoint_size_bytes: Maximum size for checkpoint state blobs.
+  """
+
+  model_config = ConfigDict(
+      arbitrary_types_allowed=True,
+      extra="forbid",
+  )
+
+  is_durable: bool = False
+  """Whether to enable durable checkpointing."""
+
+  checkpoint_policy: Literal["async_boundary", "every_turn", "manual"] = (
+      "async_boundary"
+  )
+  """When to create checkpoints during execution."""
+
+  checkpoint_store: Optional[Any] = Field(default=None)
+  """The store to use for persisting checkpoints (DurableSessionStore)."""
+
+  lease_timeout_seconds: int = Field(default=300, ge=60, le=3600)
+  """How long a lease is valid before expiring (60-3600 seconds)."""
+
+  max_checkpoint_size_bytes: int = Field(default=10 * 1024 * 1024, ge=1024)
+  """Maximum size for checkpoint state blobs (default 10MB)."""
diff --git a/src/google/adk/durable/stores/__init__.py b/src/google/adk/durable/stores/__init__.py
new file mode 100644
index 0000000000..cb04e432b6
--- /dev/null
+++ b/src/google/adk/durable/stores/__init__.py
@@ -0,0 +1,23 @@
+# Copyright 2026 Google LLC
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""Checkpoint store implementations for durable sessions."""
+
+from .base_checkpoint_store import DurableSessionStore
+from .bigquery_checkpoint_store import BigQueryCheckpointStore
+
+__all__ = [
+    "DurableSessionStore",
+    "BigQueryCheckpointStore",
+]
diff --git a/src/google/adk/durable/stores/base_checkpoint_store.py b/src/google/adk/durable/stores/base_checkpoint_store.py
new file mode 100644
index 0000000000..e6d553cd57
--- /dev/null
+++ b/src/google/adk/durable/stores/base_checkpoint_store.py
@@ -0,0 +1,258 @@
+# Copyright 2026 Google LLC
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""Abstract base class for durable session checkpoint stores."""
+
+from __future__ import annotations
+
+import abc
+from dataclasses import dataclass
+from datetime import datetime
+from typing import Any
+from typing import Dict
+from typing import Optional
+
+
+@dataclass
+class Checkpoint:
+  """Represents a checkpoint for a durable session.
+
+  Attributes:
+    session_id: The ID of the session this checkpoint belongs to.
+    checkpoint_seq: The sequence number of this checkpoint (monotonically
+      increasing).
+    created_at: When this checkpoint was created.
+    gcs_state_uri: The GCS URI where the full state blob is stored.
+    sha256: SHA-256 hash of the state blob for integrity verification.
+    size_bytes: Size of the state blob in bytes.
+    agent_state: Small agent state stored inline in BigQuery (optional).
+    trigger: What triggered this checkpoint (e.g., "async_boundary", "manual").
+  """
+
+  session_id: str
+  checkpoint_seq: int
+  created_at: datetime
+  gcs_state_uri: str
+  sha256: str
+  size_bytes: int
+  agent_state: Optional[Dict[str, Any]] = None
+  trigger: str = "async_boundary"
+
+
+@dataclass
+class SessionMetadata:
+  """Metadata about a durable session.
+
+  Attributes:
+    session_id: The unique session identifier.
+    status: Current status ("active", "paused", "completed", "failed").
+    agent_name: Name of the root agent for this session.
+    created_at: When the session was created.
+    updated_at: When the session was last updated.
+    current_checkpoint_seq: The latest checkpoint sequence number.
+    active_lease_id: ID of the current lease holder (if any).
+    lease_expiry: When the current lease expires.
+    ttl_expiry: When this session should be garbage collected.
+    metadata: Additional custom metadata.
+  """
+
+  session_id: str
+  status: str
+  agent_name: str
+  created_at: datetime
+  updated_at: datetime
+  current_checkpoint_seq: int
+  active_lease_id: Optional[str] = None
+  lease_expiry: Optional[datetime] = None
+  ttl_expiry: Optional[datetime] = None
+  metadata: Optional[Dict[str, Any]] = None
+
+
+class DurableSessionStore(abc.ABC):
+  """Abstract base class for checkpoint stores.
+
+  A checkpoint store provides persistent storage for session checkpoints,
+  enabling recovery from failures and migration across hosts.
+
+  Implementations must provide:
+  - Checkpoint write/read operations with two-phase commit
+  - Lease management to prevent concurrent modifications
+  - Session metadata management
+  """
+
+  @abc.abstractmethod
+  async def create_session(
+      self,
+      *,
+      session_id: str,
+      agent_name: str,
+      metadata: Optional[Dict[str, Any]] = None,
+  ) -> SessionMetadata:
+    """Create a new durable session.
+
+    Args:
+      session_id: Unique identifier for the session.
+      agent_name: Name of the root agent.
+      metadata: Optional custom metadata.
+
+    Returns:
+      The created session metadata.
+
+    Raises:
+      ValueError: If a session with this ID already exists.
+    """
+
+  @abc.abstractmethod
+  async def get_session(self, *, session_id: str) -> Optional[SessionMetadata]:
+    """Get session metadata.
+
+    Args:
+      session_id: The session to retrieve.
+
+    Returns:
+      The session metadata, or None if not found.
+    """
+
+  @abc.abstractmethod
+  async def update_session_status(
+      self, *, session_id: str, status: str
+  ) -> None:
+    """Update the status of a session.
+
+    Args:
+      session_id: The session to update.
+      status: The new status.
+    """
+
+  @abc.abstractmethod
+  async def write_checkpoint(
+      self,
+      *,
+      session_id: str,
+      checkpoint_seq: int,
+      state_blob: bytes,
+      agent_state: Optional[Dict[str, Any]] = None,
+      trigger: str = "async_boundary",
+  ) -> Checkpoint:
+    """Write a checkpoint with two-phase commit.
+
+    This operation should:
+    1. Upload the state blob to GCS
+    2. Record the checkpoint metadata in BigQuery
+    3. Update the session's current_checkpoint_seq
+
+    Args:
+      session_id: The session to checkpoint.
+      checkpoint_seq: The sequence number for this checkpoint.
+      state_blob: The serialized state to persist.
+      agent_state: Small agent state to store inline (optional).
+      trigger: What triggered this checkpoint.
+
+    Returns:
+      The created checkpoint.
+
+    Raises:
+      ValueError: If the checkpoint_seq is not greater than the current.
+    """
+
+  @abc.abstractmethod
+  async def read_latest_checkpoint(
+      self, *, session_id: str
+  ) -> Optional[tuple[Checkpoint, bytes]]:
+    """Read the latest checkpoint for a session.
+
+    Args:
+      session_id: The session to read.
+
+    Returns:
+      A tuple of (checkpoint, state_blob), or None if no checkpoints exist.
+    """
+
+  @abc.abstractmethod
+  async def read_checkpoint(
+      self, *, session_id: str, checkpoint_seq: int
+  ) -> Optional[tuple[Checkpoint, bytes]]:
+    """Read a specific checkpoint.
+
+    Args:
+      session_id: The session to read.
+      checkpoint_seq: The checkpoint sequence number.
+
+    Returns:
+      A tuple of (checkpoint, state_blob), or None if not found.
+    """
+
+  @abc.abstractmethod
+  async def acquire_lease(
+      self, *, session_id: str, lease_id: str, timeout_seconds: int
+  ) -> bool:
+    """Attempt to acquire a lease on a session.
+
+    Leases prevent concurrent modifications to a session. Only the lease
+    holder can write checkpoints or update session status.
+
+    Args:
+      session_id: The session to lease.
+      lease_id: A unique identifier for this lease attempt.
+      timeout_seconds: How long the lease should be valid.
+
+    Returns:
+      True if the lease was acquired, False if another lease is active.
+    """
+
+  @abc.abstractmethod
+  async def release_lease(self, *, session_id: str, lease_id: str) -> None:
+    """Release a lease on a session.
+
+    Args:
+      session_id: The session to release.
+      lease_id: The lease ID to release (must match the active lease).
+    """
+
+  @abc.abstractmethod
+  async def renew_lease(
+      self, *, session_id: str, lease_id: str, timeout_seconds: int
+  ) -> bool:
+    """Renew an existing lease.
+
+    Args:
+      session_id: The session to renew.
+      lease_id: The lease ID to renew (must match the active lease).
+      timeout_seconds: New timeout for the lease.
+
+    Returns:
+      True if the lease was renewed, False if the lease is not active.
+    """
+
+  @abc.abstractmethod
+  async def list_checkpoints(
+      self, *, session_id: str, limit: int = 10
+  ) -> list[Checkpoint]:
+    """List checkpoints for a session.
+
+    Args:
+      session_id: The session to list checkpoints for.
+      limit: Maximum number of checkpoints to return.
+
+    Returns:
+      List of checkpoints, ordered by checkpoint_seq descending.
+    """
+
+  @abc.abstractmethod
+  async def delete_session(self, *, session_id: str) -> None:
+    """Delete a session and all its checkpoints.
+
+    Args:
+      session_id: The session to delete.
+    """
diff --git a/src/google/adk/durable/stores/bigquery_checkpoint_store.py b/src/google/adk/durable/stores/bigquery_checkpoint_store.py
new file mode 100644
index 0000000000..3d53995ecc
--- /dev/null
+++ b/src/google/adk/durable/stores/bigquery_checkpoint_store.py
@@ -0,0 +1,693 @@
+# Copyright 2026 Google LLC
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""BigQuery + GCS implementation of durable session checkpoint store."""
+
+from __future__ import annotations
+
+from datetime import datetime
+from datetime import timedelta
+from datetime import timezone
+import hashlib
+import json
+import logging
+from typing import Any
+from typing import Dict
+from typing import Optional
+import uuid
+
+from ...utils.feature_decorator import experimental
+from .base_checkpoint_store import Checkpoint
+from .base_checkpoint_store import DurableSessionStore
+from .base_checkpoint_store import SessionMetadata
+
+logger = logging.getLogger("google_adk." + __name__)
+
+
+@experimental
+class BigQueryCheckpointStore(DurableSessionStore):
+  """Checkpoint store using BigQuery for metadata and GCS for state blobs.
+
+  This implementation stores:
+  - Session metadata and checkpoint records in BigQuery tables
+  - Large state blobs in Google Cloud Storage
+
+  Prerequisites:
+  - BigQuery dataset with sessions and checkpoints tables
+  - GCS bucket for state blobs
+  - Appropriate IAM permissions
+
+  Example:
+    ```python
+    store = BigQueryCheckpointStore(
+        project="my-project",
+        dataset="adk_metadata",
+        gcs_bucket="my-project-adk-checkpoints",
+    )
+
+    # Create a session
+    await store.create_session(
+        session_id="sess-123",
+        agent_name="my_agent",
+    )
+
+    # Write a checkpoint
+    await store.write_checkpoint(
+        session_id="sess-123",
+        checkpoint_seq=1,
+        state_blob=b"...",
+    )
+
+    # Read it back
+    checkpoint, blob = await store.read_latest_checkpoint(session_id="sess-123")
+    ```
+  """
+
+  def __init__(
+      self,
+      *,
+      project: str,
+      dataset: str,
+      gcs_bucket: str,
+      sessions_table: str = "sessions",
+      checkpoints_table: str = "checkpoints",
+      location: str = "US",
+  ):
+    """Initialize the BigQuery checkpoint store.
+
+    Args:
+      project: GCP project ID.
+      dataset: BigQuery dataset name.
+      gcs_bucket: GCS bucket name for state blobs.
+      sessions_table: Name of the sessions table.
+      checkpoints_table: Name of the checkpoints table.
+      location: BigQuery dataset location.
+    """
+    self._project = project
+    self._dataset = dataset
+    self._gcs_bucket = gcs_bucket
+    self._sessions_table = sessions_table
+    self._checkpoints_table = checkpoints_table
+    self._location = location
+
+    # Lazy-loaded clients
+    self._bq_client = None
+    self._storage_client = None
+
+  @property
+  def _sessions_table_id(self) -> str:
+    return f"{self._project}.{self._dataset}.{self._sessions_table}"
+
+  @property
+  def _checkpoints_table_id(self) -> str:
+    return f"{self._project}.{self._dataset}.{self._checkpoints_table}"
+
+  def _get_bq_client(self):
+    """Lazy-load BigQuery client."""
+    if self._bq_client is None:
+      from google.cloud import bigquery
+
+      self._bq_client = bigquery.Client(
+          project=self._project, location=self._location
+      )
+    return self._bq_client
+
+  def _get_storage_client(self):
+    """Lazy-load Cloud Storage client."""
+    if self._storage_client is None:
+      from google.cloud import storage
+
+      self._storage_client = storage.Client(project=self._project)
+    return self._storage_client
+
+  def _get_gcs_uri(self, session_id: str, checkpoint_seq: int) -> str:
+    """Generate a GCS URI for a checkpoint blob."""
+    return f"gs://{self._gcs_bucket}/checkpoints/{session_id}/{checkpoint_seq}.json.gz"
+
+  async def create_session(
+      self,
+      *,
+      session_id: str,
+      agent_name: str,
+      metadata: Optional[Dict[str, Any]] = None,
+  ) -> SessionMetadata:
+    """Create a new durable session."""
+    now = datetime.now(timezone.utc)
+
+    # Check if session already exists
+    existing = await self.get_session(session_id=session_id)
+    if existing:
+      raise ValueError(f"Session {session_id} already exists")
+
+    # Insert session record using DML (not streaming) for immediate updatability
+    client = self._get_bq_client()
+    from google.cloud import bigquery
+
+    insert_query = f"""
+    INSERT INTO `{self._sessions_table_id}`
+    (session_id, status, agent_name, created_at, updated_at,
+     current_checkpoint_seq, active_lease_id, lease_expiry, ttl_expiry, metadata)
+    VALUES
+    (@session_id, @status, @agent_name, @created_at, @updated_at,
+     @current_checkpoint_seq, @active_lease_id, @lease_expiry, @ttl_expiry,
+     PARSE_JSON(@metadata))
+    """
+
+    job_config = bigquery.QueryJobConfig(
+        query_parameters=[
+            bigquery.ScalarQueryParameter("session_id", "STRING", session_id),
+            bigquery.ScalarQueryParameter("status", "STRING", "active"),
+            bigquery.ScalarQueryParameter("agent_name", "STRING", agent_name),
+            bigquery.ScalarQueryParameter(
+                "created_at", "TIMESTAMP", now.isoformat()
+            ),
+            bigquery.ScalarQueryParameter(
+                "updated_at", "TIMESTAMP", now.isoformat()
+            ),
+            bigquery.ScalarQueryParameter("current_checkpoint_seq", "INT64", 0),
+            bigquery.ScalarQueryParameter("active_lease_id", "STRING", None),
+            bigquery.ScalarQueryParameter("lease_expiry", "TIMESTAMP", None),
+            bigquery.ScalarQueryParameter("ttl_expiry", "TIMESTAMP", None),
+            bigquery.ScalarQueryParameter(
+                "metadata", "STRING", json.dumps(metadata) if metadata else None
+            ),
+        ]
+    )
+    client.query(insert_query, job_config=job_config).result()
+
+    logger.info("Created durable session: %s", session_id)
+
+    return SessionMetadata(
+        session_id=session_id,
+        status="active",
+        agent_name=agent_name,
+        created_at=now,
+        updated_at=now,
+        current_checkpoint_seq=0,
+        metadata=metadata,
+    )
+
+  async def get_session(self, *, session_id: str) -> Optional[SessionMetadata]:
+    """Get session metadata."""
+    query = f"""
+    SELECT
+      session_id,
+      status,
+      agent_name,
+      created_at,
+      updated_at,
+      current_checkpoint_seq,
+      active_lease_id,
+      lease_expiry,
+      ttl_expiry,
+      metadata
+    FROM `{self._sessions_table_id}`
+    WHERE session_id = @session_id
+    """
+
+    client = self._get_bq_client()
+    from google.cloud import bigquery
+
+    job_config = bigquery.QueryJobConfig(
+        query_parameters=[
+            bigquery.ScalarQueryParameter("session_id", "STRING", session_id),
+        ]
+    )
+    results = client.query(query, job_config=job_config).result()
+
+    for row in results:
+      return SessionMetadata(
+          session_id=row.session_id,
+          status=row.status,
+          agent_name=row.agent_name,
+          created_at=row.created_at,
+          updated_at=row.updated_at,
+          current_checkpoint_seq=row.current_checkpoint_seq,
+          active_lease_id=row.active_lease_id,
+          lease_expiry=row.lease_expiry,
+          ttl_expiry=row.ttl_expiry,
+          metadata=row.metadata if isinstance(row.metadata, dict) else (json.loads(row.metadata) if row.metadata else None),
+      )
+
+    return None
+
+  async def update_session_status(
+      self, *, session_id: str, status: str
+  ) -> None:
+    """Update the status of a session."""
+    now = datetime.now(timezone.utc)
+
+    query = f"""
+    UPDATE `{self._sessions_table_id}`
+    SET status = @status, updated_at = @updated_at
+    WHERE session_id = @session_id
+    """
+
+    client = self._get_bq_client()
+    from google.cloud import bigquery
+
+    job_config = bigquery.QueryJobConfig(
+        query_parameters=[
+            bigquery.ScalarQueryParameter("session_id", "STRING", session_id),
+            bigquery.ScalarQueryParameter("status", "STRING", status),
+            bigquery.ScalarQueryParameter(
+                "updated_at", "TIMESTAMP", now.isoformat()
+            ),
+        ]
+    )
+    client.query(query, job_config=job_config).result()
+    logger.debug("Updated session %s status to %s", session_id, status)
+
+  async def write_checkpoint(
+      self,
+      *,
+      session_id: str,
+      checkpoint_seq: int,
+      state_blob: bytes,
+      agent_state: Optional[Dict[str, Any]] = None,
+      trigger: str = "async_boundary",
+  ) -> Checkpoint:
+    """Write a checkpoint with two-phase commit."""
+    import gzip
+
+    now = datetime.now(timezone.utc)
+
+    # Verify session exists and checkpoint_seq is valid
+    session = await self.get_session(session_id=session_id)
+    if not session:
+      raise ValueError(f"Session {session_id} not found")
+
+    if checkpoint_seq <= session.current_checkpoint_seq:
+      raise ValueError(
+          f"checkpoint_seq {checkpoint_seq} must be greater than current"
+          f" {session.current_checkpoint_seq}"
+      )
+
+    # Compute hash of the state blob
+    sha256 = hashlib.sha256(state_blob).hexdigest()
+    size_bytes = len(state_blob)
+
+    # Phase 1: Upload to GCS
+    gcs_uri = self._get_gcs_uri(session_id, checkpoint_seq)
+    blob_path = gcs_uri.replace(f"gs://{self._gcs_bucket}/", "")
+
+    storage_client = self._get_storage_client()
+    bucket = storage_client.bucket(self._gcs_bucket)
+    blob = bucket.blob(blob_path)
+
+    compressed = gzip.compress(state_blob)
+    blob.upload_from_string(compressed, content_type="application/gzip")
+    logger.debug(
+        "Uploaded checkpoint blob to %s (%d bytes compressed)",
+        gcs_uri,
+        len(compressed),
+    )
+
+    # Phase 2: Insert checkpoint record using DML
+    client = self._get_bq_client()
+    from google.cloud import bigquery
+
+    insert_query = f"""
+    INSERT INTO `{self._checkpoints_table_id}`
+    (session_id, checkpoint_seq, created_at, gcs_state_uri, sha256,
+     size_bytes, agent_state_json, trigger)
+    VALUES
+    (@session_id, @checkpoint_seq, @created_at, @gcs_state_uri, @sha256,
+     @size_bytes, PARSE_JSON(@agent_state_json), @trigger)
+    """
+
+    job_config = bigquery.QueryJobConfig(
+        query_parameters=[
+            bigquery.ScalarQueryParameter("session_id", "STRING", session_id),
+            bigquery.ScalarQueryParameter(
+                "checkpoint_seq", "INT64", checkpoint_seq
+            ),
+            bigquery.ScalarQueryParameter(
+                "created_at", "TIMESTAMP", now.isoformat()
+            ),
+            bigquery.ScalarQueryParameter("gcs_state_uri", "STRING", gcs_uri),
+            bigquery.ScalarQueryParameter("sha256", "STRING", sha256),
+            bigquery.ScalarQueryParameter("size_bytes", "INT64", size_bytes),
+            bigquery.ScalarQueryParameter(
+                "agent_state_json", "STRING",
+                json.dumps(agent_state) if agent_state else None
+            ),
+            bigquery.ScalarQueryParameter("trigger", "STRING", trigger),
+        ]
+    )
+    try:
+      client.query(insert_query, job_config=job_config).result()
+    except Exception as e:
+      # Rollback: delete the GCS blob
+      blob.delete()
+      raise RuntimeError(f"Failed to insert checkpoint record: {e}")
+
+    # Phase 3: Update session's current_checkpoint_seq
+    from google.cloud import bigquery
+
+    update_query = f"""
+    UPDATE `{self._sessions_table_id}`
+    SET current_checkpoint_seq = @checkpoint_seq, updated_at = @updated_at
+    WHERE session_id = @session_id
+    """
+
+    job_config = bigquery.QueryJobConfig(
+        query_parameters=[
+            bigquery.ScalarQueryParameter("session_id", "STRING", session_id),
+            bigquery.ScalarQueryParameter(
+                "checkpoint_seq", "INT64", checkpoint_seq
+            ),
+            bigquery.ScalarQueryParameter(
+                "updated_at", "TIMESTAMP", now.isoformat()
+            ),
+        ]
+    )
+    client.query(update_query, job_config=job_config).result()
+
+    logger.info(
+        "Wrote checkpoint %d for session %s (%d bytes, sha256=%s)",
+        checkpoint_seq,
+        session_id,
+        size_bytes,
+        sha256[:16],
+    )
+
+    return Checkpoint(
+        session_id=session_id,
+        checkpoint_seq=checkpoint_seq,
+        created_at=now,
+        gcs_state_uri=gcs_uri,
+        sha256=sha256,
+        size_bytes=size_bytes,
+        agent_state=agent_state,
+        trigger=trigger,
+    )
+
+  async def read_latest_checkpoint(
+      self, *, session_id: str
+  ) -> Optional[tuple[Checkpoint, bytes]]:
+    """Read the latest checkpoint for a session."""
+    session = await self.get_session(session_id=session_id)
+    if not session or session.current_checkpoint_seq == 0:
+      return None
+
+    return await self.read_checkpoint(
+        session_id=session_id, checkpoint_seq=session.current_checkpoint_seq
+    )
+
+  async def read_checkpoint(
+      self, *, session_id: str, checkpoint_seq: int
+  ) -> Optional[tuple[Checkpoint, bytes]]:
+    """Read a specific checkpoint."""
+    import gzip
+
+    query = f"""
+    SELECT
+      session_id,
+      checkpoint_seq,
+      created_at,
+      gcs_state_uri,
+      sha256,
+      size_bytes,
+      agent_state_json,
+      trigger
+    FROM `{self._checkpoints_table_id}`
+    WHERE session_id = @session_id AND checkpoint_seq = @checkpoint_seq
+    """
+
+    client = self._get_bq_client()
+    from google.cloud import bigquery
+
+    job_config = bigquery.QueryJobConfig(
+        query_parameters=[
+            bigquery.ScalarQueryParameter("session_id", "STRING", session_id),
+            bigquery.ScalarQueryParameter(
+                "checkpoint_seq", "INT64", checkpoint_seq
+            ),
+        ]
+    )
+    results = client.query(query, job_config=job_config).result()
+
+    checkpoint_row = None
+    for row in results:
+      checkpoint_row = row
+      break
+
+    if not checkpoint_row:
+      return None
+
+    # Download blob from GCS
+    gcs_uri = checkpoint_row.gcs_state_uri
+    blob_path = gcs_uri.replace(f"gs://{self._gcs_bucket}/", "")
+
+    storage_client = self._get_storage_client()
+    bucket = storage_client.bucket(self._gcs_bucket)
+    blob = bucket.blob(blob_path)
+
+    compressed = blob.download_as_bytes()
+    state_blob = gzip.decompress(compressed)
+
+    # Verify integrity
+    actual_sha256 = hashlib.sha256(state_blob).hexdigest()
+    if actual_sha256 != checkpoint_row.sha256:
+      raise RuntimeError(
+          "Checkpoint integrity check failed: expected"
+          f" {checkpoint_row.sha256}, got {actual_sha256}"
+      )
+
+    checkpoint = Checkpoint(
+        session_id=checkpoint_row.session_id,
+        checkpoint_seq=checkpoint_row.checkpoint_seq,
+        created_at=checkpoint_row.created_at,
+        gcs_state_uri=checkpoint_row.gcs_state_uri,
+        sha256=checkpoint_row.sha256,
+        size_bytes=checkpoint_row.size_bytes,
+        agent_state=(
+            checkpoint_row.agent_state_json if isinstance(checkpoint_row.agent_state_json, dict)
+            else (json.loads(checkpoint_row.agent_state_json) if checkpoint_row.agent_state_json else None)
+        ),
+        trigger=checkpoint_row.trigger,
+    )
+
+    logger.debug(
+        "Read checkpoint %d for session %s (%d bytes)",
+        checkpoint_seq,
+        session_id,
+        len(state_blob),
+    )
+
+    return checkpoint, state_blob
+
+  async def acquire_lease(
+      self, *, session_id: str, lease_id: str, timeout_seconds: int
+  ) -> bool:
+    """Attempt to acquire a lease on a session."""
+    now = datetime.now(timezone.utc)
+    expiry = now + timedelta(seconds=timeout_seconds)
+
+    # Atomic update: only succeed if no active lease or lease expired
+    query = f"""
+    UPDATE `{self._sessions_table_id}`
+    SET
+      active_lease_id = @lease_id,
+      lease_expiry = @lease_expiry,
+      updated_at = @updated_at
+    WHERE
+      session_id = @session_id
+      AND (active_lease_id IS NULL OR lease_expiry < @now)
+    """
+
+    client = self._get_bq_client()
+    from google.cloud import bigquery
+
+    job_config = bigquery.QueryJobConfig(
+        query_parameters=[
+            bigquery.ScalarQueryParameter("session_id", "STRING", session_id),
+            bigquery.ScalarQueryParameter("lease_id", "STRING", lease_id),
+            bigquery.ScalarQueryParameter(
+                "lease_expiry", "TIMESTAMP", expiry.isoformat()
+            ),
+            bigquery.ScalarQueryParameter(
+                "updated_at", "TIMESTAMP", now.isoformat()
+            ),
+            bigquery.ScalarQueryParameter("now", "TIMESTAMP", now.isoformat()),
+        ]
+    )
+    result = client.query(query, job_config=job_config).result()
+
+    # Check if the update affected any rows
+    if result.num_dml_affected_rows and result.num_dml_affected_rows > 0:
+      logger.info("Acquired lease %s on session %s", lease_id, session_id)
+      return True
+    else:
+      logger.debug("Failed to acquire lease on session %s", session_id)
+      return False
+
+  async def release_lease(self, *, session_id: str, lease_id: str) -> None:
+    """Release a lease on a session."""
+    now = datetime.now(timezone.utc)
+
+    query = f"""
+    UPDATE `{self._sessions_table_id}`
+    SET
+      active_lease_id = NULL,
+      lease_expiry = NULL,
+      updated_at = @updated_at
+    WHERE session_id = @session_id AND active_lease_id = @lease_id
+    """
+
+    client = self._get_bq_client()
+    from google.cloud import bigquery
+
+    job_config = bigquery.QueryJobConfig(
+        query_parameters=[
+            bigquery.ScalarQueryParameter("session_id", "STRING", session_id),
+            bigquery.ScalarQueryParameter("lease_id", "STRING", lease_id),
+            bigquery.ScalarQueryParameter(
+                "updated_at", "TIMESTAMP", now.isoformat()
+            ),
+        ]
+    )
+    client.query(query, job_config=job_config).result()
+    logger.info("Released lease %s on session %s", lease_id, session_id)
+
+  async def renew_lease(
+      self, *, session_id: str, lease_id: str, timeout_seconds: int
+  ) -> bool:
+    """Renew an existing lease."""
+    now = datetime.now(timezone.utc)
+    expiry = now + timedelta(seconds=timeout_seconds)
+
+    query = f"""
+    UPDATE `{self._sessions_table_id}`
+    SET lease_expiry = @lease_expiry, updated_at = @updated_at
+    WHERE session_id = @session_id AND active_lease_id = @lease_id
+    """
+
+    client = self._get_bq_client()
+    from google.cloud import bigquery
+
+    job_config = bigquery.QueryJobConfig(
+        query_parameters=[
+            bigquery.ScalarQueryParameter("session_id", "STRING", session_id),
+            bigquery.ScalarQueryParameter("lease_id", "STRING", lease_id),
+            bigquery.ScalarQueryParameter(
+                "lease_expiry", "TIMESTAMP", expiry.isoformat()
+            ),
+            bigquery.ScalarQueryParameter(
+                "updated_at", "TIMESTAMP", now.isoformat()
+            ),
+        ]
+    )
+    result = client.query(query, job_config=job_config).result()
+
+    if result.num_dml_affected_rows and result.num_dml_affected_rows > 0:
+      logger.debug("Renewed lease %s on session %s", lease_id, session_id)
+      return True
+    return False
+
+  async def list_checkpoints(
+      self, *, session_id: str, limit: int = 10
+  ) -> list[Checkpoint]:
+    """List checkpoints for a session."""
+    query = f"""
+    SELECT
+      session_id,
+      checkpoint_seq,
+      created_at,
+      gcs_state_uri,
+      sha256,
+      size_bytes,
+      agent_state_json,
+      trigger
+    FROM `{self._checkpoints_table_id}`
+    WHERE session_id = @session_id
+    ORDER BY checkpoint_seq DESC
+    LIMIT @limit
+    """
+
+    client = self._get_bq_client()
+    from google.cloud import bigquery
+
+    job_config = bigquery.QueryJobConfig(
+        query_parameters=[
+            bigquery.ScalarQueryParameter("session_id", "STRING", session_id),
+            bigquery.ScalarQueryParameter("limit", "INT64", limit),
+        ]
+    )
+    results = client.query(query, job_config=job_config).result()
+
+    checkpoints = []
+    for row in results:
+      checkpoints.append(
+          Checkpoint(
+              session_id=row.session_id,
+              checkpoint_seq=row.checkpoint_seq,
+              created_at=row.created_at,
+              gcs_state_uri=row.gcs_state_uri,
+              sha256=row.sha256,
+              size_bytes=row.size_bytes,
+              agent_state=(
+                  row.agent_state_json if isinstance(row.agent_state_json, dict)
+                  else (json.loads(row.agent_state_json) if row.agent_state_json else None)
+              ),
+              trigger=row.trigger,
+          )
+      )
+
+    return checkpoints
+
+  async def delete_session(self, *, session_id: str) -> None:
+    """Delete a session and all its checkpoints."""
+    # Delete checkpoints from GCS
+    checkpoints = await self.list_checkpoints(session_id=session_id, limit=1000)
+    storage_client = self._get_storage_client()
+    bucket = storage_client.bucket(self._gcs_bucket)
+
+    for checkpoint in checkpoints:
+      blob_path = checkpoint.gcs_state_uri.replace(
+          f"gs://{self._gcs_bucket}/", ""
+      )
+      blob = bucket.blob(blob_path)
+      try:
+        blob.delete()
+      except Exception as e:
+        logger.warning("Failed to delete blob %s: %s", blob_path, e)
+
+    # Delete checkpoint records
+    client = self._get_bq_client()
+    from google.cloud import bigquery
+
+    delete_checkpoints = f"""
+    DELETE FROM `{self._checkpoints_table_id}`
+    WHERE session_id = @session_id
+    """
+
+    job_config = bigquery.QueryJobConfig(
+        query_parameters=[
+            bigquery.ScalarQueryParameter("session_id", "STRING", session_id),
+        ]
+    )
+    client.query(delete_checkpoints, job_config=job_config).result()
+
+    # Delete session record
+    delete_session = f"""
+    DELETE FROM `{self._sessions_table_id}`
+    WHERE session_id = @session_id
+    """
+    client.query(delete_session, job_config=job_config).result()
+
+    logger.info(
+        "Deleted session %s and %d checkpoints", session_id, len(checkpoints)
+    )
diff --git a/src/google/adk/durable/workspace_snapshotter.py b/src/google/adk/durable/workspace_snapshotter.py
new file mode 100644
index 0000000000..1462b883d7
--- /dev/null
+++ b/src/google/adk/durable/workspace_snapshotter.py
@@ -0,0 +1,187 @@
+# Copyright 2026 Google LLC
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""Workspace snapshot handling for durable sessions."""
+
+from __future__ import annotations
+
+import hashlib
+import io
+import json
+import logging
+from pathlib import Path
+import tarfile
+from typing import Any
+from typing import Dict
+from typing import Optional
+
+logger = logging.getLogger("google_adk." + __name__)
+
+
+class WorkspaceSnapshotter:
+  """Handles workspace file snapshots for durable checkpoints.
+
+  This class provides utilities for creating and restoring snapshots of
+  workspace directories, enabling agents to persist and restore file-based
+  state across checkpoint boundaries.
+
+  Example:
+    ```python
+    snapshotter = WorkspaceSnapshotter(workspace_dir="/tmp/agent_workspace")
+
+    # Create a snapshot
+    blob, sha256, size = snapshotter.create_snapshot()
+
+    # Later, restore from snapshot
+    snapshotter.restore_snapshot(blob)
+    ```
+  """
+
+  def __init__(
+      self,
+      workspace_dir: Optional[str] = None,
+      exclude_patterns: Optional[list[str]] = None,
+  ):
+    """Initialize the workspace snapshotter.
+
+    Args:
+      workspace_dir: Path to the workspace directory to snapshot.
+      exclude_patterns: List of glob patterns to exclude from snapshots.
+    """
+    self._workspace_dir = Path(workspace_dir) if workspace_dir else None
+    self._exclude_patterns = exclude_patterns or [
+        "__pycache__",
+        "*.pyc",
+        ".git",
+        ".env",
+        "node_modules",
+        "*.log",
+    ]
+
+  @property
+  def workspace_dir(self) -> Optional[Path]:
+    """The workspace directory being snapshotted."""
+    return self._workspace_dir
+
+  def create_snapshot(self) -> tuple[bytes, str, int]:
+    """Create a tarball snapshot of the workspace directory.
+
+    Returns:
+      A tuple of (blob_bytes, sha256_hash, size_bytes).
+
+    Raises:
+      ValueError: If no workspace directory is configured.
+      FileNotFoundError: If the workspace directory doesn't exist.
+    """
+    if not self._workspace_dir:
+      raise ValueError("No workspace directory configured")
+
+    if not self._workspace_dir.exists():
+      raise FileNotFoundError(
+          f"Workspace directory not found: {self._workspace_dir}"
+      )
+
+    buffer = io.BytesIO()
+    with tarfile.open(fileobj=buffer, mode="w:gz") as tar:
+      for path in self._workspace_dir.rglob("*"):
+        if path.is_file() and not self._should_exclude(path):
+          arcname = path.relative_to(self._workspace_dir)
+          tar.add(path, arcname=str(arcname))
+
+    blob = buffer.getvalue()
+    sha256 = hashlib.sha256(blob).hexdigest()
+
+    logger.debug(
+        "Created workspace snapshot: %d bytes, sha256=%s", len(blob), sha256
+    )
+
+    return blob, sha256, len(blob)
+
+  def restore_snapshot(self, blob: bytes) -> None:
+    """Restore a workspace from a tarball snapshot.
+
+    Args:
+      blob: The snapshot blob previously created by create_snapshot().
+
+    Raises:
+      ValueError: If no workspace directory is configured.
+    """
+    if not self._workspace_dir:
+      raise ValueError("No workspace directory configured")
+
+    self._workspace_dir.mkdir(parents=True, exist_ok=True)
+
+    buffer = io.BytesIO(blob)
+    with tarfile.open(fileobj=buffer, mode="r:gz") as tar:
+      # Filter to prevent path traversal attacks
+      safe_members = [
+          m for m in tar.getmembers() if not m.name.startswith(("/", ".."))
+      ]
+      tar.extractall(path=self._workspace_dir, members=safe_members)
+
+    logger.debug(
+        "Restored workspace snapshot: %d bytes to %s",
+        len(blob),
+        self._workspace_dir,
+    )
+
+  def _should_exclude(self, path: Path) -> bool:
+    """Check if a path should be excluded from snapshots."""
+    path_str = str(path)
+    for pattern in self._exclude_patterns:
+      if pattern.startswith("*"):
+        # Suffix match (e.g., *.pyc)
+        if path_str.endswith(pattern[1:]):
+          return True
+      elif pattern in path_str:
+        # Contains match (e.g., __pycache__)
+        return True
+    return False
+
+
+def serialize_state_to_json(state: Dict[str, Any]) -> bytes:
+  """Serialize state dictionary to JSON bytes.
+
+  Args:
+    state: The state dictionary to serialize.
+
+  Returns:
+    JSON-encoded bytes.
+  """
+  return json.dumps(state, sort_keys=True, default=str).encode("utf-8")
+
+
+def deserialize_state_from_json(blob: bytes) -> Dict[str, Any]:
+  """Deserialize state from JSON bytes.
+
+  Args:
+    blob: JSON-encoded bytes.
+
+  Returns:
+    The deserialized state dictionary.
+  """
+  return json.loads(blob.decode("utf-8"))
+
+
+def compute_state_hash(state: Dict[str, Any]) -> str:
+  """Compute a SHA-256 hash of the state dictionary.
+
+  Args:
+    state: The state dictionary to hash.
+
+  Returns:
+    The hex-encoded SHA-256 hash.
+  """
+  blob = serialize_state_to_json(state)
+  return hashlib.sha256(blob).hexdigest()
diff --git a/tests/unittests/durable/__init__.py b/tests/unittests/durable/__init__.py
new file mode 100644
index 0000000000..58d482ea38
--- /dev/null
+++ b/tests/unittests/durable/__init__.py
@@ -0,0 +1,13 @@
+# Copyright 2026 Google LLC
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
diff --git a/tests/unittests/durable/test_bigquery_checkpoint_store.py b/tests/unittests/durable/test_bigquery_checkpoint_store.py
new file mode 100644
index 0000000000..f792c6b4b0
--- /dev/null
+++ b/tests/unittests/durable/test_bigquery_checkpoint_store.py
@@ -0,0 +1,273 @@
+# Copyright 2026 Google LLC
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""Tests for BigQueryCheckpointStore."""
+
+from datetime import datetime
+from datetime import timezone
+from unittest.mock import AsyncMock
+from unittest.mock import MagicMock
+from unittest.mock import patch
+
+from google.adk.durable.stores.bigquery_checkpoint_store import BigQueryCheckpointStore
+import pytest
+
+
+class TestBigQueryCheckpointStore:
+  """Tests for BigQueryCheckpointStore."""
+
+  @pytest.fixture
+  def store(self):
+    """Create a store instance for testing."""
+    return BigQueryCheckpointStore(
+        project="test-project",
+        dataset="test_dataset",
+        gcs_bucket="test-bucket",
+    )
+
+  def test_init(self, store):
+    """Test store initialization."""
+    assert store._project == "test-project"
+    assert store._dataset == "test_dataset"
+    assert store._gcs_bucket == "test-bucket"
+    assert store._sessions_table == "sessions"
+    assert store._checkpoints_table == "checkpoints"
+    assert store._location == "US"
+
+  def test_table_ids(self, store):
+    """Test table ID generation."""
+    assert store._sessions_table_id == "test-project.test_dataset.sessions"
+    assert (
+        store._checkpoints_table_id == "test-project.test_dataset.checkpoints"
+    )
+
+  def test_gcs_uri_generation(self, store):
+    """Test GCS URI generation."""
+    uri = store._get_gcs_uri("session-123", 5)
+    assert uri == "gs://test-bucket/checkpoints/session-123/5.json.gz"
+
+  @pytest.mark.asyncio
+  async def test_create_session(self, store):
+    """Test session creation."""
+    mock_client = MagicMock()
+    mock_client.insert_rows_json.return_value = []
+
+    with patch.object(store, "_get_bq_client", return_value=mock_client):
+      with patch.object(
+          store, "get_session", new_callable=AsyncMock
+      ) as mock_get:
+        mock_get.return_value = None
+
+        session = await store.create_session(
+            session_id="test-session",
+            agent_name="test_agent",
+            metadata={"key": "value"},
+        )
+
+        assert session.session_id == "test-session"
+        assert session.agent_name == "test_agent"
+        assert session.status == "active"
+        assert session.current_checkpoint_seq == 0
+        assert session.metadata == {"key": "value"}
+
+        mock_client.insert_rows_json.assert_called_once()
+
+  @pytest.mark.asyncio
+  async def test_create_session_already_exists(self, store):
+    """Test session creation when session already exists."""
+    with patch.object(store, "get_session", new_callable=AsyncMock) as mock_get:
+      from google.adk.durable.stores.base_checkpoint_store import SessionMetadata
+
+      mock_get.return_value = SessionMetadata(
+          session_id="test-session",
+          status="active",
+          agent_name="test_agent",
+          created_at=datetime.now(timezone.utc),
+          updated_at=datetime.now(timezone.utc),
+          current_checkpoint_seq=0,
+      )
+
+      with pytest.raises(ValueError, match="already exists"):
+        await store.create_session(
+            session_id="test-session",
+            agent_name="test_agent",
+        )
+
+  @pytest.mark.asyncio
+  async def test_write_checkpoint(self, store):
+    """Test checkpoint writing."""
+    mock_bq_client = MagicMock()
+    mock_bq_client.insert_rows_json.return_value = []
+    mock_bq_client.query.return_value.result.return_value = None
+
+    mock_storage_client = MagicMock()
+    mock_bucket = MagicMock()
+    mock_blob = MagicMock()
+    mock_storage_client.bucket.return_value = mock_bucket
+    mock_bucket.blob.return_value = mock_blob
+
+    with patch.object(store, "_get_bq_client", return_value=mock_bq_client):
+      with patch.object(
+          store, "_get_storage_client", return_value=mock_storage_client
+      ):
+        with patch.object(
+            store, "get_session", new_callable=AsyncMock
+        ) as mock_get:
+          from google.adk.durable.stores.base_checkpoint_store import SessionMetadata
+
+          mock_get.return_value = SessionMetadata(
+              session_id="test-session",
+              status="active",
+              agent_name="test_agent",
+              created_at=datetime.now(timezone.utc),
+              updated_at=datetime.now(timezone.utc),
+              current_checkpoint_seq=0,
+          )
+
+          checkpoint = await store.write_checkpoint(
+              session_id="test-session",
+              checkpoint_seq=1,
+              state_blob=b'{"state": "data"}',
+              agent_state={"key": "value"},
+              trigger="async_boundary",
+          )
+
+          assert checkpoint.session_id == "test-session"
+          assert checkpoint.checkpoint_seq == 1
+          assert checkpoint.trigger == "async_boundary"
+          assert checkpoint.agent_state == {"key": "value"}
+
+          # Verify GCS upload was called
+          mock_blob.upload_from_string.assert_called_once()
+
+          # Verify BQ insert was called
+          mock_bq_client.insert_rows_json.assert_called_once()
+
+  @pytest.mark.asyncio
+  async def test_write_checkpoint_invalid_seq(self, store):
+    """Test checkpoint writing with invalid sequence number."""
+    with patch.object(store, "get_session", new_callable=AsyncMock) as mock_get:
+      from google.adk.durable.stores.base_checkpoint_store import SessionMetadata
+
+      mock_get.return_value = SessionMetadata(
+          session_id="test-session",
+          status="active",
+          agent_name="test_agent",
+          created_at=datetime.now(timezone.utc),
+          updated_at=datetime.now(timezone.utc),
+          current_checkpoint_seq=5,
+      )
+
+      with pytest.raises(ValueError, match="must be greater"):
+        await store.write_checkpoint(
+            session_id="test-session",
+            checkpoint_seq=3,  # Less than current (5)
+            state_blob=b"data",
+        )
+
+  @pytest.mark.asyncio
+  async def test_write_checkpoint_session_not_found(self, store):
+    """Test checkpoint writing when session doesn't exist."""
+    with patch.object(store, "get_session", new_callable=AsyncMock) as mock_get:
+      mock_get.return_value = None
+
+      with pytest.raises(ValueError, match="not found"):
+        await store.write_checkpoint(
+            session_id="nonexistent",
+            checkpoint_seq=1,
+            state_blob=b"data",
+        )
+
+  @pytest.mark.asyncio
+  async def test_acquire_lease_success(self, store):
+    """Test successful lease acquisition."""
+    mock_client = MagicMock()
+    mock_result = MagicMock()
+    mock_result.num_dml_affected_rows = 1
+    mock_client.query.return_value.result.return_value = mock_result
+
+    with patch.object(store, "_get_bq_client", return_value=mock_client):
+      result = await store.acquire_lease(
+          session_id="test-session",
+          lease_id="lease-123",
+          timeout_seconds=300,
+      )
+
+      assert result is True
+      mock_client.query.assert_called_once()
+
+  @pytest.mark.asyncio
+  async def test_acquire_lease_failure(self, store):
+    """Test failed lease acquisition (another lease active)."""
+    mock_client = MagicMock()
+    mock_result = MagicMock()
+    mock_result.num_dml_affected_rows = 0
+    mock_client.query.return_value.result.return_value = mock_result
+
+    with patch.object(store, "_get_bq_client", return_value=mock_client):
+      result = await store.acquire_lease(
+          session_id="test-session",
+          lease_id="lease-123",
+          timeout_seconds=300,
+      )
+
+      assert result is False
+
+  @pytest.mark.asyncio
+  async def test_release_lease(self, store):
+    """Test lease release."""
+    mock_client = MagicMock()
+    mock_client.query.return_value.result.return_value = None
+
+    with patch.object(store, "_get_bq_client", return_value=mock_client):
+      await store.release_lease(
+          session_id="test-session",
+          lease_id="lease-123",
+      )
+
+      mock_client.query.assert_called_once()
+
+  @pytest.mark.asyncio
+  async def test_renew_lease_success(self, store):
+    """Test successful lease renewal."""
+    mock_client = MagicMock()
+    mock_result = MagicMock()
+    mock_result.num_dml_affected_rows = 1
+    mock_client.query.return_value.result.return_value = mock_result
+
+    with patch.object(store, "_get_bq_client", return_value=mock_client):
+      result = await store.renew_lease(
+          session_id="test-session",
+          lease_id="lease-123",
+          timeout_seconds=600,
+      )
+
+      assert result is True
+
+  @pytest.mark.asyncio
+  async def test_renew_lease_failure(self, store):
+    """Test failed lease renewal (lease not held)."""
+    mock_client = MagicMock()
+    mock_result = MagicMock()
+    mock_result.num_dml_affected_rows = 0
+    mock_client.query.return_value.result.return_value = mock_result
+
+    with patch.object(store, "_get_bq_client", return_value=mock_client):
+      result = await store.renew_lease(
+          session_id="test-session",
+          lease_id="lease-123",
+          timeout_seconds=600,
+      )
+
+      assert result is False
diff --git a/tests/unittests/durable/test_checkpointable_state.py b/tests/unittests/durable/test_checkpointable_state.py
new file mode 100644
index 0000000000..9c9ab0753c
--- /dev/null
+++ b/tests/unittests/durable/test_checkpointable_state.py
@@ -0,0 +1,172 @@
+# Copyright 2026 Google LLC
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""Tests for CheckpointableAgentState."""
+
+from typing import Any
+from typing import Dict
+
+from google.adk.durable.checkpointable_state import CheckpointableAgentState
+from google.adk.durable.checkpointable_state import SimpleCheckpointableState
+import pytest
+
+
+class TestSimpleCheckpointableState:
+  """Tests for SimpleCheckpointableState."""
+
+  def test_default_state(self):
+    """Test default state initialization."""
+    state = SimpleCheckpointableState()
+    assert state.data == {}
+
+  def test_state_with_data(self):
+    """Test state with initial data."""
+    state = SimpleCheckpointableState(data={"key": "value", "count": 5})
+    assert state.data["key"] == "value"
+    assert state.data["count"] == 5
+
+  def test_to_checkpoint_dict(self):
+    """Test serialization to checkpoint dict."""
+    state = SimpleCheckpointableState(data={"items": [1, 2, 3], "name": "test"})
+    checkpoint = state.to_checkpoint_dict()
+
+    assert checkpoint == {"data": {"items": [1, 2, 3], "name": "test"}}
+
+  def test_from_checkpoint_dict(self):
+    """Test deserialization from checkpoint dict."""
+    checkpoint = {"data": {"counter": 10, "results": ["a", "b"]}}
+    state = SimpleCheckpointableState.from_checkpoint_dict(checkpoint)
+
+    assert state.data["counter"] == 10
+    assert state.data["results"] == ["a", "b"]
+
+  def test_roundtrip(self):
+    """Test roundtrip serialization/deserialization."""
+    original = SimpleCheckpointableState(
+        data={
+            "nested": {"deep": {"value": 42}},
+            "list": [1, 2, 3],
+            "string": "hello",
+        }
+    )
+
+    checkpoint = original.to_checkpoint_dict()
+    restored = SimpleCheckpointableState.from_checkpoint_dict(checkpoint)
+
+    assert restored.data == original.data
+
+  def test_empty_checkpoint_dict(self):
+    """Test deserialization from empty checkpoint dict."""
+    state = SimpleCheckpointableState.from_checkpoint_dict({})
+    assert state.data == {}
+
+
+class CustomState(CheckpointableAgentState):
+  """Custom state implementation for testing."""
+
+  counter: int = 0
+  items: list[str] = []
+  metadata: dict[str, Any] = {}
+
+  def __init__(self, **data):
+    super().__init__(**data)
+    if "items" not in data:
+      self.items = []
+    if "metadata" not in data:
+      self.metadata = {}
+
+  def to_checkpoint_dict(self) -> Dict[str, Any]:
+    return {
+        "counter": self.counter,
+        "items": self.items.copy(),
+        "metadata": self.metadata.copy(),
+    }
+
+  @classmethod
+  def from_checkpoint_dict(cls, data: Dict[str, Any]) -> "CustomState":
+    return cls(
+        counter=data.get("counter", 0),
+        items=data.get("items", []),
+        metadata=data.get("metadata", {}),
+    )
+
+
+class TestCustomCheckpointableState:
+  """Tests for custom CheckpointableAgentState implementations."""
+
+  def test_custom_state_default(self):
+    """Test custom state with default values."""
+    state = CustomState()
+    assert state.counter == 0
+    assert state.items == []
+    assert state.metadata == {}
+
+  def test_custom_state_with_values(self):
+    """Test custom state with initial values."""
+    state = CustomState(
+        counter=5,
+        items=["a", "b"],
+        metadata={"key": "value"},
+    )
+    assert state.counter == 5
+    assert state.items == ["a", "b"]
+    assert state.metadata == {"key": "value"}
+
+  def test_custom_state_to_checkpoint(self):
+    """Test custom state serialization."""
+    state = CustomState(counter=10, items=["x", "y", "z"])
+    checkpoint = state.to_checkpoint_dict()
+
+    assert checkpoint["counter"] == 10
+    assert checkpoint["items"] == ["x", "y", "z"]
+    assert checkpoint["metadata"] == {}
+
+  def test_custom_state_from_checkpoint(self):
+    """Test custom state deserialization."""
+    checkpoint = {
+        "counter": 42,
+        "items": ["item1", "item2"],
+        "metadata": {"created_by": "test"},
+    }
+    state = CustomState.from_checkpoint_dict(checkpoint)
+
+    assert state.counter == 42
+    assert state.items == ["item1", "item2"]
+    assert state.metadata == {"created_by": "test"}
+
+  def test_custom_state_roundtrip(self):
+    """Test custom state roundtrip."""
+    original = CustomState(
+        counter=100,
+        items=["first", "second", "third"],
+        metadata={"version": 1, "tags": ["test", "demo"]},
+    )
+
+    checkpoint = original.to_checkpoint_dict()
+    restored = CustomState.from_checkpoint_dict(checkpoint)
+
+    assert restored.counter == original.counter
+    assert restored.items == original.items
+    assert restored.metadata == original.metadata
+
+  def test_custom_state_isolation(self):
+    """Test that checkpoint data is isolated from original."""
+    state = CustomState(items=["a", "b"])
+    checkpoint = state.to_checkpoint_dict()
+
+    # Modify checkpoint
+    checkpoint["items"].append("c")
+
+    # Original should be unchanged
+    assert state.items == ["a", "b"]
diff --git a/tests/unittests/durable/test_config.py b/tests/unittests/durable/test_config.py
new file mode 100644
index 0000000000..cf47e2107e
--- /dev/null
+++ b/tests/unittests/durable/test_config.py
@@ -0,0 +1,104 @@
+# Copyright 2026 Google LLC
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""Tests for DurableSessionConfig."""
+
+from google.adk.durable.config import DurableSessionConfig
+from pydantic import ValidationError
+import pytest
+
+
+class TestDurableSessionConfig:
+  """Tests for DurableSessionConfig model."""
+
+  def test_default_config(self):
+    """Test default configuration values."""
+    config = DurableSessionConfig()
+
+    assert config.is_durable is False
+    assert config.checkpoint_policy == "async_boundary"
+    assert config.checkpoint_store is None
+    assert config.lease_timeout_seconds == 300
+    assert config.max_checkpoint_size_bytes == 10 * 1024 * 1024
+
+  def test_enabled_config(self):
+    """Test enabled configuration."""
+    config = DurableSessionConfig(
+        is_durable=True,
+        checkpoint_policy="every_turn",
+        lease_timeout_seconds=600,
+    )
+
+    assert config.is_durable is True
+    assert config.checkpoint_policy == "every_turn"
+    assert config.lease_timeout_seconds == 600
+
+  def test_checkpoint_policies(self):
+    """Test valid checkpoint policies."""
+    for policy in ["async_boundary", "every_turn", "manual"]:
+      config = DurableSessionConfig(checkpoint_policy=policy)
+      assert config.checkpoint_policy == policy
+
+  def test_invalid_checkpoint_policy(self):
+    """Test that invalid checkpoint policies raise validation error."""
+    with pytest.raises(ValidationError):
+      DurableSessionConfig(checkpoint_policy="invalid_policy")
+
+  def test_lease_timeout_bounds(self):
+    """Test lease timeout validation bounds."""
+    # Valid minimum
+    config = DurableSessionConfig(lease_timeout_seconds=60)
+    assert config.lease_timeout_seconds == 60
+
+    # Valid maximum
+    config = DurableSessionConfig(lease_timeout_seconds=3600)
+    assert config.lease_timeout_seconds == 3600
+
+    # Below minimum
+    with pytest.raises(ValidationError):
+      DurableSessionConfig(lease_timeout_seconds=59)
+
+    # Above maximum
+    with pytest.raises(ValidationError):
+      DurableSessionConfig(lease_timeout_seconds=3601)
+
+  def test_max_checkpoint_size_bounds(self):
+    """Test max checkpoint size validation."""
+    # Valid minimum
+    config = DurableSessionConfig(max_checkpoint_size_bytes=1024)
+    assert config.max_checkpoint_size_bytes == 1024
+
+    # Below minimum
+    with pytest.raises(ValidationError):
+      DurableSessionConfig(max_checkpoint_size_bytes=1023)
+
+  def test_extra_fields_forbidden(self):
+    """Test that extra fields are not allowed."""
+    with pytest.raises(ValidationError):
+      DurableSessionConfig(unknown_field="value")
+
+  def test_config_serialization(self):
+    """Test config can be serialized to dict."""
+    config = DurableSessionConfig(
+        is_durable=True,
+        checkpoint_policy="every_turn",
+        lease_timeout_seconds=120,
+    )
+
+    data = config.model_dump()
+
+    assert data["is_durable"] is True
+    assert data["checkpoint_policy"] == "every_turn"
+    assert data["lease_timeout_seconds"] == 120
+    assert data["checkpoint_store"] is None
diff --git a/tests/unittests/durable/test_workspace_snapshotter.py b/tests/unittests/durable/test_workspace_snapshotter.py
new file mode 100644
index 0000000000..4e381f1022
--- /dev/null
+++ b/tests/unittests/durable/test_workspace_snapshotter.py
@@ -0,0 +1,246 @@
+# Copyright 2026 Google LLC
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""Tests for WorkspaceSnapshotter and utilities."""
+
+import os
+import tempfile
+
+from google.adk.durable.workspace_snapshotter import compute_state_hash
+from google.adk.durable.workspace_snapshotter import deserialize_state_from_json
+from google.adk.durable.workspace_snapshotter import serialize_state_to_json
+from google.adk.durable.workspace_snapshotter import WorkspaceSnapshotter
+import pytest
+
+
+class TestSerializationUtilities:
+  """Tests for serialization utility functions."""
+
+  def test_serialize_simple_dict(self):
+    """Test serialization of simple dictionary."""
+    state = {"key": "value", "number": 42}
+    blob = serialize_state_to_json(state)
+
+    assert isinstance(blob, bytes)
+    assert b"key" in blob
+    assert b"value" in blob
+    assert b"42" in blob
+
+  def test_deserialize_simple_dict(self):
+    """Test deserialization of simple dictionary."""
+    blob = b'{"key": "value", "number": 42}'
+    state = deserialize_state_from_json(blob)
+
+    assert state == {"key": "value", "number": 42}
+
+  def test_roundtrip_serialization(self):
+    """Test roundtrip serialization/deserialization."""
+    original = {
+        "string": "hello",
+        "number": 123,
+        "float": 3.14,
+        "bool": True,
+        "null": None,
+        "list": [1, 2, 3],
+        "nested": {"a": {"b": "c"}},
+    }
+
+    blob = serialize_state_to_json(original)
+    restored = deserialize_state_from_json(blob)
+
+    assert restored == original
+
+  def test_serialize_deterministic(self):
+    """Test that serialization is deterministic (sorted keys)."""
+    state1 = {"z": 1, "a": 2, "m": 3}
+    state2 = {"a": 2, "m": 3, "z": 1}
+
+    blob1 = serialize_state_to_json(state1)
+    blob2 = serialize_state_to_json(state2)
+
+    assert blob1 == blob2
+
+  def test_compute_state_hash(self):
+    """Test state hash computation."""
+    state = {"key": "value"}
+    hash1 = compute_state_hash(state)
+
+    assert isinstance(hash1, str)
+    assert len(hash1) == 64  # SHA-256 produces 64 hex characters
+
+  def test_hash_deterministic(self):
+    """Test that hash is deterministic."""
+    state1 = {"z": 1, "a": 2}
+    state2 = {"a": 2, "z": 1}
+
+    assert compute_state_hash(state1) == compute_state_hash(state2)
+
+  def test_hash_changes_with_content(self):
+    """Test that hash changes with content."""
+    hash1 = compute_state_hash({"key": "value1"})
+    hash2 = compute_state_hash({"key": "value2"})
+
+    assert hash1 != hash2
+
+
+class TestWorkspaceSnapshotter:
+  """Tests for WorkspaceSnapshotter."""
+
+  def test_init_default(self):
+    """Test default initialization."""
+    snapshotter = WorkspaceSnapshotter()
+
+    assert snapshotter.workspace_dir is None
+    assert "__pycache__" in snapshotter._exclude_patterns
+
+  def test_init_with_workspace(self):
+    """Test initialization with workspace directory."""
+    snapshotter = WorkspaceSnapshotter(workspace_dir="/tmp/workspace")
+
+    assert str(snapshotter.workspace_dir) == "/tmp/workspace"
+
+  def test_init_with_custom_excludes(self):
+    """Test initialization with custom exclude patterns."""
+    snapshotter = WorkspaceSnapshotter(
+        workspace_dir="/tmp/workspace",
+        exclude_patterns=["*.log", "temp/"],
+    )
+
+    assert snapshotter._exclude_patterns == ["*.log", "temp/"]
+
+  def test_should_exclude_pycache(self):
+    """Test exclusion of __pycache__ directories."""
+    snapshotter = WorkspaceSnapshotter()
+    from pathlib import Path
+
+    assert snapshotter._should_exclude(
+        Path("/some/path/__pycache__/module.pyc")
+    )
+
+  def test_should_exclude_pyc_files(self):
+    """Test exclusion of .pyc files."""
+    snapshotter = WorkspaceSnapshotter()
+    from pathlib import Path
+
+    assert snapshotter._should_exclude(Path("/some/path/module.pyc"))
+
+  def test_should_not_exclude_py_files(self):
+    """Test that .py files are not excluded."""
+    snapshotter = WorkspaceSnapshotter()
+    from pathlib import Path
+
+    assert not snapshotter._should_exclude(Path("/some/path/module.py"))
+
+  def test_should_exclude_git(self):
+    """Test exclusion of .git directories."""
+    snapshotter = WorkspaceSnapshotter()
+    from pathlib import Path
+
+    assert snapshotter._should_exclude(Path("/some/path/.git/config"))
+
+  def test_should_exclude_env(self):
+    """Test exclusion of .env files."""
+    snapshotter = WorkspaceSnapshotter()
+    from pathlib import Path
+
+    assert snapshotter._should_exclude(Path("/some/path/.env"))
+
+  def test_create_snapshot_no_workspace(self):
+    """Test that create_snapshot fails without workspace."""
+    snapshotter = WorkspaceSnapshotter()
+
+    with pytest.raises(ValueError, match="No workspace directory"):
+      snapshotter.create_snapshot()
+
+  def test_create_snapshot_missing_directory(self):
+    """Test that create_snapshot fails with missing directory."""
+    snapshotter = WorkspaceSnapshotter(workspace_dir="/nonexistent/path")
+
+    with pytest.raises(FileNotFoundError):
+      snapshotter.create_snapshot()
+
+  def test_restore_snapshot_no_workspace(self):
+    """Test that restore_snapshot fails without workspace."""
+    snapshotter = WorkspaceSnapshotter()
+
+    with pytest.raises(ValueError, match="No workspace directory"):
+      snapshotter.restore_snapshot(b"data")
+
+  def test_create_and_restore_snapshot(self):
+    """Test creating and restoring a workspace snapshot."""
+    with tempfile.TemporaryDirectory() as tmpdir:
+      # Create source workspace with files
+      source_dir = os.path.join(tmpdir, "source")
+      os.makedirs(source_dir)
+
+      # Create test files
+      with open(os.path.join(source_dir, "file1.txt"), "w") as f:
+        f.write("content1")
+      with open(os.path.join(source_dir, "file2.py"), "w") as f:
+        f.write("print('hello')")
+
+      # Create subdirectory
+      subdir = os.path.join(source_dir, "subdir")
+      os.makedirs(subdir)
+      with open(os.path.join(subdir, "nested.txt"), "w") as f:
+        f.write("nested content")
+
+      # Create snapshot
+      snapshotter = WorkspaceSnapshotter(workspace_dir=source_dir)
+      blob, sha256, size = snapshotter.create_snapshot()
+
+      assert isinstance(blob, bytes)
+      assert len(sha256) == 64
+      assert size > 0
+
+      # Restore to different location
+      dest_dir = os.path.join(tmpdir, "dest")
+      restore_snapshotter = WorkspaceSnapshotter(workspace_dir=dest_dir)
+      restore_snapshotter.restore_snapshot(blob)
+
+      # Verify files were restored
+      assert os.path.exists(os.path.join(dest_dir, "file1.txt"))
+      assert os.path.exists(os.path.join(dest_dir, "file2.py"))
+      assert os.path.exists(os.path.join(dest_dir, "subdir", "nested.txt"))
+
+      # Verify content
+      with open(os.path.join(dest_dir, "file1.txt")) as f:
+        assert f.read() == "content1"
+
+  def test_snapshot_excludes_pycache(self):
+    """Test that snapshots exclude __pycache__ directories."""
+    with tempfile.TemporaryDirectory() as tmpdir:
+      # Create workspace with __pycache__
+      workspace = os.path.join(tmpdir, "workspace")
+      os.makedirs(workspace)
+
+      with open(os.path.join(workspace, "main.py"), "w") as f:
+        f.write("print('main')")
+
+      pycache = os.path.join(workspace, "__pycache__")
+      os.makedirs(pycache)
+      with open(os.path.join(pycache, "main.cpython-311.pyc"), "wb") as f:
+        f.write(b"\x00\x00\x00\x00")
+
+      # Create snapshot
+      snapshotter = WorkspaceSnapshotter(workspace_dir=workspace)
+      blob, _, _ = snapshotter.create_snapshot()
+
+      # Restore and verify __pycache__ was excluded
+      dest = os.path.join(tmpdir, "dest")
+      restore_snapshotter = WorkspaceSnapshotter(workspace_dir=dest)
+      restore_snapshotter.restore_snapshot(blob)
+
+      assert os.path.exists(os.path.join(dest, "main.py"))
+      assert not os.path.exists(os.path.join(dest, "__pycache__"))