diff --git a/contributing/samples/long_running_task/README.md b/contributing/samples/long_running_task/README.md new file mode 100644 index 0000000000..b649a5e941 --- /dev/null +++ b/contributing/samples/long_running_task/README.md @@ -0,0 +1,182 @@ +# Durable Session Demo + +This demo showcases the durable session persistence feature in ADK, which +enables checkpoint-based durability for long-running agent invocations. + +## Overview + +Durable sessions provide: +- **Checkpoint persistence**: Agent state is saved to BigQuery + GCS +- **Failure recovery**: Resume from the last checkpoint after crashes +- **Host migration**: Move sessions between hosts seamlessly +- **Lease management**: Prevent concurrent modifications + +## Prerequisites + +1. **Google Cloud Project** with billing enabled +2. **APIs enabled**: + - BigQuery API + - Cloud Storage API + - Vertex AI API (for Gemini models) +3. **IAM permissions**: + - `roles/bigquery.dataEditor` + - `roles/storage.objectAdmin` + - `roles/aiplatform.user` + +## Setup + +### 1. Configure your environment + +```bash +# Set your project +export PROJECT_ID="test-project-0728-467323" +gcloud config set project $PROJECT_ID + +# Set your Google Cloud API key (required for Gemini 3) +export GOOGLE_CLOUD_API_KEY="your-api-key-here" + +# Authenticate +gcloud auth application-default login +``` + +### 2. Create BigQuery and GCS resources + +```bash +# Run the setup script +python contributing/samples/long_running_task/setup.py + +# To verify setup +python contributing/samples/long_running_task/setup.py --verify + +# To clean up resources +python contributing/samples/long_running_task/setup.py --cleanup +``` + +### 3. Run the demo + +```bash +adk web contributing/samples/long_running_task +``` + +## Demo Scenarios + +### Scenario 1: Long-running table scan + +``` +User: Scan the bigquery-public-data.samples.shakespeare table + +Agent: [Calls simulate_long_running_scan] + [Checkpoint written at async boundary] + [Scan completes after ~5-10 seconds] + The scan found 164,656 rows with the following findings: + - Found 5 instances of 'to be or not to be' + - Most common word: 'the' (27,801 occurrences) + - Unique words: 29,066 +``` + +### Scenario 2: Multi-stage pipeline + +``` +User: Run a pipeline from source_table to dest_table with transformations: + filter, aggregate, join + +Agent: [Calls run_data_pipeline] + [Checkpoint written at each stage boundary] + Pipeline completed successfully: + - Stage 1 (filter): 45,000 rows processed + - Stage 2 (aggregate): 32,000 rows processed + - Stage 3 (join): 28,000 rows processed +``` + +### Scenario 3: Failure recovery + +1. Start a long-running scan +2. Kill the process mid-execution +3. Restart and resume with the invocation_id +4. Agent continues from the last checkpoint + +## Architecture + +``` + +-----------------+ + | Agent | + | (LlmAgent) | + +--------+--------+ + | + v + +-----------------+ + | Runner | + | (with durability)| + +--------+--------+ + | + +----------------+----------------+ + | | + v v + +--------------+ +----------------+ + | BigQuery | | GCS | + | (metadata) | | (state blobs) | + +--------------+ +----------------+ + | - sessions | | - checkpoints/ | + | - checkpoints| | {session_id}/| + +--------------+ +----------------+ +``` + +## Configuration + +The agent is configured in `agent.py`: + +```python +app = App( + name="durable_session_demo", + root_agent=root_agent, + resumability_config=ResumabilityConfig(is_resumable=True), + durable_session_config=DurableSessionConfig( + is_durable=True, + checkpoint_policy="async_boundary", + checkpoint_store=BigQueryCheckpointStore( + project=PROJECT_ID, + dataset=DATASET, + gcs_bucket=GCS_BUCKET, + ), + lease_timeout_seconds=300, + ), +) +``` + +### Checkpoint Policies + +- `async_boundary`: Checkpoint when hitting async/long-running operations +- `every_turn`: Checkpoint after every agent turn +- `manual`: Only checkpoint when explicitly requested + +## Monitoring + +### View sessions + +```sql +SELECT * FROM `test-project-0728-467323.adk_metadata.sessions` +ORDER BY updated_at DESC +LIMIT 10; +``` + +### View checkpoints + +```sql +SELECT * FROM `test-project-0728-467323.adk_metadata.checkpoints` +ORDER BY created_at DESC +LIMIT 10; +``` + +### List checkpoint blobs + +```bash +gsutil ls -l gs://test-project-0728-467323-adk-checkpoints/checkpoints/ +``` + +## Cleanup + +To remove all resources created by this demo: + +```bash +python contributing/samples/long_running_task/setup.py --cleanup +``` diff --git a/contributing/samples/long_running_task/REVIEW_FEEDBACK.md b/contributing/samples/long_running_task/REVIEW_FEEDBACK.md new file mode 100644 index 0000000000..c8e6387f69 --- /dev/null +++ b/contributing/samples/long_running_task/REVIEW_FEEDBACK.md @@ -0,0 +1,239 @@ +# Design Document Review: Durable Session Persistence for Long-Horizon ADK Agents + +**Reviewer:** Claude Code +**Date:** 2026-02-01 +**Document:** `long_running_task_design.md` + +--- + +## Executive Summary + +The design document is **well-structured and comprehensive**, covering a real problem with a thorough technical approach. However, there are **critical accuracy issues** regarding ADK's current capabilities that must be addressed before the document can be considered accurate for review. + +**Overall Assessment:** Good foundation, requires significant revisions to accurately reflect ADK's existing resumability features. + +--- + +## 1. Reference Validation + +### External URLs (7 total) - ALL VALID + +| # | URL | Status | Notes | +|---|-----|--------|-------| +| 1 | LangGraph durable-execution | VALID | Content matches claims | +| 2 | LangGraph persistence | VALID | Checkpointing docs | +| 3 | LangGraph overview | VALID | Framework intro | +| 4 | LangGraph checkpoints reference | VALID | API docs | +| 5 | Deep Agents overview | VALID | LangChain library | +| 6 | Deep Agents long-term memory | VALID | Memory patterns | +| 7 | Anthropic harnesses article | VALID | Published 2025-11-26 | + +--- + +## 2. CRITICAL ISSUE: ADK Already Has Resumability + +### Problem Statement Inaccuracy + +The document states (Section 2): +> "Current ADK sessions are optimized for synchronous 'serving' patterns... state is ephemeral... background execution is not a first-class runtime mode" + +**This is inaccurate.** ADK already has an experimental resumability feature: + +```python +# src/google/adk/apps/app.py lines 42-58 +@experimental +class ResumabilityConfig(BaseModel): + """The "resumability" in ADK refers to the ability to: + 1. pause an invocation upon a long-running function call. + 2. resume an invocation from the last event, if it's paused or failed midway + through. + """ + is_resumable: bool = False +``` + +### Existing ADK Capabilities Not Mentioned + +| Capability | Location | Status | +|------------|----------|--------| +| `ResumabilityConfig` | `src/google/adk/apps/app.py:42-58` | Experimental | +| `should_pause_invocation()` | `src/google/adk/agents/invocation_context.py:355-389` | Implemented | +| `long_running_tool_ids` | `src/google/adk/events/event.py` | Implemented | +| Resume from last event | `src/google/adk/runners.py:1294` | Implemented | + +### Required Fix + +**The document must:** +1. Acknowledge existing `ResumabilityConfig` and pause/resume capability +2. Clearly articulate how this proposal **extends** existing features vs. replacing them +3. Update Section 2 (Problem Statement) to reflect actual gaps (e.g., durable cross-process persistence, BigQuery-based audit, external event triggers) + +--- + +## 3. Technical Review + +### 3.1 SQL Schema (Appendix B) - VALID WITH MINOR ISSUES + +**Strengths:** +- Proper partitioning strategy (`PARTITION BY DATE`) +- Sensible clustering choices +- JSON columns for flexibility + +**Issues:** + +1. **Missing primary key constraint on checkpoints:** + ```sql + -- Should add: + PRIMARY KEY (session_id, checkpoint_seq) + ``` + +2. **events table lacks PRIMARY KEY:** + ```sql + -- Consider adding: + PRIMARY KEY (event_id) -- or composite key + ``` + +3. **View `v_latest_checkpoint` uses ARRAY_AGG with OFFSET(0):** + - This is valid but will error if no checkpoints exist + - Consider `SAFE_OFFSET(0)` or handle NULL case + +### 3.2 Python Code Snippets - MOSTLY VALID + +**Section 7.1 `write_checkpoint()`:** +- Logic is sound (two-phase commit pattern) +- Consider adding error handling for partial failures + +**Section 7.2 `reconcile_on_resume()`:** +- Good idempotency pattern +- Missing: what happens if `bq.get_job()` fails? + +### 3.3 Leasing Approach (Section 7.3) - REASONABLE + +The BQ-based optimistic lease is correctly noted as best-effort. The suggestion to use Firestore/Spanner for stronger guarantees is appropriate. + +**Suggestion:** Add a concrete example of when to use each backend (BQ vs Firestore). + +--- + +## 4. Architecture Feedback + +### 4.1 Strengths + +1. **Clear separation of control plane (BQ) vs data plane (GCS)** - follows Google best practices +2. **Logical checkpointing over heap snapshots** - pragmatic and maintainable +3. **Two-phase commit pattern** - ensures atomic visibility +4. **Authoritative reconciliation** - critical for BigQuery job scenarios +5. **Good competitive analysis** (Section 14) + +### 4.2 Gaps / Missing Considerations + +| Gap | Impact | Suggested Action | +|-----|--------|------------------| +| No mention of existing `ResumabilityConfig` | Misleading problem statement | Add section on existing capability | +| No cost estimates for BQ storage/queries | Budget planning | Add rough estimates | +| No mention of BQ quota limits | Operational risk | Document relevant quotas | +| Checkpoint versioning migration strategy | Future maintenance | Expand Section 16.2 | +| No monitoring/alerting design | Operability | Add observability section | +| No rollback strategy | Safety | Document how to rollback | + +### 4.3 API Contract Review + +The proposed `CheckpointableAgentState` interface is clean: + +```python +class CheckpointableAgentState: + def export_state(self) -> dict: ... + def import_state(self, state: dict) -> None: ... +``` + +**Suggestion:** Consider alignment with existing ADK patterns: +- Existing `BaseAgentState` in `src/google/adk/agents/base_agent.py` +- Existing state patterns in `src/google/adk/sessions/state.py` + +--- + +## 5. Specific Line-by-Line Feedback + +### Section 0 (Executive Summary) +- Line 14: "12-minute barrier" - should cite source or clarify this is environment-specific +- Line 28: Cost estimate "< $0.01/session-day paused" - show calculation + +### Section 2 (Problem Statement) +- **Major revision needed** - must acknowledge existing resumability + +### Section 4.1 (States) +- Consider: should PAUSED be a first-class `Session.status` field or remain at `InvocationContext` level? + +### Section 8 (API Extensions) +- `checkpoint_policy` options are good, but: + - What triggers `superstep`? + - How does `manual` interact with `long_running_tool_ids`? + +### Section 13 (Moltbot Alignment) +- Moltbot reference is useful context +- Consider adding link/citation if public + +### Section 18 (Open Questions) +- Good list, but add: "How does this integrate with existing `ResumabilityConfig`?" + +--- + +## 6. Recommended Document Changes + +### High Priority (Must Fix) + +1. **Add Section 1.3: "Existing ADK Resumability"** + - Document current `ResumabilityConfig` capability + - Explain limitations this design addresses + - Position proposal as extension, not replacement + +2. **Revise Section 2 (Problem Statement)** + - Remove/qualify claims about ADK lacking pause/resume + - Focus on actual gaps: cross-process durability, external event triggers, enterprise audit + +3. **Add explicit integration plan** + - How does `CheckpointableAgentState` relate to `BaseAgentState`? + - Migration path from current resumability to new design + +### Medium Priority + +4. Add cost estimation section +5. Add monitoring/observability design +6. Add rollback/recovery procedures +7. Fix SQL schema issues (PKs) + +### Low Priority + +8. Add Moltbot citation if available +9. Add BQ quota documentation links +10. Consider adding architecture diagram (beyond Mermaid sequence) + +--- + +## 7. Summary Table + +| Category | Status | Details | +|----------|--------|---------| +| External URLs | VALID | All 7 references work | +| SQL Syntax | VALID with issues | Missing PKs, edge cases | +| Python Code | VALID | Sound patterns | +| Problem Statement | INACCURATE | Ignores existing resumability | +| Architecture | SOUND | Good Google-scale patterns | +| Completeness | GAPS | Missing cost, monitoring, rollback | + +--- + +## 8. Conclusion + +This is a **solid technical design** for extending ADK's capabilities for long-running BigQuery workloads. The core architecture (BQ control plane, GCS data plane, two-phase commit, authoritative reconciliation) is well-reasoned. + +**However, the document cannot be approved in its current form** because it misrepresents ADK's existing capabilities. Once the existing `ResumabilityConfig` is acknowledged and the document is repositioned as an extension rather than a new capability, it will be ready for technical review. + +**Recommended Next Steps:** +1. Revise document to acknowledge existing resumability +2. Add cost/monitoring sections +3. Fix SQL schema issues +4. Re-submit for review + +--- + +*Review generated by Claude Code on 2026-02-01* diff --git a/contributing/samples/long_running_task/__init__.py b/contributing/samples/long_running_task/__init__.py new file mode 100644 index 0000000000..4015e47d6e --- /dev/null +++ b/contributing/samples/long_running_task/__init__.py @@ -0,0 +1,15 @@ +# Copyright 2026 Google LLC +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +from . import agent diff --git a/contributing/samples/long_running_task/agent.py b/contributing/samples/long_running_task/agent.py new file mode 100644 index 0000000000..10e95f663a --- /dev/null +++ b/contributing/samples/long_running_task/agent.py @@ -0,0 +1,142 @@ +# Copyright 2026 Google LLC +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +"""Durable session demo agent with long-running BigQuery operations. + +This agent demonstrates the durable session persistence feature, which enables +checkpointing of agent state to BigQuery + GCS for recovery from failures. + +To run this demo: + 1. Set up the BigQuery tables and GCS bucket (see setup.py) + 2. Set GOOGLE_CLOUD_API_KEY environment variable + 3. Run: adk web contributing/samples/long_running_task + +Example prompts: + - "Scan the bigquery-public-data.samples.shakespeare table" + - "Get the schema of bigquery-public-data.samples.github_nested" + - "Run a pipeline from source_table to dest_table with filter, aggregate" +""" + +import os +from functools import cached_property + +from google.adk.agents import LlmAgent +from google.adk.apps import App +from google.adk.apps import ResumabilityConfig +from google.adk.durable import BigQueryCheckpointStore +from google.adk.durable import DurableSessionConfig +from google.adk.models.google_llm import Gemini +from google.adk.tools import LongRunningFunctionTool +from google.genai import Client +from google.genai import types + +from .tools import get_table_schema +from .tools import run_batch_etl_job +from .tools import run_data_pipeline +from .tools import run_demo_analysis +from .tools import run_extended_analysis +from .tools import run_ml_training_job +from .tools import simulate_long_running_scan + +# Configuration +PROJECT_ID = "test-project-0728-467323" +DATASET = "adk_metadata" +GCS_BUCKET = f"{PROJECT_ID}-adk-checkpoints" + +# API Key for Vertex AI (must be set via environment variable) +GOOGLE_CLOUD_API_KEY = os.environ.get("GOOGLE_CLOUD_API_KEY", "") + + +class VertexAIGemini(Gemini): + """Custom Gemini model configured for Vertex AI with API key.""" + + model: str = "gemini-3-flash-preview" + + @cached_property + def api_client(self) -> Client: + """Provides the api client configured for Vertex AI.""" + return Client( + vertexai=True, + api_key=GOOGLE_CLOUD_API_KEY, + http_options=types.HttpOptions( + headers=self._tracking_headers(), + retry_options=self.retry_options, + ), + ) + + +# Create the checkpoint store +checkpoint_store = BigQueryCheckpointStore( + project=PROJECT_ID, + dataset=DATASET, + gcs_bucket=GCS_BUCKET, +) + +# Create the root agent with long-running tools using custom Vertex AI model +root_agent = LlmAgent( + model=VertexAIGemini(model="gemini-3-flash-preview"), + name="durable_bq_scanner", + description="Long-running BigQuery scanner with durable checkpoints", + instruction="""You are a data analyst assistant that can run various data processing jobs. + +Your capabilities: +1. Get table schemas - Use get_table_schema for quick schema lookups +2. Scan tables - Use simulate_long_running_scan for table analysis (~5-10 seconds) +3. Run data pipelines - Use run_data_pipeline for multi-stage transformations +4. Demo analysis - Use run_demo_analysis for a 1-minute demo (perfect for presentations!) +5. Extended analysis - Use run_extended_analysis for jobs that run 1-60 minutes +6. ML training - Use run_ml_training_job for model training (2-30 minutes based on size) +7. Batch ETL - Use run_batch_etl_job for large ETL jobs (1-60 minutes) + +For quick demos (~1 minute): +- run_demo_analysis: Specify analysis_type (e.g., "sentiment", "anomaly", "trend", "clustering") + +For long-running jobs (10+ minutes): +- run_extended_analysis: Specify duration_minutes (e.g., 10, 15, 30) +- run_ml_training_job: Use dataset_size "large" (10 min), "xlarge" (15 min), or "enterprise" (30 min) +- run_batch_etl_job: Specify processing_minutes (e.g., 10, 15, 30) + +The system will automatically checkpoint your progress during long-running +operations, so you can resume if interrupted. + +Important: When using long-running tools, wait for them to complete before +taking further action. Do not call the same tool again if it returned a +pending status. +""", + tools=[ + get_table_schema, + LongRunningFunctionTool(func=simulate_long_running_scan), + LongRunningFunctionTool(func=run_data_pipeline), + LongRunningFunctionTool(func=run_demo_analysis), + LongRunningFunctionTool(func=run_extended_analysis), + LongRunningFunctionTool(func=run_ml_training_job), + LongRunningFunctionTool(func=run_batch_etl_job), + ], + generate_content_config=types.GenerateContentConfig( + temperature=1.0, # Required for Gemini 3 + ), +) + +# Create the app with durable session configuration +app = App( + name="long_running_task", + root_agent=root_agent, + resumability_config=ResumabilityConfig(is_resumable=True), + durable_session_config=DurableSessionConfig( + is_durable=True, + checkpoint_policy="async_boundary", + checkpoint_store=checkpoint_store, + lease_timeout_seconds=300, + ), +) diff --git a/contributing/samples/long_running_task/comment.md b/contributing/samples/long_running_task/comment.md new file mode 100644 index 0000000000..356cd67548 --- /dev/null +++ b/contributing/samples/long_running_task/comment.md @@ -0,0 +1,1094 @@ +# Design Review Comments and Responses + +## Comment 1: Session Service as Durable Persistence + +**From:** ADK Team +**Date:** 2026-02-02 + +**Comment:** +> "Session service is the durable session persistence. For local, user starts with InMemoryService, but they can opt-in storage-based session service: SQLite, DatabaseSessionService, BigQuerySessionService, etc." + +--- + +### Response + +Thank you for the feedback. You're correct that ADK already has a robust session service hierarchy. This comment raises an important architectural question: **Why introduce a separate CheckpointStore when SessionService already provides persistence?** + +#### Key Distinction: Session State vs. Checkpoint State + +| Aspect | Session Service | Checkpoint Store (Proposed) | +|--------|-----------------|----------------------------| +| **What it stores** | Conversation history (events, messages, tool calls) | Agent execution state (job ledgers, progress cursors, partial results) | +| **Granularity** | Per-message/event append | Per-checkpoint snapshot at logical boundaries | +| **Data model** | Event stream (append-only) | Point-in-time snapshots (two-phase commit) | +| **Primary use case** | Replay conversation context to LLM | Resume long-running task from failure point | +| **Recovery question** | "What did the agent say?" | "Where was the agent in a 6-hour BigQuery scan?" | +| **External job tracking** | Tool call events (but not reconciliation-ready) | Authoritative job ledger with status sync | + +#### Why Session Service Alone May Be Insufficient + +1. **Job Ledger with Authoritative Reconciliation** + - Session events record that a tool was called, but don't maintain a ledger that can be reconciled against external job states (DONE/FAILED/RUNNING) + - On resume, we need to query BigQuery: "Is job X still running?" and update our ledger accordingly + - This reconciliation pattern doesn't fit the append-only event model + +2. **Partial Results Persistence** + - A 50-table PII scan may complete 30 tables before failure + - Checkpoint stores: which tables done, their findings, which remain + - Session stores: the conversation about starting the scan + +3. **Two-Phase Commit Semantics** + - Checkpoints require atomic visibility: GCS blob uploaded AND metadata pointer updated + - Session services typically use simpler append semantics + - Partial checkpoint writes must not be visible + +4. **Workspace Snapshots** + - Long-running coding agents may need `/workspace` file persistence + - This is binary blob data, not conversation events + - Doesn't fit session event model + +5. **Different Query Patterns** + - Session: "Give me all events for session X in order" + - Checkpoint: "Give me the latest checkpoint for session X" (single row) + - Fleet ops: "Show me all paused sessions with checkpoints > 1 hour old" + +--- + +### Potential Approaches + +#### Option A: Separate CheckpointStore (Current Design) + +``` +┌─────────────────────────────────────────────────────────────┐ +│ ADK Application │ +├─────────────────────────────────────────────────────────────┤ +│ SessionService (existing) │ CheckpointStore (new) │ +│ - Conversation history │ - Execution state │ +│ - Event replay for LLM │ - Job ledgers │ +│ - Append-only events │ - Two-phase commit │ +│ - SQLite/DB/BigQuery │ - BigQuery + GCS │ +└─────────────────────────────────────────────────────────────┘ +``` + +**Pros:** +- Clear separation of concerns +- Different consistency models for different needs +- No changes to existing SessionService implementations +- Checkpoint-specific optimizations (compression, GCS blob storage) + +**Cons:** +- Two services to configure for durable agents +- Potential confusion about which stores what +- Additional infrastructure (though can share BigQuery dataset) + +#### Option B: Extend SessionService with Checkpoint Capability + +```python +class SessionService(ABC): + # Existing methods... + + # New checkpoint methods + async def write_checkpoint( + self, session_id: str, checkpoint_seq: int, state: bytes, ... + ) -> None: ... + + async def read_latest_checkpoint( + self, session_id: str + ) -> tuple[int, bytes] | None: ... +``` + +**Pros:** +- Single service to configure +- Unified persistence layer +- Familiar pattern for ADK users + +**Cons:** +- Mixes conversation semantics with execution semantics +- May require significant changes to existing implementations +- Two-phase commit harder to add to existing append-only services +- Risk of breaking changes + +#### Option C: Checkpoint as Special Event Type + +```python +# Store checkpoint as a special event in the session +event = Event( + author="system", + type=EventType.CHECKPOINT, + checkpoint_data=CheckpointData( + seq=5, + state_gcs_uri="gs://...", + job_ledger={...}, + ) +) +session_service.append_event(session_id, event) +``` + +**Pros:** +- Uses existing SessionService infrastructure +- Single storage location +- Events remain the universal abstraction + +**Cons:** +- Checkpoint retrieval requires scanning events (inefficient) +- Two-phase commit semantics still needed for GCS blob +- Mixing large blobs with conversation events +- Query patterns still don't match (latest vs. stream) + +--- + +### Recommendation + +**Option A (Separate CheckpointStore)** is recommended for v1 because: + +1. **Clean separation**: Conversation history and execution state serve different purposes +2. **No breaking changes**: Existing SessionService implementations unchanged +3. **Optimized for use case**: Checkpoint-specific features (GCS blobs, two-phase commit, lease management) +4. **Incremental adoption**: Users can add checkpointing without changing session config + +However, we should: +- Document the relationship clearly +- Consider Option B for v2 if the pattern proves successful +- Ensure both can share the same BigQuery dataset for operational simplicity + +--- + +## Suggested Updates to Design Doc + +Based on this feedback, the following sections should be added/updated in `long_running_task_design.md`: + +### 1. Add New Section: "Relationship to Existing Session Service" + +**Location:** After Section 5 (Architecture Overview) + +```markdown +## 5.4 Relationship to Existing Session Service + +ADK provides a `SessionService` abstraction for conversation persistence: + +| Implementation | Storage | Use Case | +|----------------|---------|----------| +| `InMemorySessionService` | RAM | Development/testing | +| `SQLiteSessionService` | Local SQLite | Single-machine persistence | +| `DatabaseSessionService` | PostgreSQL/MySQL | Production multi-instance | +| `BigQuerySessionService` | BigQuery | Enterprise scale | + +**Why a separate CheckpointStore?** + +The `SessionService` and `CheckpointStore` serve complementary purposes: + +| SessionService | CheckpointStore | +|----------------|-----------------| +| Conversation history | Execution state snapshots | +| Append-only events | Point-in-time checkpoints | +| LLM context replay | Task resume from failure | +| Per-event granularity | Per-checkpoint granularity | + +A durable long-horizon agent typically uses both: +- `SessionService` for conversation continuity +- `CheckpointStore` for execution state durability + +**Shared Infrastructure** + +Both services can share the same BigQuery dataset: +- `adk_metadata.sessions` (SessionService) +- `adk_metadata.events` (SessionService) +- `adk_metadata.durable_sessions` (CheckpointStore) +- `adk_metadata.checkpoints` (CheckpointStore) +``` + +### 2. Update Section 8.2 (Configuration) + +Add clarity about the relationship: + +```markdown +### 8.2 Configuration + +```python +# A durable agent uses BOTH session service and checkpoint store +app = App( + name="durable_scanner", + root_agent=agent, + + # Session service for conversation history (existing) + session_service=BigQuerySessionService( + project="my-project", + dataset="adk_metadata", + ), + + # Checkpoint store for execution state (new) + durable_session_config=DurableSessionConfig( + is_durable=True, + checkpoint_store=BigQueryCheckpointStore( + project="my-project", + dataset="adk_metadata", # Can share dataset + gcs_bucket="my-checkpoints", + ), + ), +) +``` + +**Note:** Both services can share the same BigQuery dataset. The checkpoint tables use a `durable_` prefix to avoid conflicts. +``` + +### 3. Add to Section 15 (Alternatives Considered) + +```markdown +| Alternative | Why not (v1) | +|-------------|--------------| +| Extend SessionService with checkpoint methods | Different consistency models; risk of breaking changes to existing implementations | +| Checkpoint as special Event type | Inefficient retrieval (scan vs. point lookup); mixes blob storage with events | +``` + +### 4. Add FAQ Entry + +```markdown +## Appendix F: FAQ + +### Why not just use SessionService for checkpoints? + +SessionService is optimized for conversation history (append-only event streams). +Checkpoints require: +- Point-in-time snapshots (not event streams) +- Two-phase commit (GCS blob + metadata atomicity) +- Different query patterns (latest-per-session, not full history) +- Large blob storage (workspace snapshots) + +The separation ensures each service is optimized for its use case. + +### Can I use CheckpointStore without SessionService? + +Yes, but not recommended. SessionService provides conversation context for +the LLM on resume. Without it, the agent loses conversation history. + +### Do they share the same BigQuery dataset? + +Yes, recommended. Use the same dataset with different table prefixes: +- SessionService: `sessions`, `events` +- CheckpointStore: `durable_sessions`, `checkpoints` +``` + +--- + +## Action Items + +- [ ] Add Section 5.4 to design doc +- [ ] Update Section 8.2 with dual-service example +- [ ] Add alternatives to Section 15 +- [ ] Add FAQ appendix +- [ ] Consider renaming tables to avoid confusion (`durable_sessions` vs `sessions`) +- [ ] Document shared dataset configuration in README + +--- + +## Open Questions for ADK Team + +1. **Table naming**: Should checkpoint tables use a prefix (`durable_sessions`) or separate dataset? +2. **Unified service**: Is there interest in a `DurableSessionService` wrapper that manages both? +3. **Event integration**: Should checkpoint events be mirrored to SessionService for audit trail? +4. **BigQuerySessionService**: Does it already have any checkpoint-like capabilities we should leverage? + +--- + +## Comment 2: GcsArtifactService for Large Blobs + +**From:** ADK Team +**Date:** 2026-02-02 + +**Comment:** +> "In ADK, ArtifactService is designed for large blobs. Have you checked that? We have a GcsArtifactService in the core library." + +--- + +### Response + +Thank you for pointing this out. Yes, I've reviewed `GcsArtifactService` (`src/google/adk/artifacts/gcs_artifact_service.py`) and the `BaseArtifactService` interface. This is a valid consideration. + +#### Current ArtifactService Capabilities + +| Feature | GcsArtifactService | +|---------|-------------------| +| Storage backend | GCS bucket | +| Key structure | `{app_name}/{user_id}/{session_id}/{filename}/{version}` | +| Versioning | Monotonic integer versions (0, 1, 2, ...) | +| Data type | `types.Part` (inline_data, text, file_data) | +| Metadata | Custom metadata dict on blob | +| Operations | save, load, list, delete, list_versions | + +#### Checkpoint Blob Requirements + +| Requirement | ArtifactService Support | Gap | +|-------------|------------------------|-----| +| Store bytes/JSON blobs | Yes (`types.Part.from_bytes`) | None | +| Session-scoped storage | Yes | None | +| Version tracking | Yes (monotonic) | Checkpoint uses `checkpoint_seq` | +| Custom metadata | Yes | Need SHA-256, trigger, size_bytes | +| Two-phase commit | **No** | Critical gap | +| Atomic visibility with BQ | **No** | Critical gap | +| Workspace tar.gz bundles | Partially (as bytes) | None | +| Integrity verification | **No** | Need SHA-256 on read | + +#### Key Gaps + +**1. Two-Phase Commit Semantics** + +The checkpoint pattern requires: +``` +Phase 1: Upload blob to GCS (may fail, invisible) +Phase 2: Insert metadata to BigQuery (makes checkpoint visible) +``` + +`GcsArtifactService.save_artifact()` uploads and returns immediately. There's no coordination with an external metadata store. A partial upload becomes immediately "visible" via `load_artifact()`. + +**2. Atomic Visibility with BigQuery Metadata** + +Checkpoints must be invisible until both: +- GCS blob exists AND +- BigQuery metadata row exists + +`GcsArtifactService` doesn't have this concept - artifacts are visible as soon as they're uploaded. + +**3. SHA-256 Integrity Verification** + +Checkpoints require integrity verification on read: +```python +# On read +blob = gcs.download(uri) +if sha256(blob) != metadata.sha256: + raise CheckpointCorruptionError() +``` + +`GcsArtifactService` doesn't compute or verify checksums. + +**4. Key Structure Mismatch** + +| Service | Key Pattern | +|---------|-------------| +| ArtifactService | `{app}/{user}/{session}/{filename}/{version}` | +| CheckpointStore | `{session_id}/{checkpoint_seq}/state.json` | + +Checkpoints don't have `app_name`, `user_id`, or `filename` - they're keyed purely by `session_id` + `checkpoint_seq`. + +--- + +### Potential Approaches + +#### Option A: Use GcsArtifactService as Underlying Storage (Adapt) + +```python +class BigQueryCheckpointStore(DurableSessionStore): + def __init__(self, artifact_service: GcsArtifactService, ...): + self._artifact_service = artifact_service + + async def write_checkpoint(self, session_id, seq, state_blob, ...): + # Phase 1: Use artifact service for GCS upload + version = await self._artifact_service.save_artifact( + app_name="checkpoints", + user_id="system", + session_id=session_id, + filename=f"checkpoint_{seq}", + artifact=types.Part.from_bytes(state_blob, mime_type="application/json"), + custom_metadata={"sha256": sha256(state_blob)}, + ) + + # Phase 2: Insert BQ metadata (makes checkpoint visible) + await self._insert_bq_metadata(session_id, seq, ...) +``` + +**Pros:** +- Reuses existing GCS infrastructure +- Consistent with ADK patterns +- Less code duplication + +**Cons:** +- Awkward key mapping (`app_name="checkpoints"`, `user_id="system"`) +- Still need custom two-phase commit logic +- Still need SHA-256 verification layer +- Version semantics don't match (artifact version vs checkpoint_seq) + +#### Option B: Direct GCS Client (Current Design) + +```python +class BigQueryCheckpointStore(DurableSessionStore): + def __init__(self, gcs_bucket: str, ...): + self._gcs_client = storage.Client() + self._bucket = self._gcs_client.bucket(gcs_bucket) + + async def write_checkpoint(self, session_id, seq, state_blob, ...): + # Phase 1: Direct GCS upload with preconditions + blob = self._bucket.blob(f"{session_id}/{seq}/state.json") + blob.upload_from_string( + state_blob, + if_generation_match=0, # Fail if exists (idempotency) + ) + + # Phase 2: Insert BQ metadata + await self._insert_bq_metadata(session_id, seq, ...) +``` + +**Pros:** +- Full control over GCS operations +- Clean key structure +- Native support for preconditions (`if_generation_match`) +- Simpler code path + +**Cons:** +- Doesn't leverage existing ArtifactService +- Separate GCS client initialization + +#### Option C: Extend ArtifactService Interface + +Add checkpoint-specific methods to `BaseArtifactService`: + +```python +class BaseArtifactService(ABC): + # Existing methods... + + # New: Checkpoint-specific operations + async def save_checkpoint_blob( + self, + *, + session_id: str, + checkpoint_seq: int, + blob: bytes, + sha256: str, + ) -> str: + """Save a checkpoint blob and return GCS URI.""" + ... + + async def load_checkpoint_blob( + self, + *, + session_id: str, + checkpoint_seq: int, + expected_sha256: str, + ) -> bytes: + """Load and verify checkpoint blob.""" + ... +``` + +**Pros:** +- Unified artifact/checkpoint interface +- Extensible for future blob types + +**Cons:** +- Modifies core ADK interface +- Checkpoint semantics may not fit all artifact backends +- Two-phase commit still external + +--- + +### Recommendation + +**Option B (Direct GCS Client)** is recommended for v1 because: + +1. **Simpler implementation**: No adapter layer or key mapping +2. **Full control**: Native GCS preconditions for idempotency +3. **Clean semantics**: Checkpoint keys match checkpoint concepts +4. **No interface changes**: Doesn't require modifying BaseArtifactService + +However, we should: +- Document the relationship with ArtifactService +- Consider Option A or C for v2 if there's desire for unification +- Ensure both can share the same GCS bucket if needed + +--- + +### Suggested Design Doc Updates + +Add to Section 15 (Alternatives Considered): + +```markdown +| Alternative | Why not (v1) | +|-------------|--------------| +| Use GcsArtifactService for checkpoint blobs | Key structure mismatch; no two-phase commit support; no SHA-256 verification; would require adapter layer | +``` + +Add to Section 5.3 (Integration with Existing ADK Services): + +```markdown +### Relationship to ArtifactService + +ADK's `ArtifactService` (`GcsArtifactService`, `FileArtifactService`, etc.) is designed for +user/session-scoped file artifacts with versioning. + +Checkpoints have different requirements: +- Two-phase commit with BigQuery metadata +- SHA-256 integrity verification +- Different key structure (session_id/checkpoint_seq) + +For v1, `CheckpointStore` uses direct GCS client access. Future versions may consider +unifying with `ArtifactService` if the interface can be extended to support checkpoint +semantics. +``` + +--- + +--- + +## Comment 3: Leasing as General Requirement + +**From:** ADK Team +**Date:** 2026-02-02 + +**Reference:** Section 7.3 - "We must ensure only one runner resumes a session at a time" + +**Comment:** +> "This is not only applicable to resume. `Runner.run_async` also requires this. Leasing is a general requirement for app developers." + +--- + +### Response + +This is an important clarification. You're correct that session-level concurrency control is a **general requirement**, not specific to durable session resume. + +#### Expanded Scope of Leasing + +| Scenario | Concurrency Risk | Current ADK Handling | +|----------|------------------|---------------------| +| Multiple `run_async()` on same session | Race conditions, duplicate tool calls | App developer responsibility | +| Resume after pause | Duplicate resume attempts | App developer responsibility | +| Pub/Sub event redelivery | Multiple runners wake on same event | App developer responsibility | +| Horizontal scaling | Multiple instances claim same session | App developer responsibility | + +The design doc incorrectly scoped leasing as a "durable session" concern. In reality: + +``` +Leasing requirement = ANY scenario where multiple runners might access the same session +``` + +#### Current State in ADK + +Looking at `Runner.run_async()` in `src/google/adk/runners.py`: + +```python +async def run_async( + self, + *, + user_id: str, + session_id: str, + new_message: types.Content, + ... +) -> AsyncGenerator[Event, None]: + # No built-in lease acquisition + # App developer must ensure single-runner-per-session +``` + +There's no built-in lease mechanism. App developers must implement their own concurrency control. + +#### Implications for Design + +**Option A: Leasing in Durable Layer Only (Current Design)** + +``` +┌─────────────────────────────────────────────────────────────┐ +│ ADK Application │ +├─────────────────────────────────────────────────────────────┤ +│ Runner.run_async() │ CheckpointStore │ +│ - No built-in leasing │ - Has lease management │ +│ - App manages concurrency │ - Protects resume only │ +└─────────────────────────────────────────────────────────────┘ +``` + +**Pros:** Non-breaking, durable sessions get protection +**Cons:** Inconsistent; regular sessions still unprotected + +**Option B: Leasing in Runner (Framework-Level)** + +```python +class Runner: + def __init__(self, ..., lease_manager: Optional[LeaseManager] = None): + self._lease_manager = lease_manager + + async def run_async(self, ..., session_id: str, ...): + if self._lease_manager: + lease = await self._lease_manager.acquire(session_id) + if not lease: + raise SessionLeaseDeniedError(session_id) + try: + # ... execute agent logic + finally: + if self._lease_manager: + await self._lease_manager.release(session_id) +``` + +**Pros:** Consistent protection for all sessions +**Cons:** Breaking change; requires lease manager configuration + +**Option C: Leasing in SessionService (Storage-Level)** + +```python +class BaseSessionService(ABC): + @abstractmethod + async def acquire_session_lease( + self, session_id: str, lease_id: str, ttl_seconds: int + ) -> bool: ... + + @abstractmethod + async def release_session_lease( + self, session_id: str, lease_id: str + ) -> None: ... +``` + +**Pros:** Unified with session storage; natural fit +**Cons:** Requires changes to all SessionService implementations + +--- + +### Recommendation + +**Short-term (v1):** Keep leasing in `CheckpointStore` for durable sessions, but: +- Update design doc to acknowledge this is a subset of a broader need +- Document that app developers need their own concurrency control for non-durable sessions + +**Medium-term (v2):** Consider adding leasing to `SessionService` interface: +- `BigQuerySessionService` already has infrastructure for this +- `DatabaseSessionService` can use row-level locks +- `InMemorySessionService` can use asyncio locks + +**Long-term:** Consider Runner-level lease integration as opt-in feature. + +--- + +### Suggested Design Doc Updates + +**Update Section 7.3 Title:** + +From: +> "7.3 Leasing & optimistic concurrency" + +To: +> "7.3 Leasing & optimistic concurrency (session-level)" + +**Add Clarification Paragraph:** + +```markdown +### 7.3 Leasing & Optimistic Concurrency + +**Note:** Session-level concurrency control is a general ADK requirement, not +specific to durable sessions. Any scenario where multiple runners might access +the same session requires leasing: + +- Multiple `run_async()` calls on the same session +- Resume after pause (durable or in-process) +- Event-driven wake-up with potential redelivery +- Horizontal scaling with shared session storage + +Currently, ADK leaves session leasing to app developers. The durable session +layer provides lease management for checkpoint-protected sessions, but this +does not cover all concurrency scenarios. + +**Future consideration:** Add optional `LeaseManager` to `Runner` or lease +methods to `SessionService` interface for framework-level protection. +``` + +**Add to Section 18 (Open Questions):** + +```markdown +| Question | Risk Level | Notes | +|----------|------------|-------| +| Framework-level leasing | Medium | Should Runner have built-in lease support? Would require LeaseManager abstraction | +| SessionService lease methods | Medium | Natural fit but requires interface changes | +``` + +--- + +--- + +## Comment 4: Cross-Process Durability Clarification + +**From:** ADK Team +**Date:** 2026-02-02 + +**Reference:** Section 1.2 - "Cross-process durability: state lost if the process dies" + +**Comment:** +> "Could you elaborate on this? I think agent state is persisted in the event and the event will be persisted in the selected session service." + +--- + +### Response + +You're correct that session events are persisted in the SessionService. Let me clarify what "state lost" means in the context of long-running tasks. + +#### What IS Preserved (SessionService Events) + +| Data | Preserved? | Location | +|------|------------|----------| +| User messages | Yes | Session events | +| Agent responses | Yes | Session events | +| Tool call records | Yes | Session events (tool name, args, result) | +| LLM conversation context | Yes | Replayable from events | + +#### What May NOT Be Preserved (or Not Usable) + +| Data | Preserved? | Issue | +|------|------------|-------| +| In-flight tool execution | **No** | Process dies mid-tool-call | +| External job handles | **Partial** | Job ID in event, but no reconciliation structure | +| Multi-step operation progress | **No** | "I'm on step 3 of 7" not tracked | +| Agent's execution plan | **No** | Task graph, priorities, dependencies | +| Partial aggregated results | **No** | "Scanned 30 of 50 tables, found X so far" | +| Workspace files in progress | **No** | Draft reports, intermediate artifacts | + +#### Concrete Example: 50-Table PII Scan + +**Scenario:** Agent is scanning 50 BigQuery tables for PII. Process dies after completing 30 tables. + +**With SessionService only:** + +``` +Events stored: + - User: "Scan all tables for PII" + - Agent: "I'll scan these 50 tables..." + - ToolCall: scan_table("table_1") → {findings: [...]} + - ToolCall: scan_table("table_2") → {findings: [...]} + ... + - ToolCall: scan_table("table_30") → {findings: [...]} + - [PROCESS DIES HERE] +``` + +On restart: +- Events replay to LLM ✓ +- LLM sees 30 tool calls completed ✓ +- But: **LLM must re-deduce** which tables remain +- But: **No structured job ledger** for reconciliation +- But: **Aggregated findings** must be re-computed from events +- Risk: **LLM may miscount** or re-scan tables + +**With Checkpoint + SessionService:** + +``` +Checkpoint stored: + { + "job_ledger": { + "table_1": {"status": "complete", "findings": 3}, + "table_2": {"status": "complete", "findings": 0}, + ... + "table_30": {"status": "complete", "findings": 5}, + "table_31": {"status": "pending"}, + ... + "table_50": {"status": "pending"} + }, + "aggregated_findings": { + "total_tables_scanned": 30, + "total_findings": 47, + "findings_by_type": {"email": 20, "ssn": 15, "phone": 12} + }, + "execution_plan": { + "current_phase": "scanning", + "next_table_index": 31 + } + } +``` + +On restart: +- Load checkpoint ✓ +- Know exactly which tables remain ✓ +- Reconcile with BigQuery job states ✓ +- Continue with aggregated state intact ✓ +- No LLM re-deduction needed ✓ + +#### The Key Distinction + +| Aspect | Session Events | Checkpoint State | +|--------|----------------|------------------| +| Purpose | LLM conversation context | Execution state recovery | +| Structure | Append-only event stream | Point-in-time snapshot | +| Recovery mode | Replay events to LLM | Load structured state | +| External jobs | Tool call records | Reconcilable job ledger | +| Aggregations | Must re-compute from events | Pre-computed, ready to use | +| Reliability | LLM must re-deduce state | Deterministic restoration | + +#### When Session Events Are Sufficient + +Session events alone work well for: +- Short conversations (< 5 min) +- Simple tool calls (no external async jobs) +- Stateless operations (each tool call independent) +- Human-in-the-loop flows (human provides continuity) + +#### When Checkpoints Add Value + +Checkpoints are valuable for: +- Long-running operations (hours/days) +- External async jobs (BigQuery, Cloud Build, ML training) +- Multi-step plans with dependencies +- Aggregated/computed state (partial results) +- Deterministic recovery (no LLM re-deduction) + +--- + +### End-to-End Concrete Example: Enterprise PII Compliance Audit + +Let me walk through a complete scenario showing what the checkpoint approach enables that event logging alone cannot. + +#### Scenario Setup + +**Task:** Scan 100 BigQuery tables across 5 datasets for PII (emails, SSNs, phone numbers) to generate a compliance report. + +**Environment:** +- Cloud Run with 60-minute timeout +- Each table scan takes 2-10 minutes (BigQuery job) +- Total expected runtime: ~8 hours +- Multiple Cloud Run instances may be involved + +**User Request:** +``` +"Scan all tables in the customer_data, transactions, analytics, +logs, and marketing datasets for PII. Generate a compliance report +with findings by table and recommendations." +``` + +--- + +#### Timeline: What Happens + +``` +Hour 0:00 - Agent starts + - Discovers 100 tables across 5 datasets + - Creates execution plan: scan tables, aggregate findings, generate report + - Begins scanning tables + +Hour 2:30 - Progress checkpoint + - 35 tables scanned + - 127 PII findings so far + - 15 BigQuery jobs completed, 2 running, 83 pending + +Hour 3:15 - PROCESS DIES (Cloud Run timeout/crash) + - 2 BigQuery jobs still running in the cloud + - Agent process terminated +``` + +--- + +#### Path A: Event Logging Only (Current ADK) + +**Events stored in SessionService:** +```json +[ + {"type": "user_message", "content": "Scan all tables..."}, + {"type": "agent_message", "content": "I'll scan 100 tables..."}, + {"type": "tool_call", "tool": "submit_bq_scan", "args": {"table": "customer_data.users"}, "result": {"job_id": "job_001", "status": "submitted"}}, + {"type": "tool_call", "tool": "get_job_result", "args": {"job_id": "job_001"}, "result": {"findings": [{"type": "email", "column": "contact_email", "count": 15000}]}}, + {"type": "tool_call", "tool": "submit_bq_scan", "args": {"table": "customer_data.orders"}, "result": {"job_id": "job_002", "status": "submitted"}}, + // ... 70 more tool call events ... + {"type": "tool_call", "tool": "submit_bq_scan", "args": {"table": "analytics.events"}, "result": {"job_id": "job_037", "status": "submitted"}}, + // PROCESS DIES - no more events +] +``` + +**On Restart (New Cloud Run Instance):** + +1. **Events replay to LLM** - LLM sees conversation history ✓ + +2. **LLM must re-deduce state:** + ``` + LLM thinking: "Looking at these events... I see job_001 through job_037 + were submitted. Some have results, some don't. Let me figure out what's done..." + ``` + +3. **Problems:** + + | Problem | Impact | + |---------|--------| + | **Job status unknown** | job_036, job_037 may have completed while process was dead - LLM doesn't know | + | **No structured ledger** | LLM must parse 70+ events to determine table status | + | **Aggregation lost** | "127 findings so far" must be re-counted from events | + | **May re-submit jobs** | LLM might re-scan tables it already scanned | + | **May miss completed jobs** | Jobs that finished during downtime have results waiting | + | **Non-deterministic** | Different LLM calls may reach different conclusions | + +4. **Likely LLM Response:** + ``` + "I see we were scanning tables for PII. Let me check what's been done... + [Spends tokens re-parsing events] + I think tables 1-35 are done. Let me continue with table 36... + + Actually, I'm not sure if job_036 completed. Let me re-submit it to be safe." + ``` + +5. **Result:** + - Duplicate BigQuery jobs (wasted cost) + - Inconsistent findings count + - Report may have duplicates or gaps + - ~30 minutes spent "figuring out" state + +--- + +#### Path B: Checkpoint + Event Logging (Proposed) + +**Checkpoint stored (in addition to events):** +```json +{ + "checkpoint_seq": 15, + "created_at": "2026-02-02T05:30:00Z", + + "execution_plan": { + "phase": "scanning", + "total_tables": 100, + "tables_completed": 35, + "tables_in_progress": 2, + "tables_pending": 63 + }, + + "job_ledger": { + "job_001": {"table": "customer_data.users", "status": "complete", "findings": 3}, + "job_002": {"table": "customer_data.orders", "status": "complete", "findings": 0}, + // ... jobs 3-35: complete ... + "job_036": {"table": "analytics.sessions", "status": "running", "submitted_at": "2026-02-02T05:28:00Z"}, + "job_037": {"table": "analytics.events", "status": "running", "submitted_at": "2026-02-02T05:29:00Z"} + }, + + "aggregated_findings": { + "total_findings": 127, + "by_type": {"email": 45, "ssn": 32, "phone": 28, "address": 22}, + "by_dataset": {"customer_data": 67, "transactions": 35, "analytics": 25}, + "tables_with_pii": ["customer_data.users", "customer_data.profiles", "..."] + }, + + "pending_tables": [ + "analytics.pageviews", + "logs.access_logs", + // ... 63 more tables ... + ] +} +``` + +**On Restart (New Cloud Run Instance):** + +1. **Load checkpoint** - Deterministic state restoration ✓ + +2. **Reconcile with BigQuery:** + ```python + # Automatic reconciliation + for job_id, job_meta in checkpoint["job_ledger"].items(): + if job_meta["status"] == "running": + actual_status = bq_client.get_job(job_id).state + if actual_status == "DONE": + # Job completed while we were dead - fetch results + results = fetch_results(job_id) + update_findings(results) + job_meta["status"] = "complete" + ``` + +3. **Result of reconciliation:** + ``` + Checkpoint loaded: 35 tables complete, 2 in-progress + Reconciliation: job_036 DONE (found 5 PII), job_037 DONE (found 2 PII) + Updated state: 37 tables complete, 134 total findings + Remaining: 63 tables + + Resuming scan from table 38... + ``` + +4. **Agent continues seamlessly:** + - No duplicate jobs + - No re-parsing events + - Findings aggregation intact + - Deterministic, reliable + - Resume took ~5 seconds + +--- + +#### Side-by-Side Comparison + +| Aspect | Events Only | Checkpoint + Events | +|--------|-------------|---------------------| +| **Recovery time** | ~30 min (LLM re-parsing) | ~5 sec (load + reconcile) | +| **Duplicate jobs** | Likely (LLM uncertainty) | None (ledger prevents) | +| **Missed job results** | Possible | None (reconciliation catches) | +| **Findings accuracy** | May have errors | Exact (pre-aggregated) | +| **Token cost** | High (re-process events) | Low (structured state) | +| **Determinism** | No (LLM-dependent) | Yes (explicit state) | +| **Total runtime** | ~10 hours (retries, confusion) | ~8 hours (clean resume) | + +--- + +#### What Checkpoint Enables That Events Cannot + +1. **Authoritative Job Reconciliation** + ``` + Events: "job_036 was submitted" (but is it done now?) + Checkpoint: "job_036 status=running" → reconcile → "actually DONE, here are results" + ``` + +2. **Pre-Aggregated State** + ``` + Events: Count findings from 70 tool_call results + Checkpoint: {"total_findings": 127, "by_type": {...}} + ``` + +3. **Explicit Execution Plan** + ``` + Events: LLM must re-deduce "what was I doing?" + Checkpoint: {"phase": "scanning", "tables_completed": 35, "tables_pending": 63} + ``` + +4. **Idempotent Resume** + ``` + Events: May or may not re-submit jobs (LLM decides) + Checkpoint: Never re-submits (ledger tracks all jobs) + ``` + +5. **Multi-Instance Coordination** + ``` + Events: Two instances might both try to continue + Checkpoint: Lease ensures only one instance resumes + ``` + +--- + +#### Cost Impact Example + +| Metric | Events Only | Checkpoint | +|--------|-------------|------------| +| BigQuery jobs submitted | 115 (15 duplicates) | 100 (exact) | +| BQ job cost @ $5/job | $575 | $500 | +| Cloud Run time | 10 hours | 8 hours | +| Cloud Run cost @ $0.10/hr | $1.00 | $0.80 | +| LLM tokens for recovery | ~50,000 | ~1,000 | +| LLM cost @ $0.01/1K | $0.50 | $0.01 | +| **Total extra cost** | **$75.50** | **$0** | + +For enterprise workloads running daily, this adds up significantly. + +--- + +### Suggested Design Doc Update + +Revise Section 1.2 limitation description: + +**From:** +> "Cross-process durability: state lost if the process dies" + +**To:** +> "Cross-process durability: While session events persist conversation history, structured execution state (job ledgers, aggregated results, execution plans) is not captured in a form that enables deterministic recovery. On restart, the LLM must re-deduce state from event history, which may be unreliable for complex multi-step operations." + +Add clarification table to Section 1.2: + +```markdown +**Clarification: Session Events vs. Checkpoint State** + +| Recovery Need | Session Events | Checkpoint | +|---------------|----------------|------------| +| Conversation context | ✓ Sufficient | ✓ | +| External job reconciliation | ✗ Manual | ✓ Structured ledger | +| Multi-step progress tracking | ✗ LLM re-deduces | ✓ Explicit state | +| Aggregated partial results | ✗ Re-compute | ✓ Pre-computed | +| Deterministic recovery | ✗ LLM-dependent | ✓ Guaranteed | +``` + +--- + +## Updated Open Questions for ADK Team + +1. **Table naming**: Should checkpoint tables use a prefix (`durable_sessions`) or separate dataset? +2. **Unified service**: Is there interest in a `DurableSessionService` wrapper that manages both SessionService and CheckpointStore? +3. **Event integration**: Should checkpoint events be mirrored to SessionService for audit trail? +4. **BigQuerySessionService**: Does it already have any checkpoint-like capabilities we should leverage? +5. **ArtifactService unification**: Should we extend `BaseArtifactService` with checkpoint-specific methods in v2? +6. **Shared bucket**: Can checkpoints share a GCS bucket with artifacts, or should they be separate? +7. **Framework-level leasing**: Should `Runner` have optional built-in lease management? Or should `SessionService` have lease methods? +8. **Lease backend standardization**: If leasing becomes a framework feature, what backends should be supported (BQ, Firestore, Redis, DB row locks)? +9. **Event-based recovery**: Is there interest in adding structured "execution state" events to SessionService as an alternative to separate checkpoints? diff --git a/contributing/samples/long_running_task/demo_server.py b/contributing/samples/long_running_task/demo_server.py new file mode 100644 index 0000000000..1715c1f11b --- /dev/null +++ b/contributing/samples/long_running_task/demo_server.py @@ -0,0 +1,435 @@ +# Copyright 2026 Google LLC +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +"""Custom demo server with checkpoint visualization UI.""" + +import asyncio +import json +import os +import uuid +from datetime import datetime +from pathlib import Path +from typing import Any, Optional + +from fastapi import FastAPI, HTTPException, Request +from fastapi.middleware.cors import CORSMiddleware +from fastapi.responses import HTMLResponse, JSONResponse +from fastapi.staticfiles import StaticFiles +from pydantic import BaseModel +import uvicorn + +from google.adk.durable import BigQueryCheckpointStore + +# Configuration +PROJECT_ID = os.environ.get("GOOGLE_CLOUD_PROJECT", "test-project-0728-467323") +DATASET = "adk_metadata" +GCS_BUCKET = f"{PROJECT_ID}-adk-checkpoints" + +# Initialize checkpoint store +checkpoint_store = BigQueryCheckpointStore( + project=PROJECT_ID, + dataset=DATASET, + gcs_bucket=GCS_BUCKET, +) + +# In-memory task state for demo +active_tasks: dict[str, dict] = {} + +app = FastAPI(title="ADK Durable Session Demo") + +# CORS +app.add_middleware( + CORSMiddleware, + allow_origins=["*"], + allow_credentials=True, + allow_methods=["*"], + allow_headers=["*"], +) + + +class TaskRequest(BaseModel): + task_type: str # "sentiment", "anomaly", "trend", "scan" + duration_seconds: int = 60 + + +class ResumeRequest(BaseModel): + session_id: str + + +@app.get("/", response_class=HTMLResponse) +async def root(): + """Serve the demo UI.""" + html_path = Path(__file__).parent / "demo_ui.html" + if html_path.exists(): + return HTMLResponse(content=html_path.read_text()) + return HTMLResponse(content="

Demo UI not found

") + + +@app.get("/api/sessions") +async def list_sessions(): + """List all sessions from BigQuery.""" + try: + client = checkpoint_store._get_bq_client() + query = f""" + SELECT session_id, status, agent_name, current_checkpoint_seq, + created_at, updated_at + FROM `{checkpoint_store._sessions_table_id}` + ORDER BY updated_at DESC + LIMIT 20 + """ + results = client.query(query).result() + sessions = [] + for row in results: + sessions.append({ + "session_id": row.session_id, + "status": row.status, + "agent_name": row.agent_name, + "checkpoint_seq": row.current_checkpoint_seq, + "created_at": row.created_at.isoformat() if row.created_at else None, + "updated_at": row.updated_at.isoformat() if row.updated_at else None, + }) + return {"sessions": sessions} + except Exception as e: + return {"sessions": [], "error": str(e)} + + +@app.get("/api/checkpoints/{session_id}") +async def list_checkpoints(session_id: str): + """List checkpoints for a session.""" + try: + checkpoints = await checkpoint_store.list_checkpoints( + session_id=session_id, limit=20 + ) + return { + "checkpoints": [ + { + "checkpoint_seq": cp.checkpoint_seq, + "created_at": cp.created_at.isoformat() if cp.created_at else None, + "trigger": cp.trigger, + "size_bytes": cp.size_bytes, + "gcs_uri": cp.gcs_state_uri, + "agent_state": cp.agent_state, + } + for cp in checkpoints + ] + } + except Exception as e: + return {"checkpoints": [], "error": str(e)} + + +@app.post("/api/task/start") +async def start_task(request: TaskRequest): + """Start a new long-running task with checkpointing.""" + session_id = f"demo-{uuid.uuid4().hex[:8]}" + + # Create session in BigQuery + try: + session = await checkpoint_store.create_session( + session_id=session_id, + agent_name="demo_agent", + metadata={"task_type": request.task_type} + ) + except Exception as e: + raise HTTPException(status_code=500, detail=f"Failed to create session: {e}") + + # Initialize task state + active_tasks[session_id] = { + "task_type": request.task_type, + "status": "running", + "progress": 0, + "total_duration": request.duration_seconds, + "records_processed": 0, + "insights_found": 0, + "checkpoints": [], + "start_time": datetime.now().isoformat(), + "should_fail": False, + "failed_at": None, + "final_output": None, + } + + # Start background task + asyncio.create_task(run_task_with_checkpoints(session_id, request.duration_seconds)) + + return { + "session_id": session_id, + "status": "started", + "message": f"Started {request.task_type} analysis task" + } + + +@app.post("/api/task/fail/{session_id}") +async def simulate_failure(session_id: str): + """Simulate a task failure.""" + if session_id not in active_tasks: + raise HTTPException(status_code=404, detail="Task not found") + + active_tasks[session_id]["should_fail"] = True + return {"status": "failure_triggered", "session_id": session_id} + + +@app.post("/api/task/resume") +async def resume_task(request: ResumeRequest): + """Resume a task from checkpoint.""" + session_id = request.session_id + + # Read the latest checkpoint + result = await checkpoint_store.read_latest_checkpoint(session_id=session_id) + if not result: + raise HTTPException(status_code=404, detail="No checkpoint found") + + checkpoint, state_blob = result + state = json.loads(state_blob.decode('utf-8')) + + # Get session info + session = await checkpoint_store.get_session(session_id=session_id) + if not session: + raise HTTPException(status_code=404, detail="Session not found") + + # Restore task state + active_tasks[session_id] = { + "task_type": state.get("task_type", "unknown"), + "status": "running", + "progress": state.get("progress", 0), + "total_duration": state.get("total_duration", 60), + "records_processed": state.get("records_processed", 0), + "insights_found": state.get("insights_found", 0), + "checkpoints": state.get("checkpoints", []), + "start_time": state.get("start_time"), + "resumed_from": checkpoint.checkpoint_seq, + "should_fail": False, + "failed_at": None, + } + + # Calculate remaining duration + remaining = active_tasks[session_id]["total_duration"] * (1 - active_tasks[session_id]["progress"] / 100) + + # Resume background task + asyncio.create_task(run_task_with_checkpoints(session_id, int(remaining), resume=True)) + + return { + "session_id": session_id, + "status": "resumed", + "resumed_from_checkpoint": checkpoint.checkpoint_seq, + "progress": active_tasks[session_id]["progress"], + "message": f"Resumed from checkpoint #{checkpoint.checkpoint_seq}" + } + + +@app.get("/api/task/status/{session_id}") +async def get_task_status(session_id: str): + """Get current task status.""" + if session_id not in active_tasks: + # Try to get from BigQuery + session = await checkpoint_store.get_session(session_id=session_id) + if session: + return { + "session_id": session_id, + "status": session.status, + "checkpoint_seq": session.current_checkpoint_seq, + "from_db": True + } + raise HTTPException(status_code=404, detail="Task not found") + + return { + "session_id": session_id, + **active_tasks[session_id] + } + + +async def run_task_with_checkpoints(session_id: str, duration: int, resume: bool = False): + """Run a long-running task with periodic checkpoints.""" + import random + + task = active_tasks.get(session_id) + if not task: + return + + checkpoint_interval = 10 # Checkpoint every 10 seconds + start_progress = task["progress"] if resume else 0 + + for elapsed in range(0, duration, checkpoint_interval): + # Check if we should fail + if task.get("should_fail"): + task["status"] = "failed" + task["failed_at"] = datetime.now().isoformat() + await checkpoint_store.update_session_status( + session_id=session_id, status="failed" + ) + return + + # Simulate work + await asyncio.sleep(min(checkpoint_interval, duration - elapsed)) + + # Update progress + progress = start_progress + ((elapsed + checkpoint_interval) / duration) * (100 - start_progress) + task["progress"] = min(progress, 100) + task["records_processed"] += random.randint(50000, 150000) + task["insights_found"] += random.randint(1, 3) + + # Get current checkpoint seq + session = await checkpoint_store.get_session(session_id=session_id) + next_seq = (session.current_checkpoint_seq if session else 0) + 1 + + # Create checkpoint + state_data = { + "task_type": task["task_type"], + "progress": task["progress"], + "total_duration": task["total_duration"], + "records_processed": task["records_processed"], + "insights_found": task["insights_found"], + "checkpoints": task["checkpoints"], + "start_time": task["start_time"], + } + + try: + checkpoint = await checkpoint_store.write_checkpoint( + session_id=session_id, + checkpoint_seq=next_seq, + state_blob=json.dumps(state_data).encode('utf-8'), + agent_state={"progress": task["progress"], "step": f"checkpoint_{next_seq}"}, + trigger="periodic", + ) + + task["checkpoints"].append({ + "seq": checkpoint.checkpoint_seq, + "time": datetime.now().isoformat(), + "progress": task["progress"], + }) + except Exception as e: + print(f"Checkpoint failed: {e}") + + # Task completed - Generate final output based on task type + task["status"] = "completed" + task["progress"] = 100 + + # Generate realistic final output + task_type = task.get("task_type", "analysis") + records = task["records_processed"] + insights = task["insights_found"] + + if task_type == "sentiment": + task["final_output"] = { + "title": "Sentiment Analysis Report", + "summary": f"Analyzed {records:,} text records across multiple data sources.", + "results": { + "overall_sentiment": "72% Positive", + "positive_records": int(records * 0.72), + "neutral_records": int(records * 0.18), + "negative_records": int(records * 0.10), + "confidence_score": 0.94, + }, + "key_findings": [ + "Strong positive sentiment around product quality", + "Minor concerns about delivery times (8% of negative)", + "Customer service mentions trending upward (+15%)", + f"Identified {insights} actionable insights for improvement", + ], + "top_themes": ["quality", "value", "service", "speed", "reliability"], + "recommendation": "Focus on delivery optimization to improve overall sentiment score by estimated 5-8%.", + } + elif task_type == "anomaly": + task["final_output"] = { + "title": "Anomaly Detection Report", + "summary": f"Scanned {records:,} data points for unusual patterns.", + "results": { + "total_anomalies": insights, + "critical_anomalies": max(1, insights // 4), + "warning_anomalies": insights // 2, + "info_anomalies": insights - insights // 4 - insights // 2, + "false_positive_rate": "2.3%", + }, + "key_findings": [ + f"Detected {insights} anomalies requiring attention", + "3 critical anomalies in transaction processing", + "Seasonal pattern identified in Q3 data", + "Root cause: 67% related to system load spikes", + ], + "anomaly_clusters": [ + {"type": "Transaction Volume Spike", "count": 5, "severity": "high"}, + {"type": "Response Time Degradation", "count": 8, "severity": "medium"}, + {"type": "Error Rate Increase", "count": 3, "severity": "high"}, + ], + "recommendation": "Investigate transaction processing during peak hours. Consider auto-scaling policies.", + } + elif task_type == "trend": + task["final_output"] = { + "title": "Trend Analysis Report", + "summary": f"Analyzed {records:,} historical data points for patterns.", + "results": { + "trend_direction": "Upward", + "growth_rate": "15.3% MoM", + "seasonality_detected": True, + "forecast_confidence": 0.89, + }, + "key_findings": [ + "Strong upward trend detected over past 6 months", + "15.3% month-over-month growth rate", + "Seasonal peaks in Q4 (holiday season)", + f"Identified {insights} significant trend changes", + ], + "forecast": { + "next_month": "+12% projected", + "next_quarter": "+38% projected", + "confidence_interval": "±8%", + }, + "recommendation": "Prepare for Q4 surge. Current trajectory suggests 2x capacity needed by year end.", + } + elif task_type == "clustering": + task["final_output"] = { + "title": "Data Clustering Report", + "summary": f"Clustered {records:,} data points into meaningful segments.", + "results": { + "clusters_identified": 5, + "silhouette_score": 0.78, + "largest_cluster_size": "45%", + "smallest_cluster_size": "8%", + }, + "key_findings": [ + "Identified 5 distinct customer segments", + "Largest segment (45%): 'Value Seekers'", + "High-value segment (12%): 'Premium Customers'", + f"Found {insights} key differentiating factors", + ], + "clusters": [ + {"name": "Value Seekers", "size": "45%", "description": "Price-sensitive, bulk buyers"}, + {"name": "Premium Customers", "size": "12%", "description": "High-spend, quality-focused"}, + {"name": "Occasional Shoppers", "size": "23%", "description": "Infrequent, event-driven"}, + {"name": "New Users", "size": "12%", "description": "Recent signups, exploring"}, + {"name": "Churning Risk", "size": "8%", "description": "Declining engagement"}, + ], + "recommendation": "Target 'Churning Risk' segment with retention campaign. Estimated 15% recovery rate.", + } + else: + task["final_output"] = { + "title": "Analysis Complete", + "summary": f"Processed {records:,} records successfully.", + "results": {"records_processed": records, "insights_found": insights}, + "key_findings": [f"Found {insights} notable patterns in the data"], + } + + task["final_output"]["metadata"] = { + "session_id": session_id, + "task_type": task_type, + "duration_seconds": task["total_duration"], + "checkpoints_created": len(task["checkpoints"]), + "completed_at": datetime.now().isoformat(), + } + + await checkpoint_store.update_session_status( + session_id=session_id, status="completed" + ) + + +if __name__ == "__main__": + uvicorn.run(app, host="0.0.0.0", port=8080) diff --git a/contributing/samples/long_running_task/demo_ui.html b/contributing/samples/long_running_task/demo_ui.html new file mode 100644 index 0000000000..60ac43ac35 --- /dev/null +++ b/contributing/samples/long_running_task/demo_ui.html @@ -0,0 +1,832 @@ + + + + + + ADK Durable Session Demo - Real Checkpoint Visualization + + + + +
+ +
+

+ ADK Durable Session Demo +

+

Real Checkpoint-Based Persistence for Long-Running Agent Tasks

+
+ + All tasks are REAL - Writing to BigQuery & GCS +
+
+ + +
+
+
+ 🗄️ +
+
BigQuery
+
test-project-0728-467323.adk_metadata
+
+
+
+ ☁️ +
+
Cloud Storage
+
gs://test-project-0728-467323-adk-checkpoints
+
+
+
+ 🔐 +
+
SHA-256 Verified
+
Checkpoint integrity guaranteed
+
+
+
+
+ + +
+ + +
+

+ 🎯 Choose a Real Task +

+

Each task simulates a real long-running data processing job with actual checkpoints saved to GCP.

+ + +
+
+
+ 😊 +
+
Sentiment Analysis
+
Analyzes text data to determine emotional tone. Processes customer reviews, social media posts, and feedback.
+
Output: Positive/Negative ratios, key themes, trend analysis
+
+
+
+ +
+
+ 🔍 +
+
Anomaly Detection
+
Scans datasets for unusual patterns and outliers. Identifies potential fraud, system errors, or data quality issues.
+
Output: Anomaly count, severity levels, root cause hints
+
+
+
+ +
+
+ 📈 +
+
Trend Analysis
+
Identifies patterns and trends over time. Forecasts future values based on historical data.
+
Output: Growth rates, seasonal patterns, forecasts
+
+
+
+ +
+
+ 🎨 +
+
Data Clustering
+
Groups similar data points together. Segments customers, categorizes content, identifies patterns.
+
Output: Cluster count, segment profiles, separation metrics
+
+
+
+
+ + +
+ +
+ + 60s +
+
Checkpoints saved every 10 seconds
+
+ + + + +

Task will create real checkpoints in BigQuery & GCS

+
+ + +
+

+ 📊 Live Task Monitor +

+ + +
+ ⏸️ +

No task running

+

Select a task and click Start

+
+ + + + +
+ +
+

+ 💥 Simulate Crash +

+

Simulate a server crash to test checkpoint recovery. The task state is safely stored!

+ +
+ + + +
+ + + + + + + + + + +
+ + +
+

+ 📍 Checkpoint Timeline +

+

Real checkpoints being written to BigQuery & GCS

+ +
+
+ 📭 +

No checkpoints yet

+

Start a task to see real checkpoints appear

+
+
+
+
+ + +
+

🔧 How Durable Checkpointing Works

+ +
+
+
1️⃣
+
Task Runs
+
Long-running analysis processes data in chunks
+
+ +
+
2️⃣
+
Checkpoint Created
+
Every 10s, state is serialized and compressed
+
+ +
+
3️⃣
+
Two-Phase Commit
+
Blob → GCS, then metadata → BigQuery
+
+ +
+
4️⃣
+
Recovery Ready
+
If crash occurs, resume from last checkpoint
+
+
+ +
+

📊 Verify in BigQuery

+
SELECT session_id, checkpoint_seq, created_at, trigger, size_bytes
+FROM `test-project-0728-467323.adk_metadata.checkpoints`
+ORDER BY created_at DESC LIMIT 10;
+
+
+ + +
+

+ 📋 Real Sessions in BigQuery + +

+

These sessions are stored in BigQuery. Click "Select" to resume any failed session.

+
+ + + + + + + + + + + + + +
Session IDStatusCheckpointsLast UpdatedActions
Loading from BigQuery...
+
+
+
+ + + + diff --git a/contributing/samples/long_running_task/long_running_task_design.md b/contributing/samples/long_running_task/long_running_task_design.md new file mode 100644 index 0000000000..38877fae7e --- /dev/null +++ b/contributing/samples/long_running_task/long_running_task_design.md @@ -0,0 +1,1448 @@ +# Durable Session Persistence for Long-Horizon ADK Agents (BigQuery-first, Generalizable Framework Capability) + +**Author:** Haiyuan Cao +**Status:** Implemented (v1 core functionality) +**Target audience:** ADK engineering leads, BigQuery Agent Analytics stakeholders, SRE/Security reviewers +**Last updated:** 2026-02-02 +**Revision:** 3.0 (implementation complete, demo deployed) + +--- + +## Implementation Status + +| Component | Status | Location | +|-----------|--------|----------| +| `DurableSessionConfig` | Implemented | `src/google/adk/durable/config.py` | +| `CheckpointableAgentState` | Implemented | `src/google/adk/durable/checkpointable_state.py` | +| `DurableSessionStore` (ABC) | Implemented | `src/google/adk/durable/stores/base_checkpoint_store.py` | +| `BigQueryCheckpointStore` | Implemented | `src/google/adk/durable/stores/bigquery_checkpoint_store.py` | +| `WorkspaceSnapshotter` | Implemented | `src/google/adk/durable/workspace_snapshotter.py` | +| App integration | Implemented | `src/google/adk/apps/app.py` | +| Demo agent | Implemented | `contributing/samples/long_running_task/` | +| Demo UI (Cloud Run) | Deployed | `https://durable-demo-201486563047.us-central1.run.app` | + +### Live Demo + +A fully functional demo is deployed on Cloud Run showcasing: +- Real-time checkpoint visualization +- Task failure simulation +- Checkpoint-based recovery +- BigQuery metadata queries +- Final task output display + +**URL:** https://durable-demo-201486563047.us-central1.run.app + +**Infrastructure:** +- BigQuery Dataset: `test-project-0728-467323.adk_metadata` +- GCS Bucket: `gs://test-project-0728-467323-adk-checkpoints` +- SHA-256 checkpoint integrity verification + +--- + +## 0. Executive One-Pager (for PM/Director skim) + +### Problem + +ADK agents struggle with BigQuery's **async, long-running workloads**. While ADK has experimental in-process resumability (`ResumabilityConfig`), it lacks: +- **Cross-process durability**: state lost if the process dies +- **External event triggers**: no Pub/Sub integration for job completion +- **Enterprise auditability**: no SQL-queryable checkpoint history +- **Cloud job reconciliation**: no authoritative state sync with BigQuery jobs + +Sandboxes time out (the "12-minute barrier" in typical cloud deployments), causing repeated cold starts, redundant metadata scans, and risk of duplicate job submissions. + +### Solution + +**Extend** ADK's existing resumability with a **Durable Session Persistence Layer**: + +* Extend lifecycle with durable **PAUSED** state (cross-process, not just in-memory) +* Persist **logical checkpoints** (plan + job ledger + tool ledger) and optionally workspace artifacts +* Store control-plane metadata + audit trail in **BigQuery** +* Store large blobs (checkpoint/workspace) in **GCS** +* Resume on external events (BigQuery job completion → Pub/Sub) with **authoritative reconciliation** + +### Key benefits + +* **Reliability:** deterministic "warm start"; prevents duplicate job fleets +* **Cost:** no idle compute while waiting; typical storage **< $0.01/session-day paused** (see [Section 21: Cost Estimation](#21-cost-estimation)) +* **Enterprise:** SQL auditability (inspect what the agent did at hour 4 of 12) +* **Strategic:** differentiates ADK by enabling **cloud job execution continuity + enterprise audit**, not just "reasoning continuity" + +### Ask / decisions + +1. Review `CheckpointableAgentState` + integration with existing `ResumabilityConfig` +2. Confirm reference infra (BQ + GCS) and leasing approach +3. Select pilot (recommended: PII scanner) + **Decision:** Durable PAUSED as extension to existing resumability vs separate plugin + +### Proposed timeline (8 weeks to pilot) + +* Weeks 1–2: API + storage/lease decisions, integration design with existing resumability +* Weeks 3–4: reference store + resume skeleton +* Weeks 5–8: pilot + metrics +* Week 9+: iterate and choose rollout path + +--- + +## 1. Background & Motivation + +### 1.1 The "12-minute barrier" in cloud data workflows + +BigQuery workloads are inherently asynchronous and may run from minutes to hours. In typical cloud sandbox deployments (Cloud Run, Cloud Functions, GKE with autoscaling), agents face timeout constraints: + +* **Cloud Run:** default 5-minute timeout, max 60 minutes +* **Cloud Functions:** default 1-minute timeout, max 9 minutes (1st gen) or 60 minutes (2nd gen) +* **Vertex AI Agent Builder:** session timeouts vary by deployment mode + +When these timeouts occur during long-running BigQuery jobs, agents: + +* lose job IDs and progress state (unless using existing resumability) +* repeat metadata scans and tool calls +* risk re-submitting already-running jobs + +### 1.2 Existing ADK Resumability (Current State) + +ADK already has an **experimental resumability feature** (`src/google/adk/apps/app.py`): + +```python +@experimental +class ResumabilityConfig(BaseModel): + """The "resumability" in ADK refers to the ability to: + 1. pause an invocation upon a long-running function call. + 2. resume an invocation from the last event, if it's paused or failed midway + through. + + Note: ADK resumes the invocation in a best-effort manner: + 1. Tool call to resume needs to be idempotent because we only guarantee + an at-least-once behavior once resumed. + 2. Any temporary / in-memory state will be lost upon resumption. + """ + is_resumable: bool = False +``` + +**Current capabilities:** +| Feature | Status | Location | +|---------|--------|----------| +| `ResumabilityConfig.is_resumable` | Experimental | `src/google/adk/apps/app.py:42-58` | +| `InvocationContext.should_pause_invocation()` | Implemented | `src/google/adk/agents/invocation_context.py:355-389` | +| `long_running_tool_ids` tracking | Implemented | `src/google/adk/events/event.py` | +| Resume from last event | Implemented | `src/google/adk/runners.py:1294+` | + +**Current limitations (gaps this design addresses):** +| Limitation | Impact | +|------------|--------| +| In-memory only | State lost on process death/restart | +| No external event triggers | Cannot wake on Pub/Sub, webhooks | +| No cross-process persistence | Cannot resume in different runner instance | +| No enterprise audit trail | No SQL-queryable checkpoint history | +| No cloud job reconciliation | No authoritative sync with BQ job states | + +### 1.3 Dogfooding BigQuery Agent Analytics + +Using BigQuery as a durable control plane is strategically aligned with the BigQuery Agent Analytics direction: + +* **Dogfooding:** demonstrates BQ-based agent observability capabilities +* **Auditability:** admins can query checkpoints directly ("what was the agent doing at hour 4?") +* **SQL robustness:** BigQuery idioms (e.g., ARRAY_AGG latest-per-session) make operational queries easy and efficient + +--- + +## 2. Problem Statement + +**This design extends ADK's existing resumability** to address gaps in cross-process durability and enterprise scenarios. + +Current ADK resumability is optimized for **in-process pause/resume**: +* Works within a single runner process lifecycle +* State persisted to session service (SQLite, Postgres, etc.) +* No external event-driven wake-up mechanism +* No BigQuery-native audit trail + +**Gaps this design addresses:** + +| Gap | Current State | Proposed Solution | +|-----|---------------|-------------------| +| Cross-process durability | State in session DB, but no checkpoint snapshots | BQ metadata + GCS blobs | +| External event triggers | Manual resume via API call | Pub/Sub → Resumer service | +| Cloud job reconciliation | App must track job IDs manually | Authoritative ledger reconciliation | +| Enterprise audit | Logs only | SQL-queryable BQ tables | +| Fleet observability | Per-session queries | Cross-agent BQ analytics | + +**Net effect:** ADK's existing resumability handles the "pause on long tool call" case well, but is not sufficient for BigQuery job fleets, multi-hour compliance scans, or any agentic workflow that needs **durable, cross-process, event-driven** "pause/wake/resume" loops. + +--- + +## 3. Goals & Non-Goals + +### 3.1 Goals + +1. **Extend** existing `ResumabilityConfig` to support durable, cross-process checkpoints +2. Support **hours-to-days** workflows via durable lifecycle state **PAUSED** +3. Enable **event-driven resume** (Pub/Sub/job events) with safe retries +4. Persist a deterministic **logical checkpoint**, not runtime heap snapshots +5. Provide **enterprise-grade auditability**, retention, and security posture +6. Ensure correctness via **two-phase commit**, **authoritative reconciliation**, and **lease-based resuming** +7. **Backward compatible** with existing ADK session services + +### 3.2 Non-Goals (v1) + +* Interpreter heap snapshot/restore (pickle/dill) — brittle across deployments and library changes +* Full microVM/container checkpointing — future work +* Replacing existing `ResumabilityConfig` — this design extends it +* Modifying existing session service implementations — new service alongside existing + +--- + +## 4. Proposed Lifecycle Model + +### 4.1 States + +Building on ADK's existing pause concept, we formalize durable states: + +* **RUNNING:** executing agent logic + tool calls +* **PAUSED:** no active compute; durable checkpoint exists in BQ+GCS; resumable via event or API +* **KILLED:** finalized; resources released; retention applies + (Optional operational outcomes: `FAILED`, `EXPIRED`.) + +### 4.2 Integration with Existing Resumability + +``` +Existing ADK Resumability Durable Session Extension +───────────────────────────── ────────────────────────────── +InvocationContext.is_resumable → DurableSessionConfig.is_durable +should_pause_invocation() → triggers checkpoint write +long_running_tool_ids → included in checkpoint ledger +Session events → replayed on resume + + BQ audit trail + + GCS checkpoint blobs + + Pub/Sub event triggers +``` + +### 4.3 "Serving → Rollout" framing + +This design shifts ADK from a request/response mindset to an **agentic rollout** model: + +* do work +* wait for environment events +* resume deterministically +* avoid compute idling + +--- + +## 5. Architecture Overview + +### 5.1 Layered checkpointing: logical → workspace → execution (future) + +**v1** explicitly adopts **Logical Checkpointing**: + +1. **Logical checkpoint (required):** plan/task graph state, job ledger, tool ledger, progress cursors +2. **Workspace snapshot (optional):** `/workspace` bundle (draft reports, code, small caches) +3. **Execution snapshot (future):** microVM/container restore + +**Rationale:** heap snapshots are notoriously fragile under code/library/version drift. Logical checkpoints remain deterministic across restarts and upgrades. + +### 5.2 Control plane vs data plane (Google-scale reliability pattern) + +* **Control plane: BigQuery** + + * sessions/checkpoints/events as structured tables + * queryable summaries for auditing and fleet observability +* **Data plane: GCS** + + * checkpoint state blobs + * workspace bundles + * large artifacts (reports, samples, exports) + +### 5.3 Integration with Existing ADK Services + +``` +┌─────────────────────────────────────────────────────────────────┐ +│ ADK Application │ +├─────────────────────────────────────────────────────────────────┤ +│ App( │ +│ resumability_config=ResumabilityConfig(is_resumable=True), │ +│ durable_session_config=DurableSessionConfig( # NEW │ +│ is_durable=True, │ +│ checkpoint_store=BigQueryCheckpointStore(...), │ +│ event_source=PubSubEventSource(...), │ +│ ), │ +│ ) │ +├─────────────────────────────────────────────────────────────────┤ +│ Existing ADK Services │ +│ ┌──────────────┐ ┌──────────────┐ ┌──────────────────────┐ │ +│ │SessionService│ │ArtifactService│ │MemoryService │ │ +│ │(SQLite/PG/...)│ │(GCS/local) │ │(in-memory/vertex) │ │ +│ └──────────────┘ └──────────────┘ └──────────────────────┘ │ +├─────────────────────────────────────────────────────────────────┤ +│ NEW: Durable Session Layer │ +│ ┌──────────────────┐ ┌─────────────────┐ ┌───────────────┐ │ +│ │DurableSessionStore│ │CheckpointStore │ │ResumeService │ │ +│ │(orchestration) │ │(BQ meta+GCS blob)│ │(Pub/Sub listen)│ │ +│ └──────────────────┘ └─────────────────┘ └───────────────┘ │ +└─────────────────────────────────────────────────────────────────┘ +``` + +--- + +## 6. Why BigQuery as the Control Plane + +Using BigQuery as the metadata store is strategic: + +* **Auditability:** SQL query of checkpoints at any time without parsing logs +* **Fleet visibility:** query state of thousands of agents concurrently +* **Robust ops patterns:** latest-per-session via idiomatic BigQuery view is simple and performant +* **Dogfooding:** demonstrates BigQuery Agent Analytics and cross-agent observability +* **Existing infrastructure:** many ADK users already have BQ datasets for analytics + +--- + +## 7. Correctness & Failure Safety + +### 7.1 Two-phase checkpoint commit (atomic visibility) + +A checkpoint is "live" only once the **BigQuery metadata row** exists. + +```python +def write_checkpoint( + session_id: str, + seq: int, + state_json: bytes, + workspace_path: str | None +) -> None: + """Two-phase checkpoint commit with error handling.""" + try: + # Phase 1: blobs to GCS (retry-safe, idempotent) + state_uri = gcs.upload( + f"checkpoints/{session_id}/{seq}/state.json", + state_json, + if_generation_match=0, # Fail if already exists + ) + workspace_uri = None + if workspace_path: + workspace_uri = gcs.upload( + f"checkpoints/{session_id}/{seq}/workspace.tar.gz", + compress_tar_gz(workspace_path), + if_generation_match=0, + ) + + # Phase 2: commit metadata in BigQuery (checkpoint becomes visible here) + bq.insert("checkpoints", { + "session_id": session_id, + "checkpoint_seq": seq, + "gcs_state_uri": state_uri, + "gcs_workspace_uri": workspace_uri, + "sha256": sha256(state_json), + "size_bytes": len(state_json), + "created_at": now(), + "trigger": "async_boundary", + "agent_state_json": extract_small_summary(state_json), + "checkpoint_fingerprint": fingerprint_checkpoint(state_json), + }) + + # Update pointer only after checkpoint metadata exists + bq.update("sessions", session_id, { + "current_checkpoint_seq": seq, + "updated_at": now(), + }) + + except GCSUploadError as e: + # Phase 1 failed - no cleanup needed, checkpoint not visible + logger.error(f"Checkpoint {seq} GCS upload failed: {e}") + raise CheckpointWriteError(f"GCS upload failed: {e}") from e + + except BigQueryInsertError as e: + # Phase 2 failed - orphan GCS blobs will be cleaned by GC + logger.error(f"Checkpoint {seq} BQ insert failed: {e}") + raise CheckpointWriteError(f"BQ insert failed: {e}") from e +``` + +**Garbage collection:** orphan GCS objects without a corresponding BQ metadata row are deleted after a grace window (default: 24 hours). + +--- + +### 7.2 Authoritative reconciliation (the core idempotency mechanism) + +On resume, do not trust events alone. Reconcile the ledger against authoritative cloud state. + +```python +def reconcile_on_resume(state: dict) -> dict: + """Reconcile job ledger against authoritative BigQuery state. + + This is the core idempotency mechanism - ensures we never + re-submit completed jobs or miss failed ones. + """ + ledger = state["job_ledger"] + reconciliation_results = { + "jobs_completed": 0, + "jobs_failed": 0, + "jobs_cancelled": 0, + "jobs_still_running": 0, + } + + for job_id, meta in ledger.items(): + try: + job = bq.get_job(job_id) + except NotFoundError: + # Job was deleted or never existed + logger.warning(f"Job {job_id} not found, marking as lost") + meta["status"] = "LOST" + meta["reconciled_at"] = now() + continue + + if job.state == "DONE" and not meta.get("consumed"): + state["results"][job_id] = fetch_results(job, meta) + meta["consumed"] = True + meta["reconciled_at"] = now() + reconciliation_results["jobs_completed"] += 1 + + elif job.state == "FAILED": + handle_failed_job(job_id, job.error_result, meta, state) + reconciliation_results["jobs_failed"] += 1 + + elif job.state == "CANCELLED": + handle_cancelled_job(job_id, meta, state) + reconciliation_results["jobs_cancelled"] += 1 + + elif job.state in ("RUNNING", "PENDING"): + register_completion_callback(job_id) + reconciliation_results["jobs_still_running"] += 1 + + state["_reconciliation_results"] = reconciliation_results + return state +``` + +This is the enterprise-grade version of "remember where you left off": + +* prevents re-submitting 2-hour scans +* handles partial failures/cancellations deterministically +* turns resume into a repeatable state machine +* provides audit trail of reconciliation results + +--- + +### 7.3 Leasing & optimistic concurrency + +We must ensure only one runner resumes a session at a time. + +**BigQuery constraint:** lacks true row-level locking. BQ-based leasing is **optimistic lease acquisition (best-effort without external lock)**. If high-burst concurrency demands stronger guarantees, the pluggable lease manager can be backed by Firestore/Spanner or external single-delivery orchestration (e.g., Cloud Tasks). + +**When to use each backend:** + +| Backend | Use Case | Guarantees | +|---------|----------|------------| +| BigQuery (default) | Low-medium concurrency, cost-sensitive | Best-effort, ~100ms latency | +| Firestore | High concurrency, strong consistency needed | Strong, ~10ms latency | +| Cloud Tasks | Exactly-once delivery required | Exactly-once with dedup window | +| Spanner | Global distribution, strong consistency | Strong, multi-region | + +BQ lease acquire template: + +```sql +UPDATE `your_project.adk_metadata.sessions` +SET active_lease_id = @lease_id, + lease_expiry = TIMESTAMP_ADD(CURRENT_TIMESTAMP(), INTERVAL @ttl_seconds SECOND), + updated_at = CURRENT_TIMESTAMP() +WHERE session_id = @session_id + AND status = 'PAUSED' + AND (active_lease_id IS NULL OR lease_expiry < CURRENT_TIMESTAMP()); +``` + +**Note:** BigQuery time travel (`FOR SYSTEM_TIME AS OF`) is useful for debugging historical state, but does not replace strong mutual exclusion. The "pluggable SessionLeaseManager" is the safety valve. + +--- + +## 8. ADK API Extensions (v1 contract) + +### 8.1 Core Interfaces + +```python +from abc import ABC, abstractmethod +from typing import Optional +from pydantic import BaseModel + +class CheckpointableAgentState(ABC): + """Interface for agents that support durable checkpointing. + + Extends the existing BaseAgentState pattern from + src/google/adk/agents/base_agent.py + """ + + @abstractmethod + def export_state(self) -> dict: + """Export agent state to a serializable dictionary. + + Returns: + Dictionary containing all state needed to resume. + Must be JSON-serializable. + """ + ... + + @abstractmethod + def import_state(self, state: dict) -> None: + """Import agent state from a previously exported dictionary. + + Args: + state: Dictionary from a previous export_state() call. + """ + ... + + def get_state_schema_version(self) -> int: + """Return the schema version for this state format. + + Override to implement versioned state migrations. + Default: 1 + """ + return 1 + + +class WorkspaceSnapshotter: + """Handles workspace directory snapshots to/from GCS.""" + + def snapshot_to_gcs( + self, + session_id: str, + checkpoint_seq: int, + workspace_path: str = "/workspace", + max_size_bytes: int = 1 * 1024 * 1024 * 1024, # 1GB default + ) -> str: + """Snapshot workspace to GCS. + + Returns: + GCS URI of the uploaded snapshot. + + Raises: + WorkspaceTooLargeError: If workspace exceeds max_size_bytes. + """ + ... + + def restore_from_gcs(self, gcs_uri: str, workspace_path: str = "/workspace") -> None: + """Restore workspace from GCS snapshot.""" + ... + + +class DurableSessionStore(ABC): + """Abstract interface for durable checkpoint storage.""" + + @abstractmethod + def write_checkpoint( + self, + session_id: str, + checkpoint_seq: int, + state: dict, + workspace_gcs_uri: Optional[str] = None, + trigger: str = "async_boundary", + ) -> None: + """Write a checkpoint with two-phase commit.""" + ... + + @abstractmethod + def read_latest_checkpoint( + self, + session_id: str, + ) -> tuple[int, dict, Optional[str]]: + """Read the latest checkpoint for a session. + + Returns: + Tuple of (checkpoint_seq, state_dict, workspace_gcs_uri). + + Raises: + CheckpointNotFoundError: If no checkpoint exists. + """ + ... + + @abstractmethod + def list_checkpoints( + self, + session_id: str, + limit: int = 100, + ) -> list[dict]: + """List checkpoint metadata for a session.""" + ... +``` + +### 8.2 Configuration + +```python +from pydantic import BaseModel, Field +from typing import Literal, Optional + +class DurableSessionConfig(BaseModel): + """Configuration for durable session persistence. + + Works alongside existing ResumabilityConfig. + """ + + is_durable: bool = False + """Enable durable cross-process checkpointing.""" + + checkpoint_policy: Literal[ + "async_boundary", # Checkpoint when pausing for async tool (default) + "tool_call_boundary", # Checkpoint after every tool call + "superstep", # Checkpoint at agent-defined superstep boundaries + "manual", # Only checkpoint when explicitly requested + ] = "async_boundary" + """When to create checkpoints.""" + + workspace_snapshot_enabled: bool = False + """Whether to include workspace directory in checkpoints.""" + + workspace_max_size_bytes: int = Field( + default=100 * 1024 * 1024, # 100MB + description="Maximum workspace snapshot size", + ) + + checkpoint_store: Optional[DurableSessionStore] = None + """The checkpoint store implementation. If None, uses BigQueryCheckpointStore.""" + + lease_backend: Literal["bigquery", "firestore", "cloud_tasks"] = "bigquery" + """Backend for lease management.""" + + lease_ttl_seconds: int = Field( + default=300, # 5 minutes + description="Lease TTL before auto-release", + ) + + retry_policy: Optional[dict] = None + """Per-tool-type retry policies for failed jobs.""" +``` + +### 8.3 Checkpoint Policy Details + +| Policy | Trigger | Use Case | +|--------|---------|----------| +| `async_boundary` | `should_pause_invocation()` returns True | BigQuery jobs, external APIs (default) | +| `tool_call_boundary` | After every tool call completes | Maximum durability, higher cost | +| `superstep` | Agent calls `checkpoint_now()` | Agent controls checkpoint granularity | +| `manual` | Only via explicit API call | Testing, debugging | + +--- + +## 9. Current vs Proposed Capability Comparison + +| Feature | Current ADK (ResumabilityConfig) | Durable Session Extension | +|---------|----------------------------------|---------------------------| +| Pause on long tool call | Yes (experimental) | Yes | +| Resume from last event | Yes (in-process) | Yes (cross-process) | +| State persistence | Session service (SQLite/PG) | Session service + BQ/GCS checkpoints | +| Cross-process resume | No | Yes | +| External event triggers | No | Yes (Pub/Sub, webhooks) | +| Max job duration | Process lifetime | Practically unlimited (days/weeks) | +| Compute cost while waiting | Idle if process alive | Zero compute while PAUSED | +| Job knowledge (IDs, state) | In-memory or session state | Persisted in ledger + BQ tables | +| Recovery | Resume API call | Automatic via event + idempotent resume | +| Auditability | Logs, session events | SQL-queryable BQ control plane | +| Fleet visibility | Per-session queries | Cross-agent BQ analytics | + +--- + +## 10. Demo Scenario: Multi-Day PII Audit + +Assume discovery finds ~50 tables; agent submits **1 BigQuery job per table**. + +1. **RUNNING:** enumerate schema, prioritize, build ledger +2. **RUNNING → PAUSED:** submit job fleet, checkpoint (two-phase), mark PAUSED, release compute +3. **PAUSED (hours/days):** jobs run in BigQuery; agent consumes zero compute +4. **Resume:** Pub/Sub event → resumer acquires lease → reads checkpoint → reconciles ledger +5. **RUNNING:** process completed jobs, handle failures, submit retries if needed +6. **KILLED:** compile compliance report, write final audit rows, cleanup + +--- + +## 11. "Plumbing vs Logic": Why Framework-Level Support Matters + +### 11.1 Framework-level ADK support > agent-specific hacks + +This capability should live at the ADK level, not be reinvented per agent team: + +| Dimension | Specific Agent Approach | ADK Framework Approach | +|-----------|-------------------------|------------------------| +| Engineering effort | each team reimplements persistence/resume | toggled via config; solved once | +| Security/compliance | inconsistent VPC-SC/CMEK/IAM | governance baked into store/resumer | +| Observability | fragmented logs | unified BQ schema across agents | +| Skill portability | skills tied to bespoke persistence | state-aware skills via standard interface | + +### 11.2 The "plumbing" components (solve once) + +* two-phase commit +* workspace snapshotting +* durable store + GC +* resume service + idempotent event handling +* leasing/concurrency strategy +* observability/audit tables + +### 11.3 The "logic" components (agent-owned) + +* what to persist in checkpoint (`job_ledger`, `audit_cursor`, partial findings) +* retry policy decisions by job/tool type +* domain-specific analysis and reporting logic + +--- + +## 12. Generalization Beyond BigQuery (Universal Long-Horizon Primitive) + +Although the motivating example is BigQuery, the primitives are general: + +* **Ledger-based reconciliation:** any external handle can be tracked (job ID, build ID, ticket ID) +* **Workspace snapshots:** preserve files for coding/refactoring/report assembly tasks +* **Event-driven resume:** Pub/Sub triggers can represent almost any service completion webhook + +### 12.1 Non-BigQuery long-horizon scenarios + +| Task Type | Resume trigger | Ledger contents | +|-----------|----------------|-----------------| +| Cloud infra provisioning | resource-ready events | resource manifests + status | +| Software refactoring | CI completion | build IDs, test results, patch plan | +| Deep research | scheduled polling/new index event | search caches + draft outline | +| Human-in-the-loop | Slack/Chat message | approval flags + pending actions | +| ML training | training job completion | model artifacts, metrics, hyperparams | + +--- + +## 13. Alignment with Moltbot (formerly ClawBot) Architecture + +This proposal aligns strongly with the long-running daemon style popularized by Moltbot/ClawBot, especially in lifecycle/state management: + +| Feature | Moltbot/ClawBot Design | Durable ADK Design | Alignment | +|---------|------------------------|--------------------| ----------| +| Orchestration | Gateway/Coordinator routes persistent sessions | ADK Agent Runner + Resumer | High | +| Persistence | Local FS "diary files" | BQ (metadata) + GCS (blobs) | High (enterprise-grade) | +| Lifecycle | Running / Paused / Killed | RUNNING / PAUSED / KILLED | Identical | +| Execution model | "Rollout" async loops | Background agent hibernates + resumes | High | + +**Enterprise advantage vs local-first bots** + +* BQ control plane enables fleet-scale SQL audit ("1,000 agents state now") +* VPC-SC, CMEK, IAM boundaries can be standardized at framework level + +--- + +## 14. Competitive Landscape (LangGraph + Claude) + +### 14.1 TL;DR + +LangGraph offers durable workflow checkpointing; Claude SDK offers session continuity/harness patterns. Neither makes **cloud job reconciliation** plus **SQL-audit control plane** a first-class target. + +### 14.2 Feature comparison + +| Feature | ADK (current) | ADK (proposed) | LangGraph | Claude SDK | +|---------|---------------|----------------|-----------|------------| +| In-process pause/resume | Yes (experimental) | Yes | Yes | Yes | +| Cross-process durability | No | Yes (BQ+GCS) | Yes (checkpointers) | Via harness | +| External event triggers | No | Yes (Pub/Sub) | Via external code | Via harness | +| Cloud job reconciliation | No | Yes (authoritative) | No | No | +| SQL audit trail | No | Yes (BQ) | No (requires custom) | No | +| Fleet observability | No | Yes (BQ analytics) | Via LangSmith | No | + +### 14.3 Why not "just use LangGraph checkpointers with BigQuery storage" + +LangGraph checkpointers serialize and restore workflow state at step boundaries, but BigQuery long-horizon requires: + +* authoritative job status reconciliation (DONE/FAILED/CANCELLED/RUNNING) +* result retrieval from destination tables +* partial failure handling and enterprise audit semantics + +This is not a drop-in "graph replay" problem; it's **cloud job continuity**. + +### 14.4 Borrow vs differentiate (prioritized) + +**v1 essential** + +1. checkpoint policy ergonomics (inspired by LangGraph) +2. coordinator/worker harness pattern (inspired by Anthropic article) + +**v2** +3. hybrid filesystem backends +4. skills/plugins packaging for BigQuery playbooks + +--- + +## 15. Alternatives Considered + +| Alternative | Why not (v1) | +|-------------|--------------| +| Extend existing SessionService | Different consistency model; BQ provides SQL audit | +| Firestore metadata | less SQL-auditable for analytics; can be lease backend later | +| Spanner leasing | heavy for v1; keep pluggable | +| Redis/Memorystore | ephemeral-first; lacks audit/query semantics | +| VM checkpointing | complex; brittle with environment drift | +| Cloud Workflows | static DAGs; agents need dynamic replanning | + +--- + +## 16. Size Limits, Spill Strategy, Compatibility + +### 16.1 Size limits + +* Keep `agent_state_json` summary small (< 1MB) and queryable +* Store full checkpoint in GCS (recommended < 100MB, hard limit 5GB) +* Workspace snapshot recommended ≤ 1 GB; large artifacts should be explicit GCS objects, not tarballed + +### 16.2 Compatibility & schema evolution + +* `agent_version`: code version (e.g., "1.2.3" or git SHA) +* `state_schema_version`: **monotonic INT64** (1,2,3…) +* optional `state_schema_version_label`: semver string for readability + +**v1 stance:** version mismatches hard-fail (safe). This prevents subtle bugs from incompatible state. + +**Migration strategy (v2):** + +```python +class CheckpointableAgentState(ABC): + def get_state_schema_version(self) -> int: + return 1 + + def migrate_state(self, old_state: dict, old_version: int) -> dict: + """Override to implement state migrations. + + Called when loading a checkpoint with older schema version. + Default: raise error (v1 behavior). + """ + raise StateSchemaMismatchError( + f"Cannot migrate from version {old_version} to {self.get_state_schema_version()}" + ) +``` + +### 16.3 checkpoint_fingerprint definition + +`checkpoint_fingerprint` = SHA256 of canonical checkpoint state excluding timestamps and non-deterministic fields. Useful for dedupe/debugging. + +```python +def fingerprint_checkpoint(state: dict) -> str: + """Compute deterministic fingerprint for checkpoint state.""" + # Remove non-deterministic fields + canonical = {k: v for k, v in state.items() + if k not in ("_timestamp", "_reconciliation_results")} + # Sort keys for determinism + canonical_json = json.dumps(canonical, sort_keys=True, separators=(',', ':')) + return hashlib.sha256(canonical_json.encode()).hexdigest() +``` + +--- + +## 17. Security, Governance, Enterprise Readiness + +### 17.1 Data sensitivity + +* **Sensitive by default:** checkpoints may include PII findings, credentials, business data +* **Classification:** treat checkpoint data with same sensitivity as source data + +### 17.2 Encryption + +| Layer | Mechanism | +|-------|-----------| +| GCS blobs | CMEK (Customer-Managed Encryption Keys) | +| BQ tables | BQ encryption policies (default or CMEK) | +| In-transit | TLS 1.3 | + +### 17.3 Access control + +* **IAM:** least privilege, separate identities for runner vs store +* **Runner identity:** needs BQ read/write, GCS read/write +* **Resumer identity:** needs BQ read/write, GCS read, Pub/Sub subscribe +* **Audit identity:** needs BQ read only + +### 17.4 Retention & compliance + +* **TTL:** configurable per session/agent type +* **GC:** automatic cleanup of expired sessions and orphan blobs +* **Legal hold:** support for compliance holds if needed +* **Audit log:** all checkpoint operations logged to Cloud Audit Logs + +### 17.5 VPC-SC + +* **Day-1 requirement** for many enterprise customers +* Ensure checkpoint bucket is in same VPC-SC perimeter +* Use restricted.googleapis.com endpoints +* Document perimeter configuration in deployment guide + +--- + +## 18. Open Questions & Risks (Senior review) + +| Question | Risk Level | Notes | +|----------|------------|-------| +| Lease contention & latency under high event bursts | Medium | May need Firestore/Tasks for >100 concurrent resumes | +| Workspace growth management | Low | Differential sync/manifest snapshots for v2 | +| Checkpoint frequency tuning | Low | Define "smart boundaries" to balance cost and safety | +| VPC-SC compliance validation | High | Day-1 requirement; needs security review | +| Multi-region/DR support | Medium | Cross-region resume: supported or out of scope? | +| Integration with existing ResumabilityConfig | Low | Design is additive, not replacing | +| State migration complexity | Medium | Hard-fail v1 is safe but limits upgrades | + +--- + +## 19. Milestones / Rollout Plan + +| Week | Milestone | Deliverables | +|------|-----------|--------------| +| 1–2 | API design & integration planning | `DurableSessionConfig` API, integration with `ResumabilityConfig`, storage/lease strategy doc | +| 3–4 | Core implementation | `BigQueryCheckpointStore`, `WorkspaceSnapshotter`, two-phase commit | +| 5–6 | Resume service | `ResumeService`, Pub/Sub integration, lease management | +| 7–8 | Pilot integration | PII scanner pilot, metrics collection | +| 9+ | Iterate & decide | Performance tuning, decide first-class vs plugin path | + +--- + +## 20. Immediate Ask / Decisions + +1. **Review** `CheckpointableAgentState` contract and integration with existing `ResumabilityConfig` +2. **Confirm** BQ+GCS as reference infra and lease backend strategy +3. **Select** pilot use case (PII scanner recommended) +4. **Decide:** Durable PAUSED as extension to existing resumability vs separate plugin/extension + +--- + +## 21. Cost Estimation + +### 21.1 Storage costs + +| Component | Typical Size | Monthly Cost (US) | +|-----------|--------------|-------------------| +| BQ session row | ~2 KB | ~$0.00004/row | +| BQ checkpoint row | ~5 KB | ~$0.0001/row | +| GCS checkpoint blob | ~100 KB | ~$0.0026/GB = ~$0.00000026 | +| GCS workspace snapshot | ~50 MB | ~$0.0026/GB = ~$0.00013 | + +**Example: 1,000 sessions, 10 checkpoints each, 24-hour retention** + +| Item | Quantity | Cost | +|------|----------|------| +| BQ session rows | 1,000 | $0.04 | +| BQ checkpoint rows | 10,000 | $1.00 | +| GCS checkpoint blobs | 10,000 × 100KB = 1GB | $0.026 | +| GCS workspace snapshots | 1,000 × 50MB = 50GB | $1.30 | +| **Total daily** | | **~$2.37** | + +**Cost per session-day paused:** ~$0.002 (well under $0.01 estimate) + +### 21.2 Compute costs + +| Component | Cost | +|-----------|------| +| PAUSED session | $0 (no compute) | +| Resume service (Cloud Run) | ~$0.001 per resume | +| Pub/Sub events | ~$0.04 per million messages | + +### 21.3 BigQuery query costs + +| Query Type | Estimated Data Scanned | Cost | +|------------|------------------------|------| +| Get latest checkpoint | ~10 KB | ~$0.00000005 | +| List session checkpoints | ~100 KB | ~$0.0000005 | +| Fleet analytics query | ~10 MB | ~$0.00005 | + +--- + +## 22. Monitoring & Observability + +### 22.1 Key metrics + +| Metric | Description | Alert Threshold | +|--------|-------------|-----------------| +| `checkpoint_write_latency_ms` | Time to write checkpoint (P50, P99) | P99 > 5000ms | +| `checkpoint_write_errors` | Failed checkpoint writes | > 1% error rate | +| `resume_latency_ms` | Time from event to resumed | P99 > 10000ms | +| `lease_contention_rate` | Failed lease acquisitions | > 5% | +| `orphan_blob_count` | GCS blobs without BQ metadata | > 1000 | +| `paused_session_count` | Currently paused sessions | Informational | +| `sessions_near_ttl` | Sessions expiring within 24h | > 100 | + +### 22.2 Dashboards + +**Operational dashboard:** +- Active sessions by state (RUNNING/PAUSED/KILLED) +- Checkpoint write success rate +- Resume latency distribution +- Lease acquisition success rate + +**Cost dashboard:** +- Storage usage (BQ + GCS) +- Query costs by type +- Compute costs (resume service) + +### 22.3 Alerting + +| Alert | Condition | Severity | +|-------|-----------|----------| +| High checkpoint failure rate | > 1% errors in 5 min | P1 | +| Resume service unhealthy | > 50% error rate | P1 | +| Lease contention spike | > 10% contention in 5 min | P2 | +| Orphan blob accumulation | > 10,000 orphans | P3 | +| Sessions nearing TTL | > 100 sessions within 1h of TTL | P3 | + +### 22.4 Logging + +All operations emit structured logs with: +- `session_id`, `checkpoint_seq`, `operation` +- `latency_ms`, `success`, `error_code` +- Correlation IDs for tracing + +--- + +## 23. Rollback & Recovery Procedures + +### 23.1 Checkpoint rollback + +```python +def rollback_to_checkpoint(session_id: str, target_seq: int) -> None: + """Rollback session to a previous checkpoint. + + Use cases: + - Agent made incorrect decisions + - Corrupted state detected + - Testing/debugging + """ + # 1. Verify target checkpoint exists + checkpoint = store.read_checkpoint(session_id, target_seq) + + # 2. Update session to point to target checkpoint + bq.update("sessions", session_id, { + "current_checkpoint_seq": target_seq, + "updated_at": now(), + }) + + # 3. Log rollback for audit + bq.insert("events", { + "session_id": session_id, + "event_type": "ROLLBACK", + "event_payload": {"from_seq": current_seq, "to_seq": target_seq}, + "event_time": now(), + }) +``` + +### 23.2 Session recovery + +| Scenario | Recovery Procedure | +|----------|-------------------| +| Resume service crash | Automatic retry via Pub/Sub redelivery | +| Checkpoint corruption | Rollback to previous checkpoint | +| BQ metadata loss | Rebuild from GCS blob inventory | +| GCS blob loss | Mark checkpoint invalid, resume from earlier | +| Lease stuck | Auto-expire after TTL, manual release available | + +### 23.3 Disaster recovery + +**Same-region:** +- BQ point-in-time recovery (7 days default) +- GCS object versioning + +**Cross-region (v2):** +- BQ dataset replication +- GCS dual-region or multi-region buckets + +--- + +## 24. Implementation Details (v1) + +### 24.1 Module Structure + +``` +src/google/adk/durable/ +├── __init__.py # Public exports +├── config.py # DurableSessionConfig +├── checkpointable_state.py # CheckpointableAgentState ABC +├── workspace_snapshotter.py # GCS workspace snapshot handling +└── stores/ + ├── __init__.py # Store exports + ├── base_checkpoint_store.py # DurableSessionStore ABC + └── bigquery_checkpoint_store.py # BQ + GCS implementation +``` + +### 24.2 Key Implementation Decisions + +| Decision | Rationale | +|----------|-----------| +| DML INSERT over streaming inserts | BigQuery streaming buffer limitations prevent immediate UPDATE after streaming insert | +| JSON column type checking | BigQuery returns JSON columns as dicts, not strings - added runtime type detection | +| SHA-256 verification | Checkpoint integrity verification on read | +| Async-first API | All store methods are async for non-blocking I/O | +| Experimental decorators | All public classes marked `@experimental` for API stability signals | + +### 24.3 BigQuery Table Schema (Simplified for v1) + +```sql +-- Sessions table +CREATE TABLE `project.adk_metadata.sessions` ( + session_id STRING NOT NULL, + status STRING NOT NULL, + agent_name STRING NOT NULL, + created_at TIMESTAMP NOT NULL, + updated_at TIMESTAMP NOT NULL, + current_checkpoint_seq INT64 NOT NULL, + active_lease_id STRING, + lease_expiry TIMESTAMP, + ttl_expiry TIMESTAMP, + metadata JSON, + PRIMARY KEY (session_id) NOT ENFORCED +); + +-- Checkpoints table +CREATE TABLE `project.adk_metadata.checkpoints` ( + session_id STRING NOT NULL, + checkpoint_seq INT64 NOT NULL, + created_at TIMESTAMP NOT NULL, + gcs_state_uri STRING NOT NULL, + sha256 STRING NOT NULL, + size_bytes INT64 NOT NULL, + agent_state JSON, + trigger STRING NOT NULL, + PRIMARY KEY (session_id, checkpoint_seq) NOT ENFORCED +); +``` + +### 24.4 Demo Architecture + +``` +┌─────────────────────────────────────────────────────────────────┐ +│ Cloud Run: durable-demo │ +│ ┌───────────────────────────────────────────────────────────┐ │ +│ │ FastAPI Server │ │ +│ │ - demo_server.py: Task management + checkpoint APIs │ │ +│ │ - demo_ui.html: Real-time visualization UI │ │ +│ └───────────────────────────────────────────────────────────┘ │ +│ │ │ +│ ▼ │ +│ ┌───────────────────────────────────────────────────────────┐ │ +│ │ BigQueryCheckpointStore │ │ +│ │ - Two-phase commit (GCS blob → BQ metadata) │ │ +│ │ - Lease management for concurrency │ │ +│ │ - SHA-256 integrity verification │ │ +│ └───────────────────────────────────────────────────────────┘ │ +└─────────────────────────────────────────────────────────────────┘ + │ │ + ▼ ▼ + ┌──────────────────┐ ┌──────────────────┐ + │ BigQuery │ │ GCS │ + │ adk_metadata │ │ checkpoints/ │ + │ - sessions │ │ {session_id}/ │ + │ - checkpoints │ │ {seq}/state.json│ + └──────────────────┘ └──────────────────┘ +``` + +### 24.5 Demo Features + +| Feature | Implementation | +|---------|----------------| +| Task types | Sentiment, Anomaly, Trend, Clustering analysis | +| Checkpoint interval | Every 10 seconds | +| Failure simulation | Manual trigger via UI | +| Resume from checkpoint | Automatic state restoration | +| Final output | Task-specific analysis reports | +| Real-time UI | Polling-based status updates | +| Checkpoint timeline | Visual checkpoint history | + +--- + +# Appendix A: Feature-to-Requirement Mapping (Demo Coverage) + +| Feature | Functional Purpose | Long-horizon benefit | +|---------|--------------------|-----------------------| +| Two-phase checkpoint commit | atomic visibility of state | prevents half-saved resumes | +| BigQuery job ledger | track async job IDs & states | hibernate during hours-long jobs | +| Workspace snapshotting | preserve files and drafts | warm start for coding/report tasks | +| Lease-based resuming | prevent concurrent resume | avoids corruption in parallel runs | +| Durable lifecycle model | add persistent PAUSED | releases compute, supports indefinite horizon | +| Authoritative reconciliation | sync with cloud job state | prevents duplicate submissions | +| Integration with ResumabilityConfig | backward compatibility | incremental adoption | + +--- + +# Appendix B: BigQuery SQL (Copy/Paste) + +## B0) Dataset + +```sql +CREATE SCHEMA IF NOT EXISTS `your_project.adk_metadata` +OPTIONS ( + location = "US", + description = "ADK Durable Session control-plane metadata (sessions, checkpoints, events)." +); +``` + +## B1) sessions + +```sql +CREATE TABLE IF NOT EXISTS `your_project.adk_metadata.sessions` ( + session_id STRING NOT NULL, + parent_session_id STRING, + owner_principal STRING NOT NULL, + + status STRING NOT NULL, + agent_name STRING NOT NULL, + agent_version STRING NOT NULL, + persistence_mode STRING NOT NULL, + + created_at TIMESTAMP NOT NULL, + updated_at TIMESTAMP NOT NULL, + + current_checkpoint_seq INT64 NOT NULL, + active_lease_id STRING, + lease_expiry TIMESTAMP, + + ttl_expiry TIMESTAMP NOT NULL, + + labels JSON, + metadata JSON, + + state_schema_version INT64 NOT NULL, + state_schema_version_label STRING, + + -- Primary key constraint (BigQuery syntax) + PRIMARY KEY (session_id) NOT ENFORCED +) +PARTITION BY DATE(updated_at) +CLUSTER BY status, owner_principal +OPTIONS (description = "Durable agent session control-plane table."); +``` + +## B2) checkpoints + +```sql +CREATE TABLE IF NOT EXISTS `your_project.adk_metadata.checkpoints` ( + session_id STRING NOT NULL, + checkpoint_seq INT64 NOT NULL, + + agent_version STRING NOT NULL, + state_schema_version INT64 NOT NULL, + state_schema_version_label STRING, + + created_at TIMESTAMP NOT NULL, + + gcs_state_uri STRING NOT NULL, + gcs_workspace_uri STRING, + + sha256 STRING NOT NULL, + size_bytes INT64 NOT NULL, + + agent_state_json JSON, + trigger STRING NOT NULL, + + num_jobs INT64, + num_tables_scanned INT64, + num_findings INT64, + + checkpoint_fingerprint STRING, + + -- Composite primary key + PRIMARY KEY (session_id, checkpoint_seq) NOT ENFORCED +) +PARTITION BY DATE(created_at) +CLUSTER BY session_id +OPTIONS (description = "Checkpoint metadata; full blobs stored in GCS."); +``` + +## B3) events + +```sql +CREATE TABLE IF NOT EXISTS `your_project.adk_metadata.events` ( + event_id STRING NOT NULL, + session_id STRING NOT NULL, + + event_time TIMESTAMP NOT NULL, + event_type STRING NOT NULL, + event_payload JSON, + + processed BOOL NOT NULL, + processed_at TIMESTAMP, + processing_lease_id STRING, + + source STRING, + severity STRING, + + -- Primary key + PRIMARY KEY (event_id) NOT ENFORCED +) +PARTITION BY DATE(event_time) +CLUSTER BY session_id, processed +OPTIONS (description = "Resume trigger events and processing audit trail."); +``` + +## B4) Views + +Latest checkpoint per session (with NULL handling): + +```sql +CREATE OR REPLACE VIEW `your_project.adk_metadata.v_latest_checkpoint` AS +SELECT + session_id, + ARRAY_AGG(c ORDER BY checkpoint_seq DESC LIMIT 1)[SAFE_OFFSET(0)] AS latest_checkpoint +FROM `your_project.adk_metadata.checkpoints` c +GROUP BY session_id; +``` + +Paused sessions nearing TTL: + +```sql +CREATE OR REPLACE VIEW `your_project.adk_metadata.v_paused_near_ttl` AS +SELECT + session_id, owner_principal, agent_name, agent_version, + ttl_expiry, updated_at, current_checkpoint_seq, + TIMESTAMP_DIFF(ttl_expiry, CURRENT_TIMESTAMP(), HOUR) AS hours_until_expiry +FROM `your_project.adk_metadata.sessions` +WHERE status = 'PAUSED' + AND ttl_expiry < TIMESTAMP_ADD(CURRENT_TIMESTAMP(), INTERVAL 24 HOUR); +``` + +Fleet status summary: + +```sql +CREATE OR REPLACE VIEW `your_project.adk_metadata.v_fleet_status` AS +SELECT + agent_name, + status, + COUNT(*) AS session_count, + AVG(current_checkpoint_seq) AS avg_checkpoints, + MIN(created_at) AS oldest_session, + MAX(updated_at) AS most_recent_activity +FROM `your_project.adk_metadata.sessions` +WHERE ttl_expiry > CURRENT_TIMESTAMP() +GROUP BY agent_name, status; +``` + +Lease acquire template: + +```sql +UPDATE `your_project.adk_metadata.sessions` +SET active_lease_id = @lease_id, + lease_expiry = TIMESTAMP_ADD(CURRENT_TIMESTAMP(), INTERVAL @ttl_seconds SECOND), + updated_at = CURRENT_TIMESTAMP() +WHERE session_id = @session_id + AND status = 'PAUSED' + AND (active_lease_id IS NULL OR lease_expiry < CURRENT_TIMESTAMP()); +``` + +--- + +# Appendix C: Sequence Diagram (Mermaid) + +```mermaid +sequenceDiagram + participant App as ADK Application + participant Runner as ADK Agent Runner + participant ResConfig as ResumabilityConfig + participant DurConfig as DurableSessionConfig + participant Store as Durable Store (BQ+GCS) + participant BQ as BigQuery + participant PS as Pub/Sub + participant Resumer as Resume Service + + Note over App,Resumer: Initialization + App->>Runner: Create with ResumabilityConfig + DurableSessionConfig + Runner->>ResConfig: is_resumable = True + Runner->>DurConfig: is_durable = True + + Note over App,Resumer: Execution & Pause + Runner->>BQ: Submit async jobs (N) + Runner->>ResConfig: should_pause_invocation() = True + Runner->>Store: Phase1: Write state blob to GCS + Runner->>Store: Phase2: Insert checkpoint metadata (BQ) + Runner->>Store: Update session status = PAUSED + Runner-->>App: Yield control (zero compute) + + Note over App,Resumer: External Events + BQ-->>PS: Job completion event(s) + PS-->>Resumer: Deliver event (may be duplicated) + + Note over App,Resumer: Resume + Resumer->>Store: Acquire lease(session_id) + + alt Lease already held + Store-->>Resumer: Lease denied + Resumer->>Resumer: Back off and retry / skip event + else Lease granted + Store-->>Resumer: Lease granted + Resumer->>Store: Read latest checkpoint + Resumer->>BQ: Reconcile job ledger (authoritative) + Resumer->>Runner: Resume session with checkpoint + Runner->>Store: Periodic checkpoint updates + Runner->>Store: Finalize session status = KILLED + Resumer->>Store: Release lease(session_id) + end +``` + +--- + +# Appendix D: Failure Modes (Operational) + +| Failure Mode | Detection | Recovery | +|--------------|-----------|----------| +| Duplicate Pub/Sub event | Lease acquisition fails | Skip, idempotent | +| Partial checkpoint write (Phase 1) | GCS upload error | Retry, no cleanup needed | +| Partial checkpoint write (Phase 2) | BQ insert error | Orphan blob GC | +| Resume crash mid-execution | Lease expires, no heartbeat | Re-acquire lease, resume from checkpoint | +| Jobs still running on resume | Reconciliation detects RUNNING | Re-register completion callback | +| Jobs failed/cancelled | Reconciliation detects state | Agent retry policy, audit decision | +| Permission revoked | API error | Fail with explicit error + audit row | +| TTL expiry | Scheduled job | GC + mark expired | +| Checkpoint corruption | SHA256 mismatch | Rollback to previous checkpoint | +| State schema mismatch | Version check on load | Hard-fail (v1), migrate (v2) | + +--- + +# Appendix E: Integration Example + +```python +from google.adk.apps import App, ResumabilityConfig +from google.adk.agents import LlmAgent +from google.adk.durable import ( + DurableSessionConfig, + BigQueryCheckpointStore, + PubSubEventSource, +) + +# Create durable-enabled application +app = App( + name="pii_scanner", + root_agent=LlmAgent( + name="scanner", + model="gemini-2.0-flash", + instructions="Scan BigQuery tables for PII...", + tools=[bq_query_tool, bq_job_tool], + ), + # Existing resumability (in-process) + resumability_config=ResumabilityConfig( + is_resumable=True, + ), + # NEW: Durable cross-process persistence + durable_session_config=DurableSessionConfig( + is_durable=True, + checkpoint_policy="async_boundary", + workspace_snapshot_enabled=False, + checkpoint_store=BigQueryCheckpointStore( + project="my-project", + dataset="adk_metadata", + gcs_bucket="my-checkpoints-bucket", + ), + lease_backend="bigquery", + lease_ttl_seconds=300, + ), +) + +# Run with runner (checkpoint happens automatically on pause) +runner = Runner( + app=app, + session_service=DatabaseSessionService(...), +) + +# Events from Pub/Sub automatically trigger resume +async for event in runner.run_async( + user_id="user-123", + session_id="session-456", + new_message=Content(parts=[Part(text="Scan all tables for PII")]), +): + print(event) +``` + +--- + +# References (URLs) + +1. LangGraph durable execution: [https://docs.langchain.com/oss/python/langgraph/durable-execution/](https://docs.langchain.com/oss/python/langgraph/durable-execution/) +2. LangGraph persistence/checkpointers: [https://docs.langchain.com/oss/python/langgraph/persistence/](https://docs.langchain.com/oss/python/langgraph/persistence/) +3. LangGraph overview: [https://docs.langchain.com/oss/python/langgraph/](https://docs.langchain.com/oss/python/langgraph/) +4. LangGraph checkpoints reference: [https://reference.langchain.com/python/langgraph/checkpoints/](https://reference.langchain.com/python/langgraph/checkpoints/) +5. Deep Agents overview: [https://docs.langchain.com/oss/python/deepagents/overview/](https://docs.langchain.com/oss/python/deepagents/overview/) +6. Deep Agents long-term memory: [https://docs.langchain.com/oss/python/deepagents/long-term-memory/](https://docs.langchain.com/oss/python/deepagents/long-term-memory/) +7. Anthropic long-running harnesses: [https://www.anthropic.com/engineering/effective-harnesses-for-long-running-agents](https://www.anthropic.com/engineering/effective-harnesses-for-long-running-agents) +8. ADK ResumabilityConfig: `src/google/adk/apps/app.py:42-58` +9. ADK InvocationContext pause: `src/google/adk/agents/invocation_context.py:355-389` diff --git a/contributing/samples/long_running_task/setup.py b/contributing/samples/long_running_task/setup.py new file mode 100644 index 0000000000..c97ecad3e9 --- /dev/null +++ b/contributing/samples/long_running_task/setup.py @@ -0,0 +1,246 @@ +#!/usr/bin/env python +# Copyright 2026 Google LLC +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +"""Setup script for the durable session demo. + +This script creates the required BigQuery dataset, tables, and GCS bucket +for the durable session persistence demo. + +Usage: + python setup.py + +Prerequisites: + - Google Cloud SDK installed and configured + - BigQuery API enabled + - Cloud Storage API enabled + - Appropriate IAM permissions: + - roles/bigquery.dataEditor + - roles/storage.objectAdmin +""" + +import argparse +import subprocess +import sys + +# Configuration +PROJECT_ID = "test-project-0728-467323" +DATASET = "adk_metadata" +GCS_BUCKET = f"{PROJECT_ID}-adk-checkpoints" +LOCATION = "US" + + +def run_command( + cmd: list[str], check: bool = True +) -> subprocess.CompletedProcess: + """Run a shell command and return the result.""" + print(f"Running: {' '.join(cmd)}") + result = subprocess.run(cmd, capture_output=True, text=True) + if check and result.returncode != 0: + print(f"Error: {result.stderr}") + if not result.stderr.strip().endswith("already exists"): + sys.exit(1) + return result + + +def create_gcs_bucket(): + """Create the GCS bucket for checkpoint blobs.""" + print("\n=== Creating GCS Bucket ===") + run_command( + ["gsutil", "mb", "-l", LOCATION, f"gs://{GCS_BUCKET}"], check=False + ) + + # Set lifecycle policy to delete old checkpoints after 30 days + lifecycle_config = """ +{ + "lifecycle": { + "rule": [ + { + "action": {"type": "Delete"}, + "condition": {"age": 30} + } + ] + } +} +""" + with open("/tmp/lifecycle.json", "w") as f: + f.write(lifecycle_config) + + run_command( + [ + "gsutil", + "lifecycle", + "set", + "/tmp/lifecycle.json", + f"gs://{GCS_BUCKET}", + ], + check=False, + ) + + print(f"GCS bucket created: gs://{GCS_BUCKET}") + + +def create_bigquery_dataset(): + """Create the BigQuery dataset.""" + print("\n=== Creating BigQuery Dataset ===") + run_command( + [ + "bq", + "mk", + "--dataset", + "--location", + LOCATION, + f"{PROJECT_ID}:{DATASET}", + ], + check=False, + ) + print(f"BigQuery dataset created: {PROJECT_ID}.{DATASET}") + + +def create_sessions_table(): + """Create the sessions metadata table.""" + print("\n=== Creating Sessions Table ===") + + schema = """ +session_id:STRING, +status:STRING, +agent_name:STRING, +created_at:TIMESTAMP, +updated_at:TIMESTAMP, +current_checkpoint_seq:INT64, +active_lease_id:STRING, +lease_expiry:TIMESTAMP, +ttl_expiry:TIMESTAMP, +metadata:JSON +""" + + run_command( + [ + "bq", + "mk", + "--table", + f"{PROJECT_ID}:{DATASET}.sessions", + schema.replace("\n", "").strip(), + ], + check=False, + ) + + print(f"Sessions table created: {PROJECT_ID}.{DATASET}.sessions") + + +def create_checkpoints_table(): + """Create the checkpoints table.""" + print("\n=== Creating Checkpoints Table ===") + + schema = """ +session_id:STRING, +checkpoint_seq:INT64, +created_at:TIMESTAMP, +gcs_state_uri:STRING, +sha256:STRING, +size_bytes:INT64, +agent_state_json:JSON, +trigger:STRING +""" + + run_command( + [ + "bq", + "mk", + "--table", + f"{PROJECT_ID}:{DATASET}.checkpoints", + schema.replace("\n", "").strip(), + ], + check=False, + ) + + print(f"Checkpoints table created: {PROJECT_ID}.{DATASET}.checkpoints") + + +def verify_setup(): + """Verify that all resources were created successfully.""" + print("\n=== Verifying Setup ===") + + # Check GCS bucket + result = run_command(["gsutil", "ls", f"gs://{GCS_BUCKET}"], check=False) + if result.returncode == 0: + print(f"[OK] GCS bucket exists: gs://{GCS_BUCKET}") + else: + print(f"[FAIL] GCS bucket not found: gs://{GCS_BUCKET}") + + # Check BigQuery tables + for table in ["sessions", "checkpoints"]: + result = run_command( + ["bq", "show", f"{PROJECT_ID}:{DATASET}.{table}"], check=False + ) + if result.returncode == 0: + print(f"[OK] BigQuery table exists: {PROJECT_ID}.{DATASET}.{table}") + else: + print(f"[FAIL] BigQuery table not found: {PROJECT_ID}.{DATASET}.{table}") + + +def cleanup(): + """Delete all resources created by this script.""" + print("\n=== Cleaning Up Resources ===") + + # Delete BigQuery tables + for table in ["sessions", "checkpoints"]: + run_command( + ["bq", "rm", "-f", f"{PROJECT_ID}:{DATASET}.{table}"], check=False + ) + + # Delete BigQuery dataset + run_command(["bq", "rm", "-f", "-d", f"{PROJECT_ID}:{DATASET}"], check=False) + + # Delete GCS bucket + run_command(["gsutil", "rm", "-r", f"gs://{GCS_BUCKET}"], check=False) + + print("Cleanup complete.") + + +def main(): + parser = argparse.ArgumentParser( + description="Setup resources for the durable session demo" + ) + parser.add_argument( + "--cleanup", + action="store_true", + help="Delete all resources instead of creating them", + ) + parser.add_argument( + "--verify", action="store_true", help="Only verify that resources exist" + ) + args = parser.parse_args() + + print(f"Project: {PROJECT_ID}") + print(f"Dataset: {DATASET}") + print(f"GCS Bucket: {GCS_BUCKET}") + print(f"Location: {LOCATION}") + + if args.cleanup: + cleanup() + elif args.verify: + verify_setup() + else: + create_gcs_bucket() + create_bigquery_dataset() + create_sessions_table() + create_checkpoints_table() + verify_setup() + + print("\nDone!") + + +if __name__ == "__main__": + main() diff --git a/contributing/samples/long_running_task/tools.py b/contributing/samples/long_running_task/tools.py new file mode 100644 index 0000000000..4dbbf4455c --- /dev/null +++ b/contributing/samples/long_running_task/tools.py @@ -0,0 +1,489 @@ +# Copyright 2026 Google LLC +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +"""Long-running tools for the durable session demo.""" + +import asyncio +import random +from datetime import datetime +from typing import Any + +from google.adk.tools.tool_context import ToolContext + + +async def simulate_long_running_scan( + table_name: str, + tool_context: ToolContext, +) -> dict[str, Any]: + """Simulate a long-running BigQuery table scan. + + This tool demonstrates durable checkpointing by simulating a scan that + takes several seconds. In a real scenario, this would be a BigQuery job + that processes large amounts of data. + + Args: + table_name: The fully-qualified BigQuery table name to scan. + tool_context: The tool context for accessing state and artifacts. + + Returns: + A dictionary with scan results including status, row count, and findings. + """ + # Simulate processing time (5-10 seconds) + processing_time = random.uniform(5.0, 10.0) + await asyncio.sleep(processing_time) + + # Simulate scan results + rows_scanned = random.randint(100000, 10000000) + findings = [] + + # Generate some sample findings based on table name + if "shakespeare" in table_name.lower(): + findings = [ + "Found 5 instances of 'to be or not to be'", + "Most common word: 'the' (27,801 occurrences)", + "Unique words: 29,066", + ] + elif "github" in table_name.lower(): + findings = [ + "Most active repository: kubernetes/kubernetes", + "Peak commit hour: 14:00 UTC", + "Average commits per day: 45,000", + ] + else: + findings = [ + f"Scanned {rows_scanned:,} rows", + "No anomalies detected", + "Data quality: 99.8%", + ] + + return { + "status": "complete", + "table": table_name, + "rows_scanned": rows_scanned, + "processing_time_seconds": round(processing_time, 2), + "findings": findings, + } + + +async def run_data_pipeline( + source_table: str, + destination_table: str, + transformations: list[str], + tool_context: ToolContext, +) -> dict[str, Any]: + """Run a data transformation pipeline. + + This simulates a multi-stage data pipeline that would typically be + checkpointed at each stage for durability. + + Args: + source_table: The source BigQuery table. + destination_table: The destination BigQuery table. + transformations: List of transformation operations to apply. + tool_context: The tool context for accessing state and artifacts. + + Returns: + Pipeline execution results. + """ + stages_completed = [] + total_rows_processed = 0 + + # Simulate each transformation stage + for i, transformation in enumerate(transformations): + # Simulate stage processing time + stage_time = random.uniform(2.0, 5.0) + await asyncio.sleep(stage_time) + + rows_processed = random.randint(10000, 100000) + total_rows_processed += rows_processed + + stages_completed.append({ + "stage": i + 1, + "transformation": transformation, + "rows_processed": rows_processed, + "duration_seconds": round(stage_time, 2), + }) + + return { + "status": "complete", + "source_table": source_table, + "destination_table": destination_table, + "stages_completed": stages_completed, + "total_rows_processed": total_rows_processed, + "total_stages": len(transformations), + } + + +async def run_extended_analysis( + job_name: str, + duration_minutes: int, + tool_context: ToolContext, +) -> dict[str, Any]: + """Run an extended analysis job for a specified duration. + + This tool simulates a long-running analysis job that can run for 10+ minutes. + Use this to test durable checkpointing with extended job durations. + + Args: + job_name: A descriptive name for the analysis job. + duration_minutes: How many minutes the job should run (1-60 minutes). + tool_context: The tool context for accessing state and artifacts. + + Returns: + Analysis job results with timing and metrics. + """ + start_time = datetime.now() + duration_seconds = min(max(duration_minutes, 1), 60) * 60 + + # Process in chunks, reporting progress + chunk_size = 30 # Report every 30 seconds + chunks_completed = 0 + total_chunks = duration_seconds // chunk_size + + metrics = { + "records_processed": 0, + "anomalies_detected": 0, + "patterns_found": 0, + } + + for i in range(0, duration_seconds, chunk_size): + remaining = min(chunk_size, duration_seconds - i) + await asyncio.sleep(remaining) + + chunks_completed += 1 + metrics["records_processed"] += random.randint(100000, 500000) + metrics["anomalies_detected"] += random.randint(0, 10) + metrics["patterns_found"] += random.randint(1, 5) + + end_time = datetime.now() + actual_duration = (end_time - start_time).total_seconds() + + return { + "status": "complete", + "job_name": job_name, + "requested_duration_minutes": duration_minutes, + "actual_duration_seconds": round(actual_duration, 2), + "actual_duration_minutes": round(actual_duration / 60, 2), + "start_time": start_time.isoformat(), + "end_time": end_time.isoformat(), + "metrics": metrics, + "summary": ( + f"Processed {metrics['records_processed']:,} records, " + f"found {metrics['anomalies_detected']} anomalies and " + f"{metrics['patterns_found']} patterns" + ), + } + + +async def run_ml_training_job( + model_name: str, + dataset_size: str, + epochs: int, + tool_context: ToolContext, +) -> dict[str, Any]: + """Run a simulated ML model training job. + + This tool simulates training a machine learning model, which can take + 10+ minutes depending on the dataset size and epochs. + + Dataset sizes and approximate training times: + - "small": ~2 minutes + - "medium": ~5 minutes + - "large": ~10 minutes + - "xlarge": ~15 minutes + - "enterprise": ~30 minutes + + Args: + model_name: Name for the model being trained. + dataset_size: Size of dataset - "small", "medium", "large", "xlarge", or "enterprise". + epochs: Number of training epochs (1-100). + tool_context: The tool context for accessing state and artifacts. + + Returns: + Training results with metrics and model performance. + """ + start_time = datetime.now() + + # Map dataset size to base training time (in seconds) + size_to_time = { + "small": 120, # 2 minutes + "medium": 300, # 5 minutes + "large": 600, # 10 minutes + "xlarge": 900, # 15 minutes + "enterprise": 1800, # 30 minutes + } + + base_time = size_to_time.get(dataset_size.lower(), 300) + epochs = min(max(epochs, 1), 100) + + # Total time scales with epochs (but not linearly) + total_time = base_time * (1 + (epochs - 1) * 0.1) + total_time = min(total_time, 3600) # Cap at 1 hour + + # Simulate training epochs + epoch_results = [] + time_per_epoch = total_time / epochs + + for epoch in range(1, epochs + 1): + await asyncio.sleep(time_per_epoch) + + # Simulate improving metrics over epochs + base_loss = 2.5 - (epoch / epochs) * 2.0 + loss = base_loss + random.uniform(-0.1, 0.1) + accuracy = min(0.5 + (epoch / epochs) * 0.45 + random.uniform(-0.02, 0.02), 0.99) + + epoch_results.append({ + "epoch": epoch, + "loss": round(loss, 4), + "accuracy": round(accuracy, 4), + "learning_rate": round(0.001 * (0.95 ** (epoch - 1)), 6), + }) + + end_time = datetime.now() + actual_duration = (end_time - start_time).total_seconds() + + final_metrics = epoch_results[-1] if epoch_results else {} + + return { + "status": "complete", + "model_name": model_name, + "dataset_size": dataset_size, + "epochs_completed": epochs, + "start_time": start_time.isoformat(), + "end_time": end_time.isoformat(), + "actual_duration_seconds": round(actual_duration, 2), + "actual_duration_minutes": round(actual_duration / 60, 2), + "final_loss": final_metrics.get("loss"), + "final_accuracy": final_metrics.get("accuracy"), + "training_history": epoch_results[-5:], # Last 5 epochs + "model_artifact": f"gs://models/{model_name}/v1/model.pkl", + } + + +async def run_batch_etl_job( + job_id: str, + source_tables: list[str], + target_table: str, + processing_minutes: int, + tool_context: ToolContext, +) -> dict[str, Any]: + """Run a batch ETL (Extract, Transform, Load) job. + + This tool simulates a large-scale ETL job that processes multiple source + tables and loads data into a target table. Can run for 10+ minutes. + + Args: + job_id: Unique identifier for this ETL job. + source_tables: List of source table names to process. + target_table: Destination table for processed data. + processing_minutes: Estimated processing time in minutes (1-60). + tool_context: The tool context for accessing state and artifacts. + + Returns: + ETL job results with detailed metrics. + """ + start_time = datetime.now() + duration_seconds = min(max(processing_minutes, 1), 60) * 60 + + # Process each source table + table_results = [] + time_per_table = duration_seconds / max(len(source_tables), 1) + + total_rows_extracted = 0 + total_rows_transformed = 0 + total_rows_loaded = 0 + + for table in source_tables: + await asyncio.sleep(time_per_table) + + rows_extracted = random.randint(1000000, 10000000) + rows_transformed = int(rows_extracted * random.uniform(0.85, 0.99)) + rows_loaded = int(rows_transformed * random.uniform(0.98, 1.0)) + + total_rows_extracted += rows_extracted + total_rows_transformed += rows_transformed + total_rows_loaded += rows_loaded + + table_results.append({ + "source_table": table, + "rows_extracted": rows_extracted, + "rows_transformed": rows_transformed, + "rows_loaded": rows_loaded, + "transform_ratio": round(rows_transformed / rows_extracted, 4), + }) + + end_time = datetime.now() + actual_duration = (end_time - start_time).total_seconds() + + return { + "status": "complete", + "job_id": job_id, + "source_tables_processed": len(source_tables), + "target_table": target_table, + "start_time": start_time.isoformat(), + "end_time": end_time.isoformat(), + "actual_duration_seconds": round(actual_duration, 2), + "actual_duration_minutes": round(actual_duration / 60, 2), + "total_rows_extracted": total_rows_extracted, + "total_rows_transformed": total_rows_transformed, + "total_rows_loaded": total_rows_loaded, + "overall_success_rate": round(total_rows_loaded / total_rows_extracted, 4), + "table_details": table_results, + } + + +async def run_demo_analysis( + analysis_type: str, + tool_context: ToolContext, +) -> dict[str, Any]: + """Run a 1-minute demo analysis job to showcase durable checkpointing. + + This tool is perfect for demos - it runs for exactly 1 minute with + progress updates every 10 seconds, showing how the system handles + long-running operations with checkpointing. + + Args: + analysis_type: Type of analysis to run (e.g., "sentiment", "anomaly", + "trend", "clustering"). + tool_context: The tool context for accessing state and artifacts. + + Returns: + Analysis results with timing and metrics. + """ + start_time = datetime.now() + total_duration = 60 # 1 minute + update_interval = 10 # Progress every 10 seconds + + progress_updates = [] + metrics = { + "records_analyzed": 0, + "insights_found": 0, + "confidence_score": 0.0, + } + + for i in range(0, total_duration, update_interval): + await asyncio.sleep(update_interval) + + progress_pct = ((i + update_interval) / total_duration) * 100 + records_batch = random.randint(50000, 150000) + metrics["records_analyzed"] += records_batch + metrics["insights_found"] += random.randint(1, 5) + metrics["confidence_score"] = min( + 0.6 + (progress_pct / 100) * 0.35 + random.uniform(-0.02, 0.02), + 0.99 + ) + + progress_updates.append({ + "timestamp": datetime.now().isoformat(), + "progress_percent": round(progress_pct, 1), + "records_batch": records_batch, + "cumulative_records": metrics["records_analyzed"], + }) + + end_time = datetime.now() + actual_duration = (end_time - start_time).total_seconds() + + # Generate analysis-specific insights + insights = { + "sentiment": [ + "Overall sentiment: 72% positive", + "Key themes: innovation, growth, sustainability", + "Sentiment trend: improving over time", + ], + "anomaly": [ + "Detected 3 significant anomalies", + "Anomaly cluster in Q3 data", + "Root cause: seasonal variation", + ], + "trend": [ + "Strong upward trend detected", + "Growth rate: 15% month-over-month", + "Forecast: continued growth expected", + ], + "clustering": [ + "Identified 5 distinct clusters", + "Largest cluster: 45% of data", + "Cluster separation: excellent", + ], + }.get(analysis_type.lower(), [ + f"Completed {analysis_type} analysis", + "Results within expected parameters", + "No critical issues detected", + ]) + + return { + "status": "complete", + "analysis_type": analysis_type, + "start_time": start_time.isoformat(), + "end_time": end_time.isoformat(), + "duration_seconds": round(actual_duration, 2), + "metrics": metrics, + "insights": insights, + "progress_history": progress_updates, + "summary": ( + f"Completed {analysis_type} analysis on " + f"{metrics['records_analyzed']:,} records. " + f"Found {metrics['insights_found']} insights with " + f"{metrics['confidence_score']:.1%} confidence." + ), + } + + +def get_table_schema(table_name: str) -> dict[str, Any]: + """Get the schema of a BigQuery table. + + This is a quick synchronous operation that doesn't require checkpointing. + + Args: + table_name: The fully-qualified BigQuery table name. + + Returns: + The table schema information. + """ + # Simulate some common schemas + if "shakespeare" in table_name.lower(): + return { + "table": table_name, + "fields": [ + {"name": "word", "type": "STRING"}, + {"name": "word_count", "type": "INTEGER"}, + {"name": "corpus", "type": "STRING"}, + {"name": "corpus_date", "type": "INTEGER"}, + ], + "num_rows": 164656, + "size_bytes": 6432064, + } + elif "github" in table_name.lower(): + return { + "table": table_name, + "fields": [ + {"name": "repo_name", "type": "STRING"}, + {"name": "path", "type": "STRING"}, + {"name": "content", "type": "STRING"}, + {"name": "size", "type": "INTEGER"}, + ], + "num_rows": 2800000000, + "size_bytes": 2500000000000, + } + else: + return { + "table": table_name, + "fields": [ + {"name": "id", "type": "INTEGER"}, + {"name": "name", "type": "STRING"}, + {"name": "created_at", "type": "TIMESTAMP"}, + ], + "num_rows": 1000000, + "size_bytes": 100000000, + } diff --git a/src/google/adk/apps/__init__.py b/src/google/adk/apps/__init__.py index 3a5d0b0643..88d3474f3a 100644 --- a/src/google/adk/apps/__init__.py +++ b/src/google/adk/apps/__init__.py @@ -15,7 +15,18 @@ from .app import App from .app import ResumabilityConfig + +# Lazy import for DurableSessionConfig to avoid circular imports +def __getattr__(name: str): + if name == 'DurableSessionConfig': + from ..durable.config import DurableSessionConfig + + return DurableSessionConfig + raise AttributeError(f'module {__name__!r} has no attribute {name!r}') + + __all__ = [ 'App', 'ResumabilityConfig', + 'DurableSessionConfig', ] diff --git a/src/google/adk/apps/app.py b/src/google/adk/apps/app.py index 71ea5ce5aa..8779ad67dc 100644 --- a/src/google/adk/apps/app.py +++ b/src/google/adk/apps/app.py @@ -14,6 +14,7 @@ from __future__ import annotations from typing import Optional +from typing import TYPE_CHECKING from pydantic import BaseModel from pydantic import ConfigDict @@ -26,6 +27,9 @@ from ..plugins.base_plugin import BasePlugin from ..utils.feature_decorator import experimental +if TYPE_CHECKING: + from ..durable.config import DurableSessionConfig + def validate_app_name(name: str) -> None: """Ensures the provided application name is safe and intuitive.""" @@ -118,6 +122,13 @@ class App(BaseModel): If configured, will be applied to all agents in the app. """ + durable_session_config: Optional["DurableSessionConfig"] = None + """ + The config for durable session persistence. + If configured, sessions will be checkpointed to external storage (BigQuery + + GCS), enabling recovery from failures and migration across hosts. + """ + @model_validator(mode="after") def _validate_name(self) -> App: validate_app_name(self.name) diff --git a/src/google/adk/durable/__init__.py b/src/google/adk/durable/__init__.py new file mode 100644 index 0000000000..bdc9082a0d --- /dev/null +++ b/src/google/adk/durable/__init__.py @@ -0,0 +1,33 @@ +# Copyright 2026 Google LLC +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +"""Durable session persistence module for ADK. + +This module provides checkpoint-based durability for long-running agent +invocations, enabling recovery from failures and migration across hosts. +""" + +from .checkpointable_state import CheckpointableAgentState +from .config import DurableSessionConfig +from .stores import BigQueryCheckpointStore +from .stores import DurableSessionStore +from .workspace_snapshotter import WorkspaceSnapshotter + +__all__ = [ + "CheckpointableAgentState", + "DurableSessionConfig", + "DurableSessionStore", + "BigQueryCheckpointStore", + "WorkspaceSnapshotter", +] diff --git a/src/google/adk/durable/checkpointable_state.py b/src/google/adk/durable/checkpointable_state.py new file mode 100644 index 0000000000..e9a372855e --- /dev/null +++ b/src/google/adk/durable/checkpointable_state.py @@ -0,0 +1,114 @@ +# Copyright 2026 Google LLC +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +"""Abstract base class for checkpointable agent state.""" + +from __future__ import annotations + +import abc +from typing import Any +from typing import Dict + +from pydantic import BaseModel +from pydantic import ConfigDict + + +class CheckpointableAgentState(BaseModel, abc.ABC): + """Abstract base class for agent state that can be checkpointed. + + Agents that need to preserve custom state across checkpoints should inherit + from this class and implement the serialization methods. + + Example: + ```python + class MyAgentState(CheckpointableAgentState): + counter: int = 0 + processed_items: list[str] = [] + + def to_checkpoint_dict(self) -> dict[str, Any]: + return { + "counter": self.counter, + "processed_items": self.processed_items, + } + + @classmethod + def from_checkpoint_dict(cls, data: dict[str, Any]) -> "MyAgentState": + return cls( + counter=data.get("counter", 0), + processed_items=data.get("processed_items", []), + ) + ``` + """ + + model_config = ConfigDict( + extra="allow", + ) + + @abc.abstractmethod + def to_checkpoint_dict(self) -> Dict[str, Any]: + """Serialize the state to a dictionary for checkpointing. + + Returns: + A dictionary containing all state that should be persisted. + The dictionary must be JSON-serializable. + """ + + @classmethod + @abc.abstractmethod + def from_checkpoint_dict( + cls, data: Dict[str, Any] + ) -> "CheckpointableAgentState": + """Deserialize the state from a checkpoint dictionary. + + Args: + data: The dictionary previously returned by to_checkpoint_dict(). + + Returns: + A new instance of the state class with restored values. + """ + + +class SimpleCheckpointableState(CheckpointableAgentState): + """A simple implementation of CheckpointableAgentState using a dict. + + This class provides a basic implementation that stores arbitrary key-value + pairs. Use this when you don't need custom serialization logic. + + Example: + ```python + state = SimpleCheckpointableState() + state.data["counter"] = 5 + state.data["results"] = ["a", "b", "c"] + + # Checkpoint + checkpoint = state.to_checkpoint_dict() + + # Restore + restored = SimpleCheckpointableState.from_checkpoint_dict(checkpoint) + assert restored.data["counter"] == 5 + ``` + """ + + data: Dict[str, Any] = {} + + def to_checkpoint_dict(self) -> Dict[str, Any]: + """Serialize the state to a dictionary.""" + return {"data": self.data.copy()} + + @classmethod + def from_checkpoint_dict( + cls, data: Dict[str, Any] + ) -> "SimpleCheckpointableState": + """Deserialize the state from a checkpoint dictionary.""" + return cls(data=data.get("data", {})) diff --git a/src/google/adk/durable/config.py b/src/google/adk/durable/config.py new file mode 100644 index 0000000000..d4b7d91e7a --- /dev/null +++ b/src/google/adk/durable/config.py @@ -0,0 +1,70 @@ +# Copyright 2026 Google LLC +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +"""Configuration for durable session persistence.""" + +from __future__ import annotations + +from typing import Any +from typing import Literal +from typing import Optional + +from pydantic import BaseModel +from pydantic import ConfigDict +from pydantic import Field + +from ..utils.feature_decorator import experimental + + +@experimental +class DurableSessionConfig(BaseModel): + """Configuration for durable session persistence. + + Durable sessions provide checkpoint-based persistence that survives process + restarts, enabling recovery from failures and migration across hosts. This + goes beyond the basic resumability feature by persisting session state to + external storage (BigQuery + GCS). + + Attributes: + is_durable: Whether to enable durable checkpointing. + checkpoint_policy: When to create checkpoints: + - "async_boundary": Checkpoint when hitting async/long-running operations + - "every_turn": Checkpoint after every agent turn + - "manual": Only checkpoint when explicitly requested + checkpoint_store: The store to use for persisting checkpoints. + lease_timeout_seconds: How long a lease is valid before expiring. + max_checkpoint_size_bytes: Maximum size for checkpoint state blobs. + """ + + model_config = ConfigDict( + arbitrary_types_allowed=True, + extra="forbid", + ) + + is_durable: bool = False + """Whether to enable durable checkpointing.""" + + checkpoint_policy: Literal["async_boundary", "every_turn", "manual"] = ( + "async_boundary" + ) + """When to create checkpoints during execution.""" + + checkpoint_store: Optional[Any] = Field(default=None) + """The store to use for persisting checkpoints (DurableSessionStore).""" + + lease_timeout_seconds: int = Field(default=300, ge=60, le=3600) + """How long a lease is valid before expiring (60-3600 seconds).""" + + max_checkpoint_size_bytes: int = Field(default=10 * 1024 * 1024, ge=1024) + """Maximum size for checkpoint state blobs (default 10MB).""" diff --git a/src/google/adk/durable/stores/__init__.py b/src/google/adk/durable/stores/__init__.py new file mode 100644 index 0000000000..cb04e432b6 --- /dev/null +++ b/src/google/adk/durable/stores/__init__.py @@ -0,0 +1,23 @@ +# Copyright 2026 Google LLC +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +"""Checkpoint store implementations for durable sessions.""" + +from .base_checkpoint_store import DurableSessionStore +from .bigquery_checkpoint_store import BigQueryCheckpointStore + +__all__ = [ + "DurableSessionStore", + "BigQueryCheckpointStore", +] diff --git a/src/google/adk/durable/stores/base_checkpoint_store.py b/src/google/adk/durable/stores/base_checkpoint_store.py new file mode 100644 index 0000000000..e6d553cd57 --- /dev/null +++ b/src/google/adk/durable/stores/base_checkpoint_store.py @@ -0,0 +1,258 @@ +# Copyright 2026 Google LLC +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +"""Abstract base class for durable session checkpoint stores.""" + +from __future__ import annotations + +import abc +from dataclasses import dataclass +from datetime import datetime +from typing import Any +from typing import Dict +from typing import Optional + + +@dataclass +class Checkpoint: + """Represents a checkpoint for a durable session. + + Attributes: + session_id: The ID of the session this checkpoint belongs to. + checkpoint_seq: The sequence number of this checkpoint (monotonically + increasing). + created_at: When this checkpoint was created. + gcs_state_uri: The GCS URI where the full state blob is stored. + sha256: SHA-256 hash of the state blob for integrity verification. + size_bytes: Size of the state blob in bytes. + agent_state: Small agent state stored inline in BigQuery (optional). + trigger: What triggered this checkpoint (e.g., "async_boundary", "manual"). + """ + + session_id: str + checkpoint_seq: int + created_at: datetime + gcs_state_uri: str + sha256: str + size_bytes: int + agent_state: Optional[Dict[str, Any]] = None + trigger: str = "async_boundary" + + +@dataclass +class SessionMetadata: + """Metadata about a durable session. + + Attributes: + session_id: The unique session identifier. + status: Current status ("active", "paused", "completed", "failed"). + agent_name: Name of the root agent for this session. + created_at: When the session was created. + updated_at: When the session was last updated. + current_checkpoint_seq: The latest checkpoint sequence number. + active_lease_id: ID of the current lease holder (if any). + lease_expiry: When the current lease expires. + ttl_expiry: When this session should be garbage collected. + metadata: Additional custom metadata. + """ + + session_id: str + status: str + agent_name: str + created_at: datetime + updated_at: datetime + current_checkpoint_seq: int + active_lease_id: Optional[str] = None + lease_expiry: Optional[datetime] = None + ttl_expiry: Optional[datetime] = None + metadata: Optional[Dict[str, Any]] = None + + +class DurableSessionStore(abc.ABC): + """Abstract base class for checkpoint stores. + + A checkpoint store provides persistent storage for session checkpoints, + enabling recovery from failures and migration across hosts. + + Implementations must provide: + - Checkpoint write/read operations with two-phase commit + - Lease management to prevent concurrent modifications + - Session metadata management + """ + + @abc.abstractmethod + async def create_session( + self, + *, + session_id: str, + agent_name: str, + metadata: Optional[Dict[str, Any]] = None, + ) -> SessionMetadata: + """Create a new durable session. + + Args: + session_id: Unique identifier for the session. + agent_name: Name of the root agent. + metadata: Optional custom metadata. + + Returns: + The created session metadata. + + Raises: + ValueError: If a session with this ID already exists. + """ + + @abc.abstractmethod + async def get_session(self, *, session_id: str) -> Optional[SessionMetadata]: + """Get session metadata. + + Args: + session_id: The session to retrieve. + + Returns: + The session metadata, or None if not found. + """ + + @abc.abstractmethod + async def update_session_status( + self, *, session_id: str, status: str + ) -> None: + """Update the status of a session. + + Args: + session_id: The session to update. + status: The new status. + """ + + @abc.abstractmethod + async def write_checkpoint( + self, + *, + session_id: str, + checkpoint_seq: int, + state_blob: bytes, + agent_state: Optional[Dict[str, Any]] = None, + trigger: str = "async_boundary", + ) -> Checkpoint: + """Write a checkpoint with two-phase commit. + + This operation should: + 1. Upload the state blob to GCS + 2. Record the checkpoint metadata in BigQuery + 3. Update the session's current_checkpoint_seq + + Args: + session_id: The session to checkpoint. + checkpoint_seq: The sequence number for this checkpoint. + state_blob: The serialized state to persist. + agent_state: Small agent state to store inline (optional). + trigger: What triggered this checkpoint. + + Returns: + The created checkpoint. + + Raises: + ValueError: If the checkpoint_seq is not greater than the current. + """ + + @abc.abstractmethod + async def read_latest_checkpoint( + self, *, session_id: str + ) -> Optional[tuple[Checkpoint, bytes]]: + """Read the latest checkpoint for a session. + + Args: + session_id: The session to read. + + Returns: + A tuple of (checkpoint, state_blob), or None if no checkpoints exist. + """ + + @abc.abstractmethod + async def read_checkpoint( + self, *, session_id: str, checkpoint_seq: int + ) -> Optional[tuple[Checkpoint, bytes]]: + """Read a specific checkpoint. + + Args: + session_id: The session to read. + checkpoint_seq: The checkpoint sequence number. + + Returns: + A tuple of (checkpoint, state_blob), or None if not found. + """ + + @abc.abstractmethod + async def acquire_lease( + self, *, session_id: str, lease_id: str, timeout_seconds: int + ) -> bool: + """Attempt to acquire a lease on a session. + + Leases prevent concurrent modifications to a session. Only the lease + holder can write checkpoints or update session status. + + Args: + session_id: The session to lease. + lease_id: A unique identifier for this lease attempt. + timeout_seconds: How long the lease should be valid. + + Returns: + True if the lease was acquired, False if another lease is active. + """ + + @abc.abstractmethod + async def release_lease(self, *, session_id: str, lease_id: str) -> None: + """Release a lease on a session. + + Args: + session_id: The session to release. + lease_id: The lease ID to release (must match the active lease). + """ + + @abc.abstractmethod + async def renew_lease( + self, *, session_id: str, lease_id: str, timeout_seconds: int + ) -> bool: + """Renew an existing lease. + + Args: + session_id: The session to renew. + lease_id: The lease ID to renew (must match the active lease). + timeout_seconds: New timeout for the lease. + + Returns: + True if the lease was renewed, False if the lease is not active. + """ + + @abc.abstractmethod + async def list_checkpoints( + self, *, session_id: str, limit: int = 10 + ) -> list[Checkpoint]: + """List checkpoints for a session. + + Args: + session_id: The session to list checkpoints for. + limit: Maximum number of checkpoints to return. + + Returns: + List of checkpoints, ordered by checkpoint_seq descending. + """ + + @abc.abstractmethod + async def delete_session(self, *, session_id: str) -> None: + """Delete a session and all its checkpoints. + + Args: + session_id: The session to delete. + """ diff --git a/src/google/adk/durable/stores/bigquery_checkpoint_store.py b/src/google/adk/durable/stores/bigquery_checkpoint_store.py new file mode 100644 index 0000000000..3d53995ecc --- /dev/null +++ b/src/google/adk/durable/stores/bigquery_checkpoint_store.py @@ -0,0 +1,693 @@ +# Copyright 2026 Google LLC +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +"""BigQuery + GCS implementation of durable session checkpoint store.""" + +from __future__ import annotations + +from datetime import datetime +from datetime import timedelta +from datetime import timezone +import hashlib +import json +import logging +from typing import Any +from typing import Dict +from typing import Optional +import uuid + +from ...utils.feature_decorator import experimental +from .base_checkpoint_store import Checkpoint +from .base_checkpoint_store import DurableSessionStore +from .base_checkpoint_store import SessionMetadata + +logger = logging.getLogger("google_adk." + __name__) + + +@experimental +class BigQueryCheckpointStore(DurableSessionStore): + """Checkpoint store using BigQuery for metadata and GCS for state blobs. + + This implementation stores: + - Session metadata and checkpoint records in BigQuery tables + - Large state blobs in Google Cloud Storage + + Prerequisites: + - BigQuery dataset with sessions and checkpoints tables + - GCS bucket for state blobs + - Appropriate IAM permissions + + Example: + ```python + store = BigQueryCheckpointStore( + project="my-project", + dataset="adk_metadata", + gcs_bucket="my-project-adk-checkpoints", + ) + + # Create a session + await store.create_session( + session_id="sess-123", + agent_name="my_agent", + ) + + # Write a checkpoint + await store.write_checkpoint( + session_id="sess-123", + checkpoint_seq=1, + state_blob=b"...", + ) + + # Read it back + checkpoint, blob = await store.read_latest_checkpoint(session_id="sess-123") + ``` + """ + + def __init__( + self, + *, + project: str, + dataset: str, + gcs_bucket: str, + sessions_table: str = "sessions", + checkpoints_table: str = "checkpoints", + location: str = "US", + ): + """Initialize the BigQuery checkpoint store. + + Args: + project: GCP project ID. + dataset: BigQuery dataset name. + gcs_bucket: GCS bucket name for state blobs. + sessions_table: Name of the sessions table. + checkpoints_table: Name of the checkpoints table. + location: BigQuery dataset location. + """ + self._project = project + self._dataset = dataset + self._gcs_bucket = gcs_bucket + self._sessions_table = sessions_table + self._checkpoints_table = checkpoints_table + self._location = location + + # Lazy-loaded clients + self._bq_client = None + self._storage_client = None + + @property + def _sessions_table_id(self) -> str: + return f"{self._project}.{self._dataset}.{self._sessions_table}" + + @property + def _checkpoints_table_id(self) -> str: + return f"{self._project}.{self._dataset}.{self._checkpoints_table}" + + def _get_bq_client(self): + """Lazy-load BigQuery client.""" + if self._bq_client is None: + from google.cloud import bigquery + + self._bq_client = bigquery.Client( + project=self._project, location=self._location + ) + return self._bq_client + + def _get_storage_client(self): + """Lazy-load Cloud Storage client.""" + if self._storage_client is None: + from google.cloud import storage + + self._storage_client = storage.Client(project=self._project) + return self._storage_client + + def _get_gcs_uri(self, session_id: str, checkpoint_seq: int) -> str: + """Generate a GCS URI for a checkpoint blob.""" + return f"gs://{self._gcs_bucket}/checkpoints/{session_id}/{checkpoint_seq}.json.gz" + + async def create_session( + self, + *, + session_id: str, + agent_name: str, + metadata: Optional[Dict[str, Any]] = None, + ) -> SessionMetadata: + """Create a new durable session.""" + now = datetime.now(timezone.utc) + + # Check if session already exists + existing = await self.get_session(session_id=session_id) + if existing: + raise ValueError(f"Session {session_id} already exists") + + # Insert session record using DML (not streaming) for immediate updatability + client = self._get_bq_client() + from google.cloud import bigquery + + insert_query = f""" + INSERT INTO `{self._sessions_table_id}` + (session_id, status, agent_name, created_at, updated_at, + current_checkpoint_seq, active_lease_id, lease_expiry, ttl_expiry, metadata) + VALUES + (@session_id, @status, @agent_name, @created_at, @updated_at, + @current_checkpoint_seq, @active_lease_id, @lease_expiry, @ttl_expiry, + PARSE_JSON(@metadata)) + """ + + job_config = bigquery.QueryJobConfig( + query_parameters=[ + bigquery.ScalarQueryParameter("session_id", "STRING", session_id), + bigquery.ScalarQueryParameter("status", "STRING", "active"), + bigquery.ScalarQueryParameter("agent_name", "STRING", agent_name), + bigquery.ScalarQueryParameter( + "created_at", "TIMESTAMP", now.isoformat() + ), + bigquery.ScalarQueryParameter( + "updated_at", "TIMESTAMP", now.isoformat() + ), + bigquery.ScalarQueryParameter("current_checkpoint_seq", "INT64", 0), + bigquery.ScalarQueryParameter("active_lease_id", "STRING", None), + bigquery.ScalarQueryParameter("lease_expiry", "TIMESTAMP", None), + bigquery.ScalarQueryParameter("ttl_expiry", "TIMESTAMP", None), + bigquery.ScalarQueryParameter( + "metadata", "STRING", json.dumps(metadata) if metadata else None + ), + ] + ) + client.query(insert_query, job_config=job_config).result() + + logger.info("Created durable session: %s", session_id) + + return SessionMetadata( + session_id=session_id, + status="active", + agent_name=agent_name, + created_at=now, + updated_at=now, + current_checkpoint_seq=0, + metadata=metadata, + ) + + async def get_session(self, *, session_id: str) -> Optional[SessionMetadata]: + """Get session metadata.""" + query = f""" + SELECT + session_id, + status, + agent_name, + created_at, + updated_at, + current_checkpoint_seq, + active_lease_id, + lease_expiry, + ttl_expiry, + metadata + FROM `{self._sessions_table_id}` + WHERE session_id = @session_id + """ + + client = self._get_bq_client() + from google.cloud import bigquery + + job_config = bigquery.QueryJobConfig( + query_parameters=[ + bigquery.ScalarQueryParameter("session_id", "STRING", session_id), + ] + ) + results = client.query(query, job_config=job_config).result() + + for row in results: + return SessionMetadata( + session_id=row.session_id, + status=row.status, + agent_name=row.agent_name, + created_at=row.created_at, + updated_at=row.updated_at, + current_checkpoint_seq=row.current_checkpoint_seq, + active_lease_id=row.active_lease_id, + lease_expiry=row.lease_expiry, + ttl_expiry=row.ttl_expiry, + metadata=row.metadata if isinstance(row.metadata, dict) else (json.loads(row.metadata) if row.metadata else None), + ) + + return None + + async def update_session_status( + self, *, session_id: str, status: str + ) -> None: + """Update the status of a session.""" + now = datetime.now(timezone.utc) + + query = f""" + UPDATE `{self._sessions_table_id}` + SET status = @status, updated_at = @updated_at + WHERE session_id = @session_id + """ + + client = self._get_bq_client() + from google.cloud import bigquery + + job_config = bigquery.QueryJobConfig( + query_parameters=[ + bigquery.ScalarQueryParameter("session_id", "STRING", session_id), + bigquery.ScalarQueryParameter("status", "STRING", status), + bigquery.ScalarQueryParameter( + "updated_at", "TIMESTAMP", now.isoformat() + ), + ] + ) + client.query(query, job_config=job_config).result() + logger.debug("Updated session %s status to %s", session_id, status) + + async def write_checkpoint( + self, + *, + session_id: str, + checkpoint_seq: int, + state_blob: bytes, + agent_state: Optional[Dict[str, Any]] = None, + trigger: str = "async_boundary", + ) -> Checkpoint: + """Write a checkpoint with two-phase commit.""" + import gzip + + now = datetime.now(timezone.utc) + + # Verify session exists and checkpoint_seq is valid + session = await self.get_session(session_id=session_id) + if not session: + raise ValueError(f"Session {session_id} not found") + + if checkpoint_seq <= session.current_checkpoint_seq: + raise ValueError( + f"checkpoint_seq {checkpoint_seq} must be greater than current" + f" {session.current_checkpoint_seq}" + ) + + # Compute hash of the state blob + sha256 = hashlib.sha256(state_blob).hexdigest() + size_bytes = len(state_blob) + + # Phase 1: Upload to GCS + gcs_uri = self._get_gcs_uri(session_id, checkpoint_seq) + blob_path = gcs_uri.replace(f"gs://{self._gcs_bucket}/", "") + + storage_client = self._get_storage_client() + bucket = storage_client.bucket(self._gcs_bucket) + blob = bucket.blob(blob_path) + + compressed = gzip.compress(state_blob) + blob.upload_from_string(compressed, content_type="application/gzip") + logger.debug( + "Uploaded checkpoint blob to %s (%d bytes compressed)", + gcs_uri, + len(compressed), + ) + + # Phase 2: Insert checkpoint record using DML + client = self._get_bq_client() + from google.cloud import bigquery + + insert_query = f""" + INSERT INTO `{self._checkpoints_table_id}` + (session_id, checkpoint_seq, created_at, gcs_state_uri, sha256, + size_bytes, agent_state_json, trigger) + VALUES + (@session_id, @checkpoint_seq, @created_at, @gcs_state_uri, @sha256, + @size_bytes, PARSE_JSON(@agent_state_json), @trigger) + """ + + job_config = bigquery.QueryJobConfig( + query_parameters=[ + bigquery.ScalarQueryParameter("session_id", "STRING", session_id), + bigquery.ScalarQueryParameter( + "checkpoint_seq", "INT64", checkpoint_seq + ), + bigquery.ScalarQueryParameter( + "created_at", "TIMESTAMP", now.isoformat() + ), + bigquery.ScalarQueryParameter("gcs_state_uri", "STRING", gcs_uri), + bigquery.ScalarQueryParameter("sha256", "STRING", sha256), + bigquery.ScalarQueryParameter("size_bytes", "INT64", size_bytes), + bigquery.ScalarQueryParameter( + "agent_state_json", "STRING", + json.dumps(agent_state) if agent_state else None + ), + bigquery.ScalarQueryParameter("trigger", "STRING", trigger), + ] + ) + try: + client.query(insert_query, job_config=job_config).result() + except Exception as e: + # Rollback: delete the GCS blob + blob.delete() + raise RuntimeError(f"Failed to insert checkpoint record: {e}") + + # Phase 3: Update session's current_checkpoint_seq + from google.cloud import bigquery + + update_query = f""" + UPDATE `{self._sessions_table_id}` + SET current_checkpoint_seq = @checkpoint_seq, updated_at = @updated_at + WHERE session_id = @session_id + """ + + job_config = bigquery.QueryJobConfig( + query_parameters=[ + bigquery.ScalarQueryParameter("session_id", "STRING", session_id), + bigquery.ScalarQueryParameter( + "checkpoint_seq", "INT64", checkpoint_seq + ), + bigquery.ScalarQueryParameter( + "updated_at", "TIMESTAMP", now.isoformat() + ), + ] + ) + client.query(update_query, job_config=job_config).result() + + logger.info( + "Wrote checkpoint %d for session %s (%d bytes, sha256=%s)", + checkpoint_seq, + session_id, + size_bytes, + sha256[:16], + ) + + return Checkpoint( + session_id=session_id, + checkpoint_seq=checkpoint_seq, + created_at=now, + gcs_state_uri=gcs_uri, + sha256=sha256, + size_bytes=size_bytes, + agent_state=agent_state, + trigger=trigger, + ) + + async def read_latest_checkpoint( + self, *, session_id: str + ) -> Optional[tuple[Checkpoint, bytes]]: + """Read the latest checkpoint for a session.""" + session = await self.get_session(session_id=session_id) + if not session or session.current_checkpoint_seq == 0: + return None + + return await self.read_checkpoint( + session_id=session_id, checkpoint_seq=session.current_checkpoint_seq + ) + + async def read_checkpoint( + self, *, session_id: str, checkpoint_seq: int + ) -> Optional[tuple[Checkpoint, bytes]]: + """Read a specific checkpoint.""" + import gzip + + query = f""" + SELECT + session_id, + checkpoint_seq, + created_at, + gcs_state_uri, + sha256, + size_bytes, + agent_state_json, + trigger + FROM `{self._checkpoints_table_id}` + WHERE session_id = @session_id AND checkpoint_seq = @checkpoint_seq + """ + + client = self._get_bq_client() + from google.cloud import bigquery + + job_config = bigquery.QueryJobConfig( + query_parameters=[ + bigquery.ScalarQueryParameter("session_id", "STRING", session_id), + bigquery.ScalarQueryParameter( + "checkpoint_seq", "INT64", checkpoint_seq + ), + ] + ) + results = client.query(query, job_config=job_config).result() + + checkpoint_row = None + for row in results: + checkpoint_row = row + break + + if not checkpoint_row: + return None + + # Download blob from GCS + gcs_uri = checkpoint_row.gcs_state_uri + blob_path = gcs_uri.replace(f"gs://{self._gcs_bucket}/", "") + + storage_client = self._get_storage_client() + bucket = storage_client.bucket(self._gcs_bucket) + blob = bucket.blob(blob_path) + + compressed = blob.download_as_bytes() + state_blob = gzip.decompress(compressed) + + # Verify integrity + actual_sha256 = hashlib.sha256(state_blob).hexdigest() + if actual_sha256 != checkpoint_row.sha256: + raise RuntimeError( + "Checkpoint integrity check failed: expected" + f" {checkpoint_row.sha256}, got {actual_sha256}" + ) + + checkpoint = Checkpoint( + session_id=checkpoint_row.session_id, + checkpoint_seq=checkpoint_row.checkpoint_seq, + created_at=checkpoint_row.created_at, + gcs_state_uri=checkpoint_row.gcs_state_uri, + sha256=checkpoint_row.sha256, + size_bytes=checkpoint_row.size_bytes, + agent_state=( + checkpoint_row.agent_state_json if isinstance(checkpoint_row.agent_state_json, dict) + else (json.loads(checkpoint_row.agent_state_json) if checkpoint_row.agent_state_json else None) + ), + trigger=checkpoint_row.trigger, + ) + + logger.debug( + "Read checkpoint %d for session %s (%d bytes)", + checkpoint_seq, + session_id, + len(state_blob), + ) + + return checkpoint, state_blob + + async def acquire_lease( + self, *, session_id: str, lease_id: str, timeout_seconds: int + ) -> bool: + """Attempt to acquire a lease on a session.""" + now = datetime.now(timezone.utc) + expiry = now + timedelta(seconds=timeout_seconds) + + # Atomic update: only succeed if no active lease or lease expired + query = f""" + UPDATE `{self._sessions_table_id}` + SET + active_lease_id = @lease_id, + lease_expiry = @lease_expiry, + updated_at = @updated_at + WHERE + session_id = @session_id + AND (active_lease_id IS NULL OR lease_expiry < @now) + """ + + client = self._get_bq_client() + from google.cloud import bigquery + + job_config = bigquery.QueryJobConfig( + query_parameters=[ + bigquery.ScalarQueryParameter("session_id", "STRING", session_id), + bigquery.ScalarQueryParameter("lease_id", "STRING", lease_id), + bigquery.ScalarQueryParameter( + "lease_expiry", "TIMESTAMP", expiry.isoformat() + ), + bigquery.ScalarQueryParameter( + "updated_at", "TIMESTAMP", now.isoformat() + ), + bigquery.ScalarQueryParameter("now", "TIMESTAMP", now.isoformat()), + ] + ) + result = client.query(query, job_config=job_config).result() + + # Check if the update affected any rows + if result.num_dml_affected_rows and result.num_dml_affected_rows > 0: + logger.info("Acquired lease %s on session %s", lease_id, session_id) + return True + else: + logger.debug("Failed to acquire lease on session %s", session_id) + return False + + async def release_lease(self, *, session_id: str, lease_id: str) -> None: + """Release a lease on a session.""" + now = datetime.now(timezone.utc) + + query = f""" + UPDATE `{self._sessions_table_id}` + SET + active_lease_id = NULL, + lease_expiry = NULL, + updated_at = @updated_at + WHERE session_id = @session_id AND active_lease_id = @lease_id + """ + + client = self._get_bq_client() + from google.cloud import bigquery + + job_config = bigquery.QueryJobConfig( + query_parameters=[ + bigquery.ScalarQueryParameter("session_id", "STRING", session_id), + bigquery.ScalarQueryParameter("lease_id", "STRING", lease_id), + bigquery.ScalarQueryParameter( + "updated_at", "TIMESTAMP", now.isoformat() + ), + ] + ) + client.query(query, job_config=job_config).result() + logger.info("Released lease %s on session %s", lease_id, session_id) + + async def renew_lease( + self, *, session_id: str, lease_id: str, timeout_seconds: int + ) -> bool: + """Renew an existing lease.""" + now = datetime.now(timezone.utc) + expiry = now + timedelta(seconds=timeout_seconds) + + query = f""" + UPDATE `{self._sessions_table_id}` + SET lease_expiry = @lease_expiry, updated_at = @updated_at + WHERE session_id = @session_id AND active_lease_id = @lease_id + """ + + client = self._get_bq_client() + from google.cloud import bigquery + + job_config = bigquery.QueryJobConfig( + query_parameters=[ + bigquery.ScalarQueryParameter("session_id", "STRING", session_id), + bigquery.ScalarQueryParameter("lease_id", "STRING", lease_id), + bigquery.ScalarQueryParameter( + "lease_expiry", "TIMESTAMP", expiry.isoformat() + ), + bigquery.ScalarQueryParameter( + "updated_at", "TIMESTAMP", now.isoformat() + ), + ] + ) + result = client.query(query, job_config=job_config).result() + + if result.num_dml_affected_rows and result.num_dml_affected_rows > 0: + logger.debug("Renewed lease %s on session %s", lease_id, session_id) + return True + return False + + async def list_checkpoints( + self, *, session_id: str, limit: int = 10 + ) -> list[Checkpoint]: + """List checkpoints for a session.""" + query = f""" + SELECT + session_id, + checkpoint_seq, + created_at, + gcs_state_uri, + sha256, + size_bytes, + agent_state_json, + trigger + FROM `{self._checkpoints_table_id}` + WHERE session_id = @session_id + ORDER BY checkpoint_seq DESC + LIMIT @limit + """ + + client = self._get_bq_client() + from google.cloud import bigquery + + job_config = bigquery.QueryJobConfig( + query_parameters=[ + bigquery.ScalarQueryParameter("session_id", "STRING", session_id), + bigquery.ScalarQueryParameter("limit", "INT64", limit), + ] + ) + results = client.query(query, job_config=job_config).result() + + checkpoints = [] + for row in results: + checkpoints.append( + Checkpoint( + session_id=row.session_id, + checkpoint_seq=row.checkpoint_seq, + created_at=row.created_at, + gcs_state_uri=row.gcs_state_uri, + sha256=row.sha256, + size_bytes=row.size_bytes, + agent_state=( + row.agent_state_json if isinstance(row.agent_state_json, dict) + else (json.loads(row.agent_state_json) if row.agent_state_json else None) + ), + trigger=row.trigger, + ) + ) + + return checkpoints + + async def delete_session(self, *, session_id: str) -> None: + """Delete a session and all its checkpoints.""" + # Delete checkpoints from GCS + checkpoints = await self.list_checkpoints(session_id=session_id, limit=1000) + storage_client = self._get_storage_client() + bucket = storage_client.bucket(self._gcs_bucket) + + for checkpoint in checkpoints: + blob_path = checkpoint.gcs_state_uri.replace( + f"gs://{self._gcs_bucket}/", "" + ) + blob = bucket.blob(blob_path) + try: + blob.delete() + except Exception as e: + logger.warning("Failed to delete blob %s: %s", blob_path, e) + + # Delete checkpoint records + client = self._get_bq_client() + from google.cloud import bigquery + + delete_checkpoints = f""" + DELETE FROM `{self._checkpoints_table_id}` + WHERE session_id = @session_id + """ + + job_config = bigquery.QueryJobConfig( + query_parameters=[ + bigquery.ScalarQueryParameter("session_id", "STRING", session_id), + ] + ) + client.query(delete_checkpoints, job_config=job_config).result() + + # Delete session record + delete_session = f""" + DELETE FROM `{self._sessions_table_id}` + WHERE session_id = @session_id + """ + client.query(delete_session, job_config=job_config).result() + + logger.info( + "Deleted session %s and %d checkpoints", session_id, len(checkpoints) + ) diff --git a/src/google/adk/durable/workspace_snapshotter.py b/src/google/adk/durable/workspace_snapshotter.py new file mode 100644 index 0000000000..1462b883d7 --- /dev/null +++ b/src/google/adk/durable/workspace_snapshotter.py @@ -0,0 +1,187 @@ +# Copyright 2026 Google LLC +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +"""Workspace snapshot handling for durable sessions.""" + +from __future__ import annotations + +import hashlib +import io +import json +import logging +from pathlib import Path +import tarfile +from typing import Any +from typing import Dict +from typing import Optional + +logger = logging.getLogger("google_adk." + __name__) + + +class WorkspaceSnapshotter: + """Handles workspace file snapshots for durable checkpoints. + + This class provides utilities for creating and restoring snapshots of + workspace directories, enabling agents to persist and restore file-based + state across checkpoint boundaries. + + Example: + ```python + snapshotter = WorkspaceSnapshotter(workspace_dir="/tmp/agent_workspace") + + # Create a snapshot + blob, sha256, size = snapshotter.create_snapshot() + + # Later, restore from snapshot + snapshotter.restore_snapshot(blob) + ``` + """ + + def __init__( + self, + workspace_dir: Optional[str] = None, + exclude_patterns: Optional[list[str]] = None, + ): + """Initialize the workspace snapshotter. + + Args: + workspace_dir: Path to the workspace directory to snapshot. + exclude_patterns: List of glob patterns to exclude from snapshots. + """ + self._workspace_dir = Path(workspace_dir) if workspace_dir else None + self._exclude_patterns = exclude_patterns or [ + "__pycache__", + "*.pyc", + ".git", + ".env", + "node_modules", + "*.log", + ] + + @property + def workspace_dir(self) -> Optional[Path]: + """The workspace directory being snapshotted.""" + return self._workspace_dir + + def create_snapshot(self) -> tuple[bytes, str, int]: + """Create a tarball snapshot of the workspace directory. + + Returns: + A tuple of (blob_bytes, sha256_hash, size_bytes). + + Raises: + ValueError: If no workspace directory is configured. + FileNotFoundError: If the workspace directory doesn't exist. + """ + if not self._workspace_dir: + raise ValueError("No workspace directory configured") + + if not self._workspace_dir.exists(): + raise FileNotFoundError( + f"Workspace directory not found: {self._workspace_dir}" + ) + + buffer = io.BytesIO() + with tarfile.open(fileobj=buffer, mode="w:gz") as tar: + for path in self._workspace_dir.rglob("*"): + if path.is_file() and not self._should_exclude(path): + arcname = path.relative_to(self._workspace_dir) + tar.add(path, arcname=str(arcname)) + + blob = buffer.getvalue() + sha256 = hashlib.sha256(blob).hexdigest() + + logger.debug( + "Created workspace snapshot: %d bytes, sha256=%s", len(blob), sha256 + ) + + return blob, sha256, len(blob) + + def restore_snapshot(self, blob: bytes) -> None: + """Restore a workspace from a tarball snapshot. + + Args: + blob: The snapshot blob previously created by create_snapshot(). + + Raises: + ValueError: If no workspace directory is configured. + """ + if not self._workspace_dir: + raise ValueError("No workspace directory configured") + + self._workspace_dir.mkdir(parents=True, exist_ok=True) + + buffer = io.BytesIO(blob) + with tarfile.open(fileobj=buffer, mode="r:gz") as tar: + # Filter to prevent path traversal attacks + safe_members = [ + m for m in tar.getmembers() if not m.name.startswith(("/", "..")) + ] + tar.extractall(path=self._workspace_dir, members=safe_members) + + logger.debug( + "Restored workspace snapshot: %d bytes to %s", + len(blob), + self._workspace_dir, + ) + + def _should_exclude(self, path: Path) -> bool: + """Check if a path should be excluded from snapshots.""" + path_str = str(path) + for pattern in self._exclude_patterns: + if pattern.startswith("*"): + # Suffix match (e.g., *.pyc) + if path_str.endswith(pattern[1:]): + return True + elif pattern in path_str: + # Contains match (e.g., __pycache__) + return True + return False + + +def serialize_state_to_json(state: Dict[str, Any]) -> bytes: + """Serialize state dictionary to JSON bytes. + + Args: + state: The state dictionary to serialize. + + Returns: + JSON-encoded bytes. + """ + return json.dumps(state, sort_keys=True, default=str).encode("utf-8") + + +def deserialize_state_from_json(blob: bytes) -> Dict[str, Any]: + """Deserialize state from JSON bytes. + + Args: + blob: JSON-encoded bytes. + + Returns: + The deserialized state dictionary. + """ + return json.loads(blob.decode("utf-8")) + + +def compute_state_hash(state: Dict[str, Any]) -> str: + """Compute a SHA-256 hash of the state dictionary. + + Args: + state: The state dictionary to hash. + + Returns: + The hex-encoded SHA-256 hash. + """ + blob = serialize_state_to_json(state) + return hashlib.sha256(blob).hexdigest() diff --git a/tests/unittests/durable/__init__.py b/tests/unittests/durable/__init__.py new file mode 100644 index 0000000000..58d482ea38 --- /dev/null +++ b/tests/unittests/durable/__init__.py @@ -0,0 +1,13 @@ +# Copyright 2026 Google LLC +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. diff --git a/tests/unittests/durable/test_bigquery_checkpoint_store.py b/tests/unittests/durable/test_bigquery_checkpoint_store.py new file mode 100644 index 0000000000..f792c6b4b0 --- /dev/null +++ b/tests/unittests/durable/test_bigquery_checkpoint_store.py @@ -0,0 +1,273 @@ +# Copyright 2026 Google LLC +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +"""Tests for BigQueryCheckpointStore.""" + +from datetime import datetime +from datetime import timezone +from unittest.mock import AsyncMock +from unittest.mock import MagicMock +from unittest.mock import patch + +from google.adk.durable.stores.bigquery_checkpoint_store import BigQueryCheckpointStore +import pytest + + +class TestBigQueryCheckpointStore: + """Tests for BigQueryCheckpointStore.""" + + @pytest.fixture + def store(self): + """Create a store instance for testing.""" + return BigQueryCheckpointStore( + project="test-project", + dataset="test_dataset", + gcs_bucket="test-bucket", + ) + + def test_init(self, store): + """Test store initialization.""" + assert store._project == "test-project" + assert store._dataset == "test_dataset" + assert store._gcs_bucket == "test-bucket" + assert store._sessions_table == "sessions" + assert store._checkpoints_table == "checkpoints" + assert store._location == "US" + + def test_table_ids(self, store): + """Test table ID generation.""" + assert store._sessions_table_id == "test-project.test_dataset.sessions" + assert ( + store._checkpoints_table_id == "test-project.test_dataset.checkpoints" + ) + + def test_gcs_uri_generation(self, store): + """Test GCS URI generation.""" + uri = store._get_gcs_uri("session-123", 5) + assert uri == "gs://test-bucket/checkpoints/session-123/5.json.gz" + + @pytest.mark.asyncio + async def test_create_session(self, store): + """Test session creation.""" + mock_client = MagicMock() + mock_client.insert_rows_json.return_value = [] + + with patch.object(store, "_get_bq_client", return_value=mock_client): + with patch.object( + store, "get_session", new_callable=AsyncMock + ) as mock_get: + mock_get.return_value = None + + session = await store.create_session( + session_id="test-session", + agent_name="test_agent", + metadata={"key": "value"}, + ) + + assert session.session_id == "test-session" + assert session.agent_name == "test_agent" + assert session.status == "active" + assert session.current_checkpoint_seq == 0 + assert session.metadata == {"key": "value"} + + mock_client.insert_rows_json.assert_called_once() + + @pytest.mark.asyncio + async def test_create_session_already_exists(self, store): + """Test session creation when session already exists.""" + with patch.object(store, "get_session", new_callable=AsyncMock) as mock_get: + from google.adk.durable.stores.base_checkpoint_store import SessionMetadata + + mock_get.return_value = SessionMetadata( + session_id="test-session", + status="active", + agent_name="test_agent", + created_at=datetime.now(timezone.utc), + updated_at=datetime.now(timezone.utc), + current_checkpoint_seq=0, + ) + + with pytest.raises(ValueError, match="already exists"): + await store.create_session( + session_id="test-session", + agent_name="test_agent", + ) + + @pytest.mark.asyncio + async def test_write_checkpoint(self, store): + """Test checkpoint writing.""" + mock_bq_client = MagicMock() + mock_bq_client.insert_rows_json.return_value = [] + mock_bq_client.query.return_value.result.return_value = None + + mock_storage_client = MagicMock() + mock_bucket = MagicMock() + mock_blob = MagicMock() + mock_storage_client.bucket.return_value = mock_bucket + mock_bucket.blob.return_value = mock_blob + + with patch.object(store, "_get_bq_client", return_value=mock_bq_client): + with patch.object( + store, "_get_storage_client", return_value=mock_storage_client + ): + with patch.object( + store, "get_session", new_callable=AsyncMock + ) as mock_get: + from google.adk.durable.stores.base_checkpoint_store import SessionMetadata + + mock_get.return_value = SessionMetadata( + session_id="test-session", + status="active", + agent_name="test_agent", + created_at=datetime.now(timezone.utc), + updated_at=datetime.now(timezone.utc), + current_checkpoint_seq=0, + ) + + checkpoint = await store.write_checkpoint( + session_id="test-session", + checkpoint_seq=1, + state_blob=b'{"state": "data"}', + agent_state={"key": "value"}, + trigger="async_boundary", + ) + + assert checkpoint.session_id == "test-session" + assert checkpoint.checkpoint_seq == 1 + assert checkpoint.trigger == "async_boundary" + assert checkpoint.agent_state == {"key": "value"} + + # Verify GCS upload was called + mock_blob.upload_from_string.assert_called_once() + + # Verify BQ insert was called + mock_bq_client.insert_rows_json.assert_called_once() + + @pytest.mark.asyncio + async def test_write_checkpoint_invalid_seq(self, store): + """Test checkpoint writing with invalid sequence number.""" + with patch.object(store, "get_session", new_callable=AsyncMock) as mock_get: + from google.adk.durable.stores.base_checkpoint_store import SessionMetadata + + mock_get.return_value = SessionMetadata( + session_id="test-session", + status="active", + agent_name="test_agent", + created_at=datetime.now(timezone.utc), + updated_at=datetime.now(timezone.utc), + current_checkpoint_seq=5, + ) + + with pytest.raises(ValueError, match="must be greater"): + await store.write_checkpoint( + session_id="test-session", + checkpoint_seq=3, # Less than current (5) + state_blob=b"data", + ) + + @pytest.mark.asyncio + async def test_write_checkpoint_session_not_found(self, store): + """Test checkpoint writing when session doesn't exist.""" + with patch.object(store, "get_session", new_callable=AsyncMock) as mock_get: + mock_get.return_value = None + + with pytest.raises(ValueError, match="not found"): + await store.write_checkpoint( + session_id="nonexistent", + checkpoint_seq=1, + state_blob=b"data", + ) + + @pytest.mark.asyncio + async def test_acquire_lease_success(self, store): + """Test successful lease acquisition.""" + mock_client = MagicMock() + mock_result = MagicMock() + mock_result.num_dml_affected_rows = 1 + mock_client.query.return_value.result.return_value = mock_result + + with patch.object(store, "_get_bq_client", return_value=mock_client): + result = await store.acquire_lease( + session_id="test-session", + lease_id="lease-123", + timeout_seconds=300, + ) + + assert result is True + mock_client.query.assert_called_once() + + @pytest.mark.asyncio + async def test_acquire_lease_failure(self, store): + """Test failed lease acquisition (another lease active).""" + mock_client = MagicMock() + mock_result = MagicMock() + mock_result.num_dml_affected_rows = 0 + mock_client.query.return_value.result.return_value = mock_result + + with patch.object(store, "_get_bq_client", return_value=mock_client): + result = await store.acquire_lease( + session_id="test-session", + lease_id="lease-123", + timeout_seconds=300, + ) + + assert result is False + + @pytest.mark.asyncio + async def test_release_lease(self, store): + """Test lease release.""" + mock_client = MagicMock() + mock_client.query.return_value.result.return_value = None + + with patch.object(store, "_get_bq_client", return_value=mock_client): + await store.release_lease( + session_id="test-session", + lease_id="lease-123", + ) + + mock_client.query.assert_called_once() + + @pytest.mark.asyncio + async def test_renew_lease_success(self, store): + """Test successful lease renewal.""" + mock_client = MagicMock() + mock_result = MagicMock() + mock_result.num_dml_affected_rows = 1 + mock_client.query.return_value.result.return_value = mock_result + + with patch.object(store, "_get_bq_client", return_value=mock_client): + result = await store.renew_lease( + session_id="test-session", + lease_id="lease-123", + timeout_seconds=600, + ) + + assert result is True + + @pytest.mark.asyncio + async def test_renew_lease_failure(self, store): + """Test failed lease renewal (lease not held).""" + mock_client = MagicMock() + mock_result = MagicMock() + mock_result.num_dml_affected_rows = 0 + mock_client.query.return_value.result.return_value = mock_result + + with patch.object(store, "_get_bq_client", return_value=mock_client): + result = await store.renew_lease( + session_id="test-session", + lease_id="lease-123", + timeout_seconds=600, + ) + + assert result is False diff --git a/tests/unittests/durable/test_checkpointable_state.py b/tests/unittests/durable/test_checkpointable_state.py new file mode 100644 index 0000000000..9c9ab0753c --- /dev/null +++ b/tests/unittests/durable/test_checkpointable_state.py @@ -0,0 +1,172 @@ +# Copyright 2026 Google LLC +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +"""Tests for CheckpointableAgentState.""" + +from typing import Any +from typing import Dict + +from google.adk.durable.checkpointable_state import CheckpointableAgentState +from google.adk.durable.checkpointable_state import SimpleCheckpointableState +import pytest + + +class TestSimpleCheckpointableState: + """Tests for SimpleCheckpointableState.""" + + def test_default_state(self): + """Test default state initialization.""" + state = SimpleCheckpointableState() + assert state.data == {} + + def test_state_with_data(self): + """Test state with initial data.""" + state = SimpleCheckpointableState(data={"key": "value", "count": 5}) + assert state.data["key"] == "value" + assert state.data["count"] == 5 + + def test_to_checkpoint_dict(self): + """Test serialization to checkpoint dict.""" + state = SimpleCheckpointableState(data={"items": [1, 2, 3], "name": "test"}) + checkpoint = state.to_checkpoint_dict() + + assert checkpoint == {"data": {"items": [1, 2, 3], "name": "test"}} + + def test_from_checkpoint_dict(self): + """Test deserialization from checkpoint dict.""" + checkpoint = {"data": {"counter": 10, "results": ["a", "b"]}} + state = SimpleCheckpointableState.from_checkpoint_dict(checkpoint) + + assert state.data["counter"] == 10 + assert state.data["results"] == ["a", "b"] + + def test_roundtrip(self): + """Test roundtrip serialization/deserialization.""" + original = SimpleCheckpointableState( + data={ + "nested": {"deep": {"value": 42}}, + "list": [1, 2, 3], + "string": "hello", + } + ) + + checkpoint = original.to_checkpoint_dict() + restored = SimpleCheckpointableState.from_checkpoint_dict(checkpoint) + + assert restored.data == original.data + + def test_empty_checkpoint_dict(self): + """Test deserialization from empty checkpoint dict.""" + state = SimpleCheckpointableState.from_checkpoint_dict({}) + assert state.data == {} + + +class CustomState(CheckpointableAgentState): + """Custom state implementation for testing.""" + + counter: int = 0 + items: list[str] = [] + metadata: dict[str, Any] = {} + + def __init__(self, **data): + super().__init__(**data) + if "items" not in data: + self.items = [] + if "metadata" not in data: + self.metadata = {} + + def to_checkpoint_dict(self) -> Dict[str, Any]: + return { + "counter": self.counter, + "items": self.items.copy(), + "metadata": self.metadata.copy(), + } + + @classmethod + def from_checkpoint_dict(cls, data: Dict[str, Any]) -> "CustomState": + return cls( + counter=data.get("counter", 0), + items=data.get("items", []), + metadata=data.get("metadata", {}), + ) + + +class TestCustomCheckpointableState: + """Tests for custom CheckpointableAgentState implementations.""" + + def test_custom_state_default(self): + """Test custom state with default values.""" + state = CustomState() + assert state.counter == 0 + assert state.items == [] + assert state.metadata == {} + + def test_custom_state_with_values(self): + """Test custom state with initial values.""" + state = CustomState( + counter=5, + items=["a", "b"], + metadata={"key": "value"}, + ) + assert state.counter == 5 + assert state.items == ["a", "b"] + assert state.metadata == {"key": "value"} + + def test_custom_state_to_checkpoint(self): + """Test custom state serialization.""" + state = CustomState(counter=10, items=["x", "y", "z"]) + checkpoint = state.to_checkpoint_dict() + + assert checkpoint["counter"] == 10 + assert checkpoint["items"] == ["x", "y", "z"] + assert checkpoint["metadata"] == {} + + def test_custom_state_from_checkpoint(self): + """Test custom state deserialization.""" + checkpoint = { + "counter": 42, + "items": ["item1", "item2"], + "metadata": {"created_by": "test"}, + } + state = CustomState.from_checkpoint_dict(checkpoint) + + assert state.counter == 42 + assert state.items == ["item1", "item2"] + assert state.metadata == {"created_by": "test"} + + def test_custom_state_roundtrip(self): + """Test custom state roundtrip.""" + original = CustomState( + counter=100, + items=["first", "second", "third"], + metadata={"version": 1, "tags": ["test", "demo"]}, + ) + + checkpoint = original.to_checkpoint_dict() + restored = CustomState.from_checkpoint_dict(checkpoint) + + assert restored.counter == original.counter + assert restored.items == original.items + assert restored.metadata == original.metadata + + def test_custom_state_isolation(self): + """Test that checkpoint data is isolated from original.""" + state = CustomState(items=["a", "b"]) + checkpoint = state.to_checkpoint_dict() + + # Modify checkpoint + checkpoint["items"].append("c") + + # Original should be unchanged + assert state.items == ["a", "b"] diff --git a/tests/unittests/durable/test_config.py b/tests/unittests/durable/test_config.py new file mode 100644 index 0000000000..cf47e2107e --- /dev/null +++ b/tests/unittests/durable/test_config.py @@ -0,0 +1,104 @@ +# Copyright 2026 Google LLC +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +"""Tests for DurableSessionConfig.""" + +from google.adk.durable.config import DurableSessionConfig +from pydantic import ValidationError +import pytest + + +class TestDurableSessionConfig: + """Tests for DurableSessionConfig model.""" + + def test_default_config(self): + """Test default configuration values.""" + config = DurableSessionConfig() + + assert config.is_durable is False + assert config.checkpoint_policy == "async_boundary" + assert config.checkpoint_store is None + assert config.lease_timeout_seconds == 300 + assert config.max_checkpoint_size_bytes == 10 * 1024 * 1024 + + def test_enabled_config(self): + """Test enabled configuration.""" + config = DurableSessionConfig( + is_durable=True, + checkpoint_policy="every_turn", + lease_timeout_seconds=600, + ) + + assert config.is_durable is True + assert config.checkpoint_policy == "every_turn" + assert config.lease_timeout_seconds == 600 + + def test_checkpoint_policies(self): + """Test valid checkpoint policies.""" + for policy in ["async_boundary", "every_turn", "manual"]: + config = DurableSessionConfig(checkpoint_policy=policy) + assert config.checkpoint_policy == policy + + def test_invalid_checkpoint_policy(self): + """Test that invalid checkpoint policies raise validation error.""" + with pytest.raises(ValidationError): + DurableSessionConfig(checkpoint_policy="invalid_policy") + + def test_lease_timeout_bounds(self): + """Test lease timeout validation bounds.""" + # Valid minimum + config = DurableSessionConfig(lease_timeout_seconds=60) + assert config.lease_timeout_seconds == 60 + + # Valid maximum + config = DurableSessionConfig(lease_timeout_seconds=3600) + assert config.lease_timeout_seconds == 3600 + + # Below minimum + with pytest.raises(ValidationError): + DurableSessionConfig(lease_timeout_seconds=59) + + # Above maximum + with pytest.raises(ValidationError): + DurableSessionConfig(lease_timeout_seconds=3601) + + def test_max_checkpoint_size_bounds(self): + """Test max checkpoint size validation.""" + # Valid minimum + config = DurableSessionConfig(max_checkpoint_size_bytes=1024) + assert config.max_checkpoint_size_bytes == 1024 + + # Below minimum + with pytest.raises(ValidationError): + DurableSessionConfig(max_checkpoint_size_bytes=1023) + + def test_extra_fields_forbidden(self): + """Test that extra fields are not allowed.""" + with pytest.raises(ValidationError): + DurableSessionConfig(unknown_field="value") + + def test_config_serialization(self): + """Test config can be serialized to dict.""" + config = DurableSessionConfig( + is_durable=True, + checkpoint_policy="every_turn", + lease_timeout_seconds=120, + ) + + data = config.model_dump() + + assert data["is_durable"] is True + assert data["checkpoint_policy"] == "every_turn" + assert data["lease_timeout_seconds"] == 120 + assert data["checkpoint_store"] is None diff --git a/tests/unittests/durable/test_workspace_snapshotter.py b/tests/unittests/durable/test_workspace_snapshotter.py new file mode 100644 index 0000000000..4e381f1022 --- /dev/null +++ b/tests/unittests/durable/test_workspace_snapshotter.py @@ -0,0 +1,246 @@ +# Copyright 2026 Google LLC +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +"""Tests for WorkspaceSnapshotter and utilities.""" + +import os +import tempfile + +from google.adk.durable.workspace_snapshotter import compute_state_hash +from google.adk.durable.workspace_snapshotter import deserialize_state_from_json +from google.adk.durable.workspace_snapshotter import serialize_state_to_json +from google.adk.durable.workspace_snapshotter import WorkspaceSnapshotter +import pytest + + +class TestSerializationUtilities: + """Tests for serialization utility functions.""" + + def test_serialize_simple_dict(self): + """Test serialization of simple dictionary.""" + state = {"key": "value", "number": 42} + blob = serialize_state_to_json(state) + + assert isinstance(blob, bytes) + assert b"key" in blob + assert b"value" in blob + assert b"42" in blob + + def test_deserialize_simple_dict(self): + """Test deserialization of simple dictionary.""" + blob = b'{"key": "value", "number": 42}' + state = deserialize_state_from_json(blob) + + assert state == {"key": "value", "number": 42} + + def test_roundtrip_serialization(self): + """Test roundtrip serialization/deserialization.""" + original = { + "string": "hello", + "number": 123, + "float": 3.14, + "bool": True, + "null": None, + "list": [1, 2, 3], + "nested": {"a": {"b": "c"}}, + } + + blob = serialize_state_to_json(original) + restored = deserialize_state_from_json(blob) + + assert restored == original + + def test_serialize_deterministic(self): + """Test that serialization is deterministic (sorted keys).""" + state1 = {"z": 1, "a": 2, "m": 3} + state2 = {"a": 2, "m": 3, "z": 1} + + blob1 = serialize_state_to_json(state1) + blob2 = serialize_state_to_json(state2) + + assert blob1 == blob2 + + def test_compute_state_hash(self): + """Test state hash computation.""" + state = {"key": "value"} + hash1 = compute_state_hash(state) + + assert isinstance(hash1, str) + assert len(hash1) == 64 # SHA-256 produces 64 hex characters + + def test_hash_deterministic(self): + """Test that hash is deterministic.""" + state1 = {"z": 1, "a": 2} + state2 = {"a": 2, "z": 1} + + assert compute_state_hash(state1) == compute_state_hash(state2) + + def test_hash_changes_with_content(self): + """Test that hash changes with content.""" + hash1 = compute_state_hash({"key": "value1"}) + hash2 = compute_state_hash({"key": "value2"}) + + assert hash1 != hash2 + + +class TestWorkspaceSnapshotter: + """Tests for WorkspaceSnapshotter.""" + + def test_init_default(self): + """Test default initialization.""" + snapshotter = WorkspaceSnapshotter() + + assert snapshotter.workspace_dir is None + assert "__pycache__" in snapshotter._exclude_patterns + + def test_init_with_workspace(self): + """Test initialization with workspace directory.""" + snapshotter = WorkspaceSnapshotter(workspace_dir="/tmp/workspace") + + assert str(snapshotter.workspace_dir) == "/tmp/workspace" + + def test_init_with_custom_excludes(self): + """Test initialization with custom exclude patterns.""" + snapshotter = WorkspaceSnapshotter( + workspace_dir="/tmp/workspace", + exclude_patterns=["*.log", "temp/"], + ) + + assert snapshotter._exclude_patterns == ["*.log", "temp/"] + + def test_should_exclude_pycache(self): + """Test exclusion of __pycache__ directories.""" + snapshotter = WorkspaceSnapshotter() + from pathlib import Path + + assert snapshotter._should_exclude( + Path("/some/path/__pycache__/module.pyc") + ) + + def test_should_exclude_pyc_files(self): + """Test exclusion of .pyc files.""" + snapshotter = WorkspaceSnapshotter() + from pathlib import Path + + assert snapshotter._should_exclude(Path("/some/path/module.pyc")) + + def test_should_not_exclude_py_files(self): + """Test that .py files are not excluded.""" + snapshotter = WorkspaceSnapshotter() + from pathlib import Path + + assert not snapshotter._should_exclude(Path("/some/path/module.py")) + + def test_should_exclude_git(self): + """Test exclusion of .git directories.""" + snapshotter = WorkspaceSnapshotter() + from pathlib import Path + + assert snapshotter._should_exclude(Path("/some/path/.git/config")) + + def test_should_exclude_env(self): + """Test exclusion of .env files.""" + snapshotter = WorkspaceSnapshotter() + from pathlib import Path + + assert snapshotter._should_exclude(Path("/some/path/.env")) + + def test_create_snapshot_no_workspace(self): + """Test that create_snapshot fails without workspace.""" + snapshotter = WorkspaceSnapshotter() + + with pytest.raises(ValueError, match="No workspace directory"): + snapshotter.create_snapshot() + + def test_create_snapshot_missing_directory(self): + """Test that create_snapshot fails with missing directory.""" + snapshotter = WorkspaceSnapshotter(workspace_dir="/nonexistent/path") + + with pytest.raises(FileNotFoundError): + snapshotter.create_snapshot() + + def test_restore_snapshot_no_workspace(self): + """Test that restore_snapshot fails without workspace.""" + snapshotter = WorkspaceSnapshotter() + + with pytest.raises(ValueError, match="No workspace directory"): + snapshotter.restore_snapshot(b"data") + + def test_create_and_restore_snapshot(self): + """Test creating and restoring a workspace snapshot.""" + with tempfile.TemporaryDirectory() as tmpdir: + # Create source workspace with files + source_dir = os.path.join(tmpdir, "source") + os.makedirs(source_dir) + + # Create test files + with open(os.path.join(source_dir, "file1.txt"), "w") as f: + f.write("content1") + with open(os.path.join(source_dir, "file2.py"), "w") as f: + f.write("print('hello')") + + # Create subdirectory + subdir = os.path.join(source_dir, "subdir") + os.makedirs(subdir) + with open(os.path.join(subdir, "nested.txt"), "w") as f: + f.write("nested content") + + # Create snapshot + snapshotter = WorkspaceSnapshotter(workspace_dir=source_dir) + blob, sha256, size = snapshotter.create_snapshot() + + assert isinstance(blob, bytes) + assert len(sha256) == 64 + assert size > 0 + + # Restore to different location + dest_dir = os.path.join(tmpdir, "dest") + restore_snapshotter = WorkspaceSnapshotter(workspace_dir=dest_dir) + restore_snapshotter.restore_snapshot(blob) + + # Verify files were restored + assert os.path.exists(os.path.join(dest_dir, "file1.txt")) + assert os.path.exists(os.path.join(dest_dir, "file2.py")) + assert os.path.exists(os.path.join(dest_dir, "subdir", "nested.txt")) + + # Verify content + with open(os.path.join(dest_dir, "file1.txt")) as f: + assert f.read() == "content1" + + def test_snapshot_excludes_pycache(self): + """Test that snapshots exclude __pycache__ directories.""" + with tempfile.TemporaryDirectory() as tmpdir: + # Create workspace with __pycache__ + workspace = os.path.join(tmpdir, "workspace") + os.makedirs(workspace) + + with open(os.path.join(workspace, "main.py"), "w") as f: + f.write("print('main')") + + pycache = os.path.join(workspace, "__pycache__") + os.makedirs(pycache) + with open(os.path.join(pycache, "main.cpython-311.pyc"), "wb") as f: + f.write(b"\x00\x00\x00\x00") + + # Create snapshot + snapshotter = WorkspaceSnapshotter(workspace_dir=workspace) + blob, _, _ = snapshotter.create_snapshot() + + # Restore and verify __pycache__ was excluded + dest = os.path.join(tmpdir, "dest") + restore_snapshotter = WorkspaceSnapshotter(workspace_dir=dest) + restore_snapshotter.restore_snapshot(blob) + + assert os.path.exists(os.path.join(dest, "main.py")) + assert not os.path.exists(os.path.join(dest, "__pycache__"))