Skip to content

Commit 99e7726

Browse files
caohy1988claude
andcommitted
feat(durable): Add durable session persistence layer for long-horizon agents
This PR implements a durable session persistence layer for ADK, enabling cross-process checkpoint-based recovery for long-running agent tasks. ## Key Features - **DurableSessionConfig**: Configuration for durable cross-process checkpointing - **BigQueryCheckpointStore**: Two-phase commit checkpoint storage (BQ metadata + GCS blobs) - **CheckpointableAgentState**: Abstract interface for agents supporting durability - **WorkspaceSnapshotter**: GCS-based workspace directory snapshotting ## Implementation Details - Two-phase commit: GCS blob upload → BigQuery metadata insert - SHA-256 checkpoint integrity verification - Lease-based concurrency control for safe resume - Async-first API design for non-blocking I/O ## Demo A fully functional demo is deployed on Cloud Run showcasing: - Real-time checkpoint visualization - Task failure simulation and recovery - BigQuery metadata queries - Final task output display Demo URL: https://durable-demo-201486563047.us-central1.run.app ## Files Added - src/google/adk/durable/ - Core durable module - contributing/samples/long_running_task/ - Demo agent and UI - tests/unittests/durable/ - Unit tests Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
1 parent 2770012 commit 99e7726

23 files changed

+6239
-0
lines changed
Lines changed: 182 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,182 @@
1+
# Durable Session Demo
2+
3+
This demo showcases the durable session persistence feature in ADK, which
4+
enables checkpoint-based durability for long-running agent invocations.
5+
6+
## Overview
7+
8+
Durable sessions provide:
9+
- **Checkpoint persistence**: Agent state is saved to BigQuery + GCS
10+
- **Failure recovery**: Resume from the last checkpoint after crashes
11+
- **Host migration**: Move sessions between hosts seamlessly
12+
- **Lease management**: Prevent concurrent modifications
13+
14+
## Prerequisites
15+
16+
1. **Google Cloud Project** with billing enabled
17+
2. **APIs enabled**:
18+
- BigQuery API
19+
- Cloud Storage API
20+
- Vertex AI API (for Gemini models)
21+
3. **IAM permissions**:
22+
- `roles/bigquery.dataEditor`
23+
- `roles/storage.objectAdmin`
24+
- `roles/aiplatform.user`
25+
26+
## Setup
27+
28+
### 1. Configure your environment
29+
30+
```bash
31+
# Set your project
32+
export PROJECT_ID="test-project-0728-467323"
33+
gcloud config set project $PROJECT_ID
34+
35+
# Set your Google Cloud API key (required for Gemini 3)
36+
export GOOGLE_CLOUD_API_KEY="your-api-key-here"
37+
38+
# Authenticate
39+
gcloud auth application-default login
40+
```
41+
42+
### 2. Create BigQuery and GCS resources
43+
44+
```bash
45+
# Run the setup script
46+
python contributing/samples/long_running_task/setup.py
47+
48+
# To verify setup
49+
python contributing/samples/long_running_task/setup.py --verify
50+
51+
# To clean up resources
52+
python contributing/samples/long_running_task/setup.py --cleanup
53+
```
54+
55+
### 3. Run the demo
56+
57+
```bash
58+
adk web contributing/samples/long_running_task
59+
```
60+
61+
## Demo Scenarios
62+
63+
### Scenario 1: Long-running table scan
64+
65+
```
66+
User: Scan the bigquery-public-data.samples.shakespeare table
67+
68+
Agent: [Calls simulate_long_running_scan]
69+
[Checkpoint written at async boundary]
70+
[Scan completes after ~5-10 seconds]
71+
The scan found 164,656 rows with the following findings:
72+
- Found 5 instances of 'to be or not to be'
73+
- Most common word: 'the' (27,801 occurrences)
74+
- Unique words: 29,066
75+
```
76+
77+
### Scenario 2: Multi-stage pipeline
78+
79+
```
80+
User: Run a pipeline from source_table to dest_table with transformations:
81+
filter, aggregate, join
82+
83+
Agent: [Calls run_data_pipeline]
84+
[Checkpoint written at each stage boundary]
85+
Pipeline completed successfully:
86+
- Stage 1 (filter): 45,000 rows processed
87+
- Stage 2 (aggregate): 32,000 rows processed
88+
- Stage 3 (join): 28,000 rows processed
89+
```
90+
91+
### Scenario 3: Failure recovery
92+
93+
1. Start a long-running scan
94+
2. Kill the process mid-execution
95+
3. Restart and resume with the invocation_id
96+
4. Agent continues from the last checkpoint
97+
98+
## Architecture
99+
100+
```
101+
+-----------------+
102+
| Agent |
103+
| (LlmAgent) |
104+
+--------+--------+
105+
|
106+
v
107+
+-----------------+
108+
| Runner |
109+
| (with durability)|
110+
+--------+--------+
111+
|
112+
+----------------+----------------+
113+
| |
114+
v v
115+
+--------------+ +----------------+
116+
| BigQuery | | GCS |
117+
| (metadata) | | (state blobs) |
118+
+--------------+ +----------------+
119+
| - sessions | | - checkpoints/ |
120+
| - checkpoints| | {session_id}/|
121+
+--------------+ +----------------+
122+
```
123+
124+
## Configuration
125+
126+
The agent is configured in `agent.py`:
127+
128+
```python
129+
app = App(
130+
name="durable_session_demo",
131+
root_agent=root_agent,
132+
resumability_config=ResumabilityConfig(is_resumable=True),
133+
durable_session_config=DurableSessionConfig(
134+
is_durable=True,
135+
checkpoint_policy="async_boundary",
136+
checkpoint_store=BigQueryCheckpointStore(
137+
project=PROJECT_ID,
138+
dataset=DATASET,
139+
gcs_bucket=GCS_BUCKET,
140+
),
141+
lease_timeout_seconds=300,
142+
),
143+
)
144+
```
145+
146+
### Checkpoint Policies
147+
148+
- `async_boundary`: Checkpoint when hitting async/long-running operations
149+
- `every_turn`: Checkpoint after every agent turn
150+
- `manual`: Only checkpoint when explicitly requested
151+
152+
## Monitoring
153+
154+
### View sessions
155+
156+
```sql
157+
SELECT * FROM `test-project-0728-467323.adk_metadata.sessions`
158+
ORDER BY updated_at DESC
159+
LIMIT 10;
160+
```
161+
162+
### View checkpoints
163+
164+
```sql
165+
SELECT * FROM `test-project-0728-467323.adk_metadata.checkpoints`
166+
ORDER BY created_at DESC
167+
LIMIT 10;
168+
```
169+
170+
### List checkpoint blobs
171+
172+
```bash
173+
gsutil ls -l gs://test-project-0728-467323-adk-checkpoints/checkpoints/
174+
```
175+
176+
## Cleanup
177+
178+
To remove all resources created by this demo:
179+
180+
```bash
181+
python contributing/samples/long_running_task/setup.py --cleanup
182+
```

0 commit comments

Comments
 (0)