Skip to content

Commit 6553c96

Browse files
authored
Merge pull request #68 from NREL/feat/recover
2 parents d491ee1 + 2926113 commit 6553c96

File tree

18 files changed

+2081
-685
lines changed

18 files changed

+2081
-685
lines changed

docs/src/explanation/automatic-recovery.md

Lines changed: 50 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -5,16 +5,52 @@ and when to use automatic vs manual recovery.
55

66
## Overview
77

8-
Torc provides **automatic failure recovery** through the `torc watch --recover` command. When jobs
9-
fail, the system:
8+
Torc provides **automatic failure recovery** through two commands:
9+
10+
- **`torc recover`** - One-shot recovery for Slurm workflows
11+
- **`torc watch --recover`** - Continuous monitoring with automatic recovery
12+
13+
When jobs fail, the system:
1014

1115
1. Diagnoses the failure cause (OOM, timeout, or unknown)
1216
2. Applies heuristics to adjust resource requirements
1317
3. Resets failed jobs and submits new Slurm allocations
14-
4. Resumes monitoring until completion or max retries
18+
4. (watch only) Resumes monitoring until completion or max retries
1519

1620
This deterministic approach handles the majority of HPC failures without human intervention.
1721

22+
## The `torc recover` Command
23+
24+
For one-shot recovery of a completed workflow with failures:
25+
26+
```bash
27+
# Preview what would be done
28+
torc recover <workflow_id> --dry-run
29+
30+
# Execute recovery
31+
torc recover <workflow_id>
32+
```
33+
34+
This command:
35+
36+
1. Checks preconditions (workflow complete, no active workers)
37+
2. Diagnoses failures using resource utilization data
38+
3. Applies recovery heuristics (increase memory/runtime)
39+
4. Runs optional recovery hook for custom logic
40+
5. Resets failed jobs and regenerates Slurm schedulers
41+
6. Submits new allocations
42+
43+
### Recovery Options
44+
45+
```bash
46+
torc recover <workflow_id> \
47+
--memory-multiplier 1.5 \ # Memory increase factor for OOM (default: 1.5)
48+
--runtime-multiplier 1.4 \ # Runtime increase factor for timeout (default: 1.4)
49+
--retry-unknown \ # Also retry jobs with unknown failure causes
50+
--recovery-hook "bash fix.sh" \ # Custom script for unknown failures
51+
--dry-run # Preview without making changes
52+
```
53+
1854
## Design Principles
1955

2056
### Why Deterministic Recovery?
@@ -89,9 +125,19 @@ After adjusting resources, the system regenerates Slurm schedulers:
89125

90126
This is handled by `torc slurm regenerate --submit`.
91127

128+
## Choosing Between `recover` and `watch --recover`
129+
130+
| Use Case | Command |
131+
| --------------------------------- | ------------------------ |
132+
| One-shot recovery after failure | `torc recover` |
133+
| Continuous monitoring | `torc watch -r` |
134+
| Preview what recovery would do | `torc recover --dry-run` |
135+
| Production long-running workflows | `torc watch -r` |
136+
| Manual investigation, then retry | `torc recover` |
137+
92138
## Configuration
93139

94-
### Command-Line Options
140+
### Watch Command Options
95141

96142
```bash
97143
torc watch <workflow_id> \

docs/src/introduction.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -12,6 +12,8 @@ dependencies, mixed resource requirements, and multiple stages.
1212
- **Resource Management** — Track CPU, memory, and GPU usage across all jobs
1313
- **Automatic Failure Recovery** — Detect OOM/timeout failures and retry with adjusted resources
1414
- **Fault Tolerance** — Resume workflows after failures without losing progress
15+
- **AI-assisted configuration and management** — Use AI tools like Claude Code and Copilot to
16+
configure workflows and diagnose problems.
1517

1618
### Job Parameterization
1719

docs/src/tutorials/ai-assistant.md

Lines changed: 153 additions & 22 deletions
Original file line numberDiff line numberDiff line change
@@ -162,42 +162,173 @@ Suppose that jobs fail because they use too much CPU or memory.
162162

163163
---
164164

165-
## Example: Debugging a Failure
165+
## Example: Recovering a failed workflow that timed out
166+
167+
The workflow in `tests/workflows/timeout_auto_recovery_test` is used to test torc's auto-recovery of
168+
walltime timeouts. In this example, the workflow was left in a failed state.
169+
170+
Here is a conversation in Claude Code after running the workflow:
166171

167172
```
168-
You: Why did a job in workflow 5 fail?
173+
> analyze the logs for workflow 6
174+
175+
Workflow 6 Log Analysis Summary
176+
177+
Workflow: timeout_auto_recovery_test - Test workflow for automatic timeout recovery
178+
179+
Job Results
180+
181+
| Job | Name | Status | Runtime | Return Code |
182+
|-----|------------|---------------|---------|--------------|
183+
| 31 | (fast job) | ✅ Completed | 1.0 min | 0 |
184+
| 32 | job_slow | ⛔ Terminated | 5.9 min | -9 (SIGKILL) |
185+
186+
Terminated Job Details (job_slow)
169187
170-
Claude:
171-
● The job failed due to an Out of Memory (OOM) kill:
188+
The job was killed by signal 9 (SIGKILL) after running for ~6 minutes, which matches the expected behavior for a timeout test.
172189
173-
Job Details:
174-
- Name: work_job
175-
- Command: python work.py
176-
- Return code: 137 (SIGKILL from OOM killer)
190+
stdout log shows the job ran through minute 6 before being killed:
191+
Slow job starting at Sat Dec 27 15:40:59 MST 2025
192+
This job will run for 10 minutes (exceeds 5 min runtime and 8 min walltime)
193+
It should be killed by Slurm and recovered by the watcher
194+
Minute 1 of 10...
195+
Minute 2 of 10...
196+
...
197+
Minute 6 of 10...
177198
178-
What happened:
179-
1. The job was configured with a 200GB memory limit
180-
2. The script allocated 230GB before being killed
181-
3. The Linux OOM killer terminated the process
199+
stderr log: Empty (no errors written)
182200
183-
Recommendation: Increase memory to 256GB and retry.
201+
Conclusion
202+
203+
The timeout auto-recovery test is working as expected:
204+
- The slow job exceeded its 5-minute runtime limit
205+
- It was killed by the system (SIGKILL, return code -9) at ~6 minutes
206+
- Peak memory usage was ~4.1 MB (no memory issues)
184207
```
185208

186209
---
187210

188-
## Example: Fixing and Retrying
211+
## Example: Debugging a failed workflow with over-consumption of memory
189212

190-
```
191-
You: Increase the memory to 256GB and restart the failed jobs.
213+
The workflow in `tests/workflows/oom_auto_recovery_test` is used to test torc's diagnosis of memory
214+
over-consumption errors.
192215

193-
Claude:
194-
● Updated job 5 resource requirements:
195-
- Memory: 200g → 256g
216+
Here is a conversation in Claude Code after running the workflow:
196217

197-
● Reset job status from "failed" to "ready"
218+
> Analyze the logs for workflow 16
198219
199-
● The job is now queued for execution. Would you like me to submit it to Slurm?
200-
```
220+
● torc - analyze_workflow_logs (MCP)(workflow_id: 16) ⎿  { "error_count": 0, "errors_by_type": [], …
221+
+7 lines (ctrl+o to expand)
222+
223+
● torc - get_workflow_status (MCP)(workflow_id: 16) ⎿  { "description": "Test workflow with an OOM",
224+
"job_counts_by_status": { … +8 lines (ctrl+o to expand)
225+
226+
● The workflow has 3 failed jobs out of 10 total, but the log analyzer didn't find error patterns.
227+
Let me dig deeper:
228+
229+
● torc - list_failed_jobs (MCP)(workflow_id: 16) ⎿  { "failed_job_count": 3, "failed_jobs": [ … +18
230+
lines (ctrl+o to expand)
231+
232+
● torc - check_resource_utilization (MCP)(workflow_id: 16, include_failed: true) ⎿  { "failed_jobs":
233+
[ { … +137 lines (ctrl+o to expand)
234+
235+
● Workflow 16 Analysis Summary
236+
237+
Workflow: oom_test - "Test workflow with an OOM" Status: 7 completed, 3 failed (10 total jobs)
238+
239+
Root Cause: Out of Memory (OOM)
240+
241+
All 3 failed jobs were killed due to OOM (return code 137):
242+
243+
| Job ID | Name | Peak Memory | Configured | Over-Utilization |
244+
| ------ | ------ | ------------------ | ---------- | ---------------- |
245+
| 103 | work_1 | N/A (killed early) | 10g | SIGKILL 137 |
246+
| 109 | work_7 | 29.91 GB | 10g | +199.1% |
247+
| 110 | work_8 | 26.89 GB | 10g | +168.9% |
248+
249+
Critical Warning: More Failures Expected
250+
251+
9 jobs exceeded their 10 GB memory limit - including 6 jobs that happened to complete but used ~30
252+
GB each:
253+
254+
- work_2 through work_6, work_9, work_10 all peaked at ~30.01 GB (+200.1% over limit)
255+
- These completed due to available system memory but are also misconfigured
256+
257+
Recommendation
258+
259+
1. Increase memory allocation to 32g for all work jobs
260+
2. Use automated recovery - I can run a dry-run of the recovery tool to preview fixes
261+
262+
Would you like me to run recover_workflow with dry_run=true to preview the recovery actions?
263+
264+
> yes
265+
266+
● torc - recover_workflow (MCP)(workflow_id: 16, dry_run: true) ⎿  { "dry_run": true,
267+
"memory_multiplier": 1.5, … +9 lines (ctrl+o to expand)
268+
269+
● Recovery Preview (Dry Run)
270+
271+
Proposed Actions:
272+
273+
- 3 failed jobs would have memory increased (1.5x multiplier: 10g → 15g)
274+
- Failed jobs would be reset to ready status
275+
- Slurm schedulers would be regenerated
276+
277+
⚠️ Important Consideration
278+
279+
The default 1.5x memory multiplier would increase memory from 10g → 15g, but the jobs actually
280+
peaked at ~30 GB. This means:
281+
282+
- 15g may still cause OOM failures
283+
- A 3.5x multiplier (10g → 35g) would be safer based on actual usage
284+
285+
Would you like me to proceed with:
286+
287+
1. Default recovery (1.5x = 15g) - may require another recovery cycle
288+
2. Custom recovery with a higher memory multiplier (e.g., 3.5x = 35g)
289+
290+
> 2
291+
292+
● torc - recover_workflow (MCP)(workflow_id: 16, dry_run: true, memory_multiplier: 3.5) ⎿  {
293+
"dry_run": true, "memory_multiplier": 3.5, … +9 lines (ctrl+o to expand)
294+
295+
● Recovery Preview (3.5x Memory Multiplier)
296+
297+
Proposed Actions:
298+
299+
- 3 failed jobs would have memory increased: 10g → 35g
300+
- Failed jobs would be reset to ready status
301+
- Slurm schedulers would be regenerated
302+
303+
This should safely cover the ~30 GB peak memory usage observed.
304+
305+
Would you like me to proceed with these recovery actions?
306+
307+
> yes
308+
309+
● torc - recover_workflow (MCP)(workflow_id: 16, dry_run: false, memory_multiplier: 3.5) ⎿  {
310+
"dry_run": false, "memory_multiplier": 3.5, … +7 lines (ctrl+o to expand)
311+
312+
● Recovery Complete ✓
313+
314+
Actions taken:
315+
316+
- 3 jobs had memory increased from 10g → 35g
317+
- Failed jobs reset to ready status
318+
- Slurm schedulers regenerated and submitted
319+
320+
● torc - get_workflow_status (MCP)(workflow_id: 16) ⎿  { "description": "Test workflow with an OOM",
321+
"job_counts_by_status": { … +8 lines (ctrl+o to expand)
322+
323+
● Current Status:
324+
325+
| Status | Count |
326+
| --------- | ----- |
327+
| Completed | 7 |
328+
| Ready | 3 |
329+
330+
The 3 previously failed jobs (work_1, work_7, work_8) are now ready to run with 35g memory. Slurm
331+
allocations have been submitted and will execute these jobs automatically.
201332

202333
---
203334

0 commit comments

Comments
 (0)