@@ -162,42 +162,173 @@ Suppose that jobs fail because they use too much CPU or memory.
162162
163163---
164164
165- ## Example: Debugging a Failure
165+ ## Example: Recovering a failed workflow that timed out
166+
167+ The workflow in ` tests/workflows/timeout_auto_recovery_test ` is used to test torc's auto-recovery of
168+ walltime timeouts. In this example, the workflow was left in a failed state.
169+
170+ Here is a conversation in Claude Code after running the workflow:
166171
167172```
168- You: Why did a job in workflow 5 fail?
173+ > analyze the logs for workflow 6
174+
175+ Workflow 6 Log Analysis Summary
176+
177+ Workflow: timeout_auto_recovery_test - Test workflow for automatic timeout recovery
178+
179+ Job Results
180+
181+ | Job | Name | Status | Runtime | Return Code |
182+ |-----|------------|---------------|---------|--------------|
183+ | 31 | (fast job) | ✅ Completed | 1.0 min | 0 |
184+ | 32 | job_slow | ⛔ Terminated | 5.9 min | -9 (SIGKILL) |
185+
186+ Terminated Job Details (job_slow)
169187
170- Claude:
171- ● The job failed due to an Out of Memory (OOM) kill:
188+ The job was killed by signal 9 (SIGKILL) after running for ~6 minutes, which matches the expected behavior for a timeout test.
172189
173- Job Details:
174- - Name: work_job
175- - Command: python work.py
176- - Return code: 137 (SIGKILL from OOM killer)
190+ stdout log shows the job ran through minute 6 before being killed:
191+ Slow job starting at Sat Dec 27 15:40:59 MST 2025
192+ This job will run for 10 minutes (exceeds 5 min runtime and 8 min walltime)
193+ It should be killed by Slurm and recovered by the watcher
194+ Minute 1 of 10...
195+ Minute 2 of 10...
196+ ...
197+ Minute 6 of 10...
177198
178- What happened:
179- 1. The job was configured with a 200GB memory limit
180- 2. The script allocated 230GB before being killed
181- 3. The Linux OOM killer terminated the process
199+ stderr log: Empty (no errors written)
182200
183- Recommendation: Increase memory to 256GB and retry.
201+ Conclusion
202+
203+ The timeout auto-recovery test is working as expected:
204+ - The slow job exceeded its 5-minute runtime limit
205+ - It was killed by the system (SIGKILL, return code -9) at ~6 minutes
206+ - Peak memory usage was ~4.1 MB (no memory issues)
184207```
185208
186209---
187210
188- ## Example: Fixing and Retrying
211+ ## Example: Debugging a failed workflow with over-consumption of memory
189212
190- ```
191- You: Increase the memory to 256GB and restart the failed jobs .
213+ The workflow in ` tests/workflows/oom_auto_recovery_test ` is used to test torc's diagnosis of memory
214+ over-consumption errors .
192215
193- Claude:
194- ● Updated job 5 resource requirements:
195- - Memory: 200g → 256g
216+ Here is a conversation in Claude Code after running the workflow:
196217
197- ● Reset job status from "failed" to "ready"
218+ > Analyze the logs for workflow 16
198219
199- ● The job is now queued for execution. Would you like me to submit it to Slurm?
200- ```
220+ ● torc - analyze_workflow_logs (MCP)(workflow_id: 16) ⎿ { "error_count": 0, "errors_by_type": [ ] , …
221+ +7 lines (ctrl+o to expand)
222+
223+ ● torc - get_workflow_status (MCP)(workflow_id: 16) ⎿ { "description": "Test workflow with an OOM",
224+ "job_counts_by_status": { … +8 lines (ctrl+o to expand)
225+
226+ ● The workflow has 3 failed jobs out of 10 total, but the log analyzer didn't find error patterns.
227+ Let me dig deeper:
228+
229+ ● torc - list_failed_jobs (MCP)(workflow_id: 16) ⎿ { "failed_job_count": 3, "failed_jobs": [ … +18
230+ lines (ctrl+o to expand)
231+
232+ ● torc - check_resource_utilization (MCP)(workflow_id: 16, include_failed: true) ⎿ { "failed_jobs":
233+ [ { … +137 lines (ctrl+o to expand)
234+
235+ ● Workflow 16 Analysis Summary
236+
237+ Workflow: oom_test - "Test workflow with an OOM" Status: 7 completed, 3 failed (10 total jobs)
238+
239+ Root Cause: Out of Memory (OOM)
240+
241+ All 3 failed jobs were killed due to OOM (return code 137):
242+
243+ | Job ID | Name | Peak Memory | Configured | Over-Utilization |
244+ | ------ | ------ | ------------------ | ---------- | ---------------- |
245+ | 103 | work_1 | N/A (killed early) | 10g | SIGKILL 137 |
246+ | 109 | work_7 | 29.91 GB | 10g | +199.1% |
247+ | 110 | work_8 | 26.89 GB | 10g | +168.9% |
248+
249+ Critical Warning: More Failures Expected
250+
251+ 9 jobs exceeded their 10 GB memory limit - including 6 jobs that happened to complete but used ~ 30
252+ GB each:
253+
254+ - work_2 through work_6, work_9, work_10 all peaked at ~ 30.01 GB (+200.1% over limit)
255+ - These completed due to available system memory but are also misconfigured
256+
257+ Recommendation
258+
259+ 1 . Increase memory allocation to 32g for all work jobs
260+ 2 . Use automated recovery - I can run a dry-run of the recovery tool to preview fixes
261+
262+ Would you like me to run recover_workflow with dry_run=true to preview the recovery actions?
263+
264+ > yes
265+
266+ ● torc - recover_workflow (MCP)(workflow_id: 16, dry_run: true) ⎿ { "dry_run": true,
267+ "memory_multiplier": 1.5, … +9 lines (ctrl+o to expand)
268+
269+ ● Recovery Preview (Dry Run)
270+
271+ Proposed Actions:
272+
273+ - 3 failed jobs would have memory increased (1.5x multiplier: 10g → 15g)
274+ - Failed jobs would be reset to ready status
275+ - Slurm schedulers would be regenerated
276+
277+ ⚠️ Important Consideration
278+
279+ The default 1.5x memory multiplier would increase memory from 10g → 15g, but the jobs actually
280+ peaked at ~ 30 GB. This means:
281+
282+ - 15g may still cause OOM failures
283+ - A 3.5x multiplier (10g → 35g) would be safer based on actual usage
284+
285+ Would you like me to proceed with:
286+
287+ 1 . Default recovery (1.5x = 15g) - may require another recovery cycle
288+ 2 . Custom recovery with a higher memory multiplier (e.g., 3.5x = 35g)
289+
290+ > 2
291+
292+ ● torc - recover_workflow (MCP)(workflow_id: 16, dry_run: true, memory_multiplier: 3.5) ⎿ {
293+ "dry_run": true, "memory_multiplier": 3.5, … +9 lines (ctrl+o to expand)
294+
295+ ● Recovery Preview (3.5x Memory Multiplier)
296+
297+ Proposed Actions:
298+
299+ - 3 failed jobs would have memory increased: 10g → 35g
300+ - Failed jobs would be reset to ready status
301+ - Slurm schedulers would be regenerated
302+
303+ This should safely cover the ~ 30 GB peak memory usage observed.
304+
305+ Would you like me to proceed with these recovery actions?
306+
307+ > yes
308+
309+ ● torc - recover_workflow (MCP)(workflow_id: 16, dry_run: false, memory_multiplier: 3.5) ⎿ {
310+ "dry_run": false, "memory_multiplier": 3.5, … +7 lines (ctrl+o to expand)
311+
312+ ● Recovery Complete ✓
313+
314+ Actions taken:
315+
316+ - 3 jobs had memory increased from 10g → 35g
317+ - Failed jobs reset to ready status
318+ - Slurm schedulers regenerated and submitted
319+
320+ ● torc - get_workflow_status (MCP)(workflow_id: 16) ⎿ { "description": "Test workflow with an OOM",
321+ "job_counts_by_status": { … +8 lines (ctrl+o to expand)
322+
323+ ● Current Status:
324+
325+ | Status | Count |
326+ | --------- | ----- |
327+ | Completed | 7 |
328+ | Ready | 3 |
329+
330+ The 3 previously failed jobs (work_1, work_7, work_8) are now ready to run with 35g memory. Slurm
331+ allocations have been submitted and will execute these jobs automatically.
201332
202333---
203334
0 commit comments