Skip to content

[agentrx-optimizer] Daily Workflow Optimization - 2026-06-18 #40014

@github-actions

Description

@github-actions

Executive Summary

AgentRx analyzed the most recent 10-run gh-aw fleet (3.75M total tokens, 76 turns, 57 action-minutes, 0 errors / 0 warnings). With no hard failures present, the diagnosis targets the highest-cost bottleneck step. The fleet's token budget is extremely concentrated: the top 3 workflows account for 97.5% of all tokens, and a single workflow — Daily CLI Tools Exploratory Tester — consumes 44.4% (1.66M tokens) on its own across the most turns (34) and the longest wall-clock (13.3m). Fleet-wide, 91 of 95 mined log-event templates are tool_result events, confirming that large, re-accumulated tool-output payloads dominate cost. Top finding: the bottleneck is token-heavy context payloads in the exploratory tester, where existing advisory token rules are not binding.

AgentRx Evidence

  • Critical step: IR step #2Daily CLI Tools Exploratory Tester (run 27741767107), the highest-cost step in the trajectory.
  • Failure category: Cost / efficiency bottleneck (token-heavy context payloads). No correctness failure — status=completed, 0 errors fleet-wide.
  • Frequency / impact: 1.66M tokens (44.4% of fleet) · 34 turns (most in fleet) · 13.3m (longest run) · ~48.9K tokens/turn. Top-3 token share = 97.5%. Fleet tool_result templates = 91/95.
  • Representative run IDs: 27741767107 (CLI Tools Exploratory, 1.66M), 27742372229 (Schema Consistency, 1.05M / 74.7K tok-per-turn), 27741769105 (PR Description Updater, 0.94M).

Labeled violations (derived from AgentRx IR + MCP fleet telemetry):

violation evidence fix_type rationale
Exploratory run burns 44% of fleet token budget run 27741767107: 1.66M tok / 34 turns / 13.3m; tok/turn ≈ 48.9K reducing token-heavy context payloads Single largest cost lever; advisory token rules already present but not enforced
tool_result payloads dominate event mix 91/95 mined templates are tool_result across 10 runs reducing token-heavy context payloads Large tool outputs re-accumulate in context every turn, compounding over 34 turns
High per-turn context re-send Schema Consistency 27742372229: 74.7K tokens/turn adding precondition checks before expensive tools No hard cap on tool-call rounds; each turn re-sends prior large outputs
High-anomaly tool_result event reliability/medium: 1 event score 0.65 ("new log template; rare cluster") adding missing telemetry attributes for better triage Rare cluster is hard to triage without structured tool-output size attributes

Note: AgentRx's static/dynamic/check/judge/report stages require an LLM endpoint (copilot CLI / azure / trapi) that is not available in this sandbox (copilot binary absent; no azure/trapi env). Only the deterministic IR stage completed. The violation table above is therefore grounded in the IR trajectory + MCP fleet telemetry rather than LLM-generated invariant checks. No telemetry or AgentRx output was invented.

AgentRx Artifacts

IR summary (runs/gh-aw-daily/trajectory_ir.json)

  • trajectory_id: gh-aw-daily-fleet-2026-06-18; domain auto-detected flash.
  • Converter: markdown IR (deterministic, no LLM fallback)ir_from_markdown=true, ir_used_llm_fallback=false.
  • 1 trajectory, 12 steps, all valid: step 1 = fleet telemetry (user goal), steps 2–11 = the 10 runs ordered by token cost, step 12 = observability insights.
  • Step 2 (critical): run_id=27741767107 | Daily CLI Tools Exploratory Tester | token_usage=1663930 | turns=34 | duration=13.3m.

Invariant / checker highlights

  • Not generated — static/dynamic/check stages need an LLM endpoint unavailable in this sandbox (RuntimeError: 'copilot' CLI not found on PATH). check.json was not produced.

Judge classification

  • Not generated — judge/report stages need the same LLM endpoint. judge.json was not produced. Root-cause category was instead derived deterministically from the IR cost ordering (cost/efficiency bottleneck).

State (runs/gh-aw-daily/state.json)

  • completed_stages: ["ir"], endpoint: "copilot", domain: "flash".

Known limitations

  • Single completed AgentRx stage (IR). Downstream LLM stages blocked by sandbox endpoint/auth constraints — analysis proceeds from completed artifacts + MCP telemetry as designed.
  • token_usage is unavailable for in-progress runs and some engines (Codex / AI Moderator reported None), so token concentration is computed over the 4 runs that report it.

Recommended Optimization

One specific change: In .github/workflows/daily-cli-tools-tester.md, convert the existing advisory "## Token Efficiency Rules" into an enforced exploration budget: (1) cap total tool-call rounds to a hard limit (e.g. ≤ 10 rounds, then write the report), and (2) require large audit/logs/status outputs to be written to /tmp/gh-aw/agent/ files and read back in slices, instead of letting full tool results accumulate in conversation context across turns. Recompile the .lock.yml.

Why highest impact: This one workflow is 44.4% of the entire fleet's token spend (1.66M) and its longest run (13.3m), driven by 34 turns each carrying re-accumulated tool output (tool_result = 91/95 of all fleet event templates). The workflow already contains soft prose token rules (count:3, max_tokens:3000) yet still produced a 1.66M-token run — so the smallest structural change (a hard round budget + offloading large outputs) attacks the dominant cost lever directly, with no correctness risk (runs already complete cleanly).

Where to implement: .github/workflows/daily-cli-tools-tester.md## Token Efficiency Rules section; then recompile daily-cli-tools-tester.lock.yml.

Validation Plan

  • How to confirm: On the next scheduled run of Daily CLI Tools Exploratory Tester, pull MCP logs and compare token_usage, turns, and duration for Daily CLI Tools Exploratory Tester against this baseline.
  • Expected success metric changes:
    • token_usage: 1.66M → < 1.0M (target ≤ 0.7M).
    • turns: 34 → ≤ 15.
    • duration: 13.3m → < 8m.
    • status remains completed, errors stay at 0, and a [cli-tools-test] issue is still produced when defects are found (functional coverage preserved).
    • Fleet top-3 token share should fall below 97.5% as this workflow's share drops from 44.4%.

References

  • §27741767107 — Daily CLI Tools Exploratory Tester (critical / highest-cost step)
  • §27742372229 — Schema Consistency Checker (highest tokens/turn)
  • §27741769105 — PR Description Updater (3rd-highest token spend)

Warning

Firewall blocked 1 domain

The following domain was blocked by the firewall during workflow execution:

  • index.crates.io

To allow these domains, add them to the network.allowed list in your workflow frontmatter:

network:
  allowed:
    - defaults
    - "index.crates.io"

See Network Configuration for more information.

Generated by ⚡ Daily AgentRx Trace Optimizer ·

  • expires on Jun 24, 2026, 11:17 PM UTC-08:00

Metadata

Metadata

Assignees

No one assigned

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions