[agentrx-optimizer] Daily Workflow Optimization - 2026-06-18

### Executive Summary

AgentRx analyzed the most recent **10-run gh-aw fleet** (3.75M total tokens, 76 turns, 57 action-minutes, **0 errors / 0 warnings**). With no hard failures present, the diagnosis targets the **highest-cost bottleneck step**. The fleet's token budget is extremely concentrated: the top 3 workflows account for **97.5%** of all tokens, and a single workflow — **Daily CLI Tools Exploratory Tester** — consumes **44.4% (1.66M tokens) on its own** across the most turns (34) and the longest wall-clock (13.3m). Fleet-wide, **91 of 95** mined log-event templates are `tool_result` events, confirming that large, re-accumulated tool-output payloads dominate cost. Top finding: the bottleneck is **token-heavy context payloads in the exploratory tester**, where existing *advisory* token rules are not binding.

### AgentRx Evidence

- **Critical step:** IR step `#2` — `Daily CLI Tools Exploratory Tester` (run `27741767107`), the highest-cost step in the trajectory.
- **Failure category:** Cost / efficiency bottleneck (token-heavy context payloads). No correctness failure — `status=completed`, 0 errors fleet-wide.
- **Frequency / impact:** 1.66M tokens (44.4% of fleet) · 34 turns (most in fleet) · 13.3m (longest run) · ~48.9K tokens/turn. Top-3 token share = 97.5%. Fleet `tool_result` templates = 91/95.
- **Representative run IDs:** `27741767107` (CLI Tools Exploratory, 1.66M), `27742372229` (Schema Consistency, 1.05M / 74.7K tok-per-turn), `27741769105` (PR Description Updater, 0.94M).

**Labeled violations (derived from AgentRx IR + MCP fleet telemetry):**

| violation | evidence | fix_type | rationale |
|---|---|---|---|
| Exploratory run burns 44% of fleet token budget | run `27741767107`: 1.66M tok / 34 turns / 13.3m; tok/turn ≈ 48.9K | reducing token-heavy context payloads | Single largest cost lever; advisory token rules already present but not enforced |
| `tool_result` payloads dominate event mix | 91/95 mined templates are `tool_result` across 10 runs | reducing token-heavy context payloads | Large tool outputs re-accumulate in context every turn, compounding over 34 turns |
| High per-turn context re-send | Schema Consistency `27742372229`: 74.7K tokens/turn | adding precondition checks before expensive tools | No hard cap on tool-call rounds; each turn re-sends prior large outputs |
| High-anomaly tool_result event | reliability/medium: 1 event score 0.65 ("new log template; rare cluster") | adding missing telemetry attributes for better triage | Rare cluster is hard to triage without structured tool-output size attributes |

> Note: AgentRx's `static`/`dynamic`/`check`/`judge`/`report` stages require an LLM endpoint (`copilot` CLI / azure / trapi) that is **not available in this sandbox** (`copilot` binary absent; no azure/trapi env). Only the deterministic **IR** stage completed. The violation table above is therefore grounded in the IR trajectory + MCP fleet telemetry rather than LLM-generated invariant checks. No telemetry or AgentRx output was invented.

<details>
<summary>AgentRx Artifacts</summary>

**IR summary** (`runs/gh-aw-daily/trajectory_ir.json`)
- `trajectory_id`: `gh-aw-daily-fleet-2026-06-18`; domain auto-detected `flash`.
- Converter: **markdown IR (deterministic, no LLM fallback)** — `ir_from_markdown=true`, `ir_used_llm_fallback=false`.
- 1 trajectory, **12 steps**, all valid: step 1 = fleet telemetry (user goal), steps 2–11 = the 10 runs ordered by token cost, step 12 = observability insights.
- Step 2 (critical): `run_id=27741767107 | Daily CLI Tools Exploratory Tester | token_usage=1663930 | turns=34 | duration=13.3m`.

**Invariant / checker highlights**
- Not generated — `static`/`dynamic`/`check` stages need an LLM endpoint unavailable in this sandbox (`RuntimeError: 'copilot' CLI not found on PATH`). `check.json` was not produced.

**Judge classification**
- Not generated — `judge`/`report` stages need the same LLM endpoint. `judge.json` was not produced. Root-cause category was instead derived deterministically from the IR cost ordering (cost/efficiency bottleneck).

**State** (`runs/gh-aw-daily/state.json`)
- `completed_stages: ["ir"]`, `endpoint: "copilot"`, `domain: "flash"`.

**Known limitations**
- Single completed AgentRx stage (IR). Downstream LLM stages blocked by sandbox endpoint/auth constraints — analysis proceeds from completed artifacts + MCP telemetry as designed.
- `token_usage` is unavailable for in-progress runs and some engines (Codex / AI Moderator reported `None`), so token concentration is computed over the 4 runs that report it.

</details>

### Recommended Optimization

**One specific change:** In `.github/workflows/daily-cli-tools-tester.md`, convert the existing advisory *"## Token Efficiency Rules"* into an **enforced exploration budget**: (1) cap total tool-call rounds to a hard limit (e.g. **≤ 10 rounds**, then write the report), and (2) require large `audit`/`logs`/`status` outputs to be **written to `/tmp/gh-aw/agent/` files and read back in slices**, instead of letting full tool results accumulate in conversation context across turns. Recompile the `.lock.yml`.

**Why highest impact:** This one workflow is **44.4% of the entire fleet's token spend** (1.66M) and its longest run (13.3m), driven by 34 turns each carrying re-accumulated tool output (`tool_result` = 91/95 of all fleet event templates). The workflow *already* contains soft prose token rules (`count:3, max_tokens:3000`) yet still produced a 1.66M-token run — so the smallest *structural* change (a hard round budget + offloading large outputs) attacks the dominant cost lever directly, with no correctness risk (runs already complete cleanly).

**Where to implement:** `.github/workflows/daily-cli-tools-tester.md` → `## Token Efficiency Rules` section; then recompile `daily-cli-tools-tester.lock.yml`.

### Validation Plan

- **How to confirm:** On the next scheduled run of `Daily CLI Tools Exploratory Tester`, pull MCP `logs` and compare `token_usage`, `turns`, and `duration` for `Daily CLI Tools Exploratory Tester` against this baseline.
- **Expected success metric changes:**
  - `token_usage`: **1.66M → < 1.0M** (target ≤ 0.7M).
  - `turns`: **34 → ≤ 15**.
  - `duration`: **13.3m → < 8m**.
  - `status` remains `completed`, errors stay at **0**, and a `[cli-tools-test]` issue is still produced when defects are found (functional coverage preserved).
  - Fleet top-3 token share should fall below 97.5% as this workflow's share drops from 44.4%.

### References

- [§27741767107](https://github.com/github/gh-aw/actions/runs/27741767107) — Daily CLI Tools Exploratory Tester (critical / highest-cost step)
- [§27742372229](https://github.com/github/gh-aw/actions/runs/27742372229) — Schema Consistency Checker (highest tokens/turn)
- [§27741769105](https://github.com/github/gh-aw/actions/runs/27741769105) — PR Description Updater (3rd-highest token spend)







> [!WARNING]
> <details>
> <summary>Firewall blocked 1 domain</summary>
>
> The following domain was blocked by the firewall during workflow execution:
>
> - `index.crates.io`
>> To allow these domains, add them to the `network.allowed` list in your workflow frontmatter:
>
> ```yaml
> network:
>   allowed:
>     - defaults
>     - "index.crates.io"
> ```
>
> See [Network Configuration](https://github.github.com/gh-aw/reference/network/) for more information.
>
> </details>


> Generated by [⚡ Daily AgentRx Trace Optimizer](https://github.com/github/gh-aw/actions/runs/27742661969) · [◷](https://github.com/search?q=repo%3Agithub%2Fgh-aw+is%3Aissue+%22gh-aw-workflow-call-id%3A+github%2Fgh-aw%2Fdaily-agentrx-trace-optimizer%22&type=issues)
> - [x] expires  on Jun 24, 2026, 11:17 PM UTC-08:00

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[agentrx-optimizer] Daily Workflow Optimization - 2026-06-18 #40014

Executive Summary

AgentRx Evidence

Recommended Optimization

Validation Plan

References

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

violation	evidence	fix_type	rationale
Exploratory run burns 44% of fleet token budget	run `27741767107`: 1.66M tok / 34 turns / 13.3m; tok/turn ≈ 48.9K	reducing token-heavy context payloads	Single largest cost lever; advisory token rules already present but not enforced
`tool_result` payloads dominate event mix	91/95 mined templates are `tool_result` across 10 runs	reducing token-heavy context payloads	Large tool outputs re-accumulate in context every turn, compounding over 34 turns
High per-turn context re-send	Schema Consistency `27742372229`: 74.7K tokens/turn	adding precondition checks before expensive tools	No hard cap on tool-call rounds; each turn re-sends prior large outputs
High-anomaly tool_result event	reliability/medium: 1 event score 0.65 ("new log template; rare cluster")	adding missing telemetry attributes for better triage	Rare cluster is hard to triage without structured tool-output size attributes

[agentrx-optimizer] Daily Workflow Optimization - 2026-06-18 #40014

Description

Executive Summary

AgentRx Evidence

Recommended Optimization

Validation Plan

References

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions