Read this when you have 2+ evals and want to know empirically — not by guessing — which models belong in
retry_policyand in what order.
You defined SummarizeArticle in the README with retry_policy models: %w[gpt-4.1-nano gpt-4.1-mini gpt-4.1]. That list was a guess. optimize_retry_policy tells you which models your evals actually need, so you stop paying for the strong model when nano was enough — or stop shipping nano when the hardest eval proves it isn't.
SummarizeArticlealready hasretry_policy. If your step has none, add one first (getting started).- 2–3 evals per step. One eval optimizes for one scenario; with only
smoke, you get a recommendation that passes smoke but may miss production edge cases. See eval-first. - Rake tasks. The standard
RubyLLM::Contract::RakeTaskincludesruby_llm_contract:optimize. Non-Rails projects: setEVAL_DIRS=....
Two orthogonal dimensions to a retry chain. A chain element is
{ model:, reasoning_effort: }— model identity AND thinking budget.optimize_retry_policyexplores both. You can also fix the thinking config at class level viathinking effort: :low(or aliasreasoning_effort :low) on the Step — it becomes the default for every chain element unless an override is passed. See thethinkingDSL note at the bottom of this guide.
For this guide, assume SummarizeArticle has three evals:
SummarizeArticle.define_eval("smoke") { ... } # short news article
SummarizeArticle.define_eval("dense_article") { ... } # long form, 5 takeaways required
SummarizeArticle.define_eval("critical_tone") { ... } # negative review, tone must matchRun once offline to verify the wiring:
rake ruby_llm_contract:optimize \
STEP=SummarizeArticle \
CANDIDATES=gpt-4.1-nano,gpt-4.1-mini@low,gpt-4.1-mini,gpt-4.1Offline uses each eval's sample_response — zero API calls. Every candidate gets the same score because they all receive the canned response. That's fine for a smoke test (verifying evals load, candidates parse, output renders) but it doesn't compare model quality. For real optimization, go live.
LIVE=1 RUNS=3 rake ruby_llm_contract:optimize \
STEP=SummarizeArticle \
CANDIDATES=gpt-4.1-nano,gpt-4.1-mini@low,gpt-4.1-mini,gpt-4.1LIVE=1 makes real API calls. RUNS=3 averages each (candidate, eval) pair over three runs — necessary because OpenAI forces temperature=1.0 on gpt-5 / o-series and the same pair can score 0.00 on one run and 1.00 on the next.
Output (illustrative):
SummarizeArticle — fallback list optimization
eval 4.1-nano 4.1-mini@low 4.1-mini 4.1
---------------------------------------------------------
smoke 1.00 1.00 1.00 1.00
dense_article 0.67 ← 1.00 1.00 1.00
critical_tone 0.50 ← 0.67 ← 1.00 1.00
Hardest eval: critical_tone
Suggested fallback list:
gpt-4.1-nano — covers 1 eval(s)
gpt-4.1-mini — passes all 3 evals
DSL:
retry_policy models: %w[gpt-4.1-nano gpt-4.1-mini]
Reading the table:
←marks scores below threshold in the hardest eval. Not a selection hint — just "this candidate fails the row that matters most".- Hardest eval = the one that forces the strong fallback. Here,
critical_tonedemandsgpt-4.1-mini. - Suggested fallback list = the shortest chain where each step covers more evals, built greedy-cheapest-first. Stops when all evals pass. Order matters:
gpt-4.1-nanois tried first; on validation failure, the gem falls back togpt-4.1-mini.
Copy the DSL, paste into your step, verify with rake ruby_llm_contract:eval. You just dropped gpt-4.1 from the chain — most requests finish on nano, mini handles what nano misses, and the strong model was never needed.
optimize shows first-attempt cost. In production, a candidate whose validator rejects 20% of outputs actually costs first_try_cost + fallback_cost × 0.20 per successful output. The first-attempt number hides this.
production_mode: { fallback: "..." } runs each candidate with a runtime [candidate, fallback] chain and reports effective cost:
SummarizeArticle.compare_models(
"dense_article",
candidates: [{ model: "gpt-4.1-nano" }, { model: "gpt-4.1-mini", reasoning_effort: "low" }],
production_mode: { fallback: "gpt-4.1-mini" }
).print_summaryOutput (live mode, illustrative):
dense_article — model comparison
Chain first-attempt fallback % effective cost latency score
-----------------------------------------------------------------------------------------------------
gpt-4.1-nano → gpt-4.1-mini $0.0010 33% $0.0018 164ms 1.00
gpt-4.1-mini (effort: low) → gpt-4.1-mini $0.0015 5% $0.0016 210ms 1.00
gpt-4.1-mini $0.0030 — $0.0030 220ms 1.00
- first-attempt — cost of the first run alone.
- fallback % — fraction of cases where the validator rejected and the fallback ran.
- effective cost — total per successful output including retries.
—— candidate equals fallback, no chain to observe.
Run this before finalizing: a candidate saving 3× on first-attempt but falling back 60% of the time may save only 1.2× in production.
Scope. Single-fallback (2-tier) chains only. Multi-tier inspect via trace.attempts. Step-level — calling on Pipeline::Base raises ArgumentError.
- "No viable chain" from a single live run. Re-run with
RUNS=3. If scores jump, the first run was noise. Never trust single-run results with gpt-5 / o-series in the pool —temperature=1.0is server-enforced. - Every candidate fails the same eval, including the strongest. The eval is rejecting correct answers. Run the step directly (
context: { retry_policy_override: nil, model: "gpt-4.1" }), inspect the output, compare with theverifyblock. Loosen the eval if the output is correct but not one of the accepted values. - Testing one specific hypothesis. (e.g. "does
mini@mediumhelp oncritical_tone?") UseSummarizeArticle.compare_models("critical_tone", candidates: [{ model: "gpt-4.1-mini", reasoning_effort: "medium" }], runs: 3)directly — three calls instead of rerunning the whole optimize pass.
Metrics exposed on Report / AggregatedReport keep their original names: single_shot_cost, single_shot_latency_ms, escalation_rate. The optimize Result struct also exposes hardest_eval as an alias for constraining_eval.
Set the default reasoning effort once on the Step class — mirrors RubyLLM::Agent.thinking exactly:
class SummarizeArticle < RubyLLM::Contract::Step::Base
model "gpt-5-nano"
thinking effort: :low # canonical
# or
reasoning_effort :low # alias for thinking(effort: :low)
endForwarded to Chat#with_thinking(**) through the adapter — works provider-agnostically (OpenAI reasoning_effort, Anthropic extended-thinking budget). A per-call override via context: { reasoning_effort: :high } still wins over the class default.