Add latency to trace by chiang-daniel · Pull Request #1340 · Kiln-AI/Kiln

chiang-daniel · 2026-04-23T22:24:54Z

What does this PR do?

Add per llm call latency tracking to Trace.

Add total_llm_latency_ms to usage model to track overall llm latency like cost and token.
Add mean_total_llm_latency_ms to eval API for comparison
Update run, compare table, radar chart, and metric correlation

Checklists

Tests have been run locally and passed
New tests have been added to any work in /lib

Summary by CodeRabbit

New Features
- Added LLM latency tracking across all model interactions
- Latency metrics now available in evaluation results and API responses
- Enhanced comparison views to display latency alongside cost metrics
- Added latency visualization in comparison charts and radar charts
- Run details now show total and subtask latency breakdowns
Tests
- Added comprehensive test coverage for latency calculation and data aggregation

Add per-call latency timing (latency_ms) to trace messages and cumulative total_llm_latency_ms to the Usage model. Instrument both LiteLlmAdapter and AdapterStream to measure wall-clock time of each LLM API call using time.monotonic, accumulate across multi-turn tool call loops, and propagate latency into trace messages. Update Usage.__add__ to support the new field, regenerate the OpenAPI schema, and add comprehensive tests for field validation, addition semantics, backwards compatibility, and adapter-level latency tracking. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>

Add mean_total_llm_latency_ms to MeanUsage model and wire up accumulators in eval_api.py to compute average latency across eval runs, respecting the existing 50% data-availability threshold. Regenerate OpenAPI schema and add tests for both above- and below-threshold latency scenarios. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>

Add latency tracking to all frontend display surfaces: - Run page: "Total Latency" and "Subtasks Latency" in usage properties - Compare table: "Latency" row in the usage section - Radar chart: "Speed" axis with lower-is-better scoring - Scatter plot: latency available as selectable axis Rename compare section from "Average Usage & Cost" to "Average Usage, Cost & Latency". Refactor radar chart to support multiple lower-is-better metrics with independent normalization. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>

coderabbitai · 2026-04-23T22:25:03Z

📝 Walkthrough

Walkthrough

This PR adds end-to-end LLM latency tracking: measuring per-model-turn latency in adapters using monotonic timing, accumulating values in the Usage model, computing mean latency across eval runs (with a 50% data threshold), exposing latency fields through API schemas, and rendering formatted metrics in the web UI across multiple views (compare charts, radar charts, run details).

Changes

Cohort / File(s)	Summary
Core Data Model & Types `libs/core/kiln_ai/datamodel/task_run.py`, `libs/core/kiln_ai/utils/open_ai_types.py`, `libs/core/kiln_ai/datamodel/test_example_models.py`, `libs/core/kiln_ai/utils/test_open_ai_types.py`	Adds `total_llm_latency_ms` field to `Usage` with non-negative validation and aggregation logic; extends `ChatCompletionAssistantMessageParamWrapper` TypedDict with optional `latency_ms` field; adds validation and backward-compatibility tests.
Model Adapter Latency Instrumentation `libs/core/kiln_ai/adapters/model_adapters/adapter_stream.py`, `libs/core/kiln_ai/adapters/model_adapters/litellm_adapter.py`, `libs/core/kiln_ai/adapters/model_adapters/test_litellm_adapter.py`	Instruments streaming and synchronous adapters to measure call duration using monotonic timing, accumulate into `usage.total_llm_latency_ms`, and annotate assistant messages with `_latency_ms` field; includes comprehensive mock-based test coverage for latency accumulation and trace propagation.
Eval API Mean Latency Computation `app/desktop/studio_server/eval_api.py`, `app/desktop/studio_server/test_eval_api.py`	Extends `MeanUsage` model with optional `mean_total_llm_latency_ms` field; implements aggregation of per-eval-run latency values across run config, computing mean only when ≥50% of samples have data; adds dual test cases covering threshold pass and fail scenarios.
API Schema Generation `app/web_ui/src/lib/api_schema.d.ts`	Exposes new latency fields in generated TypeScript schema: `ChatCompletionAssistantMessageParamWrapper` gains per-message `latency_ms`, `MeanUsage` gains `mean_total_llm_latency_ms`, and `Usage` gains `total_llm_latency_ms`.
Latency Formatting Utility `app/web_ui/src/lib/utils/formatters.ts`	Adds new `formatLatency` function that converts millisecond values to human-readable strings (milliseconds for <1000ms, seconds with one decimal for ≥1000ms).
Compare Chart Components `app/web_ui/src/lib/components/compare_chart.svelte`, `app/web_ui/src/lib/components/compare_radar_chart.svelte`	Extends chart tooltip/axis formatting to detect and apply latency-specific rendering; generalizes radar chart from cost-only scoring to support multiple "lower-is-better" metrics (cost and latency) by renaming `costToScore` to `metricToScore` and updating normalization/scaling logic.
Run Details & Compare Page UI `app/web_ui/src/routes/(app)/run/run.svelte`, `app/web_ui/src/routes/(app)/specs/[project_id]/[task_id]/compare/+page.svelte`	Extends subtask usage aggregation to recursively accumulate `total_llm_latency_ms`; adds latency state management and request-race logic; computes and displays "Total Latency" and "Subtasks Latency" with loading states in run view; adds latency metric to compare page cost section with raw-value extraction and formatted table rendering.

Sequence Diagram

sequenceDiagram
    participant Client as Client (UI)
    participant Server as Eval API Server
    participant DB as Database
    participant Adapter as Model Adapter
    participant LLM as LLM Provider

    Client->>Server: Request eval scores for run config
    Server->>DB: Query eval runs for run config
    DB-->>Server: Returns eval run records with usage data
    
    loop For each eval run
        Server->>Adapter: (historical) Measure model turn
        Adapter->>LLM: Stream/call LLM with monotonic timing
        LLM-->>Adapter: Response + elapsed time
        Adapter->>Adapter: Convert elapsed → call_latency_ms
        Adapter->>Adapter: Accumulate into usage.total_llm_latency_ms
        Adapter-->>Server: Returns usage with total_llm_latency_ms
    end
    
    Server->>Server: Aggregate latency across runs:<br/>- Sum all total_llm_latency_ms<br/>- Count samples with data<br/>- Compute mean if ≥50% threshold met
    Server->>Server: Build MeanUsage with<br/>mean_total_llm_latency_ms
    Server-->>Client: Returns MeanUsage object
    
    Client->>Client: Extract mean_total_llm_latency_ms
    Client->>Client: Format via formatLatency()
    Client->>Client: Render in chart/table views
    Client-->>Client: Display "X ms" or "X.X s"

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

PR #509: Modifies litellm_adapter.py and open_ai_types.py to track latency at the adapter level, providing the foundation for this latency-tracking feature.
PR #861: Extends run.svelte subtask aggregation logic; this PR builds on that pattern by adding recursive total_llm_latency_ms accumulation alongside cost metrics.
PR #934: Introduces cost-based scoring in compare_radar_chart.svelte; this PR generalizes that logic to metricToScore and extends it to support latency as a lower-is-better metric.

Suggested reviewers

scosman
leonardmq
tawnymanticore

🐰 A hop through time, millisecond by millisecond,
From adapter's faithful tick to the UI's gentle call,
Latency now glows in charts and compare views,
Mean values dance when fifty percent approve—
Fast or slow, the data flows! 🚀✨

🚥 Pre-merge checks | ✅ 3 | ❌ 2

❌ Failed checks (1 warning, 1 inconclusive)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 72.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.
Title check	❓ Inconclusive	The title 'Add latency to trace' is vague and overly generic, not conveying the specific scope or primary change.	Use a more descriptive title like 'Add per-LLM-call latency tracking to Usage and Eval API' to clarify what aspect of the system is being modified and the key deliverable.

✅ Passed checks (3 passed)

Check name	Status	Explanation
Description check	✅ Passed	The PR description covers the main objectives, includes checked items confirming tests were run and added, but lacks a Related Issues link.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch KIL-542/perf

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

gemini-code-assist

Code Review

This pull request introduces LLM latency tracking across the application, including data model updates, backend measurement in model adapters, and frontend visualizations in charts and run summaries. A significant issue was identified in the streaming adapter where latency measurements would be inflated by the consumer's processing time; a suggestion was provided to measure only the time spent waiting for chunks. Additionally, it is recommended to rename the costToScore function to reflect its broader use for all lower-is-better metrics and to centralize the duplicated latency formatting logic into a shared utility.

github-actions · 2026-04-23T22:28:14Z

📊 Coverage Report

Overall Coverage: 92%

Diff: origin/main...HEAD

app/desktop/studio_server/eval_api.py (100%)
libs/core/kiln_ai/adapters/model_adapters/adapter_stream.py (100%)
libs/core/kiln_ai/adapters/model_adapters/litellm_adapter.py (100%)
libs/core/kiln_ai/datamodel/task_run.py (100%)
libs/core/kiln_ai/utils/open_ai_types.py (100%)

Summary

Total: 26 lines
Missing: 0 lines
Coverage: 100%

📊 HTML Coverage Report - Interactive coverage report
📈 Diff Coverage Report - Detailed diff analysis
Github Actions Run - View the full coverage report

Fix streaming path to only measure time waiting on LLM chunks, not time the consumer spends processing yielded chunks. Also fix misleading test that claimed to test multi-call accumulation but only exercised a single LLM call. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>

Function now handles both cost and latency metrics, so the name should reflect its broader purpose. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>

chiang-daniel · 2026-04-24T00:07:11Z

@coderabbitai review

coderabbitai · 2026-04-24T00:07:16Z

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

coderabbitai

Actionable comments posted: 9

🧹 Nitpick comments (3)

app/desktop/studio_server/test_eval_api.py (1)

2202-2359: Add an exact-50% threshold test to lock boundary semantics.

You currently test >50% and <50%, but not exactly 50%, which is where regressions usually slip in.

🧪 Suggested boundary test

+@pytest.mark.asyncio
+async def test_get_run_config_eval_scores_latency_at_threshold(
+    client, mock_task_from_id, mock_task, mock_eval, mock_eval_config, mock_run_config
+):
+    """At exactly 50% availability, latency inclusion should follow intended threshold semantics."""
+    mock_task_from_id.return_value = mock_task
+
+    # 2 runs total, exactly 1 has latency => 50%
+    task_run_1 = TaskRun(
+        input="test input 1",
+        input_source=DataSource(
+            type=DataSourceType.synthetic,
+            properties={
+                "model_name": "gpt-4",
+                "model_provider": "openai",
+                "adapter_name": "langchain_adapter",
+            },
+        ),
+        output=TaskOutput(output="test output 1"),
+        usage=Usage(input_tokens=100, output_tokens=50, total_tokens=150, cost=0.005, total_llm_latency_ms=500),
+        parent=mock_task,
+    )
+    task_run_1.save_to_file()
+
+    task_run_2 = TaskRun(
+        input="test input 2",
+        input_source=DataSource(
+            type=DataSourceType.synthetic,
+            properties={
+                "model_name": "gpt-4",
+                "model_provider": "openai",
+                "adapter_name": "langchain_adapter",
+            },
+        ),
+        output=TaskOutput(output="test output 2"),
+        usage=Usage(input_tokens=200, output_tokens=100, total_tokens=300, cost=0.010),
+        parent=mock_task,
+    )
+    task_run_2.save_to_file()
+
+    eval_run_1 = EvalRun(task_run_config_id=mock_run_config.id, scores={"score1": 4.0, "overall_rating": 4.0}, input="i1", output="o1", dataset_id=task_run_1.id, task_run_usage=task_run_1.usage, parent=mock_eval_config)
+    eval_run_1.save_to_file()
+    eval_run_2 = EvalRun(task_run_config_id=mock_run_config.id, scores={"score1": 4.5, "overall_rating": 4.5}, input="i2", output="o2", dataset_id=task_run_2.id, task_run_usage=task_run_2.usage, parent=mock_eval_config)
+    eval_run_2.save_to_file()
+
+    # ...same mock/patch pattern as neighboring tests...
+    # Assert explicit expected behavior at exactly 50%:
+    # assert mean_usage["mean_total_llm_latency_ms"] is <expected per threshold contract>

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@app/desktop/studio_server/test_eval_api.py` around lines 2202 - 2359, Add a
boundary test to cover the exact 50% case: create a new test (e.g.
test_get_run_config_eval_scores_latency_at_50_percent) that mirrors
test_get_run_config_eval_scores_latency_below_threshold but creates 4
TaskRuns/EvalRuns with exactly 2 of the TaskRun.usage entries having
total_llm_latency_ms set (2/4 = 50%), patch
task_from_id/eval_from_id/task_run_config_from_id and dataset_ids_in_filter the
same way, call the same endpoint, and assert the behaviour of
mean_usage["mean_total_llm_latency_ms"] for the exact-50% case (set the expected
assertion to match the current implementation's semantics—i.e., assert not None
if the code treats >=50% as sufficient, or assert None if it requires >50%).

app/web_ui/src/lib/components/compare_radar_chart.svelte (2)

165-189: Consolidate the COST_KEY / LATENCY_KEY tooltip branches.

The two branches are structurally identical aside from the label/format of the raw value. A small per-metric formatter map keeps future lower-is-better additions one-line changes and prevents subtle drift between the two branches.

♻️ Illustrative refactor

-      } else if (key === COST_KEY) {
-        const displayValue = metricToScore(
-          rawValue,
-          lowerIsBetterValues[key] || [],
-        )
-        html += `<div>${label}: ${displayValue.toFixed(1)} <span style="color: `#888`;">(Mean Cost: $${rawValue.toFixed(6)})</span></div>`
-      } else if (key === LATENCY_KEY) {
-        const displayValue = metricToScore(
-          rawValue,
-          lowerIsBetterValues[key] || [],
-        )
-        const formatted =
-          rawValue < 1000
-            ? `${Math.round(rawValue)}ms`
-            : `${(rawValue / 1000).toFixed(1)}s`
-        html += `<div>${label}: ${displayValue.toFixed(1)} <span style="color: `#888`;">(Mean Latency: ${formatted})</span></div>`
+      } else if (isLowerIsBetterMetric(key)) {
+        const displayValue = metricToScore(
+          rawValue,
+          lowerIsBetterValues[key] || [],
+        )
+        const rawLabel = key === COST_KEY ? "Mean Cost" : "Mean Latency"
+        const rawFormatted =
+          key === COST_KEY ? `$${rawValue.toFixed(6)}` : formatLatency(rawValue)
+        html += `<div>${label}: ${displayValue.toFixed(1)} <span style="color: `#888`;">(${rawLabel}: ${rawFormatted})</span></div>`

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@app/web_ui/src/lib/components/compare_radar_chart.svelte` around lines 165 -
189, The two branches handling COST_KEY and LATENCY_KEY inside dataKeys.forEach
are duplicated; extract a per-metric raw-value formatter map and use a single
branch that calls metricToScore and then formats the raw value via the map
before composing html. Specifically, update the logic in dataKeys.forEach to
call getKeyLabel and getModelValueRaw as before, compute displayValue via
metricToScore using lowerIsBetterValues[key], and replace the
COST_KEY/LATENCY_KEY conditionals with a formatter lookup (e.g.,
formatters[COST_KEY], formatters[LATENCY_KEY]) that returns the formatted raw
string (like currency with 6 decimals or ms/s formatting) and fall back to
rawValue.toFixed(3); then build the html with the unified template that inserts
label, displayValue.toFixed(1), and the formatted raw string.

46-78: Rename parameters to match the generalized helper.

After renaming costToScore → metricToScore, the parameters are still cost / costs, which reads misleadingly when the function is invoked for latency. Tightens readability without behavior change.

♻️ Proposed refactor

-  export function metricToScore(
-    cost: number,
-    costs: number[],
+  export function metricToScore(
+    value: number,
+    values: number[],
     {
       padding = 10, // keep endpoints away from 0/100
       relFull = 0.7, // when (hi-lo)/|hi| reaches this, use full spread (k=1)
     }: {
       padding?: number
       relFull?: number
     } = {},
   ): number {
-    const lo = Math.min(...costs)
-    const hi = Math.max(...costs)
+    const lo = Math.min(...values)
+    const hi = Math.max(...values)

     const range = hi - lo
     if (range <= 0) return 50

     // 1) range-based normalized position
-    const t = (cost - lo) / range
+    const t = (value - lo) / range

-    // 2) raw padded linear score (lower cost = higher score)
+    // 2) raw padded linear score (lower value = higher score)
     const raw = padding + (1 - t) * (100 - 2 * padding)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@app/web_ui/src/lib/components/compare_radar_chart.svelte` around lines 46 -
78, The parameter names cost and costs in the exported function metricToScore
are misleading after renaming; rename them to generic names (e.g., value and
values or metric and metrics) across the function signature and all internal
references inside metricToScore to improve readability without changing
behavior, keeping the same default behavior and return logic in metricToScore.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@app/web_ui/src/lib/components/compare_radar_chart.svelte`:
- Around line 181-185: Replace the inline latency formatting in
compare_radar_chart.svelte with the shared helper: import and call
formatLatency(ms: number) from app/web_ui/src/lib/utils/formatters.ts instead of
computing `formatted` manually (the block that sets `formatted` and uses it in
the HTML string); update the code that builds `html` to use
formatLatency(rawValue) where the previous `${formatted}` was inserted so all
latency display follows the single source-of-truth.

In `@app/web_ui/src/lib/utils/formatters.ts`:
- Around line 330-333: The formatLatency function can output "1000ms" for values
like 999.6 due to rounding before unit selection; change the logic so unit
selection uses the rounded ms value: compute roundedMs = Math.round(ms) and if
roundedMs < 1000 return `${roundedMs}ms`, otherwise return seconds using the
original ms divided by 1000 (e.g., `${(ms/1000).toFixed(1)}s`) so values that
round to 1000ms become "1.0s" instead of "1000ms".

In
`@app/web_ui/src/routes/`(app)/specs/[project_id]/[task_id]/compare/+page.svelte:
- Around line 496-501: The latency formatting is leaking into the diff
calculation: when category === "cost" and scoreKey like
"mean_total_llm_latency_ms" you should keep using formatLatency(value) only for
display, and use getModelValueRaw() (the raw millisecond value) when computing
the percent-difference/badge so comparisons remain numeric; update the
percent-difference calculation to call getModelValueRaw() for both sides when
scoreKey corresponds to latency (e.g., "mean_total_llm_latency_ms") and leave
formatLatency() exclusively in the cell rendering path.

In `@libs/core/kiln_ai/adapters/model_adapters/adapter_stream.py`:
- Around line 171-180: The stream latency currently sums time only for
iterations that yield a chunk and misses the final await after the last chunk;
change the loop in adapter_stream.py to measure each await of the stream instead
of only time inside the async for body: replace the "async for chunk in stream"
with an explicit loop that records time before awaiting stream.__anext__(), adds
the elapsed time to call_latency_seconds after the await (so the final tail wait
is included), yields the chunk, and on StopAsyncIteration records the final
await duration before breaking; keep existing variables call_latency_seconds and
chunk_wait_start semantics so consumer processing time remains excluded.

In `@specs/projects/performance-tracking/architecture.md`:
- Line 222: The documentation references a renamed helper; replace uses of
metricToScore() with the canonical costToScore() so the docs match
functional_spec.md and the implementation. Update the "Speed" radar axis note
(and any other occurrences in this section) to call costToScore() and ensure the
wording still notes "lower is better". Verify the symbol name is exactly
costToScore() wherever metricToScore() appears.

In `@specs/projects/performance-tracking/functional_spec.md`:
- Around line 154-156: The spec still references the old helper name
costToScore(); update the documentation to use the new helper name
metricToScore() and, while here, confirm the spec notes that isCostMetric() now
also matches latency (both are lower-is-better) and that the radar label is
"Speed" (higher = better on chart) to match compare_radar_chart.svelte's current
behavior.

In `@specs/projects/performance-tracking/phase_plans/phase_3.md`:
- Line 13: The spec text incorrectly implies formatLatency() was created in this
phase; update the Phase 3 description for run.svelte to say you used or reused
the existing formatLatency() helper (instead of "added formatLatency()") and
confirm that calculate_subtask_usage(), the new subtask_latency_ms state
variable, and the additions to get_usage_properties() are the actual new changes
in this phase.
- Line 17: The spec incorrectly references costToScore(); update the
documentation to use the actual function name metricToScore() and ensure
references to symbols in compare_radar_chart.svelte (e.g., LATENCY_KEY,
isLowerIsBetterMetric(), lowerIsBetterValues) remain accurate — replace any
mention of costToScore() with metricToScore() and confirm wording indicates
per-key normalization via metricToScore() for cost and latency.

In `@specs/projects/performance-tracking/project_overview.md`:
- Line 30: The doc incorrectly states latency is stored in seconds; update the
spec to say latency is stored in milliseconds (integer/float as applicable) and
adjust examples/units accordingly; reference the observed implementation using
the _ms fields (latency_ms, total_llm_latency_ms, mean_total_llm_latency_ms) and
the formatLatency(ms: number) helper which renders "Xms" for values < 1000 to
ensure the documentation aligns with the code representation and displayed
units.

---

Nitpick comments:
In `@app/desktop/studio_server/test_eval_api.py`:
- Around line 2202-2359: Add a boundary test to cover the exact 50% case: create
a new test (e.g. test_get_run_config_eval_scores_latency_at_50_percent) that
mirrors test_get_run_config_eval_scores_latency_below_threshold but creates 4
TaskRuns/EvalRuns with exactly 2 of the TaskRun.usage entries having
total_llm_latency_ms set (2/4 = 50%), patch
task_from_id/eval_from_id/task_run_config_from_id and dataset_ids_in_filter the
same way, call the same endpoint, and assert the behaviour of
mean_usage["mean_total_llm_latency_ms"] for the exact-50% case (set the expected
assertion to match the current implementation's semantics—i.e., assert not None
if the code treats >=50% as sufficient, or assert None if it requires >50%).

In `@app/web_ui/src/lib/components/compare_radar_chart.svelte`:
- Around line 165-189: The two branches handling COST_KEY and LATENCY_KEY inside
dataKeys.forEach are duplicated; extract a per-metric raw-value formatter map
and use a single branch that calls metricToScore and then formats the raw value
via the map before composing html. Specifically, update the logic in
dataKeys.forEach to call getKeyLabel and getModelValueRaw as before, compute
displayValue via metricToScore using lowerIsBetterValues[key], and replace the
COST_KEY/LATENCY_KEY conditionals with a formatter lookup (e.g.,
formatters[COST_KEY], formatters[LATENCY_KEY]) that returns the formatted raw
string (like currency with 6 decimals or ms/s formatting) and fall back to
rawValue.toFixed(3); then build the html with the unified template that inserts
label, displayValue.toFixed(1), and the formatted raw string.
- Around line 46-78: The parameter names cost and costs in the exported function
metricToScore are misleading after renaming; rename them to generic names (e.g.,
value and values or metric and metrics) across the function signature and all
internal references inside metricToScore to improve readability without changing
behavior, keeping the same default behavior and return logic in metricToScore.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 470f0af0-7876-452e-b65a-aecd47423ea4

📥 Commits

Reviewing files that changed from the base of the PR and between 7f69c4c and 1a6582a.

📒 Files selected for processing (22)

app/desktop/studio_server/eval_api.py
app/desktop/studio_server/test_eval_api.py
app/web_ui/src/lib/api_schema.d.ts
app/web_ui/src/lib/components/compare_chart.svelte
app/web_ui/src/lib/components/compare_radar_chart.svelte
app/web_ui/src/lib/utils/formatters.ts
app/web_ui/src/routes/(app)/run/run.svelte
app/web_ui/src/routes/(app)/specs/[project_id]/[task_id]/compare/+page.svelte
libs/core/kiln_ai/adapters/model_adapters/adapter_stream.py
libs/core/kiln_ai/adapters/model_adapters/litellm_adapter.py
libs/core/kiln_ai/adapters/model_adapters/test_litellm_adapter.py
libs/core/kiln_ai/datamodel/task_run.py
libs/core/kiln_ai/datamodel/test_example_models.py
libs/core/kiln_ai/utils/open_ai_types.py
libs/core/kiln_ai/utils/test_open_ai_types.py
specs/projects/performance-tracking/architecture.md
specs/projects/performance-tracking/functional_spec.md
specs/projects/performance-tracking/implementation_plan.md
specs/projects/performance-tracking/phase_plans/phase_1.md
specs/projects/performance-tracking/phase_plans/phase_2.md
specs/projects/performance-tracking/phase_plans/phase_3.md
specs/projects/performance-tracking/project_overview.md

Extract duplicated latency formatting logic into shared formatLatency() in formatters.ts. Rename costToScore to metricToScore since it now handles both cost and latency metrics. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>

coderabbitai

🧹 Nitpick comments (1)

app/web_ui/src/lib/components/compare_radar_chart.svelte (1)

46-78: Rename cost/costs parameters now that the helper is generalized.

Since metricToScore now scores both cost and latency (and any future lower-is-better metric), the cost-specific parameter names are misleading. Consider value/values (or metric/metrics) for readability — purely cosmetic, no behavior change.

♻️ Proposed rename

-  export function metricToScore(
-    cost: number,
-    costs: number[],
+  export function metricToScore(
+    value: number,
+    values: number[],
     {
       padding = 10, // keep endpoints away from 0/100
       relFull = 0.7, // when (hi-lo)/|hi| reaches this, use full spread (k=1)
     }: {
       padding?: number
       relFull?: number
     } = {},
   ): number {
-    const lo = Math.min(...costs)
-    const hi = Math.max(...costs)
+    const lo = Math.min(...values)
+    const hi = Math.max(...values)

     const range = hi - lo
     if (range <= 0) return 50

     // 1) range-based normalized position
-    const t = (cost - lo) / range
+    const t = (value - lo) / range

-    // 2) raw padded linear score (lower cost = higher score)
+    // 2) raw padded linear score (lower value = higher score)
     const raw = padding + (1 - t) * (100 - 2 * padding)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@app/web_ui/src/lib/components/compare_radar_chart.svelte` around lines 46 -
78, Rename the cost-specific parameters in metricToScore to neutral names to
reflect general-purpose use: change the parameter names cost and costs to value
and values (or metric and metrics) in the metricToScore function signature and
update all internal references (lo, hi, t, raw, etc. that use those params)
accordingly; also update any call sites that pass arguments to metricToScore to
use the new parameter names if they reference them by name (or simply ensure
positional calls still work), and run the test/build to verify no unresolved
identifiers remain.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@app/web_ui/src/lib/components/compare_radar_chart.svelte`:
- Around line 46-78: Rename the cost-specific parameters in metricToScore to
neutral names to reflect general-purpose use: change the parameter names cost
and costs to value and values (or metric and metrics) in the metricToScore
function signature and update all internal references (lo, hi, t, raw, etc. that
use those params) accordingly; also update any call sites that pass arguments to
metricToScore to use the new parameter names if they reference them by name (or
simply ensure positional calls still work), and run the test/build to verify no
unresolved identifiers remain.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: aa3a98bb-f410-442e-8d7d-d4fc9c6b199f

📥 Commits

Reviewing files that changed from the base of the PR and between 1a6582a and 286c0a1.

📒 Files selected for processing (6)

app/web_ui/src/lib/components/compare_chart.svelte
app/web_ui/src/lib/components/compare_radar_chart.svelte
app/web_ui/src/lib/utils/formatters.ts
app/web_ui/src/routes/(app)/run/run.svelte
app/web_ui/src/routes/(app)/specs/[project_id]/[task_id]/compare/+page.svelte
libs/core/kiln_ai/adapters/model_adapters/adapter_stream.py

✅ Files skipped from review due to trivial changes (1)

app/web_ui/src/lib/utils/formatters.ts

🚧 Files skipped from review as they are similar to previous changes (3)

app/web_ui/src/lib/components/compare_chart.svelte
libs/core/kiln_ai/adapters/model_adapters/adapter_stream.py
app/web_ui/src/routes/(app)/run/run.svelte

leonardmq · 2026-04-24T17:24:25Z

                    "Model returned an assistant message, but no content or tool calls. This is not supported."
                )

+            response_choice.message._latency_ms = call_latency_ms  # type: ignore[attr-defined]


This seems unsafe because message here is not a type we have control over, it comes out of LiteLLM and here we monkeypatch a random property onto it.

If we only need to do this to pass it on to our _messages list, we should make a new object of type ChatCompletionMessageIncludingLiteLLM (we have control over that one so we can have it have a proper latency field) and copy the data into it

leonardmq · 2026-04-24T17:33:38Z

+                # Accumulate time spent waiting on the LLM for this chunk
+                call_latency_seconds += time.monotonic() - chunk_wait_start
                yield chunk
+                # Reset timer after yield returns — excludes consumer processing time


I wonder if this makes sense. The consumer may be doing work that takes time so excluding that work can make sense, but I suspect (and I could be wrong) that when the consumer is doing its work, the stream may keep on buffering in the background and accumulating chunks, so the next iteration in the loop does not wait at all as if the next chunk were super fast / instant even though it would only be a side effect of the consumer work serving as unmeasured waiting time.

Not sure what to think of this though

leonardmq · 2026-04-24T17:36:12Z

                )

            # Add message to messages, so it can be used in the next turn
+            response_choice.message._latency_ms = call_latency_ms  # type: ignore[attr-defined]


Same comment here about assigning to a monkeypatch or private property being dodgy

chiang-daniel and others added 3 commits April 23, 2026 14:08

chiang-daniel requested review from leonardmq, scosman and tawnymanticore April 23, 2026 22:24

gemini-code-assist Bot reviewed Apr 23, 2026

View reviewed changes

Comment thread libs/core/kiln_ai/adapters/model_adapters/adapter_stream.py Outdated

Comment thread app/web_ui/src/lib/components/compare_radar_chart.svelte Outdated

Comment thread app/web_ui/src/routes/(app)/run/run.svelte Outdated

chiang-daniel force-pushed the KIL-542/perf branch from 1d2a084 to f6391f0 Compare April 23, 2026 23:50

refactor: rename costToScore to lowerIsBetterToScore

6323f6a

Function now handles both cost and latency metrics, so the name should reflect its broader purpose. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>

chiang-daniel marked this pull request as ready for review April 24, 2026 00:04

coderabbitai Bot reviewed Apr 24, 2026

View reviewed changes

chiang-daniel force-pushed the KIL-542/perf branch from 1a6582a to 286c0a1 Compare April 24, 2026 00:40

coderabbitai Bot reviewed Apr 24, 2026

View reviewed changes

leonardmq requested changes Apr 24, 2026

View reviewed changes

Conversation

chiang-daniel commented Apr 23, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Checklists

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Apr 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram

Estimated code review effort

Possibly related PRs

Suggested reviewers

❌ Failed checks (1 warning, 1 inconclusive)

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

github-actions Bot commented Apr 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

📊 Coverage Report

Diff: origin/main...HEAD

Summary

Uh oh!

chiang-daniel commented Apr 24, 2026

Uh oh!

coderabbitai Bot commented Apr 24, 2026

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

leonardmq Apr 24, 2026

Choose a reason for hiding this comment

Uh oh!

leonardmq Apr 24, 2026

Choose a reason for hiding this comment

Uh oh!

leonardmq Apr 24, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

chiang-daniel commented Apr 23, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Apr 23, 2026 •

edited

Loading

github-actions Bot commented Apr 23, 2026 •

edited

Loading