Skip to content

Add latency to trace #1340

Open
chiang-daniel wants to merge 6 commits intomainfrom
KIL-542/perf
Open

Add latency to trace #1340
chiang-daniel wants to merge 6 commits intomainfrom
KIL-542/perf

Conversation

@chiang-daniel
Copy link
Copy Markdown
Contributor

@chiang-daniel chiang-daniel commented Apr 23, 2026

What does this PR do?

Add per llm call latency tracking to Trace.

  • Add total_llm_latency_ms to usage model to track overall llm latency like cost and token.
  • Add mean_total_llm_latency_ms to eval API for comparison
  • Update run, compare table, radar chart, and metric correlation

Checklists

  • Tests have been run locally and passed
  • New tests have been added to any work in /lib

Summary by CodeRabbit

  • New Features

    • Added LLM latency tracking across all model interactions
    • Latency metrics now available in evaluation results and API responses
    • Enhanced comparison views to display latency alongside cost metrics
    • Added latency visualization in comparison charts and radar charts
    • Run details now show total and subtask latency breakdowns
  • Tests

    • Added comprehensive test coverage for latency calculation and data aggregation

chiang-daniel and others added 3 commits April 23, 2026 14:08
Add per-call latency timing (latency_ms) to trace messages and
cumulative total_llm_latency_ms to the Usage model. Instrument both
LiteLlmAdapter and AdapterStream to measure wall-clock time of each
LLM API call using time.monotonic, accumulate across multi-turn tool
call loops, and propagate latency into trace messages. Update
Usage.__add__ to support the new field, regenerate the OpenAPI schema,
and add comprehensive tests for field validation, addition semantics,
backwards compatibility, and adapter-level latency tracking.

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
Add mean_total_llm_latency_ms to MeanUsage model and wire up
accumulators in eval_api.py to compute average latency across eval
runs, respecting the existing 50% data-availability threshold.
Regenerate OpenAPI schema and add tests for both above- and
below-threshold latency scenarios.

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
Add latency tracking to all frontend display surfaces:
- Run page: "Total Latency" and "Subtasks Latency" in usage properties
- Compare table: "Latency" row in the usage section
- Radar chart: "Speed" axis with lower-is-better scoring
- Scatter plot: latency available as selectable axis

Rename compare section from "Average Usage & Cost" to
"Average Usage, Cost & Latency". Refactor radar chart to support
multiple lower-is-better metrics with independent normalization.

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Apr 23, 2026

📝 Walkthrough

Walkthrough

This PR adds end-to-end LLM latency tracking: measuring per-model-turn latency in adapters using monotonic timing, accumulating values in the Usage model, computing mean latency across eval runs (with a 50% data threshold), exposing latency fields through API schemas, and rendering formatted metrics in the web UI across multiple views (compare charts, radar charts, run details).

Changes

Cohort / File(s) Summary
Core Data Model & Types
libs/core/kiln_ai/datamodel/task_run.py, libs/core/kiln_ai/utils/open_ai_types.py, libs/core/kiln_ai/datamodel/test_example_models.py, libs/core/kiln_ai/utils/test_open_ai_types.py
Adds total_llm_latency_ms field to Usage with non-negative validation and aggregation logic; extends ChatCompletionAssistantMessageParamWrapper TypedDict with optional latency_ms field; adds validation and backward-compatibility tests.
Model Adapter Latency Instrumentation
libs/core/kiln_ai/adapters/model_adapters/adapter_stream.py, libs/core/kiln_ai/adapters/model_adapters/litellm_adapter.py, libs/core/kiln_ai/adapters/model_adapters/test_litellm_adapter.py
Instruments streaming and synchronous adapters to measure call duration using monotonic timing, accumulate into usage.total_llm_latency_ms, and annotate assistant messages with _latency_ms field; includes comprehensive mock-based test coverage for latency accumulation and trace propagation.
Eval API Mean Latency Computation
app/desktop/studio_server/eval_api.py, app/desktop/studio_server/test_eval_api.py
Extends MeanUsage model with optional mean_total_llm_latency_ms field; implements aggregation of per-eval-run latency values across run config, computing mean only when ≥50% of samples have data; adds dual test cases covering threshold pass and fail scenarios.
API Schema Generation
app/web_ui/src/lib/api_schema.d.ts
Exposes new latency fields in generated TypeScript schema: ChatCompletionAssistantMessageParamWrapper gains per-message latency_ms, MeanUsage gains mean_total_llm_latency_ms, and Usage gains total_llm_latency_ms.
Latency Formatting Utility
app/web_ui/src/lib/utils/formatters.ts
Adds new formatLatency function that converts millisecond values to human-readable strings (milliseconds for <1000ms, seconds with one decimal for ≥1000ms).
Compare Chart Components
app/web_ui/src/lib/components/compare_chart.svelte, app/web_ui/src/lib/components/compare_radar_chart.svelte
Extends chart tooltip/axis formatting to detect and apply latency-specific rendering; generalizes radar chart from cost-only scoring to support multiple "lower-is-better" metrics (cost and latency) by renaming costToScore to metricToScore and updating normalization/scaling logic.
Run Details & Compare Page UI
app/web_ui/src/routes/(app)/run/run.svelte, app/web_ui/src/routes/(app)/specs/[project_id]/[task_id]/compare/+page.svelte
Extends subtask usage aggregation to recursively accumulate total_llm_latency_ms; adds latency state management and request-race logic; computes and displays "Total Latency" and "Subtasks Latency" with loading states in run view; adds latency metric to compare page cost section with raw-value extraction and formatted table rendering.

Sequence Diagram

sequenceDiagram
    participant Client as Client (UI)
    participant Server as Eval API Server
    participant DB as Database
    participant Adapter as Model Adapter
    participant LLM as LLM Provider

    Client->>Server: Request eval scores for run config
    Server->>DB: Query eval runs for run config
    DB-->>Server: Returns eval run records with usage data
    
    loop For each eval run
        Server->>Adapter: (historical) Measure model turn
        Adapter->>LLM: Stream/call LLM with monotonic timing
        LLM-->>Adapter: Response + elapsed time
        Adapter->>Adapter: Convert elapsed → call_latency_ms
        Adapter->>Adapter: Accumulate into usage.total_llm_latency_ms
        Adapter-->>Server: Returns usage with total_llm_latency_ms
    end
    
    Server->>Server: Aggregate latency across runs:<br/>- Sum all total_llm_latency_ms<br/>- Count samples with data<br/>- Compute mean if ≥50% threshold met
    Server->>Server: Build MeanUsage with<br/>mean_total_llm_latency_ms
    Server-->>Client: Returns MeanUsage object
    
    Client->>Client: Extract mean_total_llm_latency_ms
    Client->>Client: Format via formatLatency()
    Client->>Client: Render in chart/table views
    Client-->>Client: Display "X ms" or "X.X s"
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

  • PR #509: Modifies litellm_adapter.py and open_ai_types.py to track latency at the adapter level, providing the foundation for this latency-tracking feature.
  • PR #861: Extends run.svelte subtask aggregation logic; this PR builds on that pattern by adding recursive total_llm_latency_ms accumulation alongside cost metrics.
  • PR #934: Introduces cost-based scoring in compare_radar_chart.svelte; this PR generalizes that logic to metricToScore and extends it to support latency as a lower-is-better metric.

Suggested reviewers

  • scosman
  • leonardmq
  • tawnymanticore

🐰 A hop through time, millisecond by millisecond,
From adapter's faithful tick to the UI's gentle call,
Latency now glows in charts and compare views,
Mean values dance when fifty percent approve—
Fast or slow, the data flows! 🚀✨

🚥 Pre-merge checks | ✅ 3 | ❌ 2

❌ Failed checks (1 warning, 1 inconclusive)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 72.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Title check ❓ Inconclusive The title 'Add latency to trace' is vague and overly generic, not conveying the specific scope or primary change. Use a more descriptive title like 'Add per-LLM-call latency tracking to Usage and Eval API' to clarify what aspect of the system is being modified and the key deliverable.
✅ Passed checks (3 passed)
Check name Status Explanation
Description check ✅ Passed The PR description covers the main objectives, includes checked items confirming tests were run and added, but lacks a Related Issues link.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch KIL-542/perf

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces LLM latency tracking across the application, including data model updates, backend measurement in model adapters, and frontend visualizations in charts and run summaries. A significant issue was identified in the streaming adapter where latency measurements would be inflated by the consumer's processing time; a suggestion was provided to measure only the time spent waiting for chunks. Additionally, it is recommended to rename the costToScore function to reflect its broader use for all lower-is-better metrics and to centralize the duplicated latency formatting logic into a shared utility.

Comment thread libs/core/kiln_ai/adapters/model_adapters/adapter_stream.py Outdated
Comment thread app/web_ui/src/lib/components/compare_radar_chart.svelte Outdated
Comment thread app/web_ui/src/routes/(app)/run/run.svelte Outdated
@github-actions
Copy link
Copy Markdown

github-actions Bot commented Apr 23, 2026

📊 Coverage Report

Overall Coverage: 92%

Diff: origin/main...HEAD

  • app/desktop/studio_server/eval_api.py (100%)
  • libs/core/kiln_ai/adapters/model_adapters/adapter_stream.py (100%)
  • libs/core/kiln_ai/adapters/model_adapters/litellm_adapter.py (100%)
  • libs/core/kiln_ai/datamodel/task_run.py (100%)
  • libs/core/kiln_ai/utils/open_ai_types.py (100%)

Summary

  • Total: 26 lines
  • Missing: 0 lines
  • Coverage: 100%

Fix streaming path to only measure time waiting on LLM chunks, not
time the consumer spends processing yielded chunks. Also fix misleading
test that claimed to test multi-call accumulation but only exercised a
single LLM call.

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
Function now handles both cost and latency metrics, so the name should
reflect its broader purpose.

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
@chiang-daniel chiang-daniel marked this pull request as ready for review April 24, 2026 00:04
@chiang-daniel
Copy link
Copy Markdown
Contributor Author

@coderabbitai review

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Apr 24, 2026

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 9

🧹 Nitpick comments (3)
app/desktop/studio_server/test_eval_api.py (1)

2202-2359: Add an exact-50% threshold test to lock boundary semantics.

You currently test >50% and <50%, but not exactly 50%, which is where regressions usually slip in.

🧪 Suggested boundary test
+@pytest.mark.asyncio
+async def test_get_run_config_eval_scores_latency_at_threshold(
+    client, mock_task_from_id, mock_task, mock_eval, mock_eval_config, mock_run_config
+):
+    """At exactly 50% availability, latency inclusion should follow intended threshold semantics."""
+    mock_task_from_id.return_value = mock_task
+
+    # 2 runs total, exactly 1 has latency => 50%
+    task_run_1 = TaskRun(
+        input="test input 1",
+        input_source=DataSource(
+            type=DataSourceType.synthetic,
+            properties={
+                "model_name": "gpt-4",
+                "model_provider": "openai",
+                "adapter_name": "langchain_adapter",
+            },
+        ),
+        output=TaskOutput(output="test output 1"),
+        usage=Usage(input_tokens=100, output_tokens=50, total_tokens=150, cost=0.005, total_llm_latency_ms=500),
+        parent=mock_task,
+    )
+    task_run_1.save_to_file()
+
+    task_run_2 = TaskRun(
+        input="test input 2",
+        input_source=DataSource(
+            type=DataSourceType.synthetic,
+            properties={
+                "model_name": "gpt-4",
+                "model_provider": "openai",
+                "adapter_name": "langchain_adapter",
+            },
+        ),
+        output=TaskOutput(output="test output 2"),
+        usage=Usage(input_tokens=200, output_tokens=100, total_tokens=300, cost=0.010),
+        parent=mock_task,
+    )
+    task_run_2.save_to_file()
+
+    eval_run_1 = EvalRun(task_run_config_id=mock_run_config.id, scores={"score1": 4.0, "overall_rating": 4.0}, input="i1", output="o1", dataset_id=task_run_1.id, task_run_usage=task_run_1.usage, parent=mock_eval_config)
+    eval_run_1.save_to_file()
+    eval_run_2 = EvalRun(task_run_config_id=mock_run_config.id, scores={"score1": 4.5, "overall_rating": 4.5}, input="i2", output="o2", dataset_id=task_run_2.id, task_run_usage=task_run_2.usage, parent=mock_eval_config)
+    eval_run_2.save_to_file()
+
+    # ...same mock/patch pattern as neighboring tests...
+    # Assert explicit expected behavior at exactly 50%:
+    # assert mean_usage["mean_total_llm_latency_ms"] is <expected per threshold contract>
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@app/desktop/studio_server/test_eval_api.py` around lines 2202 - 2359, Add a
boundary test to cover the exact 50% case: create a new test (e.g.
test_get_run_config_eval_scores_latency_at_50_percent) that mirrors
test_get_run_config_eval_scores_latency_below_threshold but creates 4
TaskRuns/EvalRuns with exactly 2 of the TaskRun.usage entries having
total_llm_latency_ms set (2/4 = 50%), patch
task_from_id/eval_from_id/task_run_config_from_id and dataset_ids_in_filter the
same way, call the same endpoint, and assert the behaviour of
mean_usage["mean_total_llm_latency_ms"] for the exact-50% case (set the expected
assertion to match the current implementation's semantics—i.e., assert not None
if the code treats >=50% as sufficient, or assert None if it requires >50%).
app/web_ui/src/lib/components/compare_radar_chart.svelte (2)

165-189: Consolidate the COST_KEY / LATENCY_KEY tooltip branches.

The two branches are structurally identical aside from the label/format of the raw value. A small per-metric formatter map keeps future lower-is-better additions one-line changes and prevents subtle drift between the two branches.

♻️ Illustrative refactor
-      } else if (key === COST_KEY) {
-        const displayValue = metricToScore(
-          rawValue,
-          lowerIsBetterValues[key] || [],
-        )
-        html += `<div>${label}: ${displayValue.toFixed(1)} <span style="color: `#888`;">(Mean Cost: $${rawValue.toFixed(6)})</span></div>`
-      } else if (key === LATENCY_KEY) {
-        const displayValue = metricToScore(
-          rawValue,
-          lowerIsBetterValues[key] || [],
-        )
-        const formatted =
-          rawValue < 1000
-            ? `${Math.round(rawValue)}ms`
-            : `${(rawValue / 1000).toFixed(1)}s`
-        html += `<div>${label}: ${displayValue.toFixed(1)} <span style="color: `#888`;">(Mean Latency: ${formatted})</span></div>`
+      } else if (isLowerIsBetterMetric(key)) {
+        const displayValue = metricToScore(
+          rawValue,
+          lowerIsBetterValues[key] || [],
+        )
+        const rawLabel = key === COST_KEY ? "Mean Cost" : "Mean Latency"
+        const rawFormatted =
+          key === COST_KEY ? `$${rawValue.toFixed(6)}` : formatLatency(rawValue)
+        html += `<div>${label}: ${displayValue.toFixed(1)} <span style="color: `#888`;">(${rawLabel}: ${rawFormatted})</span></div>`
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@app/web_ui/src/lib/components/compare_radar_chart.svelte` around lines 165 -
189, The two branches handling COST_KEY and LATENCY_KEY inside dataKeys.forEach
are duplicated; extract a per-metric raw-value formatter map and use a single
branch that calls metricToScore and then formats the raw value via the map
before composing html. Specifically, update the logic in dataKeys.forEach to
call getKeyLabel and getModelValueRaw as before, compute displayValue via
metricToScore using lowerIsBetterValues[key], and replace the
COST_KEY/LATENCY_KEY conditionals with a formatter lookup (e.g.,
formatters[COST_KEY], formatters[LATENCY_KEY]) that returns the formatted raw
string (like currency with 6 decimals or ms/s formatting) and fall back to
rawValue.toFixed(3); then build the html with the unified template that inserts
label, displayValue.toFixed(1), and the formatted raw string.

46-78: Rename parameters to match the generalized helper.

After renaming costToScoremetricToScore, the parameters are still cost / costs, which reads misleadingly when the function is invoked for latency. Tightens readability without behavior change.

♻️ Proposed refactor
-  export function metricToScore(
-    cost: number,
-    costs: number[],
+  export function metricToScore(
+    value: number,
+    values: number[],
     {
       padding = 10, // keep endpoints away from 0/100
       relFull = 0.7, // when (hi-lo)/|hi| reaches this, use full spread (k=1)
     }: {
       padding?: number
       relFull?: number
     } = {},
   ): number {
-    const lo = Math.min(...costs)
-    const hi = Math.max(...costs)
+    const lo = Math.min(...values)
+    const hi = Math.max(...values)

     const range = hi - lo
     if (range <= 0) return 50

     // 1) range-based normalized position
-    const t = (cost - lo) / range
+    const t = (value - lo) / range

-    // 2) raw padded linear score (lower cost = higher score)
+    // 2) raw padded linear score (lower value = higher score)
     const raw = padding + (1 - t) * (100 - 2 * padding)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@app/web_ui/src/lib/components/compare_radar_chart.svelte` around lines 46 -
78, The parameter names cost and costs in the exported function metricToScore
are misleading after renaming; rename them to generic names (e.g., value and
values or metric and metrics) across the function signature and all internal
references inside metricToScore to improve readability without changing
behavior, keeping the same default behavior and return logic in metricToScore.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@app/web_ui/src/lib/components/compare_radar_chart.svelte`:
- Around line 181-185: Replace the inline latency formatting in
compare_radar_chart.svelte with the shared helper: import and call
formatLatency(ms: number) from app/web_ui/src/lib/utils/formatters.ts instead of
computing `formatted` manually (the block that sets `formatted` and uses it in
the HTML string); update the code that builds `html` to use
formatLatency(rawValue) where the previous `${formatted}` was inserted so all
latency display follows the single source-of-truth.

In `@app/web_ui/src/lib/utils/formatters.ts`:
- Around line 330-333: The formatLatency function can output "1000ms" for values
like 999.6 due to rounding before unit selection; change the logic so unit
selection uses the rounded ms value: compute roundedMs = Math.round(ms) and if
roundedMs < 1000 return `${roundedMs}ms`, otherwise return seconds using the
original ms divided by 1000 (e.g., `${(ms/1000).toFixed(1)}s`) so values that
round to 1000ms become "1.0s" instead of "1000ms".

In
`@app/web_ui/src/routes/`(app)/specs/[project_id]/[task_id]/compare/+page.svelte:
- Around line 496-501: The latency formatting is leaking into the diff
calculation: when category === "cost" and scoreKey like
"mean_total_llm_latency_ms" you should keep using formatLatency(value) only for
display, and use getModelValueRaw() (the raw millisecond value) when computing
the percent-difference/badge so comparisons remain numeric; update the
percent-difference calculation to call getModelValueRaw() for both sides when
scoreKey corresponds to latency (e.g., "mean_total_llm_latency_ms") and leave
formatLatency() exclusively in the cell rendering path.

In `@libs/core/kiln_ai/adapters/model_adapters/adapter_stream.py`:
- Around line 171-180: The stream latency currently sums time only for
iterations that yield a chunk and misses the final await after the last chunk;
change the loop in adapter_stream.py to measure each await of the stream instead
of only time inside the async for body: replace the "async for chunk in stream"
with an explicit loop that records time before awaiting stream.__anext__(), adds
the elapsed time to call_latency_seconds after the await (so the final tail wait
is included), yields the chunk, and on StopAsyncIteration records the final
await duration before breaking; keep existing variables call_latency_seconds and
chunk_wait_start semantics so consumer processing time remains excluded.

In `@specs/projects/performance-tracking/architecture.md`:
- Line 222: The documentation references a renamed helper; replace uses of
metricToScore() with the canonical costToScore() so the docs match
functional_spec.md and the implementation. Update the "Speed" radar axis note
(and any other occurrences in this section) to call costToScore() and ensure the
wording still notes "lower is better". Verify the symbol name is exactly
costToScore() wherever metricToScore() appears.

In `@specs/projects/performance-tracking/functional_spec.md`:
- Around line 154-156: The spec still references the old helper name
costToScore(); update the documentation to use the new helper name
metricToScore() and, while here, confirm the spec notes that isCostMetric() now
also matches latency (both are lower-is-better) and that the radar label is
"Speed" (higher = better on chart) to match compare_radar_chart.svelte's current
behavior.

In `@specs/projects/performance-tracking/phase_plans/phase_3.md`:
- Line 13: The spec text incorrectly implies formatLatency() was created in this
phase; update the Phase 3 description for run.svelte to say you used or reused
the existing formatLatency() helper (instead of "added formatLatency()") and
confirm that calculate_subtask_usage(), the new subtask_latency_ms state
variable, and the additions to get_usage_properties() are the actual new changes
in this phase.
- Line 17: The spec incorrectly references costToScore(); update the
documentation to use the actual function name metricToScore() and ensure
references to symbols in compare_radar_chart.svelte (e.g., LATENCY_KEY,
isLowerIsBetterMetric(), lowerIsBetterValues) remain accurate — replace any
mention of costToScore() with metricToScore() and confirm wording indicates
per-key normalization via metricToScore() for cost and latency.

In `@specs/projects/performance-tracking/project_overview.md`:
- Line 30: The doc incorrectly states latency is stored in seconds; update the
spec to say latency is stored in milliseconds (integer/float as applicable) and
adjust examples/units accordingly; reference the observed implementation using
the _ms fields (latency_ms, total_llm_latency_ms, mean_total_llm_latency_ms) and
the formatLatency(ms: number) helper which renders "Xms" for values < 1000 to
ensure the documentation aligns with the code representation and displayed
units.

---

Nitpick comments:
In `@app/desktop/studio_server/test_eval_api.py`:
- Around line 2202-2359: Add a boundary test to cover the exact 50% case: create
a new test (e.g. test_get_run_config_eval_scores_latency_at_50_percent) that
mirrors test_get_run_config_eval_scores_latency_below_threshold but creates 4
TaskRuns/EvalRuns with exactly 2 of the TaskRun.usage entries having
total_llm_latency_ms set (2/4 = 50%), patch
task_from_id/eval_from_id/task_run_config_from_id and dataset_ids_in_filter the
same way, call the same endpoint, and assert the behaviour of
mean_usage["mean_total_llm_latency_ms"] for the exact-50% case (set the expected
assertion to match the current implementation's semantics—i.e., assert not None
if the code treats >=50% as sufficient, or assert None if it requires >50%).

In `@app/web_ui/src/lib/components/compare_radar_chart.svelte`:
- Around line 165-189: The two branches handling COST_KEY and LATENCY_KEY inside
dataKeys.forEach are duplicated; extract a per-metric raw-value formatter map
and use a single branch that calls metricToScore and then formats the raw value
via the map before composing html. Specifically, update the logic in
dataKeys.forEach to call getKeyLabel and getModelValueRaw as before, compute
displayValue via metricToScore using lowerIsBetterValues[key], and replace the
COST_KEY/LATENCY_KEY conditionals with a formatter lookup (e.g.,
formatters[COST_KEY], formatters[LATENCY_KEY]) that returns the formatted raw
string (like currency with 6 decimals or ms/s formatting) and fall back to
rawValue.toFixed(3); then build the html with the unified template that inserts
label, displayValue.toFixed(1), and the formatted raw string.
- Around line 46-78: The parameter names cost and costs in the exported function
metricToScore are misleading after renaming; rename them to generic names (e.g.,
value and values or metric and metrics) across the function signature and all
internal references inside metricToScore to improve readability without changing
behavior, keeping the same default behavior and return logic in metricToScore.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 470f0af0-7876-452e-b65a-aecd47423ea4

📥 Commits

Reviewing files that changed from the base of the PR and between 7f69c4c and 1a6582a.

📒 Files selected for processing (22)
  • app/desktop/studio_server/eval_api.py
  • app/desktop/studio_server/test_eval_api.py
  • app/web_ui/src/lib/api_schema.d.ts
  • app/web_ui/src/lib/components/compare_chart.svelte
  • app/web_ui/src/lib/components/compare_radar_chart.svelte
  • app/web_ui/src/lib/utils/formatters.ts
  • app/web_ui/src/routes/(app)/run/run.svelte
  • app/web_ui/src/routes/(app)/specs/[project_id]/[task_id]/compare/+page.svelte
  • libs/core/kiln_ai/adapters/model_adapters/adapter_stream.py
  • libs/core/kiln_ai/adapters/model_adapters/litellm_adapter.py
  • libs/core/kiln_ai/adapters/model_adapters/test_litellm_adapter.py
  • libs/core/kiln_ai/datamodel/task_run.py
  • libs/core/kiln_ai/datamodel/test_example_models.py
  • libs/core/kiln_ai/utils/open_ai_types.py
  • libs/core/kiln_ai/utils/test_open_ai_types.py
  • specs/projects/performance-tracking/architecture.md
  • specs/projects/performance-tracking/functional_spec.md
  • specs/projects/performance-tracking/implementation_plan.md
  • specs/projects/performance-tracking/phase_plans/phase_1.md
  • specs/projects/performance-tracking/phase_plans/phase_2.md
  • specs/projects/performance-tracking/phase_plans/phase_3.md
  • specs/projects/performance-tracking/project_overview.md

Comment thread app/web_ui/src/lib/components/compare_radar_chart.svelte
Comment thread app/web_ui/src/lib/utils/formatters.ts
Comment thread libs/core/kiln_ai/adapters/model_adapters/adapter_stream.py
Comment thread specs/projects/performance-tracking/architecture.md Outdated
Comment thread specs/projects/performance-tracking/functional_spec.md Outdated
Comment thread specs/projects/performance-tracking/phase_plans/phase_3.md Outdated
Comment thread specs/projects/performance-tracking/phase_plans/phase_3.md Outdated
Comment thread specs/projects/performance-tracking/project_overview.md Outdated
Extract duplicated latency formatting logic into shared formatLatency()
in formatters.ts. Rename costToScore to metricToScore since it now
handles both cost and latency metrics.

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
app/web_ui/src/lib/components/compare_radar_chart.svelte (1)

46-78: Rename cost/costs parameters now that the helper is generalized.

Since metricToScore now scores both cost and latency (and any future lower-is-better metric), the cost-specific parameter names are misleading. Consider value/values (or metric/metrics) for readability — purely cosmetic, no behavior change.

♻️ Proposed rename
-  export function metricToScore(
-    cost: number,
-    costs: number[],
+  export function metricToScore(
+    value: number,
+    values: number[],
     {
       padding = 10, // keep endpoints away from 0/100
       relFull = 0.7, // when (hi-lo)/|hi| reaches this, use full spread (k=1)
     }: {
       padding?: number
       relFull?: number
     } = {},
   ): number {
-    const lo = Math.min(...costs)
-    const hi = Math.max(...costs)
+    const lo = Math.min(...values)
+    const hi = Math.max(...values)

     const range = hi - lo
     if (range <= 0) return 50

     // 1) range-based normalized position
-    const t = (cost - lo) / range
+    const t = (value - lo) / range

-    // 2) raw padded linear score (lower cost = higher score)
+    // 2) raw padded linear score (lower value = higher score)
     const raw = padding + (1 - t) * (100 - 2 * padding)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@app/web_ui/src/lib/components/compare_radar_chart.svelte` around lines 46 -
78, Rename the cost-specific parameters in metricToScore to neutral names to
reflect general-purpose use: change the parameter names cost and costs to value
and values (or metric and metrics) in the metricToScore function signature and
update all internal references (lo, hi, t, raw, etc. that use those params)
accordingly; also update any call sites that pass arguments to metricToScore to
use the new parameter names if they reference them by name (or simply ensure
positional calls still work), and run the test/build to verify no unresolved
identifiers remain.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@app/web_ui/src/lib/components/compare_radar_chart.svelte`:
- Around line 46-78: Rename the cost-specific parameters in metricToScore to
neutral names to reflect general-purpose use: change the parameter names cost
and costs to value and values (or metric and metrics) in the metricToScore
function signature and update all internal references (lo, hi, t, raw, etc. that
use those params) accordingly; also update any call sites that pass arguments to
metricToScore to use the new parameter names if they reference them by name (or
simply ensure positional calls still work), and run the test/build to verify no
unresolved identifiers remain.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: aa3a98bb-f410-442e-8d7d-d4fc9c6b199f

📥 Commits

Reviewing files that changed from the base of the PR and between 1a6582a and 286c0a1.

📒 Files selected for processing (6)
  • app/web_ui/src/lib/components/compare_chart.svelte
  • app/web_ui/src/lib/components/compare_radar_chart.svelte
  • app/web_ui/src/lib/utils/formatters.ts
  • app/web_ui/src/routes/(app)/run/run.svelte
  • app/web_ui/src/routes/(app)/specs/[project_id]/[task_id]/compare/+page.svelte
  • libs/core/kiln_ai/adapters/model_adapters/adapter_stream.py
✅ Files skipped from review due to trivial changes (1)
  • app/web_ui/src/lib/utils/formatters.ts
🚧 Files skipped from review as they are similar to previous changes (3)
  • app/web_ui/src/lib/components/compare_chart.svelte
  • libs/core/kiln_ai/adapters/model_adapters/adapter_stream.py
  • app/web_ui/src/routes/(app)/run/run.svelte

"Model returned an assistant message, but no content or tool calls. This is not supported."
)

response_choice.message._latency_ms = call_latency_ms # type: ignore[attr-defined]
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems unsafe because message here is not a type we have control over, it comes out of LiteLLM and here we monkeypatch a random property onto it.

If we only need to do this to pass it on to our _messages list, we should make a new object of type ChatCompletionMessageIncludingLiteLLM (we have control over that one so we can have it have a proper latency field) and copy the data into it

# Accumulate time spent waiting on the LLM for this chunk
call_latency_seconds += time.monotonic() - chunk_wait_start
yield chunk
# Reset timer after yield returns — excludes consumer processing time
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if this makes sense. The consumer may be doing work that takes time so excluding that work can make sense, but I suspect (and I could be wrong) that when the consumer is doing its work, the stream may keep on buffering in the background and accumulating chunks, so the next iteration in the loop does not wait at all as if the next chunk were super fast / instant even though it would only be a side effect of the consumer work serving as unmeasured waiting time.

Not sure what to think of this though

)

# Add message to messages, so it can be used in the next turn
response_choice.message._latency_ms = call_latency_ms # type: ignore[attr-defined]
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment here about assigning to a monkeypatch or private property being dodgy

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants