Conversation
Add per-call latency timing (latency_ms) to trace messages and cumulative total_llm_latency_ms to the Usage model. Instrument both LiteLlmAdapter and AdapterStream to measure wall-clock time of each LLM API call using time.monotonic, accumulate across multi-turn tool call loops, and propagate latency into trace messages. Update Usage.__add__ to support the new field, regenerate the OpenAPI schema, and add comprehensive tests for field validation, addition semantics, backwards compatibility, and adapter-level latency tracking. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
Add mean_total_llm_latency_ms to MeanUsage model and wire up accumulators in eval_api.py to compute average latency across eval runs, respecting the existing 50% data-availability threshold. Regenerate OpenAPI schema and add tests for both above- and below-threshold latency scenarios. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
Add latency tracking to all frontend display surfaces: - Run page: "Total Latency" and "Subtasks Latency" in usage properties - Compare table: "Latency" row in the usage section - Radar chart: "Speed" axis with lower-is-better scoring - Scatter plot: latency available as selectable axis Rename compare section from "Average Usage & Cost" to "Average Usage, Cost & Latency". Refactor radar chart to support multiple lower-is-better metrics with independent normalization. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
📝 WalkthroughWalkthroughThis PR adds end-to-end LLM latency tracking: measuring per-model-turn latency in adapters using monotonic timing, accumulating values in the Changes
Sequence DiagramsequenceDiagram
participant Client as Client (UI)
participant Server as Eval API Server
participant DB as Database
participant Adapter as Model Adapter
participant LLM as LLM Provider
Client->>Server: Request eval scores for run config
Server->>DB: Query eval runs for run config
DB-->>Server: Returns eval run records with usage data
loop For each eval run
Server->>Adapter: (historical) Measure model turn
Adapter->>LLM: Stream/call LLM with monotonic timing
LLM-->>Adapter: Response + elapsed time
Adapter->>Adapter: Convert elapsed → call_latency_ms
Adapter->>Adapter: Accumulate into usage.total_llm_latency_ms
Adapter-->>Server: Returns usage with total_llm_latency_ms
end
Server->>Server: Aggregate latency across runs:<br/>- Sum all total_llm_latency_ms<br/>- Count samples with data<br/>- Compute mean if ≥50% threshold met
Server->>Server: Build MeanUsage with<br/>mean_total_llm_latency_ms
Server-->>Client: Returns MeanUsage object
Client->>Client: Extract mean_total_llm_latency_ms
Client->>Client: Format via formatLatency()
Client->>Client: Render in chart/table views
Client-->>Client: Display "X ms" or "X.X s"
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Possibly related PRs
Suggested reviewers
🚥 Pre-merge checks | ✅ 3 | ❌ 2❌ Failed checks (1 warning, 1 inconclusive)
✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Code Review
This pull request introduces LLM latency tracking across the application, including data model updates, backend measurement in model adapters, and frontend visualizations in charts and run summaries. A significant issue was identified in the streaming adapter where latency measurements would be inflated by the consumer's processing time; a suggestion was provided to measure only the time spent waiting for chunks. Additionally, it is recommended to rename the costToScore function to reflect its broader use for all lower-is-better metrics and to centralize the duplicated latency formatting logic into a shared utility.
📊 Coverage ReportOverall Coverage: 92% Diff: origin/main...HEAD
Summary
|
Fix streaming path to only measure time waiting on LLM chunks, not time the consumer spends processing yielded chunks. Also fix misleading test that claimed to test multi-call accumulation but only exercised a single LLM call. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
1d2a084 to
f6391f0
Compare
Function now handles both cost and latency metrics, so the name should reflect its broader purpose. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
|
@coderabbitai review |
✅ Actions performedReview triggered.
|
There was a problem hiding this comment.
Actionable comments posted: 9
🧹 Nitpick comments (3)
app/desktop/studio_server/test_eval_api.py (1)
2202-2359: Add an exact-50% threshold test to lock boundary semantics.You currently test
>50%and<50%, but not exactly50%, which is where regressions usually slip in.🧪 Suggested boundary test
+@pytest.mark.asyncio +async def test_get_run_config_eval_scores_latency_at_threshold( + client, mock_task_from_id, mock_task, mock_eval, mock_eval_config, mock_run_config +): + """At exactly 50% availability, latency inclusion should follow intended threshold semantics.""" + mock_task_from_id.return_value = mock_task + + # 2 runs total, exactly 1 has latency => 50% + task_run_1 = TaskRun( + input="test input 1", + input_source=DataSource( + type=DataSourceType.synthetic, + properties={ + "model_name": "gpt-4", + "model_provider": "openai", + "adapter_name": "langchain_adapter", + }, + ), + output=TaskOutput(output="test output 1"), + usage=Usage(input_tokens=100, output_tokens=50, total_tokens=150, cost=0.005, total_llm_latency_ms=500), + parent=mock_task, + ) + task_run_1.save_to_file() + + task_run_2 = TaskRun( + input="test input 2", + input_source=DataSource( + type=DataSourceType.synthetic, + properties={ + "model_name": "gpt-4", + "model_provider": "openai", + "adapter_name": "langchain_adapter", + }, + ), + output=TaskOutput(output="test output 2"), + usage=Usage(input_tokens=200, output_tokens=100, total_tokens=300, cost=0.010), + parent=mock_task, + ) + task_run_2.save_to_file() + + eval_run_1 = EvalRun(task_run_config_id=mock_run_config.id, scores={"score1": 4.0, "overall_rating": 4.0}, input="i1", output="o1", dataset_id=task_run_1.id, task_run_usage=task_run_1.usage, parent=mock_eval_config) + eval_run_1.save_to_file() + eval_run_2 = EvalRun(task_run_config_id=mock_run_config.id, scores={"score1": 4.5, "overall_rating": 4.5}, input="i2", output="o2", dataset_id=task_run_2.id, task_run_usage=task_run_2.usage, parent=mock_eval_config) + eval_run_2.save_to_file() + + # ...same mock/patch pattern as neighboring tests... + # Assert explicit expected behavior at exactly 50%: + # assert mean_usage["mean_total_llm_latency_ms"] is <expected per threshold contract>🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@app/desktop/studio_server/test_eval_api.py` around lines 2202 - 2359, Add a boundary test to cover the exact 50% case: create a new test (e.g. test_get_run_config_eval_scores_latency_at_50_percent) that mirrors test_get_run_config_eval_scores_latency_below_threshold but creates 4 TaskRuns/EvalRuns with exactly 2 of the TaskRun.usage entries having total_llm_latency_ms set (2/4 = 50%), patch task_from_id/eval_from_id/task_run_config_from_id and dataset_ids_in_filter the same way, call the same endpoint, and assert the behaviour of mean_usage["mean_total_llm_latency_ms"] for the exact-50% case (set the expected assertion to match the current implementation's semantics—i.e., assert not None if the code treats >=50% as sufficient, or assert None if it requires >50%).app/web_ui/src/lib/components/compare_radar_chart.svelte (2)
165-189: Consolidate theCOST_KEY/LATENCY_KEYtooltip branches.The two branches are structurally identical aside from the label/format of the raw value. A small per-metric formatter map keeps future lower-is-better additions one-line changes and prevents subtle drift between the two branches.
♻️ Illustrative refactor
- } else if (key === COST_KEY) { - const displayValue = metricToScore( - rawValue, - lowerIsBetterValues[key] || [], - ) - html += `<div>${label}: ${displayValue.toFixed(1)} <span style="color: `#888`;">(Mean Cost: $${rawValue.toFixed(6)})</span></div>` - } else if (key === LATENCY_KEY) { - const displayValue = metricToScore( - rawValue, - lowerIsBetterValues[key] || [], - ) - const formatted = - rawValue < 1000 - ? `${Math.round(rawValue)}ms` - : `${(rawValue / 1000).toFixed(1)}s` - html += `<div>${label}: ${displayValue.toFixed(1)} <span style="color: `#888`;">(Mean Latency: ${formatted})</span></div>` + } else if (isLowerIsBetterMetric(key)) { + const displayValue = metricToScore( + rawValue, + lowerIsBetterValues[key] || [], + ) + const rawLabel = key === COST_KEY ? "Mean Cost" : "Mean Latency" + const rawFormatted = + key === COST_KEY ? `$${rawValue.toFixed(6)}` : formatLatency(rawValue) + html += `<div>${label}: ${displayValue.toFixed(1)} <span style="color: `#888`;">(${rawLabel}: ${rawFormatted})</span></div>`🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@app/web_ui/src/lib/components/compare_radar_chart.svelte` around lines 165 - 189, The two branches handling COST_KEY and LATENCY_KEY inside dataKeys.forEach are duplicated; extract a per-metric raw-value formatter map and use a single branch that calls metricToScore and then formats the raw value via the map before composing html. Specifically, update the logic in dataKeys.forEach to call getKeyLabel and getModelValueRaw as before, compute displayValue via metricToScore using lowerIsBetterValues[key], and replace the COST_KEY/LATENCY_KEY conditionals with a formatter lookup (e.g., formatters[COST_KEY], formatters[LATENCY_KEY]) that returns the formatted raw string (like currency with 6 decimals or ms/s formatting) and fall back to rawValue.toFixed(3); then build the html with the unified template that inserts label, displayValue.toFixed(1), and the formatted raw string.
46-78: Rename parameters to match the generalized helper.After renaming
costToScore→metricToScore, the parameters are stillcost/costs, which reads misleadingly when the function is invoked for latency. Tightens readability without behavior change.♻️ Proposed refactor
- export function metricToScore( - cost: number, - costs: number[], + export function metricToScore( + value: number, + values: number[], { padding = 10, // keep endpoints away from 0/100 relFull = 0.7, // when (hi-lo)/|hi| reaches this, use full spread (k=1) }: { padding?: number relFull?: number } = {}, ): number { - const lo = Math.min(...costs) - const hi = Math.max(...costs) + const lo = Math.min(...values) + const hi = Math.max(...values) const range = hi - lo if (range <= 0) return 50 // 1) range-based normalized position - const t = (cost - lo) / range + const t = (value - lo) / range - // 2) raw padded linear score (lower cost = higher score) + // 2) raw padded linear score (lower value = higher score) const raw = padding + (1 - t) * (100 - 2 * padding)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@app/web_ui/src/lib/components/compare_radar_chart.svelte` around lines 46 - 78, The parameter names cost and costs in the exported function metricToScore are misleading after renaming; rename them to generic names (e.g., value and values or metric and metrics) across the function signature and all internal references inside metricToScore to improve readability without changing behavior, keeping the same default behavior and return logic in metricToScore.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@app/web_ui/src/lib/components/compare_radar_chart.svelte`:
- Around line 181-185: Replace the inline latency formatting in
compare_radar_chart.svelte with the shared helper: import and call
formatLatency(ms: number) from app/web_ui/src/lib/utils/formatters.ts instead of
computing `formatted` manually (the block that sets `formatted` and uses it in
the HTML string); update the code that builds `html` to use
formatLatency(rawValue) where the previous `${formatted}` was inserted so all
latency display follows the single source-of-truth.
In `@app/web_ui/src/lib/utils/formatters.ts`:
- Around line 330-333: The formatLatency function can output "1000ms" for values
like 999.6 due to rounding before unit selection; change the logic so unit
selection uses the rounded ms value: compute roundedMs = Math.round(ms) and if
roundedMs < 1000 return `${roundedMs}ms`, otherwise return seconds using the
original ms divided by 1000 (e.g., `${(ms/1000).toFixed(1)}s`) so values that
round to 1000ms become "1.0s" instead of "1000ms".
In
`@app/web_ui/src/routes/`(app)/specs/[project_id]/[task_id]/compare/+page.svelte:
- Around line 496-501: The latency formatting is leaking into the diff
calculation: when category === "cost" and scoreKey like
"mean_total_llm_latency_ms" you should keep using formatLatency(value) only for
display, and use getModelValueRaw() (the raw millisecond value) when computing
the percent-difference/badge so comparisons remain numeric; update the
percent-difference calculation to call getModelValueRaw() for both sides when
scoreKey corresponds to latency (e.g., "mean_total_llm_latency_ms") and leave
formatLatency() exclusively in the cell rendering path.
In `@libs/core/kiln_ai/adapters/model_adapters/adapter_stream.py`:
- Around line 171-180: The stream latency currently sums time only for
iterations that yield a chunk and misses the final await after the last chunk;
change the loop in adapter_stream.py to measure each await of the stream instead
of only time inside the async for body: replace the "async for chunk in stream"
with an explicit loop that records time before awaiting stream.__anext__(), adds
the elapsed time to call_latency_seconds after the await (so the final tail wait
is included), yields the chunk, and on StopAsyncIteration records the final
await duration before breaking; keep existing variables call_latency_seconds and
chunk_wait_start semantics so consumer processing time remains excluded.
In `@specs/projects/performance-tracking/architecture.md`:
- Line 222: The documentation references a renamed helper; replace uses of
metricToScore() with the canonical costToScore() so the docs match
functional_spec.md and the implementation. Update the "Speed" radar axis note
(and any other occurrences in this section) to call costToScore() and ensure the
wording still notes "lower is better". Verify the symbol name is exactly
costToScore() wherever metricToScore() appears.
In `@specs/projects/performance-tracking/functional_spec.md`:
- Around line 154-156: The spec still references the old helper name
costToScore(); update the documentation to use the new helper name
metricToScore() and, while here, confirm the spec notes that isCostMetric() now
also matches latency (both are lower-is-better) and that the radar label is
"Speed" (higher = better on chart) to match compare_radar_chart.svelte's current
behavior.
In `@specs/projects/performance-tracking/phase_plans/phase_3.md`:
- Line 13: The spec text incorrectly implies formatLatency() was created in this
phase; update the Phase 3 description for run.svelte to say you used or reused
the existing formatLatency() helper (instead of "added formatLatency()") and
confirm that calculate_subtask_usage(), the new subtask_latency_ms state
variable, and the additions to get_usage_properties() are the actual new changes
in this phase.
- Line 17: The spec incorrectly references costToScore(); update the
documentation to use the actual function name metricToScore() and ensure
references to symbols in compare_radar_chart.svelte (e.g., LATENCY_KEY,
isLowerIsBetterMetric(), lowerIsBetterValues) remain accurate — replace any
mention of costToScore() with metricToScore() and confirm wording indicates
per-key normalization via metricToScore() for cost and latency.
In `@specs/projects/performance-tracking/project_overview.md`:
- Line 30: The doc incorrectly states latency is stored in seconds; update the
spec to say latency is stored in milliseconds (integer/float as applicable) and
adjust examples/units accordingly; reference the observed implementation using
the _ms fields (latency_ms, total_llm_latency_ms, mean_total_llm_latency_ms) and
the formatLatency(ms: number) helper which renders "Xms" for values < 1000 to
ensure the documentation aligns with the code representation and displayed
units.
---
Nitpick comments:
In `@app/desktop/studio_server/test_eval_api.py`:
- Around line 2202-2359: Add a boundary test to cover the exact 50% case: create
a new test (e.g. test_get_run_config_eval_scores_latency_at_50_percent) that
mirrors test_get_run_config_eval_scores_latency_below_threshold but creates 4
TaskRuns/EvalRuns with exactly 2 of the TaskRun.usage entries having
total_llm_latency_ms set (2/4 = 50%), patch
task_from_id/eval_from_id/task_run_config_from_id and dataset_ids_in_filter the
same way, call the same endpoint, and assert the behaviour of
mean_usage["mean_total_llm_latency_ms"] for the exact-50% case (set the expected
assertion to match the current implementation's semantics—i.e., assert not None
if the code treats >=50% as sufficient, or assert None if it requires >50%).
In `@app/web_ui/src/lib/components/compare_radar_chart.svelte`:
- Around line 165-189: The two branches handling COST_KEY and LATENCY_KEY inside
dataKeys.forEach are duplicated; extract a per-metric raw-value formatter map
and use a single branch that calls metricToScore and then formats the raw value
via the map before composing html. Specifically, update the logic in
dataKeys.forEach to call getKeyLabel and getModelValueRaw as before, compute
displayValue via metricToScore using lowerIsBetterValues[key], and replace the
COST_KEY/LATENCY_KEY conditionals with a formatter lookup (e.g.,
formatters[COST_KEY], formatters[LATENCY_KEY]) that returns the formatted raw
string (like currency with 6 decimals or ms/s formatting) and fall back to
rawValue.toFixed(3); then build the html with the unified template that inserts
label, displayValue.toFixed(1), and the formatted raw string.
- Around line 46-78: The parameter names cost and costs in the exported function
metricToScore are misleading after renaming; rename them to generic names (e.g.,
value and values or metric and metrics) across the function signature and all
internal references inside metricToScore to improve readability without changing
behavior, keeping the same default behavior and return logic in metricToScore.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 470f0af0-7876-452e-b65a-aecd47423ea4
📒 Files selected for processing (22)
app/desktop/studio_server/eval_api.pyapp/desktop/studio_server/test_eval_api.pyapp/web_ui/src/lib/api_schema.d.tsapp/web_ui/src/lib/components/compare_chart.svelteapp/web_ui/src/lib/components/compare_radar_chart.svelteapp/web_ui/src/lib/utils/formatters.tsapp/web_ui/src/routes/(app)/run/run.svelteapp/web_ui/src/routes/(app)/specs/[project_id]/[task_id]/compare/+page.sveltelibs/core/kiln_ai/adapters/model_adapters/adapter_stream.pylibs/core/kiln_ai/adapters/model_adapters/litellm_adapter.pylibs/core/kiln_ai/adapters/model_adapters/test_litellm_adapter.pylibs/core/kiln_ai/datamodel/task_run.pylibs/core/kiln_ai/datamodel/test_example_models.pylibs/core/kiln_ai/utils/open_ai_types.pylibs/core/kiln_ai/utils/test_open_ai_types.pyspecs/projects/performance-tracking/architecture.mdspecs/projects/performance-tracking/functional_spec.mdspecs/projects/performance-tracking/implementation_plan.mdspecs/projects/performance-tracking/phase_plans/phase_1.mdspecs/projects/performance-tracking/phase_plans/phase_2.mdspecs/projects/performance-tracking/phase_plans/phase_3.mdspecs/projects/performance-tracking/project_overview.md
Extract duplicated latency formatting logic into shared formatLatency() in formatters.ts. Rename costToScore to metricToScore since it now handles both cost and latency metrics. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
1a6582a to
286c0a1
Compare
There was a problem hiding this comment.
🧹 Nitpick comments (1)
app/web_ui/src/lib/components/compare_radar_chart.svelte (1)
46-78: Renamecost/costsparameters now that the helper is generalized.Since
metricToScorenow scores both cost and latency (and any future lower-is-better metric), the cost-specific parameter names are misleading. Considervalue/values(ormetric/metrics) for readability — purely cosmetic, no behavior change.♻️ Proposed rename
- export function metricToScore( - cost: number, - costs: number[], + export function metricToScore( + value: number, + values: number[], { padding = 10, // keep endpoints away from 0/100 relFull = 0.7, // when (hi-lo)/|hi| reaches this, use full spread (k=1) }: { padding?: number relFull?: number } = {}, ): number { - const lo = Math.min(...costs) - const hi = Math.max(...costs) + const lo = Math.min(...values) + const hi = Math.max(...values) const range = hi - lo if (range <= 0) return 50 // 1) range-based normalized position - const t = (cost - lo) / range + const t = (value - lo) / range - // 2) raw padded linear score (lower cost = higher score) + // 2) raw padded linear score (lower value = higher score) const raw = padding + (1 - t) * (100 - 2 * padding)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@app/web_ui/src/lib/components/compare_radar_chart.svelte` around lines 46 - 78, Rename the cost-specific parameters in metricToScore to neutral names to reflect general-purpose use: change the parameter names cost and costs to value and values (or metric and metrics) in the metricToScore function signature and update all internal references (lo, hi, t, raw, etc. that use those params) accordingly; also update any call sites that pass arguments to metricToScore to use the new parameter names if they reference them by name (or simply ensure positional calls still work), and run the test/build to verify no unresolved identifiers remain.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@app/web_ui/src/lib/components/compare_radar_chart.svelte`:
- Around line 46-78: Rename the cost-specific parameters in metricToScore to
neutral names to reflect general-purpose use: change the parameter names cost
and costs to value and values (or metric and metrics) in the metricToScore
function signature and update all internal references (lo, hi, t, raw, etc. that
use those params) accordingly; also update any call sites that pass arguments to
metricToScore to use the new parameter names if they reference them by name (or
simply ensure positional calls still work), and run the test/build to verify no
unresolved identifiers remain.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: aa3a98bb-f410-442e-8d7d-d4fc9c6b199f
📒 Files selected for processing (6)
app/web_ui/src/lib/components/compare_chart.svelteapp/web_ui/src/lib/components/compare_radar_chart.svelteapp/web_ui/src/lib/utils/formatters.tsapp/web_ui/src/routes/(app)/run/run.svelteapp/web_ui/src/routes/(app)/specs/[project_id]/[task_id]/compare/+page.sveltelibs/core/kiln_ai/adapters/model_adapters/adapter_stream.py
✅ Files skipped from review due to trivial changes (1)
- app/web_ui/src/lib/utils/formatters.ts
🚧 Files skipped from review as they are similar to previous changes (3)
- app/web_ui/src/lib/components/compare_chart.svelte
- libs/core/kiln_ai/adapters/model_adapters/adapter_stream.py
- app/web_ui/src/routes/(app)/run/run.svelte
| "Model returned an assistant message, but no content or tool calls. This is not supported." | ||
| ) | ||
|
|
||
| response_choice.message._latency_ms = call_latency_ms # type: ignore[attr-defined] |
There was a problem hiding this comment.
This seems unsafe because message here is not a type we have control over, it comes out of LiteLLM and here we monkeypatch a random property onto it.
If we only need to do this to pass it on to our _messages list, we should make a new object of type ChatCompletionMessageIncludingLiteLLM (we have control over that one so we can have it have a proper latency field) and copy the data into it
| # Accumulate time spent waiting on the LLM for this chunk | ||
| call_latency_seconds += time.monotonic() - chunk_wait_start | ||
| yield chunk | ||
| # Reset timer after yield returns — excludes consumer processing time |
There was a problem hiding this comment.
I wonder if this makes sense. The consumer may be doing work that takes time so excluding that work can make sense, but I suspect (and I could be wrong) that when the consumer is doing its work, the stream may keep on buffering in the background and accumulating chunks, so the next iteration in the loop does not wait at all as if the next chunk were super fast / instant even though it would only be a side effect of the consumer work serving as unmeasured waiting time.
Not sure what to think of this though
| ) | ||
|
|
||
| # Add message to messages, so it can be used in the next turn | ||
| response_choice.message._latency_ms = call_latency_ms # type: ignore[attr-defined] |
There was a problem hiding this comment.
Same comment here about assigning to a monkeypatch or private property being dodgy
What does this PR do?
Add per llm call latency tracking to Trace.
total_llm_latency_msto usage model to track overall llm latency like cost and token.mean_total_llm_latency_msto eval API for comparisonChecklists
Summary by CodeRabbit
New Features
Tests