feat(eval): show mean score instead of pass/fail in report and viewer#534
feat(eval): show mean score instead of pass/fail in report and viewer#534shivammittal274 merged 1 commit intomainfrom
Conversation
Greptile SummaryThis PR replaces binary pass/fail reporting with a continuous mean score (0–100%) across the eval weekly-report generator and the task viewer. Key changes:
Confidence Score: 4/5
Important Files Changed
Flowchart%%{init: {'theme': 'neutral'}}%%
flowchart TD
A[Manifest task] --> B{graderResults\npresent?}
B -- No --> C[Skip task]
B -- Yes --> D[Iterate PASS_FAIL_GRADER_ORDER]
D --> E{Grader key\nfound?}
E -- No --> C
E -- Yes --> F{score field\npresent?}
subgraph weekly-report.ts
F -- Yes --> G[scoreSum += score\nscoredCount++]
F -- No ← bug --> H[scoreSum += 0\nscoredCount++ ← skews avg]
G --> I[avgScore = scoreSum / scoredCount × 100]
end
subgraph viewer.html resolveGrade
J[keys 0 — arbitrary order] --> K{typeof score\n=== number?}
K -- Yes --> L{pct >= 75?}
L -- Yes --> M[pass — green]
L -- No --> N[fail — red ← missing neutral tier]
K -- No --> O[anyPass → PASS / FAIL]
end
I --> P[RunSummary.avgScore\nused in chart, table, stats]
A --> J
Prompt To Fix All With AIThis is a comment left during a code review.
Path: packages/browseros-agent/apps/eval/src/dashboard/viewer.html
Line: 1228
Comment:
**Missing "neutral" class for mid-range scores**
`weekly-report.ts` uses three tiers — `pass` (≥75%), `neutral` (≥40%), `fail` (<40%) — but the new `resolveGrade` path in the viewer only emits `pass` or `fail`. Any score between 40% and 74% will appear red (fail) in the task list even though the weekly report charts it as orange (neutral). The `.neutral` CSS class is already defined in the viewer's stylesheet.
```suggestion
return { label: pct + '%', cls: pct >= 75 ? 'pass' : pct >= 40 ? 'neutral' : 'fail' };
```
How can I resolve this? If you propose a fix, please make it concise.
---
This is a comment left during a code review.
Path: packages/browseros-agent/apps/eval/src/dashboard/viewer.html
Line: 1224-1225
Comment:
**Grader selection uses arbitrary key order instead of priority order**
`weekly-report.ts` iterates `PASS_FAIL_GRADER_ORDER` (`performance_grader` → `webvoyager_grader` → `fara_combined` → `fara_grader`) to pick the canonical grader score for each task. Here, `keys[0]` relies on `Object.keys()` insertion order, which may resolve to a different grader. When a task has results from multiple graders the viewer can display a score from a different grader than the one that contributes to the report's `avgScore`, making the per-task numbers inconsistent with the aggregate.
How can I resolve this? If you propose a fix, please make it concise.
---
This is a comment left during a code review.
Path: packages/browseros-agent/apps/eval/scripts/weekly-report.ts
Line: 143-146
Comment:
**Historical tasks without `score` field skew the average to zero**
`task.graderResults[name].score ?? 0` treats a missing `score` (e.g. older manifest records that only carry `pass: boolean`) as a score of `0`, while still incrementing `scoredCount`. This silently dilutes `avgScore` for any run that contains historical tasks.
By contrast, `viewer.html` correctly handles this case with `if (typeof score === 'number')` and falls back to the `pass`/`fail` path — meaning the two codepaths diverge for the same data.
Consider mirroring the viewer's check:
```typescript
const scoreVal = task.graderResults[name].score;
if (typeof scoreVal === 'number') {
scoredCount++;
scoreSum += scoreVal;
}
```
How can I resolve this? If you propose a fix, please make it concise.Reviews (1): Last reviewed commit: "feat(eval): show mean score instead of p..." | Re-trigger Greptile |
| const score = graders[firstKey].score; | ||
| if (typeof score === 'number') { | ||
| const pct = Math.round(score * 100); | ||
| return { label: pct + '%', cls: pct >= 75 ? 'pass' : 'fail' }; |
There was a problem hiding this comment.
Missing "neutral" class for mid-range scores
weekly-report.ts uses three tiers — pass (≥75%), neutral (≥40%), fail (<40%) — but the new resolveGrade path in the viewer only emits pass or fail. Any score between 40% and 74% will appear red (fail) in the task list even though the weekly report charts it as orange (neutral). The .neutral CSS class is already defined in the viewer's stylesheet.
| return { label: pct + '%', cls: pct >= 75 ? 'pass' : 'fail' }; | |
| return { label: pct + '%', cls: pct >= 75 ? 'pass' : pct >= 40 ? 'neutral' : 'fail' }; |
Prompt To Fix With AI
This is a comment left during a code review.
Path: packages/browseros-agent/apps/eval/src/dashboard/viewer.html
Line: 1228
Comment:
**Missing "neutral" class for mid-range scores**
`weekly-report.ts` uses three tiers — `pass` (≥75%), `neutral` (≥40%), `fail` (<40%) — but the new `resolveGrade` path in the viewer only emits `pass` or `fail`. Any score between 40% and 74% will appear red (fail) in the task list even though the weekly report charts it as orange (neutral). The `.neutral` CSS class is already defined in the viewer's stylesheet.
```suggestion
return { label: pct + '%', cls: pct >= 75 ? 'pass' : pct >= 40 ? 'neutral' : 'fail' };
```
How can I resolve this? If you propose a fix, please make it concise.| const firstKey = keys[0]; | ||
| const score = graders[firstKey].score; |
There was a problem hiding this comment.
Grader selection uses arbitrary key order instead of priority order
weekly-report.ts iterates PASS_FAIL_GRADER_ORDER (performance_grader → webvoyager_grader → fara_combined → fara_grader) to pick the canonical grader score for each task. Here, keys[0] relies on Object.keys() insertion order, which may resolve to a different grader. When a task has results from multiple graders the viewer can display a score from a different grader than the one that contributes to the report's avgScore, making the per-task numbers inconsistent with the aggregate.
Prompt To Fix With AI
This is a comment left during a code review.
Path: packages/browseros-agent/apps/eval/src/dashboard/viewer.html
Line: 1224-1225
Comment:
**Grader selection uses arbitrary key order instead of priority order**
`weekly-report.ts` iterates `PASS_FAIL_GRADER_ORDER` (`performance_grader` → `webvoyager_grader` → `fara_combined` → `fara_grader`) to pick the canonical grader score for each task. Here, `keys[0]` relies on `Object.keys()` insertion order, which may resolve to a different grader. When a task has results from multiple graders the viewer can display a score from a different grader than the one that contributes to the report's `avgScore`, making the per-task numbers inconsistent with the aggregate.
How can I resolve this? If you propose a fix, please make it concise.| if (task.graderResults[name]) { | ||
| graded++ | ||
| if (task.graderResults[name].pass) passed++ | ||
| scoredCount++ | ||
| scoreSum += task.graderResults[name].score ?? 0 | ||
| break |
There was a problem hiding this comment.
Historical tasks without
score field skew the average to zero
task.graderResults[name].score ?? 0 treats a missing score (e.g. older manifest records that only carry pass: boolean) as a score of 0, while still incrementing scoredCount. This silently dilutes avgScore for any run that contains historical tasks.
By contrast, viewer.html correctly handles this case with if (typeof score === 'number') and falls back to the pass/fail path — meaning the two codepaths diverge for the same data.
Consider mirroring the viewer's check:
const scoreVal = task.graderResults[name].score;
if (typeof scoreVal === 'number') {
scoredCount++;
scoreSum += scoreVal;
}Prompt To Fix With AI
This is a comment left during a code review.
Path: packages/browseros-agent/apps/eval/scripts/weekly-report.ts
Line: 143-146
Comment:
**Historical tasks without `score` field skew the average to zero**
`task.graderResults[name].score ?? 0` treats a missing `score` (e.g. older manifest records that only carry `pass: boolean`) as a score of `0`, while still incrementing `scoredCount`. This silently dilutes `avgScore` for any run that contains historical tasks.
By contrast, `viewer.html` correctly handles this case with `if (typeof score === 'number')` and falls back to the `pass`/`fail` path — meaning the two codepaths diverge for the same data.
Consider mirroring the viewer's check:
```typescript
const scoreVal = task.graderResults[name].score;
if (typeof scoreVal === 'number') {
scoredCount++;
scoreSum += scoreVal;
}
```
How can I resolve this? If you propose a fix, please make it concise.
No description provided.