Skip to content

feat(evaluation): add dependency-injected scoring engine core#140

Open
anilguleroglu wants to merge 10 commits into
mainfrom
claude/zealous-fermi-d9H2d
Open

feat(evaluation): add dependency-injected scoring engine core#140
anilguleroglu wants to merge 10 commits into
mainfrom
claude/zealous-fermi-d9H2d

Conversation

@anilguleroglu

Copy link
Copy Markdown
Collaborator

Introduce the Evaluation service engine as a self-contained core with no
database, queue, or model-runtime coupling, so it compiles and is fully
unit-testable on its own. Persistence, live target/judge adapters, the
HTTP API, and the dashboard layer on top of this engine separately.

  • types: dataset items, scorer configs (assertion | llm-judge), run and
    aggregate results, and injected target/judge invokers
  • assertion scorer: equals / contains / notContains / regex / minimal
    JSON-schema / JSON-path checks, dependency-free
  • llm-judge scorer: rubric-driven grading through an injected judge
    invoker, with 0..1 score normalisation and graceful failure handling
  • runner: bounded-concurrency orchestration with pass-rate / score /
    latency aggregation and a per-item progress hook

Covered by 24 unit tests; tsc --noEmit and eslint are clean.

https://claude.ai/code/session_01UDGtTEyau4AoGC5eQuQKK3

claude added 10 commits June 3, 2026 15:16
Introduce the Evaluation service engine as a self-contained core with no
database, queue, or model-runtime coupling, so it compiles and is fully
unit-testable on its own. Persistence, live target/judge adapters, the
HTTP API, and the dashboard layer on top of this engine separately.

- types: dataset items, scorer configs (assertion | llm-judge), run and
  aggregate results, and injected target/judge invokers
- assertion scorer: equals / contains / notContains / regex / minimal
  JSON-schema / JSON-path checks, dependency-free
- llm-judge scorer: rubric-driven grading through an injected judge
  invoker, with 0..1 score normalisation and graceful failure handling
- runner: bounded-concurrency orchestration with pass-rate / score /
  latency aggregation and a per-item progress hook

Covered by 24 unit tests; tsc --noEmit and eslint are clean.

https://claude.ai/code/session_01UDGtTEyau4AoGC5eQuQKK3
Build the Evaluation service end-to-end on top of the engine core, wiring
it through the platform's dual-provider DB layer and HTTP API.

Data model (tenant-scoped, both SQLite + MongoDB providers):
- evaluation_targets, evaluation_datasets (items embedded as JSON),
  evaluation_suites, evaluation_runs (result items + aggregate embedded)
- domain interfaces + DatabaseProvider contract methods + SQLite/Mongo
  mixins composed into both providers; types re-exported from @/lib/database

Service + adapters:
- tenant-scoped CRUD for targets/datasets/suites and run listing/retrieval
- runSuite(): loads a suite, builds live invokers, drives the engine runner,
  and persists the run + aggregate. Target/judge invokers are injectable so
  orchestration is testable without live model calls
- live model target + llm-judge invokers via handleChatCompletion; agent and
  external targets recognised but stubbed (recorded as per-item errors)

REST API: /evaluation/{targets,datasets,suites,runs} CRUD plus
POST /evaluation/suites/:key/run, registered in the API plugin.

Integration test runs the full vertical against a real SQLite provider with
injected fakes. Full suite green (2082 passed); tsc and eslint clean.

https://claude.ai/code/session_01UDGtTEyau4AoGC5eQuQKK3
Add the Evaluations dashboard so the service is usable from the app, not
just the API.

UI:
- /dashboard/evaluations — tabbed page (Targets / Datasets / Suites / Runs)
  with stat tiles, DataGrids, create modals, delete, and a one-click "Run"
  action on suites that navigates to the run detail
- create modals: target (model/agent/external), dataset (JSON items with
  validation), suite (target + dataset + assertion / llm-judge scorers)
- /dashboard/evaluations/runs/[id] — run detail with aggregate stats and a
  per-item table (pass/fail, score, per-scorer breakdown, output/error)

Wiring:
- platform-services.json: new "evaluations" service (operate category)
- rbac.ts: evaluations PermissionService + definition + /api/evaluation
  route-prefix mapping
- dashboardServices.ts: register IconChecklist
- i18n: navigation labels (en + tr)

tsc, eslint, and the full test suite (2082 passed) are clean.

https://claude.ai/code/session_01UDGtTEyau4AoGC5eQuQKK3
Introduce the Conversation Analysis service engine as a self-contained,
dependency-injected core with no database, queue, or model-runtime coupling.
Mirrors the evaluation engine's approach but is kept independent (its own
JSON helper rather than importing eval's).

Four composable analysis modes:
- extract  : field-set + prompt -> structured JSON, type-coerced (string /
  number / boolean / enum) with required-field validation
- store    : persistence intent (no engine effect)
- judge    : LLM conversation-quality scoring against a rubric (0..1)
- accuracy : reference-based per-field comparison of extracted values

The runner orchestrates extraction + optional judge + optional accuracy over
a batch with bounded concurrency, aggregating pass-rate, average judge score,
and average extraction accuracy, with a per-item progress hook.

Covered by 23 unit tests; tsc --noEmit and eslint are clean.

https://claude.ai/code/session_01UDGtTEyau4AoGC5eQuQKK3
Layer the Analysis service onto the engine core: dual-provider persistence,
a live model adapter, and a tenant-scoped REST API.

Persistence (MongoDB + SQLite parity):
- domain types + provider contract for analysis definitions, conversations,
  and runs
- analysis.mixin for both providers (SQLite uses JSON columns + row mappers;
  mappers are namespaced to avoid colliding with the evaluation mixin)
- SQLite schema (3 tables + indexes) and collection/table name maps
- curated type exports from the database index

Service + API:
- service.ts: CRUD for definitions/conversations, bulk conversation ingest,
  and runDefinition — loads a definition + conversations, drives the pure
  engine via injectable model invokers, persists the run + aggregate, and
  (store mode) writes extracted fields back onto conversations
- adapters.ts: extraction/judge invokers backed by handleChatCompletion
- /api/analysis/* Fastify plugin (definitions, conversations, ingest, run,
  runs) registered in the API plugin

Verified by a SQLite-backed e2e test (extraction + judge + accuracy +
store-mode write-back, plus a per-item error path). tsc, eslint, and the
full suite (2108 passed) are green.

https://claude.ai/code/session_01UDGtTEyau4AoGC5eQuQKK3
Add the Conversation Analysis dashboard so the service is usable from the
app, completing the vertical (engine + persistence + API + UI).

UI:
- /dashboard/analysis — tabbed page (Definitions / Conversations / Runs)
  with stat tiles, DataGrids, delete, and a one-click "Run analysis" action
  on definitions that navigates to the run detail
- CreateDefinitionModal — dynamic field-set builder (key/type/required/enum),
  mode toggles (store / accuracy / judge + rubric), and extraction/judge
  model selection
- IngestConversationsModal — paste a JSON array of transcripts (with optional
  referenceFields for accuracy), validated client-side
- /dashboard/analysis/runs/[id] — run detail with aggregate stats (analyzed,
  avg judge score, avg accuracy, failed) and a per-conversation table
  (extracted fields, judge, accuracy)

Wiring:
- platform-services.json: new "analysis" service (operate category)
- rbac.ts: analysis PermissionService + definition + /api/analysis route map
- dashboardServices.ts: register IconReportAnalytics
- i18n: navigation labels (en + tr)

tsc, eslint, and the full test suite (2108 passed) are clean.

https://claude.ai/code/session_01UDGtTEyau4AoGC5eQuQKK3
…lerts

Wire the analysis and evaluation services into the existing alert pipeline so
teams can be alerted when quality drops — no new alert logic needed, the
alertScheduler/alertEvaluator already evaluate rules against collectors.

- AlertModule gains 'analysis' and 'evaluation'; AlertMetric gains
  analysis_pass_rate / analysis_avg_judge_score / analysis_avg_accuracy and
  evaluation_pass_rate / evaluation_avg_score (0–100 percentages)
- alertService MODULE_METRICS + VALID_MODULES extended
- AnalysisCollector + EvaluationCollector average the persisted run aggregate
  over completed runs in the window (dual-provider: SQLite json_extract /
  Mongo $avg), via a shared runAggregateHelper; null metrics are excluded and
  the projectId scope is honoured

A rule like "analysis_pass_rate lt 80 over 24h" now fires through the normal
channels. Covered by a SQLite-backed collector test; tsc, eslint, and the full
suite (2113 passed) are green.

https://claude.ai/code/session_01UDGtTEyau4AoGC5eQuQKK3
Add optional cron scheduling to analysis definitions plus a background
scheduler that fires due runs — the "every night, analyze yesterday's calls"
half of the IVR automation. Combined with the alert collectors, a nightly run
that dips below a quality threshold now triggers an alert automatically.

- IAnalysisDefinition gains an optional `schedule: { cron, enabled }`
  (SQLite column + mixin handling; Mongo persists natively)
- pure schedulePlanner (validateCron / computeNextRun / isDue) — a slot fires
  at most once, decided against the most recent run's timestamp
- service.runScheduledAnalyses: finds due definitions for a tenant and runs
  them (createdBy 'system'), collecting per-definition errors
- analysisScheduler: 60s interval + distributed lock + per-tenant loop,
  mirroring the alert scheduler; started from server bootstrap
- API accepts and validates `schedule` on definition create/update

Covered by planner unit tests and an e2e case (schedule round-trips through
SQLite and fires via runScheduledAnalyses). tsc, eslint, and the full suite
(2121 passed) are green.

https://claude.ai/code/session_01UDGtTEyau4AoGC5eQuQKK3
Document both quality services and their automation.

- docs/api/evaluation.md — targets / datasets / suites / runs endpoints,
  scorers, run shape, and the evaluation alert metrics
- docs/api/analysis.md — definitions / conversations / runs endpoints, the four
  modes, conversation ingest, cron scheduling, and the analysis alert metrics
- docs/guide/evaluation-and-analysis.md — the four-layer architecture (pure
  engine → service → API → UI), data models, end-to-end walkthroughs, and the
  automation story (scheduled runs + threshold alerts) for the IVR use case
- wire all three into the VitePress sidebar (API + Guide)

Verified with `vitepress build` (no dead links).

https://claude.ai/code/session_01UDGtTEyau4AoGC5eQuQKK3
Expose the nightly-run automation in the definition UI so a schedule can be set
without the API.

- CreateDefinitionModal: a "Schedule" section — enable a cron schedule and edit
  the expression (defaults to 0 2 * * *, UTC), posted as `schedule` on create
- definitions table: a "Schedule" column showing the cron when enabled
- AnalysisDefinitionView gains the `schedule` field

tsc, eslint, and the full suite (2121 passed) are green.

https://claude.ai/code/session_01UDGtTEyau4AoGC5eQuQKK3
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants