feat(evaluation): add dependency-injected scoring engine core by anilguleroglu · Pull Request #140 · Cognipeer/console

anilguleroglu · 2026-06-03T17:34:46Z

Introduce the Evaluation service engine as a self-contained core with no
database, queue, or model-runtime coupling, so it compiles and is fully
unit-testable on its own. Persistence, live target/judge adapters, the
HTTP API, and the dashboard layer on top of this engine separately.

types: dataset items, scorer configs (assertion | llm-judge), run and
aggregate results, and injected target/judge invokers
assertion scorer: equals / contains / notContains / regex / minimal
JSON-schema / JSON-path checks, dependency-free
llm-judge scorer: rubric-driven grading through an injected judge
invoker, with 0..1 score normalisation and graceful failure handling
runner: bounded-concurrency orchestration with pass-rate / score /
latency aggregation and a per-item progress hook

Covered by 24 unit tests; tsc --noEmit and eslint are clean.

https://claude.ai/code/session_01UDGtTEyau4AoGC5eQuQKK3

Introduce the Evaluation service engine as a self-contained core with no database, queue, or model-runtime coupling, so it compiles and is fully unit-testable on its own. Persistence, live target/judge adapters, the HTTP API, and the dashboard layer on top of this engine separately. - types: dataset items, scorer configs (assertion | llm-judge), run and aggregate results, and injected target/judge invokers - assertion scorer: equals / contains / notContains / regex / minimal JSON-schema / JSON-path checks, dependency-free - llm-judge scorer: rubric-driven grading through an injected judge invoker, with 0..1 score normalisation and graceful failure handling - runner: bounded-concurrency orchestration with pass-rate / score / latency aggregation and a per-item progress hook Covered by 24 unit tests; tsc --noEmit and eslint are clean. https://claude.ai/code/session_01UDGtTEyau4AoGC5eQuQKK3

Build the Evaluation service end-to-end on top of the engine core, wiring it through the platform's dual-provider DB layer and HTTP API. Data model (tenant-scoped, both SQLite + MongoDB providers): - evaluation_targets, evaluation_datasets (items embedded as JSON), evaluation_suites, evaluation_runs (result items + aggregate embedded) - domain interfaces + DatabaseProvider contract methods + SQLite/Mongo mixins composed into both providers; types re-exported from @/lib/database Service + adapters: - tenant-scoped CRUD for targets/datasets/suites and run listing/retrieval - runSuite(): loads a suite, builds live invokers, drives the engine runner, and persists the run + aggregate. Target/judge invokers are injectable so orchestration is testable without live model calls - live model target + llm-judge invokers via handleChatCompletion; agent and external targets recognised but stubbed (recorded as per-item errors) REST API: /evaluation/{targets,datasets,suites,runs} CRUD plus POST /evaluation/suites/:key/run, registered in the API plugin. Integration test runs the full vertical against a real SQLite provider with injected fakes. Full suite green (2082 passed); tsc and eslint clean. https://claude.ai/code/session_01UDGtTEyau4AoGC5eQuQKK3

Add the Evaluations dashboard so the service is usable from the app, not just the API. UI: - /dashboard/evaluations — tabbed page (Targets / Datasets / Suites / Runs) with stat tiles, DataGrids, create modals, delete, and a one-click "Run" action on suites that navigates to the run detail - create modals: target (model/agent/external), dataset (JSON items with validation), suite (target + dataset + assertion / llm-judge scorers) - /dashboard/evaluations/runs/[id] — run detail with aggregate stats and a per-item table (pass/fail, score, per-scorer breakdown, output/error) Wiring: - platform-services.json: new "evaluations" service (operate category) - rbac.ts: evaluations PermissionService + definition + /api/evaluation route-prefix mapping - dashboardServices.ts: register IconChecklist - i18n: navigation labels (en + tr) tsc, eslint, and the full test suite (2082 passed) are clean. https://claude.ai/code/session_01UDGtTEyau4AoGC5eQuQKK3

Introduce the Conversation Analysis service engine as a self-contained, dependency-injected core with no database, queue, or model-runtime coupling. Mirrors the evaluation engine's approach but is kept independent (its own JSON helper rather than importing eval's). Four composable analysis modes: - extract : field-set + prompt -> structured JSON, type-coerced (string / number / boolean / enum) with required-field validation - store : persistence intent (no engine effect) - judge : LLM conversation-quality scoring against a rubric (0..1) - accuracy : reference-based per-field comparison of extracted values The runner orchestrates extraction + optional judge + optional accuracy over a batch with bounded concurrency, aggregating pass-rate, average judge score, and average extraction accuracy, with a per-item progress hook. Covered by 23 unit tests; tsc --noEmit and eslint are clean. https://claude.ai/code/session_01UDGtTEyau4AoGC5eQuQKK3

Layer the Analysis service onto the engine core: dual-provider persistence, a live model adapter, and a tenant-scoped REST API. Persistence (MongoDB + SQLite parity): - domain types + provider contract for analysis definitions, conversations, and runs - analysis.mixin for both providers (SQLite uses JSON columns + row mappers; mappers are namespaced to avoid colliding with the evaluation mixin) - SQLite schema (3 tables + indexes) and collection/table name maps - curated type exports from the database index Service + API: - service.ts: CRUD for definitions/conversations, bulk conversation ingest, and runDefinition — loads a definition + conversations, drives the pure engine via injectable model invokers, persists the run + aggregate, and (store mode) writes extracted fields back onto conversations - adapters.ts: extraction/judge invokers backed by handleChatCompletion - /api/analysis/* Fastify plugin (definitions, conversations, ingest, run, runs) registered in the API plugin Verified by a SQLite-backed e2e test (extraction + judge + accuracy + store-mode write-back, plus a per-item error path). tsc, eslint, and the full suite (2108 passed) are green. https://claude.ai/code/session_01UDGtTEyau4AoGC5eQuQKK3

Add the Conversation Analysis dashboard so the service is usable from the app, completing the vertical (engine + persistence + API + UI). UI: - /dashboard/analysis — tabbed page (Definitions / Conversations / Runs) with stat tiles, DataGrids, delete, and a one-click "Run analysis" action on definitions that navigates to the run detail - CreateDefinitionModal — dynamic field-set builder (key/type/required/enum), mode toggles (store / accuracy / judge + rubric), and extraction/judge model selection - IngestConversationsModal — paste a JSON array of transcripts (with optional referenceFields for accuracy), validated client-side - /dashboard/analysis/runs/[id] — run detail with aggregate stats (analyzed, avg judge score, avg accuracy, failed) and a per-conversation table (extracted fields, judge, accuracy) Wiring: - platform-services.json: new "analysis" service (operate category) - rbac.ts: analysis PermissionService + definition + /api/analysis route map - dashboardServices.ts: register IconReportAnalytics - i18n: navigation labels (en + tr) tsc, eslint, and the full test suite (2108 passed) are clean. https://claude.ai/code/session_01UDGtTEyau4AoGC5eQuQKK3

…lerts Wire the analysis and evaluation services into the existing alert pipeline so teams can be alerted when quality drops — no new alert logic needed, the alertScheduler/alertEvaluator already evaluate rules against collectors. - AlertModule gains 'analysis' and 'evaluation'; AlertMetric gains analysis_pass_rate / analysis_avg_judge_score / analysis_avg_accuracy and evaluation_pass_rate / evaluation_avg_score (0–100 percentages) - alertService MODULE_METRICS + VALID_MODULES extended - AnalysisCollector + EvaluationCollector average the persisted run aggregate over completed runs in the window (dual-provider: SQLite json_extract / Mongo $avg), via a shared runAggregateHelper; null metrics are excluded and the projectId scope is honoured A rule like "analysis_pass_rate lt 80 over 24h" now fires through the normal channels. Covered by a SQLite-backed collector test; tsc, eslint, and the full suite (2113 passed) are green. https://claude.ai/code/session_01UDGtTEyau4AoGC5eQuQKK3

Add optional cron scheduling to analysis definitions plus a background scheduler that fires due runs — the "every night, analyze yesterday's calls" half of the IVR automation. Combined with the alert collectors, a nightly run that dips below a quality threshold now triggers an alert automatically. - IAnalysisDefinition gains an optional `schedule: { cron, enabled }` (SQLite column + mixin handling; Mongo persists natively) - pure schedulePlanner (validateCron / computeNextRun / isDue) — a slot fires at most once, decided against the most recent run's timestamp - service.runScheduledAnalyses: finds due definitions for a tenant and runs them (createdBy 'system'), collecting per-definition errors - analysisScheduler: 60s interval + distributed lock + per-tenant loop, mirroring the alert scheduler; started from server bootstrap - API accepts and validates `schedule` on definition create/update Covered by planner unit tests and an e2e case (schedule round-trips through SQLite and fires via runScheduledAnalyses). tsc, eslint, and the full suite (2121 passed) are green. https://claude.ai/code/session_01UDGtTEyau4AoGC5eQuQKK3

Document both quality services and their automation. - docs/api/evaluation.md — targets / datasets / suites / runs endpoints, scorers, run shape, and the evaluation alert metrics - docs/api/analysis.md — definitions / conversations / runs endpoints, the four modes, conversation ingest, cron scheduling, and the analysis alert metrics - docs/guide/evaluation-and-analysis.md — the four-layer architecture (pure engine → service → API → UI), data models, end-to-end walkthroughs, and the automation story (scheduled runs + threshold alerts) for the IVR use case - wire all three into the VitePress sidebar (API + Guide) Verified with `vitepress build` (no dead links). https://claude.ai/code/session_01UDGtTEyau4AoGC5eQuQKK3

Expose the nightly-run automation in the definition UI so a schedule can be set without the API. - CreateDefinitionModal: a "Schedule" section — enable a cron schedule and edit the expression (defaults to 0 2 * * *, UTC), posted as `schedule` on create - definitions table: a "Schedule" column showing the cron when enabled - AnalysisDefinitionView gains the `schedule` field tsc, eslint, and the full suite (2121 passed) are green. https://claude.ai/code/session_01UDGtTEyau4AoGC5eQuQKK3

claude added 10 commits June 3, 2026 15:16

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(evaluation): add dependency-injected scoring engine core#140

feat(evaluation): add dependency-injected scoring engine core#140
anilguleroglu wants to merge 10 commits into
mainfrom
claude/zealous-fermi-d9H2d

anilguleroglu commented Jun 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants