feat(evaluation): add dependency-injected scoring engine core#140
Open
anilguleroglu wants to merge 10 commits into
Open
feat(evaluation): add dependency-injected scoring engine core#140anilguleroglu wants to merge 10 commits into
anilguleroglu wants to merge 10 commits into
Conversation
Introduce the Evaluation service engine as a self-contained core with no database, queue, or model-runtime coupling, so it compiles and is fully unit-testable on its own. Persistence, live target/judge adapters, the HTTP API, and the dashboard layer on top of this engine separately. - types: dataset items, scorer configs (assertion | llm-judge), run and aggregate results, and injected target/judge invokers - assertion scorer: equals / contains / notContains / regex / minimal JSON-schema / JSON-path checks, dependency-free - llm-judge scorer: rubric-driven grading through an injected judge invoker, with 0..1 score normalisation and graceful failure handling - runner: bounded-concurrency orchestration with pass-rate / score / latency aggregation and a per-item progress hook Covered by 24 unit tests; tsc --noEmit and eslint are clean. https://claude.ai/code/session_01UDGtTEyau4AoGC5eQuQKK3
Build the Evaluation service end-to-end on top of the engine core, wiring
it through the platform's dual-provider DB layer and HTTP API.
Data model (tenant-scoped, both SQLite + MongoDB providers):
- evaluation_targets, evaluation_datasets (items embedded as JSON),
evaluation_suites, evaluation_runs (result items + aggregate embedded)
- domain interfaces + DatabaseProvider contract methods + SQLite/Mongo
mixins composed into both providers; types re-exported from @/lib/database
Service + adapters:
- tenant-scoped CRUD for targets/datasets/suites and run listing/retrieval
- runSuite(): loads a suite, builds live invokers, drives the engine runner,
and persists the run + aggregate. Target/judge invokers are injectable so
orchestration is testable without live model calls
- live model target + llm-judge invokers via handleChatCompletion; agent and
external targets recognised but stubbed (recorded as per-item errors)
REST API: /evaluation/{targets,datasets,suites,runs} CRUD plus
POST /evaluation/suites/:key/run, registered in the API plugin.
Integration test runs the full vertical against a real SQLite provider with
injected fakes. Full suite green (2082 passed); tsc and eslint clean.
https://claude.ai/code/session_01UDGtTEyau4AoGC5eQuQKK3
Add the Evaluations dashboard so the service is usable from the app, not just the API. UI: - /dashboard/evaluations — tabbed page (Targets / Datasets / Suites / Runs) with stat tiles, DataGrids, create modals, delete, and a one-click "Run" action on suites that navigates to the run detail - create modals: target (model/agent/external), dataset (JSON items with validation), suite (target + dataset + assertion / llm-judge scorers) - /dashboard/evaluations/runs/[id] — run detail with aggregate stats and a per-item table (pass/fail, score, per-scorer breakdown, output/error) Wiring: - platform-services.json: new "evaluations" service (operate category) - rbac.ts: evaluations PermissionService + definition + /api/evaluation route-prefix mapping - dashboardServices.ts: register IconChecklist - i18n: navigation labels (en + tr) tsc, eslint, and the full test suite (2082 passed) are clean. https://claude.ai/code/session_01UDGtTEyau4AoGC5eQuQKK3
Introduce the Conversation Analysis service engine as a self-contained, dependency-injected core with no database, queue, or model-runtime coupling. Mirrors the evaluation engine's approach but is kept independent (its own JSON helper rather than importing eval's). Four composable analysis modes: - extract : field-set + prompt -> structured JSON, type-coerced (string / number / boolean / enum) with required-field validation - store : persistence intent (no engine effect) - judge : LLM conversation-quality scoring against a rubric (0..1) - accuracy : reference-based per-field comparison of extracted values The runner orchestrates extraction + optional judge + optional accuracy over a batch with bounded concurrency, aggregating pass-rate, average judge score, and average extraction accuracy, with a per-item progress hook. Covered by 23 unit tests; tsc --noEmit and eslint are clean. https://claude.ai/code/session_01UDGtTEyau4AoGC5eQuQKK3
Layer the Analysis service onto the engine core: dual-provider persistence, a live model adapter, and a tenant-scoped REST API. Persistence (MongoDB + SQLite parity): - domain types + provider contract for analysis definitions, conversations, and runs - analysis.mixin for both providers (SQLite uses JSON columns + row mappers; mappers are namespaced to avoid colliding with the evaluation mixin) - SQLite schema (3 tables + indexes) and collection/table name maps - curated type exports from the database index Service + API: - service.ts: CRUD for definitions/conversations, bulk conversation ingest, and runDefinition — loads a definition + conversations, drives the pure engine via injectable model invokers, persists the run + aggregate, and (store mode) writes extracted fields back onto conversations - adapters.ts: extraction/judge invokers backed by handleChatCompletion - /api/analysis/* Fastify plugin (definitions, conversations, ingest, run, runs) registered in the API plugin Verified by a SQLite-backed e2e test (extraction + judge + accuracy + store-mode write-back, plus a per-item error path). tsc, eslint, and the full suite (2108 passed) are green. https://claude.ai/code/session_01UDGtTEyau4AoGC5eQuQKK3
Add the Conversation Analysis dashboard so the service is usable from the app, completing the vertical (engine + persistence + API + UI). UI: - /dashboard/analysis — tabbed page (Definitions / Conversations / Runs) with stat tiles, DataGrids, delete, and a one-click "Run analysis" action on definitions that navigates to the run detail - CreateDefinitionModal — dynamic field-set builder (key/type/required/enum), mode toggles (store / accuracy / judge + rubric), and extraction/judge model selection - IngestConversationsModal — paste a JSON array of transcripts (with optional referenceFields for accuracy), validated client-side - /dashboard/analysis/runs/[id] — run detail with aggregate stats (analyzed, avg judge score, avg accuracy, failed) and a per-conversation table (extracted fields, judge, accuracy) Wiring: - platform-services.json: new "analysis" service (operate category) - rbac.ts: analysis PermissionService + definition + /api/analysis route map - dashboardServices.ts: register IconReportAnalytics - i18n: navigation labels (en + tr) tsc, eslint, and the full test suite (2108 passed) are clean. https://claude.ai/code/session_01UDGtTEyau4AoGC5eQuQKK3
…lerts Wire the analysis and evaluation services into the existing alert pipeline so teams can be alerted when quality drops — no new alert logic needed, the alertScheduler/alertEvaluator already evaluate rules against collectors. - AlertModule gains 'analysis' and 'evaluation'; AlertMetric gains analysis_pass_rate / analysis_avg_judge_score / analysis_avg_accuracy and evaluation_pass_rate / evaluation_avg_score (0–100 percentages) - alertService MODULE_METRICS + VALID_MODULES extended - AnalysisCollector + EvaluationCollector average the persisted run aggregate over completed runs in the window (dual-provider: SQLite json_extract / Mongo $avg), via a shared runAggregateHelper; null metrics are excluded and the projectId scope is honoured A rule like "analysis_pass_rate lt 80 over 24h" now fires through the normal channels. Covered by a SQLite-backed collector test; tsc, eslint, and the full suite (2113 passed) are green. https://claude.ai/code/session_01UDGtTEyau4AoGC5eQuQKK3
Add optional cron scheduling to analysis definitions plus a background
scheduler that fires due runs — the "every night, analyze yesterday's calls"
half of the IVR automation. Combined with the alert collectors, a nightly run
that dips below a quality threshold now triggers an alert automatically.
- IAnalysisDefinition gains an optional `schedule: { cron, enabled }`
(SQLite column + mixin handling; Mongo persists natively)
- pure schedulePlanner (validateCron / computeNextRun / isDue) — a slot fires
at most once, decided against the most recent run's timestamp
- service.runScheduledAnalyses: finds due definitions for a tenant and runs
them (createdBy 'system'), collecting per-definition errors
- analysisScheduler: 60s interval + distributed lock + per-tenant loop,
mirroring the alert scheduler; started from server bootstrap
- API accepts and validates `schedule` on definition create/update
Covered by planner unit tests and an e2e case (schedule round-trips through
SQLite and fires via runScheduledAnalyses). tsc, eslint, and the full suite
(2121 passed) are green.
https://claude.ai/code/session_01UDGtTEyau4AoGC5eQuQKK3
Document both quality services and their automation. - docs/api/evaluation.md — targets / datasets / suites / runs endpoints, scorers, run shape, and the evaluation alert metrics - docs/api/analysis.md — definitions / conversations / runs endpoints, the four modes, conversation ingest, cron scheduling, and the analysis alert metrics - docs/guide/evaluation-and-analysis.md — the four-layer architecture (pure engine → service → API → UI), data models, end-to-end walkthroughs, and the automation story (scheduled runs + threshold alerts) for the IVR use case - wire all three into the VitePress sidebar (API + Guide) Verified with `vitepress build` (no dead links). https://claude.ai/code/session_01UDGtTEyau4AoGC5eQuQKK3
Expose the nightly-run automation in the definition UI so a schedule can be set without the API. - CreateDefinitionModal: a "Schedule" section — enable a cron schedule and edit the expression (defaults to 0 2 * * *, UTC), posted as `schedule` on create - definitions table: a "Schedule" column showing the cron when enabled - AnalysisDefinitionView gains the `schedule` field tsc, eslint, and the full suite (2121 passed) are green. https://claude.ai/code/session_01UDGtTEyau4AoGC5eQuQKK3
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Introduce the Evaluation service engine as a self-contained core with no
database, queue, or model-runtime coupling, so it compiles and is fully
unit-testable on its own. Persistence, live target/judge adapters, the
HTTP API, and the dashboard layer on top of this engine separately.
aggregate results, and injected target/judge invokers
JSON-schema / JSON-path checks, dependency-free
invoker, with 0..1 score normalisation and graceful failure handling
latency aggregation and a per-item progress hook
Covered by 24 unit tests; tsc --noEmit and eslint are clean.
https://claude.ai/code/session_01UDGtTEyau4AoGC5eQuQKK3