Feature/trust score metric 17823921494002512507 by Danish2op · Pull Request #2625 · confident-ai/deepeval

Danish2op · 2026-04-22T21:25:07Z

Description

This PR addresses issue #2586 by implementing a new TrustScoreMetric. This metric evaluates the trustworthiness of an LLM's response based on the sources used during RAG retrieval, categorized by customizable trust tiers.

DeepEval currently evaluates LLM outputs on dimensions like faithfulness, relevance, and hallucination. However, two responses can score identically on faithfulness but have completely different trust profiles depending on where they sourced their information (e.g., SEC filings vs. unverified blog posts). This new orthogonal dimension ensures users can accurately measure output trust based on their retrieval sources.

What was built

A new TrustScoreMetric class that:

Accepts source_tiers and threshold: Takes a dictionary mapping source identifiers/keywords to tier numbers (T1=most trusted, T5=least trusted), and a success threshold float (default 0.7).
Follows Standard DeepEval Interface: Implements both measure and a_measure functions on an LLMTestCase exactly like other base metrics and supports BaseMetric properties.
Implements Accurate Scoring Logic:
- Inspects test_case.retrieval_context.
- Uses case-insensitive substring matching to map context chunks to user-provided source keys.
- Maps match tiers to scores: T1=1.0, T2=0.8, T3=0.6, T4=0.4, T5=0.2. Unmatched sources receive a default neutral score of 0.5.
- Computes the average of all chunk scores as the final trust score.
- Produces a detailed human-readable reason string explaining which sources were found and their tiers.

Changes Made

Added the deepeval/metrics/trust_score directory with __init__.py and trust_score.py.
Exported the new metric gracefully in deepeval/metrics/__init__.py.
Added a full test suite tests/test_trust_score_metric.py validating various scenarios (high/low trust, mixed/unmatched sources, threshold pass/fail, empty retrieval contexts).

How to use

from deepeval.metrics import TrustScoreMetric
from deepeval.test_case import LLMTestCase

# Map sources to tiers (Tier 1 is most trusted, Tier 5 is least)
source_tiers = {
    "SEC Filings": 1,
    "Verified Blog": 2,
    "Unverified Post": 4
}

metric = TrustScoreMetric(source_tiers=source_tiers, threshold=0.7)

test_case = LLMTestCase(
    input="What is Apple's revenue?",
    actual_output="Apple's revenue is 394 billion.",
    retrieval_context=["According to SEC filings, Apple's revenue is 394 billion."]
)

metric.measure(test_case)
print(metric.score)   # 1.0
print(metric.reason)  # Explains the specific tier mapped for the chunk
print(metric.success) # True
Testing

poetry run ruff check and poetry run black successfully ran on changed files.
poetry run pytest tests/test_trust_score_metric.py runs with a 100% pass rate.

This commit introduces a new `TrustScoreMetric` which evaluates the trustworthiness of an LLM response based on the sources used during RAG retrieval. The metric takes a dictionary of source strings mapped to tier values (T1-T5), and scores the sources appropriately. It exports the new metric in the `deepeval/metrics/__init__.py` file and provides comprehensive test cases for varying trust tiers, thresholds, and edge cases.

vercel · 2026-04-22T21:25:10Z

@Danish2op is attempting to deploy a commit to the Confident AI Team on Vercel.

A member of the Team first needs to authorize it.

This commit introduces a new `TrustScoreMetric` which evaluates the trustworthiness of an LLM response based on the sources used during RAG retrieval. The metric takes a dictionary of source strings mapped to tier values (T1-T5), and scores the sources appropriately. It exports the new metric in the `deepeval/metrics/__init__.py` file and provides comprehensive test cases for varying trust tiers, thresholds, and edge cases.

Danish2op added 3 commits April 22, 2026 20:40

Danish2op added 2 commits April 22, 2026 22:02

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature/trust score metric 17823921494002512507#2625

Feature/trust score metric 17823921494002512507#2625
Danish2op wants to merge 5 commits intoconfident-ai:mainfrom
Danish2op:feature/trust-score-metric-17823921494002512507

Danish2op commented Apr 22, 2026

Uh oh!

vercel Bot commented Apr 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Danish2op commented Apr 22, 2026

Description

What was built

Changes Made

How to use

Uh oh!

vercel Bot commented Apr 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant