Skip to content

Latest commit

Β 

History

History
1273 lines (965 loc) Β· 50.1 KB

File metadata and controls

1273 lines (965 loc) Β· 50.1 KB
title Validation
description Validate models, pipelines, and runtime assumptions before deployment with repeatable checks and guardrails.
sidebar_position 1

MATA Validation & Evaluation Guide

Version: 1.9.0
Updated: March 2026

mata.val() evaluates any supported model against a labeled dataset and returns structured metrics. It follows the YOLO-style validation pattern β€” one function call, per-class breakdowns, PR/F1 curves, confusion matrices, and formatted console tables.

Supported tasks: detection, segmentation, classification, depth estimation, OCR / text recognition.


Table of Contents


Quick Start

import mata

# Detection β€” COCO mAP
metrics = mata.val(
    "detect",
    model="facebook/detr-resnet-50",
    data="examples/configs/coco.yaml",
    verbose=True,
    plots=True,
    save_dir="runs/val/detect",
)
print(f"mAP@50:    {metrics.box.map50:.3f}")   # e.g. 0.644
print(f"mAP@50-95: {metrics.box.map:.3f}")     # e.g. 0.456

# Segmentation β€” box + mask mAP
metrics = mata.val(
    "segment",
    model="shi-labs/oneformer_coco_swin_large",
    data="examples/configs/coco.yaml",
    verbose=True,
)
print(f"Box  mAP@50: {metrics.box.map50:.3f}")
print(f"Mask mAP@50: {metrics.seg.map50:.3f}")

# Classification β€” top-k accuracy
metrics = mata.val(
    "classify",
    model="microsoft/resnet-101",
    data="examples/configs/imagenet.yaml",
    verbose=True,
)
print(f"Top-1: {metrics.top1:.1%}")   # e.g. 81.9%
print(f"Top-5: {metrics.top5:.1%}")   # e.g. 95.7%

# Depth estimation
metrics = mata.val(
    "depth",
    model="depth-anything/Depth-Anything-V2-Small-hf",
    data="examples/configs/diode.yaml",
    verbose=True,
)
print(f"AbsRel:   {metrics.abs_rel:.4f}")    # e.g. 0.3930
print(f"Ξ΄ < 1.25: {metrics.delta_1:.1%}")    # e.g. 66.9%

# OCR β€” recognition metrics
metrics = mata.val(
    "ocr",
    model="easyocr",
    data="examples/configs/coco_text.yaml",
    verbose=True,
)
print(f"CER:      {metrics.cer:.4f}")         # e.g. 0.1523
print(f"WER:      {metrics.wer:.4f}")         # e.g. 0.2347
print(f"Accuracy: {metrics.accuracy:.1%}")    # e.g. 62.4%

API Reference β€” mata.val()

def val(
    task: str,
    *,
    model: str | Any | None = None,
    data: str | dict | None = None,
    predictions: list | None = None,
    ground_truth: str | list | None = None,
    conf: float = 0.001,
    iou: float = 0.50,
    device: str | None = None,
    verbose: bool = True,
    plots: bool = False,
    save_dir: str = "",
    split: str = "val",
    **kwargs,
) -> DetMetrics | SegmentMetrics | ClassifyMetrics | DepthMetrics | OCRMetrics

Parameters

Parameter Type Default Description
task str required One of "detect", "segment", "classify", "depth", "ocr"
model str | adapter | None None HuggingFace ID, local path, config alias, or a pre-loaded adapter. Required when predictions is not given.
data str | dict | None None Path to a dataset YAML file, or dict with equivalent keys. Required when predictions is not given.
predictions list | None None Pre-computed VisionResult/ClassifyResult/DepthResult list (standalone mode).
ground_truth str | list | None None COCO-format JSON path, or list of GroundTruth objects (standalone mode).
conf float 0.001 Minimum confidence threshold for predictions.
iou float 0.50 IoU threshold for true-positive / false-positive matching.
device str | None None Device for inference ("cpu", "cuda", "cuda:0"). Auto-detected if omitted.
verbose bool True Print per-class metrics table to stdout.
plots bool False Save PR curve, F1 curve, and confusion matrix plots.
save_dir str "" Directory for plots and metrics.json. Empty string disables saving.
split str "val" Dataset split key within the YAML ("val", "test", "train").

Usage Modes

Dataset-driven β€” provide model + data; the Validator loads the model, iterates over images, runs inference, and computes metrics automatically:

metrics = mata.val("detect", model="facebook/detr-resnet-50", data="coco.yaml")

Standalone β€” provide predictions + ground_truth; no inference is performed:

metrics = mata.val("detect", predictions=my_preds, ground_truth="annotations.json")

Return Type

Returns one of DetMetrics, SegmentMetrics, ClassifyMetrics, DepthMetrics, or OCRMetrics depending on task.


Dataset YAML Format

Create a YAML file pointing to your dataset:

# examples/configs/coco.yaml
path: /data/coco # dataset root (absolute or relative to CWD)
val: val2017 # sub-directory containing validation images
annotations: annotations/instances_val2017.json # COCO-format annotation JSON (relative to path)
names: # optional: class-index β†’ name mapping
  0: person
  1: bicycle
  2: car
  # ... up to class 79

Key fields:

Field Required Description
path Yes Root directory of the dataset
val Yes Sub-directory containing images for the split being evaluated
annotations Yes Path to COCO-format JSON file (relative to path)
names No Class index β†’ name mapping. Auto-extracted from COCO JSON if omitted.

Classification YAML

# examples/configs/imagenet.yaml
path: /data/imagenet
val: val
annotations: imagenet_val_labels.json # {filename: class_index} mapping

Depth YAML

# examples/configs/diode.yaml
path: /data/diode
val: val/indoors # or val/outdoors
annotations: diode_val_annotations.json # {rgb_path: depth_npy_path} mapping

OCR (COCO-Text) YAML

# examples/configs/coco_text.yaml
path: /data/coco-text
val: val2014 # COCO 2014 validation images directory
annotations: cocotext_v2.json # COCO-Text annotations with "text" field per annotation
names:
  0: text

The annotation JSON must follow COCO-Text format β€” each annotation dict contains a "text" key with the transcription string:

{
  "images": [
    {
      "id": 1,
      "file_name": "COCO_val2014_000001.jpg",
      "width": 640,
      "height": 480
    }
  ],
  "categories": [{ "id": 1, "name": "text" }],
  "annotations": [
    {
      "id": 1,
      "image_id": 1,
      "category_id": 1,
      "bbox": [100, 200, 50, 20],
      "text": "STOP"
    }
  ]
}

Dataset Download & Setup

COCO 2017 (Detection & Segmentation)

The COCO 2017 validation split is ~1 GB images + ~240 MB annotations.

mkdir -p /data/coco && cd /data/coco

# Validation images (1 GB)
wget http://images.cocodataset.org/zips/val2017.zip
unzip val2017.zip

# Annotations (241 MB)
wget http://images.cocodataset.org/annotations/annotations_trainval2017.zip
unzip annotations_trainval2017.zip

Expected layout:

/data/coco/
β”œβ”€β”€ val2017/                     # ~5,000 validation images
└── annotations/
    └── instances_val2017.json

ImageNet ILSVRC 2012 (Classification)

Requires a free account on image-net.org.

mkdir -p /data/imagenet && cd /data/imagenet

tar -xvf ILSVRC2012_img_val.tar           # β†’ val/  (50,000 images)
tar -xvf ILSVRC2012_bbox_val_v3.tgz       # β†’ bbox_val/  (50,000 XMLs)

# Generate MATA-compatible annotation file
cd /path/to/MATA
python scripts/generate_imagenet_val_labels.py

Expected layout:

/data/imagenet/
β”œβ”€β”€ val/                          # 50,000 validation images
└── imagenet_val_labels.json      # generated {filename: class_index}

DIODE (Depth Estimation)

DIODE provides indoor and outdoor depth maps (~5 GB).

mkdir -p /data/diode && cd /data/diode
wget http://diode-dataset.s3.amazonaws.com/val.tar.gz
tar -xvf val.tar.gz

Expected layout:

/data/diode/
β”œβ”€β”€ val/
β”‚   β”œβ”€β”€ indoors/          # scene_*/scan_*/ β€” RGB .png + depth .npy pairs
β”‚   └── outdoors/
└── diode_val_annotations.json

Metrics Reference by Task

DetMetrics (Detection)

DetMetrics holds bounding-box AP results computed over 10 COCO IoU thresholds (0.50–0.95).

from mata import DetMetrics  # or: from mata.eval import DetMetrics

Fields

Field Type Description
names dict[int, str] Class ID β†’ name mapping
box Metric Inner AP accumulator (see Metric)
speed dict[str, float] {"preprocess": ms, "inference": ms, "postprocess": ms} per image
confusion_matrix ConfusionMatrix | None Optional confusion matrix
save_dir str Output directory for plots

Key Properties

Property Returns Example
metrics.box.map float β€” mAP at IoU 0.50–0.95 0.456
metrics.box.map50 float β€” mAP at IoU 0.50 0.644
metrics.box.map75 float β€” mAP at IoU 0.75 0.487
metrics.box.mp float β€” mean precision 0.726
metrics.box.mr float β€” mean recall 0.586
metrics.box.maps ndarray β€” per-class [email protected] shape (nc,)
metrics.speed dict β€” timing per image {"preprocess": 0.001, ...}
metrics.maps ndarray β€” alias for box.maps shape (nc,)
metrics.ap_class_index ndarray β€” class IDs with AP data shape (nc,)

Methods

Method Returns Description
mean_results() tuple[float, ...] (precision, recall, mAP50, mAP50-95) averaged over classes
class_result(i) tuple[float, ...] Same metrics for class index i
fitness() float Weighted fitness score: 0.1 * mAP50 + 0.9 * mAP50-95
summary() list[dict] Per-class breakdown as list of dicts
to_dict() dict Full results as nested dict
to_json() str JSON string of to_dict()
to_csv() str CSV-formatted string (one row per class)

Real-World Example Output

Evaluated with facebook/detr-resnet-50 on COCO val2017 (5,000 images, 80 classes):

metrics.box.map50   = 0.644     # mAP at IoU 0.50
metrics.box.map     = 0.456     # mAP at IoU 0.50-0.95
metrics.box.mp      = 0.726     # mean precision
metrics.box.mr      = 0.586     # mean recall
metrics.fitness()   = 0.474
Speed: preprocess 0.001ms, inference 75.0ms, postprocess 0.011ms per image

SegmentMetrics (Segmentation)

SegmentMetrics extends DetMetrics with a separate seg namespace for mask-level AP.

from mata import SegmentMetrics  # or: from mata.eval import SegmentMetrics

Additional Field

Field Type Description
seg Metric Mask-level AP accumulator

Key Properties

Box metrics are accessed via metrics.box.* (same as DetMetrics). Mask metrics use metrics.seg.*:

Property Returns Description
metrics.seg.map50 float Mask mAP at IoU 0.50
metrics.seg.map float Mask mAP at IoU 0.50-0.95
metrics.seg.map75 float Mask mAP at IoU 0.75
metrics.seg.mp float Mask mean precision
metrics.seg.mr float Mask mean recall
metrics.seg.maps ndarray Per-class mask AP

Fitness & Aggregation

# Fitness combines box and mask equally:
metrics.fitness()  # 0.5 * box.fitness() + 0.5 * seg.fitness()

# Maps combines both namespaces:
metrics.maps  # mean of box.maps and seg.maps

Methods

Same interface as DetMetrics β€” mean_results(), class_result(i), summary(), to_dict(), to_json(), to_csv().

For SegmentMetrics, mean_results() returns an 8-tuple: (box_P, box_R, box_mAP50, box_mAP50-95, mask_P, mask_R, mask_mAP50, mask_mAP50-95).

Metrics Keys

SegmentMetrics exports 8 keys in results_dict:

metrics/precision(B), metrics/recall(B), metrics/mAP50(B), metrics/mAP50-95(B),
metrics/precision(M), metrics/recall(M), metrics/mAP50(M), metrics/mAP50-95(M)

ClassifyMetrics (Classification)

ClassifyMetrics accumulates top-1 and top-5 accuracy across batches.

from mata import ClassifyMetrics  # or: from mata.eval import ClassifyMetrics

Fields

Field Type Description
names dict[int, str] Class ID β†’ name mapping
nc int Number of classes (auto-inferred from names)
top1 float Running top-1 accuracy (0.0–1.0)
top5 float Running top-5 accuracy (0.0–1.0)
speed dict[str, float] Timing breakdown per image
confusion_matrix ConfusionMatrix | None Optional confusion matrix

Key Method β€” process_predictions()

metrics.process_predictions(
    pred_labels,    # (N,) predicted class indices
    target_labels,  # (N,) ground-truth class indices
    pred_top5=None, # (N, 5) top-5 predicted indices (optional)
)

Updates top1 and top5 incrementally. Can be called multiple times for batched evaluation.

Properties

Property Returns Example
metrics.top1 float 0.8192
metrics.top5 float 0.9569
metrics.fitness float mean(top1, top5) = 0.888

Methods

Method Returns Description
mean_results() tuple[float, float] (top1, top5)
class_result(i) tuple Per-class accuracy for class i
summary() list[dict] [{"top1_acc": ..., "top5_acc": ..., "n_samples": ...}]
to_dict() dict Full results dict
to_json() str JSON string
to_csv() str CSV string

Real-World Example Output

Evaluated with microsoft/resnet-101 on ImageNet val (50,000 images, 1,000 classes):

metrics.top1    = 0.8192    # 81.92% top-1 accuracy
metrics.top5    = 0.9569    # 95.69% top-5 accuracy
metrics.fitness = 0.888
Speed: preprocess 0.002ms, inference 53.1ms, postprocess 0.006ms per image

DepthMetrics (Depth Estimation)

DepthMetrics implements standard depth estimation metrics from Eigen et al. (2014).

from mata import DepthMetrics  # or: from mata.eval import DepthMetrics

Configuration Fields

Field Type Default Description
align_scale bool True Apply median scaling alignment before computing metrics. Compensates for scale ambiguity in monocular depth models.
align_affine bool False Apply least-squares affine (scale + shift) alignment. Mutually exclusive with align_scale.

Metric Fields

| Field | Type | Description | | ---------- | ------------------ | ----------------------------------------------------------------------- | ----------- | ----- | | abs_rel | float | Mean absolute relative error: $\frac{1}{n}\sum\frac{ | d - \hat{d} | }{d}$ | | sq_rel | float | Mean squared relative error: $\frac{1}{n}\sum\frac{(d - \hat{d})^2}{d}$ | | rmse | float | Root mean squared error | | log_rmse | float | RMSE in log space | | delta_1 | float | % of pixels where $\max(\frac{\hat{d}}{d}, \frac{d}{\hat{d}}) &lt; 1.25$ | | delta_2 | float | $\delta &lt; 1.25^2$ | | delta_3 | float | $\delta &lt; 1.25^3$ | | speed | dict[str, float] | Timing breakdown |

Key Methods

# Per-image accumulation
metrics.process_batch(
    pred_depth,           # (H, W) float array
    gt_depth,             # (H, W) float array
    valid_mask=None,      # (H, W) bool array (optional)
)

# Finalize averages after all images
metrics.finalize()

# update() is an alias for process_batch()
metrics.update(pred_depth, gt_depth)

Properties

Property Returns Description
metrics.fitness float delta_1 - abs_rel (higher is better)
metrics.keys list[str] 7 metric key names
metrics.results_dict dict 8 entries (7 metrics + fitness)

Methods

Method Returns Description
mean_results() tuple[float, ...] All 7 metric values as a tuple
summary() list[dict] Single-row summary with all metrics
to_dict() dict Full results dict
to_json() str JSON string
to_csv() str CSV string

Real-World Example Output

Evaluated with depth-anything/Depth-Anything-V2-Small-hf on DIODE indoor (771 images):

metrics.abs_rel  = 0.3930    # absolute relative error
metrics.sq_rel   = 4.4949    # squared relative error
metrics.rmse     = 4.3978    # root mean squared error
metrics.log_rmse = 1.9102    # log-space RMSE
metrics.delta_1  = 0.6694    # 66.94% pixels within 1.25Γ— ratio
metrics.delta_2  = 0.8158    # 81.58% within 1.25Β²
metrics.delta_3  = 0.8834    # 88.34% within 1.25Β³
metrics.fitness  = 0.2765
Speed: preprocess 0.002ms, inference 145.4ms, postprocess 0.013ms per image

OCRMetrics (OCR / Text Recognition)

OCRMetrics implements recognition-only evaluation metrics: CER, WER, and exact-match accuracy. Evaluation is image-level: all predicted text regions are concatenated and compared against all ground-truth transcriptions (also concatenated).

from mata import OCRMetrics  # or: from mata.eval import OCRMetrics

Configuration Fields

Field Type Default Description
case_sensitive bool False When False (default), both predicted and GT text are lowercased before comparison. Matches ICDAR convention.

Metric Fields

| Field | Type | Description | | ---------- | ------------------ | ----------------------------------------------------------------------------------------------------------- | ----------- | ------ | | cer | float | Mean Character Error Rate: $\text{CER} = \frac{\text{Levenshtein}(pred, gt)}{\max( | gt | , 1)}$ | | wer | float | Mean Word Error Rate: $\text{WER} = \frac{\text{Levenshtein}(pred*{words}, gt*{words})}{\max( | gt_{words} | , 1)}$ | | accuracy | float | Exact-match accuracy: $\frac{\text{count}(pred = gt)}{N}$ β€” fraction of images with identical transcription | | speed | dict[str, float] | Timing breakdown per image |

Key Methods

# Per-image accumulation
metrics.process_batch(
    pred_text,    # str β€” full predicted transcription for this image
    gt_text,      # str β€” full ground-truth transcription for this image
)

# Finalize averages after all images
metrics.finalize()

# update() is an alias for process_batch()
metrics.update(pred_text, gt_text)

Properties

Property Returns Description
metrics.fitness float accuracy (exact-match ratio)
metrics.keys list[str] 3 metric key names
metrics.results_dict dict 4 entries (3 metrics + fitness)

Methods

Method Returns Description
mean_results() list[float] [cer, wer, accuracy]
summary() list[dict] Single-row summary with all metrics
to_dict() dict Full results dict
to_json() str JSON string
to_csv() str CSV string (header + one data row)

case_sensitive Parameter

By default, text comparison is case-insensitive (case_sensitive=False), matching the ICDAR benchmark convention. Pass case_sensitive=True to mata.val() to enable strict casing:

metrics = mata.val(
    "ocr",
    model="easyocr",
    data="examples/configs/coco_text.yaml",
    case_sensitive=True,  # β€œStop” β‰  β€œstop”
)

Real-World Example Output

Evaluated with EasyOCR on COCO-Text v2 val split (~9,000 images):

metrics.cer       = 0.1523    # mean character error rate
metrics.wer       = 0.2347    # mean word error rate
metrics.accuracy  = 0.6240    # 62.4% exact-match accuracy
metrics.fitness   = 0.6240
Speed: preprocess 0.001ms, inference 210.3ms, postprocess 0.002ms per image

Supporting Classes

Metric (Base AP Accumulator)

Metric is the internal dataclass that stores per-class AP results. It powers the .box and .seg fields of DetMetrics and SegmentMetrics.

from mata.eval.metrics import Metric

Properties

Property Returns Description
ap50 ndarray (nc,) Per-class AP at IoU 0.50
ap ndarray (nc, 10) Per-class AP at each of 10 COCO thresholds
map50 float Mean AP at IoU 0.50
map75 float Mean AP at IoU 0.75
map float Mean AP at IoU 0.50–0.95
maps ndarray (nc,) Per-class mAP (mean over 10 thresholds)
mp float Mean precision (at optimal F1 threshold)
mr float Mean recall (at optimal F1 threshold)
curves tuple (precision_curve, recall_curve, f1_curve, x_axis)
curves_results tuple 12-element tuple from ap_per_class()

Methods

Method Signature Description
update (results: tuple) β†’ None Populate from ap_per_class() output
class_result (i: int) β†’ tuple (p, r, ap50, ap50-95) for class index i
mean_results () β†’ tuple (mean_p, mean_r, mean_ap50, mean_ap50-95)
fitness () β†’ float 0.1 * map50 + 0.9 * map

ConfusionMatrix

ConfusionMatrix accumulates predictions batch by batch and supports both detection and classification tasks.

from mata.eval.confusion_matrix import ConfusionMatrix

Constructor

cm = ConfusionMatrix(
    nc,                    # Number of foreground classes
    names=None,            # {class_id: "name"} mapping
    task="detect",         # "detect" or "classify"
    conf_threshold=0.25,   # Min confidence for detection predictions
    iou_threshold=0.45,    # IoU threshold for positive match
)

Matrix dimensions:

  • Detection: (nc+1) Γ— (nc+1) β€” extra row/column for background (unmatched)
  • Classification: nc Γ— nc

Accumulation Methods

# Detection: greedy highest-confidence-first matching
cm.process_batch(
    detections,   # (N, 6) array: [x1, y1, x2, y2, conf, class_id]
    labels,       # (M, 5) array: [class_id, x1, y1, x2, y2]
)

# Classification: simple pred→true tally
cm.process_cls_preds(
    preds,        # (N,) predicted class indices
    targets,      # (N,) ground-truth class indices
)

Properties & Methods

Name Returns Description
matrix ndarray Raw confusion matrix
tp_fp() (tp, fp) True/false positive counts per class, shape (nc,)
plot(...) None Save matplotlib heatmap as confusion_matrix.png
print() None Print raw matrix to stdout
summary(normalize) list[dict] Per-class TP/FP/FN breakdown
to_json() str JSON export
to_csv() str CSV export

Plot Example

# After validation completes:
metrics = mata.val("detect", model=..., data=..., plots=True, save_dir="runs/val/")
# β†’ saves runs/val/confusion_matrix.png (when plots=True)

# Or manually from ConfusionMatrix:
metrics.confusion_matrix.plot(
    normalize=True,         # Row-normalize (percentages)
    save_dir="my_output/",  # Output directory
    names=class_names,      # Class name dict
)

DatasetLoader & GroundTruth

DatasetLoader parses a dataset YAML config and provides an iterator over (image_path, GroundTruth) pairs.

from mata.eval import DatasetLoader, GroundTruth

GroundTruth Dataclass

@dataclass
class GroundTruth:
    image_id: int | str           # Unique image identifier
    image_path: str               # Absolute path to the image file
    boxes: np.ndarray             # (N, 4) xyxy bounding boxes
    labels: np.ndarray            # (N,) class indices
    masks: list | None = None     # Optional list of masks (RLE, binary, polygon)
    depth: np.ndarray | None      # Optional (H, W) ground-truth depth map
    image_size: tuple[int, int]   # (width, height)
    text: list[str] | None = None # Optional N transcription strings (OCR datasets only)

DatasetLoader Construction

# From YAML config file
loader = DatasetLoader.from_yaml("examples/configs/coco.yaml")

# From COCO JSON directly
loader = DatasetLoader.from_coco_json(
    images_dir="/data/coco/val2017",
    json_path="/data/coco/annotations/instances_val2017.json",
)

Properties

Property Returns Description
names dict[int, str] Class ID β†’ name mapping
cat_id_to_label dict[int, int] COCO category ID β†’ contiguous label index
class_names list[str] Ordered class name list

Iteration

loader = DatasetLoader.from_yaml("coco.yaml")
for image_path, gt in loader:
    print(image_path)         # "/data/coco/val2017/000001.jpg"
    print(gt.boxes.shape)     # (N, 4) xyxy
    print(gt.labels.shape)    # (N,)
    print(gt.image_size)      # (640, 480)

Plots & Visualization

Enable plots=True in mata.val() to generate visual outputs in save_dir:

metrics = mata.val(
    "detect",
    model="facebook/detr-resnet-50",
    data="coco.yaml",
    plots=True,
    save_dir="runs/val/detect",
)

Auto-Generated Plots

File Description
PR_curve.png Precision-Recall curve (top-5 classes + bold mean)
F1_curve.png F1 score vs confidence threshold
confusion_matrix.png Class confusion heatmap (when ConfusionMatrix is available)

Manual Plot Functions

The following functions are also available for custom plotting:

from mata.eval.plots import plot_pr_curve, plot_f1_curve, plot_p_curve, plot_r_curve

# All follow the same signature pattern:
plot_pr_curve(
    px,          # (1000,) x-axis (recall values)
    py,          # (nc, 1000) y-axis (precision per class)
    ap,          # (nc,) AP per class (for legend)
    save_dir,    # Output directory
    names=None,  # {class_id: "name"} or list
    save_path=None,  # Override filename
)

Visual style: Thin gray lines for individual classes (top-5 highlighted if nc > 10), bold blue line for the mean, legend with [email protected] or max F1 value.

Note: The Validator auto-generates only PR_curve.png and F1_curve.png. The plot_p_curve() and plot_r_curve() functions are available for manual use but are not called automatically.


Standalone Mode

Standalone mode lets you score pre-computed predictions against a COCO-format ground-truth file without re-running inference. This is useful when inference runs on a separate machine (e.g., a GPU cluster).

import mata

# Step 1: Collect predictions (can be done separately)
predictions = [
    mata.run("detect", img, model="facebook/detr-resnet-50")
    for img in image_list
]

# Step 2: Score against ground truth
metrics = mata.val(
    "detect",
    predictions=predictions,
    ground_truth="annotations/instances_val2017.json",
    conf=0.001,
    iou=0.50,
    verbose=True,
)
print(f"mAP@50: {metrics.box.map50:.3f}")

Requirements:

  • predictions must be a list of result objects matching the task type
  • ground_truth is a path to a COCO-format JSON file (detection/segmentation) or task-specific JSON
  • Images are matched by filename between predictions and ground truth

See examples/validation.py for complete, runnable examples of all four tasks plus the standalone workflow.


Serialization & Export

All metrics classes support consistent serialization:

Methods Available on All Metric Classes

metrics.to_dict()   # β†’ dict   β€” full results as nested dictionary
metrics.to_json()   # β†’ str    β€” JSON string of to_dict()
metrics.to_csv()    # β†’ str    β€” CSV-formatted string (one row per class)
metrics.summary()   # β†’ list[dict] β€” per-class breakdown

metrics.json Output Structure

When save_dir is set, mata.val() writes a metrics.json file. Detection example:

{
  "results": {
    "metrics/precision(B)": 0.726,
    "metrics/recall(B)": 0.586,
    "metrics/mAP50(B)": 0.644,
    "metrics/mAP50-95(B)": 0.456,
    "fitness": 0.474
  },
  "speed": {
    "preprocess": 0.001,
    "inference": 75.036,
    "postprocess": 0.011
  },
  "per_class": [
    {
      "class_id": 0,
      "class_name": "person",
      "precision": 0.83,
      "recall": 0.695,
      "ap50": 0.795,
      "ap50_95": 0.548
    }
  ]
}

results_dict Keys by Task

Task Keys
detect metrics/precision(B), metrics/recall(B), metrics/mAP50(B), metrics/mAP50-95(B), fitness
segment All detect keys + metrics/precision(M), metrics/recall(M), metrics/mAP50(M), metrics/mAP50-95(M)
classify metrics/accuracy_top1, metrics/accuracy_top5, fitness
depth metrics/abs_rel, metrics/sq_rel, metrics/rmse, metrics/log_rmse, metrics/delta_1, metrics/delta_2, metrics/delta_3, fitness
ocr metrics/cer, metrics/wer, metrics/accuracy, fitness

Console Output

When verbose=True (default), mata.val() prints a table to stdout.

Detection / Segmentation Table

Class          Images  Instances  P       R       mAP50   mAP50-95
all              5000     36335  0.726   0.586   0.644    0.456
person           5000     10777  0.830   0.695   0.795    0.548
bicycle          5000       314  0.782   0.478   0.606    0.348
car              5000      1918  0.708   0.612   0.675    0.426
...

Speed: pre-process 0.0ms, inference 75.0ms, post-process 0.0ms per image

Classification Table

top1_acc     top5_acc
0.819        0.957

Depth Table

abs_rel  sq_rel   RMSE    log_RMSE  Ξ΄<1.25  Ξ΄<1.25Β²  Ξ΄<1.25Β³
0.3930   4.4949   4.3978  1.9102    0.6694  0.8158   0.8834

OCR Table

         CER         WER    Accuracy
      0.1523      0.2347      0.6240

Speed: pre-process 0.0ms, inference 210.3ms, post-process 0.0ms per image

Architecture (For Contributors)

Module Map

src/mata/eval/
β”œβ”€β”€ __init__.py           # Re-exports: Validator, DatasetLoader, GroundTruth + all metric classes
β”œβ”€β”€ validator.py          # Validator β€” end-to-end evaluation pipeline (855 lines)
β”œβ”€β”€ dataset.py            # DatasetLoader + GroundTruth β€” YAML/COCO JSON ingestion (501 lines)
β”œβ”€β”€ confusion_matrix.py   # ConfusionMatrix β€” detect + classify modes (364 lines)
β”œβ”€β”€ plots.py              # PR/F1/P/R curve plotting (385 lines)
β”œβ”€β”€ printer.py            # Console table output (306 lines)
└── metrics/
    β”œβ”€β”€ __init__.py       # Re-exports all metric classes + low-level functions
    β”œβ”€β”€ base.py           # Metric dataclass + ap_per_class() β€” core 101-point COCO AP (404 lines)
    β”œβ”€β”€ detect.py         # DetMetrics β€” bounding-box AP container (195 lines)
    β”œβ”€β”€ segment.py        # SegmentMetrics β€” extends DetMetrics with mask AP (218 lines)
    β”œβ”€β”€ classify.py       # ClassifyMetrics β€” top-1/top-5 accumulator (203 lines)
    β”œβ”€β”€ depth.py          # DepthMetrics β€” Eigen et al. depth metrics (384 lines)
    β”œβ”€β”€ ocr.py            # OCRMetrics β€” CER/WER/accuracy recognition metrics (297 lines)
    └── iou.py            # box_iou, mask_iou, COCO_IOU_THRESHOLDS (227 lines)

Validator Pipeline

Validator.run()
  β”‚
  β”œβ”€β”€ _build_loader()           β†’ DatasetLoader (from YAML or standalone GT)
  β”œβ”€β”€ _load_adapter()           β†’ mata.load(task, model) β€” lazy adapter loading
  β”‚
  β”œβ”€β”€ _iterate_images()         β†’ for each (image, gt):
  β”‚   β”œβ”€β”€ adapter.predict()        run inference with timing
  β”‚   β”œβ”€β”€ _match_detections()      greedy IoU matching (detect)
  β”‚   β”œβ”€β”€ _match_segments()        box + mask IoU matching (segment)
  β”‚   β”œβ”€β”€ ClassifyMetrics.process_predictions()  (classify)
  β”‚   └── DepthMetrics.process_batch()           (depth)
  β”‚
  β”œβ”€β”€ _compute_metrics()        β†’ ap_per_class() β†’ Metric.update()
  β”œβ”€β”€ _build_label_remap()      β†’ align adapter labels with GT labels
  β”œβ”€β”€ _build_confusion_matrix() β†’ optional ConfusionMatrix
  β”œβ”€β”€ _save_plots()             β†’ PR_curve.png, F1_curve.png, metrics.json
  └── _print_table()            β†’ Printer.print_results()

Label Remapping

The Validator automatically handles label mismatches between model predictions and ground-truth annotations. It builds a remapping table by matching class names between the adapter's label vocabulary and the dataset's names dict. This handles:

  • DETR-style models (1-indexed COCO category IDs)
  • OneFormer-style models (contiguous 0-indexed labels)
  • Custom label sets (matched by string name)

Mask IoU Fallback

When computing mask IoU, the system tries:

  1. pycocotools (C-accelerated) β€” fast path for RLE masks
  2. Numpy matrix multiply β€” fallback for binary/polygon masks or when pycocotools is unavailable

Both paths produce identical results. The fallback is ~3Γ— slower but requires no extra dependencies.

Extension Points

To add a new evaluation task:

  1. Create a new dataclass in src/mata/eval/metrics/ implementing mean_results(), class_result(i), fitness(), summary(), to_dict(), to_json(), to_csv()
  2. Add accumulation logic (like process_predictions() or process_batch())
  3. Register the task string in Validator._SUPPORTED_TASKS
  4. Add a matching branch in Validator._iterate_images() and Validator._compute_metrics()
  5. Add console header tuple in Printer

Low-Level Functions

These functions from mata.eval.metrics are used internally but available for custom evaluation workflows.

ap_per_class()

from mata.eval.metrics import ap_per_class

result = ap_per_class(
    tp,               # (N, T) bool β€” true positive flags per IoU threshold
    conf,             # (N,) float β€” confidence scores
    pred_cls,         # (N,) int β€” predicted class IDs
    target_cls,       # (M,) int β€” ground-truth class IDs
    iou_thresholds=COCO_IOU_THRESHOLDS,  # 10 thresholds by default
    eps=1e-16,
)
# Returns 12-element tuple:
# (tp_at_f1, fp_at_f1, precision, recall, f1, all_ap,
#  unique_classes, p_curve, r_curve, f1_curve, x_axis, prec_values)

Computes 101-point interpolated average precision (matching pycocotools.COCOeval within 0.01 tolerance).

box_iou()

from mata.eval.metrics import box_iou

iou_matrix = box_iou(boxes1, boxes2)  # (N, M) pairwise IoU
# boxes1: (N, 4) xyxy, boxes2: (M, 4) xyxy

box_iou_batch()

from mata.eval.metrics import box_iou_batch

matches = box_iou_batch(pred_boxes, gt_boxes, iou_thresholds)
# Returns: (T, N, M) boolean match matrix

mask_iou()

from mata.eval.metrics import mask_iou

iou_matrix = mask_iou(masks1, masks2, image_shape=(H, W))
# Supports: RLE dicts, binary (H,W) arrays, polygon coordinate lists

COCO_IOU_THRESHOLDS

from mata.eval.metrics import COCO_IOU_THRESHOLDS
# [0.50, 0.55, 0.60, 0.65, 0.70, 0.75, 0.80, 0.85, 0.90, 0.95]

Known Limitations

  1. Mask IoU fallback: When pycocotools is not installed, mask IoU uses a numpy fallback that is functionally identical but ~3Γ— slower.

  2. num_targets in JSON export: Per-class instance counts appear as -1 in the metrics.json per_class entries. The console table (verbose mode) shows correct instance counts, but they are not propagated to the serialized output.

  3. Auto-generated plots: The Validator only auto-generates PR_curve.png and F1_curve.png when plots=True. The plot_p_curve() and plot_r_curve() functions exist but must be called manually if needed.

  4. Segmentation mask matching: If the model does not produce masks (or mask format is incompatible), mask AP metrics will be zero while box metrics remain valid.

  5. Single-IoU confusion matrix: The ConfusionMatrix operates at a single IoU threshold (default 0.45) and confidence threshold (default 0.25), regardless of the iou parameter passed to mata.val().

  6. OCR recognition-only: OCRMetrics computes recognition metrics only (CER, WER, exact-match accuracy). Text detection metrics (H-mean precision/recall/F1 on bounding-box matching) and end-to-end evaluation (combined detection + recognition) are not yet supported. Pass mode="e2e" is reserved for a future release.

  7. OCR image-level comparison: Ground-truth transcriptions for all text regions in an image are concatenated with spaces before CER/WER computation. This avoids the hard problem of pairing predicted regions to GT regions (which requires IoU matching), but means per-region error rates are not available.


ReID Tracking Notes (v1.9.2)

Appearance-Based Re-Identification (ReID) in MATA enhances BotSort's track-recovery capability after occlusion or target re-entry. This section covers how to enable ReID, inspect its outputs, and reason about tracking quality.

What ReID Adds

Without ReID, BotSort associates detections to tracks using two cues:

  1. IoU β€” spatial overlap between predicted and detected bounding boxes
  2. GMC β€” global motion compensation (sparse optical flow) for camera motion

When you supply reid_model=..., BotSort ReID is auto-enabled and a third cue is added:

  1. Cosine appearance distance β€” L2-normalised embedding vectors extracted from detection crops are compared against cached track features (smooth_feat)

This allows BotSort to re-associate tracks even when the predicted position drifts significantly due to occlusion gaps.

Enabling ReID

import mata

# Single-camera tracking with ReID
results = mata.track(
    "video.mp4",
    model="facebook/detr-resnet-50",
    tracker="botsort",
    reid_model="openai/clip-vit-base-patch32",  # any HF image encoder; auto-enables ReID
    conf=0.3,
)

# Inspect per-instance embedding vectors
for frame_result in results:
    for inst in frame_result.instances:
        print(f"Track #{inst.track_id} embedding shape: "
              f"{inst.embedding.shape if inst.embedding is not None else 'N/A'}")

ONNX models are also supported:

results = mata.track(
    "video.mp4",
    model="facebook/detr-resnet-50",
    reid_model="osnet_x1_0.onnx",     # local ONNX ReID model
)

Inspecting Embedding Quality

Each tracked instance with an active ReID encoder will have Instance.embedding populated with an L2-normalised float32 vector of shape (D,):

import numpy as np

for result in results:
    for inst in result.instances:
        if inst.embedding is not None:
            emb = inst.embedding
            assert abs(np.linalg.norm(emb) - 1.0) < 1e-5, "Not unit norm"
            print(f"Track #{inst.track_id}: {emb.shape}, norm={np.linalg.norm(emb):.4f}")

Cross-Camera ReID with ReIDBridge

ReIDBridge publishes confirmed-track embeddings to a shared Valkey store so independent tracker instances can resolve the same physical identity across feeds.

from mata.trackers import ReIDBridge

# Camera A
bridge_a = ReIDBridge(
    "valkey://localhost:6379",
    camera_id="cam-a",
    ttl=300,                  # embeddings expire after 5 min
    similarity_thresh=0.25,   # cosine similarity cutoff
)

# mata.track() with reid_bridge: each confirmed track is published after update()
for result in mata.track(
    "rtsp://cam-a/stream",
    model="facebook/detr-resnet-50",
    reid_model="openai/clip-vit-base-patch32",
    reid_bridge=bridge_a,
    stream=True,
):
    active = [i for i in result.instances if i.track_id is not None]
    print(f"Active tracks: {len(active)}")

# Camera B β€” query nearest identity from cam-a
bridge_b = ReIDBridge("valkey://localhost:6379", camera_id="cam-b")
query_embedding = ...  # np.ndarray shape (D,), L2-normalised
matches = bridge_b.query(query_embedding, exclude_camera="cam-b", top_k=1)
if matches:
    print(f"Best cross-camera match: {matches[0]}")
    # {'track_id': 7, 'camera_id': 'cam-a', 'similarity': 0.83, ...}

ReID Validation Tips

Scenario Recommended Approach
Verify embeddings are populated Check inst.embedding is not None after update()
Measure track-recovery rate Count frames where a lost track recovers its original ID
Tune appearance threshold Adjust appearance_thresh in tracker_config (BotSort default: 0.25)
Reduce false re-associations Increase reid_model β†’ use a more discriminative encoder (e.g., OSNet vs CLIP)
GPU inference for ReID Pass device="cuda" at mata.load("track", ..., device="cuda")
ONNX production deployment Export your ReID model to ONNX and pass the .onnx path as reid_model

Known Limitations (ReID)

  1. BotSort only: ReID is integrated into BotSort's get_dists() method. ByteTrack does not support appearance-distance matching β€” reid_model is silently ignored when tracker="bytetrack".

  2. No detection-level alignment: ReID embeddings are computed for all detections that pass the confidence threshold, not only those that fail IoU association. For very dense scenes this may increase latency. Future work: skip ReID for IoU-matched detections (40–60% latency reduction).

  3. Cross-camera ID namespace: Each tracker process maintains an independent STrack._count β€” cross-camera track IDs are not globally unique. ReIDBridge resolves this at the application layer by storing (camera_id, track_id) pairs.

  4. Embedding warm-up: BotSort's smooth_feat is a running average that stabilises after ~5 frames. Track re-association quality may be lower for newly initialised tracks.

  5. Valkey dependency for ReIDBridge: ReIDBridge requires pip install datamata[valkey] (or datamata[redis]). If the server is unreachable, publish() / query() log a warning and return gracefully β€” tracking continues unaffected.