MTEB Benchmarking for on2vec Models

This directory contains tools for evaluating sentence transformer models created with on2vec ontology augmentation against the MTEB (Massive Text Embedding Benchmark) suite.

Setup

Install MTEB dependencies:

pip install mteb
# or with uv
uv add mteb

Quick Start

Run Full Benchmark

# Run all MTEB tasks on your model
python benchmark_runner.py ./hf_models/my-model --model-name my-ontology-model

# Run specific task types
python benchmark_runner.py ./hf_models/my-model --task-types STS Classification

# Quick test run (subset of tasks)
python benchmark_runner.py ./hf_models/my-model --quick

Results Structure

mteb_results/
├── my-model/
│   ├── benchmark_summary.json      # Complete results data
│   ├── benchmark_report.md         # Human-readable report
│   └── task_results/              # Individual task JSON files
│       ├── STS12.json
│       ├── Banking77Classification.json
│       └── ...

Command Reference

Basic Usage

python benchmark_runner.py MODEL_PATH [options]

Options

--output-dir DIR: Output directory (default: ./mteb_results)
--model-name NAME: Model name for results organization
--tasks TASK1 TASK2: Run specific tasks only
--task-types TYPE1 TYPE2: Run specific task categories
--batch-size N: Batch size for evaluation (default: 32)
--device DEVICE: Device to use (cuda/cpu)
--quick: Run quick subset for testing
--log-file FILE: Save logs to file

Task Types Available

Classification: Text classification tasks
Clustering: Text clustering tasks
PairClassification: Sentence pair classification
Reranking: Document reranking tasks
Retrieval: Information retrieval tasks
STS: Semantic textual similarity
Summarization: Text summarization evaluation

Examples

Compare Multiple Models

# Benchmark vanilla model
python benchmark_runner.py sentence-transformers/all-MiniLM-L6-v2 \
  --model-name vanilla-miniLM --quick

# Benchmark your ontology-augmented model
python benchmark_runner.py ./hf_models/edam-text-model \
  --model-name edam-augmented --quick

# Compare results in mteb_results/ directory

Domain-Specific Evaluation

# Focus on semantic similarity tasks (good for ontology models)
python benchmark_runner.py ./hf_models/bio-model \
  --task-types STS \
  --model-name biomedical-ontology

# Classification-heavy evaluation
python benchmark_runner.py ./hf_models/legal-model \
  --task-types Classification PairClassification \
  --model-name legal-ontology

Production Benchmarking

# Full benchmark with logging
python benchmark_runner.py ./hf_models/production-model \
  --model-name production-ontology-v1 \
  --log-file benchmark.log \
  --batch-size 64 \
  --device cuda

Understanding Results

Benchmark Report

The generated benchmark_report.md includes:

Category Averages: Mean scores across task types
Individual Results: Detailed metrics per task
Task Counts: Number of tasks evaluated per category

Key Metrics

STS Tasks: Measure semantic similarity understanding
Classification: Domain knowledge application
Retrieval: Information finding capabilities
Clustering: Concept grouping abilities

Interpreting Ontology Model Performance

Ontology-augmented models typically show:

Higher STS scores: Better semantic understanding
Improved classification: Domain-specific knowledge helps categorization
Better clustering: Ontological relationships improve concept grouping
Mixed retrieval: May depend on domain alignment

Integration with on2vec Workflow

After Creating a Model

# 1. Create ontology-augmented model
python create_hf_model.py e2e biomedical.owl bio-model

# 2. Quick benchmark test
python mteb_benchmarks/benchmark_runner.py ./hf_models/bio-model --quick

# 3. Full benchmark if results look promising
python mteb_benchmarks/benchmark_runner.py ./hf_models/bio-model

Batch Model Comparison

# Create multiple fusion variants
python batch_hf_models.py process owl_files/ ./output

# Benchmark all variants
for model in ./output/models/*/; do
    python mteb_benchmarks/benchmark_runner.py "$model" \
        --model-name "$(basename "$model")" --quick
done

Tips for Better Results

Use domain-relevant tasks: Focus on task types that align with your ontology domain
Compare against base model: Always benchmark the base text model for comparison
Start with --quick: Test subset first before running full benchmark
Monitor resource usage: MTEB can be memory and compute intensive
Save logs: Use --log-file for debugging and progress tracking

Troubleshooting

Memory Issues

# Reduce batch size
python benchmark_runner.py model --batch-size 8

# Use CPU if GPU memory limited
python benchmark_runner.py model --device cpu

Task Failures

# Run specific working tasks only
python benchmark_runner.py model --tasks STS12 STS13 Banking77Classification

Performance Issues

# Quick subset for testing
python benchmark_runner.py model --quick

# Focus on one task type
python benchmark_runner.py model --task-types STS

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MTEB Benchmarking for on2vec Models

Setup

Quick Start

Run Full Benchmark

Results Structure

Command Reference

Basic Usage

Options

Task Types Available

Examples

Compare Multiple Models

Domain-Specific Evaluation

Production Benchmarking

Understanding Results

Benchmark Report

Key Metrics

Interpreting Ontology Model Performance

Integration with on2vec Workflow

After Creating a Model

Batch Model Comparison

Tips for Better Results

Troubleshooting

Memory Issues

Task Failures

Performance Issues

FilesExpand file tree

README.md

Latest commit

History

README.md

File metadata and controls

MTEB Benchmarking for on2vec Models

Setup

Quick Start

Run Full Benchmark

Results Structure

Command Reference

Basic Usage

Options

Task Types Available

Examples

Compare Multiple Models

Domain-Specific Evaluation

Production Benchmarking

Understanding Results

Benchmark Report

Key Metrics

Interpreting Ontology Model Performance

Integration with on2vec Workflow

After Creating a Model

Batch Model Comparison

Tips for Better Results

Troubleshooting

Memory Issues

Task Failures

Performance Issues