This directory contains tools for evaluating sentence transformer models created with on2vec ontology augmentation against the MTEB (Massive Text Embedding Benchmark) suite.
Install MTEB dependencies:
pip install mteb
# or with uv
uv add mteb# Run all MTEB tasks on your model
python benchmark_runner.py ./hf_models/my-model --model-name my-ontology-model
# Run specific task types
python benchmark_runner.py ./hf_models/my-model --task-types STS Classification
# Quick test run (subset of tasks)
python benchmark_runner.py ./hf_models/my-model --quickmteb_results/
├── my-model/
│ ├── benchmark_summary.json # Complete results data
│ ├── benchmark_report.md # Human-readable report
│ └── task_results/ # Individual task JSON files
│ ├── STS12.json
│ ├── Banking77Classification.json
│ └── ...
python benchmark_runner.py MODEL_PATH [options]--output-dir DIR: Output directory (default:./mteb_results)--model-name NAME: Model name for results organization--tasks TASK1 TASK2: Run specific tasks only--task-types TYPE1 TYPE2: Run specific task categories--batch-size N: Batch size for evaluation (default: 32)--device DEVICE: Device to use (cuda/cpu)--quick: Run quick subset for testing--log-file FILE: Save logs to file
Classification: Text classification tasksClustering: Text clustering tasksPairClassification: Sentence pair classificationReranking: Document reranking tasksRetrieval: Information retrieval tasksSTS: Semantic textual similaritySummarization: Text summarization evaluation
# Benchmark vanilla model
python benchmark_runner.py sentence-transformers/all-MiniLM-L6-v2 \
--model-name vanilla-miniLM --quick
# Benchmark your ontology-augmented model
python benchmark_runner.py ./hf_models/edam-text-model \
--model-name edam-augmented --quick
# Compare results in mteb_results/ directory# Focus on semantic similarity tasks (good for ontology models)
python benchmark_runner.py ./hf_models/bio-model \
--task-types STS \
--model-name biomedical-ontology
# Classification-heavy evaluation
python benchmark_runner.py ./hf_models/legal-model \
--task-types Classification PairClassification \
--model-name legal-ontology# Full benchmark with logging
python benchmark_runner.py ./hf_models/production-model \
--model-name production-ontology-v1 \
--log-file benchmark.log \
--batch-size 64 \
--device cudaThe generated benchmark_report.md includes:
- Category Averages: Mean scores across task types
- Individual Results: Detailed metrics per task
- Task Counts: Number of tasks evaluated per category
- STS Tasks: Measure semantic similarity understanding
- Classification: Domain knowledge application
- Retrieval: Information finding capabilities
- Clustering: Concept grouping abilities
Ontology-augmented models typically show:
- Higher STS scores: Better semantic understanding
- Improved classification: Domain-specific knowledge helps categorization
- Better clustering: Ontological relationships improve concept grouping
- Mixed retrieval: May depend on domain alignment
# 1. Create ontology-augmented model
python create_hf_model.py e2e biomedical.owl bio-model
# 2. Quick benchmark test
python mteb_benchmarks/benchmark_runner.py ./hf_models/bio-model --quick
# 3. Full benchmark if results look promising
python mteb_benchmarks/benchmark_runner.py ./hf_models/bio-model# Create multiple fusion variants
python batch_hf_models.py process owl_files/ ./output
# Benchmark all variants
for model in ./output/models/*/; do
python mteb_benchmarks/benchmark_runner.py "$model" \
--model-name "$(basename "$model")" --quick
done- Use domain-relevant tasks: Focus on task types that align with your ontology domain
- Compare against base model: Always benchmark the base text model for comparison
- Start with --quick: Test subset first before running full benchmark
- Monitor resource usage: MTEB can be memory and compute intensive
- Save logs: Use
--log-filefor debugging and progress tracking
# Reduce batch size
python benchmark_runner.py model --batch-size 8
# Use CPU if GPU memory limited
python benchmark_runner.py model --device cpu# Run specific working tasks only
python benchmark_runner.py model --tasks STS12 STS13 Banking77Classification# Quick subset for testing
python benchmark_runner.py model --quick
# Focus on one task type
python benchmark_runner.py model --task-types STS