A comprehensive benchmarking framework for evaluating discrete audio tokenizer performance across different models and datasets.
This repository provides tools and scripts for systematically evaluating audio tokenizers on multilingual speech data. The project runs on Clariden (CSCS Alps) and supports multiple datasets (EuroSpeech, FLEURS, GTZAN, NatureLM) with automatic dataset detection and unified evaluation pipeline.
The benchmarking framework focuses on two main objectives:
-
Statistical Evaluation: Compute comprehensive metrics (MSE, SNR, SDR, PESQ, STOI, ESTOI) on 100 samples per language to assess tokenizer performance statistically.
-
Sample Generation: Generate 5 audio samples per tokenizer-dataset-language combination for listening evaluation and qualitative assessment.
- Access to Clariden (CSCS Alps) cluster
- Assignment to the
infra01group with proper.edfconfiguration (recommended) uvpackage manager (recommended, for creating virtual environments)- PyTorch NGC 24.11 environment (recommended)
git clone <your-repo-url>
cd benchmark-audio-tokenizerImportant: Virtual environments should be created within the NGC 24.11 environment on Clariden.
The project uses uv for fast virtual environment management. We use a two-stage dependency compilation approach:
- Top-level dependencies (
requirements-*-topdeps.txt): High-level packages specified by the user - Sub-dependencies (
requirements-*-subdeps.txt): All transitive dependencies compiled byuv pip compile
This approach allows us to use system-installed PyTorch from NGC (avoiding CUDA compatibility issues) and install dependencies in a controlled, reproducible manner
# Make sure you're in NGC 24.11 environment
# Then create all venvs:
make venvsThis creates virtual environments for all tokenizers:
.venv-neucodec/.venv-cosyvoice2/.venv-xcodec2/.venv-wavtokenizer/
# Create a specific tokenizer environment
make neucodec # CPU-only PyTorch
make cosyvoice2 # Uses system-site-packages for PyTorch
make xcodec2 # CPU-only PyTorch
make wavtokenizer # Uses system-site-packages for PyTorchEach Makefile target:
- Removes the old venv (if exists)
- Creates a new venv with
uv - Compiles top-level dependencies to sub-dependencies (where applicable)
- Removes conflicting PyTorch entries from compiled dependencies
- Installs dependencies without overshadowing system PyTorch from NGC
- Verifies the installation
Before running evaluations, we recommend testing your setup with the example notebooks in the examples/ directory:
# Activate a tokenizer environment
source .venv-neucodec/bin/activate
# Start Jupyter
jupyter notebook examples/neucodec.ipynbAvailable notebooks:
neucodec.ipynbcosyvoice2.ipynbxcodec2.ipynbwavtokenizer.ipynb
These notebooks demonstrate basic tokenizer usage and help verify that your environment is correctly configured.
.
├── examples/ # Example notebooks for testing tokenizers
├── logs/ # Execution logs (.out and .err files per job)
├── metrics/ # Evaluation results and metrics (JSON output)
├── samples/ # Generated audio samples for listening evaluation
├── scripts/ # All Python scripts and shell scripts
│ ├── tokenizer_evaluation.py # Main evaluation script
│ ├── generate_samples.py # Sample generation script
│ ├── submit_missing_jobs.py # Automatic job submission
│ ├── analyze_tokenizers.py # Analysis and visualization
│ └── ...
├── src/
│ ├── audio_tokenizers/ # Tokenizer implementations and wrappers
│ └── repos/ # External repository dependencies
├── .venv-*/ # Virtual environments for each tokenizer
├── requirements-*-topdeps.txt # Top-level dependencies
├── requirements-*-subdeps.txt # Compiled sub-dependencies
└── Makefile # Environment setup automation
EuroSpeech:
- 22 languages: bosnia-herzegovina, bulgaria, croatia, denmark, estonia, finland, france, germany, greece, iceland, italy, latvia, lithuania, malta, norway, portugal, serbia, slovakia, slovenia, sweden, uk, ukraine
FLEURS:
- 40 languages configured (102 available)
GTZAN:
- 10 music genres: blues, classical, country, disco, hiphop, jazz, metal, pop, reggae, rock
- less than 100 samples each
NatureLM:
- 6 audio datasets: Xeno-canto, WavCaps, NatureLM, Watkins, iNaturalist, Animal Sound Archive
Total Coverage: 78+ languages/datasets across 4 dataset types
- ✅ NeuCodec
- ✅ XCodec2
- ✅ CosyVoice2
- ✅ WavTokenizer
The recommended approach is to use submit_missing_jobs.py to automatically detect and submit missing tokenizer-language combinations.
Always start with a dry run to see what would be submitted:
python scripts/submit_missing_jobs.py --dry-runThis shows:
- Which tokenizer-language combinations are missing
- How jobs would be grouped (by dataset or language)
- What commands would be executed
Before submitting all missing jobs, test with a single submission:
python scripts/submit_missing_jobs.py --submit-oneThis submits only one job per task (metrics and samples) to verify everything works correctly.
Once verified, submit all missing combinations:
# Submit both metrics and samples (default)
python scripts/submit_missing_jobs.py
# Or submit only one task
python scripts/submit_missing_jobs.py --task metrics
python scripts/submit_missing_jobs.py --task samplesValidation (--validate-metrics):
- Validates that metrics JSON files are complete and have all required fields with values
- Invalid files are treated as missing and will be re-submitted
- Why needed: Sometimes jobs fail partially, creating incomplete JSON files. This ensures only complete results are considered.
python scripts/submit_missing_jobs.py --validate-metricsGrouping (--group-by):
dataset(default): Groups missing languages by dataset, creating one job per tokenizer-dataset combination- Fewer jobs, longer runtime per job
- More efficient for cluster resource usage
language: Creates one job per tokenizer-language combination- More jobs, shorter runtime per job
- Better for fine-grained control and faster individual completions
# Group by dataset (default, recommended)
python scripts/submit_missing_jobs.py --group-by dataset
# Group by language
python scripts/submit_missing_jobs.py --group-by languagePrerequisites for Job Submission:
- You must be assigned to the
infra01group - Your
.edffile must be properly configured for SLURM - The script automatically checks for running jobs to avoid duplicates
You can also run evaluations manually:
source .venv-neucodec/bin/activate
# Single language
python scripts/tokenizer_evaluation.py --tokenizer neucodec --language germany
# Multiple languages
python scripts/tokenizer_evaluation.py --tokenizer neucodec --languages germany en_us ja_jp
# Entire dataset
python scripts/tokenizer_evaluation.py --tokenizer neucodec --dataset eurospeechResults are automatically organized:
metrics/
├── neucodec_eurospeech_germany_results.json # Per-language results
├── neucodec_fleurs_en_us_results.json # Per-language results
└── ...
samples/
├── neucodec/
│ ├── eurospeech/
│ │ └── germany/
│ │ ├── metadata.json
│ │ └── sample_*.wav
│ └── fleurs/
│ └── en_us/
│ ├── metadata.json
│ └── sample_*.wav
└── ...
Each metrics file includes:
- Language name and dataset origin
- All metrics (MSE, SNR, SDR, PESQ, STOI, ESTOI) with mean, std, min, max, median
- Tokenization statistics (tokens per second, compression ratio)
- Number of samples evaluated
- Might be incomplete for certain metrics
After collecting results, use analyze_tokenizers.py to generate comprehensive analysis and visualizations.
Create a simple virtual environment with standard packages for analysis:
uv venv .venv-analysis
source .venv-analysis/bin/activate
uv pip install pandas matplotlib seaborn numpysource .venv-analysis/bin/activate
python scripts/analyze_tokenizers.pyThe script automatically:
- Detects all tokenizers from result files in
metrics/ - Validates metrics files for completeness
- Generates comprehensive visualizations
- Creates summary statistics
All outputs are saved to the results/ directory:
Visualizations:
language_coverage.png- Heatmap showing which languages each tokenizer has (with metrics completeness)overall_comparison.png- Performance comparison across all languages (may not be fair if tokenizers tested different languages)common_languages_comparison.png- Fair comparison using only languages all tokenizers havemetric_comparison_bars.png- Bar charts comparing mean performance by metricdataset_comparison.png- Performance breakdown by datasetcompression_efficiency.png- Compression ratio and tokens per second analysiscorrelation_heatmap.png- Correlation matrix between metricstop_bottom_languages_*.png- Top and bottom performing languages for key metricsscatter_*_vs_*.png- Scatter plots comparing metric relationships
Statistics:
analysis_summary.txt- Comprehensive text summary including:- Aggregation methodology explanation
- Language coverage analysis
- Per-tokenizer statistics
- Overall comparisons
- Fair comparisons (common languages only)
Key Features:
- Automatically handles missing or incomplete metrics files
- Shows metrics completeness (0-6 valid metrics per language)
- Provides both overall and fair comparisons
- Explains aggregation methodology (language-weighted vs sample-weighted)
The framework computes comprehensive reconstruction quality metrics:
- MSE (Mean Squared Error): Measures reconstruction error
- SNR (Signal-to-Noise Ratio): Overall signal quality in dB
- SDR (Signal-to-Distortion Ratio): Distortion measurement in dB
- PESQ: Perceptual Evaluation of Speech Quality (1.0-4.5 scale)
- STOI: Short-Time Objective Intelligibility (0-1 scale)
- ESTOI: Extended STOI for improved accuracy
- Tokens per second: Tokenization rate
- Compression ratio: Original size / Token size
All metrics include: mean, standard deviation, min, max, and median values computed across 100 samples per language.
- "uv: command not found": Install
uvpackage manager - PyTorch CUDA issues: Ensure you're using NGC 24.11 environment
- Dependency conflicts: The Makefile handles PyTorch conflicts automatically by using system-site-packages where needed
- "Permission denied": Verify you're in the
infra01group and.edfis configured - Jobs not submitting: Check SLURM configuration and cluster status
- Duplicate jobs: The script automatically checks for running jobs, but verify with
squeue
- "Language not found": Verify language code spelling and dataset download status
- "Dataset not found at PATH": Check cache directories
- PESQ errors: Audio is automatically resampled to 16kHz, check audio quality
- Memory issues: Adjust memory requirements in
submit_missing_jobs.pyor use SLURM with more memory
This project is part of the Data Science Lab course at ETH Zurich, autumn semester 2025.
If you use this benchmarking framework, please cite the relevant datasets.