imread-benchmark is a reproducible benchmark framework for JPEG decoding in Python ML pipelines. It provides:
- an installable
imread-benchmarkCLI for local datasets, - isolated per-library worker environments so conflicting stacks can be benchmarked in one run,
- PyTorch
DataLoaderthroughput measurements in addition to single-thread decoder speed, - Google Cloud runners for repeatable cloud CPU comparisons, and
- JSON outputs plus generated plots/tables for README, docs, and publication-ready analysis.
The default benchmark uses the ImageNet validation set and reports RGB uint8 decode throughput across common Python libraries and CPU families.
Preprint: Single-Thread JPEG Decoder Benchmarks Mis-Evaluate ML Data Loaders.
The plots and tables below are generated from output/<platform>/*.json. To refresh after a new run:
imread-benchmark plot --input output --output docs/assets/benchmarks
imread-benchmark render-readmeThe figures below are claim-first summaries: full numeric matrices remain in the tables.
Pure decode speed with one thread, bytes pre-loaded to memory. Bold = best per platform.
| Library | AMD EPYC 9B14 | AMD EPYC 9B45 | Intel Xeon Platinum 8581C | Neoverse-N1 | Neoverse-V2 |
|---|---|---|---|---|---|
simplejpeg |
690 | 857 | 735 | 456 | 662 |
turbojpeg |
640 | 818 | 708 | 426 | 613 |
jpeg4py |
636 | 760 | 699 | 423 | 611 |
kornia-rs |
642 | 761 | 664 | 391 | 629 |
opencv |
664 | 841 | 721 | 445 | 645 |
imagecodecs |
677 | 775 | 723 | 457 | 661 |
pyvips |
420 | 586 | 462 | 261 | 413 |
pillow |
537 | 726 | 577 | 360 | 551 |
skimage |
475 | 661 | 525 | 326 | 499 |
imageio |
496 | 599 | 524 | 335 | 506 |
torchvision |
621 | 864 | 712 | 440 | 643 |
tensorflow |
596 | 836 | 689 | 268 | 391 |
Best images_per_second across num_workers ∈ {0, 2, 4, 8} for each library × platform, using a PyTorch DataLoader with batch_size=32. Cell format: img/s @ Nw. Bold = best per platform.
| Library | AMD EPYC 9B14 | AMD EPYC 9B45 | Intel Xeon Platinum 8581C | Neoverse-N1 | Neoverse-V2 |
|---|---|---|---|---|---|
simplejpeg |
1,521 @ 4w | 2,739 @ 8w | 1,754 @ 8w | 1,557 @ 8w | 2,421 @ 8w |
turbojpeg |
1,535 @ 4w | 2,800 @ 8w | 1,710 @ 8w | 1,347 @ 4w | 2,389 @ 8w |
jpeg4py |
1,443 @ 4w | 2,453 @ 8w | 1,651 @ 8w | 1,411 @ 8w | 2,312 @ 8w |
kornia-rs |
1,327 @ 8w | 2,394 @ 8w | 1,422 @ 8w | 1,260 @ 8w | 1,951 @ 8w |
opencv |
1,457 @ 4w | 2,814 @ 8w | 1,707 @ 8w | 1,419 @ 8w | 2,414 @ 8w |
imagecodecs |
1,543 @ 4w | 2,476 @ 8w | 1,677 @ 8w | 1,443 @ 8w | 2,242 @ 8w |
pillow |
1,283 @ 4w | 2,465 @ 8w | 1,565 @ 8w | 1,387 @ 8w | 2,350 @ 8w |
skimage |
1,238 @ 4w | 2,536 @ 8w | 1,615 @ 8w | 1,388 @ 8w | 2,315 @ 8w |
imageio |
1,273 @ 4w | 2,324 @ 8w | 1,643 @ 8w | 1,466 @ 8w | 2,561 @ 8w |
torchvision |
1,596 @ 8w | 2,920 @ 8w | 1,612 @ 4w | 1,504 @ 8w | 2,557 @ 8w |
5 platforms · 50,000 images · 5 runs each · latest run 2026-04-22
Single-thread decoder speed is useful, but it is not enough to choose a decoder for a training pipeline. The peak DataLoader table is usually the better operational signal because it captures multiprocessing worker behavior, library fork-safety, and CPU-specific scaling.
Current headline patterns:
simplejpegis a strong single-thread baseline and wins peakDataLoaderthroughput on Intel Emerald Rapids and Neoverse N1.torchvisionwins both AMD platforms at peakDataLoaderthroughput and is effectively tied for first on Neoverse V2.imageiois not a single-thread leader, but wins peakDataLoaderthroughput on Neoverse V2 in the current GCP runs.- OpenCV is rarely the absolute winner, but is consistently close to the local winner and has successful
DataLoaderresults on every platform. - PyVips is reported for single-thread decode only; it is skipped in fork-based
DataLoaderbenchmarks because of libvips threadpool deadlocks in this harness.
All decoders output (H, W, 3) uint8 RGB numpy arrays for a fair comparison. Libraries that default to other formats (OpenCV → BGR, torchvision → CHW tensor, TensorFlow → EagerTensor) include a conversion step. Note that in real ML pipelines the conversion is often unnecessary.
Memory mode (default): images are pre-loaded as bytes before the timed loop. This measures pure decode throughput with no disk I/O.
Disk mode: each decode call reads the file from disk. Includes I/O latency.
ImageNet validation set — 50,000 JPEG images, ~500×400px.
# Download
wget https://image-net.org/data/ILSVRC/2012/ILSVRC2012_img_val.tar
mkdir -p imagenet/val
tar -xf ILSVRC2012_img_val.tar -C imagenet/val# macOS only: required by PyTurboJPEG (pure-python ctypes binding)
brew install jpeg-turbopyvips ships its own bundled libvips via the pyvips-binary PyPI wheel,
so no brew install vips is needed. simplejpeg wheels bundle libjpeg-turbo.
On Linux you'll still need apt install libjpeg-turbo8-dev libturbojpeg0
(see gcp/vm_startup.sh), since jpeg4py is built from sdist.
# Install uv if needed
pip install uv
# Install the lightweight orchestrator (control plane).
uv venv && source .venv/bin/activate
uv pip install -e .The benchmark CLI creates decoder worker environments automatically under
venvs/<group>/ when imread-benchmark run needs them. Today the groups are
mainstream and tensorflow; they are separate because TensorFlow and
PyTorch-oriented packages can have incompatible NumPy/protobuf constraints.
The first full run pays the dependency installation cost, and later runs reuse
the populated worker environments. Use --skip-setup only when those worker
environments already exist and should not be modified.
Plotting is separate from benchmark execution. Install the plotting extra only if you want to regenerate figures:
uv pip install -e '.[plot]'# What would run on this machine?
imread-benchmark list-libs
# Single + DataLoader for every supported decoder, default 50k images
imread-benchmark run --data-dir /path/to/imagenet/val
# Faster smoke run
imread-benchmark run --data-dir /path/to/imagenet/val \
--num-images 2000 --num-runs 5 --dataloader-runs 2 \
--workers 0,2
# Just one library, single-thread benchmark only
imread-benchmark run --data-dir /path/to/imagenet/val \
--libs opencv --mode single
# Generate README plots from output/ JSONs
imread-benchmark plot --input output --output docs/assets/benchmarksThe CLI sets up venvs/<group>/ for each dependency group it needs. Subsequent runs reuse those venvs, so only the first invocation pays the install cost.
Spin up a benchmark VM on GCP, run everything against ImageNet from a GCS bucket, and have it self-delete when done:
./gcp/run.sh \
--imagenet-bucket gs://my-bucket/imagenet/val \
--results-bucket gs://my-bucket/imread-results \
--no-waitBuilt venvs are cached in GCS (keyed by sha256(uv.lock)), so reruns on the same machine type skip the ~25-minute install. Use --force-rebuild to re-resolve PyPI without editing uv.lock. Full details, machine-type matrix, cost, and cache semantics: docs/gcp_benchmarks.md.
output/
└── linux_AMD-EPYC-9B45/
├── opencv_1t_results.json
├── opencv_default_results.json
├── opencv_dataloader_results.json
├── run_summary.json
└── ...
- simplejpeg — CFFI binding; zero-copy decode from bytes
- turbojpeg (PyTurboJPEG) — Python binding for libjpeg-turbo
- jpeg4py — direct libjpeg-turbo binding (Linux only)
- kornia-rs — Rust implementation using libjpeg-turbo
- OpenCV (opencv-python-headless)
- imagecodecs — uses libjpeg-turbo 3.x; prebuilt ARM64 wheels
- pyvips — libvips bindings (bundled in wheels). Single-thread only; the libvips threadpool deadlocks under fork-based PyTorch DataLoader, so dataloader benchmarks are skipped on every platform.
- Pillow
- scikit-image
- imageio
Note: Pillow-SIMD was previously included but dropped 2026-04 — upstream is abandoned (last release 2023-05), no Linux wheels, and its historical SIMD speedup is now matched by
jpeg4py/simplejpeg/kornia-rs. Full rationale indocs/gcp_benchmarks.md.
- torchvision
- tensorflow
- All benchmarks run single-threaded unless using the DataLoader benchmark
- Memory mode is the recommended baseline — it isolates decode speed from storage
- Results based on ImageNet JPEG images (~500×400px)
- Use the DataLoader benchmark for final decoder and
num_workersselection. - Start with OpenCV when you need a robust default that runs everywhere.
- Try
torchvisionwhen your pipeline already wants tensors and you can benchmark the target CPU. - Try
simplejpeg/turbojpeg/jpeg4pywhen maximum libjpeg-turbo-backed speed matters and your dataset policy handles uncommon JPEG modes.
- Use the single-thread table to compare isolated decoder implementations.
- Re-run locally if your images differ substantially from ImageNet validation JPEGs.
opencvremains the best choice when you need more than just JPEG decoding
# Run tests
uv run pytest tests/ -v
# Run linters
uv run pre-commit run --all-filesSee CONTRIBUTING.md for how to add a new decoder.
If you found this benchmark useful in your research or engineering work, please cite the preprint:
@misc{iglovikov2026singlethreadjpegdecoderbenchmarks,
title={Single-Thread JPEG Decoder Benchmarks Mis-Evaluate ML Data Loaders},
author={Vladimir Iglovikov},
year={2026},
eprint={2605.08731},
archivePrefix={arXiv},
primaryClass={cs.PF},
url={https://arxiv.org/abs/2605.08731},
}


