Skip to content

perf: parallelize VLM image preprocessing across threads#1951

Open
cdreetz wants to merge 1 commit intomainfrom
perf/parallel-vlm-image-preprocessing
Open

perf: parallelize VLM image preprocessing across threads#1951
cdreetz wants to merge 1 commit intomainfrom
perf/parallel-vlm-image-preprocessing

Conversation

@cdreetz
Copy link
Contributor

@cdreetz cdreetz commented Mar 4, 2026

Summary

  • Parallelize VLM image preprocessing chunks using ThreadPoolExecutor (up to 8 workers) instead of processing sequentially
  • PIL/numpy operations release the GIL, so threads give real concurrency for the CPU-bound image processor work
  • Add unique image count to VLM timing log for better observability

Context

With CUA browser environments (64 rollouts, 8 per example, 15 turns with screenshots), the orchestrator processes hundreds of unique images through the Qwen3-VL processor. At chunk_size=32, this created 30+ sequential chunks taking ~17s total. Threading should cut this significantly on multi-core machines.

Test plan

  • All existing trajectory tests pass (41/41)
  • Verify reduced preprocessing time in VLM training runs

🤖 Generated with Claude Code


Note

Cursor Bugbot is generating a summary for commit c8c3406. Configure here.

The Qwen3-VL image processor was processing chunks sequentially, taking
~17s for hundreds of unique screenshots. Since PIL/numpy release the GIL,
threading gives real concurrency. Process chunks in parallel with up to
8 threads via ThreadPoolExecutor.

Also add unique image count to VLM timing log for better observability.

Co-Authored-By: Claude Opus 4.6 <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant