Add ray data option for video benchmarks by oyilmaz-nvidia · Pull Request #2002 · NVIDIA-NeMo/Curator

oyilmaz-nvidia · 2026-05-19T23:30:06Z

Add ray_data variants for video nightly benchmarks

Summary

Adds ray_data executor coverage for the four video pipelines that currently only run on Xenna in nightly benchmarks, so both backends are tracked side-by-side every night — matching the dual-executor pattern already used for audio_readspeech_*, image_curation_*, domain_classification_*, etc.

Config-only change to benchmarking/nightly-benchmark.yaml — no Python edits needed because video_pipeline_benchmark.py already accepts --executor={xenna,ray_data} and routes through setup_executor().

Changes

Renamed each existing video benchmark to add an explicit _xenna suffix and added a sibling _raydata entry (identical args except --executor=ray_data):

Pipeline	Xenna entry	Ray Data entry	Timeout	`num_clips`	Min throughput
Embedding	`video_embedding_xenna`	`video_embedding_raydata`	400s	1400	4.0/s
Transcoding	`video_transcoding_xenna`	`video_transcoding_raydata`	400s	1400	5.0/s
Captioning	`video_captioning_xenna`	`video_captioning_raydata`	1800s	377	0.25/s
TransNetV2 + filters	`video_transnetv2_motion_aesthetic_filter_embeddings_xenna`	`video_transnetv2_motion_aesthetic_filter_embeddings_raydata`	800s	113	0.25/s

Signed-off-by: Onur Yilmaz <oyilmaz@nvidia.com>

copy-pr-bot · 2026-05-19T23:30:09Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

greptile-apps · 2026-05-19T23:31:53Z

Greptile Summary

This PR adds ray_data executor variants for the four existing video pipeline benchmarks (embedding, transcoding, captioning, transnetv2_motion_aesthetic_filter_embeddings), matching the dual-executor pattern used for audio, image, and domain classification pipelines. It is a config-only change — no Python edits.

Each existing video_* benchmark is renamed with an _xenna suffix, and a sibling _raydata entry is added with --executor=ray_data as the only argument difference, keeping all other flags, timeouts, and requirements identical.
All four raydata entries correctly inherit the same exact_value clip counts, min_value throughput requirements, and sink_data Slack notifications as their xenna counterparts.

Confidence Score: 5/5

Config-only addition that faithfully mirrors existing xenna entries; no logic changes.

All four raydata entries are verified to be exact copies of their xenna counterparts with only --executor=ray_data substituted. GPU resource flags, argument sets, timeouts, clip count requirements, and throughput thresholds are all consistent across every pair.

No files require special attention.

Important Files Changed

Filename	Overview
benchmarking/nightly-benchmark.yaml	Adds four ray_data benchmark entries mirroring existing xenna entries; renames four existing entries with _xenna suffix. All args, timeouts, requirements, and GPU resource flags are consistent between pairs.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[nightly-benchmark.yaml] --> B[video_pipeline_benchmark.py]
    B --> C{--executor}
    C -->|xenna| D[video_embedding_xenna]
    C -->|ray_data| E[video_embedding_raydata]
    C -->|xenna| F[video_transcoding_xenna]
    C -->|ray_data| G[video_transcoding_raydata]
    C -->|xenna| H[video_captioning_xenna]
    C -->|ray_data| I[video_captioning_raydata]
    C -->|xenna| J[video_transnetv2_..._xenna]
    C -->|ray_data| K[video_transnetv2_..._raydata]

_{Reviews (1): Last reviewed commit: "Add ray data option for video benchmarks" | Re-trigger Greptile}

ayushdg · 2026-05-20T19:12:05Z

+      - metric: num_clips_generated
+        exact_value: 1400
+      - metric: throughput_clips_per_sec
+        min_value: 4.0
+


The throughput might be different for ray data. Do we want to run verify and then add these or add these first and then fix?

Add ray data option for video benchmarks

5f51a08

Signed-off-by: Onur Yilmaz <oyilmaz@nvidia.com>

oyilmaz-nvidia marked this pull request as ready for review May 19, 2026 23:30

oyilmaz-nvidia requested a review from a team as a code owner May 19, 2026 23:30

oyilmaz-nvidia requested review from ayushdg and removed request for a team May 19, 2026 23:30

ayushdg reviewed May 20, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add ray data option for video benchmarks#2002

Add ray data option for video benchmarks#2002
oyilmaz-nvidia wants to merge 1 commit into
mainfrom
onur/add-ray-data-for-video-benchmarks

oyilmaz-nvidia commented May 19, 2026

Uh oh!

copy-pr-bot Bot commented May 19, 2026

Uh oh!

greptile-apps Bot commented May 19, 2026

Uh oh!

ayushdg May 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

oyilmaz-nvidia commented May 19, 2026

Add ray_data variants for video nightly benchmarks

Summary

Changes

Uh oh!

copy-pr-bot Bot commented May 19, 2026

Uh oh!

greptile-apps Bot commented May 19, 2026

Greptile Summary

Confidence Score: 5/5

Important Files Changed

Flowchart

Uh oh!

ayushdg May 20, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants