Skip to content

[core][DO NOT MERGE] Profiling RDT stuff#62599

Draft
Sparks0219 wants to merge 1 commit intoray-project:masterfrom
Sparks0219:joshlee/profile-rdt
Draft

[core][DO NOT MERGE] Profiling RDT stuff#62599
Sparks0219 wants to merge 1 commit intoray-project:masterfrom
Sparks0219:joshlee/profile-rdt

Conversation

@Sparks0219
Copy link
Copy Markdown
Contributor

Profiling stuff

Signed-off-by: Joshua Lee <[email protected]>
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces performance profiling instrumentation for Ray Direct Transport (RDT) using NIXL. It adds timing measurements across key paths including serialization, object putting/getting, and NIXL-specific transfer operations, alongside a new profiling script rdt_profile.py. The review feedback identifies a thread-safety issue with the global timing dictionary and points out that certain metrics should be accumulated rather than overwritten to correctly capture overhead in batch operations. Additionally, a correction was suggested for the profiling script to ensure count-based metrics are excluded from duration summations.

from ray.widgets import Template
from ray.widgets.util import repr_with_fallback

_rdt_profile_timings = {}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The global dictionary _rdt_profile_timings is not thread-safe. Ray workers often involve multiple threads (e.g., the main task execution thread and background IO or system threads). Concurrent modifications to this dictionary can lead to race conditions or RuntimeError if one thread iterates over the dictionary while another modifies it. Since this is used for profiling, it may also result in corrupted or interleaved timing data.

Comment on lines +241 to +243
_rdt_profile_timings["B_object_ref_reducer"] = (
_time.perf_counter() - _reducer_start
)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The key "B_object_ref_reducer" is overwritten for every ObjectRef being serialized. If a collection (such as a list) containing multiple ObjectRefs is passed to ray.put, this timing will only reflect the duration of the last individual reduction rather than the cumulative time for the entire collection. Using an additive approach ensures the total overhead is captured.

                _rdt_profile_timings["B_object_ref_reducer"] = (
                    _rdt_profile_timings.get("B_object_ref_reducer", 0)
                    + (_time.perf_counter() - _reducer_start)
                )

overall_end = time.perf_counter()

# Compute overhead totals
sender_overhead = sum(v for k, v in sender_timings.items() if "iterations" not in k)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The overhead calculation incorrectly includes metrics that are counts rather than durations. While the print_timings function correctly identifies keys containing "count" as non-durations, this summation only filters for "iterations". This will skew the overhead percentage if count-based metrics are added to the timings dictionary.

Suggested change
sender_overhead = sum(v for k, v in sender_timings.items() if "iterations" not in k)
sender_overhead = sum(v for k, v in sender_timings.items() if "iterations" not in k and "count" not in k)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant