-
Notifications
You must be signed in to change notification settings - Fork 6.7k
Feature/group offload pinning #12747
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Feature/group offload pinning #12747
Conversation
7e50d90 to
3b3813d
Compare
|
Thanks for your PR. However, it's being worked on in #12721. |
|
Could we resolve conflicts so that it's a bit easier to review? Seems like there's some overlap from #12692. |
6d96002 to
33d8b52
Compare
|
Done! Rebased on latest main and resolved conflicts with #12692. Should be much cleaner to review now. |
sayakpaul
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some initial comments.
| should_synchronize = ( | ||
| not self.group.onload_self and self.group.stream is not None and not should_onload_next_group | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What if non_blocking=True?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Even with non_blocking=True, if a previous group onloaded this one on a side stream, we need a sync before the default stream uses the weights or we risk reading half-copied tensors. I’ve limited the sync to the record_stream=False case, when record_stream=True the tensors are tied to the consumer stream so we can safely skip the sync.
|
Thank you for the initial comment! We are working on the solutions right now |
6f5887e to
1194a83
Compare
|
@bot /style |
|
Style bot fixed some files and pushed the changes. |
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
|
These were the error logs where two of them are I/O serialization error, a memory check error (I read the comments that this error usually passes on Ampere and Ada environment, which both are not my current environment), and a slight output difference in test_output_pretrained |
|
@sayakpaul Also with the current checks, it looks like there is coding style error. Can you help us run the automatic style correction? |
|
@bot /style |
|
Style fix is beginning .... View the workflow run here. |
|
The style bot cannot automatically do it. See: I would recommend the following:
|
|
Thanks for the pointer @sayakpaul |
|
@Aki-07 @bconstantine I ran those failing tests on my end with this branch and also on |
|
@sayakpaul thankyou for testing! Glad to hear no failures on your environment end. |
|
Hey @sayakpaul, the WanVACE LoRA failures came from the hook offloading immediately when it was attached. It saved the weights before LoRA was added, then put them back later, so the adapters never took effect. I removed that eager offload so the first offload happens after adapters are loaded. Would need your help to re run the pipelines |
| return send_to_device(kwargs, self.group.onload_device, non_blocking=self.group.non_blocking) | ||
|
|
||
| return args, kwargs | ||
| def _is_group_on_device(self) -> bool: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have erased the duplicate method names for _is_group_on_device
|
|
||
| # Ensure the top-level module also has a group_offloading hook so hook presence checks pass, | ||
| # even when it holds no parameters/buffers itself. | ||
| if config.stream is None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
even when all real groups sit in child modules, the root needs a group_offloading hook so the model stays marked as offloaded. That keeps the guardrails working (_is_group_offload_enabled still blocks .to()/.cuda() and conflicting offloads, and reapply/remove logic finds the hook). Without it, a wrapper with no params would look un-offloaded and could be moved or re-offloaded into a bad state
|
Hi @DN6 @sayakpaul We’ve updated the fix according to the review. Could you take a quick look and share any feedback when you have a moment? Thank you in advance! |
|
Hey @DN6 @sayakpaul , As mentioned above, have fixed the comments. Could you help us guide on to the next steps? |
sayakpaul
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for all the work on this PR. There are a couple of things that feel quite confusing to me. So, I would appreciate some explanations.
|
|
||
|
|
||
| # Model with only standalone computational layers at top level | ||
| class DummyModelWithStandaloneLayers(ModelMixin): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this being deleted?
Rest of the diffs in this testing script are a bit difficult to follow honestly. Could we keep this cleaner?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for pointing this out. The class was not intentionally deleted. From the git history, this shows up as removed due to branch history / rebase artifacts while integrating changes (rather than a deliberate change to the test itself), which makes the diff noisier than it should be. I’m cleaning this up now: I’ll restore that block and reorganize the commits so the test diffs are more focused/atomic and easier to review.
| pinned_dict = None | ||
|
|
||
| def _transfer_tensor_to_device(self, tensor, source_tensor, default_stream): | ||
| def _transfer_tensor_to_device(self, tensor, source_tensor, default_stream=None): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we have to set the default of default_stream?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I made it optional because the non-stream path calls _process_tensors_from_modules without a stream, there is nothing to record in that case, and record_stream is gated. None is a safety net for the record call, and it saves passing a placeholder from those call sites. If you prefer the stricter signature, I can keep it required and pass None explicitly where we don’t use streams. please do correct me thru my understanding if this is required to change
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's stick to the existing implementation in this case i.e., a stricter signature.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's stick to a stricter signature.
| _apply_group_offloading_hook(module, unmatched_group, config=config) | ||
| else: | ||
| _apply_lazy_group_offloading_hook(module, unmatched_group, config=config) | ||
| elif config.stream is None and config.offload_to_disk_path is None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems unnecessary. Explain?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
originally added the empty root hook to tag the top module as offloaded when everything else was matched, but it did not change behaviour, the child hooks already mark the model as group-offloaded and the guardrails rely on those. It just added an empty group and potential extra files, so have removed it to simplify. Functionally nothing depends on it.
| low_cpu_mem_usage=config.low_cpu_mem_usage, | ||
| onload_self=True, | ||
| group_id=name, | ||
| group_id=f"{config.module_prefix}{name}", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's happening here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is the same thing as above, we prefix group_id with the parent name to avoid collisions (ids) when recursing into block_modules. Root stays empty to preserve existing ids, the prefix only appears when descending into children.
…feature/group-offload-pinning
8403860 to
2e8f538
Compare
|
@sayakpaul thank you for all ur comments, sorry for the delay in resolving. All have been answered, Please do let us know ur review |
| if isinstance(pin_groups, str) and pin_groups in VALID_PIN_GROUPS: | ||
| return pin_groups | ||
| raise ValueError( | ||
| f"`pin_groups` must be None, {', '.join(repr(v) for v in sorted(VALID_PIN_GROUPS))}, or a callable." | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| if isinstance(pin_groups, str) and pin_groups in VALID_PIN_GROUPS: | |
| return pin_groups | |
| raise ValueError( | |
| f"`pin_groups` must be None, {', '.join(repr(v) for v in sorted(VALID_PIN_GROUPS))}, or a callable." | |
| ) | |
| elif isinstance(pin_groups, str) and pin_groups not in VALID_PIN_GROUPS: | |
| raise ValueError( | |
| f"`pin_groups` must be None, {', '.join(repr(v) for v in sorted(VALID_PIN_GROUPS))}, or a callable." | |
| ) | |
| return pin_groups |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
| pinned_dict = None | ||
|
|
||
| def _transfer_tensor_to_device(self, tensor, source_tensor, default_stream): | ||
| def _transfer_tensor_to_device(self, tensor, source_tensor, default_stream=None): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's stick to the existing implementation in this case i.e., a stricter signature.
| pinned_dict = None | ||
|
|
||
| def _transfer_tensor_to_device(self, tensor, source_tensor, default_stream): | ||
| def _transfer_tensor_to_device(self, tensor, source_tensor, default_stream=None): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's stick to a stricter signature.
|
|
||
| def initialize_hook(self, module: torch.nn.Module) -> torch.nn.Module: | ||
| if self.group.offload_leader == module: | ||
| # For disk offload we materialize the safetensor files upfront so callers can inspect them immediately. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you clarify this scenario in the comments as well? And provide a small example that justifies this change?
| # If the current module is the onload_leader of the group, we onload the group if it is supposed | ||
| # to onload itself. In the case of using prefetching with streams, we onload the next group if | ||
| # it is not supposed to onload itself. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(nit): let's not get rid of the important comments.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
| not self.group.onload_self | ||
| and self.group.stream is not None | ||
| and not should_onload_next_group | ||
| and not self.group.record_stream |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could I get a clarification on why this condition needs to be modified?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
did not change the condition, those same four checks were already present in both branches. I consolidated them into one place to avoid duplication. We still only sync when the group did not onload itself, we are using a stream, there is no pending prefetch, and record_stream is not handling lifetime tracking.
| if self.group.offload_leader == module: | ||
| self.group.offload_() | ||
| return output | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This part of the diff reads very confusing to me and hence, a bit hard to confidently review. It seems to me, post_forward() was just brought up, _send_kwargs_to_device() was added (and I am not sure why) amongst other things. Possible to have a cleaner diff?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the flag, I simplified the diff ( have removed _send_kwargs_to_device and the kwargs handling is back inline in pre_forward as before )
| Args: | ||
| pin_groups (`"first_last"` | `"all"` | `Callable`, *optional*): | ||
| Optionally keep selected groups on the onload device permanently. See | ||
| [`~hooks.group_offloading.apply_group_offloading`] for details. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we just documenting pin_groups here? If so, we should remove that from here as apply_group_offloading() should already cover it:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
removed redundant docstring
| # keys toignore when AlignDeviceHook moves inputs/outputs between devices | ||
| # these are shared mutable state modified in-place | ||
| _skip_keys = ["feat_cache", "feat_idx"] | ||
| _group_offload_block_modules = ["quant_conv", "post_quant_conv", "encoder", "decoder"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's also add a comment on how these modules were chosen to be included here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure, added comment
|
@seed93 would you like to test it? |
|
Thanks again @sayakpaul for the detailed review! Have addressed all the points |
What does this PR do?
Fixes #11966
Before submitting
documentation guidelines, and
here are tips on formatting docstrings.
Who can review?
Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.
@sayakpaul