Skip to content

Support manual resource elastic for allreduce#1714

Open
yifeng-x wants to merge 3 commits intointelligent-machine-learning:masterfrom
yifeng-x:allreduce_resource_elastic
Open

Support manual resource elastic for allreduce#1714
yifeng-x wants to merge 3 commits intointelligent-machine-learning:masterfrom
yifeng-x:allreduce_resource_elastic

Conversation

@yifeng-x
Copy link
Copy Markdown

What changes were proposed in this pull request?

Support dynamic manual adjustment of training task resources (CPU/memory/GPU count) at runtime.

Why are the changes needed?

When training exceptions occur due to insufficient resources, the system can automatically adjust resources and restart training, such as in the case of OOM.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Unit test.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant