Fix crash in device_map='auto' when CPU is missing from inferred max_memory #43013

pallavisharma6802 · 2025-12-23T05:41:28Z

What does this PR do?

Fixes a crash when using device_map="auto" in environments where "cpu" is missing from the inferred max_memory dictionary.

Bug

When max_memory is not explicitly provided, accelerate.get_max_memory() may return a dictionary without a "cpu" key (depending on the backend/environment).
_get_device_map then unconditionally scales "cpu" memory, resulting in a KeyError.

Fix

Guard access to "cpu" in inferred_max_memory before applying the scaling factor.
Preserve existing behavior when "cpu" is present.

Tests

Added tests/utils/test_device_map_cpu_guard.py to ensure device_map="auto" does not crash when "cpu" is absent from max_memory.

Fixes #42994

Before submitting

Did you read the contributor guideline?
Did you write any new necessary tests?

Who can review?

@Cyrilvallez

pallavisharma6802 · 2025-12-23T05:45:22Z

Thanks for the report! This PR adds a defensive fix for the device_map="auto" crash and includes a regression test.
Happy to make any adjustments if needed.

ArthurZucker

When max_memory is not explicitly provided, accelerate.get_max_memory() may return a dictionary without a "cpu" key (depending on the backend/environment).
_get_device_map then unconditionally scales "cpu" memory, resulting in a KeyError.

sounds like accelerate should fix this no?

pallavisharma6802 · 2026-01-05T16:27:25Z

Good point! this assumption does originate from accelerate.get_max_memory().

That said, _get_device_map in Transformers currently treats the returned max_memory as an internal contract and applies additional logic on top of it, so guarding here avoids a hard crash regardless of how max_memory is produced (including older or external Accelerate versions).

I agree that Accelerate could also be hardened, and I’m happy to open a follow-up issue/PR there if that’s preferred. In the meantime, this keeps Transformers robust against missing "cpu" entries without changing behavior when it is present.

Cyrilvallez · 2026-01-05T17:11:43Z

Humm, could you provide an example of a situation where cpu is not present in inferred_max_memory? I don't think it can be missing 🤔

pallavisharma6802 · 2026-01-05T17:21:32Z

Thanks for the question. That’s fair!

I don’t currently have a minimal standalone repro showing "cpu" missing directly from accelerate.get_max_memory() itself. What I observed was the KeyError happening in Transformers’ _get_device_map when using the inferred max_memory in my setup, which suggests "cpu" was absent at that stage. But I agree that if get_max_memory() guarantees a "cpu" entry by contract, then this guard may be too broad / misplaced.

If "cpu" is indeed invariant on the Accelerate side, I’m happy to adjust this PR to better reflect that contract (e.g. narrow the guard to the specific failing path or replace it with an assertion + update the test). I can also try to extract a minimal repro to confirm where "cpu" drops out (if it does). Let me know what you’d prefer.

Fix device_map auto crash when CPU is missing from inferred max_memory

8ad254e

Merge branch 'main' into fix-device-map-missing-cpu

cbdf9d9

ArthurZucker reviewed Jan 5, 2026

View reviewed changes

Cyrilvallez added the Code agent slop label Jan 6, 2026

Cyrilvallez closed this Jan 6, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix crash in device_map='auto' when CPU is missing from inferred max_memory #43013

Fix crash in device_map='auto' when CPU is missing from inferred max_memory #43013

Uh oh!

pallavisharma6802 commented Dec 23, 2025

Uh oh!

pallavisharma6802 commented Dec 23, 2025

Uh oh!

ArthurZucker left a comment

Uh oh!

pallavisharma6802 commented Jan 5, 2026

Uh oh!

Cyrilvallez commented Jan 5, 2026

Uh oh!

pallavisharma6802 commented Jan 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Fix crash in device_map='auto' when CPU is missing from inferred max_memory #43013

Fix crash in device_map='auto' when CPU is missing from inferred max_memory #43013

Uh oh!

Conversation

pallavisharma6802 commented Dec 23, 2025

What does this PR do?

Bug

Fix

Tests

Before submitting

Who can review?

Uh oh!

pallavisharma6802 commented Dec 23, 2025

Uh oh!

ArthurZucker left a comment

Choose a reason for hiding this comment

Uh oh!

pallavisharma6802 commented Jan 5, 2026

Uh oh!

Cyrilvallez commented Jan 5, 2026

Uh oh!

pallavisharma6802 commented Jan 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants