Skip to content

amdgpu: dm_irq_work_func workqueue hogging CPU after EDID read failure #1140

@gregrahn

Description

@gregrahn

Disclaimer

  • I have read and understood the disclaimer.

Application version

0.5.2

System version

0.2.7

Device model

JetKVM

Extension model

None

Remote device Hardware

Minisforum UM773 Lite

Remote device OS

Ubuntu 24.04.3 LTS

Bug description

Summary

On systems with AMD GPUs, the kernel intermittently becomes unresponsive following display events. Kernel logs show repeated AMDGPU display manager workqueue stalls (dm_irq_work_func) associated with EDID read failures. In some cases this leads to a full system hang requiring a hard reboot.

Description

This issue has been observed on two separate machines with identical hardware models and the same KVM setup. On both systems, kernel logs show AMDGPU display-related errors around the time of the incident, including:

  • amdgpu: [drm] *ERROR* No EDID read
  • workqueue: dm_irq_work_func [amdgpu] hogged CPU for >10000us
  • Peripheral reset messages such as sr ... Power-on or device reset occurred

In some occurrences, the dm_irq_work_func workqueue hog warning appears and the system recovers. In other cases, the system becomes completely unresponsive (no local input) and requires a power cycle. There is no clean shutdown recorded.

The issue appears to be triggered by display events such as monitor sleep/wake or display switching. Both systems are connected to a JetKVM, and EDID read failures are logged near the time of the stall.

This behavior suggests a deadlock or prolonged stall in the AMDGPU display manager workqueue when handling EDID or hotplug events.

Expected behavior

Display EDID or hotplug failures should be handled gracefully without prolonged kernel workqueue stalls or system hangs.

Actual behavior

The AMDGPU display manager workqueue (dm_irq_work_func) repeatedly hogs the CPU, and in some cases the system becomes fully unresponsive and must be rebooted.

Reproducibility

Intermittent. But reproduced on two identical systems.

Environment

  • Distribution:
$ cat /etc/os-release
PRETTY_NAME="Ubuntu 24.04.3 LTS"
NAME="Ubuntu"
VERSION_ID="24.04"
VERSION="24.04.3 LTS (Noble Numbat)"
VERSION_CODENAME=noble
ID=ubuntu
ID_LIKE=debian
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
UBUNTU_CODENAME=noble
LOGO=ubuntu-logo
  • Kernel version:
$ uname -r
6.14.0-37-generic
  • GPU model:
$ lspci -nnk | grep -A3 -E 'VGA|Display'
34:00.0 VGA compatible controller [0300]: Advanced Micro Devices, Inc. [AMD/ATI] Rembrandt [Radeon 680M] [1002:1681] (rev 0a)
	Subsystem: Advanced Micro Devices, Inc. [AMD/ATI] Rembrandt [Radeon 680M] [1002:1681]
	Kernel driver in use: amdgpu
	Kernel modules: amdgpu

Logs

system 1

Dec 29 04:30:10 kernel: amdgpu 0000:34:00.0: [drm] *ERROR* No EDID read.
Dec 29 04:30:11 kernel: workqueue: dm_irq_work_func [amdgpu] hogged CPU for >10000us 4 times, consider switching to WQ_UNBOUND
Dec 29 04:30:12 kernel: sr 0:0:0:0: Power-on or device reset occurred

system 2

Jan 01 05:55:59 kernel: workqueue: dm_irq_work_func [amdgpu] hogged CPU for >10000us 4 times
Jan 01 05:56:00 kernel: sr 0:0:0:0: Power-on or device reset occurred

Jan 05 21:22:21 kernel: amdgpu 0000:34:00.0: [drm] *ERROR* No EDID read.
Jan 05 21:22:22 kernel: sr 0:0:0:0: Power-on or device reset occurred

Jan 06 16:44:30 kernel: workqueue: dm_irq_work_func [amdgpu] hogged CPU for >10000us 11 times
Jan 06 16:44:30 kernel: sr 0:0:0:0: Power-on or device reset occurred

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions