amdgpu: dm_irq_work_func workqueue hogging CPU after EDID read failure

### Disclaimer

- [x] I have read and understood the disclaimer.

### Application version

0.5.2

### System version

0.2.7

### Device model

JetKVM

### Extension model

None

### Remote device Hardware

Minisforum UM773 Lite

### Remote device OS

Ubuntu 24.04.3 LTS

### Bug description

## Summary

On systems with AMD GPUs, the kernel intermittently becomes unresponsive following display events. Kernel logs show repeated AMDGPU display manager workqueue stalls (`dm_irq_work_func`) associated with EDID read failures. In some cases this leads to a full system hang requiring a hard reboot.

## Description

This issue has been observed on two separate machines with identical hardware models and the same KVM setup. On both systems, kernel logs show AMDGPU display-related errors around the time of the incident, including:

- `amdgpu: [drm] *ERROR* No EDID read`
- `workqueue: dm_irq_work_func [amdgpu] hogged CPU for >10000us`
- Peripheral reset messages such as `sr ... Power-on or device reset occurred`

In some occurrences, the `dm_irq_work_func` workqueue hog warning appears and the system recovers. In other cases, the system becomes completely unresponsive (no local input) and requires a power cycle. There is no clean shutdown recorded.

The issue appears to be triggered by display events such as monitor sleep/wake or display switching. Both systems are connected to a JetKVM, and EDID read failures are logged near the time of the stall. 

This behavior suggests a deadlock or prolonged stall in the AMDGPU display manager workqueue when handling EDID or hotplug events.

## Expected behavior

Display EDID or hotplug failures should be handled gracefully without prolonged kernel workqueue stalls or system hangs.

## Actual behavior

The AMDGPU display manager workqueue (`dm_irq_work_func`) repeatedly hogs the CPU, and in some cases the system becomes fully unresponsive and must be rebooted.

## Reproducibility

Intermittent. But reproduced on two identical systems.

## Environment

- Distribution:  
```
$ cat /etc/os-release
PRETTY_NAME="Ubuntu 24.04.3 LTS"
NAME="Ubuntu"
VERSION_ID="24.04"
VERSION="24.04.3 LTS (Noble Numbat)"
VERSION_CODENAME=noble
ID=ubuntu
ID_LIKE=debian
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
UBUNTU_CODENAME=noble
LOGO=ubuntu-logo
```

- Kernel version:  
```
$ uname -r
6.14.0-37-generic
```

- GPU model:  
```
$ lspci -nnk | grep -A3 -E 'VGA|Display'
34:00.0 VGA compatible controller [0300]: Advanced Micro Devices, Inc. [AMD/ATI] Rembrandt [Radeon 680M] [1002:1681] (rev 0a)
	Subsystem: Advanced Micro Devices, Inc. [AMD/ATI] Rembrandt [Radeon 680M] [1002:1681]
	Kernel driver in use: amdgpu
	Kernel modules: amdgpu
```

## Logs

system 1
```
Dec 29 04:30:10 kernel: amdgpu 0000:34:00.0: [drm] *ERROR* No EDID read.
Dec 29 04:30:11 kernel: workqueue: dm_irq_work_func [amdgpu] hogged CPU for >10000us 4 times, consider switching to WQ_UNBOUND
Dec 29 04:30:12 kernel: sr 0:0:0:0: Power-on or device reset occurred
```

system 2
```
Jan 01 05:55:59 kernel: workqueue: dm_irq_work_func [amdgpu] hogged CPU for >10000us 4 times
Jan 01 05:56:00 kernel: sr 0:0:0:0: Power-on or device reset occurred

Jan 05 21:22:21 kernel: amdgpu 0000:34:00.0: [drm] *ERROR* No EDID read.
Jan 05 21:22:22 kernel: sr 0:0:0:0: Power-on or device reset occurred

Jan 06 16:44:30 kernel: workqueue: dm_irq_work_func [amdgpu] hogged CPU for >10000us 11 times
Jan 06 16:44:30 kernel: sr 0:0:0:0: Power-on or device reset occurred
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

amdgpu: dm_irq_work_func workqueue hogging CPU after EDID read failure #1140

Disclaimer

Application version

System version

Device model

Extension model

Remote device Hardware

Remote device OS

Bug description

Summary

Description

Expected behavior

Actual behavior

Reproducibility

Environment

Logs

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

amdgpu: dm_irq_work_func workqueue hogging CPU after EDID read failure #1140

Description

Disclaimer

Application version

System version

Device model

Extension model

Remote device Hardware

Remote device OS

Bug description

Summary

Description

Expected behavior

Actual behavior

Reproducibility

Environment

Logs

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions