MES Regression 6.10, worse in 6.10.10
Brief summary of the problem:
Regression since Kernel 6.10: W7900 GPU hangs frequently with MES timeout errors. After 6.10.9, machine hard crashes and repeats "MES ring buffer full" in dmesg. 6.9 kernel was not affected. Machine rendered unusable for desktop use by subsequent regressions.
Hardware description:
- CPU: Xeon Max 9480
- GPU: AMD Radeon Pro W7900
- System Memory: 512GB
- Display(s): BenQ, Asus ProArt
- Type of Display Connection: DisplayPort
System information:
- Distro name and Version: Arch
- Kernel version: 6.10.9
- Custom kernel: no
- AMD official driver version: no
How to reproduce the issue:
Use W7900 under Kernel 6.10 and open Chromium and browse sites with lots of accelerated video (Reddit main page always triggers this):
[16511.864387] amdgpu 0000:96:00.0: amdgpu: MES failed to respond to msg=MISC (WAIT_REG_MEM)
[16511.864393] [drm:amdgpu_mes_reg_write_reg_wait [amdgpu]] *ERROR* failed to reg_write_reg_wait
Update to 6.10.9 which contains MES overflow patch. Instead of merely hanging, entire system locks up with "MES ring buffer is full." machine will lock up and require hard reboot.
I couldn't reproduce the "MES ring buffer is full" on 6.10.10, but I just opened a video-laden site and scrolled rapidly up and down in Chromium, didn't actually use it for a while.
Log files (for system lockups / game freezes / crashes)
dmesg errors inline