6950XT green screen crash with GDM on X11 while trying to run a rocm workload
Brief summary of the problem:
With GDM configured to use X11 I cannot run a rocm workload on the GPU that has a monitor attached to it. When I try I get a green screen.
The relevant kernel log messages appear to be:
Jan 15 00:27:00 lilly kernel: [drm:dc_dmub_srv_wait_idle [amdgpu]] *ERROR* Error waiting for DMUB idle: status=3
Jan 15 00:27:10 lilly kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx_0.0.0 timeout, signaled seq=103, emitted seq=105
Jan 15 00:27:10 lilly kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process gnome-shell pid 2508 thread gnome-shel:cs0 pid 2515
Jan 15 00:27:10 lilly kernel: amdgpu 0000:03:00.0: amdgpu: GPU reset begin!
Jan 15 00:27:10 lilly kernel: amdgpu: Failed to suspend process 0x8006
Jan 15 00:27:15 lilly kernel: amdgpu 0000:03:00.0: amdgpu: SMU: I'm not done with your previous command: SMN_C2PMSG_66:0x00000029 SMN_C2PMSG_82:0x00000000
Jan 15 00:27:15 lilly kernel: amdgpu 0000:03:00.0: amdgpu: Failed to disable gfxoff!
The problem does NOT occur in the following situations:
- GDM running on Wayland
- GDM running on X11 with the display DPMS'd off
- Logged into a gnome session on Wayland
- Logged into a gnome session on X11
- When logged in I can be running heavy GPU workloads such as furmark and the problem does not occur
The problem only appears to happen during the log in screen of GDM.
I have verified this with multiple reboots and I can reproduce this perfectly every time 100% of the time.
Hardware description:
- CPU: Threadripper pro 5975WX
- GPU0: 03:00.0 VGA compatible controller [0300]: Advanced Micro Devices, Inc. [AMD/ATI] Navi 21 [Radeon RX 6950 XT] [1002:73a5] (rev c0)
- GPU1: 43:00.0 VGA compatible controller [0300]: Advanced Micro Devices, Inc. [AMD/ATI] Navi 21 [Radeon RX 6800/6800 XT / 6900 XT] [1002:73bf] (rev c0)
- System Memory: 256GB
- Display(s): 1x 4K display
- Type of Display Connection: HDMI
System information:
- Distro name and Version: Fedora Workstation 37
- Kernel version: Linux lilly 6.1.5-200.fc37.x86_64 #1 (closed) SMP PREEMPT_DYNAMIC Thu Jan 12 15:52:00 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
- Custom kernel: kernel testing from fedora. Exact same problem happens on 6.0.x however
- AMD official driver version: N/A
How to reproduce the issue:
- I configured GDM through /etc/gdm/custom.conf to run on X11
- Boot the system, did not touch it (The system is sitting on the GDM login screen)
- Log in through SSH into the machine and run stable diffusion on the
rocm/pytorch:rocm5.4.1_ubuntu20.04_py3.7_pytorch_1.12.1
docker container (through podman)podman run --rm -ti --network=host --device=/dev/kfd --device=/dev/dri --group-add=video --ipc=host --cap-add=SYS_PTRACE --security-opt seccomp=unconfined -v
pwd/output:/output -v
pwd/models:/models -v
pwd/cache:/root/.cache rocm "$@"
- Inside the container run
CUDA_VISIBLE_DEVICES=1 python scripts/txt2img.py --outdir /output --plms --seed $RANDOM$RANDOM --prompt "a unicorn riding a purple tricycle"
(In my system for some reason the 6950XT is ROCM gpu 1, not 0)
The problem does not happen when running the workload on the 6900XT in the same system that does not have a monitor attached (and presumably doesn't have X running on it)
Attached files:
Screenshots/video files
N/A
Log files (for system lockups / game freezes / crashes)
- Dmesg log dmesg-crash.txt
- Full journalctl of the boot in question (includes Xorg log) full-journalctl-of-boot.txt