Nouveau crashing on Nvidia TU106 [sway/wayland/maybe Xorg]
- OS: Arch
- Kernel: 5.12.5-arch1-1 (tested on 5.8.+ kernels)
- GPU: 01:00.0 VGA compatible controller: NVIDIA Corporation TU106 [GeForce RTX 2070] (rev a1)
- mesa: 21.1.0
- libdrm: 2.4.105
- wlroots: 0.13.0
- wayland: 1.19.0
- sway: 1.6
Nouveau kernel module crashes, seemingly under stress. Managed to reproduce it multiple times using Firefox and running high resolution video. It is not reliable though; sometimes it crashes right away upon video startup, sometimes it takes minutes. I couldn't reproduce it on Xorg but I didn't experiment long enough; I feel convinced that this problem is indepedent from whatever is happening in a userspace. I experimented with different kernels and it does not seem to be reproducible on older ones than 5.9 (So 5.18.14 does seem to be fine; 5.9.0 crashes). I can see correlation between crashes and log output; in newer, crashing kernels nouveau complains about UPDATE: After using 5.8.14 for two days I experienced this exact same crash - full dmesg, filtered journal; my theory was wrong.pmu: firmware unavailable
and there's this additional [drm]
bit (example below).
2c2
< kernel: Linux version 5.8.14-arch1-1 (linux@archlinux) (gcc (GCC) 10.2.0, GNU ld (GNU Binutils) 2.35) #1 SMP PREEMPT Wed, 07 Oct 2020 23:59:46 +0000
---
> kernel: Linux version 5.11.16-arch1-1 (linux@archlinux) (gcc (GCC) 10.2.0, GNU ld (GNU Binutils) 2.36.1) #1 SMP PREEMPT Wed, 21 Apr 2021 17:22:13 +0000
6a7
> kernel: nouveau 0000:01:00.0: pmu: firmware unavailable
29c30
< kernel: nouveau 0000:01:00.0: DRM: allocated 2560x1440 fb: 0x200000, bo 00000000914f076e
---
> kernel: nouveau 0000:01:00.0: DRM: allocated 2560x1440 fb: 0x200000, bo 000000006f48f8f3
32c33
< kernel: nouveau 0000:01:00.0: fb0: nouveaudrmfb frame buffer device
---
> kernel: nouveau 0000:01:00.0: [drm] fb0: nouveaudrmfb frame buffer device
Dump of journal filtered with '(nouveau|Linux version)' and full dmesg
when crash happens. (Two different runs after each other with reboot).
- 1-kernel-5.12.5-arch1-1.log
- 1-journalctl-5.12.5-arch1-1.log
- 2-kernel-5.12.5-arch1-1.log
- 2-journalctl-5.12.5-arch1-1.log
Sometimes kernel and nouveau reports more information like here:
nouveau 0000:01:00.0: fifo: fault 00 [VIRT_READ] at 000000000a03e000 engine 40 [GR] client 03 [GPC1/T1_3] reason 00 [PDE] on channel 2 [01ff3ac000 systemd-logind[384]]
nouveau 0000:01:00.0: fifo: channel 2: killed
nouveau 0000:01:00.0: fifo: runlist 0: scheduled for recovery
nouveau 0000:01:00.0: timeout
WARNING: CPU: 2 PID: 344 at drivers/gpu/drm/nouveau/nvkm/engine/fifo/gk104.c:447 gk104_fifo_recover_engn+0x25c/0x270 [nouveau]
Modules linked in: (...)
Workqueue: events nvkm_notify_work [nouveau]
RIP: 0010:gk104_fifo_recover_engn+0x25c/0x270 [nouveau]
gk104_fifo_recover_chan+0x1e1/0x2a0 [nouveau]
gk104_fifo_fault+0x11d/0x2c0 [nouveau]
gv100_fault_ntfy_nrpfb+0x222/0x270 [nouveau]
nvkm_notify_work+0x1d/0x80 [nouveau]
nouveau 0000:01:00.0: fifo: engine 0: scheduled for recovery
nouveau 0000:01:00.0: bus: MMIO write of 00000001 FAULT at 00259c [ TIMEOUT ]
nouveau 0000:01:00.0: systemd-logind[384]: channel 2 killed!
nouveau 0000:01:00.0: bus: MMIO write of 00000001 FAULT at 00262c [ TIMEOUT ]
whole journal for this specific run
If you need any additional information, let me know.