6.5.0-0.rc3.20230727git0a8db05b571a: radeon_gem_va_ioct: NULL pointer dereference, address: 00000000000001e8
Brief summary of the problem:
Sway/Wayland, Firefox (in Wayland mode) suddenly stuck, everything else works fine.
Trying to kill any Firefox-related process, one eventually found stuck
in D state
.
Then it occurred to me to check the logs/dmesg
and from that point
started to suspect the kernel's report bellow is the root cause:
Aug 04 16:58:11 kernel: BUG: kernel NULL pointer dereference, address: 00000000000001e8
Aug 04 16:58:11 kernel: #PF: supervisor read access in kernel mode
Aug 04 16:58:11 kernel: #PF: error_code(0x0000) - not-present page
Aug 04 16:58:11 kernel: PGD 8000000163332067 P4D 8000000163332067 PUD 0
Aug 04 16:58:11 kernel: Oops: 0000 [#1] PREEMPT SMP PTI
Aug 04 16:58:11 kernel: CPU: 3 PID: 3902 Comm: Renderer Not tainted 6.5.0-0.rc3.20230727git0a8db05b571a.26.fc39.x86_64 #1
Aug 04 16:58:11 kernel: Hardware name: Gigabyte Technology Co., Ltd. Z97P-D3/Z97P-D3, BIOS F8 09/18/2015
Aug 04 16:58:11 kernel: RIP: 0010:radeon_gem_va_ioctl+0x40f/0x520 [radeon]
Aug 04 16:58:11 kernel: Code: 86 1b 7b c0 e8 b2 92 49 e7 c7 43 04 01 00 00 00 41 bd ea ff ff ff e9 ca fe ff ff 49 8b 41 70 4c 89 ce 4c 89 e7 4c 89 4c 24 08 <48> 8b 90 e8 01 00 00 e8 25 a6 0a 00 4c 8b 4c 24 08 41 89 c5 8d 80
Aug 04 16:58:11 kernel: RSP: 0018:ffffaa404118fc10 EFLAGS: 00010206
Aug 04 16:58:11 kernel: RAX: 0000000000000000 RBX: ffffaa404118fd48 RCX: 0000000000000000
Aug 04 16:58:11 kernel: RDX: 0000000000000006 RSI: ffff9413c8a98d00 RDI: ffff9413c8460000
Aug 04 16:58:11 kernel: RBP: ffffaa404118fcb0 R08: 0000000000000282 R09: ffff9413c8a98d00
Aug 04 16:58:11 kernel: R10: 0000000000000000 R11: ffffaa4041362000 R12: ffff9413c8460000
Aug 04 16:58:11 kernel: R13: 0000000000000000 R14: ffffaa404118fc28 R15: ffff941461a94c78
Aug 04 16:58:11 kernel: FS: 00007f0aadcd36c0(0000) GS:ffff9416cecc0000(0000) knlGS:0000000000000000
Aug 04 16:58:11 kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Aug 04 16:58:11 kernel: CR2: 00000000000001e8 CR3: 000000011df80005 CR4: 00000000001706e0
Aug 04 16:58:11 kernel: Call Trace:
Aug 04 16:58:11 kernel: <TASK>
Aug 04 16:58:11 kernel: ? __die+0x23/0x70
Aug 04 16:58:11 kernel: ? page_fault_oops+0x171/0x4e0
Aug 04 16:58:11 kernel: ? exc_page_fault+0x7f/0x180
Aug 04 16:58:11 kernel: ? asm_exc_page_fault+0x26/0x30
Aug 04 16:58:11 kernel: ? radeon_gem_va_ioctl+0x40f/0x520 [radeon]
Aug 04 16:58:11 kernel: ? __pfx_radeon_gem_va_ioctl+0x10/0x10 [radeon]
Aug 04 16:58:11 kernel: drm_ioctl_kernel+0xcd/0x170
Aug 04 16:58:11 kernel: drm_ioctl+0x26d/0x4b0
Aug 04 16:58:11 kernel: ? __pfx_radeon_gem_va_ioctl+0x10/0x10 [radeon]
Aug 04 16:58:11 kernel: radeon_drm_ioctl+0x4d/0x80 [radeon]
Aug 04 16:58:11 kernel: __x64_sys_ioctl+0x97/0xd0
Aug 04 16:58:11 kernel: do_syscall_64+0x60/0x90
Aug 04 16:58:11 kernel: ? do_futex+0x128/0x190
Aug 04 16:58:11 kernel: ? __x64_sys_futex+0x129/0x1e0
Aug 04 16:58:11 kernel: ? exit_to_user_mode_prepare+0x142/0x1f0
Aug 04 16:58:11 kernel: ? syscall_exit_to_user_mode+0x1b/0x40
Aug 04 16:58:11 kernel: ? do_syscall_64+0x6c/0x90
Aug 04 16:58:11 kernel: ? irqtime_account_irq+0x40/0xc0
Aug 04 16:58:11 kernel: ? __irq_exit_rcu+0x4b/0xc0
Aug 04 16:58:11 kernel: entry_SYSCALL_64_after_hwframe+0x6e/0xd8
Aug 04 16:58:11 kernel: RIP: 0033:0x7f0ac4b113ad
Aug 04 16:58:11 kernel: Code: 04 25 28 00 00 00 48 89 45 c8 31 c0 48 8d 45 10 c7 45 b0 10 00 00 00 48 89 45 b8 48 8d 45 d0 48 89 45 c0 b8 10 00 00 00 0f 05 <89> c2 3d 00 f0 ff ff 77 1a 48 8b 45 c8 64 48 2b 04 25 28 00 00 00
Aug 04 16:58:11 kernel: RSP: 002b:00007f0aadcd0d90 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
Aug 04 16:58:11 kernel: RAX: ffffffffffffffda RBX: 00007f085ccac280 RCX: 00007f0ac4b113ad
Aug 04 16:58:11 kernel: RDX: 00007f0aadcd0e50 RSI: 00000000c018646b RDI: 000000000000002b
Aug 04 16:58:11 kernel: RBP: 00007f0aadcd0de0 R08: 0000000000000020 R09: 00007f0aad069a70
Aug 04 16:58:11 kernel: R10: 0000000000000000 R11: 0000000000000246 R12: 00007f0aadcd0e50
Aug 04 16:58:11 kernel: R13: 00000000c018646b R14: 000000000000002b R15: 00007f0aad0699d8
Aug 04 16:58:11 kernel: </TASK>
Aug 04 16:58:11 kernel: Modules linked in: overlay snd_seq_dummy snd_hrtimer xt_CHECKSUM xt_MASQUERADE xt_conntrack ipt_REJECT nft_compat nf_nat_tftp nf_conntrack_tftp bridge stp llc nft_limit nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nf_log_syslog nft_log nft_ct nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 tun ip_set dummy nf_tables nfnetlink cfg80211 rfkill lm63 it87 hwmon_vid sunrpc snd_hda_codec_realtek snd_soc_rt5640 snd_hda_codec_hdmi snd_hda_codec_generic snd_soc_rl6231 ledtrig_audio snd_soc_core snd_hda_intel snd_intel_dspcfg snd_intel_sdw_acpi snd_hda_codec snd_compress snd_hda_core ac97_bus snd_hwdep snd_pcm_dmaengine snd_seq snd_seq_device snd_pcm snd_timer snd iTCO_wdt soundcore vfat intel_pmc_bxt fat iTCO_vendor_support intel_rapl_msr intel_rapl_common x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel joydev mei_hdcp mei_pxp r8169 mei_me mei kvm at24 irqbypass i2c_i801 lpc_ich i2c_smbus ppdev rapl parport_pc parport intel_cstate
Aug 04 16:58:11 kernel: intel_uncore acpi_pad fuse loop zram xfs dm_crypt amdgpu amdxcp iommu_v2 drm_buddy gpu_sched uas usb_storage hid_maltron radeon drm_ttm_helper crct10dif_pclmul crc32_pclmul ttm crc32c_intel polyval_clmulni polyval_generic i2c_algo_bit drm_suballoc_helper drm_display_helper ghash_clmulni_intel sha512_ssse3 cec video wmi ip6_tables ip_tables i2c_dev
Aug 04 16:58:11 kernel: CR2: 00000000000001e8
Aug 04 16:58:11 kernel: ---[ end trace 0000000000000000 ]---
Note that process 3902
mentioned above was likewise associated with
running Firefox, however, I was apparently able to kill it just fine,
i.e., it was not stuck in D state -- that was the matter with sibling
3901
:
$ head -n7 /proc/3901/status
Name: CanvasRenderer
Umask: 0022
State: D (disk sleep)
Tgid: 3768
Ngid: 0
Pid: 3901
PPid: 1
$ ls -anv /proc/3901/fd | grep /dev/dri/
lrwx------. 1 1000 1000 64 Aug 4 17:21 17 -> /dev/dri/renderD128
lrwx------. 1 1000 1000 64 Aug 4 17:21 41 -> /dev/dri/renderD128
lrwx------. 1 1000 1000 64 Aug 4 17:21 42 -> /dev/dri/renderD128
lrwx------. 1 1000 1000 64 Aug 4 17:21 43 -> /dev/dri/renderD128
lrwx------. 1 1000 1000 64 Aug 4 17:21 191 -> /dev/dri/renderD128
(
What baffles me here is that Aug 4 17:21
> Aug 04 16:58:11
.
So perhaps said hang of Firefox was observed some 20 minutes after
some discrepancy was observed in the kernel? Sadly, I first rush
to "unhang" Firefox, even if it meant killing whatever processes
behind, and just after apparently a remaining process stayed
unkillable, I set out to investigate, so clear timing of events
got lost.
)
Kernel code level look
(gdb) l *radeon_gem_va_ioctl+0x40f
0x33f4f is in radeon_gem_va_ioctl (drivers/gpu/drm/radeon/radeon_gem.c:661).
646
647 list_for_each_entry(entry, &list, head) {
648 domain = radeon_mem_type_to_domain(entry->bo->resource->mem_type);
649 /* if anything is swapped out don't swap it in here,
650 just abort and wait for the next CS */
651 if (domain == RADEON_GEM_DOMAIN_CPU)
652 goto error_unreserve;
653 }
654
655 mutex_lock(&bo_va->vm->mutex);
656 r = radeon_vm_clear_freed(rdev, bo_va->vm);
657 if (r)
658 goto error_unlock;
659
660 if (bo_va->it.start)
!661 r = radeon_vm_bo_update(rdev, bo_va, bo_va->bo->tbo.resource);
662
663 error_unlock:
664 mutex_unlock(&bo_va->vm->mutex);
665
666 error_unreserve:
667 ttm_eu_backoff_reservation(&ticket, &list);
668
669 error_free:
670 kvfree(vm_bos);
671
672 if (r && r != -ERESTARTSYS)
673 DRM_ERROR("Couldn't update BO_VA (%d)\n", r);
674 }
675
Not sure, but looks as if bo_va->bo
was NULL
on line 661 above
at the time of invocation.
Hardware description:
$ inxi -Gxx
Graphics:
Device-1: AMD Pitcairn XT GL [FirePro W7000] driver: radeon v: kernel
arch: GCN-1 pcie: speed: 8 GT/s lanes: 16 ports: active: DP-1,DP-2,DP-4
empty: DP-3 bus-ID: 01:00.0 chip-ID: 1002:6808 temp: 77.0 C
Display: wayland server: Xwayland v: 23.1.1 compositor: sway v: 1.8.1
driver: gpu: radeon d-rect: 5760x2400 display-ID: 1
Monitor-1: DP-1 pos: bottom-c model: BenQ BL2411 res: 1920x1200 dpi: 94
diag: 611mm (24.1")
Monitor-2: DP-2 pos: top-right model: BenQ BL2411 res: 1920x1200 dpi: 94
diag: 611mm (24.1")
Monitor-3: DP-4 pos: primary,top-left model: NEC E233WM res: 1920x1080
dpi: 96 diag: 584mm (23")
API: OpenGL v: 4.5 Mesa 23.1.4 renderer: PITCAIRN ( LLVM 16.0.6 DRM 2.50
6.5.0-0.rc3.20230727git0a8db05b571a.26.fc39.x86_64) direct-render: Yes
System information:
This is with Fedora, Rawhide/39:
$ uname -r
6.5.0-0.rc3.20230727git0a8db05b571a.26.fc39.x86_64
$ rpm -q firefox mesa-dri-drivers
firefox-115.0-2.fc39.x86_64
mesa-dri-drivers-23.1.4-1.fc39.x86_64