NULL pointer dereference in drm_sched_job_arm
Brief summary of the problem:
Every few days I get a kernel oops and have to reboot. This usually happens when starting Firefox (the window appears but with nothing inside it). The log shows a NULL pointer dereference in drm_sched_job_arm
(entity->rq->sched
).
Hardware description:
- CPU: Intel(R) Core(TM) i7-9700K CPU @ 3.60GHz
- GPU: 01:00.0 VGA compatible controller [0300]: Advanced Micro Devices, Inc. [AMD/ATI] Baffin [Radeon RX 550 640SP / RX 560/560X] [1002:67ff] (rev ff)
- System Memory: ?
- Display(s): Samsung Electric Company U32R59x H4ZM801677
- Type of Display Connection: DP
System information:
- Distro name and Version: NixOS 24.05
- Kernel version:
Linux bree 6.6.45 #1-NixOS SMP PREEMPT_DYNAMIC Sun Aug 11 10:47:28 UTC 2024 x86_64 GNU/Linux
- Custom kernel: N/A
- AMD official driver version: ?
How to reproduce the issue:
No easy reproduction. Just happens after a few days.
Logs
Aug 23 11:39:03 bree kernel: [drm] scheduler comp_1.0.2 is not ready, skipping
Aug 23 11:39:03 bree kernel: [drm] scheduler comp_1.0.3 is not ready, skipping
Aug 23 11:39:03 bree kernel: [drm] scheduler comp_1.0.4 is not ready, skipping
Aug 23 11:39:03 bree kernel: [drm] scheduler comp_1.0.5 is not ready, skipping
Aug 23 11:39:03 bree kernel: [drm] scheduler comp_1.0.6 is not ready, skipping
Aug 23 11:39:03 bree kernel: [drm] scheduler comp_1.0.7 is not ready, skipping
Aug 23 11:39:03 bree kernel: BUG: kernel NULL pointer dereference, address: 0000000000000008
Aug 23 11:39:03 bree kernel: #PF: supervisor read access in kernel mode
Aug 23 11:39:03 bree kernel: #PF: error_code(0x0000) - not-present page
Aug 23 11:39:03 bree kernel: PGD 0 P4D 0
Aug 23 11:39:03 bree kernel: Oops: 0000 [#1] PREEMPT SMP NOPTI
Aug 23 11:39:03 bree kernel: CPU: 1 PID: 98487 Comm: .firefox-w:cs0 Not tainted 6.6.44 #1-NixOS
Aug 23 11:39:03 bree kernel: Hardware name: Gigabyte Technology Co., Ltd. Z390 AORUS PRO WIFI/Z390 AORUS PRO WIFI-CF,>
Aug 23 11:39:03 bree kernel: RIP: 0010:drm_sched_job_arm+0x23/0x80 [gpu_sched]
Aug 23 11:39:03 bree kernel: Code: 90 90 90 90 90 90 90 f3 0f 1e fa 0f 1f 44 00 00 55 53 48 8b 6f 58 48 85 ed 74 5c 4>
Aug 23 11:39:03 bree kernel: RSP: 0018:ffff9b960920bb50 EFLAGS: 00010206
Aug 23 11:39:03 bree kernel: RAX: 0000000000000000 RBX: ffff8938835bf400 RCX: 0000000000000027
Aug 23 11:39:03 bree kernel: RDX: 0000000000000000 RSI: ffff893a27853c10 RDI: ffff893a27853c38
Aug 23 11:39:03 bree kernel: RBP: ffff893a27853c10 R08: 0000000000000000 R09: ffff9b960920b9b8
Aug 23 11:39:03 bree kernel: R10: 0000000000000003 R11: ffffffff9f93a4c8 R12: 0000000000000001
Aug 23 11:39:03 bree kernel: R13: 0000000000000000 R14: 00000000ffffffff R15: ffff8936ae4e7000
Aug 23 11:39:03 bree kernel: FS: 00007f11372066c0(0000) GS:ffff893d9da80000(0000) knlGS:0000000000000000
Aug 23 11:39:03 bree kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Aug 23 11:39:03 bree kernel: CR2: 0000000000000008 CR3: 0000000105e7a002 CR4: 00000000003726e0
Aug 23 11:39:03 bree kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
Aug 23 11:39:03 bree kernel: DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Aug 23 11:39:03 bree kernel: Call Trace:
Aug 23 11:39:03 bree kernel: <TASK>
Aug 23 11:39:03 bree kernel: ? __die+0x23/0x70
Aug 23 11:39:03 bree kernel: ? page_fault_oops+0x171/0x4e0
Aug 23 11:39:03 bree kernel: ? prb_read_valid+0x1b/0x30
Aug 23 11:39:03 bree kernel: ? exc_page_fault+0x71/0x160
Aug 23 11:39:03 bree kernel: ? asm_exc_page_fault+0x26/0x30
Aug 23 11:39:03 bree kernel: ? drm_sched_job_arm+0x23/0x80 [gpu_sched]
Aug 23 11:39:03 bree kernel: amdgpu_cs_ioctl+0x14be/0x1a70 [amdgpu]
Aug 23 11:39:03 bree kernel: ? __pfx_amdgpu_cs_ioctl+0x10/0x10 [amdgpu]
Aug 23 11:39:03 bree kernel: drm_ioctl_kernel+0xd3/0x180
Aug 23 11:39:03 bree kernel: drm_ioctl+0x26d/0x4b0
Aug 23 11:39:03 bree kernel: ? __pfx_amdgpu_cs_ioctl+0x10/0x10 [amdgpu]
Aug 23 11:39:03 bree kernel: amdgpu_drm_ioctl+0x4e/0x90 [amdgpu]
Aug 23 11:39:03 bree kernel: __x64_sys_ioctl+0x94/0xd0
Aug 23 11:39:03 bree kernel: do_syscall_64+0x39/0x90
Aug 23 11:39:03 bree kernel: entry_SYSCALL_64_after_hwframe+0x78/0xe2
Aug 23 11:39:03 bree kernel: RIP: 0033:0x7f11604359cf
Aug 23 11:39:03 bree kernel: Code: 00 48 89 44 24 18 31 c0 48 8d 44 24 60 c7 04 24 10 00 00 00 48 89 44 24 08 48 8d 4>
Aug 23 11:39:03 bree kernel: RSP: 002b:00007f11372059e0 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
Aug 23 11:39:03 bree kernel: RAX: ffffffffffffffda RBX: 00007f1137205ba8 RCX: 00007f11604359cf
Aug 23 11:39:03 bree kernel: RDX: 00007f1137205aa0 RSI: 00000000c0186444 RDI: 0000000000000032
Aug 23 11:39:03 bree kernel: RBP: 00007f1137205aa0 R08: 00007f1137205c10 R09: 00007f1137205a80
Aug 23 11:39:03 bree kernel: R10: 0000000000000003 R11: 0000000000000246 R12: 00000000c0186444
Aug 23 11:39:03 bree kernel: R13: 0000000000000032 R14: 00007f1137205ba8 R15: 00007f1147014000
Aug 23 11:39:03 bree kernel: </TASK>
Aug 23 11:39:03 bree kernel: Modules linked in: rfcomm cmac algif_hash algif_skcipher af_alg bnep sd_mod uas usb_stor>
Aug 23 11:39:03 bree kernel: btbcm ttm gf128mul btrfs intel_pmc_bxt cmdlinepart btmtk ghash_clmulni_intel cfg80211 m>
Aug 23 11:39:03 bree kernel: kvm_intel kvm irqbypass vhost_vsock vmw_vsock_virtio_transport_common vhost vhost_iotlb>
Aug 23 11:39:03 bree kernel: CR2: 0000000000000008
Aug 23 11:39:03 bree kernel: ---[ end trace 0000000000000000 ]---
Aug 23 11:39:03 bree kernel: RIP: 0010:drm_sched_job_arm+0x23/0x80 [gpu_sched]
Aug 23 11:39:03 bree kernel: Code: 90 90 90 90 90 90 90 f3 0f 1e fa 0f 1f 44 00 00 55 53 48 8b 6f 58 48 85 ed 74 5c 4>
Aug 23 11:39:03 bree kernel: RSP: 0018:ffff9b960920bb50 EFLAGS: 00010206
Aug 23 11:39:03 bree kernel: RAX: 0000000000000000 RBX: ffff8938835bf400 RCX: 0000000000000027
Aug 23 11:39:03 bree kernel: RDX: 0000000000000000 RSI: ffff893a27853c10 RDI: ffff893a27853c38
Aug 23 11:39:03 bree kernel: RBP: ffff893a27853c10 R08: 0000000000000000 R09: ffff9b960920b9b8
Aug 23 11:39:03 bree kernel: R10: 0000000000000003 R11: ffffffff9f93a4c8 R12: 0000000000000001
Aug 23 11:39:03 bree kernel: R13: 0000000000000000 R14: 00000000ffffffff R15: ffff8936ae4e7000
Aug 23 11:39:03 bree kernel: FS: 00007f11372066c0(0000) GS:ffff893d9da80000(0000) knlGS:0000000000000000
Aug 23 11:39:03 bree kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Aug 23 11:39:03 bree kernel: CR2: 0000000000000008 CR3: 0000000105e7a002 CR4: 00000000003726e0
Aug 23 11:39:03 bree kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
Aug 23 11:39:03 bree kernel: DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Aug 23 11:39:03 bree kernel: note: .firefox-w:cs0[98487] exited with irqs disabled
I don't know anything about GPUs, but it looks like the error comes from here:
drm_sched_entity_select_rq(entity);
sched = entity->rq->sched;
I think entity->rq
is NULL after drm_sched_entity_select_rq
, which seems reasonable as it does sometimes set it to NULL, e.g.
rq = sched ? &sched->sched_rq[entity->priority] : NULL;
if (rq != entity->rq) {
drm_sched_rq_remove_entity(entity->rq, entity);
entity->rq = rq;
}