[BUG] NULL pointer dereference: amdgpu_vm_bo_relocated
I noticed the following backtrace earlier today:
BUG: kernel NULL pointer dereference, address: 0000000000000248
#PF: supervisor read access in kernel mode
#PF: error_code(0x0000) - not-present page
PGD 115539067 P4D 115539067 PUD 14224e067 PMD 14b0b9067 PTE 0
Oops: 0000 [#1] PREEMPT SMP
CPU: 1 PID: 25856 Comm: kworker/1:1H Not tainted 6.5.2 #1
Hardware name: Micro-Star International Co., Ltd MS-7C02/B450 TOMAHAWK MAX (MS-7C02), BIOS 3.H0 07/07/2023
Workqueue: ttm ttm_bo_delayed_delete [ttm]
RIP: 0010:amdgpu_vm_bo_relocated+0x14/0xb0 [amdgpu]
Code: 89 e7 e8 bf 53 37 e1 48 89 df 5b 5d 41 5c 41 5d e9 31 fe ff ff 90 55 53 48 8b 07 48 89 fb 48 8d 6b 18 48 8d 78 38 48 8b 43 08 <48> 83 b8 48 02 00 00 00 74 3d e8 0d 93 6a e1 48 8b 03 48 8b 53 20
RSP: 0018:ffffc90003fdfe00 EFLAGS: 00010246
RAX: 0000000000000000 RBX: ffff88810ab9d310 RCX: ffff88801da6f800
RDX: ffff88813fd8a800 RSI: ffff88813fd8a800 RDI: ffff888150d81038
RBP: ffff88810ab9d328 R08: ffff888108523a00 R09: ffff888106b24500
R10: 0000000000000004 R11: 0000000000000001 R12: ffff88810ae0eed0
R13: 0000000000000000 R14: ffffe8ffff641305 R15: ffff8881037336b0
FS: 0000000000000000(0000) GS:ffff88842ec40000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000000248 CR3: 0000000108ea1000 CR4: 0000000000350ee0
Call Trace:
<TASK>
? __die+0x1a/0x60
? page_fault_oops+0x158/0x440
? srso_return_thunk+0x5/0x10
? __schedule+0x283/0x1190
? srso_return_thunk+0x5/0x10
? scsi_queue_rq+0x34f/0xa50
? exc_page_fault+0x301/0x5c0
? asm_exc_page_fault+0x22/0x30
? amdgpu_vm_bo_relocated+0x14/0xb0 [amdgpu]
amdgpu_vm_bo_invalidate+0x144/0x170 [amdgpu]
amdgpu_bo_move_notify+0x53/0xb0 [amdgpu]
ttm_bo_cleanup_memtype_use+0x1d/0x70 [ttm]
ttm_bo_delayed_delete+0x3b/0x80 [ttm]
process_one_work+0x1fa/0x370
worker_thread+0x45/0x3b0
? process_one_work+0x370/0x370
kthread+0xee/0x120
? kthread_complete_and_exit+0x20/0x20
ret_from_fork+0x2b/0x40
? kthread_complete_and_exit+0x20/0x20
ret_from_fork_asm+0x11/0x20
</TASK>
This looks like memory pressure. I didn't specifically check at the time, but I do know that while running the game in question (Starfield, via Steam), a lot of swap space is used after a while (16GB of system RAM is installed, and there's 8GB of swap space).
Later there was a GPF which looks like a consequence of this and which froze a process (vkd3d_queue) related to the game; some soft-lockup backtraces followed. Here's the GPF:
general protection fault, probably for non-canonical address 0xdead000000000108: 0000 [#2] PREEMPT SMP
CPU: 0 PID: 11682 Comm: kworker/0:8H Tainted: G D 6.5.2 #1
Hardware name: Micro-Star International Co., Ltd MS-7C02/B450 TOMAHAWK MAX (MS-7C02), BIOS 3.H0 07/07/2023
Workqueue: ttm ttm_bo_delayed_delete [ttm]
RIP: 0010:amdgpu_vm_bo_relocated+0x2e/0xb0 [amdgpu]
Code: 07 48 89 fb 48 8d 6b 18 48 8d 78 38 48 8b 43 08 48 83 b8 48 02 00 00 00 74 3d e8 0d 93 6a e1 48 8b 03 48 8b 53 20 48 8b 4b 18 <48> 89 51 08 48 89 0a 48 8b 50 50 48 89 6a 08 48 89 53 18 48 8d 50
RSP: 0018:ffffc90006987e00 EFLAGS: 00010246
RAX: ffff8881d55b5000 RBX: ffff88813c9f8c30 RCX: dead000000000100
RDX: dead000000000122 RSI: ffff88820620d000 RDI: 0000000000000000
RBP: ffff88813c9f8c48 R08: ffff888150cd5700 R09: ffff88813139a080
R10: 0000000000000004 R11: 0000000000000001 R12: ffff88810ae0eed0
R13: 0000000000000000 R14: ffffe8ffff601305 R15: ffff8881037336b0
FS: 0000000000000000(0000) GS:ffff88842ec00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007fec81230010 CR3: 000000010a632000 CR4: 0000000000350ef0
Call Trace:
<TASK>
? die_addr+0x2d/0x80
? exc_general_protection+0x192/0x380
? asm_exc_general_protection+0x22/0x30
? amdgpu_vm_bo_relocated+0x2e/0xb0 [amdgpu]
amdgpu_vm_bo_invalidate+0x144/0x170 [amdgpu]
amdgpu_bo_move_notify+0x53/0xb0 [amdgpu]
ttm_bo_cleanup_memtype_use+0x1d/0x70 [ttm]
ttm_bo_delayed_delete+0x3b/0x80 [ttm]
process_one_work+0x1fa/0x370
worker_thread+0x45/0x3b0
? process_one_work+0x370/0x370
kthread+0xee/0x120
? kthread_complete_and_exit+0x20/0x20
ret_from_fork+0x2b/0x40
? kthread_complete_and_exit+0x20/0x20
ret_from_fork_asm+0x11/0x20
</TASK>
This may be related to #2293.