NULL pointer dereference in amdgpu_bo_vm_destroy on the GRUNT chromebook
Was looking at expanding the coverage of RADV in Mesa CI, and after lots of bisecting I figured out that one test is triggering OOM and leaving the kernel driver in a bad state:
If you run dEQP as in the job below, the kernel hits ENOMEM and then a NULL pointer is dereferenced:
https://gitlab.freedesktop.org/tomeu/mesa/-/jobs/30000261
Adding this single test to the skips list gets things working again.
2022-10-19 08:30:55.276202: [ 59.318043] [drm:amdgpu_gem_va_ioctl [amdgpu]] *ERROR* Couldn't update BO_VA (-12)
2022-10-19 08:30:55.276224: [ 59.329650] [drm:amdgpu_gem_va_ioctl [amdgpu]] *ERROR* Couldn't update BO_VA (-12)
2022-10-19 08:30:55.276238: [ 59.348811] BUG: kernel NULL pointer dereference, address: 0000000000000008
2022-10-19 08:30:55.276252: [ 59.355782] #PF: supervisor write access in kernel mode
2022-10-19 08:30:55.276267: [ 59.361006] #PF: error_code(0x0002) - not-present page
2022-10-19 08:30:55.276281: [ 59.366159] PGD 0 P4D 0
2022-10-19 08:30:55.276295: [ 59.368731] Oops: 0002 [#1] PREEMPT SMP NOPTI
2022-10-19 08:30:55.276331: [ 59.373092] CPU: 1 PID: 69 Comm: kworker/1:4 Not tainted 5.19.0-rc6linux-v5.17-for-mesa-ci-b78f7870d97b.tar.bz2 #1
2022-10-19 08:30:55.276385: [ 59.383436] Hardware name: Google Grunt/Grunt, BIOS 09/05/2019
2022-10-19 08:30:55.276425: [ 59.389356] Workqueue: events ttm_device_delayed_workqueue
2022-10-19 08:30:55.276458: [ 59.394857] RIP: 0010:amdgpu_bo_vm_destroy+0x41/0x70 [amdgpu]
2022-10-19 08:30:55.276501: [ 59.401889] Code: c3 74 41 48 8b 87 38 01 00 00 4c 8d a0 68 31 01 00 4c 89 e7 e8 a0 1c c3 f4 48 8b 95 38 02 00 00 48 8b 85 40 02 00 00 4c 89 e7 <48> 89 42 08 48 89 10 48 89 9d 38 02 00 00 48 89 9d 40 02 00 00 e8
2022-10-19 08:30:55.276538: [ 59.420636] RSP: 0018:ffffb709c0287e18 EFLAGS: 00010246
2022-10-19 08:30:55.276567: [ 59.425864] RAX: 0000000000000000 RBX: ffff9e174eedda90 RCX: 0000000000000000
2022-10-19 08:30:55.276599: [ 59.432993] RDX: 0000000000000000 RSI: ffff9e174eedd9b0 RDI: ffff9e1744fd8260
2022-10-19 08:30:55.276629: [ 59.440123] RBP: ffff9e174eedd858 R08: ffff9e174004c6b0 R09: ffff9e1741345434
2022-10-19 08:30:55.276656: [ 59.447253] R10: 0000000000000018 R11: 0000000000000018 R12: ffff9e1744fd8260
2022-10-19 08:30:55.276684: [ 59.454381] R13: ffff9e174eedd9a8 R14: ffff9e174eedd858 R15: ffff9e174eedd9d0
2022-10-19 08:30:55.276712: [ 59.461513] FS: 0000000000000000(0000) GS:ffff9e176ad00000(0000) knlGS:0000000000000000
2022-10-19 08:30:55.276741: [ 59.469597] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
2022-10-19 08:30:55.276769: [ 59.475342] CR2: 0000000000000008 CR3: 0000000101556000 CR4: 00000000001506e0
2022-10-19 08:30:55.276872: [ 59.482471] Call Trace:
2022-10-19 08:30:55.276907: [ 59.484924] <TASK>
2022-10-19 08:30:55.276937: [ 59.487029] ttm_bo_delayed_delete+0x1bb/0x220
2022-10-19 08:30:55.276966: [ 59.491482] ttm_device_delayed_workqueue+0x13/0x40
2022-10-19 08:30:55.276994: [ 59.496364] process_one_work+0x1d3/0x3a0
2022-10-19 08:30:55.277022: [ 59.500381] worker_thread+0x48/0x3c0
2022-10-19 08:30:55.277102: [ 59.508056] kthread+0xe2/0x110
2022-10-19 08:30:55.277160: [ 59.515990] ret_from_fork+0x22/0x30
2022-10-19 08:30:55.277189: [ 59.519569] </TASK>
2022-10-19 08:30:55.277219: [ 59.521756] Modules linked in: amdgpu drm_ttm_helper gpu_sched
2022-10-19 08:30:55.277247: [ 59.527592] CR2: 0000000000000008
2022-10-19 08:30:55.277276: [ 59.530910] ---[ end trace 0000000000000000 ]---
2022-10-19 08:30:55.277303: [ 59.535530] RIP: 0010:amdgpu_bo_vm_destroy+0x41/0x70 [amdgpu]
2022-10-19 08:30:55.277332: [ 59.541828] Code: c3 74 41 48 8b 87 38 01 00 00 4c 8d a0 68 31 01 00 4c 89 e7 e8 a0 1c c3 f4 48 8b 95 38 02 00 00 48 8b 85 40 02 00 00 4c 89 e7 <48> 89 42 08 48 89 10 48 89 9d 38 02 00 00 48 89 9d 40 02 00 00 e8
2022-10-19 08:30:55.277360: [ 59.560576] RSP: 0018:ffffb709c0287e18 EFLAGS: 00010246
2022-10-19 08:30:55.277389: [ 59.565805] RAX: 0000000000000000 RBX: ffff9e174eedda90 RCX: 0000000000000000
2022-10-19 08:30:55.277417: [ 59.572938] RDX: 0000000000000000 RSI: ffff9e174eedd9b0 RDI: ffff9e1744fd8260
2022-10-19 08:30:55.277444: [ 59.580072] RBP: ffff9e174eedd858 R08: ffff9e174004c6b0 R09: ffff9e1741345434
2022-10-19 08:30:55.277473: [ 59.587194] R10: 0000000000000018 R11: 0000000000000018 R12: ffff9e1744fd8260
2022-10-19 08:30:55.277502: [ 59.594327] R13: ffff9e174eedd9a8 R14: ffff9e174eedd858 R15: ffff9e174eedd9d0
2022-10-19 08:30:55.277531: [ 59.601459] FS: 0000000000000000(0000) GS:ffff9e176ad00000(0000) knlGS:0000000000000000
2022-10-19 08:30:55.277558: [ 59.609548] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
2022-10-19 08:30:55.277586: [ 59.615294] CR2: 0000000000000008 CR3: 0000000101556000 CR4: 00000000001506e0
2022-10-19 08:30:55.277615: [ 60.117173] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* amdgpu_vm_validate_pt_bos() failed.
2022-10-19 08:30:55.277646: [ 60.125623] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Not enough memory for command submission!
2022-10-19 08:30:55.277676: [ 60.134613] ------------[ cut here ]------------
2022-10-19 08:30:55.277711: [ 60.139329] refcount_t: underflow; use-after-free.
2022-10-19 08:30:55.277741: [ 60.144164] WARNING: CPU: 0 PID: 180 at lib/refcount.c:28 refcount_warn_saturate+0xa6/0xf0
2022-10-19 08:30:55.277769: [ 60.152478] Modules linked in: amdgpu drm_ttm_helper gpu_sched
2022-10-19 08:30:55.277798: [ 60.158332] CPU: 0 PID: 180 Comm: deqp-vk Tainted: G D 5.19.0-rc6linux-v5.17-for-mesa-ci-b78f7870d97b.tar.bz2 #1
2022-10-19 08:30:55.277827: [ 60.169829] Hardware name: Google Grunt/Grunt, BIOS 09/05/2019
2022-10-19 08:30:55.277856: [ 60.175813] RIP: 0010:refcount_warn_saturate+0xa6/0xf0
2022-10-19 08:30:55.277883: [ 60.180973] Code: 05 a3 b5 59 01 01 e8 97 ef 79 00 0f 0b c3 80 3d 91 b5 59 01 00 75 95 48 c7 c7 b8 28 60 b5 c6 05 81 b5 59 01 01 e8 78 ef 79 00 <0f> 0b c3 80 3d 70 b5 59 01 00 0f 85 72 ff ff ff 48 c7 c7 10 29 60
2022-10-19 08:30:55.277911: [ 60.199741] RSP: 0018:ffffb709c02d3bc8 EFLAGS: 00010282
2022-10-19 08:30:55.277939: [ 60.205001] RAX: 0000000000000000 RBX: 0000000000000005 RCX: 0000000000000000
2022-10-19 08:30:55.277967: [ 60.212159] RDX: 0000000000000001 RSI: 00000000ffffffea RDI: 00000000ffffffff
2022-10-19 08:30:55.278008: [ 60.219301] RBP: 00000000ffffffff R08: ffffffffb59477c8 R09: 0000000000009ffb
2022-10-19 08:30:55.278037: [ 60.226466] R10: 00000000000002b0 R11: ffffffffb59177e0 R12: 0000000000000018
2022-10-19 08:30:55.278066: [ 60.233626] R13: ffff9e1744fc5aa0 R14: ffff9e1744fc0000 R15: 00000000fffffff4
2022-10-19 08:30:55.278094: [ 60.240765] FS: 00007f886baa5bc0(0000) GS:ffff9e176ac00000(0000) knlGS:0000000000000000
2022-10-19 08:30:55.278123: [ 60.248853] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
2022-10-19 08:30:55.278151: [ 60.254599] CR2: 00007f8860fb3000 CR3: 0000000101556000 CR4: 00000000001506f0
2022-10-19 08:30:55.278181: [ 60.261739] Call Trace:
2022-10-19 08:30:55.278211: [ 60.264198] <TASK>
2022-10-19 08:30:55.278240: [ 60.266304] amdgpu_cs_ioctl+0x4de/0x1f00 [amdgpu]
2022-10-19 08:30:55.278297: [ 60.277359] drm_ioctl_kernel+0xab/0x140
2022-10-19 08:30:55.278328: [ 60.281296] drm_ioctl+0x1fc/0x3a0
2022-10-19 08:30:55.278507: [ 60.296135] amdgpu_drm_ioctl+0x44/0x80 [amdgpu]
2022-10-19 08:30:55.278577: [ 60.301057] __x64_sys_ioctl+0x7e/0xb0
2022-10-19 08:30:55.278617: [ 60.304809] do_syscall_64+0x3b/0x90
2022-10-19 08:30:55.278646: [ 60.308390] entry_SYSCALL_64_after_hwframe+0x46/0xb0
2022-10-19 08:30:55.278678: [ 60.313442] RIP: 0033:0x7f886b5f36b7
2022-10-19 08:30:55.278721: [ 60.317020] Code: 00 00 00 48 8b 05 d9 c7 0d 00 64 c7 00 26 00 00 00 48 c7 c0 ff ff ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 b8 10 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d a9 c7 0d 00 f7 d8 64 89 01 48
2022-10-19 08:30:55.278767: [ 60.335786] RSP: 002b:00007ffced7b2a88 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
2022-10-19 08:30:55.278813: [ 60.343351] RAX: ffffffffffffffda RBX: 00007ffced7b2b00 RCX: 00007f886b5f36b7
2022-10-19 08:30:55.278863: [ 60.350480] RDX: 00007ffced7b2b00 RSI: 00000000c0186444 RDI: 0000000000000005
2022-10-19 08:30:55.278911: [ 60.357621] RBP: 00000000c0186444 R08: 000056191c2a77f0 R09: 00007ffced7b2cc8
2022-10-19 08:30:55.278960: [ 60.364749] R10: 00005619158f0ae0 R11: 0000000000000246 R12: 000056191c2a77a0
2022-10-19 08:30:55.279009: [ 60.371877] R13: 0000000000000005 R14: 0000000000000000 R15: 00007ffced7b2ca0
2022-10-19 08:30:55.279056: [ 60.379007] </TASK>
2022-10-19 08:30:55.279093: [ 60.381190] ---[ end trace 0000000000000000 ]---