Unprivileged user mode program can cause GPU reset
Submitted by Michal
Assigned to Default DRI bug account
Link to original bug (#109978)
Description
https://github.com/RadeonOpenCompute/ROCK-Kernel-Driver/issues/72
Sample program which causes this (needs ROCm):
#include <hc.hpp>
int main()
{
parallel_for_each(hc::extent<1>
(1), = [[hc]]
{
asm("s_trap 2");
});
return 0;
}
> hcc -hc main.cpp
> ./a.out
Process never ends and CTRL-C causes GPU reset which breaks all other processes actually using rocm on that GPU. Seems trap handler expects queue handle in s[0:1] which is set when using __builtin_trap() so without it trap handler causes another exceptions.
System logs:
[ 247.428727] qcm fence wait loop timeout expired
[ 247.428730] The cp might be in an unrecoverable state due to an unsuccessful queues preemption
[ 247.428736] amdgpu 0000:0b:00.0: GPU reset begin!
[ 247.619440] amdgpu 0000:0b:00.0: GPU reset
[ 248.152762] [drm] psp mode1 reset succeed
[ 248.279461] amdgpu 0000:0b:00.0: GPU reset succeeded, trying to resume
[ 248.279584] [drm] PCIE GART of 512M enabled (table at 0x000000F400900000).
[ 248.279639] [drm:amdgpu_device_gpu_recover [amdgpu]] *ERROR* VRAM is lost!
[ 248.279769] [drm] PSP is resuming...
[ 248.428305] [drm] reserve 0x400000 from 0xf400d00000 for PSP TMR SIZE
[ 248.472774] WARNING: CPU: 23 PID: 21634 at /build/linux-uQJ2um/linux-4.15.0/kernel/kthread.c:498 kthread_park+0x67/0x80
[ 248.472775] Modules linked in: ufs qnx4 hfsplus hfs minix ntfs msdos jfs xfs msr nls_utf8 cifs ccm fscache cmac bnep binfmt_misc nls_iso8859_1 edac_mce_amd arc4 snd_hda_codec_realtek snd_hda_codec_generic kvm_amd snd_hda_codec_hdmi kvm snd_seq_midi irqbypass snd_hda_intel snd_seq_midi_event snd_hda_codec btusb snd_hda_core btrtl wmi_bmof snd_rawmidi iwlmvm snd_hwdep btbcm btintel snd_pcm snd_seq bluetooth mac80211 snd_seq_device ecdh_generic snd_timer iwlwifi ccp snd cfg80211 soundcore k10temp shpchp mac_hid sch_fq_codel ib_iser rdma_cm iw_cm ib_cm ib_core iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi nct6775 hwmon_vid parport_pc ppdev lp parport ip_tables x_tables autofs4 btrfs zstd_compress raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c raid1
[ 248.472823] multipath linear raid0 amdgpu(OE) amdchash(OE) amdttm(OE) amd_sched(OE) mxm_wmi crct10dif_pclmul crc32_pclmul ghash_clmulni_intel pcbc aesni_intel aes_x86_64 amdkcl(OE) crypto_simd glue_helper amd_iommu_v2 cryptd drm_kms_helper syscopyarea sysfillrect sysimgblt igb fb_sys_fops drm dca nvme i2c_algo_bit i2c_piix4 nvme_core ptp ahci atlantic libahci pps_core gpio_amdpt wmi gpio_generic
[ 248.472846] CPU: 23 PID: 21634 Comm: a.out Tainted: G OE 4.15.0-45-generic #48 (closed)-Ubuntu
[ 248.472847] Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./X399 Professional Gaming, BIOS P3.30 08/14/2018
[ 248.472849] RIP: 0010:kthread_park+0x67/0x80
[ 248.472850] RSP: 0018:ffffb44fc7e27ad0 EFLAGS: 00010202
[ 248.472852] RAX: 0000000000000004 RBX: ffff9ec63f49e480 RCX: 0000000000000000
[ 248.472853] RDX: ffff9ec63c717198 RSI: ffff9ec63ea0c0c0 RDI: ffff9ec63dd38000
[ 248.472854] RBP: ffffb44fc7e27ae0 R08: 0000000000000051 R09: 0000000000000000
[ 248.472855] R10: 0000000000000000 R11: 0000000000000056 R12: ffff9ec63ea0c0c0
[ 248.472855] R13: ffff9ec64f4f4200 R14: ffff9ec63c710000 R15: 0000000000000000
[ 248.472857] FS: 00007fd52a286c00(0000) GS:ffff9ec65cdc0000(0000) knlGS:0000000000000000
[ 248.472858] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 248.472859] CR2: 00007f0c07687a98 CR3: 000000081b5b6000 CR4: 00000000003406e0
[ 248.472860] Call Trace:
[ 248.472865] amddrm_sched_entity_fini+0x44/0x1b0 [amd_sched]
[ 248.472868] amddrm_sched_entity_destroy+0x1f/0x30 [amd_sched]
[ 248.472907] amdgpu_vm_fini+0xbb/0x4f0 [amdgpu]
[ 248.472942] amdgpu_driver_postclose_kms+0x15b/0x2b0 [amdgpu]
[ 248.472952] drm_release+0x26b/0x390 [drm]
[ 248.472955] __fput+0xea/0x220
[ 248.472957] ____fput+0xe/0x10
[ 248.472959] task_work_run+0x9d/0xc0
[ 248.472961] do_exit+0x2ec/0xb40
[ 248.472963] do_group_exit+0x43/0xb0
[ 248.472965] get_signal+0x27b/0x590
[ 248.472968] do_signal+0x37/0x730
[ 248.472971] ? __switch_to_asm+0x34/0x70
[ 248.472973] ? __switch_to_asm+0x40/0x70
[ 248.472976] ? do_vfs_ioctl+0xa8/0x630
[ 248.472978] ? __schedule+0x299/0x8a0
[ 248.472980] exit_to_usermode_loop+0x73/0xd0
[ 248.472982] do_syscall_64+0x115/0x130
[ 248.472984] entry_SYSCALL_64_after_hwframe+0x3d/0xa2
[ 248.472986] RIP: 0033:0x7fd528bdd5d7
[ 248.472987] RSP: 002b:00007ffe830d4778 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
[ 248.472988] RAX: fffffffffffffffc RBX: 0000000000000001 RCX: 00007fd528bdd5d7
[ 248.472989] RDX: 00007ffe830d47d0 RSI: 00000000c0184b0c RDI: 0000000000000003
[ 248.472990] RBP: 00007ffe830d47d0 R08: 00007ffe830d4890 R09: 0000000000000001
[ 248.472990] R10: 0000000000c92010 R11: 0000000000000246 R12: 00000000c0184b0c
[ 248.472991] R13: 0000000000000003 R14: 0000000000000000 R15: 00000000fffffffe
[ 248.472992] Code: 0e e8 6e c0 00 00 48 8d 7b 18 e8 35 d2 8e 00 44 89 e0 5b 41 5c 5d c3 0f 0b 41 bc da ff ff ff 44 89 e0 5b 41 5c 5d c3 0f 0b eb af `<0f>` 0b 41 bc f0 ff ff ff eb da 0f 1f 44 00 00 66 2e 0f 1f 84 00
[ 248.473020] ---[ end trace 19649ddd4a6314f7 ]---
[ 248.648453] [drm] UVD and UVD ENC initialized successfully.
[ 248.748509] [drm] VCE initialized successfully.
[ 248.749616] [drm] recover vram bo from shadow start
[ 248.749666] [drm] recover vram bo from shadow done
[ 248.749680] amdgpu 0000:0b:00.0: GPU reset(1) succeeded!