5.0.21 kernel crash when many GPU app run concurrently , error msg: amdgpu_vm_validate_pt_bos() failed. , Not enough memory for command submission!
Submitted by wormwang
Assigned to Default DRI bug account
Link to original bug (#110888)
Description
Env:kernel 5.0.21 mesa 18.2.8 firmware 1.179 drm 2.4.97 binder-dkms 1.3 +android image kydroid cm-13.0-19.05.30-1-clouddisk RAM 192G. AMD RX580 8GB
We test run 77 GPU apps concurrently, kernel crash and auto reboot
journalctl log #100 (closed) (comment)
crash dump
[ 3138.636753] [drm:amdgpu_cs_parser_bos.isra.11 [amdgpu]] ERROR amdgpu_vm_validate_pt_bos() failed.
[ 3138.636831] [drm:amdgpu_cs_parser_bos.isra.11 [amdgpu]] ERROR amdgpu_vm_validate_pt_bos() failed.
[ 3138.636915] [drm:amdgpu_cs_parser_bos.isra.11 [amdgpu]] ERROR amdgpu_vm_validate_pt_bos() failed.
[ 3138.636989] [drm:amdgpu_cs_ioctl [amdgpu]] ERROR Not enough memory for command submission!
[ 3138.647377] [drm:amdgpu_cs_ioctl [amdgpu]] ERROR Not enough memory for command submission!
[ 3138.657138] [drm:amdgpu_cs_ioctl [amdgpu]] ERROR Not enough memory for command submission!
[ 3138.801062] Unable to handle kernel access to user memory outside uaccess routines at virtual address 00000000000000a8
[ 3138.801240] [drm:amdgpu_cs_parser_bos.isra.11 [amdgpu]] ERROR amdgpu_vm_validate_pt_bos() failed.
[ 3138.811638] Mem abort info:
[ 3138.811642] ESR = 0x96000004
[ 3138.811644] Exception class = DABT (current EL), IL = 32 bits
[ 3138.811647] SET = 0, FnV = 0
[ 3138.811649] EA = 0, S1PTW = 0
[ 3138.811651] Data abort info:
[ 3138.811653] ISV = 0, ISS = 0x00000004
[ 3138.811655] CM = 0, WnR = 0
[ 3138.811660] user pgtable: 4k pages, 48-bit VAs, pgdp = 000000000787c0fb
[ 3138.811663] [00000000000000a8] pgd=0000000000000000
[ 3138.811669] Internal error: Oops: 96000004 [#1 (closed)] SMP
[ 3138.811673] Modules linked in: nfnetlink_log veth xt_CHECKSUM iptable_mangle nf_conntrack_netlink nfnetlink xfrm_user xfrm_algo br_netfilter xt_nat ipt_MASQUERADE overlay xt_recent ipt_REJECT nf_reject_ipv4 xt_tcpudp devlink xt_mark xt_comment xt_conntrack bridge stp llc ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter xt_addrtype iptable_nat nf_nat_ipv4 nf_nat bpfilter ip_vs_sh ip_vs_wrr ip_vs_rr ip_vs nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 input_leds joydev nls_iso8859_1 snd_hda_intel snd_hda_codec snd_hda_core snd_hwdep snd_pcm snd_timer snd soundcore ipmi_ssif ipmi_si ipmi_devintf ipmi_msghandler sch_fq_codel ib_iser rdma_cm iw_cm ib_cm ib_core iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi binder_dkms(OE) ip_tables x_tables autofs4 ses enclosure btrfs zstd_compress raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor xor_neon raid6_pq libcrc32c raid1 raid0 multipath linear hibmc_drm hid_generic usbhid hid marvell aes_ce_blk
[ 3138.811754] aes_ce_cipher
[ 3138.822304] [drm:amdgpu_cs_ioctl [amdgpu]] ERROR Not enough memory for command submission!
[ 3138.827351] amdgpu crct10dif_ce chash i2c_algo_bit ghash_ce gpu_sched ttm sha2_ce sha256_arm64 drm_kms_helper sha1_ce syscopyarea sysfillrect sysimgblt fb_sys_fops drm hns_enet_drv mpt3sas e1000e hisi_sas_v2_hw raid_class hisi_sas_main ehci_platform libsas hns_dsaf scsi_transport_sas hns_mdio hnae aes_neon_bs aes_neon_blk crypto_simd cryptd aes_arm64
[ 3138.827381] Process BootAnimation (pid: 240132, stack limit = 0x00000000184b1ef3)
[ 3138.827386] CPU: 17 PID: 240132 Comm: BootAnimation Kdump: loaded Tainted: G OE 5.0.0-2106051013-generic #appstreamdebug
[ 3138.827388] Hardware name: Huawei TaiShan 2280 /BC11SPCD, BIOS 1.56 09/20/2018
[ 3138.827391] pstate: 60400005 (nZCv daif +PAN -UAO)
[ 3138.827499] pc : amdgpu_vm_init+0x1e4/0x490 [amdgpu]
[ 3138.827583] lr : amdgpu_vm_init+0x298/0x490 [amdgpu]
[ 3138.867149] [drm:amdgpu_cs_parser_bos.isra.11 [amdgpu]] ERROR amdgpu_vm_validate_pt_bos() failed.
[ 3138.868460] sp : ffff0003b1a5b900
[ 3138.868462] x29: ffff0003b1a5b900 x28: ffff8013f4f36000
[ 3138.868466] x27: ffff8013ae49e0c0 x26: ffff8013ae49e100
[ 3138.868469] x25: ffff0000097de000 x24: 0000000000008143
[ 3138.868472] x23: 0000000000000000 x22: ffff000011994000
[ 3138.868474] x21: 00000000fffffff4 x20: 0000000000000050
[ 3138.868477] x19: ffff8013ae49e000 x18: 0000000000000000
[ 3138.868480] x17: 0000000000000000 x16: 0000000000000101
[ 3138.868483] x15: 0000000000000000 x14: ffff0000110a6748
[ 3138.868485] x13: 0000000000000001 x12: 0000000000000000
[ 3138.873930] [drm:amdgpu_cs_ioctl [amdgpu]] ERROR Not enough memory for command submission!
[ 3138.878709] x11: 0000000000000001 x10: 0000000000000000
[ 3138.878712] x9 : ffff000008f674f0 x8 : ffff000011994b48
[ 3138.878715] x7 : ffff000008f58e20 x6 : 0000000000000000
[ 3138.878718] x5 : 0000000000000000 x4 : ffff000011994b48
[ 3138.878720] x3 : 0000000000000001 x2 : 7d8b3ec762676c00
[ 3138.878723] x1 : 0000000000000000 x0 : 00000000fffffff4
[ 3138.878729] Call trace:
[ 3138.878823] amdgpu_vm_init+0x1e4/0x490 [amdgpu]
[ 3138.878912] amdgpu_driver_open_kms+0x9c/0x200 [amdgpu]
[ 3139.153799] drm_file_alloc+0x134/0x258 [drm]
[ 3139.158515] drm_open+0xac/0x210 [drm]
[ 3139.163037] drm_stub_open+0xec/0x118 [drm]
[ 3139.167537] chrdev_open+0xac/0x1c0
[ 3139.171858] do_dentry_open+0x1c4/0x370
[ 3139.175949] vfs_open+0x38/0x48
[ 3139.179895] do_last+0x32c/0x8b0
[ 3139.183680] path_openat+0x90/0x288
[ 3139.187217] do_filp_open+0x88/0x108
[ 3139.190768] do_sys_open+0x1b0/0x3b0
[ 3139.194222] __arm64_sys_openat+0x2c/0x38
[ 3139.197480] el0_svc_common+0x8c/0x190
[ 3139.200847] el0_svc_handler+0x38/0x78
[ 3139.202961] [drm:amdgpu_cs_parser_bos.isra.11 [amdgpu]] ERROR amdgpu_vm_validate_pt_bos() failed.
[ 3139.203982] el0_svc+0x8/0xc
[ 3139.211009] [drm:amdgpu_cs_ioctl [amdgpu]] ERROR Not enough memory for command submission!
[ 3139.214079] Code: 2a0003f5 34000540 f9406277 910142f4 (b9405a80)
[ 3139.214210] SMP: stopping secondary CPUs
[ 3139.226747] Starting crashdump kernel...
[ 3139.230360] Bye!