Kernel NULL poniter dereference when running under heavy load (Ampere)
When I try to do a 36-thread CTS run on my GTX 3060, it seems to be a pretty good stress test for stuff. The good news is that the kernel and firmware do a pretty good job keeping things going even though shaders are timing out right and left. Unfortunately, I'm also able to hit what appears to be a pretty obscure failure to handle ERR_PTR()
somewhere
[ 469.962211] BUG: kernel NULL pointer dereference, address: 0000000000000014
[ 469.962221] #PF: supervisor read access in kernel mode
[ 469.962226] #PF: error_code(0x0000) - not-present page
[ 469.962230] PGD 0 P4D 0
[ 469.962237] Oops: 0000 [#1] PREEMPT SMP NOPTI
[ 469.962243] CPU: 21 PID: 3308 Comm: kworker/u72:14 Tainted: G W 6.6.0-rc7-nvk-uapi+ #10
[ 469.962250] Hardware name: System manufacturer System Product Name/ROG STRIX X299-E GAMING II, BIOS 1301
09/24/2021
[ 469.962253] Workqueue: nouveau_sched_wq nouveau_uvmm_bind_job_free_work_fn [nouveau]
[ 469.962742] RIP: 0010:nvkm_vmm_unref_ptes+0x62/0x250 [nouveau]
[ 469.963170] Code: dd 48 8b 7f 10 89 14 24 83 e5 01 49 8b 34 ee 48 8b 40 40 ff d0 0f 1f 00 8b 14 24 84 c0
0f 85 38 01 00 00 0f b6 db 49 8d 2c 9e <8b> 45 10 44 29 e0 89 45 10 41 83 7d 00 02 74 22 ba 01 00 00 00 85
[ 469.963176] RSP: 0018:ffffc900037379d0 EFLAGS: 00010246
[ 469.963182] RAX: ffffffffc04fd740 RBX: 0000000000000001 RCX: 0000000000000200
[ 469.963186] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffffc90003737a98
[ 469.963190] RBP: 0000000000000004 R08: 0000000000000002 R09: 0000000000000001
[ 469.963193] R10: 0000000000000000 R11: ffff88812d101c00 R12: 0000000000000200
[ 469.963196] R13: ffffffffc06d04c0 R14: 0000000000000000 R15: ffffc90003737a98
[ 469.963200] FS: 0000000000000000(0000) GS:ffff88905fd40000(0000) knlGS:0000000000000000
[ 469.963204] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 469.963208] CR2: 0000000000000014 CR3: 000000043c222002 CR4: 00000000003706e0
[ 469.963211] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 469.963214] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 469.963217] Call Trace:
[ 469.963222] <TASK>
[ 469.963227] ? __die+0x23/0x70
[ 469.963241] ? page_fault_oops+0x171/0x4e0
[ 469.963251] ? __module_address+0x33/0xb0
[ 469.963264] ? exc_page_fault+0x7f/0x180
[ 469.963275] ? asm_exc_page_fault+0x26/0x30
[ 469.963288] ? __pfx_nvkm_vmm_unref_ptes+0x10/0x10 [nouveau]
[ 469.963672] ? nvkm_vmm_unref_ptes+0x62/0x250 [nouveau]
[ 469.964047] nvkm_vmm_iter.isra.0+0x2a6/0x890 [nouveau]
[ 469.964469] ? nv04_timer_read+0x48/0x60 [nouveau]
[ 469.964886] ? __pfx_nvkm_vmm_unref_ptes+0x10/0x10 [nouveau]
[ 469.965263] nvkm_vmm_raw_put+0x57/0x70 [nouveau]
[ 469.965642] ? __pfx_nvkm_vmm_unref_ptes+0x10/0x10 [nouveau]
[ 469.966014] nvkm_uvmm_mthd+0x870/0x1070 [nouveau]
[ 469.966393] ? nvkm_vmm_ptes_sparse+0x18d/0x1e0 [nouveau]
[ 469.966776] ? mas_wr_node_store+0x20b/0x770
[ 469.966786] ? nvkm_ioctl+0x10b/0x250 [nouveau]
[ 469.967084] nvkm_ioctl+0x10b/0x250 [nouveau]
[ 469.967383] nvif_object_mthd+0xb4/0x200 [nouveau]
[ 469.967679] nvif_vmm_raw_put+0x57/0x80 [nouveau]
[ 469.967972] nouveau_uvmm_sm_cleanup.isra.0+0xee/0x110 [nouveau]
[ 469.968336] nouveau_uvmm_bind_job_free_work_fn+0x1e7/0x390 [nouveau]
[ 469.968705] process_one_work+0x171/0x340
[ 469.968713] worker_thread+0x27b/0x3a0
[ 469.968719] ? __pfx_worker_thread+0x10/0x10
[ 469.968724] kthread+0xe5/0x120
[ 469.968732] ? __pfx_kthread+0x10/0x10
[ 469.968739] ret_from_fork+0x31/0x50
[ 469.968748] ? __pfx_kthread+0x10/0x10
[ 469.968754] ret_from_fork_asm+0x1b/0x30
[ 469.968768] </TASK>
[ 469.968771] Modules linked in: snd_seq_dummy snd_hrtimer nf_conntrack_netbios_ns nf_conntrack_broadcast nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ip_set nf_tables nfnetlink qrtr bnep sunrpc intel_rapl_msr intel_rapl_common intel_uncore_frequency intel_uncore_frequency_common binfmt_misc isst_if_common nfit libnvdimm x86_pkg_temp_thermal intel_powerclamp iwlmvm coretemp snd_soc_avs kvm_intel mac80211 snd_soc_hda_codec snd_hda_ext_core snd_soc_core vfat snd_hda_codec_realtek kvm fat libarc4 snd_hda_codec_generic snd_hda_codec_hdmi snd_compress ac97_bus snd_pcm_dmaengine snd_hda_intel snd_intel_dspcfg snd_intel_sdw_acpi irqbypass snd_hda_codec iwlwifi snd_hda_core btusb btrtl rapl snd_hwdep btintel snd_seq btbcm eeepc_wmi intel_cstate snd_seq_device btmtk snd_pcm asus_wmi iTCO_wdt cfg80211 ledtrig_audio intel_pmc_bxt bluetooth pktcdvd iTCO_vendor_support intel_uncore snd_timer sparse_keymap
[ 469.968882] mei_me platform_profile wmi_bmof snd pcspkr i2c_i801 mei soundcore i2c_smbus ioatdma rfkill idma64 dca joydev acpi_tad loop zram hid_logitech_hidpp uas usb_storage hid_logitech_dj nouveau drm_ttm_helper ttm video drm_exec crct10dif_pclmul drm_gpuvm nvme gpu_sched crc32_pclmul crc32c_intel i2c_algo_bit polyval_clmulni polyval_generic drm_display_helper e1000e nvme_core mxm_wmi r8169 cec ghash_clmulni_intel sha512_ssse3 nvme_common wmi pinctrl_sunrisepoint ip6_tables ip_tables fuse
[ 469.968949] CR2: 0000000000000014
[ 469.968955] ---[ end trace 0000000000000000 ]---
[ 469.968958] RIP: 0010:nvkm_vmm_unref_ptes+0x62/0x250 [nouveau]
[ 469.969335] Code: dd 48 8b 7f 10 89 14 24 83 e5 01 49 8b 34 ee 48 8b 40 40 ff d0 0f 1f 00 8b 14 24 84 c0 0f 85 38 01 00 00 0f b6 db 49 8d 2c 9e <8b> 45 10 44 29 e0 89 45 10 41 83 7d 00 02 74 22 ba 01 00 00 00 85
[ 469.969340] RSP: 0018:ffffc900037379d0 EFLAGS: 00010246
[ 469.969345] RAX: ffffffffc04fd740 RBX: 0000000000000001 RCX: 0000000000000200
[ 469.969348] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffffc90003737a98
[ 469.969351] RBP: 0000000000000004 R08: 0000000000000002 R09: 0000000000000001
[ 469.969354] R10: 0000000000000000 R11: ffff88812d101c00 R12: 0000000000000200
[ 469.969357] R13: ffffffffc06d04c0 R14: 0000000000000000 R15: ffffc90003737a98
[ 469.969360] FS: 0000000000000000(0000) GS:ffff88905fd40000(0000) knlGS:0000000000000000
[ 469.969364] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 469.969367] CR2: 0000000000000014 CR3: 000000043c222002 CR4: 00000000003706e0
[ 469.969371] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 469.969373] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400