[Navi] yet another FP regression in DC (5.7.6+, 5.4.49+)
On ppc64le
with 5700 XT, i'm now getting an FP exception again, usually shortly after reaching desktop, or starting some OpenGL program, or ...
[ 105.140035] Oops: Unrecoverable FP Unavailable Exception, sig: 6 [#1]
[ 105.140048] LE PAGE_SIZE=4K MMU=Radix SMP NR_CPUS=2048 NUMA PowerNV
[ 105.140070] Modules linked in: cfg80211 8021q garp mrp stp llc binfmt_misc mlx4_ib ib_uverbs mlx4_en ib_core sr_mod cdrom joydev vmx_crypto evdev input_leds mac_hid gf128mul snd_usb_audio snd_usbmidi_lib uas snd_hda_codec_hdmi snd_rawmidi usb_storage mc usbkbd snd_hda_intel snd_intel_dspcfg snd_hda_codec ofpart snd_hda_core cmdlinepart ipmi_powernv at24 crct10dif_vpmsum powernv_flash snd_hwdep ipmi_devintf mlx4_core ibmpowernv ipmi_msghandler mtd snd_pcm uio_pdrv_genirq uio opal_prd sg snd_seq snd_seq_device snd_timer snd soundcore vhost_vsock vmw_vsock_virtio_transport_common vsock vhost_net vhost tap vhost_iotlb uhid hci_vhci bluetooth ecdh_generic rfkill ecc vfio_iommu_spapr_tce vfio_spapr_eeh vfio uinput userio ppp_generic slhc tun loop btrfs blake2b_generic xor raid6_pq libcrc32c cuse fuse kvm_hv kvm ext4 crc32c_generic crc16 mbcache jbd2 usbmouse hid_generic usbhid hid sd_mod amdgpu gpu_sched i2c_algo_bit ttm drm_kms_helper syscopyarea sysfillrect sysimgblt xhci_pci fb_sys_fops cec
[ 105.140125] xhci_hcd rc_core ahci libahci drm libata usbcore crc32c_vpmsum scsi_mod drm_panel_orientation_quirks agpgart dm_mirror dm_region_hash dm_log dm_mod
[ 105.140388] CPU: 4 PID: 506 Comm: kworker/u64:2 Not tainted 5.7.7_2 #1
[ 105.140407] Workqueue: events_unbound commit_work [drm_kms_helper]
[ 105.140439] NIP: c0080000013eceb0 LR: c0080000013ecdcc CTR: c0080000013ecd38
[ 105.140481] REGS: c0000000048e31f0 TRAP: 0800 Not tainted (5.7.7_2)
[ 105.140511] MSR: 9000000000009033 <SF,HV,EE,ME,IR,DR,RI,LE> CR: 84842222 XER: 00000000
[ 105.140557] CFAR: c0080000013ecdf0 IRQMASK: 0
GPR00: c0080000013ecdcc c0000000048e3480 c008000001607a00 0000000000000001
GPR04: 0000000000000000 c000000791bb3000 0000000000000001 0000000000000036
GPR08: 0000000000000000 c000000004110abc c008000001607a00 0000000000000000
GPR12: c0080000013ecd38 c0000007ff7fd800 c00000078471b000 c00000074b7601b8
GPR16: c00000078471b000 00000000000008ae c0000000052dd800 0000000000000000
GPR20: 000000000000000c 0000000100000100 0000000000000000 0000000100000000
GPR24: 0000000000000001 0000000000000004 0000000000000000 0000000000000001
GPR28: c000000791bb3000 0000000000000000 0000000000000000 c00000078f310000
[ 105.140835] NIP [c0080000013eceb0] dcn20_populate_dml_pipes_from_context+0x178/0xc50 [amdgpu]
[ 105.140910] LR [c0080000013ecdcc] dcn20_populate_dml_pipes_from_context+0x94/0xc50 [amdgpu]
[ 105.140933] Call Trace:
[ 105.140951] [c0000000048e3480] [0000087000000f00] 0x87000000f00 (unreliable)
[ 105.141033] [c0000000048e3550] [c0080000013f05f8] dcn20_fast_validate_bw+0x300/0x6d0 [amdgpu]
[ 105.141106] [c0000000048e35f0] [c0080000013f1058] dcn20_validate_bandwidth_internal+0xc0/0xa30 [amdgpu]
[ 105.141179] [c0000000048e36e0] [c0080000013f1a4c] dcn20_validate_bandwidth_fp+0x84/0x110 [amdgpu]
[ 105.141252] [c0000000048e3720] [c0080000013f1b34] dcn20_validate_bandwidth+0x5c/0x1a0 [amdgpu]
[ 105.141315] [c0000000048e3770] [c0080000014ae644] dc_commit_updates_for_stream+0x84c/0x18e8 [amdgpu]
[ 105.141392] [c0000000048e3870] [c00800000138c9b8] amdgpu_dm_atomic_commit_tail+0xcb0/0x1c58 [amdgpu]
[ 105.141422] [c0000000048e3c30] [c0080000007b1dfc] commit_tail+0xf4/0x270 [drm_kms_helper]
[ 105.141460] [c0000000048e3c70] [c000000000165554] process_one_work+0x264/0x520
[ 105.141503] [c0000000048e3d10] [c0000000001658a8] worker_thread+0x98/0x5b0
[ 105.141545] [c0000000048e3db0] [c00000000016f3c8] kthread+0x148/0x1a0
[ 105.141577] [c0000000048e3e20] [c00000000000cca8] ret_from_kernel_thread+0x5c/0x74
[ 105.141600] Instruction dump:
[ 105.141609] 3ac00000 fa010050 fa210058 fa410060 ea4f0008 2c320000 41820644 392702bc
[ 105.141627] 3d420000 1fbe0210 82320048 80f20058 <7c004eee> c98a6080 7e1cea14 fc00069c
[ 105.141666] ---[ end trace 6acf27852be8c1ed ]---
Kernels 5.7.5 or 5.4.48 are not affected.
The only relevant commit that happened seems to be https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/drivers/gpu/drm/amd/display/dc?h=linux-5.7.y&id=b5232e2ee8df85891514c73472cac09921e5d51d (drm/amd/display: Revalidate bandwidth before commiting DC updates
) which matches the paths in the backtrace.
I wonder why this is though - dcn20_validate_bandwidth
has its whole block wrapped in FP guard. Is it possible that something is canceling it out? (@agd5f might know?)
Relevant people: @madscientist159, @meklort, @Skirmisher
Supersedes #1118 (closed)