[amdgpu/Navi] [powerplay] Failed to send message errors

Shmerl @shmerl said:

These errors also happen when using radeon-profile to control the fan speed:

[ 3099.422315] amdgpu: [powerplay] Failed to send message 0xe, response 0xfffffffb param 0x80
[ 3099.422318] amdgpu: [powerplay] Failed to export SMU metrics table!
[ 3145.423048] amdgpu: [powerplay] Failed to send message 0x12, response 0xfffffffb param 0x6
[ 3145.423051] amdgpu: [powerplay] Failed to export SMU metrics table!
[ 3145.423076] amdgpu: [powerplay] Failed to send message 0x12, response 0xfffffffb, param 0x6
[ 3149.423073] amdgpu: [powerplay] Failed to send message 0x12, response 0xfffffffb param 0x6
[ 3149.423076] amdgpu: [powerplay] Failed to export SMU metrics table!
[ 3200.422744] amdgpu: [powerplay] Failed to send message 0xf, response 0xfffffffb, param 0xa90000
[ 3200.422846] amdgpu: [powerplay] Failed to send message 0x12, response 0xfffffffb param 0x6
[ 3200.422850] amdgpu: [powerplay] Failed to export SMU metrics table!
[ 3234.422189] amdgpu: [powerplay] Failed to send message 0xf, response 0xfffffffb, param 0xa90000

Shmerl @shmerl said:

Related: https://github.com/marazmista/radeon-profile/issues/157

aqxa1 @aqxa1 said:

Are you running a monitor at 75hz?

I can only trigger the bug when setting 74-76hz with amd-staging-drm-next, and although I haven't tested in a while, I suspect the same applies with 5.3-rcX (and drm-next-5.4).

Here's the output after setting 75hz, on amd-staging-drm-next:
[ 7937.682003] amdgpu: [powerplay] failed send message: TransferTableSmu2Dram (18) param: 0x00000006 response 0xffffffc2
[ 7937.682004] amdgpu: [powerplay] Failed to export SMU metrics table!
[ 7938.087356] amdgpu: [powerplay] failed send message: NumOfDisplays (64) param: 0x00000001 response 0xffffffc2
[ 7940.224391] amdgpu: [powerplay] failed send message: TransferTableSmu2Dram (18) param: 0x00000006 response 0xffffffc2
[ 7940.224392] amdgpu: [powerplay] Failed to export SMU metrics table!
[ 7942.362952] amdgpu: [powerplay] failed send message: SetDriverDramAddrHigh (14) param: 0x00000080 response 0xffffffc2
[ 7944.510060] amdgpu: [powerplay] failed send message: SetDriverDramAddrHigh (14) param: 0x00000080 response 0xffffffc2
[ 7944.510061] amdgpu: [powerplay] Failed to export SMU metrics table!
[ 7945.269921] amdgpu: [powerplay] failed send message: NumOfDisplays (64) param: 0x00000001 response 0xffffffc2
[ 7946.652777] amdgpu: [powerplay] failed send message: SetDriverDramAddrHigh (14) param: 0x00000080 response 0xffffffc2
[ 7947.411808] amdgpu: [powerplay] failed send message: NumOfDisplays (64) param: 0x00000001 response 0xffffffc2
[ 7948.786413] amdgpu: [powerplay] failed send message: SetDriverDramAddrHigh (14) param: 0x00000080 response 0xffffffc2
[ 7948.786414] amdgpu: [powerplay] Failed to export SMU metrics table!
[ 7950.918131] amdgpu: [powerplay] failed send message: SetDriverDramAddrHigh (14) param: 0x00000080 response 0xffffffc2
[ 7953.076247] amdgpu: [powerplay] failed send message: SetDriverDramAddrHigh (14) param: 0x00000080 response 0xffffffc2
[ 7953.076250] amdgpu: [powerplay] Failed to export SMU metrics table!

Shmerl @shmerl said:

(In reply to Andrew Sheldon from comment 3)

Are you running a monitor at 75hz?

No, 60 Hz which is my monitor's native refresh rate.

Michael de Lang said:

I can confirm this happens when I use a dual-monitor setup. I have two 1440p@144 Hz screens and these messages happen when I boot with both screens, or look at the gpu temperature through the sensor command, with both screens active.

With only one screen active, I cannot reproduce the bug.

Shmerl @shmerl said:

Just for the reference, my connection is DisplayPort 1.2.

Tako Marks said:

I ran into this issue when messing around with my BIOS settings. Not sure if helpful but when I had the option Decode Above 4G (64bit adressing on PCI bus?) on my Gigabyte Aorus B450 I experienced the same issue. After turning that option back off everything is working again.

Shmerl @shmerl said:

I don't get these errors anymore when using radeon-profile with kernel 5.4-rc6.

But with ksysguard, Failed to export SMU metrics table! message is still occurring, though it's not causing any stalls or hangs now.

I'm having the same issue right now with my Sapphire RX 5700 XT (Display port 75Hz, Freesync active on the monitor)

Nov 23 13:09:19 ansan kernel: amdgpu: [powerplay] Failed to send message 0x12, response 0xfffffffb, param 0x6
Nov 23 13:09:19 ansan kernel: amdgpu 0000:03:00.0: [mmhub] VMC page fault (src_id:0 ring:169 vmid:0 pasid:0)
Nov 23 13:09:19 ansan kernel: amdgpu 0000:03:00.0:   at page 0x00000006014c2000 from 18
Nov 23 13:09:19 ansan kernel: amdgpu 0000:03:00.0: VM_L2_PROTECTION_FAULT_STATUS:0x00041152
Nov 23 13:09:20 ansan kernel: amdgpu: [powerplay] Failed to send message 0x12, response 0xffffffc2 param 0x6

Important packages:

linux 5.3.12.1-1
linux-firmware 20191022.2b016af-3
mesa-git 1:19.3.0_devel.116728.f306d079320-1
vulkan-radeon-git 1:19.3.0_devel.116728.f306d079320-1
libdrm 2.4.100-1
lib32-mesa-git 1:19.3.0_devel.116728.f306d079320-1
lib32-vulkan-radeon-git 1:19.3.0_devel.116728.f306d079320-1
lib32-libdrm 2.4.100-1
sway 1:1.2-5

I will try to report with linux 5.4 when it's out. I think a lot of powerplay issues were fixed in the kernel.

@jattali 75hz monitors had an issue in the past, so you should try a different refresh rate if possible. A good chance it is fixed in 5.4 and/or amd-staging-drm-next, however.

I'm having the same problem with a dual monitor setup, both monitors 1920x1080 60Hz, vsync no g/freesync. Playing some steam games for about 1-2 hours I have like 2000 errors of that same message in Ubuntu logs. I am using one of the forks of the following script: https://github.com/DominiLux/amdgpu-pro-fans

Which enables me to do custom fan curves using bash and the /sys interface

I noticed when I manually spam "cat /sys/class/drm/card0/device/hwmon/hwmon1/pwm1" in the console, sometimes the whole desktop freezes up and I have to restart. Not sure if related.

Using 5.3.0-050300-generic

I'm using Linux 5.4.1 and I'm having this problem. After a variable period of time after boot (seconds to tens of minutes) the desktop freezes and the system logs get spammed with these messages:

amdgpu: [powerplay] Failed to export SMU metrics table!
amdgpu: [powerplay] failed send message: SetDriverDramAddrHigh (14) 	param: 0x00000080 response 0xffffffc2

I'm using dual monitor setup, both 1920x1200@60Hz, one on DisplayPort, the other on HDMI.

I'm also using the System-monitor Gnome shell extension to monitor temps and other stuff, if that matters.

$ uname -a
Linux marcows 5.4.1-050401-generic #201911290555 SMP Fri Nov 29 11:03:47 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux

$ glxgears -info
GL_RENDERER   = AMD NAVI10 (DRM 3.35.0, 5.4.1-050401-generic, LLVM 9.0.0)
GL_VERSION    = 4.5 (Compatibility Profile) Mesa 19.2.1
GL_VENDOR     = X.Org

EDIT: adding a picture of the screen I grabbed after I managed to execute poweroff in a terminal; messages were emitted constantly 1 per second; "Failed to export SMU metrics table!" messages are out of screen.

Do you have some utility querying hwmon stuff periodically? If so does stopping that help?

Were there some fixes that prevented those powerplay deadlocks, or they didn't make it into 5.4?

I have the System-monitor Gnome shell extension installed (used to monitor temps, sys load, ecc.), which I believe uses the hwmon facility (it gives me the ability to read the same sensors as the sensors cli program).

I disabled it yesterday and I haven't had any desktop freezes since.

I noticed a [powerplay] failed send message: SetDriverDramAddrLow (15) once in dmesg, but no hangs or any other problems.

I also get this issue, on a single 60Hz monitor setup on a PowerColor Red Devil 5700 XT.

Since I installed the graphics card on 12th November I have only had 3 lock ups even with the computer on nearly all day. In fact, my lock ups have so far only occurred when playing Dead Cells strangely.

I do have a widget in plasma that monitors GPU temperature, ksysguard (not actively open though), and radeon profile.

Running Arch Linux kernel 5.4.1-arch1-1 and latest mesa-git and llvm from the mesa-git repository.

Also when the issue occurs the display only locksup sometimes when something new is drawn to the screen, such as a menu, or opening a new window. I can still use the computer and even gracefully shutdown.

[ 8353.590432] amdgpu: [powerplay] failed send message: SetDriverDramAddrLow (15)       param: 0x004c6000 response 0xfffffffb
[ 8353.604407] amdgpu 0000:1f:00.0: [mmhub] page fault (src_id:0 ring:169 vmid:0 pasid:0, for process  pid 0 thread  pid 0)
[ 8353.604411] amdgpu 0000:1f:00.0:   in page starting at address 0x00006000004c6000 from client 18
[ 8353.604412] amdgpu 0000:1f:00.0: GCVM_L2_PROTECTION_FAULT_STATUS:0x00041152
[ 8353.604413] amdgpu 0000:1f:00.0:      MORE_FAULTS: 0x0
[ 8353.604414] amdgpu 0000:1f:00.0:      WALKER_ERROR: 0x1
[ 8353.604415] amdgpu 0000:1f:00.0:      PERMISSION_FAULTS: 0x5
[ 8353.604416] amdgpu 0000:1f:00.0:      MAPPING_ERROR: 0x1
[ 8353.604416] amdgpu 0000:1f:00.0:      RW: 0x1
[ 8355.703243] amdgpu: [powerplay] failed send message: TransferTableSmu2Dram (18)      param: 0x00000006 response 0xffffffc2
[ 8355.703246] amdgpu: [powerplay] Failed to export SMU metrics table!
[ 8355.704887] amdgpu: [powerplay] failed send message: SetDriverDramAddrLow (15)       param: 0x004c6000 response 0xffffffc2
[ 8357.816175] amdgpu: [powerplay] failed send message: SetDriverDramAddrHigh (14)      param: 0x00000080 response 0xffffffc2
[ 8357.823253] amdgpu: [powerplay] failed send message: SetDriverDramAddrLow (15)       param: 0x004c6000 response 0xffffffc2
[ 8357.823256] amdgpu: [powerplay] Failed to export SMU metrics table!
[ 8359.929368] amdgpu: [powerplay] failed send message: SetDriverDramAddrHigh (14)      param: 0x00000080 response 0xffffffc2
[ 8359.929371] amdgpu: [powerplay] Failed to export SMU metrics table!
[ 8359.947286] amdgpu: [powerplay] failed send message: SetDriverDramAddrHigh (14)      param: 0x00000080 response 0xffffffc2

[ 8394.339346] amdgpu: [powerplay] failed send message: EnableSmuFeaturesLow (8)        param: 0x00000000 response 0xffffffc2
[ 8394.339349] amdgpu: [powerplay] [smu_v11_0_auto_fan_control]Start smc FAN CONTROL feature failed!
[ 8394.339350] amdgpu: [powerplay] [smu_v11_0_set_fan_control_mode]Set fan control mode failed!
[ 8396.052751] amdgpu: [powerplay] failed send message: SetDriverDramAddrHigh (14)      param: 0x00000080 response 0xffffffc2

I definitely feel like there is some concurrency issue going on... I have some new evidence aside from just manually flooding my terminal with cat commands and the freezing of my desktop.

The bash service script I am running is from the following repository: https://github.com/grmat/amdgpu-fancontrol/blob/master/amdgpu-fancontrol It clearly tries to poll for the PWM state of my GPU's fan

FILE_PWM=$(echo /sys/class/drm/card0/device/hwmon/hwmon?/pwm1) function set_pwm { NEW_PWM=$1 OLD_PWM=$(cat $FILE_PWM) [...]

By default the script is set to update every second. All I did was set it to 60 seconds, run update.sh, and restart, and now my logs does not contain any powerplay message even after several minutes and I do not notice any micro freezing of the desktop.

If you are having powerplay problems with desktop freezing, try disabling anything that automatically polls sensors from hwmon. Widgets, scripts, lm-sensors, gpu profiler apps, etc, and see if the problem goes away.

Just built 5.5-rc1 and running ksysguard with amdgpu sensors that were producing those errors before. Now errors are gone!

I suppose this can be closed then.

Actually, that still happens. Just got this in dmesg:

[ 2121.740139] amdgpu: [powerplay] failed send message: TransferTableSmu2Dram (18)      param: 0x00000006 response 0xffffffc2
[ 2121.740142] amdgpu: [powerplay] Failed to export SMU metrics table!
[ 2124.503801] amdgpu: [powerplay] failed send message: NumOfDisplays (64)      param: 0x00000001 response 0xffffffc2
[ 2127.272493] amdgpu: [powerplay] failed send message: NumOfDisplays (64)      param: 0x00000001 response 0xffffffc2

It's not causing any lock-ups though and it happened just once so far. So some race condition or something else wrong with powerplay still remains.

It's also preceded with (though it can be just a coincidence):

[ 2079.709132] WARNING: CPU: 16 PID: 990 at drivers/gpu/drm/amd/amdgpu/../display/dc/dcn20/dcn20_resource.c:2950 dcn20_validate_bandwidth+0x99/0xb0 [amdgpu]
[ 2079.709132] Modules linked in: rfcomm(E) nf_tables(E) nfnetlink(E) bnep(E) edac_mce_amd(E) kvm_amd(E) kvm(E) irqbypass(E) btusb(E) btrtl(E) btbcm(E) snd_hda_codec_realtek(E) btintel(E) snd_hda_codec_generic(E) iwlmvm(E) crct10dif_pclmul(E) ledtrig_audio(E) crc32_pclmul(E) snd_hda_codec_hdmi(E) uvcvideo(E) videobuf2_vmalloc(E) bluetooth(E) videobuf2_memops(E) mac80211(E) ghash_clmulni_intel(E) videobuf2_v4l2(E) snd_usb_audio(E) snd_hda_intel(E) libarc4(E) snd_intel_dspcfg(E) videobuf2_common(E) nls_ascii(E) snd_usbmidi_lib(E) drbg(E) nls_cp437(E) snd_hda_codec(E) aesni_intel(E) snd_rawmidi(E) crypto_simd(E) ansi_cprng(E) videodev(E) snd_seq_device(E) vfat(E) iwlwifi(E) snd_hda_core(E) ecdh_generic(E) efi_pstore(E) cryptd(E) ecc(E) fat(E) mc(E) snd_hwdep(E) glue_helper(E) wmi_bmof(E) sp5100_tco(E) efivars(E) k10temp(E) pcspkr(E) crc16(E) watchdog(E) snd_pcm(E) snd_timer(E) cfg80211(E) snd(E) soundcore(E) rfkill(E) sg(E) ccp(E) rng_core(E) evdev(E) acpi_cpufreq(E) nct6775(E) hwmon_vid(E)
[ 2079.709152]  parport_pc(E) ppdev(E) lp(E) parport(E) efivarfs(E) ip_tables(E) x_tables(E) autofs4(E) xfs(E) btrfs(E) xor(E) zstd_decompress(E) zstd_compress(E) raid6_pq(E) libcrc32c(E) crc32c_generic(E) sd_mod(E) hid_generic(E) usbhid(E) hid(E) amdgpu(E) gpu_sched(E) ttm(E) drm_kms_helper(E) ahci(E) libahci(E) crc32c_intel(E) xhci_pci(E) mxm_wmi(E) drm(E) libata(E) i2c_piix4(E) xhci_hcd(E) mfd_core(E) igb(E) scsi_mod(E) usbcore(E) dca(E) ptp(E) pps_core(E) i2c_algo_bit(E) nvme(E) nvme_core(E) wmi(E) button(E)
[ 2079.709166] CPU: 16 PID: 990 Comm: Xorg Tainted: G            E     5.5.0-rc1 #2
[ 2079.709167] Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./X570 Taichi, BIOS P2.50 11/02/2019
[ 2079.709229] RIP: 0010:dcn20_validate_bandwidth+0x99/0xb0 [amdgpu]
[ 2079.709230] Code: 00 00 00 5d 41 5c e9 16 f7 ff ff 31 d2 f2 0f 11 85 70 21 00 00 48 89 ee 4c 89 e7 e8 01 f7 ff ff 89 c2 22 95 b8 1d 00 00 75 04 <0f> 0b eb b3 c6 85 b8 1d 00 00 00 89 d0 eb a8 0f 1f 84 00 00 00 00
[ 2079.709231] RSP: 0018:ffffb074015cfad8 EFLAGS: 00010246
[ 2079.709232] RAX: 0000000000000001 RBX: 0000000000000000 RCX: 0000000000002fd6
[ 2079.709232] RDX: 0000000000000000 RSI: ffff9d1e7ec2ff40 RDI: 000000000002ff40
[ 2079.709233] RBP: ffff9d1d67fa0000 R08: 0000000000000006 R09: 0000000000000000
[ 2079.709233] R10: ffff9d1e699b0000 R11: 0000000100000001 R12: ffff9d1e699b0000
[ 2079.709234] R13: ffff9d1dc2ed3d80 R14: 0000000000000000 R15: ffff9d1d67fa0000
[ 2079.709235] FS:  00007fe02c46cf00(0000) GS:ffff9d1e7ec00000(0000) knlGS:0000000000000000
[ 2079.709235] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 2079.709236] CR2: 0000555c86f277a8 CR3: 00000007f3f00000 CR4: 0000000000340ee0
[ 2079.709237] Call Trace:
[ 2079.709298]  dc_validate_global_state+0x25f/0x2d0 [amdgpu]
[ 2079.709360]  amdgpu_dm_atomic_check+0x5d0/0x830 [amdgpu]
[ 2079.709374]  drm_atomic_check_only+0x554/0x7e0 [drm]
[ 2079.709385]  drm_atomic_commit+0x13/0x50 [drm]
[ 2079.709396]  drm_atomic_connector_commit_dpms+0xd7/0x100 [drm]
[ 2079.709409]  drm_mode_obj_set_property_ioctl+0x159/0x2d0 [drm]
[ 2079.709421]  ? drm_connector_set_obj_prop+0x90/0x90 [drm]
[ 2079.709430]  drm_connector_property_set_ioctl+0x39/0x60 [drm]
[ 2079.709440]  drm_ioctl_kernel+0xaa/0xf0 [drm]
[ 2079.709449]  drm_ioctl+0x208/0x390 [drm]
[ 2079.709459]  ? drm_connector_set_obj_prop+0x90/0x90 [drm]
[ 2079.709498]  amdgpu_drm_ioctl+0x49/0x80 [amdgpu]
[ 2079.709501]  do_vfs_ioctl+0x461/0x6d0
[ 2079.709503]  ksys_ioctl+0x5e/0x90
[ 2079.709504]  __x64_sys_ioctl+0x16/0x20
[ 2079.709506]  do_syscall_64+0x52/0x180
[ 2079.709509]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[ 2079.709510] RIP: 0033:0x7fe02c9c05b7
[ 2079.709511] Code: 00 00 90 48 8b 05 d9 78 0c 00 64 c7 00 26 00 00 00 48 c7 c0 ff ff ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 b8 10 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d a9 78 0c 00 f7 d8 64 89 01 48
[ 2079.709512] RSP: 002b:00007ffd7bc0c768 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
[ 2079.709513] RAX: ffffffffffffffda RBX: 00007ffd7bc0c7a0 RCX: 00007fe02c9c05b7
[ 2079.709513] RDX: 00007ffd7bc0c7a0 RSI: 00000000c01064ab RDI: 000000000000000d
[ 2079.709514] RBP: 00000000c01064ab R08: 0000000000000000 R09: 00007fe02c095d10
[ 2079.709514] R10: 00007fe02c095d20 R11: 0000000000000246 R12: 0000555c885745e0
[ 2079.709515] R13: 000000000000000d R14: 0000000000000000 R15: 0000555c86f4a7c0

[amdgpu/Navi] [powerplay] Failed to send message errors

Submitted by Shmerl `@shmerl`

Description

Designs

Child items ...

Activity

Admin message

Admin message

[amdgpu/Navi] [powerplay] Failed to send message errors

Submitted by Shmerl @shmerl

Description

Activity

Submitted by Shmerl `@shmerl`