Due to an influx of spam, we have had to impose restrictions on new accounts. Please see this wiki page for instructions on how to get full permissions. Sorry for the inconvenience.
Admin message
The migration is almost done, at least the rest should happen in the background. There are still a few technical difference between the old cluster and the new ones, and they are summarized in this issue. Please pay attention to the TL:DR at the end of the comment.
Memory clock locked at max increasing power and temp (5.14.3)
Upgrading from kernel 5.13.12 to 5.14.3 causes memory clock to stay locked at 875Mhz even when idle and never goes back down to 100Mhz, causing power usage to go from 5.00 W to 18.00 W and temperature go from 35 C to 45 C.
Edited
Designs
Child items
...
Show closed items
Linked items
0
Link issues together to show that they're related.
Learn more.
Well, somewhere down to 5.13 (after 3 or 4 builds) - I'm unable to build the kernel due to the folloing and I'm have no idea what that means:
No rule to make target 'libbpf_legacy.h', needed by \ '/home/birdspider/repos/linux-git/src/linux/tools/bpf/resolve_btfids/libbpf/staticobjs/libbpf.o'. Stop.
bisect log
$ git bisect loggit bisect start# good: [62fb9874f5da54fdb243003b386128037319b219] Linux 5.13git bisect good 62fb9874f5da54fdb243003b386128037319b219# bad: [6880fa6c56601bb8ed59df6c30fd390cc5f6dd8f] Linux 5.15-rc1git bisect bad 6880fa6c56601bb8ed59df6c30fd390cc5f6dd8f# bad: [af4cf6a5689a9ecc21722cb2bb6220dcaee89c6e] Merge tag 'arm-defconfig-5.14' of git://git.kernel.org/pub/scm/linux/kernel/git/soc/socgit bisect bad af4cf6a5689a9ecc21722cb2bb6220dcaee89c6e# bad: [e058a84bfddc42ba356a2316f2cf1141974625c9] Merge tag 'drm-next-2021-07-01' of git://anongit.freedesktop.org/drm/drmgit bisect bad e058a84bfddc42ba356a2316f2cf1141974625c9# skip: [a6eaf3850cb171c328a8b0db6d3c79286a1eba9d] Merge tag 'sched-urgent-2021-06-30' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tipgit bisect skip a6eaf3850cb171c328a8b0db6d3c79286a1eba9d
I have decided to switch to the 5.10.66-1-lts longterm kernel as the "stable" branch constantly breaks my graphics card. Kernel 5.14.5 and 5.14.6 has introduced an even more severe bug that completely prevents loading any graphical environment.
Error log:
kernel: amdgpu 0000:03:00.0: [drm:amdgpu_ib_ring_tests [amdgpu]] *ERROR* IB test failed on gfx_0.0.0 (-110).kernel: [drm:process_one_work] *ERROR* ib ring test failed (-110).kernel: amdgpu 0000:03:00.0: amdgpu: Failed to power gate JPEG!kernel: [drm:jpeg_v2_0_set_powergating_state [amdgpu]] *ERROR* Dpm disable jpeg failed, ret = -62.kernel: amdgpu 0000:03:00.0: amdgpu: SMU: I'm not done with your previous command!kernel: amdgpu 0000:03:00.0: amdgpu: Failed to power gate VCN!kernel: [drm:amdgpu_dpm_enable_uvd [amdgpu]] *ERROR* Dpm disable uvd failed, ret = -62.kernel: amdgpu 0000:03:00.0: amdgpu: SMU: I'm not done with your previous command!
I just tried the latest commit 9c16208fa4389807164491bdd8e47deab4594403 but it still fails to load an X session.
Errors:
kernel: amdgpu 0000:03:00.0: amdgpu: SMU: I'm not done with your previous command!kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring sdma1 timeout, signaled seq=1, emitted seq=2kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process pid 0 thread pid 0kernel: amdgpu 0000:03:00.0: amdgpu: GPU reset begin!kernel: amdgpu 0000:03:00.0: amdgpu: SMU: I'm not done with your previous command!kernel: amdgpu 0000:03:00.0: amdgpu: SMU: I'm not done with your previous command!kernel: amdgpu 0000:03:00.0: amdgpu: SMU: I'm not done with your previous command!kernel: amdgpu 0000:03:00.0: amdgpu: Failed to disable gfxoff!
@siqueira Reverting this commit on 5.14.13-xanmod fixes the memory clock being stuck at 100% also on a RX 5700 XT with three identical 1920x1200 60 Hz monitors connected via DisplayPort MST daisy chain. The GPU power draw also dropped considerably. Haven't noticed any regressions.
Tried to revert commit 136e55e7a92726be4a858f9ad69bd53a9c5d07ec and tried the last good commit 3f8518b60c10aa96f3efa38a967a0b4eb9211ac0, but the issue persist. Using a RX 5700 XT and
using latest amd-staging-drm-next.
@siqueira I can also confirm that reverting that commit on kernel 5.15.0-rc6 fixes the power draw/memory clock problem on my multi-display setup (2560x1440@144hz and 1920x1080@60hz) on a 6700 XT. Sadly does not allow me to use a 1080p 72hz CVT-RB V2 mode on my secondary, but 60hz is what it's supposed to run at, so no big deal.
@siqueira , I can confirm I've been experiencing similar issue, and that the reversion of the commit helped!
As GPU, I'm running Gigabyte Radeon RX 5700 8GB GAMING OC v1.0, and the monitor setup is two screens, one at 1920x1080@60Hz and the other at 1280x1024@60Hz.
Running kernel 5.13.19, memory clocks idle at 100MHz and 8W of power usage.
Running 5.14.0 or later, memory clocks are stuck at 800MHz and 33W of power usage.
Running tkg-pds build of kernel 5.15.4 without any changes, I was experiencing the issue.
Adding revert patch for commit 136e55e7a92726be4a858f9ad69bd53a9c5d07ec and rebuilding, memory clocks once again correctly downgrade and power usage is reduced when idling!
specifically at amdgpu_dm_dtn_log, amdgpu_pm_info and state, see: amddebug.tar.xz
filename-suffixes:
initial (36W) // 2nd Monitor is off and connected via HDMIreboot-2ndmonitor-disconnected ( 9W) // I disconnected the HDMI cable before rebootreboot-2ndmonitor-disconnected.then-connected ( 9W) // I connected cable of the 2nd monitorreboot-2ndmonitor-disconnected.then-enabled (35W) // I enabled the 2nd (xrandr)reboot-2ndmonitor-disconnected.then-disabled (35W -> 25W -> 9W after a few seconds) // I disabled the 2ndreboot-2ndmonitor-connected (35W) // I left the HDMI cable connected, back to square 1
Observations:
state.initial: even if the 2nd-monitor is never in use, state looks like one was enabled at some point:
state.initial vs. state.reboot-2ndmonitor-disconnected.then-disabled are identical, as if there is a second monitor enabled then disabled - somehow
amdgpu_pm_info.initial vs. amdgpu_pm_info.reboot-2ndmonitor-disconnected.then-disabled shows that it's possible to go down to a lower clock in general - but not if monitor connected at boot time
EDIT: I just checked to make sure, when the cable is not connected at boot time a can enable/disable/connect/reconnect however I want - its always 10W vs 35W (1 or 2 Screen) and it correctly clocks down.
EDIT2: I just realized that instead of disconnecting the cable - it is sufficient to turn off (really, not standby) the monitor.
So my workaround for now is:
enable auto-poweroff on secondary monitor after 1h, (once)
then
daily reboots start with a powered-off secondary monitor
hence the bug is deferred - it uses 11W idle
once I start using the second monitor - it peaks at 35W
once I stop using the second monitor - it drops to 11W even if now on standby (which is not the case if has been booted with on-standby)
I've been tinkering, reading (C and kernel debugging),
is dcn2_update_clocks the right place (for NAVI10) where the clocks should switch because - somehow, if my debug output is correct, the driver thinks it is running with both clk_mgr_base->clks.dramclk_khz and new_clocks->dramclk_khz at 100MHz[*] while /sys/class/drm/card0/device/pp_dpm_mclk (and power_consumption) suggest 875Mhz.
during dcn20_prepare_bandwidth, and dcn20_optimize_bandwidth since the new_clocks are no different, they are never set, because the clocks do not differ and should_set_clock then returns false
my git-foo is letting me down - against what should this patch/series cleanly apply? because neither linux/master (or v/rc-tags) nor any of the agd5f/drm-* tags/branches I cared to try would fit?
I also tried to apply this against 5.16.0-rc5-1-git-00247-g3f667b5d4053, but all hunks get rejected even after giving patch large fuzz values. That said looking at the diff it seems that it pretty much does what reverting 136e55e7a92726be4a858f9ad69bd53a9c5d07ec does, which is changing pipe_split_policy from either MPC_SPLIT_AVOID or MPC_SPLIT_AVOID_MULT_DISP back to MPC_SPLIT_DYNAMIC. So it should help, I just can't test it as-is.
@agd5f, while I didn't apply/test the patch specifically, I tested linux 800829388818 which has this among other things (amd-drm-fixes-5.16-2021-12-29) merged.
It allowed the mem-clock do decrease in contrast to the previous cases in this ticket where it didn't.