by default the kernel sets a maximum gpu clock that exceeds the manufacturers specifications, causing hardware crashes

added 7000 dGPU series label

the first problem is that even if the video card has a maximum clock of 2525 mhz it seems that linux takes it up to 2930 mhz

Can you help me understand this?

When you say Linux takes it - up, you mean when you try to write an overclocking value? Are you always writing the overclocking value on every boot (using software)? My understanding was the card would automatically manage the frequency, but if you've written overclocking value you might have instability.

IOW I'm wondering if the instability is caused by an assumption of LACT that the default values in the OD table are reasonable boundaries to write. If you turn off the LACT daemon and reboot do you still have instability?

hello, I opened this issue because it seems that by default without me changing any value on the sysfs (I didn't even have lact installed when I discovered the issue) it seems that pp_od_clk_voltage returns 2930Mhz, I used lact to lower the value back to the manufacturer stated values

i never thought of overclocking my gpu, i want to keep the default values stated by the manufacturer

the main problem is that amdgpu should not put such a high clock by default, but respect the manufacturer default limits

Ah got it. Thanks.

Also; tangentially related to this I wonder if we should actually be tainting the system when the overclocking values have been modified. It would sure make it clearer in bug reports that it could be part of a problem.

changed title from Sapphire PULSE Radeon RX 7900 XTX - wrong power limit and max gpu clock to Sapphire PULSE Radeon RX 7900 XTX - wrong default power limit and max gpu clock

From your system can you share specifically the values of this with nothing changing them after bootup?

cat /sys/bus/pci/drivers/amdgpu/*/pp_dpm_sclk

[root@arch ~]# cat /sys/bus/pci/drivers/amdgpu/*/pp_dpm_sclk
0: 500Mhz
1: 184Mhz *
2: 2371Mhz
0: 400Mhz 
1: 600Mhz *
2: 2200Mhz 

[root@arch ~]# cat /sys/class/drm/card1/device/pp_od_clk_voltage
OD_SCLK:
0: 500Mhz
1: 2930Mhz
OD_MCLK:
0: 97Mhz
1: 1250MHz
OD_VDDGFX_OFFSET:
0mV
OD_RANGE:
SCLK:     500Mhz       5000Mhz
MCLK:      97Mhz       1500Mhz
VDDGFX_OFFSET:    -450mv          0mv

I disabled any software that can infer and rebooted before taking the values

I guess you have two GPUs in the system.

So according to that Linux shouldn't ever be taking it up to 2930MHz by itself. The highest it will go is 2371MHz on one GPU and 2200MHz on the other. If you turn on manual DPM control then it's a different story.

so that it didn't crash anymore was simply a coincidence?

was my mistake just reading pp_od_clk_voltage instead of checking via pp_dpm_sclk?

in case i close the issue

so that it didn't crash anymore was simply a coincidence?

Possibly; or maybe it's masking another issue?

was my mistake just reading pp_od_clk_voltage instead of checking via pp_dpm_sclk?

Yeah so this is basically reading the values programmed for "min" and "max" while overclocked. I think you're just getting the card overclocking defaults. If you're not overclocking they shouldn't be in use.

in case i close the issue

Well feel free to collect more data on your findings, it's not to say there isn't a problem but it's not clear what it is right now if there is one.

I'll do more tests with pp_od_clk_voltage set to the default limit and with the manufacturer's default limit to see somehow a correlation is present

@superm1 is it possible that the gpu can reach higher values (the pp_od_clk_voltage values) than those declared on pp_dpm_sclk in case of boost?

I don't believe it should be able to (barring a bug). @agd5f agree?

https://gist.github.com/andrew-ld/896f46efc94fb5e54453203ded0b5834

I ran glmark with a very high resolution (10000x10000) and while printing info from amdgpu_pm_info, it seems that SCLK reaches 2739 mhz

I tried again by limiting via pp_od_clk_voltage and it seems not to exceed the limit of pp_od_clk_voltage, by the way it goes to similar performance consuming less power

OK then there really is a bug somewhere along the way with the driver is doing internally.

My interpretation of that is overclocking is getting turned on by default when it shouldn't be.

[root@arch ~]# cat /proc/cmdline
initrd=\amd-ucode.img initrd=\initramfs-linux-zen.img quiet rd.luks.name=ba2c6422-ed7d-48bb-ac44-a29af57c61fd=root root=/dev/mapper/root rw tsc=reliable clocksource=tsc

i did the final test by removing any amdgpu option (cmdline), before i had feature mask 0xfffd7fff, the problem remains as well without any cmdline

The max clock exposed in pp_dpm_sclk is the max sustainable clock. The AIB can define higher boost clocks which will be utilized if there is enough thermal/power headroom.

@agd5f in my case from pp_od_clk_voltage I deduce that the limit that has been set is 2930Mhz, the manufacturer stated that in boost mode it goes up to 2525mhz (without oc), is it a problem that on amdgpu it is not limited up to 2525mhz by default?

I have the same issue with my RX 7800 XT (PowerColor Fighter). It also starts with the wrong max clock and also a completely too high overdrive range:

OD_RANGE:
SCLK:     500Mhz       5000Mhz
MCLK:      97Mhz       1500Mhz

(Manufacturer spec says 2430Mhz for boost.)

Therefore that might also explain, why some games crash the driver (and typically the whole machine) on Linux, while they work fine on Windows.

Did you solve the problem by setting a lower clock limit?

Seems so. Before, starting a game in Enshrouded (which is a good test subject, since it uses excessive GPU power with Proton) caused my GPU to lock up (with the same error as you have), the wayland session restarted and then the whole system froze unrecoverably. Every time I tried.

Now with the limits applied it starts to lag horribly, but it doesn't crash anymore. Some times it even runs fine and doesn't lag at all. But that is likely more a problem with Enshrouded/proton or radv. In any case: the hardware doesn't die anymore.

applying the manufacturer's recommended frequencies on my gpu reduces performance by very little but also reduces power consumption by a lot, it is definitely a good tradeoff

mentioned in issue #3128

mentioned in issue #3067

I don't see the pp_od_clk_voltage file. I do see the other sysfs files. Is there any way for me to investigate this as well?

I have been pointed hear with the same issue. 7900xtx merc 310 version, constant crashes during helldivers 2 and enshrouded. Link to journalctl https://pastebin.com/eFsW0TUq

pp_od_clk_voltage output OD_SCLK: 0: 500Mhz 1: 2500Mhz OD_MCLK: 0: 97Mhz 1: 1250MHz OD_VDDGFX_OFFSET: 0mV OD_RANGE: SCLK: 500Mhz 5000Mhz MCLK: 97Mhz 1500Mhz VDDGFX_OFFSET: -450mv 0mv

I have been looking up how to change the OD_RANGE, but the wiki seems to only to cover Sclk and Mclk. Any guidance would be greatly appreciated, but I will continue my search on my end as well.

Thanks, Terry

I change the values on my 7900xtx using https://github.com/ilya-zlobintsev/LACT

Oh my gosh, I got it! I found an archwiki page, thank you! I am an idiot, the max is for the boost range and my card was set at 2500 MHz when the manufacturer said max is 2453 MHz. I didn't think that 47 MHz would matter, but I guess it does so far. I went 2 games in a row without a lockup. If this goes a week or so without a lockup, can I send you some beer money as a way of thanks? You have no idea how close I was to just giving up and just dual booting. Thank you thank you!

Still having lockups unfortunately. The randomness of these things is frustrating, but from my very anecdotal observations, it seems to happen less than when the clock is set at 2500MHz default. I will do some more testing tonight or tomorrow and see if I can get them completely suppressed with a low enough clock

I'm also seeing the same behavior on Helldivers 2 with my Merc 7900 XT. I haven't modified any of the overclock or undervolt settings.

i have same problem on amd rx 6950xt

cat pp_dpm_sclk 0: 500Mhz * 1: 2720Mhz

cat pp_dpm_mclk 0: 96Mhz 1: 456Mhz * 2: 673Mhz 3: 1124Mhz

cat pp_od_clk_voltage OD_SCLK: 0: 500Mhz 1: 2649Mhz OD_MCLK: 0: 97Mhz 1: 1124MHz OD_VDDGFX_OFFSET: 0mV OD_RANGE: SCLK: 500Mhz 5000Mhz MCLK: 674Mhz 1500Mhz

But real specs of my gpu have another value https://www.techpowerup.com/gpu-specs/radeon-rx-6950-xt.c3875

Yep same thing here ,was overclocking to 2930 but max was 2560mhz (7900xt merc black edition 310). setting the correct value in lact solved my crashing in big demanding games.

mentioned in issue mesa/mesa#10883 (closed)

Hello @superm1 @andrew-ld I can confirm that by default it set incorrect power limits and a quiet incorrect max clock limits. I have Sapphire RX7900XTX Pulse:

It set default power limit to 303W but for this chip should be at least 355W (described on AMD website) or even 370W (it described on official website of this card).
Also it try to use higher GPU clock as it possible - I not sure because if believe to runtime, under load clock equal to ~2600 MHz but max allowed limitation by default nearly 2900 MHz (for my card it should be nearly 2525MHz)

If I set correct power limitation to 370W it is trying boost GPU even higher and it use whole 370W and become hotter and hotter. So next step I set correct max allowed GPU clock to 2525MHz - and now it use ~280W and colder and more silent!!!

PS: I not sure but technically GPU should be allowed to use at least 355W to able for example use 3D blocks and at the same time for example video acceleration without slow down any of this block (if I correctly understand without changes it use whole power limit 303W and if I will try to use additional block it will slow down if something to fit in incorrect power limits but it can use 55-77W additionally to avoid it)

Here how I workaround it:

Set kernel parameter or module in /etc/modprobe/amdgpu.conf - amdgpu.ppfeaturemask=0xffffffff
Set correct limits

echo "370000000" | sudo tee /sys/class/drm/card0/device/hwmon/*/power1_cap && \
sleep 1 && \
echo 's 1 2525' | sudo tee /sys/class/drm/card0/device/pp_od_clk_voltage && \
echo 'c' | sudo tee /sys/class/drm/card0/device/pp_od_clk_voltage

PSS: I load my card with AI training it should be as heavy as it possible I think - but again it not use for example RT or AI blocks (and it is additional power consumption)

Add another Sapphire 7900XTX Pulse to the list of wrong clocks and power limits being set automatically, causing GPU crashes and locking the system, having to use SysRq to reboot. OD_SCLK was reporting an upper clock limit of 5000Mhz! Power limit was being set to 303W. According to Sapphire card is good for 2525Mhz boost clock and 370W TDP.

Enabled kernel parameter "amdgpu.ppfeaturemask=0xffffffff", installed LACT and set max clock to 2525Mhz and power limit to 370W and have been stable for a couple weeks now.

This comment made me actually go look, my ASRock 6800 XT Phantom Gaming D 16G OC was only set to 272W for default, the base tdp for a 6800xt is 300w!

It only goes up to 312w, even though this card has 3x8 pin connectors and other 3x8 pin OC versions look to have a 350w tdp.

I'm going to play a lot of heavy games tonight to see if this fixed my crashes!

@superm1 @agd5f is there any update regarding this issue?, many issues regarding crashes and freezes relate back to this issue, does not seem to be an isolated case.

@superm1 @agd5f Hello I noticed even more strange situation that really dangerous in some case for GPUs affected by this issue:

After I set correct clock I tried to set correct power limit - and here problem because setting power limit reset clock to incorrect one (previous).
Also I noticed that setting fan_curve before setting clocks disallow to set correct clocks

And where problem if someone set for example in this way - it will cause huge auto overclocking (in my case it was 2.8-2.9GHz instead of 2.5GHz) and will use whole power limits (in my case it 370W)

Power limits - OK
Fan Curve - OK
Clocks - FAIL blocked by fan curve setting

I also noticed that not always setting fan curve works with different order of setting configs

And I found only one way to set correct limits and fan curve

Power limit
Clock
Fan Curve

Here my working config that I found

echo "0 60 30" > /sys/class/drm/card0/device/gpu_od/fan_ctrl/fan_curve
echo "1 65 40" > /sys/class/drm/card0/device/gpu_od/fan_ctrl/fan_curve
echo "2 70 50" > /sys/class/drm/card0/device/gpu_od/fan_ctrl/fan_curve
echo "3 80 90" > /sys/class/drm/card0/device/gpu_od/fan_ctrl/fan_curve
echo "4 90 100" > /sys/class/drm/card0/device/gpu_od/fan_ctrl/fan_curve

echo 's 1 2525' > /sys/class/drm/card0/device/pp_od_clk_voltage

echo "370000000" > /sys/class/drm/card0/device/hwmon/*/power1_cap
echo 'c' > /sys/class/drm/card0/device/pp_od_clk_voltage
echo "c" > /sys/class/drm/card0/device/gpu_od/fan_ctrl/fan_curve

PS: after knowing this information need always check clocks to avoid problems - because it can miss or not set something but set another things that can cause dangerous situation for GPU

I can reproduce this. Setting fan_curve, fan_minimum_pwm or fan_target_temperature often results in the custom clock being completely ignored, even in your specified order it sometimes gets reset.

Edit: https://github.com/ilya-zlobintsev/LACT/issues/329#issuecomment-2168466639

@superm1 @agd5f Technically it can affect all users with this GPUs that use software like LACT to fix this problem temporally.

@serhii-nakon i noticed this problem too (fan curve), i had ignored because i thought that simply my gpu didn't allow it, i hadn't investigated.

by default the kernel sets a maximum gpu clock that exceeds the manufacturers specifications, causing hardware crashes

Brief summary of the problem:

Hardware description:

System information:

How to reproduce the issue:

Designs

Child items ...

Activity

Admin message

Admin message

by default the kernel sets a maximum gpu clock that exceeds the manufacturers specifications, causing hardware crashes

Brief summary of the problem:

Hardware description:

System information:

How to reproduce the issue:

Activity