Due to an influx of spam, we have had to impose restrictions on new accounts. Please see this wiki page for instructions on how to get full permissions. Sorry for the inconvenience.
Admin message
Welcome to our new datacenter. The migration is still not over, but we try to bring up the service to the best we can. There are some parts not working yet (shared runners, previous job logs, previous job artifacts, ... ) but we try to do our best.
We do not guarantee data while the migration is not over, please consider this as read-only
I'm using a RX 5500 XT card on an Asus PRIME H270-PRO motherboard, Intel i5-7500 CPU, with kernel 5.10.9 under Fedora 33. I noticed that in Linux, "lspci -vv" always showed the GPU PCIe link running at 2.5GT/s link speed and never seemed to change regardless of the application being run, while in Windows, GPU-Z shows the link running at the max supported 8GT/s speed when under graphical load.
It seems like the driver thinks that 2.5GT/s is the max allowable speed, based on the pp_dpm_pcie file:
I'm assuming that something is going wrong with the PCIe link speed detection in the driver. Using the "amdgpu.pcie_gen_cap=0x70007" kernel command line option seems to result in the driver detecting the proper 8GT/s maximum speed.
lspci -vv output from booting without overriding the speed is attached.
The driver calls the pci core function pcie_bandwidth_available() to get the available link speed and lanes between it and the root complex. Presumably some link in that path is limited to 2.5GT/s or there is some issue in pcie_bandwidth_available() on your board.
I think that same function is used in the code that prints this output in dmesg on boot. It's saying the CPU PCIe root port is limiting the link speed:
pci 0000:01:00.0: 16.000 Gb/s available PCIe bandwidth, limited by 2.5 GT/s PCIe x8 link at 0000:00:01.0 (capable of 126.024 Gb/s with 16.0 GT/s PCIe x8 link)pci 0000:03:00.0: 16.000 Gb/s available PCIe bandwidth, limited by 2.5 GT/s PCIe x8 link at 0000:00:01.0 (capable of 252.048 Gb/s with 16.0 GT/s PCIe x16 link)
But that's wrong, since that link is clearly capable of 8GT/s:
It seems like pcie_bandwidth_available is just looking at the LnkSta register to determine how much bandwidth is available. After boot, it apparently shows a speed of 2.5GT/s. Maybe that's the speed the GPU's link comes up at on boot, which results in this code thinking it can't run any faster?
The GPU driver limits the speed of the device to the slowest link in its path. It doesn't make sense to run the link at a higher rate and burn more power if you are going to be limited by another link in the path.
There's only one link in the path here, between the CPU and GPU, and both ends are capable of 8GT/s, so nothing else should be limiting throughput.
Maybe pcie_bandwidth_available should be looking at the link capability rather than link status for each link in the chain, to handle the case where a link is running below its maximum speed on boot. This usage in amdgpu may be a case that code wasn't designed for though, as amdgpu really wants to know what the maximum bandwidth could be set to, not just what it currently is.
I'm not sure why the link ends up at 2.5GT/s on boot before the driver loads though. Is that expected? There is a BIOS option for the PEG link speed, I tried Auto and Gen3 which seemed to behave the same. Maybe the GPU downclocks to 2.5 on its own before the OS even loads? Not sure how that mechanism works.
My understanding is that generally the links usually train at max speed supported by both ends. I wonder if Intel bridge port has some sort of power management feature that dynamically adjusts the link similar to what the GPU does? In general, I've always found that the link comes up at the max speed when the endpoint comes out of reset.
So after experimenting with BIOS options a bit, it appears that the PEG ASPM option affects this behavior. It appears if this is set to L0s, L1 or L0sL1, the link comes up at 2.5GT/s on boot. Oddly, even if it's set to L0s, which the card doesn't seem to support on the upstream PCIe port, it still comes up at 2.5GT/s. If it's set to Disabled or Auto, it comes up at 8GT/s.
Looking at some amdgpu code, it seems that some other chip generations have some (rather complex) code to enable ASPM support, but nv.c seems to have a note that "The ASPM function is not fully enabled and verified on Navi yet. Temporarily skip this until ASPM enabled." Maybe there's some more driver-level support required to support ASPM properly? I'm not sure if the weird link speed issue is expected behavior with it enabled in the current state?
Minor detail, there are two links in this path: 00:01.0 - 01:00.0 (root port to switch) and 02:00.0 - 03:00.0 (switch to Navi). Per LnkCap, the best we can hope for is 8GT/s x8, limited by the rate of the root port and the width of the switch upstream port.
The first lspci (no override) LnkSta shows 2.5GT/s x8 on the upstream link and 16GT/s x16 on the downstream one, which lines up with the dmesg. The "available" comes from LnkSta, i.e., the current state of the link, and the "capable" part comes from LnkCap. I think the point of those messages was to identify cases like this where we're operating slower than we could.
I don't know why the upstream link comes up at 2.5GT/s x8 instead of 8GT/s x8. The PCI core doesn't do anything to influence that. We just assume the link trains at the highest rate/width supported by both ends, which I think is what the spec envisions.
The PCI core also does nothing to limit the downstream link to 4GT/s x16 or similar for power savings. Maybe it should, I dunno? But not relevant to this problem.
Just wanted to add on to this that it isn't just limited to this GPU, Getting the same thing on my RX570 (as part of an eGPU) wherein Linux will refuse to run it at above Gen 1 speeds, whereas Windows has zero issues running it at full speed on the same hardware
Note the (downgraded). This on an XPS 13 Plus running Fedora 37 on Kernel 6.1.13. I've tried adding amdgpu.pcie_gen_cap=0x00040000 to kernel params and checked that it was set successfully in /proc/cmdline, but it still doesn't reach full speeds which leads to really significant performance issues in games such as e.g. Cyberpunk 2077.
Some more posts that have referenced this issue now:
The problem is the thunderbolt link. The thunderbolt bridge come up in 2.5GT/s mode. The GPU driver queries the upstream bridges and limits it's speed to the max current speed of the upstream bridge it's connected to. There is no reason to run the link clocks faster if you are limited by a bridge above you. If the thunderbolt bridge can change it's link speed at runtime, we probably need to rework the core PCI code to take that into account when reporting the upstream bridge link speeds.
That RX 6600 GPU is not connected via Thunderbolt...
Is it normal that we cannot get the correct link speed and link width from the DPM table (pp_dpm_pcie) ?
Oh it looks like a porting mistake. SMUv13 sets up the pcie tables in smu_v13_0_0_set_default_dpm_table, and I had assumedsienna_cichlid_set_default_dpm_table did as well. I need to restore some of the logic dropped by that patch, I'll follow up later on with it.