Due to an influx of spam, we have had to impose restrictions on new accounts. Please see this wiki page for instructions on how to get full permissions. Sorry for the inconvenience.
Admin message
Our infrastructure migration is complete. Please remember to update your SSH remote to point to ssh.gitlab.freedesktop.org; SSH to the old hostname will time out. You should not see any problems apart from that. Please let us know if you do have any other issues.
The application works fine for a time, but eventually it will freeze and this gets printed to the terminal:
amdgpu: The CS has been rejected, see dmesg for more information (-2).
amdgpu: The CS has been rejected, see dmesg for more information (-19).
(attaching dmesg)
At this point, I have to kill the application, and reboot if I want to use the GPU again.
This seems to happen mainly when alt-tabbing between Godot and the desktop or terminal (both of which run on the Intel HD 630 IGP), so it might be an issue with context switching?
I don't have precise steps to reproduce yet apart from using Godot (debug build from git master branch) and other applications in parallel, to eventually see it crash within 5-10 min.
I think the bug started to happen when I upgraded to kernel 5.2.x (now running 5.3.2, still having the bug). That's what bug 111860 claims too, so I'll attempt running 5.1.20 for a while to see if the bug still happens.
Hey, I noticed a lot of 'amdgpu 0000:01:00.0: GPU pci config reset' there. Since I see no command submissions timeout errors it looks like you manually tried to reset the GPU multiple times - on one of them there was a failure after which the errors you described appeared. IS this correct ?
I don't reset the GPU manually, no. I'm not sure why this happens, but I've had such output in dmesg as far as I can remember (since I got this laptop in March).
For the reference, I've been using kernel 5.1.20 and did not experience this crash. I'm not sure yet it's conclusive to say it's a regression though, I will test more in coming days.
What happens if you disable GPU reset by loading the kernel with
amdgpu.gpu_recovery=0 ?
Good point, I forgot to mention that I added `amdgpu.dc=0 amdgpu.gpu_recovery=1` in an attempt to work around this issue just before reproducing it again. So I can confirm that I could reproduce this issue both without any amdgpu kernel parameters and with the above two.
I now did some more testing with kernel 5.3.2 and `amdgpu.gpu_recovery=0` (removing the `amdgpu.dc=0` too). Initially I could not trigger the bug, but I got it when letting the desktop environment (KDE) trigger its screensaver while Godot was running on the AMD GPU. Once I resumed from the screensaver, the GPU crashed (note: I did trigger suspend-to-RAM, the laptop was still powered).
The dmesg output is attached.
To compare, I did another test with kernel 5.1.20 (using `amdgpu.dc=0 amdgpu.gpu_recovery=1`), letting it go to sleep with Godot running on the AMD GPU, and it resumed without crashing. I also attach the dmesg output for comparison.
Hey, I noticed a lot of 'amdgpu 0000:01:00.0: GPU pci config reset' there.
These actually happen every time I change the focus between an application running on the AMD GPU (with `DRI_PRIME=1`) and another application (e.g. desktop environment, firefox, terminal) running on the Intel HD 630 IGP (`DRI_PRIME=0`, default).
Hey, I noticed a lot of 'amdgpu 0000:01:00.0: GPU pci config reset' there.
These actually happen every time I change the focus between an application
running on the AMD GPU (with `DRI_PRIME=1`) and another application (e.g.
desktop environment, firefox, terminal) running on the Intel HD 630 IGP
(`DRI_PRIME=0`, default).
So i guess the problem only happens when you run in DRI PRIME mode when different apps render of off different GPUs ?
Just ran into a similar issue after attempting to suspend my laptop, a ThinkPad t495s, running stock Ubuntu 19.10. There are a few possibly related amdgpu messages in syslog (see below). This may be truncated since the laptop crashed and had to be hard-reset using the power button.
Here's the smoking gun from syslog (slightly edited; full context below):
Feb 3 08:32:36 t495s kernel: [18032.388704] [drm] Fence fallback timer expired on ring sdma0Feb 3 08:32:36 t495s kernel: [18032.933051] amdgpu 0000:05:00.0: [drm:amdgpu_ib_ring_tests [amdgpu]] *ERROR* IB test failed on gfx (-110).Feb 3 08:32:36 t495s kernel: [18032.933138] [drm:amdgpu_device_delayed_init_work_handler [amdgpu]] *ERROR* ib ring test failed (-110).Feb 3 08:32:36 t495s /usr/lib/gdm3/gdm-x-session[2250]: amdgpu: The CS has been rejected, see dmesg for more information (-16).
Click for full syslog excerpt
Feb 3 08:32:21 t495s NetworkManager[1296]: <info> [1580715141.6207] manager: sleep: sleep requested (sleeping: no enabled: yes)Feb 3 08:32:21 t495s NetworkManager[1296]: <info> [1580715141.6208] device (enp3s0f0): state change: unavailable -> unmanaged (reason 'sleeping', sys-iface-state: 'managed')Feb 3 08:32:21 t495s gsd-media-keys[2726]: Unable to get default sinkFeb 3 08:32:21 t495s NetworkManager[1296]: <info> [1580715141.6651] device (p2p-dev-wlp1s0): state change: disconnected -> unmanaged (reason 'sleeping', sys-iface-state: 'managed')Feb 3 08:32:21 t495s NetworkManager[1296]: <info> [1580715141.6658] manager: NetworkManager state is now ASLEEPFeb 3 08:32:21 t495s whoopsie[2190]: [08:32:21] offlineFeb 3 08:32:21 t495s systemd[1]: Reached target Sleep.Feb 3 08:32:21 t495s systemd[1]: Starting Suspend...Feb 3 08:32:21 t495s kernel: [18018.308292] PM: suspend entry (deep)Feb 3 08:32:21 t495s systemd-sleep[20721]: Suspending system...Feb 3 08:32:36 t495s kernel: [18018.413888] Filesystems sync: 0.105 secondsFeb 3 08:32:36 t495s kernel: [18018.414743] Freezing user space processes ... (elapsed 8.259 seconds) done.Feb 3 08:32:36 t495s kernel: [18026.673873] OOM killer disabled.Feb 3 08:32:36 t495s kernel: [18026.673874] Freezing remaining freezable tasks ... (elapsed 0.001 seconds) done.Feb 3 08:32:36 t495s kernel: [18026.675734] printk: Suspending console(s) (use no_console_suspend to debug)Feb 3 08:32:36 t495s kernel: [18026.676107] wlp1s0: deauthenticating from 44:4e:6d:18:0c:39 by local choice (Reason: 3=DEAUTH_LEAVING)Feb 3 08:32:36 t495s kernel: [18026.676149] thinkpad_acpi: acpi_evalf(GTRW, dd, ...) failed: AE_NOT_FOUNDFeb 3 08:32:36 t495s kernel: [18026.676150] thinkpad_acpi: Cannot read adaptive keyboard mode.Feb 3 08:32:36 t495s kernel: [18030.792394] kfd2kgd: cp queue preemption time out.Feb 3 08:32:36 t495s kernel: [18031.037927] ACPI: EC: interrupt blockedFeb 3 08:32:36 t495s kernel: [18031.084852] ACPI: EC: interrupt unblockedFeb 3 08:32:36 t495s kernel: [18031.172893] PM: noirq suspend of devices failedFeb 3 08:32:36 t495s kernel: [18031.172922] pcieport 0000:00:08.1: PME: Spurious native interrupt!Feb 3 08:32:36 t495s kernel: [18031.183743] iwlwifi 0000:01:00.0: Applying debug destination EXTERNAL_DRAMFeb 3 08:32:36 t495s kernel: [18031.194112] [drm] PCIE GART of 1024M enabled (table at 0x000000F400900000).Feb 3 08:32:36 t495s kernel: [18031.194155] [drm] PSP is resuming...Feb 3 08:32:36 t495s kernel: [18031.214057] [drm] reserve 0x400000 from 0xf400c00000 for PSP TMRFeb 3 08:32:36 t495s kernel: [18031.224391] [drm] psp command failed and response status is (-65529)Feb 3 08:32:36 t495s kernel: [18031.302016] iwlwifi 0000:01:00.0: Applying debug destination EXTERNAL_DRAMFeb 3 08:32:36 t495s kernel: [18031.370220] iwlwifi 0000:01:00.0: FW already configured (0) - re-configuringFeb 3 08:32:36 t495s kernel: [18031.391940] nvme nvme0: Shutdown timeout set to 8 secondsFeb 3 08:32:36 t495s kernel: [18031.411371] nvme nvme0: 16/0/0 default/read/poll queuesFeb 3 08:32:36 t495s kernel: [18031.727161] amdgpu: [powerplay] dpm has been enabledFeb 3 08:32:36 t495s kernel: [18031.910883] [drm] VCN decode and encode initialized successfully(under DPG Mode).Feb 3 08:32:36 t495s kernel: [18031.910902] amdgpu 0000:05:00.0: ring gfx uses VM inv eng 0 on hub 0Feb 3 08:32:36 t495s kernel: [18031.910904] amdgpu 0000:05:00.0: ring comp_1.0.0 uses VM inv eng 1 on hub 0Feb 3 08:32:36 t495s kernel: [18031.910906] amdgpu 0000:05:00.0: ring comp_1.1.0 uses VM inv eng 4 on hub 0Feb 3 08:32:36 t495s kernel: [18031.910908] amdgpu 0000:05:00.0: ring comp_1.2.0 uses VM inv eng 5 on hub 0Feb 3 08:32:36 t495s kernel: [18031.910910] amdgpu 0000:05:00.0: ring comp_1.3.0 uses VM inv eng 6 on hub 0Feb 3 08:32:36 t495s kernel: [18031.910912] amdgpu 0000:05:00.0: ring comp_1.0.1 uses VM inv eng 7 on hub 0Feb 3 08:32:36 t495s kernel: [18031.910914] amdgpu 0000:05:00.0: ring comp_1.1.1 uses VM inv eng 8 on hub 0Feb 3 08:32:36 t495s kernel: [18031.910916] amdgpu 0000:05:00.0: ring comp_1.2.1 uses VM inv eng 9 on hub 0Feb 3 08:32:36 t495s kernel: [18031.910918] amdgpu 0000:05:00.0: ring comp_1.3.1 uses VM inv eng 10 on hub 0Feb 3 08:32:36 t495s kernel: [18031.910920] amdgpu 0000:05:00.0: ring kiq_2.1.0 uses VM inv eng 11 on hub 0Feb 3 08:32:36 t495s kernel: [18031.910921] amdgpu 0000:05:00.0: ring sdma0 uses VM inv eng 0 on hub 1Feb 3 08:32:36 t495s kernel: [18031.910923] amdgpu 0000:05:00.0: ring vcn_dec uses VM inv eng 1 on hub 1Feb 3 08:32:36 t495s kernel: [18031.910925] amdgpu 0000:05:00.0: ring vcn_enc0 uses VM inv eng 4 on hub 1Feb 3 08:32:36 t495s kernel: [18031.910927] amdgpu 0000:05:00.0: ring vcn_enc1 uses VM inv eng 5 on hub 1Feb 3 08:32:36 t495s kernel: [18031.910928] amdgpu 0000:05:00.0: ring vcn_jpeg uses VM inv eng 6 on hub 1Feb 3 08:32:36 t495s kernel: [18032.388704] [drm] Fence fallback timer expired on ring sdma0Feb 3 08:32:36 t495s kernel: [18032.933051] amdgpu 0000:05:00.0: [drm:amdgpu_ib_ring_tests [amdgpu]] *ERROR* IB test failed on gfx (-110).Feb 3 08:32:36 t495s kernel: [18032.933138] [drm:amdgpu_device_delayed_init_work_handler [amdgpu]] *ERROR* ib ring test failed (-110).Feb 3 08:32:36 t495s kernel: [18032.936651] thinkpad_acpi: acpi_evalf(STRW, vd, ...) failed: AE_NOT_FOUNDFeb 3 08:32:36 t495s kernel: [18032.936653] thinkpad_acpi: Cannot set adaptive keyboard mode.Feb 3 08:32:36 t495s kernel: [18032.948268] acpi LNXPOWER:00: Turning OFFFeb 3 08:32:36 t495s kernel: [18032.948293] OOM killer enabled.Feb 3 08:32:36 t495s kernel: [18032.948294] Restarting tasks ... done.Feb 3 08:32:36 t495s /usr/lib/gdm3/gdm-x-session[2250]: amdgpu: The CS has been rejected, see dmesg for more information (-16).Feb 3 08:32:36 t495s /usr/lib/gdm3/gdm-x-session[2250]: message repeated 2 times: [ amdgpu: The CS has been rejected, see dmesg for more information (-16).]Feb 3 08:32:36 t495s /usr/lib/gdm3/gdm-x-session[2250]: (II) AMDGPU(0): EDID vendor "ENC", prod id 10513Feb 3 08:32:36 t495s /usr/lib/gdm3/gdm-x-session[2250]: (II) AMDGPU(0): Using hsync ranges from config fileFeb 3 08:32:36 t495s /usr/lib/gdm3/gdm-x-session[2250]: (II) AMDGPU(0): Using vrefresh ranges from config fileFeb 3 08:32:36 t495s kernel: [18033.007339] PM: suspend exitFeb 3 08:32:36 t495s kernel: [18033.007487] PM: suspend entry (s2idle)
As ever, please let me know if I can assist in debugging the issue.
This issue hasn't had any activity since 2020-02-03. The AMD driver stack changes rapidly and contains lots of shared code across products so it's possible that it has already been fixed. Please upgrade to a current stable kernel and userspace stack and try again. If you still experience this issue with the latest driver stack, please capture relevant logging and open a new issue referring back to this one.