Due to an influx of spam, we have had to impose restrictions on new accounts. Please see this wiki page for instructions on how to get full permissions. Sorry for the inconvenience.
Admin message
The migration is almost done, at least the rest should happen in the background. There are still a few technical difference between the old cluster and the new ones, and they are summarized in this issue. Please pay attention to the TL:DR at the end of the comment.
I'm making this topic as a separate tracking of ring_gfx related bugs since we should keep https://bugs.freedesktop.org/show_bug.cgi?id=111481 related to sdma0/1 type freezes since those are ones that seem to cause random "Out of the blue" hangs on the desktop.
There is another type of freeze/hang happening when playing Starcraft II via D9VK. This one doesn't seem to be related to either ngg or dma because I have them both disabled by AMD_DEBUG=nodma and AMD_DEBUG=nongg and the hangs occur anyway, on exactly the same place every time.
Error logs:
sep 17 11:48:24 Marko-PC kernel: [drm:amdgpu_dm_commit_planes.constprop.0 [amdgpu]] ERROR Waiting for fences timed out or interrupted!
sep 17 11:48:24 Marko-PC kernel: [drm:amdgpu_job_timedout [amdgpu]] ERROR ring gfx_0.0.0 timeout, signaled seq=2361623, emitted seq=2361625
sep 17 11:48:24 Marko-PC kernel: [drm:amdgpu_job_timedout [amdgpu]] ERROR Process information: process SC2_x64.exe pid 20236 thread SC2_x64.exe pid 20236
I will try and provide trace files by using renderdoc for described issues. They also happen in native games like Rise of the Tomb Raider and Vulkan etc. Will provide as much info as possible.
Not sure if that might help someone else, but I found a workaround in my case with DOOM. I was having the same crashes as Marko described with Starcraft II, I tried the following:
In Steam, I disabled the In Game Steam Overlay
I switched the Graphics API from OpenGL to Vulkan
I did not have any crash so far. But I haven't tried to isolate one or the other.
I am seeing a similar hang in Starcraft II. Unlike Marko, I am not using d9vk --- instead, I'm using wine-nine. The hang doesn't happen in all games but seems to be particularly frequent in the coop mission "dead of night".
For my particular case at least, AMD_DEBUG=nodma seems to fix it
(In reply to Marko Popovic from comment 0)
> There is another type of freeze/hang happening when playing Starcraft II via
> D9VK. This one doesn't seem to be related to either ngg or dma because I
> have them both disabled by AMD_DEBUG=nodma and AMD_DEBUG=nongg and the hangs
> occur anyway, on exactly the same place every time.
For ring_gfx hangs they're quite more reproducible and are not affected by AMD_DEBUG=nodma or AMD_DEBUG=nongg which I already mentioned above in the bug description.
For ring_gfx hangs they're quite more reproducible and are not affected by
AMD_DEBUG=nodma or AMD_DEBUG=nongg which I already mentioned above in the
bug description.
Sorry, but this is incorrect. My Minecraft hang is most definitely a ring gfx hang, *not* sdma. I've posted logs and apitraces in the linked thread if you'd like to check for yourself.
I can't explain why nodma isn't working for you, perhaps it doesn't work for game? Have you tried putting it in /etc/environment so it's system-wide? I don't know what to tell you regarding nodma, but my hang is definitely ring gfx as well.
For ring_gfx hangs they're quite more reproducible and are not affected by
AMD_DEBUG=nodma or AMD_DEBUG=nongg which I already mentioned above in the
bug description.
Sorry, but this is incorrect. My Minecraft hang is most definitely a ring
gfx hang, *not* sdma. I've posted logs and apitraces in the linked thread if
you'd like to check for yourself.
I can't explain why nodma isn't working for you, perhaps it doesn't work for
game? Have you tried putting it in /etc/environment so it's system-wide? I
don't know what to tell you regarding nodma, but my hang is definitely ring
gfx as well.
I guess we just have many different types of hangs then... ring_gfx hangs are more mysterious than sdma0/1 hangs it seems, since there is no "universal" workaround for them. nodma works for stopping global sdma-type hangs for me, nongg works for stopping the citra-related hang of ring_gfx type, but none of those 2 variables work for stopping Starcraft II and RoTR ring_gfx-type hangs for me, so it's really really confusing.
This might actually fix the ring_gfx type hangs or even sdma ones at least for Vulkan API? Not exactly sure but will also be testing the latest MESA builds from Oibaf's PPA in following days and report back on the issue :)
This might actually fix the ring_gfx type hangs or even sdma ones at least
for Vulkan API? Not exactly sure but will also be testing the latest MESA
builds from Oibaf's PPA in following days and report back on the issue :)
Sadly, I'm still getting the ring_gfx hangs after a few minutes of playing Trackmania 2.
This might actually fix the ring_gfx type hangs or even sdma ones at least
for Vulkan API? Not exactly sure but will also be testing the latest MESA
builds from Oibaf's PPA in following days and report back on the issue :)
Sadly, I'm still getting the ring_gfx hangs after a few minutes of playing
Trackmania 2.
Oh yes I forgot to add a reply here. It didn't solve any of the hangs for me either.
RX 5700 XT Pop OS 19.10 latest Oibaf mesa not sure what llvm
Anomaly 1.5.0 update 3 standalone 64 bit mod for S.T.A.L.K.E.R. Call of Pripyat running under wine d3dx11_43->dxvk (winetricks dxvk d3dcompiler_43 d3dx11_43)
Oct 30 02:49:30 pop-os kernel: [ 4864.627343] [drm:amdgpu_dm_commit_planes.constprop.0 [amdgpu]] ERROR Waiting for fences timed out!
Oct 30 02:49:30 pop-os kernel: [ 4869.231450] [drm:amdgpu_job_timedout [amdgpu]] ERROR ring gfx_0.0.0 timeout, signaled seq=2626284, emitted seq=2626286
Oct 30 02:49:30 pop-os kernel: [ 4869.231486] [drm:amdgpu_job_timedout [amdgpu]] ERROR Process information: process AnomalyDX11.exe pid 5791 thread AnomalyDX11.exe pid 5791
Oct 30 02:49:30 pop-os kernel: [ 4869.231487] [drm] GPU recovery disabled.
Happens at random. Sometimes hangs straight away, sometimes can go over an hour without crash. Complete crash, no option available besides hard reset. Not even mouse pointer would move (as with sdma0 hang).
I'm sorry if it's not the right place to report this, I'm somewhat new to all of this.
RX 5700 XT Pop OS 19.10 latest Oibaf mesa not sure what llvm
Anomaly 1.5.0 update 3 standalone 64 bit mod for S.T.A.L.K.E.R. Call of
Pripyat running under wine d3dx11_43->dxvk (winetricks dxvk d3dcompiler_43
d3dx11_43)
Oct 30 02:49:30 pop-os kernel: [ 4864.627343]
[drm:amdgpu_dm_commit_planes.constprop.0 [amdgpu]] ERROR Waiting for
fences timed out!
Oct 30 02:49:30 pop-os kernel: [ 4869.231450] [drm:amdgpu_job_timedout
[amdgpu]] ERROR ring gfx_0.0.0 timeout, signaled seq=2626284, emitted
seq=2626286
Oct 30 02:49:30 pop-os kernel: [ 4869.231486] [drm:amdgpu_job_timedout
[amdgpu]] ERROR Process information: process AnomalyDX11.exe pid 5791
thread AnomalyDX11.exe pid 5791
Oct 30 02:49:30 pop-os kernel: [ 4869.231487] [drm] GPU recovery disabled.
Happens at random. Sometimes hangs straight away, sometimes can go over an
hour without crash. Complete crash, no option available besides hard reset.
Not even mouse pointer would move (as with sdma0 hang).
I'm sorry if it's not the right place to report this, I'm somewhat new to
all of this.
Also I'm not sure how up to date the Oibaf repo is, but Mesa git landed ACO recently for Navi cards. You can try with RADV_PERFTEST=aco environment variable set if your Mesa is new enough, and you might have better luck with hangs.
RX 5700 XT Pop OS 19.10 latest Oibaf mesa not sure what llvm
Anomaly 1.5.0 update 3 standalone 64 bit mod for S.T.A.L.K.E.R. Call of
Pripyat running under wine d3dx11_43->dxvk (winetricks dxvk d3dcompiler_43
d3dx11_43)
Oct 30 02:49:30 pop-os kernel: [ 4864.627343]
[drm:amdgpu_dm_commit_planes.constprop.0 [amdgpu]] ERROR Waiting for
fences timed out!
Oct 30 02:49:30 pop-os kernel: [ 4869.231450] [drm:amdgpu_job_timedout
[amdgpu]] ERROR ring gfx_0.0.0 timeout, signaled seq=2626284, emitted
seq=2626286
Oct 30 02:49:30 pop-os kernel: [ 4869.231486] [drm:amdgpu_job_timedout
[amdgpu]] ERROR Process information: process AnomalyDX11.exe pid 5791
thread AnomalyDX11.exe pid 5791
Oct 30 02:49:30 pop-os kernel: [ 4869.231487] [drm] GPU recovery disabled.
Happens at random. Sometimes hangs straight away, sometimes can go over an
hour without crash. Complete crash, no option available besides hard reset.
Not even mouse pointer would move (as with sdma0 hang).
I'm sorry if it's not the right place to report this, I'm somewhat new to
all of this.
Also I'm not sure how up to date the Oibaf repo is, but Mesa git landed ACO
recently for Navi cards. You can try with RADV_PERFTEST=aco environment
variable set if your Mesa is new enough, and you might have better luck with
hangs.
Thank you so very much, no way to be sure since they seemed to happen at random but I think I'd experience at least 2 or 3 hangs in the time I've tested it but smooth ride so far. No performance impact either but running this game as I do I'm supposedly laying most of the calculations on CPU not GPU.
It happened again. This time without a game or anything running, barely logged in and opened a program and boom.
Nov 2 12:42:07 pop-os kernel: [ 1675.883513] [drm:amdgpu_dm_commit_planes.constprop.0 [amdgpu]] ERROR Waiting for fences timed out!
Nov 2 12:42:07 pop-os kernel: [ 1680.747513] [drm:amdgpu_job_timedout [amdgpu]] ERROR ring gfx_0.0.0 timeout, signaled seq=2714, emitted seq=2716
Nov 2 12:42:07 pop-os kernel: [ 1680.747549] [drm:amdgpu_job_timedout [amdgpu]] ERROR Process information: process Xorg pid 2293 thread Xorg:cs0 pid 2294
Nov 2 12:42:07 pop-os kernel: [ 1680.747551] [drm] GPU recovery disabled.
Only cursor moved, no clicks registered, restart achieved with REISUB.
I tried registering at https://gitlab.freedesktop.org/mesa/mesa/issues but I'm getting no account confirmation mail so can't post it there.
Perhaps needs another entry started but it's related (since it didn't happen before I tried RADV_PERFTEST=aco and AMD_DEBUG="nongg,nodma") so I'll post it in case someone has had same issues as me.
After some time in Witcher 3 GOTY run with Lutris PC restarts on it's own. I thought something is overheating (I've noticed graphic card memory in PSensor sometimes reaching 90 so I thought maybe that's what's happening) but I investigated kern.log and this always happened before that autonomous reset:
Nov 2 22:01:53 pop-os kernel: [ 979.244964] pcieport 0000:00:01.1: AER: Corrected error received: 0000:01:00.0
Nov 2 22:01:53 pop-os kernel: [ 979.244967] nvme 0000:01:00.0: AER: PCIe Bus Error: severity=Corrected, type=Data Link Layer, (Transmitter ID)
Nov 2 22:01:53 pop-os kernel: [ 979.244968] nvme 0000:01:00.0: AER: device [1987:5012] error status/mask=00001000/00006000
Nov 2 22:01:53 pop-os kernel: [ 979.244968] nvme 0000:01:00.0: AER: [12] Timeout
Nov 2 22:01:53 pop-os kernel: [ 979.262629] Emergency Sync complete
A solution I found is to add pci=nommconf in /etc/default/grub to the line
GRUB_CMDLINE_LINUX_DEFAULT="quiet splash" (so it looks like this: GRUB_CMDLINE_LINUX_DEFAULT="quiet splash pci=nommconf").