AMD Navi GPU frequent freezes on both Manjaro/Ubuntu with kernel 5.3 and mesa 19.2 -git/llvm9

Marko Popovic said:

Adding error log from Manjaro:
avg 23 16:05:37 Marko-PC kernel: [drm:amdgpu_job_timedout [amdgpu]] ERROR ring sdma0 timeout, signaled seq=1742, emitted seq=1743
avg 23 16:05:37 Marko-PC kernel: [drm:amdgpu_job_timedout [amdgpu]] ERROR Process information: process gnome-shell pid 975 thread gnome-shell:cs0 pid 988
avg 23 16:05:37 Marko-PC kernel: [drm] GPU recovery disabled.

Pretty much same-type error happens in different situations and very often at random while using the desktop. These 2 logs one is from launching an OpenGL from Citra emulator which is reproducable every time and the second one from Manjaro is while browsing the Gnome shell and it would crash without any clear triggers.

Mathieu Belanger said:

I confirm that I have this bug or a very similar one.

It, for some reason, happens most when i'm using my IDE (Intellij based).
It will append the most when I type code and the crash occur when the IDE is supposed to propose some code completion.

I do have one to two crash a day.

Video card is RX5700
CPU is Ryzen R7-2700X

Software tested LLVM 9 git
libdrm, mesa, ddx updated from GIT very frequently.

Bug is there since I have the card, like 3 weeks ago.

Matthias Mueller @Termy said:

I don't know if i'm encountering the same bug, but it is at least similar.
I don't get hard freezes/lockups, but i get a strange "stutterting", as if the whole OS halted for a few seconds, then continued for a few seconds...and the halted times grew while the "usable seconds" got shorter quickly to the point of unusability...

It doesn't happen regularly (seems like anything between 30min and 120min) and i haven't yet made out a direct cause, but in journalctl, it seems the same messages appear every time when it begins:

kernel: amdgpu: [powerplay] Failed to send message 0xf, response 0xfffffffb, param 0xfd6000
kernel: amdgpu: [powerplay] Failed to send message 0xf, response 0xfffffffb, param 0xfd6000
kernel: amdgpu 0000:0f:00.0: [mmhub] VMC page fault (src_id:0 ring:169 vmid:0 pasid:0)
kernel: amdgpu 0000:0f:00.0: at page 0x0000600000fd6000 from 18
kernel: amdgpu 0000:0f:00.0: VM_L2_PROTECTION_FAULT_STATUS:0x00041152

after that there are a lot of these:

kernel: amdgpu: [powerplay] Failed to send message 0x40, response 0xffffffc2 param 0x2
kernel: amdgpu: [powerplay] Failed to send message 0xe, response 0xffffffc2, param 0x80

until shutdown/hardreset.

Maybe some observation that might help to narrow it down:
The first time it occured, i had to do a few reboots that showed this behaviour right after startup until it finally worked again - for about 45min.
As it didn't work again after around 10 reboots, i tried uninstalling corectrl (that i used to have a custom fan-curve) - and it finally booted normal again!
I then installed radeon-profile to have fan-controll (i don't want to have the fans stand still on desktop, as the card gets over 80° C hot before the fans kick in...).
The issue still occurs with radeon-profile, but at least every reboot is running fine...
Other thing i noticed is that after the first "freeze" with radeon-profile lm_sensors stopped reporting the fanspeed for the card, it always stays at zero.

So maybe it is related to fan-control or the sysfs interface in general?

Matthias Mueller @Termy said:

Forgot to mention: running Manjaro 5.3rc6.d0826.ga55aa89-1, mesa-git 1:19.3.0_devel.114849.0142dcb990e-1 and llvm-libs-git 10.0.0_r325376.70e158e09e9-1
And if it matters: firmware from https://aur.archlinux.org/packages/linux-firmware-agd5f-radeon-navi10/ v2019.08.26.14.36-1

Mathieu Belanger said:

It probably really depend of what we do on our desktop. I just remember now how I did stop using FileZilla since I got that GPU as it was crashing almost all the time I was using it (Like I never not crashed while that thing was open and running). Still use it for work but I keep it to minimum (open, upload, close) instead of keeping it running.

Alexandr Kara @kara said:

Might be related to https://bugs.freedesktop.org/show_bug.cgi?id=111269. I also get the "ring gfx_0.0.0 timeout" error (but not the "ring sdma0 timeout" error).

Using LLVM from git + Mesa 19.2.0-rc1 on Fedora 30 with kernel from Fedora 31 (5.3.0-0.rc5.git0.1.fc31.x86_64). GPU AMD Radeon RX 5700 XT, CPU AMD Ryzen 7 1700, 32 GB RAM (EDD).

Mathieu Belanger submitted a patch:

Ok, I did look at the recent kernel patch and commit and they seam to have fixed a couple bugs. I do not know it it include these but I did not crash one time since I merged that into the kernel 5.3-rc6. (that code is staged for 5.4 merge window).

I did attach the patch so you can merge that if you wish to try. It add all the latest bits for AMDGPU into 5.3-rc6, including Renoir support.

Patch 145225, "Merge last adg5f code":
merge_last_amdgpu-for-5.3-rc6.patch

Marko Popovic said:

(In reply to Mathieu Belanger from comment 7)

Created attachment 145225 [details] [review]
Merge last adg5f code

Ok, I did look at the recent kernel patch and commit and they seam to have
fixed a couple bugs. I do not know it it include these but I did not crash
one time since I merged that into the kernel 5.3-rc6. (that code is staged
for 5.4 merge window).

I did attach the patch so you can merge that if you wish to try. It add all
the latest bits for AMDGPU into 5.3-rc6, including Renoir support.

How do I merge the patch myself? :) I'd like to try it

Matthias Mueller @Termy said:

On my side i can report that the issue does not occur if i don't use a tool to modify the FANs - does anyone of you use something of the like or are this seperate issues?

Marko Popovic said:

(In reply to Matthias Müller from comment 9)

On my side i can report that the issue does not occur if i don't use a tool
to modify the FANs - does anyone of you use something of the like or are
this seperate issues?

I don't use any tools, all is stock.

(In reply to Mathieu Belanger from comment 7)
> Created attachment 145225 [details] [review]
> Merge last adg5f code
>
> Ok, I did look at the recent kernel patch and commit and they seam to have
> fixed a couple bugs. I do not know it it include these but I did not crash
> one time since I merged that into the kernel 5.3-rc6. (that code is staged
> for 5.4 merge window).
>
> I did attach the patch so you can merge that if you wish to try. It add all
> the latest bits for AMDGPU into 5.3-rc6, including Renoir support.

After applying the patch, same type of error occurs, luckily very easy to reproduce with Citra emulator, apparently it does something that AMD's driver really doesn't like and makes chances higher for error to occur. Also when CPU is under heavy I/O load error seems more likely to occur as well on my end.

Last log after applying the latest patch from the merge posted in the attachment:
sep 01 02:29:10 Marko-PC kernel: [drm:amdgpu_dm_atomic_commit_tail [amdgpu]] *ERROR* Waiting for fences timed out!
sep 01 02:29:10 Marko-PC kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx_0.0.0 timeout, signaled seq=16312, emitted seq=16314
sep 01 02:29:10 Marko-PC kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process citra-qt pid 2928 thread citra-qt:cs0 pid 2938
sep 01 02:29:10 Marko-PC kernel: [drm] GPU recovery disabled.

If we could get any official AMD responses to at least make sure that we're at least being listened to would be very nice.

Marko Popovic said:

Same bug is also reproducable when launching native version of Rocket League.

Here are the logs:
sep 01 12:20:56 Marko-PC kernel: amdgpu 0000:03:00.0: [gfxhub] page fault (src_id:0 ring:158 vmid:0 pasid:0, for process pid 0 thread pid 0)
sep 01 12:20:56 Marko-PC kernel: amdgpu 0000:03:00.0: in page starting at address 0x0000000000fff000 from client 27
sep 01 12:20:56 Marko-PC kernel: amdgpu 0000:03:00.0: GCVM_L2_PROTECTION_FAULT_STATUS:0x00001B3C
sep 01 12:20:56 Marko-PC kernel: amdgpu 0000:03:00.0: MORE_FAULTS: 0x0
sep 01 12:20:56 Marko-PC kernel: amdgpu 0000:03:00.0: WALKER_ERROR: 0x6
sep 01 12:20:56 Marko-PC kernel: amdgpu 0000:03:00.0: PERMISSION_FAULTS: 0x3
sep 01 12:20:56 Marko-PC kernel: amdgpu 0000:03:00.0: MAPPING_ERROR: 0x1
sep 01 12:20:56 Marko-PC kernel: amdgpu 0000:03:00.0: RW: 0x0
sep 01 12:21:12 Marko-PC kernel: [drm:amdgpu_job_timedout [amdgpu]] ERROR ring sdma0 timeout, signaled seq=7198, emitted seq=7200
sep 01 12:21:12 Marko-PC kernel: [drm:amdgpu_job_timedout [amdgpu]] ERROR Process information: process RocketLeague pid 3035 thread RocketLeag:cs0 pid 3042

Mathieu Belanger said:

I did not crash and have a > 24h uptime.

I could not test Citra as I don't have a 3DS and the roms I found are encrypted..

I could not test on Rocket League as it would require me to spend for a game I will not play.

I will continue to test later today.

Mathieu Belanger said:

(In reply to Marko Popovic from comment 10)

(In reply to Matthias Müller from comment 9)

On my side i can report that the issue does not occur if i don't use a tool
to modify the FANs - does anyone of you use something of the like or are
this seperate issues?

I don't use any tools, all is stock.

(In reply to Mathieu Belanger from comment 7)
> Created attachment 145225 [details] [review] [review]
> Merge last adg5f code
>
> Ok, I did look at the recent kernel patch and commit and they seam to have
> fixed a couple bugs. I do not know it it include these but I did not crash
> one time since I merged that into the kernel 5.3-rc6. (that code is staged
> for 5.4 merge window).
>
> I did attach the patch so you can merge that if you wish to try. It add all
> the latest bits for AMDGPU into 5.3-rc6, including Renoir support.

After applying the patch, same type of error occurs, luckily very easy to
reproduce with Citra emulator, apparently it does something that AMD's
driver really doesn't like and makes chances higher for error to occur. Also
when CPU is under heavy I/O load error seems more likely to occur as well on
my end.

Last log after applying the latest patch from the merge posted in the
attachment:
sep 01 02:29:10 Marko-PC kernel: [drm:amdgpu_dm_atomic_commit_tail [amdgpu]]
*ERROR* Waiting for fences timed out!
sep 01 02:29:10 Marko-PC kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR*
ring gfx_0.0.0 timeout, signaled seq=16312, emitted seq=16314
sep 01 02:29:10 Marko-PC kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR*
Process information: process citra-qt pid 2928 thread citra-qt:cs0 pid 2938
sep 01 02:29:10 Marko-PC kernel: [drm] GPU recovery disabled.

If we could get any official AMD responses to at least make sure that we're
at least being listened to would be very nice.

I was able to reproduce that Citra crash.
Followed the instruction, it did crash instantly after choosing continue (or a fraction of a second after, the music lagged a lil and complete system crash (was able so sync/umount/reboot with the magics key)).

Is your crash exactly at the same place? If so then it's very reproducible and it might be a good idea to run a opengl trace to see what commands was sent last to provoke the crash.

I am not familiar with the Ubuntu stuff, is these got compiled on your system? if no do you know the build date of your Mesa, libdrm and xf86-video-amdgpu (x11 ddx).

Also can you tell what microcode files dates you do have?

Libdrm : 07:49:10 PM 08/27/2019
Mesa : 05:37:07 PM 08/30/2019
Xorg amdgpu DDX : 07:55:17 PM 08/27/2019

The microcode files where not available on my distribution when I installed them. I did download/install them on August 6 but they where from July 15 ish I think, I remember that the latest microcode at that time where crashing with a black screen on module load and that's why I did install an older version.

Marko Popovic said:

(In reply to Mathieu Belanger from comment 13)

I was able to reproduce that Citra crash.
Followed the instruction, it did crash instantly after choosing continue (or
a fraction of a second after, the music lagged a lil and complete system
crash (was able so sync/umount/reboot with the magics key)).

Is your crash exactly at the same place? If so then it's very reproducible
and it might be a good idea to run a opengl trace to see what commands was
sent last to provoke the crash.

I am not familiar with the Ubuntu stuff, is these got compiled on your
system? if no do you know the build date of your Mesa, libdrm and
xf86-video-amdgpu (x11 ddx).

Also can you tell what microcode files dates you do have?

Libdrm : 07:49:10 PM 08/27/2019
Mesa : 05:37:07 PM 08/30/2019
Xorg amdgpu DDX : 07:55:17 PM 08/27/2019

The microcode files where not available on my distribution when I installed
them. I did download/install them on August 6 but they where from July 15
ish I think, I remember that the latest microcode at that time where
crashing with a black screen on module load and that's why I did install an
older version.

Yes, always happens at the same place with Citra emulator, however what bothers me more about the bug is that sometimes it happens completely randomly on my system without any really obvious triggers while just browsing and using my desktop, so it's not Citra exclusive, but luckily I've found the Citra method to provode the bug so we can do more detailed logging.

Further observations:
- Bug is the same-type as other crashes and is not Citra emulator exclusive, happens on Rocket League on launch as well and sometimes randomly while using the desktop
- Same type of crash IS NOT reproducable on Windows on the same GPU
- Same type of bug IS NOT reproducable on my IntelHD laptop with same versions of MESA/LLVM which probably means either faulty AMD kernel driver or faulty Firmware binaries.

My versions are:
MESA: Mesa 19.3.0-devel (git-6775a52 2019-09-02 eoan-oibaf-ppa)
Kernel: Ubuntu mainline 5.3 daily build (I ALSO tried amd-drm-next-5.4, same bug is reproducable)
Firmware binaries: 2019-08-26 from /~agd5f/radeon_ucode/navi10

Pierre-Eric Pelloux-Prayer @pepp said:

(In reply to Marko Popovic from comment 14)

Yes, always happens at the same place with Citra emulator

Could you capture a trace of the problem (using Apitrace or Renderdoc)?

This would be very helpful to fix it.

Marko Popovic uploaded an attachment:

Attachment 145232, "APITrace log from Citra crash":
citra-qt.1.trace

Marko Popovic said:

(In reply to Pierre-Eric Pelloux-Prayer from comment 15)

(In reply to Marko Popovic from comment 14)

Yes, always happens at the same place with Citra emulator

Could you capture a trace of the problem (using Apitrace or Renderdoc)?

This would be very helpful to fix it.

I added reproduced Citra crash recorded by using command:
apitrace trace ./citra-qt

I hope this is correct, if you need anything else or done differently please just let me know!

Marko Popovic uploaded an attachment:

I am adding Rocket League crash output from apitrace.

Attachment 145233, "APITrace log from RocketLeague crash":
RocketLeague.2.trace

Pierre-Eric Pelloux-Prayer @pepp said:

(In reply to Marko Popovic from comment 17)

(In reply to Pierre-Eric Pelloux-Prayer from comment 15)

(In reply to Marko Popovic from comment 14)

Yes, always happens at the same place with Citra emulator

Could you capture a trace of the problem (using Apitrace or Renderdoc)?

This would be very helpful to fix it.

I added reproduced Citra crash recorded by using command:
apitrace trace ./citra-qt

I hope this is correct, if you need anything else or done differently please
just let me know!

Thanks for the trace!

Replaying the trace a few times is enough to reliably to reproduce the hang.

Using AMD_DEBUG=nongg seems to prevent it so it could be a temporary workaround until a proper fix is found.
Could you confirm this on your system?

>
> I am adding Rocket League crash output from apitrace.

This trace file is very small (only one frame) and doesn't hang here.

AMD Navi GPU frequent freezes on both Manjaro/Ubuntu with kernel 5.3 and mesa 19.2 -git/llvm9

Submitted by Marko Popovic

Description

Designs

Child items ...

Activity

Admin message

Admin message

AMD Navi GPU frequent freezes on both Manjaro/Ubuntu with kernel 5.3 and mesa 19.2 -git/llvm9

Submitted by Marko Popovic

Description

Activity