TONGA hang in amdgpu_ring_lock

Andy Furniss uploaded an attachment:

I got a similar trace yesterday on current agd5f drm-next-4.3 while trying to kill uvd with mplayer by repeatedly starting.

I am slightly hopeful this is a different issue from uvd as it starts with X and I got way more starts than I recently have - 360 to get this trace after a couple of OK 250 runs.

I haven't locked in normal use, but then my desktop setup is simple = fluxbox.

Attachment 117963, "mplayer X hung task":
mplayer-hung-task

Mathias Tillman uploaded an attachment:

I've done some more testing, turns out that it never reaches amdgpu_ring_unlock_commit on certain cases, and that's what causes it to hang, since the mutex never unlocks.
I added some debug output to the code, gfx/sdma0 is ring->name, 0/9 is ring->idx and the address is the address of the ring struct.
As you can see in the log, it calls amdgpu_ring_lock on ring 9 with name sdma0, and then afterwards it calls it again on ring 0 with name gfx, without calling amdgpu_ring_unlock_commit.
I will add some more debug output in hopes of finding why exactly it's never unlocked, and if it is fixable. I should mention that these random lockups do not happen while using the proprietary catalyst driver, so it must be something in the amdgpu driver.

Attachment 117967, "dmesg with added debug output":
dmesg.txt

Christian König said:

That could just be a symptom of a hardware hang which isn't detected for some reason.

Please take a look at amdgpu_fence_info as well to see if there are any outstanding submissions.

Andy Furniss said:

(In reply to Christian König from comment 3)

That could just be a symptom of a hardware hang which isn't detected for
some reason.

There's this - drm/amdgpu: disable GPU reset by default

http://cgit.freedesktop.org/~agd5f/linux/commit/?h=drm-next-4.3&id=a895c222e7ab5f50ec10e209cd4548ecd5dd9443

Mathias Tillman said:

(In reply to Christian König from comment 3)

That could just be a symptom of a hardware hang which isn't detected for
some reason.

Please take a look at amdgpu_fence_info as well to see if there are any
outstanding submissions.

If it's a hardware hang, wouldn't it also happen when using catalyst? It doesn't happen there, so it should at least be possible to work around (if it is a hardware problem).
I will continue investigating why this happens, but it does seem to me like this, #91278, and #91676 all are caused by the same thing, but with different log output depending on if you use drm-next-4.3 or drm-next-4.2.

Christian König said:

No, current released catalyst doesn't uses anything from the amdgpu module yet.

It's clearly not a hardware problem, but invalid render commands can cause the hardware to lock up.

Mathias Tillman said:

Andy: Could you try compiling the latest kernel from drm-next-4.3-wip? I've been running it all day without a single lock up, before it used to lock up several times a day. Just wanted someone to confirm if it is in fact working, or if it's just me.

Andy Furniss said:

(In reply to Mathias Tillman from comment 7)

Andy: Could you try compiling the latest kernel from drm-next-4.3-wip? I've
been running it all day without a single lock up, before it used to lock up
several times a day. Just wanted someone to confirm if it is in fact
working, or if it's just me.

I can imaging that it's far better for desktop locks - I moved onto it when it got updated.

Initially testing with Unigine Valley I thought it was going to be good - I got further than ever before (about 4x through all the scenes having not got through once previously), but it did lock.

Mathias Tillman said:

(In reply to Andy Furniss from comment 8)

(In reply to Mathias Tillman from comment 7)

Andy: Could you try compiling the latest kernel from drm-next-4.3-wip? I've
been running it all day without a single lock up, before it used to lock up
several times a day. Just wanted someone to confirm if it is in fact
working, or if it's just me.

I can imaging that it's far better for desktop locks - I moved onto it when
it got updated.

Initially testing with Unigine Valley I thought it was going to be good - I
got further than ever before (about 4x through all the scenes having not got
through once previously), but it did lock.

That's a shame. I'll try and see if I can find out what has caused the lockups to stop for me, maybe that could help in finding out what's still causing them for you.

Alex Deucher @agd5f submitted a patch:

I think this patch should fix it.

Patch 118056, "possible fix":
0001-drm-amdgpu-fix-buffer-placement-under-memory-pressur.patch

Mathias Tillman said:

(In reply to Alex Deucher from comment 10)

Created attachment 118056 [details] [review]
possible fix

I think this patch should fix it.

No luck here I'm afraid - I'm having a hard time reproducing it during normal desktop usage (with or without the patch), but it did lockup while running Unigine Valley.

Christian König said:

(In reply to Mathias Tillman from comment 11)

No luck here I'm afraid - I'm having a hard time reproducing it during
normal desktop usage (with or without the patch), but it did lockup while
running Unigine Valley.

Assuming you can still access the box over the network after the lockup then please provide the output of the following as root:

cat /sys/kernel/debug/dri/0/amdgpu_fence_info
hexdump -s 0x14fc -n 4 /sys/kernel/debug/dri/0/amdgpu_regs

Andy Furniss said:

(In reply to Mathias Tillman from comment 11)

(In reply to Alex Deucher from comment 10)

Created attachment 118056 [details] [review] [review]
possible fix

I think this patch should fix it.

No luck here I'm afraid - I'm having a hard time reproducing it during
normal desktop usage (with or without the patch), but it did lockup while
running Unigine Valley.

I see drm-next-4.3 is now ahead again, haven't tested that yet.

With patch + drm-next-4.3-wip, I haven't yet managed to lock valley - but I've only had time to do a couple of runs (45 min then 90 min) from a clean boot. Maybe later when I've been up a while doing other things I'll try harder.

Patch doesn't apply with git apply - did it by hand.

Mathias Tillman uploaded an attachment:

I have attached the output of amdgpu_regs and amdgpu_fence_info. Hang is right after the hang happened, Normal is right after a reboot after the hang (for comparison).

Attachment 118060, "Output of amdgpu_regs and amdgpu_fence_info":
info.txt

Andy Furniss said:

(In reply to Andy Furniss from comment 13)

(In reply to Mathias Tillman from comment 11)

(In reply to Alex Deucher from comment 10)

Created attachment 118056 [details] [review] [review] [review]
possible fix

I think this patch should fix it.

No luck here I'm afraid - I'm having a hard time reproducing it during
normal desktop usage (with or without the patch), but it did lockup while
running Unigine Valley.

I see drm-next-4.3 is now ahead again, haven't tested that yet.

With patch + drm-next-4.3-wip, I haven't yet managed to lock valley - but
I've only had time to do a couple of runs (45 min then 90 min) from a clean
boot. Maybe later when I've been up a while doing other things I'll try
harder.

Patch doesn't apply with git apply - did it by hand.

I managed to lock it, seems that doing "something" between runs changes things, or first runs are lucky.

FWIW I tried running Unreal 4.5 ElementalDemo after my long runs and I got a signal 7.

After I later locked/hung valley I rebooted and tried again elemental from a clean boot and it ran OK, but after quitting. it now gives signal 7 again if I try to start it.

mentioned in issue #511 (closed)

This issue hasn't had any activity since 2019-11-19. The AMD driver stack changes rapidly and contains lots of shared code across products so it's possible that it has already been fixed. Please upgrade to a current stable kernel and userspace stack and try again. If you still experience this issue with the latest driver stack, please capture relevant logging and open a new issue referring back to this one.

closed

TONGA hang in amdgpu_ring_lock

Submitted by Mathias Tillman

Description

Designs

Child items ...

Activity

Admin message

Admin message

TONGA hang in amdgpu_ring_lock

Submitted by Mathias Tillman

Description

Activity