Due to an influx of spam, we have had to impose restrictions on new accounts. Please see this wiki page for instructions on how to get full permissions. Sorry for the inconvenience.
Admin message
The migration is almost done, at least the rest should happen in the background. There are still a few technical difference between the old cluster and the new ones, and they are summarized in this issue. Please pay attention to the TL:DR at the end of the comment.
I've been getting random hangs in amdgpu_ring_lock, this causes X to hang, meaning I can't use the computer at all. I can sometimes switch to a tty, but this doesn't always work either.
I'm running Ubuntu 15.04 with mesa and libdrm from the oibaf ppa, with a self-compiled xf86-video-amdgpu and a self-compiled kernel from agd5f, drm-next-4.3-wip (9066b0c318589f47b754a3def4fe8ec4688dc21a).
I haven't been able to predict when the hang will happen, sometimes I can use it for several hours before it hangs, other times it happens just a few minutes after booting.
I got a similar trace yesterday on current agd5f drm-next-4.3 while trying to kill uvd with mplayer by repeatedly starting.
I am slightly hopeful this is a different issue from uvd as it starts with X and I got way more starts than I recently have - 360 to get this trace after a couple of OK 250 runs.
I haven't locked in normal use, but then my desktop setup is simple = fluxbox.
I've done some more testing, turns out that it never reaches amdgpu_ring_unlock_commit on certain cases, and that's what causes it to hang, since the mutex never unlocks.
I added some debug output to the code, gfx/sdma0 is ring->name, 0/9 is ring->idx and the address is the address of the ring struct.
As you can see in the log, it calls amdgpu_ring_lock on ring 9 with name sdma0, and then afterwards it calls it again on ring 0 with name gfx, without calling amdgpu_ring_unlock_commit.
I will add some more debug output in hopes of finding why exactly it's never unlocked, and if it is fixable. I should mention that these random lockups do not happen while using the proprietary catalyst driver, so it must be something in the amdgpu driver.
Attachment 117967, "dmesg with added debug output": dmesg.txt
That could just be a symptom of a hardware hang which isn't detected for
some reason.
Please take a look at amdgpu_fence_info as well to see if there are any
outstanding submissions.
If it's a hardware hang, wouldn't it also happen when using catalyst? It doesn't happen there, so it should at least be possible to work around (if it is a hardware problem).
I will continue investigating why this happens, but it does seem to me like this, #91278, and #91676 all are caused by the same thing, but with different log output depending on if you use drm-next-4.3 or drm-next-4.2.
Andy: Could you try compiling the latest kernel from drm-next-4.3-wip? I've been running it all day without a single lock up, before it used to lock up several times a day. Just wanted someone to confirm if it is in fact working, or if it's just me.
Andy: Could you try compiling the latest kernel from drm-next-4.3-wip? I've
been running it all day without a single lock up, before it used to lock up
several times a day. Just wanted someone to confirm if it is in fact
working, or if it's just me.
I can imaging that it's far better for desktop locks - I moved onto it when it got updated.
Initially testing with Unigine Valley I thought it was going to be good - I got further than ever before (about 4x through all the scenes having not got through once previously), but it did lock.
Andy: Could you try compiling the latest kernel from drm-next-4.3-wip? I've
been running it all day without a single lock up, before it used to lock up
several times a day. Just wanted someone to confirm if it is in fact
working, or if it's just me.
I can imaging that it's far better for desktop locks - I moved onto it when
it got updated.
Initially testing with Unigine Valley I thought it was going to be good - I
got further than ever before (about 4x through all the scenes having not got
through once previously), but it did lock.
That's a shame. I'll try and see if I can find out what has caused the lockups to stop for me, maybe that could help in finding out what's still causing them for you.
Created attachment 118056 [details] [review]
possible fix
I think this patch should fix it.
No luck here I'm afraid - I'm having a hard time reproducing it during normal desktop usage (with or without the patch), but it did lockup while running Unigine Valley.
No luck here I'm afraid - I'm having a hard time reproducing it during
normal desktop usage (with or without the patch), but it did lockup while
running Unigine Valley.
Assuming you can still access the box over the network after the lockup then please provide the output of the following as root:
Created attachment 118056 [details] [review] [review]
possible fix
I think this patch should fix it.
No luck here I'm afraid - I'm having a hard time reproducing it during
normal desktop usage (with or without the patch), but it did lockup while
running Unigine Valley.
I see drm-next-4.3 is now ahead again, haven't tested that yet.
With patch + drm-next-4.3-wip, I haven't yet managed to lock valley - but I've only had time to do a couple of runs (45 min then 90 min) from a clean boot. Maybe later when I've been up a while doing other things I'll try harder.
Patch doesn't apply with git apply - did it by hand.
I have attached the output of amdgpu_regs and amdgpu_fence_info. Hang is right after the hang happened, Normal is right after a reboot after the hang (for comparison).
Attachment 118060, "Output of amdgpu_regs and amdgpu_fence_info": info.txt
Created attachment 118056 [details] [review] [review] [review]
possible fix
I think this patch should fix it.
No luck here I'm afraid - I'm having a hard time reproducing it during
normal desktop usage (with or without the patch), but it did lockup while
running Unigine Valley.
I see drm-next-4.3 is now ahead again, haven't tested that yet.
With patch + drm-next-4.3-wip, I haven't yet managed to lock valley - but
I've only had time to do a couple of runs (45 min then 90 min) from a clean
boot. Maybe later when I've been up a while doing other things I'll try
harder.
Patch doesn't apply with git apply - did it by hand.
I managed to lock it, seems that doing "something" between runs changes things, or first runs are lucky.
FWIW I tried running Unreal 4.5 ElementalDemo after my long runs and I got a signal 7.
After I later locked/hung valley I rebooted and tried again elemental from a clean boot and it ran OK, but after quitting. it now gives signal 7 again if I try to start it.
This issue hasn't had any activity since 2019-11-19. The AMD driver stack changes rapidly and contains lots of shared code across products so it's possible that it has already been fixed. Please upgrade to a current stable kernel and userspace stack and try again. If you still experience this issue with the latest driver stack, please capture relevant logging and open a new issue referring back to this one.