[NVE6] GPU lockups

Dainius Masiliūnas uploaded an attachment:

Attached the Xorg crash log. It seems to be fairly consistent during different crash instances.

Attachment 86732, "Xorg crash log":
Xorg.0.log

Dainius Masiliūnas uploaded an attachment:

Attached two kernel logs. The first one happened at the same time as the attached Xorg crash log (if the timestamps are important). The second log is /dev/kmsg during another crash instance, which seems to have caused different errors, but the same outcome.

Attachment 86733, "Second kernel log":
dmesg-older.log

Dainius Masiliūnas uploaded an attachment:

Attached another kernel log. It seems it has elements from both the previous logs.

Attachment 86735, "Third kernel log":
new-dmesg.log

Ilia Mirkin @imirkin said:

What version of mesa are you using? Could you try with mesa-git?

Dainius Masiliūnas said:

Mesa 9.2.0. And I suppose I can try the git version, although I've never tried that before, so I'm not entirely sure if I can get everything working correctly.

Dainius Masiliūnas said:

Tried the git version of Mesa, and the issue is still there, it just triggers less often.

However, I found a reliable way to reproduce the problem, on both 9.2 and git versions of Mesa. On KDE 4.11, setting the KWin compositing method to OpenGL 3.1 causes a lockup every time. With XRender I don't seem to hit this issue at all, and I think on OpenGL 2.0 the lockups happen randomly (but I need to do some more testing to make sure).

Dainius Masiliūnas said:

Actually, I think the lockups on KWin switch were induced by some openSUSE update. After another update, I could no longer reproduce that behaviour, and it's back to random lockups at any given time, no matter the compositing settings. Though it might still be notable that this issue can also be induced by certain bugs elsewhere in the system.

Matthias Nagel said:

I have the same problem on Gentoo with the following software components

x11-base/xorg-x11-7.4-r2
sys-kernel/gentoo-sources-3.12.5
kde-base/kdelibs-4.11.2-r1

with a GTX660 card. But it also sounds very similar to bug #72180.

Matthias Nagel uploaded an attachment:

Attachment 90899, "Kernel log on gentoo 3.12.5":
dmesg.log

Matthias Nagel uploaded an attachment:

Attachment 90900, "lspci on gentoo 3.12.5":
ls-pci.log

Ilia Mirkin @imirkin said:

One quick way to check if you have the same problem as bug 72180 is to use the blob fw. If that works, then you have the same issue. I guess I didn't make the connection originally...

Matthias Nagel said:

I tried to use the blob firmware, but failed to do so. See my comment at bug # 72180 for more.

Ilia Mirkin @imirkin said:

*** This bug has been marked as a duplicate of bug 72180 ***

Dainius Masiliūnas said:

Reopened as per bug #72180 suggestions.

To make it clear, this is about random GPU lockups of GTX 660 (mine's Gainward), where using PGRAPH firmware from the blob does not fix the issue.

Interestingly enough, looks like there is an equivalent (albeit also messy) bug opened for Fedora (see See Also), and it appears to be a race condition. So trying the patch in that bug might be a good idea. Alternatively they suggest booting with nouveau.noaccel=1. I'll see if I can test this.

Ilia Mirkin @imirkin said:

Please refresh this issue with new information. Make sure you're using at least kernel 4.3 and Mesa 11.0.4. Both have had important fixes which may affect your situation.

Dainius Masiliūnas said:

Right. I retested now with both kernel 4.3 and Mesa 11.0.4, and... well, it locks up, but with the kernel warning "../include/drm/drm_crtc.h:1577 drm_helper_choose_encoder_dpms" which seems to point to http://lists.freedesktop.org/archives/dri-devel/2015-September/091091.html and isn't actually a nouveau issue.

This prevents me from testing for the nouveau issue until the kernel gets fixed...

Dainius Masiliūnas uploaded an attachment:

Reading a bit more into the kernel log, I see that the drm_crtc.h warning might have been triggered by nouveau after all, because above that I have:

nouveau 0000:01:00.0: fifo: read fault at 6ff792f000 engine 07 [PBDMA0] client 06 [HOST] reason 00 [PDE] on channel 31 [023e0c9000 xembedsniproxy[2833]]
nouveau 0000:01:00.0: fifo: fifo engine fault on channel 31, recovering...
------------[ cut here ]------------
WARNING: CPU: 0 PID: 4 at ../drivers/gpu/drm/nouveau/nvkm/engine/fifo/gk104.h:73 gk104_fifo_recover_work+0x22a/0x290 nouveau

Attached the systemd journal of this. The warning above is at line 1834. The Xorg.0.log file does not have any errors or warnings at all.

I'm not sure if this should be a yet another bug report?

Attachment 119484, "Journal (fifo read fault and drm_crtc.h)":
nouveau-kernel.log

Dainius Masiliūnas said:

Testing it a few more times, it is indeed the read fault by nouveau that's causing the lockup in this case. The general DRM error does not appear during all boots, but the nouveau read fault does. When waiting around for a long time, the kernel log also has this:

INFO: task kworker/0:4:956 blocked for more than 480 seconds.
Tainted: G W O 4.3.0-1-default #1
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
kworker/0:4 D 0000000000000000 0 956 2 0x00000080
Workqueue: events gk104_fifo_recover_work [nouveau]
ffff8800d9d8bbc8 0000000000000046 ffff8801fd2b2080 ffff880214b0e040
ffff8800d9d8c000 ffff8800d9d8bd18 ffff8800d9d8bd10 ffff880214b0e040
ffff8802142d8810 ffff8800d9d8bbe0 ffffffff8166a1aa 7fffffffffffffff
Call Trace:
[<ffffffff8166a1aa>] schedule+0x3a/0x90
[<ffffffff8166cfb7>] schedule_timeout+0x197/0x260
[<ffffffff8166b526>] wait_for_completion+0x96/0x100
[<ffffffff8108019d>] flush_work+0xed/0x180
[<ffffffffa02b79dd>] gk104_fifo_fini+0x1d/0x50 [nouveau]
[<ffffffffa02b443c>] nvkm_fifo_fini+0x1c/0x30 [nouveau]
[<ffffffffa02546a0>] nvkm_engine_fini+0x20/0x30 [nouveau]
[<ffffffffa0258511>] nvkm_subdev_fini+0x61/0x1e0 [nouveau]
[<ffffffffa02b8d3b>] gk104_fifo_recover_work+0xeb/0x290 [nouveau]
[<ffffffff81080c89>] process_one_work+0x159/0x470
[<ffffffff81080fe8>] worker_thread+0x48/0x4a0
[<ffffffff81086c79>] kthread+0xc9/0xe0
[<ffffffff8166e80f>] ret_from_fork+0x3f/0x70
DWARF2 unwinder stuck at ret_from_fork+0x3f/0x70

Leftover inexact backtrace:
[<ffffffff81086bb0>] ? kthread_worker_fn+0x170/0x170

I'm still not sure if this should be a separate bug report.

Karol Herbst @karolherbst said:

Is always xembedsniproxy involved in the crash? If so, it might be worth to do a mmt until it crashes and check what it is actually doing.

Lucas Ribeiro said:

Also having random lockups on a GTX 660 Ti (NVE4 according to glxinfo), since kernel 4.1 I guess, using DRI2.
[ 0.267666] nouveau 0000:02:00.0: NVIDIA GK104 (0e4030a2)
[ 0.378583] nouveau 0000:02:00.0: bios: version 80.04.4b.00.1a
[ 0.379302] nouveau 0000:02:00.0: fb: 2048 MiB GDDR5

Now on gentoo ~amd64 using:

sys-kernel/gentoo-sources-4.5.1
x11-base/xorg-server-1.18.3
x11-drivers/xf86-video-nouveau-1.0.12

Should I make a new entry for this card?

[NVE6] GPU lockups

Submitted by Dainius Masiliūnas

Description

See also

Designs

Child items ...

Activity

Admin message

Admin message

[NVE6] GPU lockups

Submitted by Dainius Masiliūnas

Description

See also

Activity