Due to an influx of spam, we have had to impose restrictions on new accounts. Please see this wiki page for instructions on how to get full permissions. Sorry for the inconvenience.
Sometimes the X server crashes due to a GPU lockup, caused by a page fault. It happens seemingly randomly, at irregular intervals (sometimes it takes several hours, sometimes it crashes in half an hour).
Before that happens, I see a small amount of corruption (noise around the cursor), then everything but the mouse hangs. After a while, the mouse also hangs, the screen becomes black with a "_" symbol in the upper right corner of the screen (but the mouse is still displayed), and after some more time the whole screen becomes corrupt in vertical blocks. If I press Ctrl+Alt+F1 fast enough, I can switch out of X and use the console for a while, otherwise the whole PC hangs and I need to do a hard reboot.
This issue may or may not be related to bug #69029 (the symptoms seem similar, but the errors are different).
I am using a GeForce 660 card on openSUSE 13.1 x86_64 Beta. I also reported the bug downstream.
Attached two kernel logs. The first one happened at the same time as the attached Xorg crash log (if the timestamps are important). The second log is /dev/kmsg during another crash instance, which seems to have caused different errors, but the same outcome.
Mesa 9.2.0. And I suppose I can try the git version, although I've never tried that before, so I'm not entirely sure if I can get everything working correctly.
Tried the git version of Mesa, and the issue is still there, it just triggers less often.
However, I found a reliable way to reproduce the problem, on both 9.2 and git versions of Mesa. On KDE 4.11, setting the KWin compositing method to OpenGL 3.1 causes a lockup every time. With XRender I don't seem to hit this issue at all, and I think on OpenGL 2.0 the lockups happen randomly (but I need to do some more testing to make sure).
Actually, I think the lockups on KWin switch were induced by some openSUSE update. After another update, I could no longer reproduce that behaviour, and it's back to random lockups at any given time, no matter the compositing settings. Though it might still be notable that this issue can also be induced by certain bugs elsewhere in the system.
One quick way to check if you have the same problem as bug 72180 is to use the blob fw. If that works, then you have the same issue. I guess I didn't make the connection originally...
To make it clear, this is about random GPU lockups of GTX 660 (mine's Gainward), where using PGRAPH firmware from the blob does not fix the issue.
Interestingly enough, looks like there is an equivalent (albeit also messy) bug opened for Fedora (see See Also), and it appears to be a race condition. So trying the patch in that bug might be a good idea. Alternatively they suggest booting with nouveau.noaccel=1. I'll see if I can test this.
Please refresh this issue with new information. Make sure you're using at least kernel 4.3 and Mesa 11.0.4. Both have had important fixes which may affect your situation.
Right. I retested now with both kernel 4.3 and Mesa 11.0.4, and... well, it locks up, but with the kernel warning "../include/drm/drm_crtc.h:1577 drm_helper_choose_encoder_dpms" which seems to point to http://lists.freedesktop.org/archives/dri-devel/2015-September/091091.html and isn't actually a nouveau issue.
This prevents me from testing for the nouveau issue until the kernel gets fixed...
Testing it a few more times, it is indeed the read fault by nouveau that's causing the lockup in this case. The general DRM error does not appear during all boots, but the nouveau read fault does. When waiting around for a long time, the kernel log also has this:
INFO: task kworker/0:4:956 blocked for more than 480 seconds.
Tainted: G W O 4.3.0-1-default #1
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
kworker/0:4 D 0000000000000000 0 956 2 0x00000080
Workqueue: events gk104_fifo_recover_work [nouveau]
ffff8800d9d8bbc8 0000000000000046 ffff8801fd2b2080 ffff880214b0e040
ffff8800d9d8c000 ffff8800d9d8bd18 ffff8800d9d8bd10 ffff880214b0e040
ffff8802142d8810 ffff8800d9d8bbe0 ffffffff8166a1aa 7fffffffffffffff
Call Trace:
[<ffffffff8166a1aa>] schedule+0x3a/0x90
[<ffffffff8166cfb7>] schedule_timeout+0x197/0x260
[<ffffffff8166b526>] wait_for_completion+0x96/0x100
[<ffffffff8108019d>] flush_work+0xed/0x180
[<ffffffffa02b79dd>] gk104_fifo_fini+0x1d/0x50 [nouveau]
[<ffffffffa02b443c>] nvkm_fifo_fini+0x1c/0x30 [nouveau]
[<ffffffffa02546a0>] nvkm_engine_fini+0x20/0x30 [nouveau]
[<ffffffffa0258511>] nvkm_subdev_fini+0x61/0x1e0 [nouveau]
[<ffffffffa02b8d3b>] gk104_fifo_recover_work+0xeb/0x290 [nouveau]
[<ffffffff81080c89>] process_one_work+0x159/0x470
[<ffffffff81080fe8>] worker_thread+0x48/0x4a0
[<ffffffff81086c79>] kthread+0xc9/0xe0
[<ffffffff8166e80f>] ret_from_fork+0x3f/0x70
DWARF2 unwinder stuck at ret_from_fork+0x3f/0x70
Also having random lockups on a GTX 660 Ti (NVE4 according to glxinfo), since kernel 4.1 I guess, using DRI2.
[ 0.267666] nouveau 0000:02:00.0: NVIDIA GK104 (0e4030a2)
[ 0.378583] nouveau 0000:02:00.0: bios: version 80.04.4b.00.1a
[ 0.379302] nouveau 0000:02:00.0: fb: 2048 MiB GDDR5