[NV84] screen freezes
Hi Guys,
I am experiencing screen freezes with a pretty old card, that - however - was working fine in the other system. Without a particular reason (or for the reason undetermined yet by me), once every few days, the screen freezes and the only way is to reboot it remotely. In the logs I see the following:
Oct 19 10:36:10 falcor kernel: [72915.602331] ------------[ cut here ]------------
Oct 19 10:36:10 falcor kernel: [72915.602337] WARNING: CPU: 2 PID: 21589 at drivers/gpu/drm/nouveau/nvif/vmm.c:68 nvif_vmm_put+0x65/0x70
Oct 19 10:36:10 falcor kernel: [72915.602338] Modules linked in: uvcvideo videobuf2_vmalloc videobuf2_memops videobuf2_v4l2 videodev videobuf2_common
Oct 19 10:36:10 falcor kernel: [72915.602343] CPU: 2 PID: 21589 Comm: kworker/2:1 Tainted: P O 5.4.60-gentoo #15
Oct 19 10:36:10 falcor kernel: [72915.602343] Hardware name: Dell Inc. OptiPlex 9020/06X1TJ, BIOS A12 05/06/2015
Oct 19 10:36:10 falcor kernel: [72915.602346] Workqueue: events nouveau_cli_work
Oct 19 10:36:10 falcor kernel: [72915.602350] RIP: 0010:nvif_vmm_put+0x65/0x70
Oct 19 10:36:10 falcor kernel: [72915.602351] Code: 00 00 48 89 e2 be 02 00 00 00 48 c7 04 24 00 00 00 00 48 89 44 24 08 e8 99 e6 ff ff 85 c0 75 0a 48 c7 43 08 00 00 00 00 eb b7 <0f> 0b eb f2 e8 02 ad b5 ff 66 90 53 48 83 ec 20 65 48 8b 04 25 28
Oct 19 10:36:10 falcor kernel: [72915.602352] RSP: 0018:ffffa6bac887fdc8 EFLAGS: 00010282
Oct 19 10:36:10 falcor kernel: [72915.602353] RAX: 00000000fffffffe RBX: ffffa6bac887fdf0 RCX: 0000000000000000
Oct 19 10:36:10 falcor kernel: [72915.602353] RDX: 0000000000000010 RSI: ffffa6bac887fd38 RDI: ffffa6bac887fdd8
Oct 19 10:36:10 falcor kernel: [72915.602354] RBP: ffffa6bac887fe20 R08: 00000000fffffffe R09: 0000000000000000
Oct 19 10:36:10 falcor kernel: [72915.602355] R10: 0000000000000030 R11: 0000000000000018 R12: ffff9f7f99d096b8
Oct 19 10:36:10 falcor kernel: [72915.602355] R13: dead000000000122 R14: dead000000000100 R15: ffff9f7f99d096a8
Oct 19 10:36:10 falcor kernel: [72915.602356] FS: 0000000000000000(0000) GS:ffff9f8026880000(0000) knlGS:0000000000000000
Oct 19 10:36:10 falcor kernel: [72915.602357] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Oct 19 10:36:10 falcor kernel: [72915.602357] CR2: 00007f607775c260 CR3: 000000019bafa003 CR4: 00000000001606e0
Oct 19 10:36:10 falcor kernel: [72915.602358] Call Trace:
Oct 19 10:36:10 falcor kernel: [72915.602362] nouveau_vma_del+0x6b/0xb0
Oct 19 10:36:10 falcor kernel: [72915.602364] nouveau_gem_object_delete_work+0x31/0x60
Oct 19 10:36:10 falcor kernel: [72915.602365] nouveau_cli_work+0xb7/0xf0
Oct 19 10:36:10 falcor kernel: [72915.602368] process_one_work+0x1da/0x390
Oct 19 10:36:10 falcor kernel: [72915.602370] worker_thread+0x45/0x3b0
Oct 19 10:36:10 falcor kernel: [72915.602372] kthread+0xf6/0x130
Oct 19 10:36:10 falcor kernel: [72915.602374] ? process_one_work+0x390/0x390
Oct 19 10:36:10 falcor kernel: [72915.602375] ? kthread_park+0x80/0x80
Oct 19 10:36:10 falcor kernel: [72915.602377] ret_from_fork+0x35/0x40
Oct 19 10:36:10 falcor kernel: [72915.602379] ---[ end trace 7cdfdadbc5e9cfec ]---
however, this did not lead to a crash - not sure if it is relevant. This log appears right at the time of a crash:
Oct 20 08:03:32 falcor kernel: [150157.456211] ------------[ cut here ]------------
Oct 20 08:03:32 falcor kernel: [150157.456213] nouveau 0000:01:00.0: timeout
Oct 20 08:03:32 falcor kernel: [150157.456233] WARNING: CPU: 6 PID: 10436 at drivers/gpu/drm/nouveau/nvkm/engine/fifo/chang84.c:108 g84_fifo_chan_engine_fini+0x290/0x2f0
Oct 20 08:03:32 falcor kernel: [150157.456234] Modules linked in: uvcvideo videobuf2_vmalloc videobuf2_memops videobuf2_v4l2 videodev videobuf2_common
Oct 20 08:03:32 falcor kernel: [150157.456247] CPU: 6 PID: 10436 Comm: chrome:disk$3 Tainted: P W O 5.4.60-gentoo #15
Oct 20 08:03:32 falcor kernel: [150157.456248] Hardware name: Dell Inc. OptiPlex 9020/06X1TJ, BIOS A12 05/06/2015
Oct 20 08:03:32 falcor kernel: [150157.456249] RIP: 0010:g84_fifo_chan_engine_fini+0x290/0x2f0
Oct 20 08:03:32 falcor kernel: [150157.456250] Code: 10 48 8b 78 10 48 8b 57 50 48 85 d2 74 5c 48 89 14 24 e8 13 1e 07 00 48 8b 14 24 48 c7 c7 6e c3 4b 90 48 89 c6 e8 f9 08 ad ff <0f> 0b 48 8b 73 78 44 89 f7 48 81 c6 20 25 00 00 e8 9b 6a e3 ff 41
Oct 20 08:03:32 falcor kernel: [150157.456250] RSP: 0018:ffffa6bac10979e8 EFLAGS: 00010282
Oct 20 08:03:32 falcor kernel: [150157.456251] RAX: 0000000000000000 RBX: ffff9f802440f400 RCX: 0000000000000006
Oct 20 08:03:32 falcor kernel: [150157.456252] RDX: 0000000000000007 RSI: 0000000000000092 RDI: ffff9f8026996390
Oct 20 08:03:32 falcor kernel: [150157.456252] RBP: ffff9f8007ff6008 R08: 0000000000000001 R09: 0000000000000378
Oct 20 08:03:32 falcor kernel: [150157.456252] R10: 000000000001272c R11: 0000000000000001 R12: 0000000000000020
Oct 20 08:03:32 falcor kernel: [150157.456253] R13: 0000000000000000 R14: 00000000003b003b R15: ffff9f8024b09c00
Oct 20 08:03:32 falcor kernel: [150157.456253] FS: 00007f607bad4640(0000) GS:ffff9f8026980000(0000) knlGS:0000000000000000
Oct 20 08:03:32 falcor kernel: [150157.456254] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Oct 20 08:03:32 falcor kernel: [150157.456254] CR2: 00003cfad5a12000 CR3: 000000007e00a002 CR4: 00000000001606e0
Oct 20 08:03:32 falcor kernel: [150157.456255] Call Trace:
Oct 20 08:03:32 falcor kernel: [150157.456259] nvkm_fifo_chan_child_fini+0x5d/0xe0
Oct 20 08:03:32 falcor kernel: [150157.456261] nvkm_oproxy_fini+0x26/0x80
Oct 20 08:03:32 falcor kernel: [150157.456262] nvkm_object_fini+0xb7/0x150
Oct 20 08:03:32 falcor kernel: [150157.456263] nvkm_object_fini+0x6e/0x150
Oct 20 08:03:32 falcor kernel: [150157.456264] nvkm_ioctl_del+0x2a/0x50
Oct 20 08:03:32 falcor kernel: [150157.456265] nvkm_ioctl+0xda/0x172
Oct 20 08:03:32 falcor kernel: [150157.456267] nvif_object_fini+0x54/0x80
Oct 20 08:03:32 falcor kernel: [150157.456269] nouveau_channel_del+0x84/0x110
Oct 20 08:03:32 falcor kernel: [150157.456271] nouveau_abi16_chan_fini.isra.0+0x94/0xf0
Oct 20 08:03:32 falcor kernel: [150157.456272] nouveau_abi16_fini+0x28/0x60
Oct 20 08:03:32 falcor kernel: [150157.456274] nouveau_drm_postclose+0x47/0xd0
Oct 20 08:03:32 falcor kernel: [150157.456277] drm_file_free.part.0+0x1be/0x260
Oct 20 08:03:32 falcor kernel: [150157.456278] drm_release+0x95/0xd0
Oct 20 08:03:32 falcor kernel: [150157.456280] __fput+0xb4/0x240
Oct 20 08:03:32 falcor kernel: [150157.456282] task_work_run+0x84/0xa0
Oct 20 08:03:32 falcor kernel: [150157.456284] do_exit+0x353/0xab0
Oct 20 08:03:32 falcor kernel: [150157.456285] do_group_exit+0x35/0x90
Oct 20 08:03:32 falcor kernel: [150157.456288] get_signal+0x14e/0x800
Oct 20 08:03:32 falcor kernel: [150157.456289] do_signal+0x2b/0x5f0
Oct 20 08:03:32 falcor kernel: [150157.456292] ? __x64_sys_futex+0x132/0x160
Oct 20 08:03:32 falcor kernel: [150157.456293] exit_to_usermode_loop+0x60/0xa0
Oct 20 08:03:32 falcor kernel: [150157.456294] do_syscall_64+0xea/0x110
Oct 20 08:03:32 falcor kernel: [150157.456296] entry_SYSCALL_64_after_hwframe+0x44/0xa9
Oct 20 08:03:32 falcor kernel: [150157.456298] RIP: 0033:0x7f608b19aeb7
Oct 20 08:03:32 falcor kernel: [150157.456301] Code: Bad RIP value.
Oct 20 08:03:32 falcor kernel: [150157.456301] RSP: 002b:00007f607bad39c0 EFLAGS: 00000282 ORIG_RAX: 00000000000000ca
Oct 20 08:03:32 falcor kernel: [150157.456302] RAX: fffffffffffffe00 RBX: 0000000000000000 RCX: 00007f608b19aeb7
Oct 20 08:03:32 falcor kernel: [150157.456302] RDX: 0000000000000000 RSI: 0000000000000080 RDI: 00000b2cf3d96d08
Oct 20 08:03:32 falcor kernel: [150157.456303] RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000000
Oct 20 08:03:32 falcor kernel: [150157.456303] R10: 0000000000000000 R11: 0000000000000282 R12: 00000b2cf3d96ce0
Oct 20 08:03:32 falcor kernel: [150157.456304] R13: 00000b2cf3d96cb8 R14: 00007f607bad39f0 R15: 00000b2cf3d96d08
Oct 20 08:03:32 falcor kernel: [150157.456304] ---[ end trace 7cdfdadbc5e9cfed ]---
Oct 20 08:05:12 falcor shutdown[30598]: shutting down for system reboot
To hunt it down - I have tried the closed-source nvidia driver (legacy) and the freezes were much more frequent (a few times per hour), but in this case the mouse pointer was still moving and I could switch to a text console to restart the X-server. With the nouveau driver the freeze is total and the system cannot be recovered.
Besides the software bug, there are two possible issues here. Either the card got damaged in some way after it was removed from the old system - that would be the hardware issue. The other option I can imagine is that the card draws too much power - the Dell system has some weird connections to provide the power to the MB - perhaps this is insufficient, and the card has no connection to the power supply on its own. Is there any way to exclude one or the other? Is there any info I can provide?
Cheers, Bartek