[NV50, Linux 5.9] Regression: Visual artifacts and eventual GPU crash

Yeah.. 5de5b6ec causing this would be strange... mind giving it another try or figure out if there is an easy way to reproduce this issue?

Yes, disregard that version above. Currently the first bad version is fe4249af, but I'm still working on the bisect. It takes a while though because it turns out that it can take a few hours before the bug appears, which also means that it's easy to mistag bad versions as good.

mentioned in commit acba01a4

mentioned in commit cc5b828d

mentioned in commit 28fe1dcb

I just wanted to inform you that I'm facing a similar - and probably the same issue. I see the same DRM: base-0: timeout messages. Additionally, my dmesg is completely filled with kernel warnings (see the attached file). I can also report that only one of my two monitors is frozen (or better: reduced to around 0.3 FPS) and repeatedly calling dmesg seems to heal the lock sometimes (though that may be incidential).

added system freeze label

added regression label

@karolherbst - I finished bisection. This is the first commit with issues:

commit 0a96099691c8cd1ac0744ef30b6846869dc2b566
Author: Ben Skeggs <bskeggs@redhat.com>
Date:   Tue Jul 21 11:34:07 2020 +1000

    drm/nouveau/kms/nv50-: implement proper push buffer control logic
    
    We had a, what was supposed to be temporary, hack in the KMS code where we'd
    completely drain an EVO/NVD channel's push buffer when wrapping to the start
    again, instead of treating it as a ring buffer.
    
    Let's fix that, finally.
    
    Signed-off-by: Ben Skeggs <bskeggs@redhat.com>

I have been running the commit before for two days now and didn't encounter any problems, so I would say it's very likely that the above is correct.

added bisected label

thanks for doing the bisect!

Hello, just to mention the same is happening on my 01:00.0 VGA compatible controller: NVIDIA Corporation GT218M [NVS 3100M] (rev a2) with the current OpenSUSE tumbleweed's kernel 5.9.1-1-default.

(Seems also people on LKLM are reporting similar troubles )

Hope it will get fixed soon.

added widespread label

Problems persistance in kernel 5.9.8

Yeah, I also have this on 5.9.9-arch1-1 with:

02:00.0 VGA compatible controller: NVIDIA Corporation GT216 [GeForce GT 220] (rev a2)
03:00.0 VGA compatible controller: NVIDIA Corporation G96CGL [Quadro FX 580] (rev a1)

and on the other debian box with:

01:00.0 VGA compatible controller: NVIDIA Corporation GT218 [GeForce 210] (rev a2)

but it is obvious as this problem has not yet been fixed (at least this ticket is not updated)

mentioned in issue #27 (closed)

ohh, maybe we should have pinged @skeggsb on this issue here... but I am also not sure if we addressed it in 5.10 already or not.. I think to remember it came up, but maybe it was also this issue all along.

@karolherbst I will give 5.10-rc6 a try.

mentioned in issue mesa/mesa#3899

Hi everyone,

I am experiencing the same issue for a long time right now, actually since upgrade on my system to kernel 5.9.8...

Using Fedora 33, with ZFS on root, I was looking for such a thread since a month ago, and it was a real PITA to find it as my kernel is tainted (zfs modules...), neither could I send a report about the issue earlier, due to Fedora's "abrt" policy not to allow reporting troubles when kernels are tainted.

So as I said the issue is recurring all the time here, as long as I'm booting on a 5.9.X kernel, and very very frequently. Tested with "5.9.8", "5.9.9", "5.9.10", and today I just upgraded to "5.9.11" for tests. Like some people said here, falling back to 5.8 kernel (more precisely in my case 5.8.18-300) solves the problem. The following packages seem to be involved

kernel-5.9
xorg-x11-drv-nouveau

Keeping 5.8 kernel for the moment solves the issue, but the downside is that Fedora upstream is going forward with kernel updates, and of course I would like to follow that.

The main problem I'm facing up to, is that as I'm running system over zfs-on-root, what implies DKMS modules, some issue causes impossible boot on previous kernels when an upgrade to a more recent one is done. Seems that something doesn't allow more than one set of zfs-dkms modules compiled for a given kernel at time (that has nothing to do with this issue of course). When the system crashes due to this regression, I have to reboot first on kernel 5.9.X, remove all installed dkms modules, then rebuild/install dkms modules only for kernel 5.8.18, and then reboot again on this last one. One can imagine how hard it is to go with such a way to maintain a workstation functional.

So, eventhough my knowledge in debugging is quite limited, I'm really willing to help here, providing whatever is needed to fix this, with someone's guide.

Feel free to ask.

@karolherbst : I now experienced a crash with linux 5.10-rc6, so it still suffers from the bug.

Are you certain you're seeing the same issue? The commit you bisected to did indeed have issues, as there's a HW bug that we needed to workaround. That was fixed by ca386aa7.

I'd be very interested in seeing a full kernel log after the issue has reproduced on 5.10.

Also. It's possible that one of the later commits before the fix could be causing this, but it's going to be harder to track down without isolating the HW bug we triggered first.

My suggestion here would be to rebase and move the commit I mentioned right after the one you bisected to (or just squash them together), see if you can reproduce it there. Hopefully you can't, and then you'll be able to bisect the later commits and find what else is broken.

I am 90% sure I am seeing the same problem in 5.10-rc6: All the dmesg messages look similar and the symptoms do too. Here is a log:

5.10-rc6_log.txt

The symptoms start to appear around 22:57:40 (which is when the first DRM: base-0: timeout is printed). After that I think I logged in remotely about one minute later and trigger a system shutdown around 22:59:02, so the final block of nouveau messages appeared in the shutdown sequence.

I see your point about squashing the two commits together. But ca386aa7 was already part of the 5.9.0 kernel, so that can't be the fix - otherwise I would have never seen the bug. Maybe 0a960996 introduced more than one problem? Or indeed one of the commits before 0a960996 could be responsible - it is tough to say since each iteration of bisection requires running the system for at least 12 - 24 hours and even then it's not guaranteed to trigger. However I will try again as soon as I can and I will squash the two commits that time.

The point of squashing them is to be able to bisect the commits immediately after it to see if one of them introduced additional issues that can't be bisected while the HW bug is present. The HW bug presents slightly differently to what you're seeing, there's a very specific error code thrown by the display engine for that one.

It's possible that 0a960996 has more issues (which is why i suggest testing it squashed before trying another bisect), but I'd be surprised as that'd probably effect things across the board, and I haven't encountered any problems.

Same problem here with Fedora 33 kernel 5.9.11 x86_64.

01:00.0 VGA compatible controller: NVIDIA Corporation GF116 [GeForce GTS 450 Rev. 2] (rev a1)

Some hints:

Fast flashing horizontal stripes occur in windows (not screen) when content changes. A good way to show this is using the vertical slider on a tall page in firefox. I'm not sure, but I suspect the stripes color is the text color.
When system becomes unstable (i.e.: 90% frozen), I can switch to text mode (Ctrl+Alt+F2) then back to restore stablility, rather than rebooting.
Playing a video in FF increases the "chance" of instability.

Problem still occurs here with Fedora 33 and kernel upgrade to 5.9.13 x86_64.

I'm getting the SAME lockup on GT216GLM (Tesla 2). I can reproduce this crash extremely easily on this severely under-powered GPU, within about 10-20 minutes.

This driver does do something noticeably different in 5.9 with its draw on this GPU:

During alt-tabs in Gnome 3, I'm seeing a very discernible kind of screen tearing when the GPU is under load: The tear starts from the top left, heading on a perfect 45 degree angle in a down/right direction, until the tear gets to the horizontal center of the display, then it cuts horizontally rightward across the middle of the screen, until the tear gets to a point where it can draw a 45-degree angle downward to the bottom right corner of the display. This tear is not at all visible in 5.8, but it's extremely pronounced on this hardware.
I can lock up the display with 2 cores hitting 100% in kernel time once this happens, and it's pretty much impossible for me to get out of the crash -- but the system still responds via SSH when it does. Crash likelihood is greatly enhanced when playing videos in FF, alt-tabbing to VLC, and playing videos with higher resolution in VLC.

It's very quick for me to test revisions on this hardware. Is there a certain set of revisions / reverts you want to try? I'm on Gentoo so I can compile my own kernels.

Just quickly popping in to mention (like reported by @monnerat and @yuri_sevatz) that in my experience too, problems arrise in blitt-intensive tasks (e.g.: in Fireofx, scrolling and watching youtube).

It should probably be possible to create a small blitt intensive synthetic benchmark that quickly cause the problem, for the purpose of triggering the problem in minutes.

Same problem here

Fedora 33 with kernel-5.9.13-200

VGA compatible controller: NVIDIA Corporation GT218M [NVS 3100M] (rev a2)

I am having similar issues since 5.9 update on Fedora 33. Booting with 5.8 kernel fixes the problem.

lspci VGA compatible controller: NVIDIA Corporation GT216GLM [Quadro FX 880M] (rev a2)

Has anyone tested kernel 5.10.1 ? I just received the update from my distro (openSUSE) but due to all the work involved with the current pandemic (hunting the new strain), I won't have much time to spent testing. Somebody else in for putting your GPU under some "scrolling/videoplaying/blitting" stress test? (a.k.a. slacking on youtube a lot)

Reproduced on 5.10.3.

nouveau_DRM_timeout_crash.txt

I had tried 2 weeks ago with 5.10.1 and got the exact same result.

Confimed:

kernel 5.10.3 - 5.10.8
VGA: 01:00.0 VGA compatible controller: NVIDIA Corporation GF106GLM [Quadro 2000M] (rev a1)
dmesg is similar to the one above dmesg-nouveau.log

Btw if you'd like me to enable/patch in any extra prints for debugging's sake to run through this again, I have a quick turnaround for slacking! ;)

Don't forget old cards when fixing. I have the same problem with an NVIDIA G98 (298200a2).

As a side note, I never had problems using nouveau during the last ten years (at least) with that card. First problems began around 4.19 (as I remember), rarely and randomly, but more and more often. Now I have freezes everyday. It may be new bugs in nouveau, but more probably old bugs raised by newer softwares…

And thanks for all.

mentioned in issue #7

Further tests if it can help to narrow down the root cause:

frequent timeouts and freezes with Gnome+Mutter+Wayland.
less frequent timout and freezes with Gnome+Mutter+Xorg. Some screen artifacts (or delays?) in firefox though when switching tabs but nothing in logs for that.
Problems with LXQt+Mutter.
No problems with LXQt+Openbox (but I suppose it doesn't use DRM at all).

I also tried KWin+Wayland, KWin+Xorg, Enlightenment, but they had themselves problems before Nouveau :-<.

I'll stick with Openbox for the time being though I would prefer my Gnome-shell back!

still occuring with 01:00.0 VGA compatible controller: NVIDIA Corporation GT216M [GeForce GT 330M] (rev a2)
and
linux 5.10.6
Using Gnome Wayland + Mutter

IMHO: this issue MUST be fixed before 5.10 is declared LTS.

I think we should try attracting the attention of Phoronix' Michael Larabel.

I've been experiencing the same with 02:00.0 VGA compatible controller [0300]: NVIDIA Corporation C79 [GeForce 9400M] [10de:0863] (rev b1).

Doesn't take much to trigger it, making it practically unusable day-to-day for me.

I might be experiencing the same problem. I updated the kernel from 5.6 to 5.9.8 and started getting these artifacts/flickering. You can sea them easily on the upper part of the screen: VID_20201117_084247

Then I downgraded to 5.8.18 and the problem was gone. My video card is:

06:00.0 VGA compatible controller [0300]: NVIDIA Corporation C79 [GeForce 9400] [10de:086a] (rev b1)

mentioned in issue plymouth/plymouth#137

Perhaps relevant - comparing dmesg following system boot between 5.4.89 (no graphical artefacts nor nouveau timeouts) and 5.10.x, I observe the following error during boot relating to my graphics card:

kernel: pci 0000:02:00.0: error -61 assigning properties

I'm trying to run another git bisect in the way suggested by @skeggsb just now, hopefully that will shed some more light on to this.

@karolherbst @skeggsb Here are the results of my bisection:

I investigated the range of commits from 0a960996 to ca386aa7, restricting git bisect to changes in drivers/gpu/drm/nouveau. This time I applied the diff of ca386aa7 as a patch after checking out each commit (except when building that commit itself of course), so that the change in ca386aa7 is always included.

Here is the bisect log:

git bisect start '--' 'drivers/gpu/drm/nouveau'
# good: [0a96099691c8cd1ac0744ef30b6846869dc2b566] drm/nouveau/kms/nv50-: implement proper push buffer control logic
git bisect good 0a96099691c8cd1ac0744ef30b6846869dc2b566
# bad: [ca386aa7155a5467fa7b2b8376f4da8f8e59be4d] drm/nouveau/kms/nv50-gp1xx: add WAR for EVO push buffer HW bug
git bisect bad ca386aa7155a5467fa7b2b8376f4da8f8e59be4d
# bad: [fb3939e232f6120387f20a26510894a17680db3a] drm/nouveau/kms/nv50-: use NVIDIA's headers for core head_view()
git bisect bad fb3939e232f6120387f20a26510894a17680db3a
# good: [fccc858003f3f3e7a8fa272f118eb71d218a2b32] drm/nouveau/kms/nv50-: use NVIDIA's headers for wndw sema_set()
git bisect good fccc858003f3f3e7a8fa272f118eb71d218a2b32
# bad: [1070832b1eab7309c59d9564ed26f84932fed817] drm/nouveau/kms/nv50-: use NVIDIA's headers for wndw image_clr()
git bisect bad 1070832b1eab7309c59d9564ed26f84932fed817
# good: [75bd8304e61c01d2bc5df46fcf9c2e9838b3a246] drm/nouveau/kms/nv50-: use NVIDIA's headers for wndw ntfy_wait_begun()
git bisect good 75bd8304e61c01d2bc5df46fcf9c2e9838b3a246
# good: [6833d2a0c778252929805fabfdc89e4e181fcb82] drm/nouveau/kms/nv50-: use NVIDIA's headers for wndw xlut_set()
git bisect good 6833d2a0c778252929805fabfdc89e4e181fcb82
# bad: [f844eb485eb056ad3b67e49f95cbc6c685a73db4] drm/nouveau/kms/nv50-: use NVIDIA's headers for wndw image_set()
git bisect bad f844eb485eb056ad3b67e49f95cbc6c685a73db4
# good: [66f7b7bddfe60a708c7711e47c95d20db05e2110] drm/nouveau/kms/nv50-: use NVIDIA's headers for wndw xlut_clr()
git bisect good 66f7b7bddfe60a708c7711e47c95d20db05e2110
# first bad commit: [f844eb485eb056ad3b67e49f95cbc6c685a73db4] drm/nouveau/kms/nv50-: use NVIDIA's headers for wndw image_set()

The first bad commit is:

commit f844eb485eb056ad3b67e49f95cbc6c685a73db4
Author: Ben Skeggs <bskeggs@redhat.com>
Date:   Sat Jun 20 13:08:47 2020 +1000

    drm/nouveau/kms/nv50-: use NVIDIA's headers for wndw image_set()
    
    Signed-off-by: Ben Skeggs <bskeggs@redhat.com>
    Reviewed-by: Lyude Paul <lyude@redhat.com>

The issue can be reliably observed in glxgears, artifacts such as white stripes appear in f844eb48 and following commits, but not before. I didn't wait for the subsequent crash of the GPU.

It turns out that the difficulty I reported earlier (sometimes more than 10 hours before "the bug" triggers), was probably actually another issue which was fixed by ca386aa7, because I didn't observe that anymore at all. In turn, the problem introduced by f844eb48 manifests within a few minutes usually and can be provoked with tools such as glxgears.

I'm now trying to revert f844eb48 on top of 5.10.7 and will report back if this fixes all problems.

EDIT: I reverted the actual code changes in f844eb48 only, but left the new headers and the includes: revert.patch. Applying this seems to cure the problem for me. Maybe someone else can try this as well.

Thanks Bastian, I'll apply and test this today.

Fedora 5.10.7-200.fc33.x86_64 still the same issues. Visual 'glitches', for example when viewing file properties and sudden, random GPU freeze.

This has become a serious issue.

lspci VGA compatible controller: NVIDIA Corporation GT216GLM [Quadro FX 880M] (rev a2)

One way to recover without hard reset is to suspend to sleep and restore. Work with 5.10.10-200.fc33.x86_64,

I have bisected the changes within f844eb48 on the file level and found that the bug (at least for my hw) is caused by changes to the file drivers/gpu/drm/nouveau/dispnv50/base827c.c. I then inspected the diff for that file and although I can't fully claim to know what is going on I think the following fixes the bug:

--- a/drivers/gpu/drm/nouveau/dispnv50/base827c.c
+++ b/drivers/gpu/drm/nouveau/dispnv50/base827c.c
@@ -49,7 +49,11 @@ base827c_image_set(struct nv50_wndw *wndw, struct nv50_wndw_atom *asyw)
                          NVVAL(NV827C, SET_CONVERSION, OFS, 0x64));
        } else {
                PUSH_MTHD(push, NV827C, SET_PROCESSING,
-                         NVDEF(NV827C, SET_PROCESSING, USE_GAIN_OFS, DISABLE));
+                         NVDEF(NV827C, SET_PROCESSING, USE_GAIN_OFS, DISABLE),
+
+                                       SET_CONVERSION,
+                         NVVAL(NV827C, SET_CONVERSION, GAIN, 0) |
+                         NVVAL(NV827C, SET_CONVERSION, OFS, 0));
        }
 
        PUSH_MTHD(push, NV827C, SURFACE_SET_OFFSET(0, 0), asyw->image.offset[0] >> 8,

Reason is that f844eb48 contained this part:

 	} else {
-		PUSH_NVSQ(push, NV827C, 0x0110, 0,
-					0x0114, 0);
+		PUSH_MTHD(push, NV827C, SET_PROCESSING,
+			  NVDEF(NV827C, SET_PROCESSING, USE_GAIN_OFS, DISABLE));
}

Note that the "0x0114, 0" section was not converted to the new macros.

Here's a new patch to try for any interested party fix.patch

Applied the fix and stressed the GPU for a good hour now without any sign of artefacts or timeouts. I'll update if any occur as I continue testing it this afternoon, but it's looking good - thanks a lot!

I still observe the kernel: pci 0000:02:00.0: error -61 assigning properties message during boot but that now appears to be unrelated and I suffer no obvious effects as a result.

Thanks again Bastian, appreciate your efforts. Phil

Patch working amazing here too! No more tearing. Freezes seem to be gone too.

Verified against 5.10.7 /w GT216GLM (Tesla 2)

yeah, I think this change looks alright. Mind sending it out to the mailing list or at least upload a proper git format-patch file?

@karolherbst done!

Somebody pointed out (https://bugzilla.kernel.org/show_bug.cgi?id=210333) that the file base507.c should have been fixed as well, so I have sent a second version of the patch the mailing list and I will also upload it here: 0001-drm-gpu-nouveau-dispnv50-Restore-pushing-of-all-data.patch

EDIT: reuploaded patch after a bug fix.

I have opened a bug at OpenSUSE so they can integrate in the current kernel 5.10.7

Thanks, it seems to be working. I still have some freezes but it seems to be some other bug.

kernel.log

marked #27 (closed) as a duplicate of this issue

marked this issue as related to #27 (closed)

Working here too, no more freezes but flickering is still present when switching between workspaces and/or on video watching, some web browsing, etc... Actually on Fedora 33, kernel 5.10.8-200.fc33.x86_64.

You're right. I also still have flickering, but not when I'm using a linux-rt kernel. That's why I didn't noticed that before. Strange.

I don't observe any flickering with my HW. Could be a separate bug?

Yeah, no flickering here after 4 days running patched[1] v5.10.9 on GTX 560.

[1] https://copr.fedorainfracloud.org/coprs/mkrupcale/kernel/build/1891178/

Well, I used kernel 5.10.9-201.fc33.x86_64 on my Fedora 33 (there was an update since my last comment) for a couple of days until few hours ago, when a new freeze occured with the same messages (drm base-timeout...), so I switched back to kernel 5.8.18 which was the last stable one for me.

So I consider two points about this stuff :

Maybe Fedora upstream maintainers haven't yet included the patch in their kernel packages.
Regarding flickering, it may be a separate bug. But it appeared at the same time this bug appeared, and even more, with kernel 5.8.18 there is no freeze neither flickering, so that's why I gathered both of them.

Anyway I'll give a go for a test with Matthew's Krupcale patched kernel and I'll give feedback.

I don't think the fedora kernel 5.10.9-201.fc33.x86_64 includes the patch. I don't use fedora, but a quick inspection of the kernel.spec file shows that it is not included there, yet.
There have been lots of individual changes in Linux 5.9 for nouveau, so it could well be a separate bug.

That's exactly what I told to myself too, when freezing occured couple of hours ago. But it didn't occur to my mind before as it seemed to work for four or five days since the last update without rebooting.
Well it could be, I'm already trying Matthew's patched kernel and for now, since I rebooted on it a few minutes ago, no freezing for the moment, and no flickering too. I'll give it a go for a couple of days, in order to test it, but it seems like to be resolved.

Anyway thank you for your implication, this issue lasted so long unsolved.

Just to note that I've been running patched versions of 5.10.{7,8,9,10} successively for more than 10 days now without any sign of flickering or crash.

Perhaps a question for @karolherbst - do you have any indication on when this patch might be merged into the 5.10 branch? I have an open bug report with Debian and I'm just thinking whether to suggest they patch bullseye pre-emptively or wait for the upstream fix.

working on it. But before we can backport it to stable releases it has to either land in the nouveau or drm-next tree. But we are also in the process of allowing MRs here to speed up such things.

Thanks Karol, much appreciated

Sorry for the noise. The linux-rt kernel I'm using most of the time is patched for sure, but the stock non-rt kernel coming from my distribution probably not. That's why I experienced flickering with it…

@bastianbeischer : Thanks for the patch. However it does apply but does not compile: there's still a bug in it.

File base507c.c line 91: a premature extra closing parent is left at end of line.

Hi @monnerat. Yes this was a mistake in a prior version, but I think I have uploaded a fixed one already. Could you try re-downloading the patch? It's this one: https://gitlab.freedesktop.org/drm/nouveau/uploads/8844d508dbe905daf9802007dc1c7e03/0001-drm-gpu-nouveau-dispnv50-Restore-pushing-of-all-data.patch

Thanks for this new link: this version is OK.

Thank you guys for your work. I'm having the same problems running arch with 5.10.10 kernel. My error is:

nouveau 0000:02:00.0: DRM: base-0: timeout

and when I open any video files using vlc:

nouveau 0000:02:00.0: Direct firmware load for nouveau/nvac_fuc084 failed with error -2

nouveau 0000:02:00.0: Direct firmware load for nouveau/nvac_fuc084d failed with error -2

nouveau 0000:02:00.0: msvld: unable to load firmware data

nouveau 0000:02:00.0: msvld: init failed, -19

the first error causes GPU crash for me. I'm thankful to Bastian Beranek for his patch, but I have no clue how to apply the patch. Can anyone please guide me to a solution or give me a manual to read to apply the patch? Thank you all in advance

Masoud, this is a bit offtopic, but you can use this PKGBUILD for Arch Linux PKGBUILD_linux-5.10.10.arch1-2.tar.gz (if you don't know what this is: Read up on the Arch Build System and "makepkg" on the Arch wiki). Beware that compiling the kernel takes a while and requires 15 to 20 GB of disk space.

Alternatively you could just use the -lts kernel for now and wait until the fix is available in the Arch kernel (either because it was included in the upstream kernel, or because the Arch kernel maintainer applied the patch manually).

@karolherbst I don't know the timeline for getting this included - will it make it for 5.11? I guess it would take a bit longer until we can see it in the stable kernels. Do you think the patch is safe enough to be applied downstream?

thank you Bastian, much appreciated --<@

mentioned in issue #47 (moved)

[NV50, Linux 5.9] Regression: Visual artifacts and eventual GPU crash

Child items 0

Activity

Admin message

Admin message

[NV50, Linux 5.9] Regression: Visual artifacts and eventual GPU crash

Activity