Due to an influx of spam, we have had to impose restrictions on new accounts. Please see this wiki page for instructions on how to get full permissions. Sorry for the inconvenience.
Admin message
The migration is almost done, at least the rest should happen in the background. There are still a few technical difference between the old cluster and the new ones, and they are summarized in this issue. Please pay attention to the TL:DR at the end of the comment.
[NV50, Linux 5.9] Regression: Visual artifacts and eventual GPU crash
With my Geforce GTX 285
04:00.0 VGA compatible controller: NVIDIA Corporation GT200b [GeForce GTX 285] (rev a1)
I experience a regression after updating to Linux 5.9 from Linux 5.8.14 where visual artifacts appear on the screen after a few minutes, leading a lockup of my display, forcing me to reboot the PC. Downgrading fixes the issue again.
Other software versions:
X: 1.20.9
Mesa 20.2.0
KDE Plasma 5.20.0 (kwin)
The journalctl contains the following errors:
Oct 14 16:46:17 bastian-desktop kernel: nouveau 0000:04:00.0: DRM: base-0: timeoutOct 14 16:46:19 bastian-desktop kernel: nouveau 0000:04:00.0: DRM: base-0: timeoutOct 14 16:46:21 bastian-desktop kernel: nouveau 0000:04:00.0: DRM: base-0: timeoutOct 14 16:46:23 bastian-desktop kernel: nouveau 0000:04:00.0: DRM: base-0: timeoutOct 14 16:46:25 bastian-desktop kernel: nouveau 0000:04:00.0: DRM: base-0: timeout(repeats)...issue command to shut down here...Oct 14 16:48:18 bastian-desktop kernel: nouveau 0000:04:00.0: DRM: base-0: timeoutOct 14 16:48:18 bastian-desktop kernel: nouveau 0000:04:00.0: disp: ERROR 5 [INVALID_STATE] 0b [] chid 1 mthd 0080 data 00000001
I already tried to run a bisect, but it wasn't conclusive since I think I made mistakes when tagging versions good or bad, since it some times takes a while for the issue to manifest itself. The first commit that had errors in the journal was 5de5b6ec but there probably were bad commits before which I didn't tag correctly.
To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information
Child items
0
Show closed items
No child items are currently assigned. Use child items to break down this issue into smaller parts.
Yes, disregard that version above. Currently the first bad version is fe4249af, but I'm still working on the bisect. It takes a while though because it turns out that it can take a few hours before the bug appears, which also means that it's easy to mistag bad versions as good.
I just wanted to inform you that I'm facing a similar - and probably the same issue. I see the same DRM: base-0: timeout messages. Additionally, my dmesg is completely filled with kernel warnings (see the attached file). I can also report that only one of my two monitors is frozen (or better: reduced to around 0.3 FPS) and repeatedly calling dmesg seems to heal the lock sometimes (though that may be incidential).
@karolherbst - I finished bisection. This is the first commit with issues:
commit 0a96099691c8cd1ac0744ef30b6846869dc2b566Author: Ben Skeggs <bskeggs@redhat.com>Date: Tue Jul 21 11:34:07 2020 +1000 drm/nouveau/kms/nv50-: implement proper push buffer control logic We had a, what was supposed to be temporary, hack in the KMS code where we'd completely drain an EVO/NVD channel's push buffer when wrapping to the start again, instead of treating it as a ring buffer. Let's fix that, finally. Signed-off-by: Ben Skeggs <bskeggs@redhat.com>
I have been running the commit before for two days now and didn't encounter any problems, so I would say it's very likely that the above is correct.
Hello, just to mention the same is happening on my 01:00.0 VGA compatible controller: NVIDIA Corporation GT218M [NVS 3100M] (rev a2) with the current OpenSUSE tumbleweed's kernel 5.9.1-1-default.
ohh, maybe we should have pinged @skeggsb on this issue here... but I am also not sure if we addressed it in 5.10 already or not.. I think to remember it came up, but maybe it was also this issue all along.
I am experiencing the same issue for a long time right now, actually since upgrade on my system to kernel 5.9.8...
Using Fedora 33, with ZFS on root, I was looking for such a thread since a month ago, and it was a real PITA to find it as my kernel is tainted (zfs modules...), neither could I send a report about the issue earlier, due to Fedora's "abrt" policy not to allow reporting troubles when kernels are tainted.
So as I said the issue is recurring all the time here, as long as I'm booting on a 5.9.X kernel, and very very frequently. Tested with "5.9.8", "5.9.9", "5.9.10", and today I just upgraded to "5.9.11" for tests. Like some people said here, falling back to 5.8 kernel (more precisely in my case 5.8.18-300) solves the problem. The following packages seem to be involved
kernel-5.9
xorg-x11-drv-nouveau
Keeping 5.8 kernel for the moment solves the issue, but the downside is that Fedora upstream is going forward with kernel updates, and of course I would like to follow that.
The main problem I'm facing up to, is that as I'm running system over zfs-on-root, what implies DKMS modules, some issue causes impossible boot on previous kernels when an upgrade to a more recent one is done. Seems that something doesn't allow more than one set of zfs-dkms modules compiled for a given kernel at time (that has nothing to do with this issue of course).
When the system crashes due to this regression, I have to reboot first on kernel 5.9.X, remove all installed dkms modules, then rebuild/install dkms modules only for kernel 5.8.18, and then reboot again on this last one. One can imagine how hard it is to go with such a way to maintain a workstation functional.
So, eventhough my knowledge in debugging is quite limited, I'm really willing to help here, providing whatever is needed to fix this, with someone's guide.
Are you certain you're seeing the same issue? The commit you bisected to did indeed have issues, as there's a HW bug that we needed to workaround. That was fixed by ca386aa7.
I'd be very interested in seeing a full kernel log after the issue has reproduced on 5.10.
Also. It's possible that one of the later commits before the fix could be causing this, but it's going to be harder to track down without isolating the HW bug we triggered first.
My suggestion here would be to rebase and move the commit I mentioned right after the one you bisected to (or just squash them together), see if you can reproduce it there. Hopefully you can't, and then you'll be able to bisect the later commits and find what else is broken.
The symptoms start to appear around 22:57:40 (which is when the first DRM: base-0: timeout is printed). After that I think I logged in remotely about one minute later and trigger a system shutdown around 22:59:02, so the final block of nouveau messages appeared in the shutdown sequence.
I see your point about squashing the two commits together. But ca386aa7 was already part of the 5.9.0 kernel, so that can't be the fix - otherwise I would have never seen the bug. Maybe 0a960996 introduced more than one problem? Or indeed one of the commits before 0a960996 could be responsible - it is tough to say since each iteration of bisection requires running the system for at least 12 - 24 hours and even then it's not guaranteed to trigger. However I will try again as soon as I can and I will squash the two commits that time.
The point of squashing them is to be able to bisect the commits immediately after it to see if one of them introduced additional issues that can't be bisected while the HW bug is present. The HW bug presents slightly differently to what you're seeing, there's a very specific error code thrown by the display engine for that one.
It's possible that 0a960996 has more issues (which is why i suggest testing it squashed before trying another bisect), but I'd be surprised as that'd probably effect things across the board, and I haven't encountered any problems.
Fast flashing horizontal stripes occur in windows (not screen) when content changes. A good way to show this is using the vertical slider on a tall page in firefox. I'm not sure, but I suspect the stripes color is the text color.
When system becomes unstable (i.e.: 90% frozen), I can switch to text mode (Ctrl+Alt+F2) then back to restore stablility, rather than rebooting.
Playing a video in FF increases the "chance" of instability.
I'm getting the SAME lockup on GT216GLM (Tesla 2). I can reproduce this crash extremely easily on this severely under-powered GPU, within about 10-20 minutes.
This driver does do something noticeably different in 5.9 with its draw on this GPU:
During alt-tabs in Gnome 3, I'm seeing a very discernible kind of screen tearing when the GPU is under load: The tear starts from the top left, heading on a perfect 45 degree angle in a down/right direction, until the tear gets to the horizontal center of the display, then it cuts horizontally rightward across the middle of the screen, until the tear gets to a point where it can draw a 45-degree angle downward to the bottom right corner of the display. This tear is not at all visible in 5.8, but it's extremely pronounced on this hardware.
I can lock up the display with 2 cores hitting 100% in kernel time once this happens, and it's pretty much impossible for me to get out of the crash -- but the system still responds via SSH when it does. Crash likelihood is greatly enhanced when playing videos in FF, alt-tabbing to VLC, and playing videos with higher resolution in VLC.
It's very quick for me to test revisions on this hardware. Is there a certain set of revisions / reverts you want to try? I'm on Gentoo so I can compile my own kernels.
Just quickly popping in to mention (like reported by @monnerat and @yuri_sevatz) that in my experience too, problems arrise in blitt-intensive tasks (e.g.: in Fireofx, scrolling and watching youtube).
It should probably be possible to create a small blitt intensive synthetic benchmark that quickly cause the problem, for the purpose of triggering the problem in minutes.
Has anyone tested kernel 5.10.1 ? I just received the update from my distro (openSUSE) but due to all the work involved with the current pandemic (hunting the new strain), I won't have much time to spent testing. Somebody else in for putting your GPU under some "scrolling/videoplaying/blitting" stress test? (a.k.a. slacking on youtube a lot)
Don't forget old cards when fixing. I have the same problem with an NVIDIA G98 (298200a2).
As a side note, I never had problems using nouveau during the last ten years (at least) with that card. First problems began around 4.19 (as I remember), rarely and randomly, but more and more often. Now I have freezes everyday. It may be new bugs in nouveau, but more probably old bugs raised by newer softwares…
Further tests if it can help to narrow down the root cause:
frequent timeouts and freezes with Gnome+Mutter+Wayland.
less frequent timout and freezes with Gnome+Mutter+Xorg. Some screen artifacts (or delays?) in firefox though when switching tabs but nothing in logs for that.
Problems with LXQt+Mutter.
No problems with LXQt+Openbox (but I suppose it doesn't use DRM at all).
I also tried KWin+Wayland, KWin+Xorg, Enlightenment, but they had themselves problems before Nouveau :-<.
I'll stick with Openbox for the time being though I would prefer my Gnome-shell back!
still occuring with
01:00.0 VGA compatible controller: NVIDIA Corporation GT216M [GeForce GT 330M] (rev a2)
and linux 5.10.6
Using Gnome Wayland + Mutter
IMHO: this issue MUST be fixed before 5.10 is declared LTS.
I might be experiencing the same problem.
I updated the kernel from 5.6 to 5.9.8 and started getting these artifacts/flickering.
You can sea them easily on the upper part of the screen:
VID_20201117_084247
Then I downgraded to 5.8.18 and the problem was gone.
My video card is:
Perhaps relevant - comparing dmesg following system boot between 5.4.89 (no graphical artefacts nor nouveau timeouts) and 5.10.x, I observe the following error during boot relating to my graphics card:
@karolherbst @skeggsb Here are the results of my bisection:
I investigated the range of commits from 0a960996 to ca386aa7, restricting git bisect to changes in drivers/gpu/drm/nouveau. This time I
applied the diff of ca386aa7 as a patch after checking out each commit (except when building that commit itself of course), so that the change in
ca386aa7 is always included.
Here is the bisect log:
git bisect start '--' 'drivers/gpu/drm/nouveau'# good: [0a96099691c8cd1ac0744ef30b6846869dc2b566] drm/nouveau/kms/nv50-: implement proper push buffer control logicgit bisect good 0a96099691c8cd1ac0744ef30b6846869dc2b566# bad: [ca386aa7155a5467fa7b2b8376f4da8f8e59be4d] drm/nouveau/kms/nv50-gp1xx: add WAR for EVO push buffer HW buggit bisect bad ca386aa7155a5467fa7b2b8376f4da8f8e59be4d# bad: [fb3939e232f6120387f20a26510894a17680db3a] drm/nouveau/kms/nv50-: use NVIDIA's headers for core head_view()git bisect bad fb3939e232f6120387f20a26510894a17680db3a# good: [fccc858003f3f3e7a8fa272f118eb71d218a2b32] drm/nouveau/kms/nv50-: use NVIDIA's headers for wndw sema_set()git bisect good fccc858003f3f3e7a8fa272f118eb71d218a2b32# bad: [1070832b1eab7309c59d9564ed26f84932fed817] drm/nouveau/kms/nv50-: use NVIDIA's headers for wndw image_clr()git bisect bad 1070832b1eab7309c59d9564ed26f84932fed817# good: [75bd8304e61c01d2bc5df46fcf9c2e9838b3a246] drm/nouveau/kms/nv50-: use NVIDIA's headers for wndw ntfy_wait_begun()git bisect good 75bd8304e61c01d2bc5df46fcf9c2e9838b3a246# good: [6833d2a0c778252929805fabfdc89e4e181fcb82] drm/nouveau/kms/nv50-: use NVIDIA's headers for wndw xlut_set()git bisect good 6833d2a0c778252929805fabfdc89e4e181fcb82# bad: [f844eb485eb056ad3b67e49f95cbc6c685a73db4] drm/nouveau/kms/nv50-: use NVIDIA's headers for wndw image_set()git bisect bad f844eb485eb056ad3b67e49f95cbc6c685a73db4# good: [66f7b7bddfe60a708c7711e47c95d20db05e2110] drm/nouveau/kms/nv50-: use NVIDIA's headers for wndw xlut_clr()git bisect good 66f7b7bddfe60a708c7711e47c95d20db05e2110# first bad commit: [f844eb485eb056ad3b67e49f95cbc6c685a73db4] drm/nouveau/kms/nv50-: use NVIDIA's headers for wndw image_set()
The first bad commit is:
commit f844eb485eb056ad3b67e49f95cbc6c685a73db4Author: Ben Skeggs <bskeggs@redhat.com>Date: Sat Jun 20 13:08:47 2020 +1000 drm/nouveau/kms/nv50-: use NVIDIA's headers for wndw image_set() Signed-off-by: Ben Skeggs <bskeggs@redhat.com> Reviewed-by: Lyude Paul <lyude@redhat.com>
The issue can be reliably observed in glxgears, artifacts such as white stripes appear in f844eb48 and
following commits, but not before. I didn't wait for the subsequent crash of the GPU.
It turns out that the difficulty I reported earlier (sometimes more than 10 hours before "the bug" triggers), was probably actually another
issue which was fixed by ca386aa7, because I didn't observe that anymore at all. In turn, the problem
introduced by f844eb48 manifests within a few minutes usually and can be provoked with tools such as
glxgears.
I'm now trying to revert f844eb48 on top of 5.10.7 and will report back if this fixes all problems.
EDIT: I reverted the actual code changes in f844eb48 only, but left the new headers and the includes: revert.patch. Applying this seems to cure the problem for me. Maybe someone else can try this as well.
I have bisected the changes within f844eb48 on the file level and found that the bug (at least for my hw) is caused by changes to the file drivers/gpu/drm/nouveau/dispnv50/base827c.c. I then inspected the diff for that file and although I can't fully claim to know what is going on I think the following fixes the bug:
Applied the fix and stressed the GPU for a good hour now without any sign of artefacts or timeouts. I'll update if any occur as I continue testing it this afternoon, but it's looking good - thanks a lot!
I still observe the kernel: pci 0000:02:00.0: error -61 assigning properties message during boot but that now appears to be unrelated and I suffer no obvious effects as a result.
Thanks again Bastian, appreciate your efforts.
Phil
Working here too, no more freezes but flickering is still present when switching between workspaces and/or on video watching, some web browsing, etc...
Actually on Fedora 33, kernel 5.10.8-200.fc33.x86_64.
Well, I used kernel 5.10.9-201.fc33.x86_64 on my Fedora 33 (there was an update since my last comment) for a couple of days until few hours ago, when a new freeze occured with the same messages (drm base-timeout...), so I switched back to kernel 5.8.18 which was the last stable one for me.
So I consider two points about this stuff :
Maybe Fedora upstream maintainers haven't yet included the patch in their kernel packages.
Regarding flickering, it may be a separate bug. But it appeared at the same time this bug appeared, and even more, with kernel 5.8.18 there is no freeze neither flickering, so that's why I gathered both of them.
Anyway I'll give a go for a test with Matthew's Krupcale patched kernel and I'll give feedback.
I don't think the fedora kernel 5.10.9-201.fc33.x86_64 includes the patch. I don't use fedora, but a quick inspection of the kernel.spec file shows that it is not included there, yet.
There have been lots of individual changes in Linux 5.9 for nouveau, so it could well be a separate bug.
That's exactly what I told to myself too, when freezing occured couple of hours ago. But it didn't occur to my mind before as it seemed to work for four or five days since the last update without rebooting.
Well it could be, I'm already trying Matthew's patched kernel and for now, since I rebooted on it a few minutes ago, no freezing for the moment, and no flickering too. I'll give it a go for a couple of days, in order to test it, but it seems like to be resolved.
Anyway thank you for your implication, this issue lasted so long unsolved.
Just to note that I've been running patched versions of 5.10.{7,8,9,10} successively for more than 10 days now without any sign of flickering or crash.
Perhaps a question for @karolherbst - do you have any indication on when this patch might be merged into the 5.10 branch? I have an open bug report with Debian and I'm just thinking whether to suggest they patch bullseye pre-emptively or wait for the upstream fix.
working on it. But before we can backport it to stable releases it has to either land in the nouveau or drm-next tree. But we are also in the process of allowing MRs here to speed up such things.
Sorry for the noise. The linux-rt kernel I'm using most of the time is patched for sure, but the stock non-rt kernel coming from my distribution probably not. That's why I experienced flickering with it…
Thank you guys for your work. I'm having the same problems running arch with 5.10.10 kernel.
My error is:
nouveau 0000:02:00.0: DRM: base-0: timeout
and when I open any video files using vlc:
nouveau 0000:02:00.0: Direct firmware load for nouveau/nvac_fuc084 failed with error -2
nouveau 0000:02:00.0: Direct firmware load for nouveau/nvac_fuc084d failed with error -2
nouveau 0000:02:00.0: msvld: unable to load firmware data
nouveau 0000:02:00.0: msvld: init failed, -19
the first error causes GPU crash for me.
I'm thankful to Bastian Beranek for his patch, but I have no clue how to apply the patch. Can anyone please guide me to a solution or give me a manual to read to apply the patch?
Thank you all in advance
Masoud, this is a bit offtopic, but you can use this PKGBUILD for Arch Linux PKGBUILD_linux-5.10.10.arch1-2.tar.gz (if you don't know what this is: Read up on the Arch Build System and "makepkg" on the Arch wiki). Beware that compiling the kernel takes a while and requires 15 to 20 GB of disk space.
Alternatively you could just use the -lts kernel for now and wait until the fix is available in the Arch kernel (either because it was included in the upstream kernel, or because the Arch kernel maintainer applied the patch manually).
@karolherbst I don't know the timeline for getting this included - will it make it for 5.11? I guess it would take a bit longer until we can see it in the stable kernels. Do you think the patch is safe enough to be applied downstream?