Due to an influx of spam, we have had to impose restrictions on new accounts. Please see this wiki page for instructions on how to get full permissions. Sorry for the inconvenience.
Those lines were pulled from dmesg of 6.3-rc4 right after starting Xorg. No virtual machines running. If I drop out of X and
start it back up, there will be more each time.
I noticed them when I upgraded my kernel from the 6.1 series to 6.2 series. The above message was from 6.3-rc4 I built this morning.
I did a kernel bisect starting from the last drm/nouveau commit of 6.1.21 through to a later drm/nouveau commit of 6.2.8.
Performance seems very degraded and is mainly noticable when anything graphical is being displayed on the monitor, even
a web page (my cpu usage often exceeds %200 while firefox loads a site on affected kernel versions. On the 6.3-rc4 kernel,
cpu still spikes but performace seems slightly better than on 6.2.8 (marginal, I think).
Contents of BISECT_LOG:
# bad: [4cc8ba135745ce729e21f42b94daeb3fed72a132] drm/nouveau/fb/gp102-: cache scrubber binary on first load# good: [97061d441110528dc02972818f2f1dad485107f9] nouveau: fix migrate_to_ram() for faulting pagegit bisect start '4cc8ba1' '97061d4' '--' 'drivers/gpu/drm/nouveau'# good: [0b1bb1296f288bb7164d143ca82dc958f87cbff6] drm/nouveau/fifo: kill channel on NV_PPBDMA_INTR_1_CTXNOTVALIDgit bisect good 0b1bb1296f288bb7164d143ca82dc958f87cbff6# bad: [0d7557072414af191cefbaa7c908e1c09f5b7d7b] drm/nouveau/gr/gf100-: gpfifo_ctl zero before initgit bisect bad 0d7557072414af191cefbaa7c908e1c09f5b7d7b# bad: [21876b0e4284169ddbc834d02f60940a3dd27471] drm/nouveau/gr/tu102: remove gv100_grctx_unkn88cgit bisect bad 21876b0e4284169ddbc834d02f60940a3dd27471# bad: [ca081fff6ecc63c86a99918230cc9b947bebae8a] drm/nouveau/gr/gf100-: generate golden context during first object allocgit bisect bad ca081fff6ecc63c86a99918230cc9b947bebae8a# good: [ccdc043123d2a485e173e5e2627598151b7850b3] drm/nouveau/pmu: move init() falcon reset to non-nvfw codegit bisect good ccdc043123d2a485e173e5e2627598151b7850b3# bad: [e3f324956a32d08a9361ee1e3beca383f1b01eba] drm/nouveau/fb/gp102-: unlock VPR right after devinitgit bisect bad e3f324956a32d08a9361ee1e3beca383f1b01eba# good: [c7c0aac7421331baffdeb8f9c3e9702bdb1c0389] drm/nouveau/sec2: switch to newer style interrupt handlergit bisect good c7c0aac7421331baffdeb8f9c3e9702bdb1c0389# good: [0e44c21708761977dcbea9b846b51a6fb684907a] drm/nouveau/flcn: new code to load+boot simple HS FWs (VPR scrubber)git bisect good 0e44c21708761977dcbea9b846b51a6fb684907a
Contents of BISECT_START:
linux-6.2.y
Contents of BISECT_EXPECTED_REV:
5728d064190e169f1a42381bd7e5fc4d411f3188
My margin for good or bad was solely based on whether I found any AMD-VI IO_PAGE_FAULT messages after starting Xorg.
I realize that that may have been faulty logic. Performance seemed more responsive for the 'good' tests, but that could
be subjective on my part.
I decided to try the 6.3-rc kernel because while perusing recent commits, there seemed to be a lot of recent changes to the iommu. I am intending to build the 'next' kernel after I finish writing this.
Hardware specs for this system:
AMD FX-6100 cpu
Asus Sabertooth 990 FX v1 mobo
Evga GTX 750 TI
16G Mushkin ram
Running Alpine Linux edge, using modesetting driver xorg-server 21.1.7, mesa 23.0.1, libdrm 2.4.115.
Any other info needed, let me know, will supply.
are your system running with IOMMU support enabled by default or something? Because AMD-VI is kinda this IOMMU/Virtualization thing and I'm just wondering why we don't see it on other systems.
Hmmm, that seems a good question. I see that it is listed in my lsmod out, though to give a more relavent answer, I will have to reboot into an affected kernel with the kvm_amd module blacklisted (I'm running 6.1.21 getting sources setup to build linux-next). I have git cloning linux-next right now, but once it is cloned, will reboot and check.
Edit:
It seems odd that kvm_amd should be loaded; I haven't run any vm's since boot. I might have to look into why it is autoloading
even though there is no call for it. Might be a udev thing.
Ok, with kvm_amd blacklisted, not loaded, still get the IO_PAGE_FAULT errors, same as before.
My understanding of the iommu was that is also handles stuff like remapping interrupts (?); so not just for virt I think (maybe I am wrong). I must admit, the AMD-Vi part does seem misplaced outside of virtualization. Perhaps issue is in iommu, maybe iommu and nouveau? I will build and then boot linux-next, hope for best, at least worth a try.
Any suggestions for another bisect, maybe more general, are welcome. If no suggestions, will start another bisect that is generic before I go to bed (I work nights, it is nearing my bedtime for me). Will probably take a day or two as I can't start kernel builds at work.
What I'm confused about is why that commit in particular, it's just moving things around a little. I hope it's not something random and the cause is something else. Might also be, that we are setting something up in a order mattering for this issue.
This is why I want to do a more general bisect, without specifying any paths. I am no graphics hdw programmer, but the suspect commit didn't look related to the problem to my naive eyes.
in theory you can reuse your old last good and bad commits. Just sometimes issues also only happen randomly, which is sometimes a bit of a pain. But yeah.. only bisecting paths sometimes also lead to weird results.
Might be quicker than I said... I am pushing the builds a bit; instead of my usual
make -j$(nproc)
I'm using
make -j -l12
Might push the next one a bit more, maybe -l15 or -l18, since my core temp is hovering around 45C (the joys of doing my own cpu/heatsink repastes). Will report results asap. Thanks for help.
About the AMD-Vi portion of the error message, looks like that is the standard format for a page fault on an amd system with amd_iommu. Also tripped over this vger discussion over a similar error (though for amdgpu) that seems to shed a little light on the types of problems that might trigger the page fault: https://lore.kernel.org/lkml/bc7142a1-82d3-43bf-dee2-25f9297e7182@arm.com/T/
The bisect is still going. It's on the 9th iteration, 4 to go after this one. Should be done some time tomorrow.
# bad: [c9c3395d5e3dcc6daee66c6908354d47bf98cb0c] Linux 6.2# good: [830b3c68c1fb1e9176028d02ef86f3cf76aa2476] Linux 6.1git bisect start 'v6.2' 'v6.1'# bad: [1ca06f1c1acecbe02124f14a37cce347b8c1a90c] Merge tag 'xtensa-20221213' of https://github.com/jcmvbkbc/linux-xtensagit bisect bad 1ca06f1c1acecbe02124f14a37cce347b8c1a90c# good: [8715c6d3100fc7c6edddf29af4a399a1c12d028c] Merge tag 'for-6.2/dm-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dmgit bisect good 8715c6d3100fc7c6edddf29af4a399a1c12d028c# bad: [66efff515a6500d4b4976fbab3bee8b92a1137fb] Merge tag 'amd-drm-next-6.2-2022-12-07' of https://gitlab.freedesktop.org/agd5f/linux into drm-nextgit bisect bad 66efff515a6500d4b4976fbab3bee8b92a1137fb# good: [49e8e6343df688d68b12c2af50791ca37520f0b7] Merge tag 'amd-drm-next-6.2-2022-11-04' of https://gitlab.freedesktop.org/agd5f/linux into drm-nextgit bisect good 49e8e6343df688d68b12c2af50791ca37520f0b7# bad: [fc58764bbf602b65a6f63c53e5fd6feae76c510c] Merge tag 'amd-drm-next-6.2-2022-11-18' of https://gitlab.freedesktop.org/agd5f/linux into drm-nextgit bisect bad fc58764bbf602b65a6f63c53e5fd6feae76c510c# bad: [4e291f2f585313efa5200cce655e17c94906e50a] Merge tag 'drm-misc-next-2022-11-10-1' of git://anongit.freedesktop.org/drm/drm-misc into drm-nextgit bisect bad 4e291f2f585313efa5200cce655e17c94906e50a# bad: [78a43c7e3b2ff5aed1809f93b4f87a418355789e] drm/nouveau/gr/gf100-: make global attrib_cb actually globalgit bisect bad 78a43c7e3b2ff5aed1809f93b4f87a418355789e# good: [eb39c613481fd2fe6b2f66ec2ca21f8fdcdd4cac] drm/nouveau/fifo: expose per-runlist CHID informationgit bisect good eb39c613481fd2fe6b2f66ec2ca21f8fdcdd4cac# good: [8ab849d6dd4c2eb8880096e53e91dfb6ca37b589] drm/nouveau/fifo: add new engine context handlinggit bisect good 8ab849d6dd4c2eb8880096e53e91dfb6ca37b589# good: [0e44c21708761977dcbea9b846b51a6fb684907a] drm/nouveau/flcn: new code to load+boot simple HS FWs (VPR scrubber)git bisect good 0e44c21708761977dcbea9b846b51a6fb684907a# bad: [4500031f86691a44ecbbebfc77872c60c5a1b8e6] drm/nouveau/ltc: split color vs depth/stencil zbc countsgit bisect bad 4500031f86691a44ecbbebfc77872c60c5a1b8e6# bad: [2541626cfb794e57ba0575a6920826f591f7ced0] drm/nouveau/acr: use common falcon HS FW code for ACR FWsgit bisect bad 2541626cfb794e57ba0575a6920826f591f7ced0# bad: [e3f324956a32d08a9361ee1e3beca383f1b01eba] drm/nouveau/fb/gp102-: unlock VPR right after devinitgit bisect bad e3f324956a32d08a9361ee1e3beca383f1b01eba# bad: [5728d064190e169f1a42381bd7e5fc4d411f3188] drm/nouveau/fb: handle sysmem flush page from common codegit bisect bad 5728d064190e169f1a42381bd7e5fc4d411f3188# first bad commit: [5728d064190e169f1a42381bd7e5fc4d411f3188] drm/nouveau/fb: handle sysmem flush page from common code
What happens if dma_map_page in nvkm_fb_ctor fails? nvkm_fb_ctor uses dma_mapping_error and returns -EFAULT if the mapping failed, but no callers of nvkm_fb_ctor check the return value. My guess is data corruption when flush_page_init is called.
Looks promising. I am building with those commits now. If build completes before I go to work, will report shortly. Otherwise, will report in the morning.