Today I noticed that journalctl reports several errors from nouveau since 27 feb 2023

changed the description

Is that the Nvidia GPU passed to a virtual machine?

When the errors appear not necessary.

Sometimes I start virt-manager with following commands:

sudo systemctl start libvirtd.service
sudo virsh net-list --all
sudo virsh net-start default
sudo virsh list --all
virt-manager

Today I switched on my PC, I didn't start virt-manager and I can see these errors on journalctl.

I am getting pretty much the same messages in my kernel log:

[  269.616361] amd_iommu_report_page_fault: 336 callbacks suppressed
[  269.616365] nouveau 0000:01:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0012 address=0x0 flags=0x0000]
[  269.616379] nouveau 0000:01:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0012 address=0x0 flags=0x0000]
[  269.616636] nouveau 0000:01:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0012 address=0x0 flags=0x0000]
[  269.616646] nouveau 0000:01:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0012 address=0x0 flags=0x0000]
[  269.735524] nouveau 0000:01:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0012 address=0x0 flags=0x0000]
[  269.735694] nouveau 0000:01:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0012 address=0x0 flags=0x0000]
[  269.735858] nouveau 0000:01:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0012 address=0x0 flags=0x0000]
[  269.736028] nouveau 0000:01:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0012 address=0x0 flags=0x0000]
[  269.736192] nouveau 0000:01:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0012 address=0x0 flags=0x0000]
[  269.736353] nouveau 0000:01:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0012 address=0x0 flags=0x0000]

Those lines were pulled from dmesg of 6.3-rc4 right after starting Xorg. No virtual machines running. If I drop out of X and start it back up, there will be more each time.

I noticed them when I upgraded my kernel from the 6.1 series to 6.2 series. The above message was from 6.3-rc4 I built this morning. I did a kernel bisect starting from the last drm/nouveau commit of 6.1.21 through to a later drm/nouveau commit of 6.2.8. Performance seems very degraded and is mainly noticable when anything graphical is being displayed on the monitor, even a web page (my cpu usage often exceeds %200 while firefox loads a site on affected kernel versions. On the 6.3-rc4 kernel, cpu still spikes but performace seems slightly better than on 6.2.8 (marginal, I think).

Contents of BISECT_LOG:

# bad: [4cc8ba135745ce729e21f42b94daeb3fed72a132] drm/nouveau/fb/gp102-: cache scrubber binary on first load
# good: [97061d441110528dc02972818f2f1dad485107f9] nouveau: fix migrate_to_ram() for faulting page
git bisect start '4cc8ba1' '97061d4' '--' 'drivers/gpu/drm/nouveau'
# good: [0b1bb1296f288bb7164d143ca82dc958f87cbff6] drm/nouveau/fifo: kill channel on NV_PPBDMA_INTR_1_CTXNOTVALID
git bisect good 0b1bb1296f288bb7164d143ca82dc958f87cbff6
# bad: [0d7557072414af191cefbaa7c908e1c09f5b7d7b] drm/nouveau/gr/gf100-: gpfifo_ctl zero before init
git bisect bad 0d7557072414af191cefbaa7c908e1c09f5b7d7b
# bad: [21876b0e4284169ddbc834d02f60940a3dd27471] drm/nouveau/gr/tu102: remove gv100_grctx_unkn88c
git bisect bad 21876b0e4284169ddbc834d02f60940a3dd27471
# bad: [ca081fff6ecc63c86a99918230cc9b947bebae8a] drm/nouveau/gr/gf100-: generate golden context during first object alloc
git bisect bad ca081fff6ecc63c86a99918230cc9b947bebae8a
# good: [ccdc043123d2a485e173e5e2627598151b7850b3] drm/nouveau/pmu: move init() falcon reset to non-nvfw code
git bisect good ccdc043123d2a485e173e5e2627598151b7850b3
# bad: [e3f324956a32d08a9361ee1e3beca383f1b01eba] drm/nouveau/fb/gp102-: unlock VPR right after devinit
git bisect bad e3f324956a32d08a9361ee1e3beca383f1b01eba
# good: [c7c0aac7421331baffdeb8f9c3e9702bdb1c0389] drm/nouveau/sec2: switch to newer style interrupt handler
git bisect good c7c0aac7421331baffdeb8f9c3e9702bdb1c0389
# good: [0e44c21708761977dcbea9b846b51a6fb684907a] drm/nouveau/flcn: new code to load+boot simple HS FWs (VPR scrubber)
git bisect good 0e44c21708761977dcbea9b846b51a6fb684907a

Contents of BISECT_START:

linux-6.2.y

Contents of BISECT_EXPECTED_REV:

5728d064190e169f1a42381bd7e5fc4d411f3188

My margin for good or bad was solely based on whether I found any AMD-VI IO_PAGE_FAULT messages after starting Xorg. I realize that that may have been faulty logic. Performance seemed more responsive for the 'good' tests, but that could be subjective on my part.

I decided to try the 6.3-rc kernel because while perusing recent commits, there seemed to be a lot of recent changes to the iommu. I am intending to build the 'next' kernel after I finish writing this.

Hardware specs for this system:

AMD FX-6100 cpu
Asus Sabertooth 990 FX v1 mobo
Evga GTX 750 TI
16G Mushkin ram

Running Alpine Linux edge, using modesetting driver xorg-server 21.1.7, mesa 23.0.1, libdrm 2.4.115. Any other info needed, let me know, will supply.

are your system running with IOMMU support enabled by default or something? Because AMD-VI is kinda this IOMMU/Virtualization thing and I'm just wondering why we don't see it on other systems.

added bisected regression labels

Hmmm, that seems a good question. I see that it is listed in my lsmod out, though to give a more relavent answer, I will have to reboot into an affected kernel with the kvm_amd module blacklisted (I'm running 6.1.21 getting sources setup to build linux-next). I have git cloning linux-next right now, but once it is cloned, will reboot and check.

Edit: It seems odd that kvm_amd should be loaded; I haven't run any vm's since boot. I might have to look into why it is autoloading even though there is no call for it. Might be a udev thing.

Ok, with kvm_amd blacklisted, not loaded, still get the IO_PAGE_FAULT errors, same as before.

My understanding of the iommu was that is also handles stuff like remapping interrupts (?); so not just for virt I think (maybe I am wrong). I must admit, the AMD-Vi part does seem misplaced outside of virtualization. Perhaps issue is in iommu, maybe iommu and nouveau? I will build and then boot linux-next, hope for best, at least worth a try.

Any suggestions for another bisect, maybe more general, are welcome. If no suggestions, will start another bisect that is generic before I go to bed (I work nights, it is nearing my bedtime for me). Will probably take a day or two as I can't start kernel builds at work.

What I'm confused about is why that commit in particular, it's just moving things around a little. I hope it's not something random and the cause is something else. Might also be, that we are setting something up in a order mattering for this issue.

This is why I want to do a more general bisect, without specifying any paths. I am no graphics hdw programmer, but the suspect commit didn't look related to the problem to my naive eyes.

Anyway, I've started the new bisect:

git bisect start v6.2 v6.1

13 steps, so going to be a few days most likely.

in theory you can reuse your old last good and bad commits. Just sometimes issues also only happen randomly, which is sometimes a bit of a pain. But yeah.. only bisecting paths sometimes also lead to weird results.

Might be quicker than I said... I am pushing the builds a bit; instead of my usual make -j$(nproc) I'm using make -j -l12 Might push the next one a bit more, maybe -l15 or -l18, since my core temp is hovering around 45C (the joys of doing my own cpu/heatsink repastes). Will report results asap. Thanks for help.

About the AMD-Vi portion of the error message, looks like that is the standard format for a page fault on an amd system with amd_iommu. Also tripped over this vger discussion over a similar error (though for amdgpu) that seems to shed a little light on the types of problems that might trigger the page fault: https://lore.kernel.org/lkml/bc7142a1-82d3-43bf-dee2-25f9297e7182@arm.com/T/

The bisect is still going. It's on the 9th iteration, 4 to go after this one. Should be done some time tomorrow.

BISECT_LOG:

# bad: [c9c3395d5e3dcc6daee66c6908354d47bf98cb0c] Linux 6.2
# good: [830b3c68c1fb1e9176028d02ef86f3cf76aa2476] Linux 6.1
git bisect start 'v6.2' 'v6.1'
# bad: [1ca06f1c1acecbe02124f14a37cce347b8c1a90c] Merge tag 'xtensa-20221213' of https://github.com/jcmvbkbc/linux-xtensa
git bisect bad 1ca06f1c1acecbe02124f14a37cce347b8c1a90c
# good: [8715c6d3100fc7c6edddf29af4a399a1c12d028c] Merge tag 'for-6.2/dm-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm
git bisect good 8715c6d3100fc7c6edddf29af4a399a1c12d028c
# bad: [66efff515a6500d4b4976fbab3bee8b92a1137fb] Merge tag 'amd-drm-next-6.2-2022-12-07' of https://gitlab.freedesktop.org/agd5f/linux into drm-next
git bisect bad 66efff515a6500d4b4976fbab3bee8b92a1137fb
# good: [49e8e6343df688d68b12c2af50791ca37520f0b7] Merge tag 'amd-drm-next-6.2-2022-11-04' of https://gitlab.freedesktop.org/agd5f/linux into drm-next
git bisect good 49e8e6343df688d68b12c2af50791ca37520f0b7
# bad: [fc58764bbf602b65a6f63c53e5fd6feae76c510c] Merge tag 'amd-drm-next-6.2-2022-11-18' of https://gitlab.freedesktop.org/agd5f/linux into drm-next
git bisect bad fc58764bbf602b65a6f63c53e5fd6feae76c510c
# bad: [4e291f2f585313efa5200cce655e17c94906e50a] Merge tag 'drm-misc-next-2022-11-10-1' of git://anongit.freedesktop.org/drm/drm-misc into drm-next
git bisect bad 4e291f2f585313efa5200cce655e17c94906e50a
# bad: [78a43c7e3b2ff5aed1809f93b4f87a418355789e] drm/nouveau/gr/gf100-: make global attrib_cb actually global
git bisect bad 78a43c7e3b2ff5aed1809f93b4f87a418355789e
# good: [eb39c613481fd2fe6b2f66ec2ca21f8fdcdd4cac] drm/nouveau/fifo: expose per-runlist CHID information
git bisect good eb39c613481fd2fe6b2f66ec2ca21f8fdcdd4cac
# good: [8ab849d6dd4c2eb8880096e53e91dfb6ca37b589] drm/nouveau/fifo: add new engine context handling
git bisect good 8ab849d6dd4c2eb8880096e53e91dfb6ca37b589
# good: [0e44c21708761977dcbea9b846b51a6fb684907a] drm/nouveau/flcn: new code to load+boot simple HS FWs (VPR scrubber)
git bisect good 0e44c21708761977dcbea9b846b51a6fb684907a
# bad: [4500031f86691a44ecbbebfc77872c60c5a1b8e6] drm/nouveau/ltc: split color vs depth/stencil zbc counts
git bisect bad 4500031f86691a44ecbbebfc77872c60c5a1b8e6
# bad: [2541626cfb794e57ba0575a6920826f591f7ced0] drm/nouveau/acr: use common falcon HS FW code for ACR FWs
git bisect bad 2541626cfb794e57ba0575a6920826f591f7ced0
# bad: [e3f324956a32d08a9361ee1e3beca383f1b01eba] drm/nouveau/fb/gp102-: unlock VPR right after devinit
git bisect bad e3f324956a32d08a9361ee1e3beca383f1b01eba
# bad: [5728d064190e169f1a42381bd7e5fc4d411f3188] drm/nouveau/fb: handle sysmem flush page from common code
git bisect bad 5728d064190e169f1a42381bd7e5fc4d411f3188
# first bad commit: [5728d064190e169f1a42381bd7e5fc4d411f3188] drm/nouveau/fb: handle sysmem flush page from common code

BISECT_EXPECTED_REV:

5728d064190e169f1a42381bd7e5fc4d411f3188

What happens if dma_map_page in nvkm_fb_ctor fails? nvkm_fb_ctor uses dma_mapping_error and returns -EFAULT if the mapping failed, but no callers of nvkm_fb_ctor check the return value. My guess is data corruption when flush_page_init is called.

Okay, I can reproduce this issue on intel with enabled IOMMU on my GK208:

[   14.130603] DMAR: DRHD: handling fault status reg 2
[   14.135494] DMAR: [DMA Read NO_PASID] Request device [01:00.0] fault addr 0x0 [fault reason 0x06] PTE Read access is not set
[   14.212290] DMAR: DRHD: handling fault status reg 2
[   14.217175] DMAR: [DMA Read NO_PASID] Request device [01:00.0] fault addr 0x0 [fault reason 0x06] PTE Read access is not set
[   14.237738] DMAR: DRHD: handling fault status reg 2
[   14.242622] DMAR: [DMA Read NO_PASID] Request device [01:00.0] fault addr 0x0 [fault reason 0x06] PTE Read access is not set
[   14.297598] DMAR: DRHD: handling fault status reg 2
[   15.726620] rfkill: input handler disabled
[   20.876986] dmar_fault: 416 callbacks suppressed
[   20.876993] DMAR: DRHD: handling fault status reg 2
[   20.886505] DMAR: [DMA Read NO_PASID] Request device [01:00.0] fault addr 0x0 [fault reason 0x06] PTE Read access is not set
[   20.897706] DMAR: [DMA Read NO_PASID] Request device [01:00.0] fault addr 0x0 [fault reason 0x06] PTE Read access is not set
[   21.110483] DMAR: DRHD: handling fault status reg 2
[   21.115401] DMAR: [DMA Read NO_PASID] Request device [01:00.0] fault addr 0x0 [fault reason 0x06] PTE Read access is not set
[   21.126604] DMAR: [DMA Read NO_PASID] Request device [01:00.0] fault addr 0x0 [fault reason 0x06] PTE Read access is not set

mentioned in commit karolherbst/nouveau@71e41ce7

mentioned in commit karolherbst/nouveau@f4d129cd

mentioned in commit karolherbst/nouveau@ee5c07ff

@jvvv mind checking if karolherbst/nouveau@ee5c07ff fixes it for you?

Looks promising. I am building with those commits now. If build completes before I go to work, will report shortly. Otherwise, will report in the morning.

Yes, that fixes it here. Thank you!

Today I noticed that journalctl reports several errors from nouveau since 27 feb 2023

Child items ...

Activity

Admin message

Admin message

Today I noticed that journalctl reports several errors from nouveau since 27 feb 2023

Activity