Due to an influx of spam, we have had to impose restrictions on new accounts. Please see this wiki page for instructions on how to get full permissions. Sorry for the inconvenience.
Admin message
The migration is almost done, at least the rest should happen in the background. There are still a few technical difference between the old cluster and the new ones, and they are summarized in this issue. Please pay attention to the TL:DR at the end of the comment.
Xen PV uses a different PAT, which results in graphics glitches when running in Xen dom0. Patching Xen to use Linux’s PAT makes the problem go away, but this is an ABI break for PV guests and so is not an upstreamable solution. This needs to be fixed in i915.
So, for me it seems WC at index 4 is problematic for some reason.
(...)
Old CPUs have had hardware errata that caused the top bit of the PAT
entry to be ignored in certain cases. Could modern CPUs be ignoring
this bit when accessing iGPU memory or registers? With WC at position
4, this would cause WC to be treated as WB, which is consistent with the
observed behavior. WC at position 3 would not be impacted, and WC at
position 5 would be treated as WT which I expect to be safe. One way to
test this is to test 1=WB, 5=WC. If my hypothesis is correct, this
should trigger the bug, even if entry 1 in the PAT is unused because
entry 0 is also WB.
(...)
This looks like a very probable situation, indeed 1=WB, 5=WC does
trigger the bug! Specifically this layout:
WB WB UC- UC WP WC WT UC
(...)
What about WB WT WB UC WB WP WC UC- and WB WT WT UC WB WP WC UC-? Those
only differ in entry 2, which will not be used as it duplicates entry 0
or 1. Therefore, architecturally, these should behave identically. If
I am correct, the second will work fine, but the first will trigger the
bug.
Bingo! This also behaves as predicted.
So, it indeed looks like the _PAGE_PAT bit is ignored by the hardware,
even though set in relevant PTEs.
To push things forward, I need to be able to reproduce the issue quickly, in Linux, without installing QubesOS. @DemiMarie, @marmarek, could you please advise a minimal graphical environment and a test application sufficient for reproduction of those glitches?
I provided a patch that mimic Xen behavior on native Linux in https://lore.kernel.org/intel-gfx/Y5Hst0bCxQDTN7lK@mail-itl/T/. With that, the issue is obvious on Adler Lake with any Xorg application, or even in dracut LUKS prompt (if you have disk encryption). I haven't tried other graphical interfaces (Wayland), but that might be affected too. The issue can be observed on Tiger Lake too, but at least for me it affects only dracut LUKS prompt there but not Xorg (unclear why).
Using your patch on top of v6.1.23, I've not been able to reproduce the issue (see any glitches) on ADL-N (8086:46d0). Userspace I used:
weston + weston-terminal + glxgears,
startx -> Xorg (tried with both modesetting and intel driver), gnome-session-binary, gnome-shell, gnome-initial-setup -- screenshot:
Then, there must be still something else in your test environment that makes any of your Xorg applications exhibiting the issue, or not all Xorg apps are affected.
Having only remote access to an ADL (screenshot taken over KVM video), I'd rather like to avoid setting up disk encryption + dracut if not strictly required. Could you please advise some exact Xorg applications that exhibit the glitches for you?
EDIT: Please hold on, I can see some glitches with just Xorg + xterm.
I can see glitches only when using intel Xorg driver. No glitches when using modesetting driver.
UPDATE: No glitches with intel Xorg driver on pure v6.1.23 either. My conclusion is that the Xorg intel driver, not i915, can be the source of glitches under Xen PV.
Interesting, I had this behavior (glitches only with "intel" Xorg driver) on older system (Kaby Lake?) but I'm pretty sure it affected "modesetting" on newer systems too. I'll try to build you minimal reproducer initramfs.
I can see a lot of glitches when running with the kernel patch applied and weston configured with use-pixman=true.
Without use-pixman=true, only cursor image is affected. When only moving the cursor and the picture is otherwise static, transitions from one cursor icon to another are progressive, look like performed pixel by pixel in random order, and take seconds, depending on cursor movement. Those effects are much less visible, but still to some extent, when a graphics application that triggers frequent picture updates is running.
No similar glitches can be observed when running with standard Linux PAT mapping.
Since visibility and intensity of glitches depends on userspace configuration, I still suspect something at userspace level being responsible for choosing wrong PAT indexes.
transitions from one cursor icon to another are progressive, look like performed pixel by pixel in random order, and take seconds, depending on cursor movement
Sounds exactly like this issue, but for me (at least on some machines) it affects other windows too. Usually, it will finally fully render, either after several seconds (sometimes a minute), or after some event that I guess flushes some caches.
Like in hardware accelerated weston, I can see those XEN PAT table triggered cursor glitches also in Xorg/modesetting, but also no glitches affecting other (non-cursor) areas.
Previous debugging shown that _PAGE_PAT bit was ignored (cleared) - see the table I linked earlier. Maybe some user->kernel API (but not all of them) does that (either on user space or kernel space part)? The fact that it applies to plymouth too suggests for me it's rather on the kernel side (a bit unlikely to have the same software bug in several places, but not impossible...).