i915 driver incorrectly assumes that the PAT will be the standard Linux value

@armurthy Please have a first look to find the next set of steps.

Some details are provided also at https://lore.kernel.org/intel-gfx/Y5Hst0bCxQDTN7lK@mail-itl/T/, including reproduction steps outside of Xen.

Copying from the ML thread interesting findings:

I did several tests with different PAT configuration (by modifying Xen that sets the MSR). Full table is at https://pad.itl.space/sheet/#/2/sheet/view/HD1qT2Zf44Ha36TJ3wj2YL+PchsTidyNTFepW5++ZKM/ Some highlights:

1=WC, 4=WT - good

1=WT, 4=WC - bad

1=WT, 3=WC (4=WC too) - good

1=WT, 5=WC - good

So, for me it seems WC at index 4 is problematic for some reason.

(...)

Old CPUs have had hardware errata that caused the top bit of the PAT entry to be ignored in certain cases. Could modern CPUs be ignoring this bit when accessing iGPU memory or registers? With WC at position 4, this would cause WC to be treated as WB, which is consistent with the observed behavior. WC at position 3 would not be impacted, and WC at position 5 would be treated as WT which I expect to be safe. One way to test this is to test 1=WB, 5=WC. If my hypothesis is correct, this should trigger the bug, even if entry 1 in the PAT is unused because entry 0 is also WB.

(...)

This looks like a very probable situation, indeed 1=WB, 5=WC does trigger the bug! Specifically this layout:

WB WB UC- UC WP WC WT UC

(...)

What about WB WT WB UC WB WP WC UC- and WB WT WT UC WB WP WC UC-? Those only differ in entry 2, which will not be used as it duplicates entry 0 or 1. Therefore, architecturally, these should behave identically. If I am correct, the second will work fine, but the first will trigger the bug.

Bingo! This also behaves as predicted.

So, it indeed looks like the _PAGE_PAT bit is ignored by the hardware, even though set in relevant PTEs.

FWIW, the broken state looks like this:

To push things forward, I need to be able to reproduce the issue quickly, in Linux, without installing QubesOS. @DemiMarie, @marmarek, could you please advise a minimal graphical environment and a test application sufficient for reproduction of those glitches?

I provided a patch that mimic Xen behavior on native Linux in https://lore.kernel.org/intel-gfx/Y5Hst0bCxQDTN7lK@mail-itl/T/. With that, the issue is obvious on Adler Lake with any Xorg application, or even in dracut LUKS prompt (if you have disk encryption). I haven't tried other graphical interfaces (Wayland), but that might be affected too. The issue can be observed on Tiger Lake too, but at least for me it affects only dracut LUKS prompt there but not Xorg (unclear why).

Hi @marmarek,

Using your patch on top of v6.1.23, I've not been able to reproduce the issue (see any glitches) on ADL-N (8086:46d0). Userspace I used:

weston + weston-terminal + glxgears,
startx -> Xorg (tried with both modesetting and intel driver), gnome-session-binary, gnome-shell, gnome-initial-setup -- screenshot:

Then, there must be still something else in your test environment that makes any of your Xorg applications exhibiting the issue, or not all Xorg apps are affected.

Having only remote access to an ADL (screenshot taken over KVM video), I'd rather like to avoid setting up disk encryption + dracut if not strictly required. Could you please advise some exact Xorg applications that exhibit the glitches for you?

EDIT: Please hold on, I can see some glitches with just Xorg + xterm.

I can see glitches only when using intel Xorg driver. No glitches when using modesetting driver.

UPDATE: No glitches with intel Xorg driver on pure v6.1.23 either. My conclusion is that the Xorg intel driver, not i915, can be the source of glitches under Xen PV.

Interesting, I had this behavior (glitches only with "intel" Xorg driver) on older system (Kaby Lake?) but I'm pretty sure it affected "modesetting" on newer systems too. I'll try to build you minimal reproducer initramfs.

UPDATE on wayland (using weston):

I can see a lot of glitches when running with the kernel patch applied and weston configured with use-pixman=true.

Without use-pixman=true, only cursor image is affected. When only moving the cursor and the picture is otherwise static, transitions from one cursor icon to another are progressive, look like performed pixel by pixel in random order, and take seconds, depending on cursor movement. Those effects are much less visible, but still to some extent, when a graphics application that triggers frequent picture updates is running.

No similar glitches can be observed when running with standard Linux PAT mapping.

Since visibility and intensity of glitches depends on userspace configuration, I still suspect something at userspace level being responsible for choosing wrong PAT indexes.

transitions from one cursor icon to another are progressive, look like performed pixel by pixel in random order, and take seconds, depending on cursor movement

Sounds exactly like this issue, but for me (at least on some machines) it affects other windows too. Usually, it will finally fully render, either after several seconds (sometimes a minute), or after some event that I guess flushes some caches.

Like in hardware accelerated weston, I can see those XEN PAT table triggered cursor glitches also in Xorg/modesetting, but also no glitches affecting other (non-cursor) areas.

Previous debugging shown that _PAGE_PAT bit was ignored (cleared) - see the table I linked earlier. Maybe some user->kernel API (but not all of them) does that (either on user space or kernel space part)? The fact that it applies to plymouth too suggests for me it's rather on the kernel side (a bit unlikely to have the same software bug in several places, but not impossible...).

Sounds to me like the CPU is just broken.

To debug this properly someone should just write a small testcase (eg. in igt):

disable FBC and PSR
enable display
mmap the current front buffer BO (both mmap_wc and mmap_gtt paths should be tested)
write through the mapping with the CPU (using big SSE/AVX access probably a good idea to make sure WC gets really tested as well)
flush WC buffer (mmio access/mb/etc. should suffice)
verify that cache dirt doesn't linger on the display

A potential fix has been submitted for review: x86/mm: Fix PAT bit missing from page protection modify mask

mentioned in commit igt-gpu-tools@0f075441

The fix has been applied to x86/mm branch of linux tip git tree

x86/mm: Fix PAT bit missing from page protection modify mask

closed

i915 driver incorrectly assumes that the PAT will be the standard Linux value

Child items ...

Activity

Admin message

Admin message

i915 driver incorrectly assumes that the PAT will be the standard Linux value

Activity