[xen iommu] After upgrading to Linux 3.19, desktop no longer works in Xen 4.5.0 dom0

added Community GPU hang platform: ILK priority::medium severity::major + 1 deleted label

Ting-Wei Lan @lantw uploaded an attachment:

Attachment 115079, "Screenshot when the system is running in single user mode":

Ting-Wei Lan @lantw uploaded an attachment:

~~Attachment 115080~~, "dmesg":
i915-dmesg

Ting-Wei Lan @lantw uploaded an attachment:

~~Attachment 115081~~, "/sys/class/drm/card0/error":
i915-error

Ting-Wei Lan @lantw said:

git bisect shows the bad commit is https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=47591df

Jani Nikula @jani said:

(In reply to Ting-Wei Lan from comment 4)

git bisect shows the bad commit is
https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/
?id=47591df

commit 47591df505129c9774af6cca2debf283a6e56ed7
Author: Juergen Gross
Date: Mon Nov 3 14:02:04 2014 +0100

xen: Support Xen pv-domains using PAT

Please report this to xen folks. I'll leave this open for tracking purposes for now, although I was tempted to resolve NOTOURBUG.

Ander Conselvan de Oliveira said:

Was this reported to Xen folks? I don't think i915 developers will attempt to fix this, and it has been over a month, so closing as NOTOURBUG.

Ting-Wei Lan @lantw said:

It seems this problem is related to Intel VT-d. If I disable VT-d by adding iommu=off to Xen boot options, this error will not happen.

Ting-Wei Lan @lantw said:

I think I should reopen this bug because the problem also happens without using Xen.

http://lists.xenproject.org/archives/html/xen-devel/2015-06/msg02394.html
http://lists.xenproject.org/archives/html/xen-devel/2015-06/msg02387.html

This problem also happens on Linux >= 3.7 without using Xen when 'intel_iommu=on' is used. It can be worked around by adding 'intel_iommu=igfx_off'. Is it an expected behavior or a bug? Here are some 'dmesg | grep -i iommu' outputs.

Linux 3.6.11 with intel_iommu=on works fine.
[ +0.000000] Intel-IOMMU: enabled
[ +0.005366] dmar: IOMMU 0: reg_base_addr fed90000 ver 1:0 cap
c9008020e30272 ecap 1000
[ +0.005360] dmar: IOMMU 1: reg_base_addr fed91000 ver 1:0 cap
c0000020230272 ecap 1000
[ +0.005359] dmar: IOMMU 2: reg_base_addr fed93000 ver 1:0 cap
c9008020630272 ecap 1000
[ +0.003267] IOMMU 0 0xfed90000: using Register based invalidation
[ +0.006143] IOMMU 2 0xfed93000: using Register based invalidation
[ +0.006141] IOMMU: Setting RMRR:
[ +0.003298] IOMMU: Setting identity map for device 0000:00:1d.0
[0xd7aec000 - 0xd7afffff]
[ +0.008310] IOMMU: Setting identity map for device 0000:00:1a.0
[0xd7aec000 - 0xd7afffff]
[ +0.008269] IOMMU: Setting identity map for device 0000:00:1d.0
[0xe4000 - 0xe7fff]
[ +0.007753] IOMMU: Setting identity map for device 0000:00:1a.0
[0xe4000 - 0xe7fff]
[ +0.007753] IOMMU: Prepare 0-16MiB unity mapping for LPC
[ +0.005376] IOMMU: Setting identity map for device 0000:00:1f.0 [0x0 -
0xffffff]

Linux >= 3.7 without any intel_iommu argument works fine.
[ +0.005391] dmar: IOMMU 0: reg_base_addr fed90000 ver 1:0 cap
c9008020e30272 ecap 1000
[ +0.005385] dmar: IOMMU 1: reg_base_addr fed91000 ver 1:0 cap
c0000020230272 ecap 1000
[ +0.005384] dmar: IOMMU 2: reg_base_addr fed93000 ver 1:0 cap
c9008020630272 ecap 1000

Linux >= 3.7 with intel_iommu=on causes grahpics problems.
[ +0.000000] Intel-IOMMU: enabled
[ +0.005391] dmar: IOMMU 0: reg_base_addr fed90000 ver 1:0 cap
c9008020e30272 ecap 1000
[ +0.005382] dmar: IOMMU 1: reg_base_addr fed91000 ver 1:0 cap
c0000020230272 ecap 1000
[ +0.005383] dmar: IOMMU 2: reg_base_addr fed93000 ver 1:0 cap
c9008020630272 ecap 1000
[ +0.003430] IOMMU: dmar1 using Register based invalidation
[ +0.005553] IOMMU: dmar0 using Register based invalidation
[ +0.005559] IOMMU: dmar2 using Register based invalidation
[ +0.005560] IOMMU: Setting RMRR:
[ +0.003314] IOMMU: Setting identity map for device 0000:00:1a.0
[0xd7aec000 - 0xd7afffff]
[ +0.008341] IOMMU: Setting identity map for device 0000:00:1d.0
[0xd7aec000 - 0xd7afffff]
[ +0.008334] IOMMU: Setting identity map for device 0000:00:02.0
[0xd7c00000 - 0xdfffffff]
[ +0.009797] IOMMU: Setting identity map for device 0000:00:1a.0
[0xe4000 - 0xe7fff]
[ +0.007795] IOMMU: Setting identity map for device 0000:00:1d.0
[0xe4000 - 0xe7fff]
[ +0.007798] IOMMU: Prepare 0-16MiB unity mapping for LPC
[ +0.005398] IOMMU: Setting identity map for device 0000:00:1f.0 [0x0 -
0xffffff]

Linux >= 3.7 with intel_iommu=igfx_off works fine.
[ +0.000000] Intel-IOMMU: disable GFX device mapping
[ +0.005388] dmar: IOMMU 0: reg_base_addr fed90000 ver 1:0 cap
c9008020e30272 ecap 1000
[ +0.005385] dmar: IOMMU 1: reg_base_addr fed91000 ver 1:0 cap
c0000020230272 ecap 1000
[ +0.005383] dmar: IOMMU 2: reg_base_addr fed93000 ver 1:0 cap
c9008020630272 ecap 1000

Linux >= 3.7 with both intel_iommu=on and intel_iommu=igfx_off also
works fine.
[ 0.000000] Intel-IOMMU: disable GFX device mapping
[ 0.000000] Intel-IOMMU: enabled
[ 0.205011] dmar: IOMMU 0: reg_base_addr fed90000 ver 1:0 cap
c9008020e30272 ecap 1000
[ 0.218432] dmar: IOMMU 1: reg_base_addr fed91000 ver 1:0 cap
c0000020230272 ecap 1000
[ 0.231848] dmar: IOMMU 2: reg_base_addr fed93000 ver 1:0 cap
c9008020630272 ecap 1000
[ 1.873199] IOMMU: dmar0 using Register based invalidation
[ 1.878757] IOMMU: dmar2 using Register based invalidation
[ 1.884315] IOMMU: Setting RMRR:
[ 1.887631] IOMMU: Setting identity map for device 0000:00:1a.0
[0xd7aec000 - 0xd7afffff]
[ 1.895972] IOMMU: Setting identity map for device 0000:00:1d.0
[0xd7aec000 - 0xd7afffff]
[ 1.904285] IOMMU: Setting identity map for device 0000:00:1a.0
[0xe4000 - 0xe7fff]
[ 1.912079] IOMMU: Setting identity map for device 0000:00:1d.0
[0xe4000 - 0xe7fff]
[ 1.919871] IOMMU: Prepare 0-16MiB unity mapping for LPC
[ 1.925268] IOMMU: Setting identity map for device 0000:00:1f.0 [0x0

0xffffff]

It seems the difference between working and broken arguments is 'device 0000:00:02.0', which is the Intel integrated graphics controller.

David Woodhouse @dwmw2 said:

It's odd that it was triggered (in the Xen case) by a PAT patch.

What was the actual effect of that patch on the caching mode used by the machine in question?

[ +0.005382] dmar: IOMMU 1: reg_base_addr fed91000 ver 1:0 cap
c0000020230272 ecap 1000

cap & (1<<4) is set, which is the RWBF bit:

1: Indicates software must explicitly flush
the write buffers to ensure updates made to
memory-resident remapping structures are
visible to hardware.

ecap & (1<<0) is clear, which is the Coherency bit:

This field indicates if hardware access to the
root, context, extended-context and
interrupt-remap tables, and second-level
paging structures for requests-without-
PASID, are coherent (snooped) or not.
• 0:Indicates hardware accesses to
remapping structures are non-coherent.

So basically this hardware is in a mode where the IOMMU page tables are non-cache coherent. Not only do you have to clflush every cache line in the page tables to main memory when you write it, but you *also* have to jump through hoops to ensure that the writes are pushed through chipset-specific write buffers (see §6.8 of the VT-d specification).

That may help to explain why a seemingly innocent PAT change might have triggered something odd. But it would be good to know precisely what went wrong.

Also, does it help to add 'iommu=pt' to the kernel command line? That would make the IOMMU use a 1:1 mapping of all memory, rather than dynamically setting up mappings.

You say it can be reproduced without Xen, with Linux >= 3.7 — can you show the details of that please? And if it doesn't occur in 3.6, can you also bisect the non-Xen case to find when it started happening, please?

Thanks,

Ting-Wei Lan @lantw said:

(In reply to David Woodhouse from comment 9)

It's odd that it was triggered (in the Xen case) by a PAT patch.

What was the actual effect of that patch on the caching mode used by the
machine in question?

[ +0.005382] dmar: IOMMU 1: reg_base_addr fed91000 ver 1:0 cap
c0000020230272 ecap 1000

cap & (1<<4) is set, which is the RWBF bit:

1: Indicates software must explicitly flush
the write buffers to ensure updates made to
memory-resident remapping structures are
visible to hardware.

ecap & (1<<0) is clear, which is the Coherency bit:

This field indicates if hardware access to the
root, context, extended-context and
interrupt-remap tables, and second-level
paging structures for requests-without-
PASID, are coherent (snooped) or not.
• 0:Indicates hardware accesses to
remapping structures are non-coherent.

So basically this hardware is in a mode where the IOMMU page tables are
non-cache coherent. Not only do you have to clflush every cache line in the
page tables to main memory when you write it, but you *also* have to jump
through hoops to ensure that the writes are pushed through chipset-specific
write buffers (see §6.8 of the VT-d specification).

That may help to explain why a seemingly innocent PAT change might have
triggered something odd. But it would be good to know precisely what went
wrong.

Can you tell me how can I test it or provide me a link that describes steps to get needed information? I am not familiar with VT-d spec.

There were discussion on Xen-devel when I tried to make a workaround.
http://lists.xenproject.org/archives/html/xen-devel/2015-07/msg03642.html
http://lists.xenproject.org/archives/html/xen-devel/2015-07/msg03723.html

>
> Also, does it help to add 'iommu=pt' to the kernel command line? That would
> make the IOMMU use a 1:1 mapping of all memory, rather than dynamically
> setting up mappings.

No, screen output is still broken.

>
> You say it can be reproduced without Xen, with Linux >= 3.7 — can you show
> the details of that please? And if it doesn't occur in 3.6, can you also
> bisect the non-Xen case to find when it started happening, please?

Non-Xen case is already reported here:
https://bugs.freedesktop.org/show_bug.cgi?id=91127

Bisect result:
https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=edef7e6

Non-Xen case is partially fixed now. Screen output works fine, but the system crashes after using for several hours.

>
> Thanks,

Ting-Wei Lan @lantw closed a related bug:

*** Bug 91400 has been marked as a duplicate of this bug. ***

Elizabeth said:

Good afternoon,
Sorry for the long delay. Last kernel reported on this case has been 4.0 that is quite old and lots of changes have been made since that, so I'm closing this bug as invalid. If problem persist on newest kernel versions https://www.kernel.org/ please open a new bug with HW and SW information, logs and steps to reproduce. Thank you.

Ting-Wei Lan @lantw said:

I can reproduce the problem with the same hardware running Xen 4.8.2 and Linux 4.13.2 unless iommu=no-igfx is passed to Xen hypervisor command line.

Elizabeth said:

Hello again,
Could you please attach a new dmesg log and error state with newer kernel version with parameters drm.debug=0x1e log_bug_len=2M (or bigger) on grub?
Thank you.

Elizabeth said:

I'm probably wrong, but this issue may be related to bug 89360.

Ting-Wei Lan @lantw uploaded an attachment:

It took me more than 1 hour to get this file ... It crashed too quickly.

Xen dmesg messages were obtained from serial console and 'xl dmesg' command. Linux dmesg messages earlier than timestamp 520.360867 were obtained from 'dmesg' command. All messages after it were obtained from serial console because the system crashed and the ssh connection was broken.

I disabled wayland in /etc/gdm/custom.conf in order to get the result. The system also crashed in wayland mode, but there was no crash dump file or drm message.

Steps of operations:

In GRUB menu, remove 'iommu=no-igfx' from Xen command line and add 'drm.debug=0x1e log_buf_len=64M s' to Linux command line.
Boot the system and wait 5 minutes to get single user shell.
Delete /var/run/nologin.
Mount /proc/xen.
Start NetworkManager and sshd.
Connect to the host from ssh and run 'xl dmesg' and 'dmesg -w' commands.
Leave single user shell to continue normal boot.
Once the screen output becomes more broken, type 'sudo cat /sys/class/drm/card0/error > gpu_crash_dump; sudo sync' command as soon as possible because the system will stop responding within a few seconds.
Reboot the system with Xen console command 'R'.
Boot the system normally to download 'gpu_crash_dump' file.

Attachment 136084, "dmesg (Xen 4.8.2 + Linux 4.14.4)":
linux-4.14.4-xen-crash-dmesg

Ting-Wei Lan @lantw uploaded an attachment:

Attachment 136085, "/sys/class/drm/card0/error":
i915-error-414

Elizabeth said:

(In reply to Elizabeth from comment 15)

I'm probably wrong, but this issue may be related to bug 89360.
Yep, wrong. By previous comments situation seems to be the same pointing to a NOTOURBUG, though there is the VT-d. Let me ping some people to verify.

Jani Saarinen @jani.saarinen said:

First of all. Sorry about spam.
This is mass update for our bugs.

Sorry if you feel this annoying but with this trying to understand if bug still valid or not.
If bug investigation still in progress, please ignore this and I apologize!

If you think this is not anymore valid, please comment to the bug that can be closed.
If you haven't tested with our latest pre-upstream tree(drm-tip), can you do that also to see if issue is valid there still and if you cannot see issue there, please comment to the bug.

Ting-Wei Lan @lantw said:

I just downloaded and tested drm-tip commit c46052cde6a5, and I can still reproduce the problem on this machine.

Jani Saarinen @jani.saarinen said:

OK, thanks for the feedback. Chris, any help from you on this?

LAKSHMINARAYANA VUDUM @l4kshmi said:

Ting, sorry for the delay.

Do you still have the issue?
If so, try to reproduce the issue using drm-tip (https://cgit.freedesktop.org/drm-tip) and kernel parameters drm.debug=0x1e log_buf_len=4M, and if the problem persists attach the full dmesg from boot.

This will speed up the investigation.

Ting-Wei Lan @lantw said:

(In reply to Lakshmi from comment 22)

Ting, sorry for the delay.

Do you still have the issue?
If so, try to reproduce the issue using drm-tip
(https://cgit.freedesktop.org/drm-tip) and kernel parameters drm.debug=0x1e
log_buf_len=4M, and if the problem persists attach the full dmesg from boot.

This will speed up the investigation.

Yes, the problem still exists. I could reproduce it with drm-tip commit 6dc8457a2f2093eecb9c6cbb7306fd25bb1664e6. I tested two times and the results were similar: all characters are broken and the system was unable to show GDM login screen. The system was accessible from SSH but it couldn't reboot. I ended up pressing 'R' on the Xen hypervisor console to reboot it.

Ting-Wei Lan @lantw uploaded an attachment:

This is the log from the test of the first time. I am not sure why there is an ext4 error in the log, but the kernel starts printing call traces after showing the error.

I forgot to ask Xen to load Intel CPU microcode update in this test, but I think it should not affect the test result. There is a gap between 11.948721 and 315.808150 in the log because it took 10 minutes to activate LVM.

Attachment 141535, "dmesg (Xen 4.10.1 + Linux 4.19.0-rc2+) #1 (moved)":
hello2

Ting-Wei Lan @lantw uploaded an attachment:

This is the log from the test of the second time. After the first test, I rebooted the system with 'iommu=no-igfx' set on Xen command line and hoped it could boot normally. However, it stopped and dropped into a shell in initramfs because the fsck on rootfs failed. I manually performed fsck and the system seemed to boot up normally to the desktop. I assumed all filesystem troubles caused by the previous test were now cleaned up, and I rebooted the system to do the second test.

This time I remebered to add 'ucode=-1' to Xen command line to let it load Intel CPU microcode update. The version of the microcode update file is 'revision 0x11, date = 2018-05-08'. The kernel printed a lot of repeated messages in this test and the log quickly grew over 30M. I reset the system from Xen once I saw it printed messages endlessly. Because of the large file size, I only uploaded the first 20000 lines of the log here.

Attachment 141536, "dmesg (Xen 4.10.1 + Linux 4.19.0-rc2+) #2":
hello4

LAKSHMINARAYANA VUDUM @l4kshmi said:

(In reply to Ting-Wei Lan from comment 23)

(In reply to Lakshmi from comment 22)

Ting, sorry for the delay.

Do you still have the issue?
If so, try to reproduce the issue using drm-tip
(https://cgit.freedesktop.org/drm-tip) and kernel parameters drm.debug=0x1e
log_buf_len=4M, and if the problem persists attach the full dmesg from boot.

This will speed up the investigation.

Yes, the problem still exists. I could reproduce it with drm-tip commit
6dc8457a2f2093eecb9c6cbb7306fd25bb1664e6. I tested two times and the results
were similar: all characters are broken and the system was unable to show
GDM login screen. The system was accessible from SSH but it couldn't reboot.
I ended up pressing 'R' on the Xen hypervisor console to reboot it.

How often you see this issue? Every time you reboot?

Ting-Wei Lan @lantw said:

(In reply to Lakshmi from comment 26)

(In reply to Ting-Wei Lan from comment 23)

(In reply to Lakshmi from comment 22)

Ting, sorry for the delay.

Do you still have the issue?
If so, try to reproduce the issue using drm-tip
(https://cgit.freedesktop.org/drm-tip) and kernel parameters drm.debug=0x1e
log_buf_len=4M, and if the problem persists attach the full dmesg from boot.

This will speed up the investigation.

Yes, the problem still exists. I could reproduce it with drm-tip commit
6dc8457a2f2093eecb9c6cbb7306fd25bb1664e6. I tested two times and the results
were similar: all characters are broken and the system was unable to show
GDM login screen. The system was accessible from SSH but it couldn't reboot.
I ended up pressing 'R' on the Xen hypervisor console to reboot it.

How often you see this issue? Every time you reboot?

Yes, it happens on every boot unless I pass iommu=no-igfx to Xen command line.

Alexander Tsoy @puleglot said:

The problem still exist in 4.19.10. I also have an Ironlake iGPU:

$ grep 'model name' /proc/cpuinfo | head -n1
model name : Intel(R) Core(TM) i5 CPU 660 @ 3.33GHz

LAKSHMINARAYANA VUDUM @l4kshmi said:

@chris, any other suggestion to this issue apart from using it by intel_iommu=off?

Reporter, do you still have the issue?
If so, try to reproduce the issue using drm-tip (https://cgit.freedesktop.org/drm-tip).

Yes, I could reproduce it with drm-tip commit 87c99602f2beb1b0ee7bdb3310bf12133f4d3f7f. Screen output was broken in the same way as the old screenshot. GDM could not start. The system panicked and rebooted after a few minutes.

serial console dmesg /sys/class/drm/card0/error

Hi Reporter, can you try to reproduce this issue on latest drm-tip. https://cgit.freedesktop.org/drm-tip Please post your observations on it and any steps followed to reproduce the issue.

I can reproduce the issue using the same Xen and Linux kernel command line arguments. The screen output was broken in the same way, and the system crashed in one minute. I tested it with drm-tip commit f5d18496a7876fd70b9d96d6a87b3c910f5e2ef0.

serial console

@lantw, Do you still see this issue on latest drmtip.

I tested the latest drm-tip, and it was even more broken than Linux 5.16.15. It couldn't even show broken output. The screen became completely black after loading i915, and the system crashed in a few seconds.

[   11.244967] Already setup the GSI :16
[   11.249436] pci 0000:00:00.0: Intel HD Graphics Chipset
[   11.249800] pci 0000:00:00.0: detected gtt size: 2097152K total, 262144K mappable
[   11.251319] pci 0000:00:00.0: detected 131072K stolen memory
[   11.252883] i915 0000:00:02.0: [drm] VT-d active for gfx access
[   11.252976] i915 0000:00:02.0: vgaarb: deactivate vga console
[   11.257104] Console: switching to colour dummy device 80x25
[   11.425257] i915 0000:00:02.0: [drm] Transparent Hugepage mode 'huge=within_size'
[   11.425304] tmpfs: Unsupported parameter 'huge'
[   11.425317] [drm] Unable to create a private tmpfs mount, hugepage support will be disabled(-22).
[   11.425334] i915 0000:00:02.0: [drm] DMAR active, disabling use of stolen memory
[   11.436600] i915 0000:00:02.0: vgaarb: changed VGA decodes: olddecodes=io+mem,decodes=io+mem:owns=io+mem
[   11.445579] i915 0000:00:02.0: [drm:add_taint_for_CI [i915]] CI tainted:0x9 by intel_gt_init+0xb6/0x300 [i915]
[   13.373538] ------------[ cut here ]------------
[   13.373572] WARNING: CPU: 1 PID: 379 at drivers/gpu/drm/drm_mode_config.c:504 drm_mode_config_cleanup+0x258/0x2b0 [drm]
[   13.373641] Modules linked in: i915(+) i2c_algo_bit drm_buddy video drm_dp_helper crct10dif_pclmul crc32_pclmul drm_kms_helper crc32c_intel cec ghash_clmulni_intel ttm ata_generic firewire_ohci uas pata_acpi r8169 serio_raw drm firewire_core usb_storage pata_marvell crc_itu_t xen_acpi_processor xen_scsiback target_core_mod xen_pciback xen_netback xen_blkback xen_gntalloc xen_gntdev xen_evtchn pcspkr ipmi_devintf ipmi_msghandler fuse
[   13.373727] CPU: 1 PID: 379 Comm: systemd-udevd Tainted: G        W         5.17.0+ #2
[   13.373743] Hardware name: System manufacturer System Product Name/P7H55D-M EVO, BIOS 1604    07/22/2010
[   13.373758] RIP: e030:drm_mode_config_cleanup+0x258/0x2b0 [drm]
[   13.373806] Code: 14 48 c1 48 8d bd e8 01 00 00 48 81 c5 b0 01 00 00 e8 ac 14 48 c1 48 8b 45 00 48 39 e8 75 4b 48 83 c4 30 5b 5d 41 5c 41 5d c3 <0f> 0b 48 89 e6 48 89 ef e8 db 79 ff ff eb 10 48 8b 70 60 48 c7 c7
[   13.373834] RSP: e02b:ffffc9004054bb80 EFLAGS: 00010216
[   13.373845] RAX: ffff888120960268 RBX: ffff8881209602a0 RCX: 0000000000000000
[   13.373858] RDX: ffff88812808d820 RSI: ffffc9004054bac8 RDI: 00000000ffffffff
[   13.373871] RBP: ffff888120960000 R08: 0000000000000000 R09: 0000000000000000
[   13.373883] R10: 0000000000007ff0 R11: 0000000000000001 R12: ffff8881209602a8
[   13.373896] R13: ffff8881209622b8 R14: ffff888120960000 R15: 00000000ffffffed
[   13.373923] FS:  00007f986686eb40(0000) GS:ffff88840a480000(0000) knlGS:0000000000000000
[   13.373938] CS:  e030 DS: 0000 ES: 0000 CR0: 0000000080050033
[   13.373949] CR2: 00007f0160003000 CR3: 000000011c322000 CR4: 0000000000000660
[   13.373967] Call Trace:
[   13.373976]  <TASK>
[   13.373986]  ? _raw_spin_unlock_irqrestore+0x25/0x40
[   13.374003]  intel_modeset_driver_remove_noirq+0x9f/0x100 [i915]
[   13.374194]  i915_driver_probe+0x972/0xd10 [i915]
[   13.374338]  ? intel_modeset_probe_defer+0x4f/0x60 [i915]
[   13.374529]  ? i915_pci_probe+0x31/0x110 [i915]
[   13.374673]  local_pci_probe+0x45/0x80
[   13.374686]  ? pci_match_device+0xd7/0x130
[   13.374697]  pci_device_probe+0xaa/0x1a0
[   13.374709]  really_probe+0x1f5/0x3d0
[   13.374722]  __driver_probe_device+0xfe/0x180
[   13.374734]  driver_probe_device+0x1e/0x90
[   13.374745]  __driver_attach+0xc0/0x1c0
[   13.374756]  ? __device_attach_driver+0xe0/0xe0
[   13.374767]  ? __device_attach_driver+0xe0/0xe0
[   13.374779]  bus_for_each_dev+0x64/0x90
[   13.374790]  bus_add_driver+0x149/0x1e0
[   13.374801]  driver_register+0x8f/0xe0
[   13.374811]  i915_init+0x20/0x7c [i915]
[   13.374953]  ? 0xffffffffc06a0000
[   13.374963]  do_one_initcall+0x44/0x200
[   13.374977]  ? kmem_cache_alloc_trace+0x163/0x2c0
[   13.375018]  do_init_module+0x4c/0x260
[   13.375030]  __do_sys_finit_module+0x9b/0xf0
[   13.375043]  do_syscall_64+0x3b/0x90
[   13.375056]  entry_SYSCALL_64_after_hwframe+0x44/0xae
[   13.375069] RIP: 0033:0x7f9867400ecd
[   13.375079] Code: 5b 41 5c c3 66 0f 1f 84 00 00 00 00 00 f3 0f 1e fa 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 2b ef 0e 00 f7 d8 64 89 01 48
[   13.375106] RSP: 002b:00007ffe2fd20b08 EFLAGS: 00000246 ORIG_RAX: 0000000000000139
[   13.375121] RAX: ffffffffffffffda RBX: 000055bed05df8f0 RCX: 00007f9867400ecd
[   13.375134] RDX: 0000000000000000 RSI: 00007f986756732c RDI: 0000000000000014
[   13.375147] RBP: 0000000000020000 R08: 0000000000000000 R09: 0000000000000002
[   13.375159] R10: 0000000000000014 R11: 0000000000000246 R12: 00007f986756732c
[   13.375172] R13: 000055bed05b29a0 R14: 0000000000000007 R15: 000055bed05dd530
[   13.375187]  </TASK>
[   13.375193] ---[ end trace 0000000000000000 ]---
[   13.375204] [drm:drm_mode_config_cleanup [drm]] *ERROR* connector HDMI-A-1 leaked!
[   14.019619] i915 0000:00:02.0: Device initialization failed (-19)
[   14.019650] i915 0000:00:02.0: Please file a bug on drm/i915; see https://gitlab.freedesktop.org/drm/intel/-/wikis/How-to-file-i915-bugs for details.
[   14.019704] BUG: kernel NULL pointer dereference, address: 0000000000000000
[   14.019717] #PF: supervisor read access in kernel mode
[   14.019727] #PF: error_code(0x0000) - not-present page
[   14.019766] PGD 0 P4D 0
[   14.019774] Oops: 0000 [#1] PREEMPT SMP NOPTI
[   14.019785] CPU: 1 PID: 56 Comm: kworker/1:1 Tainted: G        W         5.17.0+ #2
[   14.019800] Hardware name: System manufacturer System Product Name/P7H55D-M EVO, BIOS 1604    07/22/2010
[   14.019815] Workqueue: events drm_connector_free_work_fn [drm]
[   14.019869] RIP: e030:ida_free+0x88/0x110
[   14.019881] Code: 48 89 c5 a8 01 74 23 83 fb 3e 77 24 48 d1 ed 48 0f a3 dd 73 1b 48 0f b3 dd 48 85 ed 75 74 31 f6 48 89 e7 e8 1a 03 01 00 eb 51 <48> 0f a3 18 72 28 48 8b 3c 24 4c 89 e6 e8 16 85 5d 00 44 89 ee 48
[   14.019908] RSP: e02b:ffffc9004038bdc8 EFLAGS: 00010046
[   14.019919] RAX: 0000000000000000 RBX: 0000000000000001 RCX: 0000000000000000
[   14.019931] RDX: 0000000000000000 RSI: 0000000000000001 RDI: ffffc9004038bdc8
[   14.019944] RBP: 0000000000000000 R08: 0000000000000001 R09: ffff8881209601e8
[   14.019956] R10: ffffc9004038be20 R11: 0000000000000002 R12: 0000000000000200
[   14.019969] R13: 0000000000000001 R14: ffff88812808d860 R15: ffff888120960000
[   14.019992] FS:  0000000000000000(0000) GS:ffff88840a480000(0000) knlGS:0000000000000000
[   14.020007] CS:  10000e030 DS: 0000 ES: 0000 CR0: 0000000080050033
[   14.020019] CR2: 0000000000000000 CR3: 000000011c322000 CR4: 0000000000000660
[   14.020036] Call Trace:
[   14.020044]  <TASK>
[   14.020052]  drm_connector_cleanup+0x19e/0x2e0 [drm]
[   14.020103]  intel_connector_destroy+0x4b/0x70 [i915]
[   14.020281]  drm_connector_free_work_fn+0x6e/0x80 [drm]
[   14.020329]  process_one_work+0x1e5/0x3b0
[   14.020342]  ? rescuer_thread+0x370/0x370
[   14.020353]  worker_thread+0x1c4/0x3a0
[   14.020363]  ? rescuer_thread+0x370/0x370
[   14.020373]  kthread+0xe7/0x110
[   14.020383]  ? kthread_complete_and_exit+0x20/0x20
[   14.020397]  ret_from_fork+0x22/0x30
[   14.020412]  </TASK>
[   14.020418] Modules linked in: i915(+) i2c_algo_bit drm_buddy video drm_dp_helper crct10dif_pclmul crc32_pclmul drm_kms_helper crc32c_intel cec ghash_clmulni_intel ttm ata_generic firewire_ohci uas pata_acpi r8169 serio_raw drm firewire_core usb_storage pata_marvell crc_itu_t xen_acpi_processor xen_scsiback target_core_mod xen_pciback xen_netback xen_blkback xen_gntalloc xen_gntdev xen_evtchn pcspkr ipmi_devintf ipmi_msghandler fuse
[   14.020501] CR2: 0000000000000000
[   14.020510] ---[ end trace 0000000000000000 ]---
[   14.020520] RIP: e030:ida_free+0x88/0x110
[   14.020529] Code: 48 89 c5 a8 01 74 23 83 fb 3e 77 24 48 d1 ed 48 0f a3 dd 73 1b 48 0f b3 dd 48 85 ed 75 74 31 f6 48 89 e7 e8 1a 03 01 00 eb 51 <48> 0f a3 18 72 28 48 8b 3c 24 4c 89 e6 e8 16 85 5d 00 44 89 ee 48
[   14.020557] RSP: e02b:ffffc9004038bdc8 EFLAGS: 00010046
[   14.020567] RAX: 0000000000000000 RBX: 0000000000000001 RCX: 0000000000000000
[   14.020580] RDX: 0000000000000000 RSI: 0000000000000001 RDI: ffffc9004038bdc8
[   14.020593] RBP: 0000000000000000 R08: 0000000000000001 R09: ffff8881209601e8
[   14.020605] R10: ffffc9004038be20 R11: 0000000000000002 R12: 0000000000000200
[   14.020618] R13: 0000000000000001 R14: ffff88812808d860 R15: ffff888120960000
[   14.020639] FS:  0000000000000000(0000) GS:ffff88840a480000(0000) knlGS:0000000000000000
[   14.020654] CS:  10000e030 DS: 0000 ES: 0000 CR0: 0000000080050033
[   14.020665] CR2: 0000000000000000 CR3: 000000011c322000 CR4: 0000000000000660
[   14.020683] note: kworker/1:1[56] exited with preempt_count 1

Worse, the issue couldn't be worked around with iommu=no-igfx option. The system could boot to text mode with iommu=no-igfx option when using Linux 5.16.15, but the option didn't seem to fix anything when using drm-tip.

I also checked if it could work without VT-d. No, Linux crashed with no message when VT-d was disabled:

(XEN) Xen is relinquishing VGA console.
(XEN) *** Serial input to DOM0 (type 'CTRL-a' three times to switch input)
(XEN) Freed 612kB init memory
mapping kernel into physical memory
about to get started...
(XEN) Hardware Dom0 crashed: rebooting machine in 5 seconds.

I wonder if this is actually a hardware problem. It is known (per a past Intel advisory about SGX) that the integrated GPU doesn’t use the standard DMA paths, but rather uses a shortcut within the chip. Perhaps DMA by the iGPU is not being properly translated by the IOMMU? If so, one workaround would be to ensure that the iGPU’s DMA page tables are an identity mapping.

mentioned in issue #10654 (closed)

mentioned in commit f1897f2f

Can some try with the commit f1897f2f as mentioned by @jani

I did not. It's just gitlab being silly.

The backtrace in the commit message contains:

CPU: 6 UID: 0 PID: 8009 Comm: syz.15.106 Kdump: loaded Tainted: G        W          6.13.0-rc6 #22

and this is issue #22. I probably pushed the commit through a rebase; I didn't even apply it myself.

No, I can no longer test the issue. The desktop computer no longer boots, and I have no plan to repair this 14-year-old computer.

[xen iommu] After upgrading to Linux 3.19, desktop no longer works in Xen 4.5.0 dom0

Submitted by Ting-Wei Lan `@lantw`

Description

Blocking

Child items 0

Activity

Admin message

Admin message

[xen iommu] After upgrading to Linux 3.19, desktop no longer works in Xen 4.5.0 dom0

Submitted by Ting-Wei Lan @lantw

Description

Blocking

Activity

Submitted by Ting-Wei Lan `@lantw`