Due to an influx of spam, we have had to impose restrictions on new accounts. Please see this wiki page for instructions on how to get full permissions. Sorry for the inconvenience.
Admin message
The migration is almost done, at least the rest should happen in the background. There are still a few technical difference between the old cluster and the new ones, and they are summarized in this issue. Please pay attention to the TL:DR at the end of the comment.
[xen iommu] After upgrading to Linux 3.19, desktop no longer works in Xen 4.5.0 dom0
When using Linux 3.19 and 4.0 as the dom0 kernel of Xen 4.5.0, characters on the screen become broken after the graphic driver is loaded. Please see the attached screenshot.
After Xorg is started by GDM, it causes more error and my monitor is turned off because of no signal.
[ 337.673979] [drm] stuck on render ring
[ 337.676815] [drm] GPU HANG: ecode 5:0:0xfdffffff, in Xorg.bin [2221], reason: Ring hung, action: reset
[ 337.676817] [drm] GPU hangs can indicate a bug anywhere in the entire gfx stack, including userspace.
[ 337.676818] [drm] Please file a new bug report on bugs.freedesktop.org against DRI -> DRM/Intel
[ 337.676818] [drm] drm/i915 developers can then reassign to the right component if it's not a kernel issue.
[ 337.676819] [drm] The gpu crash dump is required to analyze gpu hangs, so please always attach it.
[ 337.676820] [drm] GPU crash dump saved to /sys/class/drm/card0/error
[ 337.680940] drm/i915: Resetting chip after gpu hang
[ 343.665948] [drm] stuck on render ring
[ 343.669709] [drm] GPU HANG: ecode 5:0:0xfdffffff, in Xorg.bin [2221], reason: Ring hung, action: reset
[ 343.670016] [drm:i915_set_reset_status [i915]] ERROR gpu hanging too fast, banning!
[ 343.673893] drm/i915: Resetting chip after gpu hang
[ 345.086609] BUG: unable to handle kernel NULL pointer dereference at 0000000000000020
Please see the attached dmesg and crash dump. This problem causes the desktop unstable and unusable.
This problem also happens on Linux >= 3.7 without using Xen when 'intel_iommu=on' is used. It can be worked around by adding 'intel_iommu=igfx_off'. Is it an expected behavior or a bug? Here are some 'dmesg | grep -i iommu' outputs.
Linux 3.6.11 with intel_iommu=on works fine.
[ +0.000000] Intel-IOMMU: enabled
[ +0.005366] dmar: IOMMU 0: reg_base_addr fed90000 ver 1:0 cap
c9008020e30272 ecap 1000
[ +0.005360] dmar: IOMMU 1: reg_base_addr fed91000 ver 1:0 cap
c0000020230272 ecap 1000
[ +0.005359] dmar: IOMMU 2: reg_base_addr fed93000 ver 1:0 cap
c9008020630272 ecap 1000
[ +0.003267] IOMMU 0 0xfed90000: using Register based invalidation
[ +0.006143] IOMMU 2 0xfed93000: using Register based invalidation
[ +0.006141] IOMMU: Setting RMRR:
[ +0.003298] IOMMU: Setting identity map for device 0000:00:1d.0
[0xd7aec000 - 0xd7afffff]
[ +0.008310] IOMMU: Setting identity map for device 0000:00:1a.0
[0xd7aec000 - 0xd7afffff]
[ +0.008269] IOMMU: Setting identity map for device 0000:00:1d.0
[0xe4000 - 0xe7fff]
[ +0.007753] IOMMU: Setting identity map for device 0000:00:1a.0
[0xe4000 - 0xe7fff]
[ +0.007753] IOMMU: Prepare 0-16MiB unity mapping for LPC
[ +0.005376] IOMMU: Setting identity map for device 0000:00:1f.0 [0x0 -
0xffffff]
Linux >= 3.7 without any intel_iommu argument works fine.
[ +0.005391] dmar: IOMMU 0: reg_base_addr fed90000 ver 1:0 cap
c9008020e30272 ecap 1000
[ +0.005385] dmar: IOMMU 1: reg_base_addr fed91000 ver 1:0 cap
c0000020230272 ecap 1000
[ +0.005384] dmar: IOMMU 2: reg_base_addr fed93000 ver 1:0 cap
c9008020630272 ecap 1000
Linux >= 3.7 with intel_iommu=on causes grahpics problems.
[ +0.000000] Intel-IOMMU: enabled
[ +0.005391] dmar: IOMMU 0: reg_base_addr fed90000 ver 1:0 cap
c9008020e30272 ecap 1000
[ +0.005382] dmar: IOMMU 1: reg_base_addr fed91000 ver 1:0 cap
c0000020230272 ecap 1000
[ +0.005383] dmar: IOMMU 2: reg_base_addr fed93000 ver 1:0 cap
c9008020630272 ecap 1000
[ +0.003430] IOMMU: dmar1 using Register based invalidation
[ +0.005553] IOMMU: dmar0 using Register based invalidation
[ +0.005559] IOMMU: dmar2 using Register based invalidation
[ +0.005560] IOMMU: Setting RMRR:
[ +0.003314] IOMMU: Setting identity map for device 0000:00:1a.0
[0xd7aec000 - 0xd7afffff]
[ +0.008341] IOMMU: Setting identity map for device 0000:00:1d.0
[0xd7aec000 - 0xd7afffff]
[ +0.008334] IOMMU: Setting identity map for device 0000:00:02.0
[0xd7c00000 - 0xdfffffff]
[ +0.009797] IOMMU: Setting identity map for device 0000:00:1a.0
[0xe4000 - 0xe7fff]
[ +0.007795] IOMMU: Setting identity map for device 0000:00:1d.0
[0xe4000 - 0xe7fff]
[ +0.007798] IOMMU: Prepare 0-16MiB unity mapping for LPC
[ +0.005398] IOMMU: Setting identity map for device 0000:00:1f.0 [0x0 -
0xffffff]
Linux >= 3.7 with intel_iommu=igfx_off works fine.
[ +0.000000] Intel-IOMMU: disable GFX device mapping
[ +0.005388] dmar: IOMMU 0: reg_base_addr fed90000 ver 1:0 cap
c9008020e30272 ecap 1000
[ +0.005385] dmar: IOMMU 1: reg_base_addr fed91000 ver 1:0 cap
c0000020230272 ecap 1000
[ +0.005383] dmar: IOMMU 2: reg_base_addr fed93000 ver 1:0 cap
c9008020630272 ecap 1000
Linux >= 3.7 with both intel_iommu=on and intel_iommu=igfx_off also
works fine.
[ 0.000000] Intel-IOMMU: disable GFX device mapping
[ 0.000000] Intel-IOMMU: enabled
[ 0.205011] dmar: IOMMU 0: reg_base_addr fed90000 ver 1:0 cap
c9008020e30272 ecap 1000
[ 0.218432] dmar: IOMMU 1: reg_base_addr fed91000 ver 1:0 cap
c0000020230272 ecap 1000
[ 0.231848] dmar: IOMMU 2: reg_base_addr fed93000 ver 1:0 cap
c9008020630272 ecap 1000
[ 1.873199] IOMMU: dmar0 using Register based invalidation
[ 1.878757] IOMMU: dmar2 using Register based invalidation
[ 1.884315] IOMMU: Setting RMRR:
[ 1.887631] IOMMU: Setting identity map for device 0000:00:1a.0
[0xd7aec000 - 0xd7afffff]
[ 1.895972] IOMMU: Setting identity map for device 0000:00:1d.0
[0xd7aec000 - 0xd7afffff]
[ 1.904285] IOMMU: Setting identity map for device 0000:00:1a.0
[0xe4000 - 0xe7fff]
[ 1.912079] IOMMU: Setting identity map for device 0000:00:1d.0
[0xe4000 - 0xe7fff]
[ 1.919871] IOMMU: Prepare 0-16MiB unity mapping for LPC
[ 1.925268] IOMMU: Setting identity map for device 0000:00:1f.0 [0x0
0xffffff]
It seems the difference between working and broken arguments is 'device 0000:00:02.0', which is the Intel integrated graphics controller.
It's odd that it was triggered (in the Xen case) by a PAT patch.
What was the actual effect of that patch on the caching mode used by the machine in question?
[ +0.005382] dmar: IOMMU 1: reg_base_addr fed91000 ver 1:0 cap
c0000020230272 ecap 1000
cap & (1<<4) is set, which is the RWBF bit:
1: Indicates software must explicitly flush
the write buffers to ensure updates made to
memory-resident remapping structures are
visible to hardware.
ecap & (1<<0) is clear, which is the Coherency bit:
This field indicates if hardware access to the
root, context, extended-context and
interrupt-remap tables, and second-level
paging structures for requests-without-
PASID, are coherent (snooped) or not.
• 0:Indicates hardware accesses to
remapping structures are non-coherent.
So basically this hardware is in a mode where the IOMMU page tables are non-cache coherent. Not only do you have to clflush every cache line in the page tables to main memory when you write it, but you *also* have to jump through hoops to ensure that the writes are pushed through chipset-specific write buffers (see §6.8 of the VT-d specification).
That may help to explain why a seemingly innocent PAT change might have triggered something odd. But it would be good to know precisely what went wrong.
Also, does it help to add 'iommu=pt' to the kernel command line? That would make the IOMMU use a 1:1 mapping of all memory, rather than dynamically setting up mappings.
You say it can be reproduced without Xen, with Linux >= 3.7 — can you show the details of that please? And if it doesn't occur in 3.6, can you also bisect the non-Xen case to find when it started happening, please?
It's odd that it was triggered (in the Xen case) by a PAT patch.
What was the actual effect of that patch on the caching mode used by the
machine in question?
[ +0.005382] dmar: IOMMU 1: reg_base_addr fed91000 ver 1:0 cap
c0000020230272 ecap 1000
cap & (1<<4) is set, which is the RWBF bit:
1: Indicates software must explicitly flush
the write buffers to ensure updates made to
memory-resident remapping structures are
visible to hardware.
ecap & (1<<0) is clear, which is the Coherency bit:
This field indicates if hardware access to the
root, context, extended-context and
interrupt-remap tables, and second-level
paging structures for requests-without-
PASID, are coherent (snooped) or not.
• 0:Indicates hardware accesses to
remapping structures are non-coherent.
So basically this hardware is in a mode where the IOMMU page tables are
non-cache coherent. Not only do you have to clflush every cache line in the
page tables to main memory when you write it, but you *also* have to jump
through hoops to ensure that the writes are pushed through chipset-specific
write buffers (see §6.8 of the VT-d specification).
That may help to explain why a seemingly innocent PAT change might have
triggered something odd. But it would be good to know precisely what went
wrong.
Can you tell me how can I test it or provide me a link that describes steps to get needed information? I am not familiar with VT-d spec.
>
> Also, does it help to add 'iommu=pt' to the kernel command line? That would
> make the IOMMU use a 1:1 mapping of all memory, rather than dynamically
> setting up mappings.
No, screen output is still broken.
>
> You say it can be reproduced without Xen, with Linux >= 3.7 — can you show
> the details of that please? And if it doesn't occur in 3.6, can you also
> bisect the non-Xen case to find when it started happening, please?
Good afternoon,
Sorry for the long delay. Last kernel reported on this case has been 4.0 that is quite old and lots of changes have been made since that, so I'm closing this bug as invalid. If problem persist on newest kernel versions https://www.kernel.org/ please open a new bug with HW and SW information, logs and steps to reproduce. Thank you.
Hello again,
Could you please attach a new dmesg log and error state with newer kernel version with parameters drm.debug=0x1e log_bug_len=2M (or bigger) on grub?
Thank you.
It took me more than 1 hour to get this file ... It crashed too quickly.
Xen dmesg messages were obtained from serial console and 'xl dmesg' command. Linux dmesg messages earlier than timestamp 520.360867 were obtained from 'dmesg' command. All messages after it were obtained from serial console because the system crashed and the ssh connection was broken.
I disabled wayland in /etc/gdm/custom.conf in order to get the result. The system also crashed in wayland mode, but there was no crash dump file or drm message.
Steps of operations:
In GRUB menu, remove 'iommu=no-igfx' from Xen command line and add 'drm.debug=0x1e log_buf_len=64M s' to Linux command line.
Boot the system and wait 5 minutes to get single user shell.
Delete /var/run/nologin.
Mount /proc/xen.
Start NetworkManager and sshd.
Connect to the host from ssh and run 'xl dmesg' and 'dmesg -w' commands.
Leave single user shell to continue normal boot.
Once the screen output becomes more broken, type 'sudo cat /sys/class/drm/card0/error > gpu_crash_dump; sudo sync' command as soon as possible because the system will stop responding within a few seconds.
Reboot the system with Xen console command 'R'.
Boot the system normally to download 'gpu_crash_dump' file.
I'm probably wrong, but this issue may be related to bug 89360.
Yep, wrong. By previous comments situation seems to be the same pointing to a NOTOURBUG, though there is the VT-d. Let me ping some people to verify.
First of all. Sorry about spam.
This is mass update for our bugs.
Sorry if you feel this annoying but with this trying to understand if bug still valid or not.
If bug investigation still in progress, please ignore this and I apologize!
If you think this is not anymore valid, please comment to the bug that can be closed.
If you haven't tested with our latest pre-upstream tree(drm-tip), can you do that also to see if issue is valid there still and if you cannot see issue there, please comment to the bug.
Do you still have the issue?
If so, try to reproduce the issue using drm-tip (https://cgit.freedesktop.org/drm-tip) and kernel parameters drm.debug=0x1e log_buf_len=4M, and if the problem persists attach the full dmesg from boot.
Do you still have the issue?
If so, try to reproduce the issue using drm-tip
(https://cgit.freedesktop.org/drm-tip) and kernel parameters drm.debug=0x1e
log_buf_len=4M, and if the problem persists attach the full dmesg from boot.
This will speed up the investigation.
Yes, the problem still exists. I could reproduce it with drm-tip commit 6dc8457a2f2093eecb9c6cbb7306fd25bb1664e6. I tested two times and the results were similar: all characters are broken and the system was unable to show GDM login screen. The system was accessible from SSH but it couldn't reboot. I ended up pressing 'R' on the Xen hypervisor console to reboot it.
This is the log from the test of the first time. I am not sure why there is an ext4 error in the log, but the kernel starts printing call traces after showing the error.
I forgot to ask Xen to load Intel CPU microcode update in this test, but I think it should not affect the test result. There is a gap between 11.948721 and 315.808150 in the log because it took 10 minutes to activate LVM.
This is the log from the test of the second time. After the first test, I rebooted the system with 'iommu=no-igfx' set on Xen command line and hoped it could boot normally. However, it stopped and dropped into a shell in initramfs because the fsck on rootfs failed. I manually performed fsck and the system seemed to boot up normally to the desktop. I assumed all filesystem troubles caused by the previous test were now cleaned up, and I rebooted the system to do the second test.
This time I remebered to add 'ucode=-1' to Xen command line to let it load Intel CPU microcode update. The version of the microcode update file is 'revision 0x11, date = 2018-05-08'. The kernel printed a lot of repeated messages in this test and the log quickly grew over 30M. I reset the system from Xen once I saw it printed messages endlessly. Because of the large file size, I only uploaded the first 20000 lines of the log here.
Attachment 141536, "dmesg (Xen 4.10.1 + Linux 4.19.0-rc2+) #2": hello4
Do you still have the issue?
If so, try to reproduce the issue using drm-tip
(https://cgit.freedesktop.org/drm-tip) and kernel parameters drm.debug=0x1e
log_buf_len=4M, and if the problem persists attach the full dmesg from boot.
This will speed up the investigation.
Yes, the problem still exists. I could reproduce it with drm-tip commit
6dc8457a2f2093eecb9c6cbb7306fd25bb1664e6. I tested two times and the results
were similar: all characters are broken and the system was unable to show
GDM login screen. The system was accessible from SSH but it couldn't reboot.
I ended up pressing 'R' on the Xen hypervisor console to reboot it.
How often you see this issue? Every time you reboot?
Do you still have the issue?
If so, try to reproduce the issue using drm-tip
(https://cgit.freedesktop.org/drm-tip) and kernel parameters drm.debug=0x1e
log_buf_len=4M, and if the problem persists attach the full dmesg from boot.
This will speed up the investigation.
Yes, the problem still exists. I could reproduce it with drm-tip commit
6dc8457a2f2093eecb9c6cbb7306fd25bb1664e6. I tested two times and the results
were similar: all characters are broken and the system was unable to show
GDM login screen. The system was accessible from SSH but it couldn't reboot.
I ended up pressing 'R' on the Xen hypervisor console to reboot it.
How often you see this issue? Every time you reboot?
Yes, it happens on every boot unless I pass iommu=no-igfx to Xen command line.
Yes, I could reproduce it with drm-tip commit 87c99602f2beb1b0ee7bdb3310bf12133f4d3f7f. Screen output was broken in the same way as the old screenshot. GDM could not start. The system panicked and rebooted after a few minutes.
Hi Reporter, can you try to reproduce this issue on latest drm-tip. https://cgit.freedesktop.org/drm-tip
Please post your observations on it and any steps followed to reproduce the issue.
I can reproduce the issue using the same Xen and Linux kernel command line arguments. The screen output was broken in the same way, and the system crashed in one minute. I tested it with drm-tip commit f5d18496a7876fd70b9d96d6a87b3c910f5e2ef0.
I tested the latest drm-tip, and it was even more broken than Linux 5.16.15. It couldn't even show broken output. The screen became completely black after loading i915, and the system crashed in a few seconds.
Worse, the issue couldn't be worked around with iommu=no-igfx option. The system could boot to text mode with iommu=no-igfx option when using Linux 5.16.15, but the option didn't seem to fix anything when using drm-tip.
I also checked if it could work without VT-d. No, Linux crashed with no message when VT-d was disabled:
(XEN) Xen is relinquishing VGA console.(XEN) *** Serial input to DOM0 (type 'CTRL-a' three times to switch input)(XEN) Freed 612kB init memorymapping kernel into physical memoryabout to get started...(XEN) Hardware Dom0 crashed: rebooting machine in 5 seconds.
I wonder if this is actually a hardware problem. It is known (per a past Intel advisory about SGX) that the integrated GPU doesn’t use the standard DMA paths, but rather uses a shortcut within the chip. Perhaps DMA by the iGPU is not being properly translated by the IOMMU? If so, one workaround would be to ensure that the iGPU’s DMA page tables are an identity mapping.