Due to an influx of spam, we have had to impose restrictions on new accounts. Please see this wiki page for instructions on how to get full permissions. Sorry for the inconvenience.
Project 'drm/intel' was moved to 'drm/i915/kernel'. Please update any links and bookmarks that may still have the old path.
[xen iommu] After upgrading to Linux 3.19, desktop no longer works in Xen 4.5.0 dom0
When using Linux 3.19 and 4.0 as the dom0 kernel of Xen 4.5.0, characters on the screen become broken after the graphic driver is loaded. Please see the attached screenshot.
After Xorg is started by GDM, it causes more error and my monitor is turned off because of no signal.
[ 337.673979] [drm] stuck on render ring
[ 337.676815] [drm] GPU HANG: ecode 5:0:0xfdffffff, in Xorg.bin [2221], reason: Ring hung, action: reset
[ 337.676817] [drm] GPU hangs can indicate a bug anywhere in the entire gfx stack, including userspace.
[ 337.676818] [drm] Please file a new bug report on bugs.freedesktop.org against DRI -> DRM/Intel
[ 337.676818] [drm] drm/i915 developers can then reassign to the right component if it's not a kernel issue.
[ 337.676819] [drm] The gpu crash dump is required to analyze gpu hangs, so please always attach it.
[ 337.676820] [drm] GPU crash dump saved to /sys/class/drm/card0/error
[ 337.680940] drm/i915: Resetting chip after gpu hang
[ 343.665948] [drm] stuck on render ring
[ 343.669709] [drm] GPU HANG: ecode 5:0:0xfdffffff, in Xorg.bin [2221], reason: Ring hung, action: reset
[ 343.670016] [drm:i915_set_reset_status [i915]] ERROR gpu hanging too fast, banning!
[ 343.673893] drm/i915: Resetting chip after gpu hang
[ 345.086609] BUG: unable to handle kernel NULL pointer dereference at 0000000000000020
Please see the attached dmesg and crash dump. This problem causes the desktop unstable and unusable.
This problem also happens on Linux >= 3.7 without using Xen when 'intel_iommu=on' is used. It can be worked around by adding 'intel_iommu=igfx_off'. Is it an expected behavior or a bug? Here are some 'dmesg | grep -i iommu' outputs.
Linux 3.6.11 with intel_iommu=on works fine.
[ +0.000000] Intel-IOMMU: enabled
[ +0.005366] dmar: IOMMU 0: reg_base_addr fed90000 ver 1:0 cap
c9008020e30272 ecap 1000
[ +0.005360] dmar: IOMMU 1: reg_base_addr fed91000 ver 1:0 cap
c0000020230272 ecap 1000
[ +0.005359] dmar: IOMMU 2: reg_base_addr fed93000 ver 1:0 cap
c9008020630272 ecap 1000
[ +0.003267] IOMMU 0 0xfed90000: using Register based invalidation
[ +0.006143] IOMMU 2 0xfed93000: using Register based invalidation
[ +0.006141] IOMMU: Setting RMRR:
[ +0.003298] IOMMU: Setting identity map for device 0000:00:1d.0
[0xd7aec000 - 0xd7afffff]
[ +0.008310] IOMMU: Setting identity map for device 0000:00:1a.0
[0xd7aec000 - 0xd7afffff]
[ +0.008269] IOMMU: Setting identity map for device 0000:00:1d.0
[0xe4000 - 0xe7fff]
[ +0.007753] IOMMU: Setting identity map for device 0000:00:1a.0
[0xe4000 - 0xe7fff]
[ +0.007753] IOMMU: Prepare 0-16MiB unity mapping for LPC
[ +0.005376] IOMMU: Setting identity map for device 0000:00:1f.0 [0x0 -
0xffffff]
Linux >= 3.7 without any intel_iommu argument works fine.
[ +0.005391] dmar: IOMMU 0: reg_base_addr fed90000 ver 1:0 cap
c9008020e30272 ecap 1000
[ +0.005385] dmar: IOMMU 1: reg_base_addr fed91000 ver 1:0 cap
c0000020230272 ecap 1000
[ +0.005384] dmar: IOMMU 2: reg_base_addr fed93000 ver 1:0 cap
c9008020630272 ecap 1000
Linux >= 3.7 with intel_iommu=on causes grahpics problems.
[ +0.000000] Intel-IOMMU: enabled
[ +0.005391] dmar: IOMMU 0: reg_base_addr fed90000 ver 1:0 cap
c9008020e30272 ecap 1000
[ +0.005382] dmar: IOMMU 1: reg_base_addr fed91000 ver 1:0 cap
c0000020230272 ecap 1000
[ +0.005383] dmar: IOMMU 2: reg_base_addr fed93000 ver 1:0 cap
c9008020630272 ecap 1000
[ +0.003430] IOMMU: dmar1 using Register based invalidation
[ +0.005553] IOMMU: dmar0 using Register based invalidation
[ +0.005559] IOMMU: dmar2 using Register based invalidation
[ +0.005560] IOMMU: Setting RMRR:
[ +0.003314] IOMMU: Setting identity map for device 0000:00:1a.0
[0xd7aec000 - 0xd7afffff]
[ +0.008341] IOMMU: Setting identity map for device 0000:00:1d.0
[0xd7aec000 - 0xd7afffff]
[ +0.008334] IOMMU: Setting identity map for device 0000:00:02.0
[0xd7c00000 - 0xdfffffff]
[ +0.009797] IOMMU: Setting identity map for device 0000:00:1a.0
[0xe4000 - 0xe7fff]
[ +0.007795] IOMMU: Setting identity map for device 0000:00:1d.0
[0xe4000 - 0xe7fff]
[ +0.007798] IOMMU: Prepare 0-16MiB unity mapping for LPC
[ +0.005398] IOMMU: Setting identity map for device 0000:00:1f.0 [0x0 -
0xffffff]
Linux >= 3.7 with intel_iommu=igfx_off works fine.
[ +0.000000] Intel-IOMMU: disable GFX device mapping
[ +0.005388] dmar: IOMMU 0: reg_base_addr fed90000 ver 1:0 cap
c9008020e30272 ecap 1000
[ +0.005385] dmar: IOMMU 1: reg_base_addr fed91000 ver 1:0 cap
c0000020230272 ecap 1000
[ +0.005383] dmar: IOMMU 2: reg_base_addr fed93000 ver 1:0 cap
c9008020630272 ecap 1000
Linux >= 3.7 with both intel_iommu=on and intel_iommu=igfx_off also
works fine.
[ 0.000000] Intel-IOMMU: disable GFX device mapping
[ 0.000000] Intel-IOMMU: enabled
[ 0.205011] dmar: IOMMU 0: reg_base_addr fed90000 ver 1:0 cap
c9008020e30272 ecap 1000
[ 0.218432] dmar: IOMMU 1: reg_base_addr fed91000 ver 1:0 cap
c0000020230272 ecap 1000
[ 0.231848] dmar: IOMMU 2: reg_base_addr fed93000 ver 1:0 cap
c9008020630272 ecap 1000
[ 1.873199] IOMMU: dmar0 using Register based invalidation
[ 1.878757] IOMMU: dmar2 using Register based invalidation
[ 1.884315] IOMMU: Setting RMRR:
[ 1.887631] IOMMU: Setting identity map for device 0000:00:1a.0
[0xd7aec000 - 0xd7afffff]
[ 1.895972] IOMMU: Setting identity map for device 0000:00:1d.0
[0xd7aec000 - 0xd7afffff]
[ 1.904285] IOMMU: Setting identity map for device 0000:00:1a.0
[0xe4000 - 0xe7fff]
[ 1.912079] IOMMU: Setting identity map for device 0000:00:1d.0
[0xe4000 - 0xe7fff]
[ 1.919871] IOMMU: Prepare 0-16MiB unity mapping for LPC
[ 1.925268] IOMMU: Setting identity map for device 0000:00:1f.0 [0x0
0xffffff]
It seems the difference between working and broken arguments is 'device 0000:00:02.0', which is the Intel integrated graphics controller.
It's odd that it was triggered (in the Xen case) by a PAT patch.
What was the actual effect of that patch on the caching mode used by the machine in question?
[ +0.005382] dmar: IOMMU 1: reg_base_addr fed91000 ver 1:0 cap
c0000020230272 ecap 1000
cap & (1<<4) is set, which is the RWBF bit:
1: Indicates software must explicitly flush
the write buffers to ensure updates made to
memory-resident remapping structures are
visible to hardware.
ecap & (1<<0) is clear, which is the Coherency bit:
This field indicates if hardware access to the
root, context, extended-context and
interrupt-remap tables, and second-level
paging structures for requests-without-
PASID, are coherent (snooped) or not.
• 0:Indicates hardware accesses to
remapping structures are non-coherent.
So basically this hardware is in a mode where the IOMMU page tables are non-cache coherent. Not only do you have to clflush every cache line in the page tables to main memory when you write it, but you *also* have to jump through hoops to ensure that the writes are pushed through chipset-specific write buffers (see §6.8 of the VT-d specification).
That may help to explain why a seemingly innocent PAT change might have triggered something odd. But it would be good to know precisely what went wrong.
Also, does it help to add 'iommu=pt' to the kernel command line? That would make the IOMMU use a 1:1 mapping of all memory, rather than dynamically setting up mappings.
You say it can be reproduced without Xen, with Linux >= 3.7 — can you show the details of that please? And if it doesn't occur in 3.6, can you also bisect the non-Xen case to find when it started happening, please?
It's odd that it was triggered (in the Xen case) by a PAT patch.
What was the actual effect of that patch on the caching mode used by the
machine in question?
[ +0.005382] dmar: IOMMU 1: reg_base_addr fed91000 ver 1:0 cap
c0000020230272 ecap 1000
cap & (1<<4) is set, which is the RWBF bit:
1: Indicates software must explicitly flush
the write buffers to ensure updates made to
memory-resident remapping structures are
visible to hardware.
ecap & (1<<0) is clear, which is the Coherency bit:
This field indicates if hardware access to the
root, context, extended-context and
interrupt-remap tables, and second-level
paging structures for requests-without-
PASID, are coherent (snooped) or not.
• 0:Indicates hardware accesses to
remapping structures are non-coherent.
So basically this hardware is in a mode where the IOMMU page tables are
non-cache coherent. Not only do you have to clflush every cache line in the
page tables to main memory when you write it, but you *also* have to jump
through hoops to ensure that the writes are pushed through chipset-specific
write buffers (see §6.8 of the VT-d specification).
That may help to explain why a seemingly innocent PAT change might have
triggered something odd. But it would be good to know precisely what went
wrong.
Can you tell me how can I test it or provide me a link that describes steps to get needed information? I am not familiar with VT-d spec.
>
> Also, does it help to add 'iommu=pt' to the kernel command line? That would
> make the IOMMU use a 1:1 mapping of all memory, rather than dynamically
> setting up mappings.
No, screen output is still broken.
>
> You say it can be reproduced without Xen, with Linux >= 3.7 — can you show
> the details of that please? And if it doesn't occur in 3.6, can you also
> bisect the non-Xen case to find when it started happening, please?
Good afternoon,
Sorry for the long delay. Last kernel reported on this case has been 4.0 that is quite old and lots of changes have been made since that, so I'm closing this bug as invalid. If problem persist on newest kernel versions https://www.kernel.org/ please open a new bug with HW and SW information, logs and steps to reproduce. Thank you.
Hello again,
Could you please attach a new dmesg log and error state with newer kernel version with parameters drm.debug=0x1e log_bug_len=2M (or bigger) on grub?
Thank you.
It took me more than 1 hour to get this file ... It crashed too quickly.
Xen dmesg messages were obtained from serial console and 'xl dmesg' command. Linux dmesg messages earlier than timestamp 520.360867 were obtained from 'dmesg' command. All messages after it were obtained from serial console because the system crashed and the ssh connection was broken.
I disabled wayland in /etc/gdm/custom.conf in order to get the result. The system also crashed in wayland mode, but there was no crash dump file or drm message.
Steps of operations:
In GRUB menu, remove 'iommu=no-igfx' from Xen command line and add 'drm.debug=0x1e log_buf_len=64M s' to Linux command line.
Boot the system and wait 5 minutes to get single user shell.
Delete /var/run/nologin.
Mount /proc/xen.
Start NetworkManager and sshd.
Connect to the host from ssh and run 'xl dmesg' and 'dmesg -w' commands.
Leave single user shell to continue normal boot.
Once the screen output becomes more broken, type 'sudo cat /sys/class/drm/card0/error > gpu_crash_dump; sudo sync' command as soon as possible because the system will stop responding within a few seconds.
Reboot the system with Xen console command 'R'.
Boot the system normally to download 'gpu_crash_dump' file.
I'm probably wrong, but this issue may be related to bug 89360.
Yep, wrong. By previous comments situation seems to be the same pointing to a NOTOURBUG, though there is the VT-d. Let me ping some people to verify.
First of all. Sorry about spam.
This is mass update for our bugs.
Sorry if you feel this annoying but with this trying to understand if bug still valid or not.
If bug investigation still in progress, please ignore this and I apologize!
If you think this is not anymore valid, please comment to the bug that can be closed.
If you haven't tested with our latest pre-upstream tree(drm-tip), can you do that also to see if issue is valid there still and if you cannot see issue there, please comment to the bug.