[i915]Intel HD 4600 plagued with crashes and GPU Hang messages

added Community GPU hang platform: HSW labels

Tried this on latest drm-tip? Check https://gitlab.freedesktop.org/drm/intel/-/wikis/How-to-file-i915-bugs. Please attach the full dmesg after booting with "drm.debug=0xe log_buf_len=4M" passed the kernel cmdline and GPU Hang error file as mentioned in the link.

added severity::major label

https://ask.fedoraproject.org/t/intel-hd-4600-gpu-hang/20793/9

assigned to @raviteja

@tmistat Can you please update you observation?

We're working on reproducing this issue on Fedora 35 (kernels 5.14.10-300.fc35 and 5.16.20-200.fc35). As for now, we haven't experienced a GPU hang described here.

There's one thing I would like to confirm. When you described that you can only see the "version" file in /sys/class/drm/ directory, did you mean when working on a kernel with nomodeset param or when the GPU hang happens? If you meant the former, I'd try to inspect it after the GPU hang.

Assuming that there is no kernel panic, you should be able to escape the crashed X session after hitting "magic" SysRq key combination (you'd need to enable it first; it's the PrtSc key) and navigate to the console to get /sys/class/drm/card0/error file. If it's still missing, take a look at /sys/kernel/debug/dri/0/i195_error_state.

This file is essential for debugging the issue. Don't forget to attach the new dmesg output with it.

Ideally, you should do this on drm-tip, like @raviteja suggested, but we can try recovering the files with an older kernel first.

User did provide me comments via twitter as he for some reason can't login to gitlab.

Here is the answer: I did enable sysrq, and when I invoke alt+sysrq+e, then alt +F3, dmesg keeps dumping the Gpu hang error message on the screen over and over, making it impossible to deliver commands cleanly to the terminal. So I’m either unable to retrieve what they are wanting, unless there’s a solution to make it to where I can cleanly input terminal commands while the system is REISUB’d. This is on Arch Linux, kernel version 5.17.4-arch1-1

Many thanks for cross-posting the reply, @dizzer .

I did enable sysrq, and when I invoke alt+sysrq+e, then alt +F3

This sysrq combination sends a SIGTERM signal, which is going to be ignored in this scenario. I'd try alt+sysrq+r (to take the keyboard away from X's control) and then alt+sysrq+k (kill all processes on the current virtual console). This should bring you back to the console with, hopefully, no more GPU hang error messages.

If it doesn't work, we can approach it differently. Here you can find a tiny script that copies a couple of files over to the user directory, including the error state one: gpu_logs.sh

Download the script when booted with nomodeset and make it executable, so you can call it when the GPU hang happens. Two commands there require sudo, so you need to execute it as a user from the sudo/wheel group. After doing so, please attach the text files from /home/$USER/gpu_logs here.

User did provide me comments via twitter as he for some reason can't login to gitlab.

That's going to make the last step a bit more complicated. I believe you've tried to reset your password and then check the spam folder, haven't you? But if you still have problem with logging in, you could provide the logs using different means, and I'll post them here.

This is on Arch Linux, kernel version 5.17.4-arch1-1

Got it, thanks for the update. Do you use GNOME here as well?

Ok, so now that I've gotten my Gitlab woes in order... HOLY JESUS this has been a ride.

So, I have a present in the form of LOGS, thanks to the nice tidy script you provided. Keep in mind this is on the newly released Ubuntu 22.04, but the effect should be identical (Arch was giving me problems beyond the GPU, so I trashed that installation. Also, this will standardize testing in the future.)

In the case of Arch, yes, that was with Gnome as well. That being said, even on Ubuntu 22.04, REISUB'ing did cause the gpu hang error to constantly dump itself into the terminal, but I managed to get the script to run regardless, so that's a good thing.

doombotq@doombotq-Z87X-UD5H:~$ uname -r 5.15.0-27-generic

With all that said and sent, I did a fun little test, as I had a hunch.

So, rebooted, and enabled RC6 Render Standby in the UEFI, and rebooted, but rebooted into my Windows 10 partition. The result was nearly identical to what's happening in Linux: Hard Freeze.

Tried rebooting once more, and this time, Windows 10 got me a BSOD: Video_Scheduler_Internal_Error, bugtrap 119.

When I disable RC6 Render Standby in the UEFI, and reboot into Windows 10, it works perfectly fine.

Now, with that in mind, when I had Alma or Rocky Linux 8.5 installed, which both use the 4.18 kernel (same as RHEL), when I had RC6 Render Standby disabled in the UEFI (and i915.enable_dc=0 and intel_idle.max_cstate=1 passed to the kernel), powertop showed that RC6 was still being used, despite the fact that the feature is 100% disabled in the UEFI.

This all leads me to believe that this issue I'm experiencing is an RC6 Render Standby issue, and with DRM ignoring UEFI settings on top of that.

Again, just a hunch, may very well be something to look into.

Ok, so now that I've gotten my Gitlab woes in order... HOLY JESUS this has been a ride

Welcome back!

Keep in mind this is on the newly released Ubuntu 22.04, but the effect should be identical

Good, we can stick to Ubuntu then.

Thank you for providing all the files, I'm glad the script was useful. I can't see dmesg here, but probably it would be better to generate a new one anyway.

I'd ask you to do one more experiment for me - could you try booting up your Ubuntu with drm.debug=0x1e log_buf_len=1M kernel params, RC6 disabled in UEFI and run this script, just like you did with the last one:gpu_logs_2.sh?

This is to see what state of RC6 the driver reports, grab the newest error message and more verbose dmesg output

drpc

error

Ok, these populated, but dmesg for some reason is refusing to output to the file. Running ./gpu_logs_2.sh without sudo generates "Operation not permitted", and requests the password. Input the password, and it dumps me back to the bash prompt.

Running the script with sudo just simply dumps me to back to the prompt after entering password, and generates those files I've attached, and an empty dmesg file.

dmesg.txt

Update: Ok, after having to reboot the system 15 times and various timings of using sysrq, I managed to get the dmesg to dump into a file.

Now, all these files are while RC6 Render Standby was disabled in the UEFI, yet I'm seeing in drpc that it's enabled? Certainly confirms my suspicions that the driver is simply ignoring the bios settings.

Update:

Just as a fun test, since it was mentioned that 5.14 on Fedora 35 had no issues, I decided on my second drive to install Centos 9 Stream, which also has 5.14.0-80.el9 as it's kernel.

Guess what? No issue whatsoever, minus the fact that it defaults to llvmpipe for OpenGL rendering. Works perfectly fine without passing any additional kernel options at all. So it seems that whatever is causing the hangs in 5.15, 5.3, 5.17 and some subversions of 5.16 is simply not present in 5.14 and 4.18.

The kicker is that powertop is showing that RC6 is working correctly on Centos 9 Stream, even though it's disabled in the UEFI. Truly curious.

Thank you for all your experiments and logs, they're very useful.

Ok, these populated, but dmesg for some reason is refusing to output to the file.

I forgot that now on Ubuntu/Debian you need sudo to get dmesg, sorry about that. To fix the script, you can add "sudo" to the line 11 so all the files are collected at once.

Now, all these files are while RC6 Render Standby was disabled in the UEFI, yet I'm seeing in drpc that it's enabled? Certainly confirms my suspicions that the driver is simply ignoring the bios settings.

It looks like so, thanks for the logs. This shouldn't be happening, but it looks like it's a separate issue.

To narrow down the scope of this problem, could you try passing i915.mitigations=off to one of the problematic kernels (e.g. 5.17) and see if the hang still happens? There were some changes in the area that might have contributed to this issue.

Testing results while passing i915.mitigations=off:

Ubuntu 22.04, kernel 5.15.0-27-generic: hard freeze then self reboot during Plymouth step of launching GDM

Fedora Workstation 36, kernel 5.17.5-300.fc36.x86_64: loads desktop, then hard freezes.

Fedora Workstation 36, kernel 5.17.6-300.fc36.x86-64: hard freezes during GDM load

@sonsofblades It is hard to track this issue without knowing what patch caused a regression. We tried to reproduce this issue here without success. So, I'd like to ask you to do a git bisect. It shouldn't be hard to do that. So, assuming that you have Fedora 36 installed on your machine, you would need to do this in order to build your own Kernel 5.17 from upstream:

$ git clone https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
$ cd linux
$ git reset --hard v5.17
$ cp /boot/config-5.17.6-300.fc36.x86_64 .config
$ make && sudo make modules_install install

Then boot the new Kernel v5.17 and check if the bug is still happening. Assuming that the bug is still happening, you would do:

$ cd linux
$ git bisect start
$ git bisect bad

Then you need to checkout a version that it is known to work. As you mentioned that Kernel v5.14 works, you would now do:

$ git reset --hard v5.14
$ make && sudo make modules_install install

Then boot the new Kernel and test again. Assuming that it now works:

$ git bisect good
$ make && sudo make modules_install install

Then boot the new Kernel. Every time the test succeeds, you use git bisect good, and every time it fails: git bisect bad.

After ~16-20 boots/tests, it will show what patch broke the driver.

You can see more details about such procedure at:

https://www.kernel.org/doc/html/latest/admin-guide/bug-bisect.html

There are also some tutorial videos showing how to do it, like this one:

https://www.youtube.com/watch?v=D7JJnLFOn4A

Once you get the commit ID that broke the driver, it would be a easier for us to check what's happening there.

With regards to the speed-up, there are two things that you can do that will speed-up a lot the build time:

Use more than one CPU at build time. By default, make uses just one CPU. This would reduce a lot the build time (like 6-8 times fast on a machine with 8 CPU threads)
Use ccache, to avoid recompiling headers and other files every time. It ill cache the compilation results at the disk. It will likely reduce the time up to 2 times. It is even more efficient the second time you build the Kernel.

In order to do both in Fedora, all you need is to do:

$ sudo dnf install -y ccache
$ export CCACHE_DIR=$HOME/.ccache
$ export MAKEFLAGS="-j$((`nproc`+1))"

(I usually place the exports also at ~/.bashrc in order to always use such setup)

With regards to booting the new Kernel, you'll need to disable the secure boot option on your BIOS, as, when secure boot is enabled, grub shim will refuse to boot a non-signed Kernel.

It could indeed be related to some problem on your system, but hard to tell without knowing what patch broke support for it. If it is related to it, it could also affect non-integrated GPUs.

mentioned in issue #6116 (closed)

FWIW, I was getting similar GPU hang errors and resets with some specific GPU loads (CPU load did not matter), and it turned out to be a HW issue. There were two blown caps on the board. After replacement it no longer hangs. Might be a worth a visual inspection.

[i915]Intel HD 4600 plagued with crashes and GPU Hang messages

Child items ...

Activity

Admin message

Admin message

[i915]Intel HD 4600 plagued with crashes and GPU Hang messages

Activity