Due to an influx of spam, we have had to impose restrictions on new accounts. Please see this wiki page for instructions on how to get full permissions. Sorry for the inconvenience.
Admin message
The migration is almost done, at least the rest should happen in the background. There are still a few technical difference between the old cluster and the new ones, and they are summarized in this issue. Please pay attention to the TL:DR at the end of the comment.
Testing on multiple distro live ISO's and on an installed Fedora 35 Linux, all hang with the same error messages, dmesg output from most recent example attached:
/sys/class/drm/card0/error is non-existent when error occurs.
/var/log/Xorg.X.log's are not being generated either, indicating that Xorg crashes before it can even write anything to disk.
Hardware has been tested thoroughly on separate Windows 10 installation via AIDA64, Prime95, multiple games, Memtest86, OOCT, so hardware issues have been completely ruled out as cause of crashes.
Only known distro to boot without error is Rocky Linux/Alma Linux, but only when passing "intel_idle.max_cstate=1 i915.enable_dc=0 ahci.mobile_lpm_policy=1" to the kernel on boot. Passing any of these parameters to the kernel on any other distro tested does not eliminate the GPU Hang error.
Passing "nomodeset" to kernel on all distros does bypass the error, but makes distros completely unusable for daily usage.
Gigabyte Z87X-UD5H motherboard, using F10e BIOS (Also happens with F9 as well, stock settings)
Intel Core I5-4670k, stock clocks
24GB of GSkill DDR3-2133 RAM, set currently to XMP mode (issue still occurs regardless of RAM setting)
No discrete GPU.
3 HDD’s (mix of 1 Firecuda, 1 ST320, and ST3200 series), running in AHCI mode. Have also used BIOS RAID as well to same effect.
To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information
Child items ...
Show closed items
Linked items 0
Link issues together to show that they're related.
Learn more.
We're working on reproducing this issue on Fedora 35 (kernels 5.14.10-300.fc35 and 5.16.20-200.fc35). As for now, we haven't experienced a GPU hang described here.
There's one thing I would like to confirm. When you described that you can only see the "version" file in /sys/class/drm/ directory, did you mean when working on a kernel with nomodeset param or when the GPU hang happens? If you meant the former, I'd try to inspect it after the GPU hang.
Assuming that there is no kernel panic, you should be able to escape the crashed X session after hitting "magic" SysRq key combination (you'd need to enable it first; it's the PrtSc key) and navigate to the console to get /sys/class/drm/card0/error file. If it's still missing, take a look at /sys/kernel/debug/dri/0/i195_error_state.
This file is essential for debugging the issue. Don't forget to attach the new dmesg output with it.
Ideally, you should do this on drm-tip, like @raviteja suggested, but we can try recovering the files with an older kernel first.
User did provide me comments via twitter as he for some reason can't login to gitlab.
Here is the answer:
I did enable sysrq, and when I invoke alt+sysrq+e, then alt +F3, dmesg keeps dumping the Gpu hang error message on the screen over and over, making it impossible to deliver commands cleanly to the terminal. So I’m either unable to retrieve what they are wanting, unless there’s a solution to make it to where I can cleanly input terminal commands while the system is REISUB’d.
This is on Arch Linux, kernel version 5.17.4-arch1-1
Many thanks for cross-posting the reply, @dizzer .
I did enable sysrq, and when I invoke alt+sysrq+e, then alt +F3
This sysrq combination sends a SIGTERM signal, which is going to be ignored in this scenario. I'd try alt+sysrq+r (to take the keyboard away from X's control) and then alt+sysrq+k (kill all processes on the current virtual console). This should bring you back to the console with, hopefully, no more GPU hang error messages.
If it doesn't work, we can approach it differently. Here you can find a tiny script that copies a couple of files over to the user directory, including the error state one: gpu_logs.sh
Download the script when booted with nomodeset and make it executable, so you can call it when the GPU hang happens. Two commands there require sudo, so you need to execute it as a user from the sudo/wheel group. After doing so, please attach the text files from /home/$USER/gpu_logs here.
User did provide me comments via twitter as he for some reason can't login to gitlab.
That's going to make the last step a bit more complicated. I believe you've tried to reset your password and then check the spam folder, haven't you? But if you still have problem with logging in, you could provide the logs using different means, and I'll post them here.
This is on Arch Linux, kernel version 5.17.4-arch1-1
Got it, thanks for the update. Do you use GNOME here as well?
Ok, so now that I've gotten my Gitlab woes in order... HOLY JESUS this has been a ride.
So, I have a present in the form of LOGS, thanks to the nice tidy script you provided. Keep in mind this is on the newly released Ubuntu 22.04, but the effect should be identical (Arch was giving me problems beyond the GPU, so I trashed that installation. Also, this will standardize testing in the future.)
In the case of Arch, yes, that was with Gnome as well. That being said, even on Ubuntu 22.04, REISUB'ing did cause the gpu hang error to constantly dump itself into the terminal, but I managed to get the script to run regardless, so that's a good thing.
With all that said and sent, I did a fun little test, as I had a hunch.
So, rebooted, and enabled RC6 Render Standby in the UEFI, and rebooted, but rebooted into my Windows 10 partition. The result was nearly identical to what's happening in Linux: Hard Freeze.
Tried rebooting once more, and this time, Windows 10 got me a BSOD: Video_Scheduler_Internal_Error, bugtrap 119.
When I disable RC6 Render Standby in the UEFI, and reboot into Windows 10, it works perfectly fine.
Now, with that in mind, when I had Alma or Rocky Linux 8.5 installed, which both use the 4.18 kernel (same as RHEL), when I had RC6 Render Standby disabled in the UEFI (and i915.enable_dc=0 and intel_idle.max_cstate=1 passed to the kernel), powertop showed that RC6 was still being used, despite the fact that the feature is 100% disabled in the UEFI.
This all leads me to believe that this issue I'm experiencing is an RC6 Render Standby issue, and with DRM ignoring UEFI settings on top of that.
Again, just a hunch, may very well be something to look into.
Ok, so now that I've gotten my Gitlab woes in order... HOLY JESUS this has been a ride
Welcome back!
Keep in mind this is on the newly released Ubuntu 22.04, but the effect should be identical
Good, we can stick to Ubuntu then.
Thank you for providing all the files, I'm glad the script was useful. I can't see dmesg here, but probably it would be better to generate a new one anyway.
I'd ask you to do one more experiment for me - could you try booting up your Ubuntu with drm.debug=0x1e log_buf_len=1M kernel params, RC6 disabled in UEFI and run this script, just like you did with the last one:gpu_logs_2.sh?
This is to see what state of RC6 the driver reports, grab the newest error message and more verbose dmesg output
Ok, these populated, but dmesg for some reason is refusing to output to the file. Running ./gpu_logs_2.sh without sudo generates "Operation not permitted", and requests the password. Input the password, and it dumps me back to the bash prompt.
Running the script with sudo just simply dumps me to back to the prompt after entering password, and generates those files I've attached, and an empty dmesg file.
Update: Ok, after having to reboot the system 15 times and various timings of using sysrq, I managed to get the dmesg to dump into a file.
Now, all these files are while RC6 Render Standby was disabled in the UEFI, yet I'm seeing in drpc that it's enabled? Certainly confirms my suspicions that the driver is simply ignoring the bios settings.
Just as a fun test, since it was mentioned that 5.14 on Fedora 35 had no issues, I decided on my second drive to install Centos 9 Stream, which also has 5.14.0-80.el9 as it's kernel.
Guess what? No issue whatsoever, minus the fact that it defaults to llvmpipe for OpenGL rendering. Works perfectly fine without passing any additional kernel options at all. So it seems that whatever is causing the hangs in 5.15, 5.3, 5.17 and some subversions of 5.16 is simply not present in 5.14 and 4.18.
The kicker is that powertop is showing that RC6 is working correctly on Centos 9 Stream, even though it's disabled in the UEFI. Truly curious.
Thank you for all your experiments and logs, they're very useful.
Ok, these populated, but dmesg for some reason is refusing to output to the file.
I forgot that now on Ubuntu/Debian you need sudo to get dmesg, sorry about that. To fix the script, you can add "sudo" to the line 11 so all the files are collected at once.
Now, all these files are while RC6 Render Standby was disabled in the UEFI, yet I'm seeing in drpc that it's enabled? Certainly confirms my suspicions that the driver is simply ignoring the bios settings.
It looks like so, thanks for the logs. This shouldn't be happening, but it looks like it's a separate issue.
To narrow down the scope of this problem, could you try passing i915.mitigations=off to one of the
problematic kernels (e.g. 5.17) and see if the hang still happens? There were some changes in the area that might have contributed to this issue.
@sonsofblades It is hard to track this issue without knowing what patch caused a regression. We tried to reproduce this issue here without success. So, I'd like to ask you to do a git bisect. It shouldn't be hard to do that. So, assuming that you have Fedora 36 installed on your machine, you would need to do this in order to build your own Kernel 5.17 from upstream:
$ git clone https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git$ cd linux$ git reset --hard v5.17$ cp /boot/config-5.17.6-300.fc36.x86_64 .config$ make && sudo make modules_install install
Then boot the new Kernel v5.17 and check if the bug is still happening. Assuming that the bug is still happening, you would do:
$ cd linux$ git bisect start$ git bisect bad
Then you need to checkout a version that it is known to work. As you mentioned that Kernel v5.14 works, you would now do:
$ git reset --hard v5.14$ make && sudo make modules_install install
Then boot the new Kernel and test again. Assuming that it now works:
$ git bisect good$ make && sudo make modules_install install
Then boot the new Kernel. Every time the test succeeds, you use git bisect good, and every time it fails: git bisect bad.
After ~16-20 boots/tests, it will show what patch broke the driver.
With regards to the speed-up, there are two things that you can do that will speed-up a lot the build time:
Use more than one CPU at build time. By default, make uses just one CPU.
This would reduce a lot the build time (like 6-8 times fast on a machine with 8 CPU threads)
Use ccache, to avoid recompiling headers and other files every time. It ill cache the compilation results at the disk.
It will likely reduce the time up to 2 times. It is even more efficient the second time you build the Kernel.
In order to do both in Fedora, all you need is to do:
(I usually place the exports also at ~/.bashrc in order to always use such setup)
With regards to booting the new Kernel, you'll need to disable the secure boot option on your BIOS, as, when secure boot is enabled, grub shim will refuse to boot a non-signed Kernel.
It could indeed be related to some problem on your system, but hard to tell without knowing what patch broke support for it. If it is related to it, it could also affect non-integrated GPUs.
FWIW, I was getting similar GPU hang errors and resets with some specific GPU loads (CPU load did not matter), and it turned out to be a HW issue. There were two blown caps on the board. After replacement it no longer hangs. Might be a worth a visual inspection.