Due to an influx of spam, we have had to impose restrictions on new accounts. Please see this wiki page for instructions on how to get full permissions. Sorry for the inconvenience.
Admin message
The migration is almost done, at least the rest should happen in the background. There are still a few technical difference between the old cluster and the new ones, and they are summarized in this issue. Please pay attention to the TL:DR at the end of the comment.
After upgrading to 5.19 (locally built from the Ubuntu mainline https://git.launchpad.net/~ubuntu-kernel-test/ubuntu/+source/linux/+git/mainline-crack) I experienced two lockups where the display was unresponsive and the machine could not be cleanly shut down, even with SysRq keys. I could change VT but could not login and I had to force power off both times.
Display(s): laptop panel and ViewSonic 24in monitor
Type of Display Connection: eDP and DP
System information:
Distro name and Version: Ubuntu 22.04.1 LTS (with a custom kernel)
Kernel version: 5.19.0
Custom kernel: Ubuntu mainline-crack
AMD official driver version: ?
How to reproduce the issue:
Install kernel 5.19
Wait. (the first time, the screen was locked and when I came back, nothing worked. The second time, I was on a teams call and everything locked up. Audio continued to work until the end of the call so about half an hour, then I rebooted)
Attached files:
Log files (for system lockups / game freezes / crashes)
Yeah sure, I'll try! It might take a long time since it takes about 3 hours for the machine to lock up, so it will be hard to know if a commit is good.
This also happens on my Vega 56 for what it's worth, but it happens seemingly at random, this time it took over 12 hours for it to occur and I had been gaming for several hours within that time period (though not at the time of the crash)
Aug 13 21:24:13 main kernel: amdgpu 0000:09:00.0: amdgpu: [gfxhub0] no-retry page fault (src_id:0 ring:40 vmid:4 pasid:32783, for process firefox pid 62950 thread firefox:cs0 pid 63047)Aug 13 21:24:13 main kernel: amdgpu 0000:09:00.0: amdgpu: in page starting at address 0x000080010a0cb000 from IH client 0x1b (UTCL2)Aug 13 21:24:13 main kernel: amdgpu 0000:09:00.0: amdgpu: VM_L2_PROTECTION_FAULT_STATUS:0x00441051Aug 13 21:24:13 main kernel: amdgpu 0000:09:00.0: amdgpu: Faulty UTCL2 client ID: TCP (0x8)Aug 13 21:24:13 main kernel: amdgpu 0000:09:00.0: amdgpu: MORE_FAULTS: 0x1Aug 13 21:24:13 main kernel: amdgpu 0000:09:00.0: amdgpu: WALKER_ERROR: 0x0Aug 13 21:24:13 main kernel: amdgpu 0000:09:00.0: amdgpu: PERMISSION_FAULTS: 0x5Aug 13 21:24:13 main kernel: amdgpu 0000:09:00.0: amdgpu: MAPPING_ERROR: 0x0Aug 13 21:24:13 main kernel: amdgpu 0000:09:00.0: amdgpu: RW: 0x1
I am slowly but surely bisecting this. As mentioned before, without a reproducible test case it's pretty tedious waiting long enough to be sure a commit is bad!
However, during an attempt to shut down the machine after a lockup, I noticed this in the journal:
Nearly there... I'm down to these three commits to test:
bffa91dadf59 (refs/bisect/bad) drm/amdkfd: start using tlb_seq from the VM subsystem5255e146c99a (HEAD) drm/amdgpu: rework TLB flushinge997b82745a5 drm/amdgpu: simplify VM update tracking a bit
Hello. I'm a Vega 56 user and I've had these no-retry pagefaults aswell. The issue first occurred at the time I jumped from kernel 5.18.7 (released 25/Jun) to 5.18.14 (rel. 23/Jul) or after the update of linux-firmware to from 20220708 to 20220815 (rel. 15/Aug).
The file that changed between the two firmware versions is /lib/firmware/amdgpu/vega10_asd.bin
I am cautiously optimistic in saying that I might have been able work around it (5 days of continuous uptime so far, while the bug usually triggered before day 3), by downgrading said firmware. As of now, I can't be 100% sure about success however, as
I haven't found any means to deliberately and reliably trigger the bug. So far, I've seen it happen with firefox, steamwebhelper and mpv and vlc.
This is how the bug usually presented itself in dmesg.
Update: 17.Sept I had the no-retry pagefaults again after 5,5 days. The downgrade of linux-firmware doesn't seem to make a difference after all. I've had lots(!) of YT tabs open with their videos paused, then opened another one, pressed play and boom, screen hangs (switching to tty still worked though), same error messages as before in dmesg.
It seems like every time this occured for me, applications using hw accelerated video decoding were involved... Maybe too many handles on the decoder, or still VM related??
5255e146c99a677d4d55fdb988544bd20c539a0b is the first bad commitcommit 5255e146c99a677d4d55fdb988544bd20c539a0bAuthor: Christian König <christian.koenig@amd.com>Date: Tue Mar 15 15:27:45 2022 +0100 drm/amdgpu: rework TLB flushing Instead of tracking the VM updates through the dependencies just use a sequence counter for page table updates which indicates the need to flush the TLB. This reduces the need to flush the TLB drastically. v2: squash in NULL check fix (Christian) Signed-off-by: Christian König <christian.koenig@amd.com> Acked-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c | 8 ++--- drivers/gpu/drm/amd/amdgpu/amdgpu_ids.c | 6 ++-- drivers/gpu/drm/amd/amdgpu/amdgpu_sync.c | 20 ----------- drivers/gpu/drm/amd/amdgpu/amdgpu_sync.h | 2 -- drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c | 57 ++++++++++++++++++++++++++++++-- drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h | 15 +++++++++ 6 files changed, 76 insertions(+), 32 deletions(-)
Trying to revert only this commit on the latest mainline produces a bunch of conflicts which I am not capable of resolving. Is it obvious to the authors whether this commit is responsible for the page faults and whether it is fixable?
It might be worth pointing out that I still don't have a reliable test case for this. It often happens that a teams call will reproduce it after about ten minutes, but it's possible that I didn't leave my laptop running for enough days to be sure that a commit was really good. If there is some other debugging we could do I'd be happy to try that.
Especially, if you could create a patch which allows me to revert that commit, I would have some confidence after a few weeks that this really is the cause.
That won't be easily possible since we have a lot of dependencies on top of that.
What we could maybe do is to add the old tracking in parallel and then print a warning whenever the old behavior would have flushed the TLB while the new one doesn't.
Going to give that a try, but it might take me a day or two.