Linux 6.8-rc1: "Seemingly commit a6149f039369 broke amdgpu driver"

I'm reposting a kernel bugzilla issue filed by Niklāvs Koļesņikovs:

First of all, I'm very sorry I have to post this here, since I know it should go elsewhere but I can't use the correct issue tracker. I hope this report can be sent to the right people, because with any luck I did manage to bisect it correctly.

The issue I'm facing is kwin_wayland unpredictably hanging during regular use. I tried a large number of Mesa environment variables as well as both RR and FIFO GPU schedulers but nothing made it either happen or go away 100% of the time, so it's likely a timing related bug. With one exception, across many boots there were no dmesg entries indicating any kind of an issue and the system journal does not show any obvious patterns or failures either, so the one time it did print anything might be just a consequence of the actual bug.

In one of the bootups, BUG: kernel NULL pointer dereference, address: 0000000000000008 was reported but it was for commit b70438004a14 which should be well into the clear, since there were multiple other commits past it which were in total tested for days of typical use without a single GUI hang.

When kwin_wayland's screen output freezes, rarely SysRq+E might be able to get back to a working SDDM's Wayland login prompt running the Weston display server but it almost always freezes there and, if not, it will imminently freeze after/during login. Likewise switching to tty is unlikely to be possible or will eventually freeze, too. In once case the mouse pointer was movable but nothing reacted to interactions.

The issue started happening between Linux 6.8 pre-rc1 commits 70d201a40823 (good) and 052d534373b7 (bad). Due to no reliable reproducer, bisecting this was not easy and I still can't say with full confidence I got the right culprit but my third round of git bisection arrived at a6149f0393699308fb00149be913044977bceb56 being the first bad commit. It may or may not be relevant that at some point (IIRC, between ca34d816558c and e013aa9ab01b) the kernel also started to severely hang when entering S3 sleep as well as at the end of systemctl reboot process but I do not know if that's indicative of the same bug or not. When the S3 or reboot hangs happen, use of PC reset button on the case is required i.e. SysRq+B does nothing.

I did encounter #3124 and it's seemingly similar to my issue with kwin_wayland however the instant GNOME hang went away during bisection, so I'm not sure if it's a more severe form of the same underlying bug or a different one.

Hardware in use: Intel Core i5-12400 CPU and AMD RX 580 GPU with the Intel HD730 iGPU in RC6 render standby for HEVC encoding and Vulkan compute roles. IOMMU and CET are enabled. Bisection was initially done with Linux firmware 20231211 and the 3rd go at bisecting with 20240115. If it's relevant, I have the second newest Intel ME and UEFI firmware for my platform, since I'm still waiting for enough time to go by before I dare to flash the latest unsigned firmware update. sigh

a6149f0393699308fb00149be913044977bceb56 is the first bad commit
commit a6149f0393699308fb00149be913044977bceb56
Author: Matthew Brost <matthew.brost@intel.com>
Date:   Mon Oct 30 20:24:36 2023 -0700

    drm/sched: Convert drm scheduler to use a work queue rather than kthread
    
    In Xe, the new Intel GPU driver, a choice has made to have a 1 to 1
    mapping between a drm_gpu_scheduler and drm_sched_entity. At first this
    seems a bit odd but let us explain the reasoning below.
    
    1. In Xe the submission order from multiple drm_sched_entity is not
    guaranteed to be the same completion even if targeting the same hardware
    engine. This is because in Xe we have a firmware scheduler, the GuC,
    which allowed to reorder, timeslice, and preempt submissions. If a using
    shared drm_gpu_scheduler across multiple drm_sched_entity, the TDR falls
    apart as the TDR expects submission order == completion order. Using a
    dedicated drm_gpu_scheduler per drm_sched_entity solve this problem.
    
    2. In Xe submissions are done via programming a ring buffer (circular
    buffer), a drm_gpu_scheduler provides a limit on number of jobs, if the
    limit of number jobs is set to RING_SIZE / MAX_SIZE_PER_JOB we get flow
    control on the ring for free.
    
    A problem with this design is currently a drm_gpu_scheduler uses a
    kthread for submission / job cleanup. This doesn't scale if a large
    number of drm_gpu_scheduler are used. To work around the scaling issue,
    use a worker rather than kthread for submission / job cleanup.
    
    v2:
      - (Rob Clark) Fix msm build
      - Pass in run work queue
    v3:
      - (Boris) don't have loop in worker
    v4:
      - (Tvrtko) break out submit ready, stop, start helpers into own patch
    v5:
      - (Boris) default to ordered work queue
    v6:
      - (Luben / checkpatch) fix alignment in msm_ringbuffer.c
      - (Luben) s/drm_sched_submit_queue/drm_sched_wqueue_enqueue
      - (Luben) Update comment for drm_sched_wqueue_enqueue
      - (Luben) Positive check for submit_wq in drm_sched_init
      - (Luben) s/alloc_submit_wq/own_submit_wq
    v7:
      - (Luben) s/drm_sched_wqueue_enqueue/drm_sched_run_job_queue
    v8:
      - (Luben) Adjust var names / comments
    
    Signed-off-by: Matthew Brost <matthew.brost@intel.com>
    Reviewed-by: Luben Tuikov <luben.tuikov@amd.com>
    Link: https://lore.kernel.org/r/20231031032439.1558703-3-matthew.brost@intel.com
    Signed-off-by: Luben Tuikov <ltuikov89@gmail.com>

 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c |   2 +-
 drivers/gpu/drm/etnaviv/etnaviv_sched.c    |   2 +-
 drivers/gpu/drm/lima/lima_sched.c          |   2 +-
 drivers/gpu/drm/msm/msm_ringbuffer.c       |   2 +-
 drivers/gpu/drm/nouveau/nouveau_sched.c    |   2 +-
 drivers/gpu/drm/panfrost/panfrost_job.c    |   2 +-
 drivers/gpu/drm/scheduler/sched_main.c     | 131 +++++++++++++++--------------
 drivers/gpu/drm/v3d/v3d_sched.c            |  10 +--
 include/drm/gpu_scheduler.h                |  14 +--
 9 files changed, 86 insertions(+), 81 deletions(-)

Slight correction regarding #3124 . For me the GNOME hang happens usually at the end of login procedure but once it also happened right after login via SDDM. However since I use Weston rather than Mutter for login prompt, it could still be the same bug w.r.t. GNOME hanging with 6.8 rc1.

Linux 6.8-rc1: "Seemingly commit a6149f039369 broke amdgpu driver"

Designs

Child items ...

Activity

Admin message

Admin message

Linux 6.8-rc1: "Seemingly commit a6149f039369 broke amdgpu driver"

Activity