NULL pointer dereference in dma_resv_add_fence
Brief summary of the problem:
I am getting a driver crash when a run a homegrown python app that runs some ML models via PyTorch. I can trigger the error 100% of the time with kernel version 6.6.0, this issue does not happen with version 6.5.6.
The problem appears to be specific to my app and/or ML model, the same environment with other PyTorch based apps do not trigger this error.
After the error is triggered, sometimes I can continue to work, in others the computer completely locks up.
Hardware description:
- CPU: AMD Ryzen 7 3700X 8-Core Processor
- GPU: VGA compatible controller [0300]: Advanced Micro Devices, Inc. [AMD/ATI] Navi 31 [Radeon RX 7900 XT/7900 XTX] [1002:744c] (rev c8)
- System Memory: 64Gb DDR4-3200
- Display(s): DELL S2721DS (x2)
- Type of Display Connection: DisplayPort
System information:
- Distro name and Version: Gentoo Linux
- Kernel version: Linux 6.6.0-gentoo
How to reproduce the issue:
I only know how to trigger the issue with my custom app, which is a bit large. I am going to slim it down into something easier to test and share.