[bisected][regression] GPU hangs running Vulkan CTS caused by noretry=1
I noticed some regression in recent amd-staging-drm-next kernels where running the full Vulkan CTS on radv on a RX 5700 XT results in hangs.
I bisected to
commit dd2ae73d58cb558578c3cfe5322d79feee945c24 (refs/bisect/bad)
Author: Chengming Gui <Jack.Gui@amd.com>
Date: Tue Oct 13 12:18:27 2020 +0800
drm/amd/amdgpu: set the default value of noretry to 1 for some dGPUs
noretry = 0 cause some dGPU's kfd page fault tests fail,
so set noretry to 1 for these special ASICs:
vega20/navi10/navi14
v2: merge raven and default case due to the same setting
v3: remove ARCTURUS
Signed-off-by: Chengming Gui <Jack.Gui@amd.com>
Acked-by: Felix Kuhling <Felix.Kuehling@amd.com>
Change-Id: I3be70f463a49b0cd5c56456431d6c2cb98b13872
and indeed a later kernel with amdgpu.noretry=0
fixes the hang for me.
When it hangs there is typically one vmfault, but no vmfaults are encountered with noretry=0
.