AMDGPU hung after enabling HIP GPU acceleration in Blender Cycles 3.3
HIP for GPU acceleration in Blender 3.3 Cycles render view causes shutdown X server.
Video showing the problem on Youtube (captured for kernel bugzilla on old software version): https://www.youtube.com/watch?v=tZzTuvRn3cw
I want to note that since the creation of the report for the kernel, the X server no longer freezes, but closes.
Hardware:
CPU: AMD Ryzen™ 5 3600
MOTHERBOARD: MSI X470 GAMING PLUS MAX
GPU: SAPPHIRE Radeon RX 6600 8192Mb PULSE (11310-01-20G)
lspci -nn | grep VGA: 29:00.0 VGA compatible controller [0300]: Advanced Micro Devices, Inc. [AMD/ATI] Navi 23 [Radeon RX 6600/6600 XT/6600M] [1002:73ff] (rev c7)
Software version:
Distro: Arch Linux x86-64
linux: 6.0.0-270-tkg-bore TKG SMP PREEMPT_DYNAMIC Wed, 05 Oct 2022 17:55:03
xf86-video-amdgpu: 22.0.0-1
mesa: 22.3.0_devel.160808.1f0a0a46d97-1
rocm-llvm: 5.2.3-2
hip-runtime-amd: 5.2.3-2
blender: 3.3.1-2
Partial log with the problem:
journalctl:
Oct 06 21:30:24 sanka telegram-desktop[5593]: qt.gui.imageio.jpeg: Corrupt JPEG data: premature end of data segment
Oct 06 21:32:09 sanka kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx_0.0.0 timeout, signaled seq=250180, emitted seq=250182
Oct 06 21:32:09 sanka kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process blender pid 8977 thread blender:cs0 pid 9010
Oct 06 21:32:09 sanka kernel: amdgpu 0000:29:00.0: amdgpu: GPU reset begin!
Oct 06 21:32:09 sanka kernel: amdgpu: Failed to suspend process 0x8015
Oct 06 21:32:09 sanka kernel: amdgpu 0000:29:00.0: [drm:amdgpu_ring_test_helper [amdgpu]] *ERROR* ring kiq_2.1.0 test failed (-110)
Oct 06 21:32:09 sanka kernel: [drm:gfx_v10_0_hw_fini [amdgpu]] *ERROR* KGQ disable failed
Oct 06 21:32:09 sanka kernel: amdgpu 0000:29:00.0: amdgpu: wait for kiq fence error: 0.
Oct 06 21:32:10 sanka /usr/lib/gdm-x-session[5615]: i3status: Cannot read temperature. Verify that you have a thermal zone in /sys/class/thermal or disable the cpu_temperature module in your >
Oct 06 21:32:10 sanka kernel: [drm:gfx_v10_0_hw_fini [amdgpu]] *ERROR* failed to halt cp gfx
Oct 06 21:32:10 sanka kernel: [drm] free PSP TMR buffer
Oct 06 21:32:10 sanka kernel: amdgpu 0000:29:00.0: amdgpu: MODE1 reset
Oct 06 21:32:10 sanka kernel: amdgpu 0000:29:00.0: amdgpu: GPU mode1 reset
Oct 06 21:32:10 sanka kernel: amdgpu 0000:29:00.0: amdgpu: GPU smu mode1 reset
Oct 06 21:32:10 sanka /usr/lib/gdm-x-session[6241]: [6241:6241:1006/213210.335315:ERROR:shared_context_state.cc(855)] SharedContextState context lost via ARB/EXT_robustness. Reset status = GL>
Oct 06 21:32:10 sanka /usr/lib/gdm-x-session[6241]: [6241:6241:1006/213210.335735:ERROR:gpu_service_impl.cc(967)] Exiting GPU process because some drivers can't recover from errors. GPU proce>
Oct 06 21:32:10 sanka kernel: amdgpu 0000:29:00.0: amdgpu: GPU reset succeeded, trying to resume
Oct 06 21:32:10 sanka kernel: [drm] PCIE GART of 512M enabled (table at 0x0000008000300000).
Oct 06 21:32:10 sanka kernel: [drm] VRAM is lost due to GPU reset!
Oct 06 21:32:10 sanka kernel: [drm] PSP is resuming...
Oct 06 21:32:10 sanka kernel: [drm] reserve 0xa00000 from 0x81bc000000 for PSP TMR
Oct 06 21:32:10 sanka kernel: amdgpu 0000:29:00.0: amdgpu: RAS: optional ras ta ucode is not available
Oct 06 21:32:10 sanka kernel: amdgpu 0000:29:00.0: amdgpu: SECUREDISPLAY: securedisplay ta ucode is not available
Oct 06 21:32:10 sanka kernel: amdgpu 0000:29:00.0: amdgpu: SMU is resuming...
Oct 06 21:32:10 sanka kernel: amdgpu 0000:29:00.0: amdgpu: smu driver if version = 0x0000000f, smu fw if version = 0x00000013, smu fw program = 0, version = 0x003b2900 (59.41.0)
Oct 06 21:32:10 sanka kernel: amdgpu 0000:29:00.0: amdgpu: SMU driver if version not matched
Oct 06 21:32:10 sanka kernel: amdgpu 0000:29:00.0: amdgpu: use vbios provided pptable
Oct 06 21:32:10 sanka kernel: amdgpu 0000:29:00.0: amdgpu: SMU is resumed successfully!
Oct 06 21:32:10 sanka kernel: [drm] DMUB hardware initialized: version=0x02020013
Oct 06 21:32:11 sanka kernel: [drm] kiq ring mec 2 pipe 1 q 0
Oct 06 21:32:11 sanka kernel: [drm] VCN decode and encode initialized successfully(under DPG Mode).
Oct 06 21:32:11 sanka kernel: [drm] JPEG decode initialized successfully.
Oct 06 21:32:11 sanka kernel: amdgpu 0000:29:00.0: amdgpu: ring gfx_0.0.0 uses VM inv eng 0 on hub 0
Oct 06 21:32:11 sanka kernel: amdgpu 0000:29:00.0: amdgpu: ring comp_1.0.0 uses VM inv eng 1 on hub 0
Oct 06 21:32:11 sanka kernel: amdgpu 0000:29:00.0: amdgpu: ring comp_1.1.0 uses VM inv eng 4 on hub 0
Oct 06 21:32:11 sanka kernel: amdgpu 0000:29:00.0: amdgpu: ring comp_1.2.0 uses VM inv eng 5 on hub 0
Oct 06 21:32:11 sanka kernel: amdgpu 0000:29:00.0: amdgpu: ring comp_1.3.0 uses VM inv eng 6 on hub 0
Oct 06 21:32:11 sanka kernel: amdgpu 0000:29:00.0: amdgpu: ring comp_1.0.1 uses VM inv eng 7 on hub 0
Oct 06 21:32:11 sanka kernel: amdgpu 0000:29:00.0: amdgpu: ring comp_1.1.1 uses VM inv eng 8 on hub 0
Oct 06 21:32:11 sanka kernel: amdgpu 0000:29:00.0: amdgpu: ring comp_1.2.1 uses VM inv eng 9 on hub 0
Oct 06 21:32:11 sanka kernel: amdgpu 0000:29:00.0: amdgpu: ring comp_1.3.1 uses VM inv eng 10 on hub 0
Oct 06 21:32:11 sanka kernel: amdgpu 0000:29:00.0: amdgpu: ring kiq_2.1.0 uses VM inv eng 11 on hub 0
Oct 06 21:32:11 sanka kernel: amdgpu 0000:29:00.0: amdgpu: ring sdma0 uses VM inv eng 12 on hub 0
Oct 06 21:32:11 sanka kernel: amdgpu 0000:29:00.0: amdgpu: ring sdma1 uses VM inv eng 13 on hub 0
Oct 06 21:32:11 sanka kernel: amdgpu 0000:29:00.0: amdgpu: ring vcn_dec_0 uses VM inv eng 0 on hub 1
Oct 06 21:32:11 sanka kernel: amdgpu 0000:29:00.0: amdgpu: ring vcn_enc_0.0 uses VM inv eng 1 on hub 1
Oct 06 21:32:11 sanka kernel: amdgpu 0000:29:00.0: amdgpu: ring vcn_enc_0.1 uses VM inv eng 4 on hub 1
Oct 06 21:32:11 sanka kernel: amdgpu 0000:29:00.0: amdgpu: ring jpeg_dec uses VM inv eng 5 on hub 1
Blender log:
...
Read blend: /home/rea/Documents/tmon8/Blender/island-house.blend
[New Thread 0x7fff950ff000 (LWP 3581)]
[New Thread 0x7fff8c3ff000 (LWP 3582)]
[New Thread 0x7fff8bbfe000 (LWP 3583)]
[New Thread 0x7fff78dff000 (LWP 3584)]
[Thread 0x7fff8bbfe000 (LWP 3583) exited]
[Thread 0x7fff8c3ff000 (LWP 3582) exited]
[Thread 0x7fff950ff000 (LWP 3581) exited]
[New Thread 0x7fff950ff000 (LWP 3585)]
[New Thread 0x7fff8c3ff000 (LWP 3586)]
[New Thread 0x7fff8bbfe000 (LWP 3587)]
amdgpu: amdgpu_cs_query_fence_status failed.
X connection to :0 broken (explicit kill or server shutdown).
...
Blender backtrace:
(gdb) backtrace
#0 __pthread_setaffinity_new (th=140736236331008, cpusetsize=128, cpuset=0x7fffffffd4a0) at pthread_setaffinity.c:32
#1 0x00007fffb84b1624 in () at /usr/lib/dri/radeonsi_dri.so
#2 0x0000555558a583d2 in blender::gpu::GLShader::~GLShader() ()
#3 0x0000555558a583e9 in blender::gpu::GLShader::~GLShader() ()
#4 0x0000555557b1ab59 in OCIO_GPUDisplayShader::~OCIO_GPUDisplayShader() ()
#5 0x0000555557b1ab8f in std::__cxx11::list<OCIO_GPUDisplayShader, std::allocator<OCIO_GPUDisplayShader> >::~list() ()
#6 0x00007fffe3a54df5 in __run_exit_handlers (status=1, listp=0x7fffe3bf1760 <__exit_funcs>, run_list_atexit=run_list_atexit@entry=true, run_dtors=run_dtors@entry=true) at exit.c:113
#7 0x00007fffe3a54f70 in __GI_exit (status=<optimized out>) at exit.c:143
#8 0x00007ffff7afcc06 in _XDefaultIOError () at /usr/lib/libX11.so.6
#9 0x00007ffff7affe23 in _XIOError () at /usr/lib/libX11.so.6
#10 0x00007ffff7b04f64 in _XReply () at /usr/lib/libX11.so.6
#11 0x00007ffff7b0511d in XTranslateCoordinates () at /usr/lib/libX11.so.6
#12 0x000055555778e507 in GHOST_WindowX11::screenToClient(int, int, int&, int&) const ()
#13 0x0000555555e7fb81 in wm_cursor_position_from_ghost_screen_coords ()
#14 0x0000555555e8b797 in ghost_event_proc.lto_priv ()
#15 0x00005555577efac0 in GHOST_EventManager::dispatchEvent(GHOST_IEvent*) ()
#16 0x00005555577f00f8 in GHOST_EventManager::dispatchEvents() ()
#17 0x000055555779afd5 in GHOST_System::dispatchEvents() ()
#18 0x0000555555e60ccd in WM_main ()
#19 0x0000555555ce8f01 in main ()
rocminfo output:
ROCk module is loaded
=====================
HSA System Attributes
=====================
Runtime Version: 1.1
System Timestamp Freq.: 1000.000000MHz
Sig. Max Wait Duration: 18446744073709551615 (0xFFFFFFFFFFFFFFFF) (timestamp count)
Machine Model: LARGE
System Endianness: LITTLE
==========
HSA Agents
==========
*******
Agent 1
*******
Name: AMD Ryzen 5 3600 6-Core Processor
Uuid: CPU-XX
Marketing Name: AMD Ryzen 5 3600 6-Core Processor
Vendor Name: CPU
Feature: None specified
Profile: FULL_PROFILE
Float Round Mode: NEAR
Max Queue Number: 0(0x0)
Queue Min Size: 0(0x0)
Queue Max Size: 0(0x0)
Queue Type: MULTI
Node: 0
Device Type: CPU
Cache Info:
L1: 32768(0x8000) KB
Chip ID: 0(0x0)
Cacheline Size: 64(0x40)
Max Clock Freq. (MHz): 3600
BDFID: 0
Internal Node ID: 0
Compute Unit: 12
SIMDs per CU: 0
Shader Engines: 0
Shader Arrs. per Eng.: 0
WatchPts on Addr. Ranges:1
Features: None
Pool Info:
Pool 1
Segment: GLOBAL; FLAGS: FINE GRAINED
Size: 32789316(0x1f45344) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Alignment: 4KB
Accessible by all: TRUE
Pool 2
Segment: GLOBAL; FLAGS: KERNARG, FINE GRAINED
Size: 32789316(0x1f45344) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Alignment: 4KB
Accessible by all: TRUE
Pool 3
Segment: GLOBAL; FLAGS: COARSE GRAINED
Size: 32789316(0x1f45344) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Alignment: 4KB
Accessible by all: TRUE
ISA Info:
*******
Agent 2
*******
Name: gfx1032
Uuid: GPU-XX
Marketing Name: AMD Radeon RX 6600
Vendor Name: AMD
Feature: KERNEL_DISPATCH
Profile: BASE_PROFILE
Float Round Mode: NEAR
Max Queue Number: 128(0x80)
Queue Min Size: 64(0x40)
Queue Max Size: 131072(0x20000)
Queue Type: MULTI
Node: 1
Device Type: GPU
Cache Info:
L1: 16(0x10) KB
L2: 2048(0x800) KB
L3: 32768(0x8000) KB
Chip ID: 29695(0x73ff)
Cacheline Size: 64(0x40)
Max Clock Freq. (MHz): 2750
BDFID: 10496
Internal Node ID: 1
Compute Unit: 28
SIMDs per CU: 2
Shader Engines: 4
Shader Arrs. per Eng.: 2
WatchPts on Addr. Ranges:4
Features: KERNEL_DISPATCH
Fast F16 Operation: TRUE
Wavefront Size: 32(0x20)
Workgroup Max Size: 1024(0x400)
Workgroup Max Size per Dimension:
x 1024(0x400)
y 1024(0x400)
z 1024(0x400)
Max Waves Per CU: 32(0x20)
Max Work-item Per CU: 1024(0x400)
Grid Max Size: 4294967295(0xffffffff)
Grid Max Size per Dimension:
x 4294967295(0xffffffff)
y 4294967295(0xffffffff)
z 4294967295(0xffffffff)
Max fbarriers/Workgrp: 32
Pool Info:
Pool 1
Segment: GLOBAL; FLAGS: COARSE GRAINED
Size: 8372224(0x7fc000) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Alignment: 4KB
Accessible by all: FALSE
Pool 2
Segment: GROUP
Size: 64(0x40) KB
Allocatable: FALSE
Alloc Granule: 0KB
Alloc Alignment: 0KB
Accessible by all: FALSE
ISA Info:
ISA 1
Name: amdgcn-amd-amdhsa--gfx1032
Machine Models: HSA_MACHINE_MODEL_LARGE
Profiles: HSA_PROFILE_BASE
Default Rounding Mode: NEAR
Default Rounding Mode: NEAR
Fast f16: TRUE
Workgroup Max Size: 1024(0x400)
Workgroup Max Size per Dimension:
x 1024(0x400)
y 1024(0x400)
z 1024(0x400)
Grid Max Size: 4294967295(0xffffffff)
Grid Max Size per Dimension:
x 4294967295(0xffffffff)
y 4294967295(0xffffffff)
z 4294967295(0xffffffff)
FBarrier Max Size: 32
*** Done ***