Hello, I don't know if this may be related by I hadn't crashed in quite some time like this except for #2911.
System:
Host: _HOST_ Kernel: 6.6.6-zen1-1-zen arch: x86_64 bits: 64 compiler: gcc
v: 13.2.1 Desktop: KDE Plasma v: 5.27.10 Distro: Arch Linux
Machine:
Type: Desktop System: Micro-Star product: MS-7B86 v: 3.0
serial: <superuser required>
Mobo: Micro-Star model: B450 GAMING PLUS MAX (MS-7B86) v: 3.0
serial: <superuser required> UEFI: American Megatrends LLC. v: H.J0
date: 08/16/2023
Battery:
ID-1: hidpp_battery_1 charge: 95% condition: N/A volts: 4.1 min: N/A
model: Logitech G Pro Wireless Gaming Mouse status: discharging
CPU:
Info: 8-core model: AMD Ryzen 7 5800X bits: 64 type: MT MCP arch: Zen 3+
rev: 0 cache: L1: 512 KiB L2: 4 MiB L3: 32 MiB
Speed (MHz): avg: 2468 high: 4851 min/max: 550/4851 boost: enabled cores:
1: 550 2: 3019 3: 3008 4: 3028 5: 550 6: 550 7: 550 8: 3741 9: 3019 10: 550
11: 550 12: 3020 13: 3018 14: 4637 15: 4851 16: 4851 bogomips: 121599
Flags: avx avx2 ht lm nx pae sse sse2 sse3 sse4_1 sse4_2 sse4a ssse3 svm
Graphics:
Device-1: AMD Navi 31 [Radeon RX 7900 XT/7900 XTX] vendor: XFX RX-79XMERCB9
driver: amdgpu v: kernel arch: RDNA-3 bus-ID: 2b:00.0
Device-2: A4Tech USB Live camera driver: snd-usb-audio,uvcvideo type: USB
bus-ID: 1-2.1:4
Display: wayland server: X.org v: 1.21.1.10 with: Xwayland v: 23.2.3
compositors: 1: kwin_wayland 2: Gamescope driver: X: loaded: amdgpu
dri: radeonsi gpu: amdgpu resolution: 2560x1440
API: EGL v: 1.5 drivers: radeonsi,swrast platforms:
active: wayland,x11,surfaceless,device inactive: gbm
API: OpenGL v: 4.6 compat-v: 4.5 vendor: amd v: N/A glx-v: 1.4
direct-render: yes renderer: AMD Radeon RX 7900 XTX (radeonsi navi31 LLVM
18.0.0 DRM 3.54 6.6.6-zen1-1-zen)
API: Vulkan v: 1.3.274 drivers: radv surfaces: xcb,xlib,wayland devices: 1
Audio:
Device-1: AMD Navi 31 HDMI/DP Audio driver: snd_hda_intel v: kernel
bus-ID: 2b:00.1
Device-2: AMD Starship/Matisse HD Audio vendor: Micro-Star MSI
driver: snd_hda_intel v: kernel bus-ID: 2d:00.4
Device-3: A4Tech USB Live camera driver: snd-usb-audio,uvcvideo type: USB
bus-ID: 1-2.1:4
API: ALSA v: k6.6.6-zen1-1-zen status: kernel-api
Server-1: sndiod v: N/A status: off
Server-2: JACK v: 1.9.22 status: off
Server-3: PipeWire v: 1.0.0 status: active
Info:
Processes: 861 Uptime: 29m Memory: total: 64 GiB note: est.
available: 62.72 GiB used: 9.98 GiB (15.9%) Init: systemd Compilers:
gcc: 13.2.1 clang: 16.0.6 Packages: 2763 Shell: fish v: 3.7.0 inxi: 3.3.31
EDIT:
Was playing Star Citizen under Wine, and then I crashed.
Plenty of those here as well:
janv. 05 23:54:07 PHARCHXTI pipewire-pulse[4696]: mod.protocol-pulse: 0x558bf9f01860: [Star Citizen] overrun recover read:365653824 avail:9920 max:7680 skip:8960
janv. 05 23:54:07 PHARCHXTI pipewire-pulse[4696]: mod.protocol-pulse: 0x558bfa6526e0: [Star Citizen] overrun recover read:381575168 avail:15872 max:7680 skip:14912
janv. 05 23:54:07 PHARCHXTI pipewire-pulse[4696]: mod.protocol-pulse: 0x558bf9f01860: [Star Citizen] overrun recover read:365678144 avail:8640 max:7680 skip:7680
janv. 05 23:54:07 PHARCHXTI pipewire-pulse[4696]: mod.protocol-pulse: 0x558bf9f01860: [Star Citizen] overrun recover read:365686784 avail:9216 max:7680 skip:8256
janv. 05 23:54:07 PHARCHXTI pipewire-pulse[4696]: mod.protocol-pulse: 0x558bfa6526e0: [Star Citizen] overrun recover read:381598720 avail:20480 max:7680 skip:19520
janv. 05 23:54:07 PHARCHXTI pipewire-pulse[4696]: mod.protocol-pulse: 0x558bf9f01860: [Star Citizen] overrun recover read:365706560 avail:12992 max:7680 skip:12032
janv. 05 23:54:07 PHARCHXTI pipewire-pulse[4696]: mod.protocol-pulse: 0x558bfa6526e0: [Star Citizen] overrun recover read:381623040 avail:13056 max:7680 skip:12096
janv. 05 23:54:08 PHARCHXTI pipewire-pulse[4696]: mod.protocol-pulse: 0x558bfa6526e0: [Star Citizen] overrun recover read:381653376 avail:11904 max:7680 skip:10944
janv. 05 23:54:08 PHARCHXTI pipewire-pulse[4696]: mod.protocol-pulse: 0x558bfa6526e0: [Star Citizen] overrun recover read:381668160 avail:11456 max:7680 skip:10496
janv. 05 23:54:08 PHARCHXTI pipewire-pulse[4696]: mod.protocol-pulse: 0x558bf9f01860: [Star Citizen] overrun recover read:365763712 avail:8064 max:7680 skip:7104
janv. 05 23:54:08 PHARCHXTI pipewire-pulse[4696]: mod.protocol-pulse: 0x558bfa6526e0: [Star Citizen] overrun recover read:381679616 avail:29184 max:7680 skip:28224
janv. 05 23:54:09 PHARCHXTI pipewire-pulse[4696]: mod.protocol-pulse: 0x558bfa6526e0: [Star Citizen] overrun recover read:381752960 avail:10112 max:7680 skip:9152
janv. 05 23:54:09 PHARCHXTI pipewire-pulse[4696]: mod.protocol-pulse: 0x558bfa6526e0: [Star Citizen] overrun recover read:381817792 avail:16960 max:7680 skip:16000
janv. 05 23:54:11 PHARCHXTI pipewire-pulse[4696]: mod.protocol-pulse: 0x558bfa6526e0: [Star Citizen] overrun recover read:381937472 avail:7872 max:7680 skip:6912
janv. 05 23:54:13 PHARCHXTI pipewire-pulse[4696]: mod.protocol-pulse: 0x558bf9f01860: [Star Citizen] overrun recover read:366260416 avail:8512 max:7680 skip:7552
janv. 05 23:54:19 PHARCHXTI pipewire-pulse[4696]: mod.protocol-pulse: 0x558bfa6526e0: [Star Citizen] overrun recover read:382764224 avail:8000 max:7680 skip:7040
janv. 05 23:54:19 PHARCHXTI pipewire-pulse[4696]: mod.protocol-pulse: 0x558bfa6526e0: [Star Citizen] overrun recover read:382777024 avail:8000 max:7680 skip:7040
What is the client that needs fixing? Is it on our end?
The game was working fine on 6900XT and stopped working when switching to 7900xtx with the same Mesa version at the time but compiled with LLVM >=15 for RDNA3 compatibility purposes.
When the problem began, it used to crash from the very first seconds sometimes not even showing publisher intro. Nowadays on Mesa-git, it does go ingame and generally crashes a few seconds into the game world after the loading screen or just before starting to render said game world.
amdgpu 0000:28:00.0: amdgpu: [gfxhub] page fault (src_id:0 ring:40 vmid:1 pasid:32826, for process Spider-Man.exe pid 891701 thread Spider-Man.exe pid 892344)
amdgpu 0000:28:00.0: amdgpu: in page starting at address 0x00008001741fa000 from client 10
amdgpu 0000:28:00.0: amdgpu: GCVM_L2_PROTECTION_FAULT_STATUS:0x00000000
amdgpu 0000:28:00.0: amdgpu: Faulty UTCL2 client ID: CB/DB (0x0)
amdgpu 0000:28:00.0: amdgpu: MORE_FAULTS: 0x0
amdgpu 0000:28:00.0: amdgpu: WALKER_ERROR: 0x0
amdgpu 0000:28:00.0: amdgpu: PERMISSION_FAULTS: 0x0
amdgpu 0000:28:00.0: amdgpu: MAPPING_ERROR: 0x0
amdgpu 0000:28:00.0: amdgpu: RW: 0x0
Run the game. Load a save.
System:
Host: _HOST_ Kernel: 6.1.8-zen1-1-zen arch: x86_64 bits: 64 compiler: gcc
v: 12.2.1 Desktop: KDE Plasma v: 5.26.5 tk: Qt v: 5.15.8 wm: kwin_wayland
dm: 1: LightDM note: stopped 2: SDDM Distro: Arch Linux
CPU:
Info: 8-core model: AMD Ryzen 7 5800X bits: 64 type: MT MCP arch: Zen 3
rev: 0 cache: L1: 512 KiB L2: 4 MiB L3: 32 MiB
Speed (MHz): avg: 3727 high: 4295 min/max: 2200/5006 boost: enabled cores:
1: 3673 2: 3754 3: 3909 4: 3624 5: 3070 6: 3525 7: 3695 8: 3806 9: 4236
10: 3952 11: 4295 12: 3504 13: 3930 14: 2879 15: 3758 16: 4029
bogomips: 121600
Flags: avx avx2 ht lm nx pae sse sse2 sse3 sse4_1 sse4_2 sse4a ssse3 svm
Graphics:
Device-1: AMD Navi 31 [Radeon RX 7900 XT/7900 XTX] vendor: XFX
driver: amdgpu v: kernel arch: RDNA-3 pcie: speed: 16 GT/s lanes: 16 ports:
active: DP-1 empty: DP-2,DP-3,HDMI-A-1 bus-ID: 28:00.0 chip-ID: 1002:744c
Device-2: A4Tech USB Live camera type: USB driver: snd-usb-audio,uvcvideo
bus-ID: 1-1.1:18 chip-ID: 09da:2690
Display: wayland server: X.org v: 1.21.1.6 with: Xwayland v: 22.1.7
compositor: kwin_wayland driver: X: loaded: amdgpu dri: radeonsi gpu: amdgpu
display-ID: 0
Monitor-1: DP-1 res: 2560x1440 size: N/A
API: OpenGL v: 4.6 Mesa 23.1.0-devel (git-e37f458207) renderer: AMD
Radeon Graphics (gfx1100 LLVM 15.0.7 DRM 3.49 6.1.8-zen1-1-zen)
direct render: Yes
sudo X -version
)Did it used to work in a previous Mesa version? Worked on previous stable Mesa with 6900 XT. Same driver on 7900XTX crashes the game. It seems the higher the resolution, the higher the chances of a crash.
Does the issue reproduce with the LLVM backend (RADV_DEBUG=llvm
) or on the AMDGPU-PRO drivers?
VK_ICD_FILENAMES=/usr/share/vulkan/icd.d/amd_icd64.json
: Seems to have the same issueDoes your environment set any of the variables ACO_DEBUG
, RADV_DEBUG
, and RADV_PERFTEST
?
Got to test this again. This seems gone. No workaround here, stable with Raytracing and after a suspend-to-ram cycle. Looking good for me. @hakzsam Thanks for following along and Happy New Year!
Hello again, sorry to report, I had a random crash happen again:
oct. 30 23:15:06 _HOST_ kernel: amdgpu 0000:2b:00.0: amdgpu: [gfxhub] page fault (src_id:0 ring:24 vmid:3 pasid:32806, for process Starfield.exe pid 2151708 thread vkd3d_queue pid 2152242)
oct. 30 23:15:06 _HOST_ kernel: amdgpu 0000:2b:00.0: amdgpu: in page starting at address 0x0000801100890000 from client 10
oct. 30 23:15:06 _HOST_ kernel: amdgpu 0000:2b:00.0: amdgpu: GCVM_L2_PROTECTION_FAULT_STATUS:0x00301430
oct. 30 23:15:06 _HOST_ kernel: amdgpu 0000:2b:00.0: amdgpu: Faulty UTCL2 client ID: SQC (data) (0xa)
oct. 30 23:15:06 _HOST_ kernel: amdgpu 0000:2b:00.0: amdgpu: MORE_FAULTS: 0x0
oct. 30 23:15:06 _HOST_ kernel: amdgpu 0000:2b:00.0: amdgpu: WALKER_ERROR: 0x0
oct. 30 23:15:06 _HOST_ kernel: amdgpu 0000:2b:00.0: amdgpu: PERMISSION_FAULTS: 0x3
oct. 30 23:15:06 _HOST_ kernel: amdgpu 0000:2b:00.0: amdgpu: MAPPING_ERROR: 0x0
oct. 30 23:15:06 _HOST_ kernel: amdgpu 0000:2b:00.0: amdgpu: RW: 0x0
oct. 30 23:15:16 _HOST_ kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx_0.0.0 timeout, but soft recovered
Fortunately, the DE was able to survive this one.
Hopefully, this won't persist in main.
Just a note to underline that these issues seem particularly dominant after coming back from sleep.
Hello, no I don't think I've had any new issues with on this. Thanks.
One time occurrence. Reproducible? Unknown. Notice: Never happened with the 6900XT
Was interacting with Network widget in Latte Dock, expanding 1 NIC details
Occured again a few days ago:
Was Running Stable Diffusion, queued 5 picture generation with SDXL model. Apparently one of the things making the model special is that it uses BF16?
I'm now on Kernel 6.5.6 and haven't reproduced yet. Will try yet again. I wasn't aware the support is only up to 6.5.2. I guess I'm pioneering the issues ahead.
I'll check to see if SD can be used on a Torch version that is both compatible and has a ROCm5.7 build.
Thanks for info!
Do I close this or leave it open for RDNA < 3?
@hakzsam Hello, yes. I've logged in quite some hours since the fix and haven't noticed any crash or fault whatsoever.
Occured again today:
Was Running Stable Diffusion, queued 5 picture generation with SDXL model. Apparently one the things making the model special is that it uses BF16?
System crash (black screen / no TTY with REISUB) running https://github.com/oobabooga/text-generation-webui OpenCL build. And https://github.com/easydiffusion/easydiffusion with Python ROCm.
System:
Host: _HOST_ Kernel: 6.5.4-zen2-1-zen arch: x86_64 bits: 64 compiler: gcc
v: 13.2.1 Desktop: KDE Plasma v: 5.27.8 tk: Qt v: 5.15.10 wm: kwin_wayland
dm: 1: LightDM note: stopped 2: SDDM Distro: Arch Linux
CPU:
Info: 8-core model: AMD Ryzen 7 5800X bits: 64 type: MT MCP arch: Zen 3+
rev: 0 cache: L1: 512 KiB L2: 4 MiB L3: 32 MiB
Speed (MHz): avg: 3128 high: 3631 min/max: 2200/4850 boost: enabled cores:
1: 3531 2: 3586 3: 2879 4: 2200 5: 2880 6: 3173 7: 2879 8: 2879 9: 3631
10: 3319 11: 2896 12: 3618 13: 2879 14: 3591 15: 2880 16: 3233
bogomips: 121595
Flags: avx avx2 ht lm nx pae sse sse2 sse3 sse4_1 sse4_2 sse4a ssse3 svm
Graphics:
Device-1: AMD Navi 31 [Radeon RX 7900 XT/7900 XTX] vendor: XFX RX-79XMERCB9
driver: amdgpu v: kernel arch: RDNA-3 pcie: speed: 16 GT/s lanes: 16 ports:
active: DP-1 empty: DP-2,DP-3,HDMI-A-1 bus-ID: 2b:00.0 chip-ID: 1002:744c
Display: wayland server: X.org v: 1.21.1.8 with: Xwayland v: 23.2.1
compositor: kwin_wayland driver: X: loaded: amdgpu dri: radeonsi gpu: amdgpu
display-ID: 0
Monitor-1: DP-1 res: 2560x1440 size: N/A
API: EGL v: 1.5 platforms: device: 0 drv: radeonsi device: 1 drv: swrast
surfaceless: drv: radeonsi wayland: drv: radeonsi x11: drv: radeonsi
inactive: gbm
API: OpenGL v: 4.6 compat-v: 4.5 vendor: amd v: N/A glx-v: 1.4
direct-render: yes renderer: AMD Radeon RX 7900 XTX (navi31 LLVM 18.0.0 DRM
3.54 6.5.4-zen2-1-zen) device-ID: 1002:744c display-ID: :1.0
API: Vulkan v: 1.3.264 surfaces: xcb,xlib,wayland device: 0
type: discrete-gpu driver: mesa radv device-ID: 1002:744c
Run afore mentioned workloads and generate things with Model Layers loaded in VRAM.
@hakzsam Unfortunately not for me.
sept. 29 18:57:50 _HOST_ kernel: amdgpu 0000:2b:00.0: amdgpu: [gfxhub] page fault (src_id:0 ring:40 vmid:5 pasid:32812, for process Spider-Man.exe pid 529516 thread Spider-Man:cs0 pid 529541)
sept. 29 18:57:50 _HOST_ kernel: amdgpu 0000:2b:00.0: amdgpu: in page starting at address 0x0000800157860000 from client 10
sept. 29 18:57:50 _HOST_ kernel: amdgpu 0000:2b:00.0: amdgpu: GCVM_L2_PROTECTION_FAULT_STATUS:0x00541051
sept. 29 18:57:50 _HOST_ kernel: amdgpu 0000:2b:00.0: amdgpu: Faulty UTCL2 client ID: TCP (0x8)
sept. 29 18:57:50 _HOST_ kernel: amdgpu 0000:2b:00.0: amdgpu: MORE_FAULTS: 0x1
sept. 29 18:57:50 _HOST_ kernel: amdgpu 0000:2b:00.0: amdgpu: WALKER_ERROR: 0x0
sept. 29 18:57:50 _HOST_ kernel: amdgpu 0000:2b:00.0: amdgpu: PERMISSION_FAULTS: 0x5
sept. 29 18:57:50 _HOST_ kernel: amdgpu 0000:2b:00.0: amdgpu: MAPPING_ERROR: 0x0
sept. 29 18:57:50 _HOST_ kernel: amdgpu 0000:2b:00.0: amdgpu: RW: 0x1
sept. 29 18:57:50 _HOST_ kernel: amdgpu 0000:2b:00.0: amdgpu: [gfxhub] page fault (src_id:0 ring:40 vmid:5 pasid:32812, for process Spider-Man.exe pid 529516 thread Spider-Man:cs0 pid 529541)
sept. 29 18:57:50 _HOST_ kernel: amdgpu 0000:2b:00.0: amdgpu: in page starting at address 0x0000800157860000 from client 10
sept. 29 18:57:50 _HOST_ kernel: amdgpu 0000:2b:00.0: amdgpu: GCVM_L2_PROTECTION_FAULT_STATUS:0x00541051
sept. 29 18:57:50 _HOST_ kernel: amdgpu 0000:2b:00.0: amdgpu: Faulty UTCL2 client ID: TCP (0x8)
sept. 29 18:57:50 _HOST_ kernel: amdgpu 0000:2b:00.0: amdgpu: MORE_FAULTS: 0x1
sept. 29 18:57:50 _HOST_ kernel: amdgpu 0000:2b:00.0: amdgpu: WALKER_ERROR: 0x0
sept. 29 18:57:50 _HOST_ kernel: amdgpu 0000:2b:00.0: amdgpu: PERMISSION_FAULTS: 0x5
sept. 29 18:57:50 _HOST_ kernel: amdgpu 0000:2b:00.0: amdgpu: MAPPING_ERROR: 0x0
sept. 29 18:57:50 _HOST_ kernel: amdgpu 0000:2b:00.0: amdgpu: RW: 0x1
EDIT:
Looks like I was able to get some trace dump contrary to the usual:
@hakzsam Was the first thing I tested after Starfield but it didn't fix the crash for me no. The attached RenderDoc was generated with latest main.
@Venemo Here's some Render Doc I was able to generate.
You'll need ~<8 GB uncompressed.
One thing I notice is that the crash is nearly instant when I go fullscreen vs windowed.
I haven't had a crash on Starfield trying to reproduce the use case. Will advise but I think we can close this in 1 week?
Hello, thank you for getting back, I shall test when I can and report!
Next time I have a save around that part, I'll try. Unfortunately, I wasn't smart enough to isolate that particular save.