Xe (Intel Arc 750) is unstable on aarch64 and getting a crashdump might trigger a kernel panic (SError)
If you apply the patch from #1824 and compile libdrm_intel and mesa 24.1 for aarch64, I managed to get it running and even some 3D working (Doom3 on OpenGL runs stable).
Then, I tried running Quake 2 RTX to see if it worked (to prepare a patch for Meson for Mesa).
What I saw was:
[ 87.568481] xe 0004:04:00.0: [drm] Timedout job: seqno=4294967169, guc_id=3, flags=0x0
[ 87.669509] xe 0004:04:00.0: [drm] Timedout job: seqno=4294967169, guc_id=6, flags=0x0
[ 87.669537] xe 0004:04:00.0: [drm] Timedout job: seqno=4294967169, guc_id=5, flags=0x0
[ 120.840503] xe 0004:04:00.0: [drm] Timedout job: seqno=4294967271, guc_id=5, flags=0x0
[ 120.840953] xe 0004:04:00.0: [drm] Xe device coredump has been created
[ 120.840957] xe 0004:04:00.0: [drm] Check your /sys/class/drm/card0/device/devcoredump/data
[ 120.841368] xe 0004:04:00.0: [drm] Engine reset: guc_id=5
[ 151.382080] xe 0004:04:00.0: [drm] Timedout job: seqno=4294967169, guc_id=3, flags=0x0
[ 151.479284] xe 0004:04:00.0: [drm] Timedout job: seqno=4294967169, guc_id=5, flags=0x0
[ 151.479315] xe 0004:04:00.0: [drm] Timedout job: seqno=4294967169, guc_id=6, flags=0x0
[ 180.269205] xe 0004:04:00.0: [drm] Timedout job: seqno=4294967169, guc_id=19, flags=0x0
[ 180.367578] xe 0004:04:00.0: [drm] Timedout job: seqno=4294967169, guc_id=21, flags=0x0
[ 180.367599] xe 0004:04:00.0: [drm] Timedout job: seqno=4294967169, guc_id=22, flags=0x0
[ 184.640066] xe 0004:04:00.0: [drm] Timedout job: seqno=4294967169, guc_id=10, flags=0x0
[ 184.652455] xe 0004:04:00.0: [drm] Timedout job: seqno=4294967169, guc_id=16, flags=0x0
[ 184.653192] xe 0004:04:00.0: [drm] Timedout job: seqno=4294967169, guc_id=10, flags=0x0
[ 184.655294] xe 0004:04:00.0: [drm] Timedout job: seqno=4294967169, guc_id=9, flags=0x0
[ 430.849748] xe 0004:04:00.0: [drm] Xe device coredump has been deleted.
[ 524.802327] xe 0004:04:00.0: [drm] Timedout job: seqno=49712, guc_id=11, flags=0x0
[ 524.802763] xe 0004:04:00.0: [drm] Xe device coredump has been created
[ 524.802766] xe 0004:04:00.0: [drm] Check your /sys/class/drm/card0/device/devcoredump/data
[ 524.803944] xe 0004:04:00.0: [drm] Engine reset: guc_id=11
[ 524.804283] xe 0004:04:00.0: [drm] Timedout job: seqno=49719, guc_id=11, flags=0x1
[ 524.804304] xe 0004:04:00.0: [drm] Timedout job: seqno=49720, guc_id=11, flags=0x1
And then on a main console I saw a kernel panic:
[ 524.968400] SError Interrupt on CPU38, code 0x00000000be000411 -- SError
[ 524.968409] CPU: 38 PID: 5610 Comm: kworker/u295:4 Tainted: G U 6.9.0-rc6+ #2
[ 524.968412] Hardware name: ALTRAD8UD-1L2T/ALTRAD8UD-1L2T, BIOS 2.05 04/12/2024
[ 524.968413] Workqueue: events_unbound xe_devcoredump_deferred_snap_work [xe]
[ 524.968461] pstate: 20400009 (nzCv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
[ 524.968463] pc : __memcpy_fromio+0x54/0x98
[ 524.968469] lr : xe_vm_snapshot_capture_delayed+0x25c/0x310 [xe]
[ 524.968508] sp : ffff80008d2ebd00
[ 524.968509] x29: ffff80008d2ebd20 x28: ffff0800d2ab61b0 x27: ffff07ffebe28000
[ 524.968512] x26: 0000000000120000 x25: ffff07ffd2400000 x24: ffff0800c3b32c00
[ 524.968514] x23: 0000000000000000 x22: 0000000000100cc0 x21: 0000000000000009
[ 524.968516] x20: ffff0800d2ab6000 x19: 0000000000000009 x18: ffffffffffffffff
[ 524.968518] x17: 4d45545359534255 x16: ffffd09354044d18 x15: 2f706d756465726f
[ 524.968520] x14: 0000000000000000 x13: 0000000000000030 x12: ffff800080000000
[ 524.968523] x11: 0000000000040dc0 x10: dead000000000040 x9 : ffffd0931cc4f05c
[ 524.968525] x8 : 000028000011f000 x7 : 000000000000003f x6 : 0000000000120000
[ 524.968527] x5 : 0000000000000000 x4 : ffff80008e000060 x3 : ffff07ffd2520000
[ 524.968529] x2 : 0000000000120000 x1 : ffff80008e000000 x0 : ffff07ffd2400060
[ 524.968532] Kernel panic - not syncing: Asynchronous SError Interrupt
[ 524.968534] CPU: 38 PID: 5610 Comm: kworker/u295:4 Tainted: G U 6.9.0-rc6+ #2
[ 524.968536] Hardware name: ALTRAD8UD-1L2T/ALTRAD8UD-1L2T, BIOS 2.05 04/12/2024
[ 524.968537] Workqueue: events_unbound xe_devcoredump_deferred_snap_work [xe]
[ 524.968576] Call trace:
[ 524.968577] dump_backtrace+0x9c/0x128
[ 524.968580] show_stack+0x20/0x38
[ 524.968582] dump_stack_lvl+0x34/0x90
[ 524.968587] dump_stack+0x18/0x28
[ 524.968589] panic+0x3b4/0x3f0
[ 524.968593] nmi_panic+0x50/0xa8
[ 524.968596] arm64_serror_panic+0x78/0x90
[ 524.968598] do_serror+0x30/0x78
[ 524.968600] el1h_64_error_handler+0x30/0x48
[ 524.968602] el1h_64_error+0x64/0x68
[ 524.968604] __memcpy_fromio+0x54/0x98
[ 524.968606] xe_devcoredump_deferred_snap_work+0x5c/0x90 [xe]
[ 524.968644] process_one_work+0x18c/0x400
[ 524.968648] worker_thread+0x204/0x420
[ 524.968650] kthread+0xe8/0xf8
[ 524.968653] ret_from_fork+0x10/0x20
[ 524.968656] SMP: stopping secondary CPUs
[ 524.968673] Kernel Offset: 0x5092d4020000 from 0xffff800080000000
[ 524.968675] PHYS_OFFSET: 0xfff1000080000000
[ 524.968676] CPU features: 0x0,0000010b,80140528,4241720b
[ 524.968677] Memory Limit: none
[ 525.304772] ---[ end Kernel panic - not syncing: Asynchronous SError Interrupt ]---
Interesting thing about SError is that according to ARM Docs (https://developer.arm.com/documentation/102412/0103/Exception-types/Asynchronous-exceptions) it is:
A typical example of SError is what was previously referred to as an external asynchronous abort. Examples of SError interrupts include:
* Memory access which has passed all the MMU checks but then encounters an error on the memory bus
* Parity or Error Correction Code (ECC) checking on some RAMs, for example those in built-in caches
* An abort triggered by write-back of dirty data from a cache line to external memory
I'm not sure what exactly it is, but my gut feeling tells me that it is the third one.