Out of bounds channel id in nv84_fence_context_new with GSP
Faith reported this on IRC/discord. I'm copying it here for better tracking.
gfxstrand: Yup The fence error appears to be a failed read so, either an OOB channel id or the BO just vanishes.
Looks like maybe we're not handling channel creation failure properly?
Still, the channel create failing is a problem.
airlied: okay it's wierd we haven't seen that up until now, can you send me dmesg for it?
Stack trace:
[96538.341944] Call Trace:
[96538.341946] <TASK>
[96538.341950] ? __die+0x23/0x70
[96538.341959] ? page_fault_oops+0x171/0x4e0
[96538.341967] ? exc_page_fault+0x175/0x180
[96538.341975] ? asm_exc_page_fault+0x26/0x30
[96538.341983] ? __pfx_nouveau_drm_debugf+0x10/0x10 [nouveau]
[96538.342289] ? ioread32+0x32/0x60
[96538.342295] nv84_fence_context_new+0xb9/0x120 [nouveau]
[96538.342551] nvc0_fence_context_new+0x12/0x40 [nouveau]
[96538.342804] nouveau_channel_new+0x2f1/0x520 [nouveau]
[96538.343058] nouveau_abi16_ioctl_channel_alloc+0x165/0x450 [nouveau]
[96538.343320] ? __pfx_nouveau_abi16_ioctl_channel_alloc+0x10/0x10 [nouveau]
[96538.343574] drm_ioctl_kernel+0xd3/0x180
[96538.343580] drm_ioctl+0x26d/0x4b0
[96538.343585] ? __pfx_nouveau_abi16_ioctl_channel_alloc+0x10/0x10 [nouveau]
[96538.343844] nouveau_drm_ioctl+0x5a/0xb0 [nouveau]
[96538.344120] __x64_sys_ioctl+0x94/0xd0
[96538.344126] do_syscall_64+0x5d/0x90
[96538.344132] ? nouveau_drm_ioctl+0x7d/0xb0 [nouveau]
[96538.344409] ? __x64_sys_ioctl+0xaf/0xd0
[96538.344413] ? syscall_exit_to_user_mode+0x2b/0x40
[96538.344418] ? do_syscall_64+0x6c/0x90
[96538.344423] ? __x64_sys_ioctl+0xaf/0xd0
[96538.344426] ? syscall_exit_to_user_mode+0x2b/0x40
[96538.344431] ? do_syscall_64+0x6c/0x90
[96538.344435] ? do_syscall_64+0x6c/0x90
[96538.344439] entry_SYSCALL_64_after_hwframe+0x6e/0xd8
[96538.344445] RIP: 0033:0x7fcae810b13d
Full dmesg: fence-dmesg.txt
Later:
commit e123f3e78bfc3f05176b97717f60d953f7ee4c7c (HEAD)
Author: Faith Ekstrand <faith.ekstrand@collabora.com>
Date: Tue Nov 7 14:44:24 2023 -0600
nouveau: BUG_ON some invariants in fence_context_new
diff --git a/drivers/gpu/drm/nouveau/nv84_fence.c b/drivers/gpu/drm/nouveau/nv84_fence.c
index 812b8c62eeba..173edc19cb4d 100644
--- a/drivers/gpu/drm/nouveau/nv84_fence.c
+++ b/drivers/gpu/drm/nouveau/nv84_fence.c
@@ -131,6 +131,10 @@ nv84_fence_context_new(struct nouveau_channel *chan)
struct nv84_fence_chan *fctx;
int ret;
+ BUG_ON(priv == NULL);
+ BUG_ON(priv->bo == NULL);
+ BUG_ON(nv84_fence_chid(chan) * 16 >= priv->bo->bo.base.size);
+
fctx = chan->fence = kzalloc(sizeof(*fctx), GFP_KERNEL);
if (!fctx)
return -ENOMEM;
gfxstrand: airlied The third BUG_ON() triggers. There's your bug.
Either we're not properly recycling channel IDs or our fence BO needs to be bigger.
I don't know enough about nouveau and GSP to have opinions on which.
airlied: can you check if throwing a * 8 or something in the nouveau_bo_new in nv84_fence_create helps? also knowing what nv84_fence_chid(chan) is when it blows up?
gfxstrand: I can after a bit
airlied: okay I've no idea how that whole chid stuff works, need more learning
airlied: also a prink with drm->chan_total in it might be good info
probably some disagreement with gsp and pre-gsp on some of those
Pass: 395042, Fail: 124, Crash: 93, Skip: 1619939, Flake: 45, Duration: 1:05:51, Remaining: 0 on ampete gsp
ampere
gfxstrand: Time to try with BO size x 8