Graphics lockup at boot even with #213 fix

I am opening this new issue at the request of @karolherbst as per #213 (comment 1984045).

Summary:

The patch posted in #213 (comment 1959993) (now included in recent kernel sources and packages) does not solve the problem on some subset of older Nvidia GPUs, including my "NVIDIA G92GLM [Quadro FX 3800M]" and "NVIDIA GK107GL [Quadro 410]", and others as reported in #238 (closed) and at https://forums.opensuse.org/t/older-laptop-tumbleweed-nvidia-and-nouveau-drivers/166413/17 and other forums/lists.
Adding the patch I previously linked in the other issue (original source: https://lore.kernel.org/nouveau/CACrD=+-DUomkWxe0X5M5vMFS_JijjPGNqVXuq+qimie99vmwzw@mail.gmail.com/) does work.

From that "original source" patch I applied only the single changed line of code that actually computed something, leaving out all the extra added tests and nvkm_warn(), nvkm_trace(), and nvkm_debug() messaging and function-aborting returns. So the entire change was:

diff --git a/drivers/gpu/drm/nouveau/nvkm/engine/disp/uconn.c b/drivers/gpu/drm/nouveau/nvkm/engine/disp/uconn.c
index dad942be6679..a286ca8df04c 100644
--- a/drivers/gpu/drm/nouveau/nvkm/engine/disp/uconn.c
+++ b/drivers/gpu/drm/nouveau/nvkm/engine/disp/uconn.c
@@ -86,7 +86,7 @@ nvkm_uconn_uevent(struct nvkm_object *object, void *argv, u32 argc, struct nvkm_
                        if (args->v0.types & NVIF_CONN_EVENT_V0_UNPLUG) bits |= NVKM_I2C_UNPLUG;
                        if (args->v0.types & NVIF_CONN_EVENT_V0_IRQ   ) bits |= NVKM_I2C_IRQ;
 
-                       return nvkm_uevent_add(uevent, &device->i2c->event, outp->dp.aux->id, bits,
+                       return nvkm_uevent_add(uevent, &device->i2c->event, (outp->dp.aux->id)&0xff, bits,
                                               nvkm_uconn_uevent_aux);
                }
        }

In #213 (comment 1984045), @karolherbst wrote:

I'll think about this a bit more, it just doesn't feel right to mask a value randomly and I'd rather have a cleaner solution here.

I'm all for cleaner solutions, and perhaps I'm misinterpreting your comment, but I don't consider the 0xff mask to be a random value. In the explanation accompanying the patch, the original poster laid out what to me seemed a very plausible rationale for the mask, regarding now-removed code that implicitly did the same thing via a uint8_t variable that is now wider and potentially has additional data in its upper bits The crucial point being that the value is later used as an array index which can be out of bounds, resulting in a kernel fault similar to what's been observed in dmesg output after the boot time graphics lockup.

I could not conclusively verify or dispute this hypothesis by reviewing the nouveau source code. The call to nvkm_uevent_add() in the patch in turn calls nvkm_event_ntfy_add() which stores the questionable value in an nvkm_event_ntfy struct's int id field, i.e. with no implicit truncation. That id can then be later used as an index into an nvkm_event_ntfy's struct nvkm_event *event's int *refs array, which has been allocated via kzalloc(array3_size(index_nr, types_nr, sizeof(*event->refs)), GFP_KERNEL) in __nvkm_event_init().

I'm sure that you, as the nouveau maintainer, are intimately familiar with all the above. I am not. I tried to find how large these refs arrays are -- to confirm or disprove that limiting indices into them to <= 255 was a necessary solution -- but only saw that they were typically much smaller:

$ find . -iname \*.[ch] | xargs fgrep nvkm_event_init        
./include/nvkm/core/event.h:int  __nvkm_event_init(const struct nvkm_event_func *func, struct nvkm_subdev *, int types_nr,
./include/nvkm/core/event.h:nvkm_event_init(const struct nvkm_event_func *func, struct nvkm_subdev *subdev,
./include/nvkm/core/event.h:	return __nvkm_event_init(func, subdev, types_nr, index_nr, event);
./nvkm/engine/sw/chan.c:	return nvkm_event_init(&nvkm_sw_chan_event, &sw->engine.subdev, 1, 1, &chan->event);
./nvkm/engine/disp/base.c:	return nvkm_event_init(&nvkm_disp_vblank_func, subdev, 1, i, &disp->vblank);
./nvkm/engine/disp/base.c:	return nvkm_event_init(func->uevent, &disp->engine.subdev, 1, ARRAY_SIZE(disp->chan),
./nvkm/engine/fifo/base.c:		ret = nvkm_event_init(func->nonstall, &fifo->engine.subdev, 1, 1,
./nvkm/engine/fifo/chid.c:	return nvkm_event_init(func, subdev, 1, nr, &chid->event);
./nvkm/subdev/gpio/base.c:	return nvkm_event_init(&nvkm_gpio_intr_func, &gpio->subdev, 2, func->lines, &gpio->event);
./nvkm/subdev/fault/base.c:	ret = nvkm_event_init(&nvkm_fault_ntfy, subdev, 1, fault->buffer_nr, &fault->event);
./nvkm/subdev/i2c/base.c:	return nvkm_event_init(&nvkm_i2c_intr_func, &i2c->subdev, 4, i, &i2c->event);
./nvkm/core/event.c:__nvkm_event_init(const struct nvkm_event_func *func, struct nvkm_subdev *subdev,

This is about as far as I can go in trying to contribute to (or obfuscate?) solving this issue. I have no experience in kernel debugging, and to climb that learning curve would take me weeks if not longer. I've already spent over a month away from my other open source work since first running into this problem that has prevented me from installing much-needed updates to my Linux systems. This is a "show-stopping" bug for however many users have these older Nvidia GPUs and have no other option than nouveau (because Nvidia has dropped support for their closed-source drivers for them). I'm hitting it on openSUSE Tumbleweed which is "bleeding edge", but as other distros update to more current kernels they'll have the same problem if it's not fixed.

So ... as much as the patch I'm proposing is admittedly a "hack", I'd like to advocate for its inclusion, even as a temporary stop-gap, unless and until a cleaner one can be developed. I can always patch and install my own kernels (as time-consuming as that is) but that's not a workable solution for a large percentage the potentially affected users.

Thanks for your time and interest in this issue.

To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information

Admin message

Graphics lockup at boot even with #213 fix