Driver fails to initialise with error ENOSPC
The i915 driver fails to initialise on my system, with error ENOSPC (-28), and suggests that I file a bug here, so here it is.
I have attached the full dmesg output of a sample boot sequence of a custom-built kernel. Note that the kernel is built with i915 support as a module, and I manually inserted the module after the system was up and running, so the relevant part of the log is at the end. Also note that the system has a second, dedicated graphics card (otherwise there is no video output), so there is a lot of drm debugging info that is related to that second card. The integrated card is pci device 0000:00:02.0.
As I said, the relevant part of the kernel log is at the end, starting at line 824 (t=460s), and particularly the last four lines (-28 is ENOSPC):
i915 0000:00:02.0: [drm:i915_init_ggtt [i915]] Failed to reserve top of GGTT for GuC
i915 0000:00:02.0: Device initialization failed (-28)
i915 0000:00:02.0: Please file a bug on drm/i915; see https://gitlab.freedesktop.org/drm/intel/-/wikis/How-to-file-i915-bugs for details.
i915: probe of 0000:00:02.0 failed with error -28
The motherboard identifies itself as "Gigabyte Technology Co., Ltd. H610M S2H DDR4/H610M S2H DDR4, BIOS FL 11/15/2022" and uname on the built kernel says "Linux 6.5.0-rc4+ #47 SMP Mon Aug 7 08:18:34 CEST 2023 i686 GNU/Linux".
I have been able to track the issue down to ggtt_reserve_guc_top, in drivers/gpu/drm/i915/gt/intel_ggtt.c. Specifically, the line "GEM_BUG_ON(ggtt->vm.total <= GUC_GGTT_TOP);" triggers a BUG() if enabled on my system, as ggtt->vm.total is 0x80000000 but GUC_GGTT_TOP is 0xFEE00000; in fact, I wonder if ggtt_reserve_guc_top, as it is, can reasonably handle anything other than ggtt->vm.total being 4G.
I am also attaching a very simple patch that I wrote to fix the issue. The patch simply reserves the last 18M of memory, right before ggtt->vm.total, instead of blindly reserving at offset GUC_GGTT_TOP. I have only smoke-tested the patch: with the patch applied, the module loads and successfully initialises the device, and I have been using it for a few days with the card working and no ill effects so far, but I have not done any further testing. I have no previous experience in this part of the kernel codebase, so I do not know what other consequences the patch may have. One issue that I see with the patch is that the original code reserves the whole range starting at GUC_GGTT_TOP because it is not accessible by GuC, and then opportunistically uses it for firmware images, because they fit there. However, if we reserve below the GUC_GGTT_TOP threshold then we are actually reducing the usable range and perhaps we could do better and narrow the size of the reservation (the patch currently reserves 18M as before, while the combined size of the firmware is <1M). As such, take the patch as a proof of concept, or I can prepare an improved version if you have any suggestions.