NULL pointer dereference in linux-next-20230728 cause by amdgpu_atpx_handler
On my dual GPU MSI Alpha15 laptop linux-next-20230728 fails to boot (OS is debian stable/bookworm). Hardware:
$ lspci
00:00.0 Host bridge: Advanced Micro Devices, Inc. [AMD] Renoir/Cezanne Root Complex
00:00.2 IOMMU: Advanced Micro Devices, Inc. [AMD] Renoir/Cezanne IOMMU
00:01.0 Host bridge: Advanced Micro Devices, Inc. [AMD] Renoir PCIe Dummy Host Bridge
00:01.1 PCI bridge: Advanced Micro Devices, Inc. [AMD] Renoir PCIe GPP Bridge
00:02.0 Host bridge: Advanced Micro Devices, Inc. [AMD] Renoir PCIe Dummy Host Bridge
00:02.1 PCI bridge: Advanced Micro Devices, Inc. [AMD] Renoir/Cezanne PCIe GPP Bridge
00:02.2 PCI bridge: Advanced Micro Devices, Inc. [AMD] Renoir/Cezanne PCIe GPP Bridge
00:02.3 PCI bridge: Advanced Micro Devices, Inc. [AMD] Renoir/Cezanne PCIe GPP Bridge
00:02.4 PCI bridge: Advanced Micro Devices, Inc. [AMD] Renoir/Cezanne PCIe GPP Bridge
00:08.0 Host bridge: Advanced Micro Devices, Inc. [AMD] Renoir PCIe Dummy Host Bridge
00:08.1 PCI bridge: Advanced Micro Devices, Inc. [AMD] Renoir Internal PCIe GPP Bridge to Bus
00:14.0 SMBus: Advanced Micro Devices, Inc. [AMD] FCH SMBus Controller (rev 51)
00:14.3 ISA bridge: Advanced Micro Devices, Inc. [AMD] FCH LPC Bridge (rev 51)
00:18.0 Host bridge: Advanced Micro Devices, Inc. [AMD] Cezanne Data Fabric; Function 0
00:18.1 Host bridge: Advanced Micro Devices, Inc. [AMD] Cezanne Data Fabric; Function 1
00:18.2 Host bridge: Advanced Micro Devices, Inc. [AMD] Cezanne Data Fabric; Function 2
00:18.3 Host bridge: Advanced Micro Devices, Inc. [AMD] Cezanne Data Fabric; Function 3
00:18.4 Host bridge: Advanced Micro Devices, Inc. [AMD] Cezanne Data Fabric; Function 4
00:18.5 Host bridge: Advanced Micro Devices, Inc. [AMD] Cezanne Data Fabric; Function 5
00:18.6 Host bridge: Advanced Micro Devices, Inc. [AMD] Cezanne Data Fabric; Function 6
00:18.7 Host bridge: Advanced Micro Devices, Inc. [AMD] Cezanne Data Fabric; Function 7
01:00.0 PCI bridge: Advanced Micro Devices, Inc. [AMD/ATI] Navi 10 XL Upstream Port of PCI Express Switch (rev c3)
02:00.0 PCI bridge: Advanced Micro Devices, Inc. [AMD/ATI] Navi 10 XL Downstream Port of PCI Express Switch
03:00.0 Display controller: Advanced Micro Devices, Inc. [AMD/ATI] Navi 23 [Radeon RX 6600/6600 XT/6600M] (rev c3)
03:00.1 Audio device: Advanced Micro Devices, Inc. [AMD/ATI] Navi 21/23 HDMI/DP Audio Controller
04:00.0 Network controller: MEDIATEK Corp. MT7921K (RZ608) Wi-Fi 6E 80MHz
05:00.0 Ethernet controller: Realtek Semiconductor Co., Ltd. RTL8111/8168/8411 PCI Express Gigabit Ethernet Controller (rev 15)
06:00.0 Non-Volatile memory controller: Micron/Crucial Technology P1 NVMe PCIe SSD (rev 03)
07:00.0 Non-Volatile memory controller: Kingston Technology Company, Inc. Device 500c (rev 01)
08:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Cezanne [Radeon Vega Series / Radeon Vega Mobile Series] (rev c5)
08:00.1 Audio device: Advanced Micro Devices, Inc. [AMD/ATI] Renoir Radeon High Definition Audio Controller
08:00.2 Encryption controller: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 10h-1fh) Platform Security Processor
08:00.3 USB controller: Advanced Micro Devices, Inc. [AMD] Renoir/Cezanne USB 3.1
08:00.4 USB controller: Advanced Micro Devices, Inc. [AMD] Renoir/Cezanne USB 3.1
08:00.5 Multimedia controller: Advanced Micro Devices, Inc. [AMD] ACP/ACP3X/ACP6x Audio Coprocessor (rev 01)
08:00.6 Audio device: Advanced Micro Devices, Inc. [AMD] Family 17h/19h HD Audio Controller
08:00.7 Signal processing controller: Advanced Micro Devices, Inc. [AMD] Sensor Fusion Hub
I bisected this to commit
commit b0bd0a92b8158ea9c809d885e0f0c21518bdbd14
Author: Srinivasan Shanmugam <srinivasan.shanmugam@amd.com>
Date: Tue Jul 25 18:34:49 2023 +0530
drm/amdgpu: Prefer dev_* variant over printk in amdgpu_atpx_handler.c
Changed from printk to dev_* variants so that
we get better debug info when there are multiple GPUs
in the system.
Fixes other style issue:
ERROR: open brace '{' following function definitions go on the next line
WARNING: printk() should include KERN_<LEVEL> facility level
Cc: Christian König <christian.koenig@amd.com>
Cc: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Srinivasan Shanmugam <srinivasan.shanmugam@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
And found that the following debug patch fixed the issue for me
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_atpx_handler.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_atpx_handler.c
index 6f241c574665..c5e223e4f8ce 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_atpx_handler.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_atpx_handler.c
@@ -236,7 +236,11 @@ static int amdgpu_atpx_validate(struct amdgpu_atpx *atpx)
atpx->functions.power_cntl = true;
atpx->is_hybrid = false;
} else {
- dev_info(dev, "ATPX Hybrid Graphics\n");
+ //dev_info(dev, "ATPX Hybrid Graphics\n");
+ printk(KERN_INFO "%s %d: dev = %px _dev_info = %px dev->class = %px dev->bus = %px\n",
+ __func__, __LINE__, dev, _dev_info, dev->class, dev->bus);
+ printk(KERN_INFO "%s %d: ATPX Hybrid Graphics\n", __func__, __LINE__);
+
/*
* Disable legacy PM methods only when pcie port PM is usable,
* otherwise the device might fail to power off or power on.
@@ -289,8 +293,12 @@ static int amdgpu_atpx_verify_interface(struct amdgpu_atpx *atpx)
memcpy(&output, info->buffer.pointer, size);
/* TODO: check version? */
- dev_info(dev, "ATPX version %u, functions 0x%08x\n",
- output.version, output.function_bits);
+ //dev_info(dev, "ATPX version %u, functions 0x%08x\n",
+ // output.version, output.function_bits);
+ printk(KERN_INFO "%s %d: dev = %px _dev_info = %px dev->class = %px dev->bus = %px\n",
+ __func__, __LINE__, dev, _dev_info, dev->class, dev->bus);
+ printk(KERN_INFO "%s %d: ATPX version %u, functions 0x%08x\n",
+ __func__, __LINE__, output.version, output.function_bits);
amdgpu_atpx_parse_functions(&atpx->functions, output.function_bits);
This leads to the cause of the problem. dmesg now shows
[ 0.580341] amdgpu: vga_switcheroo: detected switching method \_SB_.PCI0.GP17.VGA_.ATPX handle
[ 0.580464] amdgpu_atpx_verify_interface 298: dev = ffff8e3f00dc35b8 _dev_info = ffffffffbeb755f0 dev->class = 0000000000000009 dev->bus = 0000000000000000
[ 0.580466] amdgpu_atpx_verify_interface 300: ATPX version 1, functions 0x00000001
[ 0.580482] amdgpu_atpx_validate 240: dev = ffff8e3f00dc35b8 _dev_info = ffffffffbeb755f0 dev->class = 0000000000000009 dev->bus = 0000000000000000
[ 0.580483] amdgpu_atpx_validate 242: ATPX Hybrid Graphics
bert@lisa:/mnt/data/linux-forest/linux-next$
The problem here is dev->class=0x9, which is basically a NULL pointer but escapes the null check in set_dev_info (in drivers/base/core.c)
static void
set_dev_info(const struct device *dev, struct dev_printk_info *dev_info)
{
const char *subsys;
memset(dev_info, 0, sizeof(*dev_info));
if (dev->class)
subsys = dev->class->name;
else if (dev->bus)
subsys = dev->bus->name;
else
return;
But set_dev_info is indirectly called by _dev_info (_dev_info => __dev_printk => dev_printk_emit => dev_vprintk_emit => set_dev_info), so one should either go back to printk functions here or investigate why dev->class is not properly NULL here.
Edit: As this seems to be related to ACPI could this be a BIOS bug?