Encounter "kernel BUG at arch/x86/mm/init_64.c:159!" when booting up
Brief summary of the problem:
During the reboot test, encounter the same error and hang while booting up.
The failure rate is low, sometimes it requires more than 500 reboot attempts to replicate the issue, and occasionally it happens within just 10 reboots.
We can reproduce the same issue on AMD W6600/W6800/W7500/W7600/W7900 cards with multiple different version of kernels, including the latest mainline v6.8-rc6(805d849d7c3c) and the latest drm-tip(8f85c978aed1) kernels.
[ 10.945108] u-Precision-5860-Tower kernel: kernel BUG at arch/x86/mm/init_64.c:159!
[ 10.945121] u-Precision-5860-Tower kernel: invalid opcode: 0000 [#1] PREEMPT SMP NOPTI
[ 10.945138] u-Precision-5860-Tower kernel: CPU: 0 PID: 163 Comm: kworker/0:2 Not tainted 6.8.0-rc7-8f85c978aed1+ #1
[ 10.945150] u-Precision-5860-Tower kernel: Hardware name: Dell Inc. Precision 5860 Tower/, BIOS 0.31.10 12/22/2023
[ 10.945157] u-Precision-5860-Tower kernel: Workqueue: events work_for_cpu_fn
[ 10.945176] u-Precision-5860-Tower kernel: RIP: 0010:sync_global_pgds+0x1ea/0x400
[ 10.945189] u-Precision-5860-Tower kernel: Code: 84 e3 96 01 49 89 c0 48 89 f8 0f 1f 00 48 23 05 94 0e 98 01 48 25 00 f0 ff ff 48 03 05 67 e3 96 01 4c 39 c0 0f 84 0f ff ff ff <0f> 0b 49 8b 75 00 4c 89 ff e8 58 6d ff ff 90 e9 1a ff ff ff 48 8b
[ 10.945201] u-Precision-5860-Tower kernel: RSP: 0018:ff6f3b5c407f3a80 EFLAGS: 00010287
[ 10.945210] u-Precision-5860-Tower kernel: RAX: ff4f25240df98000 RBX: fffff53c44406880 RCX: 0000000000000000
[ 10.945217] u-Precision-5860-Tower kernel: RDX: 0000000000000000 RSI: 0000000000000000 RDI: 000000010df98067
[ 10.945223] u-Precision-5860-Tower kernel: RBP: ff6f3b5c407f3ac0 R08: ff4f25241529a000 R09: 0000000000000000
[ 10.945229] u-Precision-5860-Tower kernel: R10: 0000000000000000 R11: 0000000000000000 R12: ff4f25240def912c
[ 10.945234] u-Precision-5860-Tower kernel: R13: ffffffffaa43a000 R14: 0000353c38000000 R15: ff4f2524101a2000
[ 10.945241] u-Precision-5860-Tower kernel: FS: 0000000000000000(0000) GS:ff4f252b6b800000(0000) knlGS:0000000000000000
[ 10.945249] u-Precision-5860-Tower kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 10.945255] u-Precision-5860-Tower kernel: CR2: 0000654ed985c3b8 CR3: 00000001da43a003 CR4: 0000000000771ef0
[ 10.945262] u-Precision-5860-Tower kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 10.945267] u-Precision-5860-Tower kernel: DR3: 0000000000000000 DR6: 00000000fffe07f0 DR7: 0000000000000400
[ 10.945273] u-Precision-5860-Tower kernel: PKRU: 55555554
[ 10.945277] u-Precision-5860-Tower kernel: Call Trace:
[ 10.945283] u-Precision-5860-Tower kernel: <TASK>
[ 10.945289] u-Precision-5860-Tower kernel: ? show_regs+0x72/0x90
[ 10.945303] u-Precision-5860-Tower kernel: ? die+0x38/0xb0
[ 10.945313] u-Precision-5860-Tower kernel: ? do_trap+0xe3/0x100
[ 10.945326] u-Precision-5860-Tower kernel: ? do_error_trap+0x75/0xb0
[ 10.945335] u-Precision-5860-Tower kernel: ? sync_global_pgds+0x1ea/0x400
[ 10.945344] u-Precision-5860-Tower kernel: ? exc_invalid_op+0x53/0x80
[ 10.945354] u-Precision-5860-Tower kernel: ? sync_global_pgds+0x1ea/0x400
[ 10.945362] u-Precision-5860-Tower kernel: ? asm_exc_invalid_op+0x1b/0x20
[ 10.945375] u-Precision-5860-Tower kernel: ? sync_global_pgds+0x1ea/0x400
[ 10.945382] u-Precision-5860-Tower kernel: ? vmemmap_populate_hugepages+0x197/0x1f0
[ 10.945397] u-Precision-5860-Tower kernel: vmemmap_populate+0x73/0xd0
[ 10.945406] u-Precision-5860-Tower kernel: __populate_section_memmap+0x1fc/0x440
[ 10.945419] u-Precision-5860-Tower kernel: sparse_add_section+0x14d/0x310
[ 10.945432] u-Precision-5860-Tower kernel: __add_pages+0xb3/0x170
[ 10.945444] u-Precision-5860-Tower kernel: add_pages+0x17/0x70
[ 10.945452] u-Precision-5860-Tower kernel: memremap_pages+0x466/0x6c0
[ 10.945465] u-Precision-5860-Tower kernel: devm_memremap_pages+0x23/0x70
[ 10.945478] u-Precision-5860-Tower kernel: kgd2kfd_init_zone_device+0x121/0x220 [amdgpu]
[ 10.946581] u-Precision-5860-Tower kernel: amdgpu_device_init+0x2a7e/0x2d60 [amdgpu]
[ 10.947280] u-Precision-5860-Tower kernel: ? pci_read_config_word+0x27/0x60
[ 10.947292] u-Precision-5860-Tower kernel: ? do_pci_enable_device+0xe3/0x110
[ 10.947303] u-Precision-5860-Tower kernel: amdgpu_driver_load_kms+0x1a/0x1c0 [amdgpu]
[ 10.947986] u-Precision-5860-Tower kernel: amdgpu_pci_probe+0x1ba/0x610 [amdgpu]
[ 10.948657] u-Precision-5860-Tower kernel: local_pci_probe+0x48/0xb0
[ 10.948670] u-Precision-5860-Tower kernel: work_for_cpu_fn+0x17/0x30
[ 10.948682] u-Precision-5860-Tower kernel: process_one_work+0x178/0x360
[ 10.948693] u-Precision-5860-Tower kernel: ? __pfx_worker_thread+0x10/0x10
[ 10.948703] u-Precision-5860-Tower kernel: worker_thread+0x307/0x430
[ 10.948713] u-Precision-5860-Tower kernel: ? __pfx_worker_thread+0x10/0x10
[ 10.948722] u-Precision-5860-Tower kernel: kthread+0xf4/0x130
[ 10.948730] u-Precision-5860-Tower kernel: ? __pfx_kthread+0x10/0x10
[ 10.948737] u-Precision-5860-Tower kernel: ret_from_fork+0x43/0x70
[ 10.948748] u-Precision-5860-Tower kernel: ? __pfx_kthread+0x10/0x10
[ 10.948755] u-Precision-5860-Tower kernel: ret_from_fork_asm+0x1b/0x30
[ 10.948766] u-Precision-5860-Tower kernel: </TASK>
[ 10.948769] u-Precision-5860-Tower kernel: Modules linked in: intel_rapl_msr intel_rapl_common intel_uncore_frequency intel_uncore_frequency_common amdgpu(+) i10nm_edac nfit x86_pkg_temp_thermal intel_powerclamp snd_sof_pci_intel_tgl snd_sof_intel_hda_common snd_soc_hdac_hda snd_sof_pci snd_sof_xtensa_dsp snd_sof_intel_hda snd_ctl_led snd_sof coretemp snd_sof_utils snd_soc_acpi_intel_match input_leds snd_soc_acpi dell_wmi pmt_telemetry snd_hda_codec_realtek snd_soc_core ledtrig_audio kvm_intel dell_smbios snd_compress binfmt_misc snd_sof_intel_hda_mlink dell_wmi_sysman snd_hda_ext_core nls_iso8859_1 pmt_class intel_sdsi dell_wmi_ddv firmware_attributes_class sparse_keymap dell_wmi_descriptor wmi_bmof hid_generic snd_hda_codec_hdmi snd_hda_codec_generic kvm video irqbypass amdxcp snd_hda_intel crct10dif_pclmul i2c_algo_bit crc32_pclmul snd_intel_dspcfg dcdbas polyval_clmulni drm_ttm_helper snd_hda_codec polyval_generic ghash_clmulni_intel ttm sha256_ssse3 snd_hwdep sha1_ssse3 drm_exec aesni_intel snd_hda_core gpu_sched crypto_simd
[ 10.948893] u-Precision-5860-Tower kernel: drm_suballoc_helper cryptd drm_buddy snd_pcm drm_display_helper rapl intel_cstate drm_kms_helper snd_seq idxd cec snd_seq_device isst_if_mmio snd_timer isst_if_mbox_pci cmdlinepart rc_core isst_if_common intel_vsec idxd_bus spi_nor snd mtd mei_me soundcore mei wmi usbhid hid mac_hid sch_fq_codel msr parport_pc drm ppdev lp efi_pstore parport ip_tables x_tables autofs4 btrfs blake2b_generic raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx libcrc32c xor raid6_pq raid1 raid0 nvme nvme_core dax_hmem rtsx_pci_sdmmc cxl_acpi cxl_core atlantic e1000e ahci i2c_i801 spi_intel_pci rtsx_pci xhci_pci macsec spi_intel i2c_smbus libahci vmd xhci_pci_renesas pinctrl_alderlake
[ 10.949124] u-Precision-5860-Tower kernel: ---[ end trace 0000000000000000 ]---
Hardware description:
- CPU: Intel(R) Xeon(R) w5-2545
- GPU: 0000:57:00.0 VGA compatible controller [0300]: Advanced Micro Devices, Inc. [AMD/ATI] Device [1002:7489]
- System Memory: 32GB
System information:
- Distro name and Version: Ubuntu 22.04.4
- Kernel version: drm-tip 6.8.0-rc7-8f85c978aed1+
How to reproduce the issue:
$ cat /lib/systemd/system/cycletest.service
[Unit]
Description=Reboots unit after 30s
[Service]
StandardOutput=syslog+console
ExecStart=/bin/sh -c "\
test -f /cycle-count || echo 0 > /cycle-count;\
echo 'starting cycletest';\
sleep 30;\
expr `cat /cycle-count` + 1 > /cycle-count;\
systemctl reboot;\
"
[Install]
WantedBy=multi-user.target
And then enable and start the reboot service
sudo systemctl daemon-reload
sudo systemctl enable cycletest.service
sudo systemctl start cycletest.service
Once the system hang then manually power cycle the machine and run below commands to stop and disable the service
sudo systemctl stop cycletest.service
sudo systemctl disable cycletest.service