[amdgpu] kernel crash when trying to resume from suspend under memory pressure
Brief summary of the problem:
system does not resume from suspend. kernel crash with hints to amdgpu driver
Issue for same problem on NixOS and Arch Linux: #1575 (closed)
Hardware description:
- CPU: Intel i9-9900K
- GPU: AMD Radeon RX 6600XT
- System Memory: G.Skill Ripjaws 32 GB DDR4 RAM (3200 MHz)
- Display(s): Dell UP2716D 27" 2.5K IPS LED Display, Dell 24" 2K IPS Display
- Type of Diplay Connection: Both display port
System information:
- Distro name and Version: Ubuntu 22.04.1
- Kernel version: 5.15.0-52-generic
- Custom kernel: No
- AMD official driver version: driver=amdgpu
The issue was also reproducible with Kernel:
- 5.15.70
- 5.11.13
- 5.4.111
How to reproduce the issue:
With this steps i'm always able to reproduce it:
-
close every program except terminal
-
disable swap
sudo swapoff -a
- fill 99.9% of RAM
stress-ng --vm-bytes $(awk '/MemAvailable/{printf "%d\n", $2 * 0.99;}' < /proc/meminfo)k --vm-keep -m 1
- suspend the system. it can take some time until the system actually turns off
- try to resume. the system does power on, but i see a black screen. at that point, kernel errors are written to journal
- you have to hard shutdown the system by holding the power button
- start the system and check journal for kernel errors
Attached files:
Log files (for system lockups / game freezes / crashes)
Click to expand
Okt 22 00:44:32 davidak-Z390-UD kernel: kworker/u32:14: page allocation failure: order:0, mode:0x100c02(GFP_NOIO|__GFP_HIGHMEM|__GFP_HARDWALL), nodemask=(null),cpuse>
Okt 22 00:44:32 davidak-Z390-UD kernel: CPU: 6 PID: 70882 Comm: kworker/u32:14 Not tainted 5.15.0-52-generic #58-Ubuntu
Okt 22 00:44:32 davidak-Z390-UD kernel: Hardware name: Gigabyte Technology Co., Ltd. Z390 UD/Z390 UD, BIOS F10 11/05/2021
Okt 22 00:44:32 davidak-Z390-UD kernel: Workqueue: events_unbound async_run_entry_fn
Okt 22 00:44:32 davidak-Z390-UD kernel: Call Trace:
Okt 22 00:44:32 davidak-Z390-UD kernel: <TASK>
Okt 22 00:44:32 davidak-Z390-UD kernel: show_stack+0x52/0x5c
Okt 22 00:44:32 davidak-Z390-UD kernel: dump_stack_lvl+0x4a/0x63
Okt 22 00:44:32 davidak-Z390-UD kernel: dump_stack+0x10/0x16
Okt 22 00:44:32 davidak-Z390-UD kernel: warn_alloc+0x138/0x160
Okt 22 00:44:32 davidak-Z390-UD kernel: __alloc_pages_slowpath.constprop.0+0xa0b/0xa40
Okt 22 00:44:32 davidak-Z390-UD kernel: __alloc_pages+0x311/0x330
Okt 22 00:44:32 davidak-Z390-UD kernel: alloc_pages+0x9e/0x1e0
Okt 22 00:44:32 davidak-Z390-UD kernel: ttm_pool_alloc+0x38e/0x520 [ttm]
Okt 22 00:44:32 davidak-Z390-UD kernel: ? __vmalloc_node+0x4e/0x70
Okt 22 00:44:32 davidak-Z390-UD kernel: amdgpu_ttm_tt_populate+0x33/0x70 [amdgpu]
Okt 22 00:44:32 davidak-Z390-UD kernel: ttm_tt_populate+0xb8/0x1c0 [ttm]
Okt 22 00:44:32 davidak-Z390-UD kernel: ttm_bo_handle_move_mem+0x1b9/0x220 [ttm]
Okt 22 00:44:32 davidak-Z390-UD kernel: ttm_mem_evict_first+0x443/0x680 [ttm]
Okt 22 00:44:32 davidak-Z390-UD kernel: ttm_resource_manager_evict_all+0x9c/0x1e0 [ttm]
Okt 22 00:44:32 davidak-Z390-UD kernel: amdgpu_ttm_evict_resources+0x36/0x70 [amdgpu]
Okt 22 00:44:32 davidak-Z390-UD kernel: amdgpu_device_suspend+0xcc/0x140 [amdgpu]
Okt 22 00:44:32 davidak-Z390-UD kernel: amdgpu_pmops_suspend+0x33/0x50 [amdgpu]
Okt 22 00:44:32 davidak-Z390-UD kernel: pci_pm_suspend+0x7b/0x190
Okt 22 00:44:32 davidak-Z390-UD kernel: ? pci_pm_freeze+0xd0/0xd0
Okt 22 00:44:32 davidak-Z390-UD kernel: dpm_run_callback+0x69/0x130
Okt 22 00:44:32 davidak-Z390-UD kernel: __device_suspend+0x140/0x410
Okt 22 00:44:32 davidak-Z390-UD kernel: async_suspend+0x23/0x70
Okt 22 00:44:32 davidak-Z390-UD kernel: async_run_entry_fn+0x30/0x120
Okt 22 00:44:32 davidak-Z390-UD kernel: process_one_work+0x228/0x3d0
Okt 22 00:44:32 davidak-Z390-UD kernel: worker_thread+0x53/0x420
Okt 22 00:44:32 davidak-Z390-UD kernel: ? process_one_work+0x3d0/0x3d0
Okt 22 00:44:32 davidak-Z390-UD kernel: kthread+0x127/0x150
Okt 22 00:44:32 davidak-Z390-UD kernel: ? set_kthread_struct+0x50/0x50
Okt 22 00:44:32 davidak-Z390-UD kernel: ret_from_fork+0x1f/0x30
Okt 22 00:44:32 davidak-Z390-UD kernel: </TASK>
Okt 22 00:44:32 davidak-Z390-UD kernel: Mem-Info:
Okt 22 00:44:32 davidak-Z390-UD kernel: active_anon:1097 inactive_anon:7747899 isolated_anon:0
active_file:148 inactive_file:155 isolated_file:0
unevictable:0 dirty:49 writeback:2
slab_reclaimable:22505 slab_unreclaimable:49925
mapped:18029 shmem:24709 pagetables:21118 bounce:0
kernel_misc_reclaimable:0
free:99282 free_pcp:95 free_cma:0
Okt 22 00:44:32 davidak-Z390-UD kernel: Node 0 active_anon:4388kB inactive_anon:30991596kB active_file:592kB inactive_file:620kB unevictable:0kB isolated(anon):0kB i>
Okt 22 00:44:32 davidak-Z390-UD kernel: Node 0 DMA free:11264kB min:28kB low:40kB high:52kB reserved_highatomic:0KB active_anon:0kB inactive_anon:0kB active_file:0kB>
Okt 22 00:44:32 davidak-Z390-UD kernel: lowmem_reserve[]: 0 832 31935 31935 31935
Okt 22 00:44:32 davidak-Z390-UD kernel: Node 0 DMA32 free:126172kB min:1760kB low:2612kB high:3464kB reserved_highatomic:0KB active_anon:0kB inactive_anon:783736kB a>
Okt 22 00:44:32 davidak-Z390-UD kernel: lowmem_reserve[]: 0 0 31103 31103 31103
Okt 22 00:44:32 davidak-Z390-UD kernel: Node 0 Normal free:259692kB min:260012kB low:291860kB high:323708kB reserved_highatomic:0KB active_anon:4388kB inactive_anon:>
Okt 22 00:44:32 davidak-Z390-UD kernel: lowmem_reserve[]: 0 0 0 0 0
Okt 22 00:44:32 davidak-Z390-UD kernel: Node 0 DMA: 0*4kB 0*8kB 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 1*1024kB (U) 1*2048kB (M) 2*4096kB (M) = 11264kB
Okt 22 00:44:32 davidak-Z390-UD kernel: Node 0 DMA32: 1*4kB (U) 1*8kB (U) 1*16kB (M) 2*32kB (U) 2*64kB (UM) 2*128kB (UM) 1*256kB (M) 3*512kB (UM) 1*1024kB (M) 0*2048>
Okt 22 00:44:32 davidak-Z390-UD kernel: Node 0 Normal: 55971*4kB (UME) 4476*8kB (UM) 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 259692>
Okt 22 00:44:32 davidak-Z390-UD kernel: Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB
Okt 22 00:44:32 davidak-Z390-UD kernel: Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
Okt 22 00:44:32 davidak-Z390-UD kernel: 25027 total pagecache pages
Okt 22 00:44:32 davidak-Z390-UD kernel: 0 pages in swap cache
Okt 22 00:44:32 davidak-Z390-UD kernel: Swap cache stats: add 0, delete 0, find 0/0
Okt 22 00:44:32 davidak-Z390-UD kernel: Free swap = 0kB
Okt 22 00:44:32 davidak-Z390-UD kernel: Total swap = 0kB
Okt 22 00:44:32 davidak-Z390-UD kernel: 8367214 pages RAM
Okt 22 00:44:32 davidak-Z390-UD kernel: 0 pages HighMem/MovableOnly
Okt 22 00:44:32 davidak-Z390-UD kernel: 170189 pages reserved
Okt 22 00:44:32 davidak-Z390-UD kernel: 0 pages hwpoisoned
Okt 22 00:44:32 davidak-Z390-UD kernel: [drm] evicting device resources failed
Okt 22 00:44:32 davidak-Z390-UD kernel: [drm:amdgpu_device_ip_suspend_phase2 [amdgpu]] *ERROR* suspend of IP block <vcn_v3_0> failed -12
Okt 22 00:44:32 davidak-Z390-UD kernel: [drm] free PSP TMR buffer
Okt 22 00:44:32 davidak-Z390-UD kernel: [TTM] Failed allocating page table
Okt 22 00:44:32 davidak-Z390-UD kernel: [drm] evicting device resources failed
Okt 22 00:44:32 davidak-Z390-UD kernel: amdgpu 0000:03:00.0: amdgpu: MODE1 reset
Okt 22 00:44:32 davidak-Z390-UD kernel: amdgpu 0000:03:00.0: amdgpu: GPU mode1 reset
Okt 22 00:44:32 davidak-Z390-UD kernel: amdgpu 0000:03:00.0: amdgpu: GPU smu mode1 reset
Okt 22 00:44:32 davidak-Z390-UD kernel: ACPI: PM: Preparing to enter system sleep state S3
Okt 22 00:44:32 davidak-Z390-UD kernel: SLUB: Unable to allocate memory on node -1, gfp=0xdc0(GFP_KERNEL|__GFP_ZERO)
Okt 22 00:44:32 davidak-Z390-UD kernel: cache: Acpi-State, object size: 80, buffer size: 80, default order: 0, min order: 0
Okt 22 00:44:32 davidak-Z390-UD kernel: node 0: slabs: 22, objs: 1122, free: 0
Okt 22 00:44:32 davidak-Z390-UD kernel: ACPI Error: AE_NO_MEMORY, Could not update object reference count (20210730/utdelete-651)
...
Okt 22 00:44:32 davidak-Z390-UD kernel: [drm:psp_hw_start [amdgpu]] *ERROR* PSP load kdb failed!
Okt 22 00:44:32 davidak-Z390-UD kernel: [drm:psp_resume [amdgpu]] *ERROR* PSP resume failed
Okt 22 00:44:32 davidak-Z390-UD kernel: [drm:amdgpu_device_fw_loading [amdgpu]] *ERROR* resume of IP block <psp> failed -62
Okt 22 00:44:32 davidak-Z390-UD kernel: amdgpu 0000:03:00.0: amdgpu: amdgpu_device_ip_resume failed (-62).
Okt 22 00:44:32 davidak-Z390-UD kernel: PM: dpm_run_callback(): pci_pm_resume+0x0/0x100 returns -62
Okt 22 00:44:32 davidak-Z390-UD kernel: amdgpu 0000:03:00.0: PM: failed to resume async: error -62
...
Okt 22 00:44:42 davidak-Z390-UD kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring sdma0 timeout, signaled seq=66433, emitted seq=66435
Okt 22 00:44:42 davidak-Z390-UD kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process pid 0 thread pid 0
Okt 22 00:44:42 davidak-Z390-UD kernel: amdgpu 0000:03:00.0: amdgpu: GPU reset begin!
Okt 22 00:44:42 davidak-Z390-UD kernel: [drm:dc_dmub_srv_wait_idle [amdgpu]] *ERROR* Error waiting for DMUB idle: status=3
Okt 22 00:44:43 davidak-Z390-UD kernel: amdgpu 0000:03:00.0: [drm:amdgpu_ring_test_helper [amdgpu]] *ERROR* ring kiq_2.1.0 test failed (-110)
Okt 22 00:44:43 davidak-Z390-UD kernel: [drm:gfx_v10_0_hw_fini [amdgpu]] *ERROR* KGQ disable failed
Okt 22 00:44:43 davidak-Z390-UD kernel: amdgpu 0000:03:00.0: [drm:amdgpu_ring_test_helper [amdgpu]] *ERROR* ring kiq_2.1.0 test failed (-110)
Okt 22 00:44:43 davidak-Z390-UD kernel: [drm:gfx_v10_0_hw_fini [amdgpu]] *ERROR* KCQ disable failed
Okt 22 00:44:43 davidak-Z390-UD firefox_firefox.desktop[58658]: [GFX1-]: GFX: RenderThread detected a device reset in PostUpdate
Okt 22 00:44:46 davidak-Z390-UD kernel: amdgpu 0000:03:00.0: amdgpu: SMU: I'm not done with your previous command!
Okt 22 00:44:46 davidak-Z390-UD kernel: amdgpu 0000:03:00.0: amdgpu: Failed to disable smu features.
Okt 22 00:44:46 davidak-Z390-UD kernel: amdgpu 0000:03:00.0: amdgpu: Fail to disable dpm features!
Okt 22 00:44:46 davidak-Z390-UD kernel: [drm:amdgpu_device_ip_suspend_phase2 [amdgpu]] *ERROR* suspend of IP block <smu> failed -62
Okt 22 00:44:46 davidak-Z390-UD kernel: [drm] free PSP TMR buffer
Okt 22 00:44:48 davidak-Z390-UD kernel: [drm] psp gfx command DESTROY_TMR(0x7) failed and response status is (0x0)
Okt 22 00:44:48 davidak-Z390-UD kernel: [drm:psp_suspend [amdgpu]] *ERROR* Failed to terminate tmr
Okt 22 00:44:48 davidak-Z390-UD kernel: [drm:amdgpu_device_ip_suspend_phase2 [amdgpu]] *ERROR* suspend of IP block <psp> failed -22
Okt 22 00:44:48 davidak-Z390-UD kernel: amdgpu 0000:03:00.0: amdgpu: MODE1 reset
Okt 22 00:44:48 davidak-Z390-UD kernel: amdgpu 0000:03:00.0: amdgpu: GPU mode1 reset
Okt 22 00:44:48 davidak-Z390-UD kernel: amdgpu 0000:03:00.0: amdgpu: GPU psp mode1 reset
Okt 22 00:44:48 davidak-Z390-UD kernel: [drm] psp is not working correctly before mode1 reset!
Okt 22 00:44:48 davidak-Z390-UD kernel: amdgpu 0000:03:00.0: amdgpu: GPU mode1 reset failed
Okt 22 00:44:48 davidak-Z390-UD kernel: amdgpu 0000:03:00.0: amdgpu: ASIC reset failed with error, -22 for drm dev, 0000:03:00.0
Okt 22 00:44:48 davidak-Z390-UD kernel: amdgpu 0000:03:00.0: amdgpu: GPU reset succeeded, trying to resume
Okt 22 00:44:48 davidak-Z390-UD kernel: [drm] PCIE GART of 512M enabled (table at 0x0000008000E10000).
Okt 22 00:44:48 davidak-Z390-UD kernel: [drm] VRAM is lost due to GPU reset!
Okt 22 00:44:48 davidak-Z390-UD kernel: amdgpu 0000:03:00.0: amdgpu: [mmhub] page fault (src_id:0 ring:173 vmid:0 pasid:0, for process pid 0 thread pid 0)
Okt 22 00:44:48 davidak-Z390-UD kernel: amdgpu 0000:03:00.0: amdgpu: in page starting at address 0x0000000000400000 from client 0x12 (VMC)
Okt 22 00:44:48 davidak-Z390-UD kernel: amdgpu 0000:03:00.0: amdgpu: MMVM_L2_PROTECTION_FAULT_STATUS:0x00000B3B
Okt 22 00:44:48 davidak-Z390-UD kernel: amdgpu 0000:03:00.0: amdgpu: Faulty UTCL2 client ID: MP0 (0x5)
Okt 22 00:44:48 davidak-Z390-UD kernel: amdgpu 0000:03:00.0: amdgpu: MORE_FAULTS: 0x1
Okt 22 00:44:48 davidak-Z390-UD kernel: amdgpu 0000:03:00.0: amdgpu: WALKER_ERROR: 0x5
Okt 22 00:44:48 davidak-Z390-UD kernel: amdgpu 0000:03:00.0: amdgpu: PERMISSION_FAULTS: 0x3
Okt 22 00:44:48 davidak-Z390-UD kernel: amdgpu 0000:03:00.0: amdgpu: MAPPING_ERROR: 0x1
Okt 22 00:44:48 davidak-Z390-UD kernel: amdgpu 0000:03:00.0: amdgpu: RW: 0x0
Okt 22 00:44:48 davidak-Z390-UD kernel: [drm] PSP is resuming...
Okt 22 00:44:48 davidak-Z390-UD kernel: amdgpu 0000:03:00.0: amdgpu: [mmhub] page fault (src_id:0 ring:173 vmid:0 pasid:0, for process pid 0 thread pid 0)
Okt 22 00:44:48 davidak-Z390-UD kernel: amdgpu 0000:03:00.0: amdgpu: in page starting at address 0x0000000000400000 from client 0x12 (VMC)
Okt 22 00:44:48 davidak-Z390-UD kernel: amdgpu 0000:03:00.0: amdgpu: MMVM_L2_PROTECTION_FAULT_STATUS:0x00000000
Okt 22 00:44:48 davidak-Z390-UD kernel: amdgpu 0000:03:00.0: amdgpu: Faulty UTCL2 client ID: unknown (0x0)
Okt 22 00:44:48 davidak-Z390-UD kernel: amdgpu 0000:03:00.0: amdgpu: MORE_FAULTS: 0x0
Okt 22 00:44:48 davidak-Z390-UD kernel: amdgpu 0000:03:00.0: amdgpu: WALKER_ERROR: 0x0
Okt 22 00:44:48 davidak-Z390-UD kernel: amdgpu 0000:03:00.0: amdgpu: PERMISSION_FAULTS: 0x0
Okt 22 00:44:48 davidak-Z390-UD kernel: amdgpu 0000:03:00.0: amdgpu: MAPPING_ERROR: 0x0
Okt 22 00:44:48 davidak-Z390-UD kernel: amdgpu 0000:03:00.0: amdgpu: RW: 0x0
Okt 22 00:44:49 davidak-Z390-UD systemd[1]: NetworkManager-dispatcher.service: Deactivated successfully.
Okt 22 00:44:53 davidak-Z390-UD kernel: [drm:psp_v11_0_memory_training [amdgpu]] *ERROR* send training msg failed.
Okt 22 00:44:53 davidak-Z390-UD kernel: [drm:psp_resume [amdgpu]] *ERROR* Failed to process memory training!
Okt 22 00:44:53 davidak-Z390-UD kernel: [drm:amdgpu_device_fw_loading [amdgpu]] *ERROR* resume of IP block <psp> failed -62
Okt 22 00:44:53 davidak-Z390-UD kernel: [drm] Skip scheduling IBs!
Okt 22 00:44:53 davidak-Z390-UD kernel: [drm] Skip scheduling IBs!
Okt 22 00:44:53 davidak-Z390-UD kernel: amdgpu 0000:03:00.0: amdgpu: GPU reset(1) failed
Okt 22 00:44:53 davidak-Z390-UD kernel: [drm] Skip scheduling IBs!
...
Okt 22 00:44:53 davidak-Z390-UD kernel: [drm] Skip scheduling IBs!
Okt 22 00:44:53 davidak-Z390-UD kernel: [drm] Skip scheduling IBs!
Okt 22 00:44:53 davidak-Z390-UD kernel: [drm] Skip scheduling IBs!
Okt 22 00:44:53 davidak-Z390-UD kernel: amdgpu 0000:03:00.0: amdgpu: GPU reset end with ret = -62
Okt 22 00:45:04 davidak-Z390-UD kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring sdma0 timeout, signaled seq=66435, emitted seq=66437
Okt 22 00:45:04 davidak-Z390-UD kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process pid 0 thread pid 0
Okt 22 00:45:04 davidak-Z390-UD kernel: amdgpu 0000:03:00.0: amdgpu: GPU reset begin!
Okt 22 00:45:12 davidak-Z390-UD gsd-power[14472]: Error setting property 'PowerSaveMode' on interface org.gnome.Mutter.DisplayConfig: Timeout was reached (g-io-error>
Okt 22 00:47:37 davidak-Z390-UD gnome-shell[14295]: libinput error: event5 - ZSA Ergodox EZ: client bug: event processing lagging behind by 21ms, your system is too>