[mce] random reboots Machine Check: 0 Bank 5: bea0000000000108
I've been battling this problem since the day I've built my system back in December 2020. I've finally isolated the problem to the GPU as I was lucky to have 2 almost identical system where other system didn't have any problems.
Let's start with a system description:
CPU: Ryzen 5900X
MB: MSI X570 Tomahawk (currently at BIOS 7C84v161(Beta version))
RAM: G.Skill 32 Gb (2 x 16) 3600Mhz/CL16 (XMP profile)
GPU: Aorus RX 5700 XT
- WD Black SN850 NVMe
- Samsung 970 EVO Plus NVMe
- Samsung 860 Pro SATA3
The problem itself presents with a sudden system restart with a following (as an example) MCE error captured:
[ 0.544291] mce: [Hardware Error]: Machine check events logged [ 0.544292] mce: [Hardware Error]: CPU 7: Machine Check: 0 Bank 5: bea0000000000108 [ 0.544296] mce: [Hardware Error]: TSC 0 ADDR ffffffa6bd6bfa MISC d012000100000000 SYND 4d000000 IPID 500b000000000 [ 0.544298] mce: [Hardware Error]: PROCESSOR 2:a20f10 TIME 1616750853 SOCKET 0 APIC 12 microcode a201009 [ 0.544319] #8 #9 #10 #11 [ 0.549297] mce: [Hardware Error]: Machine check events logged [ 0.549298] mce: [Hardware Error]: CPU 11: Machine Check: 0 Bank 5: bea0000000000108 [ 0.549299] mce: [Hardware Error]: TSC 0 ADDR ffffffa6bd6bfa MISC d012000100000000 SYND 4d000000 IPID 500b000000000 [ 0.549301] mce: [Hardware Error]: PROCESSOR 2:a20f10 TIME 1616750853 SOCKET 0 APIC 1a microcode a201009
From the beginning I thought it's a problem with CPU and spend ~1 month talking to AMD support, they confirmed that the problem is not with a CPU and couldn't reproduce the problem. Suggestion was to talk to the motherboard vendor. I've sent my motherboard first to the reseller (they didn't find any problem) who in return sent it back to the MSI. The motherboard returned as it looks in exactly same state. In addition, I've tried placing 5800X cpu into my system and replaced memory from a 100% working system. Issue persisted.
There are tons of advice on the BZ and forums with kernel parameters and BIOS settings none of them solves the problem. Setting PSI Idle Current from Auto to Typical can only reduce the frequency of the restarts.
Biggest trigger so far is idle state or sleep. As there is a sudden spike of the GPU load (I presume) the system restarts.
As a final resort, I managed to swap the GPU to a RTX 3070 for ~10 days and had zero issues over that period of time. I've been trying to force system to restart by putting it to sleep and wake as much as I can with no luck.
I'm quite positive that this is the same problem, however I've been advised to open a new issue.
I would love to keep digging to isolate the problem to the specific card, model or the GPU driver, however in a current market climate that is simply impossible.
As a really last resort that comes to my mind, I've switched my PCI-E to the gen3 mode to see if that might be the reason.
I'm attaching fresh dmesg output, but please do let me know what else can I provide to pin point the problemdmesg-3.log