Random crash on amdgpu due to temperature missrepoorting
Submitted by Michel
Assigned to mes..@..op.org
Link to original bug (#111080)
Description
Created attachment 144716 amdgpu_pm_info information from start of game to crash
Hi,
I have been experiencing some random crash in dota 2 for the past 2 years. Changed everything in the computer 6900k -> threadripper, corsaire memory -> gskill, radeon frontier -> radeon vega 7. Ubuntu 16.04 ->16.10 -> 17.04 -> 17.10 ->18.04 ->18.10 ->19.04. This is with all the mesa version in between currently on "OpenGL renderer string: AMD Radeon VII (VEGA20, DRM 3.32.0, 5.2.0-rc7+, LLVM 9.0.0) OpenGL core profile version string: 4.5 (Core Profile) Mesa 19.2.0-devel - padoka PPA OpenGL core profile shading language version string: 4.50 OpenGL core profile context flags: (none) OpenGL core profile profile mask: core profile " All experience the same random crash.
I finally got on lead on the problem seeing the GPU reporting unrealistic values, ex: MHZ jump to 10 000 range. Around the time of the crash temperature in the logs goes from 62c to 500c within two seconds back to 62c. This I suspect would cause the GPU to apply its protection and freeze and if it was true, also violate some law of physics.
Most other tool I use to test the grapgic card, example Uningine, reports correct values within the supported range defined for the cards which are
" #OD_VDDC_CURVE: #0: 808Mhz 704mV #1 (closed): 1304Mhz 777mV #2 (closed): 1801Mhz 1054mV #OD_RANGE: #SCLK: 808Mhz 2200Mhz #MCLK: 351Mhz 1200Mhz "
Attached is an example generated with "watch -t -n1 'cat /sys/kernel/debug/dri/1/amdgpu_pm_info|grep -A 9 "GFX Clocks" | tee -a /home/mitch/tmp/gpulog.txt'"
Example grep Temp " GPU Temperature: 70 C GPU Temperature: 511 C GPU Temperature: 69 C "
grep (SLCK " 1924 MHz (SCLK) 5422 MHz (SCLK) 1999 MHz (SCLK) "
I realize the issue might be somewhere else than the mesa driver but would like to know where this could be and if anybody else seen this kind of behaviour
Thank you very much for any help
Attachment 144716, "amdgpu_pm_info information from start of game to crash":
gpulog_crash3.txt