[Regression][Bisected] pp_table writes began to fail for Navi10 on amd-staging-drm-next
Summary
Somewhere between these two commits, writing to pp_table
began always resulting in an I/O failure. The SMU appears to re-initialize, then immediately fail to send the GetDpmFreqByIndex
SMU message, leaving it in a mild "limp-home" state.
I'm bisecting right now, and will follow up with a comment with the problem commit.
Commit | Value |
---|---|
6b7ad8618edb |
good |
cefd5db37208 |
bad |
Notes
- On any kernel where
pp_table
writes are working, any failure of writing topp_table
that does happen will result inpp_table
being unable to be written to until a reboot, but that's totally livable if you don't push bad tables, so I've ignored the issue, but it might be related to how this all got started.
Logs
Grepping for amdgpu-navi-overclock
, or setup-overclock.sh
will find you the area in the logs where the pp_table
is being written to.
- Success - amdgpu-pptable-logs-working.log
- Failure - amdgpu-pptable-logs.log
Snipped Logs for Convenience
Working
From 6b7ad8618edb
- the last known working case.
Jul 30 06:07:24 mcoffin-dev-tower setup-overclock.sh[509]: Updating pp_table
Jul 30 06:07:24 mcoffin-dev-tower kernel: amdgpu 0000:0a:00.0: amdgpu: smu driver if version = 0x00000036, smu fw if version = 0x00000035, smu fw version = >
Jul 30 06:07:24 mcoffin-dev-tower kernel: amdgpu 0000:0a:00.0: amdgpu: SMU driver if version not matched
Jul 30 06:07:24 mcoffin-dev-tower kernel: amdgpu 0000:0a:00.0: amdgpu: use vbios provided pptable
Jul 30 06:07:24 mcoffin-dev-tower kernel: amdgpu 0000:0a:00.0: amdgpu: smc_dpm_info table revision(format.content): 4.5
Jul 30 06:07:24 mcoffin-dev-tower kernel: amdgpu 0000:0a:00.0: amdgpu: SMU is initialized successfully!
Jul 30 06:07:24 mcoffin-dev-tower setup-overclock.sh[509]: Setting pp_od_clk_voltage
Jul 30 06:07:24 mcoffin-dev-tower setup-overclock.sh[809]: s 0 800 1 2160
Jul 30 06:07:24 mcoffin-dev-tower setup-overclock.sh[809]: m 1 903
Jul 30 06:07:24 mcoffin-dev-tower setup-overclock.sh[809]: vc 0 800 750
Jul 30 06:07:24 mcoffin-dev-tower setup-overclock.sh[809]: vc 1 1505 950
Jul 30 06:07:24 mcoffin-dev-tower setup-overclock.sh[809]: vc 2 2160 1195
Jul 30 06:07:24 mcoffin-dev-tower setup-overclock.sh[509]: Final settings for card0:
Jul 30 06:07:24 mcoffin-dev-tower setup-overclock.sh[810]: OD_SCLK:
Jul 30 06:07:24 mcoffin-dev-tower setup-overclock.sh[810]: 0: 800Mhz
Jul 30 06:07:24 mcoffin-dev-tower setup-overclock.sh[810]: 1: 2160Mhz
Jul 30 06:07:24 mcoffin-dev-tower setup-overclock.sh[810]: OD_MCLK:
Jul 30 06:07:24 mcoffin-dev-tower setup-overclock.sh[810]: 1: 903MHz
Jul 30 06:07:24 mcoffin-dev-tower setup-overclock.sh[810]: OD_VDDC_CURVE:
Jul 30 06:07:24 mcoffin-dev-tower setup-overclock.sh[810]: 0: 800MHz 750mV
Jul 30 06:07:24 mcoffin-dev-tower setup-overclock.sh[810]: 1: 1505MHz 950mV
Jul 30 06:07:24 mcoffin-dev-tower setup-overclock.sh[810]: 2: 2160MHz 1195mV
Jul 30 06:07:24 mcoffin-dev-tower setup-overclock.sh[810]: OD_RANGE:
Jul 30 06:07:24 mcoffin-dev-tower setup-overclock.sh[810]: SCLK: 800Mhz 2300Mhz
Jul 30 06:07:24 mcoffin-dev-tower setup-overclock.sh[810]: MCLK: 625Mhz 1000Mhz
Jul 30 06:07:24 mcoffin-dev-tower setup-overclock.sh[810]: VDDC_CURVE_SCLK[0]: 800Mhz 2300Mhz
Jul 30 06:07:24 mcoffin-dev-tower setup-overclock.sh[810]: VDDC_CURVE_VOLT[0]: 750mV 1200mV
Jul 30 06:07:24 mcoffin-dev-tower setup-overclock.sh[810]: VDDC_CURVE_SCLK[1]: 800Mhz 2300Mhz
Jul 30 06:07:24 mcoffin-dev-tower setup-overclock.sh[810]: VDDC_CURVE_VOLT[1]: 750mV 1200mV
Jul 30 06:07:24 mcoffin-dev-tower setup-overclock.sh[810]: VDDC_CURVE_SCLK[2]: 800Mhz 2300Mhz
Jul 30 06:07:24 mcoffin-dev-tower setup-overclock.sh[810]: VDDC_CURVE_VOLT[2]: 750mV 1200mV
Jul 30 06:07:24 mcoffin-dev-tower systemd[1]: Finished Overclock for Navi card.
Failing
From cefd5db37208
- the first known failure case.
Jul 30 06:12:38 mcoffin-dev-tower setup-overclock.sh[522]: Updating pp_table
Jul 30 06:12:38 mcoffin-dev-tower setup-overclock.sh[820]: cat: write error: Input/output error
Jul 30 06:12:38 mcoffin-dev-tower systemd[1]: amdgpu-navi-overclock@card0.service: Main process exited, code=exited, status=1/FAILURE
Jul 30 06:12:38 mcoffin-dev-tower kernel: amdgpu 0000:0a:00.0: amdgpu: smu driver if version = 0x00000036, smu fw if version = 0x00000035, smu fw version = >
Jul 30 06:12:38 mcoffin-dev-tower kernel: amdgpu 0000:0a:00.0: amdgpu: SMU driver if version not matched
Jul 30 06:12:38 mcoffin-dev-tower kernel: amdgpu 0000:0a:00.0: amdgpu: use vbios provided pptable
Jul 30 06:12:38 mcoffin-dev-tower kernel: amdgpu 0000:0a:00.0: amdgpu: smc_dpm_info table revision(format.content): 4.5
Jul 30 06:12:38 mcoffin-dev-tower kernel: amdgpu 0000:0a:00.0: amdgpu: SMU is initialized successfully!
Jul 30 06:12:38 mcoffin-dev-tower kernel: amdgpu 0000:0a:00.0: amdgpu: failed send message: GetDpmFreqByIndex (32) param: 0x000400ff response 0xffff>
Jul 30 06:12:38 mcoffin-dev-tower kernel: amdgpu 0000:0a:00.0: amdgpu: [smu_v11_0_set_single_dpm_table] failed to get dpm levels!
Jul 30 06:12:38 mcoffin-dev-tower kernel: amdgpu 0000:0a:00.0: amdgpu: Failed to setup default dpm clock tables!
Jul 30 06:12:38 mcoffin-dev-tower kernel: amdgpu 0000:0a:00.0: amdgpu: smu reset failed, ret = -5
Jul 30 06:12:38 mcoffin-dev-tower audit[1]: SERVICE_START pid=1 uid=0 auid=4294967295 ses=4294967295 msg='unit=amdgpu-navi-overclock@card0 comm="systemd" ex>
Jul 30 06:12:38 mcoffin-dev-tower systemd[1]: amdgpu-navi-overclock@card0.service: Failed with result 'exit-code'.
Jul 30 06:12:38 mcoffin-dev-tower systemd[1]: Failed to start Overclock for Navi card.
Jul 30 06:12:38 mcoffin-dev-tower systemd[1]: Startup finished in 16.462s (firmware) + 3.059s (loader) + 4.076s (kernel) + 5.184s (userspace) = 28.782s.
Jul 30 06:12:38 mcoffin-dev-tower kernel: amdgpu 0000:0a:00.0: amdgpu: Msg issuing pre-check failed and SMU may be not in the right state!
Bisect Info
First bad commit: ec8ee23f610578c71885a36ddfcf58d35cccab67
.
git bisect log
# bad: [cefd5db37208da458fa10f83f866f2f37eef70e9] drm/amdgpu: enable ih CG for navy_flounder
# good: [6b7ad8618edbe6aecf1122e654d08a8237471800] drm/radeon: fix double free
# good: [6b7ad8618edbe6aecf1122e654d08a8237471800] drm/radeon: fix double free
git bisect start 'cefd5db37208' '6b7ad8618edb' 'drivers/gpu/drm/amd'
# bad: [23c22d5f619b4ef80e65e8406476a8d05ed4ac93] drm/amd/powerplay: apply gfxoff disablement/enablement for all SMU11 ASICs
git bisect bad 23c22d5f619b4ef80e65e8406476a8d05ed4ac93
# good: [8939ebcea3802e8c134cb4d2b7ec5c69661d2428] drm/amd/display: fix dcn3 p_state_change_support validation
git bisect good 8939ebcea3802e8c134cb4d2b7ec5c69661d2428
# good: [171c842518ade2f1a018e3267d788613f081c420] drm/amdgpu: RAS emergency restart logic refine
git bisect good 171c842518ade2f1a018e3267d788613f081c420
# bad: [447d4cdb5c702aa28e6090fe45dfd91da2334b45] drm/amd/powerplay: update Sienna Cichlid default dpm table setup
git bisect bad 447d4cdb5c702aa28e6090fe45dfd91da2334b45
# good: [74c3c3f5c263554d6450215a65c505bf9582877b] drm/amd/powerplay: add more members for dpm table
git bisect good 74c3c3f5c263554d6450215a65c505bf9582877b
# good: [1dc9ba1aa9604e5cd10630b6ec3b85007ac36672] drm/amd/powerplay: update Arcturus default dpm table setting
git bisect good 1dc9ba1aa9604e5cd10630b6ec3b85007ac36672
# bad: [ec8ee23f610578c71885a36ddfcf58d35cccab67] drm/amd/powerplay: update Navi10 default dpm table setup
git bisect bad ec8ee23f610578c71885a36ddfcf58d35cccab67
# first bad commit: [ec8ee23f610578c71885a36ddfcf58d35cccab67] drm/amd/powerplay: update Navi10 default dpm table setup
Log with notes
Here is a log of what happened with each attempted revision.
Commit | Value | Notes | Boots Attempted |
---|---|---|---|
23c22d5f619b |
bad |
pp_table write failure |
1 |
8939ebcea380 |
good | working | 2 |
171c842518ad |
good | working | 2 |
447d4cdb5c70 |
bad |
pp_table write failure |
1 |
74c3c3f5c263 |
good | working | 2 |
1dc9ba1aa960 |
good | working | 2 |
ec8ee23f6105 |
bad |
pp_table write failure |
1 |
880e5a74ea31 + patch 381849
|
bad |
pp_table write failure |
1 |