Commits · 7547510d4a915f4f6d9b1262182d8db6763508f4 · Alex Deucher / linux

Mar 18, 2025

drm/amdgpu: format old RAS eeprom data into V3 version · 05d50ea3

Tao Zhou authored 3 weeks ago


Clear old data and save it in V3 format.

v2: only format eeprom data for new ASICs.

Signed-off-by: Tao Zhou <tao.zhou1@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>

05d50ea3

Mar 14, 2025

drm/amdgpu: Enable ACA by default for psp v13_0_6/v13_0_14 · 13c13bdd

Xiang Liu authored 1 month ago


Enable ACA by default for psp v13_0_6/v13_0_14.

Signed-off-by: Xiang Liu <xiang.liu@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>

13c13bdd

drm/amdgpu: Save PA of bad pages for old asics · a4b6e990

ganglxie authored 2 weeks ago


for old asics that do not support mca translating, we
just save PA for them

Signed-off-by: ganglxie <ganglxie@amd.com>
Reviewed-by: Tao Zhou <tao.zhou1@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>

a4b6e990

Feb 27, 2025

drm/amdgpu: Report generic instead of unknown boot time errors · d4bd7a50

Xiang Liu authored 1 month ago


Change the DMESG reporting of unknown errors to "Boot Controller
Generic Error" to align with the RAS SPEC and provide more clarity
to customers.

Signed-off-by: Xiang Liu <xiang.liu@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>

d4bd7a50

Feb 25, 2025

drm/amdgpu: Change page/record number calculation based on nps · a8f921a1

ganglxie authored 1 month ago


save only one record to save eeprom space,and
bad_page_num = pa_rec_num + mca_rec_num*16

Signed-off-by: ganglxie <ganglxie@amd.com>
Reviewed-by: Tao Zhou <tao.zhou1@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>

a8f921a1

drm/amdgpu: Refine bad page adding · 0153d276

ganglxie authored 1 month ago


bad page adding can be simpler with nps info

Signed-off-by: ganglxie <ganglxie@amd.com>
Reviewed-by: Tao Zhou <tao.zhou1@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>

0153d276

Feb 17, 2025

drm/amdgpu: Enable ACA by default for psp v13_0_12 · 59af05d6

Candice Li authored 1 month ago


Enable ACA by default for psp v13_0_12.

Signed-off-by: Candice Li <candice.li@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Reviewed-by: Yang Wang <kevinyang.wang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>

59af05d6

Feb 13, 2025

drm/amdgpu: Skip err_count sysfs creation on VF unsupported RAS blocks · 04893397

Victor Skvortsov authored 2 months ago


VFs are not able to query error counts for all RAS blocks. Rather than
returning error for queries on these blocks, skip sysfs the creation
all together.

Signed-off-by: Victor Skvortsov <victor.skvortsov@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>

04893397

drm/amdgpu: Update usage for bad page threshold · 16b85a09

Hawking Zhang authored 2 months ago


The driver's behavior varies based on
the configuration of amdgpu_bad_page_threshold setting

Signed-off-by: Hawking Zhang <Hawking.Zhang@amd.com>
Reviewed-by: Tao Zhou <tao.zhou1@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>

16b85a09

Jan 06, 2025

drm/amdgpu: Fix error handling in amdgpu_ras_add_bad_pages · 9095567b

SRINIVASAN SHANMUGAM authored 3 months ago


It ensures that appropriate error codes are returned when an error
condition is detected

Fixes the below;
drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c:2849 amdgpu_ras_add_bad_pages() warn: missing error code here? 'amdgpu_umc_pages_in_a_row()' failed.
drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c:2884 amdgpu_ras_add_bad_pages() warn: missing error code here? 'amdgpu_ras_mca2pa()' failed.

v2: s/-EIO/-EINVAL, retained the use of -EINVAL from
    amdgpu_umc_pages_in_a_row & and amdgpu_ras_mca2pa_by_idx, when the
    RAS context is not initialized or the convert_ras_err_addr function is
    unavailable. (Thomas)

V3: Returning 0 as the absence of eh_data is acceptable. (Tao)

Fixes: a8d133e6 ("drm/amdgpu: parse legacy RAS bad page mixed with new data in various NPS modes")
Reported-by: Dan Carpenter <dan.carpenter@linaro.org>
Cc: YiPeng Chai <yipeng.chai@amd.com>
Cc: Tao Zhou <tao.zhou1@amd.com>
Cc: Hawking Zhang <Hawking.Zhang@amd.com>
Cc: Christian König <christian.koenig@amd.com>
Cc: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Srinivasan Shanmugam <srinivasan.shanmugam@amd.com>
Reviewed-by: Tao Zhou <tao.zhou1@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>

9095567b

Dec 18, 2024

drm/amdgpu: Enable psp v14_0_3 RAS support for non-SRIOV configurations. · d1ebe307

Candice Li authored 3 months ago


Enable psp v14_0_3 RAS support for non-SRIOV configurations.

Signed-off-by: Candice Li <candice.li@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>

d1ebe307

Dec 10, 2024

drm/amdgpu: Support nbif v6_3_1 fatal error handling · ecd1191e

Candice Li authored 7 months ago


Add nbif v6_3_1 fatal error handling support.

Signed-off-by: Candice Li <candice.li@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>

ecd1191e

drm/amdgpu: Add psp v14_0_3 ras support · 2c2b84f1

Candice Li authored 3 months ago


Add psp v14_0_3 ras support.

Signed-off-by: Candice Li <candice.li@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>

2c2b84f1

drm/amdgpu: Enable RAS for psp v13_0_12 · 9a826c4a

Hawking Zhang authored 7 months ago


Enable RAS Cap check and initialize RAS funcs
for psp v13_0_12

Signed-off-by: Hawking Zhang <Hawking.Zhang@amd.com>
Reviewed-by: Tao Zhou <tao.zhou1@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>

9a826c4a

drm/amdgpu: correct the calculation of RAS bad page · ae756cd8

Tao Zhou authored 4 months ago


After the introduction of NPS RAS, one bad page record on eeprom may be
related to 1 or 16 bad pages, so the bad page record and bad page are
two different concepts, define a new variable to store bad page number.

Signed-off-by: Tao Zhou <tao.zhou1@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>

ae756cd8

drm/amdgpu: split ras_eeprom_init into init and check functions · 1f06e7f3

Tao Zhou authored 4 months ago


Init function is for ras table header read and check function is
responsible for the validation of the header. Call them in different
stages.

Signed-off-by: Tao Zhou <tao.zhou1@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>

1f06e7f3

drm/amdgpu: remove is_mca_add for ras_add_bad_pages · d08fb663

Tao Zhou authored 4 months ago


Remove unnecessary variable and simplify the logic.

Signed-off-by: Tao Zhou <tao.zhou1@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>

d08fb663

drm/amdgpu: parse legacy RAS bad page mixed with new data in various NPS modes · a8d133e6

Tao Zhou authored 4 months ago


All legacy RAS bad pages are generated in NPS1 mode, but new bad page
can be generated in any NPS mode, so we can't use retired_page stored
on eeprom directly in non-nps1 mode even for legacy data. We need to
take different actions for different data, new data can be identified
from old data by UMC_CHANNEL_IDX_V2 flag.

Signed-off-by: Tao Zhou <tao.zhou1@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>

a8d133e6

drm/amdgpu: support to find RAS bad pages via old TA · 07dd49e1

Tao Zhou authored 5 months ago


Old version of RAS TA doesn't support to convert MCA address stored on
eeprom to physical address (PA), support to find all bad pages in one
memory row by PA with old RAS TA. This approach is only suitable for
nps1 mode.

Signed-off-by: Tao Zhou <tao.zhou1@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>

07dd49e1

drm/amdgpu: store only one RAS bad page record for all pages in one row · c3d4acf0

Tao Zhou authored 5 months ago


So eeprom space can be saved, compatible with legacy way.

Signed-off-by: Tao Zhou <tao.zhou1@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>

c3d4acf0

drm/amdgpu: Prefer RAS recovery for scheduler hang · e1ee2111

lijo lazar authored 5 months ago


Before scheduling a recovery due to scheduler/job hang, check if a RAS
error is detected. If so, choose RAS recovery to handle the situation. A
scheduler/job hang could be the side effect of a RAS error. In such
cases, it is required to go through the RAS error recovery process. A
RAS error recovery process in certains cases also could avoid a full
device device reset.

An error state is maintained in RAS context to detect the block
affected. Fatal Error state uses unused block id. Set the block id when
error is detected. If the interrupt handler detected a poison error,
it's not required to look for a fatal error. Skip fatal error checking
in such cases.

Signed-off-by: Lijo Lazar <lijo.lazar@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>

e1ee2111

drm/amdgpu: do RAS MCA2PA conversion in device init phase · 0eecff79

Tao Zhou authored 5 months ago


NPS mode is introduced, the value of memory physical address (PA)
related to a MCA address varies per nps mode. We need to rely on
MCA address and convert it into PA accroding to nps mode.

Signed-off-by: Tao Zhou <tao.zhou1@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>

0eecff79

drm/amdgpu: add flag to indicate the type of RAS eeprom record · 772df3df

Tao Zhou authored 5 months ago


One UMC MCA address could map to multiply physical address (PA):

AMDGPU_RAS_EEPROM_REC_PA: one record store one PA
AMDGPU_RAS_EEPROM_REC_MCA: one record store one MCA address, PA
is not cared about

Signed-off-by: Tao Zhou <tao.zhou1@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>

772df3df

Nov 20, 2024

drm/amdgpu: Use reset recovery state checks · e283f4fb

lijo lazar authored 4 months ago


Some in_reset checks are infact checking whether the state is
reinitialization after reset. Replace with reset_in_recovery calls to
identify that it's really checking for recovery stage after reset.

Signed-off-by: Lijo Lazar <lijo.lazar@amd.com>
Acked-by: Tao Zhou <tao.zhou1@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>

e283f4fb

Nov 11, 2024

drm/amdgpu: Implement virt req_ras_err_count · 84a2947e

Victor Skvortsov authored 4 months ago


Enable RAS late init  if VF RAS Telemetry is supported.

When enabled, the VF can use this interface to query total
RAS error counts from the host.

The VF FB access may abruptly end due to a fatal error,
therefore the VF must cache and sanitize the input.

The Host allows 15 Telemetry messages every 60 seconds, afterwhich
the host will ignore any more in-coming telemetry messages. The VF will
rate limit its msg calling to once every 5 seconds (12 times in 60 seconds).
While the VF is rate limited, it will continue to report the last
good cached data.

v2: Flip generate report & update statistics order for VF

Signed-off-by: Victor Skvortsov <victor.skvortsov@amd.com>
Acked-by: Tao Zhou <tao.zhou1@amd.com>
Reviewed-by: Zhigang Luo <zhigang.luo@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>

84a2947e

drm/amdgpu: VF Query RAS Caps from Host if supported · 907fec2d

Victor Skvortsov authored 4 months ago


If VF RAS Capability support is enabled, guest is able to
retrieve the real RAS support from the host.

Signed-off-by: Victor Skvortsov <victor.skvortsov@amd.com>
Reviewed-by: Zhigang Luo <zhigang.luo@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>

907fec2d

Nov 04, 2024

drm/amdgpu: Skip IP coredump for RAS errors · 81db4eab

lijo lazar authored 4 months ago


For RAS errors, source of error is known. Skip the core dump of IP
states.

Signed-off-by: Lijo Lazar <lijo.lazar@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>

81db4eab

Sep 26, 2024

drm/amdgpu: Refactor XGMI reset on init handling · 631af731

lijo lazar authored 7 months ago


Use XGMI hive information to rely on resetting XGMI devices on
initialization rather than using mgpu structure. mgpu structure may have
other devices as well.

Signed-off-by: Lijo Lazar <lijo.lazar@amd.com>
Reviewed-by: Feifei Xu <feifxu@amd.com>
Acked-by: Rajneesh Bhardwaj <rajneesh.bhardwaj@amd.com>
Tested-by: Rajneesh Bhardwaj <rajneesh.bhardwaj@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>

631af731

drm/amdgpu: Add helper to initialize badpage info · b17f8732

lijo lazar authored 7 months ago


Add a separate function to read badpage data during initialization.
Reading bad pages will need hardware access and cannot be done during
reset. Hence in cases where device needs a full reset during
init itself, attempting to read will cause a deadlock.

Signed-off-by: Lijo Lazar <lijo.lazar@amd.com>
Reviewed-by: Feifei Xu <Feifei.Xu@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Acked-by: Rajneesh Bhardwaj <rajneesh.bhardwaj@amd.com>
Tested-by: Rajneesh Bhardwaj <rajneesh.bhardwaj@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>

b17f8732

drm/amdgpu: Use init level for pending_reset flag · 5839d27d

lijo lazar authored 7 months ago


Drop pending_reset flag in gmc block. Instead use init level to
determine which type of init is preferred - in this case MINIMAL.

Signed-off-by: Lijo Lazar <lijo.lazar@amd.com>
Acked-by: Rajneesh Bhardwaj <rajneesh.bhardwaj@amd.com>
Tested-by: Rajneesh Bhardwaj <rajneesh.bhardwaj@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>

5839d27d

amd/amdgpu: Reduce unnecessary repetitive GPU resets · 9e0feb79

YiPeng Chai authored 6 months ago


In multiple GPUs case, after a GPU has started
resetting all GPUs on hive, other GPUs do not
need to trigger GPU reset again.

Signed-off-by: YiPeng Chai <YiPeng.Chai@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>

9e0feb79

Sep 18, 2024

drm/amdgpu: fix typo in the comment · 0110ac11

Yan Zhen authored 6 months ago


Correctly spelled comments make it easier for the reader to understand
the code.

Replace 'udpate' with 'update' in the comment &
replace 'recieved' with 'received' in the comment &
replace 'dsiable' with 'disable' in the comment &
replace 'Initiailize' with 'Initialize' in the comment &
replace 'disble' with 'disable' in the comment &
replace 'Disbale' with 'Disable' in the comment &
replace 'enogh' with 'enough' in the comment &
replace 'availabe' with 'available' in the comment.

Acked-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Yan Zhen <yanzhen@vivo.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>

0110ac11

Sep 17, 2024

drm/amdgpu: disable GPU RAS bad page feature for specific ASIC · c389a060

Tao Zhou authored 6 months ago


The feature is not applicable to specific app platform.

v2: update the disablement condition and commit description
v3: move the setting to amdgpu_ras_check_supported

Signed-off-by: Tao Zhou <tao.zhou1@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>

c389a060

Aug 06, 2024

drm/amdgpu: remove RAS unused paramter 'err_addr' · 671af066

Yang Wang authored 7 months ago


- amdgpu_ras_error_statistic_ue_count()
- amdgpu_ras_error_statistic_ce_count()
- amdgpu_ras_error_statistic_de_count()

The parameter 'err_addr' is no longer used since following patch.

Fixes: a7e8467f ("drm/amdgpu: Remove unused code")
Signed-off-by: Yang Wang <kevinyang.wang@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>

671af066

drm/amdgpu: create function to check RAS RMA status · 792be2e2

Tao Zhou authored 7 months ago


In the convenience of calling it globally.

Signed-off-by: Tao Zhou <tao.zhou1@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>

792be2e2

drm/amdgpu: Add more types for boot time error reporting · dfe9d047

Hawking Zhang authored 7 months ago


Data abort exception and unknown errors are supported.

Signed-off-by: Hawking Zhang <Hawking.Zhang@amd.com>
Reviewed-by: Tao Zhou <tao.zhou1@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>

dfe9d047

Jul 23, 2024

drm/amdgpu: Remove unused code · a7e8467f

YiPeng Chai authored 8 months ago


Remove unused code.

Signed-off-by: YiPeng Chai <YiPeng.Chai@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>

a7e8467f

Jul 10, 2024

drm/amdgpu: timely save bad pages to eeprom after gpu ras reset is completed · e23300df

YiPeng Chai authored 8 months ago


The problem case is as follows:
1. GPU A triggers a gpu ras reset, and GPU A drives
   GPU B to also perform a gpu ras reset.
2. After gpu B ras reset started, gpu B queried a DE
   data. Since the DE data was queried in the ras reset
   thread instead of the page retirement thread, bad
   page retirement work would not be triggered. Then
   even if all gpu resets are completed, the bad pages
   will be cached in RAM until GPU B's bad page retirement
   work is triggered again and then saved to eeprom.

This patch can save the bad pages to eeprom in time after gpu
ras reset is completed.

v2:
  1. Add the above description to code comments.
  2. Reuse existing function.

Signed-off-by: YiPeng Chai <YiPeng.Chai@amd.com>
Reviewed-by: Tao Zhou <tao.zhou1@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>

e23300df

drm/amdgpu: flush all cached ras bad pages to eeprom · c0470691

YiPeng Chai authored 8 months ago


Before uninstalling gpu driver, flush all cached ras
bad pages to eeprom.

v2:
  Put the same code into a function and reuse the function.

Signed-off-by: YiPeng Chai <YiPeng.Chai@amd.com>
Reviewed-by: Tao Zhou <tao.zhou1@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>

c0470691

Jul 08, 2024

drm/amdgpu: add ras event state device attribute support · 59f488be

Yang Wang authored 8 months ago


add amdgpu ras 'event_state' sysfs device attribute support

Signed-off-by: Yang Wang <kevinyang.wang@amd.com>
Reviewed-by: Tao Zhou <tao.zhou1@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>

59f488be

Admin message

Admin message