- Mar 18, 2025
-
-
Tao Zhou authored
Clear old data and save it in V3 format. v2: only format eeprom data for new ASICs. Signed-off-by:
Tao Zhou <tao.zhou1@amd.com> Reviewed-by:
Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by:
Alex Deucher <alexander.deucher@amd.com>
-
- Mar 14, 2025
-
-
Xiang Liu authored
Enable ACA by default for psp v13_0_6/v13_0_14. Signed-off-by:
Xiang Liu <xiang.liu@amd.com> Reviewed-by:
Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by:
Alex Deucher <alexander.deucher@amd.com>
-
ganglxie authored
for old asics that do not support mca translating, we just save PA for them Signed-off-by:
ganglxie <ganglxie@amd.com> Reviewed-by:
Tao Zhou <tao.zhou1@amd.com> Signed-off-by:
Alex Deucher <alexander.deucher@amd.com>
-
- Feb 27, 2025
-
-
Xiang Liu authored
Change the DMESG reporting of unknown errors to "Boot Controller Generic Error" to align with the RAS SPEC and provide more clarity to customers. Signed-off-by:
Xiang Liu <xiang.liu@amd.com> Reviewed-by:
Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by:
Alex Deucher <alexander.deucher@amd.com>
-
- Feb 25, 2025
-
-
ganglxie authored
save only one record to save eeprom space,and bad_page_num = pa_rec_num + mca_rec_num*16 Signed-off-by:
ganglxie <ganglxie@amd.com> Reviewed-by:
Tao Zhou <tao.zhou1@amd.com> Signed-off-by:
Alex Deucher <alexander.deucher@amd.com>
-
ganglxie authored
bad page adding can be simpler with nps info Signed-off-by:
ganglxie <ganglxie@amd.com> Reviewed-by:
Tao Zhou <tao.zhou1@amd.com> Signed-off-by:
Alex Deucher <alexander.deucher@amd.com>
-
- Feb 17, 2025
-
-
Candice Li authored
Enable ACA by default for psp v13_0_12. Signed-off-by:
Candice Li <candice.li@amd.com> Reviewed-by:
Hawking Zhang <Hawking.Zhang@amd.com> Reviewed-by:
Yang Wang <kevinyang.wang@amd.com> Signed-off-by:
Alex Deucher <alexander.deucher@amd.com>
-
- Feb 13, 2025
-
-
Victor Skvortsov authored
VFs are not able to query error counts for all RAS blocks. Rather than returning error for queries on these blocks, skip sysfs the creation all together. Signed-off-by:
Victor Skvortsov <victor.skvortsov@amd.com> Reviewed-by:
Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by:
Alex Deucher <alexander.deucher@amd.com>
-
Hawking Zhang authored
The driver's behavior varies based on the configuration of amdgpu_bad_page_threshold setting Signed-off-by:
Hawking Zhang <Hawking.Zhang@amd.com> Reviewed-by:
Tao Zhou <tao.zhou1@amd.com> Signed-off-by:
Alex Deucher <alexander.deucher@amd.com>
-
- Jan 06, 2025
-
-
SRINIVASAN SHANMUGAM authored
It ensures that appropriate error codes are returned when an error condition is detected Fixes the below; drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c:2849 amdgpu_ras_add_bad_pages() warn: missing error code here? 'amdgpu_umc_pages_in_a_row()' failed. drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c:2884 amdgpu_ras_add_bad_pages() warn: missing error code here? 'amdgpu_ras_mca2pa()' failed. v2: s/-EIO/-EINVAL, retained the use of -EINVAL from amdgpu_umc_pages_in_a_row & and amdgpu_ras_mca2pa_by_idx, when the RAS context is not initialized or the convert_ras_err_addr function is unavailable. (Thomas) V3: Returning 0 as the absence of eh_data is acceptable. (Tao) Fixes: a8d133e6 ("drm/amdgpu: parse legacy RAS bad page mixed with new data in various NPS modes") Reported-by:
Dan Carpenter <dan.carpenter@linaro.org> Cc: YiPeng Chai <yipeng.chai@amd.com> Cc: Tao Zhou <tao.zhou1@amd.com> Cc: Hawking Zhang <Hawking.Zhang@amd.com> Cc: Christian König <christian.koenig@amd.com> Cc: Alex Deucher <alexander.deucher@amd.com> Signed-off-by:
Srinivasan Shanmugam <srinivasan.shanmugam@amd.com> Reviewed-by:
Tao Zhou <tao.zhou1@amd.com> Signed-off-by:
Alex Deucher <alexander.deucher@amd.com>
-
- Dec 18, 2024
-
-
Candice Li authored
Enable psp v14_0_3 RAS support for non-SRIOV configurations. Signed-off-by:
Candice Li <candice.li@amd.com> Reviewed-by:
Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by:
Alex Deucher <alexander.deucher@amd.com>
-
- Dec 10, 2024
-
-
Candice Li authored
Add nbif v6_3_1 fatal error handling support. Signed-off-by:
Candice Li <candice.li@amd.com> Reviewed-by:
Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by:
Alex Deucher <alexander.deucher@amd.com>
-
Candice Li authored
Add psp v14_0_3 ras support. Signed-off-by:
Candice Li <candice.li@amd.com> Reviewed-by:
Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by:
Alex Deucher <alexander.deucher@amd.com>
-
Hawking Zhang authored
Enable RAS Cap check and initialize RAS funcs for psp v13_0_12 Signed-off-by:
Hawking Zhang <Hawking.Zhang@amd.com> Reviewed-by:
Tao Zhou <tao.zhou1@amd.com> Signed-off-by:
Alex Deucher <alexander.deucher@amd.com>
-
Tao Zhou authored
After the introduction of NPS RAS, one bad page record on eeprom may be related to 1 or 16 bad pages, so the bad page record and bad page are two different concepts, define a new variable to store bad page number. Signed-off-by:
Tao Zhou <tao.zhou1@amd.com> Reviewed-by:
Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by:
Alex Deucher <alexander.deucher@amd.com>
-
Tao Zhou authored
Init function is for ras table header read and check function is responsible for the validation of the header. Call them in different stages. Signed-off-by:
Tao Zhou <tao.zhou1@amd.com> Reviewed-by:
Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by:
Alex Deucher <alexander.deucher@amd.com>
-
Tao Zhou authored
Remove unnecessary variable and simplify the logic. Signed-off-by:
Tao Zhou <tao.zhou1@amd.com> Reviewed-by:
Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by:
Alex Deucher <alexander.deucher@amd.com>
-
Tao Zhou authored
All legacy RAS bad pages are generated in NPS1 mode, but new bad page can be generated in any NPS mode, so we can't use retired_page stored on eeprom directly in non-nps1 mode even for legacy data. We need to take different actions for different data, new data can be identified from old data by UMC_CHANNEL_IDX_V2 flag. Signed-off-by:
Tao Zhou <tao.zhou1@amd.com> Reviewed-by:
Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by:
Alex Deucher <alexander.deucher@amd.com>
-
Tao Zhou authored
Old version of RAS TA doesn't support to convert MCA address stored on eeprom to physical address (PA), support to find all bad pages in one memory row by PA with old RAS TA. This approach is only suitable for nps1 mode. Signed-off-by:
Tao Zhou <tao.zhou1@amd.com> Reviewed-by:
Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by:
Alex Deucher <alexander.deucher@amd.com>
-
Tao Zhou authored
So eeprom space can be saved, compatible with legacy way. Signed-off-by:
Tao Zhou <tao.zhou1@amd.com> Reviewed-by:
Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by:
Alex Deucher <alexander.deucher@amd.com>
-
lijo lazar authored
Before scheduling a recovery due to scheduler/job hang, check if a RAS error is detected. If so, choose RAS recovery to handle the situation. A scheduler/job hang could be the side effect of a RAS error. In such cases, it is required to go through the RAS error recovery process. A RAS error recovery process in certains cases also could avoid a full device device reset. An error state is maintained in RAS context to detect the block affected. Fatal Error state uses unused block id. Set the block id when error is detected. If the interrupt handler detected a poison error, it's not required to look for a fatal error. Skip fatal error checking in such cases. Signed-off-by:
Lijo Lazar <lijo.lazar@amd.com> Reviewed-by:
Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by:
Alex Deucher <alexander.deucher@amd.com>
-
Tao Zhou authored
NPS mode is introduced, the value of memory physical address (PA) related to a MCA address varies per nps mode. We need to rely on MCA address and convert it into PA accroding to nps mode. Signed-off-by:
Tao Zhou <tao.zhou1@amd.com> Reviewed-by:
Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by:
Alex Deucher <alexander.deucher@amd.com>
-
Tao Zhou authored
One UMC MCA address could map to multiply physical address (PA): AMDGPU_RAS_EEPROM_REC_PA: one record store one PA AMDGPU_RAS_EEPROM_REC_MCA: one record store one MCA address, PA is not cared about Signed-off-by:
Tao Zhou <tao.zhou1@amd.com> Reviewed-by:
Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by:
Alex Deucher <alexander.deucher@amd.com>
-
- Nov 20, 2024
-
-
lijo lazar authored
Some in_reset checks are infact checking whether the state is reinitialization after reset. Replace with reset_in_recovery calls to identify that it's really checking for recovery stage after reset. Signed-off-by:
Lijo Lazar <lijo.lazar@amd.com> Acked-by:
Tao Zhou <tao.zhou1@amd.com> Signed-off-by:
Alex Deucher <alexander.deucher@amd.com>
-
- Nov 11, 2024
-
-
Victor Skvortsov authored
Enable RAS late init if VF RAS Telemetry is supported. When enabled, the VF can use this interface to query total RAS error counts from the host. The VF FB access may abruptly end due to a fatal error, therefore the VF must cache and sanitize the input. The Host allows 15 Telemetry messages every 60 seconds, afterwhich the host will ignore any more in-coming telemetry messages. The VF will rate limit its msg calling to once every 5 seconds (12 times in 60 seconds). While the VF is rate limited, it will continue to report the last good cached data. v2: Flip generate report & update statistics order for VF Signed-off-by:
Victor Skvortsov <victor.skvortsov@amd.com> Acked-by:
Tao Zhou <tao.zhou1@amd.com> Reviewed-by:
Zhigang Luo <zhigang.luo@amd.com> Signed-off-by:
Alex Deucher <alexander.deucher@amd.com>
-
Victor Skvortsov authored
If VF RAS Capability support is enabled, guest is able to retrieve the real RAS support from the host. Signed-off-by:
Victor Skvortsov <victor.skvortsov@amd.com> Reviewed-by:
Zhigang Luo <zhigang.luo@amd.com> Signed-off-by:
Alex Deucher <alexander.deucher@amd.com>
-
- Nov 04, 2024
-
-
lijo lazar authored
For RAS errors, source of error is known. Skip the core dump of IP states. Signed-off-by:
Lijo Lazar <lijo.lazar@amd.com> Reviewed-by:
Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by:
Alex Deucher <alexander.deucher@amd.com>
-
- Sep 26, 2024
-
-
lijo lazar authored
Use XGMI hive information to rely on resetting XGMI devices on initialization rather than using mgpu structure. mgpu structure may have other devices as well. Signed-off-by:
Lijo Lazar <lijo.lazar@amd.com> Reviewed-by:
Feifei Xu <feifxu@amd.com> Acked-by:
Rajneesh Bhardwaj <rajneesh.bhardwaj@amd.com> Tested-by:
Rajneesh Bhardwaj <rajneesh.bhardwaj@amd.com> Signed-off-by:
Alex Deucher <alexander.deucher@amd.com>
-
lijo lazar authored
Add a separate function to read badpage data during initialization. Reading bad pages will need hardware access and cannot be done during reset. Hence in cases where device needs a full reset during init itself, attempting to read will cause a deadlock. Signed-off-by:
Lijo Lazar <lijo.lazar@amd.com> Reviewed-by:
Feifei Xu <Feifei.Xu@amd.com> Reviewed-by:
Alex Deucher <alexander.deucher@amd.com> Acked-by:
Rajneesh Bhardwaj <rajneesh.bhardwaj@amd.com> Tested-by:
Rajneesh Bhardwaj <rajneesh.bhardwaj@amd.com> Signed-off-by:
Alex Deucher <alexander.deucher@amd.com>
-
lijo lazar authored
Drop pending_reset flag in gmc block. Instead use init level to determine which type of init is preferred - in this case MINIMAL. Signed-off-by:
Lijo Lazar <lijo.lazar@amd.com> Acked-by:
Rajneesh Bhardwaj <rajneesh.bhardwaj@amd.com> Tested-by:
Rajneesh Bhardwaj <rajneesh.bhardwaj@amd.com> Signed-off-by:
Alex Deucher <alexander.deucher@amd.com>
-
YiPeng Chai authored
In multiple GPUs case, after a GPU has started resetting all GPUs on hive, other GPUs do not need to trigger GPU reset again. Signed-off-by:
YiPeng Chai <YiPeng.Chai@amd.com> Reviewed-by:
Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by:
Alex Deucher <alexander.deucher@amd.com>
-
- Sep 18, 2024
-
-
Yan Zhen authored
Correctly spelled comments make it easier for the reader to understand the code. Replace 'udpate' with 'update' in the comment & replace 'recieved' with 'received' in the comment & replace 'dsiable' with 'disable' in the comment & replace 'Initiailize' with 'Initialize' in the comment & replace 'disble' with 'disable' in the comment & replace 'Disbale' with 'Disable' in the comment & replace 'enogh' with 'enough' in the comment & replace 'availabe' with 'available' in the comment. Acked-by:
Christian König <christian.koenig@amd.com> Signed-off-by:
Yan Zhen <yanzhen@vivo.com> Signed-off-by:
Alex Deucher <alexander.deucher@amd.com>
-
- Sep 17, 2024
-
-
Tao Zhou authored
The feature is not applicable to specific app platform. v2: update the disablement condition and commit description v3: move the setting to amdgpu_ras_check_supported Signed-off-by:
Tao Zhou <tao.zhou1@amd.com> Reviewed-by:
Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by:
Alex Deucher <alexander.deucher@amd.com>
-
- Aug 06, 2024
-
-
Yang Wang authored
- amdgpu_ras_error_statistic_ue_count() - amdgpu_ras_error_statistic_ce_count() - amdgpu_ras_error_statistic_de_count() The parameter 'err_addr' is no longer used since following patch. Fixes: a7e8467f ("drm/amdgpu: Remove unused code") Signed-off-by:
Yang Wang <kevinyang.wang@amd.com> Reviewed-by:
Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by:
Alex Deucher <alexander.deucher@amd.com>
-
Tao Zhou authored
In the convenience of calling it globally. Signed-off-by:
Tao Zhou <tao.zhou1@amd.com> Reviewed-by:
Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by:
Alex Deucher <alexander.deucher@amd.com>
-
Hawking Zhang authored
Data abort exception and unknown errors are supported. Signed-off-by:
Hawking Zhang <Hawking.Zhang@amd.com> Reviewed-by:
Tao Zhou <tao.zhou1@amd.com> Signed-off-by:
Alex Deucher <alexander.deucher@amd.com>
-
- Jul 23, 2024
-
-
YiPeng Chai authored
Remove unused code. Signed-off-by:
YiPeng Chai <YiPeng.Chai@amd.com> Reviewed-by:
Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by:
Alex Deucher <alexander.deucher@amd.com>
-
- Jul 10, 2024
-
-
YiPeng Chai authored
The problem case is as follows: 1. GPU A triggers a gpu ras reset, and GPU A drives GPU B to also perform a gpu ras reset. 2. After gpu B ras reset started, gpu B queried a DE data. Since the DE data was queried in the ras reset thread instead of the page retirement thread, bad page retirement work would not be triggered. Then even if all gpu resets are completed, the bad pages will be cached in RAM until GPU B's bad page retirement work is triggered again and then saved to eeprom. This patch can save the bad pages to eeprom in time after gpu ras reset is completed. v2: 1. Add the above description to code comments. 2. Reuse existing function. Signed-off-by:
YiPeng Chai <YiPeng.Chai@amd.com> Reviewed-by:
Tao Zhou <tao.zhou1@amd.com> Signed-off-by:
Alex Deucher <alexander.deucher@amd.com>
-
YiPeng Chai authored
Before uninstalling gpu driver, flush all cached ras bad pages to eeprom. v2: Put the same code into a function and reuse the function. Signed-off-by:
YiPeng Chai <YiPeng.Chai@amd.com> Reviewed-by:
Tao Zhou <tao.zhou1@amd.com> Signed-off-by:
Alex Deucher <alexander.deucher@amd.com>
-
- Jul 08, 2024
-
-
Yang Wang authored
add amdgpu ras 'event_state' sysfs device attribute support Signed-off-by:
Yang Wang <kevinyang.wang@amd.com> Reviewed-by:
Tao Zhou <tao.zhou1@amd.com> Reviewed-by:
Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by:
Alex Deucher <alexander.deucher@amd.com>
-