[PATCH 4/5] drm/amdgpu: bypass RAS error reset in some conditions
Yang, Stanley
Stanley.Yang at amd.com
Tue Oct 17 04:50:39 UTC 2023
[AMD Official Use Only - General]
The in_gpu_reset is set after reset error count and reset error status function call, so we can't use amdgpu_in_reset(), please check ras->in_recovery flag.
Regards,
Stanley
From: Zhou1, Tao <Tao.Zhou1 at amd.com>
Sent: Friday, October 13, 2023 5:06 PM
To: Zhang, Hawking <Hawking.Zhang at amd.com>; amd-gfx at lists.freedesktop.org; Yang, Stanley <Stanley.Yang at amd.com>; Li, Candice <Candice.Li at amd.com>; Chai, Thomas <YiPeng.Chai at amd.com>; Wang, Yang(Kevin) <KevinYang.Wang at amd.com>
Subject: Re: [PATCH 4/5] drm/amdgpu: bypass RAS error reset in some conditions
[AMD Official Use Only - General]
How about this condition:
if ((amdgpu_in_reset(adev) || amdgpu_ras_intr_triggered()) &&
mca_funcs && mca_funcs->mca_set_debug_mode)
I use amdgpu_in_reset to skip touching it in all gpu resets, not only for the resets triggered by ras fatal error.
Regards,
Tao
________________________________
From: Zhang, Hawking <Hawking.Zhang at amd.com<mailto:Hawking.Zhang at amd.com>>
Sent: Thursday, October 12, 2023 9:14 PM
To: Zhou1, Tao <Tao.Zhou1 at amd.com<mailto:Tao.Zhou1 at amd.com>>; amd-gfx at lists.freedesktop.org<mailto:amd-gfx at lists.freedesktop.org> <amd-gfx at lists.freedesktop.org<mailto:amd-gfx at lists.freedesktop.org>>; Yang, Stanley <Stanley.Yang at amd.com<mailto:Stanley.Yang at amd.com>>; Li, Candice <Candice.Li at amd.com<mailto:Candice.Li at amd.com>>; Chai, Thomas <YiPeng.Chai at amd.com<mailto:YiPeng.Chai at amd.com>>; Wang, Yang(Kevin) <KevinYang.Wang at amd.com<mailto:KevinYang.Wang at amd.com>>
Subject: RE: [PATCH 4/5] drm/amdgpu: bypass RAS error reset in some conditions
[AMD Official Use Only - General]
- if (!amdgpu_ras_is_supported(adev, block))
+ /* skip ras error reset in gpu reset */
+ if (amdgpu_in_reset(adev) &&
+ mca_funcs && mca_funcs->mca_set_debug_mode)
+ return 0;
We should check RAS in_recovery flag in such case. Reset domain is locked in relative late phase, at least *after* error counter harvest. Please double check.
Regards,
Hawking
-----Original Message-----
From: Zhou1, Tao <Tao.Zhou1 at amd.com<mailto:Tao.Zhou1 at amd.com>>
Sent: Thursday, October 12, 2023 17:01
To: amd-gfx at lists.freedesktop.org<mailto:amd-gfx at lists.freedesktop.org>; Yang, Stanley <Stanley.Yang at amd.com<mailto:Stanley.Yang at amd.com>>; Zhang, Hawking <Hawking.Zhang at amd.com<mailto:Hawking.Zhang at amd.com>>; Li, Candice <Candice.Li at amd.com<mailto:Candice.Li at amd.com>>; Chai, Thomas <YiPeng.Chai at amd.com<mailto:YiPeng.Chai at amd.com>>; Wang, Yang(Kevin) <KevinYang.Wang at amd.com<mailto:KevinYang.Wang at amd.com>>
Cc: Zhou1, Tao <Tao.Zhou1 at amd.com<mailto:Tao.Zhou1 at amd.com>>
Subject: [PATCH 4/5] drm/amdgpu: bypass RAS error reset in some conditions
PMFW is responsible for RAS error reset in some conditions, driver can skip the operation.
Signed-off-by: Tao Zhou <tao.zhou1 at amd.com<mailto:tao.zhou1 at amd.com>>
---
drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c | 18 ++++++++++++++++--
1 file changed, 16 insertions(+), 2 deletions(-)
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
index 91ed4fd96ee1..6dddb0423411 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
@@ -1105,11 +1105,18 @@ int amdgpu_ras_reset_error_count(struct amdgpu_device *adev,
enum amdgpu_ras_block block)
{
struct amdgpu_ras_block_object *block_obj = amdgpu_ras_get_ras_block(adev, block, 0);
+ const struct amdgpu_mca_smu_funcs *mca_funcs = adev->mca.mca_funcs;
if (!block_obj || !block_obj->hw_ops)
return 0;
- if (!amdgpu_ras_is_supported(adev, block))
+ /* skip ras error reset in gpu reset */
+ if (amdgpu_in_reset(adev) &&
+ mca_funcs && mca_funcs->mca_set_debug_mode)
+ return 0;
+
+ if (!amdgpu_ras_is_supported(adev, block) ||
+ !amdgpu_ras_get_mca_debug_mode(adev))
return 0;
if (block_obj->hw_ops->reset_ras_error_count)
@@ -1122,6 +1129,7 @@ int amdgpu_ras_reset_error_status(struct amdgpu_device *adev,
enum amdgpu_ras_block block)
{
struct amdgpu_ras_block_object *block_obj = amdgpu_ras_get_ras_block(adev, block, 0);
+ const struct amdgpu_mca_smu_funcs *mca_funcs = adev->mca.mca_funcs;
if (!block_obj || !block_obj->hw_ops) {
dev_dbg_once(adev->dev, "%s doesn't config RAS function\n", @@ -1129,7 +1137,13 @@ int amdgpu_ras_reset_error_status(struct amdgpu_device *adev,
return 0;
}
- if (!amdgpu_ras_is_supported(adev, block))
+ /* skip ras error reset in gpu reset */
+ if (amdgpu_in_reset(adev) &&
+ mca_funcs && mca_funcs->mca_set_debug_mode)
+ return 0;
+
+ if (!amdgpu_ras_is_supported(adev, block) ||
+ !amdgpu_ras_get_mca_debug_mode(adev))
return 0;
if (block_obj->hw_ops->reset_ras_error_count)
--
2.35.1
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.freedesktop.org/archives/amd-gfx/attachments/20231017/ac1b5013/attachment-0001.htm>
More information about the amd-gfx
mailing list