[PATCH v2] drm/amdgpu: Fix CPER error handling on VFs
Zhang, Hawking
Hawking.Zhang at amd.com
Wed Apr 2 04:24:23 UTC 2025
[AMD Official Use Only - AMD Internal Distribution Only]
Reviewed-by: Hawking Zhang <Hawking.Zhang at amd.com>
Regards,
Hawking
-----Original Message-----
From: Skvortsov, Victor <Victor.Skvortsov at amd.com>
Sent: Wednesday, April 2, 2025 04:59
To: amd-gfx at lists.freedesktop.org
Cc: Luo, Zhigang <Zhigang.Luo at amd.com>; Zhang, Hawking <Hawking.Zhang at amd.com>; Zhou1, Tao <Tao.Zhou1 at amd.com>; Zhao, Victor <Victor.Zhao at amd.com>; Yi, Tony <Tony.Yi at amd.com>; Skvortsov, Victor <Victor.Skvortsov at amd.com>
Subject: [PATCH v2] drm/amdgpu: Fix CPER error handling on VFs
From: Tony Yi <Tony.Yi at amd.com>
CPER read will loop infinitely if an error is encountered and the more bit is set. Add error checks to break upon failure.
v2: added function pointer checks
Suggested-by: Tony Yi <Tony.Yi at amd.com>
Signed-off-by: Victor Skvortsov <Victor.Skvortsov at amd.com>
---
drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c | 16 ++++++++++++----
1 file changed, 12 insertions(+), 4 deletions(-)
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c
index 0bb8cbe0dcc0..83f3334b3931 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c
@@ -1323,6 +1323,9 @@ static int amdgpu_virt_req_ras_err_count_internal(struct amdgpu_device *adev, bo {
struct amdgpu_virt *virt = &adev->virt;
+ if (!virt->ops || !virt->ops->req_ras_err_count)
+ return -EOPNOTSUPP;
+
/* Host allows 15 ras telemetry requests per 60 seconds. Afterwhich, the Host
* will ignore incoming guest messages. Ratelimit the guest messages to
* prevent guest self DOS.
@@ -1378,14 +1381,16 @@ amdgpu_virt_write_cpers_to_ring(struct amdgpu_device *adev,
used_size = host_telemetry->header.used_size;
if (used_size > (AMD_SRIOV_RAS_TELEMETRY_SIZE_KB << 10))
- return 0;
+ return -EINVAL;
cper_dump = kmemdup(&host_telemetry->body.cper_dump, used_size, GFP_KERNEL);
if (!cper_dump)
return -ENOMEM;
- if (checksum != amd_sriov_msg_checksum(cper_dump, used_size, 0, 0))
+ if (checksum != amd_sriov_msg_checksum(cper_dump, used_size, 0, 0)) {
+ ret = -EINVAL;
goto out;
+ }
*more = cper_dump->more;
@@ -1425,7 +1430,7 @@ static int amdgpu_virt_req_ras_cper_dump_internal(struct amdgpu_device *adev)
int ret = 0;
uint32_t more = 0;
- if (!amdgpu_sriov_ras_cper_en(adev))
+ if (!virt->ops || !virt->ops->req_ras_cper_dump)
return -EOPNOTSUPP;
do {
@@ -1434,7 +1439,7 @@ static int amdgpu_virt_req_ras_cper_dump_internal(struct amdgpu_device *adev)
adev, virt->fw_reserve.ras_telemetry, &more);
else
ret = 0;
- } while (more);
+ } while (more && !ret);
return ret;
}
@@ -1444,6 +1449,9 @@ int amdgpu_virt_req_ras_cper_dump(struct amdgpu_device *adev, bool force_update)
struct amdgpu_virt *virt = &adev->virt;
int ret = 0;
+ if (!amdgpu_sriov_ras_cper_en(adev))
+ return -EOPNOTSUPP;
+
if ((__ratelimit(&virt->ras.ras_cper_dump_rs) || force_update) &&
down_read_trylock(&adev->reset_domain->sem)) {
mutex_lock(&virt->ras.ras_telemetry_mutex);
--
2.34.1
More information about the amd-gfx
mailing list