[PATCH] drm/amd/amdgpu: Fix GPU reset handling after WGR is triggered
Alex Deucher
alexdeucher at gmail.com
Fri Apr 25 16:03:05 UTC 2025
On Fri, Apr 25, 2025 at 11:58 AM Nikola Petrovic <nipetrov at amd.com> wrote:
>
> I've identified a critical issue with the existing GPU reset mechanism that causes a BSOD on the Windows Hyper-V platform. The current function:
>
> static void xgpu_nv_mailbox_flr_work(struct work_struct *work)
>
> incorrectly sets the AMDGPU_HOST_FLR flag if any engine is hanging. This approach wrongly assumes that the Host PF is always responsible for triggering FLR. However, a VF (VM-guest) can also cause a GPU hang, which results in an unsuccessful VM reset. This ultimately causes a FULL_ACCESS_TIMEOUT on the host side, leading to six attempts to retrigger a Whole Guest Reset (WGR), which results in a BSOD after five to six failed restarts.
>
> Additionally, the current sequence sends a READY_TO_RESTART event and then requests FULL_ACCESS, which seems incorrect to me.
>
> My fix addresses this problem by using REQ_GPU_RESET to initiate the necessary restart while appropriately handling the FULL ACCESS request. My implementation has successfully passed 100 loop tests, confirming its effectiveness.
>
> Signed-off-by: Nikola Petrovic <nipetrov at amd.com>
Acked-by: Alex Deucher <alexander.deucher at amd.com>
> ---
> drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 7 ++++---
> 1 file changed, 4 insertions(+), 3 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> index 7f354cd532dc..a2a436707200 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> @@ -5314,11 +5314,12 @@ static int amdgpu_device_reset_sriov(struct amdgpu_device *adev,
> struct amdgpu_hive_info *hive = NULL;
>
> if (test_bit(AMDGPU_HOST_FLR, &reset_context->flags)) {
> + r = amdgpu_virt_wait_reset(adev);
> + if (r)
> + return r;
> if (!amdgpu_ras_get_fed_status(adev))
> - amdgpu_virt_ready_to_reset(adev);
> - amdgpu_virt_wait_reset(adev);
> + amdgpu_virt_reset_gpu(adev);
> clear_bit(AMDGPU_HOST_FLR, &reset_context->flags);
> - r = amdgpu_virt_request_full_gpu(adev, true);
> } else {
> r = amdgpu_virt_reset_gpu(adev);
> }
> --
> 2.43.0
>
More information about the amd-gfx
mailing list