[PATCH 2/2] drm/amd/amdgpu: use the default reset for ras recovery

Alex Deucher alexdeucher at gmail.com
Mon May 6 19:30:11 UTC 2024


On Mon, Apr 29, 2024 at 4:07 AM Kenneth Feng <kenneth.feng at amd.com> wrote:
>
> use the default reset for ras recovery
>
> Signed-off-by: Kenneth Feng <kenneth.feng at amd.com>
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c | 7 +++++++
>  1 file changed, 7 insertions(+)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> index a037e8fba29f..f92b2c4f0d5c 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> @@ -2437,6 +2437,7 @@ static void amdgpu_ras_do_recovery(struct work_struct *work)
>         struct amdgpu_device *adev = ras->adev;
>         struct list_head device_list, *device_list_handle =  NULL;
>         struct amdgpu_hive_info *hive = amdgpu_get_xgmi_hive(adev);
> +       int save_reset_method = amdgpu_reset_method;
>
>         if (hive) {
>                 atomic_set(&hive->ras_recovery, 1);
> @@ -2501,7 +2502,13 @@ static void amdgpu_ras_do_recovery(struct work_struct *work)
>                         }
>                 }
>
> +               if (amdgpu_gpu_recovery == 2)
> +                       amdgpu_reset_method = -1;
> +
>                 amdgpu_device_gpu_recover(ras->adev, NULL, &reset_context);
> +
> +               if (amdgpu_gpu_recovery == 2)
> +                       amdgpu_reset_method = save_reset_method;

This is racy.  amdgpu_gpu_recovery is a global variable and will be
referenced by all of the AMD GPUs in the system that are using amdgpu.
To handle this properly, we should store the selected reset method in
the adev structure and set that based on the module parameter at
driver bind time.  Then at runtime if we need to change the reset
method, we can change the device specific one in adev.  Maybe it would
be better to have two variable in adev.  E.g., default_reset_method
and override_reset_method.  In cases where have to use the default
method, we can use that.  In other cases, we can use the override
method.

Alex


>         }
>         atomic_set(&ras->in_recovery, 0);
>         if (hive) {
> --
> 2.34.1
>


More information about the amd-gfx mailing list