[PATCH] amd/amdgpu: Reduce unnecessary repetitive GPU resets

Fri Sep 20 10:37:12 UTC 2024

Am 20.09.24 um 09:36 schrieb YiPeng Chai:
> In multiple GPUs case, after a GPU has started
> resetting all GPUs on hive, other GPUs do not
> need to trigger GPU reset again.

Please drop any such handling. GPU resets in a hive are serialized using 
a single thread workqueue.

If you want to prevent multiple GPU resets you just need to cancel other 
queued up resets before or after resetting the hive.

This handling here just duplicates this logic and is therefore a clear 
NAK from my side.

Regards,
Christian.

>
> Signed-off-by: YiPeng Chai <YiPeng.Chai at amd.com>
> ---
>   drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c | 21 ++++++++++++++++++++-
>   1 file changed, 20 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> index dbfc41ddc3c7..7d48541b03d8 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> @@ -4306,8 +4306,27 @@ int amdgpu_ras_reset_gpu(struct amdgpu_device *adev)
>   		ras->gpu_reset_flags |= AMDGPU_RAS_GPU_RESET_MODE1_RESET;
>   	}
>   
> -	if (atomic_cmpxchg(&ras->in_recovery, 0, 1) == 0)
> +	if (atomic_cmpxchg(&ras->in_recovery, 0, 1) == 0) {
> +		struct amdgpu_hive_info *hive = amdgpu_get_xgmi_hive(adev);
> +		int hive_ras_recovery = 0;
> +
> +		if (hive) {
> +			hive_ras_recovery = atomic_read(&hive->ras_recovery);
> +			amdgpu_put_xgmi_hive(hive);
> +		}
> +		/* In the case of multiple GPUs, after a GPU has started
> +		 * resetting all GPUs on hive, other GPUs do not need to
> +		 * trigger GPU reset again.
> +		 */
> +		if (!hive_ras_recovery)
> +			amdgpu_reset_domain_schedule(ras->adev->reset_domain, &ras->recovery_work);
> +		else
> +			atomic_set(&ras->in_recovery, 0);
> +	} else {
> +		flush_work(&ras->recovery_work);
>   		amdgpu_reset_domain_schedule(ras->adev->reset_domain, &ras->recovery_work);
> +	}
> +
>   	return 0;
>   }
>