[PATCH 2/2] drm/amdgpu: set job guilty if reset skipped

Tue Jan 19 14:55:54 UTC 2021

Reviewed-by: Andrey Grodzovsky <andrey.grodzovsky at amd.com>

Andrey

On 1/19/21 7:22 AM, Horace Chen wrote:
> If 2 jobs on 2 different ring timed out the at a very short
> period, the reset for second job will be skipped because the
> reset is already in progress.
>
> But it doesn't mean the second job is not guilty since it
> also timed out and can be a bad job. So before skipped out
> from the reset, we need to increase karma for this job too.
>
> Signed-off-by: Horace Chen <horace.chen at amd.com>
> ---
>   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 4 ++++
>   1 file changed, 4 insertions(+)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> index 9574da3abc32..1d6ff9fe37de 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> @@ -4574,6 +4574,8 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
>   			DRM_INFO("Bailing on TDR for s_job:%llx, hive: %llx as another already in progress",
>   				job ? job->base.id : -1, hive->hive_id);
>   			amdgpu_put_xgmi_hive(hive);
> +			if (job)
> +				drm_sched_increase_karma(&job->base);
>   			return 0;
>   		}
>   		mutex_lock(&hive->hive_lock);
> @@ -4617,6 +4619,8 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
>   					job ? job->base.id : -1);
>   		r = 0;
>   		/* even we skipped this reset, still need to set the job to guilty */
> +		if (job)
> +			drm_sched_increase_karma(&job->base);
>   		goto skip_recovery;
>   	}
>