[PATCH v2] drm/amd/amdgpu: consider kernel job always not guilty

Wed Jul 21 06:26:12 UTC 2021

Am 21.07.21 um 04:05 schrieb Jingwen Chen:
> [Why]
> Currently all timedout job will be considered to be guilty. In SRIOV
> multi-vf use case, the vf flr happens first and then job time out is
> found. There can be several jobs timeout during a very small time slice.
> And if the innocent sdma job time out is found before the real bad
> job, then the innocent sdma job will be set to guilty. This will lead
> to a page fault after resubmitting job.
>
> [How]
> If the job is a kernel job, we will always consider it not guilty
>
> Signed-off-by: Jingwen Chen <Jingwen.Chen2 at amd.com>

Reviewed-by: Christian König <christian.koenig at amd.com>

> ---
>   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 6 +++---
>   1 file changed, 3 insertions(+), 3 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> index 37fa199be8b3..40461547701a 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> @@ -4410,7 +4410,7 @@ int amdgpu_device_pre_asic_reset(struct amdgpu_device *adev,
>   		amdgpu_fence_driver_force_completion(ring);
>   	}
>   
> -	if(job)
> +	if (job && job->vm)
>   		drm_sched_increase_karma(&job->base);
>   
>   	r = amdgpu_reset_prepare_hwcontext(adev, reset_context);
> @@ -4874,7 +4874,7 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
>   			DRM_INFO("Bailing on TDR for s_job:%llx, hive: %llx as another already in progress",
>   				job ? job->base.id : -1, hive->hive_id);
>   			amdgpu_put_xgmi_hive(hive);
> -			if (job)
> +			if (job && job->vm)
>   				drm_sched_increase_karma(&job->base);
>   			return 0;
>   		}
> @@ -4898,7 +4898,7 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
>   					job ? job->base.id : -1);
>   
>   		/* even we skipped this reset, still need to set the job to guilty */
> -		if (job)
> +		if (job && job->vm)
>   			drm_sched_increase_karma(&job->base);
>   		goto skip_recovery;
>   	}