[PATCH 3/6] drm/scheduler: Job timeout handler returns status

Wed Nov 25 16:48:47 UTC 2020

On 2020-11-25 04:50, Christian König wrote:
> Am 25.11.20 um 04:17 schrieb Luben Tuikov:
>> The job timeout handler now returns status
>> indicating back to the DRM layer whether the job
>> was successfully cancelled or whether more time
>> should be given to the job to complete.
>>
>> Signed-off-by: Luben Tuikov <luben.tuikov at amd.com>
>> ---
>>   drivers/gpu/drm/amd/amdgpu/amdgpu_job.c |  6 ++++--
>>   include/drm/gpu_scheduler.h             | 13 ++++++++++---
>>   2 files changed, 14 insertions(+), 5 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
>> index ff48101bab55..81b73790ecc6 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
>> @@ -28,7 +28,7 @@
>>   #include "amdgpu.h"
>>   #include "amdgpu_trace.h"
>>   
>> -static void amdgpu_job_timedout(struct drm_sched_job *s_job)
>> +static int amdgpu_job_timedout(struct drm_sched_job *s_job)
>>   {
>>   	struct amdgpu_ring *ring = to_amdgpu_ring(s_job->sched);
>>   	struct amdgpu_job *job = to_amdgpu_job(s_job);
>> @@ -41,7 +41,7 @@ static void amdgpu_job_timedout(struct drm_sched_job *s_job)
>>   	    amdgpu_ring_soft_recovery(ring, job->vmid, s_job->s_fence->parent)) {
>>   		DRM_ERROR("ring %s timeout, but soft recovered\n",
>>   			  s_job->sched->name);
>> -		return;
>> +		return 0;
>>   	}
>>   
>>   	amdgpu_vm_get_task_info(ring->adev, job->pasid, &ti);
>> @@ -53,10 +53,12 @@ static void amdgpu_job_timedout(struct drm_sched_job *s_job)
>>   
>>   	if (amdgpu_device_should_recover_gpu(ring->adev)) {
>>   		amdgpu_device_gpu_recover(ring->adev, job);
>> +		return 0;
>>   	} else {
>>   		drm_sched_suspend_timeout(&ring->sched);
>>   		if (amdgpu_sriov_vf(adev))
>>   			adev->virt.tdr_debug = true;
>> +		return 1;
>>   	}
>>   }
>>   
>> diff --git a/include/drm/gpu_scheduler.h b/include/drm/gpu_scheduler.h
>> index 2e0c368e19f6..61f7121e1c19 100644
>> --- a/include/drm/gpu_scheduler.h
>> +++ b/include/drm/gpu_scheduler.h
>> @@ -230,10 +230,17 @@ struct drm_sched_backend_ops {
>>   	struct dma_fence *(*run_job)(struct drm_sched_job *sched_job);
>>   
>>   	/**
>> -         * @timedout_job: Called when a job has taken too long to execute,
>> -         * to trigger GPU recovery.
>> +	 * @timedout_job: Called when a job has taken too long to execute,
>> +	 * to trigger GPU recovery.
>> +	 *
>> +	 * Return 0, if the job has been aborted successfully and will
>> +	 * never be heard of from the device. Return non-zero if the
>> +	 * job wasn't able to be aborted, i.e. if more time should be
>> +	 * given to this job. The result is not "bool" as this
>> +	 * function is not a predicate, although its result may seem
>> +	 * as one.
> 
> I think the whole approach of timing out a job needs to be rethinked. 
> What's timing out here is the hardware engine, not the job.
> 
> So we should also not have the job as parameter here. Maybe we should 
> make that the fence we are waiting for instead.

Yes, I wanted this patch to be minimal, and not to disrupt
too many things.

Yes, in the future we can totally revamp this, but this
is a minimal patch.

> 
>>   	 */
>> -	void (*timedout_job)(struct drm_sched_job *sched_job);
>> +	int (*timedout_job)(struct drm_sched_job *sched_job);
> 
> I would either return an error code, boolean or enum here. But not use a 
> number without a define.

Yes, that's a great idea--I'll make the change now, and resubmit.

Regards,
Luben

> 
> Regards,
> Christian.
> 
>>   
>>   	/**
>>            * @free_job: Called once the job's finished fence has been signaled
>