[PATCH 3/6] drm/scheduler: Job timeout handler returns status
Luben Tuikov
luben.tuikov at amd.com
Wed Nov 25 16:48:47 UTC 2020
On 2020-11-25 04:50, Christian König wrote:
> Am 25.11.20 um 04:17 schrieb Luben Tuikov:
>> The job timeout handler now returns status
>> indicating back to the DRM layer whether the job
>> was successfully cancelled or whether more time
>> should be given to the job to complete.
>>
>> Signed-off-by: Luben Tuikov <luben.tuikov at amd.com>
>> ---
>> drivers/gpu/drm/amd/amdgpu/amdgpu_job.c | 6 ++++--
>> include/drm/gpu_scheduler.h | 13 ++++++++++---
>> 2 files changed, 14 insertions(+), 5 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
>> index ff48101bab55..81b73790ecc6 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
>> @@ -28,7 +28,7 @@
>> #include "amdgpu.h"
>> #include "amdgpu_trace.h"
>>
>> -static void amdgpu_job_timedout(struct drm_sched_job *s_job)
>> +static int amdgpu_job_timedout(struct drm_sched_job *s_job)
>> {
>> struct amdgpu_ring *ring = to_amdgpu_ring(s_job->sched);
>> struct amdgpu_job *job = to_amdgpu_job(s_job);
>> @@ -41,7 +41,7 @@ static void amdgpu_job_timedout(struct drm_sched_job *s_job)
>> amdgpu_ring_soft_recovery(ring, job->vmid, s_job->s_fence->parent)) {
>> DRM_ERROR("ring %s timeout, but soft recovered\n",
>> s_job->sched->name);
>> - return;
>> + return 0;
>> }
>>
>> amdgpu_vm_get_task_info(ring->adev, job->pasid, &ti);
>> @@ -53,10 +53,12 @@ static void amdgpu_job_timedout(struct drm_sched_job *s_job)
>>
>> if (amdgpu_device_should_recover_gpu(ring->adev)) {
>> amdgpu_device_gpu_recover(ring->adev, job);
>> + return 0;
>> } else {
>> drm_sched_suspend_timeout(&ring->sched);
>> if (amdgpu_sriov_vf(adev))
>> adev->virt.tdr_debug = true;
>> + return 1;
>> }
>> }
>>
>> diff --git a/include/drm/gpu_scheduler.h b/include/drm/gpu_scheduler.h
>> index 2e0c368e19f6..61f7121e1c19 100644
>> --- a/include/drm/gpu_scheduler.h
>> +++ b/include/drm/gpu_scheduler.h
>> @@ -230,10 +230,17 @@ struct drm_sched_backend_ops {
>> struct dma_fence *(*run_job)(struct drm_sched_job *sched_job);
>>
>> /**
>> - * @timedout_job: Called when a job has taken too long to execute,
>> - * to trigger GPU recovery.
>> + * @timedout_job: Called when a job has taken too long to execute,
>> + * to trigger GPU recovery.
>> + *
>> + * Return 0, if the job has been aborted successfully and will
>> + * never be heard of from the device. Return non-zero if the
>> + * job wasn't able to be aborted, i.e. if more time should be
>> + * given to this job. The result is not "bool" as this
>> + * function is not a predicate, although its result may seem
>> + * as one.
>
> I think the whole approach of timing out a job needs to be rethinked.
> What's timing out here is the hardware engine, not the job.
>
> So we should also not have the job as parameter here. Maybe we should
> make that the fence we are waiting for instead.
Yes, I wanted this patch to be minimal, and not to disrupt
too many things.
Yes, in the future we can totally revamp this, but this
is a minimal patch.
>
>> */
>> - void (*timedout_job)(struct drm_sched_job *sched_job);
>> + int (*timedout_job)(struct drm_sched_job *sched_job);
>
> I would either return an error code, boolean or enum here. But not use a
> number without a define.
Yes, that's a great idea--I'll make the change now, and resubmit.
Regards,
Luben
>
> Regards,
> Christian.
>
>>
>> /**
>> * @free_job: Called once the job's finished fence has been signaled
>
More information about the amd-gfx
mailing list