[PATCH 1/1] drm/scheduler: Job timeout handler returns status (v2)

Mon Dec 14 09:00:07 UTC 2020

On 11/12/2020 21:44, Luben Tuikov wrote:
> On 2020-12-10 4:46 a.m., Steven Price wrote:
>> On 10/12/2020 02:14, Luben Tuikov wrote:
>>> This patch does not change current behaviour.
>>>
>>> The driver's job timeout handler now returns
>>> status indicating back to the DRM layer whether
>>> the task (job) was successfully aborted or whether
>>> more time should be given to the task to complete.
>>
>> I find the definitions given a little confusing, see below.
>>
>>> Default behaviour as of this patch, is preserved,
>>> except in obvious-by-comment case in the Panfrost
>>> driver, as documented below.
>>>
>>> All drivers which make use of the
>>> drm_sched_backend_ops' .timedout_job() callback
>>> have been accordingly renamed and return the
>>> would've-been default value of
>>> DRM_TASK_STATUS_ALIVE to restart the task's
>>> timeout timer--this is the old behaviour, and
>>> is preserved by this patch.
>>>
>>> In the case of the Panfrost driver, its timedout
>>> callback correctly first checks if the job had
>>> completed in due time and if so, it now returns
>>> DRM_TASK_STATUS_COMPLETE to notify the DRM layer
>>> that the task can be moved to the done list, to be
>>> freed later. In the other two subsequent checks,
>>> the value of DRM_TASK_STATUS_ALIVE is returned, as
>>> per the default behaviour.
>>>
>>> A more involved driver's solutions can be had
>>> in subequent patches.
>>
>> NIT: ^^^^^^^^^ subsequent
> 
> Thank you--no idea how my spellcheck and checkpatch.pl missed that.
> 
>>
>>>
>>> v2: Use enum as the status of a driver's job
>>>       timeout callback method.
>>>
>>> Cc: Alexander Deucher <Alexander.Deucher at amd.com>
>>> Cc: Andrey Grodzovsky <Andrey.Grodzovsky at amd.com>
>>> Cc: Christian König <christian.koenig at amd.com>
>>> Cc: Daniel Vetter <daniel.vetter at ffwll.ch>
>>> Cc: Lucas Stach <l.stach at pengutronix.de>
>>> Cc: Russell King <linux+etnaviv at armlinux.org.uk>
>>> Cc: Christian Gmeiner <christian.gmeiner at gmail.com>
>>> Cc: Qiang Yu <yuq825 at gmail.com>
>>> Cc: Rob Herring <robh at kernel.org>
>>> Cc: Tomeu Vizoso <tomeu.vizoso at collabora.com>
>>> Cc: Steven Price <steven.price at arm.com>
>>> Cc: Alyssa Rosenzweig <alyssa.rosenzweig at collabora.com>
>>> Cc: Eric Anholt <eric at anholt.net>
>>> Reported-by: kernel test robot <lkp at intel.com>
>>
>> This reported-by seems a little odd for this patch.
> 
> I got this because I wasn't (originally) changing all drivers
> which were using the task timeout callback.
> Should I remove it?

Well the kernel test robot would only have "reported" things like build 
failures and this patch isn't really 'fixing' anything (as you state it 
does not change current behaviour). So personally I'm not sure how the 
kernel test robot triggered you to write this patch. But equally if you 
were inspired by it and want to credit it, that's fine by me! I'd 
assumed that this was just a copy-paste error and you'd taken the list 
of CC's from another patch and accidentally included the Reported-by too.

>>
>>> Signed-off-by: Luben Tuikov <luben.tuikov at amd.com>
>>> ---
>>>    drivers/gpu/drm/amd/amdgpu/amdgpu_job.c |  6 +++--
>>>    drivers/gpu/drm/etnaviv/etnaviv_sched.c | 10 +++++++-
>>>    drivers/gpu/drm/lima/lima_sched.c       |  4 +++-
>>>    drivers/gpu/drm/panfrost/panfrost_job.c |  9 ++++---
>>>    drivers/gpu/drm/scheduler/sched_main.c  |  4 +---
>>>    drivers/gpu/drm/v3d/v3d_sched.c         | 32 +++++++++++++------------
>>>    include/drm/gpu_scheduler.h             | 20 +++++++++++++---
>>>    7 files changed, 57 insertions(+), 28 deletions(-)
>>>
>>
>> [....]
>>
>>> diff --git a/include/drm/gpu_scheduler.h b/include/drm/gpu_scheduler.h
>>> index 2e0c368e19f6..cedfc5394e52 100644
>>> --- a/include/drm/gpu_scheduler.h
>>> +++ b/include/drm/gpu_scheduler.h
>>> @@ -206,6 +206,11 @@ static inline bool drm_sched_invalidate_job(struct drm_sched_job *s_job,
>>>    	return s_job && atomic_inc_return(&s_job->karma) > threshold;
>>>    }
>>>    
>>> +enum drm_task_status {
>>> +	DRM_TASK_STATUS_COMPLETE,
>>> +	DRM_TASK_STATUS_ALIVE
>>> +};
>>> +
>>>    /**
>>>     * struct drm_sched_backend_ops
>>>     *
>>> @@ -230,10 +235,19 @@ struct drm_sched_backend_ops {
>>>    	struct dma_fence *(*run_job)(struct drm_sched_job *sched_job);
>>>    
>>>    	/**
>>> -         * @timedout_job: Called when a job has taken too long to execute,
>>> -         * to trigger GPU recovery.
>>> +	 * @timedout_job: Called when a job has taken too long to execute,
>>> +	 * to trigger GPU recovery.
>>> +	 *
>>> +	 * Return DRM_TASK_STATUS_ALIVE, if the task (job) is healthy
>>> +	 * and executing in the hardware, i.e. it needs more time.
>>
>> So 'alive' means the job (was) alive, and GPU recovery is happening.
>> I.e. it's the job just takes too long. Panfrost will trigger a GPU reset
>> (killing the job) in this case while returning DRM_TASK_STATUS_ALIVE.
> 
> "ALIVE" means "at this moment the task is in the hardware executing,
> using memories, DMA engines, compute units, the whole thing." You,
> can also call it active, executing, busy, etc.
> 
> "ALIVE" makes no assumptions about how the task got there. Maybe
> this was original submission, and the task is taking its sweet time.
> Maybe the driver aborted it and reissued it, all in the timer timeout
> callback, etc.
> 
> It merely tells the DRM layer that the task is actively executing
> in the hardware, so no assumptions can be made about freeing memories,
> decrementing krefs, etc. IOW, it's executing, check back later.

Ok, makes sense. Although I think the comment about "it needs more time" 
could do with a clarification that it's up to the timeout handler 
whether more time is granted (i.e. it's not the DRM scheduler core that 
is responsible for deciding to kill the job).

>>
>>> +	 *
>>> +	 * Return DRM_TASK_STATUS_COMPLETE, if the task (job) has
>>> +	 * been aborted or is unknown to the hardware, i.e. if
>>> +	 * the task is out of the hardware, and maybe it is now
>>> +	 * in the done list, or it was completed long ago, or
>>> +	 * if it is unknown to the hardware.
>>
>> Where 'complete' seems to mean a variety of things:
>>
>>    * The job completed successfully (i.e. the timeout raced), this is the
>> situation that Panfrost detects. In this case (and only this case) the
>> GPU reset will *not* happen.
> 
> Sounds good!
> 
>>
>>    * The job failed (aborted) and is no longer on the hardware. Panfrost
>> currently handles a job failure by triggering drm_sched_fault() to
>> trigger the timeout handler. But the timeout handler doesn't handle this
>> differently so will return DRM_TASK_STATUS_ALIVE.
> 
> Panfrost seems to do the right thing here by triggering. But I wonder,
> could the Panfrost driver simply return the task back to the DRM
> layer?
> 
> If the task is out of the hardware, it should return COMPLETE, a la,
> 
> Driver: COMPLETE!
> DRM: Fine, I can free memories and decrement krefs and toss it over
>       to the application client. (The application client would
>       then find out if the execution status was success of failure.
>       For instance, division by 0, or tan(pi/2), etc.)
> 
> Driver: ALIVE!
> DRM: Fine, I cannot make any assumptions of any kind. I'll back off
>       for a bit, and if I don't hear from you in some time, I'll
>       check back later.

Yeah this is definitely an area of improvement - there are a number of 
shortcuts taken in the current driver. Handling of job failure (and job 
timeout) are areas that could be improved a lot. But equally they are 
things that don't happen during normal operation so haven't been a focus 
of attention. Ideally resetting the GPU should only be attempted if the 
GPU has actually locked up, rather than just a job failing or taking too 
long.

>>    * The job is "unknown to hardware". There are some corner cases in
>> Panfrost (specifically two early returns from panfrost_job_hw_submit())
>> where the job never actually lands on the hardware, but the scheduler
>> isn't informed. We currently rely on the timeout handling to recover
>> from that. However, again, the timeout handler doesn't know about this
>> soo will return DRM_TASK_STATUS_ALIVE.
> 
> "We currently rely on the timeout handling to recover from that."
> Do you mean that when the timeout handler kicks in, you find
> that the task never landed in the hardware and then you send it
> to the hardware? Yes? Then return "ALIVE".
> If however, you never decide to send it to the hardware and simply
> abandon it (in hopes that the DRM or the Application Client will
> reissue it), then you should send it back to DRM with status "COMPLETE".

It's a corner case - the job never landed on the hardware, but therefore 
never 'completed' either. Since it's not expected to happen in normal 
operation (it requires the PM code to fail to power on the GPU or an 
internal driver error causing the GPU to be unexpectedly busy when 
submitting the job) the handling of this is to pretend the job is stuck 
on the GPU and recover in the usual way with a GPU reset.

So I guess in this case since we're lying to ourselves we should also 
lie to the DRM scheduler with the ALIVE return.

> 
>> So of the four cases listed in these comments, Panfrost is only getting
>> 2 'correct' after this change.
> 
> The Panfrost driver knows the hardware intimately, and maybe the hardware,
> or a newer version of it, can give a finer control over task execution
> on a GPU. If you can abort the task and reissue it, all in the timer
> callback, return ALIVE. If you abort it and kick it out of the hardware,
> return COMPLETE.
> 
> I'd say, this is about information, and passing information back to
> the DRM scheduler. It could be any information you'd like. I thought
> that the minimal case of information described above is fine.

Ok, I think all that's really needed it to improve on the wording in the 
comment. Specifically:

  * ALIVE: The DRM scheduler must maintain knowledge of the job (not 
free the data).

  * COMPLETE: The DRM scheduler can free the job.

>>
>> But what I really want to know is what the scheduler is planning to do
>> in these situations? The Panfrost return value in this patch is really a
>> "did we trigger a GPU reset" - and doesn't seem to match the
>> descriptions above.
> 
> The scheduler would like to know if the task is used by the GPU, i.e.
> it is in the hardware and its memories are used and krefs are upped,
> and DMA is taking place, etc. Or, if the task is out of the hardware,
> and the DRM can free relevant memories, decrement relevant krefs,
> free ancillary memories and return the task back to the application
> client.
> 
> It's just about information. Ultimately driver maintainers decide
> the most appropriate action in the timer timeout callback and then
> the most appropriate return code. This patch really mimics (tries to)
> default (current) behaviour, so as to have as minimal impact as possible,
> yet show that finer grained status is attainable from the timer timeout
> callback.
> 
> No differing action is taken by the scheduler at the moment,
> as shown in this patch.

Thanks for the clarification. I think mostly my confusion is because 
Panfrost currently handles various failures in a rather brutal manner 
(resetting the GPU even if not necessary) which means the terms don't 
match up as neatly as they should.

Ideally I'll find some free time too tidy some of this up, but sadly 
this isn't my main task.

Thanks,

Steve