[PATCH v3 1/5] drm/scheduler: rework job destruction

Tue Apr 16 14:36:20 UTC 2019

On 4/16/19 5:47 AM, Christian König wrote:
> Am 15.04.19 um 23:17 schrieb Eric Anholt:
>> Andrey Grodzovsky <andrey.grodzovsky at amd.com> writes:
>>
>>> From: Christian König <christian.koenig at amd.com>
>>>
>>> We now destroy finished jobs from the worker thread to make sure that
>>> we never destroy a job currently in timeout processing.
>>> By this we avoid holding lock around ring mirror list in drm_sched_stop
>>> which should solve a deadlock reported by a user.
>>>
>>> v2: Remove unused variable.
>>>
>>> Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=109692
>>>
>>> Signed-off-by: Christian König <christian.koenig at amd.com>
>>> Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky at amd.com>
>>> ---
>>>   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c |  17 ++--
>>>   drivers/gpu/drm/etnaviv/etnaviv_dump.c     |   4 -
>>>   drivers/gpu/drm/etnaviv/etnaviv_sched.c    |   9 +-
>>>   drivers/gpu/drm/scheduler/sched_main.c     | 138 
>>> +++++++++++++++++------------
>>>   drivers/gpu/drm/v3d/v3d_sched.c            |   9 +-
>> Missing corresponding panfrost and lima updates.  You should probably
>> pull in drm-misc for hacking on the scheduler.
>>
>>> diff --git a/drivers/gpu/drm/v3d/v3d_sched.c 
>>> b/drivers/gpu/drm/v3d/v3d_sched.c
>>> index ce7c737b..8efb091 100644
>>> --- a/drivers/gpu/drm/v3d/v3d_sched.c
>>> +++ b/drivers/gpu/drm/v3d/v3d_sched.c
>>> @@ -232,11 +232,18 @@ v3d_gpu_reset_for_timeout(struct v3d_dev *v3d, 
>>> struct drm_sched_job *sched_job)
>>>         /* block scheduler */
>>>       for (q = 0; q < V3D_MAX_QUEUES; q++)
>>> -        drm_sched_stop(&v3d->queue[q].sched);
>>> +        drm_sched_stop(&v3d->queue[q].sched, sched_job);
>>>         if(sched_job)
>>>           drm_sched_increase_karma(sched_job);
>>>   +    /*
>>> +     * Guilty job did complete and hence needs to be manually removed
>>> +     * See drm_sched_stop doc.
>>> +     */
>>> +    if (list_empty(&sched_job->node))
>>> +        sched_job->sched->ops->free_job(sched_job);
>> If the if (sched_job) is necessary up above, then this should clearly be
>> under it.
>>
>> But, can we please have a core scheduler thing we call here instead of
>> drivers all replicating it?
>
> Yeah that's also something I noted before.
>
> Essential problem is that we remove finished jobs from the mirror list 
> and so need to destruct them because we otherwise leak them.
>
> Alternative approach here would be to keep the jobs on the ring mirror 
> list, but not submit them again.
>
> Regards,
> Christian.

I really prefer to avoid this, it means adding extra flag to sched_job 
to check in each iteration of the ring mirror list. What about changing  
signature of drm_sched_backend_ops.timedout_job to return drm_sched_job* 
instead of void, this way we can return the guilty job back from the 
driver specific handler to the generic drm_sched_job_timedout and 
release it there.

Andrey

>
>>
>>> +
>>>       /* get the GPU back into the init state */
>>>       v3d_reset(v3d);
>