[PATCH v3 1/5] drm/scheduler: rework job destruction

Tue Apr 16 14:43:01 UTC 2019

Am 16.04.19 um 16:36 schrieb Grodzovsky, Andrey:
> On 4/16/19 5:47 AM, Christian König wrote:
>> Am 15.04.19 um 23:17 schrieb Eric Anholt:
>>> Andrey Grodzovsky <andrey.grodzovsky at amd.com> writes:
>>>
>>>> From: Christian König <christian.koenig at amd.com>
>>>>
>>>> We now destroy finished jobs from the worker thread to make sure that
>>>> we never destroy a job currently in timeout processing.
>>>> By this we avoid holding lock around ring mirror list in drm_sched_stop
>>>> which should solve a deadlock reported by a user.
>>>>
>>>> v2: Remove unused variable.
>>>>
>>>> Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=109692
>>>>
>>>> Signed-off-by: Christian König <christian.koenig at amd.com>
>>>> Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky at amd.com>
>>>> ---
>>>>    drivers/gpu/drm/amd/amdgpu/amdgpu_device.c |  17 ++--
>>>>    drivers/gpu/drm/etnaviv/etnaviv_dump.c     |   4 -
>>>>    drivers/gpu/drm/etnaviv/etnaviv_sched.c    |   9 +-
>>>>    drivers/gpu/drm/scheduler/sched_main.c     | 138
>>>> +++++++++++++++++------------
>>>>    drivers/gpu/drm/v3d/v3d_sched.c            |   9 +-
>>> Missing corresponding panfrost and lima updates.  You should probably
>>> pull in drm-misc for hacking on the scheduler.
>>>
>>>> diff --git a/drivers/gpu/drm/v3d/v3d_sched.c
>>>> b/drivers/gpu/drm/v3d/v3d_sched.c
>>>> index ce7c737b..8efb091 100644
>>>> --- a/drivers/gpu/drm/v3d/v3d_sched.c
>>>> +++ b/drivers/gpu/drm/v3d/v3d_sched.c
>>>> @@ -232,11 +232,18 @@ v3d_gpu_reset_for_timeout(struct v3d_dev *v3d,
>>>> struct drm_sched_job *sched_job)
>>>>          /* block scheduler */
>>>>        for (q = 0; q < V3D_MAX_QUEUES; q++)
>>>> -        drm_sched_stop(&v3d->queue[q].sched);
>>>> +        drm_sched_stop(&v3d->queue[q].sched, sched_job);
>>>>          if(sched_job)
>>>>            drm_sched_increase_karma(sched_job);
>>>>    +    /*
>>>> +     * Guilty job did complete and hence needs to be manually removed
>>>> +     * See drm_sched_stop doc.
>>>> +     */
>>>> +    if (list_empty(&sched_job->node))
>>>> +        sched_job->sched->ops->free_job(sched_job);
>>> If the if (sched_job) is necessary up above, then this should clearly be
>>> under it.
>>>
>>> But, can we please have a core scheduler thing we call here instead of
>>> drivers all replicating it?
>> Yeah that's also something I noted before.
>>
>> Essential problem is that we remove finished jobs from the mirror list
>> and so need to destruct them because we otherwise leak them.
>>
>> Alternative approach here would be to keep the jobs on the ring mirror
>> list, but not submit them again.
>>
>> Regards,
>> Christian.
>
> I really prefer to avoid this, it means adding extra flag to sched_job
> to check in each iteration of the ring mirror list.

Mhm, why actually? We just need to check if the scheduler fence is signaled.

> What about changing
> signature of drm_sched_backend_ops.timedout_job to return drm_sched_job*
> instead of void, this way we can return the guilty job back from the
> driver specific handler to the generic drm_sched_job_timedout and
> release it there.

Well the timeout handler already has the job, so returning it doesn't 
make much sense.

The problem is rather that the timeout handler doesn't know if it should 
destroy the job or not.

Christian.

>
> Andrey
>
>>>> +
>>>>        /* get the GPU back into the init state */
>>>>        v3d_reset(v3d);