[PATCH v3 1/5] drm/scheduler: rework job destruction

Tue Apr 16 15:42:14 UTC 2019

On 4/16/19 10:58 AM, Grodzovsky, Andrey wrote:
> On 4/16/19 10:43 AM, Koenig, Christian wrote:
>> Am 16.04.19 um 16:36 schrieb Grodzovsky, Andrey:
>>> On 4/16/19 5:47 AM, Christian König wrote:
>>>> Am 15.04.19 um 23:17 schrieb Eric Anholt:
>>>>> Andrey Grodzovsky <andrey.grodzovsky at amd.com> writes:
>>>>>
>>>>>> From: Christian König <christian.koenig at amd.com>
>>>>>>
>>>>>> We now destroy finished jobs from the worker thread to make sure that
>>>>>> we never destroy a job currently in timeout processing.
>>>>>> By this we avoid holding lock around ring mirror list in drm_sched_stop
>>>>>> which should solve a deadlock reported by a user.
>>>>>>
>>>>>> v2: Remove unused variable.
>>>>>>
>>>>>> Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=109692
>>>>>>
>>>>>> Signed-off-by: Christian König <christian.koenig at amd.com>
>>>>>> Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky at amd.com>
>>>>>> ---
>>>>>>      drivers/gpu/drm/amd/amdgpu/amdgpu_device.c |  17 ++--
>>>>>>      drivers/gpu/drm/etnaviv/etnaviv_dump.c     |   4 -
>>>>>>      drivers/gpu/drm/etnaviv/etnaviv_sched.c    |   9 +-
>>>>>>      drivers/gpu/drm/scheduler/sched_main.c     | 138
>>>>>> +++++++++++++++++------------
>>>>>>      drivers/gpu/drm/v3d/v3d_sched.c            |   9 +-
>>>>> Missing corresponding panfrost and lima updates.  You should probably
>>>>> pull in drm-misc for hacking on the scheduler.
>>>>>
>>>>>> diff --git a/drivers/gpu/drm/v3d/v3d_sched.c
>>>>>> b/drivers/gpu/drm/v3d/v3d_sched.c
>>>>>> index ce7c737b..8efb091 100644
>>>>>> --- a/drivers/gpu/drm/v3d/v3d_sched.c
>>>>>> +++ b/drivers/gpu/drm/v3d/v3d_sched.c
>>>>>> @@ -232,11 +232,18 @@ v3d_gpu_reset_for_timeout(struct v3d_dev *v3d,
>>>>>> struct drm_sched_job *sched_job)
>>>>>>            /* block scheduler */
>>>>>>          for (q = 0; q < V3D_MAX_QUEUES; q++)
>>>>>> -        drm_sched_stop(&v3d->queue[q].sched);
>>>>>> +        drm_sched_stop(&v3d->queue[q].sched, sched_job);
>>>>>>            if(sched_job)
>>>>>>              drm_sched_increase_karma(sched_job);
>>>>>>      +    /*
>>>>>> +     * Guilty job did complete and hence needs to be manually removed
>>>>>> +     * See drm_sched_stop doc.
>>>>>> +     */
>>>>>> +    if (list_empty(&sched_job->node))
>>>>>> +        sched_job->sched->ops->free_job(sched_job);
>>>>> If the if (sched_job) is necessary up above, then this should clearly be
>>>>> under it.
>>>>>
>>>>> But, can we please have a core scheduler thing we call here instead of
>>>>> drivers all replicating it?
>>>> Yeah that's also something I noted before.
>>>>
>>>> Essential problem is that we remove finished jobs from the mirror list
>>>> and so need to destruct them because we otherwise leak them.
>>>>
>>>> Alternative approach here would be to keep the jobs on the ring mirror
>>>> list, but not submit them again.
>>>>
>>>> Regards,
>>>> Christian.
>>> I really prefer to avoid this, it means adding extra flag to sched_job
>>> to check in each iteration of the ring mirror list.
>> Mhm, why actually? We just need to check if the scheduler fence is signaled.
> OK, i see it's equivalent but this still en extra check for all the
> iterations.
>
>>> What about changing
>>> signature of drm_sched_backend_ops.timedout_job to return drm_sched_job*
>>> instead of void, this way we can return the guilty job back from the
>>> driver specific handler to the generic drm_sched_job_timedout and
>>> release it there.
>> Well the timeout handler already has the job, so returning it doesn't
>> make much sense.
>>
>> The problem is rather that the timeout handler doesn't know if it should
>> destroy the job or not.
>
> But the driver specific handler does, and actually returning back either
> the pointer to the job or null will give an indication of that. We can
> even return bool.
>
> Andrey

Thinking a bit more about this - the way this check is done now "if 
(list_empty(&sched_job->node)) then free the sched_job" actually makes 
it possible to just move this as is from driver specific callbacks into  
drm_sched_job_timeout without any other changes.

Andrey

>
>> Christian.
>>
>>> Andrey
>>>
>>>>>> +
>>>>>>          /* get the GPU back into the init state */
>>>>>>          v3d_reset(v3d);
>> _______________________________________________
>> amd-gfx mailing list
>> amd-gfx at lists.freedesktop.org
>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
> _______________________________________________
> amd-gfx mailing list
> amd-gfx at lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/amd-gfx