[PATCH v3 1/5] drm/scheduler: rework job destruction

Tue Apr 16 16:00:00 UTC 2019

Am 16.04.19 um 17:42 schrieb Grodzovsky, Andrey:
> On 4/16/19 10:58 AM, Grodzovsky, Andrey wrote:
>> On 4/16/19 10:43 AM, Koenig, Christian wrote:
>>> Am 16.04.19 um 16:36 schrieb Grodzovsky, Andrey:
>>>> On 4/16/19 5:47 AM, Christian König wrote:
>>>>> Am 15.04.19 um 23:17 schrieb Eric Anholt:
>>>>>> Andrey Grodzovsky <andrey.grodzovsky at amd.com> writes:
>>>>>>
>>>>>>> From: Christian König <christian.koenig at amd.com>
>>>>>>>
>>>>>>> We now destroy finished jobs from the worker thread to make sure that
>>>>>>> we never destroy a job currently in timeout processing.
>>>>>>> By this we avoid holding lock around ring mirror list in drm_sched_stop
>>>>>>> which should solve a deadlock reported by a user.
>>>>>>>
>>>>>>> v2: Remove unused variable.
>>>>>>>
>>>>>>> Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=109692
>>>>>>>
>>>>>>> Signed-off-by: Christian König <christian.koenig at amd.com>
>>>>>>> Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky at amd.com>
>>>>>>> ---
>>>>>>>       drivers/gpu/drm/amd/amdgpu/amdgpu_device.c |  17 ++--
>>>>>>>       drivers/gpu/drm/etnaviv/etnaviv_dump.c     |   4 -
>>>>>>>       drivers/gpu/drm/etnaviv/etnaviv_sched.c    |   9 +-
>>>>>>>       drivers/gpu/drm/scheduler/sched_main.c     | 138
>>>>>>> +++++++++++++++++------------
>>>>>>>       drivers/gpu/drm/v3d/v3d_sched.c            |   9 +-
>>>>>> Missing corresponding panfrost and lima updates.  You should probably
>>>>>> pull in drm-misc for hacking on the scheduler.
>>>>>>
>>>>>>> diff --git a/drivers/gpu/drm/v3d/v3d_sched.c
>>>>>>> b/drivers/gpu/drm/v3d/v3d_sched.c
>>>>>>> index ce7c737b..8efb091 100644
>>>>>>> --- a/drivers/gpu/drm/v3d/v3d_sched.c
>>>>>>> +++ b/drivers/gpu/drm/v3d/v3d_sched.c
>>>>>>> @@ -232,11 +232,18 @@ v3d_gpu_reset_for_timeout(struct v3d_dev *v3d,
>>>>>>> struct drm_sched_job *sched_job)
>>>>>>>             /* block scheduler */
>>>>>>>           for (q = 0; q < V3D_MAX_QUEUES; q++)
>>>>>>> -        drm_sched_stop(&v3d->queue[q].sched);
>>>>>>> +        drm_sched_stop(&v3d->queue[q].sched, sched_job);
>>>>>>>             if(sched_job)
>>>>>>>               drm_sched_increase_karma(sched_job);
>>>>>>>       +    /*
>>>>>>> +     * Guilty job did complete and hence needs to be manually removed
>>>>>>> +     * See drm_sched_stop doc.
>>>>>>> +     */
>>>>>>> +    if (list_empty(&sched_job->node))
>>>>>>> +        sched_job->sched->ops->free_job(sched_job);
>>>>>> If the if (sched_job) is necessary up above, then this should clearly be
>>>>>> under it.
>>>>>>
>>>>>> But, can we please have a core scheduler thing we call here instead of
>>>>>> drivers all replicating it?
>>>>> Yeah that's also something I noted before.
>>>>>
>>>>> Essential problem is that we remove finished jobs from the mirror list
>>>>> and so need to destruct them because we otherwise leak them.
>>>>>
>>>>> Alternative approach here would be to keep the jobs on the ring mirror
>>>>> list, but not submit them again.
>>>>>
>>>>> Regards,
>>>>> Christian.
>>>> I really prefer to avoid this, it means adding extra flag to sched_job
>>>> to check in each iteration of the ring mirror list.
>>> Mhm, why actually? We just need to check if the scheduler fence is signaled.
>> OK, i see it's equivalent but this still en extra check for all the
>> iterations.
>>
>>>> What about changing
>>>> signature of drm_sched_backend_ops.timedout_job to return drm_sched_job*
>>>> instead of void, this way we can return the guilty job back from the
>>>> driver specific handler to the generic drm_sched_job_timedout and
>>>> release it there.
>>> Well the timeout handler already has the job, so returning it doesn't
>>> make much sense.
>>>
>>> The problem is rather that the timeout handler doesn't know if it should
>>> destroy the job or not.
>> But the driver specific handler does, and actually returning back either
>> the pointer to the job or null will give an indication of that. We can
>> even return bool.
>>
>> Andrey
>
> Thinking a bit more about this - the way this check is done now "if
> (list_empty(&sched_job->node)) then free the sched_job" actually makes
> it possible to just move this as is from driver specific callbacks into
> drm_sched_job_timeout without any other changes.

Oh, well that sounds like a good idea off hand.

Need to see the final code, but at least the best idea so far.

Christian.

>
> Andrey
>
>>> Christian.
>>>
>>>> Andrey
>>>>
>>>>>>> +
>>>>>>>           /* get the GPU back into the init state */
>>>>>>>           v3d_reset(v3d);
>>> _______________________________________________
>>> amd-gfx mailing list
>>> amd-gfx at lists.freedesktop.org
>>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>> _______________________________________________
>> amd-gfx mailing list
>> amd-gfx at lists.freedesktop.org
>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx