[PATCH] drm/scheduler: Remove entity->rq NULL check
Andrey Grodzovsky
Andrey.Grodzovsky at amd.com
Tue Aug 14 16:32:42 UTC 2018
On 08/14/2018 11:26 AM, Christian König wrote:
> Am 14.08.2018 um 17:17 schrieb Andrey Grodzovsky:
>>
>> I assume that this is the only code change and no locks are taken in
>> drm_sched_entity_push_job -
>>
>
> What are you talking about? You surely now take looks in
> drm_sched_entity_push_job():
>> + spin_lock(&entity->rq_lock);
>> + entity->last_user = current->group_leader;
>> + if (list_empty(&entity->list))
Oh, so your code in drm_sched_entity_flush still relies on my code in
drm_sched_entity_push_job, OK.
>
>> What happens if process A runs drm_sched_entity_push_job after this
>> code was executed from the (dying) process B and there
>>
>> are still jobs in the queue (the wait_event terminated prematurely),
>> the entity already removed from rq , but bool 'first' in
>> drm_sched_entity_push_job
>>
>> will return false and so the entity will not be reinserted back into
>> rq entity list and no wake up trigger will happen for process A
>> pushing a new job.
>>
>
> Thought about this as well, but in this case I would say: Shit happens!
>
> The dying process did some command submission and because of this the
> entity was killed as well when the process died and that is legitimate.
>
>>
>> Another issue bellow -
>>
>> Andrey
>>
>>
>> On 08/14/2018 03:05 AM, Christian König wrote:
>>> I would rather like to avoid taking the lock in the hot path.
>>>
>>> How about this:
>>>
>>> /* For killed process disable any more IBs enqueue right now */
>>> last_user = cmpxchg(&entity->last_user, current->group_leader,
>>> NULL);
>>> if ((!last_user || last_user == current->group_leader) &&
>>> (current->flags & PF_EXITING) && (current->exit_code ==
>>> SIGKILL)) {
>>> grab_lock();
>>> drm_sched_rq_remove_entity(entity->rq, entity);
>>> if (READ_ONCE(&entity->last_user) != NULL)
>>
>> This condition is true because just exactly now process A did
>> drm_sched_entity_push_job->WRITE_ONCE(entity->last_user,
>> current->group_leader);
>> and so the line bellow executed and entity reinserted into rq. Let's
>> say also that the entity job queue is empty now. For process A bool
>> 'first' will be true
>> and hence also
>> drm_sched_entity_push_job->drm_sched_rq_add_entity(entity->rq,
>> entity) will take place causing double insertion of the entity queue
>> into rq list.
>
> Calling drm_sched_rq_add_entity() is harmless, it is protected against
> double insertion.
Missed that one, right...
>
> But thinking more about it your idea of adding a killed or finished
> flag becomes more and more appealing to have a consistent handling here.
>
> Christian.
So to be clear - you would like something like
Removing entity->last_user and adding a 'stopped' flag to
drm_sched_entity to be set in drm_sched_entity_flush and in
drm_sched_entity_push_job check for 'if (entity->stopped)' and when
true just return some error back to user instead of pushing the job ?
Andrey
>
>>
>> Andrey
>>
>>> drm_sched_rq_add_entity(entity->rq, entity);
>>> drop_lock();
>>> }
>>>
>>> Christian.
>>>
>>> Am 13.08.2018 um 18:43 schrieb Andrey Grodzovsky:
>>>>
>>>> Attached.
>>>>
>>>> If the general idea in the patch is OK I can think of a test (and
>>>> maybe add to libdrm amdgpu tests) to actually simulate this
>>>> scenario with 2 forked
>>>>
>>>> concurrent processes working on same entity's job queue when one is
>>>> dying while the other keeps pushing to the same queue. For now I
>>>> only tested it
>>>>
>>>> with normal boot and ruining multiple glxgears concurrently - which
>>>> doesn't really test this code path since i think each of them works
>>>> on it's own FD.
>>>>
>>>> Andrey
>>>>
>>>>
>>>> On 08/10/2018 09:27 AM, Christian König wrote:
>>>>> Crap, yeah indeed that needs to be protected by some lock.
>>>>>
>>>>> Going to prepare a patch for that,
>>>>> Christian.
>>>>>
>>>>> Am 09.08.2018 um 21:49 schrieb Andrey Grodzovsky:
>>>>>>
>>>>>> Reviewed-by: Andrey Grodzovsky <andrey.grodzovsky at amd.com>
>>>>>>
>>>>>>
>>>>>> But I still have questions about entity->last_user (didn't
>>>>>> notice this before) -
>>>>>>
>>>>>> Looks to me there is a race condition with it's current usage,
>>>>>> let's say process A was preempted after doing
>>>>>> drm_sched_entity_flush->cmpxchg(...)
>>>>>>
>>>>>> now process B working on same entity (forked) is inside
>>>>>> drm_sched_entity_push_job, he writes his PID to entity->last_user
>>>>>> and also
>>>>>>
>>>>>> executes drm_sched_rq_add_entity. Now process A runs again and
>>>>>> execute drm_sched_rq_remove_entity inadvertently causing process
>>>>>> B removal
>>>>>>
>>>>>> from it's scheduler rq.
>>>>>>
>>>>>> Looks to me like instead we should lock together
>>>>>> entity->last_user accesses and adds/removals of entity to the rq.
>>>>>>
>>>>>> Andrey
>>>>>>
>>>>>>
>>>>>> On 08/06/2018 10:18 AM, Nayan Deshmukh wrote:
>>>>>>> I forgot about this since we started discussing possible
>>>>>>> scenarios of processes and threads.
>>>>>>>
>>>>>>> In any case, this check is redundant. Acked-by: Nayan Deshmukh
>>>>>>> <nayan26deshmukh at gmail.com <mailto:nayan26deshmukh at gmail.com>>
>>>>>>>
>>>>>>> Nayan
>>>>>>>
>>>>>>> On Mon, Aug 6, 2018 at 7:43 PM Christian König
>>>>>>> <ckoenig.leichtzumerken at gmail.com
>>>>>>> <mailto:ckoenig.leichtzumerken at gmail.com>> wrote:
>>>>>>>
>>>>>>> Ping. Any objections to that?
>>>>>>>
>>>>>>> Christian.
>>>>>>>
>>>>>>> Am 03.08.2018 um 13:08 schrieb Christian König:
>>>>>>> > That is superflous now.
>>>>>>> >
>>>>>>> > Signed-off-by: Christian König <christian.koenig at amd.com
>>>>>>> <mailto:christian.koenig at amd.com>>
>>>>>>> > ---
>>>>>>> > drivers/gpu/drm/scheduler/gpu_scheduler.c | 5 -----
>>>>>>> > 1 file changed, 5 deletions(-)
>>>>>>> >
>>>>>>> > diff --git a/drivers/gpu/drm/scheduler/gpu_scheduler.c
>>>>>>> b/drivers/gpu/drm/scheduler/gpu_scheduler.c
>>>>>>> > index 85908c7f913e..65078dd3c82c 100644
>>>>>>> > --- a/drivers/gpu/drm/scheduler/gpu_scheduler.c
>>>>>>> > +++ b/drivers/gpu/drm/scheduler/gpu_scheduler.c
>>>>>>> > @@ -590,11 +590,6 @@ void drm_sched_entity_push_job(struct
>>>>>>> drm_sched_job *sched_job,
>>>>>>> > if (first) {
>>>>>>> > /* Add the entity to the run queue */
>>>>>>> > spin_lock(&entity->rq_lock);
>>>>>>> > - if (!entity->rq) {
>>>>>>> > - DRM_ERROR("Trying to push to a
>>>>>>> killed entity\n");
>>>>>>> > - spin_unlock(&entity->rq_lock);
>>>>>>> > - return;
>>>>>>> > - }
>>>>>>> > drm_sched_rq_add_entity(entity->rq, entity);
>>>>>>> > spin_unlock(&entity->rq_lock);
>>>>>>> > drm_sched_wakeup(entity->rq->sched);
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>> dri-devel mailing list
>>>> dri-devel at lists.freedesktop.org
>>>> https://lists.freedesktop.org/mailman/listinfo/dri-devel
>>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.freedesktop.org/archives/dri-devel/attachments/20180814/76db2973/attachment-0001.html>
More information about the dri-devel
mailing list