[PATCH] drm/scheduler: Remove entity->rq NULL check

Tue Aug 14 16:32:42 UTC 2018

On 08/14/2018 11:26 AM, Christian König wrote:
> Am 14.08.2018 um 17:17 schrieb Andrey Grodzovsky:
>>
>> I assume that this is the only code change and no locks are taken in 
>> drm_sched_entity_push_job -
>>
>
> What are you talking about? You surely now take looks in 
> drm_sched_entity_push_job():
>> +    spin_lock(&entity->rq_lock);
>> +    entity->last_user = current->group_leader;
>> +    if (list_empty(&entity->list))

Oh, so your code in drm_sched_entity_flush still relies on my code in 
drm_sched_entity_push_job, OK.

>
>> What happens if process A runs drm_sched_entity_push_job after this 
>> code was executed from the  (dying) process B and there
>>
>> are still jobs in the queue (the wait_event terminated prematurely), 
>> the entity already removed from rq , but bool 'first' in 
>> drm_sched_entity_push_job
>>
>> will return false and so the entity will not be reinserted back into 
>> rq entity list and no wake up trigger will happen for process A 
>> pushing a new job.
>>
>
> Thought about this as well, but in this case I would say: Shit happens!
>
> The dying process did some command submission and because of this the 
> entity was killed as well when the process died and that is legitimate.
>
>>
>> Another issue bellow -
>>
>> Andrey
>>
>>
>> On 08/14/2018 03:05 AM, Christian König wrote:
>>> I would rather like to avoid taking the lock in the hot path.
>>>
>>> How about this:
>>>
>>>      /* For killed process disable any more IBs enqueue right now */
>>>     last_user = cmpxchg(&entity->last_user, current->group_leader, 
>>> NULL);
>>>      if ((!last_user || last_user == current->group_leader) &&
>>>          (current->flags & PF_EXITING) && (current->exit_code == 
>>> SIGKILL)) {
>>>         grab_lock();
>>>          drm_sched_rq_remove_entity(entity->rq, entity);
>>>         if (READ_ONCE(&entity->last_user) != NULL)
>>
>> This condition is true because just exactly now process A did 
>> drm_sched_entity_push_job->WRITE_ONCE(entity->last_user, 
>> current->group_leader);
>> and so the line bellow executed and entity reinserted into rq. Let's 
>> say also that the entity job queue is empty now. For process A bool 
>> 'first' will be true
>> and hence also 
>> drm_sched_entity_push_job->drm_sched_rq_add_entity(entity->rq, 
>> entity) will take place causing double insertion of the entity queue 
>> into rq list.
>
> Calling drm_sched_rq_add_entity() is harmless, it is protected against 
> double insertion.

Missed that one, right...
>
> But thinking more about it your idea of adding a killed or finished 
> flag becomes more and more appealing to have a consistent handling here.
>
> Christian.

So to be clear - you would like something like

Removing entity->last_user and adding a 'stopped' flag to 
drm_sched_entity to be set in drm_sched_entity_flush and in

drm_sched_entity_push_job check for  'if (entity->stopped)' and when 
true just return some error back to user instead of pushing the job ?

Andrey

>
>>
>> Andrey
>>
>>> drm_sched_rq_add_entity(entity->rq, entity);
>>>         drop_lock();
>>>     }
>>>
>>> Christian.
>>>
>>> Am 13.08.2018 um 18:43 schrieb Andrey Grodzovsky:
>>>>
>>>> Attached.
>>>>
>>>> If the general idea in the patch is OK I can think of a test (and 
>>>> maybe add to libdrm amdgpu tests) to actually simulate this 
>>>> scenario with 2 forked
>>>>
>>>> concurrent processes working on same entity's job queue when one is 
>>>> dying while the other keeps pushing to the same queue. For now I 
>>>> only tested it
>>>>
>>>> with normal boot and ruining multiple glxgears concurrently - which 
>>>> doesn't really test this code path since i think each of them works 
>>>> on it's own FD.
>>>>
>>>> Andrey
>>>>
>>>>
>>>> On 08/10/2018 09:27 AM, Christian König wrote:
>>>>> Crap, yeah indeed that needs to be protected by some lock.
>>>>>
>>>>> Going to prepare a patch for that,
>>>>> Christian.
>>>>>
>>>>> Am 09.08.2018 um 21:49 schrieb Andrey Grodzovsky:
>>>>>>
>>>>>> Reviewed-by: Andrey Grodzovsky <andrey.grodzovsky at amd.com>
>>>>>>
>>>>>>
>>>>>> But I still  have questions about entity->last_user (didn't 
>>>>>> notice this before) -
>>>>>>
>>>>>> Looks to me there is a race condition with it's current usage, 
>>>>>> let's say process A was preempted after doing 
>>>>>> drm_sched_entity_flush->cmpxchg(...)
>>>>>>
>>>>>> now process B working on same entity (forked) is inside 
>>>>>> drm_sched_entity_push_job, he writes his PID to entity->last_user 
>>>>>> and also
>>>>>>
>>>>>> executes drm_sched_rq_add_entity. Now process A runs again and 
>>>>>> execute drm_sched_rq_remove_entity inadvertently causing process 
>>>>>> B removal
>>>>>>
>>>>>> from it's scheduler rq.
>>>>>>
>>>>>> Looks to me like instead we should lock together 
>>>>>> entity->last_user accesses and adds/removals of entity to the rq.
>>>>>>
>>>>>> Andrey
>>>>>>
>>>>>>
>>>>>> On 08/06/2018 10:18 AM, Nayan Deshmukh wrote:
>>>>>>> I forgot about this since we started discussing possible 
>>>>>>> scenarios of processes and threads.
>>>>>>>
>>>>>>> In any case, this check is redundant. Acked-by: Nayan Deshmukh 
>>>>>>> <nayan26deshmukh at gmail.com <mailto:nayan26deshmukh at gmail.com>>
>>>>>>>
>>>>>>> Nayan
>>>>>>>
>>>>>>> On Mon, Aug 6, 2018 at 7:43 PM Christian König 
>>>>>>> <ckoenig.leichtzumerken at gmail.com 
>>>>>>> <mailto:ckoenig.leichtzumerken at gmail.com>> wrote:
>>>>>>>
>>>>>>>     Ping. Any objections to that?
>>>>>>>
>>>>>>>     Christian.
>>>>>>>
>>>>>>>     Am 03.08.2018 um 13:08 schrieb Christian König:
>>>>>>>     > That is superflous now.
>>>>>>>     >
>>>>>>>     > Signed-off-by: Christian König <christian.koenig at amd.com
>>>>>>>     <mailto:christian.koenig at amd.com>>
>>>>>>>     > ---
>>>>>>>     >   drivers/gpu/drm/scheduler/gpu_scheduler.c | 5 -----
>>>>>>>     >   1 file changed, 5 deletions(-)
>>>>>>>     >
>>>>>>>     > diff --git a/drivers/gpu/drm/scheduler/gpu_scheduler.c
>>>>>>>     b/drivers/gpu/drm/scheduler/gpu_scheduler.c
>>>>>>>     > index 85908c7f913e..65078dd3c82c 100644
>>>>>>>     > --- a/drivers/gpu/drm/scheduler/gpu_scheduler.c
>>>>>>>     > +++ b/drivers/gpu/drm/scheduler/gpu_scheduler.c
>>>>>>>     > @@ -590,11 +590,6 @@ void drm_sched_entity_push_job(struct
>>>>>>>     drm_sched_job *sched_job,
>>>>>>>     >       if (first) {
>>>>>>>     >               /* Add the entity to the run queue */
>>>>>>>     >  spin_lock(&entity->rq_lock);
>>>>>>>     > -             if (!entity->rq) {
>>>>>>>     > -                     DRM_ERROR("Trying to push to a
>>>>>>>     killed entity\n");
>>>>>>>     > -  spin_unlock(&entity->rq_lock);
>>>>>>>     > -                     return;
>>>>>>>     > -             }
>>>>>>>     >  drm_sched_rq_add_entity(entity->rq, entity);
>>>>>>>     >  spin_unlock(&entity->rq_lock);
>>>>>>>     >  drm_sched_wakeup(entity->rq->sched);
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>> dri-devel mailing list
>>>> dri-devel at lists.freedesktop.org
>>>> https://lists.freedesktop.org/mailman/listinfo/dri-devel
>>>
>>
>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.freedesktop.org/archives/dri-devel/attachments/20180814/76db2973/attachment-0001.html>