[PATCH] drm/scheduler: Remove entity->rq NULL check

Tue Aug 14 15:26:56 UTC 2018

Am 14.08.2018 um 17:17 schrieb Andrey Grodzovsky:
>
> I assume that this is the only code change and no locks are taken in 
> drm_sched_entity_push_job -
>

What are you talking about? You surely now take looks in 
drm_sched_entity_push_job():
> +    spin_lock(&entity->rq_lock);
> +    entity->last_user = current->group_leader;
> +    if (list_empty(&entity->list))

> What happens if process A runs drm_sched_entity_push_job after this 
> code was executed from the  (dying) process B and there
>
> are still jobs in the queue (the wait_event terminated prematurely), 
> the entity already removed from rq , but bool 'first' in 
> drm_sched_entity_push_job
>
> will return false and so the entity will not be reinserted back into 
> rq entity list and no wake up trigger will happen for process A 
> pushing a new job.
>

Thought about this as well, but in this case I would say: Shit happens!

The dying process did some command submission and because of this the 
entity was killed as well when the process died and that is legitimate.

>
> Another issue bellow -
>
> Andrey
>
>
> On 08/14/2018 03:05 AM, Christian König wrote:
>> I would rather like to avoid taking the lock in the hot path.
>>
>> How about this:
>>
>>      /* For killed process disable any more IBs enqueue right now */
>>     last_user = cmpxchg(&entity->last_user, current->group_leader, NULL);
>>      if ((!last_user || last_user == current->group_leader) &&
>>          (current->flags & PF_EXITING) && (current->exit_code == 
>> SIGKILL)) {
>>         grab_lock();
>>          drm_sched_rq_remove_entity(entity->rq, entity);
>>         if (READ_ONCE(&entity->last_user) != NULL)
>
> This condition is true because just exactly now process A did 
> drm_sched_entity_push_job->WRITE_ONCE(entity->last_user, 
> current->group_leader);
> and so the line bellow executed and entity reinserted into rq. Let's 
> say also that the entity job queue is empty now. For process A bool 
> 'first' will be true
> and hence also 
> drm_sched_entity_push_job->drm_sched_rq_add_entity(entity->rq, entity) 
> will take place causing double insertion of the entity queue into rq list.

Calling drm_sched_rq_add_entity() is harmless, it is protected against 
double insertion.

But thinking more about it your idea of adding a killed or finished flag 
becomes more and more appealing to have a consistent handling here.

Christian.

>
> Andrey
>
>> drm_sched_rq_add_entity(entity->rq, entity);
>>         drop_lock();
>>     }
>>
>> Christian.
>>
>> Am 13.08.2018 um 18:43 schrieb Andrey Grodzovsky:
>>>
>>> Attached.
>>>
>>> If the general idea in the patch is OK I can think of a test (and 
>>> maybe add to libdrm amdgpu tests) to actually simulate this scenario 
>>> with 2 forked
>>>
>>> concurrent processes working on same entity's job queue when one is 
>>> dying while the other keeps pushing to the same queue. For now I 
>>> only tested it
>>>
>>> with normal boot and ruining multiple glxgears concurrently - which 
>>> doesn't really test this code path since i think each of them works 
>>> on it's own FD.
>>>
>>> Andrey
>>>
>>>
>>> On 08/10/2018 09:27 AM, Christian König wrote:
>>>> Crap, yeah indeed that needs to be protected by some lock.
>>>>
>>>> Going to prepare a patch for that,
>>>> Christian.
>>>>
>>>> Am 09.08.2018 um 21:49 schrieb Andrey Grodzovsky:
>>>>>
>>>>> Reviewed-by: Andrey Grodzovsky <andrey.grodzovsky at amd.com>
>>>>>
>>>>>
>>>>> But I still  have questions about entity->last_user (didn't notice 
>>>>> this before) -
>>>>>
>>>>> Looks to me there is a race condition with it's current usage, 
>>>>> let's say process A was preempted after doing 
>>>>> drm_sched_entity_flush->cmpxchg(...)
>>>>>
>>>>> now process B working on same entity (forked) is inside 
>>>>> drm_sched_entity_push_job, he writes his PID to entity->last_user 
>>>>> and also
>>>>>
>>>>> executes drm_sched_rq_add_entity. Now process A runs again and 
>>>>> execute drm_sched_rq_remove_entity inadvertently causing process B 
>>>>> removal
>>>>>
>>>>> from it's scheduler rq.
>>>>>
>>>>> Looks to me like instead we should lock together entity->last_user 
>>>>> accesses and adds/removals of entity to the rq.
>>>>>
>>>>> Andrey
>>>>>
>>>>>
>>>>> On 08/06/2018 10:18 AM, Nayan Deshmukh wrote:
>>>>>> I forgot about this since we started discussing possible 
>>>>>> scenarios of processes and threads.
>>>>>>
>>>>>> In any case, this check is redundant. Acked-by: Nayan Deshmukh 
>>>>>> <nayan26deshmukh at gmail.com <mailto:nayan26deshmukh at gmail.com>>
>>>>>>
>>>>>> Nayan
>>>>>>
>>>>>> On Mon, Aug 6, 2018 at 7:43 PM Christian König 
>>>>>> <ckoenig.leichtzumerken at gmail.com 
>>>>>> <mailto:ckoenig.leichtzumerken at gmail.com>> wrote:
>>>>>>
>>>>>>     Ping. Any objections to that?
>>>>>>
>>>>>>     Christian.
>>>>>>
>>>>>>     Am 03.08.2018 um 13:08 schrieb Christian König:
>>>>>>     > That is superflous now.
>>>>>>     >
>>>>>>     > Signed-off-by: Christian König <christian.koenig at amd.com
>>>>>>     <mailto:christian.koenig at amd.com>>
>>>>>>     > ---
>>>>>>     >   drivers/gpu/drm/scheduler/gpu_scheduler.c | 5 -----
>>>>>>     >   1 file changed, 5 deletions(-)
>>>>>>     >
>>>>>>     > diff --git a/drivers/gpu/drm/scheduler/gpu_scheduler.c
>>>>>>     b/drivers/gpu/drm/scheduler/gpu_scheduler.c
>>>>>>     > index 85908c7f913e..65078dd3c82c 100644
>>>>>>     > --- a/drivers/gpu/drm/scheduler/gpu_scheduler.c
>>>>>>     > +++ b/drivers/gpu/drm/scheduler/gpu_scheduler.c
>>>>>>     > @@ -590,11 +590,6 @@ void drm_sched_entity_push_job(struct
>>>>>>     drm_sched_job *sched_job,
>>>>>>     >       if (first) {
>>>>>>     >               /* Add the entity to the run queue */
>>>>>>     >  spin_lock(&entity->rq_lock);
>>>>>>     > -             if (!entity->rq) {
>>>>>>     > -                     DRM_ERROR("Trying to push to a killed
>>>>>>     entity\n");
>>>>>>     > -  spin_unlock(&entity->rq_lock);
>>>>>>     > -                     return;
>>>>>>     > -             }
>>>>>>     >  drm_sched_rq_add_entity(entity->rq, entity);
>>>>>>     >  spin_unlock(&entity->rq_lock);
>>>>>>     >  drm_sched_wakeup(entity->rq->sched);
>>>>>>
>>>>>
>>>>
>>>
>>>
>>>
>>> _______________________________________________
>>> dri-devel mailing list
>>> dri-devel at lists.freedesktop.org
>>> https://lists.freedesktop.org/mailman/listinfo/dri-devel
>>
>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.freedesktop.org/archives/dri-devel/attachments/20180814/742f0c6c/attachment.html>