[PATCH] drm/scheduler: Remove entity->rq NULL check

Tue Aug 14 15:17:26 UTC 2018

I assume that this is the only code change and no locks are taken in 
drm_sched_entity_push_job -

What happens if process A runs drm_sched_entity_push_job after this code 
was executed from the  (dying) process B and there

are still jobs in the queue (the wait_event terminated prematurely), the 
entity already removed from rq , but bool 'first' in 
drm_sched_entity_push_job

will return false and so the entity will not be reinserted back into rq 
entity list and no wake up trigger will happen for process A pushing a 
new job.

Another issue bellow -

Andrey

On 08/14/2018 03:05 AM, Christian König wrote:
> I would rather like to avoid taking the lock in the hot path.
>
> How about this:
>
>      /* For killed process disable any more IBs enqueue right now */
>     last_user = cmpxchg(&entity->last_user, current->group_leader, NULL);
>      if ((!last_user || last_user == current->group_leader) &&
>          (current->flags & PF_EXITING) && (current->exit_code == 
> SIGKILL)) {
>         grab_lock();
>          drm_sched_rq_remove_entity(entity->rq, entity);
>         if (READ_ONCE(&entity->last_user) != NULL)

This condition is true because just exactly now process A did 
drm_sched_entity_push_job->WRITE_ONCE(entity->last_user, 
current->group_leader);
and so the line bellow executed and entity reinserted into rq. Let's say 
also that the entity job queue is empty now. For process A bool 'first' 
will be true
and hence also 
drm_sched_entity_push_job->drm_sched_rq_add_entity(entity->rq, entity) 
will take place causing double insertion of the entity queue into rq list.

Andrey

> drm_sched_rq_add_entity(entity->rq, entity);
>         drop_lock();
>     }
>
> Christian.
>
> Am 13.08.2018 um 18:43 schrieb Andrey Grodzovsky:
>>
>> Attached.
>>
>> If the general idea in the patch is OK I can think of a test (and 
>> maybe add to libdrm amdgpu tests) to actually simulate this scenario 
>> with 2 forked
>>
>> concurrent processes working on same entity's job queue when one is 
>> dying while the other keeps pushing to the same queue. For now I only 
>> tested it
>>
>> with normal boot and ruining multiple glxgears concurrently - which 
>> doesn't really test this code path since i think each of them works 
>> on it's own FD.
>>
>> Andrey
>>
>>
>> On 08/10/2018 09:27 AM, Christian König wrote:
>>> Crap, yeah indeed that needs to be protected by some lock.
>>>
>>> Going to prepare a patch for that,
>>> Christian.
>>>
>>> Am 09.08.2018 um 21:49 schrieb Andrey Grodzovsky:
>>>>
>>>> Reviewed-by: Andrey Grodzovsky <andrey.grodzovsky at amd.com>
>>>>
>>>>
>>>> But I still  have questions about entity->last_user (didn't notice 
>>>> this before) -
>>>>
>>>> Looks to me there is a race condition with it's current usage, 
>>>> let's say process A was preempted after doing 
>>>> drm_sched_entity_flush->cmpxchg(...)
>>>>
>>>> now process B working on same entity (forked) is inside 
>>>> drm_sched_entity_push_job, he writes his PID to entity->last_user 
>>>> and also
>>>>
>>>> executes drm_sched_rq_add_entity. Now process A runs again and 
>>>> execute drm_sched_rq_remove_entity inadvertently causing process B 
>>>> removal
>>>>
>>>> from it's scheduler rq.
>>>>
>>>> Looks to me like instead we should lock together entity->last_user 
>>>> accesses and adds/removals of entity to the rq.
>>>>
>>>> Andrey
>>>>
>>>>
>>>> On 08/06/2018 10:18 AM, Nayan Deshmukh wrote:
>>>>> I forgot about this since we started discussing possible scenarios 
>>>>> of processes and threads.
>>>>>
>>>>> In any case, this check is redundant. Acked-by: Nayan Deshmukh 
>>>>> <nayan26deshmukh at gmail.com <mailto:nayan26deshmukh at gmail.com>>
>>>>>
>>>>> Nayan
>>>>>
>>>>> On Mon, Aug 6, 2018 at 7:43 PM Christian König 
>>>>> <ckoenig.leichtzumerken at gmail.com 
>>>>> <mailto:ckoenig.leichtzumerken at gmail.com>> wrote:
>>>>>
>>>>>     Ping. Any objections to that?
>>>>>
>>>>>     Christian.
>>>>>
>>>>>     Am 03.08.2018 um 13:08 schrieb Christian König:
>>>>>     > That is superflous now.
>>>>>     >
>>>>>     > Signed-off-by: Christian König <christian.koenig at amd.com
>>>>>     <mailto:christian.koenig at amd.com>>
>>>>>     > ---
>>>>>     >   drivers/gpu/drm/scheduler/gpu_scheduler.c | 5 -----
>>>>>     >   1 file changed, 5 deletions(-)
>>>>>     >
>>>>>     > diff --git a/drivers/gpu/drm/scheduler/gpu_scheduler.c
>>>>>     b/drivers/gpu/drm/scheduler/gpu_scheduler.c
>>>>>     > index 85908c7f913e..65078dd3c82c 100644
>>>>>     > --- a/drivers/gpu/drm/scheduler/gpu_scheduler.c
>>>>>     > +++ b/drivers/gpu/drm/scheduler/gpu_scheduler.c
>>>>>     > @@ -590,11 +590,6 @@ void drm_sched_entity_push_job(struct
>>>>>     drm_sched_job *sched_job,
>>>>>     >       if (first) {
>>>>>     >               /* Add the entity to the run queue */
>>>>>     >               spin_lock(&entity->rq_lock);
>>>>>     > -             if (!entity->rq) {
>>>>>     > -                     DRM_ERROR("Trying to push to a killed
>>>>>     entity\n");
>>>>>     > -  spin_unlock(&entity->rq_lock);
>>>>>     > -                     return;
>>>>>     > -             }
>>>>>     >  drm_sched_rq_add_entity(entity->rq, entity);
>>>>>     >  spin_unlock(&entity->rq_lock);
>>>>>     >  drm_sched_wakeup(entity->rq->sched);
>>>>>
>>>>
>>>
>>
>>
>>
>> _______________________________________________
>> dri-devel mailing list
>> dri-devel at lists.freedesktop.org
>> https://lists.freedesktop.org/mailman/listinfo/dri-devel
>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.freedesktop.org/archives/dri-devel/attachments/20180814/12668860/attachment-0001.html>