[PATCH 2/2] drm/scheduler: stop setting rq to NULL

Fri Aug 3 13:47:38 UTC 2018

On 08/03/2018 08:12 AM, Nayan Deshmukh wrote:
>
>
> On Thu, Aug 2, 2018, 8:09 PM Andrey Grodzovsky 
> <Andrey.Grodzovsky at amd.com <mailto:Andrey.Grodzovsky at amd.com>> wrote:
>
>
>
>     On 08/02/2018 02:43 AM, Nayan Deshmukh wrote:
>>     Hi Andrey,
>>
>>     Good catch. I guess since both Christian and I were working on it
>>     parallelly we didn't catch this problem.
>>
>>     If it is only causing a problem in push_job then we can handle it
>>     something like this:
>>
>>     ----------------------------------------------------------------------------------------------------------------------------
>>     diff --git a/drivers/gpu/drm/scheduler/gpu_scheduler.c
>>     b/drivers/gpu/drm/scheduler/gpu_scheduler.c
>>     index 05dc6ecd4003..f344ce32f128 100644
>>     --- a/drivers/gpu/drm/scheduler/gpu_scheduler.c
>>     +++ b/drivers/gpu/drm/scheduler/gpu_scheduler.c
>>     @@ -563,6 +563,11 @@ void drm_sched_entity_push_job(struct
>>     drm_sched_job *sched_job,
>>             first = spsc_queue_count(&entity->job_queue) == 0;
>>             reschedule = idle && first && (entity->num_rq_list > 1);
>>
>>     +       if (first && list_empty(&entity->list)) {
>
>     first might be false if the other process interrupted by SIGKILL 
>     in middle of  wait_event_killable in drm_sched_entity_flush when
>     there were still item in queue.
>     I don't see a good way besides adding a 'stopped' flag to
>     drm_sched_enitity.
>
> Sorry I missed this mail. This case might happen but this was also not 
> being handled previously.
>
> Nayan

The original code before  'drm/scheduler: stop setting rq to NULL' was , 
because you didn't use the queue's emptiness as a criteria for aborting 
your next push to queue but rather
checking for entity->rq != NULL.

Andrey

>
>     Andrey
>
>>     +               DRM_ERROR("Trying to push to a killed entity\n");
>>     +               return;
>>     +       }
>>     +
>>             if (reschedule) {
>>                     rq = drm_sched_entity_get_free_sched(entity);
>>     spin_lock(&entity->rq_lock);
>>     @@ -580,11 +585,6 @@ void drm_sched_entity_push_job(struct
>>     drm_sched_job *sched_job,
>>             if (first) {
>>                     /* Add the entity to the run queue */
>>     spin_lock(&entity->rq_lock);
>>     -               if (!entity->rq) {
>>     -                       DRM_ERROR("Trying to push to a killed
>>     entity\n");
>>     - spin_unlock(&entity->rq_lock);
>>     -                       return;
>>     -               }
>>     drm_sched_rq_add_entity(entity->rq, entity);
>>     spin_unlock(&entity->rq_lock);
>>     drm_sched_wakeup(entity->rq->sched);
>>     -----------------------------------------------------------------------------------------------------------------------------
>>
>>
>>     On Thu, Aug 2, 2018 at 3:55 AM Andrey Grodzovsky
>>     <Andrey.Grodzovsky at amd.com <mailto:Andrey.Grodzovsky at amd.com>> wrote:
>>
>>         Thinking again about this change and 53d5f1e drm/scheduler:
>>         move idle
>>         entities to scheduler with less load v2
>>
>>         Looks to me like use case for which fc9a539 drm/scheduler:
>>         add NULL
>>         pointer check for run queue (v2) was done
>>
>>         will not work anymore.
>>
>>         First of all in drm_sched_entity_push_job, 'if (!entity->rq)'
>>         will never
>>         be true any more since we stopped entity->rq to NULL in
>>
>>         drm_sched_entity_flush.
>>
>>         Second of all, even after we removed the entity from rq in
>>         drm_sched_entity_flush to terminate any subsequent submissions
>>
>>         to the queue the other thread doing push job can just acquire
>>         again a
>>         run queue
>>
>>         from drm_sched_entity_get_free_sched and continue submission
>>         so you
>>         can't substitute ' if (!entity->rq)' to 'if
>>         (list_empty(&entity->list))'.
>>
>>         What about adding a 'stopped' flag to drm_sched_entity to be
>>         set in
>>         drm_sched_entity_flush and in
>>
>>     But it might be worth adding a stopped flag field if it is
>>     required at other places as well.
>>
>>     Thanks,
>>     Nayan
>>
>>         drm_sched_entity_push_job check for  'if (!entity->stopped)'
>>         instead of
>>         ' if (!entity->rq)' ?
>>
>>         Andrey
>>
>>
>>         On 07/30/2018 07:03 AM, Christian König wrote:
>>         > We removed the redundancy of having an extra scheduler
>>         field, so we
>>         > can't set the rq to NULL any more or otherwise won't know which
>>         > scheduler to use for the cleanup.
>>         >
>>         > Just remove the entity from the scheduling list instead.
>>         >
>>         > Signed-off-by: Christian König <christian.koenig at amd.com
>>         <mailto:christian.koenig at amd.com>>
>>         > ---
>>         >   drivers/gpu/drm/scheduler/gpu_scheduler.c | 35
>>         +++++++------------------------
>>         >   1 file changed, 8 insertions(+), 27 deletions(-)
>>         >
>>         > diff --git a/drivers/gpu/drm/scheduler/gpu_scheduler.c
>>         b/drivers/gpu/drm/scheduler/gpu_scheduler.c
>>         > index f563e4fbb4b6..1b733229201e 100644
>>         > --- a/drivers/gpu/drm/scheduler/gpu_scheduler.c
>>         > +++ b/drivers/gpu/drm/scheduler/gpu_scheduler.c
>>         > @@ -198,21 +198,6 @@ int drm_sched_entity_init(struct
>>         drm_sched_entity *entity,
>>         >   }
>>         >   EXPORT_SYMBOL(drm_sched_entity_init);
>>         >
>>         > -/**
>>         > - * drm_sched_entity_is_initialized - Query if entity is
>>         initialized
>>         > - *
>>         > - * @sched: Pointer to scheduler instance
>>         > - * @entity: The pointer to a valid scheduler entity
>>         > - *
>>         > - * return true if entity is initialized, false otherwise
>>         > -*/
>>         > -static bool drm_sched_entity_is_initialized(struct
>>         drm_gpu_scheduler *sched,
>>         > -  struct drm_sched_entity *entity)
>>         > -{
>>         > -     return entity->rq != NULL &&
>>         > -             entity->rq->sched == sched;
>>         > -}
>>         > -
>>         >   /**
>>         >    * drm_sched_entity_is_idle - Check if entity is idle
>>         >    *
>>         > @@ -224,7 +209,8 @@ static bool
>>         drm_sched_entity_is_idle(struct drm_sched_entity *entity)
>>         >   {
>>         >       rmb();
>>         >
>>         > -     if (!entity->rq ||
>>         spsc_queue_peek(&entity->job_queue) == NULL)
>>         > +     if (list_empty(&entity->list) ||
>>         > +  spsc_queue_peek(&entity->job_queue) == NULL)
>>         >               return true;
>>         >
>>         >       return false;
>>         > @@ -279,8 +265,6 @@ long drm_sched_entity_flush(struct
>>         drm_sched_entity *entity, long timeout)
>>         >       long ret = timeout;
>>         >
>>         >       sched = entity->rq->sched;
>>         > -     if (!drm_sched_entity_is_initialized(sched, entity))
>>         > -             return ret;
>>         >       /**
>>         >        * The client will not queue more IBs during this
>>         fini, consume existing
>>         >        * queued IBs or discard them on SIGKILL
>>         > @@ -299,7 +283,7 @@ long drm_sched_entity_flush(struct
>>         drm_sched_entity *entity, long timeout)
>>         >       last_user = cmpxchg(&entity->last_user,
>>         current->group_leader, NULL);
>>         >       if ((!last_user || last_user == current->group_leader) &&
>>         >           (current->flags & PF_EXITING) &&
>>         (current->exit_code == SIGKILL))
>>         > -  drm_sched_entity_set_rq(entity, NULL);
>>         > +  drm_sched_rq_remove_entity(entity->rq, entity);
>>         >
>>         >       return ret;
>>         >   }
>>         > @@ -320,7 +304,7 @@ void drm_sched_entity_fini(struct
>>         drm_sched_entity *entity)
>>         >       struct drm_gpu_scheduler *sched;
>>         >
>>         >       sched = entity->rq->sched;
>>         > -     drm_sched_entity_set_rq(entity, NULL);
>>         > +  drm_sched_rq_remove_entity(entity->rq, entity);
>>         >
>>         >       /* Consumption of existing IBs wasn't completed.
>>         Forcefully
>>         >        * remove them here.
>>         > @@ -416,15 +400,12 @@ void drm_sched_entity_set_rq(struct
>>         drm_sched_entity *entity,
>>         >       if (entity->rq == rq)
>>         >               return;
>>         >
>>         > -     spin_lock(&entity->rq_lock);
>>         > -
>>         > -     if (entity->rq)
>>         > -  drm_sched_rq_remove_entity(entity->rq, entity);
>>         > +     BUG_ON(!rq);
>>         >
>>         > +     spin_lock(&entity->rq_lock);
>>         > +  drm_sched_rq_remove_entity(entity->rq, entity);
>>         >       entity->rq = rq;
>>         > -     if (rq)
>>         > -             drm_sched_rq_add_entity(rq, entity);
>>         > -
>>         > +     drm_sched_rq_add_entity(rq, entity);
>>         >       spin_unlock(&entity->rq_lock);
>>         >   }
>>         >   EXPORT_SYMBOL(drm_sched_entity_set_rq);
>>
>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.freedesktop.org/archives/dri-devel/attachments/20180803/e3769f25/attachment-0001.html>