[PATCH] drm/gpu-sched: fix force APP kill hang

Thu Mar 29 02:45:54 UTC 2018

Hi Christian,
     Thanks for your review, could you please give some advices on how to resolve the issue? How about adding the fence status ESRCH
checking when check the fence signal?  If so,  need to identify the detail behavior if the fence status is ESRCH.

Best Wishes,
Emily Deng

> -----Original Message-----
> From: Christian König [mailto:ckoenig.leichtzumerken at gmail.com]
> Sent: Wednesday, March 28, 2018 7:57 PM
> To: Deng, Emily <Emily.Deng at amd.com>; amd-gfx at lists.freedesktop.org
> Cc: Liu, Monk <Monk.Liu at amd.com>
> Subject: Re: [PATCH] drm/gpu-sched: fix force APP kill hang
> 
> Am 28.03.2018 um 10:07 schrieb Emily Deng:
> > issue:
> > there are VMC page fault occured if force APP kill during 3dmark test,
> > the cause is in entity_fini we manually signal all those jobs in
> > entity's queue which confuse the sync/dep
> > mechanism:
> >
> > 1)page fault occured in sdma's clear job which operate on shadow
> > buffer, and shadow buffer's Gart table is cleaned by ttm_bo_release
> > since the fence in its reservation was fake signaled by entity_fini()
> > under the case of SIGKILL received.
> >
> > 2)page fault occured in gfx' job because during the lifetime of gfx
> > job we manually fake signal all jobs from its entity in entity_fini(),
> > thus the unmapping/clear PTE job depend on those result fence is
> > satisfied and sdma start clearing the PTE and lead to GFX page fault.
> 
> Nice catch, but the fixes won't work like this.
> 
> > fix:
> > 1)should at least wait all jobs already scheduled complete in
> > entity_fini() if SIGKILL is the case.
> 
> Well that is not a good idea because when we kill a process we actually want
> to tear down the task as fast as possible and not wait for anything. That is
> actually the reason why we have this handling.
> 
> > 2)if a fence signaled and try to clear some entity's dependency,
> > should set this entity guilty to prevent its job really run since the
> > dependency is fake signaled.
> 
> Well, that is a clear NAK. It would mean that you mark things like the X server
> or Wayland queue is marked guilty because some client is killed.
> 
> And since unmapping/clear jobs don't have a guilty pointer it should actually
> not have any effect on the issue.
> 
> Regards,
> Christian.
> 
> 
> >
> > related issue ticket:
> > http://ontrack-internal.amd.com/browse/SWDEV-147564?filter=-1
> >
> > Signed-off-by: Monk Liu <Monk.Liu at amd.com>
> > ---
> >   drivers/gpu/drm/scheduler/gpu_scheduler.c | 36
> +++++++++++++++++++++++++++++++
> >   1 file changed, 36 insertions(+)
> >
> > diff --git a/drivers/gpu/drm/scheduler/gpu_scheduler.c
> > b/drivers/gpu/drm/scheduler/gpu_scheduler.c
> > index 2bd69c4..9b306d3 100644
> > --- a/drivers/gpu/drm/scheduler/gpu_scheduler.c
> > +++ b/drivers/gpu/drm/scheduler/gpu_scheduler.c
> > @@ -198,6 +198,28 @@ static bool drm_sched_entity_is_ready(struct
> drm_sched_entity *entity)
> >   	return true;
> >   }
> >
> > +static void drm_sched_entity_wait_otf_signal(struct drm_gpu_scheduler
> *sched,
> > +				struct drm_sched_entity *entity)
> > +{
> > +	struct drm_sched_job *last;
> > +	signed long r;
> > +
> > +	spin_lock(&sched->job_list_lock);
> > +	list_for_each_entry_reverse(last, &sched->ring_mirror_list, node)
> > +		if (last->s_fence->scheduled.context == entity-
> >fence_context) {
> > +			dma_fence_get(&last->s_fence->finished);
> > +			break;
> > +		}
> > +	spin_unlock(&sched->job_list_lock);
> > +
> > +	if (&last->node != &sched->ring_mirror_list) {
> > +		r = dma_fence_wait_timeout(&last->s_fence->finished, false,
> msecs_to_jiffies(500));
> > +		if (r == 0)
> > +			DRM_WARN("wait on the fly job timeout\n");
> > +		dma_fence_put(&last->s_fence->finished);
> > +	}
> > +}
> > +
> >   /**
> >    * Destroy a context entity
> >    *
> > @@ -238,6 +260,12 @@ void drm_sched_entity_fini(struct
> drm_gpu_scheduler *sched,
> >   			entity->dependency = NULL;
> >   		}
> >
> > +		/* Wait till all jobs from this entity really finished otherwise
> below
> > +		 * fake signaling would kickstart sdma's clear PTE jobs and
> lead to
> > +		 * vm fault
> > +		 */
> > +		drm_sched_entity_wait_otf_signal(sched, entity);
> > +
> >   		while ((job = to_drm_sched_job(spsc_queue_pop(&entity-
> >job_queue)))) {
> >   			struct drm_sched_fence *s_fence = job->s_fence;
> >   			drm_sched_fence_scheduled(s_fence);
> > @@ -255,6 +283,14 @@ static void drm_sched_entity_wakeup(struct
> dma_fence *f, struct dma_fence_cb *cb
> >   {
> >   	struct drm_sched_entity *entity =
> >   		container_of(cb, struct drm_sched_entity, cb);
> > +
> > +	/* set the entity guity since its dependency is
> > +	 * not really cleared but fake signaled (by SIGKILL
> > +	 * or GPU recover)
> > +	 */
> > +	if (f->error && entity->guilty)
> > +		atomic_set(entity->guilty, 1);
> > +
> >   	entity->dependency = NULL;
> >   	dma_fence_put(f);
> >   	drm_sched_wakeup(entity->sched);