[PATCH] drm/gpu-sched: fix force APP kill hang

Thu Mar 29 09:47:41 UTC 2018

Hi Christian,
     Thanks for your good idea, will try this.

Best Wishes,
Emily Deng


> -----Original Message-----
> From: Koenig, Christian
> Sent: Thursday, March 29, 2018 5:05 PM
> To: Liu, Monk <Monk.Liu at amd.com>; Deng, Emily <Emily.Deng at amd.com>;
> amd-gfx at lists.freedesktop.org
> Subject: Re: [PATCH] drm/gpu-sched: fix force APP kill hang
> 
> I think the core of the problem is that we don't abort all entities of the
> process at the same time.
> 
> How about splitting drm_sched_entity_fini() into two functions?
> 
> The first one is does the waiting, removes the entity from the runqueue and
> returns an error when the process was killed.
> 
> During destruction this one is called first for all contexts as well as the entity
> for VM updates.
> 
> The second one then goes over the entity and signals all jobs with an error
> code.
> 
> This way no VM updates should be executed any longer after the process is
> killed (something which doesn't makes sense anyway and just costs us time).
> 
> Regards,
> Christian.
> 
> Am 29.03.2018 um 06:14 schrieb Liu, Monk:
> >> 2)if a fence signaled and try to clear some entity's dependency,
> >> should set this entity guilty to prevent its job really run since the
> >> dependency is fake signaled.
> > Well, that is a clear NAK. It would mean that you mark things like the X
> server or Wayland queue is marked guilty because some client is killed.
> >
> > And since unmapping/clear jobs don't have a guilty pointer it should
> actually not have any effect on the issue.
> >
> >
> > [ML], yeah that's a good point, how about this way: if a fence is fake
> > signaled and try to clear other entity's dependency we only allow entity
> marked as guilty If this entity share the same ctx (or even process?) of the
> entity of the job of that fake signaled fence ?
> > This way for a certain process a faked signaled GFX fence won't be
> > able to wake another VCE/SDMA job
> >
> > /Monk
> >
> > -----Original Message-----
> > From: Christian König [mailto:ckoenig.leichtzumerken at gmail.com]
> > Sent: 2018年3月28日 19:57
> > To: Deng, Emily <Emily.Deng at amd.com>; amd-gfx at lists.freedesktop.org
> > Cc: Liu, Monk <Monk.Liu at amd.com>
> > Subject: Re: [PATCH] drm/gpu-sched: fix force APP kill hang
> >
> > Am 28.03.2018 um 10:07 schrieb Emily Deng:
> >> issue:
> >> there are VMC page fault occured if force APP kill during 3dmark
> >> test, the cause is in entity_fini we manually signal all those jobs
> >> in entity's queue which confuse the sync/dep
> >> mechanism:
> >>
> >> 1)page fault occured in sdma's clear job which operate on shadow
> >> buffer, and shadow buffer's Gart table is cleaned by ttm_bo_release
> >> since the fence in its reservation was fake signaled by entity_fini()
> >> under the case of SIGKILL received.
> >>
> >> 2)page fault occured in gfx' job because during the lifetime of gfx
> >> job we manually fake signal all jobs from its entity in
> >> entity_fini(), thus the unmapping/clear PTE job depend on those
> >> result fence is satisfied and sdma start clearing the PTE and lead to GFX
> page fault.
> > Nice catch, but the fixes won't work like this.
> >
> >> fix:
> >> 1)should at least wait all jobs already scheduled complete in
> >> entity_fini() if SIGKILL is the case.
> > Well that is not a good idea because when we kill a process we actually
> want to tear down the task as fast as possible and not wait for anything.
> That is actually the reason why we have this handling.
> >
> >> 2)if a fence signaled and try to clear some entity's dependency,
> >> should set this entity guilty to prevent its job really run since the
> >> dependency is fake signaled.
> > Well, that is a clear NAK. It would mean that you mark things like the X
> server or Wayland queue is marked guilty because some client is killed.
> >
> > And since unmapping/clear jobs don't have a guilty pointer it should
> actually not have any effect on the issue.
> >
> > Regards,
> > Christian.
> >
> >
> >> related issue ticket:
> >> http://ontrack-internal.amd.com/browse/SWDEV-147564?filter=-1
> >>
> >> Signed-off-by: Monk Liu <Monk.Liu at amd.com>
> >> ---
> >>    drivers/gpu/drm/scheduler/gpu_scheduler.c | 36
> +++++++++++++++++++++++++++++++
> >>    1 file changed, 36 insertions(+)
> >>
> >> diff --git a/drivers/gpu/drm/scheduler/gpu_scheduler.c
> >> b/drivers/gpu/drm/scheduler/gpu_scheduler.c
> >> index 2bd69c4..9b306d3 100644
> >> --- a/drivers/gpu/drm/scheduler/gpu_scheduler.c
> >> +++ b/drivers/gpu/drm/scheduler/gpu_scheduler.c
> >> @@ -198,6 +198,28 @@ static bool drm_sched_entity_is_ready(struct
> drm_sched_entity *entity)
> >>    	return true;
> >>    }
> >>
> >> +static void drm_sched_entity_wait_otf_signal(struct drm_gpu_scheduler
> *sched,
> >> +				struct drm_sched_entity *entity) {
> >> +	struct drm_sched_job *last;
> >> +	signed long r;
> >> +
> >> +	spin_lock(&sched->job_list_lock);
> >> +	list_for_each_entry_reverse(last, &sched->ring_mirror_list, node)
> >> +		if (last->s_fence->scheduled.context == entity-
> >fence_context) {
> >> +			dma_fence_get(&last->s_fence->finished);
> >> +			break;
> >> +		}
> >> +	spin_unlock(&sched->job_list_lock);
> >> +
> >> +	if (&last->node != &sched->ring_mirror_list) {
> >> +		r = dma_fence_wait_timeout(&last->s_fence->finished, false,
> msecs_to_jiffies(500));
> >> +		if (r == 0)
> >> +			DRM_WARN("wait on the fly job timeout\n");
> >> +		dma_fence_put(&last->s_fence->finished);
> >> +	}
> >> +}
> >> +
> >>    /**
> >>     * Destroy a context entity
> >>     *
> >> @@ -238,6 +260,12 @@ void drm_sched_entity_fini(struct
> drm_gpu_scheduler *sched,
> >>    			entity->dependency = NULL;
> >>    		}
> >>
> >> +		/* Wait till all jobs from this entity really finished otherwise
> below
> >> +		 * fake signaling would kickstart sdma's clear PTE jobs and
> lead to
> >> +		 * vm fault
> >> +		 */
> >> +		drm_sched_entity_wait_otf_signal(sched, entity);
> >> +
> >>    		while ((job = to_drm_sched_job(spsc_queue_pop(&entity-
> >job_queue)))) {
> >>    			struct drm_sched_fence *s_fence = job->s_fence;
> >>    			drm_sched_fence_scheduled(s_fence);
> >> @@ -255,6 +283,14 @@ static void drm_sched_entity_wakeup(struct
> dma_fence *f, struct dma_fence_cb *cb
> >>    {
> >>    	struct drm_sched_entity *entity =
> >>    		container_of(cb, struct drm_sched_entity, cb);
> >> +
> >> +	/* set the entity guity since its dependency is
> >> +	 * not really cleared but fake signaled (by SIGKILL
> >> +	 * or GPU recover)
> >> +	 */
> >> +	if (f->error && entity->guilty)
> >> +		atomic_set(entity->guilty, 1);
> >> +
> >>    	entity->dependency = NULL;
> >>    	dma_fence_put(f);
> >>    	drm_sched_wakeup(entity->sched);