[PATCH] drm/gpu-sched: fix force APP kill hang

Wed Mar 28 11:56:34 UTC 2018

Am 28.03.2018 um 10:07 schrieb Emily Deng:
> issue:
> there are VMC page fault occured if force APP kill during
> 3dmark test, the cause is in entity_fini we manually signal
> all those jobs in entity's queue which confuse the sync/dep
> mechanism:
>
> 1)page fault occured in sdma's clear job which operate on
> shadow buffer, and shadow buffer's Gart table is cleaned by
> ttm_bo_release since the fence in its reservation was fake signaled
> by entity_fini() under the case of SIGKILL received.
>
> 2)page fault occured in gfx' job because during the lifetime
> of gfx job we manually fake signal all jobs from its entity
> in entity_fini(), thus the unmapping/clear PTE job depend on those
> result fence is satisfied and sdma start clearing the PTE and lead
> to GFX page fault.

Nice catch, but the fixes won't work like this.

> fix:
> 1)should at least wait all jobs already scheduled complete in entity_fini()
> if SIGKILL is the case.

Well that is not a good idea because when we kill a process we actually 
want to tear down the task as fast as possible and not wait for 
anything. That is actually the reason why we have this handling.

> 2)if a fence signaled and try to clear some entity's dependency, should
> set this entity guilty to prevent its job really run since the dependency
> is fake signaled.

Well, that is a clear NAK. It would mean that you mark things like the X 
server or Wayland queue is marked guilty because some client is killed.

And since unmapping/clear jobs don't have a guilty pointer it should 
actually not have any effect on the issue.

Regards,
Christian.

>
> related issue ticket:
> http://ontrack-internal.amd.com/browse/SWDEV-147564?filter=-1
>
> Signed-off-by: Monk Liu <Monk.Liu at amd.com>
> ---
>   drivers/gpu/drm/scheduler/gpu_scheduler.c | 36 +++++++++++++++++++++++++++++++
>   1 file changed, 36 insertions(+)
>
> diff --git a/drivers/gpu/drm/scheduler/gpu_scheduler.c b/drivers/gpu/drm/scheduler/gpu_scheduler.c
> index 2bd69c4..9b306d3 100644
> --- a/drivers/gpu/drm/scheduler/gpu_scheduler.c
> +++ b/drivers/gpu/drm/scheduler/gpu_scheduler.c
> @@ -198,6 +198,28 @@ static bool drm_sched_entity_is_ready(struct drm_sched_entity *entity)
>   	return true;
>   }
>   
> +static void drm_sched_entity_wait_otf_signal(struct drm_gpu_scheduler *sched,
> +				struct drm_sched_entity *entity)
> +{
> +	struct drm_sched_job *last;
> +	signed long r;
> +
> +	spin_lock(&sched->job_list_lock);
> +	list_for_each_entry_reverse(last, &sched->ring_mirror_list, node)
> +		if (last->s_fence->scheduled.context == entity->fence_context) {
> +			dma_fence_get(&last->s_fence->finished);
> +			break;
> +		}
> +	spin_unlock(&sched->job_list_lock);
> +
> +	if (&last->node != &sched->ring_mirror_list) {
> +		r = dma_fence_wait_timeout(&last->s_fence->finished, false, msecs_to_jiffies(500));
> +		if (r == 0)
> +			DRM_WARN("wait on the fly job timeout\n");
> +		dma_fence_put(&last->s_fence->finished);
> +	}
> +}
> +
>   /**
>    * Destroy a context entity
>    *
> @@ -238,6 +260,12 @@ void drm_sched_entity_fini(struct drm_gpu_scheduler *sched,
>   			entity->dependency = NULL;
>   		}
>   
> +		/* Wait till all jobs from this entity really finished otherwise below
> +		 * fake signaling would kickstart sdma's clear PTE jobs and lead to
> +		 * vm fault
> +		 */
> +		drm_sched_entity_wait_otf_signal(sched, entity);
> +
>   		while ((job = to_drm_sched_job(spsc_queue_pop(&entity->job_queue)))) {
>   			struct drm_sched_fence *s_fence = job->s_fence;
>   			drm_sched_fence_scheduled(s_fence);
> @@ -255,6 +283,14 @@ static void drm_sched_entity_wakeup(struct dma_fence *f, struct dma_fence_cb *cb
>   {
>   	struct drm_sched_entity *entity =
>   		container_of(cb, struct drm_sched_entity, cb);
> +
> +	/* set the entity guity since its dependency is
> +	 * not really cleared but fake signaled (by SIGKILL
> +	 * or GPU recover)
> +	 */
> +	if (f->error && entity->guilty)
> +		atomic_set(entity->guilty, 1);
> +
>   	entity->dependency = NULL;
>   	dma_fence_put(f);
>   	drm_sched_wakeup(entity->sched);