[PATCH] drm/gpu-sched: fix force APP kill hang
Christian König
ckoenig.leichtzumerken at gmail.com
Wed Mar 28 11:56:34 UTC 2018
Am 28.03.2018 um 10:07 schrieb Emily Deng:
> issue:
> there are VMC page fault occured if force APP kill during
> 3dmark test, the cause is in entity_fini we manually signal
> all those jobs in entity's queue which confuse the sync/dep
> mechanism:
>
> 1)page fault occured in sdma's clear job which operate on
> shadow buffer, and shadow buffer's Gart table is cleaned by
> ttm_bo_release since the fence in its reservation was fake signaled
> by entity_fini() under the case of SIGKILL received.
>
> 2)page fault occured in gfx' job because during the lifetime
> of gfx job we manually fake signal all jobs from its entity
> in entity_fini(), thus the unmapping/clear PTE job depend on those
> result fence is satisfied and sdma start clearing the PTE and lead
> to GFX page fault.
Nice catch, but the fixes won't work like this.
> fix:
> 1)should at least wait all jobs already scheduled complete in entity_fini()
> if SIGKILL is the case.
Well that is not a good idea because when we kill a process we actually
want to tear down the task as fast as possible and not wait for
anything. That is actually the reason why we have this handling.
> 2)if a fence signaled and try to clear some entity's dependency, should
> set this entity guilty to prevent its job really run since the dependency
> is fake signaled.
Well, that is a clear NAK. It would mean that you mark things like the X
server or Wayland queue is marked guilty because some client is killed.
And since unmapping/clear jobs don't have a guilty pointer it should
actually not have any effect on the issue.
Regards,
Christian.
>
> related issue ticket:
> http://ontrack-internal.amd.com/browse/SWDEV-147564?filter=-1
>
> Signed-off-by: Monk Liu <Monk.Liu at amd.com>
> ---
> drivers/gpu/drm/scheduler/gpu_scheduler.c | 36 +++++++++++++++++++++++++++++++
> 1 file changed, 36 insertions(+)
>
> diff --git a/drivers/gpu/drm/scheduler/gpu_scheduler.c b/drivers/gpu/drm/scheduler/gpu_scheduler.c
> index 2bd69c4..9b306d3 100644
> --- a/drivers/gpu/drm/scheduler/gpu_scheduler.c
> +++ b/drivers/gpu/drm/scheduler/gpu_scheduler.c
> @@ -198,6 +198,28 @@ static bool drm_sched_entity_is_ready(struct drm_sched_entity *entity)
> return true;
> }
>
> +static void drm_sched_entity_wait_otf_signal(struct drm_gpu_scheduler *sched,
> + struct drm_sched_entity *entity)
> +{
> + struct drm_sched_job *last;
> + signed long r;
> +
> + spin_lock(&sched->job_list_lock);
> + list_for_each_entry_reverse(last, &sched->ring_mirror_list, node)
> + if (last->s_fence->scheduled.context == entity->fence_context) {
> + dma_fence_get(&last->s_fence->finished);
> + break;
> + }
> + spin_unlock(&sched->job_list_lock);
> +
> + if (&last->node != &sched->ring_mirror_list) {
> + r = dma_fence_wait_timeout(&last->s_fence->finished, false, msecs_to_jiffies(500));
> + if (r == 0)
> + DRM_WARN("wait on the fly job timeout\n");
> + dma_fence_put(&last->s_fence->finished);
> + }
> +}
> +
> /**
> * Destroy a context entity
> *
> @@ -238,6 +260,12 @@ void drm_sched_entity_fini(struct drm_gpu_scheduler *sched,
> entity->dependency = NULL;
> }
>
> + /* Wait till all jobs from this entity really finished otherwise below
> + * fake signaling would kickstart sdma's clear PTE jobs and lead to
> + * vm fault
> + */
> + drm_sched_entity_wait_otf_signal(sched, entity);
> +
> while ((job = to_drm_sched_job(spsc_queue_pop(&entity->job_queue)))) {
> struct drm_sched_fence *s_fence = job->s_fence;
> drm_sched_fence_scheduled(s_fence);
> @@ -255,6 +283,14 @@ static void drm_sched_entity_wakeup(struct dma_fence *f, struct dma_fence_cb *cb
> {
> struct drm_sched_entity *entity =
> container_of(cb, struct drm_sched_entity, cb);
> +
> + /* set the entity guity since its dependency is
> + * not really cleared but fake signaled (by SIGKILL
> + * or GPU recover)
> + */
> + if (f->error && entity->guilty)
> + atomic_set(entity->guilty, 1);
> +
> entity->dependency = NULL;
> dma_fence_put(f);
> drm_sched_wakeup(entity->sched);
More information about the amd-gfx
mailing list