[PATCH] drm/gpu-sched: fix force APP kill hang
Emily Deng
Emily.Deng at amd.com
Wed Mar 28 08:07:29 UTC 2018
issue:
there are VMC page fault occured if force APP kill during
3dmark test, the cause is in entity_fini we manually signal
all those jobs in entity's queue which confuse the sync/dep
mechanism:
1)page fault occured in sdma's clear job which operate on
shadow buffer, and shadow buffer's Gart table is cleaned by
ttm_bo_release since the fence in its reservation was fake signaled
by entity_fini() under the case of SIGKILL received.
2)page fault occured in gfx' job because during the lifetime
of gfx job we manually fake signal all jobs from its entity
in entity_fini(), thus the unmapping/clear PTE job depend on those
result fence is satisfied and sdma start clearing the PTE and lead
to GFX page fault.
fix:
1)should at least wait all jobs already scheduled complete in entity_fini()
if SIGKILL is the case.
2)if a fence signaled and try to clear some entity's dependency, should
set this entity guilty to prevent its job really run since the dependency
is fake signaled.
related issue ticket:
http://ontrack-internal.amd.com/browse/SWDEV-147564?filter=-1
Signed-off-by: Monk Liu <Monk.Liu at amd.com>
---
drivers/gpu/drm/scheduler/gpu_scheduler.c | 36 +++++++++++++++++++++++++++++++
1 file changed, 36 insertions(+)
diff --git a/drivers/gpu/drm/scheduler/gpu_scheduler.c b/drivers/gpu/drm/scheduler/gpu_scheduler.c
index 2bd69c4..9b306d3 100644
--- a/drivers/gpu/drm/scheduler/gpu_scheduler.c
+++ b/drivers/gpu/drm/scheduler/gpu_scheduler.c
@@ -198,6 +198,28 @@ static bool drm_sched_entity_is_ready(struct drm_sched_entity *entity)
return true;
}
+static void drm_sched_entity_wait_otf_signal(struct drm_gpu_scheduler *sched,
+ struct drm_sched_entity *entity)
+{
+ struct drm_sched_job *last;
+ signed long r;
+
+ spin_lock(&sched->job_list_lock);
+ list_for_each_entry_reverse(last, &sched->ring_mirror_list, node)
+ if (last->s_fence->scheduled.context == entity->fence_context) {
+ dma_fence_get(&last->s_fence->finished);
+ break;
+ }
+ spin_unlock(&sched->job_list_lock);
+
+ if (&last->node != &sched->ring_mirror_list) {
+ r = dma_fence_wait_timeout(&last->s_fence->finished, false, msecs_to_jiffies(500));
+ if (r == 0)
+ DRM_WARN("wait on the fly job timeout\n");
+ dma_fence_put(&last->s_fence->finished);
+ }
+}
+
/**
* Destroy a context entity
*
@@ -238,6 +260,12 @@ void drm_sched_entity_fini(struct drm_gpu_scheduler *sched,
entity->dependency = NULL;
}
+ /* Wait till all jobs from this entity really finished otherwise below
+ * fake signaling would kickstart sdma's clear PTE jobs and lead to
+ * vm fault
+ */
+ drm_sched_entity_wait_otf_signal(sched, entity);
+
while ((job = to_drm_sched_job(spsc_queue_pop(&entity->job_queue)))) {
struct drm_sched_fence *s_fence = job->s_fence;
drm_sched_fence_scheduled(s_fence);
@@ -255,6 +283,14 @@ static void drm_sched_entity_wakeup(struct dma_fence *f, struct dma_fence_cb *cb
{
struct drm_sched_entity *entity =
container_of(cb, struct drm_sched_entity, cb);
+
+ /* set the entity guity since its dependency is
+ * not really cleared but fake signaled (by SIGKILL
+ * or GPU recover)
+ */
+ if (f->error && entity->guilty)
+ atomic_set(entity->guilty, 1);
+
entity->dependency = NULL;
dma_fence_put(f);
drm_sched_wakeup(entity->sched);
--
2.7.4
More information about the amd-gfx
mailing list