[TWO BUGs] etnaviv crashes overnight

Wed Mar 6 13:33:01 UTC 2019

Hi Russell,

Am Mittwoch, den 06.03.2019, 13:20 +0000 schrieb Russell King - ARM Linux admin:
> On Tue, Feb 26, 2019 at 11:02:48AM +0100, Lucas Stach wrote:
> > Hi Russell,
> > 
> > Am Dienstag, den 26.02.2019, 09:24 +0000 schrieb Russell King - ARM Linux admin:
> > > I'm not sure when this happened, only that it happened sometime
> > > overnight.  It was left running an Xfce desktop having only logged in,
> > > but with the monitor disconnected.  Fedora 23 plus xf86-video-armada.
> > > 
> > > I don't have any more information than that and the kernel messages
> > > below.
> > 
> > Thanks for the report! Both of those issues are caused by the GPU
> > scheduler failing to put the scheduler fence callback execution into a
> > work item, like it normally does, in the dying sched_entity cleanup.
> > This causes multiple code paths which expect to be called from a clean 
> > process context to be called from the same IRQ context with a number of
> > locks potentially already held.
> > 
> > It will take me some time to work through the corners of this code, but
> > I should have a patch fixing this later today.
> 
> Hi Lucas,
> 
> Did you get a chance to patch this bug?

I have the following patch, but it didn't get close to the amount of
testing required for such a change yet. Maybe you can run it and see if
something falls out?

Regards,
Lucas

------------------------------------>8---------------------------------
>From badd5fa28e2e9369bcd765f110bc6b8756207bcf Mon Sep 17 00:00:00 2001
From: Lucas Stach <l.stach at pengutronix.de>
Date: Wed, 27 Feb 2019 18:11:48 +0100
Subject: [PATCH] drm/scheduler: put killed job cleanup to worker

drm_sched_entity_kill_jobs_cb() is called right from the last scheduled
job finished fence signaling. As this might happen from IRQ context we
now end up calling the GPU driver free_job callback in IRQ context, while
all other paths call it from normal process context.

Etnaviv in particular calls core kernel functions that are only valid to
be called from process context when freeing the job. Other drivers might
have similar issues, but I did not validate this. Fix this by punting
the cleanup work into a work item, so the driver expectations are met.

Signed-off-by: Lucas Stach <l.stach at pengutronix.de>
---
 drivers/gpu/drm/scheduler/sched_entity.c | 30 ++++++++++++++----------
 1 file changed, 18 insertions(+), 12 deletions(-)

diff --git a/drivers/gpu/drm/scheduler/sched_entity.c b/drivers/gpu/drm/scheduler/sched_entity.c
index 3e22a54a99c2..dde875c6ca50 100644
--- a/drivers/gpu/drm/scheduler/sched_entity.c
+++ b/drivers/gpu/drm/scheduler/sched_entity.c
@@ -188,19 +188,10 @@ long drm_sched_entity_flush(struct drm_sched_entity *entity, long timeout)
 }
 EXPORT_SYMBOL(drm_sched_entity_flush);
 
-/**
- * drm_sched_entity_kill_jobs - helper for drm_sched_entity_kill_jobs
- *
- * @f: signaled fence
- * @cb: our callback structure
- *
- * Signal the scheduler finished fence when the entity in question is killed.
- */
-static void drm_sched_entity_kill_jobs_cb(struct dma_fence *f,
-					  struct dma_fence_cb *cb)
+static void drm_sched_entity_kill_work(struct work_struct *work)
 {
-	struct drm_sched_job *job = container_of(cb, struct drm_sched_job,
-						 finish_cb);
+	struct drm_sched_job *job = container_of(work, struct drm_sched_job,
+						 finish_work);
 
 	drm_sched_fence_finished(job->s_fence);
 	WARN_ON(job->s_fence->parent);
@@ -208,6 +199,15 @@ static void drm_sched_entity_kill_jobs_cb(struct dma_fence *f,
 	job->sched->ops->free_job(job);
 }
 
+static void drm_sched_entity_kill_jobs_cb(struct dma_fence *f,
+					  struct dma_fence_cb *cb)
+{
+	struct drm_sched_job *job = container_of(cb, struct drm_sched_job,
+						 finish_cb);
+
+	schedule_work(&job->finish_work);
+}
+
 /**
  * drm_sched_entity_kill_jobs - Make sure all remaining jobs are killed
  *
@@ -227,6 +227,12 @@ static void drm_sched_entity_kill_jobs(struct drm_sched_entity *entity)
 		drm_sched_fence_scheduled(s_fence);
 		dma_fence_set_error(&s_fence->finished, -ESRCH);
 
+		/*
+		 * Replace regular finish work function with one that just
+		 * kills the job.
+		 */
+		job->finish_work.func = drm_sched_entity_kill_work;
+
 		/*
 		 * When pipe is hanged by older entity, new entity might
 		 * not even have chance to submit it's first job to HW
-- 
2.20.1