[Intel-gfx] [PATCH] drm/i915/guc: Move wait for GuC out of spinlock/unlock

Tue Nov 24 05:26:33 PST 2015

On ti, 2015-11-24 at 14:04 +0100, Daniel Vetter wrote:
> On Mon, Nov 23, 2015 at 03:02:58PM -0800, yu.dai at intel.com wrote:
> > From: Alex Dai <yu.dai at intel.com>
> > 
> > When GuC Work Queue is full, driver will wait GuC for avaliable
> > space by delaying 1ms. The wait needs to be out of spinlockirq /
> > unlock. Otherwise, lockup happens because jiffies won't be updated
> > dur to irq is disabled.
> > 
> > Issue is found in igt/gem_close_race.
> > 
> > Signed-off-by: Alex Dai <yu.dai at intel.com>
> > ---
> >  drivers/gpu/drm/i915/i915_guc_submission.c | 27 +++++++++++++++++-
> > ---------
> >  1 file changed, 17 insertions(+), 10 deletions(-)
> > 
> > diff --git a/drivers/gpu/drm/i915/i915_guc_submission.c
> > b/drivers/gpu/drm/i915/i915_guc_submission.c
> > index 0a6b007..1418397 100644
> > --- a/drivers/gpu/drm/i915/i915_guc_submission.c
> > +++ b/drivers/gpu/drm/i915/i915_guc_submission.c
> > @@ -201,10 +201,13 @@ static int guc_ring_doorbell(struct
> > i915_guc_client *gc)
> >  	union guc_doorbell_qw *db;
> >  	void *base;
> >  	int attempt = 2, ret = -EAGAIN;
> > +	unsigned long flags;
> >  
> >  	base = kmap_atomic(i915_gem_object_get_page(gc-
> > >client_obj, 0));
> 
> We don't need kmap_atomic anymore here now, since it's outside of the
> spinlock.
> 
> >  	desc = base + gc->proc_desc_offset;
> >  
> > +	spin_lock_irqsave(&gc->wq_lock, flags);
> 
> Please don't use the super-generic _irqsave. It's expensive and
> results in
> fragile code when someone accidentally reuses something in an
> interrupt
> handler that was never meant to run in that context.
> 
> Instead please use the most specific funtion:
> - spin_lock if you know you are in irq context.
> - sipn_lock_irq if you know you are not.

Right, and simply spin_lock() if the lock is not taken in IRQ context
ever.

> - spin_lock_irqsave should be a big warning sign that your code has
>   layering issues.
> 
> Please audit the entire guc code for the above two issues.

Agreed, it looks inconsistent atm: we do spin_lock(wq_lock) from
debugfs and spin_lock_irq(wq_lock) from i915_guc_submit(). Neither of
them are called from IRQ context AFAICS, in which case a simple
spin_lock() would do.

--Imre

> > +
> >  	/* Update the tail so it is visible to GuC */
> >  	desc->tail = gc->wq_tail;
> >  
> > @@ -248,7 +251,10 @@ static int guc_ring_doorbell(struct
> > i915_guc_client *gc)
> >  			db_exc.cookie = 1;
> >  	}
> >  
> > +	spin_unlock_irqrestore(&gc->wq_lock, flags);
> > +
> >  	kunmap_atomic(base);
> > +
> >  	return ret;
> >  }
> >  
> > @@ -487,16 +493,16 @@ static int guc_get_workqueue_space(struct
> > i915_guc_client *gc, u32 *offset)
> >  	struct guc_process_desc *desc;
> >  	void *base;
> >  	u32 size = sizeof(struct guc_wq_item);
> > -	int ret = 0, timeout_counter = 200;
> > +	int ret = -ETIMEDOUT, timeout_counter = 200;
> > +	unsigned long flags;
> >  
> >  	base = kmap_atomic(i915_gem_object_get_page(gc-
> > >client_obj, 0));
> >  	desc = base + gc->proc_desc_offset;
> >  
> >  	while (timeout_counter-- > 0) {
> > -		ret = wait_for_atomic(CIRC_SPACE(gc->wq_tail,
> > desc->head,
> > -				gc->wq_size) >= size, 1);
> > +		spin_lock_irqsave(&gc->wq_lock, flags);
> >  
> > -		if (!ret) {
> > +		if (CIRC_SPACE(gc->wq_tail, desc->head, gc-
> > >wq_size) >= size) {
> >  			*offset = gc->wq_tail;
> >  
> >  			/* advance the tail for next workqueue
> > item */
> > @@ -505,7 +511,13 @@ static int guc_get_workqueue_space(struct
> > i915_guc_client *gc, u32 *offset)
> >  
> >  			/* this will break the loop */
> >  			timeout_counter = 0;
> > +			ret = 0;
> >  		}
> > +
> > +		spin_unlock_irqrestore(&gc->wq_lock, flags);
> > +
> > +		if (timeout_counter)
> > +			usleep_range(1000, 2000);
> 
> Do we really not have a interrupt/signal from the guc when it has
> cleared
> up some space?
> 
> >  	};
> >  
> >  	kunmap_atomic(base);
> > @@ -597,19 +609,17 @@ int i915_guc_submit(struct i915_guc_client
> > *client,
> >  {
> >  	struct intel_guc *guc = client->guc;
> >  	enum intel_ring_id ring_id = rq->ring->id;
> > -	unsigned long flags;
> >  	int q_ret, b_ret;
> >  
> >  	/* Need this because of the deferred pin ctx and ring */
> >  	/* Shall we move this right after ring is pinned? */
> >  	lr_context_update(rq);
> >  
> > -	spin_lock_irqsave(&client->wq_lock, flags);
> > -
> >  	q_ret = guc_add_workqueue_item(client, rq);
> >  	if (q_ret == 0)
> >  		b_ret = guc_ring_doorbell(client);
> >  
> > +	spin_lock(&guc->host2guc_lock);
> 
> So at first I thought there's a race now, but then I looked at what
> host2guc and wq_lock protect. It seems like the only thing they do is
> protect against debugfs, all the real protection against inconsistent
> state is done through dev->struct_mutex.
> 
> Can't we just rip out all this spinlock business from the guc code?
> It would be easier than fixing up the races in here.

> -Daniel
> 
> >  	client->submissions[ring_id] += 1;
> >  	if (q_ret) {
> >  		client->q_fail += 1;
> > @@ -620,9 +630,6 @@ int i915_guc_submit(struct i915_guc_client
> > *client,
> >  	} else {
> >  		client->retcode = 0;
> >  	}
> > -	spin_unlock_irqrestore(&client->wq_lock, flags);
> > -
> > -	spin_lock(&guc->host2guc_lock);
> >  	guc->submissions[ring_id] += 1;
> >  	guc->last_seqno[ring_id] = rq->seqno;
> >  	spin_unlock(&guc->host2guc_lock);
> > -- 
> > 2.5.0
> > 
>