[Intel-gfx] [PATCH] drm/i915: Fix erroneous dereference of batch_obj inside reset_status

Chris Wilson chris at chris-wilson.co.uk
Wed Dec 4 13:32:02 CET 2013


On Wed, Dec 04, 2013 at 01:18:42PM +0100, Daniel Vetter wrote:
> On Wed, Dec 04, 2013 at 11:37:09AM +0000, Chris Wilson wrote:
> > As the rings may be processed and their requests deallocated in a
> > different order to the natural retirement during a reset,
> > 
> > /* Whilst this request exists, batch_obj will be on the
> >  * active_list, and so will hold the active reference. Only when this
> >  * request is retired will the the batch_obj be moved onto the
> >  * inactive_list and lose its active reference. Hence we do not need
> >  * to explicitly hold another reference here.
> >  */
> > 
> > is violated, and the batch_obj may be dereferenced after it had been
> > freed on another ring. This can be simply avoided by processing the
> > status update prior to deallocating any requests.
> > 
> > Fixes regression (a possible OOPS following a GPU hang) from
> > commit aa60c664e6df502578454621c3a9b1f087ff8d25
> > Author: Mika Kuoppala <mika.kuoppala at linux.intel.com>
> > Date:   Wed Jun 12 15:13:20 2013 +0300
> > 
> >     drm/i915: find guilty batch buffer on ring resets
> > 
> > Signed-off-by: Chris Wilson <chris at chris-wilson.co.uk>
> > Cc: Mika Kuoppala <mika.kuoppala at intel.com>
> > Cc: stable at vger.kernel.org
> > ---
> >  drivers/gpu/drm/i915/i915_gem.c | 29 +++++++++++++++++++----------
> >  1 file changed, 19 insertions(+), 10 deletions(-)
> > 
> > diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
> > index ec4502034203..c1e481d36575 100644
> > --- a/drivers/gpu/drm/i915/i915_gem.c
> > +++ b/drivers/gpu/drm/i915/i915_gem.c
> > @@ -2442,15 +2442,24 @@ static void i915_gem_free_request(struct drm_i915_gem_request *request)
> >  	kfree(request);
> >  }
> >  
> > -static void i915_gem_reset_ring_lists(struct drm_i915_private *dev_priv,
> > -				      struct intel_ring_buffer *ring)
> > +static void i915_gem_reset_ring_status(struct drm_i915_private *dev_priv,
> > +				       struct intel_ring_buffer *ring)
> >  {
> > -	u32 completed_seqno;
> > -	u32 acthd;
> > +	u32 completed_seqno = ring->get_seqno(ring, false);
> > +	u32 acthd = intel_ring_get_active_head(ring);
> > +	struct drm_i915_gem_request *request;
> > +
> > +	list_for_each_entry(request, &ring->request_list, list) {
> > +		if (i915_seqno_passed(completed_seqno, request->seqno))
> > +			continue;
> >  
> > -	acthd = intel_ring_get_active_head(ring);
> > -	completed_seqno = ring->get_seqno(ring, false);
> > +		i915_set_reset_status(ring, request, acthd);
> > +	}
> > +}
> 
> Indeed the fix in the gem reset code is a bit simpler than what I've
> feared. We still have fairly tricky code which depends upon that implicit
> reference in non-obvious ways. So I still think Mika's refcount patch with
> the comments updated is the better approach.

That batch_obj only exists for GPU hang accounting. It seems pointless
to make everything else more complicated. If you really wanted to
simplify it, you could store the actual batch offset+length rather than
a pointer, which would be even safer.
-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre



More information about the Intel-gfx mailing list