[Intel-gfx] [PATCH 07/32] drm/i915: Store the reset counter when constructing a request

Wed Dec 16 02:19:31 PST 2015

On Wed, Dec 16, 2015 at 10:44:19AM +0100, Daniel Vetter wrote:
> On Fri, Dec 11, 2015 at 11:33:03AM +0000, Chris Wilson wrote:
> > As the request is only valid during the same global reset epoch, we can
> > record the current reset_counter when constructing the request and reuse
> > it when waiting upon that request in future. This removes a very hairy
> > atomic check serialised by the struct_mutex at the time of waiting and
> > allows us to transfer those waits to a central dispatcher for all
> > waiters and all requests.
> > 
> > Signed-off-by: Chris Wilson <chris at chris-wilson.co.uk>
> > Cc: Daniel Vetter <daniel.vetter at ffwll.ch>
> 
> "global reset epoch" got me thinking about what the implications for TDR
> are. Now the current TDR patches iirc are still somewhat tacked on the
> side of the existing reset handling, and so don't end up incrementing the
> reset counter.
> 
> But I don't agree with that design decision, and I think any reset must
> increment the counter (and do the reset-in-progress flag dance), since
> that's the only way to guarantee that waiters will get off locks. Even
> though we have fewer of those with every kernel release.
> 
> But then there's the problem that at least for some request they'll still
> be valid after the reset, and would break with this design here. Which can
> easily be fixed by incrementing the reset epoch for every request we
> decide should keep running (or which gets re-scheduled/emitted for another
> attempt), and doing that explicitly seems like a good idea.
> 
> The only implication I see is that for per-ring reset we might want to go
> to a per-engine reset epoch counter.

Yes, I think per-engine epoch's will be required. (I've been trying to
choose my words carefully so that it is clear that it only applies to
today and not in a TDR future!)

> So given all that I think this is a solid idea. But one comment below as a
> fallout.
> 
> > ---
> >  drivers/gpu/drm/i915/i915_drv.h         |  2 +-
> >  drivers/gpu/drm/i915/i915_gem.c         | 40 +++++++++++----------------------
> >  drivers/gpu/drm/i915/intel_display.c    |  7 +-----
> >  drivers/gpu/drm/i915/intel_lrc.c        |  7 ------
> >  drivers/gpu/drm/i915/intel_ringbuffer.c |  6 -----
> >  5 files changed, 15 insertions(+), 47 deletions(-)
> > 
> > diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h
> > index 1043ddd670a5..f30c305a6889 100644
> > --- a/drivers/gpu/drm/i915/i915_drv.h
> > +++ b/drivers/gpu/drm/i915/i915_drv.h
> > @@ -2178,6 +2178,7 @@ struct drm_i915_gem_request {
> >  	/** On Which ring this request was generated */
> >  	struct drm_i915_private *i915;
> >  	struct intel_engine_cs *ring;
> > +	unsigned reset_counter;
> >  
> >  	 /** GEM sequence number associated with the previous request,
> >  	  * when the HWS breadcrumb is equal to this the GPU is processing
> > @@ -3059,7 +3060,6 @@ void __i915_add_request(struct drm_i915_gem_request *req,
> >  #define i915_add_request_no_flush(req) \
> >  	__i915_add_request(req, NULL, false)
> >  int __i915_wait_request(struct drm_i915_gem_request *req,
> > -			unsigned reset_counter,
> >  			bool interruptible,
> >  			s64 *timeout,
> >  			struct intel_rps_client *rps);
> > diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
> > index 27e617b76418..b17cc0e42a4f 100644
> > --- a/drivers/gpu/drm/i915/i915_gem.c
> > +++ b/drivers/gpu/drm/i915/i915_gem.c
> > @@ -1214,7 +1214,6 @@ static int __i915_spin_request(struct drm_i915_gem_request *req, int state)
> >  /**
> >   * __i915_wait_request - wait until execution of request has finished
> >   * @req: duh!
> > - * @reset_counter: reset sequence associated with the given request
> >   * @interruptible: do an interruptible wait (normally yes)
> >   * @timeout: in - how long to wait (NULL forever); out - how much time remaining
> >   *
> > @@ -1229,7 +1228,6 @@ static int __i915_spin_request(struct drm_i915_gem_request *req, int state)
> >   * errno with remaining time filled in timeout argument.
> >   */
> >  int __i915_wait_request(struct drm_i915_gem_request *req,
> > -			unsigned reset_counter,
> >  			bool interruptible,
> >  			s64 *timeout,
> >  			struct intel_rps_client *rps)
> > @@ -1288,7 +1286,7 @@ int __i915_wait_request(struct drm_i915_gem_request *req,
> >  
> >  		/* We need to check whether any gpu reset happened in between
> >  		 * the caller grabbing the seqno and now ... */
> > -		if (reset_counter != i915_reset_counter(&dev_priv->gpu_error)) {
> > +		if (req->reset_counter != i915_reset_counter(&dev_priv->gpu_error)) {
> 
> READ_ONCE(req->reset) as a future proofing maybe for TDR? Or punt that to
> TDR? Could be confusing to READ_ONCE something that's (with the current
> code at least) invariant.

Right, my plan was to solve this in TDR! At the moment, I am thinking
that we regenerate a new request for each one resubmitted (that requires
patching the seqno embedded into the ring before restarting the ring),
but otherwise we can just duplicate the contents of the request.
-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre