[Intel-gfx] [PATCH 12/12] drm/i915: Cache the reset_counter for the request

Wed Nov 25 04:17:08 PST 2015

On Wed, Nov 25, 2015 at 10:12:53AM +0100, Daniel Vetter wrote:
> On Tue, Nov 24, 2015 at 09:43:32PM +0000, Chris Wilson wrote:
> > On Tue, Nov 24, 2015 at 05:43:22PM +0100, Daniel Vetter wrote:
> > > On Fri, Nov 20, 2015 at 12:43:52PM +0000, Chris Wilson wrote:
> > > > Instead of querying the reset counter before every access to the ring,
> > > > query it the first time we touch the ring, and do a final compare when
> > > > submitting the request. For correctness, we need to then sanitize how
> > > > the reset_counter is incremented to prevent broken submission and
> > > > waiting across resets, in the process fixing the persistent -EIO we
> > > > still see today on failed waits.
> > > 
> > > tbh that explanation went straight over my head. Questions:
> > > - Where do we wait again over a reset? With all the wakeups we should
> > >   catch them properly. I guess this needs the detailed scenario to help me
> > >   out, since I have dim memories that there is something wrong ;-)
> > 
> > TDR. In the future (and in past speculative patches) we have proposed
> > keeping requests over a reset and requeuing them. That is a complication
> > to the simplification of bailing out from the wait. What I have in mind,
> > is the recovery code has to fix up the request list somehow, though that
> > will be fun.
> 
> But even with requeueing the waiters need to bail out and restart the
> ioctl. So I don't see why requeueing of requests matter.

But why should the waiter even wake up? It is not holding any locks, all
it is just waiting for the request to complete. It is only those holding
a lock required to reset that we need to kick.

> And my question was about what exactly is broken when waiting over a
> reset? As long as we detect the reset and restart the ioctl we should pick
> up any kinds of state fixups the reset work has done and recover
> perfectly. Irrespective of whether the reset work has requeued some of the
> requests or not.

But... The reset handler clears the requests, we cannot wait over a
reset. So waiting over a reset today is clearly a bug in the kernel.

Only in the future do we contemplate situations where a request may
survive a reset.

> > > - What precisely is the effect for waiters? I only see moving the
> > >   atomic_inc under the dev->struct_mutex, which shouldn't do a hole lot
> > >   for anyone. Plus not returning -EAGAIN when reset_in_progress, which
> > >   looks like it might result in missed wakeups and deadlocks with the
> > >   reset work.
> > 
> > At the moment, I can trivially wedge the machine. This patch fixes that.
> > The patch also ensures that we cannot raise unhandled errors from
> > wait-request (EIO/EAGAIN handling has to be explicitly added to the
> > caller).
> 
> Hm, how/where do we wedge the machine, and how does the fix work?

Being able to reset a previously wedged machine.

> > The wait-request interface still has the wait-seqno legacy of having to
> > acquire the reset_counter under the mutex before calling. That is quite
> > hairy and causes a major complication later where we want to amalgamate
> > waiters.
> 
> Yeah, that's a sensible change. But since I don't yet understand where
> exactly the current logic results in a wedge gpu I can't convince myself
> that the new one is ok.
> 
> > Re EAGAIN. No, it cannot result in missed wakeups since that is internal
> > to the wait_request function, nor can it result in new deadlocks with the
> > reset worker.
> 
> Yeah I think today with just struct_mutex it's impossible. But if we have
> more locks the waiting thread could continue to grab more locks instead of
> bailing out straight. And then the reset handler would be somewhat screwed
> since it isn't guaranteed to make forward progress.

So basically if you add a deadlock/livelock, it may deadlock/livelock.

> > > I /thought/ that the -EIO is from check_wedge in intel_ring_begin, but
> > > that part seems essentially unchanged.
> > 
> > For two reasons, we need to to protect the access to the ring, and you
> > argued that (reporting of EIO from previous wedged GPUs) it was part of
> > the ABI. The other ABI that I think is important, is the reporting of
> > EIO if the user explicits waits on a request and it is incomplete (i.e.
> > wait_ioctl).
> 
> Well then I don't get where the -EIO is from. That leaves a truly wedge
> gpu as the cause, and for that case I don't get how it can happen by
> accident with the current code.

We fail a reset, or to recover. Happens often enough.

> > > > diff --git a/drivers/gpu/drm/i915/i915_drv.c b/drivers/gpu/drm/i915/i915_drv.c
> > > > index ffd99d27fac7..5838882e10a1 100644
> > > > --- a/drivers/gpu/drm/i915/i915_drv.c
> > > > +++ b/drivers/gpu/drm/i915/i915_drv.c
> > > > @@ -841,23 +841,31 @@ int i915_resume_legacy(struct drm_device *dev)
> > > >  int i915_reset(struct drm_device *dev)
> > > >  {
> > > >  	struct drm_i915_private *dev_priv = dev->dev_private;
> > > > +	struct i915_gpu_error *error = &dev_priv->gpu_error;
> > > > +	unsigned reset_counter;
> > > >  	bool simulated;
> > > >  	int ret;
> > > >  
> > > >  	intel_reset_gt_powersave(dev);
> > > >  
> > > >  	mutex_lock(&dev->struct_mutex);
> > > > +	atomic_clear_mask(I915_WEDGED, &error->reset_counter);
> > > > +	reset_counter = atomic_inc_return(&error->reset_counter);
> > > 
> > > This publishes the reset-complete too early for the modeset code I think.
> > > At least it crucially relies upon dev->struct_mutex serializing
> > > everything in our driver, and I don't like to cement that assumption in
> > > even more.
> > 
> > Hmm? It is already concreted in as the reset / continuation is ordered
> > with the struct_mutex. We have kicks in place for the modesetting code,
> > but we have not actually ordered that with anything inside the reset
> > recovery code. Judging by the few changes to i915_reset(), I am doubtful
> > that has actually been improved for atomic, so that it basically relies
> > on display being unaffected by anything but pending work.
> 
> Currently the reset/continuation is ordered with the 2-phase atomic_inc of
> the reset counter, with the requirement that anyone who sees
> reset_in_progress hauls out of the kernel asap. That most of gem is
> serialized with struct_mutex is just a side-effect of the current gem
> locking. With the current design we could rework gem locking and it should
> all still work, with this approach we pretty much cement the BKL locking
> approach I think.

No. It doesn't add anything more to the scheme. All it does is change
the order in which the second increment is applied so that after the
HW/SW reset happens we can start using it from inside the reset handler
to reinitialise the state.

> > > Why do we need to move that atomic_inc again?
> > 
> > The inc is to acknowledge the reset and to allow us to begin using the
> > device again (inside the reset func). It is the equivalent of adding
> > another bit for reload_in_reset. I moved it to the beginning for my
> > convenience.
> 
> Ok, this needs a code comment I think.
> 
> But I thought that the crucial change was to move check_wedge from
> ring_begin to wait_for_space and so move it out from ever being hit from
> the reset code.

No. That has nothing to do with the reset code, as by the time we ever
call intel_ring_begin() in the recovery code, all previous
waits/requests are completed i.e. (excluding execlists bugs ahem) the
rings should be empty.

> > > > @@ -1200,7 +1191,6 @@ static int __i915_spin_request(struct drm_i915_gem_request *req, int state)
> > > >  /**
> > > >   * __i915_wait_request - wait until execution of request has finished
> > > >   * @req: duh!
> > > > - * @reset_counter: reset sequence associated with the given request
> > > >   * @interruptible: do an interruptible wait (normally yes)
> > > >   * @timeout: in - how long to wait (NULL forever); out - how much time remaining
> > > >   *
> > > > @@ -1215,7 +1205,6 @@ static int __i915_spin_request(struct drm_i915_gem_request *req, int state)
> > > >   * errno with remaining time filled in timeout argument.
> > > >   */
> > > >  int __i915_wait_request(struct drm_i915_gem_request *req,
> > > > -			unsigned reset_counter,
> > > >  			bool interruptible,
> > > >  			s64 *timeout,
> > > >  			struct intel_rps_client *rps)
> > > > @@ -1264,12 +1253,12 @@ int __i915_wait_request(struct drm_i915_gem_request *req,
> > > >  
> > > >  		/* We need to check whether any gpu reset happened in between
> > > >  		 * the caller grabbing the seqno and now ... */
> > > > -		if (reset_counter != atomic_read(&dev_priv->gpu_error.reset_counter)) {
> > > > -			/* ... but upgrade the -EAGAIN to an -EIO if the gpu
> > > > -			 * is truely gone. */
> > > > -			ret = i915_gem_check_wedge(&dev_priv->gpu_error, interruptible);
> > > > -			if (ret == 0)
> > > > -				ret = -EAGAIN;
> > > > +		if (req->reset_counter != i915_reset_counter(&dev_priv->gpu_error)) {
> > > > +			/* As we do not requeue the request over a GPU reset,
> > > > +			 * if one does occur we know that the request is
> > > > +			 * effectively complete.
> > > > +			 */
> > > > +			ret = 0;
> > > >  			break;
> > > 
> > > This now no longer bails out straight when a reset is in progress with
> > > -EAGAIN. I fear for the deadlocks.
> > 
> > You fear incorrectly here. Either we hold the struct_mutex in which case
> > the reset waits and we complete our work without needing anything else,
> > or we may want to take the struct_mutex to complete our work and so may
> > end up waiting for the reset to complete. Either way the state of this
> > request is the same as to when the GPU reset actually occurs. There is
> > an inconsistency that we don't mark it as complete though, so we would
> > may try and wait on it again (only to find that the reset is in
> > progress and bail out again).
> 
> We only bail once per reset since we wait for resets to complete before
> acquiring dev->struct_mutex. On the modeset side we unfortunately can't do
> that since that code is in the drm core.
> 
> > The issue is not a potential deadlock (because that would be an already
> > existing ABBA in the code) but resources that can be depleted by
> > requests.
> 
> Besides deadlocks my other concern is that you encode reset behaviour
> outside of the reset code here. If we unconditionally EAGAIN here, then
> everyone will pick up whatever kind of fixup the reset code applied. Which
> means requeueing of requests is contained to the reset logic. But if we
> stop doing the unconditional EAGAIN then we have to fix up things here
> too.
> 
> On the flip side I don't see what's easier by returning 0 here instead of
> EAGAIN?

My fundamental argument for the last 5 years is that resets complete
requests, i.e. i915_wait_seqno/i915_wait_request should report 0 as the
object is no longer active by the GPU. Only in special circumstances do
we actually care that a reset happened (ABI), and rather than encode
that logic inside the core, that is better expressed as exceptions in
the caller.

e.g. set-to-domain works correctly even if the last request on the
object ended in a hang (and from that we can deduce that a very large
proportion of the API is also correct.)

>From my perspective, by completing the requests following a hang, we
make the kernel and userspace much more fault tolerant.

> > > > @@ -2252,6 +2250,14 @@ static int wait_for_space(struct drm_i915_gem_request *req, int bytes)
> > > >  	if (ret)
> > > >  		return ret;
> > > >  
> > > > +	/* If the request was completed due to a GPU hang, we want to
> > > > +	 * error out before we continue to emit more commands to the GPU.
> > > > +	 */
> > > > +	ret = i915_gem_check_wedge(i915_reset_counter(&to_i915(engine->dev)->gpu_error),
> > > > +				   to_i915(engine->dev)->mm.interruptible);
> > > > +	if (ret)
> > > > +		return ret;
> > > 
> > > Don't we still have the problem with -EIO now here, just less likely than
> > > from intel_ring_begin?
> > 
> > What's the problem? intel_ring_begin() is supposed to return
> > -EIO/-EAGAIN.
> 
> Hm, I thought the entire problem was that ring_begin can return -EIO,
> since that's how mesa usually dies. But with your mail your saying that we
> accidentally wedge the gpu, and that's the part I don't understand yet.

Nah, mesa dies all the time from unchecked errors elsewhere. It just
calls fprintf/exit on intel_do_flush_locked(), so that is the visible one.
By the way, how are you getting on reviewing my mesa patches to fix that?
-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre