[Intel-gfx] [PATCH 12/12] drm/i915: Cache the reset_counter for the request

Thu Nov 26 01:50:01 PST 2015

On Thu, Nov 26, 2015 at 10:21:38AM +0100, Daniel Vetter wrote:
> On Wed, Nov 25, 2015 at 12:17:08PM +0000, Chris Wilson wrote:
> > On Wed, Nov 25, 2015 at 10:12:53AM +0100, Daniel Vetter wrote:
> > > On Tue, Nov 24, 2015 at 09:43:32PM +0000, Chris Wilson wrote:
> > > > On Tue, Nov 24, 2015 at 05:43:22PM +0100, Daniel Vetter wrote:
> > > > > On Fri, Nov 20, 2015 at 12:43:52PM +0000, Chris Wilson wrote:
> > > > > > Instead of querying the reset counter before every access to the ring,
> > > > > > query it the first time we touch the ring, and do a final compare when
> > > > > > submitting the request. For correctness, we need to then sanitize how
> > > > > > the reset_counter is incremented to prevent broken submission and
> > > > > > waiting across resets, in the process fixing the persistent -EIO we
> > > > > > still see today on failed waits.
> > > > > 
> > > > > tbh that explanation went straight over my head. Questions:
> > > > > - Where do we wait again over a reset? With all the wakeups we should
> > > > >   catch them properly. I guess this needs the detailed scenario to help me
> > > > >   out, since I have dim memories that there is something wrong ;-)
> > > > 
> > > > TDR. In the future (and in past speculative patches) we have proposed
> > > > keeping requests over a reset and requeuing them. That is a complication
> > > > to the simplification of bailing out from the wait. What I have in mind,
> > > > is the recovery code has to fix up the request list somehow, though that
> > > > will be fun.
> > > 
> > > But even with requeueing the waiters need to bail out and restart the
> > > ioctl. So I don't see why requeueing of requests matter.
> > 
> > But why should the waiter even wake up? It is not holding any locks, all
> > it is just waiting for the request to complete. It is only those holding
> > a lock required to reset that we need to kick.
> 
> Because the request might not actually be completed when we do partial
> resets like with TDR. And then we'd have to undo this.

I don't think you have to undo. TDR and wait_request already doesn't
mesh well, so it needs to be solved either way. More than likely we
simply don't increment the global reset counter and just bump the seqno
for skipped requests.

> > > And my question was about what exactly is broken when waiting over a
> > > reset? As long as we detect the reset and restart the ioctl we should pick
> > > up any kinds of state fixups the reset work has done and recover
> > > perfectly. Irrespective of whether the reset work has requeued some of the
> > > requests or not.
> > 
> > But... The reset handler clears the requests, we cannot wait over a
> > reset. So waiting over a reset today is clearly a bug in the kernel.
> > 
> > Only in the future do we contemplate situations where a request may
> > survive a reset.
> 
> I think I phrased my question badly: What's broken with current waiters
> racing against resets? Currently the should all get woken and restart, and
> that seems to work. It does create some ioctl restart work when a reset
> happens, but I think that overhead is justified given that it allows us to
> have no knowledge of how exactly we reset hw and sw outside of reset
> functions.

Good. Because that doesn't change.

> > > > > - What precisely is the effect for waiters? I only see moving the
> > > > >   atomic_inc under the dev->struct_mutex, which shouldn't do a hole lot
> > > > >   for anyone. Plus not returning -EAGAIN when reset_in_progress, which
> > > > >   looks like it might result in missed wakeups and deadlocks with the
> > > > >   reset work.
> > > > 
> > > > At the moment, I can trivially wedge the machine. This patch fixes that.
> > > > The patch also ensures that we cannot raise unhandled errors from
> > > > wait-request (EIO/EAGAIN handling has to be explicitly added to the
> > > > caller).
> > > 
> > > Hm, how/where do we wedge the machine, and how does the fix work?
> > 
> > Being able to reset a previously wedged machine.
> 
> Ok, I think I get that problem now, after looking at gem_eio work.
> echo 1 > i915_wedged seems to have been broken since years. I haven't ever
> noticed that since I generally just kick the machine and restart when that
> happens.

It has been my get-of-jail card for a long time: manually grab the GPU
state and reset? Yes, please.

> The two-phase reset logic is some kind of hand-rolled special lock which
> makes sure nothing bad happens. As long as it's really two-phase.
> Essentially reset_in_progress means "everyone get of locks and stay of
> locks". With some small short-lived exceptions here and there (and well
> not so short-lived ones).
> 
> Of course we can just write reset code that doesn't deadlock, but given
> our history in this area I much prefer something that's more obviously
> correct.

No. It is nothing more than a crude leaky hack to allow you call the ring
accessors from inside the reset handler.

> > > > > I /thought/ that the -EIO is from check_wedge in intel_ring_begin, but
> > > > > that part seems essentially unchanged.
> > > > 
> > > > For two reasons, we need to to protect the access to the ring, and you
> > > > argued that (reporting of EIO from previous wedged GPUs) it was part of
> > > > the ABI. The other ABI that I think is important, is the reporting of
> > > > EIO if the user explicits waits on a request and it is incomplete (i.e.
> > > > wait_ioctl).
> > > 
> > > Well then I don't get where the -EIO is from. That leaves a truly wedge
> > > gpu as the cause, and for that case I don't get how it can happen by
> > > accident with the current code.
> > 
> > We fail a reset, or to recover. Happens often enough.
> 
> Ok, looking at gem_eio the only case where we put out an EIO and shouldn't
> seems to be set_domain_ioctl. Is that right?

Not really. That we see EIO from set-domain is just fallout from the igt
framework. The only things gem_eio tests are the ABI points where EIO is
expected to see, it doesn't contain the negative tests for all the paths
that should never generate EIO.

> Well there's all the modeset stuff, and that just became a bit worse with
> Maarten's recent changes. So probably we need the same -EIO eating there
> too.

I'm puzzled, since i915_reset() only handles GPU reset not the display
engine. The modesetting code should only have to worry about its
interactions with the GPU (any MI commands it is waiting on, or
activity) both of which should be through requests.

> > No. It doesn't add anything more to the scheme. All it does is change
> > the order in which the second increment is applied so that after the
> > HW/SW reset happens we can start using it from inside the reset handler
> > to reinitialise the state.
> 
> Well we can make that work, but it still feels like trading considerable
> amounts of complexity to get rid of reload_in_reset. Well it's not really
> any complexity right now with the BKL we have, but I don't see things as
> that simple when/if we ever split things up.

I disagree (now and then). reload_in_reset is crude and adds fragility
to the reset mechanism that you have had had to add extra code to handle.

> > From my perspective, by completing the requests following a hang, we
> > make the kernel and userspace much more fault tolerant.
> 
> Userspace must restart when we return -EAGAIN. That's also ABI. Which
> means yes it will only ever see 0 here.

Hahahaha. My point is that the kernel breaks quite often. It doesn't
matter if userspace doesn't die if the kernel doesn't behave correctly
if it still means the screen is frozen. Having access to the screen
without the GPU or driver breakage is paramount to a robust system.

> What we can do is drop the check_wedge out of wait_request, since that
> should make a pile of things more robust. But imo the -EAGAIN really needs
> to stay. It doesn't hurt, userspace never sees it, and we keep
> encapsulation of how exactly sw/hw reset is done contained in the reset
> worker.

Why though? That is skipping over the question of "when is a request
complete" that I feel is central to the argument here. You still need
the code to decide whether to return 0 or EAGAIN, or else you still
cause false WARN.
-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre