[Intel-gfx] [PATCH 05/51] drm/i915: Add return code check to i915_gem_execbuffer_retire_commands()

Wed Feb 25 18:26:51 PST 2015

On Wed, Feb 25, 2015 at 11:17:00PM +0100, Daniel Vetter wrote:
> On Fri, Feb 13, 2015 at 11:48:14AM +0000, John.C.Harrison at Intel.com wrote:
> > From: John Harrison <John.C.Harrison at Intel.com>
> > 
> > For some reason, the i915_add_request() call in
> > i915_gem_execbuffer_retire_commands() was explicitly having its return code
> > ignored. The _retire_commands() function itself was 'void'. Given that
> > _add_request() can fail without dispatching the batch buffer, this seems odd.
> 
> I was so convinced we've had a commit somewhere explaining this, but
> apparently not.
> 
> The deal is that after the dispatch call we have the batch commit and
> there's no going back any more, which also means we can't return an error
> code from the ioctl. So if you return -EIO or -ENOMEM that's a pretty bad
> lie and you really have to ignore that error code.
> 
> Again I've tried to dig up the commit for that but that was lost in the
> maze of the past 5 years of changes. We've had piles of older approaches
> to deal with this issue:
> - Don't even emit a request, just mark objects as gpu dirty. Only when
>   waiting did we emit flushes and requests, which again again gave us a
>   context to return the error. This resulted in horrible latency since
>   flushes where wait too late and also all that book-keeping was not worth
>   it at all. Don't ask ;-)
> - Emit flushes right away, but if we fail to alloc the request set the
>   outstanding lazy request bit. The job of the check_olr function used in
>   waits was to notice that and retry the allocation.
> - Preallocate the request, but that still leaves the possibility that the
>   gpu dies. But since we've committed hangcheck will clean this up and we
>   can just ignore the -EIO.

It's actually worse since it's not just -EIO but also -EINTR, returned by
intel_ring_begin when we're thrashing the gpu a bit too badly with
requests. Which means we really need to guarantee that the request is
completed properly, eventually since it's not just for fatal gpu hangs.

Atm that's done by only clearing outstanding_lazy_request after we've
really emitted the request fully. That guarantees that even when parts of
the request emission to the ringbuf fails we'll retry on the next wait if
needed.

A possible fix to make this infallible would be to reserve some fixed
amount of ringbuf credit at request creation time and then consume it
here. Of course we'd need checks to make sure we never use more ringspace
than what we reserve. To avoid massive churn we could convert
I915_RING_FREE_SPACE into a variable and increase it enough when
allocating the request. And then reduce it again at the start of
add_request.
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
+41 (0) 79 365 57 48 - http://blog.ffwll.ch