[Intel-gfx] [PATCH 1/4] drm/i915: don't return a spurious -EIO from intel_ring_begin

Mon Jul 2 18:04:48 CEST 2012

On Sun, 1 Jul 2012 12:41:19 +0200
Daniel Vetter <daniel at ffwll.ch> wrote:

> On Sun, Jul 1, 2012 at 5:09 AM, Ben Widawsky <ben at bwidawsk.net> wrote:
> > On Tue, 26 Jun 2012 23:08:50 +0200
> > Daniel Vetter <daniel.vetter at ffwll.ch> wrote:
> >
> >> Having this early check in intel_ring_begin doesn't buy us anything,
> >> since we'll be calling into wait_request in the usual case already
> >> anyway. In the corner case of not waiting for free space using the
> >> last_retired_head we simply need to do the same check, too.
> >>
> >> With these changes we'll only ever get an -EIO from intel_ring_begin
> >> if the gpu has truely been declared dead.
> >>
> >> v2: Don't conflate the change to prevent intel_ring_begin from returning
> >> a spurious -EIO with the refactor to use i915_gem_check_wedge, which is
> >> just prep work to avoid returning -EAGAIN in callsites that can't handle
> >> syscall restarting.
> >
> > I'm really scared by this change. It's diffuclt to review because there
> > are SO many callers of intel_ring_begin, and figuring out if they all
> > work in the wedged case is simply too difficult. I've yet to review the
> > rest of the series, but it may make more sense to put this change last
> > perhaps?
> 
> Well, this patch doesn't really affect much what error codes the
> callers get - we'll still throw both -EGAIN and -EIO at them (later on
> patches will fix this).
> 
> What this patch does though is prevent us from returning too many
> -EIO. Imagine the gpu died and we've noticed already (hence
> dev_priv->mm.wedged is set), but some process is stuck churning
> through the execbuf ioctl, holding dev->struct_mutex. While processing
> the execbuf we call intel_ring_begin to emit a few commands. Now
> usually, even when the gpu is dead, there is enough space in the ring
> to do so, which allows us to complete the execbuf ioctl and then later
> on we can block properly when trying to grab the mutex in the next
> ioctl for the gpu reset work handler to complete.

That in itself is a pretty big change, don't you think? It seems rather
strange and dangerous to modify HW (which we will if we allow execbuf to
continue when we write the tail pointer). I think we want some way to
not write the tail ptr in such a case. Now you might respond, well, who
cares about writes? But this imposes a pretty large restriction on any
code that can't work well after the GPU is hung.

I see the irony. I'm complaining that you can make GPU hangs unstable,
and the patch series is fixing GPU reset. Call it paranoia, it still
seems unsafe to me, and makes us have to think a bit more whenever
adding any code. Am I over-thinking this?

> 
> But thanks to that wedged check in intel_ring_begin we'll instead
> return -EIO, despite the fact that later on we could successfully
> reset the gpu. Returning -EIO forces the X server to fall back to s/w
> rendering and disabling dri2, and in the case of a 3d compositor
> usually results in a abort. After this patch we can still return -EIO
> if the gpu is dead but the reset work hasn't completed yet, but only
> so if the ring is full (which in many cases is unlikely).
> 
> Cheers, Daniel
> 
> >
> >>
> >> Signed-Off-by: Daniel Vetter <daniel.vetter at ffwll.ch>
> >> ---
> >>  drivers/gpu/drm/i915/intel_ringbuffer.c |    4 ----
> >>  1 file changed, 4 deletions(-)
> >>
> >> diff --git a/drivers/gpu/drm/i915/intel_ringbuffer.c b/drivers/gpu/drm/i915/intel_ringbuffer.c
> >> index f30a53a..501546e 100644
> >> --- a/drivers/gpu/drm/i915/intel_ringbuffer.c
> >> +++ b/drivers/gpu/drm/i915/intel_ringbuffer.c
> >> @@ -1230,13 +1230,9 @@ int intel_wait_ring_buffer(struct intel_ring_buffer *ring, int n)
> >>  int intel_ring_begin(struct intel_ring_buffer *ring,
> >>                    int num_dwords)
> >>  {
> >> -     struct drm_i915_private *dev_priv = ring->dev->dev_private;
> >>       int n = 4*num_dwords;
> >>       int ret;
> >>
> >> -     if (unlikely(atomic_read(&dev_priv->mm.wedged)))
> >> -             return -EIO;
> >> -
> >>       if (unlikely(ring->tail + n > ring->effective_size)) {
> >>               ret = intel_wrap_ring_buffer(ring);
> >>               if (unlikely(ret))
> >
> >
> >
> > --
> > Ben Widawsky, Intel Open Source Technology Center
> > _______________________________________________
> > Intel-gfx mailing list
> > Intel-gfx at lists.freedesktop.org
> > http://lists.freedesktop.org/mailman/listinfo/intel-gfx
> 
> 
> 

-- 
Ben Widawsky, Intel Open Source Technology Center