[Intel-gfx] [PATCH 5/6] drm/i915: Implement GPU reset

Sun Sep 13 22:40:11 CEST 2009

On Sun, Sep 13, 2009 at 03:24:03PM -0400, Ben Gamari wrote:
> On Sun, Sep 13, 2009 at 11:21:35AM -0700, Jesse Barnes wrote:
> > On Sun, 13 Sep 2009 18:27:48 +0100
> > Daniel J Blueman <daniel.blueman at gmail.com> wrote:
> > 
> > > A couple of things just caught my eye while looking through the patch,
> > > perhaps to consider tweaking?
> > 
> > Thanks for looking.
> > 
> > > > +void i965_reset(struct drm_device *dev, u8 flags)
> > > > +{
> > > > + ? ? ? drm_i915_private_t *dev_priv = dev->dev_private;
> > > > + ? ? ? unsigned long timeout;
> > > > + ? ? ? u8 gdrst;
> > > > + ? ? ? bool need_display = true; //!(flags & (GDRST_RENDER |
> > > > GDRST_MEDIA));
> > > 
> > > As the kernel's coding style isn't C99, is it worth using /* foo */
> > > here to be compliant?
> > 
> > Or just drop it and add a comment about the other two.  We'd need to
> > experiment some more before trying to use them.
> 
> Yeah, I think this is the right thing to do.
> 
> > 
> > > > +#if 1
> > > > + ? ? ? ? ? ? ? DRM_ERROR("i915 struct_mutex lock is still held by
> > > > %s. Giving on up reset.\n", dev->struct_mutex.owner->task->comm);
> > > > + ? ? ? ? ? ? ? return;
> > > > +#else
> > > > + ? ? ? ? ? ? ? struct task_struct *task =
> > > > dev->struct_mutex.owner->task;
> > > > + ? ? ? ? ? ? ? DRM_ERROR("Killing process %d (%s) for holding i915
> > > > device mutex\n",
> > > > + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? task->pid, task->comm);
> > > > + ? ? ? ? ? ? ? force_sig(SIGILL, task);
> > > 
> > > Should be SIGKILL here?
> > 
> > We should just decide whether we want the warning here or the kill.
> > Ben, I assume you tested the warning mostly?  Killing a waiter will
> > probably just kill the unlucky task who happened to be waiting for the
> > hung batch; might be friendlier to let it continue, but it also makes
> > the reset less reliable.
> 
> We do need to decide what to do with the task that gave us the bad
> batch. This particular code, however, doesn't handle that case. By the
> time we get here we have already woken up the waiters with the wedged
> flag set so in theory no one should be holding the lock. In the event
> that someone is, we really have three options,
> 
> 1) Try waking it up again and hope it doesn't take it this time
> 2) Wait for it to release it (probably will never happen)
> 3) Kill the process
> 
> I opted to kill the process.

my version of the code doesn't actually do this and it managed fine.

where we'll be locked (and blocking) is essentially whenever we hit
i915_wait_request. now, that fucntion in it's sleeping loop does check
whether wedged is set, if so it EIO's out of there. So we set wedged (we
MUST do that first), then we wakeup all sleepers. Now, anyone sleeping
(and since they always sleep with the lock) will check the condition and
return EIO, where they're release the lock and dump that return back to
userland. Anything else means that something else is buggy.

such contention problems on openbsd i've not seen. The only other option
I can see is in wait_ring, where we should really also check for wedged
and if so cede the lock and try again, but that's a different problem,
really and has been a problem for $LONG_TIME

-0-
-- 
Nice boy, but about as sharp as a sack of wet mice.
		-- Foghorn Leghorn