[Intel-gfx] [PATCH 01/12] drm/i915: Assert mutex_is_locked on context lookup

Mon May 6 19:59:15 CEST 2013

On Mon, May 06, 2013 at 11:44:22AM +0200, Daniel Vetter wrote:
> On Mon, May 06, 2013 at 11:40:06AM +0200, Daniel Vetter wrote:
> > On Thu, May 02, 2013 at 01:27:32PM -0700, Jesse Barnes wrote:
> > > On Tue, 23 Apr 2013 23:15:29 -0700
> > > Ben Widawsky <ben at bwidawsk.net> wrote:
> > > 
> > > > Because our context refcounting doesn't grab a ref at lookup time, it is
> > > > unsafe to do so without the lock.
> > > > 
> > > > NOTE: We don't have an easy way to put the assertion in the lookup
> > > > function which is where this really belongs. Context switching is good
> > > > enough because it actually asserts even more correctness by protecting
> > > > the default_context.
> > > > 
> > > > Signed-off-by: Ben Widawsky <ben at bwidawsk.net>
> > > > ---
> > > >  drivers/gpu/drm/i915/i915_gem_context.c | 2 ++
> > > >  1 file changed, 2 insertions(+)
> > > > 
> > > > diff --git a/drivers/gpu/drm/i915/i915_gem_context.c b/drivers/gpu/drm/i915/i915_gem_context.c
> > > > index a1e8ecb..411ace0 100644
> > > > --- a/drivers/gpu/drm/i915/i915_gem_context.c
> > > > +++ b/drivers/gpu/drm/i915/i915_gem_context.c
> > > > @@ -444,6 +444,8 @@ int i915_switch_context(struct intel_ring_buffer *ring,
> > > >  	if (dev_priv->hw_contexts_disabled)
> > > >  		return 0;
> > > >  
> > > > +	BUG_ON(!mutex_is_locked(&dev_priv->dev->struct_mutex));
> > > > +
> > > >  	if (ring != &dev_priv->ring[RCS])
> > > >  		return 0;
> > > >  
> > > 
> > > Simple enough.
> > > 
> > > Reviewed-by: Jesse Barnes <jbarnes at virtuousgeek.org>
> > > 
> > > We usually do WARN_ONs for this stuff though, in case a user actually
> > > does hit it, it may not be fatal so why crash the machine?
> > > 
> > > But that's a minor distinction since we shouldn't hit this except in
> > > development anyway.
> > 
> > Well, since this is a patch for upstream the focus should very much be on
> > supporting bug reporters and not developers. And for bug reporters a BUG
> > is much more annoying than a WARN and greatly reduces the chances that
> > we'll get a bug report.
> 
> Some more details why a WARN massively beats a BUG for us: BUG kills the
> current process and ensures all locks are stuck. Usually that means X is
> dead and you can't vt-switch away to the console to take a quick look at
> dmesg.
> 
> Now even when all rendering is down the toilet due to the follow-up damage
> after the WARN and the gpu a zombie, there's a non-zero chance that
> vt-switch (or sw rendering in X) will work long enough to grab log files
> and debugfs data.
> 
> Hence the first rule to only use a BUG on if we have a guaranteed OOPS
> otherwise (which again will kill the process and make all locks stuck).
> -Daniel
> 
> > 
> > There are imo only very few cases where a BUG instead of a WARN is
> > justified:
> > - The kernel is _guaranteed_ to oops in the next few lines anyway, so a
> >   BUG_ON will help in readability of the backtrace. Note that checking 3
> >   different things for non-NULL in the same BUG actually reduces OOPS
> >   readability (with an oops you can at least reconstruct the faulting
> >   address and so probably the pointer). Also, this means the BUG should
> >   have a neat description of what exactly blew up.
> > 
> >   The "a few lines" part is just a guideline with some big exceptions. A
> >   prime example is refcount over/underflows since those will blow up, but
> >   only sometimes later (and usually no one will have a clue why).
> > 
> > - BUGs are justified if there's a potential security hole awaiting, e.g.
> >   when something in the userspace input validation has gone wrong.
> > 
> > - I'm wary of special error handling for WARNs, but if an early return
> >   (with an error code if possible) transforms a BUG into a WARN I'm in.
> >   But trying to fix up e.g. modeset state is usually futile, since we'll
> >   end up with a black/fuzzy/corrupted screen most likely anyway.
> > 
> > But the most important rule is: In case of doubt, just WARN, don't BUG.
> > 
> > [I know, I violate it sometimes, too.]
> > 
> > Patch applied with the s/BUG/WARN bikeshed.
> > -Daniel

Thanks for applying the patch, it's certainly better than what we have
currently.

Why I wanted a BUG: When you get a ref to an object without holding a
lock you get a potentially unsafe pointer (to which we will be writing).
If the context object memory is freed, and we write to it, we have a
potential to late scribble over <insert your file system of choice>
memory. There is probably a similar security implication there as well.
Many of us are used to, and capable of recovering from GPU hangs, but
less of us like to deal with FS recovery.

I actually believe all "get" code like this (backed with refcounts)
should BUG and not WARN.

-- 
Ben Widawsky, Intel Open Source Technology Center