[Intel-gfx] [CI 13/19] drm/i915: Remove (struct_mutex) locking for busy-ioctl

Fri Aug 5 20:12:44 UTC 2016

On Fri, Aug 05, 2016 at 08:30:42PM +0100, Chris Wilson wrote:
> On Fri, Aug 05, 2016 at 09:08:34PM +0200, Daniel Vetter wrote:
> > On Fri, Aug 05, 2016 at 10:14:18AM +0100, Chris Wilson wrote:
> > > By applying the same logic as for wait-ioctl, we can query whether a
> > > request has completed without holding struct_mutex. The biggest impact
> > > system-wide is removing the flush_active and the contention that causes.
> > > 
> > > Testcase: igt/gem_busy
> > > Signed-off-by: Chris Wilson <chris at chris-wilson.co.uk>
> > > Cc: Akash Goel <akash.goel at intel.com>
> > > Reviewed-by: Joonas Lahtinen <joonas.lahtinen at linux.intel.com>
> > > ---
> > >  drivers/gpu/drm/i915/i915_gem.c | 131 +++++++++++++++++++++++++++++++---------
> > >  1 file changed, 101 insertions(+), 30 deletions(-)
> > > 
> > > diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
> > > index ceb00970b2da..b99d64bfb7eb 100644
> > > --- a/drivers/gpu/drm/i915/i915_gem.c
> > > +++ b/drivers/gpu/drm/i915/i915_gem.c
> > > @@ -3736,49 +3736,120 @@ i915_gem_object_ggtt_unpin_view(struct drm_i915_gem_object *obj,
> > >  	i915_vma_unpin(i915_gem_obj_to_ggtt_view(obj, view));
> > >  }
> > >  
> > > +static __always_inline unsigned __busy_read_flag(unsigned int id)
> > > +{
> > > +	/* Note that we could alias engines in the execbuf API, but
> > > +	 * that would be very unwise as it prevents userspace from
> > > +	 * fine control over engine selection. Ahem.
> > > +	 *
> > > +	 * This should be something like EXEC_MAX_ENGINE instead of
> > > +	 * I915_NUM_ENGINES.
> > > +	 */
> > > +	BUILD_BUG_ON(I915_NUM_ENGINES > 16);
> > > +	return 0x10000 << id;
> > > +}
> > > +
> > > +static __always_inline unsigned int __busy_write_id(unsigned int id)
> > > +{
> > > +	return id;
> > > +}
> > > +
> > > +static __always_inline unsigned
> > > +__busy_set_if_active(const struct i915_gem_active *active,
> > > +		     unsigned int (*flag)(unsigned int id))
> > > +{
> > > +	/* For more discussion about the barriers and locking concerns,
> > > +	 * see __i915_gem_active_get_rcu().
> > > +	 */
> > > +	do {
> > > +		struct drm_i915_gem_request *request;
> > > +		unsigned int id;
> > > +
> > > +		request = rcu_dereference(active->request);
> > > +		if (!request || i915_gem_request_completed(request))
> > > +			return 0;
> > > +
> > > +		id = request->engine->exec_id;
> > > +
> > > +		/* Check that the pointer wasn't reassigned and overwritten. */
> > 
> > cf. our discussion in active_get_rcu - there's no fence_get_rcu in sight
> > anywhere here, hence this needs an smp_rmb().
> 
> I toyed with smp_rmb(). 
> 
> The rcu_deference() followed by rcu_access_pointer() is ordered.
> 
> So I was back with dancing around "where the dependent-reads ordered by 
> the first rcu_deference ordered in front of the second access which was
> itself ordered after the first?" I probably should
> have stuck in the smp_rmb() and stopped worrying - it is still going to
> be cheaper than the refcount traffic.

It's the read of exec_id vs. the 2nd read of request which isn't ordered,
and which we want to be ordered to ensure we read the right engine id (and
not some bogus thing since the request was recycled meanwhile). And I
think the smp_rmb() is indeed required in there:

1. first active->request lookup
2. 2nd active->request lookup (compiler/cpu is allowed to do that, I think it
could even reorder ahead of 1 since it's not a dependent read)

<- gpu completes request, evil other thread does all the clean&recycles
with new bogus engine

3. sample the engine->exec_id
4. bail out of the loop sinc requests looked up in 1&2 match.

> > Also nitpick: The two
> > rcu_dereference(actove->request) feel a bit silly. If we move the first in
> > front of the loop, and update the local request pointer (using a tmp) it
> > would look tidier, and we could even move the loop termination condition
> > into the while () check (and move the return flag(id) at the end of the
> > function).
> 
> I was quite content with only having to think of one phase through the
> loop and not worry about state being carried forward.
> 
>  __busy_set_if_active(const struct i915_gem_active *active,
>                      unsigned int (*flag)(unsigned int id))
>  {
> +       struct drm_i915_gem_request *request;
> +       unsigned int id;
> +
>         /* For more discussion about the barriers and locking concerns,
>          * see __i915_gem_active_get_rcu().
>          */
> +       request = rcu_dereference(active->request);
>         do {
> -               struct drm_i915_gem_request *request;
> -               unsigned int id;
> +               struct drm_i915_gem_request *tmp;
>  
> -               request = rcu_dereference(active->request);
>                 if (!request || i915_gem_request_completed(request))
>                         return 0;
>  
>                 id = request->engine->exec_id;
>  
>                 /* Check that the pointer wasn't reassigned and overwritten. */
> -               if (request == rcu_access_pointer(active->request))
> -                       return flag(id);
> +               tmp = rcu_dereference(active->request);
> +               if (tmp == request)
> +                       break;
> +
> +               request = tmp;
>         } while (1);
> +
> +       return flag(id);
>  }
> 
> is also not as well optimised by gcc, apparently.

Hm yeah, underwhelming. I'm ok with either I guess.
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch