[Intel-gfx] [PATCH 08/13] drm/i915: Eliminate lots of iterations over the execobjects array

Mon Apr 10 12:17:39 UTC 2017

On Tue, Apr 04, 2017 at 05:57:34PM +0300, Joonas Lahtinen wrote:
> On ke, 2017-03-29 at 16:56 +0100, Chris Wilson wrote:
> > The major scaling bottleneck in execbuffer is the processing of the
> > execobjects. Creating an auxiliary list is inefficient when compared to
> > using the execobject array we already have allocated.
> > 
> > Reservation is then split into phases. As we lookup up the VMA, we
> > try and bind it back into active location. Only if that fails, do we add
> > it to the unbound list for phase 2. In phase 2, we try and add all those
> > objects that could not fit into their previous location, with fallback
> > to retrying all objects and evicting the VM in case of severe
> > fragmentation. (This is the same as before, except that phase 1 is now
> > done inline with looking up the VMA to avoid an iteration over the
> > execobject array. In the ideal case, we eliminate the separate reservation
> > phase). During the reservation phase, we only evict from the VM between
> > passes (rather than currently as we try to fit every new VMA). In
> > testing with Unreal Engine's Atlantis demo which stresses the eviction
> > logic on gen7 class hardware, this speed up the framerate by a factor of
> > 2.
> > 
> > The second loop amalgamation is between move_to_gpu and move_to_active.
> > As we always submit the request, even if incomplete, we can use the
> > current request to track active VMA as we perform the flushes and
> > synchronisation required.
> > 
> > The next big advancement is to avoid copying back to the user any
> > execobjects and relocations that are not changed.
> > 
> > v2: Add a Theory of Operation spiel.
> > v3: Fall back to slow relocations in preparation for flushing userptrs.
> > 
> > Signed-off-by: Chris Wilson <chris at chris-wilson.co.uk>
> 
> <SNIP>

> > +	if ((entry->flags & EXEC_OBJECT_SUPPORTS_48B_ADDRESS) == 0 &&
> > +	    (vma->node.start + vma->node.size - 1) >> 32)
> 
> upper_32_bits for clarity?

Not sure. I'd rather keep it as is for the time being and think of a
macro for this and the one in i915_gem_gtt.c

> > +static void
> > +eb_pin_vma(struct i915_execbuffer *eb,
> > +	   struct drm_i915_gem_exec_object2 *entry,
> > +	   struct i915_vma *vma)
> > +{
> > +	u64 flags;
> > +
> > +	flags = vma->node.start;
> 
> I'd be more comfortable if some mask was applied here.
> 
> Or at least GEM_BUG_ON(flags & BAD_FLAGS);

BUILD_BUG_ON() already guards against the bits mixing.

> > +	if (entry->flags & EXEC_OBJECT_NEEDS_FENCE) {
> > +		ret = i915_vma_get_fence(vma);
> > +		if (ret)
> > +			return ret;
> > +
> > +		if (i915_vma_pin_fence(vma))
> > +			entry->flags |= __EXEC_OBJECT_HAS_FENCE;
> > +	}
> 
> Smells like duplicate code with eb_vma_pin.

Close, but the order is intentionally different. :|
Earlier we don't take the error immediately and only fail if the result
doesn't match our requirements. This time, where we are now forced to
bind the vma, we do want to double check each step and unwind.

> > +	return __get_user(c, end - 1);
> 
> What's the point in this final check?

There's no guarrantee that the loop triggered a read on each page, so we
have to do a second read on the last byte of the address range to be
sure.

> > +static void eb_export_fence(struct drm_i915_gem_object *obj,
> > +			    struct drm_i915_gem_request *req,
> > +			    unsigned int flags)
> > +{
> > +	struct reservation_object *resv = obj->resv;
> > +
> > +	/* Ignore errors from failing to allocate the new fence, we can't
> > +	 * handle an error right now. Worst case should be missed
> > +	 * synchronisation leading to rendering corruption.
> > +	 */
> 
> Worthy DRM_DEBUG?

I think the oomkiller emanating from this spot will be instructive
enough. At some point in the future, when we start using ww_mutex for
serializing the objects between execbuf (rather than struct_mutex), we
should be able to do the reservation early and so catch an error before
we commit.

> > @@ -1155,10 +1524,33 @@ eb_move_to_gpu(struct i915_execbuffer *eb)
> >  		}
> >  
> >  		ret = i915_gem_request_await_object
> > -			(eb->request, obj, vma->exec_entry->flags & EXEC_OBJECT_WRITE);
> > +			(eb->request, obj, entry->flags & EXEC_OBJECT_WRITE);
> >  		if (ret)
> >  			return ret;
> > +
> > +skip_flushes:
> > +		obj->base.write_domain = 0;
> > +		if (entry->flags & EXEC_OBJECT_WRITE) {
> > +			obj->base.read_domains = 0;
> > +			if (!obj->cache_dirty && gpu_write_needs_clflush(obj))
> > +				obj->cache_dirty = true;
> > +			intel_fb_obj_invalidate(obj, ORIGIN_CS);
> > +		}
> > +		obj->base.read_domains |= I915_GEM_GPU_DOMAINS;
> > +
> > +		i915_vma_move_to_active(vma, eb->request, entry->flags);
> > +		__eb_unreserve_vma(vma, entry);
> > +		vma->exec_entry = NULL;
> 
> This seems like a bugfix hunk lost to refactoring patch.

This one is just more loop combining.
-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre