[Intel-gfx] [PATCH 3/3] drm/i915: Prefer random replacement before eviction search

Wed Jan 11 09:37:55 UTC 2017

On Wed, Jan 11, 2017 at 09:47:41AM +0200, Joonas Lahtinen wrote:
> On ti, 2017-01-10 at 21:55 +0000, Chris Wilson wrote:
> > Performing an eviction search can be very, very slow especially for a
> > range restricted replacement. For example, a workload like
> > gem_concurrent_blit will populate the entire GTT and then cause aperture
> > thrashing. Since the GTT is a mix of active and inactive tiny objects,
> > we have to search through almost 400k objects before finding anything
> > inside the mappable region, and as this search is required before every
> > operation performance falls off a cliff.
> > 
> > Instead of performing the full search, we do a trial replacement of the
> > node at a random location fitting the specified restrictions. We lose
> > the strict LRU property of the GTT in exchange for avoiding the slow
> > search (several orders of runtime improvement for gem_concurrent_blit
> > 4KiB-global-gtt, e.g. from 5000s to 20s). The loss of LRU replacement is
> > (later) mitigated firstly by only doing replacement if we find no
> > freespace and secondly by execbuf doing a PIN_NONBLOCK search first before
> > it starts thrashing (i.e. the random replacement will only occur from the
> > already inactive set of objects).
> > 
> > Signed-off-by: Chris Wilson <chris at chris-wilson.co.uk>
> > Cc: Tvrtko Ursulin <tvrtko.ursulin at intel.com>
> > Cc: Joonas Lahtinen <joonas.lahtinen at linux.intel.com>
> 
> <SNIP>
> 
> > +static u64 random_offset(u64 start, u64 end, u64 len, u64 align)
> > +{
> 
> The usual GEM_BUG_ON dance to make sure the inputs make some sense. Or
> are you relying on the upper level callers?

It was static and the callers were checking, but yeah might as well catch
them whilst we think about it.

> > +	u64 range, addr;
> > +
> > +	if (align == 0)
> > +		align = I915_GTT_MIN_ALIGNMENT;
> > +
> > +	range = round_down(end - len, align) - round_up(start, align);
> 
> For example this may cause an odd result.
> 
> > @@ -3629,6 +3655,16 @@ int i915_gem_gtt_insert(struct i915_address_space *vm,
> >  	if (err != -ENOSPC)
> >  		return err;
> >  
> > +	/* No free space, pick a slot at random */
> > +	err = i915_gem_gtt_reserve(vm, node,
> > +				   size,
> > +				   random_offset(start, end, size, alignment),
> 
> I'd pull this to a line above just to make it more humane to read.

> > +				   color,
> > +				   flags);
> > +	if (err != -ENOSPC)
> > +		return err;
> > +
> > +	/* Randomly selected placement is pinned, do a search */
> >  	err = i915_gem_evict_something(vm, size, alignment, color,
> >  				       start, end, flags);
> >  	if (err)
> 
> I'm bit unsure why it would make such a big difference, but if you've
> been running the numbers. Code itself is all good, so this is;

The pathological case we have is

	|<-- 256 MiB aperture (64k objects) -->||<-- 1792 MiB unmappable (448k objects) -->|

Now imagine that the eviction LRU is ordered top-down (just because
pathology meets reallife), and that we need to evict an object to make
room inside the aperture. The eviction scan then has to walk the list
448k before it finds one within range. And now imagine that it has to
search for a new hole between every byte inside the memcpy, for several
simultaneous clients.

If there are a few holes in the unmappable region, we also have a
similar problem with hole skipping inside the drm_mm range search. This
is mitigated by using DRM_MM_INSERT_LOW, but only once we have that
support in drm_mm. Right now, the drm_mm search is also having to walk
the MRU rejecting the holes above the full aperture.
-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre