[Intel-gfx] [PATCH 4/4] drm/i915: Opportunistically reduce flushing at execbuf

Sun Dec 14 23:55:32 PST 2014

On Sun, Dec 14, 2014 at 03:37:36PM -0800, Ben Widawsky wrote:
> On Sun, Dec 14, 2014 at 03:12:21PM +0200, Ville Syrjälä wrote:
> > On Sat, Dec 13, 2014 at 07:08:24PM -0800, Ben Widawsky wrote:
> > > If we're moving a bunch of buffers from the CPU domain to the GPU domain, and
> > > we've already blown out the entire cache via a wbinvd, there is nothing more to
> > > do.
> > > 
> > > With this and the previous patches, I am seeing a 3x FPS increase on a certain
> > > benchmark which uses a giant 2d array texture. Unless I missed something in the
> > > code, it should only effect non-LLC i915 platforms.
> > > 
> > > I haven't yet run any numbers for other benchmarks, nor have I attempted to
> > > check if various conformance tests still pass.
> > > 
> > > NOTE: As mentioned in the previous patch, if one can easily obtain the largest
> > > buffer and attempt to flush it first, the results would be even more desirable.
> > 
> > So even with that optimization if you only have tons of small buffers
> > that need to be flushed you'd still take the clflush path for every
> > single one.
> > 
> > How difficult would it to calculate the total size to be flushed first,
> > and then make the clflush vs. wbinvd decision base on that?
> > 
> 
> I'll write the patch and send it to Eero for test.
> 
> It's not hard, and I think that's a good idea as well. One reason I didn't put
> such code in this series is that moves away from a global DRM solution (and like
> I said in the cover-letter, I am fine with that). Implementing this, I think in
> the i915 code we'd just iterate through the BOs until we got to a certain
> threshold, then just call wbinvd() from i915 and not even both with drm_cache.
> You could also maybe try to shorcut if there are more than X buffers.

I don't mind an i915 specific solution (we have them already in many
places). So will wait for the results of this experiments before merging
more patches.

> However, for what you describe, I think it might make more sense to let
> userspace specify an execbuf flag to do the wbinvd(). Userspace can trivially
> determine such info, it prevents having to iterate through the buffers an extra
> time in the kernel.

If we keep the running tally of clflushing in the reservation step and
then use that in move_to_gpu we shouldn't need an additional loop. And
even if the estimate from the first loop is off a bit (e.g. due to
eviction to make space for new buffers) it won't matter - it's just an
optimization. Imo asking userspace to do this for the kernel doesn't make
much sense since the kernel already keeps track of domains.

Cheers, Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
+41 (0) 79 365 57 48 - http://blog.ffwll.ch