[Intel-gfx] [PATCH v4] drm/i915: Optimistically spin for the request completion

Mon Mar 23 02:09:53 PDT 2015

On Mon, Mar 23, 2015 at 09:31:38AM +0100, Daniel Vetter wrote:
> On Fri, Mar 20, 2015 at 10:59:50PM +0000, Chris Wilson wrote:
> > On Fri, Mar 20, 2015 at 04:19:02PM +0000, Chris Wilson wrote:
> > > I guess one test would be to see how many 1x1 [xN overdraw, say 1x1
> > > Window, but rendering internally at 1080p] clients we can run in
> > > parallel whilst hitting 60fps. And then whether allowing multiple
> > > spinners helps or hinders.
> > 
> > I was thinking of a nice easy test that could demonstrate any advantage
> > for spinning over waiting, and realised we already had such an igt. The
> > trick is that it has to generate sufficient GPU load to actually require
> > a wait, but not too high a GPU load such that we can see the impact from
> > slow completion.
> > 
> > I present igt/gem_exec_blt (modified to repeat the measurement and do an
> > average over several runs):
> > 
> > Time to blt 16384 bytes x      1:        21.000µs -> 5.800µs
> > Time to blt 16384 bytes x      2:        11.500µs -> 4.500µs
> > Time to blt 16384 bytes x      4:         6.750µs -> 3.750µs
> > Time to blt 16384 bytes x      8:         4.950µs -> 3.375µs
> > Time to blt 16384 bytes x     16:         3.825µs -> 3.175µs
> > Time to blt 16384 bytes x     32:         3.356µs -> 3.000µs 
> > Time to blt 16384 bytes x     64:         3.259µs -> 2.909µs
> > Time to blt 16384 bytes x    128:         3.083µs -> 3.095µs
> > Time to blt 16384 bytes x    256:         3.104µs -> 2.979µs
> > Time to blt 16384 bytes x    512:         3.080µs -> 3.089µs
> > Time to blt 16384 bytes x   1024:         3.077µs -> 3.040µs 
> > Time to blt 16384 bytes x   2048:         3.127µs -> 3.304µs
> > Time to blt 16384 bytes x   4096:         3.279µs -> 3.265µs
> 
> We probably need to revisit this when the scheduler lands - that one will
> want to keep a short queue and generally will block for some request to
> complete.

Speaking of which, execlists! You may have noticed that I
surreptitiously choose hsw to avoid the execlists overhead...

I was messing around over the weekend looking at the submission overhead
on bdw-u:
           -nightly  +spin     +hax      execlists=0
      x1:  23.600µs  18.400µs  15.200µs  6.800µs
      x2:  19.700µs  16.500µs  15.900µs  5.000µs
      x4:  15.600µs  12.250µs  12.500µs  4.450µs
      x8:  13.575µs  11.000µs  11.650µs  4.050µs
     x16:  10.812µs   9.738µs   9.875µs  3.900µs
     x32:   9.281µs   8.613µs   9.406µs  3.750µs
     x64:   8.088µs   7.988µs   8.806µs  3.703µs
    x128:   7.683µs   7.838µs   8.617µs  3.647µs
    x256:   9.481µs   7.301µs   8.091µs  3.409µs
    x512:   5.579µs   5.992µs   6.177µs  3.561µs
   x1024:  10.093µs   3.963µs   4.187µs  3.531µs
   x2048:  11.497µs   3.794µs   3.873µs  3.477µs
   x4096:   8.926µs   5.269µs   3.813µs  3.461µs

The hax are to remove the extra atomic ops and spinlocks imposed by
execlists. Steady state seems to be roughly on a par with the difference
appearing to be interrupt latency + extra register writes. What's
interesting is the latency caused by the ELSP submission mechanism to an
idle GPU, a hard floor for us. It may even be worth papering over it by
starting execlists from a tasklet.

I do feel this sort of information is missing from the execlists
merge...
-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre