[Intel-gfx] [PATCH v4] drm/i915: Optimistically spin for the request completion
Chris Wilson
chris at chris-wilson.co.uk
Mon Mar 23 02:09:53 PDT 2015
On Mon, Mar 23, 2015 at 09:31:38AM +0100, Daniel Vetter wrote:
> On Fri, Mar 20, 2015 at 10:59:50PM +0000, Chris Wilson wrote:
> > On Fri, Mar 20, 2015 at 04:19:02PM +0000, Chris Wilson wrote:
> > > I guess one test would be to see how many 1x1 [xN overdraw, say 1x1
> > > Window, but rendering internally at 1080p] clients we can run in
> > > parallel whilst hitting 60fps. And then whether allowing multiple
> > > spinners helps or hinders.
> >
> > I was thinking of a nice easy test that could demonstrate any advantage
> > for spinning over waiting, and realised we already had such an igt. The
> > trick is that it has to generate sufficient GPU load to actually require
> > a wait, but not too high a GPU load such that we can see the impact from
> > slow completion.
> >
> > I present igt/gem_exec_blt (modified to repeat the measurement and do an
> > average over several runs):
> >
> > Time to blt 16384 bytes x 1: 21.000µs -> 5.800µs
> > Time to blt 16384 bytes x 2: 11.500µs -> 4.500µs
> > Time to blt 16384 bytes x 4: 6.750µs -> 3.750µs
> > Time to blt 16384 bytes x 8: 4.950µs -> 3.375µs
> > Time to blt 16384 bytes x 16: 3.825µs -> 3.175µs
> > Time to blt 16384 bytes x 32: 3.356µs -> 3.000µs
> > Time to blt 16384 bytes x 64: 3.259µs -> 2.909µs
> > Time to blt 16384 bytes x 128: 3.083µs -> 3.095µs
> > Time to blt 16384 bytes x 256: 3.104µs -> 2.979µs
> > Time to blt 16384 bytes x 512: 3.080µs -> 3.089µs
> > Time to blt 16384 bytes x 1024: 3.077µs -> 3.040µs
> > Time to blt 16384 bytes x 2048: 3.127µs -> 3.304µs
> > Time to blt 16384 bytes x 4096: 3.279µs -> 3.265µs
>
> We probably need to revisit this when the scheduler lands - that one will
> want to keep a short queue and generally will block for some request to
> complete.
Speaking of which, execlists! You may have noticed that I
surreptitiously choose hsw to avoid the execlists overhead...
I was messing around over the weekend looking at the submission overhead
on bdw-u:
-nightly +spin +hax execlists=0
x1: 23.600µs 18.400µs 15.200µs 6.800µs
x2: 19.700µs 16.500µs 15.900µs 5.000µs
x4: 15.600µs 12.250µs 12.500µs 4.450µs
x8: 13.575µs 11.000µs 11.650µs 4.050µs
x16: 10.812µs 9.738µs 9.875µs 3.900µs
x32: 9.281µs 8.613µs 9.406µs 3.750µs
x64: 8.088µs 7.988µs 8.806µs 3.703µs
x128: 7.683µs 7.838µs 8.617µs 3.647µs
x256: 9.481µs 7.301µs 8.091µs 3.409µs
x512: 5.579µs 5.992µs 6.177µs 3.561µs
x1024: 10.093µs 3.963µs 4.187µs 3.531µs
x2048: 11.497µs 3.794µs 3.873µs 3.477µs
x4096: 8.926µs 5.269µs 3.813µs 3.461µs
The hax are to remove the extra atomic ops and spinlocks imposed by
execlists. Steady state seems to be roughly on a par with the difference
appearing to be interrupt latency + extra register writes. What's
interesting is the latency caused by the ELSP submission mechanism to an
idle GPU, a hard floor for us. It may even be worth papering over it by
starting execlists from a tasklet.
I do feel this sort of information is missing from the execlists
merge...
-Chris
--
Chris Wilson, Intel Open Source Technology Centre
More information about the Intel-gfx
mailing list