[Intel-gfx] [PATCH 4/4] [v4] drm/i915: Convert execbuf code to use vmas

Wed Aug 14 09:58:51 CEST 2013

On Tue, Aug 13, 2013 at 06:11:59PM -0700, Ben Widawsky wrote:
> On Tue, Aug 13, 2013 at 06:09:09PM -0700, Ben Widawsky wrote:
> > From: Ben Widawsky <ben at bwidawsk.net>
> > 
> > In order to transition more of our code over to using a VMA instead of
> > an <OBJ, VM> pair - we must have the vma accessible at execbuf time. Up
> > until now, we've only had a VMA when actually binding an object.
> > 
> > The previous patch helped handle the distinction on bound vs. unbound.
> > This patch will help us catch leaks, and other issues before we actually
> > shuffle a bunch of stuff around.
> > 
> > This attempts to convert all the execbuf code to speak in vmas. Since
> > the execbuf code is very self contained it was a nice isolated
> > conversion.
> > 
> > The meat of the code is about turning eb_objects into eb_vma, and then
> > wiring up the rest of the code to use vmas instead of obj, vm pairs.
> > 
> > Unfortunately, to do this, we must move the exec_list link from the obj
> > structure. This list is reused in the eviction code, so we must also
> > modify the eviction code to make this work.
> > 
> > WARNING: This patch makes an already hotly profiled path slower. The cost is
> > unavoidable. In reply to this mail, I will attach the extra data.
> > 
> 
> [snip]
> 
> Here is the output from gem_exec_lut_handle both before and after this
> patch. The results honestly don't make sense to me, but I'll set Chris
> parse it before scratching my head harder.
> 
> Before patch
> ============
> relocation: buffers=   1: old=   8060 + 165.3*reloc, lut=   7816 + 164.8*reloc (ns)
> relocation: buffers=   2: old=   6748 + 166.6*reloc, lut=   6952 + 165.4*reloc (ns)
> relocation: buffers=   4: old=   8140 + 165.9*reloc, lut=   8216 + 165.4*reloc (ns)
> relocation: buffers=   8: old=  10732 + 166.0*reloc, lut=  10615 + 165.2*reloc (ns)
> relocation: buffers=  16: old=  15099 + 167.8*reloc, lut=  15337 + 165.3*reloc (ns)
> relocation: buffers=  32: old=  26140 + 166.0*reloc, lut=  25488 + 165.5*reloc (ns)
> relocation: buffers=  64: old=  46300 + 170.5*reloc, lut=  44279 + 166.7*reloc (ns)
> relocation: buffers= 128: old=  84056 + 176.9*reloc, lut=  85379 + 166.3*reloc (ns)
> relocation: buffers= 256: old= 174398 + 167.9*reloc, lut= 167744 + 167.0*reloc (ns)
> relocation: buffers= 512: old= 349688 + 175.7*reloc, lut= 348590 + 170.8*reloc (ns)
> relocation: buffers=1024: old= 726265 + 191.2*reloc, lut= 719774 + 180.2*reloc (ns)
> relocation: buffers=2048: old=1456866 + 224.3*reloc, lut=1442087 + 173.0*reloc (ns)
> skip-relocs: buffers=   1: old=   4445 + 16.0*reloc, lut=   4433 + 15.6*reloc (ns)
> skip-relocs: buffers=   2: old=   4585 + 16.0*reloc, lut=   4571 + 15.6*reloc (ns)
> skip-relocs: buffers=   4: old=   5667 + 16.0*reloc, lut=   5340 + 15.6*reloc (ns)
> skip-relocs: buffers=   8: old=   6051 + 16.1*reloc, lut=   6026 + 15.6*reloc (ns)
> skip-relocs: buffers=  16: old=   7953 + 16.1*reloc, lut=   7914 + 15.6*reloc (ns)
> skip-relocs: buffers=  32: old=  11972 + 16.2*reloc, lut=  11875 + 15.7*reloc (ns)
> skip-relocs: buffers=  64: old=  19999 + 16.5*reloc, lut=  19832 + 15.7*reloc (ns)
> skip-relocs: buffers= 128: old=  37796 + 16.9*reloc, lut=  36539 + 15.9*reloc (ns)
> skip-relocs: buffers= 256: old=  71604 + 18.1*reloc, lut=  71313 + 16.5*reloc (ns)
> skip-relocs: buffers= 512: old= 152682 + 24.3*reloc, lut= 141379 + 27.9*reloc (ns)
> skip-relocs: buffers=1024: old= 314116 + 41.7*reloc, lut= 303019 + 20.1*reloc (ns)
> skip-relocs: buffers=2048: old= 619784 + 54.1*reloc, lut= 603931 + 20.0*reloc (ns)
> no-relocs: buffers=   1: old=   4194 + 5.1*reloc, lut=   4206 + 4.8*reloc (ns)
> no-relocs: buffers=   2: old=   4404 + 5.1*reloc, lut=   4381 + 4.8*reloc (ns)
> no-relocs: buffers=   4: old=   4926 + 5.1*reloc, lut=   4921 + 4.8*reloc (ns)
> no-relocs: buffers=   8: old=   5901 + 5.1*reloc, lut=   5822 + 4.9*reloc (ns)
> no-relocs: buffers=  16: old=   7840 + 5.1*reloc, lut=   7737 + 4.9*reloc (ns)
> no-relocs: buffers=  32: old=  11842 + 5.1*reloc, lut=  11681 + 4.9*reloc (ns)
> no-relocs: buffers=  64: old=  19741 + 5.1*reloc, lut=  19542 + 4.8*reloc (ns)
> no-relocs: buffers= 128: old=  36479 + 5.2*reloc, lut=  35958 + 4.9*reloc (ns)
> no-relocs: buffers= 256: old=  70171 + 5.4*reloc, lut=  69390 + 5.2*reloc (ns)
> no-relocs: buffers= 512: old= 147213 + 3.5*reloc, lut= 137953 + 13.0*reloc (ns)
> no-relocs: buffers=1024: old= 300165 + 4.8*reloc, lut= 293852 + 4.9*reloc (ns)
> no-relocs: buffers=2048: old= 597992 + 8.3*reloc, lut= 590185 + 2.1*reloc (ns)
> 
> 
> After patch
> ===========
> relocation: buffers=   1: old=   8075 + 81.4*reloc, lut=   7592 + 80.6*reloc (ns)
> relocation: buffers=   2: old=   5744 + 82.3*reloc, lut=   5837 + 81.1*reloc (ns)
> relocation: buffers=   4: old=   4875 + 82.7*reloc, lut=   4871 + 81.6*reloc (ns)
> relocation: buffers=   8: old=   5729 + 82.7*reloc, lut=   5698 + 81.5*reloc (ns)
> relocation: buffers=  16: old=   7952 + 83.0*reloc, lut=   7809 + 81.9*reloc (ns)
> relocation: buffers=  32: old=  11884 + 82.9*reloc, lut=  11702 + 81.6*reloc (ns)
> relocation: buffers=  64: old=  20388 + 83.4*reloc, lut=  19995 + 82.2*reloc (ns)
> relocation: buffers= 128: old=  38057 + 85.0*reloc, lut=  37675 + 83.4*reloc (ns)
> relocation: buffers= 256: old=  74912 + 87.0*reloc, lut=  74064 + 85.4*reloc (ns)
> relocation: buffers= 512: old= 161136 + 94.8*reloc, lut= 157046 + 87.5*reloc (ns)
> relocation: buffers=1024: old= 349443 + 107.0*reloc, lut= 342081 + 91.2*reloc (ns)
> relocation: buffers=2048: old= 707951 + 131.8*reloc, lut= 690754 + 96.9*reloc (ns)
> skip-relocs: buffers=   1: old=   2966 + 16.6*reloc, lut=   2963 + 15.6*reloc (ns)
> skip-relocs: buffers=   2: old=   3083 + 16.5*reloc, lut=   3056 + 15.5*reloc (ns)
> skip-relocs: buffers=   4: old=   3279 + 16.6*reloc, lut=   3242 + 15.6*reloc (ns)
> skip-relocs: buffers=   8: old=   3692 + 16.7*reloc, lut=   3654 + 15.6*reloc (ns)
> skip-relocs: buffers=  16: old=   4522 + 16.7*reloc, lut=   4461 + 15.5*reloc (ns)
> skip-relocs: buffers=  32: old=   6254 + 16.7*reloc, lut=   6138 + 15.7*reloc (ns)
> skip-relocs: buffers=  64: old=  10098 + 16.8*reloc, lut=   9939 + 15.7*reloc (ns)
> skip-relocs: buffers= 128: old=  17983 + 17.6*reloc, lut=  17729 + 16.3*reloc (ns)
> skip-relocs: buffers= 256: old=  34388 + 18.8*reloc, lut=  33981 + 17.6*reloc (ns)
> skip-relocs: buffers= 512: old=  74211 + 25.2*reloc, lut=  72185 + 18.6*reloc (ns)
> skip-relocs: buffers=1024: old= 160514 + 34.1*reloc, lut= 157086 + 20.3*reloc (ns)
> skip-relocs: buffers=2048: old= 323954 + 51.5*reloc, lut= 315928 + 22.5*reloc (ns)
> no-relocs: buffers=   1: old=   2840 + 5.1*reloc, lut=   2834 + 4.8*reloc (ns)
> no-relocs: buffers=   2: old=   2938 + 5.1*reloc, lut=   2917 + 4.8*reloc (ns)
> no-relocs: buffers=   4: old=   3220 + 5.1*reloc, lut=   3201 + 4.8*reloc (ns)
> no-relocs: buffers=   8: old=   3614 + 5.1*reloc, lut=   3545 + 4.8*reloc (ns)
> no-relocs: buffers=  16: old=   4437 + 5.1*reloc, lut=   4368 + 4.8*reloc (ns)
> no-relocs: buffers=  32: old=   6105 + 5.1*reloc, lut=   6024 + 4.9*reloc (ns)
> no-relocs: buffers=  64: old=   9864 + 5.1*reloc, lut=   9652 + 4.9*reloc (ns)
> no-relocs: buffers= 128: old=  17388 + 5.1*reloc, lut=  17126 + 4.9*reloc (ns)
> no-relocs: buffers= 256: old=  33087 + 5.4*reloc, lut=  32668 + 5.3*reloc (ns)
> no-relocs: buffers= 512: old=  71476 + 5.0*reloc, lut=  69464 + 4.9*reloc (ns)
> no-relocs: buffers=1024: old= 154379 + 4.9*reloc, lut= 152796 + 4.3*reloc (ns)
> no-relocs: buffers=2048: old= 309435 + 5.0*reloc, lut= 301095 + 4.9*reloc (ns)

Hmm, either the patch really did make things faster or the results are
being subjected to cpufreq.

relocation: do the full relocation processing
skip-relocs: all the relocation addresses are correct, so no rewrite
no-relocs: no buffers moved, no relocation processing

"Buffers" is the number of buffers passed into execbuffer, and so
we measure the cost of trying to process a certain complexity of a batch.
Then we do a least-squares plot through a number of batches with varying
numbers of relocations to estimate the overhead of processing the
execbuffer array versus the cost of each addition relocation in the
batch. The first number (the x_0 intercept in nanoseconds) should be the
measure of how long it takes to grab all the buffers into our local
structures - this is what I was expecting to be worst hit by the vma
patches.
-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre