[Intel-gfx] [PATCH 2/2] intel: Use I915_EXEC_NO_RELOC when available

Tue Jan 20 13:46:37 PST 2015

On Tue, Jan 20, 2015 at 12:53:35PM -0800, Kristian Høgsberg wrote:
> On Tue, Jan 20, 2015 at 12:42 AM, Daniel Vetter <daniel at ffwll.ch> wrote:
> > My idea for all this would have been to create a per-thread execbuf
> > relocation context with a hashtab to map buffer pointers to execbuf index
> > and a bunch of arrays to prepare the reloc entry tables. If you do it
> > correctly all the per-reloc work should be a O(1) streaming writes to a
> > few arrays plus the hashtab lookup. With no code run at execbuf time
> > (except the ioctl ofc). Even the libdrm_bo->presumed_offset update after
> > execbuf could be done lockless (as long as readers are careful to never
> > reload it by using something similar to the kernel's READ_ONCE macro).
> >
> > But that means a completely new reloc api, so a lot more work. Also I
> > think it only makes sense do that for drivers that really care about the
> > last bit of performance, and then do it within the driver so that there's
> > no constraints about abi.
> 
> Indeed, I moved it into mesa so I could rework that. bo_emit_reloc()
> is showing up in profiles. The patch below along with NO_RELOC and
> HANDLE_LUT flags gives me 5-10% improement on CPU bound benchmarks, so
> it's certainly worth it.  I'm skeptical that a hashtable lookup per
> reloc emit is going to perform better than just fixing up the relocs
> at execbuf2 time though.  It would be nice to not do any work at ioctl
> time, but for that you need a very fast way to map from bo to
> per-thread bo state as you go. Maybe a per-thread array mapping from
> gem handle to exec_object could work...
> 
> WIP Patch is here:
> 
> http://cgit.freedesktop.org/~krh/mesa/commit/?h=b0e4ce7bbce2a79ad37d6de460af88b9581ea1d7

Hmm, that is actually pretty neat. My idle thought was to create
per-context batchmgr with its own view of the bo (to counter the
multithreaded free-for-all). In your patch, you neatly demonstrate that
you don't need per-context view of the bo, only of the relocations. And
it will make drm_intel_bo_emit_reloc() fixed cost, which should produce
most of your CPU overhead saving.

However, I think if you do take it a step further with a batchmgr_bo,
you can make the drm_intel_bo_references() very cheap as well. 

Looks good.
-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre