[Intel-gfx] [PATCH 2/2] intel: Use I915_EXEC_NO_RELOC when available

Tue Jan 20 22:10:57 PST 2015

On Tue, Jan 20, 2015 at 1:46 PM, Chris Wilson <chris at chris-wilson.co.uk> wrote:
> On Tue, Jan 20, 2015 at 12:53:35PM -0800, Kristian Høgsberg wrote:
>> On Tue, Jan 20, 2015 at 12:42 AM, Daniel Vetter <daniel at ffwll.ch> wrote:
>> > My idea for all this would have been to create a per-thread execbuf
>> > relocation context with a hashtab to map buffer pointers to execbuf index
>> > and a bunch of arrays to prepare the reloc entry tables. If you do it
>> > correctly all the per-reloc work should be a O(1) streaming writes to a
>> > few arrays plus the hashtab lookup. With no code run at execbuf time
>> > (except the ioctl ofc). Even the libdrm_bo->presumed_offset update after
>> > execbuf could be done lockless (as long as readers are careful to never
>> > reload it by using something similar to the kernel's READ_ONCE macro).
>> >
>> > But that means a completely new reloc api, so a lot more work. Also I
>> > think it only makes sense do that for drivers that really care about the
>> > last bit of performance, and then do it within the driver so that there's
>> > no constraints about abi.
>>
>> Indeed, I moved it into mesa so I could rework that. bo_emit_reloc()
>> is showing up in profiles. The patch below along with NO_RELOC and
>> HANDLE_LUT flags gives me 5-10% improement on CPU bound benchmarks, so
>> it's certainly worth it.  I'm skeptical that a hashtable lookup per
>> reloc emit is going to perform better than just fixing up the relocs
>> at execbuf2 time though.  It would be nice to not do any work at ioctl
>> time, but for that you need a very fast way to map from bo to
>> per-thread bo state as you go. Maybe a per-thread array mapping from
>> gem handle to exec_object could work...
>>
>> WIP Patch is here:
>>
>> http://cgit.freedesktop.org/~krh/mesa/commit/?h=b0e4ce7bbce2a79ad37d6de460af88b9581ea1d7
>
> Hmm, that is actually pretty neat. My idle thought was to create
> per-context batchmgr with its own view of the bo (to counter the
> multithreaded free-for-all). In your patch, you neatly demonstrate that
> you don't need per-context view of the bo, only of the relocations. And
> it will make drm_intel_bo_emit_reloc() fixed cost, which should produce
> most of your CPU overhead saving.
>
> However, I think if you do take it a step further with a batchmgr_bo,
> you can make the drm_intel_bo_references() very cheap as well.

I did think about a per-thread/context bo wrapper that could track
validation list index and presumed_offset, but the problem is that
there's only one intel_mipmap_tree/intel_buffer_object struct to store
it in.  We'd have to a per-thread wrapper for those too, and I'm not
sure that's feasible.

However, drm_intel_bo_reference/unreference() is showing up on the
profiles now that relocations are cheaper, but I think the better way
to make those cheaper is to move the ref count to the public struct
and make the ref/unref inline functions (calling out to a destructor
for the refcount==0 case, of course).  On that note, do you know why
drm_intel_gem_bo_unreference() is so convoluted? What's the deal with
the atomic_add_unless?

> Looks good.
> -Chris
>
> --
> Chris Wilson, Intel Open Source Technology Centre