[Intel-gfx] [PATCH 2/2] intel: Use I915_EXEC_NO_RELOC when available

Tue Jan 20 12:53:35 PST 2015

On Tue, Jan 20, 2015 at 12:42 AM, Daniel Vetter <daniel at ffwll.ch> wrote:
> On Mon, Jan 19, 2015 at 09:45:35PM -0800, Kristian Høgsberg wrote:
>> On Fri, Jan 16, 2015 at 8:23 PM, Daniel Vetter <daniel at ffwll.ch> wrote:
>> > On Fri, Jan 16, 2015 at 05:46:00PM -0800, Kristian Høgsberg wrote:
>> >> The I915_EXEC_NO_RELOC flag lets us tell the kernel that the offset we
>> >> provide in the validate list entry is what we've used in all relocations
>> >> to the bo in question.  If the bo hasn't moved, the kernel can skip
>> >> relocations completely.
>> >>
>> >> Signed-off-by: Kristian Høgsberg <krh at bitplanet.net>
>> >> ---
>> >>  intel/intel_bufmgr_gem.c | 17 ++++++++++++++++-
>> >>  1 file changed, 16 insertions(+), 1 deletion(-)
>> >>
>> >> diff --git a/intel/intel_bufmgr_gem.c b/intel/intel_bufmgr_gem.c
>> >> index 8a51cea..a657a4d 100644
>> >> --- a/intel/intel_bufmgr_gem.c
>> >> +++ b/intel/intel_bufmgr_gem.c
>> >> @@ -131,6 +131,7 @@ typedef struct _drm_intel_bufmgr_gem {
>> >>   unsigned int no_exec : 1;
>> >>   unsigned int has_vebox : 1;
>> >>   unsigned int has_handle_lut : 1;
>> >> + unsigned int has_no_reloc : 1;
>> >>   bool fenced_relocs;
>> >>
>> >>   char *aub_filename;
>> >> @@ -504,7 +505,15 @@ drm_intel_add_validate_buffer2(drm_intel_bo *bo, int need_fence)
>> >>   bufmgr_gem->exec2_objects[index].relocation_count = bo_gem->reloc_count;
>> >>   bufmgr_gem->exec2_objects[index].relocs_ptr = (uintptr_t)bo_gem->relocs;
>> >>   bufmgr_gem->exec2_objects[index].alignment = 0;
>> >> - bufmgr_gem->exec2_objects[index].offset = 0;
>> >> +
>> >> + /* If the kernel supports I915_EXEC_NO_RELOC, it will compare
>> >> + * offset in struct drm_i915_gem_exec_object2 against the bos
>> >> + * current offset and if all bos haven't moved it will skip
>> >> + * relocation processing alltogether.  If I915_EXEC_NO_RELOC
>> >> + * is not supported, the kernel ignores the incoming value of
>> >> + * offset so we can set it either way.
>> >> + */
>> >> + bufmgr_gem->exec2_objects[index].offset = bo->offset64;
>> >>   bufmgr_gem->exec_bos[index] = bo;
>> >>   bufmgr_gem->exec2_objects[index].flags = 0;
>> >>   bufmgr_gem->exec2_objects[index].rsvd1 = 0;
>> >> @@ -2471,6 +2480,8 @@ do_exec2(drm_intel_bo *bo, int used, drm_intel_context *ctx,
>> >>
>> >>   if (bufmgr_gem->has_handle_lut)
>> >>   execbuf.flags |= I915_EXEC_HANDLE_LUT;
>> >> + if (bufmgr_gem->has_no_reloc)
>> >> + execbuf.flags |= I915_EXEC_NO_RELOC;
>> >
>> > You need some opt-in flag to not break existing userspace: Iirc both UXA
>> > and libva retain reloc trees partially, which means that we might have
>> > different presumed offsets for the same bo in different relocs.
>> >
>> > This is only safe when you throw away and rebuild the reloc tree with all
>> > buffers completely each time around (like current mesa does afaik).
>>
>> And it turns out that even inside mesa we cannot guarantee this.  In
>> case of multiple threads sharing objects, thread A could be halfway in
>> building up its batchbuffer when thread B does a execbuf ioctls and
>> causes some objects to be moved.  Thread A will then finish building
>> its reloc list with different offsets for the objects that were moved
>> and if we pass NO_RELOC in that case, nothing will fix up the wrong
>> presumed_offsets for the first half.
>>
>> We can just check that all presumed_offset of all relocs match
>> bo->offset64, and pass NO_RELOC if they do for all bos. This will also
>> work of UXA and libva and not require any opt-in.
>
> My idea for all this would have been to create a per-thread execbuf
> relocation context with a hashtab to map buffer pointers to execbuf index
> and a bunch of arrays to prepare the reloc entry tables. If you do it
> correctly all the per-reloc work should be a O(1) streaming writes to a
> few arrays plus the hashtab lookup. With no code run at execbuf time
> (except the ioctl ofc). Even the libdrm_bo->presumed_offset update after
> execbuf could be done lockless (as long as readers are careful to never
> reload it by using something similar to the kernel's READ_ONCE macro).
>
> But that means a completely new reloc api, so a lot more work. Also I
> think it only makes sense do that for drivers that really care about the
> last bit of performance, and then do it within the driver so that there's
> no constraints about abi.

Indeed, I moved it into mesa so I could rework that. bo_emit_reloc()
is showing up in profiles. The patch below along with NO_RELOC and
HANDLE_LUT flags gives me 5-10% improement on CPU bound benchmarks, so
it's certainly worth it.  I'm skeptical that a hashtable lookup per
reloc emit is going to perform better than just fixing up the relocs
at execbuf2 time though.  It would be nice to not do any work at ioctl
time, but for that you need a very fast way to map from bo to
per-thread bo state as you go. Maybe a per-thread array mapping from
gem handle to exec_object could work...

WIP Patch is here:

http://cgit.freedesktop.org/~krh/mesa/commit/?h=b0e4ce7bbce2a79ad37d6de460af88b9581ea1d7

it needs a couple of libdrm changes: expose exec_index in the public
bo struct, the drm_bo_clear_idle() function and I need to get at the
hw context id.

Kristian

> -Daniel
> --
> Daniel Vetter
> Software Engineer, Intel Corporation
> +41 (0) 79 365 57 48 - http://blog.ffwll.ch