[Intel-gfx] [PATCH 2/2] intel: Use I915_EXEC_NO_RELOC when available

Tue Jan 20 00:42:51 PST 2015

On Mon, Jan 19, 2015 at 09:45:35PM -0800, Kristian Høgsberg wrote:
> On Fri, Jan 16, 2015 at 8:23 PM, Daniel Vetter <daniel at ffwll.ch> wrote:
> > On Fri, Jan 16, 2015 at 05:46:00PM -0800, Kristian Høgsberg wrote:
> >> The I915_EXEC_NO_RELOC flag lets us tell the kernel that the offset we
> >> provide in the validate list entry is what we've used in all relocations
> >> to the bo in question.  If the bo hasn't moved, the kernel can skip
> >> relocations completely.
> >>
> >> Signed-off-by: Kristian Høgsberg <krh at bitplanet.net>
> >> ---
> >>  intel/intel_bufmgr_gem.c | 17 ++++++++++++++++-
> >>  1 file changed, 16 insertions(+), 1 deletion(-)
> >>
> >> diff --git a/intel/intel_bufmgr_gem.c b/intel/intel_bufmgr_gem.c
> >> index 8a51cea..a657a4d 100644
> >> --- a/intel/intel_bufmgr_gem.c
> >> +++ b/intel/intel_bufmgr_gem.c
> >> @@ -131,6 +131,7 @@ typedef struct _drm_intel_bufmgr_gem {
> >>   unsigned int no_exec : 1;
> >>   unsigned int has_vebox : 1;
> >>   unsigned int has_handle_lut : 1;
> >> + unsigned int has_no_reloc : 1;
> >>   bool fenced_relocs;
> >>
> >>   char *aub_filename;
> >> @@ -504,7 +505,15 @@ drm_intel_add_validate_buffer2(drm_intel_bo *bo, int need_fence)
> >>   bufmgr_gem->exec2_objects[index].relocation_count = bo_gem->reloc_count;
> >>   bufmgr_gem->exec2_objects[index].relocs_ptr = (uintptr_t)bo_gem->relocs;
> >>   bufmgr_gem->exec2_objects[index].alignment = 0;
> >> - bufmgr_gem->exec2_objects[index].offset = 0;
> >> +
> >> + /* If the kernel supports I915_EXEC_NO_RELOC, it will compare
> >> + * offset in struct drm_i915_gem_exec_object2 against the bos
> >> + * current offset and if all bos haven't moved it will skip
> >> + * relocation processing alltogether.  If I915_EXEC_NO_RELOC
> >> + * is not supported, the kernel ignores the incoming value of
> >> + * offset so we can set it either way.
> >> + */
> >> + bufmgr_gem->exec2_objects[index].offset = bo->offset64;
> >>   bufmgr_gem->exec_bos[index] = bo;
> >>   bufmgr_gem->exec2_objects[index].flags = 0;
> >>   bufmgr_gem->exec2_objects[index].rsvd1 = 0;
> >> @@ -2471,6 +2480,8 @@ do_exec2(drm_intel_bo *bo, int used, drm_intel_context *ctx,
> >>
> >>   if (bufmgr_gem->has_handle_lut)
> >>   execbuf.flags |= I915_EXEC_HANDLE_LUT;
> >> + if (bufmgr_gem->has_no_reloc)
> >> + execbuf.flags |= I915_EXEC_NO_RELOC;
> >
> > You need some opt-in flag to not break existing userspace: Iirc both UXA
> > and libva retain reloc trees partially, which means that we might have
> > different presumed offsets for the same bo in different relocs.
> >
> > This is only safe when you throw away and rebuild the reloc tree with all
> > buffers completely each time around (like current mesa does afaik).
> 
> And it turns out that even inside mesa we cannot guarantee this.  In
> case of multiple threads sharing objects, thread A could be halfway in
> building up its batchbuffer when thread B does a execbuf ioctls and
> causes some objects to be moved.  Thread A will then finish building
> its reloc list with different offsets for the objects that were moved
> and if we pass NO_RELOC in that case, nothing will fix up the wrong
> presumed_offsets for the first half.
> 
> We can just check that all presumed_offset of all relocs match
> bo->offset64, and pass NO_RELOC if they do for all bos. This will also
> work of UXA and libva and not require any opt-in.

My idea for all this would have been to create a per-thread execbuf
relocation context with a hashtab to map buffer pointers to execbuf index
and a bunch of arrays to prepare the reloc entry tables. If you do it
correctly all the per-reloc work should be a O(1) streaming writes to a
few arrays plus the hashtab lookup. With no code run at execbuf time
(except the ioctl ofc). Even the libdrm_bo->presumed_offset update after
execbuf could be done lockless (as long as readers are careful to never
reload it by using something similar to the kernel's READ_ONCE macro).

But that means a completely new reloc api, so a lot more work. Also I
think it only makes sense do that for drivers that really care about the
last bit of performance, and then do it within the driver so that there's
no constraints about abi.
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
+41 (0) 79 365 57 48 - http://blog.ffwll.ch