[Mesa-dev] [RFC 3/3] anv: Do relocations in userspace before execbuf ioctl

Tue Nov 1 16:12:34 UTC 2016

On Tue, Nov 01, 2016 at 08:39:16AM -0700, Jason Ekstrand wrote:
>    On Tue, Nov 1, 2016 at 1:07 AM, Chris Wilson <[1]chris at chris-wilson.co.uk>
>    wrote:
> 
>      On Mon, Oct 31, 2016 at 09:48:15PM -0700, Jason Ekstrand wrote:
>      > From: Kristian Høgsberg Kristensen
>      <[2]kristian.h.kristensen at intel.com>
>      >
>      > This reduces the amount of stalling that the kernel does between
>      batches
>      > and improves the performance of Dota 2 on a Sky Lake GT2 desktop by
>      around
>      > 30%.  Ideally, we would have a simple execbuf2 flag that would let us
>      tell
>      > the kernel "Trust me, you don't need to stall to relocate" but the
>      kernel
>      > has no such flag at the moment.
> 
>      The NO_RELOC approach should work out to be faster since you then avoid
>      all the relocation handling in the kernel. You already have the buffers
>      coherently mapped, the kernel doesn't, the kernel needs to copy the
>      entire relocation tree if even one relocation is wrong, etc.
>      > Signed-off-by: Jason Ekstrand <[3]jason at jlekstrand.net>
>      > Cc: Kristian Høgsberg <[4]krh at bitplanet.net>
>      > Cc: Ben Widawsky <[5]ben at bwidawsk.net>
>      > Cc: Daniel Vetter <[6]daniel.vetter at ffwll.ch>
>      > Cc: Chris Wilson <[7]chris at chris-wilson.co.uk>
>      >
>      > ---
>      >  src/intel/vulkan/anv_device.c | 96
>      +++++++++++++++++++++++++++++++++++++++++--
>      >  1 file changed, 93 insertions(+), 3 deletions(-)
>      >
>      > diff --git a/src/intel/vulkan/anv_device.c
>      b/src/intel/vulkan/anv_device.c
>      > index baa767e..ef371a7 100644
>      > --- a/src/intel/vulkan/anv_device.c
>      > +++ b/src/intel/vulkan/anv_device.c
>      > @@ -1068,6 +1068,89 @@ void anv_GetDeviceQueue(
>      >     *pQueue = anv_queue_to_handle(&device->queue);
>      >  }
>      >
>      > +static void
>      > +write_reloc(const struct anv_device *device, void *p, uint64_t v)
>      > +{
>      > +   if (device->info.gen >= 8)
>      > +      *(uint64_t *)p = v;
> 
>      Something to remember here is that some MI (I think) instructions
>      require canonical format. (bits 63:48 = bit 47).
> 
>    Is there anything you can point me to on this?  Looking at the i915
>    sources, it appears that it just writes in the relocation verbatim.

The ABI is that offset/presumed_offset are in canonical format:

/* Used to convert any address to canonical form.
 * Starting from gen8, some commands (e.g. STATE_BASE_ADDRESS,
 * MI_LOAD_REGISTER_MEM and others, see Broadwell PRM Vol2a) require the
 * addresses to be in a canonical form:
 * "GraphicsAddress[63:48] are ignored by the HW and assumed to be in correct
 * canonical form [63:48] == [47]."
 */
#define GEN8_HIGH_ADDRESS_BIT 47
static inline uint64_t gen8_canonical_addr(uint64_t address)
{
        return sign_extend64(address, GEN8_HIGH_ADDRESS_BIT);
}

and the relocation values we write are also in this format.

This was enforced in 

commit 934acce3c069a3d8b14085957248444145d9ec1b
Author: Michał Winiarski <michal.winiarski at intel.com>
Date:   Tue Dec 29 18:24:52 2015 +0100

    drm/i915: Avoid writing relocs with addresses in non-canonical form

and there was the suggestion that some workloads were seeing hangs due to
not using the canonical format for addresses - though I have no details
on those.

>      > +   else
>      > +      *(uint32_t *)p = v;
>      > +}
>      > +
>      > +static void
>      > +anv_reloc_list_apply(struct anv_reloc_list *list,
>      > +                     struct anv_device *device, struct anv_bo *bo)
>      > +{
>      > +   for (size_t i = 0; i < list->num_relocs; i++) {
>      > +      void *p = bo->map + list->relocs[i].offset;
>      > +
>      > +      struct anv_bo *target_bo = list->reloc_bos[i];
>      > +      write_reloc(device, p, target_bo->offset +
>      list->relocs[i].delta);
>      > +      list->relocs[i].presumed_offset = bo->offset;
>      > +   }
>      > +}
>      > +
>      > +/**
>      > + * This function applies the relocation for a command buffer and
>      writes the
>      > + * actual addresses into the buffers as per what we were told by the
>      kernel on
>      > + * the previous execbuf2 call.  This should be safe to do because,
>      for each
>      > + * relocated address, we have two cases:
>      > + *
>      > + *  1) The buffer to which the address points is not in use by the
>      GPU.  In
>      > + *     this case, neither is the 32 or 64-bit chunk of GPU memory
>      storing the
>      > + *     address so it's completely safe to update the address.
> 
>      A caveat here is that if the kernel believes the buffer is not in use,
>      assume the kernel moved it. Hence you always need to tell the kernel
>      about the buffers for the batch and your assumed locations. This is very
>      important to remember when dealing with workloads that cause GTT
>      thrashing or swap thrashing. (In the past you would have also caused
>      corruption here if you forgot to set the write markers.)
> 
>    Yes, I believe I am doing this properly.  Above, after doing the userspace
>    reloc, I set the presumed offset based on the reloc that was just
>    applied.  If the kernel shuffles things around due to GTT thrashing, it
>    has the information it needs to do the relocations correctly in that case.

Yes, the code is correct (afaict). I was just thinking of expanding upon
the comment, to at least touch upon the corner cases of when an object
that you thought was active may suddenly become inactive. And hence why
we still need the relocation information to recover. :)

>      > + *
>      > + *  2) The buffer to which the address points is currently in use by
>      the GPU.
>      > + *     If it's still in use, then some execbuf2 call before this one
>      used it
>      > + *     and it has been in use ever since.  In this case, it can't
>      have moved
>      > + *     since the last execbuffer2 call.  We then have two subcases:
>      > + *
>      > + *      a) The 32 or 64-bit chunk of GPU memory storing the address
>      is not in
>      > + *         use by the GPU.  In this case, it is safe to write the
>      relocation.
>      > + *
>      > + *      a) The 32 or 64-bit chunk of GPU memory storing the address
>      is
>      > + *         currently in use by the GPU.  In this case the the
>      relocated
>      > + *         address we will write is exactly the same as the last
>      address we
>      > + *         wrote there.
> 
>      (But I don't see you checking on the kernel to confirm, e.g.
>      [8]https://cgit.freedesktop.org/~ickle/mesa/tree/src/mesa/drivers/dri/i965/brw_batch.c?h=brw-batch#n814
>      )
> 
>    The point here is that the BO containing the image data is in one of two
>    states (as seen by the kernel): active or inactive.  If it's active then
>    so is the surface state buffer so nothing is moving around and any address
>    that I write into the surface state buffer will be exactly what is already
>    there.  If it's inactive then I know the GPU isn't using it so changing
>    that value won't cause the GPU any problems.  If the kernel comes by and
>    has to change it again due to GTT thrashing, that's just fine but the
>    point is that changing it now is ok.

I was thinking towards the games you can play. If you know the buffers
are active, you can treat them as pinned. That allows you to skip
building the relocation array in userspace (that's moot point if you
already have one). The kernel will not read the relocation array either
for softpinned buffers or unmoved NO_RELOC buffers, so in principle
there should not be any advantage in kernel overhead in using either
technique.

Same as above, I like having discussion of corner cases in the comments
:) Mainly for better understanding of both parties when the time comes
to try and improve (and the corner cases tend to be most
interesting/frustrating wrt squeezing out consistency).
-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre