[Mesa-dev] [RFC 3/3] anv: Do relocations in userspace before execbuf ioctl

Tue Nov 1 08:07:36 UTC 2016

On Mon, Oct 31, 2016 at 09:48:15PM -0700, Jason Ekstrand wrote:
> From: Kristian Høgsberg Kristensen <kristian.h.kristensen at intel.com>
> 
> This reduces the amount of stalling that the kernel does between batches
> and improves the performance of Dota 2 on a Sky Lake GT2 desktop by around
> 30%.  Ideally, we would have a simple execbuf2 flag that would let us tell
> the kernel "Trust me, you don't need to stall to relocate" but the kernel
> has no such flag at the moment.

The NO_RELOC approach should work out to be faster since you then avoid
all the relocation handling in the kernel. You already have the buffers
coherently mapped, the kernel doesn't, the kernel needs to copy the
entire relocation tree if even one relocation is wrong, etc.

> Signed-off-by: Jason Ekstrand <jason at jlekstrand.net>
> Cc: Kristian Høgsberg <krh at bitplanet.net>
> Cc: Ben Widawsky <ben at bwidawsk.net>
> Cc: Daniel Vetter <daniel.vetter at ffwll.ch>
> Cc: Chris Wilson <chris at chris-wilson.co.uk>
> 
> ---
>  src/intel/vulkan/anv_device.c | 96 +++++++++++++++++++++++++++++++++++++++++--
>  1 file changed, 93 insertions(+), 3 deletions(-)
> 
> diff --git a/src/intel/vulkan/anv_device.c b/src/intel/vulkan/anv_device.c
> index baa767e..ef371a7 100644
> --- a/src/intel/vulkan/anv_device.c
> +++ b/src/intel/vulkan/anv_device.c
> @@ -1068,6 +1068,89 @@ void anv_GetDeviceQueue(
>     *pQueue = anv_queue_to_handle(&device->queue);
>  }
>  
> +static void
> +write_reloc(const struct anv_device *device, void *p, uint64_t v)
> +{
> +   if (device->info.gen >= 8)
> +      *(uint64_t *)p = v;

Something to remember here is that some MI (I think) instructions
require canonical format. (bits 63:48 = bit 47).

> +   else
> +      *(uint32_t *)p = v;
> +}
> +
> +static void
> +anv_reloc_list_apply(struct anv_reloc_list *list,
> +                     struct anv_device *device, struct anv_bo *bo)
> +{
> +   for (size_t i = 0; i < list->num_relocs; i++) {
> +      void *p = bo->map + list->relocs[i].offset;
> +
> +      struct anv_bo *target_bo = list->reloc_bos[i];
> +      write_reloc(device, p, target_bo->offset + list->relocs[i].delta);
> +      list->relocs[i].presumed_offset = bo->offset;
> +   }
> +}
> +
> +/**
> + * This function applies the relocation for a command buffer and writes the
> + * actual addresses into the buffers as per what we were told by the kernel on
> + * the previous execbuf2 call.  This should be safe to do because, for each
> + * relocated address, we have two cases:
> + *
> + *  1) The buffer to which the address points is not in use by the GPU.  In
> + *     this case, neither is the 32 or 64-bit chunk of GPU memory storing the
> + *     address so it's completely safe to update the address.

A caveat here is that if the kernel believes the buffer is not in use,
assume the kernel moved it. Hence you always need to tell the kernel
about the buffers for the batch and your assumed locations. This is very
important to remember when dealing with workloads that cause GTT
thrashing or swap thrashing. (In the past you would have also caused
corruption here if you forgot to set the write markers.)

> + *
> + *  2) The buffer to which the address points is currently in use by the GPU.
> + *     If it's still in use, then some execbuf2 call before this one used it
> + *     and it has been in use ever since.  In this case, it can't have moved
> + *     since the last execbuffer2 call.  We then have two subcases:
> + *
> + *      a) The 32 or 64-bit chunk of GPU memory storing the address is not in
> + *         use by the GPU.  In this case, it is safe to write the relocation.
> + *
> + *      a) The 32 or 64-bit chunk of GPU memory storing the address is
> + *         currently in use by the GPU.  In this case the the relocated
> + *         address we will write is exactly the same as the last address we
> + *         wrote there.

(But I don't see you checking on the kernel to confirm, e.g.
https://cgit.freedesktop.org/~ickle/mesa/tree/src/mesa/drivers/dri/i965/brw_batch.c?h=brw-batch#n814
)
> + *
> + * By doing relocations on the CPU, we can tell the kernel that it doesn't
> + * need to bother.  We want to do this because the surface state buffer is
> + * used by every command buffer so, if the kernel does the relocations, it
> + * will always be busy and the kernel will always stall.

Userspace relocation + NO_RELOC processing should also work out to be
fastest of the available approaches today. If you want to try an ignore
stalls flag, it shouldn't take too long to write (I'll get that done
today) and then you can compare the overhead of the kernel having to copy
the relocations from userspace. (You also want the fast execbuf relocation
processing patches from a few years ago...)
-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre