[Mesa-dev] [PATCH 40/53] i965/drm: Rewrite relocation handling.

Wed Apr 5 11:46:27 UTC 2017

On Tue, Apr 04, 2017 at 05:10:30PM -0700, Kenneth Graunke wrote:
> The execbuf2 kernel API requires us to construct two kinds of lists.
> First is a "validation list" (struct drm_i915_gem_exec_object2[])
> containing each BO referenced by the batch.  (The batch buffer itself
> must be the last entry in this list.)  Each validation list entry
> contains a pointer to the second kind of list: a relocation list.
> The relocation list contains information about pointers to BOs that
> the kernel may need to patch up if it relocates objects within the VMA.
> 
> This is a very general mechanism, allowing every BO to contain pointers
> to other BOs.  libdrm_intel models this by giving each drm_intel_bo a
> list of relocations to other BOs.  Together, these form "reloc trees".
> 
> Processing relocations involves a depth-first-search of the relocation
> trees, starting from the batch buffer.  Care has to be taken not to
> double-visit buffers.  Creating the validation list has to be deferred
> until the last minute, after all relocations are emitted, so we have the
> full tree present.  Calculating the amount of aperture space required to
> pin those BOs also involves tree walking, which is expensive, so libdrm
> has hacks to try and perform less expensive estimates.
> 
> For some reason, it also stored the validation list in the global
> (per-screen) bufmgr structure, rather than as an local variable in the
> execbuffer function, requiring locking for no good reason.
> 
> It also assumed that the batch would probably contain a relocation
> every 2 DWords - which is absurdly high - and simply aborted if there
> were more relocations than the max.  This meant the first relocation
> from a BO would allocate 180kB of data structures!
> 
> This is way too complicated for our needs.  i965 only emits relocations
> from the batchbuffer - all GPU commands and state such as SURFACE_STATE
> live in the batch BO.  No other buffer uses relocations.  This means we
> can have a single relocation list for the batchbuffer.  We can add a BO
> to the validation list (set) the first time we emit a relocation to it.
> We can easily keep a running tally of the aperture space required for
> that list by adding the BO size when we add it to the validation list.
> 
> This patch overhauls the relocation system to do exactly that.  There
> are many nice benefits:
> 
> - We have a flat relocation list instead of trees.
> - We can produce the validation list up front.
> - We can allocate smaller arrays and dynamically grow them.
> - Aperture space checks are now (a + b <= c) instead of a tree walk.
> - brw_batch_references() is a trivial validation list walk.
>   It should be straightforward to make it O(1) in the future.
> - We don't need to bloat each drm_bacon_bo with 32B of reloc data.
> - We don't need to lock in execbuffer, as the data structures are
>   context-local, and not per-screen.
> - Significantly less code and a better match for what we're doing.
> - The simpler system should make it easier to take advantage of
>   I915_EXEC_NO_RELOC in a future patch.
> ---
>  src/mesa/drivers/dri/i965/brw_bufmgr.h        |  53 +-
>  src/mesa/drivers/dri/i965/brw_compute.c       |   2 +-
>  src/mesa/drivers/dri/i965/brw_context.h       |  12 +
>  src/mesa/drivers/dri/i965/brw_draw.c          |   2 +-
>  src/mesa/drivers/dri/i965/genX_blorp_exec.c   |   2 +-
>  src/mesa/drivers/dri/i965/intel_batchbuffer.c | 215 +++++++-
>  src/mesa/drivers/dri/i965/intel_batchbuffer.h |   3 +
>  src/mesa/drivers/dri/i965/intel_blit.c        |  30 +-
>  src/mesa/drivers/dri/i965/intel_bufmgr_gem.c  | 718 +-------------------------
>  9 files changed, 229 insertions(+), 808 deletions(-)
> 
> diff --git a/src/mesa/drivers/dri/i965/brw_bufmgr.h b/src/mesa/drivers/dri/i965/brw_bufmgr.h
> index 237f39bb078..d3db6a3967b 100644
> --- a/src/mesa/drivers/dri/i965/brw_bufmgr.h
> +++ b/src/mesa/drivers/dri/i965/brw_bufmgr.h
> @@ -88,6 +88,15 @@ struct _drm_bacon_bo {
>  	 * entries when calling drm_bacon_bo_emit_reloc()
>  	 */
>  	uint64_t offset64;
> +
> +	/**
> +	 * Boolean of whether the GPU is definitely not accessing the buffer.
> +	 *
> +	 * This is only valid when reusable, since non-reusable
> +	 * buffers are those that have been shared with other
> +	 * processes, so we don't know their state.
> +	 */
> +	bool idle;
>  };
>  
>  #define BO_ALLOC_FOR_RENDER (1<<0)
> @@ -178,37 +187,6 @@ void drm_bacon_bo_wait_rendering(drm_bacon_bo *bo);
>   */
>  void drm_bacon_bufmgr_destroy(drm_bacon_bufmgr *bufmgr);
>  
> -/** Executes the command buffer pointed to by bo. */
> -int drm_bacon_bo_exec(drm_bacon_bo *bo, int used);
> -
> -/** Executes the command buffer pointed to by bo on the selected ring buffer */
> -int drm_bacon_bo_mrb_exec(drm_bacon_bo *bo, int used, unsigned int flags);
> -int drm_bacon_bufmgr_check_aperture_space(drm_bacon_bo ** bo_array, int count);
> -
> -/**
> - * Add relocation entry in reloc_buf, which will be updated with the
> - * target buffer's real offset on on command submission.
> - *
> - * Relocations remain in place for the lifetime of the buffer object.
> - *
> - * \param bo Buffer to write the relocation into.
> - * \param offset Byte offset within reloc_bo of the pointer to
> - *                      target_bo.
> - * \param target_bo Buffer whose offset should be written into the
> - *                  relocation entry.
> - * \param target_offset Constant value to be added to target_bo's
> - *                      offset in relocation entry.
> - * \param read_domains GEM read domains which the buffer will be
> - *                      read into by the command that this relocation
> - *                      is part of.
> - * \param write_domains GEM read domains which the buffer will be
> - *                      dirtied in by the command that this
> - *                      relocation is part of.
> - */
> -int drm_bacon_bo_emit_reloc(drm_bacon_bo *bo, uint32_t offset,
> -			    drm_bacon_bo *target_bo, uint32_t target_offset,
> -			    uint32_t read_domains, uint32_t write_domain);
> -
>  /**
>   * Ask that the buffer be placed in tiling mode
>   *
> @@ -271,9 +249,6 @@ int drm_bacon_bo_disable_reuse(drm_bacon_bo *bo);
>   */
>  int drm_bacon_bo_is_reusable(drm_bacon_bo *bo);
>  
> -/** Returns true if target_bo is in the relocation tree rooted at bo. */
> -int drm_bacon_bo_references(drm_bacon_bo *bo, drm_bacon_bo *target_bo);
> -
>  /* drm_bacon_bufmgr_gem.c */
>  drm_bacon_bufmgr *drm_bacon_bufmgr_gem_init(struct gen_device_info *devinfo,
>  					    int fd, int batch_size);
> @@ -290,8 +265,6 @@ void *drm_bacon_gem_bo_map__cpu(drm_bacon_bo *bo);
>  void *drm_bacon_gem_bo_map__gtt(drm_bacon_bo *bo);
>  void *drm_bacon_gem_bo_map__wc(drm_bacon_bo *bo);
>  
> -int drm_bacon_gem_bo_get_reloc_count(drm_bacon_bo *bo);
> -void drm_bacon_gem_bo_clear_relocs(drm_bacon_bo *bo, int start);
>  void drm_bacon_gem_bo_start_gtt_access(drm_bacon_bo *bo, int write_enable);
>  
>  int drm_bacon_gem_bo_wait(drm_bacon_bo *bo, int64_t timeout_ns);
> @@ -300,14 +273,6 @@ drm_bacon_context *drm_bacon_gem_context_create(drm_bacon_bufmgr *bufmgr);
>  int drm_bacon_gem_context_get_id(drm_bacon_context *ctx,
>                                   uint32_t *ctx_id);
>  void drm_bacon_gem_context_destroy(drm_bacon_context *ctx);
> -int drm_bacon_gem_bo_context_exec(drm_bacon_bo *bo, drm_bacon_context *ctx,
> -				  int used, unsigned int flags);
> -int drm_bacon_gem_bo_fence_exec(drm_bacon_bo *bo,
> -				drm_bacon_context *ctx,
> -				int used,
> -				int in_fence,
> -				int *out_fence,
> -				unsigned int flags);
>  
>  int drm_bacon_bo_gem_export_to_prime(drm_bacon_bo *bo, int *prime_fd);
>  drm_bacon_bo *drm_bacon_bo_gem_create_from_prime(drm_bacon_bufmgr *bufmgr,
> diff --git a/src/mesa/drivers/dri/i965/brw_compute.c b/src/mesa/drivers/dri/i965/brw_compute.c
> index d816d056aad..e924401c3af 100644
> --- a/src/mesa/drivers/dri/i965/brw_compute.c
> +++ b/src/mesa/drivers/dri/i965/brw_compute.c
> @@ -212,7 +212,7 @@ brw_dispatch_compute_common(struct gl_context *ctx)
>  
>     brw->no_batch_wrap = false;
>  
> -   if (drm_bacon_bufmgr_check_aperture_space(&brw->batch.bo, 1)) {
> +   if (!brw_batch_has_aperture_space(brw, 0)) {
>        if (!fail_next) {
>           intel_batchbuffer_reset_to_saved(brw);
>           intel_batchbuffer_flush(brw);
> diff --git a/src/mesa/drivers/dri/i965/brw_context.h b/src/mesa/drivers/dri/i965/brw_context.h
> index 00e9224d7d7..186ce826801 100644
> --- a/src/mesa/drivers/dri/i965/brw_context.h
> +++ b/src/mesa/drivers/dri/i965/brw_context.h
> @@ -477,9 +477,21 @@ struct intel_batchbuffer {
>     bool needs_sol_reset;
>     bool state_base_address_emitted;
>  
> +   struct drm_i915_gem_relocation_entry *relocs;
> +   int reloc_count;
> +   int reloc_array_size;
> +   /** The validation list */
> +   struct drm_i915_gem_exec_object2 *exec_objects;
> +   drm_bacon_bo **exec_bos;
> +   int exec_count;
> +   int exec_array_size;
> +   /** The amount of aperture space (in bytes) used by all exec_bos */
> +   int aperture_space;

You are not using the aperture (that is the CPU addressable BAR). This
is just the batch_size, and not to be confused with the rss that you may
also want to track in future.

> +
>     struct {
>        uint32_t *map_next;
>        int reloc_count;
> +      int exec_count;
>     } saved;
>  
>     /** Map from batch offset to brw_state_batch data (with DEBUG_BATCH) */
> diff --git a/src/mesa/drivers/dri/i965/brw_draw.c b/src/mesa/drivers/dri/i965/brw_draw.c
> index a6e229f2210..bf09915d0c3 100644
> --- a/src/mesa/drivers/dri/i965/brw_draw.c
> +++ b/src/mesa/drivers/dri/i965/brw_draw.c
> @@ -601,7 +601,7 @@ retry:
>  
>        brw->no_batch_wrap = false;
>  
> -      if (drm_bacon_bufmgr_check_aperture_space(&brw->batch.bo, 1)) {
> +      if (!brw_batch_has_aperture_space(brw, 0)) {
>           if (!fail_next) {
>              intel_batchbuffer_reset_to_saved(brw);
>              intel_batchbuffer_flush(brw);
> diff --git a/src/mesa/drivers/dri/i965/genX_blorp_exec.c b/src/mesa/drivers/dri/i965/genX_blorp_exec.c
> index dc1c1ea2993..e1a799457a2 100644
> --- a/src/mesa/drivers/dri/i965/genX_blorp_exec.c
> +++ b/src/mesa/drivers/dri/i965/genX_blorp_exec.c
> @@ -231,7 +231,7 @@ retry:
>      * map all the BOs into the GPU at batch exec time later.  If so, flush the
>      * batch and try again with nothing else in the batch.
>      */
> -   if (drm_bacon_bufmgr_check_aperture_space(&brw->batch.bo, 1)) {
> +   if (!brw_batch_has_aperture_space(brw, 0)) {
>        if (!check_aperture_failed_once) {
>           check_aperture_failed_once = true;
>           intel_batchbuffer_reset_to_saved(brw);
> diff --git a/src/mesa/drivers/dri/i965/intel_batchbuffer.c b/src/mesa/drivers/dri/i965/intel_batchbuffer.c
> index 59ab55d31f7..444e33c48ee 100644
> --- a/src/mesa/drivers/dri/i965/intel_batchbuffer.c
> +++ b/src/mesa/drivers/dri/i965/intel_batchbuffer.c
> @@ -38,6 +38,8 @@
>  #include <xf86drm.h>
>  #include <i915_drm.h>
>  
> +#define FILE_DEBUG_FLAG DEBUG_BUFMGR
> +
>  static void
>  intel_batchbuffer_reset(struct intel_batchbuffer *batch,
>                          drm_bacon_bufmgr *bufmgr,
> @@ -68,6 +70,17 @@ intel_batchbuffer_init(struct intel_batchbuffer *batch,
>        batch->map_next = batch->cpu_map;
>     }
>  
> +   batch->reloc_count = 0;
> +   batch->reloc_array_size = 768;

Wowser!

> +   batch->relocs = malloc(batch->reloc_array_size *
> +                          sizeof(struct drm_i915_gem_relocation_entry));

Allocation failure handling? Deferred to use? Even so, you will need to
record you didn't allocate the array_size.

>  void
>  intel_batchbuffer_reset_to_saved(struct brw_context *brw)
>  {
> -   drm_bacon_gem_bo_clear_relocs(brw->batch.bo, brw->batch.saved.reloc_count);
> +   for (int i = brw->batch.saved.exec_count;
> +        i < brw->batch.exec_count; i++) {
> +      if (brw->batch.exec_bos[i] != brw->batch.bo) {
> +         drm_bacon_bo_unreference(brw->batch.exec_bos[i]);
> +      }

There's a trick to add to your todo list - treat being busy as an
implicit reference.

> +static void
> +add_exec_bo(struct intel_batchbuffer *batch, drm_bacon_bo *bo)
> +{
> +   if (bo != batch->bo) {
> +      for (int i = 0; i < batch->exec_count; i++) {
> +         if (batch->exec_bos[i] == bo)
> +            return;
> +      }
> +
> +      drm_bacon_bo_reference(bo);
> +   }
> +
> +   if (batch->exec_count == batch->exec_array_size) {
> +      batch->exec_array_size *= 2;
> +      batch->exec_bos =
> +         realloc(batch->exec_bos,
> +                 batch->exec_array_size * sizeof(batch->exec_bos[0]));
> +      batch->exec_objects =
> +         realloc(batch->exec_objects,
> +                 batch->exec_array_size * sizeof(batch->exec_objects[0]));

This has to be robust! What I suggested was to install a setjmp during
batch_begin so that you could unwind and escape after an allocation
failure and have it reported back to GL. Those patches to refactor the
aperture checking / error checking around batch buffer handling can be
applied to this series.

> +   }
> +
> +   struct drm_i915_gem_exec_object2 *validation_entry =
> +      &batch->exec_objects[batch->exec_count];
> +   validation_entry->handle = bo->handle;
> +   if (bo == batch->bo) {
> +      validation_entry->relocation_count = batch->reloc_count;
> +      validation_entry->relocs_ptr = (uintptr_t) batch->relocs;
> +   } else {
> +      validation_entry->relocation_count = 0;
> +      validation_entry->relocs_ptr = 0;
> +   }
> +   validation_entry->alignment = bo->align;
> +   validation_entry->offset = bo->offset64;
> +   validation_entry->flags = 0;
> +   validation_entry->rsvd1 = 0;
> +   validation_entry->rsvd2 = 0;
> +
> +   batch->exec_bos[batch->exec_count] = bo;
> +   batch->exec_count++;
> +   batch->aperture_space += bo->size;
> +}
> +
> +static int
> +execbuffer(int fd,
> +           struct intel_batchbuffer *batch,
> +           drm_bacon_context *ctx,
> +           int used,
> +           int in_fence,
> +           int *out_fence,
> +           int flags)
> +{
> +   struct drm_i915_gem_execbuffer2 execbuf = {
> +      .buffers_ptr = (uintptr_t) batch->exec_objects,
> +      .buffer_count = batch->exec_count,
> +      .batch_start_offset = 0,
> +      .batch_len = used,
> +      .cliprects_ptr = 0,
> +      .num_cliprects = 0,
> +      .DR1 = 0,
> +      .DR4 = 0,

Just don't mention the garbage.

> +      .flags = flags,
> +      .rsvd2 = 0,
> +   };
> +
> +   uint32_t ctx_id = 0;
> +   drm_bacon_gem_context_get_id(ctx, &ctx_id);
> +   i915_execbuffer2_set_context_id(execbuf, ctx_id);

Urgh.

> +
> +   if (in_fence != -1) {
> +      execbuf.rsvd2 = in_fence;
> +      execbuf.flags |= I915_EXEC_FENCE_IN;
> +   }
> +
> +   if (out_fence != NULL) {
> +      *out_fence = -1;
> +      execbuf.flags |= I915_EXEC_FENCE_OUT;
> +   }
> +
> +   int ret = drmIoctl(fd, DRM_IOCTL_I915_GEM_EXECBUFFER2_WR, &execbuf);

You could improve this by

cmd = DRM_IOCTL_I915_GEM_EXECBUFFER2;
if (out_fence)
	cmd |= _IOC_WRITE << _IOC_DIRSHIFT;
or
	cmd = DRM_IOCTL_I915_GEM_EXECBUFFER2_WR;

> +   if (ret != 0)
> +      ret = -errno;
> +
> +   for (int i = 0; i < batch->exec_count; i++) {
> +      drm_bacon_bo *bo = batch->exec_bos[i];
> +
> +      bo->idle = false;
> +
> +      /* Update drm_bacon_bo::offset64 */
> +      if (batch->exec_objects[i].offset != bo->offset64) {
> +         DBG("BO %d migrated: 0x%" PRIx64 " -> 0x%llx\n",
> +             bo->handle, bo->offset64, batch->exec_objects[i].offset);
> +         bo->offset64 = batch->exec_objects[i].offset;
> +      }
> +   }
> +
> +   if (ret == 0 && out_fence != NULL)
> +      *out_fence = execbuf.rsvd2 >> 32;
> +
> +   return ret;
> +}
> +
>  static int
>  do_flush_locked(struct brw_context *brw, int in_fence_fd, int *out_fence_fd)
>  {
> +   __DRIscreen *dri_screen = brw->screen->driScrnPriv;
>     struct intel_batchbuffer *batch = &brw->batch;
>     int ret = 0;
>  
> @@ -484,17 +625,18 @@ do_flush_locked(struct brw_context *brw, int in_fence_fd, int *out_fence_fd)
>  	 flags |= I915_EXEC_GEN7_SOL_RESET;
>  
>        if (ret == 0) {
> -	 if (brw->hw_ctx == NULL || batch->ring != RENDER_RING) {
> +         void *hw_ctx = batch->ring != RENDER_RING ? NULL : brw->hw_ctx;
> +	 if (hw_ctx == NULL) {
>              assert(in_fence_fd == -1);
>              assert(out_fence_fd == NULL);

These assertions are no longer related. They only applied to the
libdrm_intel API.

> -            ret = drm_bacon_bo_mrb_exec(batch->bo, 4 * USED_BATCH(*batch),
> -                                        flags);
> -	 } else {
> -	    ret = drm_bacon_gem_bo_fence_exec(batch->bo, brw->hw_ctx,
> -                                                4 * USED_BATCH(*batch),
> -                                                in_fence_fd, out_fence_fd,
> -                                                flags);
> -	 }
> +         }
> +
> +         /* Add the batch itself to the end of the validation list */
> +         add_exec_bo(batch, batch->bo);
> +
> +         ret = execbuffer(dri_screen->fd, batch, hw_ctx,
> +                          4 * USED_BATCH(*batch),
> +                          in_fence_fd, out_fence_fd, flags);
>        }
>  
>        throttle(brw);
> @@ -577,9 +719,20 @@ _intel_batchbuffer_flush_fence(struct brw_context *brw,
>  }
>  
>  bool
> +brw_batch_has_aperture_space(struct brw_context *brw, unsigned extra_space)
> +{
> +   return brw->batch.aperture_space + extra_space <=
> +          brw->screen->aperture_threshold;
> +}
> +
> +bool
>  brw_batch_references(struct intel_batchbuffer *batch, drm_bacon_bo *bo)
>  {
> -   return drm_bacon_bo_references(batch->bo, bo);
> +   for (int i = 0; i < batch->exec_count; i++) {
> +      if (batch->exec_bos[i] == bo)
> +         return true;
> +   }
> +   return false;
>  }
>  
>  /*  This is the only way buffers get added to the validate list.
> @@ -589,13 +742,31 @@ brw_emit_reloc(struct intel_batchbuffer *batch, uint32_t batch_offset,
>                 drm_bacon_bo *target, uint32_t target_offset,
>                 uint32_t read_domains, uint32_t write_domain)
>  {
> -   int ret;
> +   if (batch->reloc_count == batch->reloc_array_size) {
> +      batch->reloc_array_size *= 2;
> +      batch->relocs = realloc(batch->relocs,
> +                              batch->reloc_array_size *
> +                              sizeof(struct drm_i915_gem_relocation_entry));

See above for ideas on handling malloc failures.

> +   }
> +
> +   /* Check args */
> +   assert(batch_offset <= BATCH_SZ - sizeof(uint32_t));
> +   assert(_mesa_bitcount(write_domain) <= 1);
> +
> +   if (target != batch->bo)
> +      add_exec_bo(batch, target);
> +
> +   struct drm_i915_gem_relocation_entry *reloc =
> +      &batch->relocs[batch->reloc_count];
> +
> +   batch->reloc_count++;
>  
> -   ret = drm_bacon_bo_emit_reloc(batch->bo, batch_offset,
> -                                 target, target_offset,
> -                                 read_domains, write_domain);
> -   assert(ret == 0);
> -   (void)ret;
> +   reloc->offset = batch_offset;
> +   reloc->delta = target_offset;
> +   reloc->target_handle = target->handle;
> +   reloc->read_domains = read_domains;
> +   reloc->write_domain = write_domain;
> +   reloc->presumed_offset = target->offset64;

Hmm, this is the wrong offset (its global not context). This is not the
final version of relocation handling...
-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre