[Mesa-dev] [PATCH 40/53] i965/drm: Rewrite relocation handling.

Thu Apr 6 10:52:06 UTC 2017

On Wed, Apr 05, 2017 at 04:56:42PM -0700, Kenneth Graunke wrote:
> On Wednesday, April 5, 2017 4:46:27 AM PDT Chris Wilson wrote:
> > On Tue, Apr 04, 2017 at 05:10:30PM -0700, Kenneth Graunke wrote:
> > > The execbuf2 kernel API requires us to construct two kinds of lists.
> > > First is a "validation list" (struct drm_i915_gem_exec_object2[])
> > > containing each BO referenced by the batch.  (The batch buffer itself
> > > must be the last entry in this list.)  Each validation list entry
> > > contains a pointer to the second kind of list: a relocation list.
> > > The relocation list contains information about pointers to BOs that
> > > the kernel may need to patch up if it relocates objects within the VMA.
> > > 
> > > This is a very general mechanism, allowing every BO to contain pointers
> > > to other BOs.  libdrm_intel models this by giving each drm_intel_bo a
> > > list of relocations to other BOs.  Together, these form "reloc trees".
> > > 
> > > Processing relocations involves a depth-first-search of the relocation
> > > trees, starting from the batch buffer.  Care has to be taken not to
> > > double-visit buffers.  Creating the validation list has to be deferred
> > > until the last minute, after all relocations are emitted, so we have the
> > > full tree present.  Calculating the amount of aperture space required to
> > > pin those BOs also involves tree walking, which is expensive, so libdrm
> > > has hacks to try and perform less expensive estimates.
> > > 
> > > For some reason, it also stored the validation list in the global
> > > (per-screen) bufmgr structure, rather than as an local variable in the
> > > execbuffer function, requiring locking for no good reason.
> > > 
> > > It also assumed that the batch would probably contain a relocation
> > > every 2 DWords - which is absurdly high - and simply aborted if there
> > > were more relocations than the max.  This meant the first relocation
> > > from a BO would allocate 180kB of data structures!
> > > 
> > > This is way too complicated for our needs.  i965 only emits relocations
> > > from the batchbuffer - all GPU commands and state such as SURFACE_STATE
> > > live in the batch BO.  No other buffer uses relocations.  This means we
> > > can have a single relocation list for the batchbuffer.  We can add a BO
> > > to the validation list (set) the first time we emit a relocation to it.
> > > We can easily keep a running tally of the aperture space required for
> > > that list by adding the BO size when we add it to the validation list.
> > > 
> > > This patch overhauls the relocation system to do exactly that.  There
> > > are many nice benefits:
> > > 
> > > - We have a flat relocation list instead of trees.
> > > - We can produce the validation list up front.
> > > - We can allocate smaller arrays and dynamically grow them.
> > > - Aperture space checks are now (a + b <= c) instead of a tree walk.
> > > - brw_batch_references() is a trivial validation list walk.
> > >   It should be straightforward to make it O(1) in the future.
> > > - We don't need to bloat each drm_bacon_bo with 32B of reloc data.
> > > - We don't need to lock in execbuffer, as the data structures are
> > >   context-local, and not per-screen.
> > > - Significantly less code and a better match for what we're doing.
> > > - The simpler system should make it easier to take advantage of
> > >   I915_EXEC_NO_RELOC in a future patch.
> > > ---
> > >  src/mesa/drivers/dri/i965/brw_bufmgr.h        |  53 +-
> > >  src/mesa/drivers/dri/i965/brw_compute.c       |   2 +-
> > >  src/mesa/drivers/dri/i965/brw_context.h       |  12 +
> > >  src/mesa/drivers/dri/i965/brw_draw.c          |   2 +-
> > >  src/mesa/drivers/dri/i965/genX_blorp_exec.c   |   2 +-
> > >  src/mesa/drivers/dri/i965/intel_batchbuffer.c | 215 +++++++-
> > >  src/mesa/drivers/dri/i965/intel_batchbuffer.h |   3 +
> > >  src/mesa/drivers/dri/i965/intel_blit.c        |  30 +-
> > >  src/mesa/drivers/dri/i965/intel_bufmgr_gem.c  | 718 +-------------------------
> > >  9 files changed, 229 insertions(+), 808 deletions(-)
> > > 
> > > diff --git a/src/mesa/drivers/dri/i965/brw_bufmgr.h b/src/mesa/drivers/dri/i965/brw_bufmgr.h
> > > index 237f39bb078..d3db6a3967b 100644
> > > --- a/src/mesa/drivers/dri/i965/brw_bufmgr.h
> > > +++ b/src/mesa/drivers/dri/i965/brw_bufmgr.h
> > > @@ -88,6 +88,15 @@ struct _drm_bacon_bo {
> > >  	 * entries when calling drm_bacon_bo_emit_reloc()
> > >  	 */
> > >  	uint64_t offset64;
> > > +
> > > +	/**
> > > +	 * Boolean of whether the GPU is definitely not accessing the buffer.
> > > +	 *
> > > +	 * This is only valid when reusable, since non-reusable
> > > +	 * buffers are those that have been shared with other
> > > +	 * processes, so we don't know their state.
> > > +	 */
> > > +	bool idle;
> > >  };
> > >  
> > >  #define BO_ALLOC_FOR_RENDER (1<<0)
> > > @@ -178,37 +187,6 @@ void drm_bacon_bo_wait_rendering(drm_bacon_bo *bo);
> > >   */
> > >  void drm_bacon_bufmgr_destroy(drm_bacon_bufmgr *bufmgr);
> > >  
> > > -/** Executes the command buffer pointed to by bo. */
> > > -int drm_bacon_bo_exec(drm_bacon_bo *bo, int used);
> > > -
> > > -/** Executes the command buffer pointed to by bo on the selected ring buffer */
> > > -int drm_bacon_bo_mrb_exec(drm_bacon_bo *bo, int used, unsigned int flags);
> > > -int drm_bacon_bufmgr_check_aperture_space(drm_bacon_bo ** bo_array, int count);
> > > -
> > > -/**
> > > - * Add relocation entry in reloc_buf, which will be updated with the
> > > - * target buffer's real offset on on command submission.
> > > - *
> > > - * Relocations remain in place for the lifetime of the buffer object.
> > > - *
> > > - * \param bo Buffer to write the relocation into.
> > > - * \param offset Byte offset within reloc_bo of the pointer to
> > > - *                      target_bo.
> > > - * \param target_bo Buffer whose offset should be written into the
> > > - *                  relocation entry.
> > > - * \param target_offset Constant value to be added to target_bo's
> > > - *                      offset in relocation entry.
> > > - * \param read_domains GEM read domains which the buffer will be
> > > - *                      read into by the command that this relocation
> > > - *                      is part of.
> > > - * \param write_domains GEM read domains which the buffer will be
> > > - *                      dirtied in by the command that this
> > > - *                      relocation is part of.
> > > - */
> > > -int drm_bacon_bo_emit_reloc(drm_bacon_bo *bo, uint32_t offset,
> > > -			    drm_bacon_bo *target_bo, uint32_t target_offset,
> > > -			    uint32_t read_domains, uint32_t write_domain);
> > > -
> > >  /**
> > >   * Ask that the buffer be placed in tiling mode
> > >   *
> > > @@ -271,9 +249,6 @@ int drm_bacon_bo_disable_reuse(drm_bacon_bo *bo);
> > >   */
> > >  int drm_bacon_bo_is_reusable(drm_bacon_bo *bo);
> > >  
> > > -/** Returns true if target_bo is in the relocation tree rooted at bo. */
> > > -int drm_bacon_bo_references(drm_bacon_bo *bo, drm_bacon_bo *target_bo);
> > > -
> > >  /* drm_bacon_bufmgr_gem.c */
> > >  drm_bacon_bufmgr *drm_bacon_bufmgr_gem_init(struct gen_device_info *devinfo,
> > >  					    int fd, int batch_size);
> > > @@ -290,8 +265,6 @@ void *drm_bacon_gem_bo_map__cpu(drm_bacon_bo *bo);
> > >  void *drm_bacon_gem_bo_map__gtt(drm_bacon_bo *bo);
> > >  void *drm_bacon_gem_bo_map__wc(drm_bacon_bo *bo);
> > >  
> > > -int drm_bacon_gem_bo_get_reloc_count(drm_bacon_bo *bo);
> > > -void drm_bacon_gem_bo_clear_relocs(drm_bacon_bo *bo, int start);
> > >  void drm_bacon_gem_bo_start_gtt_access(drm_bacon_bo *bo, int write_enable);
> > >  
> > >  int drm_bacon_gem_bo_wait(drm_bacon_bo *bo, int64_t timeout_ns);
> > > @@ -300,14 +273,6 @@ drm_bacon_context *drm_bacon_gem_context_create(drm_bacon_bufmgr *bufmgr);
> > >  int drm_bacon_gem_context_get_id(drm_bacon_context *ctx,
> > >                                   uint32_t *ctx_id);
> > >  void drm_bacon_gem_context_destroy(drm_bacon_context *ctx);
> > > -int drm_bacon_gem_bo_context_exec(drm_bacon_bo *bo, drm_bacon_context *ctx,
> > > -				  int used, unsigned int flags);
> > > -int drm_bacon_gem_bo_fence_exec(drm_bacon_bo *bo,
> > > -				drm_bacon_context *ctx,
> > > -				int used,
> > > -				int in_fence,
> > > -				int *out_fence,
> > > -				unsigned int flags);
> > >  
> > >  int drm_bacon_bo_gem_export_to_prime(drm_bacon_bo *bo, int *prime_fd);
> > >  drm_bacon_bo *drm_bacon_bo_gem_create_from_prime(drm_bacon_bufmgr *bufmgr,
> > > diff --git a/src/mesa/drivers/dri/i965/brw_compute.c b/src/mesa/drivers/dri/i965/brw_compute.c
> > > index d816d056aad..e924401c3af 100644
> > > --- a/src/mesa/drivers/dri/i965/brw_compute.c
> > > +++ b/src/mesa/drivers/dri/i965/brw_compute.c
> > > @@ -212,7 +212,7 @@ brw_dispatch_compute_common(struct gl_context *ctx)
> > >  
> > >     brw->no_batch_wrap = false;
> > >  
> > > -   if (drm_bacon_bufmgr_check_aperture_space(&brw->batch.bo, 1)) {
> > > +   if (!brw_batch_has_aperture_space(brw, 0)) {
> > >        if (!fail_next) {
> > >           intel_batchbuffer_reset_to_saved(brw);
> > >           intel_batchbuffer_flush(brw);
> > > diff --git a/src/mesa/drivers/dri/i965/brw_context.h b/src/mesa/drivers/dri/i965/brw_context.h
> > > index 00e9224d7d7..186ce826801 100644
> > > --- a/src/mesa/drivers/dri/i965/brw_context.h
> > > +++ b/src/mesa/drivers/dri/i965/brw_context.h
> > > @@ -477,9 +477,21 @@ struct intel_batchbuffer {
> > >     bool needs_sol_reset;
> > >     bool state_base_address_emitted;
> > >  
> > > +   struct drm_i915_gem_relocation_entry *relocs;
> > > +   int reloc_count;
> > > +   int reloc_array_size;
> > > +   /** The validation list */
> > > +   struct drm_i915_gem_exec_object2 *exec_objects;
> > > +   drm_bacon_bo **exec_bos;
> > > +   int exec_count;
> > > +   int exec_array_size;
> > > +   /** The amount of aperture space (in bytes) used by all exec_bos */
> > > +   int aperture_space;
> > 
> > You are not using the aperture (that is the CPU addressable BAR). This
> > is just the batch_size, and not to be confused with the rss that you may
> > also want to track in future.
> > 
> > > +
> > >     struct {
> > >        uint32_t *map_next;
> > >        int reloc_count;
> > > +      int exec_count;
> > >     } saved;
> > >  
> > >     /** Map from batch offset to brw_state_batch data (with DEBUG_BATCH) */
> > > diff --git a/src/mesa/drivers/dri/i965/brw_draw.c b/src/mesa/drivers/dri/i965/brw_draw.c
> > > index a6e229f2210..bf09915d0c3 100644
> > > --- a/src/mesa/drivers/dri/i965/brw_draw.c
> > > +++ b/src/mesa/drivers/dri/i965/brw_draw.c
> > > @@ -601,7 +601,7 @@ retry:
> > >  
> > >        brw->no_batch_wrap = false;
> > >  
> > > -      if (drm_bacon_bufmgr_check_aperture_space(&brw->batch.bo, 1)) {
> > > +      if (!brw_batch_has_aperture_space(brw, 0)) {
> > >           if (!fail_next) {
> > >              intel_batchbuffer_reset_to_saved(brw);
> > >              intel_batchbuffer_flush(brw);
> > > diff --git a/src/mesa/drivers/dri/i965/genX_blorp_exec.c b/src/mesa/drivers/dri/i965/genX_blorp_exec.c
> > > index dc1c1ea2993..e1a799457a2 100644
> > > --- a/src/mesa/drivers/dri/i965/genX_blorp_exec.c
> > > +++ b/src/mesa/drivers/dri/i965/genX_blorp_exec.c
> > > @@ -231,7 +231,7 @@ retry:
> > >      * map all the BOs into the GPU at batch exec time later.  If so, flush the
> > >      * batch and try again with nothing else in the batch.
> > >      */
> > > -   if (drm_bacon_bufmgr_check_aperture_space(&brw->batch.bo, 1)) {
> > > +   if (!brw_batch_has_aperture_space(brw, 0)) {
> > >        if (!check_aperture_failed_once) {
> > >           check_aperture_failed_once = true;
> > >           intel_batchbuffer_reset_to_saved(brw);
> > > diff --git a/src/mesa/drivers/dri/i965/intel_batchbuffer.c b/src/mesa/drivers/dri/i965/intel_batchbuffer.c
> > > index 59ab55d31f7..444e33c48ee 100644
> > > --- a/src/mesa/drivers/dri/i965/intel_batchbuffer.c
> > > +++ b/src/mesa/drivers/dri/i965/intel_batchbuffer.c
> > > @@ -38,6 +38,8 @@
> > >  #include <xf86drm.h>
> > >  #include <i915_drm.h>
> > >  
> > > +#define FILE_DEBUG_FLAG DEBUG_BUFMGR
> > > +
> > >  static void
> > >  intel_batchbuffer_reset(struct intel_batchbuffer *batch,
> > >                          drm_bacon_bufmgr *bufmgr,
> > > @@ -68,6 +70,17 @@ intel_batchbuffer_init(struct intel_batchbuffer *batch,
> > >        batch->map_next = batch->cpu_map;
> > >     }
> > >  
> > > +   batch->reloc_count = 0;
> > > +   batch->reloc_array_size = 768;
> > 
> > Wowser!
> 
> I ran a couple of programs when trying to decide how big to make these
> arrays.  The Unreal Elemental demo had 749 relocations in a draw, and
> 200 BOs used by a batch.  Other simpler games had 264 relocations and
> 55 BOs.

I had 100 obj, 300 relocs as my idea of typical GL, with most of those
relocs being vertex/index buffers.

> These entries are 32 bytes, so 768 is 24kB.  It might make sense to pick
> a smaller size and let it adjust up once or twice.  It shouldn't be that
> costly to realloc as long as we aren't doing it all the time...

Mostly thinking about even handling that many buffers in NORELOC is
10-20us (the cost of execbuf is more or less linear in the number of
execobj). Bah, just make it as light as possible and parallelisable. On
full-ppgtt, we should be able to execute execbuf from different contexts
in parallel (so long as they don't overlap their buffers). Lots of fun
to get to that point. :|

> > > +   batch->relocs = malloc(batch->reloc_array_size *
> > > +                          sizeof(struct drm_i915_gem_relocation_entry));
> > 
> > Allocation failure handling? Deferred to use? Even so, you will need to
> > record you didn't allocate the array_size.
> 
> Frankly, if you run out of VMA such that malloc doesn't work, things are
> going to catch fire and there's not much we can do about it.  Obviously
> you need to be careful in the kernel, but we tend to play fast and loose
> with this in userspace.
> 
> It looks like the old libdrm_intel code checked for NULL pointers and
> failed the execbuffer call with -ENOMEM.  Of course, Mesa would then
> happily call abort().
> 
> libdrm_intel would also assert fail if you used more than 4094 relocs.
> 
> So I'm not sure this is actually worse.

I can accept "status quo with a plan to improve" but mesa is used in
system critical services (e.g. display servers). If mesa dies that is data
loss in many unrelated apps. I do think you can neatly do the allocation
checks and report back the execbuf errors to make it someone's else
problem (such as who should install the sigbus handlers to catch page
allocation failures in the middle of a mmap). Refactoring libdrm is the
perfect time to remove those aborts and exits :)

> > > +static int
> > > +execbuffer(int fd,
> > > +           struct intel_batchbuffer *batch,
> > > +           drm_bacon_context *ctx,
> > > +           int used,
> > > +           int in_fence,
> > > +           int *out_fence,
> > > +           int flags)
> > > +{
> > > +   struct drm_i915_gem_execbuffer2 execbuf = {
> > > +      .buffers_ptr = (uintptr_t) batch->exec_objects,
> > > +      .buffer_count = batch->exec_count,
> > > +      .batch_start_offset = 0,
> > > +      .batch_len = used,
> > > +      .cliprects_ptr = 0,
> > > +      .num_cliprects = 0,
> > > +      .DR1 = 0,
> > > +      .DR4 = 0,
> > 
> > Just don't mention the garbage.
> 
> Good call.
> 
> > > +      .flags = flags,
> > > +      .rsvd2 = 0,
> > > +   };
> > > +
> > > +   uint32_t ctx_id = 0;
> > > +   drm_bacon_gem_context_get_id(ctx, &ctx_id);
> > > +   i915_execbuffer2_set_context_id(execbuf, ctx_id);
> > 
> > Urgh.
> 
> I hadn't looked at what that did - urgh indeed.  Should I just do
> 
>       .rsvd1 = ctx_id; /* rsvd1 is actually the context ID */
> 
> instead?

Yes. Without updating the headers to use an anonymous union, just a
comment to explain the wart will suffice. This should be the only
location that's required.
-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre