[Mesa-dev] [PATCH 40/53] i965/drm: Rewrite relocation handling.
Chris Wilson
chris at chris-wilson.co.uk
Wed Apr 5 11:46:27 UTC 2017
On Tue, Apr 04, 2017 at 05:10:30PM -0700, Kenneth Graunke wrote:
> The execbuf2 kernel API requires us to construct two kinds of lists.
> First is a "validation list" (struct drm_i915_gem_exec_object2[])
> containing each BO referenced by the batch. (The batch buffer itself
> must be the last entry in this list.) Each validation list entry
> contains a pointer to the second kind of list: a relocation list.
> The relocation list contains information about pointers to BOs that
> the kernel may need to patch up if it relocates objects within the VMA.
>
> This is a very general mechanism, allowing every BO to contain pointers
> to other BOs. libdrm_intel models this by giving each drm_intel_bo a
> list of relocations to other BOs. Together, these form "reloc trees".
>
> Processing relocations involves a depth-first-search of the relocation
> trees, starting from the batch buffer. Care has to be taken not to
> double-visit buffers. Creating the validation list has to be deferred
> until the last minute, after all relocations are emitted, so we have the
> full tree present. Calculating the amount of aperture space required to
> pin those BOs also involves tree walking, which is expensive, so libdrm
> has hacks to try and perform less expensive estimates.
>
> For some reason, it also stored the validation list in the global
> (per-screen) bufmgr structure, rather than as an local variable in the
> execbuffer function, requiring locking for no good reason.
>
> It also assumed that the batch would probably contain a relocation
> every 2 DWords - which is absurdly high - and simply aborted if there
> were more relocations than the max. This meant the first relocation
> from a BO would allocate 180kB of data structures!
>
> This is way too complicated for our needs. i965 only emits relocations
> from the batchbuffer - all GPU commands and state such as SURFACE_STATE
> live in the batch BO. No other buffer uses relocations. This means we
> can have a single relocation list for the batchbuffer. We can add a BO
> to the validation list (set) the first time we emit a relocation to it.
> We can easily keep a running tally of the aperture space required for
> that list by adding the BO size when we add it to the validation list.
>
> This patch overhauls the relocation system to do exactly that. There
> are many nice benefits:
>
> - We have a flat relocation list instead of trees.
> - We can produce the validation list up front.
> - We can allocate smaller arrays and dynamically grow them.
> - Aperture space checks are now (a + b <= c) instead of a tree walk.
> - brw_batch_references() is a trivial validation list walk.
> It should be straightforward to make it O(1) in the future.
> - We don't need to bloat each drm_bacon_bo with 32B of reloc data.
> - We don't need to lock in execbuffer, as the data structures are
> context-local, and not per-screen.
> - Significantly less code and a better match for what we're doing.
> - The simpler system should make it easier to take advantage of
> I915_EXEC_NO_RELOC in a future patch.
> ---
> src/mesa/drivers/dri/i965/brw_bufmgr.h | 53 +-
> src/mesa/drivers/dri/i965/brw_compute.c | 2 +-
> src/mesa/drivers/dri/i965/brw_context.h | 12 +
> src/mesa/drivers/dri/i965/brw_draw.c | 2 +-
> src/mesa/drivers/dri/i965/genX_blorp_exec.c | 2 +-
> src/mesa/drivers/dri/i965/intel_batchbuffer.c | 215 +++++++-
> src/mesa/drivers/dri/i965/intel_batchbuffer.h | 3 +
> src/mesa/drivers/dri/i965/intel_blit.c | 30 +-
> src/mesa/drivers/dri/i965/intel_bufmgr_gem.c | 718 +-------------------------
> 9 files changed, 229 insertions(+), 808 deletions(-)
>
> diff --git a/src/mesa/drivers/dri/i965/brw_bufmgr.h b/src/mesa/drivers/dri/i965/brw_bufmgr.h
> index 237f39bb078..d3db6a3967b 100644
> --- a/src/mesa/drivers/dri/i965/brw_bufmgr.h
> +++ b/src/mesa/drivers/dri/i965/brw_bufmgr.h
> @@ -88,6 +88,15 @@ struct _drm_bacon_bo {
> * entries when calling drm_bacon_bo_emit_reloc()
> */
> uint64_t offset64;
> +
> + /**
> + * Boolean of whether the GPU is definitely not accessing the buffer.
> + *
> + * This is only valid when reusable, since non-reusable
> + * buffers are those that have been shared with other
> + * processes, so we don't know their state.
> + */
> + bool idle;
> };
>
> #define BO_ALLOC_FOR_RENDER (1<<0)
> @@ -178,37 +187,6 @@ void drm_bacon_bo_wait_rendering(drm_bacon_bo *bo);
> */
> void drm_bacon_bufmgr_destroy(drm_bacon_bufmgr *bufmgr);
>
> -/** Executes the command buffer pointed to by bo. */
> -int drm_bacon_bo_exec(drm_bacon_bo *bo, int used);
> -
> -/** Executes the command buffer pointed to by bo on the selected ring buffer */
> -int drm_bacon_bo_mrb_exec(drm_bacon_bo *bo, int used, unsigned int flags);
> -int drm_bacon_bufmgr_check_aperture_space(drm_bacon_bo ** bo_array, int count);
> -
> -/**
> - * Add relocation entry in reloc_buf, which will be updated with the
> - * target buffer's real offset on on command submission.
> - *
> - * Relocations remain in place for the lifetime of the buffer object.
> - *
> - * \param bo Buffer to write the relocation into.
> - * \param offset Byte offset within reloc_bo of the pointer to
> - * target_bo.
> - * \param target_bo Buffer whose offset should be written into the
> - * relocation entry.
> - * \param target_offset Constant value to be added to target_bo's
> - * offset in relocation entry.
> - * \param read_domains GEM read domains which the buffer will be
> - * read into by the command that this relocation
> - * is part of.
> - * \param write_domains GEM read domains which the buffer will be
> - * dirtied in by the command that this
> - * relocation is part of.
> - */
> -int drm_bacon_bo_emit_reloc(drm_bacon_bo *bo, uint32_t offset,
> - drm_bacon_bo *target_bo, uint32_t target_offset,
> - uint32_t read_domains, uint32_t write_domain);
> -
> /**
> * Ask that the buffer be placed in tiling mode
> *
> @@ -271,9 +249,6 @@ int drm_bacon_bo_disable_reuse(drm_bacon_bo *bo);
> */
> int drm_bacon_bo_is_reusable(drm_bacon_bo *bo);
>
> -/** Returns true if target_bo is in the relocation tree rooted at bo. */
> -int drm_bacon_bo_references(drm_bacon_bo *bo, drm_bacon_bo *target_bo);
> -
> /* drm_bacon_bufmgr_gem.c */
> drm_bacon_bufmgr *drm_bacon_bufmgr_gem_init(struct gen_device_info *devinfo,
> int fd, int batch_size);
> @@ -290,8 +265,6 @@ void *drm_bacon_gem_bo_map__cpu(drm_bacon_bo *bo);
> void *drm_bacon_gem_bo_map__gtt(drm_bacon_bo *bo);
> void *drm_bacon_gem_bo_map__wc(drm_bacon_bo *bo);
>
> -int drm_bacon_gem_bo_get_reloc_count(drm_bacon_bo *bo);
> -void drm_bacon_gem_bo_clear_relocs(drm_bacon_bo *bo, int start);
> void drm_bacon_gem_bo_start_gtt_access(drm_bacon_bo *bo, int write_enable);
>
> int drm_bacon_gem_bo_wait(drm_bacon_bo *bo, int64_t timeout_ns);
> @@ -300,14 +273,6 @@ drm_bacon_context *drm_bacon_gem_context_create(drm_bacon_bufmgr *bufmgr);
> int drm_bacon_gem_context_get_id(drm_bacon_context *ctx,
> uint32_t *ctx_id);
> void drm_bacon_gem_context_destroy(drm_bacon_context *ctx);
> -int drm_bacon_gem_bo_context_exec(drm_bacon_bo *bo, drm_bacon_context *ctx,
> - int used, unsigned int flags);
> -int drm_bacon_gem_bo_fence_exec(drm_bacon_bo *bo,
> - drm_bacon_context *ctx,
> - int used,
> - int in_fence,
> - int *out_fence,
> - unsigned int flags);
>
> int drm_bacon_bo_gem_export_to_prime(drm_bacon_bo *bo, int *prime_fd);
> drm_bacon_bo *drm_bacon_bo_gem_create_from_prime(drm_bacon_bufmgr *bufmgr,
> diff --git a/src/mesa/drivers/dri/i965/brw_compute.c b/src/mesa/drivers/dri/i965/brw_compute.c
> index d816d056aad..e924401c3af 100644
> --- a/src/mesa/drivers/dri/i965/brw_compute.c
> +++ b/src/mesa/drivers/dri/i965/brw_compute.c
> @@ -212,7 +212,7 @@ brw_dispatch_compute_common(struct gl_context *ctx)
>
> brw->no_batch_wrap = false;
>
> - if (drm_bacon_bufmgr_check_aperture_space(&brw->batch.bo, 1)) {
> + if (!brw_batch_has_aperture_space(brw, 0)) {
> if (!fail_next) {
> intel_batchbuffer_reset_to_saved(brw);
> intel_batchbuffer_flush(brw);
> diff --git a/src/mesa/drivers/dri/i965/brw_context.h b/src/mesa/drivers/dri/i965/brw_context.h
> index 00e9224d7d7..186ce826801 100644
> --- a/src/mesa/drivers/dri/i965/brw_context.h
> +++ b/src/mesa/drivers/dri/i965/brw_context.h
> @@ -477,9 +477,21 @@ struct intel_batchbuffer {
> bool needs_sol_reset;
> bool state_base_address_emitted;
>
> + struct drm_i915_gem_relocation_entry *relocs;
> + int reloc_count;
> + int reloc_array_size;
> + /** The validation list */
> + struct drm_i915_gem_exec_object2 *exec_objects;
> + drm_bacon_bo **exec_bos;
> + int exec_count;
> + int exec_array_size;
> + /** The amount of aperture space (in bytes) used by all exec_bos */
> + int aperture_space;
You are not using the aperture (that is the CPU addressable BAR). This
is just the batch_size, and not to be confused with the rss that you may
also want to track in future.
> +
> struct {
> uint32_t *map_next;
> int reloc_count;
> + int exec_count;
> } saved;
>
> /** Map from batch offset to brw_state_batch data (with DEBUG_BATCH) */
> diff --git a/src/mesa/drivers/dri/i965/brw_draw.c b/src/mesa/drivers/dri/i965/brw_draw.c
> index a6e229f2210..bf09915d0c3 100644
> --- a/src/mesa/drivers/dri/i965/brw_draw.c
> +++ b/src/mesa/drivers/dri/i965/brw_draw.c
> @@ -601,7 +601,7 @@ retry:
>
> brw->no_batch_wrap = false;
>
> - if (drm_bacon_bufmgr_check_aperture_space(&brw->batch.bo, 1)) {
> + if (!brw_batch_has_aperture_space(brw, 0)) {
> if (!fail_next) {
> intel_batchbuffer_reset_to_saved(brw);
> intel_batchbuffer_flush(brw);
> diff --git a/src/mesa/drivers/dri/i965/genX_blorp_exec.c b/src/mesa/drivers/dri/i965/genX_blorp_exec.c
> index dc1c1ea2993..e1a799457a2 100644
> --- a/src/mesa/drivers/dri/i965/genX_blorp_exec.c
> +++ b/src/mesa/drivers/dri/i965/genX_blorp_exec.c
> @@ -231,7 +231,7 @@ retry:
> * map all the BOs into the GPU at batch exec time later. If so, flush the
> * batch and try again with nothing else in the batch.
> */
> - if (drm_bacon_bufmgr_check_aperture_space(&brw->batch.bo, 1)) {
> + if (!brw_batch_has_aperture_space(brw, 0)) {
> if (!check_aperture_failed_once) {
> check_aperture_failed_once = true;
> intel_batchbuffer_reset_to_saved(brw);
> diff --git a/src/mesa/drivers/dri/i965/intel_batchbuffer.c b/src/mesa/drivers/dri/i965/intel_batchbuffer.c
> index 59ab55d31f7..444e33c48ee 100644
> --- a/src/mesa/drivers/dri/i965/intel_batchbuffer.c
> +++ b/src/mesa/drivers/dri/i965/intel_batchbuffer.c
> @@ -38,6 +38,8 @@
> #include <xf86drm.h>
> #include <i915_drm.h>
>
> +#define FILE_DEBUG_FLAG DEBUG_BUFMGR
> +
> static void
> intel_batchbuffer_reset(struct intel_batchbuffer *batch,
> drm_bacon_bufmgr *bufmgr,
> @@ -68,6 +70,17 @@ intel_batchbuffer_init(struct intel_batchbuffer *batch,
> batch->map_next = batch->cpu_map;
> }
>
> + batch->reloc_count = 0;
> + batch->reloc_array_size = 768;
Wowser!
> + batch->relocs = malloc(batch->reloc_array_size *
> + sizeof(struct drm_i915_gem_relocation_entry));
Allocation failure handling? Deferred to use? Even so, you will need to
record you didn't allocate the array_size.
> void
> intel_batchbuffer_reset_to_saved(struct brw_context *brw)
> {
> - drm_bacon_gem_bo_clear_relocs(brw->batch.bo, brw->batch.saved.reloc_count);
> + for (int i = brw->batch.saved.exec_count;
> + i < brw->batch.exec_count; i++) {
> + if (brw->batch.exec_bos[i] != brw->batch.bo) {
> + drm_bacon_bo_unreference(brw->batch.exec_bos[i]);
> + }
There's a trick to add to your todo list - treat being busy as an
implicit reference.
> +static void
> +add_exec_bo(struct intel_batchbuffer *batch, drm_bacon_bo *bo)
> +{
> + if (bo != batch->bo) {
> + for (int i = 0; i < batch->exec_count; i++) {
> + if (batch->exec_bos[i] == bo)
> + return;
> + }
> +
> + drm_bacon_bo_reference(bo);
> + }
> +
> + if (batch->exec_count == batch->exec_array_size) {
> + batch->exec_array_size *= 2;
> + batch->exec_bos =
> + realloc(batch->exec_bos,
> + batch->exec_array_size * sizeof(batch->exec_bos[0]));
> + batch->exec_objects =
> + realloc(batch->exec_objects,
> + batch->exec_array_size * sizeof(batch->exec_objects[0]));
This has to be robust! What I suggested was to install a setjmp during
batch_begin so that you could unwind and escape after an allocation
failure and have it reported back to GL. Those patches to refactor the
aperture checking / error checking around batch buffer handling can be
applied to this series.
> + }
> +
> + struct drm_i915_gem_exec_object2 *validation_entry =
> + &batch->exec_objects[batch->exec_count];
> + validation_entry->handle = bo->handle;
> + if (bo == batch->bo) {
> + validation_entry->relocation_count = batch->reloc_count;
> + validation_entry->relocs_ptr = (uintptr_t) batch->relocs;
> + } else {
> + validation_entry->relocation_count = 0;
> + validation_entry->relocs_ptr = 0;
> + }
> + validation_entry->alignment = bo->align;
> + validation_entry->offset = bo->offset64;
> + validation_entry->flags = 0;
> + validation_entry->rsvd1 = 0;
> + validation_entry->rsvd2 = 0;
> +
> + batch->exec_bos[batch->exec_count] = bo;
> + batch->exec_count++;
> + batch->aperture_space += bo->size;
> +}
> +
> +static int
> +execbuffer(int fd,
> + struct intel_batchbuffer *batch,
> + drm_bacon_context *ctx,
> + int used,
> + int in_fence,
> + int *out_fence,
> + int flags)
> +{
> + struct drm_i915_gem_execbuffer2 execbuf = {
> + .buffers_ptr = (uintptr_t) batch->exec_objects,
> + .buffer_count = batch->exec_count,
> + .batch_start_offset = 0,
> + .batch_len = used,
> + .cliprects_ptr = 0,
> + .num_cliprects = 0,
> + .DR1 = 0,
> + .DR4 = 0,
Just don't mention the garbage.
> + .flags = flags,
> + .rsvd2 = 0,
> + };
> +
> + uint32_t ctx_id = 0;
> + drm_bacon_gem_context_get_id(ctx, &ctx_id);
> + i915_execbuffer2_set_context_id(execbuf, ctx_id);
Urgh.
> +
> + if (in_fence != -1) {
> + execbuf.rsvd2 = in_fence;
> + execbuf.flags |= I915_EXEC_FENCE_IN;
> + }
> +
> + if (out_fence != NULL) {
> + *out_fence = -1;
> + execbuf.flags |= I915_EXEC_FENCE_OUT;
> + }
> +
> + int ret = drmIoctl(fd, DRM_IOCTL_I915_GEM_EXECBUFFER2_WR, &execbuf);
You could improve this by
cmd = DRM_IOCTL_I915_GEM_EXECBUFFER2;
if (out_fence)
cmd |= _IOC_WRITE << _IOC_DIRSHIFT;
or
cmd = DRM_IOCTL_I915_GEM_EXECBUFFER2_WR;
> + if (ret != 0)
> + ret = -errno;
> +
> + for (int i = 0; i < batch->exec_count; i++) {
> + drm_bacon_bo *bo = batch->exec_bos[i];
> +
> + bo->idle = false;
> +
> + /* Update drm_bacon_bo::offset64 */
> + if (batch->exec_objects[i].offset != bo->offset64) {
> + DBG("BO %d migrated: 0x%" PRIx64 " -> 0x%llx\n",
> + bo->handle, bo->offset64, batch->exec_objects[i].offset);
> + bo->offset64 = batch->exec_objects[i].offset;
> + }
> + }
> +
> + if (ret == 0 && out_fence != NULL)
> + *out_fence = execbuf.rsvd2 >> 32;
> +
> + return ret;
> +}
> +
> static int
> do_flush_locked(struct brw_context *brw, int in_fence_fd, int *out_fence_fd)
> {
> + __DRIscreen *dri_screen = brw->screen->driScrnPriv;
> struct intel_batchbuffer *batch = &brw->batch;
> int ret = 0;
>
> @@ -484,17 +625,18 @@ do_flush_locked(struct brw_context *brw, int in_fence_fd, int *out_fence_fd)
> flags |= I915_EXEC_GEN7_SOL_RESET;
>
> if (ret == 0) {
> - if (brw->hw_ctx == NULL || batch->ring != RENDER_RING) {
> + void *hw_ctx = batch->ring != RENDER_RING ? NULL : brw->hw_ctx;
> + if (hw_ctx == NULL) {
> assert(in_fence_fd == -1);
> assert(out_fence_fd == NULL);
These assertions are no longer related. They only applied to the
libdrm_intel API.
> - ret = drm_bacon_bo_mrb_exec(batch->bo, 4 * USED_BATCH(*batch),
> - flags);
> - } else {
> - ret = drm_bacon_gem_bo_fence_exec(batch->bo, brw->hw_ctx,
> - 4 * USED_BATCH(*batch),
> - in_fence_fd, out_fence_fd,
> - flags);
> - }
> + }
> +
> + /* Add the batch itself to the end of the validation list */
> + add_exec_bo(batch, batch->bo);
> +
> + ret = execbuffer(dri_screen->fd, batch, hw_ctx,
> + 4 * USED_BATCH(*batch),
> + in_fence_fd, out_fence_fd, flags);
> }
>
> throttle(brw);
> @@ -577,9 +719,20 @@ _intel_batchbuffer_flush_fence(struct brw_context *brw,
> }
>
> bool
> +brw_batch_has_aperture_space(struct brw_context *brw, unsigned extra_space)
> +{
> + return brw->batch.aperture_space + extra_space <=
> + brw->screen->aperture_threshold;
> +}
> +
> +bool
> brw_batch_references(struct intel_batchbuffer *batch, drm_bacon_bo *bo)
> {
> - return drm_bacon_bo_references(batch->bo, bo);
> + for (int i = 0; i < batch->exec_count; i++) {
> + if (batch->exec_bos[i] == bo)
> + return true;
> + }
> + return false;
> }
>
> /* This is the only way buffers get added to the validate list.
> @@ -589,13 +742,31 @@ brw_emit_reloc(struct intel_batchbuffer *batch, uint32_t batch_offset,
> drm_bacon_bo *target, uint32_t target_offset,
> uint32_t read_domains, uint32_t write_domain)
> {
> - int ret;
> + if (batch->reloc_count == batch->reloc_array_size) {
> + batch->reloc_array_size *= 2;
> + batch->relocs = realloc(batch->relocs,
> + batch->reloc_array_size *
> + sizeof(struct drm_i915_gem_relocation_entry));
See above for ideas on handling malloc failures.
> + }
> +
> + /* Check args */
> + assert(batch_offset <= BATCH_SZ - sizeof(uint32_t));
> + assert(_mesa_bitcount(write_domain) <= 1);
> +
> + if (target != batch->bo)
> + add_exec_bo(batch, target);
> +
> + struct drm_i915_gem_relocation_entry *reloc =
> + &batch->relocs[batch->reloc_count];
> +
> + batch->reloc_count++;
>
> - ret = drm_bacon_bo_emit_reloc(batch->bo, batch_offset,
> - target, target_offset,
> - read_domains, write_domain);
> - assert(ret == 0);
> - (void)ret;
> + reloc->offset = batch_offset;
> + reloc->delta = target_offset;
> + reloc->target_handle = target->handle;
> + reloc->read_domains = read_domains;
> + reloc->write_domain = write_domain;
> + reloc->presumed_offset = target->offset64;
Hmm, this is the wrong offset (its global not context). This is not the
final version of relocation handling...
-Chris
--
Chris Wilson, Intel Open Source Technology Centre
More information about the mesa-dev
mailing list