[Mesa-dev] [PATCH 04/18] i965: Introduce a context-local batch manager
Kenneth Graunke
kenneth at whitecape.org
Mon Jul 6 12:34:00 PDT 2015
On Monday, July 06, 2015 11:33:09 AM Chris Wilson wrote:
> When submitting commands to the GPU every cycle of latency counts;
> mutexes, spinlocks, even atomics quickly add to substantial overhead.
>
> This "batch manager" acts as thread-local shim over the buffer manager
> (drm_intel_bufmgr_gem). As we are only ever used from within a single
> context, we can rely on the upper layers providing thread safety.
> This allows us to import buffers from the shared screen (sharing buffers
> between multiple contexts, threads and users) and wrap that handle in
> our own. Similarly, we want to share the buffer cache between all
> users on the file and so allocate from the global threadsafe buffer
> manager, with a very small and transient local cache of active buffers.
>
> The batch manager provides a cheap way of busyness tracking and very
> efficient batch construction and kernel submission.
>
> The restrictions over and above the generic submission engine in
> intel_bufmgr_gem are:
> - not thread-safe
> - flat relocations, only the batch buffer itself carries
> relocations. Relocations relative to auxiliary buffers
> must be performed via STATE_BASE
> - direct mapping of the batch for writes, expect reads
> from the batch to be slow
> - the batch is a fixed 64k in size
> - access to the batch must be wrapped by brw_batch_begin/_end
> - all relocations must be immediately written into the batch
>
> The importance of the flat relocation tree with local offset handling is
> that it allows us to use the "relocation-less" execbuffer interfaces,
> dramatically reducing the overhead of batch submission. However, that
> can be relaxed to allow other buffers than the batch buffer to carry
> relocations, if need be.
>
> ivb/bdw OglBatch7 improves by ~20% above and beyond my kernel relocation
> speedups.
>
> ISSUES:
> * shared mipmap trees
> - we instantiate a context local copy on use, but what are the semantics for
> serializing read/writes between them - do we need automagic flushing of
> execution on other contexts and common busyness tracking?
> - we retain references to the bo past the lifetime of its parent
> batchmgr as the mipmap_tree is retained past the lifetime of its
> original context, see glx_arb_create_context/default_major_version
> * OglMultithread is nevertheless unhappy; but that looks like undefined
> behaviour - i.e. a buggy client concurrently executing the same GL
> context in multiple threads, unpatched is equally buggy.
> * Add full-ppgtt softpinning support (no more relocations, at least for
> the first 256TiB), at the moment there is a limited proof-of-principle
> demonstration
> * polish and move to libdrm; though at the cost of sealing the structs?
>
> Signed-off-by: Chris Wilson <chris at chris-wilson.co.uk>
> Cc: Daniel Vetter <daniel.vetter at ffwll.ch>
> Cc: Kristian Høgsberg <krh at bitplanet.net>
> Cc: Kenneth Graunke <kenneth at whitecape.org>
> Cc: Jesse Barnes <jbarnes at virtuousgeek.org>
> Cc: Ian Romanick <ian.d.romanick at intel.com>
> Cc: Abdiel Janulgue <abdiel.janulgue at linux.intel.com>
> Cc: Eero Tamminen <eero.t.tamminen at intel.com>
> Cc: Martin Peres <martin.peres at linux.intel.com>
> ---
> src/mesa/drivers/dri/i965/Makefile.sources | 4 +-
> src/mesa/drivers/dri/i965/brw_batch.c | 1946 ++++++++++++++++++++
> src/mesa/drivers/dri/i965/brw_batch.h | 377 ++++
> src/mesa/drivers/dri/i965/brw_binding_tables.c | 1 -
> src/mesa/drivers/dri/i965/brw_blorp.cpp | 46 +-
> src/mesa/drivers/dri/i965/brw_cc.c | 16 +-
> src/mesa/drivers/dri/i965/brw_clear.c | 1 -
> src/mesa/drivers/dri/i965/brw_clip.c | 2 -
> src/mesa/drivers/dri/i965/brw_clip_line.c | 2 -
> src/mesa/drivers/dri/i965/brw_clip_point.c | 2 -
> src/mesa/drivers/dri/i965/brw_clip_state.c | 14 +-
> src/mesa/drivers/dri/i965/brw_clip_tri.c | 2 -
> src/mesa/drivers/dri/i965/brw_clip_unfilled.c | 2 -
> src/mesa/drivers/dri/i965/brw_clip_util.c | 2 -
> src/mesa/drivers/dri/i965/brw_compute.c | 42 +-
> src/mesa/drivers/dri/i965/brw_conditional_render.c | 2 +-
> src/mesa/drivers/dri/i965/brw_context.c | 233 ++-
> src/mesa/drivers/dri/i965/brw_context.h | 144 +-
> src/mesa/drivers/dri/i965/brw_cs.cpp | 6 +-
> src/mesa/drivers/dri/i965/brw_curbe.c | 1 -
> src/mesa/drivers/dri/i965/brw_draw.c | 103 +-
> src/mesa/drivers/dri/i965/brw_draw_upload.c | 23 +-
> src/mesa/drivers/dri/i965/brw_ff_gs.c | 2 -
> src/mesa/drivers/dri/i965/brw_ff_gs_emit.c | 1 -
> src/mesa/drivers/dri/i965/brw_fs.cpp | 5 +-
> src/mesa/drivers/dri/i965/brw_meta_fast_clear.c | 11 +-
> src/mesa/drivers/dri/i965/brw_meta_stencil_blit.c | 1 -
> src/mesa/drivers/dri/i965/brw_meta_updownsample.c | 1 -
> src/mesa/drivers/dri/i965/brw_misc_state.c | 10 +-
> src/mesa/drivers/dri/i965/brw_object_purgeable.c | 8 +-
> .../drivers/dri/i965/brw_performance_monitor.c | 88 +-
> src/mesa/drivers/dri/i965/brw_pipe_control.c | 24 +-
> src/mesa/drivers/dri/i965/brw_primitive_restart.c | 2 -
> src/mesa/drivers/dri/i965/brw_program.c | 23 +-
> src/mesa/drivers/dri/i965/brw_queryobj.c | 68 +-
> src/mesa/drivers/dri/i965/brw_reset.c | 14 +-
> src/mesa/drivers/dri/i965/brw_sampler_state.c | 8 +-
> src/mesa/drivers/dri/i965/brw_sf.c | 2 -
> src/mesa/drivers/dri/i965/brw_sf_emit.c | 2 -
> src/mesa/drivers/dri/i965/brw_sf_state.c | 21 +-
> src/mesa/drivers/dri/i965/brw_state.h | 2 +-
> src/mesa/drivers/dri/i965/brw_state_batch.c | 41 +-
> src/mesa/drivers/dri/i965/brw_state_cache.c | 70 +-
> src/mesa/drivers/dri/i965/brw_state_dump.c | 77 +-
> src/mesa/drivers/dri/i965/brw_state_upload.c | 16 +-
> src/mesa/drivers/dri/i965/brw_structs.h | 33 +-
> src/mesa/drivers/dri/i965/brw_urb.c | 9 +-
> src/mesa/drivers/dri/i965/brw_vec4.cpp | 5 +-
> src/mesa/drivers/dri/i965/brw_vs_state.c | 33 +-
> src/mesa/drivers/dri/i965/brw_vs_surface_state.c | 4 +-
> src/mesa/drivers/dri/i965/brw_wm_state.c | 38 +-
> src/mesa/drivers/dri/i965/brw_wm_surface_state.c | 76 +-
> src/mesa/drivers/dri/i965/gen6_blorp.cpp | 17 +-
> src/mesa/drivers/dri/i965/gen6_cc.c | 1 -
> src/mesa/drivers/dri/i965/gen6_clip_state.c | 1 -
> src/mesa/drivers/dri/i965/gen6_depth_state.c | 1 -
> src/mesa/drivers/dri/i965/gen6_depthstencil.c | 1 -
> src/mesa/drivers/dri/i965/gen6_gs_state.c | 1 -
> src/mesa/drivers/dri/i965/gen6_multisample_state.c | 1 -
> src/mesa/drivers/dri/i965/gen6_queryobj.c | 56 +-
> src/mesa/drivers/dri/i965/gen6_sampler_state.c | 1 -
> src/mesa/drivers/dri/i965/gen6_scissor_state.c | 1 -
> src/mesa/drivers/dri/i965/gen6_sf_state.c | 1 -
> src/mesa/drivers/dri/i965/gen6_sol.c | 9 +-
> src/mesa/drivers/dri/i965/gen6_surface_state.c | 13 +-
> src/mesa/drivers/dri/i965/gen6_urb.c | 1 -
> src/mesa/drivers/dri/i965/gen6_viewport_state.c | 1 -
> src/mesa/drivers/dri/i965/gen6_vs_state.c | 2 +-
> src/mesa/drivers/dri/i965/gen6_wm_state.c | 1 -
> src/mesa/drivers/dri/i965/gen7_blorp.cpp | 16 +-
> src/mesa/drivers/dri/i965/gen7_disable.c | 1 -
> src/mesa/drivers/dri/i965/gen7_gs_state.c | 1 -
> src/mesa/drivers/dri/i965/gen7_misc_state.c | 3 +-
> src/mesa/drivers/dri/i965/gen7_sf_state.c | 1 -
> src/mesa/drivers/dri/i965/gen7_sol_state.c | 49 +-
> src/mesa/drivers/dri/i965/gen7_urb.c | 1 -
> src/mesa/drivers/dri/i965/gen7_viewport_state.c | 1 -
> src/mesa/drivers/dri/i965/gen7_vs_state.c | 1 -
> src/mesa/drivers/dri/i965/gen7_wm_state.c | 1 -
> src/mesa/drivers/dri/i965/gen7_wm_surface_state.c | 60 +-
> src/mesa/drivers/dri/i965/gen8_blend_state.c | 1 -
> src/mesa/drivers/dri/i965/gen8_depth_state.c | 16 +-
> src/mesa/drivers/dri/i965/gen8_disable.c | 1 -
> src/mesa/drivers/dri/i965/gen8_draw_upload.c | 1 -
> src/mesa/drivers/dri/i965/gen8_gs_state.c | 1 -
> src/mesa/drivers/dri/i965/gen8_misc_state.c | 1 -
> src/mesa/drivers/dri/i965/gen8_multisample_state.c | 1 -
> src/mesa/drivers/dri/i965/gen8_ps_state.c | 1 -
> src/mesa/drivers/dri/i965/gen8_sf_state.c | 1 -
> src/mesa/drivers/dri/i965/gen8_sol_state.c | 3 +-
> src/mesa/drivers/dri/i965/gen8_surface_state.c | 73 +-
> src/mesa/drivers/dri/i965/gen8_viewport_state.c | 1 -
> src/mesa/drivers/dri/i965/gen8_vs_state.c | 1 -
> src/mesa/drivers/dri/i965/gen8_wm_depth_stencil.c | 1 -
> src/mesa/drivers/dri/i965/intel_batchbuffer.c | 480 -----
> src/mesa/drivers/dri/i965/intel_batchbuffer.h | 179 --
> src/mesa/drivers/dri/i965/intel_blit.c | 68 +-
> src/mesa/drivers/dri/i965/intel_blit.h | 10 +-
> src/mesa/drivers/dri/i965/intel_buffer_objects.c | 222 +--
> src/mesa/drivers/dri/i965/intel_buffer_objects.h | 18 +-
> src/mesa/drivers/dri/i965/intel_debug.c | 6 -
> src/mesa/drivers/dri/i965/intel_extensions.c | 48 +-
> src/mesa/drivers/dri/i965/intel_fbo.c | 46 +-
> src/mesa/drivers/dri/i965/intel_fbo.h | 4 -
> src/mesa/drivers/dri/i965/intel_image.h | 6 +-
> src/mesa/drivers/dri/i965/intel_mipmap_tree.c | 98 +-
> src/mesa/drivers/dri/i965/intel_mipmap_tree.h | 11 +-
> src/mesa/drivers/dri/i965/intel_pixel_bitmap.c | 3 +-
> src/mesa/drivers/dri/i965/intel_pixel_copy.c | 3 -
> src/mesa/drivers/dri/i965/intel_pixel_draw.c | 2 +-
> src/mesa/drivers/dri/i965/intel_pixel_read.c | 28 +-
> src/mesa/drivers/dri/i965/intel_screen.c | 68 +-
> src/mesa/drivers/dri/i965/intel_screen.h | 16 +-
> src/mesa/drivers/dri/i965/intel_syncobj.c | 86 +-
> src/mesa/drivers/dri/i965/intel_tex.c | 6 +-
> src/mesa/drivers/dri/i965/intel_tex_image.c | 35 +-
> src/mesa/drivers/dri/i965/intel_tex_subimage.c | 33 +-
> src/mesa/drivers/dri/i965/intel_tiled_memcpy.c | 14 +-
> src/mesa/drivers/dri/i965/intel_tiled_memcpy.h | 4 +-
> src/mesa/drivers/dri/i965/intel_upload.c | 33 +-
> 120 files changed, 3341 insertions(+), 2199 deletions(-)
> create mode 100644 src/mesa/drivers/dri/i965/brw_batch.c
> create mode 100644 src/mesa/drivers/dri/i965/brw_batch.h
> delete mode 100644 src/mesa/drivers/dri/i965/intel_batchbuffer.c
> delete mode 100644 src/mesa/drivers/dri/i965/intel_batchbuffer.h
Hi Chris,
Thanks so much for taking the time to do this - I've known that we
needed to revamp this code for a while, but am not terribly familiar
with the details at the kernel level. You understand it much better
than most of us. 20% is a huge improvement!
While I really like this idea in principle, the current patch is rather
huge, making it difficult to review; bisectability would also suffer.
Would it be possible to split it up into several smaller patches?
One idea I had for how you might accomplish that is to introduce the
brw_bo struct and related functions, but make it simply contain a
drm_intel_bo * field, and fall back to the existing libdrm-based code.
That way, you could put all the mechanical renaming and refactoring
across the entire code-base in one patch that should have no functional
change. You could then replace the contents or brw_bo and those
functions with your new improved batch manager.
We've talked about moving away from libdrm for a while, so having our
own functions and structures makes a lot of sense.
I'm also curious about the performance on non-LLC platforms (CHV or
BYT)? You've dropped a number of non-LLC paths - which is probably
fine, honestly...they were always of dubious value - but it'd be nice
to know the record the effects of this change on non-LLC in the commit
message.
[snip]
> +/**
> + * Number of bytes to reserve for commands necessary to complete a batch.
> + *
> + * This includes:
> + * - MI_BATCHBUFFER_END (4 bytes)
> + * - Optional MI_NOOP for ensuring the batch length is qword aligned (4 bytes)
> + * - Any state emitted by vtbl->finish_batch():
> + * - Gen4-5 record ending occlusion query values (4 * 4 = 16 bytes)
> + * - Disabling OA counters on Gen6+ (3 DWords = 12 bytes)
> + * - Ending MI_REPORT_PERF_COUNT on Gen5+, plus associated PIPE_CONTROLs:
> + * - Two sets of PIPE_CONTROLs, which become 3 PIPE_CONTROLs each on SNB,
> + * which are 4 DWords each ==> 2 * 3 * 4 * 4 = 96 bytes
> + * - 3 DWords for MI_REPORT_PERF_COUNT itself on Gen6+. ==> 12 bytes.
> + * On Ironlake, it's 6 DWords, but we have some slack due to the lack of
> + * Sandybridge PIPE_CONTROL madness.
> + *
> + * Total: 140 bytes
> + */
> +#define BATCH_RESERVED 140
I see you noticed that 146 was the result of me flubbing the arithmetic.
:) I apparently also flubbed up the Gen6 PIPE_CONTROL size, and just
landed a patch to fix that. It should be 152 now, IIRC.
> +
> +/* Surface offsets are limited to a maximum of 64k from the surface base */
> +#define BATCH_SIZE (64 << 10)
So...our current batches are 32k (8192 * sizeof(uint32_t)). Doubling
the size of the batch has always seemed like a good idea; we have a
bunch of overhead when BRW_NEW_BATCH triggers, and the "other driver"
uses larger batches too.
I'd experienced GPU hangs in the past when going beyond 64k, and I think
you've explained that with this comment. Good catch!
> +/* XXX Temporary home until kernel patches land */
> +#define I915_PARAM_HAS_EXEC_SOFTPIN 37
> +#define EXEC_OBJECT_PINNED (1<<4)
> +#define I915_PARAM_HAS_EXEC_BATCH_FIRST 38
> +#define I915_EXEC_BATCH_FIRST (1<<16)
> +
> +#define DBG_NO_FAST_RELOC 0
> +#define DBG_NO_HANDLE_LUT 0
> +#define DBG_NO_BATCH_FIRST 0
> +#define DBG_NO_SOFTPIN 0
> +
> +#define PERF_IDLE 0 /* ring mask */
> +
> +inline static void list_move(struct list_head *from, struct list_head *to)
> +{
> + list_del(from);
> + list_add(from, to);
> +}
> +
> +inline static void list_move_tail(struct list_head *from, struct list_head *to)
> +{
> + list_del(from);
> + list_addtail(from, to);
> +}
These should go into src/util/list.h in a separate patch.
> diff --git a/src/mesa/drivers/dri/i965/brw_batch.h b/src/mesa/drivers/dri/i965/brw_batch.h
> new file mode 100644
> index 0000000..0b5468b
> --- /dev/null
> +++ b/src/mesa/drivers/dri/i965/brw_batch.h
> @@ -0,0 +1,377 @@
> +#ifndef BRW_BATCH_H
> +#define BRW_BATCH_H
> +
> +#include <stdbool.h>
> +#include <stdint.h>
> +#include <string.h>
> +#include <setjmp.h>
> +#include <assert.h>
> +
> +#include <intel_aub.h>
> +#include <intel_bufmgr.h>
> +
> +#include "util/list.h"
> +
> +#define HAS_GCC(major, minor) defined(__GNUC__) && (__GNUC__ > (major) || __GNUC__ == (major) && __GNUC_MINOR__ >= (minor))
> +
> +#if HAS_GCC(3, 4)
> +#define must_check __attribute__((warn_unused_result))
> +#else
> +#define must_check
> +#endif
I've sent out a patch that adds MUST_CHECK to Mesa as a whole, with the
appropriate configure.ac magic, so you should be able to just use that.
> diff --git a/src/mesa/drivers/dri/i965/brw_blorp.cpp b/src/mesa/drivers/dri/i965/brw_blorp.cpp
> index 2ccfae1..e1a9f56 100644
> --- a/src/mesa/drivers/dri/i965/brw_blorp.cpp
> +++ b/src/mesa/drivers/dri/i965/brw_blorp.cpp
> @@ -22,7 +22,6 @@
> */
>
> #include <errno.h>
> -#include "intel_batchbuffer.h"
> #include "intel_fbo.h"
>
> #include "brw_blorp.h"
> @@ -211,7 +210,9 @@ brw_blorp_exec(struct brw_context *brw, const brw_blorp_params *params)
> {
> struct gl_context *ctx = &brw->ctx;
> uint32_t estimated_max_batch_usage = 1500;
> - bool check_aperture_failed_once = false;
> +
> + if (brw_batch_begin(&brw->batch, estimated_max_batch_usage, RENDER_RING) < 0)
> + return;
>
> /* Flush the sampler and render caches. We definitely need to flush the
> * sampler cache so that we get updated contents from the render cache for
> @@ -222,13 +223,6 @@ brw_blorp_exec(struct brw_context *brw, const brw_blorp_params *params)
> */
> brw_emit_mi_flush(brw);
>
> -retry:
> - intel_batchbuffer_require_space(brw, estimated_max_batch_usage, RENDER_RING);
> - intel_batchbuffer_save_state(brw);
> - drm_intel_bo *saved_bo = brw->batch.bo;
> - uint32_t saved_used = brw->batch.used;
> - uint32_t saved_state_batch_offset = brw->batch.state_batch_offset;
> -
> switch (brw->gen) {
> case 6:
> gen6_blorp_exec(brw, params);
> @@ -241,37 +235,18 @@ retry:
> unreachable("not reached");
> }
>
> - /* Make sure we didn't wrap the batch unintentionally, and make sure we
> - * reserved enough space that a wrap will never happen.
> - */
> - assert(brw->batch.bo == saved_bo);
> - assert((brw->batch.used - saved_used) * 4 +
> - (saved_state_batch_offset - brw->batch.state_batch_offset) <
> - estimated_max_batch_usage);
> - /* Shut up compiler warnings on release build */
> - (void)saved_bo;
> - (void)saved_used;
> - (void)saved_state_batch_offset;
> + brw_emit_mi_flush(brw);
>
> /* Check if the blorp op we just did would make our batch likely to fail to
> * map all the BOs into the GPU at batch exec time later. If so, flush the
> * batch and try again with nothing else in the batch.
> */
> - if (dri_bufmgr_check_aperture_space(&brw->batch.bo, 1)) {
> - if (!check_aperture_failed_once) {
> - check_aperture_failed_once = true;
> - intel_batchbuffer_reset_to_saved(brw);
> - intel_batchbuffer_flush(brw);
> - goto retry;
> - } else {
> - int ret = intel_batchbuffer_flush(brw);
> - WARN_ONCE(ret == -ENOSPC,
> - "i965: blorp emit exceeded available aperture space\n");
> - }
> + if (brw_batch_end(&brw->batch)) {
> + WARN_ONCE(1, "i965: blorp emit exceeded available aperture space\n");
> + return;
> }
>
> - if (unlikely(brw->always_flush_batch))
> - intel_batchbuffer_flush(brw);
> + brw_batch_maybe_flush(&brw->batch);
>
> /* We've smashed all state compared to what the normal 3D pipeline
> * rendering tracks for GL.
> @@ -279,11 +254,6 @@ retry:
> brw->ctx.NewDriverState = ~0ull;
> brw->no_depth_or_stencil = false;
> brw->ib.type = -1;
> -
> - /* Flush the sampler cache so any texturing from the destination is
> - * coherent.
> - */
> - brw_emit_mi_flush(brw);
It looks like you moved this flush earlier, so it's in the section of
code that retries? That's probably reasonable. Seems worth keeping the
comment, and this could be done as a separate patch...
[snip]
> +static void
> +load_sized_register_mem(struct brw_context *brw,
> + uint32_t reg,
> + struct brw_bo *bo,
> + uint32_t read_domains, uint32_t write_domain,
> + uint32_t offset,
> + int size)
> +{
> + int i;
> +
> + /* MI_LOAD_REGISTER_MEM only exists on Gen7+. */
> + assert(brw->gen >= 7);
> +
> + if (brw->gen >= 8) {
> + BEGIN_BATCH(4 * size);
> + for (i = 0; i < size; i++) {
> + OUT_BATCH(GEN7_MI_LOAD_REGISTER_MEM | (4 - 2));
> + OUT_BATCH(reg + i * 4);
> + OUT_RELOC64(bo, read_domains, write_domain, offset + i * 4);
> + }
> + ADVANCE_BATCH();
> + } else {
> + BEGIN_BATCH(3 * size);
> + for (i = 0; i < size; i++) {
> + OUT_BATCH(GEN7_MI_LOAD_REGISTER_MEM | (3 - 2));
> + OUT_BATCH(reg + i * 4);
> + OUT_RELOC(bo, read_domains, write_domain, offset + i * 4);
> + }
> + ADVANCE_BATCH();
> + }
> +}
> +
> +void
> +brw_load_register_mem(struct brw_context *brw,
> + uint32_t reg,
> + struct brw_bo *bo,
> + uint32_t read_domains, uint32_t write_domain,
> + uint32_t offset)
> +{
> + load_sized_register_mem(brw, reg, bo, read_domains, write_domain, offset, 1);
> +}
> +
> +void
> +brw_load_register_mem64(struct brw_context *brw,
> + uint32_t reg,
> + struct brw_bo *bo,
> + uint32_t read_domains, uint32_t write_domain,
> + uint32_t offset)
> +{
> + load_sized_register_mem(brw, reg, bo, read_domains, write_domain, offset, 2);
> +}
Maybe just make these static inlines in brw_state.h or somewhere?
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 819 bytes
Desc: This is a digitally signed message part.
URL: <http://lists.freedesktop.org/archives/mesa-dev/attachments/20150706/90e5d605/attachment-0001.sig>
More information about the mesa-dev
mailing list