[Mesa-dev] [PATCH 04/18] i965: Introduce a context-local batch manager

Mon Jul 6 12:34:00 PDT 2015

On Monday, July 06, 2015 11:33:09 AM Chris Wilson wrote:
> When submitting commands to the GPU every cycle of latency counts;
> mutexes, spinlocks, even atomics quickly add to substantial overhead.
> 
> This "batch manager" acts as thread-local shim over the buffer manager
> (drm_intel_bufmgr_gem). As we are only ever used from within a single
> context, we can rely on the upper layers providing thread safety.
> This allows us to import buffers from the shared screen (sharing buffers
> between multiple contexts, threads and users) and wrap that handle in
> our own. Similarly, we want to share the buffer cache between all
> users on the file and so allocate from the global threadsafe buffer
> manager, with a very small and transient local cache of active buffers.
> 
> The batch manager provides a cheap way of busyness tracking and very
> efficient batch construction and kernel submission.
> 
> The restrictions over and above the generic submission engine in
> intel_bufmgr_gem are:
>      - not thread-safe
>      - flat relocations, only the batch buffer itself carries
>        relocations. Relocations relative to auxiliary buffers
>        must be performed via STATE_BASE
>      - direct mapping of the batch for writes, expect reads
>        from the batch to be slow
>      - the batch is a fixed 64k in size
>      - access to the batch must be wrapped by brw_batch_begin/_end
>      - all relocations must be immediately written into the batch
> 
> The importance of the flat relocation tree with local offset handling is
> that it allows us to use the "relocation-less" execbuffer interfaces,
> dramatically reducing the overhead of batch submission. However, that
> can be relaxed to allow other buffers than the batch buffer to carry
> relocations, if need be.
> 
> ivb/bdw OglBatch7 improves by ~20% above and beyond my kernel relocation
> speedups.
> 
> ISSUES:
> * shared mipmap trees
>   - we instantiate a context local copy on use, but what are the semantics for
>     serializing read/writes between them - do we need automagic flushing of
>     execution on other contexts and common busyness tracking?
>   - we retain references to the bo past the lifetime of its parent
>     batchmgr as the mipmap_tree is retained past the lifetime of its
>     original context, see glx_arb_create_context/default_major_version
> * OglMultithread is nevertheless unhappy; but that looks like undefined
>   behaviour - i.e. a buggy client concurrently executing the same GL
>   context in multiple threads, unpatched is equally buggy.
> * Add full-ppgtt softpinning support (no more relocations, at least for
>   the first 256TiB), at the moment there is a limited proof-of-principle
>   demonstration
> * polish and move to libdrm; though at the cost of sealing the structs?
> 
> Signed-off-by: Chris Wilson <chris at chris-wilson.co.uk>
> Cc: Daniel Vetter <daniel.vetter at ffwll.ch>
> Cc: Kristian Høgsberg <krh at bitplanet.net>
> Cc: Kenneth Graunke <kenneth at whitecape.org>
> Cc: Jesse Barnes <jbarnes at virtuousgeek.org>
> Cc: Ian Romanick <ian.d.romanick at intel.com>
> Cc: Abdiel Janulgue <abdiel.janulgue at linux.intel.com>
> Cc: Eero Tamminen <eero.t.tamminen at intel.com>
> Cc: Martin Peres <martin.peres at linux.intel.com>
> ---
>  src/mesa/drivers/dri/i965/Makefile.sources         |    4 +-
>  src/mesa/drivers/dri/i965/brw_batch.c              | 1946 ++++++++++++++++++++
>  src/mesa/drivers/dri/i965/brw_batch.h              |  377 ++++
>  src/mesa/drivers/dri/i965/brw_binding_tables.c     |    1 -
>  src/mesa/drivers/dri/i965/brw_blorp.cpp            |   46 +-
>  src/mesa/drivers/dri/i965/brw_cc.c                 |   16 +-
>  src/mesa/drivers/dri/i965/brw_clear.c              |    1 -
>  src/mesa/drivers/dri/i965/brw_clip.c               |    2 -
>  src/mesa/drivers/dri/i965/brw_clip_line.c          |    2 -
>  src/mesa/drivers/dri/i965/brw_clip_point.c         |    2 -
>  src/mesa/drivers/dri/i965/brw_clip_state.c         |   14 +-
>  src/mesa/drivers/dri/i965/brw_clip_tri.c           |    2 -
>  src/mesa/drivers/dri/i965/brw_clip_unfilled.c      |    2 -
>  src/mesa/drivers/dri/i965/brw_clip_util.c          |    2 -
>  src/mesa/drivers/dri/i965/brw_compute.c            |   42 +-
>  src/mesa/drivers/dri/i965/brw_conditional_render.c |    2 +-
>  src/mesa/drivers/dri/i965/brw_context.c            |  233 ++-
>  src/mesa/drivers/dri/i965/brw_context.h            |  144 +-
>  src/mesa/drivers/dri/i965/brw_cs.cpp               |    6 +-
>  src/mesa/drivers/dri/i965/brw_curbe.c              |    1 -
>  src/mesa/drivers/dri/i965/brw_draw.c               |  103 +-
>  src/mesa/drivers/dri/i965/brw_draw_upload.c        |   23 +-
>  src/mesa/drivers/dri/i965/brw_ff_gs.c              |    2 -
>  src/mesa/drivers/dri/i965/brw_ff_gs_emit.c         |    1 -
>  src/mesa/drivers/dri/i965/brw_fs.cpp               |    5 +-
>  src/mesa/drivers/dri/i965/brw_meta_fast_clear.c    |   11 +-
>  src/mesa/drivers/dri/i965/brw_meta_stencil_blit.c  |    1 -
>  src/mesa/drivers/dri/i965/brw_meta_updownsample.c  |    1 -
>  src/mesa/drivers/dri/i965/brw_misc_state.c         |   10 +-
>  src/mesa/drivers/dri/i965/brw_object_purgeable.c   |    8 +-
>  .../drivers/dri/i965/brw_performance_monitor.c     |   88 +-
>  src/mesa/drivers/dri/i965/brw_pipe_control.c       |   24 +-
>  src/mesa/drivers/dri/i965/brw_primitive_restart.c  |    2 -
>  src/mesa/drivers/dri/i965/brw_program.c            |   23 +-
>  src/mesa/drivers/dri/i965/brw_queryobj.c           |   68 +-
>  src/mesa/drivers/dri/i965/brw_reset.c              |   14 +-
>  src/mesa/drivers/dri/i965/brw_sampler_state.c      |    8 +-
>  src/mesa/drivers/dri/i965/brw_sf.c                 |    2 -
>  src/mesa/drivers/dri/i965/brw_sf_emit.c            |    2 -
>  src/mesa/drivers/dri/i965/brw_sf_state.c           |   21 +-
>  src/mesa/drivers/dri/i965/brw_state.h              |    2 +-
>  src/mesa/drivers/dri/i965/brw_state_batch.c        |   41 +-
>  src/mesa/drivers/dri/i965/brw_state_cache.c        |   70 +-
>  src/mesa/drivers/dri/i965/brw_state_dump.c         |   77 +-
>  src/mesa/drivers/dri/i965/brw_state_upload.c       |   16 +-
>  src/mesa/drivers/dri/i965/brw_structs.h            |   33 +-
>  src/mesa/drivers/dri/i965/brw_urb.c                |    9 +-
>  src/mesa/drivers/dri/i965/brw_vec4.cpp             |    5 +-
>  src/mesa/drivers/dri/i965/brw_vs_state.c           |   33 +-
>  src/mesa/drivers/dri/i965/brw_vs_surface_state.c   |    4 +-
>  src/mesa/drivers/dri/i965/brw_wm_state.c           |   38 +-
>  src/mesa/drivers/dri/i965/brw_wm_surface_state.c   |   76 +-
>  src/mesa/drivers/dri/i965/gen6_blorp.cpp           |   17 +-
>  src/mesa/drivers/dri/i965/gen6_cc.c                |    1 -
>  src/mesa/drivers/dri/i965/gen6_clip_state.c        |    1 -
>  src/mesa/drivers/dri/i965/gen6_depth_state.c       |    1 -
>  src/mesa/drivers/dri/i965/gen6_depthstencil.c      |    1 -
>  src/mesa/drivers/dri/i965/gen6_gs_state.c          |    1 -
>  src/mesa/drivers/dri/i965/gen6_multisample_state.c |    1 -
>  src/mesa/drivers/dri/i965/gen6_queryobj.c          |   56 +-
>  src/mesa/drivers/dri/i965/gen6_sampler_state.c     |    1 -
>  src/mesa/drivers/dri/i965/gen6_scissor_state.c     |    1 -
>  src/mesa/drivers/dri/i965/gen6_sf_state.c          |    1 -
>  src/mesa/drivers/dri/i965/gen6_sol.c               |    9 +-
>  src/mesa/drivers/dri/i965/gen6_surface_state.c     |   13 +-
>  src/mesa/drivers/dri/i965/gen6_urb.c               |    1 -
>  src/mesa/drivers/dri/i965/gen6_viewport_state.c    |    1 -
>  src/mesa/drivers/dri/i965/gen6_vs_state.c          |    2 +-
>  src/mesa/drivers/dri/i965/gen6_wm_state.c          |    1 -
>  src/mesa/drivers/dri/i965/gen7_blorp.cpp           |   16 +-
>  src/mesa/drivers/dri/i965/gen7_disable.c           |    1 -
>  src/mesa/drivers/dri/i965/gen7_gs_state.c          |    1 -
>  src/mesa/drivers/dri/i965/gen7_misc_state.c        |    3 +-
>  src/mesa/drivers/dri/i965/gen7_sf_state.c          |    1 -
>  src/mesa/drivers/dri/i965/gen7_sol_state.c         |   49 +-
>  src/mesa/drivers/dri/i965/gen7_urb.c               |    1 -
>  src/mesa/drivers/dri/i965/gen7_viewport_state.c    |    1 -
>  src/mesa/drivers/dri/i965/gen7_vs_state.c          |    1 -
>  src/mesa/drivers/dri/i965/gen7_wm_state.c          |    1 -
>  src/mesa/drivers/dri/i965/gen7_wm_surface_state.c  |   60 +-
>  src/mesa/drivers/dri/i965/gen8_blend_state.c       |    1 -
>  src/mesa/drivers/dri/i965/gen8_depth_state.c       |   16 +-
>  src/mesa/drivers/dri/i965/gen8_disable.c           |    1 -
>  src/mesa/drivers/dri/i965/gen8_draw_upload.c       |    1 -
>  src/mesa/drivers/dri/i965/gen8_gs_state.c          |    1 -
>  src/mesa/drivers/dri/i965/gen8_misc_state.c        |    1 -
>  src/mesa/drivers/dri/i965/gen8_multisample_state.c |    1 -
>  src/mesa/drivers/dri/i965/gen8_ps_state.c          |    1 -
>  src/mesa/drivers/dri/i965/gen8_sf_state.c          |    1 -
>  src/mesa/drivers/dri/i965/gen8_sol_state.c         |    3 +-
>  src/mesa/drivers/dri/i965/gen8_surface_state.c     |   73 +-
>  src/mesa/drivers/dri/i965/gen8_viewport_state.c    |    1 -
>  src/mesa/drivers/dri/i965/gen8_vs_state.c          |    1 -
>  src/mesa/drivers/dri/i965/gen8_wm_depth_stencil.c  |    1 -
>  src/mesa/drivers/dri/i965/intel_batchbuffer.c      |  480 -----
>  src/mesa/drivers/dri/i965/intel_batchbuffer.h      |  179 --
>  src/mesa/drivers/dri/i965/intel_blit.c             |   68 +-
>  src/mesa/drivers/dri/i965/intel_blit.h             |   10 +-
>  src/mesa/drivers/dri/i965/intel_buffer_objects.c   |  222 +--
>  src/mesa/drivers/dri/i965/intel_buffer_objects.h   |   18 +-
>  src/mesa/drivers/dri/i965/intel_debug.c            |    6 -
>  src/mesa/drivers/dri/i965/intel_extensions.c       |   48 +-
>  src/mesa/drivers/dri/i965/intel_fbo.c              |   46 +-
>  src/mesa/drivers/dri/i965/intel_fbo.h              |    4 -
>  src/mesa/drivers/dri/i965/intel_image.h            |    6 +-
>  src/mesa/drivers/dri/i965/intel_mipmap_tree.c      |   98 +-
>  src/mesa/drivers/dri/i965/intel_mipmap_tree.h      |   11 +-
>  src/mesa/drivers/dri/i965/intel_pixel_bitmap.c     |    3 +-
>  src/mesa/drivers/dri/i965/intel_pixel_copy.c       |    3 -
>  src/mesa/drivers/dri/i965/intel_pixel_draw.c       |    2 +-
>  src/mesa/drivers/dri/i965/intel_pixel_read.c       |   28 +-
>  src/mesa/drivers/dri/i965/intel_screen.c           |   68 +-
>  src/mesa/drivers/dri/i965/intel_screen.h           |   16 +-
>  src/mesa/drivers/dri/i965/intel_syncobj.c          |   86 +-
>  src/mesa/drivers/dri/i965/intel_tex.c              |    6 +-
>  src/mesa/drivers/dri/i965/intel_tex_image.c        |   35 +-
>  src/mesa/drivers/dri/i965/intel_tex_subimage.c     |   33 +-
>  src/mesa/drivers/dri/i965/intel_tiled_memcpy.c     |   14 +-
>  src/mesa/drivers/dri/i965/intel_tiled_memcpy.h     |    4 +-
>  src/mesa/drivers/dri/i965/intel_upload.c           |   33 +-
>  120 files changed, 3341 insertions(+), 2199 deletions(-)
>  create mode 100644 src/mesa/drivers/dri/i965/brw_batch.c
>  create mode 100644 src/mesa/drivers/dri/i965/brw_batch.h
>  delete mode 100644 src/mesa/drivers/dri/i965/intel_batchbuffer.c
>  delete mode 100644 src/mesa/drivers/dri/i965/intel_batchbuffer.h

Hi Chris,

Thanks so much for taking the time to do this - I've known that we
needed to revamp this code for a while, but am not terribly familiar
with the details at the kernel level.  You understand it much better
than most of us.  20% is a huge improvement!

While I really like this idea in principle, the current patch is rather
huge, making it difficult to review; bisectability would also suffer.
Would it be possible to split it up into several smaller patches?

One idea I had for how you might accomplish that is to introduce the
brw_bo struct and related functions, but make it simply contain a
drm_intel_bo * field, and fall back to the existing libdrm-based code.

That way, you could put all the mechanical renaming and refactoring
across the entire code-base in one patch that should have no functional
change.  You could then replace the contents or brw_bo and those
functions with your new improved batch manager.

We've talked about moving away from libdrm for a while, so having our
own functions and structures makes a lot of sense.

I'm also curious about the performance on non-LLC platforms (CHV or
BYT)?  You've dropped a number of non-LLC paths - which is probably
fine, honestly...they were always of dubious value - but it'd be nice
to know the record the effects of this change on non-LLC in the commit
message.

[snip]
> +/**
> + * Number of bytes to reserve for commands necessary to complete a batch.
> + *
> + * This includes:
> + * - MI_BATCHBUFFER_END (4 bytes)
> + * - Optional MI_NOOP for ensuring the batch length is qword aligned (4 bytes)
> + * - Any state emitted by vtbl->finish_batch():
> + *   - Gen4-5 record ending occlusion query values (4 * 4 = 16 bytes)
> + *   - Disabling OA counters on Gen6+ (3 DWords = 12 bytes)
> + *   - Ending MI_REPORT_PERF_COUNT on Gen5+, plus associated PIPE_CONTROLs:
> + *     - Two sets of PIPE_CONTROLs, which become 3 PIPE_CONTROLs each on SNB,
> + *       which are 4 DWords each ==> 2 * 3 * 4 * 4 = 96 bytes
> + *     - 3 DWords for MI_REPORT_PERF_COUNT itself on Gen6+.  ==> 12 bytes.
> + *       On Ironlake, it's 6 DWords, but we have some slack due to the lack of
> + *       Sandybridge PIPE_CONTROL madness.
> + *
> + * Total: 140 bytes
> + */
> +#define BATCH_RESERVED 140

I see you noticed that 146 was the result of me flubbing the arithmetic.
:) I apparently also flubbed up the Gen6 PIPE_CONTROL size, and just
landed a patch to fix that.  It should be 152 now, IIRC.

> +
> +/* Surface offsets are limited to a maximum of 64k from the surface base */
> +#define BATCH_SIZE (64 << 10)

So...our current batches are 32k (8192 * sizeof(uint32_t)).  Doubling
the size of the batch has always seemed like a good idea; we have a
bunch of overhead when BRW_NEW_BATCH triggers, and the "other driver"
uses larger batches too.

I'd experienced GPU hangs in the past when going beyond 64k, and I think
you've explained that with this comment.  Good catch!

> +/* XXX Temporary home until kernel patches land */
> +#define I915_PARAM_HAS_EXEC_SOFTPIN 37
> +#define EXEC_OBJECT_PINNED	(1<<4)
> +#define I915_PARAM_HAS_EXEC_BATCH_FIRST 38
> +#define I915_EXEC_BATCH_FIRST (1<<16)
> +
> +#define DBG_NO_FAST_RELOC 0
> +#define DBG_NO_HANDLE_LUT 0
> +#define DBG_NO_BATCH_FIRST 0
> +#define DBG_NO_SOFTPIN 0
> +
> +#define PERF_IDLE 0 /* ring mask */
> +
> +inline static void list_move(struct list_head *from, struct list_head *to)
> +{
> +   list_del(from);
> +   list_add(from, to);
> +}
> +
> +inline static void list_move_tail(struct list_head *from, struct list_head *to)
> +{
> +   list_del(from);
> +   list_addtail(from, to);
> +}

These should go into src/util/list.h in a separate patch.

> diff --git a/src/mesa/drivers/dri/i965/brw_batch.h b/src/mesa/drivers/dri/i965/brw_batch.h
> new file mode 100644
> index 0000000..0b5468b
> --- /dev/null
> +++ b/src/mesa/drivers/dri/i965/brw_batch.h
> @@ -0,0 +1,377 @@
> +#ifndef BRW_BATCH_H
> +#define BRW_BATCH_H
> +
> +#include <stdbool.h>
> +#include <stdint.h>
> +#include <string.h>
> +#include <setjmp.h>
> +#include <assert.h>
> +
> +#include <intel_aub.h>
> +#include <intel_bufmgr.h>
> +
> +#include "util/list.h"
> +
> +#define HAS_GCC(major, minor) defined(__GNUC__) && (__GNUC__ > (major) || __GNUC__ == (major) && __GNUC_MINOR__ >= (minor))
> +
> +#if HAS_GCC(3, 4)
> +#define must_check  __attribute__((warn_unused_result))
> +#else
> +#define must_check
> +#endif

I've sent out a patch that adds MUST_CHECK to Mesa as a whole, with the
appropriate configure.ac magic, so you should be able to just use that.

> diff --git a/src/mesa/drivers/dri/i965/brw_blorp.cpp b/src/mesa/drivers/dri/i965/brw_blorp.cpp
> index 2ccfae1..e1a9f56 100644
> --- a/src/mesa/drivers/dri/i965/brw_blorp.cpp
> +++ b/src/mesa/drivers/dri/i965/brw_blorp.cpp
> @@ -22,7 +22,6 @@
>   */
>  
>  #include <errno.h>
> -#include "intel_batchbuffer.h"
>  #include "intel_fbo.h"
>  
>  #include "brw_blorp.h"
> @@ -211,7 +210,9 @@ brw_blorp_exec(struct brw_context *brw, const brw_blorp_params *params)
>  {
>     struct gl_context *ctx = &brw->ctx;
>     uint32_t estimated_max_batch_usage = 1500;
> -   bool check_aperture_failed_once = false;
> +
> +   if (brw_batch_begin(&brw->batch, estimated_max_batch_usage, RENDER_RING) < 0)
> +      return;
>  
>     /* Flush the sampler and render caches.  We definitely need to flush the
>      * sampler cache so that we get updated contents from the render cache for
> @@ -222,13 +223,6 @@ brw_blorp_exec(struct brw_context *brw, const brw_blorp_params *params)
>      */
>     brw_emit_mi_flush(brw);
>  
> -retry:
> -   intel_batchbuffer_require_space(brw, estimated_max_batch_usage, RENDER_RING);
> -   intel_batchbuffer_save_state(brw);
> -   drm_intel_bo *saved_bo = brw->batch.bo;
> -   uint32_t saved_used = brw->batch.used;
> -   uint32_t saved_state_batch_offset = brw->batch.state_batch_offset;
> -
>     switch (brw->gen) {
>     case 6:
>        gen6_blorp_exec(brw, params);
> @@ -241,37 +235,18 @@ retry:
>        unreachable("not reached");
>     }
>  
> -   /* Make sure we didn't wrap the batch unintentionally, and make sure we
> -    * reserved enough space that a wrap will never happen.
> -    */
> -   assert(brw->batch.bo == saved_bo);
> -   assert((brw->batch.used - saved_used) * 4 +
> -          (saved_state_batch_offset - brw->batch.state_batch_offset) <
> -          estimated_max_batch_usage);
> -   /* Shut up compiler warnings on release build */
> -   (void)saved_bo;
> -   (void)saved_used;
> -   (void)saved_state_batch_offset;
> +   brw_emit_mi_flush(brw);
>  
>     /* Check if the blorp op we just did would make our batch likely to fail to
>      * map all the BOs into the GPU at batch exec time later.  If so, flush the
>      * batch and try again with nothing else in the batch.
>      */
> -   if (dri_bufmgr_check_aperture_space(&brw->batch.bo, 1)) {
> -      if (!check_aperture_failed_once) {
> -         check_aperture_failed_once = true;
> -         intel_batchbuffer_reset_to_saved(brw);
> -         intel_batchbuffer_flush(brw);
> -         goto retry;
> -      } else {
> -         int ret = intel_batchbuffer_flush(brw);
> -         WARN_ONCE(ret == -ENOSPC,
> -                   "i965: blorp emit exceeded available aperture space\n");
> -      }
> +   if (brw_batch_end(&brw->batch)) {
> +      WARN_ONCE(1, "i965: blorp emit exceeded available aperture space\n");
> +      return;
>     }
>  
> -   if (unlikely(brw->always_flush_batch))
> -      intel_batchbuffer_flush(brw);
> +   brw_batch_maybe_flush(&brw->batch);
>  
>     /* We've smashed all state compared to what the normal 3D pipeline
>      * rendering tracks for GL.
> @@ -279,11 +254,6 @@ retry:
>     brw->ctx.NewDriverState = ~0ull;
>     brw->no_depth_or_stencil = false;
>     brw->ib.type = -1;
> -
> -   /* Flush the sampler cache so any texturing from the destination is
> -    * coherent.
> -    */
> -   brw_emit_mi_flush(brw);

It looks like you moved this flush earlier, so it's in the section of
code that retries?  That's probably reasonable.  Seems worth keeping the
comment, and this could be done as a separate patch...

[snip]
> +static void
> +load_sized_register_mem(struct brw_context *brw,
> +                        uint32_t reg,
> +                        struct brw_bo *bo,
> +                        uint32_t read_domains, uint32_t write_domain,
> +                        uint32_t offset,
> +                        int size)
> +{
> +   int i;
> +
> +   /* MI_LOAD_REGISTER_MEM only exists on Gen7+. */
> +   assert(brw->gen >= 7);
> +
> +   if (brw->gen >= 8) {
> +      BEGIN_BATCH(4 * size);
> +      for (i = 0; i < size; i++) {
> +         OUT_BATCH(GEN7_MI_LOAD_REGISTER_MEM | (4 - 2));
> +         OUT_BATCH(reg + i * 4);
> +         OUT_RELOC64(bo, read_domains, write_domain, offset + i * 4);
> +      }
> +      ADVANCE_BATCH();
> +   } else {
> +      BEGIN_BATCH(3 * size);
> +      for (i = 0; i < size; i++) {
> +         OUT_BATCH(GEN7_MI_LOAD_REGISTER_MEM | (3 - 2));
> +         OUT_BATCH(reg + i * 4);
> +         OUT_RELOC(bo, read_domains, write_domain, offset + i * 4);
> +      }
> +      ADVANCE_BATCH();
> +   }
> +}
> +
> +void
> +brw_load_register_mem(struct brw_context *brw,
> +                      uint32_t reg,
> +                      struct brw_bo *bo,
> +                      uint32_t read_domains, uint32_t write_domain,
> +                      uint32_t offset)
> +{
> +   load_sized_register_mem(brw, reg, bo, read_domains, write_domain, offset, 1);
> +}
> +
> +void
> +brw_load_register_mem64(struct brw_context *brw,
> +                        uint32_t reg,
> +                        struct brw_bo *bo,
> +                        uint32_t read_domains, uint32_t write_domain,
> +                        uint32_t offset)
> +{
> +   load_sized_register_mem(brw, reg, bo, read_domains, write_domain, offset, 2);
> +}

Maybe just make these static inlines in brw_state.h or somewhere?
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 819 bytes
Desc: This is a digitally signed message part.
URL: <http://lists.freedesktop.org/archives/mesa-dev/attachments/20150706/90e5d605/attachment-0001.sig>