[Mesa-dev] [PATCH 04/18] i965: Introduce a context-local batch manager

Chris Wilson chris at chris-wilson.co.uk
Mon Jul 6 03:33:09 PDT 2015


When submitting commands to the GPU every cycle of latency counts;
mutexes, spinlocks, even atomics quickly add to substantial overhead.

This "batch manager" acts as thread-local shim over the buffer manager
(drm_intel_bufmgr_gem). As we are only ever used from within a single
context, we can rely on the upper layers providing thread safety.
This allows us to import buffers from the shared screen (sharing buffers
between multiple contexts, threads and users) and wrap that handle in
our own. Similarly, we want to share the buffer cache between all
users on the file and so allocate from the global threadsafe buffer
manager, with a very small and transient local cache of active buffers.

The batch manager provides a cheap way of busyness tracking and very
efficient batch construction and kernel submission.

The restrictions over and above the generic submission engine in
intel_bufmgr_gem are:
     - not thread-safe
     - flat relocations, only the batch buffer itself carries
       relocations. Relocations relative to auxiliary buffers
       must be performed via STATE_BASE
     - direct mapping of the batch for writes, expect reads
       from the batch to be slow
     - the batch is a fixed 64k in size
     - access to the batch must be wrapped by brw_batch_begin/_end
     - all relocations must be immediately written into the batch

The importance of the flat relocation tree with local offset handling is
that it allows us to use the "relocation-less" execbuffer interfaces,
dramatically reducing the overhead of batch submission. However, that
can be relaxed to allow other buffers than the batch buffer to carry
relocations, if need be.

ivb/bdw OglBatch7 improves by ~20% above and beyond my kernel relocation
speedups.

ISSUES:
* shared mipmap trees
  - we instantiate a context local copy on use, but what are the semantics for
    serializing read/writes between them - do we need automagic flushing of
    execution on other contexts and common busyness tracking?
  - we retain references to the bo past the lifetime of its parent
    batchmgr as the mipmap_tree is retained past the lifetime of its
    original context, see glx_arb_create_context/default_major_version
* OglMultithread is nevertheless unhappy; but that looks like undefined
  behaviour - i.e. a buggy client concurrently executing the same GL
  context in multiple threads, unpatched is equally buggy.
* Add full-ppgtt softpinning support (no more relocations, at least for
  the first 256TiB), at the moment there is a limited proof-of-principle
  demonstration
* polish and move to libdrm; though at the cost of sealing the structs?

Signed-off-by: Chris Wilson <chris at chris-wilson.co.uk>
Cc: Daniel Vetter <daniel.vetter at ffwll.ch>
Cc: Kristian Høgsberg <krh at bitplanet.net>
Cc: Kenneth Graunke <kenneth at whitecape.org>
Cc: Jesse Barnes <jbarnes at virtuousgeek.org>
Cc: Ian Romanick <ian.d.romanick at intel.com>
Cc: Abdiel Janulgue <abdiel.janulgue at linux.intel.com>
Cc: Eero Tamminen <eero.t.tamminen at intel.com>
Cc: Martin Peres <martin.peres at linux.intel.com>
---
 src/mesa/drivers/dri/i965/Makefile.sources         |    4 +-
 src/mesa/drivers/dri/i965/brw_batch.c              | 1946 ++++++++++++++++++++
 src/mesa/drivers/dri/i965/brw_batch.h              |  377 ++++
 src/mesa/drivers/dri/i965/brw_binding_tables.c     |    1 -
 src/mesa/drivers/dri/i965/brw_blorp.cpp            |   46 +-
 src/mesa/drivers/dri/i965/brw_cc.c                 |   16 +-
 src/mesa/drivers/dri/i965/brw_clear.c              |    1 -
 src/mesa/drivers/dri/i965/brw_clip.c               |    2 -
 src/mesa/drivers/dri/i965/brw_clip_line.c          |    2 -
 src/mesa/drivers/dri/i965/brw_clip_point.c         |    2 -
 src/mesa/drivers/dri/i965/brw_clip_state.c         |   14 +-
 src/mesa/drivers/dri/i965/brw_clip_tri.c           |    2 -
 src/mesa/drivers/dri/i965/brw_clip_unfilled.c      |    2 -
 src/mesa/drivers/dri/i965/brw_clip_util.c          |    2 -
 src/mesa/drivers/dri/i965/brw_compute.c            |   42 +-
 src/mesa/drivers/dri/i965/brw_conditional_render.c |    2 +-
 src/mesa/drivers/dri/i965/brw_context.c            |  233 ++-
 src/mesa/drivers/dri/i965/brw_context.h            |  144 +-
 src/mesa/drivers/dri/i965/brw_cs.cpp               |    6 +-
 src/mesa/drivers/dri/i965/brw_curbe.c              |    1 -
 src/mesa/drivers/dri/i965/brw_draw.c               |  103 +-
 src/mesa/drivers/dri/i965/brw_draw_upload.c        |   23 +-
 src/mesa/drivers/dri/i965/brw_ff_gs.c              |    2 -
 src/mesa/drivers/dri/i965/brw_ff_gs_emit.c         |    1 -
 src/mesa/drivers/dri/i965/brw_fs.cpp               |    5 +-
 src/mesa/drivers/dri/i965/brw_meta_fast_clear.c    |   11 +-
 src/mesa/drivers/dri/i965/brw_meta_stencil_blit.c  |    1 -
 src/mesa/drivers/dri/i965/brw_meta_updownsample.c  |    1 -
 src/mesa/drivers/dri/i965/brw_misc_state.c         |   10 +-
 src/mesa/drivers/dri/i965/brw_object_purgeable.c   |    8 +-
 .../drivers/dri/i965/brw_performance_monitor.c     |   88 +-
 src/mesa/drivers/dri/i965/brw_pipe_control.c       |   24 +-
 src/mesa/drivers/dri/i965/brw_primitive_restart.c  |    2 -
 src/mesa/drivers/dri/i965/brw_program.c            |   23 +-
 src/mesa/drivers/dri/i965/brw_queryobj.c           |   68 +-
 src/mesa/drivers/dri/i965/brw_reset.c              |   14 +-
 src/mesa/drivers/dri/i965/brw_sampler_state.c      |    8 +-
 src/mesa/drivers/dri/i965/brw_sf.c                 |    2 -
 src/mesa/drivers/dri/i965/brw_sf_emit.c            |    2 -
 src/mesa/drivers/dri/i965/brw_sf_state.c           |   21 +-
 src/mesa/drivers/dri/i965/brw_state.h              |    2 +-
 src/mesa/drivers/dri/i965/brw_state_batch.c        |   41 +-
 src/mesa/drivers/dri/i965/brw_state_cache.c        |   70 +-
 src/mesa/drivers/dri/i965/brw_state_dump.c         |   77 +-
 src/mesa/drivers/dri/i965/brw_state_upload.c       |   16 +-
 src/mesa/drivers/dri/i965/brw_structs.h            |   33 +-
 src/mesa/drivers/dri/i965/brw_urb.c                |    9 +-
 src/mesa/drivers/dri/i965/brw_vec4.cpp             |    5 +-
 src/mesa/drivers/dri/i965/brw_vs_state.c           |   33 +-
 src/mesa/drivers/dri/i965/brw_vs_surface_state.c   |    4 +-
 src/mesa/drivers/dri/i965/brw_wm_state.c           |   38 +-
 src/mesa/drivers/dri/i965/brw_wm_surface_state.c   |   76 +-
 src/mesa/drivers/dri/i965/gen6_blorp.cpp           |   17 +-
 src/mesa/drivers/dri/i965/gen6_cc.c                |    1 -
 src/mesa/drivers/dri/i965/gen6_clip_state.c        |    1 -
 src/mesa/drivers/dri/i965/gen6_depth_state.c       |    1 -
 src/mesa/drivers/dri/i965/gen6_depthstencil.c      |    1 -
 src/mesa/drivers/dri/i965/gen6_gs_state.c          |    1 -
 src/mesa/drivers/dri/i965/gen6_multisample_state.c |    1 -
 src/mesa/drivers/dri/i965/gen6_queryobj.c          |   56 +-
 src/mesa/drivers/dri/i965/gen6_sampler_state.c     |    1 -
 src/mesa/drivers/dri/i965/gen6_scissor_state.c     |    1 -
 src/mesa/drivers/dri/i965/gen6_sf_state.c          |    1 -
 src/mesa/drivers/dri/i965/gen6_sol.c               |    9 +-
 src/mesa/drivers/dri/i965/gen6_surface_state.c     |   13 +-
 src/mesa/drivers/dri/i965/gen6_urb.c               |    1 -
 src/mesa/drivers/dri/i965/gen6_viewport_state.c    |    1 -
 src/mesa/drivers/dri/i965/gen6_vs_state.c          |    2 +-
 src/mesa/drivers/dri/i965/gen6_wm_state.c          |    1 -
 src/mesa/drivers/dri/i965/gen7_blorp.cpp           |   16 +-
 src/mesa/drivers/dri/i965/gen7_disable.c           |    1 -
 src/mesa/drivers/dri/i965/gen7_gs_state.c          |    1 -
 src/mesa/drivers/dri/i965/gen7_misc_state.c        |    3 +-
 src/mesa/drivers/dri/i965/gen7_sf_state.c          |    1 -
 src/mesa/drivers/dri/i965/gen7_sol_state.c         |   49 +-
 src/mesa/drivers/dri/i965/gen7_urb.c               |    1 -
 src/mesa/drivers/dri/i965/gen7_viewport_state.c    |    1 -
 src/mesa/drivers/dri/i965/gen7_vs_state.c          |    1 -
 src/mesa/drivers/dri/i965/gen7_wm_state.c          |    1 -
 src/mesa/drivers/dri/i965/gen7_wm_surface_state.c  |   60 +-
 src/mesa/drivers/dri/i965/gen8_blend_state.c       |    1 -
 src/mesa/drivers/dri/i965/gen8_depth_state.c       |   16 +-
 src/mesa/drivers/dri/i965/gen8_disable.c           |    1 -
 src/mesa/drivers/dri/i965/gen8_draw_upload.c       |    1 -
 src/mesa/drivers/dri/i965/gen8_gs_state.c          |    1 -
 src/mesa/drivers/dri/i965/gen8_misc_state.c        |    1 -
 src/mesa/drivers/dri/i965/gen8_multisample_state.c |    1 -
 src/mesa/drivers/dri/i965/gen8_ps_state.c          |    1 -
 src/mesa/drivers/dri/i965/gen8_sf_state.c          |    1 -
 src/mesa/drivers/dri/i965/gen8_sol_state.c         |    3 +-
 src/mesa/drivers/dri/i965/gen8_surface_state.c     |   73 +-
 src/mesa/drivers/dri/i965/gen8_viewport_state.c    |    1 -
 src/mesa/drivers/dri/i965/gen8_vs_state.c          |    1 -
 src/mesa/drivers/dri/i965/gen8_wm_depth_stencil.c  |    1 -
 src/mesa/drivers/dri/i965/intel_batchbuffer.c      |  480 -----
 src/mesa/drivers/dri/i965/intel_batchbuffer.h      |  179 --
 src/mesa/drivers/dri/i965/intel_blit.c             |   68 +-
 src/mesa/drivers/dri/i965/intel_blit.h             |   10 +-
 src/mesa/drivers/dri/i965/intel_buffer_objects.c   |  222 +--
 src/mesa/drivers/dri/i965/intel_buffer_objects.h   |   18 +-
 src/mesa/drivers/dri/i965/intel_debug.c            |    6 -
 src/mesa/drivers/dri/i965/intel_extensions.c       |   48 +-
 src/mesa/drivers/dri/i965/intel_fbo.c              |   46 +-
 src/mesa/drivers/dri/i965/intel_fbo.h              |    4 -
 src/mesa/drivers/dri/i965/intel_image.h            |    6 +-
 src/mesa/drivers/dri/i965/intel_mipmap_tree.c      |   98 +-
 src/mesa/drivers/dri/i965/intel_mipmap_tree.h      |   11 +-
 src/mesa/drivers/dri/i965/intel_pixel_bitmap.c     |    3 +-
 src/mesa/drivers/dri/i965/intel_pixel_copy.c       |    3 -
 src/mesa/drivers/dri/i965/intel_pixel_draw.c       |    2 +-
 src/mesa/drivers/dri/i965/intel_pixel_read.c       |   28 +-
 src/mesa/drivers/dri/i965/intel_screen.c           |   68 +-
 src/mesa/drivers/dri/i965/intel_screen.h           |   16 +-
 src/mesa/drivers/dri/i965/intel_syncobj.c          |   86 +-
 src/mesa/drivers/dri/i965/intel_tex.c              |    6 +-
 src/mesa/drivers/dri/i965/intel_tex_image.c        |   35 +-
 src/mesa/drivers/dri/i965/intel_tex_subimage.c     |   33 +-
 src/mesa/drivers/dri/i965/intel_tiled_memcpy.c     |   14 +-
 src/mesa/drivers/dri/i965/intel_tiled_memcpy.h     |    4 +-
 src/mesa/drivers/dri/i965/intel_upload.c           |   33 +-
 120 files changed, 3341 insertions(+), 2199 deletions(-)
 create mode 100644 src/mesa/drivers/dri/i965/brw_batch.c
 create mode 100644 src/mesa/drivers/dri/i965/brw_batch.h
 delete mode 100644 src/mesa/drivers/dri/i965/intel_batchbuffer.c
 delete mode 100644 src/mesa/drivers/dri/i965/intel_batchbuffer.h

diff --git a/src/mesa/drivers/dri/i965/Makefile.sources b/src/mesa/drivers/dri/i965/Makefile.sources
index 5a33aac..92e2ad0 100644
--- a/src/mesa/drivers/dri/i965/Makefile.sources
+++ b/src/mesa/drivers/dri/i965/Makefile.sources
@@ -1,4 +1,6 @@
 i965_FILES = \
+	brw_batch.c \
+	brw_batch.h \
 	brw_binding_tables.c \
 	brw_blorp_blit.cpp \
 	brw_blorp_blit_eu.cpp \
@@ -185,8 +187,6 @@ i965_FILES = \
 	gen8_wm_depth_stencil.c \
 	intel_asm_annotation.c \
 	intel_asm_annotation.h \
-	intel_batchbuffer.c \
-	intel_batchbuffer.h \
 	intel_blit.c \
 	intel_blit.h \
 	intel_buffer_objects.c \
diff --git a/src/mesa/drivers/dri/i965/brw_batch.c b/src/mesa/drivers/dri/i965/brw_batch.c
new file mode 100644
index 0000000..24e96c6
--- /dev/null
+++ b/src/mesa/drivers/dri/i965/brw_batch.c
@@ -0,0 +1,1946 @@
+/*
+ * Copyright (c) 2015 Intel Corporation
+ *
+ * Permission is hereby granted, free of charge, to any person obtaining a
+ * copy of this software and associated documentation files (the "Software"),
+ * to deal in the Software without restriction, including without limitation
+ * the rights to use, copy, modify, merge, publish, distribute, sublicense,
+ * and/or sell copies of the Software, and to permit persons to whom the
+ * Software is furnished to do so, subject to the following conditions:
+ *
+ * The above copyright notice and this permission notice (including the next
+ * paragraph) shall be included in all copies or substantial portions of the
+ * Software.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+ * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+ * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.  IN NO EVENT SHALL
+ * THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+ * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+ * OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ *
+ * Authors:
+ *    Chris Wilson <chris at chris-wilson.co.uk>
+ *
+ */
+#include "brw_batch.h"
+#include "brw_context.h" /* XXX brw_finish_batch() */
+
+#include <sys/types.h>
+#include <sys/mman.h>
+#include <stdlib.h>
+#include <setjmp.h>
+
+#include <intel_bufmgr.h>
+#include <i915_drm.h>
+#include <xf86drm.h>
+#include <errno.h>
+
+#include "intel_screen.h"
+
+/*
+ * When submitting commands to the GPU every cycle of latency counts;
+ * mutexes, spinlocks, even atomics quickly add to substantial overhead.
+ *
+ * This "batch manager" acts as thread-local shim over the buffer manager
+ * (drm_intel_bufmgr_gem). As we are only ever used from within a single
+ * context, we can rely on the upper layers providing thread safety. This
+ * allows us to import buffers from the shared screen (sharing buffers
+ * between multiple contexts, threads and users) and wrap that handle in
+ * our own. Similarly, we want to share the buffer cache between all users
+ * on the file and so allocate from the global threadsafe buffer manager,
+ * with a very small and transient local cache of active buffers.
+ *
+ * The batch manager provides a cheap way of busyness tracking and very
+ * efficient batch construction and kernel submission.
+ *
+ * The restrictions over and above the generic submission engine in
+ * intel_bufmgr_gem are:
+ * 	- not thread-safe
+ * 	- flat relocations, only the batch buffer itself carries
+ * 	  relocations. Relocations relative to auxiliary buffers
+ * 	  must be performed via STATE_BASE
+ * 	- direct mapping of the batch for writes, expect reads
+ * 	  from the batch to be slow
+ * 	- the batch is a fixed 64k in size
+ * 	- access to the batch must be wrapped by brw_batch_begin/_end
+ * 	- all relocations must be immediately written into the batch
+ */
+
+/**
+ * Number of bytes to reserve for commands necessary to complete a batch.
+ *
+ * This includes:
+ * - MI_BATCHBUFFER_END (4 bytes)
+ * - Optional MI_NOOP for ensuring the batch length is qword aligned (4 bytes)
+ * - Any state emitted by vtbl->finish_batch():
+ *   - Gen4-5 record ending occlusion query values (4 * 4 = 16 bytes)
+ *   - Disabling OA counters on Gen6+ (3 DWords = 12 bytes)
+ *   - Ending MI_REPORT_PERF_COUNT on Gen5+, plus associated PIPE_CONTROLs:
+ *     - Two sets of PIPE_CONTROLs, which become 3 PIPE_CONTROLs each on SNB,
+ *       which are 4 DWords each ==> 2 * 3 * 4 * 4 = 96 bytes
+ *     - 3 DWords for MI_REPORT_PERF_COUNT itself on Gen6+.  ==> 12 bytes.
+ *       On Ironlake, it's 6 DWords, but we have some slack due to the lack of
+ *       Sandybridge PIPE_CONTROL madness.
+ *
+ * Total: 140 bytes
+ */
+#define BATCH_RESERVED 140
+
+/* Surface offsets are limited to a maximum of 64k from the surface base */
+#define BATCH_SIZE (64 << 10)
+
+/* XXX Temporary home until kernel patches land */
+#define I915_PARAM_HAS_EXEC_SOFTPIN 37
+#define EXEC_OBJECT_PINNED	(1<<4)
+#define I915_PARAM_HAS_EXEC_BATCH_FIRST 38
+#define I915_EXEC_BATCH_FIRST (1<<16)
+
+#define DBG_NO_FAST_RELOC 0
+#define DBG_NO_HANDLE_LUT 0
+#define DBG_NO_BATCH_FIRST 0
+#define DBG_NO_SOFTPIN 0
+
+#define PERF_IDLE 0 /* ring mask */
+
+inline static void list_move(struct list_head *from, struct list_head *to)
+{
+   list_del(from);
+   list_add(from, to);
+}
+
+inline static void list_move_tail(struct list_head *from, struct list_head *to)
+{
+   list_del(from);
+   list_addtail(from, to);
+}
+
+static const unsigned hw_ring[] = {
+   [RENDER_RING] = I915_EXEC_RENDER,
+   [BLT_RING] = I915_EXEC_BLT,
+};
+
+/*
+ * The struct brw_request is central to efficiently tracking GPU activity,
+ * and the busyness of all buffers. It serves as both a read and a write
+ * fence on the buffers (and as the external GL fence). This is done by
+ * associating each relocation (i.e. every use of a buffer by a GPU within
+ * a batch) with the request as a read fence (for a read-only relocation)
+ * or as both the read/write fences (for a writeable relocation).
+ *
+ * Then if we ever need to query whether a particular buffer is active,
+ * we can look at the appropriate fence and see whether it has expired.
+ * If not we can then ask the kernel if has just retired and report back.
+ * If the request is still undergoing construction and not been submitted,
+ * we have that information immediately available and can report busyness
+ * without having to search.
+ *
+ * Periodically (after every new request) we poll for request completion,
+ * asking if the oldest is complete. This allows us to then maintain the
+ * busyness state of all buffers without having to query every buffer
+ * every time.
+ *
+ * After certain events (such as mapping or waiting on a buffer), we know that
+ * the buffer is idle and so is the associated fence and all fences older.
+ *
+ * A nice side-effect of tracking requests, and buffer busyness is that we
+ * can also track a reasonable measure of how much of the aperture is filled
+ * by active buffers (a resident set size). This is useful for predicting
+ * when the kernel will start evicting our buffers, for example.
+ */
+struct brw_request {
+   struct brw_bo *bo;
+   struct list_head link, write, read, fence;
+};
+#define RQ_MARK_RING(rq, ring) ((struct brw_bo *)((uintptr_t)((rq)->bo) | (ring)))
+#define RQ_BO(rq) ((struct brw_bo *)((uintptr_t)(rq)->bo & ~3))
+#define RQ_RING(rq) (((unsigned)(uintptr_t)(rq)->bo & 3))
+
+static bool __brw_bo_busy(struct brw_bo *bo)
+{
+   struct drm_i915_gem_busy busy;
+
+   memset(&busy, 0, sizeof(busy));
+   busy.handle = bo->handle;
+   busy.busy = ~0;
+   drmIoctl(bo->batch->fd, DRM_IOCTL_I915_GEM_BUSY, &busy);
+   /* If an error occurs here, it can only be due to flushing the
+    * buffer on the hardware i.e. the buffer itself is still busy.
+    * Possible errors are:
+    * 	-ENOENT: the buffer didn't exist, impossible!
+    * 	-ENOMEM: the kernel failed to flush due to allocation failures
+    * 	         scary, but the buffer is busy.
+    * 	-EIO:    the kernel should have marked the buffer as idle during
+    * 	         the reset, if it hasn't it will never and the buffer
+    * 	         itself will never become idle.
+    * 	(-EINTR, -EAGAIN eaten by drmIoctl()).
+    */
+   return busy.busy;
+}
+
+/*
+ * Retire this and all older requests.
+ */
+static void __brw_request_retire(struct brw_request *rq)
+{
+   struct brw_batch *batch = RQ_BO(rq)->batch;
+   struct list_head *list = &batch->requests[RQ_RING(rq)];
+   struct brw_request *tmp;
+
+   assert(!__brw_bo_busy(RQ_BO(rq)) || batch->fini);
+
+   if (PERF_IDLE & (1 << RQ_RING(rq)) && rq->link.next == list)
+      batch->idle_time[RQ_RING(rq)] = -get_time();
+
+   do {
+      tmp = list_last_entry(list, struct brw_request, link);
+      assert(!__brw_bo_busy(RQ_BO(tmp)) || batch->fini);
+      list_del(&tmp->link);
+      assert(RQ_BO(tmp)->exec == NULL);
+
+      while (!list_empty(&tmp->write)) {
+         struct brw_bo *bo;
+
+         bo = list_first_entry(&tmp->write,
+                               struct brw_bo,
+                               write.link);
+
+         assert(bo->write.rq == tmp);
+         assert(bo->read.rq);
+
+         list_delinit(&bo->write.link);
+         bo->write.rq = NULL;
+      }
+      assert(RQ_BO(tmp)->write.rq == NULL);
+
+      while (!list_empty(&tmp->read)) {
+         struct brw_bo *bo;
+
+         bo = list_first_entry(&tmp->read,
+                               struct brw_bo,
+                               read.link);
+         assert(bo->exec == NULL);
+         assert(bo->read.rq == tmp);
+         if (bo->write.rq) {
+            assert(RQ_RING(bo->write.rq) != RQ_RING(rq));
+            __brw_request_retire(bo->write.rq);
+         }
+         assert(bo->write.rq == NULL);
+
+         list_delinit(&bo->read.link);
+         bo->read.rq = NULL;
+
+         assert(batch->rss >= bo->size);
+         batch->rss -= bo->size;
+
+         if (bo->reusable)
+            list_move(&bo->link, &batch->inactive);
+
+         if (!bo->refcnt)
+            __brw_bo_free(bo);
+      }
+      assert(RQ_BO(tmp)->read.rq == NULL);
+
+      while (!list_empty(&tmp->fence)) {
+         struct brw_fence *fence;
+
+         fence = list_first_entry(&tmp->fence,
+                                  struct brw_fence,
+                                  link);
+         assert(fence->rq == tmp);
+         list_del(&fence->link);
+         fence->rq = NULL;
+
+         if (fence->signal)
+            fence->signal(fence);
+      }
+
+      if (tmp == batch->throttle)
+         batch->throttle = NULL;
+
+      tmp->bo = RQ_BO(tmp); /* strip off the ring id */
+      tmp->link.next = (struct list_head *)batch->freed_rq;
+      batch->freed_rq = tmp;
+   } while (tmp != rq);
+}
+
+/*
+ * Is the request busy? First we can see if this request
+ * has already been retired (idle), or if this request is still under
+ * construction (busy). Failing that to the best of our knowledge, it is
+ * still being processed by the GPU, so then we must ask the kernel if the
+ * request is now idle. If we find it is idle, we now know this and all
+ * older requests are also idle.
+ */
+bool __brw_request_busy(struct brw_request *rq, unsigned flags)
+{
+   struct brw_bo *bo;
+   if (rq == NULL)
+      return false;
+
+   bo = RQ_BO(rq);
+   if (bo->read.rq == NULL)
+      return false;
+
+   assert(bo->read.rq == rq);
+
+   if (bo->exec != NULL) {
+      if (flags & BUSY_FLUSH)
+         brw_batch_flush(bo->batch);
+      return true;
+   }
+
+   if (__brw_bo_busy(bo))
+      return true;
+
+   __brw_request_retire(rq);
+   return false;
+}
+
+/*
+ * Update the cache domain tracked by the kernel. This can have a number
+ * of side-effects but is essential in order to make coherency and
+ * serialisation between the GPU and CPU. If there is conflicting GPU access
+ * then set-domain will wait until the GPU has finished accessing the buffer
+ * before proceeding to change the domain. If the buffer is not cache coherent
+ * and we request CPU access, the kernel will clflush that buffer to make it
+ * coherent with the CPU access. Both of these imply delays and overhead, so
+ * we do our best to avoid moving buffers to the GTT/CPU domains. However,
+ * if we do, we know the buffer and its requst are idle so we can update
+ * our request tracking after a blocking call.
+ */
+static void __brw_bo_set_domain(struct brw_bo *bo, unsigned domain, bool write)
+{
+   struct drm_i915_gem_set_domain set_domain;
+   struct brw_request *rq;
+
+   if (bo->exec) /* flush failed, pretend we are ASYNC | INCOHERENT */
+      return;
+
+   memset(&set_domain, 0, sizeof(set_domain));
+   set_domain.handle = bo->handle;
+   set_domain.read_domains = domain;
+   if (write)
+      set_domain.write_domain = domain;
+
+   domain = domain == I915_GEM_DOMAIN_CPU ? DOMAIN_CPU : DOMAIN_GTT;
+   if (bo->domain == domain)
+      return;
+
+   if (drmIoctl(bo->batch->fd, DRM_IOCTL_I915_GEM_SET_DOMAIN, &set_domain))
+      return;
+
+   rq = write ? bo->read.rq : bo->write.rq;
+   if (rq)
+      __brw_request_retire(rq);
+
+   bo->domain = write ? domain : DOMAIN_NONE;
+   assert(bo->refcnt);
+}
+
+/*
+ * Wait for the buffer to become completely idle, i.e. not being accessed by
+ * the GPU at all (neither for oustanding reads or writes).
+ * This is equivalent to setting the buffer write domain to GTT, but the
+ * wait ioctl avoids the set-domain side-effects (e.g. clflushing in
+ * some circumstances).
+ */
+static int __brw_bo_wait(struct brw_bo *bo, int64_t timeout)
+{
+   struct drm_i915_gem_wait wait;
+
+   assert(bo->exec == NULL);
+
+   if (!brw_bo_busy(bo, BUSY_WRITE | BUSY_RETIRE))
+      return 0;
+
+   memset(&wait, 0, sizeof(wait));
+   wait.bo_handle = bo->handle;
+   wait.timeout_ns = timeout;
+   wait.flags = 0;
+
+   if (drmIoctl(bo->batch->fd, DRM_IOCTL_I915_GEM_WAIT, &wait)) {
+      int ret = -errno;
+      if (timeout < 0) {
+         __brw_bo_set_domain(bo, I915_GEM_DOMAIN_GTT, true);
+         ret = 0;
+      }
+      return ret;
+   }
+
+   assert(bo->read.rq);
+   __brw_request_retire(bo->read.rq);
+   return 0;
+}
+
+static inline uint32_t hash_32(uint32_t hash, unsigned bits)
+{
+   return (hash * 0x9e37001) >> (32 - bits);
+}
+
+static inline struct list_head *borrowed(struct brw_batch *batch, uint32_t handle)
+{
+   return &batch->borrowed[hash_32(handle, BORROWED_BITS)];
+}
+
+/*
+ * We have context local bo, but those may be shared between contexts by
+ * shared mipmaps and other buffers. If we find we are dealing with a bo
+ * belonging to another batch, we need to translate that into a local bo
+ * for associating with our fences.
+ */
+static struct brw_bo *__brw_batch_lookup_handle(struct brw_batch *batch,
+                                                uint32_t handle)
+{
+   /* XXX may need a resizable ht? */
+   struct list_head *hlist = borrowed(batch, handle);
+
+   list_for_each_entry(struct brw_bo, bo, hlist, link)
+      if (bo->handle == handle)
+         return bo;
+
+   return NULL;
+}
+
+inline static bool has_lut(struct brw_batch *batch)
+{
+   return batch->batch_base_flags & I915_EXEC_HANDLE_LUT;
+}
+
+/*
+ * Prepare the batch manager for constructing a new batch/request.
+ *
+ * Reset all the accounting we do per-batch, and allocate ourselves a new
+ * batch bo.
+ */
+static int __brw_batch_reset(struct brw_batch *batch)
+{
+   struct brw_request *rq;
+
+retry:
+   rq = batch->freed_rq;
+   if (unlikely(rq == NULL)) {
+      rq = malloc(sizeof(*rq));
+      if (unlikely(rq == NULL))
+         goto oom;
+
+      rq->bo = brw_bo_create(batch, "batch", BATCH_SIZE, 0, 0);
+      if (unlikely(rq->bo == NULL)) {
+         free(rq);
+         goto oom;
+      }
+      rq->bo->target_handle = -1;
+
+      /* We are inheriting a foriegn buffer, so call set-domain */
+      brw_bo_map(rq->bo, MAP_WRITE);
+
+      list_inithead(&rq->read);
+      list_inithead(&rq->write);
+      list_inithead(&rq->fence);
+   } else {
+      assert(list_empty(&rq->read));
+      assert(list_empty(&rq->write));
+      assert(list_empty(&rq->fence));
+      batch->freed_rq = (struct brw_request *)rq->link.next;
+   }
+
+   assert(RQ_BO(rq) == rq->bo);
+   batch->map = brw_bo_map(rq->bo, MAP_WRITE | MAP_ASYNC);
+   if (unlikely(batch->map == NULL)) {
+      brw_bo_put(rq->bo);
+      free(rq);
+
+oom:
+      if (list_empty(&batch->requests[batch->ring])) {
+         batch->next_request = NULL;
+         return -ENOMEM;
+      }
+
+      /* force the synchronization to recover some memory */
+      rq = list_last_entry(&batch->requests[batch->ring],
+                           struct brw_request, link);
+      __brw_bo_wait(RQ_BO(rq), -1);
+      goto retry;
+   }
+
+   batch->bo = rq->bo;
+   memset(&batch->emit, 0, sizeof(batch->emit));
+   batch->next_request = rq;
+
+   batch->reserved = BATCH_RESERVED / 4;
+   batch->state = BATCH_SIZE / 4;
+   batch->aperture = 0;
+   batch->batch_flags = batch->batch_base_flags;
+
+   list_add(&rq->bo->read.link, &rq->read);
+   if (batch->batch_base_flags & I915_EXEC_BATCH_FIRST) {
+      rq->bo->target_handle =
+	 has_lut(batch) ? batch->emit.nexec : rq->bo->handle;
+      rq->bo->exec =
+	 memset(&batch->exec[batch->emit.nexec++], 0, sizeof(*rq->bo->exec));
+   } else
+      rq->bo->exec = (void *)1;
+   rq->bo->read.rq = rq;
+   batch->rss += BATCH_SIZE;
+   return 0;
+}
+
+static int gem_param(int fd, int name)
+{
+   drm_i915_getparam_t gp;
+   int v = -1; /* No param uses (yet) the sign bit, reserve it for errors */
+
+   memset(&gp, 0, sizeof(gp));
+   gp.param = name;
+   gp.value = &v;
+   if (drmIoctl(fd, DRM_IOCTL_I915_GETPARAM, &gp))
+      return -1;
+
+   return v;
+}
+
+static bool test_has_fast_reloc(int fd)
+{
+   if (DBG_NO_FAST_RELOC)
+      return DBG_NO_FAST_RELOC < 0;
+
+   return gem_param(fd, I915_PARAM_HAS_EXEC_NO_RELOC) > 0;
+}
+
+static bool test_has_handle_lut(int fd)
+{
+   if (DBG_NO_HANDLE_LUT)
+      return DBG_NO_HANDLE_LUT < 0;
+
+   return gem_param(fd, I915_PARAM_HAS_EXEC_HANDLE_LUT) > 0;
+}
+
+static bool test_has_batch_first(int fd)
+{
+   if (DBG_NO_BATCH_FIRST)
+      return DBG_NO_BATCH_FIRST < 0;
+
+   return gem_param(fd, I915_PARAM_HAS_EXEC_BATCH_FIRST) > 0;
+}
+
+static bool test_has_softpin(int fd)
+{
+   if (DBG_NO_SOFTPIN)
+      return DBG_NO_SOFTPIN < 0;
+
+   if (gem_param(fd, I915_PARAM_HAS_ALIASING_PPGTT) < 2)
+      return false;
+
+   return gem_param(fd, I915_PARAM_HAS_EXEC_SOFTPIN) > 0;
+}
+
+static uint64_t __get_max_aperture(int fd)
+{
+	struct drm_i915_gem_get_aperture aperture;
+
+	if (gem_param(fd, I915_PARAM_HAS_ALIASING_PPGTT) > 2)
+		return (uint64_t)1 << 48;
+
+	memset(&aperture, 0, sizeof(aperture));
+	if (drmIoctl(fd, DRM_IOCTL_I915_GEM_GET_APERTURE, &aperture))
+		return 512 << 20; /* Minimum found on gen4+ */
+
+	return aperture.aper_size;
+}
+
+static uint64_t get_max_aperture(int fd)
+{
+	static uint64_t max_aperture;
+
+	if (max_aperture == 0)
+		max_aperture = __get_max_aperture(fd);
+
+	return max_aperture;
+}
+
+/*
+ * Initialise the batch-manager for the context.
+ *
+ * We use the devinfo and settings found in intel_screen to set ourselves up
+ * for the hardware environment, and supplement that with our own feature
+ * tests. (These too should probably move to intel_screen and shared between
+ * all contexts.)
+ */
+int brw_batch_init(struct brw_batch *batch,
+		   struct intel_screen *screen)
+{
+   const struct brw_device_info *devinfo;
+   int ret;
+   int n;
+
+   batch->fd = intel_screen_to_fd(screen);
+   batch->bufmgr = screen->bufmgr;
+   batch->screen = screen;
+
+   devinfo = screen->devinfo;
+
+   batch->no_hw = screen->no_hw;
+
+   batch->needs_pipecontrol_ggtt_wa = devinfo->gen == 6;
+   batch->reloc_size = 512;
+   batch->exec_size = 256;
+   batch->reloc = malloc(sizeof(batch->reloc[0])*batch->reloc_size);
+   batch->exec = malloc(sizeof(batch->exec[0])*batch->exec_size);
+   if (unlikely(batch->reloc == NULL || batch->exec == NULL)) {
+      ret = -ENOMEM;
+      goto err;
+   }
+
+   for (n = 0; n < 1 << BORROWED_BITS; n++)
+      list_inithead(&batch->borrowed[n]);
+   list_inithead(&batch->active);
+   list_inithead(&batch->inactive);
+   for (n = 0; n < __BRW_NUM_RINGS; n++)
+      list_inithead(&batch->requests[n]);
+
+   batch->actual_ring[RENDER_RING] = RENDER_RING;
+   batch->actual_ring[BLT_RING] = BLT_RING;
+   if (devinfo->gen < 6)
+      batch->actual_ring[BLT_RING] = RENDER_RING;
+
+   batch->has_mmap_wc = true; /* tested on first use */
+   batch->has_llc = devinfo->has_llc;
+   batch->has_softpin = test_has_softpin(batch->fd);
+   batch->max_aperture = 3*get_max_aperture(batch->fd)/4;
+
+   if (test_has_fast_reloc(batch->fd))
+      batch->batch_base_flags |= I915_EXEC_NO_RELOC;
+   if (test_has_handle_lut(batch->fd))
+      batch->batch_base_flags |= I915_EXEC_HANDLE_LUT;
+   if (test_has_batch_first(batch->fd))
+      batch->batch_base_flags |= I915_EXEC_BATCH_FIRST;
+
+   if (devinfo->gen >= 6) {
+      /* Create a new hardware context.  Using a hardware context means that
+       * our GPU state will be saved/restored on context switch, allowing us
+       * to assume that the GPU is in the same state we left it in.
+       *
+       * This is required for transform feedback buffer offsets, query objects,
+       * and also allows us to reduce how much state we have to emit.
+       */
+      struct drm_i915_gem_context_create create;
+
+      memset(&create, 0, sizeof(create));
+      drmIoctl(batch->fd, DRM_IOCTL_I915_GEM_CONTEXT_CREATE, &create);
+      batch->hw_ctx = create.ctx_id;
+      if (!batch->hw_ctx) {
+	 ret = -errno;
+	 fprintf(stderr, "Gen6+ requires Kernel 3.6 or later.\n");
+	 goto err;
+      }
+   }
+
+   if (unlikely(INTEL_DEBUG & DEBUG_AUB))
+      drm_intel_aub_enable(&batch->aub, batch->fd, devinfo->gen);
+
+   ret = __brw_batch_reset(batch);
+   if (ret)
+      goto err;
+
+   return 0;
+
+err:
+   drm_intel_aub_enable(&batch->aub, -1, 0);
+   free(batch->reloc);
+   free(batch->exec);
+   return ret;
+}
+
+/*
+ * Mark a bo as being written to by this batch.
+ *
+ * We frequently dirty a buffer and then emit a global cache flush
+ * cleaning all the dirty buffers within a batch. Afterwards, we may
+ * then write to the same buffer, but may not re-emit a relocation and
+ * so we need to notify that the buffer is now dirty again. Normally
+ * we can rely the relocation marking the write buffers as dirty.
+ *
+ * All caches are flushed by the kernel between batches, so at the end
+ * of each batch we can mark all buffers as clean again. (Before we can
+ * access the buffer, either by the GPU in the next batch or by the CPU
+ * following a set-domain call, that access will be after the flush has
+ * finished.)
+ */
+void brw_bo_mark_dirty(struct brw_batch *batch, struct brw_bo *bo)
+{
+   if (unlikely(bo->batch != batch)) {
+      bo = __brw_batch_lookup_handle(batch, bo->handle);
+      assert(bo);
+   }
+   assert(bo->batch == batch);
+   assert(bo != bo->batch->bo);
+
+   /* We should only be called on objects already in the batch for writing */
+   assert(bo->exec);
+   assert(bo->read.rq == batch->next_request);
+   assert(bo->write.rq == batch->next_request);
+   assert(bo->domain == DOMAIN_GPU);
+
+   if (bo->dirty)
+      return;
+
+   list_move(&bo->write.link, &batch->next_request->write);
+   bo->dirty = true;
+}
+
+/*
+ * At the end of each batch and when explicitly flushing caches within
+ * a batch, we can mark all the buffers within that batch as now clean.
+ */
+void brw_batch_clear_dirty(struct brw_batch *batch)
+{
+   struct list_head *write = &batch->next_request->write;
+
+   list_for_each_entry(struct brw_request_node, node, write, link) {
+      struct brw_bo *bo = container_of(node, bo, write);
+      if (!bo->dirty)
+         break;
+
+      bo->dirty = false;
+   }
+}
+
+static void __brw_batch_grow_exec(struct brw_batch *batch)
+{
+   struct drm_i915_gem_exec_object2 *new_exec;
+   uint16_t new_size;
+
+   new_size = batch->exec_size * 2;
+   new_exec = NULL;
+   if (likely(new_size > batch->exec_size))
+      new_exec = realloc(batch->exec, new_size*sizeof(new_exec[0]));
+   if (unlikely(new_exec == NULL))
+      longjmp(batch->jmpbuf, -ENOMEM);
+
+   if (new_exec != batch->exec) {
+      struct list_head *read = &batch->next_request->read;
+
+      list_for_each_entry(struct brw_bo, bo, read, read.link)
+         bo->exec = new_exec + (bo->exec - batch->exec);
+
+      batch->exec = new_exec;
+   }
+
+   batch->exec_size = new_size;
+}
+
+static void __brw_batch_grow_reloc(struct brw_batch *batch)
+{
+   struct drm_i915_gem_relocation_entry *new_reloc;
+   uint16_t new_size;
+
+   new_size = batch->reloc_size * 2;
+   new_reloc = NULL;
+   if (likely(new_size > batch->reloc_size))
+      new_reloc = realloc(batch->reloc, new_size*sizeof(new_reloc[0]));
+   if (unlikely(new_reloc == NULL))
+      longjmp(batch->jmpbuf, -ENOMEM);
+
+   batch->reloc = new_reloc;
+   batch->reloc_size = new_size;
+}
+
+/*
+ * Add a relocation entry for the target buffer into the current batch.
+ *
+ * This is the heart of performing fast relocations, both here and in
+ * the corresponding kernel relocation routines.
+ *
+ * - Instead of passing in handles for the kernel convert back into
+ *   the buffer for every relocation, we tell the kernel which
+ *   execobject slot corresponds with the relocation. The kernel is
+ *   able to use a simple LUT constructed as it first looks up each buffer
+ *   for the batch rather than search a small, overfull hashtable. As both
+ *   the number of relocations and buffers in a batch grow, the simple
+ *   LUT is much more efficient (though the LUT itself is less cache
+ *   friendly).
+ *   However, as the batch buffer is by definition the last object in
+ *   the execbuffer array we have to perform a pass to relabel the
+ *   target of all relocations pointing to the batch. (Except when
+ *   the kernel supports batch-first, in which case we can do the relocation
+ *   target processing for the batch inline.)
+ *
+ * - If the kernel has not moved the buffer, it will still be in the same
+ *   location as last time we used it. If we tell the kernel that all the
+ *   relocation entries are the same as the offset for the buffer, then
+ *   the kernel need only check that all the buffers are still in the same
+ *   location and then skip performing relocations entirely. A huge win.
+ *
+ * - As a consequence of telling the kernel to skip processing the relocations,
+ *   we need to tell the kernel about the read/write domains and special needs
+ *   of the buffers.
+ *
+ * - Alternatively, we can request the kernel place the buffer exactly
+ *   where we want it and forgo all relocations to that buffer entirely.
+ *   The buffer is effectively pinned for its lifetime (if the kernel
+ *   does have to move it, for example to swap it out to recover memory,
+ *   the kernel will return it back to our requested location at the start
+ *   of the next batch.) This of course imposes a lot of constraints on where
+ *   we can say the buffers are, they must meet all the alignment constraints
+ *   and not overlap.
+ *
+ * - Essential to all these techniques is that we always use the same
+ *   presumed_offset for the relocations as for submitting the execobject.
+ *   That value must be written into the batch and it must match the value
+ *   we tell the kernel. (This breaks down when using relocation tries shared
+ *   between multiple contexts, hence the need for context-local batch
+ *   management.)
+ *
+ * In contrast to libdrm, we can build the execbuffer array along with
+ * the batch by forgoing the ability to handle general relocation trees.
+ * This avoids having multiple passes to build the execbuffer parameter,
+ * and also gives us a means to cheaply track when a buffer has been
+ * referenced by the batch.
+ */
+uint64_t __brw_batch_reloc(struct brw_batch *batch,
+                           uint32_t batch_offset,
+                           struct brw_bo *target_bo,
+                           uint64_t target_offset,
+                           unsigned read_domains,
+                           unsigned write_domain)
+{
+   assert(target_bo->refcnt);
+   if (unlikely(target_bo->batch != batch)) {
+      /* XXX legal sharing between contexts/threads? */
+      target_bo = brw_bo_import(batch, target_bo->base, true);
+      if (unlikely(target_bo == NULL))
+         longjmp(batch->jmpbuf, -ENOMEM);
+      target_bo->refcnt--; /* kept alive by the implicit active reference */
+   }
+   assert(target_bo->batch == batch);
+
+   if (target_bo->exec == NULL) {
+      int n;
+
+      /* reserve one exec entry for the batch */
+      if (unlikely(batch->emit.nexec + 1 == batch->exec_size))
+         __brw_batch_grow_exec(batch);
+
+      n = batch->emit.nexec++;
+      target_bo->target_handle = has_lut(batch) ? n : target_bo->handle;
+      target_bo->exec = memset(batch->exec + n, 0, sizeof(*target_bo->exec));
+      target_bo->exec->handle = target_bo->handle;
+      target_bo->exec->alignment = target_bo->alignment;
+      target_bo->exec->offset = target_bo->offset;
+      if (target_bo->pinned)
+         target_bo->exec->flags = EXEC_OBJECT_PINNED;
+
+      /* Track the total amount of memory in use by all active requests */
+      if (target_bo->read.rq == NULL) {
+	      batch->rss += target_bo->size;
+	      if (batch->rss > batch->peak_rss)
+		      batch->peak_rss = batch->rss;
+      }
+      target_bo->read.rq = batch->next_request;
+      list_move_tail(&target_bo->read.link, &batch->next_request->read);
+
+      batch->aperture += target_bo->size;
+   }
+
+   if (!target_bo->pinned) {
+      int n;
+
+      if (unlikely(batch->emit.nreloc == batch->reloc_size))
+         __brw_batch_grow_reloc(batch);
+
+      n = batch->emit.nreloc++;
+      batch->reloc[n].offset = batch_offset;
+      batch->reloc[n].delta = target_offset;
+      batch->reloc[n].target_handle = target_bo->target_handle;
+      batch->reloc[n].presumed_offset = target_bo->offset;
+      batch->reloc[n].read_domains = read_domains;
+      batch->reloc[n].write_domain = write_domain;
+
+      /* If we haven't added the batch to the execobject array yet, we
+       * will have to process all the relocations pointing to the
+       * batch when finalizing the request for submission.
+       */
+      if (target_bo->target_handle == -1) {
+         int m = batch->emit.nself++;
+         if (m < 256)
+            batch->self_reloc[m] = n;
+      }
+   }
+
+   if (write_domain && !target_bo->dirty) {
+      assert(target_bo != batch->bo);
+      target_bo->write.rq = batch->next_request;
+      list_move(&target_bo->write.link, &batch->next_request->write);
+      assert(target_bo->write.rq == target_bo->read.rq);
+      target_bo->dirty = true;
+      target_bo->domain = DOMAIN_GPU;
+      if (has_lut(batch)) {
+         target_bo->exec->flags |= EXEC_OBJECT_WRITE;
+         if (write_domain == I915_GEM_DOMAIN_INSTRUCTION &&
+             batch->needs_pipecontrol_ggtt_wa)
+            target_bo->exec->flags |= EXEC_OBJECT_NEEDS_GTT;
+      }
+   }
+
+   return target_bo->offset + target_offset;
+}
+
+/*
+ * Close the batch by writing all the tail commands (to store register
+ * values between batches, disable profiling, etc). And then to end it all
+ * we set MI_BATCH_BUFFER_END.
+ */
+static uint32_t __brw_batch_finish(struct brw_batch *batch)
+{
+   batch->reserved = 0;
+
+   /* Catch any final allocation errors, rolling back is marginally safer */
+   batch->saved = batch->emit;
+   if (setjmp(batch->jmpbuf) == 0)
+	   brw_finish_batch(batch);
+   else
+	   batch->emit = batch->saved;
+
+   if (unlikely(INTEL_DEBUG & DEBUG_BATCH)) {
+      int bytes_for_commands = 4 * batch->emit.nbatch;
+      int bytes_for_state = batch->bo->size - 4*batch->state;
+      int total_bytes = bytes_for_commands + bytes_for_state;
+      fprintf(stderr, "Batchbuffer flush on ring %d with %4db (pkt) + "
+              "%4db (state) = %4db (%0.1f%%), with %d buffers and %d relocations [%d self], RSS %d KiB (cap %dKiB)\n",
+              batch->ring, bytes_for_commands, bytes_for_state,
+              total_bytes, 100.0f * total_bytes / BATCH_SIZE,
+              batch->emit.nexec, batch->emit.nreloc, batch->emit.nself,
+              (int)(batch->aperture>>10), (int)(batch->max_aperture>>10));
+   }
+
+   batch->map[batch->emit.nbatch] = 0xa << 23;
+   return 4*((batch->emit.nbatch + 2) & ~1);
+}
+
+static void
+__brw_batch_throttle(struct brw_batch *batch, struct brw_request *rq)
+{
+   /* Wait for the swapbuffers before the one we just emitted, so we
+    * don't get too many swaps outstanding for apps that are GPU-heavy
+    * but not CPU-heavy.
+    *
+    * We're using intelDRI2Flush (called from the loader before
+    * swapbuffer) and glFlush (for front buffer rendering) as the
+    * indicator that a frame is done and then throttle when we get
+    * here as we prepare to render the next frame.  At this point for
+    * round trips for swap/copy and getting new buffers are done and
+    * we'll spend less time waiting on the GPU.
+    *
+    * Unfortunately, we don't have a handle to the batch containing
+    * the swap, and getting our hands on that doesn't seem worth it,
+    * so we just use the first batch we emitted after the last swap.
+    */
+   if (batch->need_swap_throttle) {
+      if (batch->throttle && !batch->disable_throttling)
+         __brw_bo_wait(RQ_BO(batch->throttle), -1);
+      batch->throttle = rq;
+      batch->need_flush_throttle = false;
+      batch->need_swap_throttle = false;
+   }
+
+   if (batch->need_flush_throttle) {
+      drmCommandNone(batch->fd, DRM_I915_GEM_THROTTLE);
+      batch->need_flush_throttle = false;
+   }
+
+   if (unlikely(INTEL_DEBUG & DEBUG_SYNC)) {
+      fprintf(stderr, "waiting for idle\n");
+      __brw_bo_wait(RQ_BO(rq), -1);
+   }
+}
+
+/*
+ * If we added relocations pointing to the batch before we knew
+ * its final index (the kernel assumes that the batch is last unless
+ * told otherwise), then we have to go through all the relocations
+ * and point them back to the batch.
+ */
+static void __brw_batch_fixup_self_relocations(struct brw_batch *batch)
+{
+   uint32_t target = batch->bo->target_handle;
+   int n, count;
+
+   count = MIN2(batch->emit.nself, 256);
+   for (n = 0; n < count; n++)
+      batch->reloc[batch->self_reloc[n]].target_handle = target;
+   if (n == 256) {
+      for (n = batch->self_reloc[255] + 1; n < batch->emit.nself; n++) {
+         if (batch->reloc[n].target_handle == -1)
+            batch->reloc[n].target_handle = target;
+      }
+   }
+}
+
+static void
+__brw_batch_dump(struct brw_batch *batch)
+{
+   struct drm_intel_decode *decode;
+
+   decode = drm_intel_decode_context_alloc(batch->screen->deviceID);
+   if (!decode)
+      return;
+
+   drm_intel_decode_set_batch_pointer(decode,
+                                      batch->map, batch->bo->offset,
+                                      batch->emit.nbatch + 1);
+
+   drm_intel_decode_set_output_file(decode, stderr);
+   drm_intel_decode(decode);
+
+   drm_intel_decode_context_free(decode);
+
+   brw_debug_batch(batch);
+}
+
+static void
+__brw_batch_aub(struct brw_batch *batch)
+{
+   drm_intel_bo **bos;
+   struct brw_request *rq = batch->next_request;
+   int count;
+
+   bos = malloc(sizeof(drm_intel_bo *) * batch->emit.nexec);
+   if (unlikely(bos == NULL))
+      return;
+
+   count = 0;
+   list_for_each_entry(struct brw_bo, bo, &rq->read, read.link)
+      bos[count++] = bo->base;
+   assert(count == batch->emit.nexec);
+
+   drm_intel_aub_exec(&batch->aub, batch->bo->base, bos, count,
+                      hw_ring[batch->ring], batch->emit.nbatch*4);
+
+   free(bos);
+}
+
+/*
+ * Check to see if the oldest requests have completed and retire them.
+ */
+static void __brw_batch_retire(struct brw_batch *batch)
+{
+   struct list_head *list = &batch->requests[batch->ring];
+
+   do {
+      struct brw_request *rq;
+
+      rq = list_last_entry(list, struct brw_request, link);
+      if (__brw_bo_busy(RQ_BO(rq)))
+	 break;
+
+      __brw_request_retire(rq);
+   } while (!list_empty(list));
+}
+
+/*
+ * Finalize the batch, submit it to hardware, and start a new batch/request.
+ */
+int brw_batch_flush(struct brw_batch *batch)
+{
+   struct drm_i915_gem_execbuffer2 execbuf;
+   struct drm_i915_gem_exec_object2 *exec;
+   struct brw_request *rq = batch->next_request;
+
+   if (unlikely(batch->emit.nbatch == 0))
+      return 0;
+
+   if (unlikely(rq == NULL))
+      return -ENOMEM;
+
+   if (unlikely(INTEL_DEBUG & DEBUG_AUB))
+      brw_annotate_batch(batch);
+
+   memset(&execbuf, 0, sizeof(execbuf));
+   execbuf.batch_len = __brw_batch_finish(batch);
+
+   assert(rq->bo == batch->bo);
+   assert(rq->bo->write.rq == NULL);
+   assert(rq->bo->read.rq == rq);
+   assert(rq->bo->exec != NULL);
+   assert(rq->bo->dirty);
+
+   /* After we call __brw_batch_finish() as the callbacks may add relocs! */
+   if (rq->bo->target_handle == -1) {
+      rq->bo->target_handle =
+	 has_lut(batch) ? batch->emit.nexec : rq->bo->handle;
+      rq->bo->exec =
+	 memset(&batch->exec[batch->emit.nexec++], 0, sizeof(*exec));
+
+      __brw_batch_fixup_self_relocations(batch);
+   }
+
+   exec = rq->bo->exec;
+   exec->handle = rq->bo->handle;
+   exec->offset = rq->bo->offset;
+   exec->alignment = rq->bo->alignment;
+   exec->relocation_count = batch->emit.nreloc;
+   exec->relocs_ptr = (uintptr_t)batch->reloc;
+   if (rq->bo->pinned)
+      exec->flags |= EXEC_OBJECT_PINNED;
+   assert((exec->flags & EXEC_OBJECT_WRITE) == 0);
+
+   execbuf.buffers_ptr = (uintptr_t)batch->exec;
+   execbuf.buffer_count = batch->emit.nexec;
+   if (batch->ring == RENDER_RING || batch->has_softpin)
+      execbuf.rsvd1 = batch->hw_ctx;
+   execbuf.flags = hw_ring[batch->ring] | batch->batch_flags;
+
+   if (unlikely(batch->no_hw)) {
+      /* submit a dummy execbuf to keep the fences accurate */
+      batch->map[0] = 0xa << 23;
+      execbuf.batch_len = 8;
+   }
+
+   if (unlikely(drmIoctl(batch->fd, DRM_IOCTL_I915_GEM_EXECBUFFER2, &execbuf))){
+      if (errno == ENOSPC)
+         return -ENOSPC;
+
+      fprintf(stderr,
+              "Failed to submit batch buffer, rendering will be incorrect: %s [%d]\n",
+              strerror(errno), errno);
+
+      /* submit a dummy execbuf to keep the fences accurate */
+      batch->map[0] = 0xa << 23;
+      execbuf.batch_len = 8;
+
+      if (drmIoctl(batch->fd, DRM_IOCTL_I915_GEM_EXECBUFFER2, &execbuf))
+         return -errno;
+   }
+
+   if (PERF_IDLE && batch->idle_time[batch->ring] < 0) {
+      batch->idle_time[batch->ring] += get_time();
+      fprintf(stderr, "GPU command queue %d idle for %.3fms\n",
+	      batch->ring, batch->idle_time[batch->ring] * 1000);
+   }
+
+   list_for_each_entry(struct brw_bo, bo, &rq->read, read.link) {
+      assert(bo->exec);
+      assert(bo->read.rq == rq);
+      bo->offset = bo->exec->offset;
+      bo->exec = NULL;
+      bo->dirty = false;
+      bo->target_handle = -1;
+      if (bo->domain != DOMAIN_GPU)
+         bo->domain = DOMAIN_NONE;
+   }
+   assert(!rq->bo->dirty);
+   list_add(&rq->link, &batch->requests[batch->ring]);
+   rq->bo->pinned = batch->has_softpin;
+   rq->bo = RQ_MARK_RING(rq, batch->ring);
+
+   if (unlikely(INTEL_DEBUG & DEBUG_BATCH))
+      __brw_batch_dump(batch);
+
+   if (unlikely(batch->aub.file))
+      __brw_batch_aub(batch);
+
+   __brw_batch_throttle(batch, rq);
+   __brw_batch_retire(batch);
+
+   return __brw_batch_reset(batch);
+}
+
+/*
+ * Is the GPU still processing the most recent batch submitted?
+ * (Note does not include the batch currently being constructed.)
+ */
+bool brw_batch_busy(struct brw_batch *batch)
+{
+   struct brw_request *rq;
+
+   if (list_empty(&batch->requests[batch->ring]))
+      return false;
+
+   rq = list_first_entry(&batch->requests[batch->ring],
+                         struct brw_request, link);
+
+   return __brw_request_busy(rq, 0);
+}
+
+/*
+ * Wait for all GPU processing to complete.
+ */
+void brw_batch_wait(struct brw_batch *batch)
+{
+   int n;
+
+   for (n = 0; n < __BRW_NUM_RINGS; n++) {
+      struct brw_request *rq;
+
+      if (list_empty(&batch->requests[n]))
+         continue;
+
+      rq = list_first_entry(&batch->requests[n],
+                            struct brw_request, link);
+
+      __brw_bo_wait(rq->bo, -1);
+   }
+}
+
+static bool __is_uncached(int fd, uint32_t handle)
+{
+   struct drm_i915_gem_caching arg;
+
+   memset(&arg, 0, sizeof(arg));
+   arg.handle = handle;
+   drmIoctl(fd, DRM_IOCTL_I915_GEM_GET_CACHING, &arg);
+   /* There is no right answer if an error occurs here. Fortunately, the
+    * only error is ENOENT and that's impossible!
+    */
+   return arg.caching != I915_CACHING_CACHED;
+}
+
+/*
+ * Wrap a drm_intel_bo reference in a struct brw_bo. Owernership
+ * of that reference is transferred to the struct brw_bo.
+ */
+struct brw_bo *brw_bo_import(struct brw_batch *batch,
+			     drm_intel_bo *base,
+			     bool borrow)
+{
+   struct brw_bo *bo;
+   uint32_t tiling, swizzling;
+
+   if (unlikely(base == NULL))
+      return NULL;
+
+   assert(base->handle);
+   assert(base->size);
+
+   if (borrow) {
+      bo = __brw_batch_lookup_handle(batch, base->handle);
+      if (bo) {
+         bo->refcnt++;
+         return bo;
+      }
+   }
+
+   if (batch->freed_bo) {
+      bo = batch->freed_bo;
+      batch->freed_bo = (struct brw_bo *)bo->base;
+   } else {
+      bo = malloc(sizeof(*bo));
+      if (unlikely(bo == NULL))
+         return NULL;
+   }
+
+   memset(bo, 0, sizeof(*bo));
+
+   bo->handle = base->handle;
+   bo->batch = batch;
+   bo->refcnt = 1;
+   bo->offset = base->offset64;
+   bo->alignment = base->align;
+   bo->size = base->size;
+
+   drm_intel_bo_get_tiling(base, &tiling, &swizzling);
+   bo->tiling = tiling;
+   bo->swizzling = swizzling;
+   bo->reusable = !borrow;
+   bo->cache_coherent = batch->has_llc; /* XXX libdrm bookkeeping */
+
+   batch->vmsize += bo->size;
+
+   list_inithead(&bo->read.link);
+   list_inithead(&bo->write.link);
+
+   bo->base = base;
+   if (borrow) {
+      list_add(&bo->link, borrowed(batch, bo->handle));
+      drm_intel_bo_reference(base);
+      if (bo->cache_coherent)
+         bo->cache_coherent = !__is_uncached(batch->fd, bo->handle);
+   } else {
+      list_add(&bo->link, &batch->inactive);
+      /* If the buffer hasn't been used before on the GPU, presume it is a
+       * new buffer in the CPU write domain. However, a buffer may have been
+       * mapped and unused - but that should be relatively rare compared to
+       * the optimisation chance of first writing through the CPU.
+       */
+      if (bo->offset == 0)
+         __brw_bo_set_domain(bo, I915_GEM_DOMAIN_CPU, true);
+   }
+
+   return bo;
+}
+
+/*
+ * Search the list of active buffers (a local short lived cache) for
+ * something of the right size to reuse for the allocation request.
+ */
+static struct brw_bo *__brw_bo_create__cached(struct brw_batch *batch,
+					      uint64_t size)
+{
+   list_for_each_entry(struct brw_bo, bo, &batch->active, link) {
+      assert(bo->batch == batch);
+      assert(bo->read.rq != NULL);
+
+      if (bo->size < size || 3*size > 4*bo->size)
+	 continue;
+
+      list_move(&bo->link, &batch->inactive);
+      bo->refcnt++;
+      return bo;
+   }
+
+   return NULL;
+}
+
+struct brw_bo *brw_bo_create(struct brw_batch *batch,
+			     const char *name,
+			     uint64_t size,
+			     uint64_t alignment,
+			     unsigned flags)
+{
+   drm_intel_bo *base;
+   struct brw_bo *bo;
+
+   if (flags & BO_ALLOC_FOR_RENDER) {
+      bo = __brw_bo_create__cached(batch, size);
+      if (bo) {
+	 /* XXX rename */
+	 bo->alignment = alignment;
+	 drm_intel_bo_set_tiling(bo->base, I915_TILING_NONE, 0);
+	 return bo;
+      }
+   }
+
+   base = drm_intel_bo_alloc(batch->bufmgr, name, size, alignment);
+   if (unlikely(base == NULL))
+      return NULL;
+
+   bo = brw_bo_import(batch, base, false);
+   if (unlikely(bo == NULL)) {
+      drm_intel_bo_unreference(base);
+      return NULL;
+   }
+
+   return bo;
+}
+
+static uint64_t brw_surface_size(int cpp,
+				 uint32_t width,
+				 uint32_t height,
+				 uint32_t tiling,
+				 uint32_t *pitch)
+{
+   uint32_t tile_width, tile_height;
+
+   switch (tiling) {
+   default:
+   case I915_TILING_NONE:
+      tile_width = 64;
+      tile_height = 2;
+      break;
+   case I915_TILING_X:
+      tile_width = 512;
+      tile_height = 8;
+      break;
+   case I915_TILING_Y:
+      tile_width = 128;
+      tile_height = 32;
+      break;
+   }
+
+   *pitch = ALIGN(width * cpp, tile_width);
+   height = ALIGN(height, tile_height);
+   height *= *pitch;
+   return ALIGN(height, 4096);
+}
+
+struct brw_bo *
+brw_bo_create_tiled(struct brw_batch *batch,
+		    const char *name,
+		    uint32_t width,
+		    uint32_t height,
+		    int cpp,
+		    uint32_t *tiling,
+		    uint32_t *pitch,
+		    unsigned flags)
+{
+   unsigned long __pitch;
+   drm_intel_bo *base;
+   struct brw_bo *bo;
+
+   if (flags & BO_ALLOC_FOR_RENDER) {
+      uint64_t size = brw_surface_size(cpp, width, height, *tiling, pitch);
+
+      bo = __brw_bo_create__cached(batch, size);
+      if (bo) {
+	 /* XXX rename */
+	 bo->alignment = 0;
+         drm_intel_bo_set_tiling(bo->base, tiling, *pitch);
+	 return bo;
+      }
+   }
+
+   base = drm_intel_bo_alloc_tiled(batch->bufmgr, name,
+                                   width, height, cpp,
+                                   tiling, &__pitch, flags);
+   if (unlikely(base == NULL))
+      return NULL;
+
+   *pitch = __pitch;
+   bo = brw_bo_import(batch, base, false);
+   if (unlikely(bo == NULL)) {
+      drm_intel_bo_unreference(base);
+      return NULL;
+   }
+
+   return bo;
+}
+
+/*
+ * Import a foriegn buffer from another process using the global
+ * (flinked) name.
+ */
+struct brw_bo *brw_bo_create_from_name(struct brw_batch *batch,
+				       const char *name,
+				       uint32_t global_name)
+{
+   drm_intel_bo *base;
+   struct brw_bo *bo;
+
+   base = drm_intel_bo_gem_create_from_name(batch->bufmgr, name, global_name);
+   if (unlikely(base == NULL))
+      return NULL;
+
+   bo = brw_bo_import(batch, base, true);
+   drm_intel_bo_unreference(base);
+
+   return bo;
+}
+
+bool brw_batch_create_fence(struct brw_batch *batch,
+                            struct brw_fence *fence)
+{
+   brw_batch_flush(batch);
+
+   if (list_empty(&batch->requests[batch->ring]))
+      return false;
+
+   fence->rq = list_first_entry(&batch->requests[batch->ring],
+                                struct brw_request, link);
+   list_add(&fence->link, &fence->rq->fence);
+   return true;
+}
+
+int brw_fence_wait(struct brw_fence *fence, int64_t timeout)
+{
+   if (fence->rq == NULL)
+      return 0;
+
+   return __brw_bo_wait(fence->rq->bo, timeout);
+}
+
+void brw_fence_destroy(struct brw_fence *fence)
+{
+   if (fence->rq == NULL)
+      return;
+
+   list_del(&fence->link);
+   fence->rq = NULL;
+}
+
+/*
+ * Write a portion of the *linear* buffer using the pointer provided.
+ *
+ * This is conceptually equivalent to calling
+ *   memcpy(brw_bo_map(MAP_WRITE | MAP_DETILED | flags) + offset, data, size)
+ * but can be much more efficient as it will try to avoid cache domain
+ * side-effects (if any).
+ */
+void brw_bo_write(struct brw_bo *bo,
+		  uint64_t offset,
+		  const void *data,
+		  uint64_t length,
+		  unsigned flags)
+{
+   struct drm_i915_gem_pwrite pwrite;
+   void *map;
+
+   assert(offset < bo->size);
+   assert(length <= bo->size - offset);
+
+   map = brw_bo_map(bo, MAP_WRITE | MAP_DETILED | flags);
+   if (map) {
+      memcpy(map + offset, data, length);
+      return;
+   }
+
+   memset(&pwrite, 0, sizeof(pwrite));
+   pwrite.handle = bo->handle;
+   pwrite.offset = offset;
+   pwrite.size = length;
+   pwrite.data_ptr = (uintptr_t)data;
+   if (drmIoctl(bo->batch->fd, DRM_IOCTL_I915_GEM_PWRITE, &pwrite))
+      return;
+
+   if (bo->read.rq)
+      __brw_request_retire(bo->read.rq);
+
+   assert(bo->refcnt);
+   bo->domain = DOMAIN_GTT;
+}
+
+/*
+ * Read a portion of the *linear* buffer into the pointer provided.
+ *
+ * This is conceptually equivalent to calling
+ *   memcpy(data, brw_bo_map(MAP_READ | MAP_DETILED | flags) + offset, size)
+ * but can be much more efficient as it will try to avoid cache domain
+ * side-effects (if any).
+ */
+void brw_bo_read(struct brw_bo *bo,
+		 uint64_t offset,
+		 void *data,
+		 uint64_t length,
+		 unsigned flags)
+{
+   struct drm_i915_gem_pread pread;
+   void *map;
+
+   assert(offset < bo->size);
+   assert(length <= bo->size - offset);
+
+   if (bo->cache_coherent) {
+      map = brw_bo_map(bo, MAP_READ | MAP_DETILED | flags);
+      if (map) {
+         memcpy(data, map + offset, length);
+         return;
+      }
+   } else {
+      if ((flags & MAP_ASYNC) == 0) {
+         struct brw_request *rq = bo->write.rq;
+         if (rq && rq->bo->exec)
+            brw_batch_flush(bo->batch);
+      }
+   }
+
+   memset(&pread, 0, sizeof(pread));
+   pread.handle = bo->handle;
+   pread.offset = offset;
+   pread.size = length;
+   pread.data_ptr = (uintptr_t)data;
+   if (drmIoctl(bo->batch->fd, DRM_IOCTL_I915_GEM_PREAD, &pread))
+      return;
+
+   if (bo->write.rq)
+      __brw_request_retire(bo->write.rq);
+
+   assert(bo->refcnt);
+   if (bo->domain != DOMAIN_CPU)
+      bo->domain = DOMAIN_NONE;
+}
+
+/*
+ * Provide a WC mmaping of the buffer. Coherent everywhere, but
+ * reads are very slow (as they are uncached). Fenced, so automatically
+ * detiled by hardware and constrained to fit in the aperture.
+ */
+static void *brw_bo_map__gtt(struct brw_bo *bo, unsigned flags)
+{
+   if (flags & MAP_DETILED && bo->tiling)
+      return NULL;
+
+   if (bo->map__gtt == NULL)
+      bo->map__gtt = drm_intel_gem_bo_map__gtt(bo->base);
+
+   if ((flags & MAP_ASYNC) == 0)
+      __brw_bo_set_domain(bo, I915_GEM_DOMAIN_GTT, flags & MAP_WRITE);
+
+   return bo->map__gtt;
+}
+
+/*
+ * Provide a WC mmaping of the buffer. Coherent everywhere, but
+ * reads are very slow (as they are uncached). Unfenced, not
+ * constrained by the mappable aperture.
+ */
+static void *brw_bo_map__wc(struct brw_bo *bo, unsigned flags)
+{
+   if (!bo->batch->has_mmap_wc)
+      return brw_bo_map__gtt(bo, flags);
+
+   if (bo->map__wc == NULL) {
+      bo->map__wc = drm_intel_gem_bo_map__wc(bo->base);
+      if (bo->map__wc == NULL) {
+         bo->batch->has_mmap_wc = false;
+         return brw_bo_map__gtt(bo, flags);
+      }
+   }
+   assert(bo->map__wc);
+
+   if ((flags & MAP_ASYNC) == 0)
+      __brw_bo_set_domain(bo, I915_GEM_DOMAIN_GTT, flags & MAP_WRITE);
+
+   return bo->map__wc;
+}
+
+/*
+ * Provide a WB mmaping of the buffer. Incoherent on non-LLC platforms
+ * and will trigger clflushes of the entire buffer. Unfenced, not
+ * constrained by the mappable aperture.
+ */
+static void *brw_bo_map__cpu(struct brw_bo *bo, unsigned flags)
+{
+   if (bo->map__cpu == NULL)
+      bo->map__cpu = drm_intel_gem_bo_map__cpu(bo->base);
+   assert(bo->map__cpu);
+
+   if ((flags & MAP_ASYNC) == 0)
+      __brw_bo_set_domain(bo, I915_GEM_DOMAIN_CPU, flags & MAP_WRITE);
+
+   return bo->map__cpu;
+}
+
+static bool can_map__cpu(struct brw_bo *bo, unsigned flags)
+{
+   if (bo->cache_coherent)
+      return true;
+
+   if (flags & MAP_COHERENT)
+      return false;
+
+   if (bo->domain == DOMAIN_CPU)
+      return true;
+
+   return (flags & MAP_WRITE) == 0;
+}
+
+/*
+ * Map the buffer for access by the CPU, either for writing or reading,
+ * and return a pointer for that access.
+ *
+ * If the async flag is not set, any previous writing by the GPU is
+ * waited upon, and if write access is required all GPU reads as well.
+ *
+ * If the async flag is set, the kernel is not informed of the access
+ * and the access may be concurrent with GPU access. Also importantly,
+ * cache domain tracking for the buffer is *not* maintained and so access
+ * modes are limited to coherent modes (taking into account the current
+ * cache domain).
+ *
+ * If the detiled flag is set, the caller will perform manual detiling
+ * through the mapping, and so we do not allocate a fence for the operation.
+ * This can return NULL on failure, for example if the kernel doesn't support
+ * such an operation.
+ *
+ * The method for mapping the buffer is chosen based on the hardware
+ * architecture (LLC has fast coherent reads and writes, non-LLC has fast
+ * coherent writes, slow coherent reads but faster incoherent reads)
+ * and mode of operation. In theory, for every desired access mode,
+ * the pointer is the fastest direct CPU access to the immediate buffer.
+ * However, direct CPU access to this buffer may not always be the fastest
+ * method of accessing the data within that buffer by the CPU!
+ *
+ * Returns NULL on error.
+ */
+void *brw_bo_map(struct brw_bo *bo, unsigned flags)
+{
+   assert(bo->refcnt);
+
+   if ((flags & MAP_ASYNC) == 0) {
+      struct brw_request *rq;
+
+      rq = flags & MAP_WRITE ? bo->read.rq : bo->write.rq;
+      if (rq && RQ_BO(rq)->dirty)
+         brw_batch_flush(bo->batch);
+   }
+
+   if (bo->tiling && (flags & MAP_DETILED) == 0)
+      return brw_bo_map__gtt(bo, flags);
+   else if (can_map__cpu(bo, flags))
+      return brw_bo_map__cpu(bo, flags);
+   else
+      return brw_bo_map__wc(bo, flags);
+}
+
+/*
+ * After the final reference to a bo is released, free the buffer.
+ *
+ * If the buffer is still active, and it is reusable, the buffer is
+ * transferred to the local active cache and may be reallocated on the
+ * next call to brw_bo_create() or brw_bo_create_tiled(). Otherwise the
+ * buffer is returned back to the shared screen bufmgr pool.
+ */
+void  __brw_bo_free(struct brw_bo *bo)
+{
+   struct brw_batch *batch;
+
+   assert(bo->refcnt == 0);
+
+   if (bo->read.rq) {
+      assert(bo->batch);
+      if (bo->reusable)
+         list_move(&bo->link, &bo->batch->active);
+      return;
+   }
+
+   assert(!bo->write.rq);
+   list_del(&bo->link);
+
+   if (bo->offset)
+      bo->base->offset64 = bo->offset;
+   drm_intel_bo_unreference(bo->base);
+
+   batch = bo->batch;
+   if (batch == NULL) {
+      free(bo);
+      return;
+   }
+
+   batch->vmsize -= bo->size;
+   if (batch->vmsize < batch->peak_rss)
+      batch->peak_rss = batch->vmsize;
+
+   bo->base = (drm_intel_bo *)batch->freed_bo;
+   batch->freed_bo = bo;
+}
+
+/*
+ * Mark the beginning of a batch construction critical section, during which
+ * the batch is not allowed to be flushed. Access to the batch prior to this
+ * call is invalid. Access after this call but with instructions for another
+ * ring is also invalid. All BATCH_EMIT() must be inside a brw_batch_begin(),
+ * brw_batch_end() pairing - the exception to this rule are when inside the
+ * brw_start_batch() and brw_finish_batch() callbacks.
+ *
+ * Control returns to the caller of brw_batch_begin() if an error is
+ * encountered whilst inside the critical section. If the return code
+ * is negative, a fatal error occurred. If the return code is positive,
+ * the batch had to be flushed and the critical section needs to be restarted.
+ *
+ * On success 0 is returned.
+ *
+ * Must be paired with brw_batch_end().
+ */
+int brw_batch_begin(struct brw_batch *batch,
+		    uint32_t bytes,
+		    enum brw_gpu_ring ring)
+{
+   uint16_t space;
+
+   if (unlikely(batch->next_request == NULL))
+      return -ENOMEM;
+
+   ring = batch->actual_ring[ring];
+   if (ring != batch->ring)
+      space = 0;
+   else
+      space = batch->state - batch->reserved - batch->emit.nbatch;
+   if (unlikely(bytes/4 > space)) {
+      int ret = brw_batch_flush(batch);
+      if (ret)
+         return ret;
+   }
+
+   batch->ring = ring;
+   batch->bo->dirty = true;
+
+   if (batch->emit.nbatch == 0)
+      /* An early allocation error should be impossible */
+      brw_start_batch(batch);
+
+   assert(batch->ring == ring);
+   batch->saved = batch->emit;
+   return setjmp(batch->jmpbuf);
+}
+
+/*
+ * Mark the end of a batch construction critical section. After this call
+ * the batch is inaccessible until the next brw_batch_begin().
+ *
+ * We may flush the batch to hardware if it exceeds the aperture
+ * high water mark. If the batch submission fails, we rollback to the
+ * end of the previous critical section and try flushing again. If that
+ * should fail, we report the error back to the caller. If the rollback
+ * succeeds, we jump back to the brw_batch_begin() with a fresh request
+ * and run through the critical section again.
+ *
+ * Returns 0 on success and no errors have occurred.
+ *
+ * Must be paired with brw_batch_begin().
+ */
+int brw_batch_end(struct brw_batch *batch)
+{
+   int ret;
+
+   if (batch->aperture < batch->max_aperture)
+      return 0;
+
+   ret = brw_batch_flush(batch);
+   if (likely(ret == 0))
+      return 0;
+
+   if (batch->saved.nbatch == batch->emit.nbatch)
+      return ret;
+
+   batch->emit = batch->saved;
+
+   ret = brw_batch_flush(batch);
+   if (ret != -ENOSPC)
+      return ret;
+
+   longjmp(batch->jmpbuf, 1);
+}
+
+/*
+ * How much of the batch is used, both by the 3DSTATE at the beginning of
+ * the batch, and the data at the end?
+ */
+inline static int __brw_batch_count(struct brw_batch *batch)
+{
+   return batch->emit.nbatch + BATCH_SIZE/4 - batch->state;
+}
+
+/*
+ * After a high-level draw command, check to see if we want to flush
+ * the batch to the hardware for either debug reasons or for sanity.
+ */
+int brw_batch_maybe_flush(struct brw_batch *batch)
+{
+   if (batch->always_flush)
+      goto flush;
+
+   /* If the working set exceeds the GTT's limits, we will need to evict
+    * textures in order to execute batches. As we have no method for predicting
+    * when we need to evict, we need to frequently flush the batch so that any
+    * stalls are minimised.
+    */
+   if (batch->peak_rss > batch->max_aperture && __brw_batch_count(batch) > 2048)
+      goto flush;
+
+   return 0;
+
+flush:
+   if (unlikely(INTEL_DEBUG & DEBUG_BATCH)) {
+      fprintf(stderr, "Forcing batchbuffer flush after %d: debug.always_flush?=%d, rss=%d [cap %d], vmasize=%d\n",
+	      batch->emit.nbatch,
+	      batch->always_flush,
+              (int)(batch->peak_rss >> 20), (int)(batch->max_aperture >> 20),
+	      (int)(batch->vmsize >> 20));
+   }
+   return brw_batch_flush(batch);
+}
+
+/*
+ * Query the kernel for the number of times our hardware context has
+ * been implicated in a reset event - either guilty or just a victim,
+ * and the number of resets that have occurred overall.
+ */
+int brw_batch_get_reset_stats(struct brw_batch *batch,
+			      uint32_t *reset_count,
+			      uint32_t *active,
+			      uint32_t *pending)
+{
+   struct drm_i915_reset_stats stats;
+
+   if (!batch->hw_ctx)
+      return -ENODEV;
+
+   memset(&stats, 0, sizeof(stats));
+   stats.ctx_id = batch->hw_ctx;
+   if (drmIoctl(batch->fd, DRM_IOCTL_I915_GET_RESET_STATS, &stats))
+      return -errno;
+
+   *reset_count = stats.reset_count;
+   *active = stats.batch_active;
+   *pending = stats.batch_pending;
+   return 0;
+}
+
+/*
+ * Mark the buffers as being invalid to prevent stale dereferences when
+ * tearing down shared resources.
+ */
+static void __brw_bo_list_fini(struct list_head *list)
+{
+   while (!list_empty(list)) {
+      struct brw_bo *bo = list_first_entry(list, struct brw_bo, link);
+
+      assert(bo->batch);
+      assert(bo->read.rq == NULL);
+
+      bo->batch = NULL;
+      list_delinit(&bo->link);
+   }
+}
+
+/* Normally we never free a request as they get recycled between batches.
+ * Except when we need to teardown the batch manager and free everything.
+ */
+static void __brw_request_free(struct brw_request *rq)
+{
+   /* Opencode the free(bo) here to handle batch->next_request */
+   assert(RQ_BO(rq) == rq->bo);
+   list_delinit(&rq->bo->link);
+   free(rq->bo);
+   free(rq);
+}
+
+/*
+ * Teardown the batch manager and free all associated memory and resources.
+ */
+void brw_batch_fini(struct brw_batch *batch)
+{
+   int n;
+
+   /* All bo should have been released before the destructor is called */
+   batch->fini = true;
+
+   drm_intel_aub_enable(&batch->aub, -1, 0);
+
+   for (n = 0; n < __BRW_NUM_RINGS; n++) {
+      struct brw_request *rq;
+
+      if (list_empty(&batch->requests[n]))
+         continue;
+
+      /* Note that the request and buffers are not truly idle here. It is
+       * safe as the kernel will keep a reference whilst the buffers are
+       * active (so we can shutdown ahead of time), but we need to disable
+       * our runtime assertions that the request is idle at the time of
+       * retiring.
+       */
+      rq = list_first_entry(&batch->requests[n], struct brw_request, link);
+      __brw_request_retire(rq);
+
+      assert(list_empty(&batch->requests[n]));
+   }
+
+   while (batch->freed_rq) {
+      struct brw_request *rq = batch->freed_rq;
+      batch->freed_rq = (struct brw_request *)rq->link.next;
+      __brw_request_free(rq);
+   }
+   __brw_request_free(batch->next_request);
+
+   assert(list_empty(&batch->active));
+   for (n = 0; n < 1 << BORROWED_BITS; n++)
+      __brw_bo_list_fini(&batch->borrowed[n]);
+   __brw_bo_list_fini(&batch->inactive);
+
+   while (batch->freed_bo) {
+      struct brw_bo *bo = batch->freed_bo;
+      batch->freed_bo = (struct brw_bo *)bo->base;
+      free(bo);
+   }
+
+   free(batch->exec);
+   free(batch->reloc);
+
+   if (batch->hw_ctx) {
+      struct drm_i915_gem_context_destroy destroy;
+
+      memset(&destroy, 0, sizeof(destroy));
+      destroy.ctx_id = batch->hw_ctx;
+      drmIoctl(batch->fd, DRM_IOCTL_I915_GEM_CONTEXT_DESTROY, &destroy);
+   }
+}
diff --git a/src/mesa/drivers/dri/i965/brw_batch.h b/src/mesa/drivers/dri/i965/brw_batch.h
new file mode 100644
index 0000000..0b5468b
--- /dev/null
+++ b/src/mesa/drivers/dri/i965/brw_batch.h
@@ -0,0 +1,377 @@
+#ifndef BRW_BATCH_H
+#define BRW_BATCH_H
+
+#include <stdbool.h>
+#include <stdint.h>
+#include <string.h>
+#include <setjmp.h>
+#include <assert.h>
+
+#include <intel_aub.h>
+#include <intel_bufmgr.h>
+
+#include "util/list.h"
+
+#define HAS_GCC(major, minor) defined(__GNUC__) && (__GNUC__ > (major) || __GNUC__ == (major) && __GNUC_MINOR__ >= (minor))
+
+#if HAS_GCC(3, 4)
+#define must_check  __attribute__((warn_unused_result))
+#else
+#define must_check
+#endif
+
+struct _drm_intel_bufmgr;
+struct _drm_intel_bo;
+
+struct intel_screen;
+
+enum brw_gpu_ring {
+   RENDER_RING = 0,
+   BLT_RING,
+   __BRW_NUM_RINGS,
+};
+
+struct brw_batch;
+struct brw_bo;
+struct brw_request;
+
+enum brw_bo_domain { DOMAIN_NONE, DOMAIN_CPU, DOMAIN_GTT, DOMAIN_GPU };
+
+struct brw_bo {
+   struct brw_batch *batch;
+   struct drm_i915_gem_exec_object2 *exec;
+   struct brw_request_node {
+      struct brw_request *rq;
+      struct list_head link;
+   } read, write;
+
+   unsigned dirty : 1;
+   unsigned domain : 2;
+   unsigned tiling : 4;
+   unsigned swizzling : 4;
+   unsigned pinned : 1;
+   unsigned cache_coherent : 1;
+   unsigned reusable : 1;
+
+   unsigned refcnt;
+   uint32_t handle;
+   uint32_t target_handle;
+   uint64_t size;
+   uint64_t alignment;
+   uint64_t offset;
+
+   struct _drm_intel_bo *base;
+   struct list_head link;
+
+   void *map__cpu;
+   void *map__gtt;
+   void *map__wc;
+};
+
+struct brw_batch {
+   int fd;
+
+   struct brw_bo *bo;
+   uint32_t *map;
+
+   uint32_t batch_flags;
+   uint32_t batch_base_flags;
+
+   enum brw_gpu_ring ring;
+   uint32_t hw_ctx;
+
+   uint16_t reserved;
+   uint16_t state;
+
+   struct brw_batch_state {
+      uint16_t nbatch;
+      uint16_t nexec;
+      uint16_t nreloc;
+      uint16_t nself;
+   } emit, saved;
+
+   uint64_t aperture;
+   uint64_t max_aperture;
+   uint64_t rss, peak_rss, vmsize;
+
+   bool has_softpin : 1;
+   bool has_llc : 1;
+   bool has_mmap_wc : 1;
+   bool needs_pipecontrol_ggtt_wa : 1;
+
+   bool always_flush : 1;
+
+   /** Framerate throttling: @{ */
+   /* Limit the number of outstanding SwapBuffers by waiting for an earlier
+    * frame of rendering to complete. This gives a very precise cap to the
+    * latency between input and output such that rendering never gets more
+    * than a frame behind the user. (With the caveat that we technically are
+    * not using the SwapBuffers itself as a barrier but the first batch
+    * submitted afterwards, which may be immediately prior to the next
+    * SwapBuffers.)
+    */
+   bool need_swap_throttle : 1;
+
+   /** General throttling, not caught by throttling between SwapBuffers */
+   bool need_flush_throttle : 1;
+   bool disable_throttling : 1;
+   /** @} */
+
+   bool no_hw : 1;
+   bool fini : 1;
+
+   jmp_buf jmpbuf;
+
+   uint16_t exec_size;
+   uint16_t reloc_size;
+
+   struct drm_i915_gem_exec_object2 *exec;
+   struct drm_i915_gem_relocation_entry *reloc;
+   uint16_t self_reloc[256];
+
+   int actual_ring[__BRW_NUM_RINGS];
+   struct list_head requests[__BRW_NUM_RINGS];
+   struct brw_request *throttle;
+   struct brw_request *next_request;
+   struct brw_request *freed_rq;
+
+   double idle_time[__BRW_NUM_RINGS];
+
+   struct intel_screen *screen;
+   struct _drm_intel_bufmgr *bufmgr;
+   struct list_head active, inactive;
+
+#define BORROWED_BITS 3
+   struct list_head borrowed[1<<BORROWED_BITS];
+
+   struct brw_bo *freed_bo;
+
+   drm_intel_aub aub;
+};
+
+int brw_batch_init(struct brw_batch *batch,
+		   struct intel_screen *screen);
+
+void brw_batch_clear_dirty(struct brw_batch *batch);
+
+/** Add a relocation entry to the current batch
+ * XXX worth specialising 32bit variant?
+ */
+uint64_t __brw_batch_reloc(struct brw_batch *batch,
+                           uint32_t batch_offset,
+                           struct brw_bo *target_bo,
+                           uint64_t target_offset,
+                           unsigned read_domains,
+                           unsigned write_domain);
+must_check static inline uint64_t brw_batch_reloc(struct brw_batch *batch,
+						  uint32_t batch_offset,
+						  struct brw_bo *target_bo,
+						  uint64_t target_offset,
+						  unsigned read_domains,
+						  unsigned write_domain)
+{
+   if (target_bo == NULL)
+      return target_offset;
+
+   return __brw_batch_reloc(batch, batch_offset,
+                            target_bo, target_offset,
+                            read_domains, write_domain);
+}
+
+int brw_batch_get_reset_stats(struct brw_batch *batch,
+			      uint32_t *reset_count,
+			      uint32_t *active,
+			      uint32_t *pending);
+
+bool brw_batch_busy(struct brw_batch *batch);
+/** Wait for the last submitted rendering to complete */
+void brw_batch_wait(struct brw_batch *batch);
+
+void brw_batch_fini(struct brw_batch *batch);
+
+/* Wrap a drm_intel_bo within a local struct brw_bo */
+struct brw_bo *
+brw_bo_import(struct brw_batch *batch,
+	      struct _drm_intel_bo *base,
+	      bool borrow);
+
+/* Create a local brw_bo for a linear/unfenced buffer and allocate the buffer */
+struct brw_bo *
+brw_bo_create(struct brw_batch *batch,
+	      const char *name,
+	      uint64_t size,
+	      uint64_t alignment,
+	      unsigned flags);
+
+/* Create a local brw_bo for a tiled buffer and allocate the buffer */
+struct brw_bo *
+brw_bo_create_tiled(struct brw_batch *batch,
+		    const char *name,
+		    uint32_t width,
+		    uint32_t height,
+		    int cpp,
+		    uint32_t *tiling,
+		    uint32_t *pitch,
+		    unsigned flags);
+
+/* Create a local brw_bo for a foreign buffer using its global flinked name */
+struct brw_bo *brw_bo_create_from_name(struct brw_batch *batch,
+				       const char *name,
+				       uint32_t global_name);
+
+void brw_bo_mark_dirty(struct brw_batch *batch, struct brw_bo *bo);
+
+inline static int brw_bo_madvise(struct brw_bo *bo, int state)
+{
+   return drm_intel_bo_madvise(bo->base, state);
+}
+
+inline static uint32_t brw_bo_flink(struct brw_bo *bo)
+{
+   uint32_t name = 0;
+   drm_intel_bo_flink(bo->base, &name);
+   return name;
+}
+
+int brw_bo_wait(struct brw_bo *bo, int64_t timeout);
+
+void brw_bo_write(struct brw_bo *bo, uint64_t offset,
+		  const void *data, uint64_t length,
+		  unsigned flags);
+void brw_bo_read(struct brw_bo *bo, uint64_t offset,
+		 void *data, uint64_t length,
+		 unsigned flags);
+
+bool __brw_request_busy(struct brw_request *rq, unsigned flags);
+static inline bool brw_bo_busy(struct brw_bo *bo, unsigned flags)
+#define BUSY_READ 0
+#define BUSY_WRITE 1
+#define BUSY_FLUSH 2
+#define BUSY_RETIRE 4
+{
+   struct brw_request *rq;
+
+   if (!bo)
+      return false;
+
+   assert(bo->refcnt);
+   rq = flags & BUSY_WRITE ? bo->read.rq : bo->write.rq;
+   if (!rq) {
+      assert(!bo->exec);
+      return false;
+   }
+
+   if (flags & (BUSY_FLUSH | BUSY_RETIRE))
+      return __brw_request_busy(rq, flags);
+
+   return true;
+}
+
+void *brw_bo_map(struct brw_bo *bo, unsigned flags);
+#define MAP_READ 0x0
+#define MAP_WRITE 0x1
+#define MAP_ASYNC 0x2
+#define MAP_COHERENT 0x4
+#define MAP_DETILED 0x8
+
+/* Take a new reference to the brw_bo */
+static inline struct brw_bo *brw_bo_get(struct brw_bo *bo)
+{
+   assert(bo != NULL && bo->refcnt > 0);
+   bo->refcnt++;
+   return bo;
+}
+
+/* Release a reference to the brw_bo */
+void  __brw_bo_free(struct brw_bo *bo);
+static inline void brw_bo_put(struct brw_bo *bo)
+{
+   assert(bo == NULL || bo->refcnt > 0);
+   if (bo && --bo->refcnt == 0)
+      __brw_bo_free(bo);
+}
+
+/* A fence is created at the current point on the order batch timeline. When
+ * the GPU passes that point, the fence will be signalled. Or you can wait
+ * for a fence to complete.
+ */
+struct brw_fence {
+   struct brw_request *rq;
+   struct list_head link;
+   void (*signal)(struct brw_fence *);
+};
+
+bool
+brw_batch_create_fence(struct brw_batch *batch,
+                       struct brw_fence *fence);
+
+static inline bool
+brw_fence_busy(struct brw_fence *fence)
+{
+   return __brw_request_busy(fence->rq, BUSY_FLUSH | BUSY_RETIRE);
+}
+
+int brw_fence_wait(struct brw_fence *fence, int64_t timeout);
+void brw_fence_destroy(struct brw_fence *fence);
+
+/* Control batch command insertion and submission to hw */
+must_check int brw_batch_begin(struct brw_batch *batch,
+			       uint32_t estimated_bytes,
+			       enum brw_gpu_ring ring);
+int brw_batch_end(struct brw_batch *batch);
+int brw_batch_flush(struct brw_batch *batch);
+int brw_batch_maybe_flush(struct brw_batch *batch);
+
+/* Interfaces for writing commands into the batch */
+static inline void __brw_batch_check(struct brw_batch *batch, int count, enum brw_gpu_ring ring)
+{
+   assert(batch->emit.nbatch + count < batch->state - batch->reserved);
+   assert(batch->ring == batch->actual_ring[ring]);
+}
+
+static inline void brw_batch_emit(struct brw_batch *batch, uint32_t dw)
+{
+   batch->map[batch->emit.nbatch++] = dw;
+}
+
+static inline void brw_batch_data(struct brw_batch *batch,
+				  const void *data,
+				  int bytes)
+{
+   assert(batch->emit.nbatch + bytes/4 < batch->state - batch->reserved);
+   assert((bytes & 3) == 0);
+   memcpy(batch->map + batch->emit.nbatch, data, bytes);
+   batch->emit.nbatch += bytes / 4;
+}
+
+static inline uint32_t float_as_int(float f)
+{
+   union {
+      float f;
+      uint32_t dw;
+   } fi;
+
+   fi.f = f;
+   return fi.dw;
+}
+
+static inline void brw_batch_emit64(struct brw_batch *batch, uint64_t qw)
+{
+   *(uint64_t *)(batch->map + batch->emit.nbatch) = qw;
+   batch->emit.nbatch += 2;
+}
+
+#define BEGIN_BATCH(n) __brw_batch_check(&brw->batch, n, RENDER_RING)
+#define BEGIN_BATCH_BLT(n) __brw_batch_check(&brw->batch, n, BLT_RING)
+#define OUT_BATCH(dw) brw_batch_emit(&brw->batch, dw)
+#define OUT_BATCH_F(f) brw_batch_emit(&brw->batch, float_as_int(f))
+#define OUT_RELOC(bo, read_domains, write_domain, delta) \
+	OUT_BATCH(brw_batch_reloc(&brw->batch, brw->batch.emit.nbatch*4, \
+				  bo, delta, read_domains, write_domain))
+#define OUT_BATCH64(qw) brw_batch_emit64(&brw->batch, qw)
+#define OUT_RELOC64(bo, read_domains, write_domain, delta) \
+	OUT_BATCH64(brw_batch_reloc(&brw->batch, brw->batch.emit.nbatch*4,\
+				    bo, delta, read_domains, write_domain))
+#define ADVANCE_BATCH()
+
+#endif /* BRW_BATCH_H */
diff --git a/src/mesa/drivers/dri/i965/brw_binding_tables.c b/src/mesa/drivers/dri/i965/brw_binding_tables.c
index 98ff0dd..697b4c7 100644
--- a/src/mesa/drivers/dri/i965/brw_binding_tables.c
+++ b/src/mesa/drivers/dri/i965/brw_binding_tables.c
@@ -42,7 +42,6 @@
 #include "brw_context.h"
 #include "brw_defines.h"
 #include "brw_state.h"
-#include "intel_batchbuffer.h"
 
 /**
  * Upload a shader stage's binding table as indirect state.
diff --git a/src/mesa/drivers/dri/i965/brw_blorp.cpp b/src/mesa/drivers/dri/i965/brw_blorp.cpp
index 2ccfae1..e1a9f56 100644
--- a/src/mesa/drivers/dri/i965/brw_blorp.cpp
+++ b/src/mesa/drivers/dri/i965/brw_blorp.cpp
@@ -22,7 +22,6 @@
  */
 
 #include <errno.h>
-#include "intel_batchbuffer.h"
 #include "intel_fbo.h"
 
 #include "brw_blorp.h"
@@ -211,7 +210,9 @@ brw_blorp_exec(struct brw_context *brw, const brw_blorp_params *params)
 {
    struct gl_context *ctx = &brw->ctx;
    uint32_t estimated_max_batch_usage = 1500;
-   bool check_aperture_failed_once = false;
+
+   if (brw_batch_begin(&brw->batch, estimated_max_batch_usage, RENDER_RING) < 0)
+      return;
 
    /* Flush the sampler and render caches.  We definitely need to flush the
     * sampler cache so that we get updated contents from the render cache for
@@ -222,13 +223,6 @@ brw_blorp_exec(struct brw_context *brw, const brw_blorp_params *params)
     */
    brw_emit_mi_flush(brw);
 
-retry:
-   intel_batchbuffer_require_space(brw, estimated_max_batch_usage, RENDER_RING);
-   intel_batchbuffer_save_state(brw);
-   drm_intel_bo *saved_bo = brw->batch.bo;
-   uint32_t saved_used = brw->batch.used;
-   uint32_t saved_state_batch_offset = brw->batch.state_batch_offset;
-
    switch (brw->gen) {
    case 6:
       gen6_blorp_exec(brw, params);
@@ -241,37 +235,18 @@ retry:
       unreachable("not reached");
    }
 
-   /* Make sure we didn't wrap the batch unintentionally, and make sure we
-    * reserved enough space that a wrap will never happen.
-    */
-   assert(brw->batch.bo == saved_bo);
-   assert((brw->batch.used - saved_used) * 4 +
-          (saved_state_batch_offset - brw->batch.state_batch_offset) <
-          estimated_max_batch_usage);
-   /* Shut up compiler warnings on release build */
-   (void)saved_bo;
-   (void)saved_used;
-   (void)saved_state_batch_offset;
+   brw_emit_mi_flush(brw);
 
    /* Check if the blorp op we just did would make our batch likely to fail to
     * map all the BOs into the GPU at batch exec time later.  If so, flush the
     * batch and try again with nothing else in the batch.
     */
-   if (dri_bufmgr_check_aperture_space(&brw->batch.bo, 1)) {
-      if (!check_aperture_failed_once) {
-         check_aperture_failed_once = true;
-         intel_batchbuffer_reset_to_saved(brw);
-         intel_batchbuffer_flush(brw);
-         goto retry;
-      } else {
-         int ret = intel_batchbuffer_flush(brw);
-         WARN_ONCE(ret == -ENOSPC,
-                   "i965: blorp emit exceeded available aperture space\n");
-      }
+   if (brw_batch_end(&brw->batch)) {
+      WARN_ONCE(1, "i965: blorp emit exceeded available aperture space\n");
+      return;
    }
 
-   if (unlikely(brw->always_flush_batch))
-      intel_batchbuffer_flush(brw);
+   brw_batch_maybe_flush(&brw->batch);
 
    /* We've smashed all state compared to what the normal 3D pipeline
     * rendering tracks for GL.
@@ -279,11 +254,6 @@ retry:
    brw->ctx.NewDriverState = ~0ull;
    brw->no_depth_or_stencil = false;
    brw->ib.type = -1;
-
-   /* Flush the sampler cache so any texturing from the destination is
-    * coherent.
-    */
-   brw_emit_mi_flush(brw);
 }
 
 brw_hiz_op_params::brw_hiz_op_params(struct intel_mipmap_tree *mt,
diff --git a/src/mesa/drivers/dri/i965/brw_cc.c b/src/mesa/drivers/dri/i965/brw_cc.c
index 354c733..4f62b29 100644
--- a/src/mesa/drivers/dri/i965/brw_cc.c
+++ b/src/mesa/drivers/dri/i965/brw_cc.c
@@ -36,7 +36,6 @@
 #include "brw_util.h"
 #include "main/macros.h"
 #include "main/stencil.h"
-#include "intel_batchbuffer.h"
 
 static void
 brw_upload_cc_vp(struct brw_context *brw)
@@ -227,17 +226,14 @@ static void upload_cc_unit(struct brw_context *brw)
       cc->cc5.statistics_enable = 1;
 
    /* BRW_NEW_CC_VP */
-   cc->cc4.cc_viewport_state_offset = (brw->batch.bo->offset64 +
-				       brw->cc.vp_offset) >> 5; /* reloc */
+   cc->cc4.cc_viewport_state_offset =
+      brw_batch_reloc(&brw->batch,
+		      (brw->cc.state_offset +
+		       offsetof(struct brw_cc_unit_state, cc4)),
+		      brw->batch.bo, brw->cc.vp_offset,
+		      I915_GEM_DOMAIN_INSTRUCTION, 0) >> 5;
 
    brw->ctx.NewDriverState |= BRW_NEW_GEN4_UNIT_STATE;
-
-   /* Emit CC viewport relocation */
-   drm_intel_bo_emit_reloc(brw->batch.bo,
-			   (brw->cc.state_offset +
-			    offsetof(struct brw_cc_unit_state, cc4)),
-			   brw->batch.bo, brw->cc.vp_offset,
-			   I915_GEM_DOMAIN_INSTRUCTION, 0);
 }
 
 const struct brw_tracked_state brw_cc_unit = {
diff --git a/src/mesa/drivers/dri/i965/brw_clear.c b/src/mesa/drivers/dri/i965/brw_clear.c
index f981388..571e692 100644
--- a/src/mesa/drivers/dri/i965/brw_clear.c
+++ b/src/mesa/drivers/dri/i965/brw_clear.c
@@ -32,7 +32,6 @@
 #include "swrast/swrast.h"
 #include "drivers/common/meta.h"
 
-#include "intel_batchbuffer.h"
 #include "intel_blit.h"
 #include "intel_fbo.h"
 #include "intel_mipmap_tree.h"
diff --git a/src/mesa/drivers/dri/i965/brw_clip.c b/src/mesa/drivers/dri/i965/brw_clip.c
index 3a73c64..e044375 100644
--- a/src/mesa/drivers/dri/i965/brw_clip.c
+++ b/src/mesa/drivers/dri/i965/brw_clip.c
@@ -33,8 +33,6 @@
 #include "main/macros.h"
 #include "main/enums.h"
 
-#include "intel_batchbuffer.h"
-
 #include "brw_defines.h"
 #include "brw_context.h"
 #include "brw_eu.h"
diff --git a/src/mesa/drivers/dri/i965/brw_clip_line.c b/src/mesa/drivers/dri/i965/brw_clip_line.c
index 8e34f7c..65db789 100644
--- a/src/mesa/drivers/dri/i965/brw_clip_line.c
+++ b/src/mesa/drivers/dri/i965/brw_clip_line.c
@@ -34,8 +34,6 @@
 #include "main/enums.h"
 #include "program/program.h"
 
-#include "intel_batchbuffer.h"
-
 #include "brw_defines.h"
 #include "brw_context.h"
 #include "brw_eu.h"
diff --git a/src/mesa/drivers/dri/i965/brw_clip_point.c b/src/mesa/drivers/dri/i965/brw_clip_point.c
index 81487d3..9c886ff 100644
--- a/src/mesa/drivers/dri/i965/brw_clip_point.c
+++ b/src/mesa/drivers/dri/i965/brw_clip_point.c
@@ -34,8 +34,6 @@
 #include "main/enums.h"
 #include "program/program.h"
 
-#include "intel_batchbuffer.h"
-
 #include "brw_defines.h"
 #include "brw_context.h"
 #include "brw_eu.h"
diff --git a/src/mesa/drivers/dri/i965/brw_clip_state.c b/src/mesa/drivers/dri/i965/brw_clip_state.c
index dee74db..8307ecd 100644
--- a/src/mesa/drivers/dri/i965/brw_clip_state.c
+++ b/src/mesa/drivers/dri/i965/brw_clip_state.c
@@ -133,16 +133,14 @@ brw_upload_clip_unit(struct brw_context *brw)
        ctx->ViewportArray[0].Width == fb_width &&
        ctx->ViewportArray[0].Height == fb_height)
    {
+      /* emit clip viewport relocation */
       clip->clip5.guard_band_enable = 1;
       clip->clip6.clipper_viewport_state_ptr =
-         (brw->batch.bo->offset64 + brw->clip.vp_offset) >> 5;
-
-      /* emit clip viewport relocation */
-      drm_intel_bo_emit_reloc(brw->batch.bo,
-                              (brw->clip.state_offset +
-                               offsetof(struct brw_clip_unit_state, clip6)),
-                              brw->batch.bo, brw->clip.vp_offset,
-                              I915_GEM_DOMAIN_INSTRUCTION, 0);
+	 brw_batch_reloc(&brw->batch,
+			 (brw->clip.state_offset +
+			  offsetof(struct brw_clip_unit_state, clip6)),
+			 brw->batch.bo, brw->clip.vp_offset,
+			 I915_GEM_DOMAIN_INSTRUCTION, 0) >> 5;
    }
 
    /* _NEW_TRANSFORM */
diff --git a/src/mesa/drivers/dri/i965/brw_clip_tri.c b/src/mesa/drivers/dri/i965/brw_clip_tri.c
index cca7eb1..64db7e4 100644
--- a/src/mesa/drivers/dri/i965/brw_clip_tri.c
+++ b/src/mesa/drivers/dri/i965/brw_clip_tri.c
@@ -34,8 +34,6 @@
 #include "main/enums.h"
 #include "program/program.h"
 
-#include "intel_batchbuffer.h"
-
 #include "brw_defines.h"
 #include "brw_context.h"
 #include "brw_eu.h"
diff --git a/src/mesa/drivers/dri/i965/brw_clip_unfilled.c b/src/mesa/drivers/dri/i965/brw_clip_unfilled.c
index 6baf620..48c2648 100644
--- a/src/mesa/drivers/dri/i965/brw_clip_unfilled.c
+++ b/src/mesa/drivers/dri/i965/brw_clip_unfilled.c
@@ -34,8 +34,6 @@
 #include "main/enums.h"
 #include "program/program.h"
 
-#include "intel_batchbuffer.h"
-
 #include "brw_defines.h"
 #include "brw_context.h"
 #include "brw_eu.h"
diff --git a/src/mesa/drivers/dri/i965/brw_clip_util.c b/src/mesa/drivers/dri/i965/brw_clip_util.c
index 40ad144..7b953b2 100644
--- a/src/mesa/drivers/dri/i965/brw_clip_util.c
+++ b/src/mesa/drivers/dri/i965/brw_clip_util.c
@@ -35,8 +35,6 @@
 #include "main/enums.h"
 #include "program/program.h"
 
-#include "intel_batchbuffer.h"
-
 #include "brw_defines.h"
 #include "brw_context.h"
 #include "brw_eu.h"
diff --git a/src/mesa/drivers/dri/i965/brw_compute.c b/src/mesa/drivers/dri/i965/brw_compute.c
index 5693ab5..7dfcd6b 100644
--- a/src/mesa/drivers/dri/i965/brw_compute.c
+++ b/src/mesa/drivers/dri/i965/brw_compute.c
@@ -30,7 +30,6 @@
 #include "brw_context.h"
 #include "brw_draw.h"
 #include "brw_state.h"
-#include "intel_batchbuffer.h"
 #include "brw_defines.h"
 
 
@@ -87,7 +86,7 @@ brw_dispatch_compute(struct gl_context *ctx, const GLuint *num_groups)
 {
    struct brw_context *brw = brw_context(ctx);
    int estimated_buffer_space_needed;
-   bool fail_next = false;
+   int ret;
 
    if (!_mesa_check_conditional_render(ctx))
       return;
@@ -105,48 +104,23 @@ brw_dispatch_compute(struct gl_context *ctx, const GLuint *num_groups)
    estimated_buffer_space_needed += 1024; /* push constants */
    estimated_buffer_space_needed += 512; /* misc. pad */
 
-   /* Flush the batch if it's approaching full, so that we don't wrap while
-    * we've got validated state that needs to be in the same batch as the
-    * primitives.
-    */
-   intel_batchbuffer_require_space(brw, estimated_buffer_space_needed,
-                                   RENDER_RING);
-   intel_batchbuffer_save_state(brw);
+   ret = brw_batch_begin(&brw->batch, estimated_buffer_space_needed, RENDER_RING);
+   if (ret < 0)
+	   return;
 
- retry:
-   brw->no_batch_wrap = true;
    brw_upload_compute_state(brw);
-
    brw_emit_gpgpu_walker(brw, num_groups);
 
-   brw->no_batch_wrap = false;
-
-   if (dri_bufmgr_check_aperture_space(&brw->batch.bo, 1)) {
-      if (!fail_next) {
-         intel_batchbuffer_reset_to_saved(brw);
-         intel_batchbuffer_flush(brw);
-         fail_next = true;
-         goto retry;
-      } else {
-         if (intel_batchbuffer_flush(brw) == -ENOSPC) {
-            static bool warned = false;
-
-            if (!warned) {
-               fprintf(stderr, "i965: Single compute shader dispatch "
-                       "exceeded available aperture space\n");
-               warned = true;
-            }
-         }
-      }
-   }
+   ret = brw_batch_end(&brw->batch);
+   if (ret)
+	   return;
 
    /* Now that we know we haven't run out of aperture space, we can safely
     * reset the dirty bits.
     */
    brw_compute_state_finished(brw);
 
-   if (brw->always_flush_batch)
-      intel_batchbuffer_flush(brw);
+   brw_batch_maybe_flush(&brw->batch);
 
    brw_state_cache_check_size(brw);
 
diff --git a/src/mesa/drivers/dri/i965/brw_conditional_render.c b/src/mesa/drivers/dri/i965/brw_conditional_render.c
index 6d37c3b..ffd10a6 100644
--- a/src/mesa/drivers/dri/i965/brw_conditional_render.c
+++ b/src/mesa/drivers/dri/i965/brw_conditional_render.c
@@ -35,7 +35,7 @@
 
 #include "brw_context.h"
 #include "brw_defines.h"
-#include "intel_batchbuffer.h"
+#include "intel_reg.h"
 
 static void
 set_predicate_enable(struct brw_context *brw,
diff --git a/src/mesa/drivers/dri/i965/brw_context.c b/src/mesa/drivers/dri/i965/brw_context.c
index 8150b94..a8ed0b8 100644
--- a/src/mesa/drivers/dri/i965/brw_context.c
+++ b/src/mesa/drivers/dri/i965/brw_context.c
@@ -54,13 +54,13 @@
 #include "brw_draw.h"
 #include "brw_state.h"
 
-#include "intel_batchbuffer.h"
 #include "intel_buffer_objects.h"
 #include "intel_buffers.h"
 #include "intel_fbo.h"
 #include "intel_mipmap_tree.h"
 #include "intel_pixel.h"
 #include "intel_image.h"
+#include "intel_reg.h"
 #include "intel_tex.h"
 #include "intel_tex_obj.h"
 
@@ -186,7 +186,8 @@ intel_update_state(struct gl_context * ctx, GLuint new_state)
 	 continue;
       intel_miptree_all_slices_resolve_depth(brw, tex_obj->mt);
       intel_miptree_resolve_color(brw, tex_obj->mt);
-      brw_render_cache_set_check_flush(brw, tex_obj->mt->bo);
+      if (tex_obj->mt->bo->dirty)
+	 brw_emit_mi_flush(brw);
    }
 
    _mesa_lock_context_textures(ctx);
@@ -214,7 +215,7 @@ intel_flush_front(struct gl_context *ctx)
           * performance.
           */
          intel_resolve_for_dri2_flush(brw, driDrawable);
-         intel_batchbuffer_flush(brw);
+         brw_batch_flush(&brw->batch);
 
          flushFront(screen)(driDrawable, driDrawable->loaderPrivate);
 
@@ -224,6 +225,7 @@ intel_flush_front(struct gl_context *ctx)
          brw->front_buffer_dirty = false;
       }
    }
+   brw_batch_flush(&brw->batch);
 }
 
 static void
@@ -231,10 +233,9 @@ intel_glFlush(struct gl_context *ctx)
 {
    struct brw_context *brw = brw_context(ctx);
 
-   intel_batchbuffer_flush(brw);
    intel_flush_front(ctx);
 
-   brw->need_flush_throttle = true;
+   brw->batch.need_flush_throttle = true;
 }
 
 static void
@@ -244,8 +245,7 @@ intel_finish(struct gl_context * ctx)
 
    intel_glFlush(ctx);
 
-   if (brw->batch.last_bo)
-      drm_intel_bo_wait_rendering(brw->batch.last_bo);
+   brw_batch_wait(&brw->batch);
 }
 
 static void
@@ -623,15 +623,6 @@ brw_process_driconf_options(struct brw_context *brw)
    driParseConfigFiles(options, &brw->intelScreen->optionCache,
                        brw->driContext->driScreenPriv->myNum, "i965");
 
-   int bo_reuse_mode = driQueryOptioni(options, "bo_reuse");
-   switch (bo_reuse_mode) {
-   case DRI_CONF_BO_REUSE_DISABLED:
-      break;
-   case DRI_CONF_BO_REUSE_ALL:
-      intel_bufmgr_gem_enable_reuse(brw->bufmgr);
-      break;
-   }
-
    if (!driQueryOptionb(options, "hiz")) {
        brw->has_hiz = false;
        /* On gen6, you can only do separate stencil with HIZ. */
@@ -641,7 +632,7 @@ brw_process_driconf_options(struct brw_context *brw)
 
    if (driQueryOptionb(options, "always_flush_batch")) {
       fprintf(stderr, "flushing batchbuffer before/after each draw call\n");
-      brw->always_flush_batch = true;
+      brw->batch.always_flush = true;
    }
 
    if (driQueryOptionb(options, "always_flush_cache")) {
@@ -651,7 +642,7 @@ brw_process_driconf_options(struct brw_context *brw)
 
    if (driQueryOptionb(options, "disable_throttling")) {
       fprintf(stderr, "disabling flush throttling\n");
-      brw->disable_throttling = true;
+      brw->batch.disable_throttling = true;
    }
 
    brw->precompile = driQueryOptionb(&brw->optionCache, "shader_precompile");
@@ -707,7 +698,18 @@ brwCreateContext(gl_api api,
    driContextPriv->driverPrivate = brw;
    brw->driContext = driContextPriv;
    brw->intelScreen = screen;
-   brw->bufmgr = screen->bufmgr;
+
+   if (brw_batch_init(&brw->batch, screen)) {
+      fprintf(stderr, "%s: failed to alloc batch\n", __func__);
+      *dri_ctx_error = __DRI_CTX_ERROR_NO_MEMORY;
+      return false;
+   }
+
+   if (brw_init_pipe_control(brw, devinfo)) {
+      fprintf(stderr, "%s: failed to alloc workarounds\n", __func__);
+      *dri_ctx_error = __DRI_CTX_ERROR_NO_MEMORY;
+      return false;
+   }
 
    brw->gen = devinfo->gen;
    brw->gt = devinfo->gt;
@@ -800,31 +802,6 @@ brwCreateContext(gl_api api,
 
    intel_fbo_init(brw);
 
-   intel_batchbuffer_init(brw);
-
-   if (brw->gen >= 6) {
-      /* Create a new hardware context.  Using a hardware context means that
-       * our GPU state will be saved/restored on context switch, allowing us
-       * to assume that the GPU is in the same state we left it in.
-       *
-       * This is required for transform feedback buffer offsets, query objects,
-       * and also allows us to reduce how much state we have to emit.
-       */
-      brw->hw_ctx = drm_intel_gem_context_create(brw->bufmgr);
-
-      if (!brw->hw_ctx) {
-         fprintf(stderr, "Gen6+ requires Kernel 3.6 or later.\n");
-         intelDestroyContext(driContextPriv);
-         return false;
-      }
-   }
-
-   if (brw_init_pipe_control(brw, devinfo)) {
-      *dri_ctx_error = __DRI_CTX_ERROR_NO_MEMORY;
-      intelDestroyContext(driContextPriv);
-      return false;
-   }
-
    brw_init_state(brw);
 
    intelInitExtensions(ctx);
@@ -911,8 +888,8 @@ intelDestroyContext(__DRIcontext * driContextPriv)
 
    /* Dump a final BMP in case the application doesn't call SwapBuffers */
    if (INTEL_DEBUG & DEBUG_AUB) {
-      intel_batchbuffer_flush(brw);
-      aub_dump_bmp(&brw->ctx);
+      brw_batch_flush(&brw->batch);
+      aub_dump_bmp(brw);
    }
 
    _mesa_meta_free(&brw->ctx);
@@ -929,15 +906,10 @@ intelDestroyContext(__DRIcontext * driContextPriv)
    brw_destroy_state(brw);
    brw_draw_destroy(brw);
 
-   drm_intel_bo_unreference(brw->curbe.curbe_bo);
-   if (brw->vs.base.scratch_bo)
-      drm_intel_bo_unreference(brw->vs.base.scratch_bo);
-   if (brw->gs.base.scratch_bo)
-      drm_intel_bo_unreference(brw->gs.base.scratch_bo);
-   if (brw->wm.base.scratch_bo)
-      drm_intel_bo_unreference(brw->wm.base.scratch_bo);
-
-   drm_intel_gem_context_destroy(brw->hw_ctx);
+   brw_bo_put(brw->curbe.curbe_bo);
+   brw_bo_put(brw->vs.base.scratch_bo);
+   brw_bo_put(brw->gs.base.scratch_bo);
+   brw_bo_put(brw->wm.base.scratch_bo);
 
    if (ctx->swrast_context) {
       _swsetup_DestroyContext(&brw->ctx);
@@ -948,19 +920,14 @@ intelDestroyContext(__DRIcontext * driContextPriv)
    if (ctx->swrast_context)
       _swrast_DestroyContext(&brw->ctx);
 
-   brw_fini_pipe_control(brw);
-   intel_batchbuffer_free(brw);
+   /* free the Mesa context */
+   _mesa_free_context_data(&brw->ctx);
 
-   drm_intel_bo_unreference(brw->throttle_batch[1]);
-   drm_intel_bo_unreference(brw->throttle_batch[0]);
-   brw->throttle_batch[1] = NULL;
-   brw->throttle_batch[0] = NULL;
+   brw_fini_pipe_control(brw);
+   brw_batch_fini(&brw->batch);
 
    driDestroyOptionCache(&brw->optionCache);
 
-   /* free the Mesa context */
-   _mesa_free_context_data(&brw->ctx);
-
    ralloc_free(brw);
    driContextPriv->driverPrivate = NULL;
 }
@@ -1293,7 +1260,6 @@ intel_query_dri2_buffers(struct brw_context *brw,
        * query, we need to make sure all the pending drawing has landed in the
        * real front buffer.
        */
-      intel_batchbuffer_flush(brw);
       intel_flush_front(&brw->ctx);
 
       attachments[i++] = __DRI_BUFFER_FRONT_LEFT;
@@ -1305,7 +1271,6 @@ intel_query_dri2_buffers(struct brw_context *brw,
        * So before doing the query, make sure all the pending drawing has
        * landed in the real front buffer.
        */
-      intel_batchbuffer_flush(brw);
       intel_flush_front(&brw->ctx);
    }
 
@@ -1346,7 +1311,7 @@ intel_process_dri2_buffer(struct brw_context *brw,
                           const char *buffer_name)
 {
    struct gl_framebuffer *fb = drawable->driverPrivate;
-   drm_intel_bo *bo;
+   struct brw_bo *bo;
 
    if (!rb)
       return;
@@ -1370,7 +1335,7 @@ intel_process_dri2_buffer(struct brw_context *brw,
 	* name, then drm_intel_bo_flink() is a low-cost getter.  It does not
 	* create a new name.
 	*/
-      drm_intel_bo_flink(last_mt->bo, &old_name);
+      old_name = brw_bo_flink(last_mt->bo);
    }
 
    if (old_name == buffer->name)
@@ -1383,9 +1348,7 @@ intel_process_dri2_buffer(struct brw_context *brw,
               buffer->cpp, buffer->pitch);
    }
 
-   intel_miptree_release(&rb->mt);
-   bo = drm_intel_bo_gem_create_from_name(brw->bufmgr, buffer_name,
-                                          buffer->name);
+   bo = brw_bo_create_from_name(&brw->batch, buffer_name, buffer->name);
    if (!bo) {
       fprintf(stderr,
               "Failed to open BO for returned DRI2 buffer "
@@ -1396,9 +1359,11 @@ intel_process_dri2_buffer(struct brw_context *brw,
       return;
    }
 
+   //intel_miptree_release(&rb->mt);
    intel_update_winsys_renderbuffer_miptree(brw, rb, bo,
                                             drawable->w, drawable->h,
                                             buffer->pitch);
+   brw_bo_put(bo);
 
    if (brw_is_front_buffer_drawing(fb) &&
        (buffer->attachment == __DRI_BUFFER_FRONT_LEFT ||
@@ -1408,8 +1373,6 @@ intel_process_dri2_buffer(struct brw_context *brw,
    }
 
    assert(rb->mt);
-
-   drm_intel_bo_unreference(bo);
 }
 
 /**
@@ -1451,12 +1414,14 @@ intel_update_image_buffer(struct brw_context *intel,
    else
       last_mt = rb->singlesample_mt;
 
-   if (last_mt && last_mt->bo == buffer->bo)
+   if (last_mt && last_mt->bo->handle == buffer->bo->handle)
       return;
 
-   intel_update_winsys_renderbuffer_miptree(intel, rb, buffer->bo,
+   struct brw_bo *bo = brw_bo_import(&intel->batch, buffer->bo, true);
+   intel_update_winsys_renderbuffer_miptree(intel, rb, bo,
                                             buffer->width, buffer->height,
                                             buffer->pitch);
+   brw_bo_put(bo);
 
    if (brw_is_front_buffer_drawing(fb) &&
        buffer_type == __DRI_IMAGE_BUFFER_FRONT &&
@@ -1520,3 +1485,121 @@ intel_update_image_buffers(struct brw_context *brw, __DRIdrawable *drawable)
                                 __DRI_IMAGE_BUFFER_BACK);
    }
 }
+
+/**
+ * Called when starting a new batch buffer.
+ */
+void
+brw_start_batch(struct brw_batch *batch)
+{
+   struct brw_context *brw = container_of(batch, brw, batch);
+
+   if (batch->ring != RENDER_RING)
+      return;
+
+   /* If the kernel supports hardware contexts, then most hardware state is
+    * preserved between batches; we only need to re-emit state that is required
+    * to be in every batch.  Otherwise we need to re-emit all the state that
+    * would otherwise be stored in the context (which for all intents and
+    * purposes means everything).
+    */
+   if (!batch->hw_ctx)
+      brw->ctx.NewDriverState |= BRW_NEW_CONTEXT;
+
+   brw->ctx.NewDriverState |= BRW_NEW_BATCH;
+
+   brw->emitted_mi_flush = 0;
+   brw->state_batch_count = 0;
+
+   brw->ib.type = -1;
+
+   /* We need to periodically reap the shader time results, because rollover
+    * happens every few seconds.  We also want to see results every once in a
+    * while, because many programs won't cleanly destroy our context, so the
+    * end-of-run printout may not happen.
+    */
+   if (INTEL_DEBUG & DEBUG_SHADER_TIME)
+      brw_collect_and_report_shader_time(brw);
+
+   if (INTEL_DEBUG & DEBUG_PERFMON)
+      brw_dump_perf_monitors(brw);
+
+   brw_perf_monitor_new_batch(brw);
+}
+
+/**
+ * Called from brw_batch_flush before emitting MI_BATCHBUFFER_END and sending
+ * it off.
+ *
+ * This function can emit state (say, to preserve registers that aren't saved
+ * between batches).  All of this state MUST fit in the reserved space at the
+ * end of the batchbuffer.  If you add more GPU state, increase the reserved
+ * space by updating the BATCH_RESERVED macro.
+ */
+void brw_finish_batch(struct brw_batch *batch)
+{
+   struct brw_context *brw = container_of(batch, brw, batch);
+
+   if (batch->ring != RENDER_RING)
+      return;
+
+   /* Capture the closing pipeline statistics register values necessary to
+    * support query objects (in the non-hardware context world).
+    */
+   brw_emit_query_end(brw);
+
+   /* We may also need to snapshot and disable OA counters. */
+   brw_perf_monitor_finish_batch(brw);
+}
+
+static void
+load_sized_register_mem(struct brw_context *brw,
+                        uint32_t reg,
+                        struct brw_bo *bo,
+                        uint32_t read_domains, uint32_t write_domain,
+                        uint32_t offset,
+                        int size)
+{
+   int i;
+
+   /* MI_LOAD_REGISTER_MEM only exists on Gen7+. */
+   assert(brw->gen >= 7);
+
+   if (brw->gen >= 8) {
+      BEGIN_BATCH(4 * size);
+      for (i = 0; i < size; i++) {
+         OUT_BATCH(GEN7_MI_LOAD_REGISTER_MEM | (4 - 2));
+         OUT_BATCH(reg + i * 4);
+         OUT_RELOC64(bo, read_domains, write_domain, offset + i * 4);
+      }
+      ADVANCE_BATCH();
+   } else {
+      BEGIN_BATCH(3 * size);
+      for (i = 0; i < size; i++) {
+         OUT_BATCH(GEN7_MI_LOAD_REGISTER_MEM | (3 - 2));
+         OUT_BATCH(reg + i * 4);
+         OUT_RELOC(bo, read_domains, write_domain, offset + i * 4);
+      }
+      ADVANCE_BATCH();
+   }
+}
+
+void
+brw_load_register_mem(struct brw_context *brw,
+                      uint32_t reg,
+                      struct brw_bo *bo,
+                      uint32_t read_domains, uint32_t write_domain,
+                      uint32_t offset)
+{
+   load_sized_register_mem(brw, reg, bo, read_domains, write_domain, offset, 1);
+}
+
+void
+brw_load_register_mem64(struct brw_context *brw,
+                        uint32_t reg,
+                        struct brw_bo *bo,
+                        uint32_t read_domains, uint32_t write_domain,
+                        uint32_t offset)
+{
+   load_sized_register_mem(brw, reg, bo, read_domains, write_domain, offset, 2);
+}
diff --git a/src/mesa/drivers/dri/i965/brw_context.h b/src/mesa/drivers/dri/i965/brw_context.h
index db0fc48..e4fded3 100644
--- a/src/mesa/drivers/dri/i965/brw_context.h
+++ b/src/mesa/drivers/dri/i965/brw_context.h
@@ -50,7 +50,6 @@ extern "C" {
 #endif
 
 #include <drm.h>
-#include <intel_bufmgr.h>
 #include <i915_drm.h>
 #ifdef __cplusplus
 	#undef virtual
@@ -65,6 +64,8 @@ extern "C" {
 #include "intel_tex_obj.h"
 #include "intel_resolve_map.h"
 
+#include "brw_batch.h"
+
 /* Glossary:
  *
  * URB - uniform resource buffer.  A mid-sized buffer which is
@@ -790,11 +791,10 @@ struct brw_cache {
    struct brw_context *brw;
 
    struct brw_cache_item **items;
-   drm_intel_bo *bo;
+   struct brw_bo *bo;
    GLuint size, n_items;
 
    uint32_t next_offset;
-   bool bo_used_by_gpu;
 
    /**
     * Optional functions used in determining whether the prog_data for a new
@@ -829,7 +829,7 @@ enum shader_time_shader_type {
 
 struct brw_vertex_buffer {
    /** Buffer object containing the uploaded vertex data */
-   drm_intel_bo *bo;
+   struct brw_bo *bo;
    uint32_t offset;
    /** Byte stride between elements in the uploaded array */
    GLuint stride;
@@ -848,41 +848,10 @@ struct brw_query_object {
    struct gl_query_object Base;
 
    /** Last query BO associated with this query. */
-   drm_intel_bo *bo;
+   struct brw_bo *bo;
 
    /** Last index in bo with query data for this object. */
    int last_index;
-
-   /** True if we know the batch has been flushed since we ended the query. */
-   bool flushed;
-};
-
-enum brw_gpu_ring {
-   UNKNOWN_RING,
-   RENDER_RING,
-   BLT_RING,
-};
-
-struct intel_batchbuffer {
-   /** Current batchbuffer being queued up. */
-   drm_intel_bo *bo;
-   /** Last BO submitted to the hardware.  Used for glFinish(). */
-   drm_intel_bo *last_bo;
-
-   uint16_t emit, total;
-   uint16_t used, reserved_space;
-   uint32_t *map;
-   uint32_t *cpu_map;
-#define BATCH_SZ (8192*sizeof(uint32_t))
-
-   uint32_t state_batch_offset;
-   enum brw_gpu_ring ring;
-   bool needs_sol_reset;
-
-   struct {
-      uint16_t used;
-      int reloc_count;
-   } saved;
 };
 
 #define BRW_MAX_XFB_STREAMS 4
@@ -891,7 +860,7 @@ struct brw_transform_feedback_object {
    struct gl_transform_feedback_object base;
 
    /** A buffer to hold SO_WRITE_OFFSET(n) values while paused. */
-   drm_intel_bo *offset_bo;
+   struct brw_bo *offset_bo;
 
    /** If true, SO_WRITE_OFFSET(n) should be reset to zero at next use. */
    bool zero_offsets;
@@ -904,7 +873,7 @@ struct brw_transform_feedback_object {
     *  @{
     */
    uint64_t prims_generated[BRW_MAX_XFB_STREAMS];
-   drm_intel_bo *prim_count_bo;
+   struct brw_bo *prim_count_bo;
    unsigned prim_count_buffer_index; /**< in number of uint64_t units */
    /** @} */
 
@@ -930,7 +899,7 @@ struct brw_stage_state
     * Optional scratch buffer used to store spilled register values and
     * variably-indexed GRF arrays.
     */
-   drm_intel_bo *scratch_bo;
+   struct brw_bo *scratch_bo;
 
    /** Offset in the program cache to the program */
    uint32_t prog_offset;
@@ -998,7 +967,7 @@ struct brw_context
                                          bool rw, bool for_gather);
       void (*emit_buffer_surface_state)(struct brw_context *brw,
                                         uint32_t *out_offset,
-                                        drm_intel_bo *bo,
+                                        struct brw_bo *bo,
                                         unsigned buffer_offset,
                                         unsigned surface_format,
                                         unsigned buffer_size,
@@ -1026,22 +995,13 @@ struct brw_context
 
    } vtbl;
 
-   dri_bufmgr *bufmgr;
-
-   drm_intel_context *hw_ctx;
+   struct brw_batch batch;
 
    /** BO for post-sync nonzero writes for gen6 workaround. */
-   drm_intel_bo *workaround_bo;
+   struct brw_bo *workaround_bo;
    uint8_t pipe_controls_since_last_cs_stall;
 
    /**
-    * Set of drm_intel_bo * that have been rendered to within this batchbuffer
-    * and would need flushing before being used from another cache domain that
-    * isn't coherent with it (i.e. the sampler).
-    */
-   struct set *render_cache;
-
-   /**
     * Number of resets observed in the system at context creation.
     *
     * This is tracked in the context so that we can determine that another
@@ -1049,11 +1009,8 @@ struct brw_context
     */
    uint32_t reset_count;
 
-   struct intel_batchbuffer batch;
-   bool no_batch_wrap;
-
    struct {
-      drm_intel_bo *bo;
+      struct brw_bo *bo;
       uint32_t next_offset;
    } upload;
 
@@ -1065,23 +1022,6 @@ struct brw_context
     */
    bool front_buffer_dirty;
 
-   /** Framerate throttling: @{ */
-   drm_intel_bo *throttle_batch[2];
-
-   /* Limit the number of outstanding SwapBuffers by waiting for an earlier
-    * frame of rendering to complete. This gives a very precise cap to the
-    * latency between input and output such that rendering never gets more
-    * than a frame behind the user. (With the caveat that we technically are
-    * not using the SwapBuffers itself as a barrier but the first batch
-    * submitted afterwards, which may be immediately prior to the next
-    * SwapBuffers.)
-    */
-   bool need_swap_throttle;
-
-   /** General throttling, not caught by throttling between SwapBuffers */
-   bool need_flush_throttle;
-   /** @} */
-
    GLuint stats_wm;
 
    /**
@@ -1089,9 +1029,7 @@ struct brw_context
     * @{
     */
    bool no_rast;
-   bool always_flush_batch;
    bool always_flush_cache;
-   bool disable_throttling;
    bool precompile;
 
    driOptionCache optionCache;
@@ -1169,7 +1107,7 @@ struct brw_context
        * Buffer and offset used for GL_ARB_shader_draw_parameters
        * (for now, only gl_BaseVertex).
        */
-      drm_intel_bo *draw_params_bo;
+      struct brw_bo *draw_params_bo;
       uint32_t draw_params_offset;
    } draw;
 
@@ -1209,7 +1147,7 @@ struct brw_context
       const struct _mesa_index_buffer *ib;
 
       /* Updates are signaled by BRW_NEW_INDEX_BUFFER. */
-      drm_intel_bo *bo;
+      struct brw_bo *bo;
       GLuint type;
 
       /* Offset to index buffer index to use in CMD_3D_PRIM so that we can
@@ -1294,7 +1232,7 @@ struct brw_context
        * Pointer to the (intel_upload.c-generated) BO containing the uniforms
        * for upload to the CURBE.
        */
-      drm_intel_bo *curbe_bo;
+      struct brw_bo *curbe_bo;
       /** Offset within curbe_bo of space for current curbe entry */
       GLuint curbe_offset;
    } curbe;
@@ -1384,7 +1322,7 @@ struct brw_context
        * Buffer object used in place of multisampled null render targets on
        * Gen6.  See brw_emit_null_surface_state().
        */
-      drm_intel_bo *multisampled_null_render_target_bo;
+      struct brw_bo *multisampled_null_render_target_bo;
       uint32_t fast_clear_op;
    } wm;
 
@@ -1421,7 +1359,7 @@ struct brw_context
        * A buffer object storing OA counter snapshots taken at the start and
        * end of each batch (creating "bookends" around the batch).
        */
-      drm_intel_bo *bookend_bo;
+      struct brw_bo *bookend_bo;
 
       /** The number of snapshots written to bookend_bo. */
       int bookend_snapshots;
@@ -1459,6 +1397,7 @@ struct brw_context
       int index;
    } *state_batch_list;
    int state_batch_count;
+   int emitted_mi_flush;
 
    uint32_t render_target_format[MESA_FORMAT_COUNT];
    bool format_supported_as_render_target[MESA_FORMAT_COUNT];
@@ -1492,7 +1431,7 @@ struct brw_context
    int basevertex;
 
    struct {
-      drm_intel_bo *bo;
+      struct brw_bo *bo;
       const char **names;
       int *ids;
       enum shader_time_shader_type *types;
@@ -1508,6 +1447,14 @@ struct brw_context
    struct intel_screen *intelScreen;
 };
 
+static inline int brw_to_fd(struct brw_context *brw)
+{
+   return intel_screen_to_fd(brw->intelScreen);
+}
+
+void brw_start_batch(struct brw_batch *batch);
+void brw_finish_batch(struct brw_batch *batch);
+
 /*======================================================================
  * brw_vtbl.c
  */
@@ -1605,10 +1552,10 @@ void brw_emit_query_end(struct brw_context *brw);
 
 /** gen6_queryobj.c */
 void gen6_init_queryobj_functions(struct dd_function_table *functions);
-void brw_write_timestamp(struct brw_context *brw, drm_intel_bo *bo, int idx);
-void brw_write_depth_count(struct brw_context *brw, drm_intel_bo *bo, int idx);
+void brw_write_timestamp(struct brw_context *brw, struct brw_bo *bo, int idx);
+void brw_write_depth_count(struct brw_context *brw, struct brw_bo *bo, int idx);
 void brw_store_register_mem64(struct brw_context *brw,
-                              drm_intel_bo *bo, uint32_t reg, int idx);
+                              struct brw_bo *bo, uint32_t reg, int idx);
 
 /** brw_conditional_render.c */
 void brw_init_conditional_render_functions(struct dd_function_table *functions);
@@ -1617,20 +1564,21 @@ bool brw_check_conditional_render(struct brw_context *brw);
 /** intel_batchbuffer.c */
 void brw_load_register_mem(struct brw_context *brw,
                            uint32_t reg,
-                           drm_intel_bo *bo,
+                           struct brw_bo *bo,
                            uint32_t read_domains, uint32_t write_domain,
                            uint32_t offset);
 void brw_load_register_mem64(struct brw_context *brw,
                              uint32_t reg,
-                             drm_intel_bo *bo,
+                             struct brw_bo *bo,
                              uint32_t read_domains, uint32_t write_domain,
                              uint32_t offset);
 
 /*======================================================================
  * brw_state_dump.c
  */
-void brw_debug_batch(struct brw_context *brw);
-void brw_annotate_aub(struct brw_context *brw);
+void brw_debug_batch(struct brw_batch *batch);
+void brw_annotate_batch(struct brw_batch *batch);
+void aub_dump_bmp(struct brw_context *brw);
 
 /*======================================================================
  * brw_tex.c
@@ -1645,7 +1593,7 @@ void brwInitFragProgFuncs( struct dd_function_table *functions );
 
 int brw_get_scratch_size(int size);
 void brw_get_scratch_bo(struct brw_context *brw,
-			drm_intel_bo **scratch_bo, int size);
+			struct brw_bo **scratch_bo, int size);
 void brw_init_shader_time(struct brw_context *brw);
 int brw_get_shader_time_index(struct brw_context *brw,
                               struct gl_shader_program *shader_prog,
@@ -1705,7 +1653,7 @@ void brw_prepare_vertices(struct brw_context *brw);
 /* brw_wm_surface_state.c */
 void brw_init_surface_formats(struct brw_context *brw);
 void brw_create_constant_surface(struct brw_context *brw,
-                                 drm_intel_bo *bo,
+                                 struct brw_bo *bo,
                                  uint32_t offset,
                                  uint32_t size,
                                  uint32_t *out_offset,
@@ -1739,12 +1687,6 @@ void brw_dump_perf_monitors(struct brw_context *brw);
 void brw_perf_monitor_new_batch(struct brw_context *brw);
 void brw_perf_monitor_finish_batch(struct brw_context *brw);
 
-/* intel_buffer_objects.c */
-int brw_bo_map(struct brw_context *brw, drm_intel_bo *bo, int write_enable,
-               const char *bo_name);
-int brw_bo_map_gtt(struct brw_context *brw, drm_intel_bo *bo,
-                   const char *bo_name);
-
 /* intel_extensions.c */
 extern void intelInitExtensions(struct gl_context *ctx);
 
@@ -1913,13 +1855,9 @@ brw_program_reloc(struct brw_context *brw, uint32_t state_offset,
       return prog_offset;
    }
 
-   drm_intel_bo_emit_reloc(brw->batch.bo,
-			   state_offset,
-			   brw->cache.bo,
-			   prog_offset,
-			   I915_GEM_DOMAIN_INSTRUCTION, 0);
-
-   return brw->cache.bo->offset64 + prog_offset;
+   return brw_batch_reloc(&brw->batch, state_offset,
+			  brw->cache.bo, prog_offset,
+			  I915_GEM_DOMAIN_INSTRUCTION, 0);
 }
 
 bool brw_do_cubemap_normalize(struct exec_list *instructions);
@@ -2006,7 +1944,7 @@ void brw_fini_pipe_control(struct brw_context *brw);
 
 void brw_emit_pipe_control_flush(struct brw_context *brw, uint32_t flags);
 void brw_emit_pipe_control_write(struct brw_context *brw, uint32_t flags,
-                                 drm_intel_bo *bo, uint32_t offset,
+                                 struct brw_bo *bo, uint32_t offset,
                                  uint32_t imm_lower, uint32_t imm_upper);
 void brw_emit_mi_flush(struct brw_context *brw);
 void brw_emit_post_sync_nonzero_flush(struct brw_context *brw);
diff --git a/src/mesa/drivers/dri/i965/brw_cs.cpp b/src/mesa/drivers/dri/i965/brw_cs.cpp
index 4c5082c..ee1f481 100644
--- a/src/mesa/drivers/dri/i965/brw_cs.cpp
+++ b/src/mesa/drivers/dri/i965/brw_cs.cpp
@@ -30,7 +30,6 @@
 #include "brw_wm.h"
 #include "intel_mipmap_tree.h"
 #include "brw_state.h"
-#include "intel_batchbuffer.h"
 
 extern "C"
 bool
@@ -68,8 +67,7 @@ brw_cs_emit(struct brw_context *brw,
    double start_time = 0;
 
    if (unlikely(brw->perf_debug)) {
-      start_busy = (brw->batch.last_bo &&
-                    drm_intel_bo_busy(brw->batch.last_bo));
+      start_busy = brw_batch_busy(&brw->batch);
       start_time = get_time();
    }
 
@@ -152,7 +150,7 @@ brw_cs_emit(struct brw_context *brw,
       }
       shader->compiled_once = true;
 
-      if (start_busy && !drm_intel_bo_busy(brw->batch.last_bo)) {
+      if (start_busy && !brw_batch_busy(&brw->batch)) {
          perf_debug("CS compile took %.03f ms and stalled the GPU\n",
                     (get_time() - start_time) * 1000);
       }
diff --git a/src/mesa/drivers/dri/i965/brw_curbe.c b/src/mesa/drivers/dri/i965/brw_curbe.c
index befd7a9..29b75bc 100644
--- a/src/mesa/drivers/dri/i965/brw_curbe.c
+++ b/src/mesa/drivers/dri/i965/brw_curbe.c
@@ -57,7 +57,6 @@
 #include "program/prog_parameter.h"
 #include "program/prog_print.h"
 #include "program/prog_statevars.h"
-#include "intel_batchbuffer.h"
 #include "intel_buffer_objects.h"
 #include "brw_context.h"
 #include "brw_defines.h"
diff --git a/src/mesa/drivers/dri/i965/brw_draw.c b/src/mesa/drivers/dri/i965/brw_draw.c
index 69ad4d4..1e967da 100644
--- a/src/mesa/drivers/dri/i965/brw_draw.c
+++ b/src/mesa/drivers/dri/i965/brw_draw.c
@@ -48,11 +48,11 @@
 #include "brw_state.h"
 #include "brw_vs.h"
 
-#include "intel_batchbuffer.h"
 #include "intel_buffers.h"
 #include "intel_fbo.h"
 #include "intel_mipmap_tree.h"
 #include "intel_buffer_objects.h"
+#include "intel_reg.h"
 
 #define FILE_DEBUG_FLAG DEBUG_PRIMS
 
@@ -172,6 +172,37 @@ static GLuint trim(GLenum prim, GLuint length)
       return length;
 }
 
+static void mark_fb_dirty(struct brw_context *brw)
+{
+   const struct gl_framebuffer *fb = brw->ctx.DrawBuffer;
+   struct intel_renderbuffer *irb;
+
+   if (!brw->emitted_mi_flush)
+      return;
+
+   for (int i = 0; i < fb->_NumColorDrawBuffers; i++) {
+      irb = intel_renderbuffer(fb->_ColorDrawBuffers[i]);
+      if (irb)
+         brw_bo_mark_dirty(&brw->batch, irb->mt->bo);
+   }
+
+   irb = intel_renderbuffer(fb->Attachment[BUFFER_DEPTH].Renderbuffer);
+   if (irb)
+      brw_bo_mark_dirty(&brw->batch, irb->mt->bo);
+
+   if (brw->ctx.Stencil._Enabled) {
+      irb = intel_renderbuffer(fb->Attachment[BUFFER_STENCIL].Renderbuffer);
+      if (irb) {
+         struct intel_mipmap_tree *mt = irb->mt;
+         if (mt && mt->stencil_mt)
+            mt = mt->stencil_mt;
+         if (mt)
+            brw_bo_mark_dirty(&brw->batch, mt->bo);
+      }
+   }
+
+   brw->emitted_mi_flush = 0;
+}
 
 static void brw_emit_prim(struct brw_context *brw,
 			  const struct _mesa_prim *prim,
@@ -223,9 +254,10 @@ static void brw_emit_prim(struct brw_context *brw,
    /* If indirect, emit a bunch of loads from the indirect BO. */
    if (prim->is_indirect) {
       struct gl_buffer_object *indirect_buffer = brw->ctx.DrawIndirectBuffer;
-      drm_intel_bo *bo = intel_bufferobj_buffer(brw,
-            intel_buffer_object(indirect_buffer),
-            prim->indirect_offset, 5 * sizeof(GLuint));
+      struct brw_bo *bo =
+	 intel_bufferobj_buffer(brw,
+				intel_buffer_object(indirect_buffer),
+				prim->indirect_offset, 5 * sizeof(GLuint));
 
       indirect_flag = GEN7_3DPRIM_INDIRECT_PARAMETER_ENABLE;
 
@@ -283,6 +315,8 @@ static void brw_emit_prim(struct brw_context *brw,
    OUT_BATCH(base_vertex_location);
    ADVANCE_BATCH();
 
+   mark_fb_dirty(brw);
+
    if (brw->always_flush_cache) {
       brw_emit_mi_flush(brw);
    }
@@ -296,7 +330,7 @@ static void brw_merge_inputs( struct brw_context *brw,
    GLuint i;
 
    for (i = 0; i < brw->vb.nr_buffers; i++) {
-      drm_intel_bo_unreference(brw->vb.buffers[i].bo);
+      brw_bo_put(brw->vb.buffers[i].bo);
       brw->vb.buffers[i].bo = NULL;
    }
    brw->vb.nr_buffers = 0;
@@ -367,7 +401,6 @@ static void brw_postdraw_set_buffers_need_resolve(struct brw_context *brw)
    struct intel_renderbuffer *front_irb = NULL;
    struct intel_renderbuffer *back_irb = intel_get_renderbuffer(fb, BUFFER_BACK_LEFT);
    struct intel_renderbuffer *depth_irb = intel_get_renderbuffer(fb, BUFFER_DEPTH);
-   struct intel_renderbuffer *stencil_irb = intel_get_renderbuffer(fb, BUFFER_STENCIL);
    struct gl_renderbuffer_attachment *depth_att = &fb->Attachment[BUFFER_DEPTH];
 
    if (brw_is_front_buffer_drawing(fb))
@@ -379,20 +412,6 @@ static void brw_postdraw_set_buffers_need_resolve(struct brw_context *brw)
       back_irb->need_downsample = true;
    if (depth_irb && ctx->Depth.Mask) {
       intel_renderbuffer_att_set_needs_depth_resolve(depth_att);
-      brw_render_cache_set_add_bo(brw, depth_irb->mt->bo);
-   }
-
-   if (ctx->Extensions.ARB_stencil_texturing &&
-       stencil_irb && ctx->Stencil._WriteEnabled) {
-      brw_render_cache_set_add_bo(brw, stencil_irb->mt->bo);
-   }
-
-   for (int i = 0; i < fb->_NumColorDrawBuffers; i++) {
-      struct intel_renderbuffer *irb =
-         intel_renderbuffer(fb->_ColorDrawBuffers[i]);
-
-      if (irb)
-         brw_render_cache_set_add_bo(brw, irb->mt->bo);
    }
 }
 
@@ -410,7 +429,6 @@ static void brw_try_draw_prims( struct gl_context *ctx,
 {
    struct brw_context *brw = brw_context(ctx);
    GLuint i;
-   bool fail_next = false;
 
    if (ctx->NewState)
       _mesa_update_state( ctx );
@@ -457,6 +475,7 @@ static void brw_try_draw_prims( struct gl_context *ctx,
    for (i = 0; i < nr_prims; i++) {
       int estimated_max_prim_size;
       const int sampler_state_size = 16;
+      int ret;
 
       estimated_max_prim_size = 512; /* batchbuffer commands */
       estimated_max_prim_size += BRW_MAX_TEX_UNIT *
@@ -469,8 +488,9 @@ static void brw_try_draw_prims( struct gl_context *ctx,
        * we've got validated state that needs to be in the same batch as the
        * primitives.
        */
-      intel_batchbuffer_require_space(brw, estimated_max_prim_size, RENDER_RING);
-      intel_batchbuffer_save_state(brw);
+      ret = brw_batch_begin(&brw->batch, estimated_max_prim_size, RENDER_RING);
+      if (ret < 0)
+	 break;
 
       if (brw->num_instances != prims[i].num_instances ||
           brw->basevertex != prims[i].basevertex) {
@@ -485,13 +505,12 @@ static void brw_try_draw_prims( struct gl_context *ctx,
       brw->draw.gl_basevertex =
          prims[i].indexed ? prims[i].basevertex : prims[i].start;
 
-      drm_intel_bo_unreference(brw->draw.draw_params_bo);
+      brw_bo_put(brw->draw.draw_params_bo);
 
       if (prims[i].is_indirect) {
          /* Point draw_params_bo at the indirect buffer. */
          brw->draw.draw_params_bo =
-            intel_buffer_object(ctx->DrawIndirectBuffer)->buffer;
-         drm_intel_bo_reference(brw->draw.draw_params_bo);
+            brw_bo_get(intel_buffer_object(ctx->DrawIndirectBuffer)->buffer);
          brw->draw.draw_params_offset =
             prims[i].indirect_offset + (prims[i].indexed ? 12 : 8);
       } else {
@@ -507,35 +526,20 @@ static void brw_try_draw_prims( struct gl_context *ctx,
       else
 	 gen6_set_prim(brw, &prims[i]);
 
-retry:
-
       /* Note that before the loop, brw->ctx.NewDriverState was set to != 0, and
        * that the state updated in the loop outside of this block is that in
-       * *_set_prim or intel_batchbuffer_flush(), which only impacts
-       * brw->ctx.NewDriverState.
+       * *_set_prim, which only impacts brw->ctx.NewDriverState.
        */
       if (brw->ctx.NewDriverState) {
-	 brw->no_batch_wrap = true;
 	 brw_upload_render_state(brw);
       }
 
       brw_emit_prim(brw, &prims[i], brw->primitive);
 
-      brw->no_batch_wrap = false;
-
-      if (dri_bufmgr_check_aperture_space(&brw->batch.bo, 1)) {
-	 if (!fail_next) {
-	    intel_batchbuffer_reset_to_saved(brw);
-	    intel_batchbuffer_flush(brw);
-	    fail_next = true;
-	    goto retry;
-	 } else {
-            int ret = intel_batchbuffer_flush(brw);
-            WARN_ONCE(ret == -ENOSPC,
-                      "i965: Single primitive emit exceeded "
-                      "available aperture space\n");
-	 }
-      }
+      ret = brw_batch_end(&brw->batch);
+      WARN_ONCE(ret == -ENOSPC,
+		"i965: Single primitive emit exceeded "
+		"available aperture space\n");
 
       /* Now that we know we haven't run out of aperture space, we can safely
        * reset the dirty bits.
@@ -544,8 +548,7 @@ retry:
          brw_render_state_finished(brw);
    }
 
-   if (brw->always_flush_batch)
-      intel_batchbuffer_flush(brw);
+   brw_batch_maybe_flush(&brw->batch);
 
    brw_state_cache_check_size(brw);
    brw_postdraw_set_buffers_need_resolve(brw);
@@ -628,7 +631,7 @@ void brw_draw_destroy( struct brw_context *brw )
    int i;
 
    for (i = 0; i < brw->vb.nr_buffers; i++) {
-      drm_intel_bo_unreference(brw->vb.buffers[i].bo);
+      brw_bo_put(brw->vb.buffers[i].bo);
       brw->vb.buffers[i].bo = NULL;
    }
    brw->vb.nr_buffers = 0;
@@ -638,6 +641,6 @@ void brw_draw_destroy( struct brw_context *brw )
    }
    brw->vb.nr_enabled = 0;
 
-   drm_intel_bo_unreference(brw->ib.bo);
+   brw_bo_put(brw->ib.bo);
    brw->ib.bo = NULL;
 }
diff --git a/src/mesa/drivers/dri/i965/brw_draw_upload.c b/src/mesa/drivers/dri/i965/brw_draw_upload.c
index 320e40e..bc2e8fa 100644
--- a/src/mesa/drivers/dri/i965/brw_draw_upload.c
+++ b/src/mesa/drivers/dri/i965/brw_draw_upload.c
@@ -37,7 +37,6 @@
 #include "brw_context.h"
 #include "brw_state.h"
 
-#include "intel_batchbuffer.h"
 #include "intel_buffer_objects.h"
 
 static GLuint double_types[5] = {
@@ -361,7 +360,7 @@ copy_array_to_vbo_array(struct brw_context *brw,
       intel_upload_data(brw, element->glarray->Ptr,
                         element->glarray->_ElementSize,
                         element->glarray->_ElementSize,
-			&buffer->bo, &buffer->offset);
+                        &buffer->bo, &buffer->offset);
 
       buffer->stride = 0;
       return;
@@ -480,9 +479,8 @@ brw_prepare_vertices(struct brw_context *brw)
                           glarray->_ElementSize);
                }
             }
-            buffer->bo = intel_bufferobj_buffer(brw, intel_buffer,
-                                                offset, size);
-            drm_intel_bo_reference(buffer->bo);
+            buffer->bo = brw_bo_get(intel_bufferobj_buffer(brw, intel_buffer,
+							   offset, size));
 
 	    input->buffer = j++;
 	    input->offset = 0;
@@ -596,7 +594,7 @@ brw_prepare_shader_draw_parameters(struct brw_context *brw)
    /* For non-indirect draws, upload gl_BaseVertex. */
    if (brw->vs.prog_data->uses_vertexid && brw->draw.draw_params_bo == NULL) {
       intel_upload_data(brw, &brw->draw.gl_basevertex, 4, 4,
-			&brw->draw.draw_params_bo,
+                        &brw->draw.draw_params_bo,
                         &brw->draw.draw_params_offset);
    }
 }
@@ -607,7 +605,7 @@ brw_prepare_shader_draw_parameters(struct brw_context *brw)
 static void
 emit_vertex_buffer_state(struct brw_context *brw,
                          unsigned buffer_nr,
-                         drm_intel_bo *bo,
+                         struct brw_bo *bo,
                          unsigned bo_ending_address,
                          unsigned bo_offset,
                          unsigned stride,
@@ -860,7 +858,7 @@ static void brw_upload_indices(struct brw_context *brw)
    struct gl_context *ctx = &brw->ctx;
    const struct _mesa_index_buffer *index_buffer = brw->ib.ib;
    GLuint ib_size;
-   drm_intel_bo *old_bo = brw->ib.bo;
+   struct brw_bo *old_bo = brw->ib.bo;
    struct gl_buffer_object *bufferobj;
    GLuint offset;
    GLuint ib_type_size;
@@ -878,7 +876,7 @@ static void brw_upload_indices(struct brw_context *brw)
       /* Get new bufferobj, offset:
        */
       intel_upload_data(brw, index_buffer->ptr, ib_size, ib_type_size,
-			&brw->ib.bo, &offset);
+                        &brw->ib.bo, &offset);
    } else {
       offset = (GLuint) (unsigned long) index_buffer->ptr;
 
@@ -901,13 +899,12 @@ static void brw_upload_indices(struct brw_context *brw)
 
          ctx->Driver.UnmapBuffer(ctx, bufferobj, MAP_INTERNAL);
       } else {
-         drm_intel_bo *bo =
+         struct brw_bo *bo =
             intel_bufferobj_buffer(brw, intel_buffer_object(bufferobj),
                                    offset, ib_size);
          if (bo != brw->ib.bo) {
-            drm_intel_bo_unreference(brw->ib.bo);
-            brw->ib.bo = bo;
-            drm_intel_bo_reference(bo);
+            brw_bo_put(brw->ib.bo);
+            brw->ib.bo = brw_bo_get(bo);
          }
       }
    }
diff --git a/src/mesa/drivers/dri/i965/brw_ff_gs.c b/src/mesa/drivers/dri/i965/brw_ff_gs.c
index f72f37f..bd452c3 100644
--- a/src/mesa/drivers/dri/i965/brw_ff_gs.c
+++ b/src/mesa/drivers/dri/i965/brw_ff_gs.c
@@ -34,8 +34,6 @@
 #include "main/enums.h"
 #include "main/transformfeedback.h"
 
-#include "intel_batchbuffer.h"
-
 #include "brw_defines.h"
 #include "brw_context.h"
 #include "brw_eu.h"
diff --git a/src/mesa/drivers/dri/i965/brw_ff_gs_emit.c b/src/mesa/drivers/dri/i965/brw_ff_gs_emit.c
index 50bda61..56a29b4 100644
--- a/src/mesa/drivers/dri/i965/brw_ff_gs_emit.c
+++ b/src/mesa/drivers/dri/i965/brw_ff_gs_emit.c
@@ -35,7 +35,6 @@
 #include "main/enums.h"
 
 #include "program/program.h"
-#include "intel_batchbuffer.h"
 
 #include "brw_defines.h"
 #include "brw_context.h"
diff --git a/src/mesa/drivers/dri/i965/brw_fs.cpp b/src/mesa/drivers/dri/i965/brw_fs.cpp
index 189da1d..5a0fef8 100644
--- a/src/mesa/drivers/dri/i965/brw_fs.cpp
+++ b/src/mesa/drivers/dri/i965/brw_fs.cpp
@@ -3928,8 +3928,7 @@ brw_wm_fs_emit(struct brw_context *brw,
    double start_time = 0;
 
    if (unlikely(brw->perf_debug)) {
-      start_busy = (brw->batch.last_bo &&
-                    drm_intel_bo_busy(brw->batch.last_bo));
+      start_busy = brw_batch_busy(&brw->batch);
       start_time = get_time();
    }
 
@@ -4015,7 +4014,7 @@ brw_wm_fs_emit(struct brw_context *brw,
          brw_wm_debug_recompile(brw, prog, key);
       shader->compiled_once = true;
 
-      if (start_busy && !drm_intel_bo_busy(brw->batch.last_bo)) {
+      if (start_busy && !brw_batch_busy(&brw->batch)) {
          perf_debug("FS compile took %.03f ms and stalled the GPU\n",
                     (get_time() - start_time) * 1000);
       }
diff --git a/src/mesa/drivers/dri/i965/brw_meta_fast_clear.c b/src/mesa/drivers/dri/i965/brw_meta_fast_clear.c
index 5b8191c..a571a74 100644
--- a/src/mesa/drivers/dri/i965/brw_meta_fast_clear.c
+++ b/src/mesa/drivers/dri/i965/brw_meta_fast_clear.c
@@ -49,7 +49,6 @@
 #include "brw_context.h"
 #include "brw_draw.h"
 #include "intel_fbo.h"
-#include "intel_batchbuffer.h"
 
 #include "brw_blorp.h"
 
@@ -623,7 +622,10 @@ brw_meta_fast_clear(struct brw_context *brw, struct gl_framebuffer *fb,
     *     write-flush must be issued before sending any DRAW commands on that
     *     render target.
     */
-   brw_emit_mi_flush(brw);
+   if (fast_clear_buffers && brw_batch_begin(&brw->batch, 60, RENDER_RING) >= 0) {
+      brw_emit_mi_flush(brw);
+      brw_batch_end(&brw->batch);
+   }
 
    /* If we had to fall back to plain clear for any buffers, clear those now
     * by calling into meta.
@@ -677,7 +679,10 @@ brw_meta_resolve_color(struct brw_context *brw,
    GLuint fbo, rbo;
    struct rect rect;
 
-   brw_emit_mi_flush(brw);
+   if (brw_batch_begin(&brw->batch, 60, RENDER_RING) >= 0) {
+      brw_emit_mi_flush(brw);
+      brw_batch_end(&brw->batch);
+   }
 
    _mesa_meta_begin(ctx, MESA_META_ALL);
 
diff --git a/src/mesa/drivers/dri/i965/brw_meta_stencil_blit.c b/src/mesa/drivers/dri/i965/brw_meta_stencil_blit.c
index d4abfe6..6a35163 100644
--- a/src/mesa/drivers/dri/i965/brw_meta_stencil_blit.c
+++ b/src/mesa/drivers/dri/i965/brw_meta_stencil_blit.c
@@ -43,7 +43,6 @@
  */
 
 #include "brw_context.h"
-#include "intel_batchbuffer.h"
 #include "intel_fbo.h"
 
 #include "main/blit.h"
diff --git a/src/mesa/drivers/dri/i965/brw_meta_updownsample.c b/src/mesa/drivers/dri/i965/brw_meta_updownsample.c
index f39d50a..37a0968 100644
--- a/src/mesa/drivers/dri/i965/brw_meta_updownsample.c
+++ b/src/mesa/drivers/dri/i965/brw_meta_updownsample.c
@@ -22,7 +22,6 @@
  */
 
 #include "brw_context.h"
-#include "intel_batchbuffer.h"
 #include "intel_fbo.h"
 
 #include "main/blit.h"
diff --git a/src/mesa/drivers/dri/i965/brw_misc_state.c b/src/mesa/drivers/dri/i965/brw_misc_state.c
index 1bbb16c..5dd45da 100644
--- a/src/mesa/drivers/dri/i965/brw_misc_state.c
+++ b/src/mesa/drivers/dri/i965/brw_misc_state.c
@@ -31,9 +31,9 @@
 
 
 
-#include "intel_batchbuffer.h"
 #include "intel_fbo.h"
 #include "intel_mipmap_tree.h"
+#include "intel_reg.h"
 
 #include "brw_context.h"
 #include "brw_state.h"
@@ -543,10 +543,10 @@ brw_emit_depthbuffer(struct brw_context *brw)
       height = stencil_irb->Base.Base.Height;
    }
 
-   if (depth_mt)
-      brw_render_cache_set_check_flush(brw, depth_mt->bo);
-   if (stencil_mt)
-      brw_render_cache_set_check_flush(brw, stencil_mt->bo);
+   if (depth_mt && depth_mt->bo->dirty)
+      brw_emit_mi_flush(brw);
+   if (stencil_mt && stencil_mt->bo->dirty)
+      brw_emit_mi_flush(brw);
 
    brw->vtbl.emit_depth_stencil_hiz(brw, depth_mt, depth_offset,
                                     depthbuffer_format, depth_surface_type,
diff --git a/src/mesa/drivers/dri/i965/brw_object_purgeable.c b/src/mesa/drivers/dri/i965/brw_object_purgeable.c
index 20f66f2..3240ee3 100644
--- a/src/mesa/drivers/dri/i965/brw_object_purgeable.c
+++ b/src/mesa/drivers/dri/i965/brw_object_purgeable.c
@@ -38,12 +38,12 @@
 #include "intel_mipmap_tree.h"
 
 static GLenum
-intel_buffer_purgeable(drm_intel_bo *buffer)
+intel_buffer_purgeable(struct brw_bo *buffer)
 {
    int retained = 0;
 
    if (buffer != NULL)
-      retained = drm_intel_bo_madvise(buffer, I915_MADV_DONTNEED);
+      retained = brw_bo_madvise(buffer, I915_MADV_DONTNEED);
 
    return retained ? GL_VOLATILE_APPLE : GL_RELEASED_APPLE;
 }
@@ -101,13 +101,13 @@ intel_render_object_purgeable(struct gl_context * ctx,
 }
 
 static GLenum
-intel_buffer_unpurgeable(drm_intel_bo *buffer)
+intel_buffer_unpurgeable(struct brw_bo *buffer)
 {
    int retained;
 
    retained = 0;
    if (buffer != NULL)
-      retained = drm_intel_bo_madvise(buffer, I915_MADV_WILLNEED);
+      retained = brw_bo_madvise(buffer, I915_MADV_WILLNEED);
 
    return retained ? GL_RETAINED_APPLE : GL_UNDEFINED_APPLE;
 }
diff --git a/src/mesa/drivers/dri/i965/brw_performance_monitor.c b/src/mesa/drivers/dri/i965/brw_performance_monitor.c
index 0a12375..fd94348 100644
--- a/src/mesa/drivers/dri/i965/brw_performance_monitor.c
+++ b/src/mesa/drivers/dri/i965/brw_performance_monitor.c
@@ -54,7 +54,8 @@
 
 #include "brw_context.h"
 #include "brw_defines.h"
-#include "intel_batchbuffer.h"
+
+#include "intel_reg.h"
 
 #define FILE_DEBUG_FLAG DEBUG_PERFMON
 
@@ -69,7 +70,7 @@ struct brw_perf_monitor_object
    /**
     * BO containing OA counter snapshots at monitor Begin/End time.
     */
-   drm_intel_bo *oa_bo;
+   struct brw_bo *oa_bo;
 
    /** Indexes into bookend_bo (snapshot numbers) for various segments. */
    int oa_head_end;
@@ -90,7 +91,7 @@ struct brw_perf_monitor_object
     * BO containing starting and ending snapshots for any active pipeline
     * statistics counters.
     */
-   drm_intel_bo *pipeline_stats_bo;
+   struct brw_bo *pipeline_stats_bo;
 
    /**
     * Storage for final pipeline statistics counter results.
@@ -615,15 +616,13 @@ gather_statistics_results(struct brw_context *brw,
       return;
    }
 
-   drm_intel_bo_map(monitor->pipeline_stats_bo, false);
-   uint64_t *start = monitor->pipeline_stats_bo->virtual;
+   uint64_t *start = brw_bo_map(monitor->pipeline_stats_bo, MAP_READ);
    uint64_t *end = start + (SECOND_SNAPSHOT_OFFSET_IN_BYTES / sizeof(uint64_t));
 
    for (int i = 0; i < num_counters; i++) {
       monitor->pipeline_stats_results[i] = end[i] - start[i];
    }
-   drm_intel_bo_unmap(monitor->pipeline_stats_bo);
-   drm_intel_bo_unreference(monitor->pipeline_stats_bo);
+   brw_bo_put(monitor->pipeline_stats_bo);
    monitor->pipeline_stats_bo = NULL;
 }
 
@@ -701,16 +700,21 @@ stop_oa_counters(struct brw_context *brw)
  */
 static void
 emit_mi_report_perf_count(struct brw_context *brw,
-                          drm_intel_bo *bo,
+                          struct brw_bo *bo,
                           uint32_t offset_in_bytes,
                           uint32_t report_id)
 {
    assert(offset_in_bytes % 64 == 0);
 
    /* Make sure the commands to take a snapshot fits in a single batch. */
-   intel_batchbuffer_require_space(brw, MI_REPORT_PERF_COUNT_BATCH_DWORDS * 4,
-                                   RENDER_RING);
-   int batch_used = brw->batch.used;
+   if (brw_batch_begin(&brw->batch,
+                       MI_REPORT_PERF_COUNT_BATCH_DWORDS * 4,
+                       RENDER_RING) < 0)
+      return;
+
+   /* If the OA counters aren't already on, enable them. */
+   if (brw->perfmon.oa_users == 0)
+      start_oa_counters(brw);
 
    /* Reports apparently don't always get written unless we flush first. */
    brw_emit_mi_flush(brw);
@@ -752,9 +756,7 @@ emit_mi_report_perf_count(struct brw_context *brw,
 
    /* Reports apparently don't always get written unless we flush after. */
    brw_emit_mi_flush(brw);
-
-   (void) batch_used;
-   assert(brw->batch.used - batch_used <= MI_REPORT_PERF_COUNT_BATCH_DWORDS * 4);
+   brw_batch_end(&brw->batch);
 }
 
 /**
@@ -892,8 +894,7 @@ gather_oa_results(struct brw_context *brw,
    struct gl_perf_monitor_object *m = &monitor->base;
    assert(monitor->oa_bo != NULL);
 
-   drm_intel_bo_map(monitor->oa_bo, false);
-   uint32_t *monitor_buffer = monitor->oa_bo->virtual;
+   uint32_t *monitor_buffer = brw_bo_map(monitor->oa_bo, MAP_READ);
 
    /* If monitoring was entirely contained within a single batch, then the
     * bookend BO is irrelevant.  Just subtract monitor->bo's two snapshots.
@@ -903,7 +904,6 @@ gather_oa_results(struct brw_context *brw,
                  monitor_buffer,
                  monitor_buffer + (SECOND_SNAPSHOT_OFFSET_IN_BYTES /
                                    sizeof(uint32_t)));
-      drm_intel_bo_unmap(monitor->oa_bo);
       return;
    }
 
@@ -950,13 +950,11 @@ gather_oa_results(struct brw_context *brw,
                                    sizeof(uint32_t)));
    }
 
-   drm_intel_bo_unmap(monitor->oa_bo);
-
    /* If the monitor has ended, then we've gathered all the results, and
     * can free the monitor's OA BO.
     */
    if (m->Ended) {
-      drm_intel_bo_unreference(monitor->oa_bo);
+      brw_bo_put(monitor->oa_bo);
       monitor->oa_bo = NULL;
 
       /* The monitor's OA result is now resolved. */
@@ -989,8 +987,7 @@ wrap_bookend_bo(struct brw_context *brw)
     */
    assert(brw->perfmon.oa_users > 0);
 
-   drm_intel_bo_map(brw->perfmon.bookend_bo, false);
-   uint32_t *bookend_buffer = brw->perfmon.bookend_bo->virtual;
+   uint32_t *bookend_buffer = brw_bo_map(brw->perfmon.bookend_bo, MAP_READ);
    for (int i = 0; i < brw->perfmon.unresolved_elements; i++) {
       struct brw_perf_monitor_object *monitor = brw->perfmon.unresolved[i];
       struct gl_perf_monitor_object *m = &monitor->base;
@@ -1011,7 +1008,6 @@ wrap_bookend_bo(struct brw_context *brw)
          assert(monitor->oa_tail_start == -1);
       }
    }
-   drm_intel_bo_unmap(brw->perfmon.bookend_bo);
 
    brw->perfmon.bookend_snapshots = 0;
 }
@@ -1060,7 +1056,7 @@ reinitialize_perf_monitor(struct brw_context *brw,
                           struct brw_perf_monitor_object *monitor)
 {
    if (monitor->oa_bo) {
-      drm_intel_bo_unreference(monitor->oa_bo);
+      brw_bo_put(monitor->oa_bo);
       monitor->oa_bo = NULL;
    }
 
@@ -1077,7 +1073,7 @@ reinitialize_perf_monitor(struct brw_context *brw,
    monitor->oa_results = NULL;
 
    if (monitor->pipeline_stats_bo) {
-      drm_intel_bo_unreference(monitor->pipeline_stats_bo);
+      brw_bo_put(monitor->pipeline_stats_bo);
       monitor->pipeline_stats_bo = NULL;
    }
 
@@ -1105,34 +1101,25 @@ brw_begin_perf_monitor(struct gl_context *ctx,
        * wasting memory for contexts that don't use performance monitors.
        */
       if (!brw->perfmon.bookend_bo) {
-         brw->perfmon.bookend_bo = drm_intel_bo_alloc(brw->bufmgr,
-                                                      "OA bookend BO",
-                                                      BOOKEND_BO_SIZE_BYTES, 64);
+	 brw->perfmon.bookend_bo = brw_bo_create(&brw->batch,
+						 "OA bookend BO",
+						 BOOKEND_BO_SIZE_BYTES, 0, 0);
       }
 
       monitor->oa_bo =
-         drm_intel_bo_alloc(brw->bufmgr, "perf. monitor OA bo", 4096, 64);
+         brw_bo_create(&brw->batch, "perf. monitor OA bo", 4096, 0, 0);
 #ifdef DEBUG
       /* Pre-filling the BO helps debug whether writes landed. */
-      drm_intel_bo_map(monitor->oa_bo, true);
-      memset((char *) monitor->oa_bo->virtual, 0xff, 4096);
-      drm_intel_bo_unmap(monitor->oa_bo);
+      memset(brw_bo_map(monitor->oa_bo, MAP_WRITE), 0xff, 4096);
 #endif
 
       /* Allocate storage for accumulated OA counter values. */
       monitor->oa_results =
          calloc(brw->perfmon.entries_per_oa_snapshot, sizeof(uint32_t));
 
-      /* If the OA counters aren't already on, enable them. */
-      if (brw->perfmon.oa_users == 0) {
-         /* Ensure the OACONTROL enable and snapshot land in the same batch. */
-         int space = (MI_REPORT_PERF_COUNT_BATCH_DWORDS + 3) * 4;
-         intel_batchbuffer_require_space(brw, space, RENDER_RING);
-         start_oa_counters(brw);
-      }
-
       /* Take a starting OA counter snapshot. */
       emit_mi_report_perf_count(brw, monitor->oa_bo, 0, REPORT_ID);
+      brw->perfmon.oa_users++;
 
       monitor->oa_head_end = brw->perfmon.bookend_snapshots;
       monitor->oa_middle_start = brw->perfmon.bookend_snapshots + 1;
@@ -1140,13 +1127,11 @@ brw_begin_perf_monitor(struct gl_context *ctx,
 
       /* Add the monitor to the unresolved list. */
       add_to_unresolved_monitor_list(brw, monitor);
-
-      ++brw->perfmon.oa_users;
    }
 
    if (monitor_needs_statistics_registers(brw, m)) {
       monitor->pipeline_stats_bo =
-         drm_intel_bo_alloc(brw->bufmgr, "perf. monitor stats bo", 4096, 64);
+         brw_bo_create(&brw->batch, "perf. monitor stats bo", 4096, 0, 0);
 
       /* Take starting snapshots. */
       snapshot_statistics_registers(brw, monitor, 0);
@@ -1238,15 +1223,11 @@ brw_is_perf_monitor_result_available(struct gl_context *ctx,
    bool stats_available = true;
 
    if (monitor_needs_oa(brw, m)) {
-      oa_available = !monitor->oa_bo ||
-         (!drm_intel_bo_references(brw->batch.bo, monitor->oa_bo) &&
-          !drm_intel_bo_busy(monitor->oa_bo));
+      oa_available = !brw_bo_busy(monitor->oa_bo, BUSY_READ);
    }
 
    if (monitor_needs_statistics_registers(brw, m)) {
-      stats_available = !monitor->pipeline_stats_bo ||
-         (!drm_intel_bo_references(brw->batch.bo, monitor->pipeline_stats_bo) &&
-          !drm_intel_bo_busy(monitor->pipeline_stats_bo));
+      stats_available = !brw_bo_busy(monitor->pipeline_stats_bo, BUSY_READ);
    }
 
    return oa_available && stats_available;
@@ -1293,11 +1274,9 @@ brw_get_perf_monitor_result(struct gl_context *ctx,
           * Using an unsynchronized mapping avoids stalling for an
           * indeterminate amount of time.
           */
-         drm_intel_gem_bo_map_unsynchronized(brw->perfmon.bookend_bo);
-
-         gather_oa_results(brw, monitor, brw->perfmon.bookend_bo->virtual);
-
-         drm_intel_bo_unmap(brw->perfmon.bookend_bo);
+         gather_oa_results(brw, monitor,
+			   brw_bo_map(brw->perfmon.bookend_bo,
+				      MAP_READ | MAP_ASYNC));
       }
 
       for (int i = 0; i < brw->perfmon.entries_per_oa_snapshot; i++) {
@@ -1386,7 +1365,6 @@ void
 brw_perf_monitor_new_batch(struct brw_context *brw)
 {
    assert(brw->batch.ring == RENDER_RING);
-   assert(brw->gen < 6 || brw->batch.used == 0);
 
    if (brw->perfmon.oa_users == 0)
       return;
diff --git a/src/mesa/drivers/dri/i965/brw_pipe_control.c b/src/mesa/drivers/dri/i965/brw_pipe_control.c
index 05e14cd..c64fbb1 100644
--- a/src/mesa/drivers/dri/i965/brw_pipe_control.c
+++ b/src/mesa/drivers/dri/i965/brw_pipe_control.c
@@ -22,7 +22,7 @@
  */
 
 #include "brw_context.h"
-#include "intel_batchbuffer.h"
+
 #include "intel_fbo.h"
 #include "intel_reg.h"
 
@@ -135,7 +135,7 @@ brw_emit_pipe_control_flush(struct brw_context *brw, uint32_t flags)
  */
 void
 brw_emit_pipe_control_write(struct brw_context *brw, uint32_t flags,
-                            drm_intel_bo *bo, uint32_t offset,
+                            struct brw_bo *bo, uint32_t offset,
                             uint32_t imm_lower, uint32_t imm_upper)
 {
    if (brw->gen >= 8) {
@@ -289,7 +289,15 @@ brw_emit_post_sync_nonzero_flush(struct brw_context *brw)
 void
 brw_emit_mi_flush(struct brw_context *brw)
 {
-   if (brw->batch.ring == BLT_RING && brw->gen >= 6) {
+   if (brw->batch.emit.nbatch == 0)
+      return;
+
+   if (brw->batch.state - brw->batch.emit.nbatch < 20) {
+      brw_batch_flush(&brw->batch);
+      return;
+   }
+
+   if (brw->batch.ring == BLT_RING) {
       BEGIN_BATCH_BLT(4);
       OUT_BATCH(MI_FLUSH_DW);
       OUT_BATCH(0);
@@ -327,7 +335,8 @@ brw_emit_mi_flush(struct brw_context *brw)
       brw_emit_pipe_control_flush(brw, flags);
    }
 
-   brw_render_cache_set_clear(brw);
+   brw_batch_clear_dirty(&brw->batch);
+   brw->emitted_mi_flush = 1;
 }
 
 int
@@ -341,12 +350,11 @@ brw_init_pipe_control(struct brw_context *brw,
     * the gen6 workaround because it involves actually writing to
     * the buffer, and the kernel doesn't let us write to the batch.
     */
-   brw->workaround_bo = brw->intelScreen->workaround_bo;
+   brw->workaround_bo =
+      brw_bo_import(&brw->batch, brw->intelScreen->workaround_bo, true);
    if (brw->workaround_bo == NULL)
       return -ENOMEM;
 
-   drm_intel_bo_reference(brw->workaround_bo);
-
    brw->pipe_controls_since_last_cs_stall = 0;
 
    return 0;
@@ -355,5 +363,5 @@ brw_init_pipe_control(struct brw_context *brw,
 void
 brw_fini_pipe_control(struct brw_context *brw)
 {
-   drm_intel_bo_unreference(brw->workaround_bo);
+   brw_bo_put(brw->workaround_bo);
 }
diff --git a/src/mesa/drivers/dri/i965/brw_primitive_restart.c b/src/mesa/drivers/dri/i965/brw_primitive_restart.c
index 2c7a7e8..163d8a2 100644
--- a/src/mesa/drivers/dri/i965/brw_primitive_restart.c
+++ b/src/mesa/drivers/dri/i965/brw_primitive_restart.c
@@ -33,8 +33,6 @@
 #include "brw_defines.h"
 #include "brw_draw.h"
 
-#include "intel_batchbuffer.h"
-
 /**
  * Check if the hardware's cut index support can handle the primitive
  * restart index value (pre-Haswell only).
diff --git a/src/mesa/drivers/dri/i965/brw_program.c b/src/mesa/drivers/dri/i965/brw_program.c
index 85e271d..3c047e2 100644
--- a/src/mesa/drivers/dri/i965/brw_program.c
+++ b/src/mesa/drivers/dri/i965/brw_program.c
@@ -45,7 +45,8 @@
 #include "brw_shader.h"
 #include "brw_nir.h"
 #include "brw_wm.h"
-#include "intel_batchbuffer.h"
+
+#include "intel_reg.h"
 
 static unsigned
 get_new_program_id(struct intel_screen *screen)
@@ -259,17 +260,17 @@ brw_get_scratch_size(int size)
 
 void
 brw_get_scratch_bo(struct brw_context *brw,
-		   drm_intel_bo **scratch_bo, int size)
+		   struct brw_bo **scratch_bo, int size)
 {
-   drm_intel_bo *old_bo = *scratch_bo;
+   struct brw_bo *old_bo = *scratch_bo;
 
    if (old_bo && old_bo->size < size) {
-      drm_intel_bo_unreference(old_bo);
+      brw_bo_put(old_bo);
       old_bo = NULL;
    }
 
    if (!old_bo) {
-      *scratch_bo = drm_intel_bo_alloc(brw->bufmgr, "scratch bo", size, 4096);
+      *scratch_bo = brw_bo_create(&brw->batch, "scratch bo", size, 4096, 0);
    }
 }
 
@@ -297,9 +298,9 @@ void
 brw_init_shader_time(struct brw_context *brw)
 {
    const int max_entries = 2048;
-   brw->shader_time.bo =
-      drm_intel_bo_alloc(brw->bufmgr, "shader time",
-                         max_entries * SHADER_TIME_STRIDE * 3, 4096);
+   brw->shader_time.bo = brw_bo_create(&brw->batch, "shader time",
+				       max_entries * SHADER_TIME_STRIDE * 3,
+				       4096, 0);
    brw->shader_time.names = rzalloc_array(brw, const char *, max_entries);
    brw->shader_time.ids = rzalloc_array(brw, int, max_entries);
    brw->shader_time.types = rzalloc_array(brw, enum shader_time_shader_type,
@@ -462,8 +463,7 @@ brw_collect_shader_time(struct brw_context *brw)
     * delaying reading the reports, but it doesn't look like it's a big
     * overhead compared to the cost of tracking the time in the first place.
     */
-   drm_intel_bo_map(brw->shader_time.bo, true);
-   void *bo_map = brw->shader_time.bo->virtual;
+   void *bo_map = brw_bo_map(brw->shader_time.bo, MAP_WRITE);
 
    for (int i = 0; i < brw->shader_time.num_entries; i++) {
       uint32_t *times = bo_map + i * 3 * SHADER_TIME_STRIDE;
@@ -476,7 +476,6 @@ brw_collect_shader_time(struct brw_context *brw)
    /* Zero the BO out to clear it out for our next collection.
     */
    memset(bo_map, 0, brw->shader_time.bo->size);
-   drm_intel_bo_unmap(brw->shader_time.bo);
 }
 
 void
@@ -529,7 +528,7 @@ brw_get_shader_time_index(struct brw_context *brw,
 void
 brw_destroy_shader_time(struct brw_context *brw)
 {
-   drm_intel_bo_unreference(brw->shader_time.bo);
+   brw_bo_put(brw->shader_time.bo);
    brw->shader_time.bo = NULL;
 }
 
diff --git a/src/mesa/drivers/dri/i965/brw_queryobj.c b/src/mesa/drivers/dri/i965/brw_queryobj.c
index aea4d9b..4d248d7 100644
--- a/src/mesa/drivers/dri/i965/brw_queryobj.c
+++ b/src/mesa/drivers/dri/i965/brw_queryobj.c
@@ -40,15 +40,17 @@
 #include "brw_context.h"
 #include "brw_defines.h"
 #include "brw_state.h"
-#include "intel_batchbuffer.h"
 #include "intel_reg.h"
 
 /**
  * Emit PIPE_CONTROLs to write the current GPU timestamp into a buffer.
  */
 void
-brw_write_timestamp(struct brw_context *brw, drm_intel_bo *query_bo, int idx)
+brw_write_timestamp(struct brw_context *brw, struct brw_bo *query_bo, int idx)
 {
+   if (brw_batch_begin(&brw->batch, 60, RENDER_RING) < 0)
+      return;
+
    if (brw->gen == 6) {
       /* Emit Sandybridge workaround flush: */
       brw_emit_pipe_control_flush(brw,
@@ -58,13 +60,15 @@ brw_write_timestamp(struct brw_context *brw, drm_intel_bo *query_bo, int idx)
 
    brw_emit_pipe_control_write(brw, PIPE_CONTROL_WRITE_TIMESTAMP,
                                query_bo, idx * sizeof(uint64_t), 0, 0);
+
+   brw_batch_end(&brw->batch);
 }
 
 /**
  * Emit PIPE_CONTROLs to write the PS_DEPTH_COUNT register into a buffer.
  */
 void
-brw_write_depth_count(struct brw_context *brw, drm_intel_bo *query_bo, int idx)
+brw_write_depth_count(struct brw_context *brw, struct brw_bo *query_bo, int idx)
 {
    uint32_t flags;
 
@@ -78,8 +82,13 @@ brw_write_depth_count(struct brw_context *brw, drm_intel_bo *query_bo, int idx)
    if (brw->predicate.supported)
       flags |= PIPE_CONTROL_FLUSH_ENABLE;
 
+   if (brw_batch_begin(&brw->batch, 40, RENDER_RING) < 0)
+      return;
+
    brw_emit_pipe_control_write(brw, flags, query_bo,
                                idx * sizeof(uint64_t), 0, 0);
+
+   brw_batch_end(&brw->batch);
 }
 
 /**
@@ -89,12 +98,10 @@ static void
 brw_queryobj_get_results(struct gl_context *ctx,
 			 struct brw_query_object *query)
 {
-   struct brw_context *brw = brw_context(ctx);
-
    int i;
    uint64_t *results;
 
-   assert(brw->gen < 6);
+   assert(brw_context(ctx)->gen < 6);
 
    if (query->bo == NULL)
       return;
@@ -103,17 +110,7 @@ brw_queryobj_get_results(struct gl_context *ctx,
     * still contributing to it, flush it now so the results will be present
     * when mapped.
     */
-   if (drm_intel_bo_references(brw->batch.bo, query->bo))
-      intel_batchbuffer_flush(brw);
-
-   if (unlikely(brw->perf_debug)) {
-      if (drm_intel_bo_busy(query->bo)) {
-         perf_debug("Stalling on the GPU waiting for a query object.\n");
-      }
-   }
-
-   drm_intel_bo_map(query->bo, false);
-   results = query->bo->virtual;
+   results = brw_bo_map(query->bo, MAP_READ);
    switch (query->Base.Target) {
    case GL_TIME_ELAPSED_EXT:
       /* The query BO contains the starting and ending timestamps.
@@ -159,12 +156,11 @@ brw_queryobj_get_results(struct gl_context *ctx,
    default:
       unreachable("Unrecognized query target in brw_queryobj_get_results()");
    }
-   drm_intel_bo_unmap(query->bo);
 
    /* Now that we've processed the data stored in the query's buffer object,
     * we can release it.
     */
-   drm_intel_bo_unreference(query->bo);
+   brw_bo_put(query->bo);
    query->bo = NULL;
 }
 
@@ -196,7 +192,7 @@ brw_delete_query(struct gl_context *ctx, struct gl_query_object *q)
 {
    struct brw_query_object *query = (struct brw_query_object *)q;
 
-   drm_intel_bo_unreference(query->bo);
+   brw_bo_put(query->bo);
    free(query);
 }
 
@@ -235,8 +231,8 @@ brw_begin_query(struct gl_context *ctx, struct gl_query_object *q)
        * obtain the time elapsed.  Notably, this includes time elapsed while
        * the system was doing other work, such as running other applications.
        */
-      drm_intel_bo_unreference(query->bo);
-      query->bo = drm_intel_bo_alloc(brw->bufmgr, "timer query", 4096, 4096);
+      brw_bo_put(query->bo);
+      query->bo = brw_bo_create(&brw->batch, "timer query", 4096, 4096, 0);
       brw_write_timestamp(brw, query->bo, 0);
       break;
 
@@ -250,7 +246,7 @@ brw_begin_query(struct gl_context *ctx, struct gl_query_object *q)
        * Since we're starting a new query, we need to be sure to throw away
        * any previous occlusion query results.
        */
-      drm_intel_bo_unreference(query->bo);
+      brw_bo_put(query->bo);
       query->bo = NULL;
       query->last_index = -1;
 
@@ -350,10 +346,9 @@ static void brw_wait_query(struct gl_context *ctx, struct gl_query_object *q)
  */
 static void brw_check_query(struct gl_context *ctx, struct gl_query_object *q)
 {
-   struct brw_context *brw = brw_context(ctx);
    struct brw_query_object *query = (struct brw_query_object *)q;
 
-   assert(brw->gen < 6);
+   assert(brw_context(ctx)->gen < 6);
 
    /* From the GL_ARB_occlusion_query spec:
     *
@@ -362,10 +357,7 @@ static void brw_check_query(struct gl_context *ctx, struct gl_query_object *q)
     *      not ready yet on the first time it is queried.  This ensures that
     *      the async query will return true in finite time.
     */
-   if (query->bo && drm_intel_bo_references(brw->batch.bo, query->bo))
-      intel_batchbuffer_flush(brw);
-
-   if (query->bo == NULL || !drm_intel_bo_busy(query->bo)) {
+   if (!brw_bo_busy(query->bo, BUSY_READ | BUSY_FLUSH)) {
       brw_queryobj_get_results(ctx, query);
       query->Base.Ready = true;
    }
@@ -394,7 +386,7 @@ ensure_bo_has_space(struct gl_context *ctx, struct brw_query_object *query)
          brw_queryobj_get_results(ctx, query);
       }
 
-      query->bo = drm_intel_bo_alloc(brw->bufmgr, "query", 4096, 1);
+      query->bo = brw_bo_create(&brw->batch, "query", 4096, 0, 0);
       query->last_index = 0;
    }
 }
@@ -425,7 +417,7 @@ brw_emit_query_begin(struct brw_context *brw)
    struct gl_context *ctx = &brw->ctx;
    struct brw_query_object *query = brw->query.obj;
 
-   if (brw->hw_ctx)
+   if (brw->batch.hw_ctx)
       return;
 
    /* Skip if we're not doing any queries, or we've already recorded the
@@ -452,7 +444,7 @@ brw_emit_query_end(struct brw_context *brw)
 {
    struct brw_query_object *query = brw->query.obj;
 
-   if (brw->hw_ctx)
+   if (brw->batch.hw_ctx)
       return;
 
    if (!brw->query.begin_emitted)
@@ -479,11 +471,10 @@ brw_query_counter(struct gl_context *ctx, struct gl_query_object *q)
 
    assert(q->Target == GL_TIMESTAMP);
 
-   drm_intel_bo_unreference(query->bo);
-   query->bo = drm_intel_bo_alloc(brw->bufmgr, "timestamp query", 4096, 4096);
-   brw_write_timestamp(brw, query->bo, 0);
+   brw_bo_put(query->bo);
+   query->bo = brw_bo_create(&brw->batch, "timestamp query", 4096, 0, 0);
 
-   query->flushed = false;
+   brw_write_timestamp(brw, query->bo, 0);
 }
 
 /**
@@ -495,9 +486,10 @@ static uint64_t
 brw_get_timestamp(struct gl_context *ctx)
 {
    struct brw_context *brw = brw_context(ctx);
-   uint64_t result = 0;
+   uint64_t result;
 
-   drm_intel_reg_read(brw->bufmgr, TIMESTAMP, &result);
+   if (drm_intel_reg_read(brw->intelScreen->bufmgr, TIMESTAMP, &result))
+      return 0;
 
    /* See logic in brw_queryobj_get_results() */
    result = result >> 32;
diff --git a/src/mesa/drivers/dri/i965/brw_reset.c b/src/mesa/drivers/dri/i965/brw_reset.c
index e3182b1..e9b280b 100644
--- a/src/mesa/drivers/dri/i965/brw_reset.c
+++ b/src/mesa/drivers/dri/i965/brw_reset.c
@@ -36,12 +36,6 @@ brw_get_graphics_reset_status(struct gl_context *ctx)
    uint32_t active;
    uint32_t pending;
 
-   /* If hardware contexts are not being used (or
-    * DRM_IOCTL_I915_GET_RESET_STATS is not supported), this function should
-    * not be accessible.
-    */
-   assert(brw->hw_ctx != NULL);
-
    /* A reset status other than NO_ERROR was returned last time. I915 returns
     * nonzero active/pending only if reset has been encountered and completed.
     * Return NO_ERROR from now on.
@@ -49,8 +43,12 @@ brw_get_graphics_reset_status(struct gl_context *ctx)
    if (brw->reset_count != 0)
       return GL_NO_ERROR;
 
-   err = drm_intel_get_reset_stats(brw->hw_ctx, &reset_count, &active,
-                                   &pending);
+   /* If hardware contexts are not being used (or
+    * DRM_IOCTL_I915_GET_RESET_STATS is not supported), this function should
+    * not be accessible.
+    */
+   err = brw_batch_get_reset_stats(&brw->batch,
+				   &reset_count, &active, &pending);
    if (err)
       return GL_NO_ERROR;
 
diff --git a/src/mesa/drivers/dri/i965/brw_sampler_state.c b/src/mesa/drivers/dri/i965/brw_sampler_state.c
index 22ccbfe..a56356b 100644
--- a/src/mesa/drivers/dri/i965/brw_sampler_state.c
+++ b/src/mesa/drivers/dri/i965/brw_sampler_state.c
@@ -39,7 +39,6 @@
 #include "brw_context.h"
 #include "brw_state.h"
 #include "brw_defines.h"
-#include "intel_batchbuffer.h"
 #include "intel_mipmap_tree.h"
 
 #include "main/macros.h"
@@ -99,14 +98,13 @@ brw_emit_sampler_state(struct brw_context *brw,
            SET_FIELD(mag_filter, BRW_SAMPLER_MAG_FILTER) |
            SET_FIELD(min_filter, BRW_SAMPLER_MIN_FILTER);
 
-   ss[2] = border_color_offset;
    if (brw->gen < 6) {
-      ss[2] += brw->batch.bo->offset64; /* reloc */
-      drm_intel_bo_emit_reloc(brw->batch.bo,
+      ss[2] = brw_batch_reloc(&brw->batch,
                               batch_offset_for_sampler_state + 8,
                               brw->batch.bo, border_color_offset,
                               I915_GEM_DOMAIN_SAMPLER, 0);
-   }
+   } else
+      ss[2] = border_color_offset;
 
    ss[3] = SET_FIELD(max_anisotropy, BRW_SAMPLER_MAX_ANISOTROPY) |
            SET_FIELD(address_rounding, BRW_SAMPLER_ADDRESS_ROUNDING);
diff --git a/src/mesa/drivers/dri/i965/brw_sf.c b/src/mesa/drivers/dri/i965/brw_sf.c
index 872464c..52deb57 100644
--- a/src/mesa/drivers/dri/i965/brw_sf.c
+++ b/src/mesa/drivers/dri/i965/brw_sf.c
@@ -36,8 +36,6 @@
 #include "main/enums.h"
 #include "main/fbobject.h"
 
-#include "intel_batchbuffer.h"
-
 #include "brw_defines.h"
 #include "brw_context.h"
 #include "brw_eu.h"
diff --git a/src/mesa/drivers/dri/i965/brw_sf_emit.c b/src/mesa/drivers/dri/i965/brw_sf_emit.c
index b3ee5c1..28e2e56 100644
--- a/src/mesa/drivers/dri/i965/brw_sf_emit.c
+++ b/src/mesa/drivers/dri/i965/brw_sf_emit.c
@@ -34,8 +34,6 @@
 #include "main/macros.h"
 #include "main/enums.h"
 
-#include "intel_batchbuffer.h"
-
 #include "brw_defines.h"
 #include "brw_context.h"
 #include "brw_eu.h"
diff --git a/src/mesa/drivers/dri/i965/brw_sf_state.c b/src/mesa/drivers/dri/i965/brw_sf_state.c
index 5d98922..076166b 100644
--- a/src/mesa/drivers/dri/i965/brw_sf_state.c
+++ b/src/mesa/drivers/dri/i965/brw_sf_state.c
@@ -133,7 +133,6 @@ static void upload_sf_unit( struct brw_context *brw )
 {
    struct gl_context *ctx = &brw->ctx;
    struct brw_sf_unit_state *sf;
-   drm_intel_bo *bo = brw->batch.bo;
    int chipset_max_threads;
    bool render_to_fbo = _mesa_is_user_fbo(ctx->DrawBuffer);
 
@@ -179,9 +178,6 @@ static void upload_sf_unit( struct brw_context *brw )
       sf->thread4.stats_enable = 1;
 
    /* BRW_NEW_SF_VP */
-   sf->sf5.sf_viewport_state_offset = (brw->batch.bo->offset64 +
-				       brw->sf.vp_offset) >> 5; /* reloc */
-
    sf->sf5.viewport_transform = 1;
 
    /* _NEW_SCISSOR */
@@ -200,6 +196,15 @@ static void upload_sf_unit( struct brw_context *brw )
     */
    sf->sf5.front_winding ^= render_to_fbo;
 
+   sf->sf5.sf_viewport_state_offset = 
+      brw_batch_reloc(&brw->batch,
+		      (brw->sf.state_offset + offsetof(struct brw_sf_unit_state, sf5)),
+		      brw->batch.bo,
+		      brw->sf.vp_offset | sf->dw5,
+		      I915_GEM_DOMAIN_INSTRUCTION, 0) >> 5;
+
+
+
    /* _NEW_POLYGON */
    switch (ctx->Polygon.CullFlag ? ctx->Polygon.CullFaceMode : GL_NONE) {
    case GL_FRONT:
@@ -290,14 +295,6 @@ static void upload_sf_unit( struct brw_context *brw )
     * something loaded through the GPE (L2 ISC), so it's INSTRUCTION domain.
     */
 
-   /* Emit SF viewport relocation */
-   drm_intel_bo_emit_reloc(bo, (brw->sf.state_offset +
-				offsetof(struct brw_sf_unit_state, sf5)),
-			   brw->batch.bo, (brw->sf.vp_offset |
-					     sf->sf5.front_winding |
-					     (sf->sf5.viewport_transform << 1)),
-			   I915_GEM_DOMAIN_INSTRUCTION, 0);
-
    brw->ctx.NewDriverState |= BRW_NEW_GEN4_UNIT_STATE;
 }
 
diff --git a/src/mesa/drivers/dri/i965/brw_state.h b/src/mesa/drivers/dri/i965/brw_state.h
index 987672f..02c83db 100644
--- a/src/mesa/drivers/dri/i965/brw_state.h
+++ b/src/mesa/drivers/dri/i965/brw_state.h
@@ -227,7 +227,7 @@ void brw_destroy_caches( struct brw_context *brw );
  * brw_state_batch.c
  */
 #define BRW_BATCH_STRUCT(brw, s) \
-   intel_batchbuffer_data(brw, (s), sizeof(*(s)), RENDER_RING)
+   brw_batch_data(&brw->batch, (s), sizeof(*(s)))
 
 void *__brw_state_batch(struct brw_context *brw,
                         enum aub_state_struct_type type,
diff --git a/src/mesa/drivers/dri/i965/brw_state_batch.c b/src/mesa/drivers/dri/i965/brw_state_batch.c
index a405a80..d79cb48 100644
--- a/src/mesa/drivers/dri/i965/brw_state_batch.c
+++ b/src/mesa/drivers/dri/i965/brw_state_batch.c
@@ -30,7 +30,7 @@
   */
 
 #include "brw_state.h"
-#include "intel_batchbuffer.h"
+
 #include "main/imports.h"
 #include "util/ralloc.h"
 
@@ -41,14 +41,12 @@ brw_track_state_batch(struct brw_context *brw,
                       int size,
                       int index)
 {
-   struct intel_batchbuffer *batch = &brw->batch;
-
    if (!brw->state_batch_list) {
       /* Our structs are always aligned to at least 32 bytes, so
        * our array doesn't need to be any larger
        */
       brw->state_batch_list = ralloc_size(brw, sizeof(*brw->state_batch_list) *
-					  batch->bo->size / 32);
+					  brw->batch.bo->size / 32);
    }
 
    brw->state_batch_list[brw->state_batch_count].offset = offset;
@@ -81,13 +79,14 @@ make_annotation(drm_intel_aub_annotation *annotation, uint32_t type,
  * is annotated according to the type of each data structure.
  */
 void
-brw_annotate_aub(struct brw_context *brw)
+brw_annotate_batch(struct brw_batch *batch)
 {
+   struct brw_context *brw = container_of(batch, brw, batch);
    unsigned annotation_count = 2 * brw->state_batch_count + 1;
    drm_intel_aub_annotation annotations[annotation_count];
    int a = 0;
    make_annotation(&annotations[a++], AUB_TRACE_TYPE_BATCH, 0,
-                   4*brw->batch.used);
+                   4*brw->batch.emit.nbatch);
    for (int i = brw->state_batch_count; i-- > 0; ) {
       uint32_t type = brw->state_batch_list[i].type;
       uint32_t start_offset = brw->state_batch_list[i].offset;
@@ -98,8 +97,8 @@ brw_annotate_aub(struct brw_context *brw)
                       AUB_TRACE_SUBTYPE(type), end_offset);
    }
    assert(a == annotation_count);
-   drm_intel_bufmgr_gem_set_aub_annotations(brw->batch.bo, annotations,
-                                            annotation_count);
+   drm_intel_bufmgr_gem_set_aub_annotations(brw->batch.bo->base,
+					    annotations, annotation_count);
 }
 
 /**
@@ -125,27 +124,13 @@ __brw_state_batch(struct brw_context *brw,
                   uint32_t *out_offset)
 
 {
-   struct intel_batchbuffer *batch = &brw->batch;
-   uint32_t offset;
-
-   assert(size < batch->bo->size);
-   offset = ROUND_DOWN_TO(batch->state_batch_offset - size, alignment);
-
-   /* If allocating from the top would wrap below the batchbuffer, or
-    * if the batch's used space (plus the reserved pad) collides with our
-    * space, then flush and try again.
-    */
-   if (batch->state_batch_offset < size ||
-       offset < 4*batch->used + batch->reserved_space) {
-      intel_batchbuffer_flush(brw);
-      offset = ROUND_DOWN_TO(batch->state_batch_offset - size, alignment);
-   }
-
-   batch->state_batch_offset = offset;
+   assert(size < brw->batch.bo->size);
+   brw->batch.state = ROUND_DOWN_TO(4*brw->batch.state - size, alignment)/4;
+   assert(brw->batch.state > brw->batch.emit.nbatch);
 
    if (unlikely(INTEL_DEBUG & (DEBUG_BATCH | DEBUG_AUB)))
-      brw_track_state_batch(brw, type, offset, size, index);
+      brw_track_state_batch(brw, type, 4*brw->batch.state, size, index);
 
-   *out_offset = offset;
-   return batch->map + (offset>>2);
+   *out_offset = 4*brw->batch.state;
+   return brw->batch.map + brw->batch.state;
 }
diff --git a/src/mesa/drivers/dri/i965/brw_state_cache.c b/src/mesa/drivers/dri/i965/brw_state_cache.c
index 157b33d..1210bb7 100644
--- a/src/mesa/drivers/dri/i965/brw_state_cache.c
+++ b/src/mesa/drivers/dri/i965/brw_state_cache.c
@@ -45,7 +45,6 @@
  */
 
 #include "main/imports.h"
-#include "intel_batchbuffer.h"
 #include "brw_state.h"
 #include "brw_vs.h"
 #include "brw_wm.h"
@@ -169,29 +168,19 @@ static void
 brw_cache_new_bo(struct brw_cache *cache, uint32_t new_size)
 {
    struct brw_context *brw = cache->brw;
-   drm_intel_bo *new_bo;
+   struct brw_bo *new_bo;
 
-   new_bo = drm_intel_bo_alloc(brw->bufmgr, "program cache", new_size, 64);
-   if (brw->has_llc)
-      drm_intel_gem_bo_map_unsynchronized(new_bo);
+   new_bo = brw_bo_create(&brw->batch, "program cache", new_size, 64, 0);
 
    /* Copy any existing data that needs to be saved. */
    if (cache->next_offset != 0) {
-      if (brw->has_llc) {
-         memcpy(new_bo->virtual, cache->bo->virtual, cache->next_offset);
-      } else {
-         drm_intel_bo_map(cache->bo, false);
-         drm_intel_bo_subdata(new_bo, 0, cache->next_offset,
-                              cache->bo->virtual);
-         drm_intel_bo_unmap(cache->bo);
-      }
+      brw_bo_read(cache->bo, 0,
+                  brw_bo_map(new_bo, MAP_WRITE), cache->next_offset,
+                  MAP_ASYNC);
    }
 
-   if (brw->has_llc)
-      drm_intel_bo_unmap(cache->bo);
-   drm_intel_bo_unreference(cache->bo);
+   brw_bo_put(cache->bo);
    cache->bo = new_bo;
-   cache->bo_used_by_gpu = false;
 
    /* Since we have a new BO in place, we need to signal the units
     * that depend on it (state base address on gen5+, or unit state before).
@@ -209,7 +198,6 @@ brw_try_upload_using_copy(struct brw_cache *cache,
 			  const void *data,
 			  const void *aux)
 {
-   struct brw_context *brw = cache->brw;
    int i;
    struct brw_cache_item *item;
 
@@ -231,11 +219,9 @@ brw_try_upload_using_copy(struct brw_cache *cache,
 	    continue;
 	 }
 
-         if (!brw->has_llc)
-            drm_intel_bo_map(cache->bo, false);
-	 ret = memcmp(cache->bo->virtual + item->offset, data, item->size);
-         if (!brw->has_llc)
-            drm_intel_bo_unmap(cache->bo);
+	 ret = memcmp(brw_bo_map(cache->bo, MAP_READ | MAP_ASYNC) + item->offset,
+		      data,
+		      item->size);
 	 if (ret)
 	    continue;
 
@@ -253,8 +239,6 @@ brw_upload_item_data(struct brw_cache *cache,
 		     struct brw_cache_item *item,
 		     const void *data)
 {
-   struct brw_context *brw = cache->brw;
-
    /* Allocate space in the cache BO for our new program. */
    if (cache->next_offset + item->size > cache->bo->size) {
       uint32_t new_size = cache->bo->size * 2;
@@ -265,16 +249,11 @@ brw_upload_item_data(struct brw_cache *cache,
       brw_cache_new_bo(cache, new_size);
    }
 
-   /* If we would block on writing to an in-use program BO, just
-    * recreate it.
-    */
-   if (!brw->has_llc && cache->bo_used_by_gpu) {
-      perf_debug("Copying busy program cache buffer.\n");
-      brw_cache_new_bo(cache, cache->bo->size);
-   }
-
    item->offset = cache->next_offset;
 
+   /* Copy data to the buffer */
+   brw_bo_write(cache->bo, item->offset, data, item->size, MAP_ASYNC);
+
    /* Programs are always 64-byte aligned, so set up the next one now */
    cache->next_offset = ALIGN(item->offset + item->size, 64);
 }
@@ -291,7 +270,6 @@ brw_upload_cache(struct brw_cache *cache,
 		 uint32_t *out_offset,
 		 void *out_aux)
 {
-   struct brw_context *brw = cache->brw;
    struct brw_cache_item *item = CALLOC_STRUCT(brw_cache_item);
    GLuint hash;
    void *tmp;
@@ -331,13 +309,6 @@ brw_upload_cache(struct brw_cache *cache,
    cache->items[hash] = item;
    cache->n_items++;
 
-   /* Copy data to the buffer */
-   if (brw->has_llc) {
-      memcpy((char *) cache->bo->virtual + item->offset, data, data_size);
-   } else {
-      drm_intel_bo_subdata(cache->bo, item->offset, data_size, data);
-   }
-
    *out_offset = item->offset;
    *(void **)out_aux = (void *)((char *)item->key + item->key_size);
    cache->brw->ctx.NewDriverState |= 1 << cache_id;
@@ -355,11 +326,7 @@ brw_init_caches(struct brw_context *brw)
    cache->items =
       calloc(cache->size, sizeof(struct brw_cache_item *));
 
-   cache->bo = drm_intel_bo_alloc(brw->bufmgr,
-				  "program cache",
-				  4096, 64);
-   if (brw->has_llc)
-      drm_intel_gem_bo_map_unsynchronized(cache->bo);
+   cache->bo = brw_bo_create(&brw->batch, "program cache", 4096, 64, 0);
 
    cache->aux_compare[BRW_CACHE_VS_PROG] = brw_vs_prog_data_compare;
    cache->aux_compare[BRW_CACHE_GS_PROG] = brw_gs_prog_data_compare;
@@ -394,6 +361,9 @@ brw_clear_cache(struct brw_context *brw, struct brw_cache *cache)
 
    cache->n_items = 0;
 
+   brw_bo_put(cache->bo);
+   cache->bo = brw_bo_create(&brw->batch, "program cache", 4096, 64, 0);
+
    /* Start putting programs into the start of the BO again, since
     * we'll never find the old results.
     */
@@ -404,7 +374,6 @@ brw_clear_cache(struct brw_context *brw, struct brw_cache *cache)
     */
    brw->NewGLState |= ~0;
    brw->ctx.NewDriverState |= ~0ull;
-   intel_batchbuffer_flush(brw);
 }
 
 void
@@ -427,11 +396,10 @@ brw_destroy_cache(struct brw_context *brw, struct brw_cache *cache)
 
    DBG("%s\n", __func__);
 
-   if (brw->has_llc)
-      drm_intel_bo_unmap(cache->bo);
-   drm_intel_bo_unreference(cache->bo);
-   cache->bo = NULL;
    brw_clear_cache(brw, cache);
+   brw_bo_put(cache->bo);
+   cache->bo = NULL;
+
    free(cache->items);
    cache->items = NULL;
    cache->size = 0;
diff --git a/src/mesa/drivers/dri/i965/brw_state_dump.c b/src/mesa/drivers/dri/i965/brw_state_dump.c
index b6f4d59..efa0cb0 100644
--- a/src/mesa/drivers/dri/i965/brw_state_dump.c
+++ b/src/mesa/drivers/dri/i965/brw_state_dump.c
@@ -26,7 +26,6 @@
  */
 
 #include "main/mtypes.h"
-#include "intel_batchbuffer.h"
 
 #include "brw_context.h"
 #include "brw_defines.h"
@@ -67,6 +66,11 @@ static const char *surface_tiling[] = {
    "Y-tiled"
 };
 
+static void *batch_in(struct brw_context *brw, unsigned offset)
+{
+	return (void *)brw->batch.map + offset;
+}
+
 static void
 batch_out(struct brw_context *brw, const char *name, uint32_t offset,
 	  int index, char *fmt, ...) PRINTFLIKE(5, 6);
@@ -75,7 +79,7 @@ static void
 batch_out(struct brw_context *brw, const char *name, uint32_t offset,
 	  int index, char *fmt, ...)
 {
-   uint32_t *data = brw->batch.bo->virtual + offset;
+   uint32_t *data = batch_in(brw, offset);
    va_list va;
 
    fprintf(stderr, "0x%08x:      0x%08x: %8s: ",
@@ -89,7 +93,7 @@ static void
 batch_out64(struct brw_context *brw, const char *name, uint32_t offset,
             int index, char *fmt, ...)
 {
-   uint32_t *tmp = brw->batch.bo->virtual + offset;
+   uint32_t *tmp = batch_in(brw, offset);
 
    /* Swap the dwords since we want to handle this as a 64b value, but the data
     * is typically emitted as dwords.
@@ -121,7 +125,7 @@ get_965_surfacetype(unsigned int surfacetype)
 static void dump_vs_state(struct brw_context *brw, uint32_t offset)
 {
    const char *name = "VS_STATE";
-   struct brw_vs_unit_state *vs = brw->batch.bo->virtual + offset;
+   struct brw_vs_unit_state *vs = batch_in(brw, offset);
 
    batch_out(brw, name, offset, 0, "thread0\n");
    batch_out(brw, name, offset, 1, "thread1\n");
@@ -136,7 +140,7 @@ static void dump_vs_state(struct brw_context *brw, uint32_t offset)
 static void dump_gs_state(struct brw_context *brw, uint32_t offset)
 {
    const char *name = "GS_STATE";
-   struct brw_gs_unit_state *gs = brw->batch.bo->virtual + offset;
+   struct brw_gs_unit_state *gs = batch_in(brw, offset);
 
    batch_out(brw, name, offset, 0, "thread0\n");
    batch_out(brw, name, offset, 1, "thread1\n");
@@ -151,7 +155,7 @@ static void dump_gs_state(struct brw_context *brw, uint32_t offset)
 static void dump_clip_state(struct brw_context *brw, uint32_t offset)
 {
    const char *name = "CLIP_STATE";
-   struct brw_clip_unit_state *clip = brw->batch.bo->virtual + offset;
+   struct brw_clip_unit_state *clip = batch_in(brw, offset);
 
    batch_out(brw, name, offset, 0, "thread0\n");
    batch_out(brw, name, offset, 1, "thread1\n");
@@ -170,7 +174,7 @@ static void dump_clip_state(struct brw_context *brw, uint32_t offset)
 static void dump_sf_state(struct brw_context *brw, uint32_t offset)
 {
    const char *name = "SF_STATE";
-   struct brw_sf_unit_state *sf = brw->batch.bo->virtual + offset;
+   struct brw_sf_unit_state *sf = batch_in(brw, offset);
 
    batch_out(brw, name, offset, 0, "thread0\n");
    batch_out(brw, name, offset, 1, "thread1\n");
@@ -186,7 +190,7 @@ static void dump_sf_state(struct brw_context *brw, uint32_t offset)
 static void dump_wm_state(struct brw_context *brw, uint32_t offset)
 {
    const char *name = "WM_STATE";
-   struct brw_wm_unit_state *wm = brw->batch.bo->virtual + offset;
+   struct brw_wm_unit_state *wm = batch_in(brw, offset);
 
    batch_out(brw, name, offset, 0, "thread0\n");
    batch_out(brw, name, offset, 1, "thread1\n");
@@ -213,7 +217,7 @@ static void dump_wm_state(struct brw_context *brw, uint32_t offset)
 static void dump_surface_state(struct brw_context *brw, uint32_t offset)
 {
    const char *name = "SURF";
-   uint32_t *surf = brw->batch.bo->virtual + offset;
+   uint32_t *surf = batch_in(brw, offset);
 
    batch_out(brw, name, offset, 0, "%s %s\n",
 	     get_965_surfacetype(GET_FIELD(surf[0], BRW_SURFACE_TYPE)),
@@ -237,7 +241,7 @@ static void dump_surface_state(struct brw_context *brw, uint32_t offset)
 static void dump_gen7_surface_state(struct brw_context *brw, uint32_t offset)
 {
    const char *name = "SURF";
-   uint32_t *surf = brw->batch.bo->virtual + offset;
+   uint32_t *surf = batch_in(brw, offset);
 
    batch_out(brw, name, offset, 0, "%s %s %s\n",
              get_965_surfacetype(GET_FIELD(surf[0], BRW_SURFACE_TYPE)),
@@ -276,7 +280,7 @@ static float q_to_float(uint32_t data, int integer_end, int integer_start,
 static void
 dump_gen8_surface_state(struct brw_context *brw, uint32_t offset, int index)
 {
-   uint32_t *surf = brw->batch.bo->virtual + offset;
+   uint32_t *surf = batch_in(brw, offset);
    int aux_mode = surf[6] & INTEL_MASK(2, 0);
    const char *aux_str;
    char *name;
@@ -350,8 +354,7 @@ dump_sdc(struct brw_context *brw, uint32_t offset)
    const char *name = "SDC";
 
    if (brw->gen >= 5 && brw->gen <= 6) {
-      struct gen5_sampler_default_color *sdc = (brw->batch.bo->virtual +
-                                                offset);
+      struct gen5_sampler_default_color *sdc = batch_in(brw, offset);
       batch_out(brw, name, offset, 0, "unorm rgba\n");
       batch_out(brw, name, offset, 1, "r %f\n", sdc->f[0]);
       batch_out(brw, name, offset, 2, "b %f\n", sdc->f[1]);
@@ -365,7 +368,7 @@ dump_sdc(struct brw_context *brw, uint32_t offset)
       batch_out(brw, name, offset, 10, "s16 ba\n");
       batch_out(brw, name, offset, 11, "s8 rgba\n");
    } else {
-      float *sdc = brw->batch.bo->virtual + offset;
+      float *sdc = batch_in(brw, offset);
       batch_out(brw, name, offset, 0, "r %f\n", sdc[0]);
       batch_out(brw, name, offset, 1, "g %f\n", sdc[1]);
       batch_out(brw, name, offset, 2, "b %f\n", sdc[2]);
@@ -377,7 +380,7 @@ static void dump_sampler_state(struct brw_context *brw,
 			       uint32_t offset, uint32_t size)
 {
    int i;
-   uint32_t *samp = brw->batch.bo->virtual + offset;
+   uint32_t *samp = batch_in(brw, offset);
 
    for (i = 0; i < size / 16; i++) {
       char name[20];
@@ -396,7 +399,7 @@ static void dump_sampler_state(struct brw_context *brw,
 static void gen7_dump_sampler_state(struct brw_context *brw,
                                     uint32_t offset, uint32_t size)
 {
-   const uint32_t *samp = brw->batch.bo->virtual + offset;
+   const uint32_t *samp = batch_in(brw, offset);
    char name[20];
 
    for (int i = 0; i < size / 16; i++) {
@@ -436,7 +439,7 @@ static void dump_sf_viewport_state(struct brw_context *brw,
 				   uint32_t offset)
 {
    const char *name = "SF VP";
-   struct brw_sf_viewport *vp = brw->batch.bo->virtual + offset;
+   struct brw_sf_viewport *vp = batch_in(brw, offset);
 
    assert(brw->gen < 7);
 
@@ -457,7 +460,7 @@ static void dump_clip_viewport_state(struct brw_context *brw,
 				     uint32_t offset)
 {
    const char *name = "CLIP VP";
-   struct brw_clipper_viewport *vp = brw->batch.bo->virtual + offset;
+   struct brw_clipper_viewport *vp = batch_in(brw, offset);
 
    assert(brw->gen < 7);
 
@@ -471,7 +474,7 @@ static void dump_sf_clip_viewport_state(struct brw_context *brw,
 					uint32_t offset)
 {
    const char *name = "SF_CLIP VP";
-   struct gen7_sf_clip_viewport *vp = brw->batch.bo->virtual + offset;
+   struct gen7_sf_clip_viewport *vp = batch_in(brw, offset);
 
    assert(brw->gen >= 7);
 
@@ -486,7 +489,7 @@ static void dump_sf_clip_viewport_state(struct brw_context *brw,
    batch_out(brw, name, offset, 9, "guardband ymin = %f\n", vp->guardband.ymin);
    batch_out(brw, name, offset, 10, "guardband ymax = %f\n", vp->guardband.ymax);
    if (brw->gen >= 8) {
-      float *cc_vp = brw->batch.bo->virtual + offset;
+      float *cc_vp = batch_in(brw, offset);
       batch_out(brw, name, offset, 12, "Min extents: %.2fx%.2f\n",
                 cc_vp[12], cc_vp[14]);
       batch_out(brw, name, offset, 14, "Max extents: %.2fx%.2f\n",
@@ -498,7 +501,7 @@ static void dump_sf_clip_viewport_state(struct brw_context *brw,
 static void dump_cc_viewport_state(struct brw_context *brw, uint32_t offset)
 {
    const char *name = "CC VP";
-   struct brw_cc_viewport *vp = brw->batch.bo->virtual + offset;
+   struct brw_cc_viewport *vp = batch_in(brw, offset);
 
    batch_out(brw, name, offset, 0, "min_depth = %f\n", vp->min_depth);
    batch_out(brw, name, offset, 1, "max_depth = %f\n", vp->max_depth);
@@ -507,7 +510,7 @@ static void dump_cc_viewport_state(struct brw_context *brw, uint32_t offset)
 static void dump_depth_stencil_state(struct brw_context *brw, uint32_t offset)
 {
    const char *name = "D_S";
-   struct gen6_depth_stencil_state *ds = brw->batch.bo->virtual + offset;
+   struct gen6_depth_stencil_state *ds = batch_in(brw, offset);
 
    batch_out(brw, name, offset, 0,
 	     "stencil %sable, func %d, write %sable\n",
@@ -541,7 +544,7 @@ static void dump_cc_state_gen4(struct brw_context *brw, uint32_t offset)
 static void dump_cc_state_gen6(struct brw_context *brw, uint32_t offset)
 {
    const char *name = "CC";
-   struct gen6_color_calc_state *cc = brw->batch.bo->virtual + offset;
+   struct gen6_color_calc_state *cc = batch_in(brw, offset);
 
    batch_out(brw, name, offset, 0,
 	     "alpha test format %s, round disable %d, stencil ref %d, "
@@ -568,7 +571,7 @@ static void dump_blend_state(struct brw_context *brw, uint32_t offset)
 static void
 gen8_dump_blend_state(struct brw_context *brw, uint32_t offset, uint32_t size)
 {
-   const uint32_t *blend = brw->batch.bo->virtual + offset;
+   const uint32_t *blend = batch_in(brw, offset);
    const char *logicop[] =
    {
         "LOGICOP_CLEAR (BLACK)",
@@ -655,7 +658,7 @@ static void
 dump_scissor(struct brw_context *brw, uint32_t offset)
 {
    const char *name = "SCISSOR";
-   struct gen6_scissor_rect *scissor = brw->batch.bo->virtual + offset;
+   struct gen6_scissor_rect *scissor = batch_in(brw, offset);
 
    batch_out(brw, name, offset, 0, "xmin %d, ymin %d\n",
 	     scissor->xmin, scissor->ymin);
@@ -667,8 +670,8 @@ static void
 dump_vs_constants(struct brw_context *brw, uint32_t offset, uint32_t size)
 {
    const char *name = "VS_CONST";
-   uint32_t *as_uint = brw->batch.bo->virtual + offset;
-   float *as_float = brw->batch.bo->virtual + offset;
+   uint32_t *as_uint = batch_in(brw, offset);
+   float *as_float = batch_in(brw, offset);
    int i;
 
    for (i = 0; i < size / 4; i += 4) {
@@ -683,8 +686,8 @@ static void
 dump_wm_constants(struct brw_context *brw, uint32_t offset, uint32_t size)
 {
    const char *name = "WM_CONST";
-   uint32_t *as_uint = brw->batch.bo->virtual + offset;
-   float *as_float = brw->batch.bo->virtual + offset;
+   uint32_t *as_uint = batch_in(brw, offset);
+   float *as_float = batch_in(brw, offset);
    int i;
 
    for (i = 0; i < size / 4; i += 4) {
@@ -700,7 +703,7 @@ static void dump_binding_table(struct brw_context *brw, uint32_t offset,
 {
    char name[20];
    int i;
-   uint32_t *data = brw->batch.bo->virtual + offset;
+   uint32_t *data = batch_in(brw, offset);
 
    for (i = 0; i < size / 4; i++) {
       if (data[i] == 0)
@@ -717,8 +720,6 @@ dump_prog_cache(struct brw_context *brw)
    struct brw_cache *cache = &brw->cache;
    unsigned int b;
 
-   drm_intel_bo_map(brw->cache.bo, false);
-
    for (b = 0; b < cache->size; b++) {
       struct brw_cache_item *item;
 
@@ -753,12 +754,11 @@ dump_prog_cache(struct brw_context *brw)
 	 }
 
          fprintf(stderr, "%s:\n", name);
-         brw_disassemble(brw->intelScreen->devinfo, brw->cache.bo->virtual,
+         brw_disassemble(brw->intelScreen->devinfo,
+			 brw_bo_map(brw->cache.bo, MAP_READ | MAP_ASYNC),
                          item->offset, item->size, stderr);
       }
    }
-
-   drm_intel_bo_unmap(brw->cache.bo);
 }
 
 static void
@@ -861,12 +861,11 @@ dump_state_batch(struct brw_context *brw)
  * The buffer offsets printed rely on the buffer containing the last offset
  * it was validated at.
  */
-void brw_debug_batch(struct brw_context *brw)
+void brw_debug_batch(struct brw_batch *batch)
 {
-   drm_intel_bo_map(brw->batch.bo, false);
-   dump_state_batch(brw);
-   drm_intel_bo_unmap(brw->batch.bo);
+   struct brw_context *brw = container_of(batch, brw, batch);
 
+   dump_state_batch(brw);
    if (0)
       dump_prog_cache(brw);
 }
diff --git a/src/mesa/drivers/dri/i965/brw_state_upload.c b/src/mesa/drivers/dri/i965/brw_state_upload.c
index 7662c3b..285c6c1 100644
--- a/src/mesa/drivers/dri/i965/brw_state_upload.c
+++ b/src/mesa/drivers/dri/i965/brw_state_upload.c
@@ -34,8 +34,8 @@
 #include "brw_context.h"
 #include "brw_state.h"
 #include "drivers/common/meta.h"
-#include "intel_batchbuffer.h"
 #include "intel_buffers.h"
+#include "intel_reg.h"
 #include "brw_vs.h"
 #include "brw_ff_gs.h"
 #include "brw_gs.h"
@@ -338,15 +338,21 @@ static const struct brw_tracked_state *gen8_compute_atoms[] =
    &brw_cs_state,
 };
 
-static void
+static int
 brw_upload_initial_gpu_state(struct brw_context *brw)
 {
+   int ret;
+
    /* On platforms with hardware contexts, we can set our initial GPU state
     * right away rather than doing it via state atoms.  This saves a small
     * amount of overhead on every draw call.
     */
-   if (!brw->hw_ctx)
-      return;
+   if (!brw->batch.hw_ctx)
+      return 0;
+
+   ret = brw_batch_begin(&brw->batch, 200, RENDER_RING);
+   if (ret < 0)
+      return ret;
 
    if (brw->gen == 6)
       brw_emit_post_sync_nonzero_flush(brw);
@@ -366,6 +372,8 @@ brw_upload_initial_gpu_state(struct brw_context *brw)
    if (brw->gen >= 8) {
       gen8_emit_3dstate_sample_pattern(brw);
    }
+
+   return brw_batch_end(&brw->batch);
 }
 
 static inline const struct brw_tracked_state *
diff --git a/src/mesa/drivers/dri/i965/brw_structs.h b/src/mesa/drivers/dri/i965/brw_structs.h
index 55338c0..e167254 100644
--- a/src/mesa/drivers/dri/i965/brw_structs.h
+++ b/src/mesa/drivers/dri/i965/brw_structs.h
@@ -391,13 +391,16 @@ struct brw_sf_unit_state
       unsigned pad3:1;
    } thread4;
 
-   struct
+   union
    {
-      unsigned front_winding:1;
-      unsigned viewport_transform:1;
-      unsigned pad0:3;
-      unsigned sf_viewport_state_offset:27; /* Offset from GENERAL_STATE_BASE */
-   } sf5;
+      struct {
+	 unsigned front_winding:1;
+	 unsigned viewport_transform:1;
+	 unsigned pad0:3;
+	 unsigned sf_viewport_state_offset:27; /* Offset from GENERAL_STATE_BASE */
+      } sf5;
+      uint32_t dw5;
+   };
 
    struct
    {
@@ -525,15 +528,17 @@ struct brw_wm_unit_state
    struct thread2 thread2;
    struct thread3 thread3;
 
-   struct {
-      unsigned stats_enable:1;
-      unsigned depth_buffer_clear:1;
-      unsigned sampler_count:3;
-      unsigned sampler_state_pointer:27;
-   } wm4;
+   union {
+      struct {
+	 unsigned stats_enable:1;
+	 unsigned depth_buffer_clear:1;
+	 unsigned sampler_count:3;
+	 unsigned sampler_state_pointer:27;
+      } wm4;
+      uint32_t dw4;
+   };
 
-   struct
-   {
+   struct {
       unsigned enable_8_pix:1;
       unsigned enable_16_pix:1;
       unsigned enable_32_pix:1;
diff --git a/src/mesa/drivers/dri/i965/brw_urb.c b/src/mesa/drivers/dri/i965/brw_urb.c
index 6fcf1b0..fa79fba 100644
--- a/src/mesa/drivers/dri/i965/brw_urb.c
+++ b/src/mesa/drivers/dri/i965/brw_urb.c
@@ -31,11 +31,12 @@
 
 
 
-#include "intel_batchbuffer.h"
 #include "brw_context.h"
 #include "brw_state.h"
 #include "brw_defines.h"
 
+#include "intel_reg.h"
+
 #define VS 0
 #define GS 1
 #define CLP 2
@@ -249,10 +250,10 @@ void brw_upload_urb_fence(struct brw_context *brw)
    uf.bits1.cs_fence  = brw->urb.size;
 
    /* erratum: URB_FENCE must not cross a 64byte cacheline */
-   if ((brw->batch.used & 15) > 12) {
-      int pad = 16 - (brw->batch.used & 15);
+   if ((brw->batch.emit.nbatch & 15) > 12) {
+      int pad = 16 - (brw->batch.emit.nbatch & 15);
       do
-	 brw->batch.map[brw->batch.used++] = MI_NOOP;
+	 brw_batch_emit(&brw->batch, MI_NOOP);
       while (--pad);
    }
 
diff --git a/src/mesa/drivers/dri/i965/brw_vec4.cpp b/src/mesa/drivers/dri/i965/brw_vec4.cpp
index a5c686c..8902729 100644
--- a/src/mesa/drivers/dri/i965/brw_vec4.cpp
+++ b/src/mesa/drivers/dri/i965/brw_vec4.cpp
@@ -1868,8 +1868,7 @@ brw_vs_emit(struct brw_context *brw,
    const unsigned *assembly = NULL;
 
    if (unlikely(brw->perf_debug)) {
-      start_busy = (brw->batch.last_bo &&
-                    drm_intel_bo_busy(brw->batch.last_bo));
+      start_busy = brw_batch_busy(&brw->batch);
       start_time = get_time();
    }
 
@@ -1965,7 +1964,7 @@ brw_vs_emit(struct brw_context *brw,
       if (shader->compiled_once) {
          brw_vs_debug_recompile(brw, prog, &c->key);
       }
-      if (start_busy && !drm_intel_bo_busy(brw->batch.last_bo)) {
+      if (start_busy && !brw_batch_busy(&brw->batch)) {
          perf_debug("VS compile took %.03f ms and stalled the GPU\n",
                     (get_time() - start_time) * 1000);
       }
diff --git a/src/mesa/drivers/dri/i965/brw_vs_state.c b/src/mesa/drivers/dri/i965/brw_vs_state.c
index b9b97a7..59d8c8c 100644
--- a/src/mesa/drivers/dri/i965/brw_vs_state.c
+++ b/src/mesa/drivers/dri/i965/brw_vs_state.c
@@ -80,10 +80,16 @@ brw_upload_vs_unit(struct brw_context *brw)
       brw->vs.prog_data->base.base.binding_table.size_bytes / 4;
 
    if (brw->vs.prog_data->base.base.total_scratch != 0) {
-      vs->thread2.scratch_space_base_pointer =
-	 stage_state->scratch_bo->offset64 >> 10; /* reloc */
       vs->thread2.per_thread_scratch_space =
 	 ffs(brw->vs.prog_data->base.base.total_scratch) - 11;
+
+      vs->thread2.scratch_space_base_pointer =
+	 brw_batch_reloc(&brw->batch,
+			 stage_state->state_offset + offsetof(struct brw_vs_unit_state, thread2),
+			 stage_state->scratch_bo,
+			 vs->thread2.per_thread_scratch_space,
+			 I915_GEM_DOMAIN_RENDER,
+			 I915_GEM_DOMAIN_RENDER) >> 10;
    } else {
       vs->thread2.scratch_space_base_pointer = 0;
       vs->thread2.per_thread_scratch_space = 0;
@@ -158,24 +164,11 @@ brw_upload_vs_unit(struct brw_context *brw)
    if (stage_state->sampler_count) {
       /* BRW_NEW_SAMPLER_STATE_TABLE - reloc */
       vs->vs5.sampler_state_pointer =
-         (brw->batch.bo->offset64 + stage_state->sampler_offset) >> 5;
-      drm_intel_bo_emit_reloc(brw->batch.bo,
-                              stage_state->state_offset +
-                              offsetof(struct brw_vs_unit_state, vs5),
-                              brw->batch.bo,
-                              (stage_state->sampler_offset |
-                               vs->vs5.sampler_count),
-                              I915_GEM_DOMAIN_INSTRUCTION, 0);
-   }
-
-   /* Emit scratch space relocation */
-   if (brw->vs.prog_data->base.base.total_scratch != 0) {
-      drm_intel_bo_emit_reloc(brw->batch.bo,
-			      stage_state->state_offset +
-			      offsetof(struct brw_vs_unit_state, thread2),
-			      stage_state->scratch_bo,
-			      vs->thread2.per_thread_scratch_space,
-			      I915_GEM_DOMAIN_RENDER, I915_GEM_DOMAIN_RENDER);
+	 brw_batch_reloc(&brw->batch,
+			 stage_state->state_offset + offsetof(struct brw_vs_unit_state, vs5),
+			 brw->batch.bo,
+			 (stage_state->sampler_offset | vs->vs5.sampler_count),
+			 I915_GEM_DOMAIN_INSTRUCTION, 0) >> 5;
    }
 
    brw->ctx.NewDriverState |= BRW_NEW_GEN4_UNIT_STATE;
diff --git a/src/mesa/drivers/dri/i965/brw_vs_surface_state.c b/src/mesa/drivers/dri/i965/brw_vs_surface_state.c
index b2f91bd..b6e0ad8 100644
--- a/src/mesa/drivers/dri/i965/brw_vs_surface_state.c
+++ b/src/mesa/drivers/dri/i965/brw_vs_surface_state.c
@@ -74,7 +74,7 @@ brw_upload_pull_constants(struct brw_context *brw,
 
    /* BRW_NEW_*_PROG_DATA | _NEW_PROGRAM_CONSTANTS */
    uint32_t size = prog_data->nr_pull_params * 4;
-   drm_intel_bo *const_bo = NULL;
+   struct brw_bo *const_bo = NULL;
    uint32_t const_offset;
    gl_constant_value *constants = intel_upload_space(brw, size, 64,
                                                      &const_bo, &const_offset);
@@ -96,7 +96,7 @@ brw_upload_pull_constants(struct brw_context *brw,
    brw_create_constant_surface(brw, const_bo, const_offset, size,
                                &stage_state->surf_offset[surf_index],
                                dword_pitch);
-   drm_intel_bo_unreference(const_bo);
+   brw_bo_put(const_bo);
 
    brw->ctx.NewDriverState |= brw_new_constbuf;
 }
diff --git a/src/mesa/drivers/dri/i965/brw_wm_state.c b/src/mesa/drivers/dri/i965/brw_wm_state.c
index 0cd4390..1195f55 100644
--- a/src/mesa/drivers/dri/i965/brw_wm_state.c
+++ b/src/mesa/drivers/dri/i965/brw_wm_state.c
@@ -140,10 +140,15 @@ brw_upload_wm_unit(struct brw_context *brw)
       prog_data->base.binding_table.size_bytes / 4;
 
    if (prog_data->base.total_scratch != 0) {
-      wm->thread2.scratch_space_base_pointer =
-	 brw->wm.base.scratch_bo->offset64 >> 10; /* reloc */
       wm->thread2.per_thread_scratch_space =
 	 ffs(prog_data->base.total_scratch) - 11;
+
+      wm->thread2.scratch_space_base_pointer =
+	 brw_batch_reloc(&brw->batch,
+			 brw->wm.base.state_offset + offsetof(struct brw_wm_unit_state, thread2),
+			 brw->wm.base.scratch_bo,
+			 wm->thread2.per_thread_scratch_space,
+			 I915_GEM_DOMAIN_RENDER, I915_GEM_DOMAIN_RENDER) >> 10;
    } else {
       wm->thread2.scratch_space_base_pointer = 0;
       wm->thread2.per_thread_scratch_space = 0;
@@ -167,8 +172,12 @@ brw_upload_wm_unit(struct brw_context *brw)
 
    if (brw->wm.base.sampler_count) {
       /* BRW_NEW_SAMPLER_STATE_TABLE - reloc */
-      wm->wm4.sampler_state_pointer = (brw->batch.bo->offset64 +
-				       brw->wm.base.sampler_offset) >> 5;
+      wm->wm4.sampler_state_pointer =
+	 brw_batch_reloc(&brw->batch,
+			 brw->wm.base.state_offset + offsetof(struct brw_wm_unit_state, wm4),
+			 brw->batch.bo,
+			 brw->wm.base.sampler_offset | wm->dw4,
+			 I915_GEM_DOMAIN_INSTRUCTION, 0) >> 5;
    } else {
       wm->wm4.sampler_state_pointer = 0;
    }
@@ -229,27 +238,6 @@ brw_upload_wm_unit(struct brw_context *brw)
    if (unlikely(INTEL_DEBUG & DEBUG_STATS) || brw->stats_wm)
       wm->wm4.stats_enable = 1;
 
-   /* Emit scratch space relocation */
-   if (prog_data->base.total_scratch != 0) {
-      drm_intel_bo_emit_reloc(brw->batch.bo,
-			      brw->wm.base.state_offset +
-			      offsetof(struct brw_wm_unit_state, thread2),
-			      brw->wm.base.scratch_bo,
-			      wm->thread2.per_thread_scratch_space,
-			      I915_GEM_DOMAIN_RENDER, I915_GEM_DOMAIN_RENDER);
-   }
-
-   /* Emit sampler state relocation */
-   if (brw->wm.base.sampler_count != 0) {
-      drm_intel_bo_emit_reloc(brw->batch.bo,
-			      brw->wm.base.state_offset +
-			      offsetof(struct brw_wm_unit_state, wm4),
-			      brw->batch.bo, (brw->wm.base.sampler_offset |
-                                              wm->wm4.stats_enable |
-                                              (wm->wm4.sampler_count << 2)),
-			      I915_GEM_DOMAIN_INSTRUCTION, 0);
-   }
-
    brw->ctx.NewDriverState |= BRW_NEW_GEN4_UNIT_STATE;
 }
 
diff --git a/src/mesa/drivers/dri/i965/brw_wm_surface_state.c b/src/mesa/drivers/dri/i965/brw_wm_surface_state.c
index 72aad96..f488557 100644
--- a/src/mesa/drivers/dri/i965/brw_wm_surface_state.c
+++ b/src/mesa/drivers/dri/i965/brw_wm_surface_state.c
@@ -38,7 +38,6 @@
 #include "main/framebuffer.h"
 
 #include "intel_mipmap_tree.h"
-#include "intel_batchbuffer.h"
 #include "intel_tex.h"
 #include "intel_fbo.h"
 #include "intel_buffer_objects.h"
@@ -243,7 +242,7 @@ brw_get_texture_swizzle(const struct gl_context *ctx,
 static void
 gen4_emit_buffer_surface_state(struct brw_context *brw,
                                uint32_t *out_offset,
-                               drm_intel_bo *bo,
+                               struct brw_bo *bo,
                                unsigned buffer_offset,
                                unsigned surface_format,
                                unsigned buffer_size,
@@ -257,7 +256,10 @@ gen4_emit_buffer_surface_state(struct brw_context *brw,
    surf[0] = BRW_SURFACE_BUFFER << BRW_SURFACE_TYPE_SHIFT |
              surface_format << BRW_SURFACE_FORMAT_SHIFT |
              (brw->gen >= 6 ? BRW_SURFACE_RC_READ_WRITE : 0);
-   surf[1] = (bo ? bo->offset64 : 0) + buffer_offset; /* reloc */
+   surf[1] = brw_batch_reloc(&brw->batch, *out_offset + 4,
+                             bo, buffer_offset,
+                             I915_GEM_DOMAIN_SAMPLER,
+                             (rw ? I915_GEM_DOMAIN_SAMPLER : 0));
    surf[2] = (buffer_size & 0x7f) << BRW_SURFACE_WIDTH_SHIFT |
              ((buffer_size >> 7) & 0x1fff) << BRW_SURFACE_HEIGHT_SHIFT;
    surf[3] = ((buffer_size >> 20) & 0x7f) << BRW_SURFACE_DEPTH_SHIFT |
@@ -268,10 +270,6 @@ gen4_emit_buffer_surface_state(struct brw_context *brw,
     * physical cache.  It is mapped in hardware to the sampler cache."
     */
    if (bo) {
-      drm_intel_bo_emit_reloc(brw->batch.bo, *out_offset + 4,
-                              bo, buffer_offset,
-                              I915_GEM_DOMAIN_SAMPLER,
-                              (rw ? I915_GEM_DOMAIN_SAMPLER : 0));
    }
 }
 
@@ -285,7 +283,7 @@ brw_update_buffer_texture_surface(struct gl_context *ctx,
    struct intel_buffer_object *intel_obj =
       intel_buffer_object(tObj->BufferObject);
    uint32_t size = tObj->BufferSize;
-   drm_intel_bo *bo = NULL;
+   struct brw_bo *bo = NULL;
    mesa_format format = tObj->_BufferObjectFormat;
    uint32_t brw_format = brw_format_for_mesa_format(format);
    int texel_size = _mesa_get_format_bytes(format);
@@ -367,7 +365,11 @@ brw_update_texture_surface(struct gl_context *ctx,
 	      BRW_SURFACE_CUBEFACE_ENABLES |
 	      tex_format << BRW_SURFACE_FORMAT_SHIFT);
 
-   surf[1] = mt->bo->offset64 + mt->offset; /* reloc */
+   surf[1] = brw_batch_reloc(&brw->batch,
+			     *surf_offset + 4,
+			     mt->bo,
+			     mt->offset,
+			     I915_GEM_DOMAIN_SAMPLER, 0);
 
    surf[2] = ((intelObj->_MaxLevel - tObj->BaseLevel) << BRW_SURFACE_LOD_SHIFT |
 	      (mt->logical_width0 - 1) << BRW_SURFACE_WIDTH_SHIFT |
@@ -381,13 +383,6 @@ brw_update_texture_surface(struct gl_context *ctx,
               SET_FIELD(tObj->BaseLevel - mt->first_level, BRW_SURFACE_MIN_LOD));
 
    surf[5] = mt->align_h == 4 ? BRW_SURFACE_VERTICAL_ALIGN_ENABLE : 0;
-
-   /* Emit relocation to surface contents */
-   drm_intel_bo_emit_reloc(brw->batch.bo,
-                           *surf_offset + 4,
-                           mt->bo,
-                           surf[1] - mt->bo->offset64,
-                           I915_GEM_DOMAIN_SAMPLER, 0);
 }
 
 /**
@@ -396,7 +391,7 @@ brw_update_texture_surface(struct gl_context *ctx,
  */
 void
 brw_create_constant_surface(struct brw_context *brw,
-			    drm_intel_bo *bo,
+			    struct brw_bo *bo,
 			    uint32_t offset,
 			    uint32_t size,
 			    uint32_t *out_offset,
@@ -424,7 +419,7 @@ brw_update_sol_surface(struct brw_context *brw,
 {
    struct intel_buffer_object *intel_bo = intel_buffer_object(buffer_obj);
    uint32_t offset_bytes = 4 * offset_dwords;
-   drm_intel_bo *bo = intel_bufferobj_buffer(brw, intel_bo,
+   struct brw_bo *bo = intel_bufferobj_buffer(brw, intel_bo,
                                              offset_bytes,
                                              buffer_obj->Size - offset_bytes);
    uint32_t *surf = brw_state_batch(brw, AUB_TRACE_SURFACE_STATE, 6 * 4, 32,
@@ -480,19 +475,17 @@ brw_update_sol_surface(struct brw_context *brw,
       BRW_SURFACE_MIPMAPLAYOUT_BELOW << BRW_SURFACE_MIPLAYOUT_SHIFT |
       surface_format << BRW_SURFACE_FORMAT_SHIFT |
       BRW_SURFACE_RC_READ_WRITE;
-   surf[1] = bo->offset64 + offset_bytes; /* reloc */
+   surf[1] = brw_batch_reloc(&brw->batch,
+			     *out_offset + 4,
+			     bo, offset_bytes,
+			     I915_GEM_DOMAIN_RENDER,
+			     I915_GEM_DOMAIN_RENDER);
    surf[2] = (width << BRW_SURFACE_WIDTH_SHIFT |
 	      height << BRW_SURFACE_HEIGHT_SHIFT);
    surf[3] = (depth << BRW_SURFACE_DEPTH_SHIFT |
               pitch_minus_1 << BRW_SURFACE_PITCH_SHIFT);
    surf[4] = 0;
    surf[5] = 0;
-
-   /* Emit relocation to surface contents. */
-   drm_intel_bo_emit_reloc(brw->batch.bo,
-			   *out_offset + 4,
-			   bo, offset_bytes,
-			   I915_GEM_DOMAIN_RENDER, I915_GEM_DOMAIN_RENDER);
 }
 
 /* Creates a new WM constant buffer reflecting the current fragment program's
@@ -560,7 +553,7 @@ brw_emit_null_surface_state(struct brw_context *brw,
     *     - Surface Format must be R8G8B8A8_UNORM.
     */
    unsigned surface_type = BRW_SURFACE_NULL;
-   drm_intel_bo *bo = NULL;
+   struct brw_bo *bo = NULL;
    unsigned pitch_minus_1 = 0;
    uint32_t multisampling_state = 0;
    uint32_t *surf = brw_state_batch(brw, AUB_TRACE_SURFACE_STATE, 6 * 4, 32,
@@ -600,7 +593,10 @@ brw_emit_null_surface_state(struct brw_context *brw,
 		  1 << BRW_SURFACE_WRITEDISABLE_B_SHIFT |
 		  1 << BRW_SURFACE_WRITEDISABLE_A_SHIFT);
    }
-   surf[1] = bo ? bo->offset64 : 0;
+   surf[1] = brw_batch_reloc(&brw->batch, *out_offset + 4,
+                             bo, 0,
+                             I915_GEM_DOMAIN_RENDER,
+                             I915_GEM_DOMAIN_RENDER);
    surf[2] = ((width - 1) << BRW_SURFACE_WIDTH_SHIFT |
               (height - 1) << BRW_SURFACE_HEIGHT_SHIFT);
 
@@ -613,13 +609,6 @@ brw_emit_null_surface_state(struct brw_context *brw,
               pitch_minus_1 << BRW_SURFACE_PITCH_SHIFT);
    surf[4] = multisampling_state;
    surf[5] = 0;
-
-   if (bo) {
-      drm_intel_bo_emit_reloc(brw->batch.bo,
-                              *out_offset + 4,
-                              bo, 0,
-                              I915_GEM_DOMAIN_RENDER, I915_GEM_DOMAIN_RENDER);
-   }
 }
 
 /**
@@ -676,8 +665,12 @@ brw_update_renderbuffer_surface(struct brw_context *brw,
 
    /* reloc */
    assert(mt->offset % mt->cpp == 0);
-   surf[1] = (intel_renderbuffer_get_tile_offsets(irb, &tile_x, &tile_y) +
-	      mt->bo->offset64 + mt->offset);
+   surf[1] = brw_batch_reloc(&brw->batch, offset + 4,
+			     mt->bo,
+			     mt->offset +
+			     intel_renderbuffer_get_tile_offsets(irb, &tile_x, &tile_y),
+			     I915_GEM_DOMAIN_RENDER,
+			     I915_GEM_DOMAIN_RENDER);
 
    surf[2] = ((rb->Width - 1) << BRW_SURFACE_WIDTH_SHIFT |
 	      (rb->Height - 1) << BRW_SURFACE_HEIGHT_SHIFT);
@@ -719,13 +712,6 @@ brw_update_renderbuffer_surface(struct brw_context *brw,
       }
    }
 
-   drm_intel_bo_emit_reloc(brw->batch.bo,
-                           offset + 4,
-                           mt->bo,
-                           surf[1] - mt->bo->offset64,
-                           I915_GEM_DOMAIN_RENDER,
-                           I915_GEM_DOMAIN_RENDER);
-
    return offset;
 }
 
@@ -904,7 +890,7 @@ brw_upload_ubo_surfaces(struct brw_context *brw,
 
       binding = &ctx->UniformBufferBindings[shader->UniformBlocks[i].Binding];
       intel_bo = intel_buffer_object(binding->BufferObject);
-      drm_intel_bo *bo =
+      struct brw_bo *bo =
          intel_bufferobj_buffer(brw, intel_bo,
                                 binding->Offset,
                                 binding->BufferObject->Size - binding->Offset);
@@ -963,7 +949,7 @@ brw_upload_abo_surfaces(struct brw_context *brw,
          &ctx->AtomicBufferBindings[prog->AtomicBuffers[i].Binding];
       struct intel_buffer_object *intel_bo =
          intel_buffer_object(binding->BufferObject);
-      drm_intel_bo *bo = intel_bufferobj_buffer(
+      struct brw_bo *bo = intel_bufferobj_buffer(
          brw, intel_bo, binding->Offset, intel_bo->Base.Size - binding->Offset);
 
       brw->vtbl.emit_buffer_surface_state(brw, &surf_offsets[i], bo,
diff --git a/src/mesa/drivers/dri/i965/gen6_blorp.cpp b/src/mesa/drivers/dri/i965/gen6_blorp.cpp
index 54c4a6d..6aa772c 100644
--- a/src/mesa/drivers/dri/i965/gen6_blorp.cpp
+++ b/src/mesa/drivers/dri/i965/gen6_blorp.cpp
@@ -23,7 +23,6 @@
 
 #include <assert.h>
 
-#include "intel_batchbuffer.h"
 #include "intel_fbo.h"
 #include "intel_mipmap_tree.h"
 
@@ -386,9 +385,12 @@ gen6_blorp_emit_surface_state(struct brw_context *brw,
               BRW_SURFACE_CUBEFACE_ENABLES |
               surface->brw_surfaceformat << BRW_SURFACE_FORMAT_SHIFT);
 
-   /* reloc */
-   surf[1] = (surface->compute_tile_offsets(&tile_x, &tile_y) +
-              mt->bo->offset64);
+   surf[1] = brw_batch_reloc(&brw->batch,
+			     wm_surf_offset + 4,
+			     mt->bo,
+			     surface->compute_tile_offsets(&tile_x, &tile_y),
+			     read_domains, write_domain);
+
 
    surf[2] = (0 << BRW_SURFACE_LOD_SHIFT |
               (width - 1) << BRW_SURFACE_WIDTH_SHIFT |
@@ -416,13 +418,6 @@ gen6_blorp_emit_surface_state(struct brw_context *brw,
               (surface->mt->align_h == 4 ?
                BRW_SURFACE_VERTICAL_ALIGN_ENABLE : 0));
 
-   /* Emit relocation to surface contents */
-   drm_intel_bo_emit_reloc(brw->batch.bo,
-                           wm_surf_offset + 4,
-                           mt->bo,
-                           surf[1] - mt->bo->offset64,
-                           read_domains, write_domain);
-
    return wm_surf_offset;
 }
 
diff --git a/src/mesa/drivers/dri/i965/gen6_cc.c b/src/mesa/drivers/dri/i965/gen6_cc.c
index 2b76e24..2fbb075 100644
--- a/src/mesa/drivers/dri/i965/gen6_cc.c
+++ b/src/mesa/drivers/dri/i965/gen6_cc.c
@@ -29,7 +29,6 @@
 #include "brw_state.h"
 #include "brw_defines.h"
 #include "brw_util.h"
-#include "intel_batchbuffer.h"
 #include "main/macros.h"
 #include "main/enums.h"
 #include "main/glformats.h"
diff --git a/src/mesa/drivers/dri/i965/gen6_clip_state.c b/src/mesa/drivers/dri/i965/gen6_clip_state.c
index 9a29366..d7fe872 100644
--- a/src/mesa/drivers/dri/i965/gen6_clip_state.c
+++ b/src/mesa/drivers/dri/i965/gen6_clip_state.c
@@ -29,7 +29,6 @@
 #include "brw_state.h"
 #include "brw_defines.h"
 #include "brw_util.h"
-#include "intel_batchbuffer.h"
 #include "main/fbobject.h"
 #include "main/framebuffer.h"
 
diff --git a/src/mesa/drivers/dri/i965/gen6_depth_state.c b/src/mesa/drivers/dri/i965/gen6_depth_state.c
index 8f0d7dc..b1a9dd1 100644
--- a/src/mesa/drivers/dri/i965/gen6_depth_state.c
+++ b/src/mesa/drivers/dri/i965/gen6_depth_state.c
@@ -22,7 +22,6 @@
  */
 
 
-#include "intel_batchbuffer.h"
 #include "intel_fbo.h"
 #include "intel_mipmap_tree.h"
 
diff --git a/src/mesa/drivers/dri/i965/gen6_depthstencil.c b/src/mesa/drivers/dri/i965/gen6_depthstencil.c
index 2c625c9..ed731c5 100644
--- a/src/mesa/drivers/dri/i965/gen6_depthstencil.c
+++ b/src/mesa/drivers/dri/i965/gen6_depthstencil.c
@@ -25,7 +25,6 @@
  *
  */
 
-#include "intel_batchbuffer.h"
 #include "intel_fbo.h"
 #include "brw_context.h"
 #include "brw_defines.h"
diff --git a/src/mesa/drivers/dri/i965/gen6_gs_state.c b/src/mesa/drivers/dri/i965/gen6_gs_state.c
index eb4c586..3d4bb68 100644
--- a/src/mesa/drivers/dri/i965/gen6_gs_state.c
+++ b/src/mesa/drivers/dri/i965/gen6_gs_state.c
@@ -28,7 +28,6 @@
 #include "brw_context.h"
 #include "brw_state.h"
 #include "brw_defines.h"
-#include "intel_batchbuffer.h"
 
 static void
 gen6_upload_gs_push_constants(struct brw_context *brw)
diff --git a/src/mesa/drivers/dri/i965/gen6_multisample_state.c b/src/mesa/drivers/dri/i965/gen6_multisample_state.c
index 36734f5..878c13a 100644
--- a/src/mesa/drivers/dri/i965/gen6_multisample_state.c
+++ b/src/mesa/drivers/dri/i965/gen6_multisample_state.c
@@ -21,7 +21,6 @@
  * IN THE SOFTWARE.
  */
 
-#include "intel_batchbuffer.h"
 
 #include "brw_context.h"
 #include "brw_defines.h"
diff --git a/src/mesa/drivers/dri/i965/gen6_queryobj.c b/src/mesa/drivers/dri/i965/gen6_queryobj.c
index 9f4a5db..120aedb 100644
--- a/src/mesa/drivers/dri/i965/gen6_queryobj.c
+++ b/src/mesa/drivers/dri/i965/gen6_queryobj.c
@@ -36,7 +36,6 @@
 #include "brw_context.h"
 #include "brw_defines.h"
 #include "brw_state.h"
-#include "intel_batchbuffer.h"
 #include "intel_reg.h"
 
 /*
@@ -50,7 +49,7 @@
  */
 void
 brw_store_register_mem64(struct brw_context *brw,
-                         drm_intel_bo *bo, uint32_t reg, int idx)
+                         struct brw_bo *bo, uint32_t reg, int idx)
 {
    assert(brw->gen >= 6);
 
@@ -84,7 +83,7 @@ brw_store_register_mem64(struct brw_context *brw,
 
 static void
 write_primitives_generated(struct brw_context *brw,
-                           drm_intel_bo *query_bo, int stream, int idx)
+                           struct brw_bo *query_bo, int stream, int idx)
 {
    brw_emit_mi_flush(brw);
 
@@ -98,7 +97,7 @@ write_primitives_generated(struct brw_context *brw,
 
 static void
 write_xfb_primitives_written(struct brw_context *brw,
-                             drm_intel_bo *bo, int stream, int idx)
+                             struct brw_bo *bo, int stream, int idx)
 {
    brw_emit_mi_flush(brw);
 
@@ -119,7 +118,7 @@ pipeline_target_to_index(int target)
 }
 
 static void
-emit_pipeline_stat(struct brw_context *brw, drm_intel_bo *bo,
+emit_pipeline_stat(struct brw_context *brw, struct brw_bo *bo,
                    int stream, int target, int idx)
 {
    /* One source of confusion is the tessellation shader statistics. The
@@ -175,8 +174,7 @@ gen6_queryobj_get_results(struct gl_context *ctx,
    if (query->bo == NULL)
       return;
 
-   brw_bo_map(brw, query->bo, false, "query object");
-   uint64_t *results = query->bo->virtual;
+   uint64_t *results = brw_bo_map(query->bo, MAP_READ);
    switch (query->Base.Target) {
    case GL_TIME_ELAPSED:
       /* The query BO contains the starting and ending timestamps.
@@ -255,12 +253,11 @@ gen6_queryobj_get_results(struct gl_context *ctx,
    default:
       unreachable("Unrecognized query target in brw_queryobj_get_results()");
    }
-   drm_intel_bo_unmap(query->bo);
 
    /* Now that we've processed the data stored in the query's buffer object,
     * we can release it.
     */
-   drm_intel_bo_unreference(query->bo);
+   brw_bo_put(query->bo);
    query->bo = NULL;
 
    query->Base.Ready = true;
@@ -279,8 +276,11 @@ gen6_begin_query(struct gl_context *ctx, struct gl_query_object *q)
    struct brw_query_object *query = (struct brw_query_object *)q;
 
    /* Since we're starting a new query, we need to throw away old results. */
-   drm_intel_bo_unreference(query->bo);
-   query->bo = drm_intel_bo_alloc(brw->bufmgr, "query results", 4096, 4096);
+   brw_bo_put(query->bo);
+   query->bo = brw_bo_create(&brw->batch, "query results", 4096, 4096, 0);
+
+   if (brw_batch_begin(&brw->batch, 120, RENDER_RING) < 0)
+      return;
 
    switch (query->Base.Target) {
    case GL_TIME_ELAPSED:
@@ -337,6 +337,8 @@ gen6_begin_query(struct gl_context *ctx, struct gl_query_object *q)
    default:
       unreachable("Unrecognized query target in brw_begin_query()");
    }
+
+   brw_batch_end(&brw->batch);
 }
 
 /**
@@ -353,6 +355,9 @@ gen6_end_query(struct gl_context *ctx, struct gl_query_object *q)
    struct brw_context *brw = brw_context(ctx);
    struct brw_query_object *query = (struct brw_query_object *)q;
 
+   if (brw_batch_begin(&brw->batch, 120, RENDER_RING) < 0)
+      return;
+
    switch (query->Base.Target) {
    case GL_TIME_ELAPSED:
       brw_write_timestamp(brw, query->bo, 1);
@@ -391,26 +396,7 @@ gen6_end_query(struct gl_context *ctx, struct gl_query_object *q)
       unreachable("Unrecognized query target in brw_end_query()");
    }
 
-   /* The current batch contains the commands to handle EndQuery(),
-    * but they won't actually execute until it is flushed.
-    */
-   query->flushed = false;
-}
-
-/**
- * Flush the batch if it still references the query object BO.
- */
-static void
-flush_batch_if_needed(struct brw_context *brw, struct brw_query_object *query)
-{
-   /* If the batch doesn't reference the BO, it must have been flushed
-    * (for example, due to being full).  Record that it's been flushed.
-    */
-   query->flushed = query->flushed ||
-      !drm_intel_bo_references(brw->batch.bo, query->bo);
-
-   if (!query->flushed)
-      intel_batchbuffer_flush(brw);
+   brw_batch_end(&brw->batch);
 }
 
 /**
@@ -421,15 +407,12 @@ flush_batch_if_needed(struct brw_context *brw, struct brw_query_object *query)
  */
 static void gen6_wait_query(struct gl_context *ctx, struct gl_query_object *q)
 {
-   struct brw_context *brw = brw_context(ctx);
    struct brw_query_object *query = (struct brw_query_object *)q;
 
    /* If the application has requested the query result, but this batch is
     * still contributing to it, flush it now to finish that work so the
     * result will become available (eventually).
     */
-   flush_batch_if_needed(brw, query);
-
    gen6_queryobj_get_results(ctx, query);
 }
 
@@ -441,7 +424,6 @@ static void gen6_wait_query(struct gl_context *ctx, struct gl_query_object *q)
  */
 static void gen6_check_query(struct gl_context *ctx, struct gl_query_object *q)
 {
-   struct brw_context *brw = brw_context(ctx);
    struct brw_query_object *query = (struct brw_query_object *)q;
 
    /* If query->bo is NULL, we've already gathered the results - this is a
@@ -457,9 +439,7 @@ static void gen6_check_query(struct gl_context *ctx, struct gl_query_object *q)
     *      not ready yet on the first time it is queried.  This ensures that
     *      the async query will return true in finite time.
     */
-   flush_batch_if_needed(brw, query);
-
-   if (!drm_intel_bo_busy(query->bo)) {
+   if (!brw_bo_busy(query->bo, BUSY_READ | BUSY_FLUSH)) {
       gen6_queryobj_get_results(ctx, query);
    }
 }
diff --git a/src/mesa/drivers/dri/i965/gen6_sampler_state.c b/src/mesa/drivers/dri/i965/gen6_sampler_state.c
index 9e7da58..ecc6b21 100644
--- a/src/mesa/drivers/dri/i965/gen6_sampler_state.c
+++ b/src/mesa/drivers/dri/i965/gen6_sampler_state.c
@@ -28,7 +28,6 @@
 #include "brw_context.h"
 #include "brw_state.h"
 #include "brw_defines.h"
-#include "intel_batchbuffer.h"
 
 static void
 upload_sampler_state_pointers(struct brw_context *brw)
diff --git a/src/mesa/drivers/dri/i965/gen6_scissor_state.c b/src/mesa/drivers/dri/i965/gen6_scissor_state.c
index 17b4a7f..4a343f4 100644
--- a/src/mesa/drivers/dri/i965/gen6_scissor_state.c
+++ b/src/mesa/drivers/dri/i965/gen6_scissor_state.c
@@ -28,7 +28,6 @@
 #include "brw_context.h"
 #include "brw_state.h"
 #include "brw_defines.h"
-#include "intel_batchbuffer.h"
 #include "main/fbobject.h"
 #include "main/framebuffer.h"
 
diff --git a/src/mesa/drivers/dri/i965/gen6_sf_state.c b/src/mesa/drivers/dri/i965/gen6_sf_state.c
index b00517e..026fee3 100644
--- a/src/mesa/drivers/dri/i965/gen6_sf_state.c
+++ b/src/mesa/drivers/dri/i965/gen6_sf_state.c
@@ -32,7 +32,6 @@
 #include "main/macros.h"
 #include "main/fbobject.h"
 #include "main/framebuffer.h"
-#include "intel_batchbuffer.h"
 
 /**
  * Determine the appropriate attribute override value to store into the
diff --git a/src/mesa/drivers/dri/i965/gen6_sol.c b/src/mesa/drivers/dri/i965/gen6_sol.c
index 3899ce9..39bf8e9 100644
--- a/src/mesa/drivers/dri/i965/gen6_sol.c
+++ b/src/mesa/drivers/dri/i965/gen6_sol.c
@@ -29,7 +29,6 @@
 #include "main/bufferobj.h"
 #include "main/macros.h"
 #include "brw_context.h"
-#include "intel_batchbuffer.h"
 #include "brw_defines.h"
 #include "brw_state.h"
 #include "main/transformfeedback.h"
@@ -205,9 +204,9 @@ brw_new_transform_feedback(struct gl_context *ctx, GLuint name)
    _mesa_init_transform_feedback_object(&brw_obj->base, name);
 
    brw_obj->offset_bo =
-      drm_intel_bo_alloc(brw->bufmgr, "transform feedback offsets", 16, 64);
+      brw_bo_create(&brw->batch, "transform feedback offsets", 16, 64, 0);
    brw_obj->prim_count_bo =
-      drm_intel_bo_alloc(brw->bufmgr, "xfb primitive counts", 4096, 64);
+      brw_bo_create(&brw->batch, "xfb primitive counts", 4096, 64, 0);
 
    return &brw_obj->base;
 }
@@ -223,8 +222,8 @@ brw_delete_transform_feedback(struct gl_context *ctx,
       _mesa_reference_buffer_object(ctx, &obj->Buffers[i], NULL);
    }
 
-   drm_intel_bo_unreference(brw_obj->offset_bo);
-   drm_intel_bo_unreference(brw_obj->prim_count_bo);
+   brw_bo_put(brw_obj->offset_bo);
+   brw_bo_put(brw_obj->prim_count_bo);
 
    free(brw_obj);
 }
diff --git a/src/mesa/drivers/dri/i965/gen6_surface_state.c b/src/mesa/drivers/dri/i965/gen6_surface_state.c
index 03e913a..53abbba 100644
--- a/src/mesa/drivers/dri/i965/gen6_surface_state.c
+++ b/src/mesa/drivers/dri/i965/gen6_surface_state.c
@@ -30,7 +30,6 @@
 #include "program/prog_parameter.h"
 
 #include "intel_mipmap_tree.h"
-#include "intel_batchbuffer.h"
 #include "intel_tex.h"
 #include "intel_fbo.h"
 #include "intel_buffer_objects.h"
@@ -95,7 +94,10 @@ gen6_update_renderbuffer_surface(struct brw_context *brw,
 
    /* reloc */
    assert(mt->offset % mt->cpp == 0);
-   surf[1] = mt->bo->offset64 + mt->offset;
+   surf[1] = brw_batch_reloc(&brw->batch, offset + 4,
+			     mt->bo, mt->offset,
+			     I915_GEM_DOMAIN_RENDER,
+			     I915_GEM_DOMAIN_RENDER);
 
    /* In the gen6 PRM Volume 1 Part 1: Graphics Core, Section 7.18.3.7.1
     * (Surface Arrays For all surfaces other than separate stencil buffer):
@@ -127,13 +129,6 @@ gen6_update_renderbuffer_surface(struct brw_context *brw,
 
    surf[5] = (mt->align_h == 4 ? BRW_SURFACE_VERTICAL_ALIGN_ENABLE : 0);
 
-   drm_intel_bo_emit_reloc(brw->batch.bo,
-                           offset + 4,
-                           mt->bo,
-                           surf[1] - mt->bo->offset64,
-                           I915_GEM_DOMAIN_RENDER,
-                           I915_GEM_DOMAIN_RENDER);
-
    return offset;
 }
 
diff --git a/src/mesa/drivers/dri/i965/gen6_urb.c b/src/mesa/drivers/dri/i965/gen6_urb.c
index c7311fd..78cb973 100644
--- a/src/mesa/drivers/dri/i965/gen6_urb.c
+++ b/src/mesa/drivers/dri/i965/gen6_urb.c
@@ -26,7 +26,6 @@
  */
 
 #include "main/macros.h"
-#include "intel_batchbuffer.h"
 #include "brw_context.h"
 #include "brw_state.h"
 #include "brw_defines.h"
diff --git a/src/mesa/drivers/dri/i965/gen6_viewport_state.c b/src/mesa/drivers/dri/i965/gen6_viewport_state.c
index 7c8d884..2845f3b 100644
--- a/src/mesa/drivers/dri/i965/gen6_viewport_state.c
+++ b/src/mesa/drivers/dri/i965/gen6_viewport_state.c
@@ -28,7 +28,6 @@
 #include "brw_context.h"
 #include "brw_state.h"
 #include "brw_defines.h"
-#include "intel_batchbuffer.h"
 #include "main/fbobject.h"
 #include "main/framebuffer.h"
 #include "main/viewport.h"
diff --git a/src/mesa/drivers/dri/i965/gen6_vs_state.c b/src/mesa/drivers/dri/i965/gen6_vs_state.c
index 35d10ef..89bb426 100644
--- a/src/mesa/drivers/dri/i965/gen6_vs_state.c
+++ b/src/mesa/drivers/dri/i965/gen6_vs_state.c
@@ -29,9 +29,9 @@
 #include "brw_state.h"
 #include "brw_defines.h"
 #include "brw_util.h"
+#include "intel_reg.h"
 #include "program/prog_parameter.h"
 #include "program/prog_statevars.h"
-#include "intel_batchbuffer.h"
 #include "glsl/glsl_parser_extras.h"
 
 /**
diff --git a/src/mesa/drivers/dri/i965/gen6_wm_state.c b/src/mesa/drivers/dri/i965/gen6_wm_state.c
index d1748ba..e3b6054 100644
--- a/src/mesa/drivers/dri/i965/gen6_wm_state.c
+++ b/src/mesa/drivers/dri/i965/gen6_wm_state.c
@@ -34,7 +34,6 @@
 #include "program/prog_parameter.h"
 #include "program/prog_statevars.h"
 #include "main/framebuffer.h"
-#include "intel_batchbuffer.h"
 
 static void
 gen6_upload_wm_push_constants(struct brw_context *brw)
diff --git a/src/mesa/drivers/dri/i965/gen7_blorp.cpp b/src/mesa/drivers/dri/i965/gen7_blorp.cpp
index abace6d..62d735e 100644
--- a/src/mesa/drivers/dri/i965/gen7_blorp.cpp
+++ b/src/mesa/drivers/dri/i965/gen7_blorp.cpp
@@ -23,7 +23,6 @@
 
 #include <assert.h>
 
-#include "intel_batchbuffer.h"
 #include "intel_fbo.h"
 #include "intel_mipmap_tree.h"
 
@@ -168,9 +167,11 @@ gen7_blorp_emit_surface_state(struct brw_context *brw,
    else
       surf[0] |= GEN7_SURFACE_ARYSPC_FULL;
 
-   /* reloc */
-   surf[1] =
-      surface->compute_tile_offsets(&tile_x, &tile_y) + mt->bo->offset64;
+   surf[1] = brw_batch_reloc(&brw->batch,
+			     wm_surf_offset + 4,
+			     mt->bo,
+			     surface->compute_tile_offsets(&tile_x, &tile_y),
+			     read_domains, write_domain);
 
    /* Note that the low bits of these fields are missing, so
     * there's the possibility of getting in trouble.
@@ -204,13 +205,6 @@ gen7_blorp_emit_surface_state(struct brw_context *brw,
                   SET_FIELD(HSW_SCS_ALPHA, GEN7_SURFACE_SCS_A));
    }
 
-   /* Emit relocation to surface contents */
-   drm_intel_bo_emit_reloc(brw->batch.bo,
-                           wm_surf_offset + 4,
-                           mt->bo,
-                           surf[1] - mt->bo->offset64,
-                           read_domains, write_domain);
-
    gen7_check_surface_setup(surf, is_render_target);
 
    return wm_surf_offset;
diff --git a/src/mesa/drivers/dri/i965/gen7_disable.c b/src/mesa/drivers/dri/i965/gen7_disable.c
index 2c43cd7..6d0be45 100644
--- a/src/mesa/drivers/dri/i965/gen7_disable.c
+++ b/src/mesa/drivers/dri/i965/gen7_disable.c
@@ -24,7 +24,6 @@
 #include "brw_context.h"
 #include "brw_state.h"
 #include "brw_defines.h"
-#include "intel_batchbuffer.h"
 
 static void
 disable_stages(struct brw_context *brw)
diff --git a/src/mesa/drivers/dri/i965/gen7_gs_state.c b/src/mesa/drivers/dri/i965/gen7_gs_state.c
index 8d6d3fe..02add4c 100644
--- a/src/mesa/drivers/dri/i965/gen7_gs_state.c
+++ b/src/mesa/drivers/dri/i965/gen7_gs_state.c
@@ -24,7 +24,6 @@
 #include "brw_context.h"
 #include "brw_state.h"
 #include "brw_defines.h"
-#include "intel_batchbuffer.h"
 
 static void
 upload_gs_state(struct brw_context *brw)
diff --git a/src/mesa/drivers/dri/i965/gen7_misc_state.c b/src/mesa/drivers/dri/i965/gen7_misc_state.c
index a14d4a0..fb20b22 100644
--- a/src/mesa/drivers/dri/i965/gen7_misc_state.c
+++ b/src/mesa/drivers/dri/i965/gen7_misc_state.c
@@ -22,7 +22,6 @@
  */
 
 #include "main/mtypes.h"
-#include "intel_batchbuffer.h"
 #include "intel_mipmap_tree.h"
 #include "intel_fbo.h"
 #include "brw_context.h"
@@ -53,7 +52,7 @@ gen7_emit_depth_stencil_hiz(struct brw_context *brw,
 
    /* Skip repeated NULL depth/stencil emits (think 2D rendering). */
    if (!mt && brw->no_depth_or_stencil) {
-      assert(brw->hw_ctx);
+      assert(brw->batch.hw_ctx);
       return;
    }
 
diff --git a/src/mesa/drivers/dri/i965/gen7_sf_state.c b/src/mesa/drivers/dri/i965/gen7_sf_state.c
index 4fa46a8..3742cb7 100644
--- a/src/mesa/drivers/dri/i965/gen7_sf_state.c
+++ b/src/mesa/drivers/dri/i965/gen7_sf_state.c
@@ -28,7 +28,6 @@
 #include "main/macros.h"
 #include "main/fbobject.h"
 #include "main/framebuffer.h"
-#include "intel_batchbuffer.h"
 
 static void
 upload_sbe_state(struct brw_context *brw)
diff --git a/src/mesa/drivers/dri/i965/gen7_sol_state.c b/src/mesa/drivers/dri/i965/gen7_sol_state.c
index 41573a8..cabd77a 100644
--- a/src/mesa/drivers/dri/i965/gen7_sol_state.c
+++ b/src/mesa/drivers/dri/i965/gen7_sol_state.c
@@ -31,8 +31,8 @@
 #include "brw_context.h"
 #include "brw_state.h"
 #include "brw_defines.h"
-#include "intel_batchbuffer.h"
 #include "intel_buffer_objects.h"
+#include "intel_reg.h"
 #include "main/transformfeedback.h"
 
 static void
@@ -52,7 +52,7 @@ upload_3dstate_so_buffers(struct brw_context *brw)
    for (i = 0; i < 4; i++) {
       struct intel_buffer_object *bufferobj =
 	 intel_buffer_object(xfb_obj->Buffers[i]);
-      drm_intel_bo *bo;
+      struct brw_bo *bo;
       uint32_t start, end;
       uint32_t stride;
 
@@ -314,14 +314,7 @@ gen7_tally_prims_generated(struct brw_context *brw,
    /* If the current batch is still contributing to the number of primitives
     * generated, flush it now so the results will be present when mapped.
     */
-   if (drm_intel_bo_references(brw->batch.bo, obj->prim_count_bo))
-      intel_batchbuffer_flush(brw);
-
-   if (unlikely(brw->perf_debug && drm_intel_bo_busy(obj->prim_count_bo)))
-      perf_debug("Stalling for # of transform feedback primitives written.\n");
-
-   drm_intel_bo_map(obj->prim_count_bo, false);
-   uint64_t *prim_counts = obj->prim_count_bo->virtual;
+   uint64_t *prim_counts = brw_bo_map(obj->prim_count_bo, MAP_READ);
 
    assert(obj->prim_count_buffer_index % (2 * BRW_MAX_XFB_STREAMS) == 0);
    int pairs = obj->prim_count_buffer_index / (2 * BRW_MAX_XFB_STREAMS);
@@ -334,8 +327,6 @@ gen7_tally_prims_generated(struct brw_context *brw,
       prim_counts += 2 * BRW_MAX_XFB_STREAMS; /* move to the next pair */
    }
 
-   drm_intel_bo_unmap(obj->prim_count_bo);
-
    /* We've already gathered up the old data; we can safely overwrite it now. */
    obj->prim_count_buffer_index = 0;
 }
@@ -446,9 +437,6 @@ gen7_begin_transform_feedback(struct gl_context *ctx, GLenum mode,
    /* Reset the SO buffer offsets to 0. */
    if (brw->gen >= 8) {
       brw_obj->zero_offsets = true;
-   } else {
-      intel_batchbuffer_flush(brw);
-      brw->batch.needs_sol_reset = true;
    }
 
    /* We're about to lose the information needed to compute the number of
@@ -457,14 +445,30 @@ gen7_begin_transform_feedback(struct gl_context *ctx, GLenum mode,
     */
    brw_compute_xfb_vertices_written(brw, brw_obj);
 
+   if (brw_batch_begin(&brw->batch, 300, RENDER_RING) < 0)
+      return;
+
    /* No primitives have been generated yet. */
    for (int i = 0; i < BRW_MAX_XFB_STREAMS; i++) {
       brw_obj->prims_generated[i] = 0;
    }
 
+   /* Reset the SOL buffer offset registers. */
+   if (brw->gen < 8) {
+      for (int i = 0; i < 4; i++) {
+         BEGIN_BATCH(3);
+         OUT_BATCH(MI_LOAD_REGISTER_IMM | (3 - 2));
+         OUT_BATCH(GEN7_SO_WRITE_OFFSET(i));
+         OUT_BATCH(0);
+         ADVANCE_BATCH();
+      }
+   }
+
    /* Store the starting value of the SO_NUM_PRIMS_WRITTEN counters. */
    gen7_save_primitives_written_counters(brw, brw_obj);
 
+   brw_batch_end(&brw->batch);
+
    brw_obj->primitive_mode = mode;
 }
 
@@ -482,9 +486,14 @@ gen7_end_transform_feedback(struct gl_context *ctx,
    struct brw_transform_feedback_object *brw_obj =
       (struct brw_transform_feedback_object *) obj;
 
+   if (brw_batch_begin(&brw->batch, 300, RENDER_RING) < 0)
+      return;
+
    /* Store the ending value of the SO_NUM_PRIMS_WRITTEN counters. */
    gen7_save_primitives_written_counters(brw, brw_obj);
 
+   brw_batch_end(&brw->batch);
+
    /* EndTransformFeedback() means that we need to update the number of
     * vertices written.  Since it's only necessary if DrawTransformFeedback()
     * is called and it means mapping a buffer object, we delay computing it
@@ -501,6 +510,9 @@ gen7_pause_transform_feedback(struct gl_context *ctx,
    struct brw_transform_feedback_object *brw_obj =
       (struct brw_transform_feedback_object *) obj;
 
+   if (brw_batch_begin(&brw->batch, 300, RENDER_RING) < 0)
+      return;
+
    /* Flush any drawing so that the counters have the right values. */
    brw_emit_mi_flush(brw);
 
@@ -523,6 +535,8 @@ gen7_pause_transform_feedback(struct gl_context *ctx,
     * from our counts.
     */
    gen7_save_primitives_written_counters(brw, brw_obj);
+
+   brw_batch_end(&brw->batch);
 }
 
 void
@@ -533,6 +547,9 @@ gen7_resume_transform_feedback(struct gl_context *ctx,
    struct brw_transform_feedback_object *brw_obj =
       (struct brw_transform_feedback_object *) obj;
 
+   if (brw_batch_begin(&brw->batch, 300, RENDER_RING) < 0)
+      return;
+
    /* Reload the SOL buffer offset registers. */
    if (brw->gen < 8) {
       for (int i = 0; i < 4; i++) {
@@ -548,4 +565,6 @@ gen7_resume_transform_feedback(struct gl_context *ctx,
 
    /* Store the new starting value of the SO_NUM_PRIMS_WRITTEN counters. */
    gen7_save_primitives_written_counters(brw, brw_obj);
+
+   brw_batch_end(&brw->batch);
 }
diff --git a/src/mesa/drivers/dri/i965/gen7_urb.c b/src/mesa/drivers/dri/i965/gen7_urb.c
index d371c19..bc631b1 100644
--- a/src/mesa/drivers/dri/i965/gen7_urb.c
+++ b/src/mesa/drivers/dri/i965/gen7_urb.c
@@ -22,7 +22,6 @@
  */
 
 #include "main/macros.h"
-#include "intel_batchbuffer.h"
 #include "brw_context.h"
 #include "brw_state.h"
 #include "brw_defines.h"
diff --git a/src/mesa/drivers/dri/i965/gen7_viewport_state.c b/src/mesa/drivers/dri/i965/gen7_viewport_state.c
index b655205..a32e25f 100644
--- a/src/mesa/drivers/dri/i965/gen7_viewport_state.c
+++ b/src/mesa/drivers/dri/i965/gen7_viewport_state.c
@@ -24,7 +24,6 @@
 #include "brw_context.h"
 #include "brw_state.h"
 #include "brw_defines.h"
-#include "intel_batchbuffer.h"
 #include "main/fbobject.h"
 #include "main/framebuffer.h"
 #include "main/viewport.h"
diff --git a/src/mesa/drivers/dri/i965/gen7_vs_state.c b/src/mesa/drivers/dri/i965/gen7_vs_state.c
index 4b17d06..b6e90bc 100644
--- a/src/mesa/drivers/dri/i965/gen7_vs_state.c
+++ b/src/mesa/drivers/dri/i965/gen7_vs_state.c
@@ -27,7 +27,6 @@
 #include "brw_util.h"
 #include "program/prog_parameter.h"
 #include "program/prog_statevars.h"
-#include "intel_batchbuffer.h"
 
 
 void
diff --git a/src/mesa/drivers/dri/i965/gen7_wm_state.c b/src/mesa/drivers/dri/i965/gen7_wm_state.c
index ea11ae8..56400e3 100644
--- a/src/mesa/drivers/dri/i965/gen7_wm_state.c
+++ b/src/mesa/drivers/dri/i965/gen7_wm_state.c
@@ -31,7 +31,6 @@
 #include "program/prog_parameter.h"
 #include "program/prog_statevars.h"
 #include "main/framebuffer.h"
-#include "intel_batchbuffer.h"
 
 static void
 upload_wm_state(struct brw_context *brw)
diff --git a/src/mesa/drivers/dri/i965/gen7_wm_surface_state.c b/src/mesa/drivers/dri/i965/gen7_wm_surface_state.c
index 15ab2b0..991e7bc 100644
--- a/src/mesa/drivers/dri/i965/gen7_wm_surface_state.c
+++ b/src/mesa/drivers/dri/i965/gen7_wm_surface_state.c
@@ -28,7 +28,6 @@
 #include "program/prog_parameter.h"
 
 #include "intel_mipmap_tree.h"
-#include "intel_batchbuffer.h"
 #include "intel_tex.h"
 #include "intel_fbo.h"
 #include "intel_buffer_objects.h"
@@ -116,19 +115,14 @@ gen7_set_surface_mcs_info(struct brw_context *brw,
     * thus have their lower 12 bits zero), we can use an ordinary reloc to do
     * the necessary address translation.
     */
-   assert ((mcs_mt->bo->offset64 & 0xfff) == 0);
-
-   surf[6] = GEN7_SURFACE_MCS_ENABLE |
-             SET_FIELD(pitch_tiles - 1, GEN7_SURFACE_MCS_PITCH) |
-             mcs_mt->bo->offset64;
-
-   drm_intel_bo_emit_reloc(brw->batch.bo,
-                           surf_offset + 6 * 4,
-                           mcs_mt->bo,
-                           surf[6] & 0xfff,
-                           is_render_target ? I915_GEM_DOMAIN_RENDER
-                           : I915_GEM_DOMAIN_SAMPLER,
-                           is_render_target ? I915_GEM_DOMAIN_RENDER : 0);
+   surf[6] = brw_batch_reloc(&brw->batch,
+			     surf_offset + 6 * 4,
+			     mcs_mt->bo,
+			     GEN7_SURFACE_MCS_ENABLE |
+			     SET_FIELD(pitch_tiles - 1, GEN7_SURFACE_MCS_PITCH),
+			     is_render_target ? I915_GEM_DOMAIN_RENDER
+			     : I915_GEM_DOMAIN_SAMPLER,
+			     is_render_target ? I915_GEM_DOMAIN_RENDER : 0);
 }
 
 
@@ -221,7 +215,7 @@ gen7_check_surface_setup(uint32_t *surf, bool is_render_target)
 static void
 gen7_emit_buffer_surface_state(struct brw_context *brw,
                                uint32_t *out_offset,
-                               drm_intel_bo *bo,
+                               struct brw_bo *bo,
                                unsigned buffer_offset,
                                unsigned surface_format,
                                unsigned buffer_size,
@@ -235,7 +229,9 @@ gen7_emit_buffer_surface_state(struct brw_context *brw,
    surf[0] = BRW_SURFACE_BUFFER << BRW_SURFACE_TYPE_SHIFT |
              surface_format << BRW_SURFACE_FORMAT_SHIFT |
              BRW_SURFACE_RC_READ_WRITE;
-   surf[1] = (bo ? bo->offset64 : 0) + buffer_offset; /* reloc */
+   surf[1] = brw_batch_reloc(&brw->batch, *out_offset + 4,
+			     bo, buffer_offset, I915_GEM_DOMAIN_SAMPLER,
+			     (rw ? I915_GEM_DOMAIN_SAMPLER : 0));
    surf[2] = SET_FIELD((buffer_size - 1) & 0x7f, GEN7_SURFACE_WIDTH) |
              SET_FIELD(((buffer_size - 1) >> 7) & 0x3fff, GEN7_SURFACE_HEIGHT);
    if (surface_format == BRW_SURFACEFORMAT_RAW)
@@ -253,13 +249,6 @@ gen7_emit_buffer_surface_state(struct brw_context *brw,
                   SET_FIELD(HSW_SCS_ALPHA, GEN7_SURFACE_SCS_A));
    }
 
-   /* Emit relocation to surface contents */
-   if (bo) {
-      drm_intel_bo_emit_reloc(brw->batch.bo, *out_offset + 4,
-                              bo, buffer_offset, I915_GEM_DOMAIN_SAMPLER,
-                              (rw ? I915_GEM_DOMAIN_SAMPLER : 0));
-   }
-
    gen7_check_surface_setup(surf, false /* is_render_target */);
 }
 
@@ -299,7 +288,10 @@ gen7_emit_texture_surface_state(struct brw_context *brw,
    if (mt->array_layout == ALL_SLICES_AT_EACH_LOD)
       surf[0] |= GEN7_SURFACE_ARYSPC_LOD0;
 
-   surf[1] = mt->bo->offset64 + mt->offset; /* reloc */
+   surf[1] = brw_batch_reloc(&brw->batch, *surf_offset + 4,
+			     mt->bo, mt->offset,
+			     I915_GEM_DOMAIN_SAMPLER,
+			     (rw ? I915_GEM_DOMAIN_SAMPLER : 0));
 
    surf[2] = SET_FIELD(mt->logical_width0 - 1, GEN7_SURFACE_WIDTH) |
              SET_FIELD(mt->logical_height0 - 1, GEN7_SURFACE_HEIGHT);
@@ -336,14 +328,6 @@ gen7_emit_texture_surface_state(struct brw_context *brw,
                                 mt->mcs_mt, false /* is RT */);
    }
 
-   /* Emit relocation to surface contents */
-   drm_intel_bo_emit_reloc(brw->batch.bo,
-                           *surf_offset + 4,
-                           mt->bo,
-                           surf[1] - mt->bo->offset64,
-                           I915_GEM_DOMAIN_SAMPLER,
-                           (rw ? I915_GEM_DOMAIN_SAMPLER : 0));
-
    gen7_check_surface_setup(surf, false /* is_render_target */);
 }
 
@@ -519,7 +503,10 @@ gen7_update_renderbuffer_surface(struct brw_context *brw,
    }
 
    assert(mt->offset % mt->cpp == 0);
-   surf[1] = mt->bo->offset64 + mt->offset;
+   surf[1] = brw_batch_reloc(&brw->batch, offset + 4,
+			     mt->bo, mt->offset,
+			     I915_GEM_DOMAIN_RENDER,
+			     I915_GEM_DOMAIN_RENDER);
 
    assert(brw->has_surface_tile_offset);
 
@@ -550,13 +537,6 @@ gen7_update_renderbuffer_surface(struct brw_context *brw,
                   SET_FIELD(HSW_SCS_ALPHA, GEN7_SURFACE_SCS_A));
    }
 
-   drm_intel_bo_emit_reloc(brw->batch.bo,
-                           offset + 4,
-                           mt->bo,
-                           surf[1] - mt->bo->offset64,
-                           I915_GEM_DOMAIN_RENDER,
-                           I915_GEM_DOMAIN_RENDER);
-
    gen7_check_surface_setup(surf, true /* is_render_target */);
 
    return offset;
diff --git a/src/mesa/drivers/dri/i965/gen8_blend_state.c b/src/mesa/drivers/dri/i965/gen8_blend_state.c
index 786c79a..31a3deb 100644
--- a/src/mesa/drivers/dri/i965/gen8_blend_state.c
+++ b/src/mesa/drivers/dri/i965/gen8_blend_state.c
@@ -26,7 +26,6 @@
 #include "brw_defines.h"
 #include "brw_util.h"
 #include "brw_wm.h"
-#include "intel_batchbuffer.h"
 #include "main/macros.h"
 #include "main/enums.h"
 #include "main/glformats.h"
diff --git a/src/mesa/drivers/dri/i965/gen8_depth_state.c b/src/mesa/drivers/dri/i965/gen8_depth_state.c
index 8f23702..739e974 100644
--- a/src/mesa/drivers/dri/i965/gen8_depth_state.c
+++ b/src/mesa/drivers/dri/i965/gen8_depth_state.c
@@ -21,9 +21,9 @@
  * IN THE SOFTWARE.
  */
 
-#include "intel_batchbuffer.h"
 #include "intel_mipmap_tree.h"
 #include "intel_fbo.h"
+#include "intel_reg.h"
 #include "intel_resolve_map.h"
 #include "brw_context.h"
 #include "brw_state.h"
@@ -52,7 +52,7 @@ emit_depth_packets(struct brw_context *brw,
 
    /* Skip repeated NULL depth/stencil emits (think 2D rendering). */
    if (!depth_mt && !stencil_mt && brw->no_depth_or_stencil) {
-      assert(brw->hw_ctx);
+      assert(brw->batch.hw_ctx);
       return;
    }
 
@@ -400,6 +400,9 @@ gen8_hiz_exec(struct brw_context *brw, struct intel_mipmap_tree *mt,
    if (op == GEN6_HIZ_OP_NONE)
       return;
 
+   if (brw_batch_begin(&brw->batch, 1000, RENDER_RING) < 0)
+      return;
+
    /* Disable the PMA stall fix since we're about to do a HiZ operation. */
    if (brw->gen == 8)
       write_pma_stall_bits(brw, 0);
@@ -508,7 +511,14 @@ gen8_hiz_exec(struct brw_context *brw, struct intel_mipmap_tree *mt,
    ADVANCE_BATCH();
 
    /* Mark this buffer as needing a TC flush, as we've rendered to it. */
-   brw_render_cache_set_add_bo(brw, mt->bo);
+   assert(mt->bo->dirty);
+
+   if (brw_batch_end(&brw->batch)) {
+      struct gl_context *ctx = &brw->ctx;
+      WARN_ONCE(1, "i965: blorp emit exceeded available aperture space\n");
+   }
+
+   brw_batch_maybe_flush(&brw->batch);
 
    /* We've clobbered all of the depth packets, and the drawing rectangle,
     * so we need to ensure those packets are re-emitted before the next
diff --git a/src/mesa/drivers/dri/i965/gen8_disable.c b/src/mesa/drivers/dri/i965/gen8_disable.c
index da0d4a5..ec86fee 100644
--- a/src/mesa/drivers/dri/i965/gen8_disable.c
+++ b/src/mesa/drivers/dri/i965/gen8_disable.c
@@ -24,7 +24,6 @@
 #include "brw_context.h"
 #include "brw_state.h"
 #include "brw_defines.h"
-#include "intel_batchbuffer.h"
 
 static void
 disable_stages(struct brw_context *brw)
diff --git a/src/mesa/drivers/dri/i965/gen8_draw_upload.c b/src/mesa/drivers/dri/i965/gen8_draw_upload.c
index 1af90ec..dc5e915 100644
--- a/src/mesa/drivers/dri/i965/gen8_draw_upload.c
+++ b/src/mesa/drivers/dri/i965/gen8_draw_upload.c
@@ -32,7 +32,6 @@
 #include "brw_context.h"
 #include "brw_state.h"
 
-#include "intel_batchbuffer.h"
 #include "intel_buffer_objects.h"
 
 static void
diff --git a/src/mesa/drivers/dri/i965/gen8_gs_state.c b/src/mesa/drivers/dri/i965/gen8_gs_state.c
index 26a02d3..e5c3d23 100644
--- a/src/mesa/drivers/dri/i965/gen8_gs_state.c
+++ b/src/mesa/drivers/dri/i965/gen8_gs_state.c
@@ -24,7 +24,6 @@
 #include "brw_context.h"
 #include "brw_state.h"
 #include "brw_defines.h"
-#include "intel_batchbuffer.h"
 
 static void
 gen8_upload_gs_state(struct brw_context *brw)
diff --git a/src/mesa/drivers/dri/i965/gen8_misc_state.c b/src/mesa/drivers/dri/i965/gen8_misc_state.c
index b20038e..83376cd 100644
--- a/src/mesa/drivers/dri/i965/gen8_misc_state.c
+++ b/src/mesa/drivers/dri/i965/gen8_misc_state.c
@@ -21,7 +21,6 @@
  * IN THE SOFTWARE.
  */
 
-#include "intel_batchbuffer.h"
 #include "brw_context.h"
 #include "brw_state.h"
 #include "brw_defines.h"
diff --git a/src/mesa/drivers/dri/i965/gen8_multisample_state.c b/src/mesa/drivers/dri/i965/gen8_multisample_state.c
index 75cbe06..da5b32b 100644
--- a/src/mesa/drivers/dri/i965/gen8_multisample_state.c
+++ b/src/mesa/drivers/dri/i965/gen8_multisample_state.c
@@ -21,7 +21,6 @@
  * IN THE SOFTWARE.
  */
 
-#include "intel_batchbuffer.h"
 
 #include "brw_context.h"
 #include "brw_defines.h"
diff --git a/src/mesa/drivers/dri/i965/gen8_ps_state.c b/src/mesa/drivers/dri/i965/gen8_ps_state.c
index a88f109..0c66e91 100644
--- a/src/mesa/drivers/dri/i965/gen8_ps_state.c
+++ b/src/mesa/drivers/dri/i965/gen8_ps_state.c
@@ -25,7 +25,6 @@
 #include "program/program.h"
 #include "brw_state.h"
 #include "brw_defines.h"
-#include "intel_batchbuffer.h"
 
 void
 gen8_upload_ps_extra(struct brw_context *brw,
diff --git a/src/mesa/drivers/dri/i965/gen8_sf_state.c b/src/mesa/drivers/dri/i965/gen8_sf_state.c
index c2b585d..d0174a1 100644
--- a/src/mesa/drivers/dri/i965/gen8_sf_state.c
+++ b/src/mesa/drivers/dri/i965/gen8_sf_state.c
@@ -27,7 +27,6 @@
 #include "brw_util.h"
 #include "main/macros.h"
 #include "main/fbobject.h"
-#include "intel_batchbuffer.h"
 
 static void
 upload_sbe(struct brw_context *brw)
diff --git a/src/mesa/drivers/dri/i965/gen8_sol_state.c b/src/mesa/drivers/dri/i965/gen8_sol_state.c
index 58ead68..07212ab 100644
--- a/src/mesa/drivers/dri/i965/gen8_sol_state.c
+++ b/src/mesa/drivers/dri/i965/gen8_sol_state.c
@@ -31,7 +31,6 @@
 #include "brw_context.h"
 #include "brw_state.h"
 #include "brw_defines.h"
-#include "intel_batchbuffer.h"
 #include "intel_buffer_objects.h"
 #include "main/transformfeedback.h"
 
@@ -70,7 +69,7 @@ gen8_upload_3dstate_so_buffers(struct brw_context *brw)
       uint32_t start = xfb_obj->Offset[i];
       assert(start % 4 == 0);
       uint32_t end = ALIGN(start + xfb_obj->Size[i], 4);
-      drm_intel_bo *bo =
+      struct brw_bo *bo =
          intel_bufferobj_buffer(brw, bufferobj, start, end - start);
       assert(end <= bo->size);
 
diff --git a/src/mesa/drivers/dri/i965/gen8_surface_state.c b/src/mesa/drivers/dri/i965/gen8_surface_state.c
index bd3eb00..2a815d7 100644
--- a/src/mesa/drivers/dri/i965/gen8_surface_state.c
+++ b/src/mesa/drivers/dri/i965/gen8_surface_state.c
@@ -29,7 +29,6 @@
 #include "program/prog_parameter.h"
 
 #include "intel_mipmap_tree.h"
-#include "intel_batchbuffer.h"
 #include "intel_tex.h"
 #include "intel_fbo.h"
 #include "intel_buffer_objects.h"
@@ -145,7 +144,7 @@ allocate_surface_state(struct brw_context *brw, uint32_t *out_offset, int index)
 static void
 gen8_emit_buffer_surface_state(struct brw_context *brw,
                                uint32_t *out_offset,
-                               drm_intel_bo *bo,
+                               struct brw_bo *bo,
                                unsigned buffer_offset,
                                unsigned surface_format,
                                unsigned buffer_size,
@@ -171,15 +170,11 @@ gen8_emit_buffer_surface_state(struct brw_context *brw,
              SET_FIELD(HSW_SCS_GREEN, GEN7_SURFACE_SCS_G) |
              SET_FIELD(HSW_SCS_BLUE,  GEN7_SURFACE_SCS_B) |
              SET_FIELD(HSW_SCS_ALPHA, GEN7_SURFACE_SCS_A);
-   /* reloc */
-   *((uint64_t *) &surf[8]) = (bo ? bo->offset64 : 0) + buffer_offset;
-
    /* Emit relocation to surface contents. */
-   if (bo) {
-      drm_intel_bo_emit_reloc(brw->batch.bo, *out_offset + 8 * 4,
-                              bo, buffer_offset, I915_GEM_DOMAIN_SAMPLER,
-                              rw ? I915_GEM_DOMAIN_SAMPLER : 0);
-   }
+   *((uint64_t *)&surf[8]) =
+	   brw_batch_reloc(&brw->batch, *out_offset + 8 * 4,
+			   bo, buffer_offset, I915_GEM_DOMAIN_SAMPLER,
+			   rw ? I915_GEM_DOMAIN_SAMPLER : 0);
 }
 
 static void
@@ -275,27 +270,18 @@ gen8_emit_texture_surface_state(struct brw_context *brw,
       SET_FIELD(swizzle_to_scs(GET_SWZ(swizzle, 2)), GEN7_SURFACE_SCS_B) |
       SET_FIELD(swizzle_to_scs(GET_SWZ(swizzle, 3)), GEN7_SURFACE_SCS_A);
 
-   *((uint64_t *) &surf[8]) = mt->bo->offset64 + mt->offset; /* reloc */
-
-   if (aux_mt) {
-      *((uint64_t *) &surf[10]) = aux_mt->bo->offset64;
-      drm_intel_bo_emit_reloc(brw->batch.bo, *surf_offset + 10 * 4,
-                              aux_mt->bo, 0,
-                              I915_GEM_DOMAIN_SAMPLER,
-                              (rw ? I915_GEM_DOMAIN_SAMPLER : 0));
-   } else {
-      surf[10] = 0;
-      surf[11] = 0;
-   }
-   surf[12] = 0;
-
-   /* Emit relocation to surface contents */
-   drm_intel_bo_emit_reloc(brw->batch.bo,
-                           *surf_offset + 8 * 4,
-                           mt->bo,
-                           mt->offset,
+   *((uint64_t *)&surf[8]) =
+	   brw_batch_reloc(&brw->batch, *surf_offset + 8 * 4,
+			   mt->bo, mt->offset,
                            I915_GEM_DOMAIN_SAMPLER,
                            (rw ? I915_GEM_DOMAIN_SAMPLER : 0));
+
+   *((uint64_t *)&surf[10]) =
+	   brw_batch_reloc(&brw->batch, *surf_offset + 10 * 4,
+			   aux_mt ? aux_mt->bo : NULL, 0,
+			   I915_GEM_DOMAIN_SAMPLER,
+			   (rw ? I915_GEM_DOMAIN_SAMPLER : 0));
+   surf[12] = 0;
 }
 
 static void
@@ -502,27 +488,18 @@ gen8_update_renderbuffer_surface(struct brw_context *brw,
              SET_FIELD(HSW_SCS_ALPHA, GEN7_SURFACE_SCS_A);
 
    assert(mt->offset % mt->cpp == 0);
-   *((uint64_t *) &surf[8]) = mt->bo->offset64 + mt->offset; /* reloc */
-
-   if (aux_mt) {
-      *((uint64_t *) &surf[10]) = aux_mt->bo->offset64;
-      drm_intel_bo_emit_reloc(brw->batch.bo,
-                              offset + 10 * 4,
-                              aux_mt->bo, 0,
-                              I915_GEM_DOMAIN_RENDER, I915_GEM_DOMAIN_RENDER);
-   } else {
-      surf[10] = 0;
-      surf[11] = 0;
-   }
+   *((uint64_t *) &surf[8]) =
+      brw_batch_reloc(&brw->batch, offset + 8*4,
+		      mt->bo, mt->offset,
+		      I915_GEM_DOMAIN_RENDER,
+		      I915_GEM_DOMAIN_RENDER);
+
+   *((uint64_t *)&surf[10]) =
+      brw_batch_reloc(&brw->batch, offset + 10 * 4,
+		      aux_mt ? aux_mt->bo : NULL, 0,
+		      I915_GEM_DOMAIN_RENDER, I915_GEM_DOMAIN_RENDER);
    surf[12] = 0;
 
-   drm_intel_bo_emit_reloc(brw->batch.bo,
-                           offset + 8 * 4,
-                           mt->bo,
-                           mt->offset,
-                           I915_GEM_DOMAIN_RENDER,
-                           I915_GEM_DOMAIN_RENDER);
-
    return offset;
 }
 
diff --git a/src/mesa/drivers/dri/i965/gen8_viewport_state.c b/src/mesa/drivers/dri/i965/gen8_viewport_state.c
index 2d8eeb1..2cc6f61 100644
--- a/src/mesa/drivers/dri/i965/gen8_viewport_state.c
+++ b/src/mesa/drivers/dri/i965/gen8_viewport_state.c
@@ -24,7 +24,6 @@
 #include "brw_context.h"
 #include "brw_state.h"
 #include "brw_defines.h"
-#include "intel_batchbuffer.h"
 #include "main/fbobject.h"
 #include "main/framebuffer.h"
 #include "main/viewport.h"
diff --git a/src/mesa/drivers/dri/i965/gen8_vs_state.c b/src/mesa/drivers/dri/i965/gen8_vs_state.c
index 28f5add..f14951d 100644
--- a/src/mesa/drivers/dri/i965/gen8_vs_state.c
+++ b/src/mesa/drivers/dri/i965/gen8_vs_state.c
@@ -27,7 +27,6 @@
 #include "brw_util.h"
 #include "program/prog_parameter.h"
 #include "program/prog_statevars.h"
-#include "intel_batchbuffer.h"
 
 static void
 upload_vs_state(struct brw_context *brw)
diff --git a/src/mesa/drivers/dri/i965/gen8_wm_depth_stencil.c b/src/mesa/drivers/dri/i965/gen8_wm_depth_stencil.c
index 2c843b2..0f1128a 100644
--- a/src/mesa/drivers/dri/i965/gen8_wm_depth_stencil.c
+++ b/src/mesa/drivers/dri/i965/gen8_wm_depth_stencil.c
@@ -21,7 +21,6 @@
  * IN THE SOFTWARE.
  */
 
-#include "intel_batchbuffer.h"
 #include "intel_fbo.h"
 #include "brw_context.h"
 #include "brw_defines.h"
diff --git a/src/mesa/drivers/dri/i965/intel_batchbuffer.c b/src/mesa/drivers/dri/i965/intel_batchbuffer.c
deleted file mode 100644
index 969d92c..0000000
--- a/src/mesa/drivers/dri/i965/intel_batchbuffer.c
+++ /dev/null
@@ -1,480 +0,0 @@
-/**************************************************************************
- *
- * Copyright 2006 VMware, Inc.
- * All Rights Reserved.
- *
- * Permission is hereby granted, free of charge, to any person obtaining a
- * copy of this software and associated documentation files (the
- * "Software"), to deal in the Software without restriction, including
- * without limitation the rights to use, copy, modify, merge, publish,
- * distribute, sub license, and/or sell copies of the Software, and to
- * permit persons to whom the Software is furnished to do so, subject to
- * the following conditions:
- *
- * The above copyright notice and this permission notice (including the
- * next paragraph) shall be included in all copies or substantial portions
- * of the Software.
- *
- * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS
- * OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
- * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NON-INFRINGEMENT.
- * IN NO EVENT SHALL VMWARE AND/OR ITS SUPPLIERS BE LIABLE FOR
- * ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT,
- * TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE
- * SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
- *
- **************************************************************************/
-
-#include "intel_batchbuffer.h"
-#include "intel_buffer_objects.h"
-#include "intel_reg.h"
-#include "intel_bufmgr.h"
-#include "intel_buffers.h"
-#include "intel_fbo.h"
-#include "brw_context.h"
-
-#include <xf86drm.h>
-#include <i915_drm.h>
-
-static void
-intel_batchbuffer_reset(struct brw_context *brw);
-
-void
-intel_batchbuffer_init(struct brw_context *brw)
-{
-   intel_batchbuffer_reset(brw);
-
-   if (!brw->has_llc) {
-      brw->batch.cpu_map = malloc(BATCH_SZ);
-      brw->batch.map = brw->batch.cpu_map;
-   }
-}
-
-static void
-intel_batchbuffer_reset(struct brw_context *brw)
-{
-   if (brw->batch.last_bo != NULL) {
-      drm_intel_bo_unreference(brw->batch.last_bo);
-      brw->batch.last_bo = NULL;
-   }
-   brw->batch.last_bo = brw->batch.bo;
-
-   brw_render_cache_set_clear(brw);
-
-   brw->batch.bo = drm_intel_bo_alloc(brw->bufmgr, "batchbuffer",
-					BATCH_SZ, 4096);
-   if (brw->has_llc) {
-      drm_intel_bo_map(brw->batch.bo, true);
-      brw->batch.map = brw->batch.bo->virtual;
-   }
-
-   brw->batch.reserved_space = BATCH_RESERVED;
-   brw->batch.state_batch_offset = brw->batch.bo->size;
-   brw->batch.used = 0;
-   brw->batch.needs_sol_reset = false;
-
-   /* We don't know what ring the new batch will be sent to until we see the
-    * first BEGIN_BATCH or BEGIN_BATCH_BLT.  Mark it as unknown.
-    */
-   brw->batch.ring = UNKNOWN_RING;
-}
-
-void
-intel_batchbuffer_save_state(struct brw_context *brw)
-{
-   brw->batch.saved.used = brw->batch.used;
-   brw->batch.saved.reloc_count =
-      drm_intel_gem_bo_get_reloc_count(brw->batch.bo);
-}
-
-void
-intel_batchbuffer_reset_to_saved(struct brw_context *brw)
-{
-   drm_intel_gem_bo_clear_relocs(brw->batch.bo, brw->batch.saved.reloc_count);
-
-   brw->batch.used = brw->batch.saved.used;
-   if (brw->batch.used == 0)
-      brw->batch.ring = UNKNOWN_RING;
-}
-
-void
-intel_batchbuffer_free(struct brw_context *brw)
-{
-   free(brw->batch.cpu_map);
-   drm_intel_bo_unreference(brw->batch.last_bo);
-   drm_intel_bo_unreference(brw->batch.bo);
-}
-
-static void
-do_batch_dump(struct brw_context *brw)
-{
-   struct drm_intel_decode *decode;
-   struct intel_batchbuffer *batch = &brw->batch;
-   int ret;
-
-   decode = drm_intel_decode_context_alloc(brw->intelScreen->deviceID);
-   if (!decode)
-      return;
-
-   ret = drm_intel_bo_map(batch->bo, false);
-   if (ret == 0) {
-      drm_intel_decode_set_batch_pointer(decode,
-					 batch->bo->virtual,
-					 batch->bo->offset64,
-					 batch->used);
-   } else {
-      fprintf(stderr,
-	      "WARNING: failed to map batchbuffer (%s), "
-	      "dumping uploaded data instead.\n", strerror(ret));
-
-      drm_intel_decode_set_batch_pointer(decode,
-					 batch->map,
-					 batch->bo->offset64,
-					 batch->used);
-   }
-
-   drm_intel_decode_set_output_file(decode, stderr);
-   drm_intel_decode(decode);
-
-   drm_intel_decode_context_free(decode);
-
-   if (ret == 0) {
-      drm_intel_bo_unmap(batch->bo);
-
-      brw_debug_batch(brw);
-   }
-}
-
-void
-intel_batchbuffer_emit_render_ring_prelude(struct brw_context *brw)
-{
-   /* We may need to enable and snapshot OA counters. */
-   brw_perf_monitor_new_batch(brw);
-}
-
-/**
- * Called when starting a new batch buffer.
- */
-static void
-brw_new_batch(struct brw_context *brw)
-{
-   /* Create a new batchbuffer and reset the associated state: */
-   drm_intel_gem_bo_clear_relocs(brw->batch.bo, 0);
-   intel_batchbuffer_reset(brw);
-
-   /* If the kernel supports hardware contexts, then most hardware state is
-    * preserved between batches; we only need to re-emit state that is required
-    * to be in every batch.  Otherwise we need to re-emit all the state that
-    * would otherwise be stored in the context (which for all intents and
-    * purposes means everything).
-    */
-   if (brw->hw_ctx == NULL)
-      brw->ctx.NewDriverState |= BRW_NEW_CONTEXT;
-
-   brw->ctx.NewDriverState |= BRW_NEW_BATCH;
-
-   brw->state_batch_count = 0;
-
-   brw->ib.type = -1;
-
-   /* We need to periodically reap the shader time results, because rollover
-    * happens every few seconds.  We also want to see results every once in a
-    * while, because many programs won't cleanly destroy our context, so the
-    * end-of-run printout may not happen.
-    */
-   if (INTEL_DEBUG & DEBUG_SHADER_TIME)
-      brw_collect_and_report_shader_time(brw);
-
-   if (INTEL_DEBUG & DEBUG_PERFMON)
-      brw_dump_perf_monitors(brw);
-}
-
-/**
- * Called from intel_batchbuffer_flush before emitting MI_BATCHBUFFER_END and
- * sending it off.
- *
- * This function can emit state (say, to preserve registers that aren't saved
- * between batches).  All of this state MUST fit in the reserved space at the
- * end of the batchbuffer.  If you add more GPU state, increase the reserved
- * space by updating the BATCH_RESERVED macro.
- */
-static void
-brw_finish_batch(struct brw_context *brw)
-{
-   /* Capture the closing pipeline statistics register values necessary to
-    * support query objects (in the non-hardware context world).
-    */
-   brw_emit_query_end(brw);
-
-   /* We may also need to snapshot and disable OA counters. */
-   if (brw->batch.ring == RENDER_RING)
-      brw_perf_monitor_finish_batch(brw);
-
-   /* Mark that the current program cache BO has been used by the GPU.
-    * It will be reallocated if we need to put new programs in for the
-    * next batch.
-    */
-   brw->cache.bo_used_by_gpu = true;
-}
-
-static void
-throttle(struct brw_context *brw)
-{
-   /* Wait for the swapbuffers before the one we just emitted, so we
-    * don't get too many swaps outstanding for apps that are GPU-heavy
-    * but not CPU-heavy.
-    *
-    * We're using intelDRI2Flush (called from the loader before
-    * swapbuffer) and glFlush (for front buffer rendering) as the
-    * indicator that a frame is done and then throttle when we get
-    * here as we prepare to render the next frame.  At this point for
-    * round trips for swap/copy and getting new buffers are done and
-    * we'll spend less time waiting on the GPU.
-    *
-    * Unfortunately, we don't have a handle to the batch containing
-    * the swap, and getting our hands on that doesn't seem worth it,
-    * so we just use the first batch we emitted after the last swap.
-    */
-   if (brw->need_swap_throttle && brw->throttle_batch[0]) {
-      if (brw->throttle_batch[1]) {
-         if (!brw->disable_throttling)
-            drm_intel_bo_wait_rendering(brw->throttle_batch[1]);
-         drm_intel_bo_unreference(brw->throttle_batch[1]);
-      }
-      brw->throttle_batch[1] = brw->throttle_batch[0];
-      brw->throttle_batch[0] = NULL;
-      brw->need_swap_throttle = false;
-      /* Throttling here is more precise than the throttle ioctl, so skip it */
-      brw->need_flush_throttle = false;
-   }
-
-   if (brw->need_flush_throttle) {
-      __DRIscreen *psp = brw->intelScreen->driScrnPriv;
-      drmCommandNone(psp->fd, DRM_I915_GEM_THROTTLE);
-      brw->need_flush_throttle = false;
-   }
-}
-
-/* TODO: Push this whole function into bufmgr.
- */
-static int
-do_flush_locked(struct brw_context *brw)
-{
-   struct intel_batchbuffer *batch = &brw->batch;
-   int ret = 0;
-
-   if (brw->has_llc) {
-      drm_intel_bo_unmap(batch->bo);
-   } else {
-      ret = drm_intel_bo_subdata(batch->bo, 0, 4*batch->used, batch->map);
-      if (ret == 0 && batch->state_batch_offset != batch->bo->size) {
-	 ret = drm_intel_bo_subdata(batch->bo,
-				    batch->state_batch_offset,
-				    batch->bo->size - batch->state_batch_offset,
-				    (char *)batch->map + batch->state_batch_offset);
-      }
-   }
-
-   if (!brw->intelScreen->no_hw) {
-      int flags;
-
-      if (brw->gen >= 6 && batch->ring == BLT_RING) {
-         flags = I915_EXEC_BLT;
-      } else {
-         flags = I915_EXEC_RENDER;
-      }
-      if (batch->needs_sol_reset)
-	 flags |= I915_EXEC_GEN7_SOL_RESET;
-
-      if (ret == 0) {
-         if (unlikely(INTEL_DEBUG & DEBUG_AUB))
-            brw_annotate_aub(brw);
-
-	 if (brw->hw_ctx == NULL || batch->ring != RENDER_RING) {
-	    ret = drm_intel_bo_mrb_exec(batch->bo, 4 * batch->used, NULL, 0, 0,
-					flags);
-	 } else {
-	    ret = drm_intel_gem_bo_context_exec(batch->bo, brw->hw_ctx,
-						4 * batch->used, flags);
-	 }
-      }
-
-      throttle(brw);
-   }
-
-   if (unlikely(INTEL_DEBUG & DEBUG_BATCH))
-      do_batch_dump(brw);
-
-   if (ret != 0) {
-      fprintf(stderr, "intel_do_flush_locked failed: %s\n", strerror(-ret));
-      exit(1);
-   }
-
-   return ret;
-}
-
-int
-_intel_batchbuffer_flush(struct brw_context *brw,
-			 const char *file, int line)
-{
-   int ret;
-
-   if (brw->batch.used == 0)
-      return 0;
-
-   if (brw->throttle_batch[0] == NULL) {
-      brw->throttle_batch[0] = brw->batch.bo;
-      drm_intel_bo_reference(brw->throttle_batch[0]);
-   }
-
-   if (unlikely(INTEL_DEBUG & DEBUG_BATCH)) {
-      int bytes_for_commands = 4 * brw->batch.used;
-      int bytes_for_state = brw->batch.bo->size - brw->batch.state_batch_offset;
-      int total_bytes = bytes_for_commands + bytes_for_state;
-      fprintf(stderr, "%s:%d: Batchbuffer flush with %4db (pkt) + "
-              "%4db (state) = %4db (%0.1f%%)\n", file, line,
-              bytes_for_commands, bytes_for_state,
-              total_bytes,
-              100.0f * total_bytes / BATCH_SZ);
-   }
-
-   brw->batch.reserved_space = 0;
-
-   brw_finish_batch(brw);
-
-   /* Mark the end of the buffer. */
-   intel_batchbuffer_emit_dword(brw, MI_BATCH_BUFFER_END);
-   if (brw->batch.used & 1) {
-      /* Round batchbuffer usage to 2 DWORDs. */
-      intel_batchbuffer_emit_dword(brw, MI_NOOP);
-   }
-
-   intel_upload_finish(brw);
-
-   /* Check that we didn't just wrap our batchbuffer at a bad time. */
-   assert(!brw->no_batch_wrap);
-
-   ret = do_flush_locked(brw);
-
-   if (unlikely(INTEL_DEBUG & DEBUG_SYNC)) {
-      fprintf(stderr, "waiting for idle\n");
-      drm_intel_bo_wait_rendering(brw->batch.bo);
-   }
-
-   /* Start a new batch buffer. */
-   brw_new_batch(brw);
-
-   return ret;
-}
-
-
-/*  This is the only way buffers get added to the validate list.
- */
-bool
-intel_batchbuffer_emit_reloc(struct brw_context *brw,
-                             drm_intel_bo *buffer,
-                             uint32_t read_domains, uint32_t write_domain,
-			     uint32_t delta)
-{
-   int ret;
-
-   ret = drm_intel_bo_emit_reloc(brw->batch.bo, 4*brw->batch.used,
-				 buffer, delta,
-				 read_domains, write_domain);
-   assert(ret == 0);
-   (void)ret;
-
-   /* Using the old buffer offset, write in what the right data would be, in
-    * case the buffer doesn't move and we can short-circuit the relocation
-    * processing in the kernel
-    */
-   intel_batchbuffer_emit_dword(brw, buffer->offset64 + delta);
-
-   return true;
-}
-
-bool
-intel_batchbuffer_emit_reloc64(struct brw_context *brw,
-                               drm_intel_bo *buffer,
-                               uint32_t read_domains, uint32_t write_domain,
-			       uint32_t delta)
-{
-   int ret = drm_intel_bo_emit_reloc(brw->batch.bo, 4*brw->batch.used,
-                                     buffer, delta,
-                                     read_domains, write_domain);
-   assert(ret == 0);
-   (void) ret;
-
-   /* Using the old buffer offset, write in what the right data would be, in
-    * case the buffer doesn't move and we can short-circuit the relocation
-    * processing in the kernel
-    */
-   uint64_t offset = buffer->offset64 + delta;
-   intel_batchbuffer_emit_dword(brw, offset);
-   intel_batchbuffer_emit_dword(brw, offset >> 32);
-
-   return true;
-}
-
-
-void
-intel_batchbuffer_data(struct brw_context *brw,
-                       const void *data, GLuint bytes, enum brw_gpu_ring ring)
-{
-   assert((bytes & 3) == 0);
-   intel_batchbuffer_require_space(brw, bytes, ring);
-   memcpy(brw->batch.map + brw->batch.used, data, bytes);
-   brw->batch.used += bytes >> 2;
-}
-
-static void
-load_sized_register_mem(struct brw_context *brw,
-                        uint32_t reg,
-                        drm_intel_bo *bo,
-                        uint32_t read_domains, uint32_t write_domain,
-                        uint32_t offset,
-                        int size)
-{
-   int i;
-
-   /* MI_LOAD_REGISTER_MEM only exists on Gen7+. */
-   assert(brw->gen >= 7);
-
-   if (brw->gen >= 8) {
-      BEGIN_BATCH(4 * size);
-      for (i = 0; i < size; i++) {
-         OUT_BATCH(GEN7_MI_LOAD_REGISTER_MEM | (4 - 2));
-         OUT_BATCH(reg + i * 4);
-         OUT_RELOC64(bo, read_domains, write_domain, offset + i * 4);
-      }
-      ADVANCE_BATCH();
-   } else {
-      BEGIN_BATCH(3 * size);
-      for (i = 0; i < size; i++) {
-         OUT_BATCH(GEN7_MI_LOAD_REGISTER_MEM | (3 - 2));
-         OUT_BATCH(reg + i * 4);
-         OUT_RELOC(bo, read_domains, write_domain, offset + i * 4);
-      }
-      ADVANCE_BATCH();
-   }
-}
-
-void
-brw_load_register_mem(struct brw_context *brw,
-                      uint32_t reg,
-                      drm_intel_bo *bo,
-                      uint32_t read_domains, uint32_t write_domain,
-                      uint32_t offset)
-{
-   load_sized_register_mem(brw, reg, bo, read_domains, write_domain, offset, 1);
-}
-
-void
-brw_load_register_mem64(struct brw_context *brw,
-                        uint32_t reg,
-                        drm_intel_bo *bo,
-                        uint32_t read_domains, uint32_t write_domain,
-                        uint32_t offset)
-{
-   load_sized_register_mem(brw, reg, bo, read_domains, write_domain, offset, 2);
-}
diff --git a/src/mesa/drivers/dri/i965/intel_batchbuffer.h b/src/mesa/drivers/dri/i965/intel_batchbuffer.h
deleted file mode 100644
index ef8a6ff..0000000
--- a/src/mesa/drivers/dri/i965/intel_batchbuffer.h
+++ /dev/null
@@ -1,179 +0,0 @@
-#ifndef INTEL_BATCHBUFFER_H
-#define INTEL_BATCHBUFFER_H
-
-#include "main/mtypes.h"
-
-#include "brw_context.h"
-#include "intel_bufmgr.h"
-#include "intel_reg.h"
-
-#ifdef __cplusplus
-extern "C" {
-#endif
-
-/**
- * Number of bytes to reserve for commands necessary to complete a batch.
- *
- * This includes:
- * - MI_BATCHBUFFER_END (4 bytes)
- * - Optional MI_NOOP for ensuring the batch length is qword aligned (4 bytes)
- * - Any state emitted by vtbl->finish_batch():
- *   - Gen4-5 record ending occlusion query values (4 * 4 = 16 bytes)
- *   - Disabling OA counters on Gen6+ (3 DWords = 12 bytes)
- *   - Ending MI_REPORT_PERF_COUNT on Gen5+, plus associated PIPE_CONTROLs:
- *     - Two sets of PIPE_CONTROLs, which become 3 PIPE_CONTROLs each on SNB,
- *       which are 4 DWords each ==> 2 * 3 * 4 * 4 = 96 bytes
- *     - 3 DWords for MI_REPORT_PERF_COUNT itself on Gen6+.  ==> 12 bytes.
- *       On Ironlake, it's 6 DWords, but we have some slack due to the lack of
- *       Sandybridge PIPE_CONTROL madness.
- */
-#define BATCH_RESERVED 146
-
-struct intel_batchbuffer;
-
-void intel_batchbuffer_emit_render_ring_prelude(struct brw_context *brw);
-void intel_batchbuffer_init(struct brw_context *brw);
-void intel_batchbuffer_free(struct brw_context *brw);
-void intel_batchbuffer_save_state(struct brw_context *brw);
-void intel_batchbuffer_reset_to_saved(struct brw_context *brw);
-
-int _intel_batchbuffer_flush(struct brw_context *brw,
-			     const char *file, int line);
-
-#define intel_batchbuffer_flush(intel) \
-	_intel_batchbuffer_flush(intel, __FILE__, __LINE__)
-
-
-
-/* Unlike bmBufferData, this currently requires the buffer be mapped.
- * Consider it a convenience function wrapping multple
- * intel_buffer_dword() calls.
- */
-void intel_batchbuffer_data(struct brw_context *brw,
-                            const void *data, GLuint bytes,
-                            enum brw_gpu_ring ring);
-
-bool intel_batchbuffer_emit_reloc(struct brw_context *brw,
-                                       drm_intel_bo *buffer,
-				       uint32_t read_domains,
-				       uint32_t write_domain,
-				       uint32_t offset);
-bool intel_batchbuffer_emit_reloc64(struct brw_context *brw,
-                                    drm_intel_bo *buffer,
-                                    uint32_t read_domains,
-                                    uint32_t write_domain,
-                                    uint32_t offset);
-static inline uint32_t float_as_int(float f)
-{
-   union {
-      float f;
-      uint32_t d;
-   } fi;
-
-   fi.f = f;
-   return fi.d;
-}
-
-/* Inline functions - might actually be better off with these
- * non-inlined.  Certainly better off switching all command packets to
- * be passed as structs rather than dwords, but that's a little bit of
- * work...
- */
-static inline unsigned
-intel_batchbuffer_space(struct brw_context *brw)
-{
-   return (brw->batch.state_batch_offset - brw->batch.reserved_space)
-      - brw->batch.used*4;
-}
-
-
-static inline void
-intel_batchbuffer_emit_dword(struct brw_context *brw, GLuint dword)
-{
-#ifdef DEBUG
-   assert(intel_batchbuffer_space(brw) >= 4);
-#endif
-   brw->batch.map[brw->batch.used++] = dword;
-   assert(brw->batch.ring != UNKNOWN_RING);
-}
-
-static inline void
-intel_batchbuffer_emit_float(struct brw_context *brw, float f)
-{
-   intel_batchbuffer_emit_dword(brw, float_as_int(f));
-}
-
-static inline void
-intel_batchbuffer_require_space(struct brw_context *brw, GLuint sz,
-                                enum brw_gpu_ring ring)
-{
-   /* If we're switching rings, implicitly flush the batch. */
-   if (unlikely(ring != brw->batch.ring) && brw->batch.ring != UNKNOWN_RING &&
-       brw->gen >= 6) {
-      intel_batchbuffer_flush(brw);
-   }
-
-#ifdef DEBUG
-   assert(sz < BATCH_SZ - BATCH_RESERVED);
-#endif
-   if (intel_batchbuffer_space(brw) < sz)
-      intel_batchbuffer_flush(brw);
-
-   enum brw_gpu_ring prev_ring = brw->batch.ring;
-   /* The intel_batchbuffer_flush() calls above might have changed
-    * brw->batch.ring to UNKNOWN_RING, so we need to set it here at the end.
-    */
-   brw->batch.ring = ring;
-
-   if (unlikely(prev_ring == UNKNOWN_RING && ring == RENDER_RING))
-      intel_batchbuffer_emit_render_ring_prelude(brw);
-}
-
-static inline void
-intel_batchbuffer_begin(struct brw_context *brw, int n, enum brw_gpu_ring ring)
-{
-   intel_batchbuffer_require_space(brw, n * 4, ring);
-
-   brw->batch.emit = brw->batch.used;
-#ifdef DEBUG
-   brw->batch.total = n;
-#endif
-}
-
-static inline void
-intel_batchbuffer_advance(struct brw_context *brw)
-{
-#ifdef DEBUG
-   struct intel_batchbuffer *batch = &brw->batch;
-   unsigned int _n = batch->used - batch->emit;
-   assert(batch->total != 0);
-   if (_n != batch->total) {
-      fprintf(stderr, "ADVANCE_BATCH: %d of %d dwords emitted\n",
-	      _n, batch->total);
-      abort();
-   }
-   batch->total = 0;
-#endif
-}
-
-#define BEGIN_BATCH(n) intel_batchbuffer_begin(brw, n, RENDER_RING)
-#define BEGIN_BATCH_BLT(n) intel_batchbuffer_begin(brw, n, BLT_RING)
-#define OUT_BATCH(d) intel_batchbuffer_emit_dword(brw, d)
-#define OUT_BATCH_F(f) intel_batchbuffer_emit_float(brw, f)
-#define OUT_RELOC(buf, read_domains, write_domain, delta) do {		\
-   intel_batchbuffer_emit_reloc(brw, buf,			\
-				read_domains, write_domain, delta);	\
-} while (0)
-
-/* Handle 48-bit address relocations for Gen8+ */
-#define OUT_RELOC64(buf, read_domains, write_domain, delta) do { \
-   intel_batchbuffer_emit_reloc64(brw, buf, read_domains, write_domain, delta);	\
-} while (0)
-
-#define ADVANCE_BATCH() intel_batchbuffer_advance(brw);
-
-#ifdef __cplusplus
-}
-#endif
-
-#endif
diff --git a/src/mesa/drivers/dri/i965/intel_blit.c b/src/mesa/drivers/dri/i965/intel_blit.c
index bc39053..1acbace 100644
--- a/src/mesa/drivers/dri/i965/intel_blit.c
+++ b/src/mesa/drivers/dri/i965/intel_blit.c
@@ -39,7 +39,6 @@
 #include "intel_buffers.h"
 #include "intel_fbo.h"
 #include "intel_reg.h"
-#include "intel_batchbuffer.h"
 #include "intel_mipmap_tree.h"
 
 #define FILE_DEBUG_FLAG DEBUG_BLIT
@@ -397,11 +396,11 @@ alignment_valid(struct brw_context *brw, unsigned offset, uint32_t tiling)
 
 static bool
 can_fast_copy_blit(struct brw_context *brw,
-		   drm_intel_bo *src_buffer,
+		   struct brw_bo *src_buffer,
                    int16_t src_x, int16_t src_y,
                    uintptr_t src_offset, uint32_t src_pitch,
                    uint32_t src_tiling, uint32_t src_tr_mode,
-		   drm_intel_bo *dst_buffer,
+		   struct brw_bo *dst_buffer,
                    int16_t dst_x, int16_t dst_y,
                    uintptr_t dst_offset, uint32_t dst_pitch,
                    uint32_t dst_tiling, uint32_t dst_tr_mode,
@@ -507,12 +506,12 @@ bool
 intelEmitCopyBlit(struct brw_context *brw,
 		  GLuint cpp,
 		  GLshort src_pitch,
-		  drm_intel_bo *src_buffer,
+		  struct brw_bo *src_buffer,
 		  GLuint src_offset,
 		  uint32_t src_tiling,
 		  uint32_t src_tr_mode,
 		  GLshort dst_pitch,
-		  drm_intel_bo *dst_buffer,
+		  struct brw_bo *dst_buffer,
 		  GLuint dst_offset,
 		  uint32_t dst_tiling,
 		  uint32_t dst_tr_mode,
@@ -521,10 +520,9 @@ intelEmitCopyBlit(struct brw_context *brw,
 		  GLshort w, GLshort h,
 		  GLenum logic_op)
 {
-   GLuint CMD, BR13, pass = 0;
+   GLuint CMD, BR13;
    int dst_y2 = dst_y + h;
    int dst_x2 = dst_x + w;
-   drm_intel_bo *aper_array[3];
    bool dst_y_tiled = dst_tiling == I915_TILING_Y;
    bool src_y_tiled = src_tiling == I915_TILING_Y;
    bool use_fast_copy_blit = false;
@@ -532,25 +530,8 @@ intelEmitCopyBlit(struct brw_context *brw,
    if ((dst_y_tiled || src_y_tiled) && brw->gen < 6)
       return false;
 
-   /* do space check before going any further */
-   do {
-       aper_array[0] = brw->batch.bo;
-       aper_array[1] = dst_buffer;
-       aper_array[2] = src_buffer;
-
-       if (dri_bufmgr_check_aperture_space(aper_array, 3) != 0) {
-           intel_batchbuffer_flush(brw);
-           pass++;
-       } else
-           break;
-   } while (pass < 2);
-
-   if (pass >= 2)
-      return false;
-
    unsigned length = brw->gen >= 8 ? 10 : 8;
 
-   intel_batchbuffer_require_space(brw, length * 4, BLT_RING);
    DBG("%s src:buf(%p)/%d+%d %d,%d dst:buf(%p)/%d+%d %d,%d sz:%dx%d\n",
        __func__,
        src_buffer, src_pitch, src_offset, src_x, src_y,
@@ -661,6 +642,9 @@ intelEmitCopyBlit(struct brw_context *brw,
    assert(dst_offset + (dst_y + h - 1) * abs(dst_pitch) +
           (w * cpp) <= dst_buffer->size);
 
+   if (brw_batch_begin(&brw->batch, 20, BLT_RING) < 0)
+      return false;
+
    BEGIN_BATCH_BLT_TILED(length, dst_y_tiled, src_y_tiled);
    OUT_BATCH(CMD | (length - 2));
    OUT_BATCH(BR13 | (uint16_t)dst_pitch);
@@ -688,10 +672,7 @@ intelEmitCopyBlit(struct brw_context *brw,
    }
 
    ADVANCE_BATCH_TILED(dst_y_tiled, src_y_tiled);
-
-   brw_emit_mi_flush(brw);
-
-   return true;
+   return brw_batch_end(&brw->batch) == 0;
 }
 
 bool
@@ -700,7 +681,7 @@ intelEmitImmediateColorExpandBlit(struct brw_context *brw,
 				  GLubyte *src_bits, GLuint src_size,
 				  GLuint fg_color,
 				  GLshort dst_pitch,
-				  drm_intel_bo *dst_buffer,
+				  struct brw_bo *dst_buffer,
 				  GLuint dst_offset,
 				  uint32_t dst_tiling,
 				  GLshort x, GLshort y,
@@ -728,9 +709,6 @@ intelEmitImmediateColorExpandBlit(struct brw_context *brw,
        dst_buffer, dst_pitch, dst_offset, x, y, w, h, src_size, dwords);
 
    unsigned xy_setup_blt_length = brw->gen >= 8 ? 10 : 8;
-   intel_batchbuffer_require_space(brw, (xy_setup_blt_length * 4) +
-                                        (3 * 4) + dwords * 4, BLT_RING);
-
    opcode = XY_SETUP_BLT_CMD;
    if (cpp == 4)
       opcode |= XY_BLT_WRITE_ALPHA | XY_BLT_WRITE_RGB;
@@ -746,6 +724,9 @@ intelEmitImmediateColorExpandBlit(struct brw_context *brw,
    if (dst_tiling != I915_TILING_NONE)
       blit_cmd |= XY_DST_TILED;
 
+   if (brw_batch_begin(&brw->batch, 20 + dwords, BLT_RING) < 0)
+      return false;
+
    BEGIN_BATCH_BLT(xy_setup_blt_length + 3);
    OUT_BATCH(opcode | (xy_setup_blt_length - 2));
    OUT_BATCH(br13);
@@ -771,11 +752,9 @@ intelEmitImmediateColorExpandBlit(struct brw_context *brw,
    OUT_BATCH(SET_FIELD(y + h, BLT_Y) | SET_FIELD(x + w, BLT_X));
    ADVANCE_BATCH();
 
-   intel_batchbuffer_data(brw, src_bits, dwords * 4, BLT_RING);
+   brw_batch_data(&brw->batch, src_bits, dwords * 4);
 
-   brw_emit_mi_flush(brw);
-
-   return true;
+   return brw_batch_end(&brw->batch) == 0;
 }
 
 /* We don't have a memmove-type blit like some other hardware, so we'll do a
@@ -784,9 +763,9 @@ intelEmitImmediateColorExpandBlit(struct brw_context *brw,
  */
 void
 intel_emit_linear_blit(struct brw_context *brw,
-		       drm_intel_bo *dst_bo,
+		       struct brw_bo *dst_bo,
 		       unsigned int dst_offset,
-		       drm_intel_bo *src_bo,
+		       struct brw_bo *src_bo,
 		       unsigned int src_offset,
 		       unsigned int size)
 {
@@ -853,7 +832,6 @@ intel_miptree_set_alpha_to_one(struct brw_context *brw,
 {
    uint32_t BR13, CMD;
    int pitch, cpp;
-   drm_intel_bo *aper_array[2];
 
    pitch = mt->pitch;
    cpp = mt->cpp;
@@ -871,14 +849,8 @@ intel_miptree_set_alpha_to_one(struct brw_context *brw,
    }
    BR13 |= pitch;
 
-   /* do space check before going any further */
-   aper_array[0] = brw->batch.bo;
-   aper_array[1] = mt->bo;
-
-   if (drm_intel_bufmgr_check_aperture_space(aper_array,
-					     ARRAY_SIZE(aper_array)) != 0) {
-      intel_batchbuffer_flush(brw);
-   }
+   if (brw_batch_begin(&brw->batch, 20, BLT_RING) < 0)
+      return;
 
    unsigned length = brw->gen >= 8 ? 7 : 6;
    bool dst_y_tiled = mt->tiling == I915_TILING_Y;
@@ -900,5 +872,5 @@ intel_miptree_set_alpha_to_one(struct brw_context *brw,
    OUT_BATCH(0xffffffff); /* white, but only alpha gets written */
    ADVANCE_BATCH_TILED(dst_y_tiled, false);
 
-   brw_emit_mi_flush(brw);
+   brw_batch_end(&brw->batch);
 }
diff --git a/src/mesa/drivers/dri/i965/intel_blit.h b/src/mesa/drivers/dri/i965/intel_blit.h
index c3d19a5..a0e4e2b 100644
--- a/src/mesa/drivers/dri/i965/intel_blit.h
+++ b/src/mesa/drivers/dri/i965/intel_blit.h
@@ -34,12 +34,12 @@ bool
 intelEmitCopyBlit(struct brw_context *brw,
                   GLuint cpp,
                   GLshort src_pitch,
-                  drm_intel_bo *src_buffer,
+                  struct brw_bo *src_buffer,
                   GLuint src_offset,
                   uint32_t src_tiling,
                   uint32_t src_tr_mode,
                   GLshort dst_pitch,
-                  drm_intel_bo *dst_buffer,
+                  struct brw_bo *dst_buffer,
                   GLuint dst_offset,
                   uint32_t dst_tiling,
                   uint32_t dst_tr_mode,
@@ -66,16 +66,16 @@ intelEmitImmediateColorExpandBlit(struct brw_context *brw,
 				  GLubyte *src_bits, GLuint src_size,
 				  GLuint fg_color,
 				  GLshort dst_pitch,
-				  drm_intel_bo *dst_buffer,
+				  struct brw_bo *dst_buffer,
 				  GLuint dst_offset,
 				  uint32_t dst_tiling,
 				  GLshort x, GLshort y,
 				  GLshort w, GLshort h,
 				  GLenum logic_op);
 void intel_emit_linear_blit(struct brw_context *brw,
-			    drm_intel_bo *dst_bo,
+			    struct brw_bo *dst_bo,
 			    unsigned int dst_offset,
-			    drm_intel_bo *src_bo,
+			    struct brw_bo *src_bo,
 			    unsigned int src_offset,
 			    unsigned int size);
 
diff --git a/src/mesa/drivers/dri/i965/intel_buffer_objects.c b/src/mesa/drivers/dri/i965/intel_buffer_objects.c
index ff05b5c..fda5c9f 100644
--- a/src/mesa/drivers/dri/i965/intel_buffer_objects.c
+++ b/src/mesa/drivers/dri/i965/intel_buffer_objects.c
@@ -39,47 +39,6 @@
 #include "brw_context.h"
 #include "intel_blit.h"
 #include "intel_buffer_objects.h"
-#include "intel_batchbuffer.h"
-
-/**
- * Map a buffer object; issue performance warnings if mapping causes stalls.
- *
- * This matches the drm_intel_bo_map API, but takes an additional human-readable
- * name for the buffer object to use in the performance debug message.
- */
-int
-brw_bo_map(struct brw_context *brw,
-           drm_intel_bo *bo, int write_enable,
-           const char *bo_name)
-{
-   if (likely(!brw->perf_debug) || !drm_intel_bo_busy(bo))
-      return drm_intel_bo_map(bo, write_enable);
-
-   double start_time = get_time();
-
-   int ret = drm_intel_bo_map(bo, write_enable);
-
-   perf_debug("CPU mapping a busy %s BO stalled and took %.03f ms.\n",
-              bo_name, (get_time() - start_time) * 1000);
-
-   return ret;
-}
-
-int
-brw_bo_map_gtt(struct brw_context *brw, drm_intel_bo *bo, const char *bo_name)
-{
-   if (likely(!brw->perf_debug) || !drm_intel_bo_busy(bo))
-      return drm_intel_gem_bo_map_gtt(bo);
-
-   double start_time = get_time();
-
-   int ret = drm_intel_gem_bo_map_gtt(bo);
-
-   perf_debug("GTT mapping a busy %s BO stalled and took %.03f ms.\n",
-              bo_name, (get_time() - start_time) * 1000);
-
-   return ret;
-}
 
 static void
 mark_buffer_gpu_usage(struct intel_buffer_object *intel_obj,
@@ -92,17 +51,20 @@ mark_buffer_gpu_usage(struct intel_buffer_object *intel_obj,
 static void
 mark_buffer_inactive(struct intel_buffer_object *intel_obj)
 {
+   if (brw_bo_busy(intel_obj->buffer, BUSY_WRITE))
+      return;
+
    intel_obj->gpu_active_start = ~0;
    intel_obj->gpu_active_end = 0;
 }
 
-/** Allocates a new drm_intel_bo to store the data for the buffer object. */
+/** Allocates a new brw_bo to store the data for the buffer object. */
 static void
 alloc_buffer_object(struct brw_context *brw,
                     struct intel_buffer_object *intel_obj)
 {
-   intel_obj->buffer = drm_intel_bo_alloc(brw->bufmgr, "bufferobj",
-					  intel_obj->Base.Size, 64);
+   intel_obj->buffer =
+      brw_bo_create(&brw->batch, "bufferobj", intel_obj->Base.Size, 64, 0);
 
    /* the buffer might be bound as a uniform buffer, need to update it
     */
@@ -119,7 +81,7 @@ alloc_buffer_object(struct brw_context *brw,
 static void
 release_buffer(struct intel_buffer_object *intel_obj)
 {
-   drm_intel_bo_unreference(intel_obj->buffer);
+   brw_bo_put(intel_obj->buffer);
    intel_obj->buffer = NULL;
 }
 
@@ -166,7 +128,7 @@ brw_delete_buffer(struct gl_context * ctx, struct gl_buffer_object *obj)
     */
    _mesa_buffer_unmap_all_mappings(ctx, obj);
 
-   drm_intel_bo_unreference(intel_obj->buffer);
+   brw_bo_put(intel_obj->buffer);
    free(intel_obj);
 }
 
@@ -213,7 +175,7 @@ brw_buffer_data(struct gl_context *ctx,
          return false;
 
       if (data != NULL)
-	 drm_intel_bo_subdata(intel_obj->buffer, 0, size, data);
+	 brw_bo_write(intel_obj->buffer, 0, data, size, 0);
    }
 
    return true;
@@ -238,64 +200,49 @@ brw_buffer_subdata(struct gl_context *ctx,
 {
    struct brw_context *brw = brw_context(ctx);
    struct intel_buffer_object *intel_obj = intel_buffer_object(obj);
-   bool busy;
 
    if (size == 0)
       return;
 
    assert(intel_obj);
 
-   /* See if we can unsynchronized write the data into the user's BO. This
-    * avoids GPU stalls in unfortunately common user patterns (uploading
-    * sequentially into a BO, with draw calls in between each upload).
-    *
-    * Once we've hit this path, we mark this GL BO as preferring stalling to
-    * blits, so that we can hopefully hit this path again in the future
-    * (otherwise, an app that might occasionally stall but mostly not will end
-    * up with blitting all the time, at the cost of bandwidth)
-    */
-   if (offset + size <= intel_obj->gpu_active_start ||
-       intel_obj->gpu_active_end <= offset) {
-      if (brw->has_llc) {
-         drm_intel_gem_bo_map_unsynchronized(intel_obj->buffer);
-         memcpy(intel_obj->buffer->virtual + offset, data, size);
-         drm_intel_bo_unmap(intel_obj->buffer);
-
-         if (intel_obj->gpu_active_end > intel_obj->gpu_active_start)
-            intel_obj->prefer_stall_to_blit = true;
-         return;
-      } else {
-         perf_debug("BufferSubData could be unsynchronized, but !LLC doesn't support it yet\n");
-      }
-   }
-
-   busy =
-      drm_intel_bo_busy(intel_obj->buffer) ||
-      drm_intel_bo_references(brw->batch.bo, intel_obj->buffer);
-
-   if (busy) {
+   if (brw_bo_busy(intel_obj->buffer, BUSY_WRITE | BUSY_RETIRE)) {
       if (size == intel_obj->Base.Size) {
 	 /* Replace the current busy bo so the subdata doesn't stall. */
-	 drm_intel_bo_unreference(intel_obj->buffer);
+	 brw_bo_put(intel_obj->buffer);
 	 alloc_buffer_object(brw, intel_obj);
+      } else if (offset + size <= intel_obj->gpu_active_start ||
+                 intel_obj->gpu_active_end <= offset) {
+         /* See if we can unsynchronized write the data into the user's BO.
+          * This avoids GPU stalls in unfortunately common user patterns
+          * (uploading sequentially into a BO, with draw calls in between
+          * each upload).
+          *
+          * Once we've hit this path, we mark this GL BO as preferring
+          * stalling to blits, so that we can hopefully hit this path again
+          * in the future (otherwise, an app that might occasionally stall
+          * but mostly not will end up with blitting all the time, at the
+          * cost of bandwidth).
+          */
+         brw_bo_write(intel_obj->buffer, offset, data, size, MAP_ASYNC);
+         if (intel_obj->gpu_active_end > intel_obj->gpu_active_start)
+            intel_obj->prefer_stall_to_blit = intel_obj->buffer->cache_coherent;
+	 return;
       } else if (!intel_obj->prefer_stall_to_blit) {
+	 uint32_t upload;
          perf_debug("Using a blit copy to avoid stalling on "
                     "glBufferSubData(%ld, %ld) (%ldkb) to a busy "
                     "(%d-%d) buffer object.\n",
                     (long)offset, (long)offset + size, (long)(size/1024),
                     intel_obj->gpu_active_start,
                     intel_obj->gpu_active_end);
-	 drm_intel_bo *temp_bo =
-	    drm_intel_bo_alloc(brw->bufmgr, "subdata temp", size, 64);
-
-	 drm_intel_bo_subdata(temp_bo, 0, size, data);
-
+	 struct brw_bo *bo = NULL;
+         intel_upload_data(brw, data, size, 64, &bo, &upload);
 	 intel_emit_linear_blit(brw,
 				intel_obj->buffer, offset,
-				temp_bo, 0,
+				bo, upload,
 				size);
-
-	 drm_intel_bo_unreference(temp_bo);
+	 brw_bo_put(bo);
          return;
       } else {
          perf_debug("Stalling on glBufferSubData(%ld, %ld) (%ldkb) to a busy "
@@ -304,11 +251,10 @@ brw_buffer_subdata(struct gl_context *ctx,
                     (long)offset, (long)offset + size, (long)(size/1024),
                     intel_obj->gpu_active_start,
                     intel_obj->gpu_active_end);
-         intel_batchbuffer_flush(brw);
       }
    }
 
-   drm_intel_bo_subdata(intel_obj->buffer, offset, size, data);
+   brw_bo_write(intel_obj->buffer, offset, data, size, 0);
    mark_buffer_inactive(intel_obj);
 }
 
@@ -327,14 +273,9 @@ brw_get_buffer_subdata(struct gl_context *ctx,
                        struct gl_buffer_object *obj)
 {
    struct intel_buffer_object *intel_obj = intel_buffer_object(obj);
-   struct brw_context *brw = brw_context(ctx);
 
    assert(intel_obj);
-   if (drm_intel_bo_references(brw->batch.bo, intel_obj->buffer)) {
-      intel_batchbuffer_flush(brw);
-   }
-   drm_intel_bo_get_subdata(intel_obj->buffer, offset, size, data);
-
+   brw_bo_read(intel_obj->buffer, offset, data, size, 0);
    mark_buffer_inactive(intel_obj);
 }
 
@@ -365,6 +306,7 @@ brw_map_buffer_range(struct gl_context *ctx,
 {
    struct brw_context *brw = brw_context(ctx);
    struct intel_buffer_object *intel_obj = intel_buffer_object(obj);
+   unsigned map_flags;
 
    assert(intel_obj);
 
@@ -389,19 +331,11 @@ brw_map_buffer_range(struct gl_context *ctx,
     * achieve the required synchronization.
     */
    if (!(access & GL_MAP_UNSYNCHRONIZED_BIT)) {
-      if (drm_intel_bo_references(brw->batch.bo, intel_obj->buffer)) {
-	 if (access & GL_MAP_INVALIDATE_BUFFER_BIT) {
-	    drm_intel_bo_unreference(intel_obj->buffer);
+      if ((access & GL_MAP_INVALIDATE_BUFFER_BIT)) {
+	 if (brw_bo_busy(intel_obj->buffer, BUSY_WRITE | BUSY_RETIRE)) {
+	    brw_bo_put(intel_obj->buffer);
 	    alloc_buffer_object(brw, intel_obj);
-	 } else {
-            perf_debug("Stalling on the GPU for mapping a busy buffer "
-                       "object\n");
-	    intel_batchbuffer_flush(brw);
 	 }
-      } else if (drm_intel_bo_busy(intel_obj->buffer) &&
-		 (access & GL_MAP_INVALIDATE_BUFFER_BIT)) {
-	 drm_intel_bo_unreference(intel_obj->buffer);
-	 alloc_buffer_object(brw, intel_obj);
       }
    }
 
@@ -416,46 +350,41 @@ brw_map_buffer_range(struct gl_context *ctx,
     */
    if (!(access & (GL_MAP_UNSYNCHRONIZED_BIT | GL_MAP_PERSISTENT_BIT)) &&
        (access & GL_MAP_INVALIDATE_RANGE_BIT) &&
-       drm_intel_bo_busy(intel_obj->buffer)) {
+       brw_bo_busy(intel_obj->buffer, BUSY_WRITE | BUSY_RETIRE)) {
       /* Ensure that the base alignment of the allocation meets the alignment
        * guarantees the driver has advertised to the application.
        */
       const unsigned alignment = ctx->Const.MinMapBufferAlignment;
 
       intel_obj->map_extra[index] = (uintptr_t) offset % alignment;
-      intel_obj->range_map_bo[index] = drm_intel_bo_alloc(brw->bufmgr,
-                                                          "BO blit temp",
-                                                          length +
-                                                          intel_obj->map_extra[index],
-                                                          alignment);
-      if (brw->has_llc) {
-         brw_bo_map(brw, intel_obj->range_map_bo[index],
-                    (access & GL_MAP_WRITE_BIT) != 0, "range-map");
-      } else {
-         drm_intel_gem_bo_map_gtt(intel_obj->range_map_bo[index]);
-      }
+      intel_obj->range_map_bo[index] =
+	      brw_bo_create(&brw->batch,
+			    "BO blit temp",
+			    length + intel_obj->map_extra[index],
+			    alignment, 0);
+
       obj->Mappings[index].Pointer =
-         intel_obj->range_map_bo[index]->virtual + intel_obj->map_extra[index];
+	 brw_bo_map(intel_obj->range_map_bo[index], MAP_WRITE) +
+	 intel_obj->map_extra[index];
+
       return obj->Mappings[index].Pointer;
    }
 
-   if (access & GL_MAP_UNSYNCHRONIZED_BIT) {
-      if (!brw->has_llc && brw->perf_debug &&
-          drm_intel_bo_busy(intel_obj->buffer)) {
-         perf_debug("MapBufferRange with GL_MAP_UNSYNCHRONIZED_BIT stalling (it's actually synchronized on non-LLC platforms)\n");
-      }
-      drm_intel_gem_bo_map_unsynchronized(intel_obj->buffer);
-   } else if (!brw->has_llc && (!(access & GL_MAP_READ_BIT) ||
-                              (access & GL_MAP_PERSISTENT_BIT))) {
-      drm_intel_gem_bo_map_gtt(intel_obj->buffer);
-      mark_buffer_inactive(intel_obj);
-   } else {
-      brw_bo_map(brw, intel_obj->buffer, (access & GL_MAP_WRITE_BIT) != 0,
-                 "MapBufferRange");
-      mark_buffer_inactive(intel_obj);
-   }
+   map_flags = 0;
+   if (access & GL_MAP_UNSYNCHRONIZED_BIT)
+      map_flags |= MAP_ASYNC;
+   if (access & GL_MAP_WRITE_BIT)
+      map_flags |= MAP_WRITE;
+   if (access & GL_MAP_READ_BIT)
+      map_flags |= MAP_READ;
+   if (access & GL_MAP_PERSISTENT_BIT)
+      map_flags |= MAP_COHERENT;
+
+   obj->Mappings[index].Pointer =
+      brw_bo_map(intel_obj->buffer, map_flags) + offset;
+
+   mark_buffer_inactive(intel_obj);
 
-   obj->Mappings[index].Pointer = intel_obj->buffer->virtual + offset;
    return obj->Mappings[index].Pointer;
 }
 
@@ -543,8 +472,6 @@ brw_unmap_buffer(struct gl_context *ctx,
    assert(intel_obj);
    assert(obj->Mappings[index].Pointer);
    if (intel_obj->range_map_bo[index] != NULL) {
-      drm_intel_bo_unmap(intel_obj->range_map_bo[index]);
-
       if (!(obj->Mappings[index].AccessFlags & GL_MAP_FLUSH_EXPLICIT_BIT)) {
          intel_emit_linear_blit(brw,
                                 intel_obj->buffer, obj->Mappings[index].Offset,
@@ -555,18 +482,10 @@ brw_unmap_buffer(struct gl_context *ctx,
                                obj->Mappings[index].Length);
       }
 
-      /* Since we've emitted some blits to buffers that will (likely) be used
-       * in rendering operations in other cache domains in this batch, emit a
-       * flush.  Once again, we wish for a domain tracker in libdrm to cover
-       * usage inside of a batchbuffer.
-       */
-      brw_emit_mi_flush(brw);
-
-      drm_intel_bo_unreference(intel_obj->range_map_bo[index]);
+      brw_bo_put(intel_obj->range_map_bo[index]);
       intel_obj->range_map_bo[index] = NULL;
-   } else if (intel_obj->buffer != NULL) {
-      drm_intel_bo_unmap(intel_obj->buffer);
    }
+
    obj->Mappings[index].Pointer = NULL;
    obj->Mappings[index].Offset = 0;
    obj->Mappings[index].Length = 0;
@@ -581,7 +500,7 @@ brw_unmap_buffer(struct gl_context *ctx,
  * Anywhere that uses buffer objects in the pipeline should be using this to
  * mark the range of the buffer that is being accessed by the pipeline.
  */
-drm_intel_bo *
+struct brw_bo *
 intel_bufferobj_buffer(struct brw_context *brw,
                        struct intel_buffer_object *intel_obj,
                        uint32_t offset, uint32_t size)
@@ -615,7 +534,7 @@ brw_copy_buffer_subdata(struct gl_context *ctx,
    struct brw_context *brw = brw_context(ctx);
    struct intel_buffer_object *intel_src = intel_buffer_object(src);
    struct intel_buffer_object *intel_dst = intel_buffer_object(dst);
-   drm_intel_bo *src_bo, *dst_bo;
+   struct brw_bo *src_bo, *dst_bo;
 
    if (size == 0)
       return;
@@ -626,13 +545,6 @@ brw_copy_buffer_subdata(struct gl_context *ctx,
    intel_emit_linear_blit(brw,
 			  dst_bo, write_offset,
 			  src_bo, read_offset, size);
-
-   /* Since we've emitted some blits to buffers that will (likely) be used
-    * in rendering operations in other cache domains in this batch, emit a
-    * flush.  Once again, we wish for a domain tracker in libdrm to cover
-    * usage inside of a batchbuffer.
-    */
-   brw_emit_mi_flush(brw);
 }
 
 void
diff --git a/src/mesa/drivers/dri/i965/intel_buffer_objects.h b/src/mesa/drivers/dri/i965/intel_buffer_objects.h
index 5eaf9dc..179c0576 100644
--- a/src/mesa/drivers/dri/i965/intel_buffer_objects.h
+++ b/src/mesa/drivers/dri/i965/intel_buffer_objects.h
@@ -40,9 +40,9 @@ struct gl_buffer_object;
 struct intel_buffer_object
 {
    struct gl_buffer_object Base;
-   drm_intel_bo *buffer;     /* the low-level buffer manager's buffer handle */
+   struct brw_bo *buffer; /* the low-level buffer manager's buffer handle */
 
-   drm_intel_bo *range_map_bo[MAP_COUNT];
+   struct brw_bo *range_map_bo[MAP_COUNT];
 
    /**
     * Alignment offset from the range_map_bo temporary mapping to the returned
@@ -84,26 +84,24 @@ struct intel_buffer_object
 
 /* Get the bm buffer associated with a GL bufferobject:
  */
-drm_intel_bo *intel_bufferobj_buffer(struct brw_context *brw,
-                                     struct intel_buffer_object *obj,
-                                     uint32_t offset,
-                                     uint32_t size);
+struct brw_bo *intel_bufferobj_buffer(struct brw_context *brw,
+				      struct intel_buffer_object *obj,
+				      uint32_t offset,
+				      uint32_t size);
 
 void intel_upload_data(struct brw_context *brw,
                        const void *data,
                        uint32_t size,
                        uint32_t alignment,
-                       drm_intel_bo **out_bo,
+                       struct brw_bo **out_bo,
                        uint32_t *out_offset);
 
 void *intel_upload_space(struct brw_context *brw,
                          uint32_t size,
                          uint32_t alignment,
-                         drm_intel_bo **out_bo,
+                         struct brw_bo **out_bo,
                          uint32_t *out_offset);
 
-void intel_upload_finish(struct brw_context *brw);
-
 /* Hook the bufferobject implementation into mesa:
  */
 void intelInitBufferObjectFuncs(struct dd_function_table *functions);
diff --git a/src/mesa/drivers/dri/i965/intel_debug.c b/src/mesa/drivers/dri/i965/intel_debug.c
index b68c212..be786ec 100644
--- a/src/mesa/drivers/dri/i965/intel_debug.c
+++ b/src/mesa/drivers/dri/i965/intel_debug.c
@@ -93,17 +93,11 @@ brw_process_intel_debug_variable(struct intel_screen *screen)
    uint64_t intel_debug = driParseDebugString(getenv("INTEL_DEBUG"), debug_control);
    (void) p_atomic_cmpxchg(&INTEL_DEBUG, 0, intel_debug);
 
-   if (INTEL_DEBUG & DEBUG_BUFMGR)
-      dri_bufmgr_set_debug(screen->bufmgr, true);
-
    if ((INTEL_DEBUG & DEBUG_SHADER_TIME) && screen->devinfo->gen < 7) {
       fprintf(stderr,
               "shader_time debugging requires gen7 (Ivybridge) or better.\n");
       INTEL_DEBUG &= ~DEBUG_SHADER_TIME;
    }
-
-   if (INTEL_DEBUG & DEBUG_AUB)
-      drm_intel_bufmgr_gem_set_aub_dump(screen->bufmgr, true);
 }
 
 /**
diff --git a/src/mesa/drivers/dri/i965/intel_extensions.c b/src/mesa/drivers/dri/i965/intel_extensions.c
index 6b3bd12..c9a2007 100644
--- a/src/mesa/drivers/dri/i965/intel_extensions.c
+++ b/src/mesa/drivers/dri/i965/intel_extensions.c
@@ -28,7 +28,6 @@
 #include "main/version.h"
 
 #include "brw_context.h"
-#include "intel_batchbuffer.h"
 #include "intel_reg.h"
 #include "utils.h"
 
@@ -50,12 +49,15 @@ can_do_pipelined_register_writes(struct brw_context *brw)
    if (result != -1)
       return result;
 
+   result = false;
+
    /* We use SO_WRITE_OFFSET0 since you're supposed to write it (unlike the
     * statistics registers), and we already reset it to zero before using it.
     */
    const int reg = GEN7_SO_WRITE_OFFSET(0);
    const int expected_value = 0x1337d0d0;
    const int offset = 100;
+   int ret;
 
    /* The register we picked only exists on Gen7+. */
    assert(brw->gen == 7);
@@ -64,10 +66,12 @@ can_do_pipelined_register_writes(struct brw_context *brw)
    /* Set a value in a BO to a known quantity.  The workaround BO already
     * exists and doesn't contain anything important, so we may as well use it.
     */
-   drm_intel_bo_map(brw->workaround_bo, true);
-   data = brw->workaround_bo->virtual;
+   data = brw_bo_map(brw->workaround_bo, MAP_WRITE);
    data[offset] = 0xffffffff;
-   drm_intel_bo_unmap(brw->workaround_bo);
+
+   ret = brw_batch_begin(&brw->batch, 60, RENDER_RING);
+   if (ret < 0)
+      return result;
 
    /* Write the register. */
    BEGIN_BATCH(3);
@@ -87,17 +91,13 @@ can_do_pipelined_register_writes(struct brw_context *brw)
              offset * sizeof(uint32_t));
    ADVANCE_BATCH();
 
-   intel_batchbuffer_flush(brw);
+   if (brw_batch_end(&brw->batch))
+      return result;
 
    /* Check whether the value got written. */
-   drm_intel_bo_map(brw->workaround_bo, false);
-   data = brw->workaround_bo->virtual;
-   bool success = data[offset] == expected_value;
-   drm_intel_bo_unmap(brw->workaround_bo);
-
-   result = success;
-
-   return success;
+   data = brw_bo_map(brw->workaround_bo, MAP_READ);
+   result = data[offset] == expected_value;
+   return result;
 }
 
 static bool
@@ -120,10 +120,12 @@ can_write_oacontrol(struct brw_context *brw)
    /* Set a value in a BO to a known quantity.  The workaround BO already
     * exists and doesn't contain anything important, so we may as well use it.
     */
-   drm_intel_bo_map(brw->workaround_bo, true);
-   data = brw->workaround_bo->virtual;
+   data = brw_bo_map(brw->workaround_bo, MAP_WRITE);
    data[offset] = 0xffffffff;
-   drm_intel_bo_unmap(brw->workaround_bo);
+
+   result = false;
+   if (brw_batch_begin(&brw->batch, 60, RENDER_RING) < 0)
+      return result;
 
    /* Write OACONTROL. */
    BEGIN_BATCH(3);
@@ -152,17 +154,13 @@ can_write_oacontrol(struct brw_context *brw)
    OUT_BATCH(0);
    ADVANCE_BATCH();
 
-   intel_batchbuffer_flush(brw);
+   if (brw_batch_end(&brw->batch))
+      return result;
 
    /* Check whether the value got written. */
-   drm_intel_bo_map(brw->workaround_bo, false);
-   data = brw->workaround_bo->virtual;
-   bool success = data[offset] == expected_value;
-   drm_intel_bo_unmap(brw->workaround_bo);
-
-   result = success;
-
-   return success;
+   data = brw_bo_map(brw->workaround_bo, MAP_READ);
+   result = data[offset] == expected_value;
+   return result;
 }
 
 /**
diff --git a/src/mesa/drivers/dri/i965/intel_fbo.c b/src/mesa/drivers/dri/i965/intel_fbo.c
index 05e3f8b..f5395c1 100644
--- a/src/mesa/drivers/dri/i965/intel_fbo.c
+++ b/src/mesa/drivers/dri/i965/intel_fbo.c
@@ -43,7 +43,6 @@
 #include "swrast/swrast.h"
 #include "drivers/common/meta.h"
 
-#include "intel_batchbuffer.h"
 #include "intel_buffers.h"
 #include "intel_blit.h"
 #include "intel_fbo.h"
@@ -377,13 +376,15 @@ intel_image_target_renderbuffer_storage(struct gl_context *ctx,
    irb = intel_renderbuffer(rb);
    intel_miptree_release(&irb->mt);
 
+   struct brw_bo *bo = brw_bo_import(&brw->batch, image->bo, true);
+
    /* Disable creation of the miptree's aux buffers because the driver exposes
     * no EGL API to manage them. That is, there is no API for resolving the aux
     * buffer's content to the main buffer nor for invalidating the aux buffer's
     * content.
     */
    irb->mt = intel_miptree_create_for_bo(brw,
-                                         image->bo,
+                                         bo,
                                          image->format,
                                          image->offset,
                                          image->width,
@@ -391,6 +392,7 @@ intel_image_target_renderbuffer_storage(struct gl_context *ctx,
                                          1,
                                          image->pitch,
                                          MIPTREE_LAYOUT_DISABLE_AUX);
+   brw_bo_put(bo);
    if (!irb->mt)
       return;
 
@@ -1044,43 +1046,6 @@ intel_renderbuffer_move_to_temp(struct brw_context *brw,
    intel_miptree_release(&new_mt);
 }
 
-void
-brw_render_cache_set_clear(struct brw_context *brw)
-{
-   struct set_entry *entry;
-
-   set_foreach(brw->render_cache, entry) {
-      _mesa_set_remove(brw->render_cache, entry);
-   }
-}
-
-void
-brw_render_cache_set_add_bo(struct brw_context *brw, drm_intel_bo *bo)
-{
-   _mesa_set_add(brw->render_cache, bo);
-}
-
-/**
- * Emits an appropriate flush for a BO if it has been rendered to within the
- * same batchbuffer as a read that's about to be emitted.
- *
- * The GPU has separate, incoherent caches for the render cache and the
- * sampler cache, along with other caches.  Usually data in the different
- * caches don't interact (e.g. we don't render to our driver-generated
- * immediate constant data), but for render-to-texture in FBOs we definitely
- * do.  When a batchbuffer is flushed, the kernel will ensure that everything
- * necessary is flushed before another use of that BO, but for reuse from
- * different caches within a batchbuffer, it's all our responsibility.
- */
-void
-brw_render_cache_set_check_flush(struct brw_context *brw, drm_intel_bo *bo)
-{
-   if (!_mesa_set_search(brw->render_cache, bo))
-      return;
-
-   brw_emit_mi_flush(brw);
-}
-
 /**
  * Do one-time context initializations related to GL_EXT_framebuffer_object.
  * Hook in device driver functions.
@@ -1101,7 +1066,4 @@ intel_fbo_init(struct brw_context *brw)
       dd->BlitFramebuffer = gen4_blit_framebuffer;
    dd->EGLImageTargetRenderbufferStorage =
       intel_image_target_renderbuffer_storage;
-
-   brw->render_cache = _mesa_set_create(brw, _mesa_hash_pointer,
-                                        _mesa_key_pointer_equal);
 }
diff --git a/src/mesa/drivers/dri/i965/intel_fbo.h b/src/mesa/drivers/dri/i965/intel_fbo.h
index c7cc570..6c761e6 100644
--- a/src/mesa/drivers/dri/i965/intel_fbo.h
+++ b/src/mesa/drivers/dri/i965/intel_fbo.h
@@ -240,10 +240,6 @@ void
 intel_renderbuffer_upsample(struct brw_context *brw,
                             struct intel_renderbuffer *irb);
 
-void brw_render_cache_set_clear(struct brw_context *brw);
-void brw_render_cache_set_add_bo(struct brw_context *brw, drm_intel_bo *bo);
-void brw_render_cache_set_check_flush(struct brw_context *brw, drm_intel_bo *bo);
-
 unsigned
 intel_quantize_num_samples(struct intel_screen *intel, unsigned num_samples);
 
diff --git a/src/mesa/drivers/dri/i965/intel_image.h b/src/mesa/drivers/dri/i965/intel_image.h
index a82cf3b..c464a17 100644
--- a/src/mesa/drivers/dri/i965/intel_image.h
+++ b/src/mesa/drivers/dri/i965/intel_image.h
@@ -42,7 +42,6 @@
 #include <xf86drm.h>
 
 #include "main/mtypes.h"
-#include "intel_bufmgr.h"
 #include <GL/internal/dri_interface.h>
 
 #ifdef __cplusplus
@@ -66,8 +65,11 @@ struct intel_image_format {
    } planes[3];
 };
 
+struct _drm_intel_bo;
+
 struct __DRIimageRec {
-   drm_intel_bo *bo;
+   struct _drm_intel_bo *bo;
+
    uint32_t pitch; /**< in bytes */
    GLenum internal_format;
    uint32_t dri_format;
diff --git a/src/mesa/drivers/dri/i965/intel_mipmap_tree.c b/src/mesa/drivers/dri/i965/intel_mipmap_tree.c
index fb896a9..e8bbc04 100644
--- a/src/mesa/drivers/dri/i965/intel_mipmap_tree.c
+++ b/src/mesa/drivers/dri/i965/intel_mipmap_tree.c
@@ -28,7 +28,6 @@
 #include <GL/gl.h>
 #include <GL/internal/dri_interface.h>
 
-#include "intel_batchbuffer.h"
 #include "intel_mipmap_tree.h"
 #include "intel_resolve_map.h"
 #include "intel_tex.h"
@@ -561,12 +560,12 @@ intel_lower_compressed_format(struct brw_context *brw, mesa_format format)
 /* This function computes Yf/Ys tiled bo size, alignment and pitch. */
 static uint64_t
 intel_get_yf_ys_bo_size(struct intel_mipmap_tree *mt, unsigned *alignment,
-                        uint64_t *pitch)
+                        uint32_t *pitch)
 {
    const uint32_t bpp = mt->cpp * 8;
    const uint32_t aspect_ratio = (bpp == 16 || bpp == 64) ? 2 : 1;
-   uint32_t tile_width, tile_height;
-   uint64_t stride, size, aligned_y;
+   uint32_t tile_width, tile_height, stride;
+   uint64_t size, aligned_y;
 
    assert(mt->tr_mode != INTEL_MIPTREE_TRMODE_NONE);
 
@@ -662,21 +661,21 @@ intel_miptree_create(struct brw_context *brw,
    if (layout_flags & MIPTREE_LAYOUT_ACCELERATED_UPLOAD)
       alloc_flags |= BO_ALLOC_FOR_RENDER;
 
-   unsigned long pitch;
+   uint32_t pitch;
    mt->etc_format = etc_format;
 
    if (mt->tr_mode != INTEL_MIPTREE_TRMODE_NONE) {
-      unsigned alignment = 0;
-      unsigned long size;
+      uint32_t alignment;
+      uint64_t size;
       size = intel_get_yf_ys_bo_size(mt, &alignment, &pitch);
       assert(size);
-      mt->bo = drm_intel_bo_alloc_for_render(brw->bufmgr, "miptree",
-                                             size, alignment);
+      mt->bo = brw_bo_create(&brw->batch, "miptree",
+			     size, alignment,
+			     alloc_flags);
    } else {
-      mt->bo = drm_intel_bo_alloc_tiled(brw->bufmgr, "miptree",
-                                        total_width, total_height, mt->cpp,
-                                        &mt->tiling, &pitch,
-                                        alloc_flags);
+      mt->bo = brw_bo_create_tiled(&brw->batch, "miptree",
+				   total_width, total_height, mt->cpp,
+				   &mt->tiling, &pitch, alloc_flags);
    }
 
    mt->pitch = pitch;
@@ -690,10 +689,10 @@ intel_miptree_create(struct brw_context *brw,
                  mt->total_width, mt->total_height);
 
       mt->tiling = I915_TILING_X;
-      drm_intel_bo_unreference(mt->bo);
-      mt->bo = drm_intel_bo_alloc_tiled(brw->bufmgr, "miptree",
-                                  total_width, total_height, mt->cpp,
-                                  &mt->tiling, &pitch, alloc_flags);
+      brw_bo_put(mt->bo);
+      mt->bo = brw_bo_create_tiled(&brw->batch, "miptree",
+				   total_width, total_height, mt->cpp,
+				   &mt->tiling, &pitch, alloc_flags);
       mt->pitch = pitch;
    }
 
@@ -729,7 +728,7 @@ intel_miptree_create(struct brw_context *brw,
 
 struct intel_mipmap_tree *
 intel_miptree_create_for_bo(struct brw_context *brw,
-                            drm_intel_bo *bo,
+                            struct brw_bo *bo,
                             mesa_format format,
                             uint32_t offset,
                             uint32_t width,
@@ -739,15 +738,12 @@ intel_miptree_create_for_bo(struct brw_context *brw,
                             uint32_t layout_flags)
 {
    struct intel_mipmap_tree *mt;
-   uint32_t tiling, swizzle;
    GLenum target;
 
-   drm_intel_bo_get_tiling(bo, &tiling, &swizzle);
-
    /* Nothing will be able to use this miptree with the BO if the offset isn't
     * aligned.
     */
-   if (tiling != I915_TILING_NONE)
+   if (bo->tiling != I915_TILING_NONE)
       assert(offset % 4096 == 0);
 
    /* miptrees can't handle negative pitch.  If you need flipping of images,
@@ -772,11 +768,10 @@ intel_miptree_create_for_bo(struct brw_context *brw,
    if (!mt)
       return NULL;
 
-   drm_intel_bo_reference(bo);
-   mt->bo = bo;
+   mt->bo = brw_bo_get(bo);
    mt->pitch = pitch;
    mt->offset = offset;
-   mt->tiling = tiling;
+   mt->tiling = bo->tiling;
 
    return mt;
 }
@@ -794,7 +789,7 @@ intel_miptree_create_for_bo(struct brw_context *brw,
 void
 intel_update_winsys_renderbuffer_miptree(struct brw_context *intel,
                                          struct intel_renderbuffer *irb,
-                                         drm_intel_bo *bo,
+                                         struct brw_bo *bo,
                                          uint32_t width, uint32_t height,
                                          uint32_t pitch)
 {
@@ -926,13 +921,13 @@ intel_miptree_release(struct intel_mipmap_tree **mt)
 
       DBG("%s deleting %p\n", __func__, *mt);
 
-      drm_intel_bo_unreference((*mt)->bo);
+      brw_bo_put((*mt)->bo);
       intel_miptree_release(&(*mt)->stencil_mt);
       if ((*mt)->hiz_buf) {
          if ((*mt)->hiz_buf->mt)
             intel_miptree_release(&(*mt)->hiz_buf->mt);
          else
-            drm_intel_bo_unreference((*mt)->hiz_buf->bo);
+            brw_bo_put((*mt)->hiz_buf->bo);
          free((*mt)->hiz_buf);
       }
       intel_miptree_release(&(*mt)->mcs_mt);
@@ -1570,17 +1565,17 @@ intel_gen7_hiz_buf_create(struct brw_context *brw,
       }
    }
 
-   unsigned long pitch;
+   uint32_t pitch;
    uint32_t tiling = I915_TILING_Y;
-   buf->bo = drm_intel_bo_alloc_tiled(brw->bufmgr, "hiz",
-                                      hz_width, hz_height, 1,
-                                      &tiling, &pitch,
-                                      BO_ALLOC_FOR_RENDER);
+   buf->bo = brw_bo_create_tiled(&brw->batch, "hiz",
+				 hz_width, hz_height, 1,
+				 &tiling, &pitch,
+				 BO_ALLOC_FOR_RENDER);
    if (!buf->bo) {
       free(buf);
       return NULL;
    } else if (tiling != I915_TILING_Y) {
-      drm_intel_bo_unreference(buf->bo);
+      brw_bo_put(buf->bo);
       free(buf);
       return NULL;
    }
@@ -1673,17 +1668,17 @@ intel_gen8_hiz_buf_create(struct brw_context *brw,
       }
    }
 
-   unsigned long pitch;
+   uint32_t pitch;
    uint32_t tiling = I915_TILING_Y;
-   buf->bo = drm_intel_bo_alloc_tiled(brw->bufmgr, "hiz",
-                                      hz_width, hz_height, 1,
-                                      &tiling, &pitch,
-                                      BO_ALLOC_FOR_RENDER);
+   buf->bo = brw_bo_create_tiled(&brw->batch, "hiz",
+				 hz_width, hz_height, 1,
+				 &tiling, &pitch,
+				 BO_ALLOC_FOR_RENDER);
    if (!buf->bo) {
       free(buf);
       return NULL;
    } else if (tiling != I915_TILING_Y) {
-      drm_intel_bo_unreference(buf->bo);
+      brw_bo_put(buf->bo);
       free(buf);
       return NULL;
    }
@@ -2064,25 +2059,13 @@ intel_miptree_map_raw(struct brw_context *brw, struct intel_mipmap_tree *mt)
     * resolve any pending fast color clears before we map.
     */
    intel_miptree_resolve_color(brw, mt);
-
-   drm_intel_bo *bo = mt->bo;
-
-   if (drm_intel_bo_references(brw->batch.bo, bo))
-      intel_batchbuffer_flush(brw);
-
-   if (mt->tiling != I915_TILING_NONE)
-      brw_bo_map_gtt(brw, bo, "miptree");
-   else
-      brw_bo_map(brw, bo, true, "miptree");
-
-   return bo->virtual;
+   return brw_bo_map(mt->bo, MAP_WRITE);
 }
 
 void
 intel_miptree_unmap_raw(struct brw_context *brw,
                         struct intel_mipmap_tree *mt)
 {
-   drm_intel_bo_unmap(mt->bo);
 }
 
 static void
@@ -2622,11 +2605,10 @@ use_intel_mipree_map_blit(struct brw_context *brw,
                           unsigned int level,
                           unsigned int slice)
 {
-   if (brw->has_llc &&
-      /* It's probably not worth swapping to the blit ring because of
-       * all the overhead involved.
-       */
-       !(mode & GL_MAP_WRITE_BIT) &&
+   /* It's probably not worth swapping to the blit ring because of
+    * all the overhead involved.
+    */
+   if (!(mode & GL_MAP_WRITE_BIT) &&
        !mt->compressed &&
        (mt->tiling == I915_TILING_X ||
         /* Prior to Sandybridge, the blitter can't handle Y tiling */
diff --git a/src/mesa/drivers/dri/i965/intel_mipmap_tree.h b/src/mesa/drivers/dri/i965/intel_mipmap_tree.h
index bde6daa..7e91c97 100644
--- a/src/mesa/drivers/dri/i965/intel_mipmap_tree.h
+++ b/src/mesa/drivers/dri/i965/intel_mipmap_tree.h
@@ -33,7 +33,7 @@
  * The hardware has a fixed layout of a texture depending on parameters such
  * as the target/type (2D, 3D, CUBE), width, height, pitch, and number of
  * mipmap levels.  The individual level/layer slices are each 2D rectangles of
- * pixels at some x/y offset from the start of the drm_intel_bo.
+ * pixels at some x/y offset from the start of the brw_bo.
  *
  * Original OpenGL allowed texture miplevels to be specified in arbitrary
  * order, and a texture may change size over time.  Thus, each
@@ -49,7 +49,6 @@
 #include <assert.h>
 
 #include "main/mtypes.h"
-#include "intel_bufmgr.h"
 #include "intel_resolve_map.h"
 #include <GL/internal/dri_interface.h>
 
@@ -321,7 +320,7 @@ enum miptree_array_layout {
 struct intel_miptree_aux_buffer
 {
    /** Buffer object containing the pixel data. */
-   drm_intel_bo *bo;
+   struct brw_bo *bo;
 
    uint32_t pitch; /**< pitch in bytes. */
 
@@ -340,7 +339,7 @@ enum intel_miptree_tr_mode {
 struct intel_mipmap_tree
 {
    /** Buffer object containing the pixel data. */
-   drm_intel_bo *bo;
+   struct brw_bo *bo;
 
    uint32_t pitch; /**< pitch in bytes. */
 
@@ -557,7 +556,7 @@ struct intel_mipmap_tree *intel_miptree_create(struct brw_context *brw,
 
 struct intel_mipmap_tree *
 intel_miptree_create_for_bo(struct brw_context *brw,
-                            drm_intel_bo *bo,
+                            struct brw_bo *bo,
                             mesa_format format,
                             uint32_t offset,
                             uint32_t width,
@@ -569,7 +568,7 @@ intel_miptree_create_for_bo(struct brw_context *brw,
 void
 intel_update_winsys_renderbuffer_miptree(struct brw_context *intel,
                                          struct intel_renderbuffer *irb,
-                                         drm_intel_bo *bo,
+					 struct brw_bo *bo,
                                          uint32_t width, uint32_t height,
                                          uint32_t pitch);
 
diff --git a/src/mesa/drivers/dri/i965/intel_pixel_bitmap.c b/src/mesa/drivers/dri/i965/intel_pixel_bitmap.c
index 224dc65..bd40a92 100644
--- a/src/mesa/drivers/dri/i965/intel_pixel_bitmap.c
+++ b/src/mesa/drivers/dri/i965/intel_pixel_bitmap.c
@@ -44,7 +44,6 @@
 
 #include "brw_context.h"
 #include "intel_screen.h"
-#include "intel_batchbuffer.h"
 #include "intel_blit.h"
 #include "intel_fbo.h"
 #include "intel_image.h"
@@ -314,7 +313,7 @@ do_blit_bitmap( struct gl_context *ctx,
 out:
 
    if (unlikely(INTEL_DEBUG & DEBUG_SYNC))
-      intel_batchbuffer_flush(brw);
+      brw_batch_flush(&brw->batch);
 
    if (_mesa_is_bufferobj(unpack->BufferObj)) {
       /* done with PBO so unmap it now */
diff --git a/src/mesa/drivers/dri/i965/intel_pixel_copy.c b/src/mesa/drivers/dri/i965/intel_pixel_copy.c
index ce053ed..4313588 100644
--- a/src/mesa/drivers/dri/i965/intel_pixel_copy.c
+++ b/src/mesa/drivers/dri/i965/intel_pixel_copy.c
@@ -39,7 +39,6 @@
 #include "intel_pixel.h"
 #include "intel_fbo.h"
 #include "intel_blit.h"
-#include "intel_batchbuffer.h"
 
 #define FILE_DEBUG_FLAG DEBUG_PIXEL
 
@@ -149,8 +148,6 @@ do_blit_copypixels(struct gl_context * ctx,
       return false;
    }
 
-   intel_batchbuffer_flush(brw);
-
    /* Clip to destination buffer. */
    orig_dstx = dstx;
    orig_dsty = dsty;
diff --git a/src/mesa/drivers/dri/i965/intel_pixel_draw.c b/src/mesa/drivers/dri/i965/intel_pixel_draw.c
index 6c6bd86..09eea3e 100644
--- a/src/mesa/drivers/dri/i965/intel_pixel_draw.c
+++ b/src/mesa/drivers/dri/i965/intel_pixel_draw.c
@@ -60,7 +60,7 @@ do_blit_drawpixels(struct gl_context * ctx,
    struct brw_context *brw = brw_context(ctx);
    struct intel_buffer_object *src = intel_buffer_object(unpack->BufferObj);
    GLuint src_offset;
-   drm_intel_bo *src_buffer;
+   struct brw_bo *src_buffer;
 
    DBG("%s\n", __func__);
 
diff --git a/src/mesa/drivers/dri/i965/intel_pixel_read.c b/src/mesa/drivers/dri/i965/intel_pixel_read.c
index 3fe506e..1789023 100644
--- a/src/mesa/drivers/dri/i965/intel_pixel_read.c
+++ b/src/mesa/drivers/dri/i965/intel_pixel_read.c
@@ -39,7 +39,6 @@
 
 #include "brw_context.h"
 #include "intel_screen.h"
-#include "intel_batchbuffer.h"
 #include "intel_blit.h"
 #include "intel_buffers.h"
 #include "intel_fbo.h"
@@ -84,11 +83,6 @@ intel_readpixels_tiled_memcpy(struct gl_context * ctx,
    struct intel_renderbuffer *irb = intel_renderbuffer(rb);
    int dst_pitch;
 
-   /* The miptree's buffer. */
-   drm_intel_bo *bo;
-
-   int error = 0;
-
    uint32_t cpp;
    mem_copy_fn mem_copy = NULL;
 
@@ -96,7 +90,7 @@ intel_readpixels_tiled_memcpy(struct gl_context * ctx,
     * a 2D BGRA, RGBA, L8 or A8 texture. It could be generalized to support
     * more types.
     */
-   if (!brw->has_llc ||
+   if (!irb->mt->bo->cache_coherent ||
        !(type == GL_UNSIGNED_BYTE || type == GL_UNSIGNED_INT_8_8_8_8_REV) ||
        pixels == NULL ||
        _mesa_is_bufferobj(pack->BufferObj) ||
@@ -155,19 +149,6 @@ intel_readpixels_tiled_memcpy(struct gl_context * ctx,
     */
    intel_miptree_resolve_color(brw, irb->mt);
 
-   bo = irb->mt->bo;
-
-   if (drm_intel_bo_references(brw->batch.bo, bo)) {
-      perf_debug("Flushing before mapping a referenced bo.\n");
-      intel_batchbuffer_flush(brw);
-   }
-
-   error = brw_bo_map(brw, bo, false /* write enable */, "miptree");
-   if (error) {
-      DBG("%s: failed to map bo\n", __func__);
-      return false;
-   }
-
    dst_pitch = _mesa_image_row_stride(pack, width, format, type);
 
    /* For a window-system renderbuffer, the buffer is actually flipped
@@ -196,19 +177,16 @@ intel_readpixels_tiled_memcpy(struct gl_context * ctx,
        pack->Alignment, pack->RowLength, pack->SkipPixels,
        pack->SkipRows);
 
-   tiled_to_linear(
+   return tiled_to_linear(
       xoffset * cpp, (xoffset + width) * cpp,
       yoffset, yoffset + height,
       pixels - (ptrdiff_t) yoffset * dst_pitch - (ptrdiff_t) xoffset * cpp,
-      bo->virtual,
+      brw_bo_map(irb->mt->bo, MAP_READ | MAP_DETILED),
       dst_pitch, irb->mt->pitch,
       brw->has_swizzling,
       irb->mt->tiling,
       mem_copy
    );
-
-   drm_intel_bo_unmap(bo);
-   return true;
 }
 
 void
diff --git a/src/mesa/drivers/dri/i965/intel_screen.c b/src/mesa/drivers/dri/i965/intel_screen.c
index cd8e6eb..fcf50fc 100644
--- a/src/mesa/drivers/dri/i965/intel_screen.c
+++ b/src/mesa/drivers/dri/i965/intel_screen.c
@@ -44,6 +44,8 @@
 #include "utils.h"
 #include "xmlpool.h"
 
+#include "intel_reg.h"
+
 static const __DRIconfigOptionsExtension brw_config_options = {
    .base = { __DRI_CONFIG_OPTIONS, 1 },
    .xml =
@@ -91,9 +93,7 @@ DRI_CONF_BEGIN
 DRI_CONF_END
 };
 
-#include "intel_batchbuffer.h"
 #include "intel_buffers.h"
-#include "intel_bufmgr.h"
 #include "intel_fbo.h"
 #include "intel_mipmap_tree.h"
 #include "intel_screen.h"
@@ -118,9 +118,9 @@ get_time(void)
 }
 
 void
-aub_dump_bmp(struct gl_context *ctx)
+aub_dump_bmp(struct brw_context *brw)
 {
-   struct gl_framebuffer *fb = ctx->DrawBuffer;
+   struct gl_framebuffer *fb = brw->ctx.DrawBuffer;
 
    for (int i = 0; i < fb->_NumColorDrawBuffers; i++) {
       struct intel_renderbuffer *irb =
@@ -138,14 +138,16 @@ aub_dump_bmp(struct gl_context *ctx)
 	    continue;
 	 }
 
-         drm_intel_gem_bo_aub_dump_bmp(irb->mt->bo,
-				       irb->draw_x,
-				       irb->draw_y,
-				       irb->Base.Base.Width,
-				       irb->Base.Base.Height,
-				       format,
-				       irb->mt->pitch,
-				       0);
+	 drm_intel_aub_bo_bmp(&brw->batch.aub,
+                              irb->mt->bo->base,
+                              irb->draw_x,
+                              irb->draw_y,
+                              irb->Base.Base.Width,
+                              irb->Base.Base.Height,
+                              format,
+                              irb->mt->bo->tiling,
+                              irb->mt->pitch,
+                              0);
       }
    }
 }
@@ -176,15 +178,15 @@ intel_dri2_flush_with_flags(__DRIcontext *cPriv,
    if (flags & __DRI2_FLUSH_DRAWABLE)
       intel_resolve_for_dri2_flush(brw, dPriv);
 
+   brw_batch_flush(&brw->batch);
+
    if (reason == __DRI2_THROTTLE_SWAPBUFFER)
-      brw->need_swap_throttle = true;
+      brw->batch.need_swap_throttle = true;
    if (reason == __DRI2_THROTTLE_FLUSHFRONT)
-      brw->need_flush_throttle = true;
-
-   intel_batchbuffer_flush(brw);
+      brw->batch.need_flush_throttle = true;
 
    if (INTEL_DEBUG & DEBUG_AUB) {
-      aub_dump_bmp(ctx);
+      aub_dump_bmp(brw);
    }
 }
 
@@ -359,8 +361,8 @@ intel_setup_image_from_mipmap_tree(struct brw_context *brw, __DRIimage *image,
                                                   &image->tile_y);
 
    drm_intel_bo_unreference(image->bo);
-   image->bo = mt->bo;
-   drm_intel_bo_reference(mt->bo);
+   image->bo = mt->bo->base;
+   drm_intel_bo_reference(image->bo);
 }
 
 static __DRIimage *
@@ -421,8 +423,8 @@ intel_create_image_from_renderbuffer(__DRIcontext *context,
    image->offset = 0;
    image->data = loaderPrivate;
    drm_intel_bo_unreference(image->bo);
-   image->bo = irb->mt->bo;
-   drm_intel_bo_reference(irb->mt->bo);
+   image->bo = irb->mt->bo->base;
+   drm_intel_bo_reference(image->bo);
    image->width = rb->Width;
    image->height = rb->Height;
    image->pitch = irb->mt->pitch;
@@ -526,7 +528,7 @@ intel_create_image(__DRIscreen *screen,
    if (image == NULL)
       return NULL;
 
-   
+
    cpp = _mesa_get_format_bytes(image->format);
    image->bo = drm_intel_bo_alloc_tiled(intelScreen->bufmgr, "image",
                                         width, height, cpp, &tiling,
@@ -553,7 +555,7 @@ intel_query_image(__DRIimage *image, int attrib, int *value)
       *value = image->bo->handle;
       return true;
    case __DRI_IMAGE_ATTRIB_NAME:
-      return !drm_intel_bo_flink(image->bo, (uint32_t *) value);
+      return drm_intel_bo_flink(image->bo, (uint32_t *)value) == 0;
    case __DRI_IMAGE_ATTRIB_FORMAT:
       *value = image->dri_format;
       return true;
@@ -569,9 +571,7 @@ intel_query_image(__DRIimage *image, int attrib, int *value)
       *value = image->planar_format->components;
       return true;
    case __DRI_IMAGE_ATTRIB_FD:
-      if (drm_intel_bo_gem_export_to_prime(image->bo, value) == 0)
-         return true;
-      return false;
+      return drm_intel_bo_gem_export_to_prime(image->bo, value) == 0;
    case __DRI_IMAGE_ATTRIB_FOURCC:
       if (intel_lookup_fourcc(image->dri_format, value))
          return true;
@@ -1083,13 +1083,27 @@ intel_init_bufmgr(struct intel_screen *intelScreen)
 
    intelScreen->no_hw = getenv("INTEL_NO_HW") != NULL;
 
-   intelScreen->bufmgr = intel_bufmgr_gem_init(spriv->fd, BATCH_SZ);
+   intelScreen->bufmgr = intel_bufmgr_gem_init(spriv->fd, 0);
    if (intelScreen->bufmgr == NULL) {
       fprintf(stderr, "[%s:%u] Error initializing buffer manager.\n",
 	      __func__, __LINE__);
       return false;
    }
 
+#if 0 /* XXX */
+   driParseConfigFiles(options, &brw->intelScreen->optionCache,
+                       brw->driContext->driScreenPriv->myNum, "i965");
+   switch (driQueryOptioni(options, "bo_reuse")) {
+   case DRI_CONF_BO_REUSE_DISABLED:
+      break;
+   case DRI_CONF_BO_REUSE_ALL:
+      drm_intel_bufmgr_gem_enable_reuse(intelScreen->bufmgr);
+      break;
+   }
+#else
+   drm_intel_bufmgr_gem_enable_reuse(intelScreen->bufmgr);
+#endif
+
    drm_intel_bufmgr_gem_enable_fenced_relocs(intelScreen->bufmgr);
 
    if (!intel_get_boolean(spriv, I915_PARAM_HAS_RELAXED_DELTA)) {
diff --git a/src/mesa/drivers/dri/i965/intel_screen.h b/src/mesa/drivers/dri/i965/intel_screen.h
index e55fddb..733654c 100644
--- a/src/mesa/drivers/dri/i965/intel_screen.h
+++ b/src/mesa/drivers/dri/i965/intel_screen.h
@@ -34,16 +34,20 @@
 #include <GL/internal/dri_interface.h>
 
 #include "dri_util.h"
-#include "intel_bufmgr.h"
 #include "brw_device_info.h"
 #include "i915_drm.h"
 #include "xmlconfig.h"
 
+#include <intel_bufmgr.h>
+
 struct intel_screen
 {
    int deviceID;
    const struct brw_device_info *devinfo;
 
+   drm_intel_bufmgr *bufmgr;
+   drm_intel_bo *workaround_bo;
+
    __DRIscreen *driScrnPriv;
 
    bool no_hw;
@@ -59,9 +63,6 @@ struct intel_screen
     */
    bool has_context_reset_notification;
 
-   dri_bufmgr *bufmgr;
-   drm_intel_bo *workaround_bo;
-
    /**
     * A unique ID for shader programs.
     */
@@ -83,6 +84,12 @@ struct intel_screen
    int cmd_parser_version;
  };
 
+static inline int intel_screen_to_fd(struct intel_screen *scr)
+{
+   __DRIscreen *psp = scr->driScrnPriv;
+   return psp->fd;
+}
+
 extern void intelDestroyContext(__DRIcontext * driContextPriv);
 
 extern GLboolean intelUnbindContext(__DRIcontext * driContextPriv);
@@ -96,7 +103,6 @@ intelMakeCurrent(__DRIcontext * driContextPriv,
                  __DRIdrawable * driReadPriv);
 
 double get_time(void);
-void aub_dump_bmp(struct gl_context *ctx);
 
 const int*
 intel_supported_msaa_modes(const struct intel_screen  *screen);
diff --git a/src/mesa/drivers/dri/i965/intel_syncobj.c b/src/mesa/drivers/dri/i965/intel_syncobj.c
index c44c4be..544380e 100644
--- a/src/mesa/drivers/dri/i965/intel_syncobj.c
+++ b/src/mesa/drivers/dri/i965/intel_syncobj.c
@@ -41,56 +41,13 @@
 #include "main/imports.h"
 
 #include "brw_context.h"
-#include "intel_batchbuffer.h"
 #include "intel_reg.h"
 
-struct brw_fence {
-   /** The fence waits for completion of this batch. */
-   drm_intel_bo *batch_bo;
-
-   bool signalled;
-};
-
 struct intel_gl_sync_object {
    struct gl_sync_object Base;
    struct brw_fence fence;
 };
 
-static void
-brw_fence_finish(struct brw_fence *fence)
-{
-   if (fence->batch_bo)
-      drm_intel_bo_unreference(fence->batch_bo);
-}
-
-static void
-brw_fence_insert(struct brw_context *brw, struct brw_fence *fence)
-{
-   assert(!fence->batch_bo);
-   assert(!fence->signalled);
-
-   brw_emit_mi_flush(brw);
-   fence->batch_bo = brw->batch.bo;
-   drm_intel_bo_reference(fence->batch_bo);
-   intel_batchbuffer_flush(brw);
-}
-
-static bool
-brw_fence_has_completed(struct brw_fence *fence)
-{
-   if (fence->signalled)
-      return true;
-
-   if (fence->batch_bo && !drm_intel_bo_busy(fence->batch_bo)) {
-      drm_intel_bo_unreference(fence->batch_bo);
-      fence->batch_bo = NULL;
-      fence->signalled = true;
-      return true;
-   }
-
-   return false;
-}
-
 /**
  * Return true if the function successfully signals or has already signalled.
  * (This matches the behavior expected from __DRI2fence::client_wait_sync).
@@ -99,27 +56,15 @@ static bool
 brw_fence_client_wait(struct brw_context *brw, struct brw_fence *fence,
                       uint64_t timeout)
 {
-   if (fence->signalled)
-      return true;
-
-   assert(fence->batch_bo);
-
    /* DRM_IOCTL_I915_GEM_WAIT uses a signed 64 bit timeout and returns
-    * immediately for timeouts <= 0.  The best we can do is to clamp the
-    * timeout to INT64_MAX.  This limits the maximum timeout from 584 years to
-    * 292 years - likely not a big deal.
+    * immediately for timeout == 0, and indefinitely if timeout is negative.
+    * The best we can do is to clamp the timeout to INT64_MAX.  This limits
+    * the maximum timeout from 584 years to 292 years - likely not a big deal.
     */
    if (timeout > INT64_MAX)
       timeout = INT64_MAX;
 
-   if (drm_intel_gem_bo_wait(fence->batch_bo, timeout) != 0)
-      return false;
-
-   fence->signalled = true;
-   drm_intel_bo_unreference(fence->batch_bo);
-   fence->batch_bo = NULL;
-
-   return true;
+   return brw_fence_wait(fence, timeout) == 0;
 }
 
 static void
@@ -149,18 +94,27 @@ intel_gl_delete_sync_object(struct gl_context *ctx, struct gl_sync_object *s)
 {
    struct intel_gl_sync_object *sync = (struct intel_gl_sync_object *)s;
 
-   brw_fence_finish(&sync->fence);
+   brw_fence_destroy(&sync->fence);
    free(sync);
 }
 
 static void
+__intel_fence_signal(struct brw_fence *fence)
+{
+   struct intel_gl_sync_object *sync = container_of(fence, sync, fence);
+
+   sync->Base.StatusFlag = 1;
+}
+
+static void
 intel_gl_fence_sync(struct gl_context *ctx, struct gl_sync_object *s,
                     GLenum condition, GLbitfield flags)
 {
    struct brw_context *brw = brw_context(ctx);
    struct intel_gl_sync_object *sync = (struct intel_gl_sync_object *)s;
 
-   brw_fence_insert(brw, &sync->fence);
+   sync->fence.signal = __intel_fence_signal;
+   s->StatusFlag = !brw_batch_create_fence(&brw->batch, &sync->fence);
 }
 
 static void
@@ -170,8 +124,7 @@ intel_gl_client_wait_sync(struct gl_context *ctx, struct gl_sync_object *s,
    struct brw_context *brw = brw_context(ctx);
    struct intel_gl_sync_object *sync = (struct intel_gl_sync_object *)s;
 
-   if (brw_fence_client_wait(brw, &sync->fence, timeout))
-      s->StatusFlag = 1;
+   brw_fence_client_wait(brw, &sync->fence, timeout);
 }
 
 static void
@@ -189,8 +142,7 @@ intel_gl_check_sync(struct gl_context *ctx, struct gl_sync_object *s)
 {
    struct intel_gl_sync_object *sync = (struct intel_gl_sync_object *)s;
 
-   if (brw_fence_has_completed(&sync->fence))
-      s->StatusFlag = 1;
+   brw_fence_busy(&sync->fence);
 }
 
 void
@@ -214,7 +166,7 @@ intel_dri_create_fence(__DRIcontext *ctx)
    if (!fence)
       return NULL;
 
-   brw_fence_insert(brw, fence);
+   brw_batch_create_fence(&brw->batch, fence);
 
    return fence;
 }
@@ -224,7 +176,7 @@ intel_dri_destroy_fence(__DRIscreen *screen, void *driver_fence)
 {
    struct brw_fence *fence = driver_fence;
 
-   brw_fence_finish(fence);
+   brw_fence_destroy(fence);
    free(fence);
 }
 
diff --git a/src/mesa/drivers/dri/i965/intel_tex.c b/src/mesa/drivers/dri/i965/intel_tex.c
index b0181ad..831f104 100644
--- a/src/mesa/drivers/dri/i965/intel_tex.c
+++ b/src/mesa/drivers/dri/i965/intel_tex.c
@@ -330,9 +330,9 @@ intel_set_texture_storage_for_buffer_object(struct gl_context *ctx,
 
    assert(intel_texobj->mt == NULL);
 
-   drm_intel_bo *bo = intel_bufferobj_buffer(brw, intel_buffer_obj,
-                                             buffer_offset,
-                                             row_stride * image->Height);
+   struct brw_bo *bo = intel_bufferobj_buffer(brw, intel_buffer_obj,
+					      buffer_offset,
+					      row_stride * image->Height);
    intel_texobj->mt =
       intel_miptree_create_for_bo(brw, bo,
                                   image->TexFormat,
diff --git a/src/mesa/drivers/dri/i965/intel_tex_image.c b/src/mesa/drivers/dri/i965/intel_tex_image.c
index e077d5e..50f3352 100644
--- a/src/mesa/drivers/dri/i965/intel_tex_image.c
+++ b/src/mesa/drivers/dri/i965/intel_tex_image.c
@@ -19,7 +19,6 @@
 
 #include "intel_mipmap_tree.h"
 #include "intel_buffer_objects.h"
-#include "intel_batchbuffer.h"
 #include "intel_tex.h"
 #include "intel_blit.h"
 #include "intel_fbo.h"
@@ -94,7 +93,9 @@ intelTexImage(struct gl_context * ctx,
    struct intel_texture_image *intelImage = intel_texture_image(texImage);
    bool ok;
 
-   bool tex_busy = intelImage->mt && drm_intel_bo_busy(intelImage->mt->bo);
+   bool tex_busy =
+      intelImage->mt &&
+      brw_bo_busy(intelImage->mt->bo, BUSY_WRITE | BUSY_RETIRE);
 
    DBG("%s mesa_format %s target %s format %s type %s level %d %dx%dx%d\n",
        __func__, _mesa_get_format_name(texImage->TexFormat),
@@ -146,7 +147,7 @@ intelTexImage(struct gl_context * ctx,
 static void
 intel_set_texture_image_bo(struct gl_context *ctx,
                            struct gl_texture_image *image,
-                           drm_intel_bo *bo,
+                           struct brw_bo *bo,
                            GLenum target,
                            GLenum internalFormat,
                            mesa_format format,
@@ -341,13 +342,15 @@ intel_image_target_texture_2d(struct gl_context *ctx, GLenum target,
     * buffer's content to the main buffer nor for invalidating the aux buffer's
     * content.
     */
-   intel_set_texture_image_bo(ctx, texImage, image->bo,
+   struct brw_bo *bo = brw_bo_import(&brw->batch, image->bo, true);
+   intel_set_texture_image_bo(ctx, texImage, bo,
                               target, image->internal_format,
                               image->format, image->offset,
                               image->width,  image->height,
                               image->pitch,
                               image->tile_x, image->tile_y,
                               MIPTREE_LAYOUT_DISABLE_AUX);
+   brw_bo_put(bo);
 }
 
 /**
@@ -368,11 +371,6 @@ intel_gettexsubimage_tiled_memcpy(struct gl_context *ctx,
    struct intel_texture_image *image = intel_texture_image(texImage);
    int dst_pitch;
 
-   /* The miptree's buffer. */
-   drm_intel_bo *bo;
-
-   int error = 0;
-
    uint32_t cpp;
    mem_copy_fn mem_copy = NULL;
 
@@ -427,18 +425,6 @@ intel_gettexsubimage_tiled_memcpy(struct gl_context *ctx,
     */
    intel_miptree_resolve_color(brw, image->mt);
 
-   bo = image->mt->bo;
-
-   if (drm_intel_bo_references(brw->batch.bo, bo)) {
-      perf_debug("Flushing before mapping a referenced bo.\n");
-      intel_batchbuffer_flush(brw);
-   }
-
-   error = brw_bo_map(brw, bo, false /* write enable */, "miptree");
-   if (error) {
-      DBG("%s: failed to map bo\n", __func__);
-      return false;
-   }
 
    dst_pitch = _mesa_image_row_stride(packing, width, format, type);
 
@@ -456,19 +442,16 @@ intel_gettexsubimage_tiled_memcpy(struct gl_context *ctx,
    xoffset += image->mt->level[level].level_x;
    yoffset += image->mt->level[level].level_y;
 
-   tiled_to_linear(
+   return tiled_to_linear(
       xoffset * cpp, (xoffset + width) * cpp,
       yoffset, yoffset + height,
       pixels - (ptrdiff_t) yoffset * dst_pitch - (ptrdiff_t) xoffset * cpp,
-      bo->virtual,
+      brw_bo_map(image->mt->bo, MAP_READ | MAP_DETILED),
       dst_pitch, image->mt->pitch,
       brw->has_swizzling,
       image->mt->tiling,
       mem_copy
    );
-
-   drm_intel_bo_unmap(bo);
-   return true;
 }
 
 static void
diff --git a/src/mesa/drivers/dri/i965/intel_tex_subimage.c b/src/mesa/drivers/dri/i965/intel_tex_subimage.c
index 7507f76..6b1bff5 100644
--- a/src/mesa/drivers/dri/i965/intel_tex_subimage.c
+++ b/src/mesa/drivers/dri/i965/intel_tex_subimage.c
@@ -38,7 +38,6 @@
 #include "drivers/common/meta.h"
 
 #include "brw_context.h"
-#include "intel_batchbuffer.h"
 #include "intel_tex.h"
 #include "intel_mipmap_tree.h"
 #include "intel_blit.h"
@@ -86,11 +85,6 @@ intel_texsubimage_tiled_memcpy(struct gl_context * ctx,
    struct intel_texture_image *image = intel_texture_image(texImage);
    int src_pitch;
 
-   /* The miptree's buffer. */
-   drm_intel_bo *bo;
-
-   int error = 0;
-
    uint32_t cpp;
    mem_copy_fn mem_copy = NULL;
 
@@ -103,8 +97,7 @@ intel_texsubimage_tiled_memcpy(struct gl_context * ctx,
     * with _mesa_image_row_stride. However, before removing the restrictions
     * we need tests.
     */
-   if (!brw->has_llc ||
-       !(type == GL_UNSIGNED_BYTE || type == GL_UNSIGNED_INT_8_8_8_8_REV) ||
+   if (!(type == GL_UNSIGNED_BYTE || type == GL_UNSIGNED_INT_8_8_8_8_REV) ||
        !(texImage->TexObject->Target == GL_TEXTURE_2D ||
          texImage->TexObject->Target == GL_TEXTURE_RECTANGLE) ||
        pixels == NULL ||
@@ -141,19 +134,6 @@ intel_texsubimage_tiled_memcpy(struct gl_context * ctx,
     */
    intel_miptree_resolve_color(brw, image->mt);
 
-   bo = image->mt->bo;
-
-   if (drm_intel_bo_references(brw->batch.bo, bo)) {
-      perf_debug("Flushing before mapping a referenced bo.\n");
-      intel_batchbuffer_flush(brw);
-   }
-
-   error = brw_bo_map(brw, bo, true /* write enable */, "miptree");
-   if (error || bo->virtual == NULL) {
-      DBG("%s: failed to map bo\n", __func__);
-      return false;
-   }
-
    src_pitch = _mesa_image_row_stride(packing, width, format, type);
 
    /* We postponed printing this message until having committed to executing
@@ -174,19 +154,16 @@ intel_texsubimage_tiled_memcpy(struct gl_context * ctx,
    xoffset += image->mt->level[level].level_x;
    yoffset += image->mt->level[level].level_y;
 
-   linear_to_tiled(
+   return linear_to_tiled(
       xoffset * cpp, (xoffset + width) * cpp,
       yoffset, yoffset + height,
-      bo->virtual,
+      brw_bo_map(image->mt->bo, MAP_WRITE | MAP_DETILED),
       pixels - (ptrdiff_t) yoffset * src_pitch - (ptrdiff_t) xoffset * cpp,
       image->mt->pitch, src_pitch,
       brw->has_swizzling,
       image->mt->tiling,
       mem_copy
    );
-
-   drm_intel_bo_unmap(bo);
-   return true;
 }
 
 static void
@@ -202,7 +179,9 @@ intelTexSubImage(struct gl_context * ctx,
    struct intel_texture_image *intelImage = intel_texture_image(texImage);
    bool ok;
 
-   bool tex_busy = intelImage->mt && drm_intel_bo_busy(intelImage->mt->bo);
+   bool tex_busy =
+      intelImage->mt &&
+      brw_bo_busy(intelImage->mt->bo, BUSY_WRITE | BUSY_RETIRE);
 
    DBG("%s mesa_format %s target %s format %s type %s level %d %dx%dx%d\n",
        __func__, _mesa_get_format_name(texImage->TexFormat),
diff --git a/src/mesa/drivers/dri/i965/intel_tiled_memcpy.c b/src/mesa/drivers/dri/i965/intel_tiled_memcpy.c
index dcf0462..404bc2f 100644
--- a/src/mesa/drivers/dri/i965/intel_tiled_memcpy.c
+++ b/src/mesa/drivers/dri/i965/intel_tiled_memcpy.c
@@ -552,7 +552,7 @@ ytiled_to_linear_faster(uint32_t x0, uint32_t x1, uint32_t x2, uint32_t x3,
  * 'dst' is the start of the texture and 'src' is the corresponding
  * address to copy from, though copying begins at (xt1, yt1).
  */
-void
+bool
 linear_to_tiled(uint32_t xt1, uint32_t xt2,
                 uint32_t yt1, uint32_t yt2,
                 char *dst, const char *src,
@@ -568,6 +568,9 @@ linear_to_tiled(uint32_t xt1, uint32_t xt2,
    uint32_t tw, th, span;
    uint32_t swizzle_bit = has_swizzling ? 1<<6 : 0;
 
+   if (!dst)
+      return false;
+
    if (tiling == I915_TILING_X) {
       tw = xtile_width;
       th = xtile_height;
@@ -630,6 +633,8 @@ linear_to_tiled(uint32_t xt1, uint32_t xt2,
                    mem_copy);
       }
    }
+
+   return true;
 }
 
 /**
@@ -643,7 +648,7 @@ linear_to_tiled(uint32_t xt1, uint32_t xt2,
  * 'dst' is the start of the texture and 'src' is the corresponding
  * address to copy from, though copying begins at (xt1, yt1).
  */
-void
+bool
 tiled_to_linear(uint32_t xt1, uint32_t xt2,
                 uint32_t yt1, uint32_t yt2,
                 char *dst, const char *src,
@@ -659,6 +664,9 @@ tiled_to_linear(uint32_t xt1, uint32_t xt2,
    uint32_t tw, th, span;
    uint32_t swizzle_bit = has_swizzling ? 1<<6 : 0;
 
+   if (!src)
+      return false;
+
    if (tiling == I915_TILING_X) {
       tw = xtile_width;
       th = xtile_height;
@@ -721,6 +729,8 @@ tiled_to_linear(uint32_t xt1, uint32_t xt2,
                    mem_copy);
       }
    }
+
+   return true;
 }
 
 
diff --git a/src/mesa/drivers/dri/i965/intel_tiled_memcpy.h b/src/mesa/drivers/dri/i965/intel_tiled_memcpy.h
index 9dc1088..a64e516 100644
--- a/src/mesa/drivers/dri/i965/intel_tiled_memcpy.h
+++ b/src/mesa/drivers/dri/i965/intel_tiled_memcpy.h
@@ -37,7 +37,7 @@
 
 typedef void *(*mem_copy_fn)(void *dest, const void *src, size_t n);
 
-void
+bool
 linear_to_tiled(uint32_t xt1, uint32_t xt2,
                 uint32_t yt1, uint32_t yt2,
                 char *dst, const char *src,
@@ -46,7 +46,7 @@ linear_to_tiled(uint32_t xt1, uint32_t xt2,
                 uint32_t tiling,
                 mem_copy_fn mem_copy);
 
-void
+bool
 tiled_to_linear(uint32_t xt1, uint32_t xt2,
                 uint32_t yt1, uint32_t yt2,
                 char *dst, const char *src,
diff --git a/src/mesa/drivers/dri/i965/intel_upload.c b/src/mesa/drivers/dri/i965/intel_upload.c
index 870aabc..0da5936 100644
--- a/src/mesa/drivers/dri/i965/intel_upload.c
+++ b/src/mesa/drivers/dri/i965/intel_upload.c
@@ -36,7 +36,6 @@
 #include "brw_context.h"
 #include "intel_blit.h"
 #include "intel_buffer_objects.h"
-#include "intel_batchbuffer.h"
 #include "intel_fbo.h"
 #include "intel_mipmap_tree.h"
 
@@ -50,14 +49,10 @@
 #define ALIGN_NPOT(value, alignment) \
    (((value) + (alignment) - 1) / (alignment) * (alignment))
 
-void
+static void
 intel_upload_finish(struct brw_context *brw)
 {
-   if (!brw->upload.bo)
-      return;
-
-   drm_intel_bo_unmap(brw->upload.bo);
-   drm_intel_bo_unreference(brw->upload.bo);
+   brw_bo_put(brw->upload.bo);
    brw->upload.bo = NULL;
    brw->upload.next_offset = 0;
 }
@@ -89,7 +84,7 @@ void *
 intel_upload_space(struct brw_context *brw,
                    uint32_t size,
                    uint32_t alignment,
-                   drm_intel_bo **out_bo,
+                   struct brw_bo **out_bo,
                    uint32_t *out_offset)
 {
    uint32_t offset;
@@ -101,24 +96,20 @@ intel_upload_space(struct brw_context *brw,
    }
 
    if (!brw->upload.bo) {
-      brw->upload.bo = drm_intel_bo_alloc(brw->bufmgr, "streamed data",
-                                          MAX2(INTEL_UPLOAD_SIZE, size), 4096);
-      if (brw->has_llc)
-         drm_intel_bo_map(brw->upload.bo, true);
-      else
-         drm_intel_gem_bo_map_gtt(brw->upload.bo);
+      brw->upload.bo = brw_bo_create(&brw->batch, "streamed data",
+				     MAX2(INTEL_UPLOAD_SIZE, size), 4096, 0);
    }
 
    brw->upload.next_offset = offset + size;
 
    *out_offset = offset;
+
    if (*out_bo != brw->upload.bo) {
-      drm_intel_bo_unreference(*out_bo);
-      *out_bo = brw->upload.bo;
-      drm_intel_bo_reference(brw->upload.bo);
+      brw_bo_put(*out_bo);
+      *out_bo = brw_bo_get(brw->upload.bo);
    }
 
-   return brw->upload.bo->virtual + offset;
+   return brw_bo_map(brw->upload.bo, MAP_WRITE | MAP_ASYNC) + offset;
 }
 
 /**
@@ -131,9 +122,9 @@ intel_upload_data(struct brw_context *brw,
                   const void *data,
                   uint32_t size,
                   uint32_t alignment,
-                  drm_intel_bo **out_bo,
+		  struct brw_bo **out_bo,
                   uint32_t *out_offset)
 {
-   void *dst = intel_upload_space(brw, size, alignment, out_bo, out_offset);
-   memcpy(dst, data, size);
+   memcpy(intel_upload_space(brw, size, alignment, out_bo, out_offset),
+	  data, size);
 }
-- 
2.1.4



More information about the mesa-dev mailing list