[Mesa-dev] [PATCH 48/70] i965: Introduce a context-local batch manager

Fri Aug 7 13:13:52 PDT 2015

When submitting commands to the GPU every cycle of latency counts;
mutexes, spinlocks, even atomics quickly add to substantial overhead.

This "batch manager" acts as thread-local shim over the buffer manager
(drm_intel_bufmgr_gem). As we are only ever used from within a single
context, we can rely on the upper layers providing thread safety.
This allows us to import buffers from the shared screen (sharing buffers
between multiple contexts, threads and users) and wrap that handle in
our own. Similarly, we want to share the buffer cache between all
users on the file and so allocate from the global threadsafe buffer
manager, with a very small and transient local cache of active buffers.

The batch manager provides a cheap way of busyness tracking and very
efficient batch construction and kernel submission.

The restrictions over and above the generic submission engine in
intel_bufmgr_gem are:
     - not thread-safe
     - flat relocations, only the batch buffer itself carries
       relocations. Relocations relative to auxiliary buffers
       must be performed via STATE_BASE
     - direct mapping of the batch for writes, expect reads
       from the batch to be slow
     - the batch is a fixed 64k in size
     - access to the batch must be wrapped by brw_batch_begin/_end
     - all relocations must be immediately written into the batch

The importance of the flat relocation tree with local offset handling is
that it allows us to use the "relocation-less" execbuffer interfaces,
dramatically reducing the overhead of batch submission. However, that
can be relaxed to allow other buffers than the batch buffer to carry
relocations, if need be.

ivb/bdw OglBatch7 improves by ~20% above and beyond my kernel relocation
speedups.

ISSUES:
* shared mipmap trees
  - we instantiate a context local copy on use, but what are the semantics for
    serializing read/writes between them - do we need automagic flushing of
    execution on other contexts and common busyness tracking?
  - we retain references to the bo past the lifetime of its parent
    batchmgr as the mipmap_tree is retained past the lifetime of its
    original context, see glx_arb_create_context/default_major_version
* OglMultithread is nevertheless unhappy; but that looks like undefined
  behaviour - i.e. a buggy client concurrently executing the same GL
  context in multiple threads, unpatched is equally buggy.
* Add full-ppgtt softpinning support (no more relocations, at least for
  the first 256TiB), at the moment there is a limited proof-of-principle
  demonstration
* polish and move to libdrm; though at the cost of sealing the structs?

Signed-off-by: Chris Wilson <chris at chris-wilson.co.uk>
Cc: Daniel Vetter <daniel.vetter at ffwll.ch>
Cc: Kristian Høgsberg <krh at bitplanet.net>
Cc: Kenneth Graunke <kenneth at whitecape.org>
Cc: Jesse Barnes <jbarnes at virtuousgeek.org>
Cc: Ian Romanick <ian.d.romanick at intel.com>
Cc: Abdiel Janulgue <abdiel.janulgue at linux.intel.com>
Cc: Eero Tamminen <eero.t.tamminen at intel.com>
Cc: Martin Peres <martin.peres at linux.intel.com>
---
 src/mesa/drivers/dri/i965/Makefile.sources         |    3 +-
 src/mesa/drivers/dri/i965/brw_batch.c              | 1914 ++++++++++++++++++++
 src/mesa/drivers/dri/i965/brw_batch.h              |  464 +++--
 src/mesa/drivers/dri/i965/brw_context.c            |   46 +-
 src/mesa/drivers/dri/i965/brw_context.h            |   20 +-
 .../drivers/dri/i965/brw_performance_monitor.c     |   37 +-
 src/mesa/drivers/dri/i965/brw_pipe_control.c       |   23 +-
 src/mesa/drivers/dri/i965/brw_program.c            |    4 +-
 src/mesa/drivers/dri/i965/brw_queryobj.c           |   21 +-
 src/mesa/drivers/dri/i965/brw_reset.c              |    2 +-
 src/mesa/drivers/dri/i965/brw_state.h              |    2 +-
 src/mesa/drivers/dri/i965/brw_state_batch.c        |   26 +-
 src/mesa/drivers/dri/i965/brw_state_cache.c        |   57 +-
 src/mesa/drivers/dri/i965/brw_state_dump.c         |   16 +-
 src/mesa/drivers/dri/i965/brw_urb.c                |    8 +-
 src/mesa/drivers/dri/i965/gen6_queryobj.c          |   35 +-
 src/mesa/drivers/dri/i965/gen7_sol_state.c         |   13 +-
 src/mesa/drivers/dri/i965/intel_batchbuffer.c      |  439 -----
 src/mesa/drivers/dri/i965/intel_batchbuffer.h      |  142 --
 src/mesa/drivers/dri/i965/intel_blit.c             |    2 +-
 src/mesa/drivers/dri/i965/intel_buffer_objects.c   |  161 +-
 src/mesa/drivers/dri/i965/intel_buffer_objects.h   |    2 -
 src/mesa/drivers/dri/i965/intel_debug.c            |    3 -
 src/mesa/drivers/dri/i965/intel_fbo.c              |    5 +-
 src/mesa/drivers/dri/i965/intel_mipmap_tree.c      |   36 +-
 src/mesa/drivers/dri/i965/intel_mipmap_tree.h      |    1 -
 src/mesa/drivers/dri/i965/intel_pixel_copy.c       |    2 -
 src/mesa/drivers/dri/i965/intel_pixel_read.c       |   34 +-
 src/mesa/drivers/dri/i965/intel_screen.c           |    7 +-
 src/mesa/drivers/dri/i965/intel_screen.h           |    9 +-
 src/mesa/drivers/dri/i965/intel_syncobj.c          |    6 +-
 src/mesa/drivers/dri/i965/intel_tex_image.c        |   41 +-
 src/mesa/drivers/dri/i965/intel_tex_subimage.c     |   38 +-
 src/mesa/drivers/dri/i965/intel_tiled_memcpy.c     |   14 +-
 src/mesa/drivers/dri/i965/intel_tiled_memcpy.h     |    4 +-
 src/mesa/drivers/dri/i965/intel_upload.c           |   12 +-
 36 files changed, 2452 insertions(+), 1197 deletions(-)
 create mode 100644 src/mesa/drivers/dri/i965/brw_batch.c
 delete mode 100644 src/mesa/drivers/dri/i965/intel_batchbuffer.c
 delete mode 100644 src/mesa/drivers/dri/i965/intel_batchbuffer.h

diff --git a/src/mesa/drivers/dri/i965/Makefile.sources b/src/mesa/drivers/dri/i965/Makefile.sources
index be80246..ba3c0c7 100644
--- a/src/mesa/drivers/dri/i965/Makefile.sources
+++ b/src/mesa/drivers/dri/i965/Makefile.sources
@@ -1,4 +1,5 @@
 i965_FILES = \
+	brw_batch.c \
 	brw_batch.h \
 	brw_binding_tables.c \
 	brw_blorp_blit.cpp \
@@ -192,8 +193,6 @@ i965_FILES = \
 	gen8_wm_depth_stencil.c \
 	intel_asm_annotation.c \
 	intel_asm_annotation.h \
-	intel_batchbuffer.c \
-	intel_batchbuffer.h \
 	intel_blit.c \
 	intel_blit.h \
 	intel_buffer_objects.c \
diff --git a/src/mesa/drivers/dri/i965/brw_batch.c b/src/mesa/drivers/dri/i965/brw_batch.c
new file mode 100644
index 0000000..d9386bb
--- /dev/null
+++ b/src/mesa/drivers/dri/i965/brw_batch.c
@@ -0,0 +1,1914 @@
+/*
+ * Copyright (c) 2015 Intel Corporation
+ *
+ * Permission is hereby granted, free of charge, to any person obtaining a
+ * copy of this software and associated documentation files (the "Software"),
+ * to deal in the Software without restriction, including without limitation
+ * the rights to use, copy, modify, merge, publish, distribute, sublicense,
+ * and/or sell copies of the Software, and to permit persons to whom the
+ * Software is furnished to do so, subject to the following conditions:
+ *
+ * The above copyright notice and this permission notice (including the next
+ * paragraph) shall be included in all copies or substantial portions of the
+ * Software.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+ * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+ * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.  IN NO EVENT SHALL
+ * THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+ * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+ * OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ *
+ * Authors:
+ *    Chris Wilson <chris at chris-wilson.co.uk>
+ *
+ */
+#include "brw_batch.h"
+#include "brw_context.h" /* XXX brw_batch_start_hook() et al*/
+
+#include <sys/types.h>
+#include <sys/mman.h>
+#include <stdlib.h>
+#include <setjmp.h>
+
+#include <intel_bufmgr.h>
+#include <i915_drm.h>
+#include <xf86drm.h>
+#include <errno.h>
+
+#include "intel_screen.h"
+
+/*
+ * When submitting commands to the GPU every cycle of latency counts;
+ * mutexes, spinlocks, even atomics quickly add to substantial overhead.
+ *
+ * This "batch manager" acts as thread-local shim over the buffer manager
+ * (drm_intel_bufmgr_gem). As we are only ever used from within a single
+ * context, we can rely on the upper layers providing thread safety. This
+ * allows us to import buffers from the shared screen (sharing buffers
+ * between multiple contexts, threads and users) and wrap that handle in
+ * our own. Similarly, we want to share the buffer cache between all users
+ * on the file and so allocate from the global threadsafe buffer manager,
+ * with a very small and transient local cache of active buffers.
+ *
+ * The batch manager provides a cheap way of busyness tracking and very
+ * efficient batch construction and kernel submission.
+ *
+ * The restrictions over and above the generic submission engine in
+ * intel_bufmgr_gem are:
+ * 	- not thread-safe
+ * 	- flat relocations, only the batch buffer itself carries
+ * 	  relocations. Relocations relative to auxiliary buffers
+ * 	  must be performed via STATE_BASE
+ * 	- direct mapping of the batch for writes, expect reads
+ * 	  from the batch to be slow
+ * 	- the batch is a fixed 64k in size
+ * 	- access to the batch must be wrapped by brw_batch_begin/_end
+ * 	- all relocations must be immediately written into the batch
+ */
+
+/**
+ * Number of bytes to reserve for commands necessary to complete a batch.
+ *
+ * This includes:
+ * - MI_BATCHBUFFER_END (4 bytes)
+ * - Optional MI_NOOP for ensuring the batch length is qword aligned (4 bytes)
+ * - Any state emitted by vtbl->finish_batch():
+ *   - Gen4-5 record ending occlusion query values (4 * 4 = 16 bytes)
+ *   - Disabling OA counters on Gen6+ (3 DWords = 12 bytes)
+ *   - Ending MI_REPORT_PERF_COUNT on Gen5+, plus associated PIPE_CONTROLs:
+ *     - Two sets of PIPE_CONTROLs, which become 3 PIPE_CONTROLs each on SNB,
+ *       which are 5 DWords each ==> 2 * 3 * 5 * 4 = 120 bytes
+ *     - 3 DWords for MI_REPORT_PERF_COUNT itself on Gen6+.  ==> 12 bytes.
+ *       On Ironlake, it's 6 DWords, but we have some slack due to the lack of
+ *       Sandybridge PIPE_CONTROL madness.
+ */
+#define BATCH_RESERVED 152
+
+/* Surface offsets are limited to a maximum of 64k from the surface base */
+#define BATCH_SIZE (64 << 10)
+
+/* XXX Temporary home until kernel patches land */
+#define I915_PARAM_HAS_EXEC_SOFTPIN 37
+#define EXEC_OBJECT_PINNED (1<<5)
+#define I915_PARAM_HAS_EXEC_BATCH_FIRST 38
+#define I915_EXEC_BATCH_FIRST (1<<16)
+
+#define DBG_NO_FAST_RELOC 0
+#define DBG_NO_HANDLE_LUT 0
+#define DBG_NO_BATCH_FIRST 0
+#define DBG_NO_SOFTPIN 0
+#define DBG_NO_MMAP_WC 0
+
+#define PERF_IDLE 0 /* ring mask */
+
+#define READ_SIGNAL 0
+#define WRITE_SIGNAL 1
+#define NO_SIGNAL 2
+
+static const unsigned hw_ring[] = {
+   [RENDER_RING] = I915_EXEC_RENDER,
+   [BLT_RING] = I915_EXEC_BLT,
+};
+
+/*
+ * The struct brw_request is central to efficiently tracking GPU activity,
+ * and the busyness of all buffers. It serves as both a read and a write
+ * fence on the buffers (and as the external GL fence). This is done by
+ * associating each relocation (i.e. every use of a buffer by a GPU within
+ * a batch) with the request as a read fence (for a read-only relocation)
+ * or as both the read/write fences (for a writeable relocation).
+ *
+ * Then if we ever need to query whether a particular buffer is active,
+ * we can look at the appropriate fence and see whether it has expired.
+ * If not we can then ask the kernel if has just retired and report back.
+ * If the request is still undergoing construction and not been submitted,
+ * we have that information immediately available and can report busyness
+ * without having to search.
+ *
+ * Periodically (after every new request) we poll for request completion,
+ * asking if the oldest is complete. This allows us to then maintain the
+ * busyness state of all buffers without having to query every buffer
+ * every time.
+ *
+ * After certain events (such as mapping or waiting on a buffer), we know that
+ * the buffer is idle and so is the associated fence and all fences older.
+ *
+ * A nice side-effect of tracking requests, and buffer busyness is that we
+ * can also track a reasonable measure of how much of the aperture is filled
+ * by active buffers (a resident set size). This is useful for predicting
+ * when the kernel will start evicting our buffers, for example.
+ */
+struct brw_request {
+   struct brw_bo *bo;
+   struct brw_request *next;
+   struct list_head fences;
+};
+#define RQ_MARK_RING(rq, ring) ((struct brw_bo *)((uintptr_t)((rq)->bo) | (ring)))
+#define RQ_BO(rq) ((struct brw_bo *)((uintptr_t)(rq)->bo & ~3))
+#define RQ_RING(rq) (((unsigned)(uintptr_t)(rq)->bo & 3))
+
+static bool __brw_bo_busy(struct brw_bo *bo)
+{
+   struct drm_i915_gem_busy busy;
+
+   memset(&busy, 0, sizeof(busy));
+   busy.handle = bo->handle;
+   busy.busy = ~0;
+   drmIoctl(bo->batch->fd, DRM_IOCTL_I915_GEM_BUSY, &busy);
+   /* If an error occurs here, it can only be due to flushing the
+    * buffer on the hardware i.e. the buffer itself is still busy.
+    * Possible errors are:
+    * 	-ENOENT: the buffer didn't exist, impossible!
+    * 	-ENOMEM: the kernel failed to flush due to allocation failures
+    * 	         scary, but the buffer is busy.
+    * 	-EIO:    the kernel should have marked the buffer as idle during
+    * 	         the reset, if it hasn't it will never and the buffer
+    * 	         itself will never become idle.
+    * 	(-EINTR, -EAGAIN eaten by drmIoctl()).
+    */
+   return busy.busy;
+}
+
+/*
+ * Retire this and all older requests.
+ */
+static void __brw_request_retire(struct brw_request * const rq)
+{
+   const int ring = RQ_RING(rq);
+   struct brw_batch * const batch = RQ_BO(rq)->batch;
+   struct brw_request * const tail = rq->next;
+   struct brw_request *tmp;
+
+   assert(!__brw_bo_busy(RQ_BO(rq)) || batch->fini);
+
+   if (PERF_IDLE & (1 << RQ_RING(rq)) && rq->next == NULL)
+      batch->idle_time[RQ_RING(rq)] = -get_time();
+
+   tmp = batch->requests[ring].lru;
+   do {
+      assert(!__brw_bo_busy(RQ_BO(tmp)) || batch->fini);
+      assert(RQ_BO(tmp)->exec == NULL);
+
+      list_for_each_entry_safe(struct __brw_fence, fence, &tmp->fences, link) {
+         struct brw_bo *bo;
+
+         assert(fence->rq == tmp);
+         list_inithead(&fence->link);
+         fence->rq = NULL;
+
+         switch ((uintptr_t)fence->signal) {
+         case READ_SIGNAL:
+            bo = container_of(fence, bo, read);
+            assert(bo->exec == NULL);
+
+            if (unlikely(bo->write.rq)) {
+               assert(RQ_RING(bo->write.rq) != RQ_RING(rq));
+               assert(bo->write.rq != rq);
+               __brw_request_retire(bo->write.rq);
+            }
+            assert(bo->write.rq == NULL);
+
+            assert(batch->rss >= bo->size);
+            batch->rss -= bo->size;
+
+            if (likely(bo->reusable))
+               list_move(&bo->link, &batch->inactive);
+
+            if (unlikely(!bo->refcnt))
+               __brw_bo_free(bo);
+            break;
+
+         case WRITE_SIGNAL:
+         case NO_SIGNAL:
+            break;
+
+         default:
+            fence->signal(fence);
+            break;
+         }
+      }
+      assert(RQ_BO(tmp)->write.rq == NULL);
+      assert(RQ_BO(tmp)->read.rq == NULL);
+
+      if (tmp == batch->throttle[0])
+         batch->throttle[0] = NULL;
+      if (tmp == batch->throttle[1])
+         batch->throttle[1] = NULL;
+
+      tmp->bo = RQ_BO(tmp); /* strip off the ring id */
+      tmp = tmp->next;
+   } while (tmp != tail);
+
+   rq->next = batch->freed_rq;
+   batch->freed_rq = batch->requests[ring].lru;
+
+   batch->requests[ring].lru = tmp;
+   if (tmp == NULL)
+      batch->requests[ring].mru = NULL;
+}
+
+/*
+ * Is the request busy? First we can see if this request
+ * has already been retired (idle), or if this request is still under
+ * construction (busy). Failing that to the best of our knowledge, it is
+ * still being processed by the GPU, so then we must ask the kernel if the
+ * request is now idle. If we find it is idle, we now know this and all
+ * older requests are also idle.
+ */
+bool __brw_request_busy(struct brw_request *rq,
+                        unsigned flags,
+                        struct perf_debug *perf)
+{
+   struct brw_bo *bo;
+   if (rq == NULL)
+      return false;
+
+   bo = RQ_BO(rq);
+   if (bo->read.rq == NULL)
+      return false;
+
+   assert(bo->read.rq == rq);
+
+   if (bo->dirty) {
+      if (flags & BUSY_FLUSH)
+         brw_batch_flush(bo->batch, perf);
+      return true;
+   }
+
+   if (__brw_bo_busy(bo))
+      return true;
+
+   __brw_request_retire(rq);
+   return false;
+}
+
+/*
+ * Update the cache domain tracked by the kernel. This can have a number
+ * of side-effects but is essential in order to maintain coherency and
+ * serialisation between the GPU and CPU. If there is conflicting GPU access
+ * then set-domain will wait until the GPU has finished accessing the buffer
+ * before proceeding to change the domain. If the buffer is not cache coherent
+ * and we request CPU access, the kernel will clflush that buffer to make it
+ * coherent with the CPU access. Both of these imply delays and overhead, so
+ * we do our best to avoid moving buffers to the GTT/CPU domains. However,
+ * if we do, we know the buffer and its requst are idle so we can update
+ * our request tracking after a blocking call.
+ */
+static void __brw_bo_set_domain(struct brw_bo *bo, unsigned domain, bool write)
+{
+   struct drm_i915_gem_set_domain set;
+   struct brw_request *rq;
+
+   if (bo->domain == domain)
+      return;
+
+   if (bo->exec) /* flush failed, pretend we are ASYNC | INCOHERENT */
+      return;
+
+   memset(&set, 0, sizeof(set));
+   set.handle = bo->handle;
+   set.read_domains =
+      domain == DOMAIN_CPU ? I915_GEM_DOMAIN_CPU : I915_GEM_DOMAIN_GTT;
+   if (write)
+      set.write_domain = set.read_domains;
+
+   if (unlikely(drmIoctl(bo->batch->fd, DRM_IOCTL_I915_GEM_SET_DOMAIN, &set)))
+      return;
+
+   rq = write ? bo->read.rq : bo->write.rq;
+   if (rq)
+      __brw_request_retire(rq);
+
+   bo->domain = write ? domain : DOMAIN_NONE;
+   assert(bo->refcnt);
+}
+
+/*
+ * Wait for the buffer to become completely idle, i.e. not being accessed by
+ * the GPU at all (neither for oustanding reads or writes).
+ * This is equivalent to setting the buffer write domain to GTT, but the
+ * wait ioctl avoids the set-domain side-effects (e.g. clflushing in
+ * some circumstances).
+ */
+static int __brw_bo_wait(struct brw_bo *bo,
+                         int64_t timeout,
+                         struct perf_debug *perf)
+{
+   struct drm_i915_gem_wait wait;
+   int ret;
+
+   assert(bo->exec == NULL);
+
+   if (!brw_bo_busy(bo, BUSY_WRITE | BUSY_FLUSH | BUSY_RETIRE, perf))
+      return 0;
+
+   memset(&wait, 0, sizeof(wait));
+   wait.bo_handle = bo->handle;
+   wait.timeout_ns = timeout;
+   wait.flags = 0;
+
+   if (unlikely(perf))
+      perf->elapsed = -get_time();
+
+   if (unlikely(drmIoctl(bo->batch->fd, DRM_IOCTL_I915_GEM_WAIT, &wait))) {
+      ret = -errno;
+      if (timeout < 0) {
+         __brw_bo_set_domain(bo, DOMAIN_GTT, true);
+         ret = 0;
+      }
+   } else {
+      assert(bo->read.rq);
+      __brw_request_retire(bo->read.rq);
+      ret = 0;
+   }
+
+   if (unlikely(perf)) {
+      perf->elapsed += get_time();
+      if (perf->elapsed > 1e-5) /* 0.01ms */
+         brw_batch_report_stall_hook(bo->batch, perf);
+   }
+
+   return ret;
+}
+
+static inline uint32_t hash_32(uint32_t hash, unsigned bits)
+{
+   return (hash * 0x9e37001) >> (32 - bits);
+}
+
+static inline struct list_head *borrowed(struct brw_batch *batch, uint32_t handle)
+{
+   return &batch->borrowed[hash_32(handle, BORROWED_BITS)];
+}
+
+/*
+ * We have context local bo, but those may be shared between contexts by
+ * shared mipmaps and other buffers. If we find we are dealing with a bo
+ * belonging to another batch, we need to translate that into a local bo
+ * for associating with our fences.
+ */
+static struct brw_bo *__brw_batch_lookup_handle(struct brw_batch *batch,
+                                                uint32_t handle)
+{
+   /* XXX may need a resizable ht? */
+   struct list_head *hlist = borrowed(batch, handle);
+
+   list_for_each_entry(struct brw_bo, bo, hlist, link)
+      if (bo->handle == handle)
+         return bo;
+
+   return NULL;
+}
+
+inline static bool has_lut(struct brw_batch *batch)
+{
+   return batch->batch_base_flags & I915_EXEC_HANDLE_LUT;
+}
+
+static void __brw_batch_clear(struct brw_batch *batch)
+{
+   memset(&batch->emit, 0, sizeof(batch->emit));
+   batch->_ptr = batch->map;
+   batch->reserved = BATCH_RESERVED / 4;
+   batch->state = BATCH_SIZE / 4;
+   batch->aperture = 0;
+   batch->batch_flags = batch->batch_base_flags;
+}
+
+/*
+ * Prepare the batch manager for constructing a new batch/request.
+ *
+ * Reset all the accounting we do per-batch, and allocate ourselves a new
+ * batch bo.
+ */
+static int __brw_batch_reset(struct brw_batch *batch)
+{
+   struct brw_request *rq;
+
+retry:
+   rq = batch->freed_rq;
+   if (unlikely(rq == NULL)) {
+      rq = malloc(sizeof(*rq));
+      if (unlikely(rq == NULL))
+         goto oom;
+
+      rq->bo = brw_bo_create(batch, "batch", BATCH_SIZE, 0, 0);
+      if (unlikely(rq->bo == NULL)) {
+         free(rq);
+         goto oom;
+      }
+      rq->bo->target_handle = -1;
+
+      /* We are inheriting a foriegn buffer, so call set-domain */
+      brw_bo_map(rq->bo, MAP_WRITE, NULL);
+   } else
+      batch->freed_rq = rq->next;
+   rq->next = NULL;
+
+   assert(RQ_BO(rq) == rq->bo);
+   batch->map = brw_bo_map(rq->bo, MAP_WRITE | MAP_ASYNC, NULL);
+   if (unlikely(batch->map == NULL)) {
+      brw_bo_put(rq->bo);
+      free(rq);
+
+oom:
+      /* force the synchronization to recover some memory */
+      rq = batch->requests[batch->ring].mru;
+      if (rq == NULL) {
+         batch->next_request = NULL;
+         return -ENOMEM;
+      }
+
+      __brw_bo_wait(RQ_BO(rq), -1, NULL);
+      goto retry;
+   }
+
+   batch->next_request = rq;
+   batch->bo = rq->bo;
+
+   __brw_batch_clear(batch);
+
+   assert(rq->bo->target_handle == -1);
+   list_inithead(&rq->fences);
+   list_add(&rq->bo->read.link, &rq->fences);
+   if (batch->batch_base_flags & I915_EXEC_BATCH_FIRST) {
+      rq->bo->target_handle =
+         has_lut(batch) ? batch->emit.nexec : rq->bo->handle;
+      rq->bo->exec =
+         memset(&batch->exec[batch->emit.nexec++], 0, sizeof(*rq->bo->exec));
+   } else
+      rq->bo->exec = (void *)1;
+   rq->bo->read.rq = rq;
+   batch->rss += BATCH_SIZE;
+   return 0;
+}
+
+static int gem_param(int fd, int name)
+{
+   drm_i915_getparam_t gp;
+   int v = -1; /* No param uses (yet) the sign bit, reserve it for errors */
+
+   memset(&gp, 0, sizeof(gp));
+   gp.param = name;
+   gp.value = &v;
+   if (drmIoctl(fd, DRM_IOCTL_I915_GETPARAM, &gp))
+      return -1;
+
+   return v;
+}
+
+static bool test_has_fast_reloc(int fd)
+{
+   if (DBG_NO_FAST_RELOC)
+      return DBG_NO_FAST_RELOC < 0;
+
+   return gem_param(fd, I915_PARAM_HAS_EXEC_NO_RELOC) > 0;
+}
+
+static bool test_has_handle_lut(int fd)
+{
+   if (DBG_NO_HANDLE_LUT)
+      return DBG_NO_HANDLE_LUT < 0;
+
+   return gem_param(fd, I915_PARAM_HAS_EXEC_HANDLE_LUT) > 0;
+}
+
+static bool test_has_batch_first(int fd)
+{
+   if (DBG_NO_BATCH_FIRST)
+      return DBG_NO_BATCH_FIRST < 0;
+
+   return gem_param(fd, I915_PARAM_HAS_EXEC_BATCH_FIRST) > 0;
+}
+
+static bool test_has_mmap_wc(int fd)
+{
+   if (DBG_NO_MMAP_WC)
+      return DBG_NO_MMAP_WC < 0;
+
+   return gem_param(fd, I915_PARAM_MMAP_VERSION) > 0;
+}
+
+static bool test_has_softpin(int fd)
+{
+   if (DBG_NO_SOFTPIN)
+      return DBG_NO_SOFTPIN < 0;
+
+   if (gem_param(fd, I915_PARAM_HAS_ALIASING_PPGTT) < 2)
+      return false;
+
+   return gem_param(fd, I915_PARAM_HAS_EXEC_SOFTPIN) > 0;
+}
+
+static uint64_t __get_max_aperture(int fd)
+{
+   struct drm_i915_gem_get_aperture aperture;
+
+   if (gem_param(fd, I915_PARAM_HAS_ALIASING_PPGTT) > 2)
+      return (uint64_t)1 << 48;
+
+   memset(&aperture, 0, sizeof(aperture));
+   if (unlikely(drmIoctl(fd, DRM_IOCTL_I915_GEM_GET_APERTURE, &aperture)))
+      return 512 << 20; /* Minimum found on gen4+ */
+
+   return aperture.aper_size;
+}
+
+static uint64_t get_max_aperture(int fd)
+{
+   static uint64_t max_aperture;
+
+   if (max_aperture == 0)
+      max_aperture = __get_max_aperture(fd);
+
+   return max_aperture;
+}
+
+/*
+ * Initialise the batch-manager for the context.
+ *
+ * We use the devinfo and settings found in intel_screen to set ourselves up
+ * for the hardware environment, and supplement that with our own feature
+ * tests. (These too should probably move to intel_screen and shared between
+ * all contexts.)
+ */
+int brw_batch_init(struct brw_batch *batch,
+                   struct intel_screen *screen)
+{
+   const struct brw_device_info *devinfo;
+   struct drm_i915_gem_context_create create;
+   int ret;
+   int n;
+
+   batch->fd = intel_screen_to_fd(screen);
+   batch->bufmgr = screen->bufmgr;
+   batch->screen = screen;
+
+   devinfo = screen->devinfo;
+
+   batch->no_hw = screen->no_hw;
+
+   batch->needs_pipecontrol_ggtt_wa = devinfo->gen == 6;
+   batch->reloc_size = 512;
+   batch->exec_size = 256;
+   batch->reloc = malloc(sizeof(batch->reloc[0])*batch->reloc_size);
+   batch->exec = malloc(sizeof(batch->exec[0])*batch->exec_size);
+   if (unlikely(batch->reloc == NULL || batch->exec == NULL)) {
+      ret = -ENOMEM;
+      goto err;
+   }
+
+   for (n = 0; n < 1 << BORROWED_BITS; n++)
+      list_inithead(&batch->borrowed[n]);
+   list_inithead(&batch->active);
+   list_inithead(&batch->inactive);
+
+   batch->actual_ring[RENDER_RING] = RENDER_RING;
+   batch->actual_ring[BLT_RING] = BLT_RING;
+   if (devinfo->gen < 6)
+      batch->actual_ring[BLT_RING] = RENDER_RING;
+
+   batch->has_llc = devinfo->has_llc;
+   batch->has_mmap_wc = test_has_mmap_wc(batch->fd);
+   batch->has_softpin = test_has_softpin(batch->fd);
+   batch->max_aperture = 3*get_max_aperture(batch->fd)/4;
+
+   if (test_has_fast_reloc(batch->fd))
+      batch->batch_base_flags |= I915_EXEC_NO_RELOC;
+   if (test_has_handle_lut(batch->fd))
+      batch->batch_base_flags |= I915_EXEC_HANDLE_LUT;
+   if (test_has_batch_first(batch->fd))
+      batch->batch_base_flags |= I915_EXEC_BATCH_FIRST;
+
+   /* Create a new hardware context.  Using a hardware context means that
+    * our GPU state will be saved/restored on context switch, allowing us
+    * to assume that the GPU is in the same state we left it in.
+    *
+    * This is required for transform feedback buffer offsets, query objects,
+    * and also allows us to reduce how much state we have to emit.
+    */
+   memset(&create, 0, sizeof(create));
+   drmIoctl(batch->fd, DRM_IOCTL_I915_GEM_CONTEXT_CREATE, &create);
+   batch->hw_ctx = create.ctx_id;
+   if (!batch->hw_ctx) {
+      if (devinfo->gen >= 6) {
+         ret = -errno;
+         fprintf(stderr, "Gen6+ requires Kernel 3.6 or later.\n");
+         goto err;
+      }
+   }
+
+   ret = __brw_batch_reset(batch);
+   if (ret)
+      goto err;
+
+   return 0;
+
+err:
+   free(batch->reloc);
+   free(batch->exec);
+   return ret;
+}
+
+static void __brw_batch_grow_exec(struct brw_batch *batch)
+{
+   struct drm_i915_gem_exec_object2 *new_exec;
+   uint16_t new_size;
+
+   new_size = batch->exec_size * 2;
+   new_exec = NULL;
+   if (likely(new_size > batch->exec_size))
+      new_exec = realloc(batch->exec, new_size*sizeof(new_exec[0]));
+   if (unlikely(new_exec == NULL))
+      longjmp(batch->jmpbuf, -ENOMEM);
+
+   if (new_exec != batch->exec) {
+      struct list_head * const list = &batch->next_request->fences;
+
+      list_for_each_entry_rev(struct __brw_fence, fence, list, link) {
+         struct brw_bo *bo;
+
+         if (unlikely(fence->signal != (void *)READ_SIGNAL)) {
+            if (fence->signal == (void *)WRITE_SIGNAL)
+               break;
+            else
+               continue;
+         }
+
+         bo = container_of(fence, bo, read);
+         bo->exec = new_exec + (bo->exec - batch->exec);
+      }
+
+      batch->exec = new_exec;
+   }
+
+   batch->exec_size = new_size;
+}
+
+static void __brw_batch_grow_reloc(struct brw_batch *batch)
+{
+   struct drm_i915_gem_relocation_entry *new_reloc;
+   uint16_t new_size;
+
+   new_size = batch->reloc_size * 2;
+   new_reloc = NULL;
+   if (likely(new_size > batch->reloc_size))
+      new_reloc = realloc(batch->reloc, new_size*sizeof(new_reloc[0]));
+   if (unlikely(new_reloc == NULL))
+      longjmp(batch->jmpbuf, -ENOMEM);
+
+   batch->reloc = new_reloc;
+   batch->reloc_size = new_size;
+}
+
+/*
+ * Add a relocation entry for the target buffer into the current batch.
+ *
+ * This is the heart of performing fast relocations, both here and in
+ * the corresponding kernel relocation routines.
+ *
+ * - Instead of passing in handles for the kernel convert back into
+ *   the buffer for every relocation, we tell the kernel which
+ *   execobject slot corresponds with the relocation. The kernel is
+ *   able to use a simple LUT constructed as it first looks up each buffer
+ *   for the batch rather than search a small, overfull hashtable. As both
+ *   the number of relocations and buffers in a batch grow, the simple
+ *   LUT is much more efficient (though the LUT itself is less cache
+ *   friendly).
+ *   However, as the batch buffer is by definition the last object in
+ *   the execbuffer array we have to perform a pass to relabel the
+ *   target of all relocations pointing to the batch. (Except when
+ *   the kernel supports batch-first, in which case we can do the relocation
+ *   target processing for the batch inline.)
+ *
+ * - If the kernel has not moved the buffer, it will still be in the same
+ *   location as last time we used it. If we tell the kernel that all the
+ *   relocation entries are the same as the offset for the buffer, then
+ *   the kernel need only check that all the buffers are still in the same
+ *   location and then skip performing relocations entirely. A huge win.
+ *
+ * - As a consequence of telling the kernel to skip processing the relocations,
+ *   we need to tell the kernel about the read/write domains and special needs
+ *   of the buffers.
+ *
+ * - Alternatively, we can request the kernel place the buffer exactly
+ *   where we want it and forgo all relocations to that buffer entirely.
+ *   The buffer is effectively pinned for its lifetime (if the kernel
+ *   does have to move it, for example to swap it out to recover memory,
+ *   the kernel will return it back to our requested location at the start
+ *   of the next batch.) This of course imposes a lot of constraints on where
+ *   we can say the buffers are, they must meet all the alignment constraints
+ *   and not overlap.
+ *
+ * - Essential to all these techniques is that we always use the same
+ *   presumed_offset for the relocations as for submitting the execobject.
+ *   That value must be written into the batch and it must match the value
+ *   we tell the kernel. (This breaks down when using relocation tries shared
+ *   between multiple contexts, hence the need for context-local batch
+ *   management.)
+ *
+ * In contrast to libdrm, we can build the execbuffer array along with
+ * the batch by forgoing the ability to handle general relocation trees.
+ * This avoids having multiple passes to build the execbuffer parameter,
+ * and also gives us a means to cheaply track when a buffer has been
+ * referenced by the batch.
+ */
+uint64_t __brw_batch_reloc(struct brw_batch *batch,
+                           uint32_t batch_offset,
+                           struct brw_bo *target_bo,
+                           uint64_t target_offset,
+                           unsigned read_domains,
+                           unsigned write_domain)
+{
+   assert(batch->inside_begin_count);
+
+   assert(target_bo->refcnt);
+   if (unlikely(target_bo->batch != batch)) {
+      /* XXX legal sharing between contexts/threads? */
+      target_bo = brw_bo_import(batch, target_bo->base, true);
+      if (unlikely(target_bo == NULL))
+         longjmp(batch->jmpbuf, -ENOMEM);
+      target_bo->refcnt--; /* kept alive by the implicit active reference */
+   }
+   assert(target_bo->batch == batch);
+
+   if (target_bo->exec == NULL) {
+      int n;
+
+      /* reserve one exec entry for the batch */
+      if (unlikely(batch->emit.nexec + 1 == batch->exec_size))
+         __brw_batch_grow_exec(batch);
+
+      n = batch->emit.nexec++;
+      target_bo->target_handle = has_lut(batch) ? n : target_bo->handle;
+      target_bo->exec = memset(batch->exec + n, 0, sizeof(*target_bo->exec));
+      target_bo->exec->handle = target_bo->handle;
+      target_bo->exec->alignment = target_bo->alignment;
+      target_bo->exec->offset = target_bo->offset;
+      if (target_bo->pinned)
+         target_bo->exec->flags = EXEC_OBJECT_PINNED;
+
+      /* Track the total amount of memory in use by all active requests */
+      if (target_bo->read.rq == NULL) {
+         batch->rss += target_bo->size;
+         if (batch->rss > batch->peak_rss)
+            batch->peak_rss = batch->rss;
+      }
+      target_bo->read.rq = batch->next_request;
+      list_movetail(&target_bo->read.link, &batch->next_request->fences);
+
+      batch->aperture += target_bo->size;
+   }
+
+   if (!target_bo->pinned) {
+      int n;
+
+      if (unlikely(batch->emit.nreloc == batch->reloc_size))
+         __brw_batch_grow_reloc(batch);
+
+      n = batch->emit.nreloc++;
+      batch->reloc[n].offset = batch_offset;
+      batch->reloc[n].delta = target_offset;
+      batch->reloc[n].target_handle = target_bo->target_handle;
+      batch->reloc[n].presumed_offset = target_bo->offset;
+      batch->reloc[n].read_domains = read_domains;
+      batch->reloc[n].write_domain = write_domain;
+
+      /* If we haven't added the batch to the execobject array yet, we
+       * will have to process all the relocations pointing to the
+       * batch when finalizing the request for submission.
+       */
+      if (target_bo->target_handle == -1) {
+         int m = batch->emit.nself++;
+         if (m < 256)
+            batch->self_reloc[m] = n;
+      }
+   }
+
+   if (write_domain && !target_bo->dirty) {
+      assert(target_bo != batch->bo);
+      target_bo->write.rq = batch->next_request;
+      list_move(&target_bo->write.link, &batch->next_request->fences);
+      assert(target_bo->write.rq == target_bo->read.rq);
+      target_bo->dirty = true;
+      target_bo->domain = DOMAIN_GPU;
+      if (has_lut(batch)) {
+         target_bo->exec->flags |= EXEC_OBJECT_WRITE;
+         if (write_domain == I915_GEM_DOMAIN_INSTRUCTION &&
+             batch->needs_pipecontrol_ggtt_wa)
+            target_bo->exec->flags |= EXEC_OBJECT_NEEDS_GTT;
+      }
+   }
+
+   return target_bo->offset + target_offset;
+}
+
+/*
+ * Close the batch by writing all the tail commands (to store register
+ * values between batches, disable profiling, etc). And then to end it all
+ * we set MI_BATCH_BUFFER_END.
+ */
+static uint32_t __brw_batch_finish(struct brw_batch *batch,
+                                   struct perf_debug *info)
+{
+   batch->reserved = 0;
+
+   /* Catch any final allocation errors, rolling back is marginally safer */
+   batch->saved = batch->emit;
+   if (setjmp(batch->jmpbuf) == 0) {
+      batch->inside_begin_count++;
+      brw_batch_finish_hook(batch);
+      batch->inside_begin_count--;
+      batch->emit.nbatch = batch->_ptr - batch->map;
+   } else
+      batch->emit = batch->saved;
+
+   if (unlikely(INTEL_DEBUG & DEBUG_BATCH)) {
+      int bytes_for_commands = 4*batch->emit.nbatch;
+      int bytes_for_state = batch->bo->size - 4*batch->state;
+      int total_bytes = bytes_for_commands + bytes_for_state;
+      fprintf(stderr, "%d: Batchbuffer flush at %s:%d (%s) on ring %d with %4db (pkt) + "
+              "%4db (state) = %4db (%0.1f%%), with %d buffers and %d relocations [%d self], RSS %d KiB (cap %dKiB)\n",
+              batch->hw_ctx,
+              info ? info->file : "???",
+              info ? info->line : -1,
+              info ? info->string : "???",
+              batch->ring, bytes_for_commands, bytes_for_state,
+              total_bytes, 100.0f * total_bytes / BATCH_SIZE,
+              batch->emit.nexec, batch->emit.nreloc, batch->emit.nself,
+              (int)(batch->aperture>>10), (int)(batch->max_aperture>>10));
+   }
+
+   batch->map[batch->emit.nbatch] = 0xa << 23;
+   return 4*((batch->emit.nbatch + 2) & ~1);
+}
+
+static void
+__brw_batch_throttle(struct brw_batch *batch, struct brw_request *rq)
+{
+   if (unlikely(batch->disable_throttling))
+      return;
+
+   /* Wait for the swapbuffers before the one we just emitted, so we
+    * don't get too many swaps outstanding for apps that are GPU-heavy
+    * but not CPU-heavy.
+    *
+    * We're using intelDRI2Flush (called from the loader before
+    * swapbuffer) and glFlush (for front buffer rendering) as the
+    * indicator that a frame is done and then throttle when we get
+    * here as we prepare to render the next frame.  At this point for
+    * round trips for swap/copy and getting new buffers are done and
+    * we'll spend less time waiting on the GPU.
+    *
+    * Unfortunately, we don't have a handle to the batch containing
+    * the swap, and getting our hands on that doesn't seem worth it,
+    * so we just use the first batch we emitted after the last swap.
+    */
+   if (batch->need_swap_throttle) {
+      if (batch->throttle[0])
+         __brw_bo_wait(RQ_BO(batch->throttle[0]), -1, NULL);
+      batch->throttle[0] = batch->throttle[1];
+      batch->throttle[1] = rq;
+      batch->need_flush_throttle = false;
+      batch->need_swap_throttle = false;
+   }
+
+   if (batch->need_flush_throttle) {
+      drmCommandNone(batch->fd, DRM_I915_GEM_THROTTLE);
+      batch->need_flush_throttle = false;
+   }
+
+   if (unlikely(INTEL_DEBUG & DEBUG_SYNC)) {
+      struct drm_i915_gem_wait wait;
+
+      memset(&wait, 0, sizeof(wait));
+      wait.bo_handle = RQ_BO(rq)->handle;
+      wait.timeout_ns = -1;
+
+      drmIoctl(batch->fd, DRM_IOCTL_I915_GEM_WAIT, &wait);
+   }
+}
+
+/*
+ * If we added relocations pointing to the batch before we knew
+ * its final index (the kernel assumes that the batch is last unless
+ * told otherwise), then we have to go through all the relocations
+ * and point them back to the batch.
+ */
+static void __brw_batch_fixup_self_relocations(struct brw_batch *batch)
+{
+   uint32_t target = batch->bo->target_handle;
+   int n, count;
+
+   count = MIN2(batch->emit.nself, 256);
+   for (n = 0; n < count; n++)
+      batch->reloc[batch->self_reloc[n]].target_handle = target;
+   if (n == 256) {
+      for (n = batch->self_reloc[255] + 1; n < batch->emit.nreloc; n++) {
+         if (batch->reloc[n].target_handle == -1)
+            batch->reloc[n].target_handle = target;
+      }
+   }
+}
+
+static void
+__brw_batch_dump(struct brw_batch *batch)
+{
+   struct drm_intel_decode *decode;
+
+   decode = drm_intel_decode_context_alloc(batch->screen->deviceID);
+   if (unlikely(decode == NULL))
+      return;
+
+   drm_intel_decode_set_batch_pointer(decode,
+                                      batch->map, batch->bo->offset,
+                                      batch->emit.nbatch + 1);
+
+   drm_intel_decode_set_output_file(decode, stderr);
+   drm_intel_decode(decode);
+
+   drm_intel_decode_context_free(decode);
+
+   brw_debug_batch(batch);
+}
+
+/*
+ * Check to see if the oldest requests have completed and retire them.
+ */
+static void __brw_batch_retire(struct brw_batch *batch)
+{
+   do {
+      struct brw_request *rq;
+
+      rq = batch->requests[batch->ring].lru;
+      if (rq->next == NULL || __brw_bo_busy(RQ_BO(rq)))
+         break;
+
+      __brw_request_retire(rq);
+   } while (1);
+}
+
+/*
+ * Finalize the batch, submit it to hardware, and start a new batch/request.
+ */
+int brw_batch_flush(struct brw_batch *batch, struct perf_debug *perf)
+{
+   struct drm_i915_gem_execbuffer2 execbuf;
+   struct drm_i915_gem_exec_object2 *exec;
+   struct brw_request *rq = batch->next_request;
+
+   assert(!batch->inside_begin_count);
+   assert(batch->_ptr == batch->map + batch->emit.nbatch);
+   if (unlikely(batch->emit.nbatch == 0))
+      return 0;
+
+   if (unlikely(rq == NULL))
+      return -ENOMEM;
+
+   if (unlikely(perf))
+      brw_batch_report_flush_hook(batch, perf);
+
+   memset(&execbuf, 0, sizeof(execbuf));
+   execbuf.batch_len = __brw_batch_finish(batch, perf);
+   assert(execbuf.batch_len % 8 == 0);
+
+   assert(rq->bo == batch->bo);
+   assert(rq->bo->write.rq == NULL);
+   assert(rq->bo->read.rq == rq);
+   assert(rq->bo->exec != NULL);
+   assert(rq->bo->dirty);
+
+   /* After we call __brw_batch_finish() as the callbacks may add relocs! */
+   if (rq->bo->target_handle == -1) {
+      rq->bo->target_handle =
+         has_lut(batch) ? batch->emit.nexec : rq->bo->handle;
+      rq->bo->exec =
+         memset(&batch->exec[batch->emit.nexec++], 0, sizeof(*exec));
+
+      __brw_batch_fixup_self_relocations(batch);
+   }
+
+   exec = rq->bo->exec;
+   exec->handle = rq->bo->handle;
+   exec->offset = rq->bo->offset;
+   exec->alignment = rq->bo->alignment;
+   exec->relocation_count = batch->emit.nreloc;
+   exec->relocs_ptr = (uintptr_t)batch->reloc;
+   if (rq->bo->pinned)
+      exec->flags |= EXEC_OBJECT_PINNED;
+   assert((exec->flags & EXEC_OBJECT_WRITE) == 0);
+
+   execbuf.buffers_ptr = (uintptr_t)batch->exec;
+   execbuf.buffer_count = batch->emit.nexec;
+   if (batch->ring == RENDER_RING || batch->has_softpin)
+      execbuf.rsvd1 = batch->hw_ctx;
+   execbuf.flags = hw_ring[batch->ring] | batch->batch_flags;
+
+   if (unlikely(batch->no_hw))
+      goto skip;
+
+   if (unlikely(drmIoctl(batch->fd, DRM_IOCTL_I915_GEM_EXECBUFFER2, &execbuf))){
+      if (errno == ENOSPC)
+         return -ENOSPC;
+
+      fprintf(stderr,
+              "Failed to submit batch buffer, rendering will be incorrect: %s [%d]\n",
+              strerror(errno), errno);
+
+      /* submit a dummy execbuf to keep the fences accurate */
+      batch->map[0] = 0xa << 23;
+      execbuf.batch_len = 0;
+
+      if (drmIoctl(batch->fd, DRM_IOCTL_I915_GEM_EXECBUFFER2, &execbuf)) {
+         assert(errno != ENOSPC);
+         __brw_batch_clear(batch);
+         return -errno;
+      }
+   }
+
+   if (PERF_IDLE && batch->idle_time[batch->ring] < 0) {
+      batch->idle_time[batch->ring] += get_time();
+      fprintf(stderr, "GPU command queue %d idle for %.3fms\n",
+              batch->ring, batch->idle_time[batch->ring] * 1000);
+   }
+
+skip:
+   list_for_each_entry_rev(struct __brw_fence, fence, &rq->fences, link) {
+      struct brw_bo *bo;
+
+      if (unlikely(fence->signal != (void *)READ_SIGNAL)) {
+         if (fence->signal == (void *)WRITE_SIGNAL)
+            break;
+         else
+            continue;
+      }
+
+      bo = container_of(fence, bo, read);
+      assert(bo->exec);
+      assert(bo->read.rq == rq);
+
+      bo->offset = bo->exec->offset;
+      bo->exec = NULL;
+      bo->dirty = false;
+      bo->target_handle = -1;
+      if (bo->domain != DOMAIN_GPU)
+         bo->domain = DOMAIN_NONE;
+   }
+   assert(!rq->bo->dirty);
+   if (batch->requests[batch->ring].mru)
+      batch->requests[batch->ring].mru->next = rq;
+   else
+      batch->requests[batch->ring].lru = rq;
+   batch->requests[batch->ring].mru = rq;
+   rq->bo->pinned = batch->has_softpin;
+   rq->bo = RQ_MARK_RING(rq, batch->ring);
+
+   if (unlikely(INTEL_DEBUG & DEBUG_BATCH))
+      __brw_batch_dump(batch);
+
+   __brw_batch_throttle(batch, rq);
+   __brw_batch_retire(batch);
+
+   brw_batch_clear_dirty(batch);
+
+   return __brw_batch_reset(batch);
+}
+
+/*
+ * Is the GPU still processing the most recent batch submitted?
+ * (Note does not include the batch currently being constructed.)
+ */
+bool brw_batch_busy(struct brw_batch *batch)
+{
+   struct brw_request *rq = batch->requests[batch->ring].mru;
+   return rq && __brw_request_busy(rq, 0, NULL);
+}
+
+/*
+ * Wait for all GPU processing to complete.
+ */
+void brw_batch_wait(struct brw_batch *batch, struct perf_debug *perf)
+{
+   int n;
+
+   brw_batch_flush(batch, perf);
+
+   for (n = 0; n < __BRW_NUM_RINGS; n++) {
+      struct brw_request *rq;
+
+      rq = batch->requests[n].mru;
+      if (rq == NULL)
+         continue;
+
+      __brw_bo_wait(rq->bo, -1, perf);
+   }
+}
+
+static bool __is_uncached(int fd, uint32_t handle)
+{
+   struct drm_i915_gem_caching arg;
+
+   memset(&arg, 0, sizeof(arg));
+   arg.handle = handle;
+   drmIoctl(fd, DRM_IOCTL_I915_GEM_GET_CACHING, &arg);
+   /* There is no right answer if an error occurs here. Fortunately, the
+    * only error is ENOENT and that's impossible!
+    */
+   return arg.caching != I915_CACHING_CACHED;
+}
+
+/*
+ * Wrap a drm_intel_bo reference in a struct brw_bo. Owernership
+ * of that reference is transferred to the struct brw_bo.
+ */
+struct brw_bo *brw_bo_import(struct brw_batch *batch,
+                             drm_intel_bo *base,
+                             bool borrow)
+{
+   struct brw_bo *bo;
+   uint32_t tiling, swizzling;
+
+   if (unlikely(base == NULL))
+      return NULL;
+
+   assert(base->handle);
+   assert(base->size);
+
+   if (borrow) {
+      bo = __brw_batch_lookup_handle(batch, base->handle);
+      if (bo) {
+         bo->refcnt++;
+         return bo;
+      }
+   }
+
+   if (batch->freed_bo) {
+      bo = batch->freed_bo;
+      batch->freed_bo = (struct brw_bo *)bo->base;
+   } else {
+      bo = malloc(sizeof(*bo));
+      if (unlikely(bo == NULL))
+         return NULL;
+   }
+
+   memset(bo, 0, sizeof(*bo));
+
+   bo->handle = base->handle;
+   bo->batch = batch;
+   bo->refcnt = 1;
+   bo->offset = base->offset64;
+   bo->alignment = base->align;
+   bo->size = base->size;
+
+   drm_intel_bo_get_tiling(base, &tiling, &swizzling);
+   bo->tiling = tiling;
+   bo->reusable = !borrow;
+   bo->cache_coherent = batch->has_llc; /* XXX libdrm bookkeeping */
+
+   batch->vmsize += bo->size;
+
+   list_inithead(&bo->read.link);
+   list_inithead(&bo->write.link);
+
+   bo->read.signal = (void *)READ_SIGNAL;
+   bo->write.signal = (void *)WRITE_SIGNAL;
+
+   bo->base = base;
+   if (borrow) {
+      list_add(&bo->link, borrowed(batch, bo->handle));
+      drm_intel_bo_reference(base);
+      if (bo->cache_coherent)
+         bo->cache_coherent = !__is_uncached(batch->fd, bo->handle);
+   } else {
+      list_add(&bo->link, &batch->inactive);
+      /* If the buffer hasn't been used before on the GPU, presume it is a
+       * new buffer in the CPU write domain. However, a buffer may have been
+       * mapped and unused - but that should be relatively rare compared to
+       * the optimisation chance of first writing through the CPU.
+       */
+      if (bo->offset == 0)
+         __brw_bo_set_domain(bo, DOMAIN_CPU, true);
+   }
+
+   return bo;
+}
+
+/*
+ * Search the list of active buffers (a local short lived cache) for
+ * something of the right size to reuse for the allocation request.
+ */
+static struct brw_bo *__brw_bo_create__cached(struct brw_batch *batch,
+                                              uint64_t size)
+{
+   list_for_each_entry(struct brw_bo, bo, &batch->active, link) {
+      assert(bo->batch == batch);
+      assert(bo->read.rq != NULL);
+
+      if (bo->size < size || 3*size > 4*bo->size)
+         continue;
+
+      list_move(&bo->link, &batch->inactive);
+      bo->refcnt++;
+      return bo;
+   }
+
+   return NULL;
+}
+
+struct brw_bo *brw_bo_create(struct brw_batch *batch,
+                             const char *name,
+                             uint64_t size,
+                             uint64_t alignment,
+                             unsigned flags)
+{
+   drm_intel_bo *base;
+   struct brw_bo *bo;
+
+   if (flags & BO_ALLOC_FOR_RENDER) {
+      bo = __brw_bo_create__cached(batch, size);
+      if (bo) {
+         /* XXX rename */
+         bo->alignment = alignment;
+         if (bo->tiling != I915_TILING_NONE) {
+            uint32_t tiling = I915_TILING_NONE;
+            drm_intel_bo_set_tiling(bo->base, &tiling, 0);
+            bo->tiling = tiling;
+         }
+         if (bo->tiling != I915_TILING_NONE) {
+            list_move(&bo->link, &batch->active);
+            bo->refcnt--;
+         } else
+            return bo;
+      }
+   }
+
+   base = drm_intel_bo_alloc(batch->bufmgr, name, size, alignment);
+   if (unlikely(base == NULL))
+      return NULL;
+
+   bo = brw_bo_import(batch, base, false);
+   if (unlikely(bo == NULL)) {
+      drm_intel_bo_unreference(base);
+      return NULL;
+   }
+
+   return bo;
+}
+
+static uint64_t brw_surface_size(int cpp,
+                                 uint32_t width,
+                                 uint32_t height,
+                                 uint32_t tiling,
+                                 uint32_t *pitch)
+{
+   uint32_t tile_width, tile_height;
+
+   switch (tiling) {
+   default:
+   case I915_TILING_NONE:
+      tile_width = 64;
+      tile_height = 2;
+      break;
+   case I915_TILING_X:
+      tile_width = 512;
+      tile_height = 8;
+      break;
+   case I915_TILING_Y:
+      tile_width = 128;
+      tile_height = 32;
+      break;
+   }
+
+   *pitch = ALIGN(width * cpp, tile_width);
+   height = ALIGN(height, tile_height);
+   height *= *pitch;
+   return ALIGN(height, 4096);
+}
+
+struct brw_bo *
+brw_bo_create_tiled(struct brw_batch *batch,
+                    const char *name,
+                    uint32_t width,
+                    uint32_t height,
+                    int cpp,
+                    uint32_t *tiling,
+                    uint32_t *pitch,
+                    unsigned flags)
+{
+   unsigned long __pitch;
+   drm_intel_bo *base;
+   struct brw_bo *bo;
+
+   if (flags & BO_ALLOC_FOR_RENDER) {
+      uint64_t size = brw_surface_size(cpp, width, height, *tiling, pitch);
+
+      bo = __brw_bo_create__cached(batch, size);
+      if (bo) {
+         /* XXX rename */
+         bo->alignment = 0;
+         drm_intel_bo_set_tiling(bo->base, tiling, *pitch);
+         bo->tiling = *tiling;
+         return bo;
+      }
+   }
+
+   base = drm_intel_bo_alloc_tiled(batch->bufmgr, name,
+                                   width, height, cpp,
+                                   tiling, &__pitch, flags);
+   if (unlikely(base == NULL))
+      return NULL;
+
+   *pitch = __pitch;
+   bo = brw_bo_import(batch, base, false);
+   if (unlikely(bo == NULL)) {
+      drm_intel_bo_unreference(base);
+      return NULL;
+   }
+
+   return bo;
+}
+
+/*
+ * Import a foriegn buffer from another process using the global
+ * (flinked) name.
+ */
+struct brw_bo *brw_bo_create_from_name(struct brw_batch *batch,
+                                       const char *name,
+                                       uint32_t global_name)
+{
+   drm_intel_bo *base;
+   struct brw_bo *bo;
+
+   base = drm_intel_bo_gem_create_from_name(batch->bufmgr, name, global_name);
+   if (unlikely(base == NULL))
+      return NULL;
+
+   bo = brw_bo_import(batch, base, true);
+   drm_intel_bo_unreference(base);
+
+   return bo;
+}
+
+/*
+ * Write a portion of the *linear* buffer using the pointer provided.
+ *
+ * This is conceptually equivalent to calling
+ *   memcpy(brw_bo_map(MAP_WRITE | MAP_DETILED | flags) + offset, data, size)
+ * but can be much more efficient as it will try to avoid cache domain
+ * side-effects (if any).
+ */
+void brw_bo_write(struct brw_bo *bo,
+                  uint64_t offset,
+                  const void *data,
+                  uint64_t length,
+                  unsigned flags,
+                  struct perf_debug *perf)
+{
+   struct drm_i915_gem_pwrite pwrite;
+   void *map;
+
+   assert(offset < bo->size);
+   assert(length <= bo->size - offset);
+
+   map = brw_bo_map(bo, MAP_WRITE | MAP_DETILED | flags, perf);
+   if (map) {
+      memcpy(map + offset, data, length);
+      return;
+   }
+
+   memset(&pwrite, 0, sizeof(pwrite));
+   pwrite.handle = bo->handle;
+   pwrite.offset = offset;
+   pwrite.size = length;
+   pwrite.data_ptr = (uintptr_t)data;
+   if (unlikely(drmIoctl(bo->batch->fd, DRM_IOCTL_I915_GEM_PWRITE, &pwrite)))
+      return;
+
+   if (bo->read.rq)
+      __brw_request_retire(bo->read.rq);
+
+   assert(bo->refcnt);
+   bo->domain = DOMAIN_GTT;
+}
+
+/*
+ * Read a portion of the *linear* buffer into the pointer provided.
+ *
+ * This is conceptually equivalent to calling
+ *   memcpy(data, brw_bo_map(MAP_READ | MAP_DETILED | flags) + offset, size)
+ * but can be much more efficient as it will try to avoid cache domain
+ * side-effects (if any).
+ */
+void brw_bo_read(struct brw_bo *bo,
+                 uint64_t offset,
+                 void *data,
+                 uint64_t length,
+                 unsigned flags,
+                 struct perf_debug *perf)
+{
+   struct drm_i915_gem_pread pread;
+   void *map;
+
+   assert(offset < bo->size);
+   assert(length <= bo->size - offset);
+
+   if (bo->cache_coherent) {
+      map = brw_bo_map(bo, MAP_READ | MAP_DETILED | flags, perf);
+      if (map) {
+         memcpy(data, map + offset, length);
+         return;
+      }
+   } else {
+      if ((flags & MAP_ASYNC) == 0) {
+         struct brw_request *rq = bo->write.rq;
+         if (rq && RQ_BO(rq)->dirty)
+            brw_batch_flush(bo->batch, perf);
+      }
+   }
+
+   memset(&pread, 0, sizeof(pread));
+   pread.handle = bo->handle;
+   pread.offset = offset;
+   pread.size = length;
+   pread.data_ptr = (uintptr_t)data;
+
+   if (unlikely(perf))
+      __brw_bo_wait(RQ_BO(bo->write.rq), -1, perf);
+
+   if (unlikely(drmIoctl(bo->batch->fd, DRM_IOCTL_I915_GEM_PREAD, &pread)))
+      return;
+
+   if (bo->write.rq)
+      __brw_request_retire(bo->write.rq);
+
+   assert(bo->refcnt);
+   if (bo->domain != DOMAIN_CPU)
+      bo->domain = DOMAIN_NONE;
+}
+
+/*
+ * Provide a WC mmaping of the buffer. Coherent everywhere, but
+ * reads are very slow (as they are uncached). Fenced, so automatically
+ * detiled by hardware and constrained to fit in the aperture.
+ */
+static void *brw_bo_map__gtt(struct brw_bo *bo, unsigned flags)
+{
+   if (flags & MAP_DETILED && bo->tiling)
+      return NULL;
+
+   if (bo->map__gtt == NULL)
+      bo->map__gtt = drm_intel_gem_bo_map__gtt(bo->base);
+
+   if ((flags & MAP_ASYNC) == 0)
+      __brw_bo_set_domain(bo, DOMAIN_GTT, flags & MAP_WRITE);
+
+   return bo->map__gtt;
+}
+
+/*
+ * Provide a WC mmaping of the buffer. Coherent everywhere, but
+ * reads are very slow (as they are uncached). Unfenced, not
+ * constrained by the mappable aperture.
+ */
+static void *brw_bo_map__wc(struct brw_bo *bo, unsigned flags)
+{
+   if (bo->map__wc == NULL)
+      bo->map__wc = drm_intel_gem_bo_map__wc(bo->base);
+
+   if ((flags & MAP_ASYNC) == 0)
+      __brw_bo_set_domain(bo, DOMAIN_GTT, flags & MAP_WRITE);
+
+   return bo->map__wc;
+}
+
+/*
+ * Provide a WB mmaping of the buffer. Incoherent on non-LLC platforms
+ * and will trigger clflushes of the entire buffer. Unfenced, not
+ * constrained by the mappable aperture.
+ */
+static void *brw_bo_map__cpu(struct brw_bo *bo, unsigned flags)
+{
+   if (bo->map__cpu == NULL)
+      bo->map__cpu = drm_intel_gem_bo_map__cpu(bo->base);
+   assert(bo->map__cpu);
+
+   if ((flags & MAP_ASYNC) == 0)
+      __brw_bo_set_domain(bo, DOMAIN_CPU, flags & MAP_WRITE);
+
+   return bo->map__cpu;
+}
+
+static bool can_map__cpu(struct brw_bo *bo, unsigned flags)
+{
+   if (bo->cache_coherent)
+      return true;
+
+   if (flags & MAP_PERSISTENT)
+      return false;
+
+   if (bo->domain == DOMAIN_CPU)
+      return true;
+
+   if (flags & MAP_COHERENT)
+      return false;
+
+   return (flags & MAP_WRITE) == 0;
+}
+
+/*
+ * Map the buffer for access by the CPU, either for writing or reading,
+ * and return a pointer for that access.
+ *
+ * If the async flag is not set, any previous writing by the GPU is
+ * waited upon, and if write access is required all GPU reads as well.
+ *
+ * If the async flag is set, the kernel is not informed of the access
+ * and the access may be concurrent with GPU access. Also importantly,
+ * cache domain tracking for the buffer is *not* maintained and so access
+ * modes are limited to coherent modes (taking into account the current
+ * cache domain).
+ *
+ * If the detiled flag is set, the caller will perform manual detiling
+ * through the mapping, and so we do not allocate a fence for the operation.
+ * This can return NULL on failure, for example if the kernel doesn't support
+ * such an operation.
+ *
+ * The method for mapping the buffer is chosen based on the hardware
+ * architecture (LLC has fast coherent reads and writes, non-LLC has fast
+ * coherent writes, slow coherent reads but faster incoherent reads)
+ * and mode of operation. In theory, for every desired access mode,
+ * the pointer is the fastest direct CPU access to the immediate buffer.
+ * However, direct CPU access to this buffer may not always be the fastest
+ * method of accessing the data within that buffer by the CPU!
+ *
+ * Returns NULL on error.
+ */
+void *brw_bo_map(struct brw_bo *bo, unsigned flags, struct perf_debug *perf)
+{
+   assert(bo->refcnt);
+
+   if ((flags & MAP_ASYNC) == 0) {
+      struct brw_request *rq;
+
+      rq = flags & MAP_WRITE ? bo->read.rq : bo->write.rq;
+      if (rq && RQ_BO(rq)->dirty)
+         brw_batch_flush(bo->batch, perf);
+
+      if (unlikely(rq && perf))
+         __brw_bo_wait(RQ_BO(rq), -1, perf);
+   }
+
+   if (bo->tiling && (flags & MAP_DETILED) == 0)
+      return brw_bo_map__gtt(bo, flags);
+   else if (can_map__cpu(bo, flags))
+      return brw_bo_map__cpu(bo, flags);
+   else if (bo->batch->has_mmap_wc)
+      return brw_bo_map__wc(bo, flags);
+   else
+      return brw_bo_map__gtt(bo, flags);
+}
+
+/*
+ * After the final reference to a bo is released, free the buffer.
+ *
+ * If the buffer is still active, and it is reusable, the buffer is
+ * transferred to the local active cache and may be reallocated on the
+ * next call to brw_bo_create() or brw_bo_create_tiled(). Otherwise the
+ * buffer is returned back to the shared screen bufmgr pool.
+ */
+void  __brw_bo_free(struct brw_bo *bo)
+{
+   struct brw_batch *batch;
+
+   assert(bo->refcnt == 0);
+
+   if (bo->read.rq) {
+      assert(bo->batch);
+      if (bo->reusable)
+         list_move(&bo->link, &bo->batch->active);
+      return;
+   }
+
+   assert(!bo->write.rq);
+   list_del(&bo->link);
+
+   if (bo->offset)
+      bo->base->offset64 = bo->offset;
+   drm_intel_bo_unreference(bo->base);
+
+   batch = bo->batch;
+   if (batch == NULL) {
+      free(bo);
+      return;
+   }
+
+   batch->vmsize -= bo->size;
+   if (batch->vmsize < batch->peak_rss)
+      batch->peak_rss = batch->vmsize;
+
+   bo->base = (drm_intel_bo *)batch->freed_bo;
+   batch->freed_bo = bo;
+}
+
+/*
+ * Mark the beginning of a batch construction critical section, during which
+ * the batch is not allowed to be flushed. Access to the batch prior to this
+ * call is invalid. Access after this call but with instructions for another
+ * ring is also invalid. All BATCH_EMIT() must be inside a brw_batch_begin(),
+ * brw_batch_end() pairing - the exception to this rule are when inside the
+ * brw_start_batch() and brw_finish_batch() callbacks.
+ *
+ * Control returns to the caller of brw_batch_begin() if an error is
+ * encountered whilst inside the critical section. If the return code
+ * is negative, a fatal error occurred. If the return code is positive,
+ * the batch had to be flushed and the critical section needs to be restarted.
+ *
+ * On success 0 is returned.
+ *
+ * Must be paired with brw_batch_end().
+ */
+int brw_batch_begin(struct brw_batch *batch,
+                    uint32_t bytes,
+                    enum brw_gpu_ring ring)
+{
+   uint16_t space;
+
+   if (unlikely(batch->next_request == NULL))
+      return -ENOMEM;
+
+   assert(!batch->repeat);
+   if (batch->inside_begin_count++)
+      return 0;
+
+   ring = batch->actual_ring[ring];
+   if (ring != batch->ring)
+      space = 0;
+   else
+      space = batch->state - batch->reserved - batch->emit.nbatch;
+   if (unlikely(bytes/4 > space)) {
+      int ret;
+
+      batch->inside_begin_count = 0;
+
+      ret = brw_batch_flush(batch, NULL);
+      if (unlikely(ret))
+         return ret;
+
+      assert(batch->inside_begin_count == 0);
+      batch->inside_begin_count = 1;
+   }
+
+   batch->ring = ring;
+
+   if (!batch->bo->dirty) {
+      /* An early allocation error should be impossible */
+      brw_batch_start_hook(batch);
+      batch->bo->dirty = true;
+   }
+
+   assert(batch->ring == ring);
+   batch->saved = batch->emit;
+   return setjmp(batch->jmpbuf);
+}
+
+/*
+ * Mark the end of a batch construction critical section. After this call
+ * the batch is inaccessible until the next brw_batch_begin().
+ *
+ * We may flush the batch to hardware if it exceeds the aperture
+ * high water mark. If the batch submission fails, we rollback to the
+ * end of the previous critical section and try flushing again. If that
+ * should fail, we report the error back to the caller. If the rollback
+ * succeeds, we jump back to the brw_batch_begin() with a fresh request
+ * and run through the critical section again.
+ *
+ * Returns 0 on success and no errors have occurred.
+ *
+ * Must be paired with brw_batch_begin().
+ */
+int brw_batch_end(struct brw_batch *batch)
+{
+   int ret;
+
+   assert(batch->inside_begin_count);
+   if (--batch->inside_begin_count)
+      return 0;
+
+   batch->emit.nbatch = batch->_ptr - batch->map;
+
+   ret = 0;
+   if (batch->aperture > batch->max_aperture)
+      ret = brw_batch_flush(batch, NULL);
+   if (likely(ret == 0 || batch->repeat)) {
+      batch->repeat = false;
+      return ret;
+   }
+
+   batch->emit = batch->saved;
+   batch->_ptr = batch->map + batch->emit.nbatch;
+
+   ret = brw_batch_flush(batch, NULL);
+   if (ret != -ENOSPC)
+      return ret;
+
+   assert(!batch->repeat);
+   batch->repeat = true;
+
+   batch->inside_begin_count++;
+   brw_batch_start_hook(batch);
+   batch->bo->dirty = true;
+
+   longjmp(batch->jmpbuf, 1);
+}
+
+/*
+ * How much of the batch is used, both by the 3DSTATE at the beginning of
+ * the batch, and the data at the end?
+ */
+inline static int __brw_batch_size(struct brw_batch *batch)
+{
+   return batch->emit.nbatch + BATCH_SIZE/4 - batch->state;
+}
+
+/*
+ * After a high-level draw command, check to see if we want to flush
+ * the batch to the hardware for either debug reasons or for sanity.
+ */
+int brw_batch_maybe_flush(struct brw_batch *batch)
+{
+   if (unlikely(batch->always_flush))
+      goto flush;
+
+   /* If the working set exceeds the GTT's limits, we will need to evict
+    * textures in order to execute batches. As we have no method for predicting
+    * when we need to evict, we need to frequently flush the batch so that any
+    * stalls are minimised.
+    */
+   if (batch->peak_rss > batch->max_aperture && __brw_batch_size(batch) > 2048)
+      goto flush;
+
+   return 0;
+
+flush:
+   if (unlikely(INTEL_DEBUG & DEBUG_BATCH)) {
+      fprintf(stderr, "Forcing batchbuffer flush after %d: debug.always_flush?=%d, rss=%d [cap %d], vmasize=%d\n",
+              batch->emit.nbatch,
+              batch->always_flush,
+              (int)(batch->peak_rss >> 20), (int)(batch->max_aperture >> 20),
+              (int)(batch->vmsize >> 20));
+   }
+   return brw_batch_flush(batch, NULL);
+}
+
+/*
+ * Query the kernel for the number of times our hardware context has
+ * been implicated in a reset event - either guilty or just a victim,
+ * and the number of resets that have occurred overall.
+ */
+int brw_batch_get_reset_stats(struct brw_batch *batch,
+                              uint32_t *reset_count,
+                              uint32_t *active,
+                              uint32_t *pending)
+{
+   struct drm_i915_reset_stats stats;
+
+   memset(&stats, 0, sizeof(stats));
+   stats.ctx_id = batch->hw_ctx;
+   if (unlikely(drmIoctl(batch->fd, DRM_IOCTL_I915_GET_RESET_STATS, &stats)))
+      return -errno;
+
+   *reset_count = stats.reset_count;
+   *active = stats.batch_active;
+   *pending = stats.batch_pending;
+   return 0;
+}
+
+/*
+ * Mark the buffers as being invalid to prevent stale dereferences when
+ * tearing down shared resources.
+ */
+static void __brw_bo_list_fini(struct list_head *list)
+{
+   while (!list_empty(list)) {
+      struct brw_bo *bo = list_first_entry(list, struct brw_bo, link);
+
+      assert(bo->batch);
+      assert(bo->read.rq == NULL);
+
+      bo->batch = NULL;
+      list_delinit(&bo->link);
+   }
+}
+
+/* Normally we never free a request as they get recycled between batches.
+ * Except when we need to teardown the batch manager and free everything.
+ */
+static void __brw_request_free(struct brw_request *rq)
+{
+   /* Opencode the free(bo) here to handle batch->next_request */
+   assert(RQ_BO(rq) == rq->bo);
+   list_delinit(&rq->bo->link);
+   free(rq->bo);
+   free(rq);
+}
+
+/*
+ * Teardown the batch manager and free all associated memory and resources.
+ */
+void brw_batch_fini(struct brw_batch *batch)
+{
+   int n;
+
+   /* All bo should have been released before the destructor is called */
+   batch->fini = true;
+
+   for (n = 0; n < __BRW_NUM_RINGS; n++) {
+      struct brw_request *rq;
+
+      rq = batch->requests[n].mru;
+      if (rq == NULL)
+         continue;
+
+      /* Note that the request and buffers are not truly idle here. It is
+       * safe as the kernel will keep a reference whilst the buffers are
+       * active (so we can shutdown ahead of time), but we need to disable
+       * our runtime assertions that the request is idle at the time of
+       * retiring.
+       */
+      __brw_request_retire(rq);
+
+      assert(batch->requests[n].lru == NULL);
+      assert(batch->requests[n].mru == NULL);
+   }
+
+   while (batch->freed_rq) {
+      struct brw_request *rq = batch->freed_rq;
+      batch->freed_rq = rq->next;
+      __brw_request_free(rq);
+   }
+   __brw_request_free(batch->next_request);
+
+   assert(list_empty(&batch->active));
+   for (n = 0; n < 1 << BORROWED_BITS; n++)
+      __brw_bo_list_fini(&batch->borrowed[n]);
+   __brw_bo_list_fini(&batch->inactive);
+
+   while (batch->freed_bo) {
+      struct brw_bo *bo = batch->freed_bo;
+      batch->freed_bo = (struct brw_bo *)bo->base;
+      free(bo);
+   }
+
+   free(batch->exec);
+   free(batch->reloc);
+
+   if (batch->hw_ctx) {
+      struct drm_i915_gem_context_destroy destroy;
+
+      memset(&destroy, 0, sizeof(destroy));
+      destroy.ctx_id = batch->hw_ctx;
+      drmIoctl(batch->fd, DRM_IOCTL_I915_GEM_CONTEXT_DESTROY, &destroy);
+   }
+}
diff --git a/src/mesa/drivers/dri/i965/brw_batch.h b/src/mesa/drivers/dri/i965/brw_batch.h
index bbfb736..fee4d80 100644
--- a/src/mesa/drivers/dri/i965/brw_batch.h
+++ b/src/mesa/drivers/dri/i965/brw_batch.h
@@ -31,55 +31,108 @@
 extern "C" {
 #endif
 
+#include <stdbool.h>
+#include <stdint.h>
+#include <string.h>
 #include <setjmp.h>
+#include <assert.h>
+
+#include <intel_aub.h>
 
 #include <intel_bufmgr.h>
 
 #include "util/list.h"
+#include "util/macros.h"
+
+struct _drm_intel_bufmgr;
+struct _drm_intel_bo;
 
-typedef drm_intel_bo brw_bo;
+struct intel_screen;
+struct perf_debug;
 
 enum brw_gpu_ring {
-   UNKNOWN_RING,
-   RENDER_RING,
+   RENDER_RING = 0,
    BLT_RING,
+   __BRW_NUM_RINGS,
+};
+
+struct brw_batch;
+struct brw_request;
+
+enum brw_bo_domain { DOMAIN_NONE, DOMAIN_CPU, DOMAIN_GTT, DOMAIN_GPU };
+
+/* A fence is created at the current point on the order batch timeline. When
+ * the GPU passes that point, the fence will be signalled. Or you can wait
+ * for a fence to complete.
+ */
+struct __brw_fence {
+   struct brw_request *rq;
+   struct list_head link;
+   void (*signal)(struct __brw_fence *);
 };
 
+typedef struct brw_bo {
+   struct brw_batch *batch;
+   struct drm_i915_gem_exec_object2 *exec;
+   struct __brw_fence read, write;
+
+   unsigned dirty : 1;
+   unsigned domain : 2;
+   unsigned tiling : 4;
+   unsigned pinned : 1;
+   unsigned cache_coherent : 1;
+   unsigned reusable : 1;
+
+   unsigned refcnt;
+   uint32_t handle;
+   uint32_t target_handle;
+   uint64_t size;
+   uint64_t alignment;
+   uint64_t offset;
+
+   struct _drm_intel_bo *base;
+   struct list_head link;
+
+   void *map__cpu;
+   void *map__gtt;
+   void *map__wc;
+} brw_bo;
+
 typedef struct brw_batch {
-   /** Current batchbuffer being queued up. */
-   brw_bo *bo;
-   /** Last BO submitted to the hardware.  Used for glFinish(). */
-   brw_bo *last_bo;
+   int fd;
 
-#ifdef DEBUG
-   uint16_t emit, total;
-#endif
-   uint16_t reserved_space;
-   uint32_t *map_next;
+   struct brw_bo *bo;
    uint32_t *map;
-   uint32_t *cpu_map;
-#define BATCH_SZ (8192*sizeof(uint32_t))
+   uint32_t *_ptr;
+
+   uint32_t batch_flags;
+   uint32_t batch_base_flags;
 
-   uint32_t state_batch_offset;
    enum brw_gpu_ring ring;
-   bool needs_sol_reset;
-   int gen;
+   uint32_t hw_ctx;
 
-   jmp_buf jmpbuf;
-   bool repeat;
-   unsigned begin_count;
-   bool no_batch_wrap;
+   uint16_t reserved;
+   uint16_t state;
 
-   struct {
-      uint32_t *map_next;
-      int reloc_count;
-   } saved;
+   struct brw_batch_state {
+      uint16_t nbatch;
+      uint16_t nexec;
+      uint16_t nreloc;
+      uint16_t nself;
+   } emit, saved;
 
-   dri_bufmgr *bufmgr;
+   uint64_t aperture;
+   uint64_t max_aperture;
+   uint64_t rss, peak_rss, vmsize;
 
-   /** Framerate throttling: @{ */
-   brw_bo *throttle_batch[2];
+   bool has_softpin : 1;
+   bool has_llc : 1;
+   bool has_mmap_wc : 1;
+   bool needs_pipecontrol_ggtt_wa : 1;
+
+   bool always_flush : 1;
 
+   /** Framerate throttling: @{ */
    /* Limit the number of outstanding SwapBuffers by waiting for an earlier
     * frame of rendering to complete. This gives a very precise cap to the
     * latency between input and output such that rendering never gets more
@@ -88,16 +141,45 @@ typedef struct brw_batch {
     * submitted afterwards, which may be immediately prior to the next
     * SwapBuffers.)
     */
-   bool need_swap_throttle;
+   bool need_swap_throttle : 1;
 
    /** General throttling, not caught by throttling between SwapBuffers */
-   bool need_flush_throttle;
+   bool need_flush_throttle : 1;
+   bool disable_throttling : 1;
    /** @} */
 
-   bool always_flush : 1;
-   bool disable_throttling : 1;
+   bool no_hw : 1;
+   bool repeat : 1;
+   bool fini : 1;
+
+   unsigned inside_begin_count;
+   jmp_buf jmpbuf;
+
+   uint16_t exec_size;
+   uint16_t reloc_size;
+
+   struct drm_i915_gem_exec_object2 *exec;
+   struct drm_i915_gem_relocation_entry *reloc;
+   uint16_t self_reloc[256];
 
-   drm_intel_context *hw_ctx;
+   int actual_ring[__BRW_NUM_RINGS];
+   struct brw_request *next_request;
+   struct {
+      struct brw_request *lru, *mru;
+   } requests[__BRW_NUM_RINGS];
+   struct brw_request *throttle[2];
+   struct brw_request *freed_rq;
+
+   double idle_time[__BRW_NUM_RINGS];
+
+   struct intel_screen *screen;
+   struct _drm_intel_bufmgr *bufmgr;
+   struct list_head active, inactive;
+
+#define BORROWED_BITS 3
+   struct list_head borrowed[1<<BORROWED_BITS];
+
+   struct brw_bo *freed_bo;
 
    /**
     * Set of brw_bo* that have been rendered to within this batchbuffer
@@ -107,169 +189,235 @@ typedef struct brw_batch {
    struct set *render_cache;
 } brw_batch;
 
-/**
- * Number of bytes to reserve for commands necessary to complete a batch.
- *
- * This includes:
- * - MI_BATCHBUFFER_END (4 bytes)
- * - Optional MI_NOOP for ensuring the batch length is qword aligned (4 bytes)
- * - Any state emitted by vtbl->finish_batch():
- *   - Gen4-5 record ending occlusion query values (4 * 4 = 16 bytes)
- *   - Disabling OA counters on Gen6+ (3 DWords = 12 bytes)
- *   - Ending MI_REPORT_PERF_COUNT on Gen5+, plus associated PIPE_CONTROLs:
- *     - Two sets of PIPE_CONTROLs, which become 3 PIPE_CONTROLs each on SNB,
- *       which are 5 DWords each ==> 2 * 3 * 5 * 4 = 120 bytes
- *     - 3 DWords for MI_REPORT_PERF_COUNT itself on Gen6+.  ==> 12 bytes.
- *       On Ironlake, it's 6 DWords, but we have some slack due to the lack of
- *       Sandybridge PIPE_CONTROL madness.
- *   - CC_STATE workaround on HSW (12 * 4 = 48 bytes)
- *     - 5 dwords for initial mi_flush
- *     - 2 dwords for CC state setup
- *     - 5 dwords for the required pipe control at the end
- */
-#define BATCH_RESERVED 152
+int brw_batch_init(struct brw_batch *batch,
+                   struct intel_screen *screen);
 
-inline static brw_bo *brw_bo_create(brw_batch *batch,
-                                    const char *name,
-                                    uint64_t size,
-                                    uint64_t alignment,
-                                    unsigned flags)
+/** Add a relocation entry to the current batch
+ * XXX worth specialising 32bit variant?
+ */
+uint64_t __brw_batch_reloc(struct brw_batch *batch,
+                           uint32_t batch_offset,
+                           struct brw_bo *target_bo,
+                           uint64_t target_offset,
+                           unsigned read_domains,
+                           unsigned write_domain);
+MUST_CHECK static inline uint64_t brw_batch_reloc(struct brw_batch *batch,
+                                                  uint32_t batch_offset,
+                                                  struct brw_bo *target_bo,
+                                                  uint64_t target_offset,
+                                                  unsigned read_domains,
+                                                  unsigned write_domain)
 {
-   return drm_intel_bo_alloc(batch->bufmgr, name, size, alignment);
-}
+   if (target_bo == NULL)
+      return target_offset;
 
-inline static brw_bo *brw_bo_create_tiled(brw_batch *batch,
-                                          const char *name,
-                                          uint32_t width,
-                                          uint32_t height,
-                                          uint32_t cpp,
-                                          uint32_t *tiling,
-                                          uint32_t *pitch,
-                                          unsigned flags)
-{
-   unsigned long __pitch;
-   brw_bo *bo = drm_intel_bo_alloc_tiled(batch->bufmgr, name,
-                                         width, height, cpp,
-                                         tiling, &__pitch,
-                                         flags);
-   *pitch = __pitch;
-   return bo;
+   return __brw_batch_reloc(batch, batch_offset,
+                            target_bo, target_offset,
+                            read_domains, write_domain);
 }
 
-inline static brw_bo *brw_bo_create_from_name(brw_batch *batch,
-                                              const char *name,
-                                              uint32_t global_name)
+int brw_batch_get_reset_stats(struct brw_batch *batch,
+                              uint32_t *reset_count,
+                              uint32_t *active,
+                              uint32_t *pending);
+
+bool brw_batch_busy(struct brw_batch *batch);
+/** Wait for the last submitted rendering to complete */
+void brw_batch_wait(struct brw_batch *batch,
+                    struct perf_debug *stall);
+
+void brw_batch_fini(struct brw_batch *batch);
+
+/* Wrap a drm_intel_bo within a local struct brw_bo */
+struct brw_bo *
+brw_bo_import(struct brw_batch *batch,
+              struct _drm_intel_bo *base,
+              bool borrow);
+
+/* Create a local brw_bo for a linear/unfenced buffer and allocate the buffer */
+struct brw_bo *
+brw_bo_create(struct brw_batch *batch,
+              const char *name,
+              uint64_t size,
+              uint64_t alignment,
+              unsigned flags);
+
+/* Create a local brw_bo for a tiled buffer and allocate the buffer */
+struct brw_bo *
+brw_bo_create_tiled(struct brw_batch *batch,
+                    const char *name,
+                    uint32_t width,
+                    uint32_t height,
+                    int cpp,
+                    uint32_t *tiling,
+                    uint32_t *pitch,
+                    unsigned flags);
+
+/* Create a local brw_bo for a foreign buffer using its global flinked name */
+struct brw_bo *brw_bo_create_from_name(struct brw_batch *batch,
+                                       const char *name,
+                                       uint32_t global_name);
+
+void brw_bo_mark_dirty(struct brw_batch *batch, brw_bo *bo);
+void brw_batch_clear_dirty(struct brw_batch *batch);
+
+inline static int brw_bo_madvise(struct brw_bo *bo, int state)
 {
-   return drm_intel_bo_gem_create_from_name(batch->bufmgr, name, global_name);
+   return drm_intel_bo_madvise(bo->base, state);
 }
 
-inline static brw_bo *brw_bo_get(brw_bo *bo)
+inline static uint32_t brw_bo_flink(struct brw_bo *bo)
 {
-   drm_intel_bo_reference(bo);
-   return bo;
+   uint32_t name = 0;
+   drm_intel_bo_flink(bo->base, &name);
+   return name;
 }
 
-inline static void brw_bo_put(brw_bo *bo)
+int brw_bo_wait(struct brw_bo *bo, int64_t timeout);
+
+void brw_bo_write(struct brw_bo *bo, uint64_t offset,
+                  const void *data, uint64_t length,
+                  unsigned flags,
+                  struct perf_debug *perf);
+void brw_bo_read(struct brw_bo *bo, uint64_t offset,
+                 void *data, uint64_t length,
+                 unsigned flags,
+                 struct perf_debug *perf);
+
+bool __brw_request_busy(struct brw_request *rq,
+                        unsigned flags,
+                        struct perf_debug *perf);
+static inline bool brw_bo_busy(struct brw_bo *bo,
+                               unsigned flags,
+                               struct perf_debug *perf)
+#define BUSY_READ 0
+#define BUSY_WRITE 1
+#define BUSY_FLUSH 2
+#define BUSY_RETIRE 4
 {
-   if (bo)
-      drm_intel_bo_unreference(bo);
+   struct brw_request *rq;
+
+   if (!bo)
+      return false;
+
+   assert(bo->refcnt);
+   rq = flags & BUSY_WRITE ? bo->read.rq : bo->write.rq;
+   if (!rq) {
+      assert(!bo->exec);
+      return false;
+   }
+
+   if (flags & (BUSY_FLUSH | BUSY_RETIRE))
+      return __brw_request_busy(rq, flags, perf);
+
+   return true;
 }
 
-inline static int brw_bo_madvise(brw_bo *bo, int state)
+void *brw_bo_map(struct brw_bo *bo, unsigned flags, struct perf_debug *perf);
+/* Must match MapBufferRange interface (for convenience) */
+#define MAP_READ        0x1
+#define MAP_WRITE       0x2
+#define MAP_ASYNC       0x20
+#define MAP_PERSISTENT  0x40
+#define MAP_COHERENT    0x80
+#define MAP_INTERNAL_MASK (~0xff)
+/* internal */
+#define MAP_DETILED     0x100
+
+/* Take a new reference to the brw_bo */
+static inline struct brw_bo *brw_bo_get(struct brw_bo *bo)
 {
-   return drm_intel_bo_madvise(bo, state);
+   assert(bo != NULL && bo->refcnt > 0);
+   bo->refcnt++;
+   return bo;
 }
 
-inline static uint32_t brw_bo_flink(brw_bo *bo)
+/* Release a reference to the brw_bo */
+void  __brw_bo_free(struct brw_bo *bo);
+static inline void brw_bo_put(struct brw_bo *bo)
 {
-   uint32_t name = 0;
-   drm_intel_bo_flink(bo, &name);
-   return name;
+   assert(bo == NULL || bo->refcnt > 0);
+   if (bo && --bo->refcnt == 0)
+      __brw_bo_free(bo);
 }
 
-void brw_batch_clear_dirty(brw_batch *batch);
-void brw_bo_mark_dirty(brw_batch *batch, brw_bo *bo);
+/* Control batch command insertion and submission to hw */
+MUST_CHECK int brw_batch_begin(struct brw_batch *batch,
+                               uint32_t estimated_bytes,
+                               enum brw_gpu_ring ring);
+int brw_batch_end(struct brw_batch *batch);
+int brw_batch_flush(struct brw_batch *batch, struct perf_debug *perf);
+int brw_batch_maybe_flush(struct brw_batch *batch);
 
-inline static bool brw_batch_busy(brw_batch *batch)
+/* Interfaces for writing commands into the batch */
+static inline int brw_batch_count(struct brw_batch *batch)
 {
-   return batch->last_bo && drm_intel_bo_busy(batch->last_bo);
+   return batch->_ptr - batch->map;
 }
 
-MUST_CHECK inline static uint64_t
-brw_batch_reloc(brw_batch *batch,
-                uint32_t batch_offset,
-                brw_bo *target_bo,
-                uint64_t target_offset,
-                unsigned read_domains,
-                unsigned write_domain)
+inline static void brw_batch_cacheline_evade(struct brw_batch *batch,
+                                             unsigned sz)
 {
-   int ret;
-
-   if (target_bo == NULL)
-      return 0;
-
-   ret = drm_intel_bo_emit_reloc(batch->bo, batch_offset,
-                                 target_bo, target_offset,
-                                 read_domains, write_domain);
-   assert(ret == 0);
-   (void)ret;
-
-   return target_bo->offset64 + target_offset;
+#define CACHELINE 64
+   if (((uintptr_t)batch->_ptr & (CACHELINE - 1)) > (CACHELINE - sz)) {
+      int pad = CACHELINE - ((uintptr_t)batch->_ptr & (CACHELINE - 1));
+      memset(batch->_ptr, 0, pad);
+      batch->_ptr += pad / sizeof(*batch->_ptr);
+   }
+#undef CACHELINE
 }
 
-struct perf_debug;
-int brw_batch_flush(struct brw_batch *batch, struct perf_debug *info);
-
-inline static void brw_batch_maybe_flush(struct brw_batch *batch)
+static inline uint32_t * __brw_batch_check(struct brw_batch *batch,
+                                           int count,
+                                           enum brw_gpu_ring ring)
 {
-   if (unlikely(batch->always_flush))
-      brw_batch_flush(batch, NULL);
-}
+   uint32_t *ptr;
 
-void intel_batchbuffer_save_state(struct brw_batch *batch);
-void intel_batchbuffer_reset_to_saved(struct brw_batch *batch);
+   assert(batch->inside_begin_count);
+   assert(brw_batch_count(batch) + count < batch->state - batch->reserved);
+   assert(batch->ring == batch->actual_ring[ring]);
 
-void brw_batch_start_hook(struct brw_batch *batch);
-
-#define USED_BATCH(batch) ((uintptr_t)((batch)->map_next - (batch)->map))
+   ptr = batch->_ptr;
+   batch->_ptr += count;
+   return ptr;
+}
 
-static inline unsigned
-intel_batchbuffer_space(struct brw_batch *batch)
+static inline void brw_batch_data(struct brw_batch *batch,
+                                  const void *data,
+                                  int bytes)
 {
-   return (batch->state_batch_offset - batch->reserved_space)
-      - USED_BATCH(batch)*4;
+   assert(brw_batch_count(batch) + bytes/4 < batch->state - batch->reserved);
+   assert((bytes & 3) == 0);
+   memcpy(batch->_ptr, data, bytes);
+   batch->_ptr += bytes / sizeof(*batch->_ptr);
 }
 
-static inline void
-intel_batchbuffer_require_space(struct brw_batch *batch, GLuint sz,
-                                enum brw_gpu_ring ring)
+static inline uint32_t float_as_int(float f)
 {
-   /* If we're switching rings, implicitly flush the batch. */
-   if (unlikely(ring != batch->ring) && batch->ring != UNKNOWN_RING &&
-       batch->gen >= 6) {
-      brw_batch_flush(batch, NULL);
-   }
-
-#ifdef DEBUG
-   assert(sz < BATCH_SZ - BATCH_RESERVED);
-#endif
-   if (intel_batchbuffer_space(batch) < sz)
-      brw_batch_flush(batch, NULL);
-
-   enum brw_gpu_ring prev_ring = batch->ring;
-   /* The brw_batch_flush() calls above might have changed
-    * brw->batch.ring to UNKNOWN_RING, so we need to set it here at the end.
-    */
-   batch->ring = ring;
+   union {
+      float f;
+      uint32_t dw;
+   } fi;
 
-   if (unlikely(prev_ring == UNKNOWN_RING))
-      brw_batch_start_hook(batch);
+   fi.f = f;
+   return fi.dw;
 }
 
-int brw_batch_begin(struct brw_batch *batch,
-                    const int sz_bytes,
-                    enum brw_gpu_ring ring);
-int brw_batch_end(struct brw_batch *batch);
+#define BEGIN_BATCH(n) do { \
+   uint32_t *__map = __brw_batch_check(&brw->batch, n, RENDER_RING)
+#define BEGIN_BATCH_BLT(n) do { \
+   uint32_t *__map = __brw_batch_check(&brw->batch, n, BLT_RING)
+#define OUT_BATCH(dw) *__map++ = (dw)
+#define OUT_BATCH_F(f) *__map++ = float_as_int(f)
+#define OUT_RELOC(bo, read, write, delta) \
+   *__map = brw_batch_reloc(&brw->batch, \
+                            4*(__map - brw->batch.map), \
+                            bo, delta, read, write), __map++
+#define OUT_RELOC64(bo, read, write, delta) \
+   *(uint64_t *)__map = brw_batch_reloc(&brw->batch, \
+                                        4*(__map - brw->batch.map), \
+                                        bo, delta, read, write), __map += 2
+#define ADVANCE_BATCH() assert(__map == brw->batch._ptr); } while(0)
 
 #ifdef __cplusplus
 }
diff --git a/src/mesa/drivers/dri/i965/brw_context.c b/src/mesa/drivers/dri/i965/brw_context.c
index 8857b08..ad8ddee 100644
--- a/src/mesa/drivers/dri/i965/brw_context.c
+++ b/src/mesa/drivers/dri/i965/brw_context.c
@@ -245,9 +245,7 @@ intel_finish(struct gl_context * ctx)
    struct brw_context *brw = brw_context(ctx);
 
    intel_flush_front(ctx, PERF_DEBUG(brw, "Finish"));
-
-   if (brw->batch.last_bo)
-      drm_intel_bo_wait_rendering(brw->batch.last_bo);
+   brw_batch_wait(&brw->batch, PERF_DEBUG(brw, "Finish"));
 }
 
 static void
@@ -709,7 +707,18 @@ brwCreateContext(gl_api api,
    driContextPriv->driverPrivate = brw;
    brw->driContext = driContextPriv;
    brw->intelScreen = screen;
-   brw->batch.bufmgr = screen->bufmgr;
+
+   if (brw_batch_init(&brw->batch, screen)) {
+      fprintf(stderr, "%s: failed to alloc batch\n", __func__);
+      *dri_ctx_error = __DRI_CTX_ERROR_NO_MEMORY;
+      return false;
+   }
+
+   if (brw_init_pipe_control(brw, devinfo)) {
+      fprintf(stderr, "%s: failed to alloc workarounds\n", __func__);
+      *dri_ctx_error = __DRI_CTX_ERROR_NO_MEMORY;
+      return false;
+   }
 
    brw->gen = devinfo->gen;
    brw->gt = devinfo->gt;
@@ -804,12 +813,6 @@ brwCreateContext(gl_api api,
 
    intel_fbo_init(brw);
 
-   if (!intel_batchbuffer_init(brw)) {
-      intelDestroyContext(driContextPriv);
-      return false;
-   }
-
-   brw_init_pipe_control(brw, devinfo);
    brw_init_state(brw);
 
    intelInitExtensions(ctx);
@@ -927,14 +930,14 @@ intelDestroyContext(__DRIcontext * driContextPriv)
    if (ctx->swrast_context)
       _swrast_DestroyContext(&brw->ctx);
 
+   /* free the Mesa context */
+   _mesa_free_context_data(&brw->ctx);
+
    brw_fini_pipe_control(brw);
-   intel_batchbuffer_free(brw);
+   brw_batch_fini(&brw->batch);
 
    driDestroyOptionCache(&brw->optionCache);
 
-   /* free the Mesa context */
-   _mesa_free_context_data(&brw->ctx);
-
    ralloc_free(brw);
    driContextPriv->driverPrivate = NULL;
 }
@@ -1419,12 +1422,14 @@ intel_update_image_buffer(struct brw_context *intel,
    else
       last_mt = rb->singlesample_mt;
 
-   if (last_mt && last_mt->bo == buffer->bo)
+   if (last_mt && last_mt->bo->handle == buffer->bo->handle)
       return;
 
-   intel_update_winsys_renderbuffer_miptree(intel, rb, buffer->bo,
+   brw_bo *bo = brw_bo_import(&intel->batch, buffer->bo, true);
+   intel_update_winsys_renderbuffer_miptree(intel, rb, bo,
                                             buffer->width, buffer->height,
                                             buffer->pitch);
+   brw_bo_put(bo);
 
    if (brw_is_front_buffer_drawing(fb) &&
        buffer_type == __DRI_IMAGE_BUFFER_FRONT &&
@@ -1511,6 +1516,13 @@ void brw_batch_start_hook(brw_batch *batch)
       brw_dump_perf_monitors(brw);
 
    brw_perf_monitor_new_batch(brw);
+
+/* Drop when RS headers get pulled to libdrm */
+#ifndef I915_EXEC_RESOURCE_STREAMER
+#define I915_EXEC_RESOURCE_STREAMER (1<<15)
+#endif
+   if (brw->use_resource_streamer)
+      brw->batch.batch_flags |= I915_EXEC_RESOURCE_STREAMER;
 }
 
 /**
@@ -1562,8 +1574,6 @@ void brw_batch_finish_hook(brw_batch *batch)
    /* We may also need to snapshot and disable OA counters. */
    brw_perf_monitor_finish_batch(brw);
 
-   brw->cache.bo_used_by_gpu = true;
-
    brw->state_batch_count = 0;
 
    brw->ib.type = -1;
diff --git a/src/mesa/drivers/dri/i965/brw_context.h b/src/mesa/drivers/dri/i965/brw_context.h
index f326a45..647bb4f 100644
--- a/src/mesa/drivers/dri/i965/brw_context.h
+++ b/src/mesa/drivers/dri/i965/brw_context.h
@@ -51,7 +51,6 @@ extern "C" {
 #endif
 
 #include <drm.h>
-#include <intel_bufmgr.h>
 #include <i915_drm.h>
 #ifdef __cplusplus
 	#undef virtual
@@ -796,7 +795,6 @@ struct brw_cache {
    GLuint size, n_items;
 
    uint32_t next_offset;
-   bool bo_used_by_gpu;
 
    /**
     * Optional functions used in determining whether the prog_data for a new
@@ -854,9 +852,6 @@ struct brw_query_object {
 
    /** Last index in bo with query data for this object. */
    int last_index;
-
-   /** True if we know the batch has been flushed since we ended the query. */
-   bool flushed;
 };
 
 #define BRW_MAX_XFB_STREAMS 4
@@ -1600,7 +1595,7 @@ bool brw_check_conditional_render(struct brw_context *brw);
 /*======================================================================
  * brw_state_dump.c
  */
-void brw_debug_batch(struct brw_context *brw);
+void brw_debug_batch(brw_batch *batch);
 
 /*======================================================================
  * brw_tex.c
@@ -1709,12 +1704,6 @@ void brw_dump_perf_monitors(struct brw_context *brw);
 void brw_perf_monitor_new_batch(struct brw_context *brw);
 void brw_perf_monitor_finish_batch(struct brw_context *brw);
 
-/* intel_buffer_objects.c */
-int brw_bo_map(struct brw_context *brw, brw_bo *bo, int write_enable,
-               const char *bo_name);
-int brw_bo_map_gtt(struct brw_context *brw, brw_bo *bo,
-                   const char *bo_name);
-
 /* intel_extensions.c */
 extern void intelInitExtensions(struct gl_context *ctx);
 
@@ -1966,8 +1955,8 @@ gen9_use_linear_1d_layout(const struct brw_context *brw,
                           const struct intel_mipmap_tree *mt);
 
 /* brw_pipe_control.c */
-void brw_init_pipe_control(struct brw_context *brw,
-                           const struct brw_device_info *info);
+int brw_init_pipe_control(struct brw_context *brw,
+                          const struct brw_device_info *info);
 void brw_fini_pipe_control(struct brw_context *brw);
 
 void brw_emit_pipe_control_flush(struct brw_context *brw, uint32_t flags);
@@ -1988,7 +1977,4 @@ bool brw_check_dirty(struct brw_context *ctx, brw_bo *bo);
 }
 #endif
 
-/* Temporary include to hide some mechanical changes for brw-batch */
-#include "intel_batchbuffer.h"
-
 #endif
diff --git a/src/mesa/drivers/dri/i965/brw_performance_monitor.c b/src/mesa/drivers/dri/i965/brw_performance_monitor.c
index 5545ac7..c09db4d 100644
--- a/src/mesa/drivers/dri/i965/brw_performance_monitor.c
+++ b/src/mesa/drivers/dri/i965/brw_performance_monitor.c
@@ -617,14 +617,12 @@ gather_statistics_results(struct brw_context *brw,
       return;
    }
 
-   drm_intel_bo_map(monitor->pipeline_stats_bo, false);
-   uint64_t *start = monitor->pipeline_stats_bo->virtual;
+   uint64_t *start = brw_bo_map(monitor->pipeline_stats_bo, MAP_READ, NULL);
    uint64_t *end = start + (SECOND_SNAPSHOT_OFFSET_IN_BYTES / sizeof(uint64_t));
 
    for (int i = 0; i < num_counters; i++) {
       monitor->pipeline_stats_results[i] = end[i] - start[i];
    }
-   drm_intel_bo_unmap(monitor->pipeline_stats_bo);
    brw_bo_put(monitor->pipeline_stats_bo);
    monitor->pipeline_stats_bo = NULL;
 }
@@ -880,8 +878,7 @@ gather_oa_results(struct brw_context *brw,
    struct gl_perf_monitor_object *m = &monitor->base;
    assert(monitor->oa_bo != NULL);
 
-   drm_intel_bo_map(monitor->oa_bo, false);
-   uint32_t *monitor_buffer = monitor->oa_bo->virtual;
+   uint32_t *monitor_buffer = brw_bo_map(monitor->oa_bo, MAP_READ, NULL);
 
    /* If monitoring was entirely contained within a single batch, then the
     * bookend BO is irrelevant.  Just subtract monitor->bo's two snapshots.
@@ -891,7 +888,6 @@ gather_oa_results(struct brw_context *brw,
                  monitor_buffer,
                  monitor_buffer + (SECOND_SNAPSHOT_OFFSET_IN_BYTES /
                                    sizeof(uint32_t)));
-      drm_intel_bo_unmap(monitor->oa_bo);
       return;
    }
 
@@ -938,8 +934,6 @@ gather_oa_results(struct brw_context *brw,
                                    sizeof(uint32_t)));
    }
 
-   drm_intel_bo_unmap(monitor->oa_bo);
-
    /* If the monitor has ended, then we've gathered all the results, and
     * can free the monitor's OA BO.
     */
@@ -977,8 +971,7 @@ wrap_bookend_bo(struct brw_context *brw)
     */
    assert(brw->perfmon.oa_users > 0);
 
-   drm_intel_bo_map(brw->perfmon.bookend_bo, false);
-   uint32_t *bookend_buffer = brw->perfmon.bookend_bo->virtual;
+   uint32_t *bookend_buffer = brw_bo_map(brw->perfmon.bookend_bo, MAP_READ, NULL);
    for (int i = 0; i < brw->perfmon.unresolved_elements; i++) {
       struct brw_perf_monitor_object *monitor = brw->perfmon.unresolved[i];
       struct gl_perf_monitor_object *m = &monitor->base;
@@ -999,7 +992,6 @@ wrap_bookend_bo(struct brw_context *brw)
          assert(monitor->oa_tail_start == -1);
       }
    }
-   drm_intel_bo_unmap(brw->perfmon.bookend_bo);
 
    brw->perfmon.bookend_snapshots = 0;
 }
@@ -1114,9 +1106,7 @@ brw_begin_perf_monitor(struct gl_context *ctx,
                                      4096, 64, DBG_BO_ALLOC_FLAG);
 #ifdef DEBUG
       /* Pre-filling the BO helps debug whether writes landed. */
-      drm_intel_bo_map(monitor->oa_bo, true);
-      memset((char *) monitor->oa_bo->virtual, 0xff, 4096);
-      drm_intel_bo_unmap(monitor->oa_bo);
+      memset(brw_bo_map(monitor->oa_bo, MAP_WRITE, NULL), 0xff, 4096);
 #endif
 
       /* Allocate storage for accumulated OA counter values. */
@@ -1237,15 +1227,13 @@ brw_is_perf_monitor_result_available(struct gl_context *ctx,
    bool stats_available = true;
 
    if (monitor_needs_oa(brw, m)) {
-      oa_available = !monitor->oa_bo ||
-         (!drm_intel_bo_references(brw->batch.bo, monitor->oa_bo) &&
-          !drm_intel_bo_busy(monitor->oa_bo));
+      oa_available = !brw_bo_busy(monitor->oa_bo, BUSY_READ,
+                                  PERF_DEBUG(brw, "IsPerfMonitorResultAvailable"));
    }
 
    if (monitor_needs_statistics_registers(brw, m)) {
-      stats_available = !monitor->pipeline_stats_bo ||
-         (!drm_intel_bo_references(brw->batch.bo, monitor->pipeline_stats_bo) &&
-          !drm_intel_bo_busy(monitor->pipeline_stats_bo));
+      stats_available = !brw_bo_busy(monitor->pipeline_stats_bo, BUSY_READ,
+                                     PERF_DEBUG(brw, "IsPerfMonitorResultAvailable"));
    }
 
    return oa_available && stats_available;
@@ -1292,11 +1280,9 @@ brw_get_perf_monitor_result(struct gl_context *ctx,
           * Using an unsynchronized mapping avoids stalling for an
           * indeterminate amount of time.
           */
-         drm_intel_gem_bo_map_unsynchronized(brw->perfmon.bookend_bo);
-
-         gather_oa_results(brw, monitor, brw->perfmon.bookend_bo->virtual);
-
-         drm_intel_bo_unmap(brw->perfmon.bookend_bo);
+         gather_oa_results(brw, monitor,
+                           brw_bo_map(brw->perfmon.bookend_bo,
+                                      MAP_READ | MAP_ASYNC, NULL));
       }
 
       for (int i = 0; i < brw->perfmon.entries_per_oa_snapshot; i++) {
@@ -1385,7 +1371,6 @@ void
 brw_perf_monitor_new_batch(struct brw_context *brw)
 {
    assert(brw->batch.ring == RENDER_RING);
-   assert(brw->gen < 6 || USED_BATCH(&brw->batch) == 0);
 
    if (brw->perfmon.oa_users == 0)
       return;
diff --git a/src/mesa/drivers/dri/i965/brw_pipe_control.c b/src/mesa/drivers/dri/i965/brw_pipe_control.c
index 67ff19b..16b882a 100644
--- a/src/mesa/drivers/dri/i965/brw_pipe_control.c
+++ b/src/mesa/drivers/dri/i965/brw_pipe_control.c
@@ -288,7 +288,10 @@ brw_emit_post_sync_nonzero_flush(struct brw_context *brw)
 void
 brw_emit_mi_flush(struct brw_context *brw)
 {
-   if (brw->batch.ring == BLT_RING && brw->gen >= 6) {
+   if (brw_batch_count(&brw->batch) == 0)
+      return;
+
+   if (brw->batch.ring == BLT_RING) {
       BEGIN_BATCH_BLT(4);
       OUT_BATCH(MI_FLUSH_DW);
       OUT_BATCH(0);
@@ -332,26 +335,38 @@ brw_emit_mi_flush(struct brw_context *brw)
 void
 brw_mi_flush(struct brw_context *brw, enum brw_gpu_ring ring)
 {
+   /* Nothing inside batch, rely on kernel flush before batch */
+   if (brw_batch_count(&brw->batch) == 0)
+      return;
+
+   /* Need to switch rings, again we can rely on the kernel flush in between */
+   if (brw->batch.actual_ring[ring] != brw->batch.ring)
+      return;
+
    if (brw_batch_begin(&brw->batch, 60, ring) >= 0) {
       brw_emit_mi_flush(brw);
       brw_batch_end(&brw->batch);
    }
 }
 
-void
+int
 brw_init_pipe_control(struct brw_context *brw,
                       const struct brw_device_info *devinfo)
 {
    if (devinfo->gen < 6)
-      return;
+      return 0;
 
    /* We can't just use brw_state_batch to get a chunk of space for
     * the gen6 workaround because it involves actually writing to
     * the buffer, and the kernel doesn't let us write to the batch.
     */
-   brw->workaround_bo = brw_bo_get(brw->intelScreen->workaround_bo);
+   brw->workaround_bo =
+      brw_bo_import(&brw->batch, brw->intelScreen->workaround_bo, true);
+   if (brw->workaround_bo == NULL)
+      return -ENOMEM;
 
    brw->pipe_controls_since_last_cs_stall = 0;
+   return 0;
 }
 
 void
diff --git a/src/mesa/drivers/dri/i965/brw_program.c b/src/mesa/drivers/dri/i965/brw_program.c
index cd9cfc6..8edcd42 100644
--- a/src/mesa/drivers/dri/i965/brw_program.c
+++ b/src/mesa/drivers/dri/i965/brw_program.c
@@ -464,8 +464,7 @@ brw_collect_shader_time(struct brw_context *brw)
     * delaying reading the reports, but it doesn't look like it's a big
     * overhead compared to the cost of tracking the time in the first place.
     */
-   drm_intel_bo_map(brw->shader_time.bo, true);
-   void *bo_map = brw->shader_time.bo->virtual;
+   void *bo_map = brw_bo_map(brw->shader_time.bo, MAP_WRITE, NULL);
 
    for (int i = 0; i < brw->shader_time.num_entries; i++) {
       uint32_t *times = bo_map + i * 3 * SHADER_TIME_STRIDE;
@@ -478,7 +477,6 @@ brw_collect_shader_time(struct brw_context *brw)
    /* Zero the BO out to clear it out for our next collection.
     */
    memset(bo_map, 0, brw->shader_time.bo->size);
-   drm_intel_bo_unmap(brw->shader_time.bo);
 }
 
 void
diff --git a/src/mesa/drivers/dri/i965/brw_queryobj.c b/src/mesa/drivers/dri/i965/brw_queryobj.c
index 28d01c1..556e691 100644
--- a/src/mesa/drivers/dri/i965/brw_queryobj.c
+++ b/src/mesa/drivers/dri/i965/brw_queryobj.c
@@ -103,17 +103,7 @@ brw_queryobj_get_results(struct gl_context *ctx,
     * still contributing to it, flush it now so the results will be present
     * when mapped.
     */
-   if (drm_intel_bo_references(brw->batch.bo, query->bo))
-      brw_batch_flush(&brw->batch, PERF_DEBUG(brw, "GetQuery"));
-
-   if (unlikely(brw->perf_debug)) {
-      if (drm_intel_bo_busy(query->bo)) {
-         perf_debug("Stalling on the GPU waiting for a query object.\n");
-      }
-   }
-
-   drm_intel_bo_map(query->bo, false);
-   results = query->bo->virtual;
+   results = brw_bo_map(query->bo, MAP_READ, PERF_DEBUG(brw, "GetQuery"));
    switch (query->Base.Target) {
    case GL_TIME_ELAPSED_EXT:
       /* The query BO contains the starting and ending timestamps.
@@ -159,7 +149,6 @@ brw_queryobj_get_results(struct gl_context *ctx,
    default:
       unreachable("Unrecognized query target in brw_queryobj_get_results()");
    }
-   drm_intel_bo_unmap(query->bo);
 
    /* Now that we've processed the data stored in the query's buffer object,
     * we can release it.
@@ -373,10 +362,8 @@ static void brw_check_query(struct gl_context *ctx, struct gl_query_object *q)
     *      not ready yet on the first time it is queried.  This ensures that
     *      the async query will return true in finite time.
     */
-   if (query->bo && drm_intel_bo_references(brw->batch.bo, query->bo))
-      brw_batch_flush(&brw->batch, PERF_DEBUG(brw, "CheckQuery"));
-
-   if (query->bo == NULL || !drm_intel_bo_busy(query->bo)) {
+   if (!brw_bo_busy(query->bo, BUSY_READ | BUSY_FLUSH,
+                    PERF_DEBUG(brw, "CheckQuery"))) {
       brw_queryobj_get_results(ctx, query);
       query->Base.Ready = true;
    }
@@ -500,8 +487,6 @@ brw_query_counter(struct gl_context *ctx, struct gl_query_object *q)
 
    brw_write_timestamp(brw, query->bo, 0);
    brw_batch_end(&brw->batch);
-
-   query->flushed = false;
 }
 
 /**
diff --git a/src/mesa/drivers/dri/i965/brw_reset.c b/src/mesa/drivers/dri/i965/brw_reset.c
index f84df22..6c50a0f 100644
--- a/src/mesa/drivers/dri/i965/brw_reset.c
+++ b/src/mesa/drivers/dri/i965/brw_reset.c
@@ -49,7 +49,7 @@ brw_get_graphics_reset_status(struct gl_context *ctx)
    if (brw->reset_count != 0)
       return GL_NO_ERROR;
 
-   err = drm_intel_get_reset_stats(brw->batch.hw_ctx,
+   err = brw_batch_get_reset_stats(&brw->batch,
                                    &reset_count, &active, &pending);
    if (err)
       return GL_NO_ERROR;
diff --git a/src/mesa/drivers/dri/i965/brw_state.h b/src/mesa/drivers/dri/i965/brw_state.h
index 943e8c8..34d133a 100644
--- a/src/mesa/drivers/dri/i965/brw_state.h
+++ b/src/mesa/drivers/dri/i965/brw_state.h
@@ -227,7 +227,7 @@ void brw_destroy_caches( struct brw_context *brw );
  * brw_state_batch.c
  */
 #define BRW_BATCH_STRUCT(brw, s) \
-   intel_batchbuffer_data(brw, (s), sizeof(*(s)), RENDER_RING)
+   brw_batch_data(&brw->batch, (s), sizeof(*(s)))
 
 void *__brw_state_batch(struct brw_context *brw,
                         enum aub_state_struct_type type,
diff --git a/src/mesa/drivers/dri/i965/brw_state_batch.c b/src/mesa/drivers/dri/i965/brw_state_batch.c
index 8b99b26..315f6eb 100644
--- a/src/mesa/drivers/dri/i965/brw_state_batch.c
+++ b/src/mesa/drivers/dri/i965/brw_state_batch.c
@@ -78,27 +78,13 @@ __brw_state_batch(struct brw_context *brw,
                   uint32_t *out_offset)
 
 {
-   brw_batch *batch = &brw->batch;
-   uint32_t offset;
-
-   assert(size < batch->bo->size);
-   offset = ROUND_DOWN_TO(batch->state_batch_offset - size, alignment);
-
-   /* If allocating from the top would wrap below the batchbuffer, or
-    * if the batch's used space (plus the reserved pad) collides with our
-    * space, then flush and try again.
-    */
-   if (batch->state_batch_offset < size ||
-       offset < 4 * USED_BATCH(batch) + batch->reserved_space) {
-      brw_batch_flush(batch, NULL);
-      offset = ROUND_DOWN_TO(batch->state_batch_offset - size, alignment);
-   }
-
-   batch->state_batch_offset = offset;
+   assert(size < brw->batch.bo->size);
+   brw->batch.state = ROUND_DOWN_TO(4*brw->batch.state - size, alignment)/4;
+   assert(brw->batch.state > brw->batch._ptr - brw->batch.map);
 
    if (unlikely(INTEL_DEBUG & DEBUG_BATCH))
-      brw_track_state_batch(brw, type, offset, size, index);
+      brw_track_state_batch(brw, type, 4*brw->batch.state, size, index);
 
-   *out_offset = offset;
-   return batch->map + (offset>>2);
+   *out_offset = 4*brw->batch.state;
+   return brw->batch.map + brw->batch.state;
 }
diff --git a/src/mesa/drivers/dri/i965/brw_state_cache.c b/src/mesa/drivers/dri/i965/brw_state_cache.c
index 3ba12c4..9c3d4ea 100644
--- a/src/mesa/drivers/dri/i965/brw_state_cache.c
+++ b/src/mesa/drivers/dri/i965/brw_state_cache.c
@@ -171,26 +171,16 @@ brw_cache_new_bo(struct brw_cache *cache, uint32_t new_size)
    brw_bo *new_bo;
 
    new_bo = brw_bo_create(&brw->batch, "program cache", new_size, 64, 0);
-   if (brw->has_llc)
-      drm_intel_gem_bo_map_unsynchronized(new_bo);
 
    /* Copy any existing data that needs to be saved. */
    if (cache->next_offset != 0) {
-      if (brw->has_llc) {
-         memcpy(new_bo->virtual, cache->bo->virtual, cache->next_offset);
-      } else {
-         drm_intel_bo_map(cache->bo, false);
-         drm_intel_bo_subdata(new_bo, 0, cache->next_offset,
-                              cache->bo->virtual);
-         drm_intel_bo_unmap(cache->bo);
-      }
+      brw_bo_read(cache->bo, 0,
+                  brw_bo_map(new_bo, MAP_WRITE, NULL), cache->next_offset,
+                  MAP_ASYNC, NULL);
    }
 
-   if (brw->has_llc)
-      drm_intel_bo_unmap(cache->bo);
    brw_bo_put(cache->bo);
    cache->bo = new_bo;
-   cache->bo_used_by_gpu = false;
 
    /* Since we have a new BO in place, we need to signal the units
     * that depend on it (state base address on gen5+, or unit state before).
@@ -208,7 +198,6 @@ brw_try_upload_using_copy(struct brw_cache *cache,
 			  const void *data,
 			  const void *aux)
 {
-   struct brw_context *brw = cache->brw;
    int i;
    struct brw_cache_item *item;
 
@@ -230,11 +219,9 @@ brw_try_upload_using_copy(struct brw_cache *cache,
 	    continue;
 	 }
 
-         if (!brw->has_llc)
-            drm_intel_bo_map(cache->bo, false);
-	 ret = memcmp(cache->bo->virtual + item->offset, data, item->size);
-         if (!brw->has_llc)
-            drm_intel_bo_unmap(cache->bo);
+         void *old =
+            brw_bo_map(cache->bo, MAP_READ | MAP_ASYNC, NULL) + item->offset;
+         ret = memcmp(old, data, item->size);
 	 if (ret)
 	    continue;
 
@@ -252,8 +239,6 @@ brw_upload_item_data(struct brw_cache *cache,
 		     struct brw_cache_item *item,
 		     const void *data)
 {
-   struct brw_context *brw = cache->brw;
-
    /* Allocate space in the cache BO for our new program. */
    if (cache->next_offset + item->size > cache->bo->size) {
       uint32_t new_size = cache->bo->size * 2;
@@ -264,16 +249,11 @@ brw_upload_item_data(struct brw_cache *cache,
       brw_cache_new_bo(cache, new_size);
    }
 
-   /* If we would block on writing to an in-use program BO, just
-    * recreate it.
-    */
-   if (!brw->has_llc && cache->bo_used_by_gpu) {
-      perf_debug("Copying busy program cache buffer.\n");
-      brw_cache_new_bo(cache, cache->bo->size);
-   }
-
    item->offset = cache->next_offset;
 
+   /* Copy data to the buffer */
+   brw_bo_write(cache->bo, item->offset, data, item->size, MAP_ASYNC, NULL);
+
    /* Programs are always 64-byte aligned, so set up the next one now */
    cache->next_offset = ALIGN(item->offset + item->size, 64);
 }
@@ -290,7 +270,6 @@ brw_upload_cache(struct brw_cache *cache,
 		 uint32_t *out_offset,
 		 void *out_aux)
 {
-   struct brw_context *brw = cache->brw;
    struct brw_cache_item *item = CALLOC_STRUCT(brw_cache_item);
    GLuint hash;
    void *tmp;
@@ -330,13 +309,6 @@ brw_upload_cache(struct brw_cache *cache,
    cache->items[hash] = item;
    cache->n_items++;
 
-   /* Copy data to the buffer */
-   if (brw->has_llc) {
-      memcpy((char *) cache->bo->virtual + item->offset, data, data_size);
-   } else {
-      drm_intel_bo_subdata(cache->bo, item->offset, data_size, data);
-   }
-
    *out_offset = item->offset;
    *(void **)out_aux = (void *)((char *)item->key + item->key_size);
    cache->brw->ctx.NewDriverState |= 1 << cache_id;
@@ -355,8 +327,6 @@ brw_init_caches(struct brw_context *brw)
       calloc(cache->size, sizeof(struct brw_cache_item *));
 
    cache->bo = brw_bo_create(&brw->batch, "program cache", 4096, 64, 0);
-   if (brw->has_llc)
-      drm_intel_gem_bo_map_unsynchronized(cache->bo);
 
    cache->aux_compare[BRW_CACHE_VS_PROG] = brw_vs_prog_data_compare;
    cache->aux_compare[BRW_CACHE_GS_PROG] = brw_gs_prog_data_compare;
@@ -391,6 +361,9 @@ brw_clear_cache(struct brw_context *brw, struct brw_cache *cache)
 
    cache->n_items = 0;
 
+   brw_bo_put(cache->bo);
+   cache->bo = brw_bo_create(&brw->batch, "program cache", 4096, 64, 0);
+
    /* Start putting programs into the start of the BO again, since
     * we'll never find the old results.
     */
@@ -401,7 +374,6 @@ brw_clear_cache(struct brw_context *brw, struct brw_cache *cache)
     */
    brw->NewGLState |= ~0;
    brw->ctx.NewDriverState |= ~0ull;
-   brw_batch_flush(&brw->batch, NULL);
 }
 
 void
@@ -424,11 +396,10 @@ brw_destroy_cache(struct brw_context *brw, struct brw_cache *cache)
 
    DBG("%s\n", __func__);
 
-   if (brw->has_llc)
-      drm_intel_bo_unmap(cache->bo);
+   brw_clear_cache(brw, cache);
    brw_bo_put(cache->bo);
    cache->bo = NULL;
-   brw_clear_cache(brw, cache);
+
    free(cache->items);
    cache->items = NULL;
    cache->size = 0;
diff --git a/src/mesa/drivers/dri/i965/brw_state_dump.c b/src/mesa/drivers/dri/i965/brw_state_dump.c
index a597c1f..9bdc080 100644
--- a/src/mesa/drivers/dri/i965/brw_state_dump.c
+++ b/src/mesa/drivers/dri/i965/brw_state_dump.c
@@ -68,7 +68,7 @@ static const char *surface_tiling[] = {
 
 static void *batch_in(struct brw_context *brw, unsigned offset)
 {
-   return (void *)brw->batch.bo->virtual + offset;
+   return (void *)brw->batch.map + offset;
 }
 
 static void
@@ -720,8 +720,6 @@ dump_prog_cache(struct brw_context *brw)
    struct brw_cache *cache = &brw->cache;
    unsigned int b;
 
-   drm_intel_bo_map(brw->cache.bo, false);
-
    for (b = 0; b < cache->size; b++) {
       struct brw_cache_item *item;
 
@@ -756,12 +754,11 @@ dump_prog_cache(struct brw_context *brw)
 	 }
 
          fprintf(stderr, "%s:\n", name);
-         brw_disassemble(brw->intelScreen->devinfo, brw->cache.bo->virtual,
+         brw_disassemble(brw->intelScreen->devinfo,
+                         brw_bo_map(brw->cache.bo, MAP_READ | MAP_ASYNC, NULL),
                          item->offset, item->size, stderr);
       }
    }
-
-   drm_intel_bo_unmap(brw->cache.bo);
 }
 
 static void
@@ -864,12 +861,11 @@ dump_state_batch(struct brw_context *brw)
  * The buffer offsets printed rely on the buffer containing the last offset
  * it was validated at.
  */
-void brw_debug_batch(struct brw_context *brw)
+void brw_debug_batch(struct brw_batch *batch)
 {
-   drm_intel_bo_map(brw->batch.bo, false);
-   dump_state_batch(brw);
-   drm_intel_bo_unmap(brw->batch.bo);
+   struct brw_context *brw = container_of(batch, brw, batch);
 
+   dump_state_batch(brw);
    if (0)
       dump_prog_cache(brw);
 }
diff --git a/src/mesa/drivers/dri/i965/brw_urb.c b/src/mesa/drivers/dri/i965/brw_urb.c
index f4215c7..b19c810 100644
--- a/src/mesa/drivers/dri/i965/brw_urb.c
+++ b/src/mesa/drivers/dri/i965/brw_urb.c
@@ -250,12 +250,6 @@ void brw_upload_urb_fence(struct brw_context *brw)
    uf.bits1.cs_fence  = brw->urb.size;
 
    /* erratum: URB_FENCE must not cross a 64byte cacheline */
-   if ((USED_BATCH(&brw->batch) & 15) > 12) {
-      int pad = 16 - (USED_BATCH(&brw->batch) & 15);
-      do
-         *brw->batch.map_next++ = MI_NOOP;
-      while (--pad);
-   }
-
+   brw_batch_cacheline_evade(&brw->batch, sizeof(uf));
    BRW_BATCH_STRUCT(brw, &uf);
 }
diff --git a/src/mesa/drivers/dri/i965/gen6_queryobj.c b/src/mesa/drivers/dri/i965/gen6_queryobj.c
index 935dcfe..7b0e884 100644
--- a/src/mesa/drivers/dri/i965/gen6_queryobj.c
+++ b/src/mesa/drivers/dri/i965/gen6_queryobj.c
@@ -174,8 +174,8 @@ gen6_queryobj_get_results(struct gl_context *ctx,
    if (query->bo == NULL)
       return;
 
-   brw_bo_map(brw, query->bo, false, "query object");
-   uint64_t *results = query->bo->virtual;
+   uint64_t *results =
+      brw_bo_map(query->bo, MAP_READ, PERF_DEBUG(brw, "GetQuery"));
    switch (query->Base.Target) {
    case GL_TIME_ELAPSED:
       /* The query BO contains the starting and ending timestamps.
@@ -254,7 +254,6 @@ gen6_queryobj_get_results(struct gl_context *ctx,
    default:
       unreachable("Unrecognized query target in brw_queryobj_get_results()");
    }
-   drm_intel_bo_unmap(query->bo);
 
    /* Now that we've processed the data stored in the query's buffer object,
     * we can release it.
@@ -400,27 +399,6 @@ gen6_end_query(struct gl_context *ctx, struct gl_query_object *q)
    }
 
    brw_batch_end(&brw->batch);
-
-   /* The current batch contains the commands to handle EndQuery(),
-    * but they won't actually execute until it is flushed.
-    */
-   query->flushed = false;
-}
-
-/**
- * Flush the batch if it still references the query object BO.
- */
-static void
-flush_batch_if_needed(struct brw_context *brw, struct brw_query_object *query)
-{
-   /* If the batch doesn't reference the BO, it must have been flushed
-    * (for example, due to being full).  Record that it's been flushed.
-    */
-   query->flushed = query->flushed ||
-      !drm_intel_bo_references(brw->batch.bo, query->bo);
-
-   if (!query->flushed)
-      brw_batch_flush(&brw->batch, PERF_DEBUG(brw, "GetQuery"));
 }
 
 /**
@@ -431,15 +409,12 @@ flush_batch_if_needed(struct brw_context *brw, struct brw_query_object *query)
  */
 static void gen6_wait_query(struct gl_context *ctx, struct gl_query_object *q)
 {
-   struct brw_context *brw = brw_context(ctx);
    struct brw_query_object *query = (struct brw_query_object *)q;
 
    /* If the application has requested the query result, but this batch is
     * still contributing to it, flush it now to finish that work so the
     * result will become available (eventually).
     */
-   flush_batch_if_needed(brw, query);
-
    gen6_queryobj_get_results(ctx, query);
 }
 
@@ -451,7 +426,6 @@ static void gen6_wait_query(struct gl_context *ctx, struct gl_query_object *q)
  */
 static void gen6_check_query(struct gl_context *ctx, struct gl_query_object *q)
 {
-   struct brw_context *brw = brw_context(ctx);
    struct brw_query_object *query = (struct brw_query_object *)q;
 
    /* If query->bo is NULL, we've already gathered the results - this is a
@@ -467,9 +441,8 @@ static void gen6_check_query(struct gl_context *ctx, struct gl_query_object *q)
     *      not ready yet on the first time it is queried.  This ensures that
     *      the async query will return true in finite time.
     */
-   flush_batch_if_needed(brw, query);
-
-   if (!drm_intel_bo_busy(query->bo)) {
+   if (!brw_bo_busy(query->bo, BUSY_READ | BUSY_FLUSH,
+                    PERF_DEBUG(brw_context(ctx), "CheckQuery"))) {
       gen6_queryobj_get_results(ctx, query);
    }
 }
diff --git a/src/mesa/drivers/dri/i965/gen7_sol_state.c b/src/mesa/drivers/dri/i965/gen7_sol_state.c
index b177521..d8bfb71 100644
--- a/src/mesa/drivers/dri/i965/gen7_sol_state.c
+++ b/src/mesa/drivers/dri/i965/gen7_sol_state.c
@@ -316,14 +316,7 @@ gen7_tally_prims_generated(struct brw_context *brw,
    /* If the current batch is still contributing to the number of primitives
     * generated, flush it now so the results will be present when mapped.
     */
-   if (drm_intel_bo_references(brw->batch.bo, obj->prim_count_bo))
-      brw_batch_flush(&brw->batch, perf);
-
-   if (unlikely(brw->perf_debug && drm_intel_bo_busy(obj->prim_count_bo)))
-      perf_debug("Stalling for # of transform feedback primitives written.\n");
-
-   drm_intel_bo_map(obj->prim_count_bo, false);
-   uint64_t *prim_counts = obj->prim_count_bo->virtual;
+   uint64_t *prim_counts = brw_bo_map(obj->prim_count_bo, MAP_READ, perf);
 
    assert(obj->prim_count_buffer_index % (2 * BRW_MAX_XFB_STREAMS) == 0);
    int pairs = obj->prim_count_buffer_index / (2 * BRW_MAX_XFB_STREAMS);
@@ -336,8 +329,6 @@ gen7_tally_prims_generated(struct brw_context *brw,
       prim_counts += 2 * BRW_MAX_XFB_STREAMS; /* move to the next pair */
    }
 
-   drm_intel_bo_unmap(obj->prim_count_bo);
-
    /* We've already gathered up the old data; we can safely overwrite it now. */
    obj->prim_count_buffer_index = 0;
 }
@@ -452,7 +443,7 @@ gen7_begin_transform_feedback(struct gl_context *ctx, GLenum mode,
       brw_obj->zero_offsets = true;
    } else if (!brw->has_pipelined_so) {
       brw_batch_flush(&brw->batch, PERF_DEBUG(brw, "BeginTransformFeedback"));
-      brw->batch.needs_sol_reset = true;
+      brw->batch.batch_flags |= I915_EXEC_GEN7_SOL_RESET;
    }
 
    /* We're about to lose the information needed to compute the number of
diff --git a/src/mesa/drivers/dri/i965/intel_batchbuffer.c b/src/mesa/drivers/dri/i965/intel_batchbuffer.c
deleted file mode 100644
index 6e6b794..0000000
--- a/src/mesa/drivers/dri/i965/intel_batchbuffer.c
+++ /dev/null
@@ -1,439 +0,0 @@
-/**************************************************************************
- *
- * Copyright 2006 VMware, Inc.
- * All Rights Reserved.
- *
- * Permission is hereby granted, free of charge, to any person obtaining a
- * copy of this software and associated documentation files (the
- * "Software"), to deal in the Software without restriction, including
- * without limitation the rights to use, copy, modify, merge, publish,
- * distribute, sub license, and/or sell copies of the Software, and to
- * permit persons to whom the Software is furnished to do so, subject to
- * the following conditions:
- *
- * The above copyright notice and this permission notice (including the
- * next paragraph) shall be included in all copies or substantial portions
- * of the Software.
- *
- * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS
- * OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
- * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NON-INFRINGEMENT.
- * IN NO EVENT SHALL VMWARE AND/OR ITS SUPPLIERS BE LIABLE FOR
- * ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT,
- * TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE
- * SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
- *
- **************************************************************************/
-
-#include "intel_batchbuffer.h"
-#include "intel_buffer_objects.h"
-#include "intel_reg.h"
-#include "intel_bufmgr.h"
-#include "intel_buffers.h"
-#include "intel_fbo.h"
-#include "brw_context.h"
-#include "brw_defines.h"
-#include "brw_state.h"
-
-#include <xf86drm.h>
-#include <i915_drm.h>
-
-static void
-intel_batchbuffer_reset(struct brw_context *brw);
-
-int
-intel_batchbuffer_init(struct brw_context *brw)
-{
-   brw->batch.gen = brw->gen;
-
-   intel_batchbuffer_reset(brw);
-
-   if (!brw->has_llc) {
-      brw->batch.cpu_map = malloc(BATCH_SZ);
-      brw->batch.map = brw->batch.cpu_map;
-      brw->batch.map_next = brw->batch.cpu_map;
-   }
-
-   if (brw->gen >= 6) {
-      /* Create a new hardware context.  Using a hardware context means that
-       * our GPU state will be saved/restored on context switch, allowing us
-       * to assume that the GPU is in the same state we left it in.
-       *
-       * This is required for transform feedback buffer offsets, query objects,
-       * and also allows us to reduce how much state we have to emit.
-       */
-      brw->batch.hw_ctx = drm_intel_gem_context_create(brw->batch.bufmgr);
-
-      if (!brw->batch.hw_ctx) {
-         fprintf(stderr, "Gen6+ requires Kernel 3.6 or later.\n");
-         return false;
-      }
-   }
-
-   return true;
-}
-
-static void
-intel_batchbuffer_reset(struct brw_context *brw)
-{
-   brw_bo_put(brw->batch.last_bo);
-   brw->batch.last_bo = brw->batch.bo;
-
-   brw_batch_clear_dirty(&brw->batch);
-
-   brw->batch.bo = brw_bo_create(&brw->batch, "batchbuffer", BATCH_SZ, 4096, 0);
-   if (brw->has_llc) {
-      drm_intel_bo_map(brw->batch.bo, true);
-      brw->batch.map = brw->batch.bo->virtual;
-   }
-   brw->batch.map_next = brw->batch.map;
-
-   brw->batch.reserved_space = BATCH_RESERVED;
-   brw->batch.state_batch_offset = brw->batch.bo->size;
-   brw->batch.needs_sol_reset = false;
-
-   /* We don't know what ring the new batch will be sent to until we see the
-    * first BEGIN_BATCH or BEGIN_BATCH_BLT.  Mark it as unknown.
-    */
-   brw->batch.ring = UNKNOWN_RING;
-}
-
-void
-intel_batchbuffer_save_state(struct brw_batch *batch)
-{
-   batch->saved.map_next = batch->map_next;
-   batch->saved.reloc_count = drm_intel_gem_bo_get_reloc_count(batch->bo);
-}
-
-void
-intel_batchbuffer_reset_to_saved(struct brw_batch *batch)
-{
-   drm_intel_gem_bo_clear_relocs(batch->bo, batch->saved.reloc_count);
-
-   batch->map_next = batch->saved.map_next;
-   if (USED_BATCH(batch) == 0)
-      batch->ring = UNKNOWN_RING;
-}
-
-void
-intel_batchbuffer_free(struct brw_context *brw)
-{
-   free(brw->batch.cpu_map);
-   brw_bo_put(brw->batch.last_bo);
-   brw_bo_put(brw->batch.bo);
-
-   brw_bo_put(brw->batch.throttle_batch[1]);
-   brw_bo_put(brw->batch.throttle_batch[0]);
-
-   drm_intel_gem_context_destroy(brw->batch.hw_ctx);
-}
-
-static void
-do_batch_dump(struct brw_context *brw)
-{
-   struct drm_intel_decode *decode;
-   brw_batch *batch = &brw->batch;
-   int ret;
-
-   decode = drm_intel_decode_context_alloc(brw->intelScreen->deviceID);
-   if (!decode)
-      return;
-
-   ret = drm_intel_bo_map(batch->bo, false);
-   if (ret == 0) {
-      drm_intel_decode_set_batch_pointer(decode,
-					 batch->bo->virtual,
-					 batch->bo->offset64,
-                                         USED_BATCH(batch));
-   } else {
-      fprintf(stderr,
-	      "WARNING: failed to map batchbuffer (%s), "
-	      "dumping uploaded data instead.\n", strerror(ret));
-
-      drm_intel_decode_set_batch_pointer(decode,
-					 batch->map,
-					 batch->bo->offset64,
-                                         USED_BATCH(batch));
-   }
-
-   drm_intel_decode_set_output_file(decode, stderr);
-   drm_intel_decode(decode);
-
-   drm_intel_decode_context_free(decode);
-
-   if (ret == 0) {
-      drm_intel_bo_unmap(batch->bo);
-
-      brw_debug_batch(brw);
-   }
-}
-
-/**
- * Called when starting a new batch buffer.
- */
-static void
-brw_new_batch(struct brw_context *brw)
-{
-   /* Create a new batchbuffer and reset the associated state: */
-   drm_intel_gem_bo_clear_relocs(brw->batch.bo, 0);
-   intel_batchbuffer_reset(brw);
-}
-
-static void
-throttle(struct brw_context *brw)
-{
-   /* Wait for the swapbuffers before the one we just emitted, so we
-    * don't get too many swaps outstanding for apps that are GPU-heavy
-    * but not CPU-heavy.
-    *
-    * We're using intelDRI2Flush (called from the loader before
-    * swapbuffer) and glFlush (for front buffer rendering) as the
-    * indicator that a frame is done and then throttle when we get
-    * here as we prepare to render the next frame.  At this point for
-    * round trips for swap/copy and getting new buffers are done and
-    * we'll spend less time waiting on the GPU.
-    *
-    * Unfortunately, we don't have a handle to the batch containing
-    * the swap, and getting our hands on that doesn't seem worth it,
-    * so we just use the first batch we emitted after the last swap.
-    */
-   if (brw->batch.need_swap_throttle && brw->batch.throttle_batch[0]) {
-      if (brw->batch.throttle_batch[1]) {
-         if (!brw->batch.disable_throttling)
-            drm_intel_bo_wait_rendering(brw->batch.throttle_batch[1]);
-         brw_bo_put(brw->batch.throttle_batch[1]);
-      }
-      brw->batch.throttle_batch[1] = brw->batch.throttle_batch[0];
-      brw->batch.throttle_batch[0] = NULL;
-      brw->batch.need_swap_throttle = false;
-      /* Throttling here is more precise than the throttle ioctl, so skip it */
-      brw->batch.need_flush_throttle = false;
-   }
-
-   if (brw->batch.need_flush_throttle) {
-      __DRIscreen *psp = brw->intelScreen->driScrnPriv;
-      drmCommandNone(psp->fd, DRM_I915_GEM_THROTTLE);
-      brw->batch.need_flush_throttle = false;
-   }
-}
-
-/* Drop when RS headers get pulled to libdrm */
-#ifndef I915_EXEC_RESOURCE_STREAMER
-#define I915_EXEC_RESOURCE_STREAMER (1<<15)
-#endif
-
-/* TODO: Push this whole function into bufmgr.
- */
-static int
-do_flush_locked(struct brw_context *brw)
-{
-   brw_batch *batch = &brw->batch;
-   int ret = 0;
-
-   if (brw->has_llc) {
-      drm_intel_bo_unmap(batch->bo);
-   } else {
-      ret = drm_intel_bo_subdata(batch->bo, 0, 4 * USED_BATCH(batch), batch->map);
-      if (ret == 0 && batch->state_batch_offset != batch->bo->size) {
-	 ret = drm_intel_bo_subdata(batch->bo,
-				    batch->state_batch_offset,
-				    batch->bo->size - batch->state_batch_offset,
-				    (char *)batch->map + batch->state_batch_offset);
-      }
-   }
-
-   if (!brw->intelScreen->no_hw) {
-      int flags;
-
-      if (brw->gen >= 6 && batch->ring == BLT_RING) {
-         flags = I915_EXEC_BLT;
-      } else {
-         flags = I915_EXEC_RENDER |
-            (brw->use_resource_streamer ? I915_EXEC_RESOURCE_STREAMER : 0);
-      }
-      if (batch->needs_sol_reset)
-	 flags |= I915_EXEC_GEN7_SOL_RESET;
-
-      if (ret == 0) {
-         if (batch->hw_ctx == NULL || batch->ring != RENDER_RING) {
-            ret = drm_intel_bo_mrb_exec(batch->bo, 4 * USED_BATCH(batch),
-                                        NULL, 0, 0, flags);
-         } else {
-            ret = drm_intel_gem_bo_context_exec(batch->bo, batch->hw_ctx,
-                                                4 * USED_BATCH(batch), flags);
-         }
-      }
-
-      throttle(brw);
-   }
-
-   if (unlikely(INTEL_DEBUG & DEBUG_BATCH))
-      do_batch_dump(brw);
-
-   if (ret != 0) {
-      fprintf(stderr, "intel_do_flush_locked failed: %s\n", strerror(-ret));
-      exit(1);
-   }
-
-   return ret;
-}
-
-int
-brw_batch_flush(struct brw_batch *batch, struct perf_debug *info)
-{
-   struct brw_context *brw = container_of(batch, brw, batch);
-   int ret;
-
-   if (USED_BATCH(batch) == 0)
-      return 0;
-
-   if (brw->batch.throttle_batch[0] == NULL)
-      brw->batch.throttle_batch[0] = brw_bo_get(brw->batch.bo);
-
-   if (unlikely(INTEL_DEBUG & DEBUG_BATCH)) {
-      int bytes_for_commands = 4 * USED_BATCH(batch);
-      int bytes_for_state = brw->batch.bo->size - brw->batch.state_batch_offset;
-      int total_bytes = bytes_for_commands + bytes_for_state;
-      fprintf(stderr, "%s:%d: Batchbuffer flush with %4db (pkt) + "
-              "%4db (state) = %4db (%0.1f%%)\n",
-              info ? info->file : "???", info ? info->line : -1,
-              bytes_for_commands, bytes_for_state,
-              total_bytes,
-              100.0f * total_bytes / BATCH_SZ);
-   }
-
-   if (unlikely(info))
-      brw_batch_report_flush_hook(batch, info);
-
-   brw->batch.reserved_space = 0;
-
-   brw->batch.begin_count++;
-   brw_batch_finish_hook(&brw->batch);
-   brw->batch.begin_count--;
-
-   /* Mark the end of the buffer. */
-   intel_batchbuffer_emit_dword(brw, MI_BATCH_BUFFER_END);
-   if (USED_BATCH(&brw->batch) & 1) {
-      /* Round batchbuffer usage to 2 DWORDs. */
-      intel_batchbuffer_emit_dword(brw, MI_NOOP);
-   }
-
-   intel_upload_finish(brw);
-
-   /* Check that we didn't just wrap our batchbuffer at a bad time. */
-   assert(!brw->batch.no_batch_wrap);
-
-   ret = do_flush_locked(brw);
-
-   if (unlikely(INTEL_DEBUG & DEBUG_SYNC)) {
-      fprintf(stderr, "waiting for idle\n");
-      drm_intel_bo_wait_rendering(brw->batch.bo);
-   }
-
-   /* Start a new batch buffer. */
-   brw_new_batch(brw);
-
-   return ret;
-}
-
-
-/*  This is the only way buffers get added to the validate list.
- */
-uint32_t
-intel_batchbuffer_reloc(struct brw_context *brw,
-                        brw_bo *buffer, uint32_t offset,
-                        uint32_t read_domains, uint32_t write_domain,
-                        uint32_t delta)
-{
-   int ret;
-
-   ret = drm_intel_bo_emit_reloc(brw->batch.bo, offset,
-				 buffer, delta,
-				 read_domains, write_domain);
-   assert(ret == 0);
-   (void)ret;
-
-   /* Using the old buffer offset, write in what the right data would be, in
-    * case the buffer doesn't move and we can short-circuit the relocation
-    * processing in the kernel
-    */
-   return buffer->offset64 + delta;
-}
-
-uint64_t
-intel_batchbuffer_reloc64(struct brw_context *brw,
-                          brw_bo *buffer, uint32_t offset,
-                          uint32_t read_domains, uint32_t write_domain,
-                          uint32_t delta)
-{
-   int ret = drm_intel_bo_emit_reloc(brw->batch.bo, offset,
-                                     buffer, delta,
-                                     read_domains, write_domain);
-   assert(ret == 0);
-   (void) ret;
-
-   /* Using the old buffer offset, write in what the right data would be, in
-    * case the buffer doesn't move and we can short-circuit the relocation
-    * processing in the kernel
-    */
-   return buffer->offset64 + delta;
-}
-
-
-void
-intel_batchbuffer_data(struct brw_context *brw,
-                       const void *data, GLuint bytes, enum brw_gpu_ring ring)
-{
-   assert((bytes & 3) == 0);
-   intel_batchbuffer_require_space(&brw->batch, bytes, ring);
-   memcpy(brw->batch.map_next, data, bytes);
-   brw->batch.map_next += bytes >> 2;
-}
-
-int brw_batch_begin(struct brw_batch *batch,
-                    const int sz_bytes,
-                    enum brw_gpu_ring ring)
-{
-   if (batch->begin_count++)
-      return 0;
-
-   intel_batchbuffer_require_space(batch, sz_bytes, ring);
-   intel_batchbuffer_save_state(batch);
-
-   batch->repeat = false;
-   batch->no_batch_wrap = true;
-
-   return setjmp(batch->jmpbuf);
-}
-
-int brw_batch_end(struct brw_batch *batch)
-{
-   assert(batch->begin_count);
-   if (--batch->begin_count)
-      return 0;
-
-   batch->no_batch_wrap = false;
-
-   if (dri_bufmgr_check_aperture_space(&batch->bo, 1)) {
-      if (!batch->repeat) {
-         enum brw_gpu_ring ring = batch->ring;
-
-         intel_batchbuffer_reset_to_saved(batch);
-         brw_batch_flush(batch, NULL);
-
-         batch->begin_count++;
-         batch->no_batch_wrap = true;
-
-         batch->ring = ring;
-         if (ring == RENDER_RING)
-            brw_batch_start_hook(batch);
-
-         batch->repeat = true;
-         longjmp(batch->jmpbuf, 1);
-      }
-
-      return brw_batch_flush(batch, NULL);
-   }
-
-   return 0;
-}
diff --git a/src/mesa/drivers/dri/i965/intel_batchbuffer.h b/src/mesa/drivers/dri/i965/intel_batchbuffer.h
deleted file mode 100644
index 3c505a2..0000000
--- a/src/mesa/drivers/dri/i965/intel_batchbuffer.h
+++ /dev/null
@@ -1,142 +0,0 @@
-#ifndef INTEL_BATCHBUFFER_H
-#define INTEL_BATCHBUFFER_H
-
-#include "main/mtypes.h"
-
-#include "brw_context.h"
-#include "intel_bufmgr.h"
-
-#ifdef __cplusplus
-extern "C" {
-#endif
-
-struct brw_batch;
-struct brw_context;
-enum brw_gpu_ring;
-
-int intel_batchbuffer_init(struct brw_context *brw);
-void intel_batchbuffer_free(struct brw_context *brw);
-
-
-/* Unlike bmBufferData, this currently requires the buffer be mapped.
- * Consider it a convenience function wrapping multple
- * intel_buffer_dword() calls.
- */
-void intel_batchbuffer_data(struct brw_context *brw,
-                            const void *data, GLuint bytes,
-                            enum brw_gpu_ring ring);
-
-uint32_t intel_batchbuffer_reloc(struct brw_context *brw,
-                                 brw_bo *buffer,
-                                 uint32_t offset,
-                                 uint32_t read_domains,
-                                 uint32_t write_domain,
-                                 uint32_t delta);
-uint64_t intel_batchbuffer_reloc64(struct brw_context *brw,
-                                   brw_bo *buffer,
-                                   uint32_t offset,
-                                   uint32_t read_domains,
-                                   uint32_t write_domain,
-                                   uint32_t delta);
-
-static inline uint32_t float_as_int(float f)
-{
-   union {
-      float f;
-      uint32_t d;
-   } fi;
-
-   fi.f = f;
-   return fi.d;
-}
-
-/* Inline functions - might actually be better off with these
- * non-inlined.  Certainly better off switching all command packets to
- * be passed as structs rather than dwords, but that's a little bit of
- * work...
- */
-static inline void
-intel_batchbuffer_emit_dword(struct brw_context *brw, GLuint dword)
-{
-#ifdef DEBUG
-   assert(intel_batchbuffer_space(&brw->batch) >= 4);
-#endif
-   *brw->batch.map_next++ = dword;
-   assert(brw->batch.ring != UNKNOWN_RING);
-}
-
-static inline void
-intel_batchbuffer_emit_float(struct brw_context *brw, float f)
-{
-   intel_batchbuffer_emit_dword(brw, float_as_int(f));
-}
-
-static inline void
-intel_batchbuffer_begin(struct brw_context *brw, int n, enum brw_gpu_ring ring)
-{
-   intel_batchbuffer_require_space(&brw->batch, n * 4, ring);
-
-#ifdef DEBUG
-   brw->batch.emit = USED_BATCH(&brw->batch);
-   brw->batch.total = n;
-#endif
-}
-
-static inline void
-intel_batchbuffer_advance(struct brw_context *brw)
-{
-#ifdef DEBUG
-   brw_batch *batch = &brw->batch;
-   unsigned int _n = USED_BATCH(batch) - batch->emit;
-   assert(batch->total != 0);
-   if (_n != batch->total) {
-      fprintf(stderr, "ADVANCE_BATCH: %d of %d dwords emitted\n",
-	      _n, batch->total);
-      abort();
-   }
-   batch->total = 0;
-#endif
-}
-
-#define BEGIN_BATCH(n) do {                            \
-   intel_batchbuffer_begin(brw, (n), RENDER_RING);     \
-   uint32_t *__map = brw->batch.map_next;              \
-   brw->batch.map_next += (n)
-
-#define BEGIN_BATCH_BLT(n) do {                        \
-   intel_batchbuffer_begin(brw, (n), BLT_RING);        \
-   uint32_t *__map = brw->batch.map_next;              \
-   brw->batch.map_next += (n)
-
-#define OUT_BATCH(d) *__map++ = (d)
-#define OUT_BATCH_F(f) OUT_BATCH(float_as_int((f)))
-
-#define OUT_RELOC(buf, read_domains, write_domain, delta) do { \
-   uint32_t __offset = (__map - brw->batch.map) * 4;           \
-   OUT_BATCH(intel_batchbuffer_reloc(brw, (buf), __offset,     \
-                                     (read_domains),           \
-                                     (write_domain),           \
-                                     (delta)));                \
-} while (0)
-
-/* Handle 48-bit address relocations for Gen8+ */
-#define OUT_RELOC64(buf, read_domains, write_domain, delta) do {      \
-   uint32_t __offset = (__map - brw->batch.map) * 4;                  \
-   uint64_t reloc64 = intel_batchbuffer_reloc64(brw, (buf), __offset, \
-                                                (read_domains),       \
-                                                (write_domain),       \
-                                                (delta));             \
-   OUT_BATCH(reloc64);                                                \
-   OUT_BATCH(reloc64 >> 32);                                          \
-} while (0)
-
-#define ADVANCE_BATCH()                  \
-   assert(__map == brw->batch.map_next); \
-   intel_batchbuffer_advance(brw);       \
-} while (0)
-
-#ifdef __cplusplus
-}
-#endif
-
-#endif
diff --git a/src/mesa/drivers/dri/i965/intel_blit.c b/src/mesa/drivers/dri/i965/intel_blit.c
index 636e48a..a35c8df 100644
--- a/src/mesa/drivers/dri/i965/intel_blit.c
+++ b/src/mesa/drivers/dri/i965/intel_blit.c
@@ -756,7 +756,7 @@ intelEmitImmediateColorExpandBlit(struct brw_context *brw,
    OUT_BATCH(SET_FIELD(y + h, BLT_Y) | SET_FIELD(x + w, BLT_X));
    ADVANCE_BATCH();
 
-   intel_batchbuffer_data(brw, src_bits, dwords * 4, BLT_RING);
+   brw_batch_data(&brw->batch, src_bits, dwords * 4);
 
    brw_emit_mi_flush(brw);
 
diff --git a/src/mesa/drivers/dri/i965/intel_buffer_objects.c b/src/mesa/drivers/dri/i965/intel_buffer_objects.c
index 0cd997e..8ef01f0 100644
--- a/src/mesa/drivers/dri/i965/intel_buffer_objects.c
+++ b/src/mesa/drivers/dri/i965/intel_buffer_objects.c
@@ -40,46 +40,6 @@
 #include "intel_blit.h"
 #include "intel_buffer_objects.h"
 
-/**
- * Map a buffer object; issue performance warnings if mapping causes stalls.
- *
- * This matches the drm_intel_bo_map API, but takes an additional human-readable
- * name for the buffer object to use in the performance debug message.
- */
-int
-brw_bo_map(struct brw_context *brw,
-           brw_bo *bo, int write_enable,
-           const char *bo_name)
-{
-   if (likely(!brw->perf_debug) || !drm_intel_bo_busy(bo))
-      return drm_intel_bo_map(bo, write_enable);
-
-   double start_time = get_time();
-
-   int ret = drm_intel_bo_map(bo, write_enable);
-
-   perf_debug("CPU mapping a busy %s BO stalled and took %.03f ms.\n",
-              bo_name, (get_time() - start_time) * 1000);
-
-   return ret;
-}
-
-int
-brw_bo_map_gtt(struct brw_context *brw, brw_bo *bo, const char *bo_name)
-{
-   if (likely(!brw->perf_debug) || !drm_intel_bo_busy(bo))
-      return drm_intel_gem_bo_map_gtt(bo);
-
-   double start_time = get_time();
-
-   int ret = drm_intel_gem_bo_map_gtt(bo);
-
-   perf_debug("GTT mapping a busy %s BO stalled and took %.03f ms.\n",
-              bo_name, (get_time() - start_time) * 1000);
-
-   return ret;
-}
-
 static void
 mark_buffer_gpu_usage(struct intel_buffer_object *intel_obj,
                                uint32_t offset, uint32_t size)
@@ -91,6 +51,9 @@ mark_buffer_gpu_usage(struct intel_buffer_object *intel_obj,
 static void
 mark_buffer_inactive(struct intel_buffer_object *intel_obj)
 {
+   if (brw_bo_busy(intel_obj->buffer, BUSY_WRITE, NULL))
+      return;
+
    intel_obj->gpu_active_start = ~0;
    intel_obj->gpu_active_end = 0;
 }
@@ -212,7 +175,7 @@ brw_buffer_data(struct gl_context *ctx,
          return false;
 
       if (data != NULL)
-	 drm_intel_bo_subdata(intel_obj->buffer, 0, size, data);
+         brw_bo_write(intel_obj->buffer, 0, data, size, 0, NULL);
    }
 
    return true;
@@ -237,7 +200,6 @@ brw_buffer_subdata(struct gl_context *ctx,
 {
    struct brw_context *brw = brw_context(ctx);
    struct intel_buffer_object *intel_obj = intel_buffer_object(obj);
-   bool busy;
 
    if (size == 0)
       return;
@@ -255,28 +217,17 @@ brw_buffer_subdata(struct gl_context *ctx,
     */
    if (offset + size <= intel_obj->gpu_active_start ||
        intel_obj->gpu_active_end <= offset) {
-      if (brw->has_llc) {
-         drm_intel_gem_bo_map_unsynchronized(intel_obj->buffer);
-         memcpy(intel_obj->buffer->virtual + offset, data, size);
-         drm_intel_bo_unmap(intel_obj->buffer);
-
-         if (intel_obj->gpu_active_end > intel_obj->gpu_active_start)
-            intel_obj->prefer_stall_to_blit = true;
-         return;
-      } else {
-         perf_debug("BufferSubData could be unsynchronized, but !LLC doesn't support it yet\n");
-      }
+      brw_bo_write(intel_obj->buffer, offset, data, size, MAP_ASYNC, NULL);
+      if (intel_obj->gpu_active_end > intel_obj->gpu_active_start)
+         intel_obj->prefer_stall_to_blit = intel_obj->buffer->cache_coherent;
+      return;
    }
 
-   busy =
-      drm_intel_bo_busy(intel_obj->buffer) ||
-      drm_intel_bo_references(brw->batch.bo, intel_obj->buffer);
-
-   if (busy) {
+   if (brw_bo_busy(intel_obj->buffer, BUSY_WRITE | BUSY_RETIRE, NULL)) {
       if (size == intel_obj->Base.Size) {
 	 /* Replace the current busy bo so the subdata doesn't stall. */
          brw_bo_put(intel_obj->buffer);
-	 alloc_buffer_object(brw, intel_obj);
+         alloc_buffer_object(brw, intel_obj);
       } else if (!intel_obj->prefer_stall_to_blit) {
          perf_debug("Using a blit copy to avoid stalling on "
                     "glBufferSubData(%ld, %ld) (%ldkb) to a busy "
@@ -287,12 +238,13 @@ brw_buffer_subdata(struct gl_context *ctx,
          brw_bo *temp_bo =
             brw_bo_create(&brw->batch, "subdata temp", size, 64, 0);
 
-	 drm_intel_bo_subdata(temp_bo, 0, size, data);
+         brw_bo_write(temp_bo, 0, data, size, 0,
+                      PERF_DEBUG(brw, "BufferSubData"));
 
-	 intel_emit_linear_blit(brw,
-				intel_obj->buffer, offset,
-				temp_bo, 0,
-				size);
+         intel_emit_linear_blit(brw,
+                                intel_obj->buffer, offset,
+                                temp_bo, 0,
+                                size);
 
          brw_bo_put(temp_bo);
          return;
@@ -303,11 +255,11 @@ brw_buffer_subdata(struct gl_context *ctx,
                     (long)offset, (long)offset + size, (long)(size/1024),
                     intel_obj->gpu_active_start,
                     intel_obj->gpu_active_end);
-         brw_batch_flush(&brw->batch, PERF_DEBUG(brw, "BufferSubData"));
       }
    }
 
-   drm_intel_bo_subdata(intel_obj->buffer, offset, size, data);
+   brw_bo_write(intel_obj->buffer, offset, data, size, 0,
+                PERF_DEBUG(brw, "BufferSubData"));
    mark_buffer_inactive(intel_obj);
 }
 
@@ -326,14 +278,10 @@ brw_get_buffer_subdata(struct gl_context *ctx,
                        struct gl_buffer_object *obj)
 {
    struct intel_buffer_object *intel_obj = intel_buffer_object(obj);
-   struct brw_context *brw = brw_context(ctx);
 
    assert(intel_obj);
-   if (drm_intel_bo_references(brw->batch.bo, intel_obj->buffer)) {
-      brw_batch_flush(&brw->batch, PERF_DEBUG(brw, "BufferSubData"));
-   }
-   drm_intel_bo_get_subdata(intel_obj->buffer, offset, size, data);
-
+   brw_bo_read(intel_obj->buffer, offset, data, size, 0,
+               PERF_DEBUG(brw_context(ctx), "GetBufferSubData"));
    mark_buffer_inactive(intel_obj);
 }
 
@@ -388,19 +336,11 @@ brw_map_buffer_range(struct gl_context *ctx,
     * achieve the required synchronization.
     */
    if (!(access & GL_MAP_UNSYNCHRONIZED_BIT)) {
-      if (drm_intel_bo_references(brw->batch.bo, intel_obj->buffer)) {
-	 if (access & GL_MAP_INVALIDATE_BUFFER_BIT) {
+      if ((access & GL_MAP_INVALIDATE_BUFFER_BIT)) {
+         if (brw_bo_busy(intel_obj->buffer, BUSY_WRITE | BUSY_RETIRE, NULL)) {
             brw_bo_put(intel_obj->buffer);
-	    alloc_buffer_object(brw, intel_obj);
-	 } else {
-            perf_debug("Stalling on the GPU for mapping a busy buffer "
-                       "object\n");
-            brw_batch_flush(&brw->batch, PERF_DEBUG(brw, "miptree"));
-	 }
-      } else if (drm_intel_bo_busy(intel_obj->buffer) &&
-		 (access & GL_MAP_INVALIDATE_BUFFER_BIT)) {
-         brw_bo_put(intel_obj->buffer);
-	 alloc_buffer_object(brw, intel_obj);
+            alloc_buffer_object(brw, intel_obj);
+         }
       }
    }
 
@@ -415,46 +355,37 @@ brw_map_buffer_range(struct gl_context *ctx,
     */
    if (!(access & (GL_MAP_UNSYNCHRONIZED_BIT | GL_MAP_PERSISTENT_BIT)) &&
        (access & GL_MAP_INVALIDATE_RANGE_BIT) &&
-       drm_intel_bo_busy(intel_obj->buffer)) {
+       brw_bo_busy(intel_obj->buffer, BUSY_WRITE | BUSY_RETIRE, NULL)) {
       /* Ensure that the base alignment of the allocation meets the alignment
        * guarantees the driver has advertised to the application.
        */
       const unsigned alignment = ctx->Const.MinMapBufferAlignment;
 
       intel_obj->map_extra[index] = (uintptr_t) offset % alignment;
-      intel_obj->range_map_bo[index] = brw_bo_create(&brw->batch,
-                                                     "BO blit temp",
-                                                     length +
-                                                     intel_obj->map_extra[index],
-                                                     alignment, 0);
-      if (brw->has_llc) {
-         brw_bo_map(brw, intel_obj->range_map_bo[index],
-                    (access & GL_MAP_WRITE_BIT) != 0, "range-map");
-      } else {
-         drm_intel_gem_bo_map_gtt(intel_obj->range_map_bo[index]);
-      }
+      intel_obj->range_map_bo[index] =
+         brw_bo_create(&brw->batch, "BO blit temp",
+                       length + intel_obj->map_extra[index], alignment, 0);
+
       obj->Mappings[index].Pointer =
-         intel_obj->range_map_bo[index]->virtual + intel_obj->map_extra[index];
+         brw_bo_map(intel_obj->range_map_bo[index], MAP_WRITE, NULL) +
+         intel_obj->map_extra[index];
+
       return obj->Mappings[index].Pointer;
    }
 
-   if (access & GL_MAP_UNSYNCHRONIZED_BIT) {
-      if (!brw->has_llc && brw->perf_debug &&
-          drm_intel_bo_busy(intel_obj->buffer)) {
-         perf_debug("MapBufferRange with GL_MAP_UNSYNCHRONIZED_BIT stalling (it's actually synchronized on non-LLC platforms)\n");
-      }
-      drm_intel_gem_bo_map_unsynchronized(intel_obj->buffer);
-   } else if (!brw->has_llc && (!(access & GL_MAP_READ_BIT) ||
-                              (access & GL_MAP_PERSISTENT_BIT))) {
-      drm_intel_gem_bo_map_gtt(intel_obj->buffer);
-      mark_buffer_inactive(intel_obj);
-   } else {
-      brw_bo_map(brw, intel_obj->buffer, (access & GL_MAP_WRITE_BIT) != 0,
-                 "MapBufferRange");
-      mark_buffer_inactive(intel_obj);
-   }
+   STATIC_ASSERT(GL_MAP_UNSYNCHRONIZED_BIT == MAP_ASYNC);
+   STATIC_ASSERT(GL_MAP_WRITE_BIT == MAP_WRITE);
+   STATIC_ASSERT(GL_MAP_READ_BIT == MAP_READ);
+   STATIC_ASSERT(GL_MAP_PERSISTENT_BIT == MAP_PERSISTENT);
+   STATIC_ASSERT(GL_MAP_COHERENT_BIT == MAP_COHERENT);
+   assert((access & MAP_INTERNAL_MASK) == 0);
+
+   obj->Mappings[index].Pointer =
+      brw_bo_map(intel_obj->buffer, access,
+                 PERF_DEBUG(brw, "MapBufferRange")) + offset;
+
+   mark_buffer_inactive(intel_obj);
 
-   obj->Mappings[index].Pointer = intel_obj->buffer->virtual + offset;
    return obj->Mappings[index].Pointer;
 }
 
@@ -542,8 +473,6 @@ brw_unmap_buffer(struct gl_context *ctx,
    assert(intel_obj);
    assert(obj->Mappings[index].Pointer);
    if (intel_obj->range_map_bo[index] != NULL) {
-      drm_intel_bo_unmap(intel_obj->range_map_bo[index]);
-
       if (!(obj->Mappings[index].AccessFlags & GL_MAP_FLUSH_EXPLICIT_BIT)) {
          intel_emit_linear_blit(brw,
                                 intel_obj->buffer, obj->Mappings[index].Offset,
@@ -563,8 +492,6 @@ brw_unmap_buffer(struct gl_context *ctx,
 
       brw_bo_put(intel_obj->range_map_bo[index]);
       intel_obj->range_map_bo[index] = NULL;
-   } else if (intel_obj->buffer != NULL) {
-      drm_intel_bo_unmap(intel_obj->buffer);
    }
    obj->Mappings[index].Pointer = NULL;
    obj->Mappings[index].Offset = 0;
diff --git a/src/mesa/drivers/dri/i965/intel_buffer_objects.h b/src/mesa/drivers/dri/i965/intel_buffer_objects.h
index a31ac0d..0cb5a48 100644
--- a/src/mesa/drivers/dri/i965/intel_buffer_objects.h
+++ b/src/mesa/drivers/dri/i965/intel_buffer_objects.h
@@ -103,8 +103,6 @@ void *intel_upload_space(struct brw_context *brw,
                          brw_bo **out_bo,
                          uint32_t *out_offset);
 
-void intel_upload_finish(struct brw_context *brw);
-
 /* Hook the bufferobject implementation into mesa:
  */
 void intelInitBufferObjectFuncs(struct dd_function_table *functions);
diff --git a/src/mesa/drivers/dri/i965/intel_debug.c b/src/mesa/drivers/dri/i965/intel_debug.c
index 8c8f6a6..5c2dfa4 100644
--- a/src/mesa/drivers/dri/i965/intel_debug.c
+++ b/src/mesa/drivers/dri/i965/intel_debug.c
@@ -94,9 +94,6 @@ brw_process_intel_debug_variable(struct intel_screen *screen)
    uint64_t intel_debug = driParseDebugString(getenv("INTEL_DEBUG"), debug_control);
    (void) p_atomic_cmpxchg(&INTEL_DEBUG, 0, intel_debug);
 
-   if (INTEL_DEBUG & DEBUG_BUFMGR)
-      dri_bufmgr_set_debug(screen->bufmgr, true);
-
    if ((INTEL_DEBUG & DEBUG_SHADER_TIME) && screen->devinfo->gen < 7) {
       fprintf(stderr,
               "shader_time debugging requires gen7 (Ivybridge) or better.\n");
diff --git a/src/mesa/drivers/dri/i965/intel_fbo.c b/src/mesa/drivers/dri/i965/intel_fbo.c
index 5c86655..8344791 100644
--- a/src/mesa/drivers/dri/i965/intel_fbo.c
+++ b/src/mesa/drivers/dri/i965/intel_fbo.c
@@ -376,13 +376,15 @@ intel_image_target_renderbuffer_storage(struct gl_context *ctx,
    irb = intel_renderbuffer(rb);
    intel_miptree_release(&irb->mt);
 
+   struct brw_bo *bo = brw_bo_import(&brw->batch, image->bo, true);
+
    /* Disable creation of the miptree's aux buffers because the driver exposes
     * no EGL API to manage them. That is, there is no API for resolving the aux
     * buffer's content to the main buffer nor for invalidating the aux buffer's
     * content.
     */
    irb->mt = intel_miptree_create_for_bo(brw,
-                                         image->bo,
+                                         bo,
                                          image->format,
                                          image->offset,
                                          image->width,
@@ -390,6 +392,7 @@ intel_image_target_renderbuffer_storage(struct gl_context *ctx,
                                          1,
                                          image->pitch,
                                          MIPTREE_LAYOUT_DISABLE_AUX);
+   brw_bo_put(bo);
    if (!irb->mt)
       return;
 
diff --git a/src/mesa/drivers/dri/i965/intel_mipmap_tree.c b/src/mesa/drivers/dri/i965/intel_mipmap_tree.c
index 1bd9fa2..0410d06 100644
--- a/src/mesa/drivers/dri/i965/intel_mipmap_tree.c
+++ b/src/mesa/drivers/dri/i965/intel_mipmap_tree.c
@@ -564,8 +564,8 @@ intel_get_yf_ys_bo_size(struct intel_mipmap_tree *mt,
 {
    const uint32_t bpp = mt->cpp * 8;
    const uint32_t aspect_ratio = (bpp == 16 || bpp == 64) ? 2 : 1;
-   uint32_t tile_width, tile_height;
-   unsigned long stride, size, aligned_y;
+   uint32_t tile_width, tile_height, stride, aligned_y;
+   uint64_t size;
 
    assert(mt->tr_mode != INTEL_MIPTREE_TRMODE_NONE);
 
@@ -736,15 +736,12 @@ intel_miptree_create_for_bo(struct brw_context *brw,
                             uint32_t layout_flags)
 {
    struct intel_mipmap_tree *mt;
-   uint32_t tiling, swizzle;
    GLenum target;
 
-   drm_intel_bo_get_tiling(bo, &tiling, &swizzle);
-
    /* Nothing will be able to use this miptree with the BO if the offset isn't
     * aligned.
     */
-   if (tiling != I915_TILING_NONE)
+   if (bo->tiling != I915_TILING_NONE)
       assert(offset % 4096 == 0);
 
    /* miptrees can't handle negative pitch.  If you need flipping of images,
@@ -771,7 +768,7 @@ intel_miptree_create_for_bo(struct brw_context *brw,
    mt->bo = brw_bo_get(bo);
    mt->pitch = pitch;
    mt->offset = offset;
-   mt->tiling = tiling;
+   mt->tiling = bo->tiling;
 
    return mt;
 }
@@ -1353,25 +1350,13 @@ intel_miptree_map_raw(struct brw_context *brw,
     * resolve any pending fast color clears before we map.
     */
    intel_miptree_resolve_color(brw, mt);
-
-   brw_bo *bo = mt->bo;
-
-   if (drm_intel_bo_references(brw->batch.bo, bo))
-      brw_batch_flush(&brw->batch, PERF_DEBUG(brw, "miptree"));
-
-   if (mt->tiling != I915_TILING_NONE)
-      brw_bo_map_gtt(brw, bo, "miptree");
-   else
-      brw_bo_map(brw, bo, mode & GL_MAP_WRITE_BIT, "miptree");
-
-   return bo->virtual;
+   return brw_bo_map(mt->bo, mode, PERF_DEBUG(brw, "TexImage"));
 }
 
 static void
 intel_miptree_unmap_raw(struct brw_context *brw,
                         struct intel_mipmap_tree *mt)
 {
-   drm_intel_bo_unmap(mt->bo);
 }
 
 static bool
@@ -2237,7 +2222,7 @@ intel_miptree_map_movntdqa(struct brw_context *brw,
    image_x += map->x;
    image_y += map->y;
 
-   void *src = intel_miptree_map_raw(brw, mt, map->mode);
+   void *src = intel_miptree_map_raw(brw, mt, map->mode | GL_MAP_COHERENT_BIT);
    if (!src)
       return;
    src += image_y * mt->pitch;
@@ -2623,11 +2608,10 @@ use_intel_mipree_map_blit(struct brw_context *brw,
                           unsigned int level,
                           unsigned int slice)
 {
-   if (brw->has_llc &&
-      /* It's probably not worth swapping to the blit ring because of
-       * all the overhead involved.
-       */
-       !(mode & GL_MAP_WRITE_BIT) &&
+   /* It's probably not worth swapping to the blit ring because of
+    * all the overhead involved.
+    */
+   if (!(mode & GL_MAP_WRITE_BIT) &&
        !mt->compressed &&
        (mt->tiling == I915_TILING_X ||
         /* Prior to Sandybridge, the blitter can't handle Y tiling */
diff --git a/src/mesa/drivers/dri/i965/intel_mipmap_tree.h b/src/mesa/drivers/dri/i965/intel_mipmap_tree.h
index 1b7cf64..fd03956 100644
--- a/src/mesa/drivers/dri/i965/intel_mipmap_tree.h
+++ b/src/mesa/drivers/dri/i965/intel_mipmap_tree.h
@@ -49,7 +49,6 @@
 #include <assert.h>
 
 #include "main/mtypes.h"
-#include "intel_bufmgr.h"
 #include "intel_resolve_map.h"
 #include <GL/internal/dri_interface.h>
 
diff --git a/src/mesa/drivers/dri/i965/intel_pixel_copy.c b/src/mesa/drivers/dri/i965/intel_pixel_copy.c
index f1013ff..4313588 100644
--- a/src/mesa/drivers/dri/i965/intel_pixel_copy.c
+++ b/src/mesa/drivers/dri/i965/intel_pixel_copy.c
@@ -148,8 +148,6 @@ do_blit_copypixels(struct gl_context * ctx,
       return false;
    }
 
-   brw_batch_flush(&brw->batch, PERF_DEBUG(brw, "CopyPixels"));
-
    /* Clip to destination buffer. */
    orig_dstx = dstx;
    orig_dsty = dsty;
diff --git a/src/mesa/drivers/dri/i965/intel_pixel_read.c b/src/mesa/drivers/dri/i965/intel_pixel_read.c
index c0cf5d6..7b86f9c 100644
--- a/src/mesa/drivers/dri/i965/intel_pixel_read.c
+++ b/src/mesa/drivers/dri/i965/intel_pixel_read.c
@@ -83,11 +83,6 @@ intel_readpixels_tiled_memcpy(struct gl_context * ctx,
    struct intel_renderbuffer *irb = intel_renderbuffer(rb);
    int dst_pitch;
 
-   /* The miptree's buffer. */
-   brw_bo *bo;
-
-   int error = 0;
-
    uint32_t cpp;
    mem_copy_fn mem_copy = NULL;
 
@@ -95,8 +90,7 @@ intel_readpixels_tiled_memcpy(struct gl_context * ctx,
     * a 2D BGRA, RGBA, L8 or A8 texture. It could be generalized to support
     * more types.
     */
-   if (!brw->has_llc ||
-       !(type == GL_UNSIGNED_BYTE || type == GL_UNSIGNED_INT_8_8_8_8_REV) ||
+   if (!(type == GL_UNSIGNED_BYTE || type == GL_UNSIGNED_INT_8_8_8_8_REV) ||
        pixels == NULL ||
        _mesa_is_bufferobj(pack->BufferObj) ||
        pack->Alignment > 4 ||
@@ -149,22 +143,18 @@ intel_readpixels_tiled_memcpy(struct gl_context * ctx,
       return false;
    }
 
+   /* tiled_to_linear() assumes that if the object is swizzled, it
+    * is using I915_BIT6_SWIZZLE_9_10 for X and I915_BIT6_SWIZZLE_9 for Y.
+    * This is only true on gen5 and above.
+    */
+   if (brw->gen < 5 && brw->has_swizzling)
+      return false;
+
    /* Since we are going to read raw data to the miptree, we need to resolve
     * any pending fast color clears before we start.
     */
    intel_miptree_resolve_color(brw, irb->mt);
 
-   bo = irb->mt->bo;
-
-   if (drm_intel_bo_references(brw->batch.bo, bo))
-      brw_batch_flush(&brw->batch, PERF_DEBUG(brw, "ReadPixels"));
-
-   error = brw_bo_map(brw, bo, false /* write enable */, "miptree");
-   if (error) {
-      DBG("%s: failed to map bo\n", __func__);
-      return false;
-   }
-
    dst_pitch = _mesa_image_row_stride(pack, width, format, type);
 
    /* For a window-system renderbuffer, the buffer is actually flipped
@@ -193,19 +183,17 @@ intel_readpixels_tiled_memcpy(struct gl_context * ctx,
        pack->Alignment, pack->RowLength, pack->SkipPixels,
        pack->SkipRows);
 
-   tiled_to_linear(
+   return tiled_to_linear(
       xoffset * cpp, (xoffset + width) * cpp,
       yoffset, yoffset + height,
       pixels - (ptrdiff_t) yoffset * dst_pitch - (ptrdiff_t) xoffset * cpp,
-      bo->virtual,
+      brw_bo_map(irb->mt->bo, MAP_READ | MAP_DETILED,
+                 PERF_DEBUG(brw, "ReadPixels")),
       dst_pitch, irb->mt->pitch,
       brw->has_swizzling,
       irb->mt->tiling,
       mem_copy
    );
-
-   drm_intel_bo_unmap(bo);
-   return true;
 }
 
 void
diff --git a/src/mesa/drivers/dri/i965/intel_screen.c b/src/mesa/drivers/dri/i965/intel_screen.c
index e5fd887..2c9d362 100644
--- a/src/mesa/drivers/dri/i965/intel_screen.c
+++ b/src/mesa/drivers/dri/i965/intel_screen.c
@@ -93,7 +93,6 @@ DRI_CONF_END
 };
 
 #include "intel_buffers.h"
-#include "intel_bufmgr.h"
 #include "intel_fbo.h"
 #include "intel_mipmap_tree.h"
 #include "intel_screen.h"
@@ -328,7 +327,7 @@ intel_setup_image_from_mipmap_tree(struct brw_context *brw, __DRIimage *image,
                                                   &image->tile_y);
 
    drm_intel_bo_unreference(image->bo);
-   image->bo = mt->bo;
+   image->bo = mt->bo->base;
    drm_intel_bo_reference(image->bo);
 }
 
@@ -390,7 +389,7 @@ intel_create_image_from_renderbuffer(__DRIcontext *context,
    image->offset = 0;
    image->data = loaderPrivate;
    drm_intel_bo_unreference(image->bo);
-   image->bo = irb->mt->bo;
+   image->bo = irb->mt->bo->base;
    drm_intel_bo_reference(image->bo);
    image->width = rb->Width;
    image->height = rb->Height;
@@ -1050,7 +1049,7 @@ intel_init_bufmgr(struct intel_screen *intelScreen)
 
    intelScreen->no_hw = getenv("INTEL_NO_HW") != NULL;
 
-   intelScreen->bufmgr = intel_bufmgr_gem_init(spriv->fd, BATCH_SZ);
+   intelScreen->bufmgr = intel_bufmgr_gem_init(spriv->fd, 4096);
    if (intelScreen->bufmgr == NULL) {
       fprintf(stderr, "[%s:%u] Error initializing buffer manager.\n",
 	      __func__, __LINE__);
diff --git a/src/mesa/drivers/dri/i965/intel_screen.h b/src/mesa/drivers/dri/i965/intel_screen.h
index d6e80a0..3356ebf 100644
--- a/src/mesa/drivers/dri/i965/intel_screen.h
+++ b/src/mesa/drivers/dri/i965/intel_screen.h
@@ -34,11 +34,12 @@
 #include <GL/internal/dri_interface.h>
 
 #include "dri_util.h"
-#include "intel_bufmgr.h"
 #include "brw_device_info.h"
 #include "i915_drm.h"
 #include "xmlconfig.h"
 
+#include <intel_bufmgr.h>
+
 struct intel_screen
 {
    int deviceID;
@@ -97,6 +98,12 @@ struct intel_screen
    int cmd_parser_version;
  };
 
+static inline int intel_screen_to_fd(struct intel_screen *scr)
+{
+   __DRIscreen *psp = scr->driScrnPriv;
+   return psp->fd;
+}
+
 extern void intelDestroyContext(__DRIcontext * driContextPriv);
 
 extern GLboolean intelUnbindContext(__DRIcontext * driContextPriv);
diff --git a/src/mesa/drivers/dri/i965/intel_syncobj.c b/src/mesa/drivers/dri/i965/intel_syncobj.c
index 00b9e73..d55cf4b 100644
--- a/src/mesa/drivers/dri/i965/intel_syncobj.c
+++ b/src/mesa/drivers/dri/i965/intel_syncobj.c
@@ -67,7 +67,7 @@ brw_fence_insert(struct brw_context *brw, struct brw_fence *fence)
    assert(!fence->batch_bo);
    assert(!fence->signalled);
 
-   brw_mi_flush(brw, RENDER_RING);
+   brw_mi_flush(brw, brw->batch.ring);
    fence->batch_bo = brw_bo_get(brw->batch.bo);
    brw_batch_flush(&brw->batch, PERF_DEBUG(brw, "SyncFence"));
 }
@@ -78,7 +78,7 @@ brw_fence_has_completed(struct brw_fence *fence)
    if (fence->signalled)
       return true;
 
-   if (fence->batch_bo && !drm_intel_bo_busy(fence->batch_bo)) {
+   if (brw_bo_busy(fence->batch_bo, BUSY_WRITE | BUSY_RETIRE, NULL)) {
       brw_bo_put(fence->batch_bo);
       fence->batch_bo = NULL;
       fence->signalled = true;
@@ -109,7 +109,7 @@ brw_fence_client_wait(struct brw_context *brw, struct brw_fence *fence,
    if (timeout > INT64_MAX)
       timeout = INT64_MAX;
 
-   if (drm_intel_gem_bo_wait(fence->batch_bo, timeout) != 0)
+   if (drm_intel_gem_bo_wait(fence->batch_bo->base, timeout) != 0)
       return false;
 
    fence->signalled = true;
diff --git a/src/mesa/drivers/dri/i965/intel_tex_image.c b/src/mesa/drivers/dri/i965/intel_tex_image.c
index 123b05e..553f0b0 100644
--- a/src/mesa/drivers/dri/i965/intel_tex_image.c
+++ b/src/mesa/drivers/dri/i965/intel_tex_image.c
@@ -92,7 +92,9 @@ intelTexImage(struct gl_context * ctx,
    struct intel_texture_image *intelImage = intel_texture_image(texImage);
    bool ok;
 
-   bool tex_busy = intelImage->mt && drm_intel_bo_busy(intelImage->mt->bo);
+   bool tex_busy =
+      intelImage->mt &&
+      brw_bo_busy(intelImage->mt->bo, BUSY_WRITE | BUSY_RETIRE, NULL);
 
    DBG("%s mesa_format %s target %s format %s type %s level %d %dx%dx%d\n",
        __func__, _mesa_get_format_name(texImage->TexFormat),
@@ -339,13 +341,15 @@ intel_image_target_texture_2d(struct gl_context *ctx, GLenum target,
     * buffer's content to the main buffer nor for invalidating the aux buffer's
     * content.
     */
-   intel_set_texture_image_bo(ctx, texImage, image->bo,
+   struct brw_bo *bo = brw_bo_import(&brw->batch, image->bo, true);
+   intel_set_texture_image_bo(ctx, texImage, bo,
                               target, image->internal_format,
                               image->format, image->offset,
                               image->width,  image->height,
                               image->pitch,
                               image->tile_x, image->tile_y,
                               MIPTREE_LAYOUT_DISABLE_AUX);
+   brw_bo_put(bo);
 }
 
 /**
@@ -366,11 +370,6 @@ intel_gettexsubimage_tiled_memcpy(struct gl_context *ctx,
    struct intel_texture_image *image = intel_texture_image(texImage);
    int dst_pitch;
 
-   /* The miptree's buffer. */
-   brw_bo *bo;
-
-   int error = 0;
-
    uint32_t cpp;
    mem_copy_fn mem_copy = NULL;
 
@@ -383,8 +382,7 @@ intel_gettexsubimage_tiled_memcpy(struct gl_context *ctx,
     * with _mesa_image_row_stride. However, before removing the restrictions
     * we need tests.
     */
-   if (!brw->has_llc ||
-       !(type == GL_UNSIGNED_BYTE || type == GL_UNSIGNED_INT_8_8_8_8_REV) ||
+   if (!(type == GL_UNSIGNED_BYTE || type == GL_UNSIGNED_INT_8_8_8_8_REV) ||
        !(texImage->TexObject->Target == GL_TEXTURE_2D ||
          texImage->TexObject->Target == GL_TEXTURE_RECTANGLE) ||
        pixels == NULL ||
@@ -420,21 +418,18 @@ intel_gettexsubimage_tiled_memcpy(struct gl_context *ctx,
       return false;
    }
 
+   /* tiled_to_linear() assumes that if the object is swizzled, it
+    * is using I915_BIT6_SWIZZLE_9_10 for X and I915_BIT6_SWIZZLE_9 for Y.
+    * This is only true on gen5 and above.
+    */
+   if (brw->gen < 5 && brw->has_swizzling)
+      return false;
+
    /* Since we are going to write raw data to the miptree, we need to resolve
     * any pending fast color clears before we start.
     */
    intel_miptree_resolve_color(brw, image->mt);
 
-   bo = image->mt->bo;
-
-   if (drm_intel_bo_references(brw->batch.bo, bo))
-      brw_batch_flush(&brw->batch, PERF_DEBUG(brw, "miptree"));
-
-   error = brw_bo_map(brw, bo, false /* write enable */, "miptree");
-   if (error) {
-      DBG("%s: failed to map bo\n", __func__);
-      return false;
-   }
 
    dst_pitch = _mesa_image_row_stride(packing, width, format, type);
 
@@ -452,19 +447,17 @@ intel_gettexsubimage_tiled_memcpy(struct gl_context *ctx,
    xoffset += image->mt->level[level].level_x;
    yoffset += image->mt->level[level].level_y;
 
-   tiled_to_linear(
+   return tiled_to_linear(
       xoffset * cpp, (xoffset + width) * cpp,
       yoffset, yoffset + height,
       pixels - (ptrdiff_t) yoffset * dst_pitch - (ptrdiff_t) xoffset * cpp,
-      bo->virtual,
+      brw_bo_map(image->mt->bo, MAP_READ | MAP_DETILED,
+                 PERF_DEBUG(brw, "TexGetSubImage")),
       dst_pitch, image->mt->pitch,
       brw->has_swizzling,
       image->mt->tiling,
       mem_copy
    );
-
-   drm_intel_bo_unmap(bo);
-   return true;
 }
 
 static void
diff --git a/src/mesa/drivers/dri/i965/intel_tex_subimage.c b/src/mesa/drivers/dri/i965/intel_tex_subimage.c
index 79341d7..f93a975 100644
--- a/src/mesa/drivers/dri/i965/intel_tex_subimage.c
+++ b/src/mesa/drivers/dri/i965/intel_tex_subimage.c
@@ -85,11 +85,6 @@ intel_texsubimage_tiled_memcpy(struct gl_context * ctx,
    struct intel_texture_image *image = intel_texture_image(texImage);
    int src_pitch;
 
-   /* The miptree's buffer. */
-   brw_bo *bo;
-
-   int error = 0;
-
    uint32_t cpp;
    mem_copy_fn mem_copy = NULL;
 
@@ -102,8 +97,7 @@ intel_texsubimage_tiled_memcpy(struct gl_context * ctx,
     * with _mesa_image_row_stride. However, before removing the restrictions
     * we need tests.
     */
-   if (!brw->has_llc ||
-       !(type == GL_UNSIGNED_BYTE || type == GL_UNSIGNED_INT_8_8_8_8_REV) ||
+   if (!(type == GL_UNSIGNED_BYTE || type == GL_UNSIGNED_INT_8_8_8_8_REV) ||
        !(texImage->TexObject->Target == GL_TEXTURE_2D ||
          texImage->TexObject->Target == GL_TEXTURE_RECTANGLE) ||
        pixels == NULL ||
@@ -135,22 +129,18 @@ intel_texsubimage_tiled_memcpy(struct gl_context * ctx,
       return false;
    }
 
+   /* linear_to_tiled() assumes that if the object is swizzled, it
+    * is using I915_BIT6_SWIZZLE_9_10 for X and I915_BIT6_SWIZZLE_9 for Y.
+    * This is only true on gen5 and above.
+    */
+   if (brw->gen < 5 && brw->has_swizzling)
+      return false;
+
    /* Since we are going to write raw data to the miptree, we need to resolve
     * any pending fast color clears before we start.
     */
    intel_miptree_resolve_color(brw, image->mt);
 
-   bo = image->mt->bo;
-
-   if (drm_intel_bo_references(brw->batch.bo, bo))
-      brw_batch_flush(&brw->batch, PERF_DEBUG(brw, "miptree"));
-
-   error = brw_bo_map(brw, bo, true /* write enable */, "miptree");
-   if (error || bo->virtual == NULL) {
-      DBG("%s: failed to map bo\n", __func__);
-      return false;
-   }
-
    src_pitch = _mesa_image_row_stride(packing, width, format, type);
 
    /* We postponed printing this message until having committed to executing
@@ -171,19 +161,17 @@ intel_texsubimage_tiled_memcpy(struct gl_context * ctx,
    xoffset += image->mt->level[level].level_x;
    yoffset += image->mt->level[level].level_y;
 
-   linear_to_tiled(
+   return linear_to_tiled(
       xoffset * cpp, (xoffset + width) * cpp,
       yoffset, yoffset + height,
-      bo->virtual,
+      brw_bo_map(image->mt->bo, MAP_WRITE | MAP_DETILED,
+                 PERF_DEBUG(brw, "TexSubImage")),
       pixels - (ptrdiff_t) yoffset * src_pitch - (ptrdiff_t) xoffset * cpp,
       image->mt->pitch, src_pitch,
       brw->has_swizzling,
       image->mt->tiling,
       mem_copy
    );
-
-   drm_intel_bo_unmap(bo);
-   return true;
 }
 
 static void
@@ -199,7 +187,9 @@ intelTexSubImage(struct gl_context * ctx,
    struct intel_texture_image *intelImage = intel_texture_image(texImage);
    bool ok;
 
-   bool tex_busy = intelImage->mt && drm_intel_bo_busy(intelImage->mt->bo);
+   bool tex_busy =
+      intelImage->mt &&
+      brw_bo_busy(intelImage->mt->bo, BUSY_WRITE | BUSY_RETIRE, NULL);
 
    DBG("%s mesa_format %s target %s format %s type %s level %d %dx%dx%d\n",
        __func__, _mesa_get_format_name(texImage->TexFormat),
diff --git a/src/mesa/drivers/dri/i965/intel_tiled_memcpy.c b/src/mesa/drivers/dri/i965/intel_tiled_memcpy.c
index dcf0462..500ac1f 100644
--- a/src/mesa/drivers/dri/i965/intel_tiled_memcpy.c
+++ b/src/mesa/drivers/dri/i965/intel_tiled_memcpy.c
@@ -552,7 +552,7 @@ ytiled_to_linear_faster(uint32_t x0, uint32_t x1, uint32_t x2, uint32_t x3,
  * 'dst' is the start of the texture and 'src' is the corresponding
  * address to copy from, though copying begins at (xt1, yt1).
  */
-void
+bool
 linear_to_tiled(uint32_t xt1, uint32_t xt2,
                 uint32_t yt1, uint32_t yt2,
                 char *dst, const char *src,
@@ -568,6 +568,9 @@ linear_to_tiled(uint32_t xt1, uint32_t xt2,
    uint32_t tw, th, span;
    uint32_t swizzle_bit = has_swizzling ? 1<<6 : 0;
 
+   if (unlikely(!dst))
+      return false;
+
    if (tiling == I915_TILING_X) {
       tw = xtile_width;
       th = xtile_height;
@@ -630,6 +633,8 @@ linear_to_tiled(uint32_t xt1, uint32_t xt2,
                    mem_copy);
       }
    }
+
+   return true;
 }
 
 /**
@@ -643,7 +648,7 @@ linear_to_tiled(uint32_t xt1, uint32_t xt2,
  * 'dst' is the start of the texture and 'src' is the corresponding
  * address to copy from, though copying begins at (xt1, yt1).
  */
-void
+bool
 tiled_to_linear(uint32_t xt1, uint32_t xt2,
                 uint32_t yt1, uint32_t yt2,
                 char *dst, const char *src,
@@ -659,6 +664,9 @@ tiled_to_linear(uint32_t xt1, uint32_t xt2,
    uint32_t tw, th, span;
    uint32_t swizzle_bit = has_swizzling ? 1<<6 : 0;
 
+   if (unlikely(!src))
+      return false;
+
    if (tiling == I915_TILING_X) {
       tw = xtile_width;
       th = xtile_height;
@@ -721,6 +729,8 @@ tiled_to_linear(uint32_t xt1, uint32_t xt2,
                    mem_copy);
       }
    }
+
+   return true;
 }
 
 
diff --git a/src/mesa/drivers/dri/i965/intel_tiled_memcpy.h b/src/mesa/drivers/dri/i965/intel_tiled_memcpy.h
index 9dc1088..a64e516 100644
--- a/src/mesa/drivers/dri/i965/intel_tiled_memcpy.h
+++ b/src/mesa/drivers/dri/i965/intel_tiled_memcpy.h
@@ -37,7 +37,7 @@
 
 typedef void *(*mem_copy_fn)(void *dest, const void *src, size_t n);
 
-void
+bool
 linear_to_tiled(uint32_t xt1, uint32_t xt2,
                 uint32_t yt1, uint32_t yt2,
                 char *dst, const char *src,
@@ -46,7 +46,7 @@ linear_to_tiled(uint32_t xt1, uint32_t xt2,
                 uint32_t tiling,
                 mem_copy_fn mem_copy);
 
-void
+bool
 tiled_to_linear(uint32_t xt1, uint32_t xt2,
                 uint32_t yt1, uint32_t yt2,
                 char *dst, const char *src,
diff --git a/src/mesa/drivers/dri/i965/intel_upload.c b/src/mesa/drivers/dri/i965/intel_upload.c
index 435f56f..1d51a37 100644
--- a/src/mesa/drivers/dri/i965/intel_upload.c
+++ b/src/mesa/drivers/dri/i965/intel_upload.c
@@ -49,13 +49,9 @@
 #define ALIGN_NPOT(value, alignment) \
    (((value) + (alignment) - 1) / (alignment) * (alignment))
 
-void
+static void
 intel_upload_finish(struct brw_context *brw)
 {
-   if (!brw->upload.bo)
-      return;
-
-   drm_intel_bo_unmap(brw->upload.bo);
    brw_bo_put(brw->upload.bo);
    brw->upload.bo = NULL;
    brw->upload.next_offset = 0;
@@ -102,10 +98,6 @@ intel_upload_space(struct brw_context *brw,
    if (!brw->upload.bo) {
       brw->upload.bo = brw_bo_create(&brw->batch, "streamed data",
                                      MAX2(INTEL_UPLOAD_SIZE, size), 4096, 0);
-      if (brw->has_llc)
-         drm_intel_bo_map(brw->upload.bo, true);
-      else
-         drm_intel_gem_bo_map_gtt(brw->upload.bo);
    }
 
    brw->upload.next_offset = offset + size;
@@ -116,7 +108,7 @@ intel_upload_space(struct brw_context *brw,
       *out_bo = brw_bo_get(brw->upload.bo);
    }
 
-   return brw->upload.bo->virtual + offset;
+   return brw_bo_map(brw->upload.bo, MAP_WRITE | MAP_ASYNC, NULL) + offset;
 }
 
 /**
-- 
2.5.0