[Mesa-dev] [PATCH 00/12] render reordering for optimized tile buffer usage

Sat Jul 2 16:52:03 UTC 2016

So, games/apps that are aware of how a tiler gpu works will make an
effort to avoid mid-batch (tile pass) updates to textures, UBOs, etc,
since this will force a flush, and extra resolve (tile->mem) and
restore (mem->tile) in the next batch.  They also avoid unnecessary
framebuffer switches, for the same reason.

But turns out that many games, benchmarks, etc, aren't very good at
this.  But what if we could re-order the batches (and potentially
shadow texture/UBO/etc resources) to minimize the tile passes and
unnecessary resolve/restore?

This is based on a rough idea that Eric suggested a while back, and
a few other experiments that I have been trying recently.  It boils
down to three parts:

1) Add an fd_batch object, which tracks cmdstream being built for that
   particular tile pass.  State that is global to the tile pass is
   move from fd_context to fd_batch.  (Mostly the framebuffer state,
   but also so internal tracking that is done to decide whether to
   use GMEM or sysmem/bypass mode, etc.)

   Tracking of resources written/read in the batch is also moved from
   ctx to batch.

2) Add a batch-cache.  Previously, whenever new framebuffer state is
   set, it forced a flush.  Now (if reordering is enabled), we use
   the framebuffer state as key into a hashtable to map it to an
   existing batch (if there is one, otherwise construct a new batch
   and add it to the table).

   When a resource is marked as read/written by a batch, which is
   already pending access by another batch, a dependency between the
   two batches is added.

   TODO there is probably a bit more room for improvement here.  See
   below analysis of supertuxkart.

3) Shadow resources.  Mid-batch UBO updates or uploading new contents
   to an in-use texture is sadly too common.  Traditional (non-tiler)
   gpu's could solve this with a staging buffer, and blitting from the
   staging to real buffer at the appropriate spot in the cmdstream.
   But this doesn't work for a tiling gpu, since we'll need the old
   contents again when we move on to the next tile.  To solve this,
   allocate a new buffer and back-blit the previous contents to the
   new buffer.  The existing buffer becomes a shadow and is unref'd
   (the backing GEM object is kept alive since it is referenced by
   the cmdstream).

   For example, a texture upload + mipmap gen turns into transfer_map
   for level zero (glTexSubImage*, etc), followed by blits to the
   remaining mipmap levels (glGenerateMipmap()).  So in transfer_map()
   if writing new contents into the buffer would trigger a flush or
   stall, we shadow the existing buffer, and blit the remaining levels
   from old to new.  Each blit turns into a batch (different frame-
   buffer state), and is not immediately flushed, but just hangs out
   in the batch cache.  When the next blit (from glGenerateMipmap()
   overwrites the contents from the back-blit, we realize this and
   drop the previous rendering to the batch, so in many cases the
   back-blit ends up discarded.

Results:

supertuxkart was a big winner, with an overall ~30% boost, making the
new render engine finally playable on most levels.  Fps varies a lot
by level, but on average going from 14-19fps to 20-25fps.

(Sadly, the old render engine, which was much faster on lower end hw,
seems to be in disrepair.)

I did also add some instrumentation to collect some stats on # of
different sorts of batches.  Since supertuxkart --profile-laps is
not repeatable, I could not directly compare results there, but I
could compare an apitrace replay of stk level:

  normal:  batch_sysmem=10398, batch_gmem=6958, batch_restore=3864
  reorder: batch_sysmem=16825, batch_gmem=6956, batch_restore=3863
  (for 792 frames)

I was expecting a drop in gmem batches, and restores, because stk
does two problematic things: (1) render target switches, ie. clear,
switch fb, clear, switch fb, draw, etc., and (2) mid-batch UBO
update.

I've looked a bit into the render target switches, but it seems like
it is mixing/matching zsbuf and cbuf's in a way that makes them map
to different batches.  Ie:

   set fb: zsbuf=A, cbuf[0]=B
   clear color0
   clear stencil
   set fb: zsbuf=A, cbuf[0]=C
   draw

Not entirely sure what to do about that.  I suppose I could track the
cmdstream for the clears individually, and juggle them between batches
somehow to avoid the flush?

The mid-batch UBO update seems to actually happen between two fb states
with the same color0 and zs, but first treats color0 as R8G8B8A8_SRGB
and the next R8G8B8A8_UNORM.  Probably we need a flush here anyways,
but use of glDiscardFramebuffer() in the app (and wiring up the driver
bits) could avoid a lot of restores.

Most of the gain seems to come from simply not stalling on the UBO
update.

xonitic also seems to be a winner, although I haven't analyzed it as
closely:

   med:   48fps -> 52fps
   high:  25fps -> 31fps
   ultra: 15fps -> 19fps

and the batch stats show more of an improvement:

med:
  normal:  batch_sysmem=0,    batch_gmem=18055, batch_restore=3748
  reorder: batch_sysmem=2220, batch_gmem=14483, batch_restore=174
  (10510 frames)

high:
  normal:  batch_sysmem=63072, batch_gmem=62692, batch_restore=48384
  reorder: batch_sysmem=65429, batch_gmem=58284, batch_restore=43971
  (10510 frames)

ultra:
  normal:  batch_sysmem=63072, batch_gmem=81318, batch_restore=66863
  reorder: batch_sysmem=65869, batch_gmem=71360, batch_restore=56939
  (10510 frames)

So in all cases a nice drop in tile passes (batch_gmem) and reduction
in number of times we need to move back from system memory to tile
buffer (batch_restore).  High/ultra still has a lot of restore's per
frame, so maybe there is still some room for improvement.  Not sure
yet if it is the same sort of thing going on as supertuxkart.

I would expect to see some gains in manhattan and possibly trex, but
unfortunately it is mostly using compressed textures that util_blitter
cannot blit, so the resource shadowing back-blit ends up on the CPU
(which ends up flushing previous mipmap generation and stalling, which
kind of defeats the purpose).  I'm not entirely sure what to do here.
Since we don't need scaling/filtering/etc we could map things to a
different format which can be rendered to, but I think we end up
needing to also lie about the width/height.  Which works ok for fb
state (we take w/h from the pipe_surface, not the pipe_resource).  But
not on the src (tex state) side.  Possibly we could add w/h to
pipe_sampler_view to solve this?  Solving this should at least bring
about +15% in manhattan, and maybe a bit in trex.

At any rate, the freedreno bits end up depending on some libdrm
patches[1] which in turn depend on some kernel stuff I have queued up
for 4.8.  So it will be some time before it lands.  But I'd like to
get the first three patches reviewed and pushed.  And suggestions
about the remaining issues welcome, since there is still some room
for further gains.

[1] https://github.com/freedreno/libdrm/commits/fd-next

Rob Clark (12):
  gallium/util: make util_copy_framebuffer_state(src=NULL) work
  gallium: un-inline pipe_surface_desc
  list: fix list_replace() for empty lists
  freedreno: introduce fd_batch
  freedreno: push resource tracking down into batch
  freedreno: dynamically sized/growable cmd buffers
  freedreno: move more batch related tracking to fd_batch
  freedreno: add batch-cache
  freedreno: batch re-ordering support
  freedreno: spiff up some debug traces
  freedreno: shadow textures if possible to avoid stall/flush
  freedreno: support discarding previous rendering in special cases

 src/gallium/auxiliary/util/u_framebuffer.c         |  37 ++-
 src/gallium/drivers/freedreno/Makefile.sources     |   4 +
 src/gallium/drivers/freedreno/a2xx/fd2_draw.c      |  12 +-
 src/gallium/drivers/freedreno/a2xx/fd2_emit.c      |  15 +-
 src/gallium/drivers/freedreno/a2xx/fd2_gmem.c      |  63 ++---
 src/gallium/drivers/freedreno/a3xx/fd3_context.c   |   4 -
 src/gallium/drivers/freedreno/a3xx/fd3_context.h   |   5 -
 src/gallium/drivers/freedreno/a3xx/fd3_draw.c      |  23 +-
 src/gallium/drivers/freedreno/a3xx/fd3_emit.c      |  23 +-
 src/gallium/drivers/freedreno/a3xx/fd3_emit.h      |   2 +-
 src/gallium/drivers/freedreno/a3xx/fd3_gmem.c      | 146 +++++------
 src/gallium/drivers/freedreno/a4xx/fd4_draw.c      |  41 +--
 src/gallium/drivers/freedreno/a4xx/fd4_draw.h      |  13 +-
 src/gallium/drivers/freedreno/a4xx/fd4_emit.c      |  24 +-
 src/gallium/drivers/freedreno/a4xx/fd4_emit.h      |   2 +-
 src/gallium/drivers/freedreno/a4xx/fd4_gmem.c      | 122 ++++-----
 src/gallium/drivers/freedreno/freedreno_batch.c    | 280 +++++++++++++++++++++
 src/gallium/drivers/freedreno/freedreno_batch.h    | 152 +++++++++++
 .../drivers/freedreno/freedreno_batch_cache.c      | 246 ++++++++++++++++++
 .../drivers/freedreno/freedreno_batch_cache.h      |  51 ++++
 src/gallium/drivers/freedreno/freedreno_context.c  | 131 ++--------
 src/gallium/drivers/freedreno/freedreno_context.h  | 123 ++-------
 src/gallium/drivers/freedreno/freedreno_draw.c     | 132 +++++-----
 src/gallium/drivers/freedreno/freedreno_draw.h     |  15 +-
 src/gallium/drivers/freedreno/freedreno_gmem.c     | 110 ++++----
 src/gallium/drivers/freedreno/freedreno_gmem.h     |   6 +-
 src/gallium/drivers/freedreno/freedreno_query_hw.c |   8 +-
 src/gallium/drivers/freedreno/freedreno_resource.c | 242 ++++++++++++++++--
 src/gallium/drivers/freedreno/freedreno_resource.h |  10 +-
 src/gallium/drivers/freedreno/freedreno_screen.c   |   9 +
 src/gallium/drivers/freedreno/freedreno_screen.h   |   2 +
 src/gallium/drivers/freedreno/freedreno_state.c    |  19 +-
 src/gallium/drivers/freedreno/freedreno_util.h     |  43 ++--
 src/gallium/include/pipe/p_state.h                 |  23 +-
 src/util/list.h                                    |  14 +-
 35 files changed, 1486 insertions(+), 666 deletions(-)
 create mode 100644 src/gallium/drivers/freedreno/freedreno_batch.c
 create mode 100644 src/gallium/drivers/freedreno/freedreno_batch.h
 create mode 100644 src/gallium/drivers/freedreno/freedreno_batch_cache.c
 create mode 100644 src/gallium/drivers/freedreno/freedreno_batch_cache.h

-- 
2.7.4