[Mesa-dev] [PATCH 00/12] render reordering for optimized tile buffer usage

Fri Jul 8 16:14:18 UTC 2016

On Sat, Jul 2, 2016 at 12:52 PM, Rob Clark <robdclark at gmail.com> wrote:
> So, games/apps that are aware of how a tiler gpu works will make an
> effort to avoid mid-batch (tile pass) updates to textures, UBOs, etc,
> since this will force a flush, and extra resolve (tile->mem) and
> restore (mem->tile) in the next batch.  They also avoid unnecessary
> framebuffer switches, for the same reason.
>
> But turns out that many games, benchmarks, etc, aren't very good at
> this.  But what if we could re-order the batches (and potentially
> shadow texture/UBO/etc resources) to minimize the tile passes and
> unnecessary resolve/restore?
>
> This is based on a rough idea that Eric suggested a while back, and
> a few other experiments that I have been trying recently.  It boils
> down to three parts:
>
> 1) Add an fd_batch object, which tracks cmdstream being built for that
>    particular tile pass.  State that is global to the tile pass is
>    move from fd_context to fd_batch.  (Mostly the framebuffer state,
>    but also so internal tracking that is done to decide whether to
>    use GMEM or sysmem/bypass mode, etc.)
>
>    Tracking of resources written/read in the batch is also moved from
>    ctx to batch.

So, it turned out that tracking only the most recent batch that reads
a resource leads to unnecessary dependencies, and results in batches
getting force-flushed (to avoid a dependency loop) when otherwise not
needed.

I initially did things this way so I could have a single list_head in
the pipe_resource, and to avoid needing to track a 'struct set' of
batches per pipe_resource.  But we really need a way to allow tracking
of multiple batches that read a resource without introducing an
artificial dependency between the reading batches.

So I came up with a different approach after discussing a few
different options with glisse.  It involves putting an upper bound on
the # of batches at 32 (although 64 would be a possibility).  In the
batch, we end up needing a hash-set to track resources accessed by the
batch.  But in the resource we only need a bitmask of which batches
access this resource (and a single 'struct fd_batch *write_batch' for
most recent writer).  (And I check the bitmask to short-circuit the
hashset lookup/insert in the common case.)

So now I'm getting ~+20% on manhattan, and a bit more improvement in
xonotic than before.  There are still a few glitches in xonotic (the
increased re-ordering exposes that occlusion query is completely
broken and queries need some work to dtrt in the face of re-ordering).
And the map in the upper left corner somehow doesn't show the
outline/map of the level (just the dots where the players are at).
Not sure yet what is going on there.

Mostly I only hit forced flushes due to hitting upper limit on # of
batches during game startup, when it is doing a lot of texture uploads
and mipmap generation, but not yet submitted any rendering that uses
those textures.  And an upper-bound on un-flushed batches in that sort
of scenario actually seems like a good thing.  Although I could
probably be more clever about picking which batch(es) to flush in that
scenario.  The upper limit could be problematic if someone uploaded
layer 0 to a bunch of textures, and then generated mipmap for all of
the textures (as opposed to interleaving upload/genmipmap).  I guess
you probably have to go out of your way to be that stupid, so meh?

One of the annoying things, since pipe_resource is per-screen, not
per-context, I end up having to push batch_cache down into screen.
Which means that, for example, one context switching fb state could
force flushing a batch from another context.  Eventually if I push of
gmem+submit to a per-context helper thread, that should help keep
things properly serialized.  Although I still need some (currently
missing) mutexes to serialize batch_cache access, etc.  Also it means
that the upper limit on # of batches is per-screen, not per-context.
Not really sure what to do about that.  I really wish resources were
not shared across contexts (but rather just use flink or dmabuf when
you need to share), but I guess it is too late for that now :-(

https://github.com/freedreno/mesa/commit/e23dac02234de1c688efbad58758fdf9d837c94b

BR,
-R

> 2) Add a batch-cache.  Previously, whenever new framebuffer state is
>    set, it forced a flush.  Now (if reordering is enabled), we use
>    the framebuffer state as key into a hashtable to map it to an
>    existing batch (if there is one, otherwise construct a new batch
>    and add it to the table).
>
>    When a resource is marked as read/written by a batch, which is
>    already pending access by another batch, a dependency between the
>    two batches is added.
>
>    TODO there is probably a bit more room for improvement here.  See
>    below analysis of supertuxkart.
>
> 3) Shadow resources.  Mid-batch UBO updates or uploading new contents
>    to an in-use texture is sadly too common.  Traditional (non-tiler)
>    gpu's could solve this with a staging buffer, and blitting from the
>    staging to real buffer at the appropriate spot in the cmdstream.
>    But this doesn't work for a tiling gpu, since we'll need the old
>    contents again when we move on to the next tile.  To solve this,
>    allocate a new buffer and back-blit the previous contents to the
>    new buffer.  The existing buffer becomes a shadow and is unref'd
>    (the backing GEM object is kept alive since it is referenced by
>    the cmdstream).
>
>    For example, a texture upload + mipmap gen turns into transfer_map
>    for level zero (glTexSubImage*, etc), followed by blits to the
>    remaining mipmap levels (glGenerateMipmap()).  So in transfer_map()
>    if writing new contents into the buffer would trigger a flush or
>    stall, we shadow the existing buffer, and blit the remaining levels
>    from old to new.  Each blit turns into a batch (different frame-
>    buffer state), and is not immediately flushed, but just hangs out
>    in the batch cache.  When the next blit (from glGenerateMipmap()
>    overwrites the contents from the back-blit, we realize this and
>    drop the previous rendering to the batch, so in many cases the
>    back-blit ends up discarded.
>
>
>
> Results:
>
> supertuxkart was a big winner, with an overall ~30% boost, making the
> new render engine finally playable on most levels.  Fps varies a lot
> by level, but on average going from 14-19fps to 20-25fps.
>
> (Sadly, the old render engine, which was much faster on lower end hw,
> seems to be in disrepair.)
>
> I did also add some instrumentation to collect some stats on # of
> different sorts of batches.  Since supertuxkart --profile-laps is
> not repeatable, I could not directly compare results there, but I
> could compare an apitrace replay of stk level:
>
>   normal:  batch_sysmem=10398, batch_gmem=6958, batch_restore=3864
>   reorder: batch_sysmem=16825, batch_gmem=6956, batch_restore=3863
>   (for 792 frames)
>
> I was expecting a drop in gmem batches, and restores, because stk
> does two problematic things: (1) render target switches, ie. clear,
> switch fb, clear, switch fb, draw, etc., and (2) mid-batch UBO
> update.
>
> I've looked a bit into the render target switches, but it seems like
> it is mixing/matching zsbuf and cbuf's in a way that makes them map
> to different batches.  Ie:
>
>    set fb: zsbuf=A, cbuf[0]=B
>    clear color0
>    clear stencil
>    set fb: zsbuf=A, cbuf[0]=C
>    draw
>
> Not entirely sure what to do about that.  I suppose I could track the
> cmdstream for the clears individually, and juggle them between batches
> somehow to avoid the flush?
>
> The mid-batch UBO update seems to actually happen between two fb states
> with the same color0 and zs, but first treats color0 as R8G8B8A8_SRGB
> and the next R8G8B8A8_UNORM.  Probably we need a flush here anyways,
> but use of glDiscardFramebuffer() in the app (and wiring up the driver
> bits) could avoid a lot of restores.
>
> Most of the gain seems to come from simply not stalling on the UBO
> update.
>
>
> xonitic also seems to be a winner, although I haven't analyzed it as
> closely:
>
>    med:   48fps -> 52fps
>    high:  25fps -> 31fps
>    ultra: 15fps -> 19fps
>
> and the batch stats show more of an improvement:
>
> med:
>   normal:  batch_sysmem=0,    batch_gmem=18055, batch_restore=3748
>   reorder: batch_sysmem=2220, batch_gmem=14483, batch_restore=174
>   (10510 frames)
>
> high:
>   normal:  batch_sysmem=63072, batch_gmem=62692, batch_restore=48384
>   reorder: batch_sysmem=65429, batch_gmem=58284, batch_restore=43971
>   (10510 frames)
>
> ultra:
>   normal:  batch_sysmem=63072, batch_gmem=81318, batch_restore=66863
>   reorder: batch_sysmem=65869, batch_gmem=71360, batch_restore=56939
>   (10510 frames)
>
> So in all cases a nice drop in tile passes (batch_gmem) and reduction
> in number of times we need to move back from system memory to tile
> buffer (batch_restore).  High/ultra still has a lot of restore's per
> frame, so maybe there is still some room for improvement.  Not sure
> yet if it is the same sort of thing going on as supertuxkart.
>
> I would expect to see some gains in manhattan and possibly trex, but
> unfortunately it is mostly using compressed textures that util_blitter
> cannot blit, so the resource shadowing back-blit ends up on the CPU
> (which ends up flushing previous mipmap generation and stalling, which
> kind of defeats the purpose).  I'm not entirely sure what to do here.
> Since we don't need scaling/filtering/etc we could map things to a
> different format which can be rendered to, but I think we end up
> needing to also lie about the width/height.  Which works ok for fb
> state (we take w/h from the pipe_surface, not the pipe_resource).  But
> not on the src (tex state) side.  Possibly we could add w/h to
> pipe_sampler_view to solve this?  Solving this should at least bring
> about +15% in manhattan, and maybe a bit in trex.
>
>
> At any rate, the freedreno bits end up depending on some libdrm
> patches[1] which in turn depend on some kernel stuff I have queued up
> for 4.8.  So it will be some time before it lands.  But I'd like to
> get the first three patches reviewed and pushed.  And suggestions
> about the remaining issues welcome, since there is still some room
> for further gains.
>
> [1] https://github.com/freedreno/libdrm/commits/fd-next
>
> Rob Clark (12):
>   gallium/util: make util_copy_framebuffer_state(src=NULL) work
>   gallium: un-inline pipe_surface_desc
>   list: fix list_replace() for empty lists
>   freedreno: introduce fd_batch
>   freedreno: push resource tracking down into batch
>   freedreno: dynamically sized/growable cmd buffers
>   freedreno: move more batch related tracking to fd_batch
>   freedreno: add batch-cache
>   freedreno: batch re-ordering support
>   freedreno: spiff up some debug traces
>   freedreno: shadow textures if possible to avoid stall/flush
>   freedreno: support discarding previous rendering in special cases
>
>  src/gallium/auxiliary/util/u_framebuffer.c         |  37 ++-
>  src/gallium/drivers/freedreno/Makefile.sources     |   4 +
>  src/gallium/drivers/freedreno/a2xx/fd2_draw.c      |  12 +-
>  src/gallium/drivers/freedreno/a2xx/fd2_emit.c      |  15 +-
>  src/gallium/drivers/freedreno/a2xx/fd2_gmem.c      |  63 ++---
>  src/gallium/drivers/freedreno/a3xx/fd3_context.c   |   4 -
>  src/gallium/drivers/freedreno/a3xx/fd3_context.h   |   5 -
>  src/gallium/drivers/freedreno/a3xx/fd3_draw.c      |  23 +-
>  src/gallium/drivers/freedreno/a3xx/fd3_emit.c      |  23 +-
>  src/gallium/drivers/freedreno/a3xx/fd3_emit.h      |   2 +-
>  src/gallium/drivers/freedreno/a3xx/fd3_gmem.c      | 146 +++++------
>  src/gallium/drivers/freedreno/a4xx/fd4_draw.c      |  41 +--
>  src/gallium/drivers/freedreno/a4xx/fd4_draw.h      |  13 +-
>  src/gallium/drivers/freedreno/a4xx/fd4_emit.c      |  24 +-
>  src/gallium/drivers/freedreno/a4xx/fd4_emit.h      |   2 +-
>  src/gallium/drivers/freedreno/a4xx/fd4_gmem.c      | 122 ++++-----
>  src/gallium/drivers/freedreno/freedreno_batch.c    | 280 +++++++++++++++++++++
>  src/gallium/drivers/freedreno/freedreno_batch.h    | 152 +++++++++++
>  .../drivers/freedreno/freedreno_batch_cache.c      | 246 ++++++++++++++++++
>  .../drivers/freedreno/freedreno_batch_cache.h      |  51 ++++
>  src/gallium/drivers/freedreno/freedreno_context.c  | 131 ++--------
>  src/gallium/drivers/freedreno/freedreno_context.h  | 123 ++-------
>  src/gallium/drivers/freedreno/freedreno_draw.c     | 132 +++++-----
>  src/gallium/drivers/freedreno/freedreno_draw.h     |  15 +-
>  src/gallium/drivers/freedreno/freedreno_gmem.c     | 110 ++++----
>  src/gallium/drivers/freedreno/freedreno_gmem.h     |   6 +-
>  src/gallium/drivers/freedreno/freedreno_query_hw.c |   8 +-
>  src/gallium/drivers/freedreno/freedreno_resource.c | 242 ++++++++++++++++--
>  src/gallium/drivers/freedreno/freedreno_resource.h |  10 +-
>  src/gallium/drivers/freedreno/freedreno_screen.c   |   9 +
>  src/gallium/drivers/freedreno/freedreno_screen.h   |   2 +
>  src/gallium/drivers/freedreno/freedreno_state.c    |  19 +-
>  src/gallium/drivers/freedreno/freedreno_util.h     |  43 ++--
>  src/gallium/include/pipe/p_state.h                 |  23 +-
>  src/util/list.h                                    |  14 +-
>  35 files changed, 1486 insertions(+), 666 deletions(-)
>  create mode 100644 src/gallium/drivers/freedreno/freedreno_batch.c
>  create mode 100644 src/gallium/drivers/freedreno/freedreno_batch.h
>  create mode 100644 src/gallium/drivers/freedreno/freedreno_batch_cache.c
>  create mode 100644 src/gallium/drivers/freedreno/freedreno_batch_cache.h
>
> --
> 2.7.4
>