[Mesa-dev] [PATCH 00/12] render reordering for optimized tile buffer usage
Rob Clark
robdclark at gmail.com
Fri Jul 8 16:14:18 UTC 2016
On Sat, Jul 2, 2016 at 12:52 PM, Rob Clark <robdclark at gmail.com> wrote:
> So, games/apps that are aware of how a tiler gpu works will make an
> effort to avoid mid-batch (tile pass) updates to textures, UBOs, etc,
> since this will force a flush, and extra resolve (tile->mem) and
> restore (mem->tile) in the next batch. They also avoid unnecessary
> framebuffer switches, for the same reason.
>
> But turns out that many games, benchmarks, etc, aren't very good at
> this. But what if we could re-order the batches (and potentially
> shadow texture/UBO/etc resources) to minimize the tile passes and
> unnecessary resolve/restore?
>
> This is based on a rough idea that Eric suggested a while back, and
> a few other experiments that I have been trying recently. It boils
> down to three parts:
>
> 1) Add an fd_batch object, which tracks cmdstream being built for that
> particular tile pass. State that is global to the tile pass is
> move from fd_context to fd_batch. (Mostly the framebuffer state,
> but also so internal tracking that is done to decide whether to
> use GMEM or sysmem/bypass mode, etc.)
>
> Tracking of resources written/read in the batch is also moved from
> ctx to batch.
So, it turned out that tracking only the most recent batch that reads
a resource leads to unnecessary dependencies, and results in batches
getting force-flushed (to avoid a dependency loop) when otherwise not
needed.
I initially did things this way so I could have a single list_head in
the pipe_resource, and to avoid needing to track a 'struct set' of
batches per pipe_resource. But we really need a way to allow tracking
of multiple batches that read a resource without introducing an
artificial dependency between the reading batches.
So I came up with a different approach after discussing a few
different options with glisse. It involves putting an upper bound on
the # of batches at 32 (although 64 would be a possibility). In the
batch, we end up needing a hash-set to track resources accessed by the
batch. But in the resource we only need a bitmask of which batches
access this resource (and a single 'struct fd_batch *write_batch' for
most recent writer). (And I check the bitmask to short-circuit the
hashset lookup/insert in the common case.)
So now I'm getting ~+20% on manhattan, and a bit more improvement in
xonotic than before. There are still a few glitches in xonotic (the
increased re-ordering exposes that occlusion query is completely
broken and queries need some work to dtrt in the face of re-ordering).
And the map in the upper left corner somehow doesn't show the
outline/map of the level (just the dots where the players are at).
Not sure yet what is going on there.
Mostly I only hit forced flushes due to hitting upper limit on # of
batches during game startup, when it is doing a lot of texture uploads
and mipmap generation, but not yet submitted any rendering that uses
those textures. And an upper-bound on un-flushed batches in that sort
of scenario actually seems like a good thing. Although I could
probably be more clever about picking which batch(es) to flush in that
scenario. The upper limit could be problematic if someone uploaded
layer 0 to a bunch of textures, and then generated mipmap for all of
the textures (as opposed to interleaving upload/genmipmap). I guess
you probably have to go out of your way to be that stupid, so meh?
One of the annoying things, since pipe_resource is per-screen, not
per-context, I end up having to push batch_cache down into screen.
Which means that, for example, one context switching fb state could
force flushing a batch from another context. Eventually if I push of
gmem+submit to a per-context helper thread, that should help keep
things properly serialized. Although I still need some (currently
missing) mutexes to serialize batch_cache access, etc. Also it means
that the upper limit on # of batches is per-screen, not per-context.
Not really sure what to do about that. I really wish resources were
not shared across contexts (but rather just use flink or dmabuf when
you need to share), but I guess it is too late for that now :-(
https://github.com/freedreno/mesa/commit/e23dac02234de1c688efbad58758fdf9d837c94b
BR,
-R
> 2) Add a batch-cache. Previously, whenever new framebuffer state is
> set, it forced a flush. Now (if reordering is enabled), we use
> the framebuffer state as key into a hashtable to map it to an
> existing batch (if there is one, otherwise construct a new batch
> and add it to the table).
>
> When a resource is marked as read/written by a batch, which is
> already pending access by another batch, a dependency between the
> two batches is added.
>
> TODO there is probably a bit more room for improvement here. See
> below analysis of supertuxkart.
>
> 3) Shadow resources. Mid-batch UBO updates or uploading new contents
> to an in-use texture is sadly too common. Traditional (non-tiler)
> gpu's could solve this with a staging buffer, and blitting from the
> staging to real buffer at the appropriate spot in the cmdstream.
> But this doesn't work for a tiling gpu, since we'll need the old
> contents again when we move on to the next tile. To solve this,
> allocate a new buffer and back-blit the previous contents to the
> new buffer. The existing buffer becomes a shadow and is unref'd
> (the backing GEM object is kept alive since it is referenced by
> the cmdstream).
>
> For example, a texture upload + mipmap gen turns into transfer_map
> for level zero (glTexSubImage*, etc), followed by blits to the
> remaining mipmap levels (glGenerateMipmap()). So in transfer_map()
> if writing new contents into the buffer would trigger a flush or
> stall, we shadow the existing buffer, and blit the remaining levels
> from old to new. Each blit turns into a batch (different frame-
> buffer state), and is not immediately flushed, but just hangs out
> in the batch cache. When the next blit (from glGenerateMipmap()
> overwrites the contents from the back-blit, we realize this and
> drop the previous rendering to the batch, so in many cases the
> back-blit ends up discarded.
>
>
>
> Results:
>
> supertuxkart was a big winner, with an overall ~30% boost, making the
> new render engine finally playable on most levels. Fps varies a lot
> by level, but on average going from 14-19fps to 20-25fps.
>
> (Sadly, the old render engine, which was much faster on lower end hw,
> seems to be in disrepair.)
>
> I did also add some instrumentation to collect some stats on # of
> different sorts of batches. Since supertuxkart --profile-laps is
> not repeatable, I could not directly compare results there, but I
> could compare an apitrace replay of stk level:
>
> normal: batch_sysmem=10398, batch_gmem=6958, batch_restore=3864
> reorder: batch_sysmem=16825, batch_gmem=6956, batch_restore=3863
> (for 792 frames)
>
> I was expecting a drop in gmem batches, and restores, because stk
> does two problematic things: (1) render target switches, ie. clear,
> switch fb, clear, switch fb, draw, etc., and (2) mid-batch UBO
> update.
>
> I've looked a bit into the render target switches, but it seems like
> it is mixing/matching zsbuf and cbuf's in a way that makes them map
> to different batches. Ie:
>
> set fb: zsbuf=A, cbuf[0]=B
> clear color0
> clear stencil
> set fb: zsbuf=A, cbuf[0]=C
> draw
>
> Not entirely sure what to do about that. I suppose I could track the
> cmdstream for the clears individually, and juggle them between batches
> somehow to avoid the flush?
>
> The mid-batch UBO update seems to actually happen between two fb states
> with the same color0 and zs, but first treats color0 as R8G8B8A8_SRGB
> and the next R8G8B8A8_UNORM. Probably we need a flush here anyways,
> but use of glDiscardFramebuffer() in the app (and wiring up the driver
> bits) could avoid a lot of restores.
>
> Most of the gain seems to come from simply not stalling on the UBO
> update.
>
>
> xonitic also seems to be a winner, although I haven't analyzed it as
> closely:
>
> med: 48fps -> 52fps
> high: 25fps -> 31fps
> ultra: 15fps -> 19fps
>
> and the batch stats show more of an improvement:
>
> med:
> normal: batch_sysmem=0, batch_gmem=18055, batch_restore=3748
> reorder: batch_sysmem=2220, batch_gmem=14483, batch_restore=174
> (10510 frames)
>
> high:
> normal: batch_sysmem=63072, batch_gmem=62692, batch_restore=48384
> reorder: batch_sysmem=65429, batch_gmem=58284, batch_restore=43971
> (10510 frames)
>
> ultra:
> normal: batch_sysmem=63072, batch_gmem=81318, batch_restore=66863
> reorder: batch_sysmem=65869, batch_gmem=71360, batch_restore=56939
> (10510 frames)
>
> So in all cases a nice drop in tile passes (batch_gmem) and reduction
> in number of times we need to move back from system memory to tile
> buffer (batch_restore). High/ultra still has a lot of restore's per
> frame, so maybe there is still some room for improvement. Not sure
> yet if it is the same sort of thing going on as supertuxkart.
>
> I would expect to see some gains in manhattan and possibly trex, but
> unfortunately it is mostly using compressed textures that util_blitter
> cannot blit, so the resource shadowing back-blit ends up on the CPU
> (which ends up flushing previous mipmap generation and stalling, which
> kind of defeats the purpose). I'm not entirely sure what to do here.
> Since we don't need scaling/filtering/etc we could map things to a
> different format which can be rendered to, but I think we end up
> needing to also lie about the width/height. Which works ok for fb
> state (we take w/h from the pipe_surface, not the pipe_resource). But
> not on the src (tex state) side. Possibly we could add w/h to
> pipe_sampler_view to solve this? Solving this should at least bring
> about +15% in manhattan, and maybe a bit in trex.
>
>
> At any rate, the freedreno bits end up depending on some libdrm
> patches[1] which in turn depend on some kernel stuff I have queued up
> for 4.8. So it will be some time before it lands. But I'd like to
> get the first three patches reviewed and pushed. And suggestions
> about the remaining issues welcome, since there is still some room
> for further gains.
>
> [1] https://github.com/freedreno/libdrm/commits/fd-next
>
> Rob Clark (12):
> gallium/util: make util_copy_framebuffer_state(src=NULL) work
> gallium: un-inline pipe_surface_desc
> list: fix list_replace() for empty lists
> freedreno: introduce fd_batch
> freedreno: push resource tracking down into batch
> freedreno: dynamically sized/growable cmd buffers
> freedreno: move more batch related tracking to fd_batch
> freedreno: add batch-cache
> freedreno: batch re-ordering support
> freedreno: spiff up some debug traces
> freedreno: shadow textures if possible to avoid stall/flush
> freedreno: support discarding previous rendering in special cases
>
> src/gallium/auxiliary/util/u_framebuffer.c | 37 ++-
> src/gallium/drivers/freedreno/Makefile.sources | 4 +
> src/gallium/drivers/freedreno/a2xx/fd2_draw.c | 12 +-
> src/gallium/drivers/freedreno/a2xx/fd2_emit.c | 15 +-
> src/gallium/drivers/freedreno/a2xx/fd2_gmem.c | 63 ++---
> src/gallium/drivers/freedreno/a3xx/fd3_context.c | 4 -
> src/gallium/drivers/freedreno/a3xx/fd3_context.h | 5 -
> src/gallium/drivers/freedreno/a3xx/fd3_draw.c | 23 +-
> src/gallium/drivers/freedreno/a3xx/fd3_emit.c | 23 +-
> src/gallium/drivers/freedreno/a3xx/fd3_emit.h | 2 +-
> src/gallium/drivers/freedreno/a3xx/fd3_gmem.c | 146 +++++------
> src/gallium/drivers/freedreno/a4xx/fd4_draw.c | 41 +--
> src/gallium/drivers/freedreno/a4xx/fd4_draw.h | 13 +-
> src/gallium/drivers/freedreno/a4xx/fd4_emit.c | 24 +-
> src/gallium/drivers/freedreno/a4xx/fd4_emit.h | 2 +-
> src/gallium/drivers/freedreno/a4xx/fd4_gmem.c | 122 ++++-----
> src/gallium/drivers/freedreno/freedreno_batch.c | 280 +++++++++++++++++++++
> src/gallium/drivers/freedreno/freedreno_batch.h | 152 +++++++++++
> .../drivers/freedreno/freedreno_batch_cache.c | 246 ++++++++++++++++++
> .../drivers/freedreno/freedreno_batch_cache.h | 51 ++++
> src/gallium/drivers/freedreno/freedreno_context.c | 131 ++--------
> src/gallium/drivers/freedreno/freedreno_context.h | 123 ++-------
> src/gallium/drivers/freedreno/freedreno_draw.c | 132 +++++-----
> src/gallium/drivers/freedreno/freedreno_draw.h | 15 +-
> src/gallium/drivers/freedreno/freedreno_gmem.c | 110 ++++----
> src/gallium/drivers/freedreno/freedreno_gmem.h | 6 +-
> src/gallium/drivers/freedreno/freedreno_query_hw.c | 8 +-
> src/gallium/drivers/freedreno/freedreno_resource.c | 242 ++++++++++++++++--
> src/gallium/drivers/freedreno/freedreno_resource.h | 10 +-
> src/gallium/drivers/freedreno/freedreno_screen.c | 9 +
> src/gallium/drivers/freedreno/freedreno_screen.h | 2 +
> src/gallium/drivers/freedreno/freedreno_state.c | 19 +-
> src/gallium/drivers/freedreno/freedreno_util.h | 43 ++--
> src/gallium/include/pipe/p_state.h | 23 +-
> src/util/list.h | 14 +-
> 35 files changed, 1486 insertions(+), 666 deletions(-)
> create mode 100644 src/gallium/drivers/freedreno/freedreno_batch.c
> create mode 100644 src/gallium/drivers/freedreno/freedreno_batch.h
> create mode 100644 src/gallium/drivers/freedreno/freedreno_batch_cache.c
> create mode 100644 src/gallium/drivers/freedreno/freedreno_batch_cache.h
>
> --
> 2.7.4
>
More information about the mesa-dev
mailing list