[Mesa-dev] [PATCH 00/12] render reordering for optimized tile buffer usage
Rob Clark
robdclark at gmail.com
Sat Jul 2 16:52:03 UTC 2016
So, games/apps that are aware of how a tiler gpu works will make an
effort to avoid mid-batch (tile pass) updates to textures, UBOs, etc,
since this will force a flush, and extra resolve (tile->mem) and
restore (mem->tile) in the next batch. They also avoid unnecessary
framebuffer switches, for the same reason.
But turns out that many games, benchmarks, etc, aren't very good at
this. But what if we could re-order the batches (and potentially
shadow texture/UBO/etc resources) to minimize the tile passes and
unnecessary resolve/restore?
This is based on a rough idea that Eric suggested a while back, and
a few other experiments that I have been trying recently. It boils
down to three parts:
1) Add an fd_batch object, which tracks cmdstream being built for that
particular tile pass. State that is global to the tile pass is
move from fd_context to fd_batch. (Mostly the framebuffer state,
but also so internal tracking that is done to decide whether to
use GMEM or sysmem/bypass mode, etc.)
Tracking of resources written/read in the batch is also moved from
ctx to batch.
2) Add a batch-cache. Previously, whenever new framebuffer state is
set, it forced a flush. Now (if reordering is enabled), we use
the framebuffer state as key into a hashtable to map it to an
existing batch (if there is one, otherwise construct a new batch
and add it to the table).
When a resource is marked as read/written by a batch, which is
already pending access by another batch, a dependency between the
two batches is added.
TODO there is probably a bit more room for improvement here. See
below analysis of supertuxkart.
3) Shadow resources. Mid-batch UBO updates or uploading new contents
to an in-use texture is sadly too common. Traditional (non-tiler)
gpu's could solve this with a staging buffer, and blitting from the
staging to real buffer at the appropriate spot in the cmdstream.
But this doesn't work for a tiling gpu, since we'll need the old
contents again when we move on to the next tile. To solve this,
allocate a new buffer and back-blit the previous contents to the
new buffer. The existing buffer becomes a shadow and is unref'd
(the backing GEM object is kept alive since it is referenced by
the cmdstream).
For example, a texture upload + mipmap gen turns into transfer_map
for level zero (glTexSubImage*, etc), followed by blits to the
remaining mipmap levels (glGenerateMipmap()). So in transfer_map()
if writing new contents into the buffer would trigger a flush or
stall, we shadow the existing buffer, and blit the remaining levels
from old to new. Each blit turns into a batch (different frame-
buffer state), and is not immediately flushed, but just hangs out
in the batch cache. When the next blit (from glGenerateMipmap()
overwrites the contents from the back-blit, we realize this and
drop the previous rendering to the batch, so in many cases the
back-blit ends up discarded.
Results:
supertuxkart was a big winner, with an overall ~30% boost, making the
new render engine finally playable on most levels. Fps varies a lot
by level, but on average going from 14-19fps to 20-25fps.
(Sadly, the old render engine, which was much faster on lower end hw,
seems to be in disrepair.)
I did also add some instrumentation to collect some stats on # of
different sorts of batches. Since supertuxkart --profile-laps is
not repeatable, I could not directly compare results there, but I
could compare an apitrace replay of stk level:
normal: batch_sysmem=10398, batch_gmem=6958, batch_restore=3864
reorder: batch_sysmem=16825, batch_gmem=6956, batch_restore=3863
(for 792 frames)
I was expecting a drop in gmem batches, and restores, because stk
does two problematic things: (1) render target switches, ie. clear,
switch fb, clear, switch fb, draw, etc., and (2) mid-batch UBO
update.
I've looked a bit into the render target switches, but it seems like
it is mixing/matching zsbuf and cbuf's in a way that makes them map
to different batches. Ie:
set fb: zsbuf=A, cbuf[0]=B
clear color0
clear stencil
set fb: zsbuf=A, cbuf[0]=C
draw
Not entirely sure what to do about that. I suppose I could track the
cmdstream for the clears individually, and juggle them between batches
somehow to avoid the flush?
The mid-batch UBO update seems to actually happen between two fb states
with the same color0 and zs, but first treats color0 as R8G8B8A8_SRGB
and the next R8G8B8A8_UNORM. Probably we need a flush here anyways,
but use of glDiscardFramebuffer() in the app (and wiring up the driver
bits) could avoid a lot of restores.
Most of the gain seems to come from simply not stalling on the UBO
update.
xonitic also seems to be a winner, although I haven't analyzed it as
closely:
med: 48fps -> 52fps
high: 25fps -> 31fps
ultra: 15fps -> 19fps
and the batch stats show more of an improvement:
med:
normal: batch_sysmem=0, batch_gmem=18055, batch_restore=3748
reorder: batch_sysmem=2220, batch_gmem=14483, batch_restore=174
(10510 frames)
high:
normal: batch_sysmem=63072, batch_gmem=62692, batch_restore=48384
reorder: batch_sysmem=65429, batch_gmem=58284, batch_restore=43971
(10510 frames)
ultra:
normal: batch_sysmem=63072, batch_gmem=81318, batch_restore=66863
reorder: batch_sysmem=65869, batch_gmem=71360, batch_restore=56939
(10510 frames)
So in all cases a nice drop in tile passes (batch_gmem) and reduction
in number of times we need to move back from system memory to tile
buffer (batch_restore). High/ultra still has a lot of restore's per
frame, so maybe there is still some room for improvement. Not sure
yet if it is the same sort of thing going on as supertuxkart.
I would expect to see some gains in manhattan and possibly trex, but
unfortunately it is mostly using compressed textures that util_blitter
cannot blit, so the resource shadowing back-blit ends up on the CPU
(which ends up flushing previous mipmap generation and stalling, which
kind of defeats the purpose). I'm not entirely sure what to do here.
Since we don't need scaling/filtering/etc we could map things to a
different format which can be rendered to, but I think we end up
needing to also lie about the width/height. Which works ok for fb
state (we take w/h from the pipe_surface, not the pipe_resource). But
not on the src (tex state) side. Possibly we could add w/h to
pipe_sampler_view to solve this? Solving this should at least bring
about +15% in manhattan, and maybe a bit in trex.
At any rate, the freedreno bits end up depending on some libdrm
patches[1] which in turn depend on some kernel stuff I have queued up
for 4.8. So it will be some time before it lands. But I'd like to
get the first three patches reviewed and pushed. And suggestions
about the remaining issues welcome, since there is still some room
for further gains.
[1] https://github.com/freedreno/libdrm/commits/fd-next
Rob Clark (12):
gallium/util: make util_copy_framebuffer_state(src=NULL) work
gallium: un-inline pipe_surface_desc
list: fix list_replace() for empty lists
freedreno: introduce fd_batch
freedreno: push resource tracking down into batch
freedreno: dynamically sized/growable cmd buffers
freedreno: move more batch related tracking to fd_batch
freedreno: add batch-cache
freedreno: batch re-ordering support
freedreno: spiff up some debug traces
freedreno: shadow textures if possible to avoid stall/flush
freedreno: support discarding previous rendering in special cases
src/gallium/auxiliary/util/u_framebuffer.c | 37 ++-
src/gallium/drivers/freedreno/Makefile.sources | 4 +
src/gallium/drivers/freedreno/a2xx/fd2_draw.c | 12 +-
src/gallium/drivers/freedreno/a2xx/fd2_emit.c | 15 +-
src/gallium/drivers/freedreno/a2xx/fd2_gmem.c | 63 ++---
src/gallium/drivers/freedreno/a3xx/fd3_context.c | 4 -
src/gallium/drivers/freedreno/a3xx/fd3_context.h | 5 -
src/gallium/drivers/freedreno/a3xx/fd3_draw.c | 23 +-
src/gallium/drivers/freedreno/a3xx/fd3_emit.c | 23 +-
src/gallium/drivers/freedreno/a3xx/fd3_emit.h | 2 +-
src/gallium/drivers/freedreno/a3xx/fd3_gmem.c | 146 +++++------
src/gallium/drivers/freedreno/a4xx/fd4_draw.c | 41 +--
src/gallium/drivers/freedreno/a4xx/fd4_draw.h | 13 +-
src/gallium/drivers/freedreno/a4xx/fd4_emit.c | 24 +-
src/gallium/drivers/freedreno/a4xx/fd4_emit.h | 2 +-
src/gallium/drivers/freedreno/a4xx/fd4_gmem.c | 122 ++++-----
src/gallium/drivers/freedreno/freedreno_batch.c | 280 +++++++++++++++++++++
src/gallium/drivers/freedreno/freedreno_batch.h | 152 +++++++++++
.../drivers/freedreno/freedreno_batch_cache.c | 246 ++++++++++++++++++
.../drivers/freedreno/freedreno_batch_cache.h | 51 ++++
src/gallium/drivers/freedreno/freedreno_context.c | 131 ++--------
src/gallium/drivers/freedreno/freedreno_context.h | 123 ++-------
src/gallium/drivers/freedreno/freedreno_draw.c | 132 +++++-----
src/gallium/drivers/freedreno/freedreno_draw.h | 15 +-
src/gallium/drivers/freedreno/freedreno_gmem.c | 110 ++++----
src/gallium/drivers/freedreno/freedreno_gmem.h | 6 +-
src/gallium/drivers/freedreno/freedreno_query_hw.c | 8 +-
src/gallium/drivers/freedreno/freedreno_resource.c | 242 ++++++++++++++++--
src/gallium/drivers/freedreno/freedreno_resource.h | 10 +-
src/gallium/drivers/freedreno/freedreno_screen.c | 9 +
src/gallium/drivers/freedreno/freedreno_screen.h | 2 +
src/gallium/drivers/freedreno/freedreno_state.c | 19 +-
src/gallium/drivers/freedreno/freedreno_util.h | 43 ++--
src/gallium/include/pipe/p_state.h | 23 +-
src/util/list.h | 14 +-
35 files changed, 1486 insertions(+), 666 deletions(-)
create mode 100644 src/gallium/drivers/freedreno/freedreno_batch.c
create mode 100644 src/gallium/drivers/freedreno/freedreno_batch.h
create mode 100644 src/gallium/drivers/freedreno/freedreno_batch_cache.c
create mode 100644 src/gallium/drivers/freedreno/freedreno_batch_cache.h
--
2.7.4
More information about the mesa-dev
mailing list