[Freedreno] tilers and out-of-order rendering..

Thu May 19 21:14:35 UTC 2016

So some rendering patterns that I've seen in apps turn out to be
somewhat evil for tiling gpu's.. couple cases I've seen:

1) stk has some silliness where it binds an fbo, clears, binds other
fbo clears, binds previous fbo and draws, and so on.  This one is
probably not too hard to just fix in stk.

2) I've seen a render pattern in manhattan where app does a bunch of
texture uploads mid-frame via a pbo (and then generates mipmap levels
for the updated texture, which hits the blit path which changes fb
state and forces a flush).  This one probably not something that can
be fixed in the app ;-)

There are probably other cases where this comes up which I haven't
noticed yet.  I'm not entirely sure how common the pattern that I see
in manhattan is.

At one point, Eric Anholt mentioned the idea of tracking rendering
cmdstream per render-target, as well as dependency information between
these different sets of cmdstream (if you render to one fbo, then turn
around and sample from it, the rendering needs to happen before the
sampling).  I've been thinking a bit about how this would actually
work, and trying to do some experiments to get an idea about how
useful this would be.

In the manhattan case, via a bit of a hack (to basically no-op the
pipe->blit() to avoid interrupting the tiling pass), I guestimate that
if we were able to re-order the rendering it would gain us something
around 15%.  (This is on ifc6540.. the win might be bigger on
something more memory bandwidth constrained.)

To realize the benefit we would require a bit more cleverness in
pipe->transfer_map to realize that the whole texture contents is being
updated and turn the DISCARD_RANGE into DISCARD_WHOLE_RESOURCE.  The
problem being, I think, that it is only discarding the first mipmap
level so we'd need realize that in the new buffer the additional
mipmap levels aren't valid.. no idea how that would work.. but in this
case it seems like mostly a smallish (128x128) texture so maybe it is
a win to just memcpy the rest of the old texture data over to the new
texture bo to avoid the stall/flush.

Anyways, the basic idea involves turning pipe_framebuffer_state into a
refcnt'd CSO inside the driver, and use that as the point to track
rendering cmds and dependency info.  (It would be kinda nice if fb
state was already a CSO.. but I guess we can work around that in the
driver using the pipe_framebuffer_state as the hashtable key..
hopefully we can rely on not having garbage data in unused cbuf slots?
 Otherwise we might need a custom hash/equals fxn.)  So something
like:

   /* framebuffer CSO: */
   /* TODO maybe it is more clear to call it fd_batch? */
   struct fd_framebuffer_state {
      struct pipe_reference refcnt;
      struct pipe_framebuffer_state base;
      struct fd_context *ctx;
      struct fd_ringbuffer *ring;
      struct set *dependencies;   /* hashset of dependent
fd_framebuffer_state(s) */
      bool dirty;
   }

When new fb state is set, hashtable lookup and increment the refcnt of
existing CSO if it exists, else create new state object.  And unref
the outgoing CSO.  Whenever there is unflushed rendering to a prsc
(pipe_resource), the prsc would need to also hold a refcnt to the most
recent fb CSO which renders to the prsc to keep the fb CSO live as
long as something depends on it.  Also we need to hold ref's to all
the entries in the dependencies table.

Whenever we emit a reference to another prsc (texture, vbo, index
buffer, etc), we'd have to check if it has pending rendering in a
different fb CSO.  I think for the most part we could replace
OUT_RELOC(fd_bo *) helper with OUT_PRSC(pipe_resource *).. so
something roughly like:

   struct fd_resource {
      struct u_resource base;
      ...
-     struct fd_context *pending_ctx;
+     /* hold ref to most recent fb CSO that rendered to us: */
+     struct fd_framebuffer_state *pending_fb;
   }

   static inline void
   OUT_RSC(struct fb_ringbuffer *ring, struct fd_resource *rsc)
   {
       if (rsc->pending_fb && rsc->pending_fb->dirty) {
          /* a bit ugly to chase the current ctx ptr this way, but
           * OUT_RING() is already used in a lot of places that
           * don't have ctx ptr handy..
           */
          struct fd_context *ctx = rsc->pending_fb->ctx;

          /* check for reverse dependency.. if other fb CSO already
           * depends on current fb then we cannot create a loop:
           */
          if (depends_on(rsc->pending_fb, ctx->fb)) {
             fd_context_render(ctx, ctx->fb);
          } else {
             .. add rsc->pending_fb to ctx->fb->dependencies ..
          }
       }
       OUT_RING(ring, rsc->bo);
   }

   static inline void
   OUT_PRSC(struct fd_ringbuffer *ring, struct pipe_resource *prsc)
   {
       OUT_RSC(ring, fd_resource(prsc));
   }

TODO:
1) how would queries work when we start re-ordering rendering?
   I guess we need a query results bo per fb CSO and the query
   needs to hold ref's to all the fb CSO's that were active for
   the duration of the query?  Timestamp queries would have
   truely non-sense results (but that is already more or less
   the case for tilers)
2) what happens w/ prsc's shared across multiple pipe_context's?
   I guess we get a pipe->flush() otherwise sharing would never
   work, so maybe that is good enough?
3) anything useful to extract out into helpers?  I guess vc4 and
   freedreno more or less want the same thing..
4) what gremlins have I not imagined yet?  There seems like a
   lot of ways to get this wrong..