[Freedreno] [Mesa-dev] tilers and out-of-order rendering..

Fri Jun 3 12:53:20 UTC 2016

Ok, so I had a really evil thought that I wanted to bounce off
people..  it's a quite different approach from the more obvious one
discussed below (and which I've already started implementing)

Basically, idea is to have a wrapper pipe driver, similar to
ddebug/rbug/trace/etc, which re-orders draw calls.  All the CSO
objects would have to be wrapped in a refcounted thing, so
pending-draw's could hang on to their associated state.  For things
that are not refcounted (draw_info, and all the non-CSO state) there
would unfortunately be some memcpy involved.. not sure how bad that
would be, but it seems like the thing that could doom the idea?

The nice thing is it becomes basically free to turn on/off for
different drivers, at least at screen create time.. basically it gets
100% re-use, rather than having to re-implement the concepts in each
(tiler) driver.

Not sure if we need a way to turn it on/off at context create time,
but either way it would be nice if it were somehow a driconf option so
that it could be enabled/disabled per app, as to not penalize properly
written apps.

Thoughts?

----

Semi-related issue, which applies to either of the draw-reordering
approaches.  A frequent pattern is:

   ... bunch of draws ...
   glTexSubImage2D()
   glGenerateMipmap()
   ... bunch more draws ...
   ... repeat sequence a bunch of times with same texture ...

That glTexSubImage() comes to driver as transfer_map(DISCARD_RANGE).
At this point the backing bo is likely to be busy (since above
sequence repeats a bunch of times with the same texture).  So the best
we can do is discard whole bo and schedule blit(s) for the remaining
levels into the new bo.

But then at the glGenerateMipmap() step, we overwrite the contents of
all the other layers.  Which means if driver (or re-ordering wrapper
layer) had some extra hints, the blits triggered by the transfer_map()
could be skipped.

What I'm thinking would be a simple solution is to have an extra field
in pipe_draw_info so that internal blits (like mipmap generation)
could hint to the driver that the entire previous contents of the
render target are discarded.  (Or possibly we want it more
fine-grained, to indicate which render-targets and z/s are discarded,
if not all?  But thit doesn't seem useful.)  This could help tell
tilers that they could discard previous blits (and even skip
system-memory -> tile transfer).

(Hell, there might even be some use to apps to expose the "this draw
discards previous contents" type extension..  given some of the wonky
vendor extensions I've seen, I wouldn't be surprised if it already
existed.)

Thoughts?

BR,
-R

On Fri, May 20, 2016 at 10:51 AM, Rob Clark <robdclark at gmail.com> wrote:
> On Fri, May 20, 2016 at 3:35 AM, Jose Fonseca <jfonseca at vmware.com> wrote:
>> On 20/05/16 00:34, Rob Clark wrote:
>>>
>>> On Thu, May 19, 2016 at 6:21 PM, Eric Anholt <eric at anholt.net> wrote:
>>>>
>>>> Rob Clark <robdclark at gmail.com> writes:
>>>>
>>>>> So some rendering patterns that I've seen in apps turn out to be
>>>>> somewhat evil for tiling gpu's.. couple cases I've seen:
>>>>>
>>>>> 1) stk has some silliness where it binds an fbo, clears, binds other
>>>>> fbo clears, binds previous fbo and draws, and so on.  This one is
>>>>> probably not too hard to just fix in stk.
>>>>>
>>>>> 2) I've seen a render pattern in manhattan where app does a bunch of
>>>>> texture uploads mid-frame via a pbo (and then generates mipmap levels
>>>>> for the updated texture, which hits the blit path which changes fb
>>>>> state and forces a flush).  This one probably not something that can
>>>>> be fixed in the app ;-)
>>>>>
>>>>> There are probably other cases where this comes up which I haven't
>>>>> noticed yet.  I'm not entirely sure how common the pattern that I see
>>>>> in manhattan is.
>>>>>
>>>>> At one point, Eric Anholt mentioned the idea of tracking rendering
>>>>> cmdstream per render-target, as well as dependency information between
>>>>> these different sets of cmdstream (if you render to one fbo, then turn
>>>>> around and sample from it, the rendering needs to happen before the
>>>>> sampling).  I've been thinking a bit about how this would actually
>>>>> work, and trying to do some experiments to get an idea about how
>>>>> useful this would be.
>>>>
>>>>
>>>> My plan was pretty much what you laid out here, except I was going to
>>>> just map to my CL struct with a little hash table from the FB state
>>>> members since FB state isn't a CSO.
>>>
>>>
>>> ok, yeah, I guess that solves the naming conflict (fd_batch(_state)
>>> sounds nicer for what it's purpose really is than
>>> fd_framebuffer_state)
>>>
>>> BR,
>>> -R
>>
>>
>> llvmpipe is also a tiler and we've seen similar patterns.  Flushing reduces
>> caching effectiveness, however in llvmpipe quite often texture sampling is
>> the bottleneck, and an additional flush doesn't make a huge difference.
>>
>
> interesting, it hadn't occurred to me about llvmpipe
>
>>
>> I think the internal hash table as Eric proposes seems a better first step.
>>
>> Later on we could try make framebuffer state a first class cso, but I
>> suspect you'll probably want to walk internally all pending FBOs CLs anyway
>> (to see which need to be flushed on transfers.)
>>
>> So first changing the driver internals, then abstract if there are
>> commonalities, seems more effective way forward.
>
>
> yeah, makes sense.. and I'm planning to go w/ Eric's idea to keep
> fd_batch separate from framebuffer state.
>
> It did occur to me that I forgot to think about the write-after-read
> hazard case.  Those need to be handled with an extra dependency
> between batches too.
>
> And at least for this particular case, I need somehow some cleverness
> to discard or clone the old bo to avoid that write-after-read forcing
> a flush.  (Maybe in transfer_map?  But I guess there are other paths..
> hmm..)
>
> BR,
> -R