[Mesa-dev] RFC: Haswell resource streamer/hw-generated binding tables (v2)

Wed Oct 16 04:50:25 PDT 2013

On Tuesday, October 15, 2013 10:29:16 AM Eric Anholt wrote:
> Abdiel Janulgue <abdiel.janulgue at linux.intel.com> writes:
> > On Monday, October 14, 2013 10:50:24 AM Eric Anholt wrote:
> >> Abdiel Janulgue <abdiel.janulgue at linux.intel.com> writes:
> >> > One optimization idea that I had in mind a few months ago was to find a
> >> > way to reduce emission of surface state objects. Currently, we rebuild
> >> > surface states every time we generate binding tables. The idea is to
> >> > basically relocate the surface state indirect state objects on a
> >> > separate
> >> > buffer object from the command batch. Using the resource streamer, we
> >> > can
> >> > then publish the deltas when indices referring to them needs to be
> >> > changed.
> >> > 
> >> > So whenever a surface needs to be used, instead of rebuilding the whole
> >> > binding table structure the driver can essentially say on a per-slot
> >> > basis
> >> > "hey a surface got activated but it it was previously bound to index
> >> > 10,
> >> > lets rebind it to index 12".
> >> > 
> >> > This potentially reduces the CPU overhead of generating and uploading
> >> > binding tables. I did a previous experiment and found out that it
> >> > reduced
> >> > generation of surfaces states to as much as 99% with this approach.
> >> > What
> >> > do you think?
> >> 
> >> This has the downside that new batches implicitly reference the surfaces
> >> that were referenced by old batches.  Imagine a video player that's
> >> uploading a new frame to a new BO every time -- until the surface cache
> >> BO wraps, the old BO stays referenced and app memory usage just goes up
> >> and up.  The workaround is to have hash table you look into at BO free
> >> time that tells you what relocations to rip out of the surface cache.
> > 
> > Why not use a sufficiently-sized surface cache buffer and as soon as
> > it approaches the wrap limit either (1) do a clear_relocs then flush
> > batch or (2) just throw it away and reallocate a new one? To reduce
> > overhead, it has to be sufficiently sized so that the wrap limit does
> > not get reached too often.
> > 
> > Aside from video player case, this would help games that does not
> > upload textures every frame. But I guess most games should do that
> > already, no?
> 
> I'm not clear on what you're proposing that would not (effectively) leak
> memory when applications are doing new buffer allocation every frame.

I'm not sure which allocation you had in mind. If you are referring 
to the bo of the actual surface itself, isn't it already
refcounted so that at some point it will free itself?

What I had in mind is something similar to this:

batch bo              
        | (reloc)    
        |         
        +--> surface_cache bo --+--> (reloc) surf a
        |                       |
        | (reloc)               +--> (reloc) surf b 
batch bo                        |
                                +--> (reloc) surf c

Yes, the surface's relocation entry from surface_cache bo does add an 
extra reference to the surfaces (existing surfaces already in
the cache should be skipped). But clearing the surface_cache bo once
in a while would effectively unreference those old surfaces. Unless 
I'm still missing something, I don't see why this would leak?

Note that the relocation offset of the surfaces are based from the 
surface_cache base address instead of from the batchbuffer.