[Mesa-dev] GBM and the Device Memory Allocator Proposals

Wed Dec 6 15:45:41 UTC 2017

On 06.12.2017 14:25, Rob Clark wrote:
> On Wed, Dec 6, 2017 at 2:07 AM, James Jones <jajones at nvidia.com> wrote:
>> Note I also posed the question of whether things like cached (and similarly
>> compression, since I view compression as roughly an equivalent mechanism to
>> a cache) in one of the open issues on my XDC 2017 slides because of this
>> very problem of over-pruning it causes.  It's on slide 15, as "No
>> device-local capabilities".  You'll have to listen to my coverage of it in
>> the recorded presentation for that slide to make any sense, but it's the
>> same thing Nicolai has laid out here.
>>
>> As I continued working through our prototype driver support, I found I
>> didn't actually need to include cached or compressed as capabilities: The
>> GPU just applies them as needed and the usage transitions make it
>> transparent to the non-GPU engines.  That does mean the GPU driver currently
>> needs to be the one to realize the allocation from the capability set to get
>> optimal behavior.  We could fix that by reworking our driver though.  At
>> this point, not including device-local properties like on-device caching in
>> capabilities seems like the right solution to me.  I'm curious whether this
>> applies universally though, or if other hardware doesn't fit the
>> "compression and stuff all behaves like a cache" idiom.
>>
> 
> Possibly a SoC(ish) type device which has a "system" cache that some
> but not all devices fall into.  I *think* the intel chips w/ EDRAM
> might fall into this category.  I know the idea has come up elsewhere,
> although not sure if anything like that ended up in production.  It
> seems like something we'd at least want to have an idea how to deal
> with, even if it isn't used for device internal caches.
> 
> Not sure if similar situation could come up w/ discrete GPU and video
> decode/encode engines on the same die?

It definitely could. Our GPUs currently don't have shared caches between 
gfx and video engines, but moving more and more clients under a shared 
L2 cache has been a theme over the last few generations. I doubt that's 
going to happen for the video engines any time soon, but you never know.

I don't think we really need caches as a capability for our current 
GPUs, but it may change, and in any case, we do want compression as a 
capability.

> [snip]
>>> I think I like the idea of having transitions being part of the
>>> per-device/engine cap sets, so that such information can be used upon
>>> merging to know which capabilities may remain or have to be dropped.
>>>
>>> I think James's proposal for usage transitions was intended to work
>>> with flows like:
>>>
>>>     1. App gets GPU caps for RENDER usage
>>>     2. App allocates GPU memory using a layout from (1)
>>>     3. App now decides it wants use the buffer for SCANOUT
>>>     4. App queries usage transition metadata from RENDER to SCANOUT,
>>>        given the current memory layout.
>>>     5. Do the transition and hand the buffer off to display
>>
>>
>> No, all usages the app intends to transition to must be specified up front
>> when initially querying caps in the model I assumed.  The app then specifies
>> some subset (up to the full set) of the specified usages as a src and dst
>> when querying transition metadata.
>>
>>> The problem I see with this is that it isn't guaranteed that there will
>>> be a chain of transitions for the buffer to be usable by display.
>>
> 
> hmm, I guess if a buffer *can* be shared across all uses, there by
> definition has to be a chain of transitions to go from any
> usage+device to any other usage+device.
> 
> Possibly a separate step to query transitions avoids solving for every
> possible transition when merging the caps set.. although until you do
> that query I don't think you know the resulting merged caps set is
> valid.
> 
> Maybe in practice for every cap FOO there exists a FOO->null (or
> FOO->generic if you prefer) transition, ie. compressed->uncompressed,
> cached->clean, etc.  I suppose that makes the problem easier to solve.

It really would, to the extent that I would prefer if we could bake it 
into the system as an assumption.

I have my doubts about how to manage calculating transitions cleanly at 
all without it. The metadata stuff is very vague to me.

>> I hadn't thought hard about it, but my initial thoughts were that it would
>> be required that the driver support transitioning to any single usage given
>> the capabilities returned.  However, transitioning to multiple usages (E.g.,
>> to simultaneously rendering and scanning out) could fail to produce a valid
>> transition, in which case the app would have to fall back to a copy in that
>> case, or avoid that simultaneous usage combination in some other way.
>>
>>> Adding transition metadata to the original capability sets, and using
>>> that information when merging could give us a compatible memory layout
>>> that would be usable by both GPU and display.
>>>
>>> I'll look into extending the current merging logic to also take into
>>> account transitions.
>>
>>
>> Yes, it'll be good to see whether this can be made to work.  I agree Rob's
>> example outcomes above are ideal, but it's not clear to me how to code up
>> such an algorithm.  This also all seems unnecessary if "device local"
>> capabilities aren't needed, as posited above.
> 
> Probably things like device private caches, and transitions between
> usages on the same device(+driver?[1]) could be left out.  For the
> cache case, if you have a cache shared between some but not all
> devices, that problem looks to me to be basically the same problem as
> compressed buffers when some but not all devices support a particular
> compression scheme.
> 
> [1] can we assume magic under the hood for vk and gl interop with
> drivers from same vendor on same device?

In my book, the fewer assumptions we have to make for that, the better.

Cheers,
Nicolai

-- 
Lerne, wie die Welt wirklich ist,
Aber vergiss niemals, wie sie sein sollte.