[Mesa-dev] GBM and the Device Memory Allocator Proposals

Wed Dec 6 13:25:19 UTC 2017

On Wed, Dec 6, 2017 at 2:07 AM, James Jones <jajones at nvidia.com> wrote:
> On 12/01/2017 01:52 PM, Miguel Angel Vico wrote:
>>
>>
>>
>> On Fri, 1 Dec 2017 13:38:41 -0500
>> Rob Clark <robdclark at gmail.com> wrote:
>>
>>>
>>> sure, this was just a hypothetical example.  But to take this case as
>>> another example, if you didn't want to expose uncached rendering (or
>>> cached w/ cache flushes after each draw), you would exclude the entry
>>> from the GPU set which didn't have FOO/cached (I'm adding back a
>>> cached but not CC config just to make it interesting), and end up
>>> with:
>>>
>>>     trans_a: FOO/CC -> null
>>>     trans_b: FOO/cached -> null
>>>
>>> GPU:
>>>    1: caps(FOO/tiled, FOO/CC, FOO/cached); constraints(alignment=32k)
>>>    2: caps(FOO/tiled, FOO/cached); constraints(alignment=32k)
>>>
>>> Display:
>>>    1: caps(FOO/tiled); constraints(alignment=64k)
>>>
>>> Merged Result:
>>>    1: caps(FOO/tiled, FOO/CC, FOO/cached); constraints(alignment=64k);
>>>       transition(GPU->display: trans_a, trans_b; display->GPU: none)
>>>    2: caps(FOO/tiled, FOO/cached); constraints(alignment=64k);
>>>       transition(GPU->display: trans_b; display->GPU: none)
>>>
>>> So there isn't anything in the result set that doesn't have GPU cache,
>>> and the cache-flush transition is always in the set of required
>>> transitions going from GPU -> display
>>>
>>> Hmm, I guess this does require the concept of a required cap..
>>
>>
>> Which we already introduced to the allocator API when we realized we
>> would need them as we were prototyping.
>
>
> Note I also posed the question of whether things like cached (and similarly
> compression, since I view compression as roughly an equivalent mechanism to
> a cache) in one of the open issues on my XDC 2017 slides because of this
> very problem of over-pruning it causes.  It's on slide 15, as "No
> device-local capabilities".  You'll have to listen to my coverage of it in
> the recorded presentation for that slide to make any sense, but it's the
> same thing Nicolai has laid out here.
>
> As I continued working through our prototype driver support, I found I
> didn't actually need to include cached or compressed as capabilities: The
> GPU just applies them as needed and the usage transitions make it
> transparent to the non-GPU engines.  That does mean the GPU driver currently
> needs to be the one to realize the allocation from the capability set to get
> optimal behavior.  We could fix that by reworking our driver though.  At
> this point, not including device-local properties like on-device caching in
> capabilities seems like the right solution to me.  I'm curious whether this
> applies universally though, or if other hardware doesn't fit the
> "compression and stuff all behaves like a cache" idiom.
>

Possibly a SoC(ish) type device which has a "system" cache that some
but not all devices fall into.  I *think* the intel chips w/ EDRAM
might fall into this category.  I know the idea has come up elsewhere,
although not sure if anything like that ended up in production.  It
seems like something we'd at least want to have an idea how to deal
with, even if it isn't used for device internal caches.

Not sure if similar situation could come up w/ discrete GPU and video
decode/encode engines on the same die?

[snip]

>> I think I like the idea of having transitions being part of the
>> per-device/engine cap sets, so that such information can be used upon
>> merging to know which capabilities may remain or have to be dropped.
>>
>> I think James's proposal for usage transitions was intended to work
>> with flows like:
>>
>>    1. App gets GPU caps for RENDER usage
>>    2. App allocates GPU memory using a layout from (1)
>>    3. App now decides it wants use the buffer for SCANOUT
>>    4. App queries usage transition metadata from RENDER to SCANOUT,
>>       given the current memory layout.
>>    5. Do the transition and hand the buffer off to display
>
>
> No, all usages the app intends to transition to must be specified up front
> when initially querying caps in the model I assumed.  The app then specifies
> some subset (up to the full set) of the specified usages as a src and dst
> when querying transition metadata.
>
>> The problem I see with this is that it isn't guaranteed that there will
>> be a chain of transitions for the buffer to be usable by display.
>

hmm, I guess if a buffer *can* be shared across all uses, there by
definition has to be a chain of transitions to go from any
usage+device to any other usage+device.

Possibly a separate step to query transitions avoids solving for every
possible transition when merging the caps set.. although until you do
that query I don't think you know the resulting merged caps set is
valid.

Maybe in practice for every cap FOO there exists a FOO->null (or
FOO->generic if you prefer) transition, ie. compressed->uncompressed,
cached->clean, etc.  I suppose that makes the problem easier to solve.

>
> I hadn't thought hard about it, but my initial thoughts were that it would
> be required that the driver support transitioning to any single usage given
> the capabilities returned.  However, transitioning to multiple usages (E.g.,
> to simultaneously rendering and scanning out) could fail to produce a valid
> transition, in which case the app would have to fall back to a copy in that
> case, or avoid that simultaneous usage combination in some other way.
>
>> Adding transition metadata to the original capability sets, and using
>> that information when merging could give us a compatible memory layout
>> that would be usable by both GPU and display.
>>
>> I'll look into extending the current merging logic to also take into
>> account transitions.
>
>
> Yes, it'll be good to see whether this can be made to work.  I agree Rob's
> example outcomes above are ideal, but it's not clear to me how to code up
> such an algorithm.  This also all seems unnecessary if "device local"
> capabilities aren't needed, as posited above.

Probably things like device private caches, and transitions between
usages on the same device(+driver?[1]) could be left out.  For the
cache case, if you have a cache shared between some but not all
devices, that problem looks to me to be basically the same problem as
compressed buffers when some but not all devices support a particular
compression scheme.

[1] can we assume magic under the hood for vk and gl interop with
drivers from same vendor on same device?

[snip]

>>>
>>> yeah, maybe shouldn't be FOO/gpucache but FOO/gpucache($id)..
>>
>>
>> That just seems an implementation detail of the representation the
>> particular vendor chooses for the CACHE capability, right?
>
>
> Agreed.
>
> One final note:  When I initially wrote up the capability merging logic, I
> treated "layout" as a sort of "special" capability, basically like Nicolai
> originally outlined above.  Miguel suggested I add the "required" bit
> instead to generalize things, and it ended up working out much cleaner.
> Besides the layout, there is at least one other obvious candidate for a
> "required" capability that became obvious as soon as I started coding up the
> prototype driver: memory location.  It might seem like memory location is a
> simple device-agnostic constraint rather than a capability, but it's
> actually too complicated (we need more memory locations than "device" and
> "host").  It has to be vendor specific, and hence fits in better as a
> capability.
>
> I think if possible, we should try to keep the design generalized to as few
> types of objects and special cases as possible.  The more we can generalize
> the solutions to our existing problem set, the better the mechanism should
> hold up as we apply it to new and unknown problems as they arise.
>

agreed

BR,
-R