[Mesa-dev] GBM and the Device Memory Allocator Proposals

Wed Dec 6 07:07:49 UTC 2017

On 12/01/2017 01:52 PM, Miguel Angel Vico wrote:
> 
> 
> On Fri, 1 Dec 2017 13:38:41 -0500
> Rob Clark <robdclark at gmail.com> wrote:
> 
>> On Fri, Dec 1, 2017 at 12:09 PM, Nicolai Hähnle <nhaehnle at gmail.com> wrote:
>>> On 01.12.2017 16:06, Rob Clark wrote:
>>>>
>>>> On Thu, Nov 30, 2017 at 5:43 PM, Nicolai Hähnle <nhaehnle at gmail.com>
>>>> wrote:
>>>>>
>>>>> Hi,
>>>>>
>>>>> I've had a chance to look a bit more closely at the allocator prototype
>>>>> repository now. There's a whole bunch of low-level API design feedback,
>>>>> but
>>>>> for now let's focus on the high-level stuff first.

Thanks for taking a look.

>>>>> Going by the 4.5 major object types (as also seen on slide 5 of your
>>>>> presentation [0]), assertions and usages make sense to me.
>>>>>
>>>>> Capabilities and capability sets should be cleaned up in my opinion, as
>>>>> the
>>>>> status quo is overly obfuscating things. What capability sets really
>>>>> represent, as far as I understand them, is *memory layouts*, and so
>>>>> that's
>>>>> what they should be called.
>>>>>
>>>>> This conceptually simplifies `derive_capabilities` significantly without
>>>>> any
>>>>> loss of expressiveness as far as I can see. Given two lists of memory
>>>>> layouts, we simply look for which memory layouts appear in both lists,
>>>>> and
>>>>> then merge their constraints and capabilities.
>>>>>
>>>>> Merging constraints looks good to me.
>>>>>
>>>>> Capabilities need some more thought. The prototype removes capabilities
>>>>> when
>>>>> merging layouts, but I'd argue that that is often undesirable. (In fact,
>>>>> I
>>>>> cannot think of capabilities which we'd always want to remove.)
>>>>>
>>>>> A typical example for this is compression (i.e. DCC in our case). For
>>>>> rendering usage, we'd return something like:
>>>>>
>>>>> Memory layout: AMD/tiled; constraints(alignment=64k); caps(AMD/DCC)
>>>>>
>>>>> For display usage, we might return (depending on hardware):
>>>>>
>>>>> Memory layout: AMD/tiled; constraints(alignment=64k); caps(none)
>>>>>
>>>>> Merging these in the prototype would remove the DCC capability, even
>>>>> though
>>>>> it might well make sense to keep it there for rendering. Dealing withthe
>>>>> fact that display usage does not have this capability is precisely one of
>>>>> the two things that transitions are about! The other thing that
>>>>> transitions
>>>>> are about is caches.
>>>>>
>>>>> I think this is kind of what Rob was saying in one of his mails.
>>>>
>>>>
>>>> Perhaps "layout" is a better name than "caps".. either way I think of
>>>> both AMD/tiled and AMD/DCC as the same type of "thing".. the
>>>> difference between AMD/tiled and AMD/DCC is that a transition can be
>>>> provided for AMD/DCC.  Other than that they are both things describing
>>>> the layout.
>>>
>>>
>>> The reason that a transition can be provided is that they aren't quite the
>>> same thing, though. In a very real sense, AMD/DCC is a "child" propertyof
>>> AMD/tiled: DCC is implemented as a meta surface whose memory layout depends
>>> on the layout of the main surface.
>>
>> I suppose this is six-of-one, half-dozen of the other..
>>
>> what you are calling a layout is what I'm calling a cap that just
>> happens not to have an associated transition
>>
>>> Although, if there are GPUs that can do an in-place "transition" between
>>> different tiling layouts, then the distinction is perhaps really not as
>>> clear-cut. I guess that would only apply to tiled renderers.
>>
>> I suppose the advantage of just calling both layout and caps the same
>> thing, and just saying that a "cap" (or "layout" if you prefer that
>> name) can optionally have one or more associated transitions, is that
>> you can deal with cases where sometimes a tiled format might actually
>> have an in-place transition ;-)
>>
>>>   
>>>> So lets say you have a setup where both display and GPU supported
>>>> FOO/tiled, but only GPU supported compressed (FOO/CC) and cached
>>>> (FOO/cached).  But the GPU supported the following transitions:
>>>>
>>>>     trans_a: FOO/CC -> null
>>>>     trans_b: FOO/cached -> null
>>>>
>>>> Then the sets for each device (in order of preference):
>>>>
>>>> GPU:
>>>>     1: caps(FOO/tiled, FOO/CC, FOO/cached); constraints(alignment=32k)
>>>>     2: caps(FOO/tiled, FOO/CC); constraints(alignment=32k)
>>>>     3: caps(FOO/tiled); constraints(alignment=32k)
>>>>
>>>> Display:
>>>>     1: caps(FOO/tiled); constraints(alignment=64k)
>>>>
>>>> Merged Result:
>>>>     1: caps(FOO/tiled, FOO/CC, FOO/cached); constraints(alignment=64k);
>>>>        transition(GPU->display: trans_a, trans_b; display->GPU: none)
>>>>     2: caps(FOO/tiled, FOO/CC); constraints(alignment=64k);
>>>>        transition(GPU->display: trans_a; display->GPU: none)
>>>>     3: caps(FOO/tiled); constraints(alignment=64k);
>>>>        transition(GPU->display: none; display->GPU: none)
>>>
>>>
>>> We definitely don't want to expose a way of getting uncached rendering
>>> surfaces for radeonsi. I mean, I think we are supposed to be able to program
>>> our hardware so that the backend bypasses all caches, but (a) nobody
>>> validates that and (b) it's basically suicide in terms of performance. Let's
>>> build fewer footguns :)
>>
>> sure, this was just a hypothetical example.  But to take this case as
>> another example, if you didn't want to expose uncached rendering (or
>> cached w/ cache flushes after each draw), you would exclude the entry
>> from the GPU set which didn't have FOO/cached (I'm adding back a
>> cached but not CC config just to make it interesting), and end up
>> with:
>>
>>     trans_a: FOO/CC -> null
>>     trans_b: FOO/cached -> null
>>
>> GPU:
>>    1: caps(FOO/tiled, FOO/CC, FOO/cached); constraints(alignment=32k)
>>    2: caps(FOO/tiled, FOO/cached); constraints(alignment=32k)
>>
>> Display:
>>    1: caps(FOO/tiled); constraints(alignment=64k)
>>
>> Merged Result:
>>    1: caps(FOO/tiled, FOO/CC, FOO/cached); constraints(alignment=64k);
>>       transition(GPU->display: trans_a, trans_b; display->GPU: none)
>>    2: caps(FOO/tiled, FOO/cached); constraints(alignment=64k);
>>       transition(GPU->display: trans_b; display->GPU: none)
>>
>> So there isn't anything in the result set that doesn't have GPU cache,
>> and the cache-flush transition is always in the set of required
>> transitions going from GPU -> display
>>
>> Hmm, I guess this does require the concept of a required cap..
> 
> Which we already introduced to the allocator API when we realized we
> would need them as we were prototyping.

Note I also posed the question of whether things like cached (and 
similarly compression, since I view compression as roughly an equivalent 
mechanism to a cache) in one of the open issues on my XDC 2017 slides 
because of this very problem of over-pruning it causes.  It's on slide 
15, as "No device-local capabilities".  You'll have to listen to my 
coverage of it in the recorded presentation for that slide to make any 
sense, but it's the same thing Nicolai has laid out here.

As I continued working through our prototype driver support, I found I 
didn't actually need to include cached or compressed as capabilities: 
The GPU just applies them as needed and the usage transitions make it 
transparent to the non-GPU engines.  That does mean the GPU driver 
currently needs to be the one to realize the allocation from the 
capability set to get optimal behavior.  We could fix that by reworking 
our driver though.  At this point, not including device-local properties 
like on-device caching in capabilities seems like the right solution to 
me.  I'm curious whether this applies universally though, or if other 
hardware doesn't fit the "compression and stuff all behaves like a 
cache" idiom.

>>
>>> So at least for radeonsi, we wouldn't want to have an AMD/cached bit, but
>>> we'd still want to have a transition between the GPU and display precisely
>>> to flush caches.
>>>
>>>   
>>>>> Two interesting questions:
>>>>>
>>>>> 1. If we query for multiple usages on the same device, can we get a
>>>>> capability which can only be used for a subset of those usages?
>>>>
>>>>
>>>> I think the original idea was, "no"..  perhaps that could restriction
>>>> could be lifted if transitions where part of the result.  Or maybe you
>>>> just query independently the same device for multiple different
>>>> usages, and then merge that cap-set.
>>>>
>>>> (Do we need to care about intra-device transitions?  Or can we just
>>>> let the driver care about that, same as it always has?)
>>>>   
>>>>> 2. What happens when we merge memory layouts with sets of capabilities
>>>>> where
>>>>> neither is a subset of the other?
>>>>
>>>>
>>>> I think this is a case where no zero-copy sharing is possible, right?
>>>
>>>
>>> Not necessarily. Let's say we have some industry-standard tiling layoutfoo,
>>> and vendors support their own proprietary framebuffer compression on top of
>>> that.
>>>
>>> In that case, we may get:
>>>
>>> Device 1, rendering: caps(BASE/foo, VND1/compressed)
>>> Device 2, sampling/scanout: caps(BASE/foo, VND2/compressed)
>>>
>>> It should be possible to allocate a surface as
>>>
>>> caps(BASE/foo, VND1/compressed)
>>>
>>> and just transition the cap(VND1/compressed) away after rendering before
>>> accessing it with device 2.
>>
>> In this case presumably VND1 defines transitions VND1/cc <-> null, and
>> VND2 does the same for VND2/cc ?
>>
>>> The interesting question is whether it would be possible or ever usefulto
>>> have a surface allocated as caps(BASE/foo, VND1/compressed,
>>> VND2/compressed).
>>
>> not sure if it is useful or not, but I think the idea of defining cap
>> transitions and returning the associated transitions when merging a
>> caps set could express this..
> 
> Yeah. I guess if we define per-cap transitions, the allocator should be
> able to find a transition chain to express layouts like the above when
> merging.
 >
>>
>>> My guess is: there will be cases where it's possible, but there won't be
>>> cases where it's useful (because you tend to render on device 1 and just
>>> sample or scanout on device 2).
>>>
>>> So it makes sense to say that derive_capabilities should just provide both
>>> layouts in this case.
>>>
>>>   
>>>>> As for the actual transition API, I accept that some metadata may be
>>>>> required, and the metadata probably needs to depend on the memory layout,
>>>>> which is often vendor-specific. But even linear layouts need some
>>>>> transitions for caches. We probably need at least some generic
>>>>> "off-device
>>>>> usage" bit.
>>>>
>>>>
>>>> I've started thinking of cached as a capability with a transition.. I
>>>> think that helps.  Maybe it needs to somehow be more specific (ie. if
>>>> you have two devices both with there own cache with no coherency
>>>> between the two)
>>>
>>>
>>> As I wrote above, I'd prefer not to think of "cached" as a capability at
>>> least for radeonsi.
>>>
>>>  From the desktop perspective, I would say let's ignore caches, the drivers
>>> know which caches they need to flush to make data visible to other devices
>>> on the system.
>>>
>>> On the other hand, there are probably SoC cases where non-coherent caches
>>> are shared between some but not all devices, and in that case perhaps we do
>>> need to communicate this.
>>>
>>> So perhaps we should have two kinds of "capabilities".
>>>
>>> The first, like framebuffer compression, is a capability of the allocated
>>> memory layout (because the compression requires a meta surface), and devices
>>> that expose it may opportunistically use it.
>>>
>>> The second, like caches, is a capability that the device/driver will use and
>>> you don't get a say in it, but other devices/drivers also don't need tobe
>>> aware of them.
>>
>> yeah, a required cap..  we had tried to avoid this, since unlike
>> constraints which are well defined, the core constraint/capability
>> merging wouldn't know what to do about merging parameterized caps.
>> But I guess if transitions are provided then it doesn't have to.
> 
> We are going to need required caps either way, right? The core
> capability merging logic would try to find a compatible layout to be
> used across engines by either using available transitions or dropping
> caps. We'd need a way to indicate that a particular engine won't be
> able to handle the resulting layout if a certain capability was dropped.
> 
>>
>>> So then you could theoretically have a system that gives you:
>>>
>>> GPU:     FOO/tiled(layout-caps=FOO/cc, dev-caps=FOO/gpu-cache)
>>> Display: FOO/tiled(layout-caps=FOO/cc)
>>> Video:   FOO/tiled(dev-caps=FOO/vid-cache)
>>> Camera:  FOO/tiled(dev-caps=FOO/vid-cache)
>>>
>>> ... from which a FOO/tiled(FOO/cc) surface would be allocated.
>>>
>>> The idea here is that whether a transition is required is fully visiblefrom
>>> the capabilities:
>>>
>>> 1. Moving an image from the camera to the video engine for immediate
>>> compression requires no transition.
>>>
>>> 2. Moving an image from the camera or video engine to the display requires a
>>> transition by the video/camera device/API, which may flush the video cache.
>>>
>>> 3. Moving an image from the camera or video engine to the GPU additionally
>>> requires a transition by the GPU, which may invalidate the GPU cache.
>>>
>>> 4. Moving an image from the GPU anywhere else requires a transition by the
>>> GPU; in all cases, GPU caches may be flushed. When moving to the video
>>> engine or camera, the image additionally needs to be decompressed. When
>>> moving to the video engine (or camera? :)), a transition by the video engine
>>> is also required, which may invalidate the video cache.
>>>
>>> 5. Moving an image from the display to the video engine requires a
>>> decompression -- oops! :)
>>
>> I guess it should be possible for devices to provide transitions in
>> both directions, which would deal with this..
>>
>>> Ignoring that last point for now, I don't think you actually need a
>>> "query_transition" function in libdevicealloc with this approach, for the
>>> most part.
>>
>> with the idea of being able to provide optional transitions, I'm
>> leaning towards just having one of outputs of merging the caps sets
>> being the sets of transitions required to pass a buffer between
>> different devices..
>>
> 
> I think I like the idea of having transitions being part of the
> per-device/engine cap sets, so that such information can be used upon
> merging to know which capabilities may remain or have to be dropped.
> 
> I think James's proposal for usage transitions was intended to work
> with flows like:
> 
>    1. App gets GPU caps for RENDER usage
>    2. App allocates GPU memory using a layout from (1)
>    3. App now decides it wants use the buffer for SCANOUT
>    4. App queries usage transition metadata from RENDER to SCANOUT,
>       given the current memory layout.
>    5. Do the transition and hand the buffer off to display

No, all usages the app intends to transition to must be specified up 
front when initially querying caps in the model I assumed.  The app then 
specifies some subset (up to the full set) of the specified usages as a 
src and dst when querying transition metadata.

> The problem I see with this is that it isn't guaranteed that there will
> be a chain of transitions for the buffer to be usable by display.

I hadn't thought hard about it, but my initial thoughts were that it 
would be required that the driver support transitioning to any single 
usage given the capabilities returned.  However, transitioning to 
multiple usages (E.g., to simultaneously rendering and scanning out) 
could fail to produce a valid transition, in which case the app would 
have to fall back to a copy in that case, or avoid that simultaneous 
usage combination in some other way.

> Adding transition metadata to the original capability sets, and using
> that information when merging could give us a compatible memory layout
> that would be usable by both GPU and display.
> 
> I'll look into extending the current merging logic to also take into
> account transitions.

Yes, it'll be good to see whether this can be made to work.  I agree 
Rob's example outcomes above are ideal, but it's not clear to me how to 
code up such an algorithm.  This also all seems unnecessary if "device 
local" capabilities aren't needed, as posited above.

>> although maybe the user doesn't need to know every possible transition
>> between devices once you have more than two devices..
> 
> We should be able to infer how buffers are going to be moved around
> from the list of usages, shouldn't we?
> 
> Maybe we are missing some bits of information there, but I think the
> allocator should be able to know what transitions the app will care
> about and provide only those.

The allocator only knows the requested union of all usages currently. 
The number of possible transitions grows combinatorially for every usage 
requested I believe.  I expect there will be cases where ~10 usages are 
specified, so generating all possible transitions all the time may be 
excessive, when the app will probably generally only care about 2 or 3 
states, and in practice, there will probably only actually be 2 or 3 
different underlying possible combinations of operations.

>>
>> /me shrugs
>>
>>> Instead, each API needs to provide import and export transition/barrier
>>> functions which receive the previous/next layout-and capability-set.
>>>
>>> Basically, to import a frame from the camera to OpenGL/Vulkan in the above
>>> system, you'd first do the camera transition:
>>>
>>>    struct layout_capability cc_cap = { FOO, FOO_CC };
>>>    struct device_capability gpu_cache = { FOO, FOO_GPU_CACHE };
>>>
>>>    cameraExportTransition(image, 1, &layoutCaps, 1, &gpu_cache, &fence);
>>>
>>> and then e.g. an OpenGL import transition:
>>>
>>>    struct device_capability vid_cache = { FOO, FOO_VID_CACHE };
>>>
>>>    glImportTransitionEXT(texture, 0, NULL, 1, &vid_cache, fence);
>>>
>>> By looking at the capabilities for the other device, each API's driver can
>>> derive the required transition steps.
>>>
>>> There are probably more gaps, but these are the two I can think of right
>>> now, and both related to the initialization status of meta surfaces, i.e.
>>> FOO/cc:
>>>
>>> 1. Point 5 above about moving away from the display engine in the example.
>>> This is an ugly asymmetry in the rule that each engine performs its required
>>> import and export transitions.
>>>
>>> 2. When the GPU imports a FOO/tiled(FOO/cc) surface, the compression meta
>>> surface can be in one of two states:
>>> - reflecting a fully decompressed surface (if the surface was previously
>>> exported from the GPU), or
>>> - garbage (if the surface was allocated by the GPU driver, but then handed
>>> off to the camera before being re-imported for processing)
>>> The GPU's import transition needs to distinguish the two, but it can't with
>>> the scheme above.
>>
>> hmm, so I suppose this is also true in the cache case.. you want to
>> know if the buffer was written by someone else since you saw it last..
>>
>>> Something to think about :)
>>>
>>> Also, not really a gap, but something to keep in mind: for multi-GPU
>>> systems, the cache-capability needs to carry the device number or PCI bus id
>>> or something, at least as long as those caches are not coherent between
>>> GPUs.
>>>   
>>
>> yeah, maybe shouldn't be FOO/gpucache but FOO/gpucache($id)..
> 
> That just seems an implementation detail of the representation the
> particular vendor chooses for the CACHE capability, right?

Agreed.

One final note:  When I initially wrote up the capability merging logic, 
I treated "layout" as a sort of "special" capability, basically like 
Nicolai originally outlined above.  Miguel suggested I add the 
"required" bit instead to generalize things, and it ended up working out 
much cleaner.  Besides the layout, there is at least one other obvious 
candidate for a "required" capability that became obvious as soon as I 
started coding up the prototype driver: memory location.  It might seem 
like memory location is a simple device-agnostic constraint rather than 
a capability, but it's actually too complicated (we need more memory 
locations than "device" and "host").  It has to be vendor specific, and 
hence fits in better as a capability.

I think if possible, we should try to keep the design generalized to as 
few types of objects and special cases as possible.  The more we can 
generalize the solutions to our existing problem set, the better the 
mechanism should hold up as we apply it to new and unknown problems as 
they arise.

Thanks,
-James

> Thanks,
> Miguel.
> 
>>
>>
>> BR,
>> -R
> 
>