[Mesa-dev] GBM and the Device Memory Allocator Proposals

Fri Dec 1 21:52:22 UTC 2017

On Fri, 1 Dec 2017 13:38:41 -0500
Rob Clark <robdclark at gmail.com> wrote:

> On Fri, Dec 1, 2017 at 12:09 PM, Nicolai Hähnle <nhaehnle at gmail.com> wrote:
> > On 01.12.2017 16:06, Rob Clark wrote:  
> >>
> >> On Thu, Nov 30, 2017 at 5:43 PM, Nicolai Hähnle <nhaehnle at gmail.com>
> >> wrote:  
> >>>
> >>> Hi,
> >>>
> >>> I've had a chance to look a bit more closely at the allocator prototype
> >>> repository now. There's a whole bunch of low-level API design feedback,
> >>> but
> >>> for now let's focus on the high-level stuff first.
> >>>
> >>> Going by the 4.5 major object types (as also seen on slide 5 of your
> >>> presentation [0]), assertions and usages make sense to me.
> >>>
> >>> Capabilities and capability sets should be cleaned up in my opinion, as
> >>> the
> >>> status quo is overly obfuscating things. What capability sets really
> >>> represent, as far as I understand them, is *memory layouts*, and so
> >>> that's
> >>> what they should be called.
> >>>
> >>> This conceptually simplifies `derive_capabilities` significantly without
> >>> any
> >>> loss of expressiveness as far as I can see. Given two lists of memory
> >>> layouts, we simply look for which memory layouts appear in both lists,
> >>> and
> >>> then merge their constraints and capabilities.
> >>>
> >>> Merging constraints looks good to me.
> >>>
> >>> Capabilities need some more thought. The prototype removes capabilities
> >>> when
> >>> merging layouts, but I'd argue that that is often undesirable. (In fact,
> >>> I
> >>> cannot think of capabilities which we'd always want to remove.)
> >>>
> >>> A typical example for this is compression (i.e. DCC in our case). For
> >>> rendering usage, we'd return something like:
> >>>
> >>> Memory layout: AMD/tiled; constraints(alignment=64k); caps(AMD/DCC)
> >>>
> >>> For display usage, we might return (depending on hardware):
> >>>
> >>> Memory layout: AMD/tiled; constraints(alignment=64k); caps(none)
> >>>
> >>> Merging these in the prototype would remove the DCC capability, even
> >>> though
> >>> it might well make sense to keep it there for rendering. Dealing with the
> >>> fact that display usage does not have this capability is precisely one of
> >>> the two things that transitions are about! The other thing that
> >>> transitions
> >>> are about is caches.
> >>>
> >>> I think this is kind of what Rob was saying in one of his mails.  
> >>
> >>
> >> Perhaps "layout" is a better name than "caps".. either way I think of
> >> both AMD/tiled and AMD/DCC as the same type of "thing".. the
> >> difference between AMD/tiled and AMD/DCC is that a transition can be
> >> provided for AMD/DCC.  Other than that they are both things describing
> >> the layout.  
> >
> >
> > The reason that a transition can be provided is that they aren't quite the
> > same thing, though. In a very real sense, AMD/DCC is a "child" property of
> > AMD/tiled: DCC is implemented as a meta surface whose memory layout depends
> > on the layout of the main surface.  
> 
> I suppose this is six-of-one, half-dozen of the other..
> 
> what you are calling a layout is what I'm calling a cap that just
> happens not to have an associated transition
> 
> > Although, if there are GPUs that can do an in-place "transition" between
> > different tiling layouts, then the distinction is perhaps really not as
> > clear-cut. I guess that would only apply to tiled renderers.  
> 
> I suppose the advantage of just calling both layout and caps the same
> thing, and just saying that a "cap" (or "layout" if you prefer that
> name) can optionally have one or more associated transitions, is that
> you can deal with cases where sometimes a tiled format might actually
> have an in-place transition ;-)
> 
> >  
> >> So lets say you have a setup where both display and GPU supported
> >> FOO/tiled, but only GPU supported compressed (FOO/CC) and cached
> >> (FOO/cached).  But the GPU supported the following transitions:
> >>
> >>    trans_a: FOO/CC -> null
> >>    trans_b: FOO/cached -> null
> >>
> >> Then the sets for each device (in order of preference):
> >>
> >> GPU:
> >>    1: caps(FOO/tiled, FOO/CC, FOO/cached); constraints(alignment=32k)
> >>    2: caps(FOO/tiled, FOO/CC); constraints(alignment=32k)
> >>    3: caps(FOO/tiled); constraints(alignment=32k)
> >>
> >> Display:
> >>    1: caps(FOO/tiled); constraints(alignment=64k)
> >>
> >> Merged Result:
> >>    1: caps(FOO/tiled, FOO/CC, FOO/cached); constraints(alignment=64k);
> >>       transition(GPU->display: trans_a, trans_b; display->GPU: none)
> >>    2: caps(FOO/tiled, FOO/CC); constraints(alignment=64k);
> >>       transition(GPU->display: trans_a; display->GPU: none)
> >>    3: caps(FOO/tiled); constraints(alignment=64k);
> >>       transition(GPU->display: none; display->GPU: none)  
> >
> >
> > We definitely don't want to expose a way of getting uncached rendering
> > surfaces for radeonsi. I mean, I think we are supposed to be able to program
> > our hardware so that the backend bypasses all caches, but (a) nobody
> > validates that and (b) it's basically suicide in terms of performance. Let's
> > build fewer footguns :)  
> 
> sure, this was just a hypothetical example.  But to take this case as
> another example, if you didn't want to expose uncached rendering (or
> cached w/ cache flushes after each draw), you would exclude the entry
> from the GPU set which didn't have FOO/cached (I'm adding back a
> cached but not CC config just to make it interesting), and end up
> with:
> 
>    trans_a: FOO/CC -> null
>    trans_b: FOO/cached -> null
> 
> GPU:
>   1: caps(FOO/tiled, FOO/CC, FOO/cached); constraints(alignment=32k)
>   2: caps(FOO/tiled, FOO/cached); constraints(alignment=32k)
> 
> Display:
>   1: caps(FOO/tiled); constraints(alignment=64k)
> 
> Merged Result:
>   1: caps(FOO/tiled, FOO/CC, FOO/cached); constraints(alignment=64k);
>      transition(GPU->display: trans_a, trans_b; display->GPU: none)
>   2: caps(FOO/tiled, FOO/cached); constraints(alignment=64k);
>      transition(GPU->display: trans_b; display->GPU: none)
> 
> So there isn't anything in the result set that doesn't have GPU cache,
> and the cache-flush transition is always in the set of required
> transitions going from GPU -> display
> 
> Hmm, I guess this does require the concept of a required cap..

Which we already introduced to the allocator API when we realized we
would need them as we were prototyping.

> 
> > So at least for radeonsi, we wouldn't want to have an AMD/cached bit, but
> > we'd still want to have a transition between the GPU and display precisely
> > to flush caches.
> >
> >  
> >>> Two interesting questions:
> >>>
> >>> 1. If we query for multiple usages on the same device, can we get a
> >>> capability which can only be used for a subset of those usages?  
> >>
> >>
> >> I think the original idea was, "no"..  perhaps that could restriction
> >> could be lifted if transitions where part of the result.  Or maybe you
> >> just query independently the same device for multiple different
> >> usages, and then merge that cap-set.
> >>
> >> (Do we need to care about intra-device transitions?  Or can we just
> >> let the driver care about that, same as it always has?)
> >>  
> >>> 2. What happens when we merge memory layouts with sets of capabilities
> >>> where
> >>> neither is a subset of the other?  
> >>
> >>
> >> I think this is a case where no zero-copy sharing is possible, right?  
> >
> >
> > Not necessarily. Let's say we have some industry-standard tiling layout foo,
> > and vendors support their own proprietary framebuffer compression on top of
> > that.
> >
> > In that case, we may get:
> >
> > Device 1, rendering: caps(BASE/foo, VND1/compressed)
> > Device 2, sampling/scanout: caps(BASE/foo, VND2/compressed)
> >
> > It should be possible to allocate a surface as
> >
> > caps(BASE/foo, VND1/compressed)
> >
> > and just transition the cap(VND1/compressed) away after rendering before
> > accessing it with device 2.  
> 
> In this case presumably VND1 defines transitions VND1/cc <-> null, and
> VND2 does the same for VND2/cc ?
> 
> > The interesting question is whether it would be possible or ever useful to
> > have a surface allocated as caps(BASE/foo, VND1/compressed,
> > VND2/compressed).  
> 
> not sure if it is useful or not, but I think the idea of defining cap
> transitions and returning the associated transitions when merging a
> caps set could express this..

Yeah. I guess if we define per-cap transitions, the allocator should be
able to find a transition chain to express layouts like the above when
merging.

> 
> > My guess is: there will be cases where it's possible, but there won't be
> > cases where it's useful (because you tend to render on device 1 and just
> > sample or scanout on device 2).
> >
> > So it makes sense to say that derive_capabilities should just provide both
> > layouts in this case.
> >
> >  
> >>> As for the actual transition API, I accept that some metadata may be
> >>> required, and the metadata probably needs to depend on the memory layout,
> >>> which is often vendor-specific. But even linear layouts need some
> >>> transitions for caches. We probably need at least some generic
> >>> "off-device
> >>> usage" bit.  
> >>
> >>
> >> I've started thinking of cached as a capability with a transition.. I
> >> think that helps.  Maybe it needs to somehow be more specific (ie. if
> >> you have two devices both with there own cache with no coherency
> >> between the two)  
> >
> >
> > As I wrote above, I'd prefer not to think of "cached" as a capability at
> > least for radeonsi.
> >
> > From the desktop perspective, I would say let's ignore caches, the drivers
> > know which caches they need to flush to make data visible to other devices
> > on the system.
> >
> > On the other hand, there are probably SoC cases where non-coherent caches
> > are shared between some but not all devices, and in that case perhaps we do
> > need to communicate this.
> >
> > So perhaps we should have two kinds of "capabilities".
> >
> > The first, like framebuffer compression, is a capability of the allocated
> > memory layout (because the compression requires a meta surface), and devices
> > that expose it may opportunistically use it.
> >
> > The second, like caches, is a capability that the device/driver will use and
> > you don't get a say in it, but other devices/drivers also don't need to be
> > aware of them.  
> 
> yeah, a required cap..  we had tried to avoid this, since unlike
> constraints which are well defined, the core constraint/capability
> merging wouldn't know what to do about merging parameterized caps.
> But I guess if transitions are provided then it doesn't have to.

We are going to need required caps either way, right? The core
capability merging logic would try to find a compatible layout to be
used across engines by either using available transitions or dropping
caps. We'd need a way to indicate that a particular engine won't be
able to handle the resulting layout if a certain capability was dropped.

> 
> > So then you could theoretically have a system that gives you:
> >
> > GPU:     FOO/tiled(layout-caps=FOO/cc, dev-caps=FOO/gpu-cache)
> > Display: FOO/tiled(layout-caps=FOO/cc)
> > Video:   FOO/tiled(dev-caps=FOO/vid-cache)
> > Camera:  FOO/tiled(dev-caps=FOO/vid-cache)
> >
> > ... from which a FOO/tiled(FOO/cc) surface would be allocated.
> >
> > The idea here is that whether a transition is required is fully visible from
> > the capabilities:
> >
> > 1. Moving an image from the camera to the video engine for immediate
> > compression requires no transition.
> >
> > 2. Moving an image from the camera or video engine to the display requires a
> > transition by the video/camera device/API, which may flush the video cache.
> >
> > 3. Moving an image from the camera or video engine to the GPU additionally
> > requires a transition by the GPU, which may invalidate the GPU cache.
> >
> > 4. Moving an image from the GPU anywhere else requires a transition by the
> > GPU; in all cases, GPU caches may be flushed. When moving to the video
> > engine or camera, the image additionally needs to be decompressed. When
> > moving to the video engine (or camera? :)), a transition by the video engine
> > is also required, which may invalidate the video cache.
> >
> > 5. Moving an image from the display to the video engine requires a
> > decompression -- oops! :)  
> 
> I guess it should be possible for devices to provide transitions in
> both directions, which would deal with this..
> 
> > Ignoring that last point for now, I don't think you actually need a
> > "query_transition" function in libdevicealloc with this approach, for the
> > most part.  
> 
> with the idea of being able to provide optional transitions, I'm
> leaning towards just having one of outputs of merging the caps sets
> being the sets of transitions required to pass a buffer between
> different devices..
> 

I think I like the idea of having transitions being part of the
per-device/engine cap sets, so that such information can be used upon
merging to know which capabilities may remain or have to be dropped.

I think James's proposal for usage transitions was intended to work
with flows like:

  1. App gets GPU caps for RENDER usage
  2. App allocates GPU memory using a layout from (1)
  3. App now decides it wants use the buffer for SCANOUT
  4. App queries usage transition metadata from RENDER to SCANOUT,
     given the current memory layout.
  5. Do the transition and hand the buffer off to display

The problem I see with this is that it isn't guaranteed that there will
be a chain of transitions for the buffer to be usable by display.

Adding transition metadata to the original capability sets, and using
that information when merging could give us a compatible memory layout
that would be usable by both GPU and display.

I'll look into extending the current merging logic to also take into
account transitions.

> although maybe the user doesn't need to know every possible transition
> between devices once you have more than two devices..

We should be able to infer how buffers are going to be moved around
from the list of usages, shouldn't we?

Maybe we are missing some bits of information there, but I think the
allocator should be able to know what transitions the app will care
about and provide only those.

> 
> /me shrugs
> 
> > Instead, each API needs to provide import and export transition/barrier
> > functions which receive the previous/next layout-and capability-set.
> >
> > Basically, to import a frame from the camera to OpenGL/Vulkan in the above
> > system, you'd first do the camera transition:
> >
> >   struct layout_capability cc_cap = { FOO, FOO_CC };
> >   struct device_capability gpu_cache = { FOO, FOO_GPU_CACHE };
> >
> >   cameraExportTransition(image, 1, &layoutCaps, 1, &gpu_cache, &fence);
> >
> > and then e.g. an OpenGL import transition:
> >
> >   struct device_capability vid_cache = { FOO, FOO_VID_CACHE };
> >
> >   glImportTransitionEXT(texture, 0, NULL, 1, &vid_cache, fence);
> >
> > By looking at the capabilities for the other device, each API's driver can
> > derive the required transition steps.
> >
> > There are probably more gaps, but these are the two I can think of right
> > now, and both related to the initialization status of meta surfaces, i.e.
> > FOO/cc:
> >
> > 1. Point 5 above about moving away from the display engine in the example.
> > This is an ugly asymmetry in the rule that each engine performs its required
> > import and export transitions.
> >
> > 2. When the GPU imports a FOO/tiled(FOO/cc) surface, the compression meta
> > surface can be in one of two states:
> > - reflecting a fully decompressed surface (if the surface was previously
> > exported from the GPU), or
> > - garbage (if the surface was allocated by the GPU driver, but then handed
> > off to the camera before being re-imported for processing)
> > The GPU's import transition needs to distinguish the two, but it can't with
> > the scheme above.  
> 
> hmm, so I suppose this is also true in the cache case.. you want to
> know if the buffer was written by someone else since you saw it last..
> 
> > Something to think about :)
> >
> > Also, not really a gap, but something to keep in mind: for multi-GPU
> > systems, the cache-capability needs to carry the device number or PCI bus id
> > or something, at least as long as those caches are not coherent between
> > GPUs.
> >  
> 
> yeah, maybe shouldn't be FOO/gpucache but FOO/gpucache($id)..

That just seems an implementation detail of the representation the
particular vendor chooses for the CACHE capability, right?

Thanks,
Miguel.

> 
> 
> BR,
> -R

-- 
Miguel