[Mesa-dev] GBM and the Device Memory Allocator Proposals
Miguel Angel Vico
mvicomoya at nvidia.com
Fri Dec 8 18:52:39 UTC 2017
On Wed, 6 Dec 2017 16:57:45 -0800
James Jones <jajones at nvidia.com> wrote:
> On 12/06/2017 03:25 AM, Nicolai Hähnle wrote:
> > On 06.12.2017 08:07, James Jones wrote:
> > [snip]
> >>>>>> So lets say you have a setup where both display and GPU supported
> >>>>>> FOO/tiled, but only GPU supported compressed (FOO/CC) and cached
> >>>>>> (FOO/cached). But the GPU supported the following transitions:
> >>>>>>
> >>>>>> trans_a: FOO/CC -> null
> >>>>>> trans_b: FOO/cached -> null
> >>>>>>
> >>>>>> Then the sets for each device (in order of preference):
> >>>>>>
> >>>>>> GPU:
> >>>>>> 1: caps(FOO/tiled, FOO/CC, FOO/cached);
> >>>>>> constraints(alignment=32k)
> >>>>>> 2: caps(FOO/tiled, FOO/CC); constraints(alignment=32k)
> >>>>>> 3: caps(FOO/tiled); constraints(alignment=32k)
> >>>>>>
> >>>>>> Display:
> >>>>>> 1: caps(FOO/tiled); constraints(alignment=64k)
> >>>>>>
> >>>>>> Merged Result:
> >>>>>> 1: caps(FOO/tiled, FOO/CC, FOO/cached);
> >>>>>> constraints(alignment=64k);
> >>>>>> transition(GPU->display: trans_a, trans_b; display->GPU: none)
> >>>>>> 2: caps(FOO/tiled, FOO/CC); constraints(alignment=64k);
> >>>>>> transition(GPU->display: trans_a; display->GPU: none)
> >>>>>> 3: caps(FOO/tiled); constraints(alignment=64k);
> >>>>>> transition(GPU->display: none; display->GPU: none)
> >>>>>
> >>>>>
> >>>>> We definitely don't want to expose a way of getting uncached rendering
> >>>>> surfaces for radeonsi. I mean, I think we are supposed to be able
> >>>>> to program
> >>>>> our hardware so that the backend bypasses all caches, but (a) nobody
> >>>>> validates that and (b) it's basically suicide in terms of
> >>>>> performance. Let's
> >>>>> build fewer footguns :)
> >>>>
> >>>> sure, this was just a hypothetical example. But to take this case as
> >>>> another example, if you didn't want to expose uncached rendering (or
> >>>> cached w/ cache flushes after each draw), you would exclude the entry
> >>>> from the GPU set which didn't have FOO/cached (I'm adding back a
> >>>> cached but not CC config just to make it interesting), and end up
> >>>> with:
> >>>>
> >>>> trans_a: FOO/CC -> null
> >>>> trans_b: FOO/cached -> null
> >>>>
> >>>> GPU:
> >>>> 1: caps(FOO/tiled, FOO/CC, FOO/cached); constraints(alignment=32k)
> >>>> 2: caps(FOO/tiled, FOO/cached); constraints(alignment=32k)
> >>>>
> >>>> Display:
> >>>> 1: caps(FOO/tiled); constraints(alignment=64k)
> >>>>
> >>>> Merged Result:
> >>>> 1: caps(FOO/tiled, FOO/CC, FOO/cached); constraints(alignment=64k);
> >>>> transition(GPU->display: trans_a, trans_b; display->GPU: none)
> >>>> 2: caps(FOO/tiled, FOO/cached); constraints(alignment=64k);
> >>>> transition(GPU->display: trans_b; display->GPU: none)
> >>>>
> >>>> So there isn't anything in the result set that doesn't have GPU cache,
> >>>> and the cache-flush transition is always in the set of required
> >>>> transitions going from GPU -> display
> >>>>
> >>>> Hmm, I guess this does require the concept of a required cap..
> >>>
> >>> Which we already introduced to the allocator API when we realized we
> >>> would need them as we were prototyping.
> >>
> >> Note I also posed the question of whether things like cached (and
> >> similarly compression, since I view compression as roughly an
> >> equivalent mechanism to a cache) in one of the open issues on my XDC
> >> 2017 slides because of this very problem of over-pruning it causes.
> >> It's on slide 15, as "No device-local capabilities". You'll have to
> >> listen to my coverage of it in the recorded presentation for that
> >> slide to make any sense, but it's the same thing Nicolai has laid out
> >> here.
> >>
> >> As I continued working through our prototype driver support, I found I
> >> didn't actually need to include cached or compressed as capabilities:
> >> The GPU just applies them as needed and the usage transitions make it
> >> transparent to the non-GPU engines. That does mean the GPU driver
> >> currently needs to be the one to realize the allocation from the
> >> capability set to get optimal behavior. We could fix that by
> >> reworking our driver though. At this point, not including
> >> device-local properties like on-device caching in capabilities seems
> >> like the right solution to me. I'm curious whether this applies
> >> universally though, or if other hardware doesn't fit the "compression
> >> and stuff all behaves like a cache" idiom.
> >
> > Compression is a part of the memory layout for us: framebuffer
> > compression uses an additional "meta surface". At the most basic level,
> > an allocation with loss-less compression support is by necessity bigger
> > than an allocation without.
> >
> > We can allocate this meta surface separately, but then we're forced to
> > decompress when passing the surface around (e.g. to a compositor.)
> >
> > Consider also the example I gave elsewhere, where a cross-vendor tiling
> > layout is combined with vendor-specific compression:
> >
> > Device 1, rendering: caps(BASE/foo-tiling, VND1/compression)
> > Device 2, sampling/scanout: caps(BASE/foo-tiling, VND2/compression)
> >
> > Some more thoughts on caching or "device-local" properties below.
>
> Compression requires extra resources for us as well. That's probably
> universal. I think the distinction between the two approaches is
> whether the allocating driver deduces that compression can be used with
> a given capability set and hence adds the resources implicitly, or
> whether the capability set indicates it explicitly. My theory is that
> the implicit path is possible, but it has downsides. The explicit path
> is attractive due to its exact nature, as I alluded to in my talk: You
> can tell the exact properties of an allocation given the capability set
> used to allocate it. If that can be made to work, I prefer that path as
> well. Agreed that your path also works better for the
> multi-vendor+device example.
>
> >
> > [snip]
> >>> I think I like the idea of having transitions being part of the
> >>> per-device/engine cap sets, so that such information can be used upon
> >>> merging to know which capabilities may remain or have to be dropped.
> >>>
> >>> I think James's proposal for usage transitions was intended to work
> >>> with flows like:
> >>>
> >>> 1. App gets GPU caps for RENDER usage
> >>> 2. App allocates GPU memory using a layout from (1)
> >>> 3. App now decides it wants use the buffer for SCANOUT
> >>> 4. App queries usage transition metadata from RENDER to SCANOUT,
> >>> given the current memory layout.
> >>> 5. Do the transition and hand the buffer off to display
> >>
> >> No, all usages the app intends to transition to must be specified up
> >> front when initially querying caps in the model I assumed. The app
> >> then specifies some subset (up to the full set) of the specified
> >> usages as a src and dst when querying transition metadata.
> >>
> >>> The problem I see with this is that it isn't guaranteed that there will
> >>> be a chain of transitions for the buffer to be usable by display.
> >>
> >> I hadn't thought hard about it, but my initial thoughts were that it
> >> would be required that the driver support transitioning to any single
> >> usage given the capabilities returned. However, transitioning to
> >> multiple usages (E.g., to simultaneously rendering and scanning out)
> >> could fail to produce a valid transition, in which case the app would
> >> have to fall back to a copy in that case, or avoid that simultaneous
> >> usage combination in some other way.
> >>
> >>> Adding transition metadata to the original capability sets, and using
> >>> that information when merging could give us a compatible memory layout
> >>> that would be usable by both GPU and display.
> >>>
> >>> I'll look into extending the current merging logic to also take into
> >>> account transitions.
> >>
> >> Yes, it'll be good to see whether this can be made to work. I agree
> >> Rob's example outcomes above are ideal, but it's not clear to me how
> >> to code up such an algorithm. This also all seems unnecessary if
> >> "device local" capabilities aren't needed, as posited above.
Even if "device local" capabilities aren't exposed in the capability
set, we might still want to have capabilities exposed that may have an
associated transition, right? And as already mentioned, looks like some
capabilites such as shared caches might not qualify as "device local"
and must be exposed either way.
If we don't embed transition information in the capability set somehow,
I don't see how we can avoid the merge operation dropping certain
capabilities because they aren't found in all sets.
My background in the matter is limited, so I'm probably missing some
points, but here's an idea of how to implement Rob's suggestion, which
I think also ties to Nicolai's transition pseudo-algorithm below:
1. Upon capabilities query, a device puts together a list of optimal
capability sets to best satisfy the intended usage for that
particular device.
2. Since the list of all usages is provided, we can guess whether we
might need subsets of those sets from (1) to satisfy all usages. We
also know whether we can provide transitions that convert those
sets (1) to simpler subsets.
3. We add to the list of returned capability sets all optimal sets
from (1) plus all suboptimal sets from (2) (than can actually be
obtained through transitions from (1).
4. We add information to each capability set so that we know what
other sets in the list are obtained by applying transitions. Each
set has a list 'source transitions' (pointer to source super set +
pointer to transition to apply) and a list of 'destination
transitions' (pointer to destination subset + pointer to transition
to apply).
Thus, we end up with a list of sets and subsets connected to each other
according to the available transitions.
Then, we can modify the capability merge logic such that:
1. We compute union of constraints
2. We search for the set which capabilities are found in both provided
lists. Since we have the connectivity information, we can actually
return more complex sets that will be converted to the simpler
found one (no need for dropping capabilities).
3. If no intersection of sets is found by (2), we start dropping
capabilities until we find an intersection, or fail the merge
operation.
Note that transition information will be preserved in the new returned
list of sets.
When actually transitioning from one usage to another, we just navigate
the capability set graph from the corresponding source set to the
destination set, applying any transitions required, which are encoded
in the sets themselves.
> >>
> >>>> although maybe the user doesn't need to know every possible transition
> >>>> between devices once you have more than two devices..
> >>>
> >>> We should be able to infer how buffers are going to be moved around
> >>> from the list of usages, shouldn't we?
> >>>
> >>> Maybe we are missing some bits of information there, but I think the
> >>> allocator should be able to know what transitions the app will care
> >>> about and provide only those.
> >>
> >> The allocator only knows the requested union of all usages currently.
> >> The number of possible transitions grows combinatorially for every
> >> usage requested I believe. I expect there will be cases where ~10
> >> usages are specified, so generating all possible transitions all the
> >> time may be excessive, when the app will probably generally only care
> >> about 2 or 3 states, and in practice, there will probably only
> >> actually be 2 or 3 different underlying possible combinations of
> >> operations.
> >
> > Exactly. So I wonder if we can't just "cut through the bullshit" somehow?
Rather than expressing usage as a union of uses, would it make sense to
express it as a directed graph somehow so that the application can
specify how it intends to move the allocation around?
If we had a directed usage graph, upon capability query, we'd know what
transitions the application is going to care about, and expose one set
of capabilities and transitions or another accordingly.
> >
> > I'm looking for something that would also eliminate another part of the
> > design that makes me uncomfortable: the metadata for transitions. This
> > makes me uncomfortable for a number of reasons. Who computes the
> > metadata? How is the representation of the metadata? With cross-device
> > usages (which is the whole point of the exercise), this quickly becomes
> > infeasible.
> >
> > So instead as a thought experiment, let's just use what we already have:
> > capabilities and constraints (or properties/attributes).
> >
> > I kind of already outlined this with the long example in my email here
> > https://lists.freedesktop.org/archives/mesa-dev/2017-December/179055.html
> >
> > Let me try to summarize the transition algorithm. Its inputs are:
> > - the current (source) capability set
> > - the desired new usages
> > - the capability sets associated with these usages, as queried when the
> > surface was allocated
> >
> > Steps of the algorithm:
> >
> > 1. Compute the merged capability set for the new usages (the destination
> > capability set).
> > 2. Compute the transition capability set, which is the merger of the
> > source and destination sets.
> > 3. Determine whether a "release" transition is required on the source
> > device(s):
> > 3a. For global properties, a transition is required if the source
> > capability set is a superset of the transition set.
> > 3b. For device-local properties, a transition is required if there is
> > some destination device for which the device-local properties are a
> > subset of the source set.
> > 4. Determine whether an "acquire" transition is required on the
> > destination device(s) in a similar way.
> >
> > Finally, execute the transitions using corresponding APIs, where the
> > APIs simply receive the computed capability sets.
> >
> > For example, release transitions would receive the source capability set
> > (and perhaps the source usages), the transition capability set, and the
> > set difference of device-local capabilities, and nothing else.
> >
> > The point is that all steps of the algorithm can be implemented in a
> > device-agnostic way in libdevicealloc, without calling into any
> > device/driver callbacks.
> >
> > I'm pretty sure this or something like it can be made to work. We need
> > to think through a lot of example cases, but at least we'll have thought
> > them through, which is better than relying on some opaque metadata thing
> > and then finding out later that there are some new cross-device cases
> > where things don't work out because the piece of (presumably
> > device-specific driver) code that computes the metadata isn't aware of
> > them.
>
> This sounds pretty good. I'd like to see more detailed pseudo-code of a
> full cycle (cap query, allocation, transition to and from a few usages),
> but it seems pretty solid. I very much like that it enables the
> explicit capability sets, but I'm mildly worried it might add API
> complexity overall rather than reduce it.
>
> I think in the end our two proposals are very similar: Yours just moves
> the conversion from high-level properties -> device commands to the
> driver applying the transition. That's fine in theory, though it shifts
> some minor overhead to the time of the transition. We could design the
> APIs such that it's possible to cache/pre-bake the device commands for a
> given transition though to alleviate that if it proves meaningful.
>
> To make it clearer what the "metadata" is in my version and hence
> perhaps make it clearer how similar the two are, a few notes:
>
> Transitions are queried per device in my proposal. Note this means you
> need to query two different sets of transition metadata for a
> cross-device transition, one from the source that would be applied on
> that device in the source API, and one from the destination that would
> be applied on that device in the destination API. APIs/engines that
> don't require transitions would return some NULL metadata indicating no
> required transition on that side.
>
> Some examples of the metadata approach:
>
> 1) transition from NVIDIA dev rendering -> NVIDIA dev texturing both in
> Vulkan, same device:
>
> -Query transition. You'd get some metadata representing very simple
> cache management stuff if anything. You'd apply it using some form of
> pipeline barrier on the relevant image.
>
> 2) transition from NVIDIA dev rendering -> NVIDIA dev texturing both in
> Vulkan, different device:
>
> -Query transition from each device. You'd get some metadata
> representing more complex cache management, and potentially a decompress
> depending on the compatibility of the two devices. The driver is the
> same for both devices in this case, so it can calculate the similarities
> exactly by examining the capability set and each device's properties.
> You'd apply it using some form of pipeline barrier with the respective
> metadata on the relevant image on each device.
>
>
> 3) transition from NVIDIA dev rendering -> AMD dev texturing both in Vulkan:
>
> -Query transition from each device. NVIDIA driver would see the
> destination usage is a foreign device it has no knowledge of and perform
> a complete cache flush and decompress. AMD driver would see the source
> usage is something it doesn't recognize and perform a full cache
> invalidate (and compression surface invalidate, if any?). You'd apply
> it using some form of pipeline barrier with the respective metadata on
> the relevant image on each device.
>
> 4) transition from NVIDIA dev rendering -> NVIDIA encoder with cache
> coherence
>
> -Query source transition on GPU dev. Query destination transition on
> video encoder dev. GPU recognizes the destination is a device it is
> aware has certain properties and hence returns a decompress only since
> it knows it has cache coherence. Video encoder dev returns NULL
> transition. Apply source transition on source graphics API. Note this
> case requires some careful coordination across a vendor's various driver
> stacks to perform optimally. It would automatically degrade to the
> foreign device case for naive/incomplete drivers though.
>
> > [snip]
> >> One final note: When I initially wrote up the capability merging
> >> logic, I treated "layout" as a sort of "special" capability, basically
> >> like Nicolai originally outlined above. Miguel suggested I add the
> >> "required" bit instead to generalize things, and it ended up working
> >> out much cleaner. Besides the layout, there is at least one other
> >> obvious candidate for a "required" capability that became obvious as
> >> soon as I started coding up the prototype driver: memory location. It
> >> might seem like memory location is a simple device-agnostic constraint
> >> rather than a capability, but it's actually too complicated (we need
> >> more memory locations than "device" and "host"). It has to be vendor
> >> specific, and hence fits in better as a capability.
> >
> > Could you give more concrete examples of what you'd like to see, and why
> > having this as constraints is insufficient?
>
> We have more than one "device local" memory with different capabilities
> on some devices. I think you guys have this situation as well with your
> cards with an SSD on them or something if I'm interpretting the
> marketing stuf right. I'd like to be able to express those all without
> needing to code them into the device-agnostic portion of the allocator
> library ahead of time. That way, if we come up with any new clever
> ones, we don't need to wait for everyone to update their allocator
> library to make use of them.
>
> Additionally, with things like SLI/Crossfire, we end up with a sort of
> NUMA memory architecture, where memory on a "remote" card might have
> similar but not exactly the same capabilities as device-local memory.
> This would be rather complex to represent in the generic constraints as
> well.
>
> >> I think if possible, we should try to keep the design generalized to
> >> as few types of objects and special cases as possible. The more we
> >> can generalize the solutions to our existing problem set, the better
> >> the mechanism should hold up as we apply it to new and unknown
> >> problems as they arise.
> >
> > I'm coming around to the fact that those things should perhaps live in a
> > single list/array, but I still don't like the term "capability".
> >
> > I admit it's a bit of bike-shedding, but I'm starting to think it would
> > be better to go with the generic term "property" or "attribute", and
> > then add flags/adjectives to that based on how merging should work.
> >
> > This would include the constraints as well -- it seems arbitrary to me
> > that those would be singled out into their own list.
> >
> > Basically, the underlying principle is that a good API would have either
> > one list that includes all the properties, or one list per
> > merging-behavior. And I think one single list is easier on the API
> > consumer and easier to extend.
>
> Agreed with Rob. Constraints are different for a reason: They're
> non-extensible and hence can merge in more complex ways. Capabilities
> are extensible, but must be merged by simple memcmp()-style operations,
> currently more or less simple intersection.
>
> However, I also don't care about naming. "Constraints" was chosen
> because it connotates negatively since they "limit" what an allocation
> created from a capability set can do, and similarly "capabilities"
> connotates positively because it indicates things that are built up
> additively to describe abilities of an allocation. However, I don't
> know that that metaphor held up entirely as the design was realized, so
> it might be a good time to bikeshed new names anyway.
There have been several naming suggestions for different pieces of the
library. I'll start separate threads with patches with some of the
name changes, so we can keep the bike-shedding separate from the design
discussion.
Thanks.
>
> Thanks,
> -James
>
> > Cheers,
> > Nicolai
> >
--
Miguel
More information about the mesa-dev
mailing list