[Mesa-dev] GBM and the Device Memory Allocator Proposals

Fri Dec 1 17:09:32 UTC 2017

On 01.12.2017 16:06, Rob Clark wrote:
> On Thu, Nov 30, 2017 at 5:43 PM, Nicolai Hähnle <nhaehnle at gmail.com> wrote:
>> Hi,
>>
>> I've had a chance to look a bit more closely at the allocator prototype
>> repository now. There's a whole bunch of low-level API design feedback, but
>> for now let's focus on the high-level stuff first.
>>
>> Going by the 4.5 major object types (as also seen on slide 5 of your
>> presentation [0]), assertions and usages make sense to me.
>>
>> Capabilities and capability sets should be cleaned up in my opinion, as the
>> status quo is overly obfuscating things. What capability sets really
>> represent, as far as I understand them, is *memory layouts*, and so that's
>> what they should be called.
>>
>> This conceptually simplifies `derive_capabilities` significantly without any
>> loss of expressiveness as far as I can see. Given two lists of memory
>> layouts, we simply look for which memory layouts appear in both lists, and
>> then merge their constraints and capabilities.
>>
>> Merging constraints looks good to me.
>>
>> Capabilities need some more thought. The prototype removes capabilities when
>> merging layouts, but I'd argue that that is often undesirable. (In fact, I
>> cannot think of capabilities which we'd always want to remove.)
>>
>> A typical example for this is compression (i.e. DCC in our case). For
>> rendering usage, we'd return something like:
>>
>> Memory layout: AMD/tiled; constraints(alignment=64k); caps(AMD/DCC)
>>
>> For display usage, we might return (depending on hardware):
>>
>> Memory layout: AMD/tiled; constraints(alignment=64k); caps(none)
>>
>> Merging these in the prototype would remove the DCC capability, even though
>> it might well make sense to keep it there for rendering. Dealing with the
>> fact that display usage does not have this capability is precisely one of
>> the two things that transitions are about! The other thing that transitions
>> are about is caches.
>>
>> I think this is kind of what Rob was saying in one of his mails.
> 
> Perhaps "layout" is a better name than "caps".. either way I think of
> both AMD/tiled and AMD/DCC as the same type of "thing".. the
> difference between AMD/tiled and AMD/DCC is that a transition can be
> provided for AMD/DCC.  Other than that they are both things describing
> the layout.

The reason that a transition can be provided is that they aren't quite 
the same thing, though. In a very real sense, AMD/DCC is a "child" 
property of AMD/tiled: DCC is implemented as a meta surface whose memory 
layout depends on the layout of the main surface.

Although, if there are GPUs that can do an in-place "transition" between 
different tiling layouts, then the distinction is perhaps really not as 
clear-cut. I guess that would only apply to tiled renderers.

> So lets say you have a setup where both display and GPU supported
> FOO/tiled, but only GPU supported compressed (FOO/CC) and cached
> (FOO/cached).  But the GPU supported the following transitions:
> 
>    trans_a: FOO/CC -> null
>    trans_b: FOO/cached -> null
> 
> Then the sets for each device (in order of preference):
> 
> GPU:
>    1: caps(FOO/tiled, FOO/CC, FOO/cached); constraints(alignment=32k)
>    2: caps(FOO/tiled, FOO/CC); constraints(alignment=32k)
>    3: caps(FOO/tiled); constraints(alignment=32k)
> 
> Display:
>    1: caps(FOO/tiled); constraints(alignment=64k)
> 
> Merged Result:
>    1: caps(FOO/tiled, FOO/CC, FOO/cached); constraints(alignment=64k);
>       transition(GPU->display: trans_a, trans_b; display->GPU: none)
>    2: caps(FOO/tiled, FOO/CC); constraints(alignment=64k);
>       transition(GPU->display: trans_a; display->GPU: none)
>    3: caps(FOO/tiled); constraints(alignment=64k);
>       transition(GPU->display: none; display->GPU: none)

We definitely don't want to expose a way of getting uncached rendering 
surfaces for radeonsi. I mean, I think we are supposed to be able to 
program our hardware so that the backend bypasses all caches, but (a) 
nobody validates that and (b) it's basically suicide in terms of 
performance. Let's build fewer footguns :)

So at least for radeonsi, we wouldn't want to have an AMD/cached bit, 
but we'd still want to have a transition between the GPU and display 
precisely to flush caches.

>> Two interesting questions:
>>
>> 1. If we query for multiple usages on the same device, can we get a
>> capability which can only be used for a subset of those usages?
> 
> I think the original idea was, "no"..  perhaps that could restriction
> could be lifted if transitions where part of the result.  Or maybe you
> just query independently the same device for multiple different
> usages, and then merge that cap-set.
> 
> (Do we need to care about intra-device transitions?  Or can we just
> let the driver care about that, same as it always has?)
> 
>> 2. What happens when we merge memory layouts with sets of capabilities where
>> neither is a subset of the other?
> 
> I think this is a case where no zero-copy sharing is possible, right?

Not necessarily. Let's say we have some industry-standard tiling layout 
foo, and vendors support their own proprietary framebuffer compression 
on top of that.

In that case, we may get:

Device 1, rendering: caps(BASE/foo, VND1/compressed)
Device 2, sampling/scanout: caps(BASE/foo, VND2/compressed)

It should be possible to allocate a surface as

caps(BASE/foo, VND1/compressed)

and just transition the cap(VND1/compressed) away after rendering before 
accessing it with device 2.

The interesting question is whether it would be possible or ever useful 
to have a surface allocated as caps(BASE/foo, VND1/compressed, 
VND2/compressed).

My guess is: there will be cases where it's possible, but there won't be 
cases where it's useful (because you tend to render on device 1 and just 
sample or scanout on device 2).

So it makes sense to say that derive_capabilities should just provide 
both layouts in this case.

>> As for the actual transition API, I accept that some metadata may be
>> required, and the metadata probably needs to depend on the memory layout,
>> which is often vendor-specific. But even linear layouts need some
>> transitions for caches. We probably need at least some generic "off-device
>> usage" bit.
> 
> I've started thinking of cached as a capability with a transition.. I
> think that helps.  Maybe it needs to somehow be more specific (ie. if
> you have two devices both with there own cache with no coherency
> between the two)

As I wrote above, I'd prefer not to think of "cached" as a capability at 
least for radeonsi.

 From the desktop perspective, I would say let's ignore caches, the 
drivers know which caches they need to flush to make data visible to 
other devices on the system.

On the other hand, there are probably SoC cases where non-coherent 
caches are shared between some but not all devices, and in that case 
perhaps we do need to communicate this.

So perhaps we should have two kinds of "capabilities".

The first, like framebuffer compression, is a capability of the 
allocated memory layout (because the compression requires a meta 
surface), and devices that expose it may opportunistically use it.

The second, like caches, is a capability that the device/driver will use 
and you don't get a say in it, but other devices/drivers also don't need 
to be aware of them.

So then you could theoretically have a system that gives you:

GPU:     FOO/tiled(layout-caps=FOO/cc, dev-caps=FOO/gpu-cache)
Display: FOO/tiled(layout-caps=FOO/cc)
Video:   FOO/tiled(dev-caps=FOO/vid-cache)
Camera:  FOO/tiled(dev-caps=FOO/vid-cache)

... from which a FOO/tiled(FOO/cc) surface would be allocated.

The idea here is that whether a transition is required is fully visible 
from the capabilities:

1. Moving an image from the camera to the video engine for immediate 
compression requires no transition.

2. Moving an image from the camera or video engine to the display 
requires a transition by the video/camera device/API, which may flush 
the video cache.

3. Moving an image from the camera or video engine to the GPU 
additionally requires a transition by the GPU, which may invalidate the 
GPU cache.

4. Moving an image from the GPU anywhere else requires a transition by 
the GPU; in all cases, GPU caches may be flushed. When moving to the 
video engine or camera, the image additionally needs to be decompressed. 
When moving to the video engine (or camera? :)), a transition by the 
video engine is also required, which may invalidate the video cache.

5. Moving an image from the display to the video engine requires a 
decompression -- oops! :)

Ignoring that last point for now, I don't think you actually need a 
"query_transition" function in libdevicealloc with this approach, for 
the most part.

Instead, each API needs to provide import and export transition/barrier 
functions which receive the previous/next layout-and capability-set.

Basically, to import a frame from the camera to OpenGL/Vulkan in the 
above system, you'd first do the camera transition:

   struct layout_capability cc_cap = { FOO, FOO_CC };
   struct device_capability gpu_cache = { FOO, FOO_GPU_CACHE };

   cameraExportTransition(image, 1, &layoutCaps, 1, &gpu_cache, &fence);

and then e.g. an OpenGL import transition:

   struct device_capability vid_cache = { FOO, FOO_VID_CACHE };

   glImportTransitionEXT(texture, 0, NULL, 1, &vid_cache, fence);

By looking at the capabilities for the other device, each API's driver 
can derive the required transition steps.

There are probably more gaps, but these are the two I can think of right 
now, and both related to the initialization status of meta surfaces, 
i.e. FOO/cc:

1. Point 5 above about moving away from the display engine in the 
example. This is an ugly asymmetry in the rule that each engine performs 
its required import and export transitions.

2. When the GPU imports a FOO/tiled(FOO/cc) surface, the compression 
meta surface can be in one of two states:
- reflecting a fully decompressed surface (if the surface was previously 
exported from the GPU), or
- garbage (if the surface was allocated by the GPU driver, but then 
handed off to the camera before being re-imported for processing)
The GPU's import transition needs to distinguish the two, but it can't 
with the scheme above.

Something to think about :)

Also, not really a gap, but something to keep in mind: for multi-GPU 
systems, the cache-capability needs to carry the device number or PCI 
bus id or something, at least as long as those caches are not coherent 
between GPUs.

Cheers,
Nicolai

> 
> BR,
> -R
> 
>>
>> Cheers,
>> Nicolai
>>
>> [0] https://www.x.org/wiki/Events/XDC2017/jones_allocator.pdf
>>
>>
>> On 21.11.2017 02:11, James Jones wrote:
>>>
>>> As many here know at this point, I've been working on solving issues
>>> related to DMA-capable memory allocation for various devices for some time
>>> now.  I'd like to take this opportunity to apologize for the way I handled
>>> the EGL stream proposals.  I understand now that the development process
>>> followed there was unacceptable to the community and likely offended many
>>> great engineers.
>>>
>>> Moving forward, I attempted to reboot talks in a more constructive manner
>>> with the generic allocator library proposals & discussion forum at XDC 2016.
>>> Some great design ideas came out of that, and I've since been prototyping
>>> some code to prove them out before bringing them back as official proposals.
>>> Again, I understand some people are growing concerned that I've been doing
>>> this off on the side in a github project that has primarily NVIDIA
>>> contributors.  My goal was only to avoid wasting everyone's time with
>>> unproven ideas.  The intent was never to dump the prototype code as-is on
>>> the community and presume acceptance. It's just a public research project.
>>>
>>> Now the prototyping is nearing completion, and I'd like to renew
>>> discussion on whether and how the new mechanisms can be integrated with the
>>> Linux graphics stack.
>>>
>>> I'd be interested to know if more work is needed to demonstrate the
>>> usefulness of the new mechanisms, or whether people think they have value at
>>> this point.
>>>
>>> After talking with people on the hallway track at XDC this year, I've
>>> heard several proposals for incorporating the new mechanisms:
>>>
>>> -Include ideas from the generic allocator design into GBM.  This could
>>> take the form of designing a "GBM 2.0" API, or incrementally adding to the
>>> existing GBM API.
>>>
>>> -Develop a library to replace GBM.  The allocator prototype code could be
>>> massaged into something production worthy to jump start this process.
>>>
>>> -Develop a library that sits beside or on top of GBM, using GBM for
>>> low-level graphics buffer allocation, while supporting non-graphics kernel
>>> APIs directly.  The additional cross-device negotiation and sorting of
>>> capabilities would be handled in this slightly higher-level API before
>>> handing off to GBM and other APIs for actual allocation somehow.
>>>
>>> -I have also heard some general comments that regardless of the
>>> relationship between GBM and the new allocator mechanisms, it might be time
>>> to move GBM out of Mesa so it can be developed as a stand-alone project.
>>> I'd be interested what others think about that, as it would be something
>>> worth coordinating with any other new development based on or inside of GBM.
>>>
>>> And of course I'm open to any other ideas for integration.  Beyond just
>>> where this code would live, there is much to debate about the mechanisms
>>> themselves and all the implementation details.  I was just hoping to kick
>>> things off with something high level to start.
>>>
>>> For reference, the code Miguel and I have been developing for the
>>> prototype is here:
>>>
>>>      https://github.com/cubanismo/allocator
>>>
>>> And we've posted a port of kmscube that uses the new interfaces as a
>>> demonstration here:
>>>
>>>      https://github.com/cubanismo/kmscube
>>>
>>> There are still some proposed mechanisms (usage transitions mainly) that
>>> aren't prototyped, but I think it makes sense to start discussing
>>> integration while prototyping continues.
>>>
>>> In addition, I'd like to note that NVIDIA is committed to providing open
>>> source driver implementations of these mechanisms for our hardware, in
>>> addition to support in our proprietary drivers.  In other words, wherever
>>> modifications to the nouveau kernel & userspace drivers are needed to
>>> implement the improved allocator mechanisms, we'll be contributing patches
>>> if no one beats us to it.
>>>
>>> Thanks in advance for any feedback!
>>>
>>> -James Jones
>>> _______________________________________________
>>> mesa-dev mailing list
>>> mesa-dev at lists.freedesktop.org
>>> https://lists.freedesktop.org/mailman/listinfo/mesa-dev
>>
>>
>>
>> --
>> Lerne, wie die Welt wirklich ist,
>> Aber vergiss niemals, wie sie sein sollte.
>> _______________________________________________
>> mesa-dev mailing list
>> mesa-dev at lists.freedesktop.org
>> https://lists.freedesktop.org/mailman/listinfo/mesa-dev

-- 
Lerne, wie die Welt wirklich ist,
Aber vergiss niemals, wie sie sein sollte.