XDC allocator workshop and Wayland dmabuf hints

Mon Oct 14 13:02:59 UTC 2019

On 10/13/19 2:05 PM, Scott Anderson wrote:
> (Sorry to CCs for spam, I made an error in my first posting)
> 
> Hi,
> 
> There were certainly some interesting changes discussed at the allocator
> workshop during XDC this year, and I'd like to just summarise my
> thoughts on it and make sure everybody is on the same page.
> 
> For those who don't know who I am or my stake in this, I'm the
> maintainer of the DRM and graphics code for the wlroots Wayland
> compositor library. I'm ascent12 on Github and Freenode.
> 
> 
> My understanding of the issue Nvidia was trying to solve was the
> in-place transition between different format modifiers. E.g. if a client
> is to be scanned out, the buffer would need to be transitioned to a
> non-compressed format that the display controller can work with, but if
> the client is to be composited, a compressed format would be used,
> saving on memory bandwidth. Hardware may have more efficient ways to
> transition between different formats, so it would be good if we can use
> these and not rely on having to perform a blit if we don't need to. The
> problem is more general than this, but that was just the example given.
> 
> The original solution proposed in James' talk was to add functions to
> EGL/OpenGL/Vulkan and have the display server perform transitions where
> required.

FWIW, I didn't intend to imply the display server should be the thing 
doing transitions.  It is a possible implementation, but I assumed 
display servers would only do these transitions in fallback paths or as 
part of some in-between period before clients picked up on the need for 
them.  Beyond the design goals you imply below, I wanted to note that 
it's more optimal to perform transitions in the client, and since 
transitions were intended to be persistent (paralleling Vulkan layout 
transitions), the compositor would need to transition back to the 
client's view of the image if the client hadn't picked up on the 
transition and agreed to handle it anyway, which would not be ideal and 
could cost additional perf in some cases.

> Discussions during the workshop at the start tended to having libliftoff
> handle all of this, but would require libliftoff to have its own
> rendering context, which I think is bloating the purpose of the library.
> Also discussed was to have libliftoff ask the compositor to perform the
> transition if it thinks it was possible.
> 
> 
> Another suggestion I made was to make use of Simon's dmabuf hints patch
> to the wp_linux_dmabuf protocol [1] and leave it up to the client's GPU
> driver to handle any transitions. This wasn't adequately represented in
> the lightning talk summarising the workshop, so I'll go over it here
> now, making sure everyone understands what it is and why I think it is
> the way we should go forward.
> 
> Right now, a Wayland compositor will advertise all of the
> format+modifier pairs that it supports, but currently does not provide
> any context for clients as to which one they should actually choose.
> It's basically up to chance if a client is able to be scanned out and is
> likely to lead to several suboptimal situations.
> 
> The dmabuf hints patch adds a way to suggest a better format to use,
> based on the current context. This is dynamic, and can be sent multiple
> times over the lifetime of a surface. The patch also adds a way for the
> compositor to tell the client which GPU its using, which is useful for
> clients to know in multi GPU situations.
> 
> These hints are in various "tranches", which are just groups of
> format+modifier pairs of the same preference. The tranches are ordered
> from most optimal to least optimal. The most optimal tranche would imply
> direct scanout, while a less optimal tranche would imply compositing,
> but is not actually defined like that in the protocol.
> 
> If a client becomes fullscreen, we would send the format+modifier pairs
> for the primary plane as the most optimal tranche. If a client is
> eligible to be scanned out on an overlay plane, we would send the
> format+modifier pairs for that plane. If a client is partially occluded
> or otherwise not possible to be scanned out, we'd just have the normal
> format+modifier pairs that we can use as a texture. Note that the
> compositor won't send format+modifier pairs which we cannot texture
> from, even if the plane advertises it's supported. We always need to be
> able to fall back to compositing.
> 
> 
> The hard part of figuring out which clients are "eligible" for being
> scanned out on an overlay plane could be handled by libliftoff (or
> something similar) and given back to the compositor to forward to
> clients. For libliftoff to make a properly informed decision, I think
> the atomic KMS API needs to be changed. We can only TEST_ONLY for valid
> buffers, testing the immediate configuration, but doesn't allow us to
> test for a configuration we WANT to go to. We need some sort of fake
> framebuffer not backed by any real memory, but will allow us to
> TEST_ONLY it. Without this, we may tell the client format+modifier pairs
> that we think will work for scanout, but don't due to whatever hardware
> limitations or transient issues like memory bandwidth, and we could
> actually make things worse by having the client transition formats.
> 
> As an aside, I would really like these fake framebuffers for my
> modesetting set up to be a lot cleaner too.
> 
> I'm sure this has been discussed before, and I'm not really sure what
> the implications are from a driver perspective. I'd have to leave it up
> to people more familiar with KMS and driver internals to comment on
> this. Even if the solution isn't 100%, something that works most of
> time would be hugely helpful (especially with RGB formats). Perhaps this
> is not possible, and would need to live inside of driver-specific code
> inside of libraries like libliftoff, but it would be nice not to come to
> that. It seems useful enough for a generic KMS userspace.

IIRC, the only concern I mentioned about this type of mechanism is that 
we'd need to validate the "real" surface was in local device memory. 
However, we probably wouldn't succeed creating a "real" FB for a surface 
that wasn't in/couldn't get in device memory in the first place, so 
that's probably moot.  I don't know if others have limitations like this 
that are tied to having actual memory and don't have the same solution, 
but regardless, I agree, an empty, proxy FB object of some sort sounds 
useful.  I'll never claim to be a mode setting expert, but from some 
quick grepping, it looks like the input our HW needs to validate a 
configuration is covered by bpp/format + outputs used + overlay planes 
used on those outputs.

> As to how dmabuf hints would look client-side, I think this could be
> managed by the GPU driver pretty easily.
> 
> For EGL, if the driver is capable of transitions in-place, they can
> simply do that as required. If the GPU cannot transition to a new format
> directly, they can deallocate then reallocate buffers in the new format
> as they are consumed. This would only lead to a couple of inefficient
> frames when a new hint is sent, but will reach an optimal situation in
> the steady state.
> 
> Vulkan already has VK_SUBOPTIMAL_KHR for telling the application that it
> should reallocate its swapchain.
> 
> If either the EGL driver or Vulkan application doesn't take these hints
> to account, things will still continue to work, just not in the most
> optimal way, basically as it works now.
> 
> 
> I believe that the dmabuf hints patch should meet the transition issue
> that Nvidia was trying to fix. It keeps a lot of the complexity inside
> of the drivers, and keeps the rendering complexity outside of
> libliftoff, and doesn't require extensions to EGL/OpenGL/Vulkan as far
> as I know.

I agree this all sounds like the right path forward.  However, a few 
things to note:

-I still think the extensions are useful.  Our proprietary driver's 
Wayland EGL layer is written entirely in EGL+OpenGL (I.e., it's just a 
"layer"):  https://github.com/NVIDIA/egl-wayland  Currently it's using 
EGLStream, but it could have a GBM+dma-buf path as well.  Similarly, 
being able to implement Vulkan WSI in a layer using only standardized 
Vulkan code in the layer was an explicit goal of Vulkan 1.1-era SI 
(external objects, the dma-buf extensions layered on top, etc.), and is 
clearly the preferable future (See the related XDC talk).  Finally, 
low-level clients shouldn't be required to go through 
eglSwapBuffers()/vkQueuePresentKHR() at all.  There shouldn't be a 
required perf hit for rolling your own presentation code.  For all these 
reasons, I think having GL/EGL/Vulkan extensions available is still the 
way to go, and these are the primary reasons I wrote things up this way 
as opposed to hacking on eglSwapBuffers() in nouveau.  You could say 
this can be deferred, but I'd prefer to develop it up front to prove the 
concept.  It would suck to get this wrong somehow and then have to go 
back and adjust the interfaces again to fix it.  Internally, it 
shouldn't matter that much either way.  Drivers can use the same 
gallium/DRI/internal code paths to implement the extension and some 
automatic transition.

-As you note, this limits things to formats/layouts that can be 
composited (basically, things that can be textures).  "Things that can 
be textures" is a superset of "Things that can be scanned out" for these 
purposes on our HW, so that's fine for NVIDIA.  Does that hold up 
elsewhere?  A secondary motivation for me was that the compositor could 
transition back to compositing from overlay compositing without 
requiring a blit or a new frame from the client in cases where that 
didn't hold up, but I don't know if that's interesting or not.

I don't think any of this prevents moving forward with your proposals. 
Just wanted to note it here for posterity.

> In the future, if any sort of constraints interface is worked out,
> wp_linux_dmabuf could be extended again to accommodate it. There may
> already be a need for some extra flags which would be the equivalent of
> GBM_BO_USE_SCANOUT, but I'm not going to try and design that interface
> here.

I agree deferring constraints is the best way to make progress.

Thanks,
-James

> 
> Thanks for your time.
> Any feedback is welcome.
> 
> Scott
> 
> ---
> [1] https://patchwork.freedesktop.org/patch/263061/
> _______________________________________________
> wayland-devel mailing list
> wayland-devel at lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/wayland-devel