About migrating framebuffers in multi-GPU compositors

Thu Mar 24 14:44:57 UTC 2022

On Thu, 24 Mar 2022 13:43:02 +0000
"Hoosier, Matt" <Matt.Hoosier at garmin.com> wrote:

> On Thu, 2022-03-24 at 11:56 +0200, Pekka Paalanen wrote:

...

> This MR cover letter has a better overview of all the methods:
> 
> <https://gitlab.gnome.org/GNOME/mutter/-/merge_requests/810>
> 
> https://gitlab.gnome.org/GNOME/mutter/-/merge_requests/810
> 
> Ah, even nicer. Thanks!
> 
> In the ranked-order list of strategies there, the zero-copy
> technique is less preferred than the secondary GPU copy
> technique. Seems like you'd rarely ever fall through to the
> zero-copy strategy even if the GPU drivers do both support it.
> Anything subtle going on there that's good to be aware of? Like
> maybe a given driver typically supports secondary-GPU-copy XOR
> zero-copy, so you're fairly likely to reach the second strategy
> on systems that can handle it.
> 

Hi Matt,

the main reason was to not regress existing working systems. Unless
you actually own all the different kinds and combinations of
systems, there isn't really any way to test anything, except on the
one thing you happen have at hand. You rely on end users to test
things.

Zero-copy path means rendering directly into "foreign" buffers. If
that works at all, the performance might be a nasty surprise.
Performance can also depend on the scenegraph: how much blending
there is, how much damage to paint at a time, how many pixels exist.
There are no guarantees. If you want to know, you have to benchmark
every system individually.

I may be pessimistic here, but this avoided regressions, I hope.

Secondary GPU should be able to execute in parallel to primary GPU,
so doing a copy on secondary GPU might be more responsive than
doing the zero-copy path, because accessing foreign buffers
in some hardware configurations has a high cost. Zero-copy might be
much slower than secondary or even primary GPU copy path. Or zero
copy might be the fastest path.

In the DisplayLink case, the secondary GPU does not support
hardware OpenGL (there is no GPU), so the secondary GPU copy path is
skipped outright.

> 
> If I follow this right, the blit occurs directly between video
> 
> memory owned by the primary GPU into dumb-buffer memory owned by
> 
> the secondary GPU, without laboriously using the CPU to do PIO.
> 
> 
> Correct.
> 
> 
> Does this imply that the two GPUs' drivers have to be at least
> 
> minimally aware of each other to negotiate some kind of DMA path
> 
> directly between the two?
> 
> 
> I don't know the details. It depends on whether you can map
> 
> secondary GPU memory to be written by the primary GPU. The specific
> 
> use case here is iGPU as primary and virtual as secondary, which
> 
> means that video memory for both is more or less "system RAM". No
> 
> discrete VRAM involved.
> 
> Oh interesting. I hadn't realized that on the hybrid GPU systems
> even the dGPU uses system RAM. But on thinking about it, that's
> probably the only efficient way for the hardware to be designed.
> 

This is not hybrid GPU system. There is no dGPU at all in the
intended use case. There is only iGPU, and a virtual device (no
hardware, no acceleration, no passthrough, nothing - just a virtual
KMS device shoveling framebuffers back to userspace to a daemon that
compresses them for USB transmission).

I'm not aware of any dGPU that did not come with VRAM of its own. A
dGPU can access system RAM in some ways, but it's obviously slower
than accessing on-card VRAM. Display directly from system RAM to
dGPU is usually not a thing, or at least not preferred due to the
bus load it causes.

> It is accomplished through the kernel dmabuf framework where
> 
> drivers export and import dmabuf.
> 
> Right, makes sense.
> 
> So I wonder how I should reason about a system that's configured
> with 2x of the same discrete graphics card (AMD, if it matters).
> The compositor would arbitrarly pick whichever of those happened
> to enumerate first as the primary, and then it's down to the
> driver details as to which of the four migration paths gets
> chosen? For the moment, let's assume that none of the stock
> applications is bothering to use any sort of advanced dmabuf
> hinting to pick the right GPU node to correspond to the output on
> which it will eventually display.

Until now, we have been talking exclusively about the display
server's own rendering, and how to display that on another GPU.
There is essentially just one rendering device (the one the display
server uses for composition), and multiple sink devices. Some of
the sinks may be simple (KMS only), some may be complex (have a GPU
that actually can blit + KMS).

If you start thinking about applications as well, you add a new
dimension to the problem. An application might use any acceleration
capable GPU (a real GPU, not what Mutter calls a "GPU" in the
code) which may or may not be the same device where resulting image
should be composited, or displayed directly. That opens up a lot of
combinations, and the most complicated one is this:

- display server composites on GPU1
- display server has outputs on GPU1 and GPU2
- application renders on GPU2
- application is fullscreen on an output on GPU2

Ideally the image rendered on GPU2 will be displayed directly on
GPU2. That may not actually happen, because the display server must
always be able to composite the application window as well, and
composition happens on GPU1. So it's easy to end up with a
situation, where
- application renders on GPU2
- display server migrates the window image to GPU1, so it could be
  composited when necessary
- the image is on GPU1 now, so direct scanout on GPU2 is not
  possible
- the display server composites on GPU1
- the display server uses GPU2 to copy the composition to GPU2
- GPU2 displays

Obviously a display server could and should be smarter than that,
but that takes engineering. This used to be a problem with Mutter,
especially with external GPUs where the bus is even slower, and I
don't know if it has been fixed yet.

We also did not even discuss the different ways to migrate an image
between two GPUs. Maybe some cross-access is possible so the buffer
can live in GPU2 all the time.

What is possible and how well it performs depends on the exact
hardware combination, drivers, and how smart the display server is.
And of course applications have an impact via choosing the GPU to
use.

In Mutter's case, I suspect it still picks the primary GPU based on
which one was "boot VGA", if the concept exists on the system.
That's what UEFI or BIOS initializes as the first display. Maybe
it's configurable nowadays, I don't know.

I believe display server developers know quite well that device node
numbering is not reliable and can change on reboot. I would hope
they use device path instead, if they allow configuring the primary
GPU choice. Device path addresses devices based on e.g. which PCIe
slot they plugged at. (For USB, even device path is not reliable.)

The new feedback additions to the dmabuf Wayland extension should
make applications' lives easier too, but even the old wl_drm Mesa
extension told applications which device the compositor is using.
That would be the default GPU choice for apps.

I'm not sure if I answered your question, but obviously it's a
complicated topic.

Thanks,
pq
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 833 bytes
Desc: OpenPGP digital signature
URL: <https://lists.freedesktop.org/archives/wayland-devel/attachments/20220324/4285fa6b/attachment-0001.sig>