[Mesa-dev] [RFC] Linux Graphics Next: Explicit fences everywhere and no BO fences - initial proposal

Tue Apr 20 12:42:26 UTC 2021

Hi Marek,

On Mon, 19 Apr 2021 at 11:48, Marek Olšák <maraeo at gmail.com> wrote:

> *2. Explicit synchronization for window systems and modesetting*
>
> The producer is an application and the consumer is a compositor or a
> modesetting driver.
>
> *2.1. The Present request*
>

So the 'present request' is an ioctl, right? Not a userspace construct like
it is today? If so, how do we correlate the two?

The terminology is pretty X11-centric so I'll assume that's what you've
designed against, but Wayland and even X11 carry much more auxiliary
information attached to a present request than just 'this buffer, this
swapchain'. Wayland latches a lot of data on presentation, including
non-graphics data such as surface geometry (so we can have resizes which
don't suck), window state (e.g. fullscreen or not, also so we can have
resizes which don't suck), and these requests can also cascade through a
tree of subsurfaces (so we can have embeds which don't suck). X11 mostly
just carries timestamps, which is more tractable.

Given we don't want to move the entirety of Wayland into kernel-visible
objects, how do we synchronise the two streams so they aren't incoherent?
Taking a rough stab at it whilst assuming we do have
DRM_IOCTL_NONMODE_PRESENT, this would create a present object somewhere in
kernel space, which the producer would create and ?? export a FD from, that
the compositor would ?? import.

As part of the Present request, the producer will pass 2 fences (sync
> objects) to the consumer alongside the presented DMABUF BO:
> - The submit fence: Initially unsignalled, it will be signalled when the
> producer has finished drawing into the presented buffer.
>

We have already have this in Wayland through dma_fence. I'm relaxed about
this becoming drm_syncobj or drm_newmappedysncobjthing, it's just a matter
of typing. X11 has patches to DRI3 to support dma_fence, but they never got
merged because it was far too invasive to a server which is no longer
maintained.

> - The return fence: Initially unsignalled, it will be signalled when the
> consumer has finished using the presented buffer.
>

Currently in Wayland the return fence (again a dma_fence) is generated by
the compositor and sent as an event when it's done, because we can't have
speculative/empty/future fences. drm_syncobj would make this possible, but
so far I've been hesitant because I don't see the benefit to it (more
below).

> Deadlock mitigation to recover from segfaults:
> - The kernel knows which process is obliged to signal which fence. This
> information is part of the Present request and supplied by userspace.
>

Same as today with dma_fence. Less true with drm_syncobj if we're using
timelines.

> - If the producer crashes, the kernel signals the submit fence, so that
> the consumer can make forward progress.
>

This is only a change if the producer is now allowed to submit a fence
before it's flushed the work which would eventually fulfill that fence.
Using dma_fence has so far isolated us from this.

> - If the consumer crashes, the kernel signals the return fence, so that
> the producer can reclaim the buffer.
>

'The consumer' is problematic, per below. I think the wording you want is
'if no references are held to the submitted present object'.

> - A GPU hang signals all fences. Other deadlocks will be handled like GPU
> hangs.
>
> Other window system requests can follow the same idea.
>

Which other window system requests did you have in mind? Again, moving the
entirety of Wayland's signaling into the kernel is a total non-starter.
Partly because it means our entire protocol would be subject to the
kernel's ABI rules, partly because the rules and interdependencies between
the requests are extremely complex, but mostly because the kernel is just a
useless proxy: it would be forced to do significant work to reason about
what those requests do and when they should happen, but wouldn't be able to
make those decisions itself so would have to just punt everything to
userspace. Unless we have eBPF compositors.

> Merged fences where one fence object contains multiple fences will be
> supported. A merged fence is signalled only when its fences are signalled.
> The consumer will have the option to redefine the unsignalled return fence
> to a merged fence.
>

An elaboration of how this differed from drm_syncobj would be really
helpful here. I can make some guesses based on the rest of the mail, but
I'm not sure how accurate they are.

> *2.2. Modesetting*
>
> Since a modesetting driver can also be the consumer, the present ioctl
> will contain a submit fence and a return fence too. One small problem with
> this is that userspace can hang the modesetting driver, but in theory, any
> later present ioctl can override the previous one, so the unsignalled
> presentation is never used.
>

This is also problematic. It's not just KMS, but media codecs too - V4L
doesn't yet have explicit fencing, but given the programming model of
codecs and how deeply they interoperate, but it will.

Rather than client (GPU) -> compositor (GPU) -> compositor (KMS), imagine
you're playing a Steam game on your Chromebook which you're streaming via
Twitch or whatever. The full chain looks like:
* Steam game renders with GPU
* Xwayland in container receives dmabuf, forwards dmabuf to Wayland server
(does not directly consume)
* Wayland server (which is actually Chromium) receives dmabuf, forwards
dmabuf to Chromium UI process
* Chromium UI process forwards client dmabuf to KMS for direct scanout
* Chromium UI process _also_ forwards client dmabuf to GPU process
* Chromium GPU process composites Chromium UI + client dmabuf + webcam
frame from V4L to GPU composition job
* Chromium GPU process forwards GPU composition dmabuf (not client dmabuf)
to media codec for streaming

So, we don't have a 1:1 producer:consumer relationship. Even if we accept
it's 1:n, your Chromebook is about to burst into flames and we're dropping
frames to try to keep up. Some of the consumers are FIFO (the codec wants
to push things through in order), and some of them are mailbox (the display
wants to get the latest content, not from half a second ago before the
other player started jumping around and now you're dead). You can't reason
about any of these dependencies ahead of time from a producer PoV, because
userspace will be making these decisions frame by frame. Also someone's
started using the Vulkan present-timing extension because life wasn't
confusing enough already.

As Christian and Daniel were getting at, there are also two 'levels' of
explicit synchronisation.

The first (let's call it 'blind') is plumbing a dma_fence through to be
passed with the dmabuf. When the client submits a buffer for presentation,
it submits a dma_fence as well. When the compositor is finished with it
(i.e. has flushed the last work which will source from that buffer), it
passes a dma_fence back to the client, or no fence if required (buffer was
never accessed, or all accesses are known to be fully retired e.g. the last
fence accessing it has already signaled). This is just a matter of typing,
and is supported by at least Weston. It implies no scheduling change over
implicit fencing in that the compositor can be held hostage by abusive
clients with a really long compute shader in their dependency chain: all
that's happening is that we're plumbing those synchronisation tokens
through userspace instead of having the kernel dig them up from dma_resv.
But we at least have a no-deadlock guarantee, because a dma_fence will
complete in bounded time.

The second (let's call it 'smart') is ... much more than that. Not only
does the compositor accept and generate explicit synchronisation points for
the client, but those synchronisation points aren't dma_fences, but may be
wait-before-signal, or may be wait-never-signal. So in order to avoid a
terminal deadlock, the compositor has to sit on every synchronisation point
and check before it flushes any dependent work that it has signaled, or
will at least signal in bounded time. If that guarantee isn't there, you
have to punt and see if anything happens at your next repaint point. We
don't currently have this support in any compositor, and it's a lot more
work than blind.

Given the interdependencies I've described above for Wayland - say a resize
case, or when a surface commit triggers a cascade of subsurface commits -
GPU-side conditional rendering is not always possible. In those cases, you
_must_ do CPU-side waits and keep both sets of state around. Pain.

Typing all that out has convinced me that the current proposal is a net
loss in every case.

Complex rendering uses (game engine with a billion draw calls, a billion
BOs, complex sync dependencies, wait-before-signal and/or conditional
rendering/descriptor indexing) don't need the complexity of a present ioctl
and checking whether other processes have crashed or whatever. They already
have everything plumbed through for this themselves, and need to implement
so much infrastructure around it that they don't need much/any help from
the kernel. Just give them a sync primitive with almost zero guarantees
that they can map into CPU & GPU address space, let them go wild with it.
drm_syncobj_plus_footgun. Good luck.

Simple presentation uses (desktop, browser, game) don't need the
hyperoptimisation of sync primitives. Frame times are relatively long, and
you can only have so many surfaces which aren't occluded. Either you have a
complex scene to composite, in which case the CPU overhead of something
like dma_fence is lower than the CPU overhead required to walk through a
single compositor repaint cycle anyway, or you have a completely trivial
scene to composite and you can absolutely eat the overhead of exporting and
scheduling like two fences in 10ms.

Complex presentation uses (out-streaming, media sources, deeper
presentation chains) make the trivial present ioctl so complex that its
benefits evaporate. Wait-before-signal pushes so much complexity into the
compositor that you have to eat a lot of CPU overhead there and lose your
ability to do pipelined draws because you have to hang around and see if
they'll ever complete. Cross-device usage means everyone just ends up
spinning on the CPU instead.

So, can we take a step back? What are the problems we're trying to solve?
If it's about optimising the game engine's internal rendering, how would
that benefit from a present ioctl instead of current synchronisation?

If it's about composition, how do we balance the complexity between the
kernel and userspace? What's the global benefit from throwing our hands in
the air and saying 'you deal with it' to all of userspace, given that
existing mailbox systems making frame-by-frame decisions already preclude
deep/speculative pipelining on the client side?

Given that userspace then loses all ability to reason about presentation if
wait-before-signal becomes a thing, do we end up with a global performance
loss by replacing the overhead of kernel dma_fence handling with userspace
spinning on a page? Even if we micro-optimise that by allowing userspace to
be notified on access, is the overhead of pagefault -> kernel signal
handler -> queue signalfd notification -> userspace event loop -> read page
& compare to expected value, actually better than dma_fence?

Cheers,
Daniel
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.freedesktop.org/archives/mesa-dev/attachments/20210420/756b4d9d/attachment.htm>