[RFC] Linux Graphics Next: Explicit fences everywhere and no BO fences - initial proposal
Jason Ekstrand
jason at jlekstrand.net
Tue Apr 20 15:45:53 UTC 2021
It's still early in the morning here and I'm not awake yet so sorry if
this comes out in bits and pieces...
On Tue, Apr 20, 2021 at 7:43 AM Daniel Stone <daniel at fooishbar.org> wrote:
>
> Hi Marek,
>
> On Mon, 19 Apr 2021 at 11:48, Marek Olšák <maraeo at gmail.com> wrote:
>>
>> 2. Explicit synchronization for window systems and modesetting
>>
>> The producer is an application and the consumer is a compositor or a modesetting driver.
>>
>> 2.1. The Present request
>
>
> So the 'present request' is an ioctl, right? Not a userspace construct like it is today? If so, how do we correlate the two?
>
> The terminology is pretty X11-centric so I'll assume that's what you've designed against, but Wayland and even X11 carry much more auxiliary information attached to a present request than just 'this buffer, this swapchain'. Wayland latches a lot of data on presentation, including non-graphics data such as surface geometry (so we can have resizes which don't suck), window state (e.g. fullscreen or not, also so we can have resizes which don't suck), and these requests can also cascade through a tree of subsurfaces (so we can have embeds which don't suck). X11 mostly just carries timestamps, which is more tractable.
>
> Given we don't want to move the entirety of Wayland into kernel-visible objects, how do we synchronise the two streams so they aren't incoherent? Taking a rough stab at it whilst assuming we do have DRM_IOCTL_NONMODE_PRESENT, this would create a present object somewhere in kernel space, which the producer would create and ?? export a FD from, that the compositor would ?? import.
>
>> As part of the Present request, the producer will pass 2 fences (sync objects) to the consumer alongside the presented DMABUF BO:
>> - The submit fence: Initially unsignalled, it will be signalled when the producer has finished drawing into the presented buffer.
>
>
> We have already have this in Wayland through dma_fence. I'm relaxed about this becoming drm_syncobj or drm_newmappedysncobjthing, it's just a matter of typing. X11 has patches to DRI3 to support dma_fence, but they never got merged because it was far too invasive to a server which is no longer maintained.
>
>>
>> - The return fence: Initially unsignalled, it will be signalled when the consumer has finished using the presented buffer.
>
>
> Currently in Wayland the return fence (again a dma_fence) is generated by the compositor and sent as an event when it's done, because we can't have speculative/empty/future fences. drm_syncobj would make this possible, but so far I've been hesitant because I don't see the benefit to it (more below).
>
>>
>> Deadlock mitigation to recover from segfaults:
>> - The kernel knows which process is obliged to signal which fence. This information is part of the Present request and supplied by userspace.
>
>
> Same as today with dma_fence. Less true with drm_syncobj if we're using timelines.
>
>>
>> - If the producer crashes, the kernel signals the submit fence, so that the consumer can make forward progress.
>
>
> This is only a change if the producer is now allowed to submit a fence before it's flushed the work which would eventually fulfill that fence. Using dma_fence has so far isolated us from this.
>
>>
>> - If the consumer crashes, the kernel signals the return fence, so that the producer can reclaim the buffer.
>
>
> 'The consumer' is problematic, per below. I think the wording you want is 'if no references are held to the submitted present object'.
>
>>
>> - A GPU hang signals all fences. Other deadlocks will be handled like GPU hangs.
>>
>> Other window system requests can follow the same idea.
>
>
> Which other window system requests did you have in mind? Again, moving the entirety of Wayland's signaling into the kernel is a total non-starter. Partly because it means our entire protocol would be subject to the kernel's ABI rules, partly because the rules and interdependencies between the requests are extremely complex, but mostly because the kernel is just a useless proxy: it would be forced to do significant work to reason about what those requests do and when they should happen, but wouldn't be able to make those decisions itself so would have to just punt everything to userspace. Unless we have eBPF compositors.
>
>>
>> Merged fences where one fence object contains multiple fences will be supported. A merged fence is signalled only when its fences are signalled. The consumer will have the option to redefine the unsignalled return fence to a merged fence.
>
>
> An elaboration of how this differed from drm_syncobj would be really helpful here. I can make some guesses based on the rest of the mail, but I'm not sure how accurate they are.
>
>>
>> 2.2. Modesetting
>>
>> Since a modesetting driver can also be the consumer, the present ioctl will contain a submit fence and a return fence too. One small problem with this is that userspace can hang the modesetting driver, but in theory, any later present ioctl can override the previous one, so the unsignalled presentation is never used.
>
>
> This is also problematic. It's not just KMS, but media codecs too - V4L doesn't yet have explicit fencing, but given the programming model of codecs and how deeply they interoperate, but it will.
>
> Rather than client (GPU) -> compositor (GPU) -> compositor (KMS), imagine you're playing a Steam game on your Chromebook which you're streaming via Twitch or whatever. The full chain looks like:
> * Steam game renders with GPU
> * Xwayland in container receives dmabuf, forwards dmabuf to Wayland server (does not directly consume)
> * Wayland server (which is actually Chromium) receives dmabuf, forwards dmabuf to Chromium UI process
> * Chromium UI process forwards client dmabuf to KMS for direct scanout
> * Chromium UI process _also_ forwards client dmabuf to GPU process
> * Chromium GPU process composites Chromium UI + client dmabuf + webcam frame from V4L to GPU composition job
> * Chromium GPU process forwards GPU composition dmabuf (not client dmabuf) to media codec for streaming
>
> So, we don't have a 1:1 producer:consumer relationship. Even if we accept it's 1:n, your Chromebook is about to burst into flames and we're dropping frames to try to keep up. Some of the consumers are FIFO (the codec wants to push things through in order), and some of them are mailbox (the display wants to get the latest content, not from half a second ago before the other player started jumping around and now you're dead). You can't reason about any of these dependencies ahead of time from a producer PoV, because userspace will be making these decisions frame by frame. Also someone's started using the Vulkan present-timing extension because life wasn't confusing enough already.
>
> As Christian and Daniel were getting at, there are also two 'levels' of explicit synchronisation.
>
> The first (let's call it 'blind') is plumbing a dma_fence through to be passed with the dmabuf. When the client submits a buffer for presentation, it submits a dma_fence as well. When the compositor is finished with it (i.e. has flushed the last work which will source from that buffer), it passes a dma_fence back to the client, or no fence if required (buffer was never accessed, or all accesses are known to be fully retired e.g. the last fence accessing it has already signaled). This is just a matter of typing, and is supported by at least Weston. It implies no scheduling change over implicit fencing in that the compositor can be held hostage by abusive clients with a really long compute shader in their dependency chain: all that's happening is that we're plumbing those synchronisation tokens through userspace instead of having the kernel dig them up from dma_resv. But we at least have a no-deadlock guarantee, because a dma_fence will complete in bounded time.
>
> The second (let's call it 'smart') is ... much more than that. Not only does the compositor accept and generate explicit synchronisation points for the client, but those synchronisation points aren't dma_fences, but may be wait-before-signal, or may be wait-never-signal. So in order to avoid a terminal deadlock, the compositor has to sit on every synchronisation point and check before it flushes any dependent work that it has signaled, or will at least signal in bounded time. If that guarantee isn't there, you have to punt and see if anything happens at your next repaint point. We don't currently have this support in any compositor, and it's a lot more work than blind.
>
> Given the interdependencies I've described above for Wayland - say a resize case, or when a surface commit triggers a cascade of subsurface commits - GPU-side conditional rendering is not always possible. In those cases, you _must_ do CPU-side waits and keep both sets of state around. Pain.
>
> Typing all that out has convinced me that the current proposal is a net loss in every case.
>
> Complex rendering uses (game engine with a billion draw calls, a billion BOs, complex sync dependencies, wait-before-signal and/or conditional rendering/descriptor indexing) don't need the complexity of a present ioctl and checking whether other processes have crashed or whatever. They already have everything plumbed through for this themselves, and need to implement so much infrastructure around it that they don't need much/any help from the kernel. Just give them a sync primitive with almost zero guarantees that they can map into CPU & GPU address space, let them go wild with it. drm_syncobj_plus_footgun. Good luck.
>
> Simple presentation uses (desktop, browser, game) don't need the hyperoptimisation of sync primitives. Frame times are relatively long, and you can only have so many surfaces which aren't occluded. Either you have a complex scene to composite, in which case the CPU overhead of something like dma_fence is lower than the CPU overhead required to walk through a single compositor repaint cycle anyway, or you have a completely trivial scene to composite and you can absolutely eat the overhead of exporting and scheduling like two fences in 10ms.
>
> Complex presentation uses (out-streaming, media sources, deeper presentation chains) make the trivial present ioctl so complex that its benefits evaporate. Wait-before-signal pushes so much complexity into the compositor that you have to eat a lot of CPU overhead there and lose your ability to do pipelined draws because you have to hang around and see if they'll ever complete. Cross-device usage means everyone just ends up spinning on the CPU instead.
>
> So, can we take a step back? What are the problems we're trying to solve? If it's about optimising the game engine's internal rendering, how would that benefit from a present ioctl instead of current synchronisation?
IMO, there are two problems being solved here which are related in
very subtle and tricky ways. They're also, admittedly, driver
problems, not really winsys problems. Unfortunately, they may have
winsys implications.
First, is better/real timelines for Vulkan and compute. With
VK_KHR_timeline_semaphore, we introduced the timeline programming
model to Vulkan. This is a massively better programming model for
complex rendering apps which want to be doing all sorts of crazy. It
comes with all the fun toys including wait-before-signal and no
timeouts on any particular time points (a single command buffer may
still time out). Unfortunately, the current implementation involves a
lot of driver complexity, both in user space and kernel space. The
"ideal" implementation for timelines (which is what Win10 does) is to
have a trivial implementation where each timeline is a 64-bit integer
living somewhere, clients signal whatever value they want, and you
just throw the whole mess at the wall and hope the scheduler sorts it
out. I'm going to call these "memory fences" rather than "userspace
fences" because they could, in theory, be hidden entirely inside the
kernel.
We also want something like this for compute workloads. Not only
because Vulkan and level0 provide this as part of their core API but
because compute very much doesn't want dma-fence guarantees. You can,
in theory, have a compute kernel sitting there running for hours and
it should be ok assuming your scheduler can preempt and time-slice it
with other stuff. This means that we can't ever have a long-running
compute batch which triggers a dma-fence. We have to be able to
trigger SOMETHING at the ends of those batches. What do we use? TBD
but memory fences are the current proposal.
The second biting issue is that, in the current kernel implementation
of dma-fence and dma_resv, we've lumped internal synchronization for
memory management together with execution synchronization for
userspace dependency tracking. And we have no way to tell the
difference between the two internally. Even if user space is passing
around sync_files and trying to do explicit sync, once you get inside
the kernel, they're all dma-fences and it can't tell the difference.
If we move to a more userspace-controlled synchronization model with
wait-before-signal and no timeouts unless requested, regardless of the
implementation, it plays really badly dma-fence. And, by "badly" I
mean the two are nearly incompatible. From a user space PoV, it means
it's tricky to provide the finite time dma-fence guarantee. From a
kernel PoV, it's way worse. Currently, the way dma-fence is
constructed, it's impossible to deadlock assuming everyone follows the
rules. The moment we allow user space to deadlock itself and allow
those deadlocks to leak into the kernel, we have a problem. Even if
we throw in some timeouts, we still have a scenario where user space
has one linearizable dependency graph for execution synchronization
and the kernel has a different linearizable dependency graph for
memory management and, when you smash them together, you may have
cycles in your graph.
So how do we sort this all out? Good question. It's a hard problem.
Probably the hardest problem here is the second one: the intermixing
of synchronization types. Solving that one is likely going to require
some user space re-plumbing because all the user space APIs we have
for explicit sync are built on dma-fence.
--Jason
> If it's about composition, how do we balance the complexity between the kernel and userspace? What's the global benefit from throwing our hands in the air and saying 'you deal with it' to all of userspace, given that existing mailbox systems making frame-by-frame decisions already preclude deep/speculative pipelining on the client side?
>
> Given that userspace then loses all ability to reason about presentation if wait-before-signal becomes a thing, do we end up with a global performance loss by replacing the overhead of kernel dma_fence handling with userspace spinning on a page? Even if we micro-optimise that by allowing userspace to be notified on access, is the overhead of pagefault -> kernel signal handler -> queue signalfd notification -> userspace event loop -> read page & compare to expected value, actually better than dma_fence?
>
> Cheers,
> Daniel
> _______________________________________________
> dri-devel mailing list
> dri-devel at lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/dri-devel
More information about the dri-devel
mailing list