[Mesa-dev] [RFC] Linux Graphics Next: Explicit fences everywhere and no BO fences - initial proposal

Tue Apr 20 16:24:58 UTC 2021

On Tue, Apr 20, 2021 at 9:10 AM Daniel Vetter <daniel at ffwll.ch> wrote:
>
> On Tue, Apr 20, 2021 at 1:59 PM Christian König
> <ckoenig.leichtzumerken at gmail.com> wrote:
> >
> > > Yeah. If we go with userspace fences, then userspace can hang itself. Not
> > > the kernel's problem.
> >
> > Well, the path of inner peace begins with four words. “Not my fucking
> > problem.”
> >
> > But I'm not that much concerned about the kernel, but rather about
> > important userspace processes like X, Wayland, SurfaceFlinger etc...
> >
> > I mean attaching a page to a sync object and allowing to wait/signal
> > from both CPU as well as GPU side is not so much of a problem.
> >
> > > You have to somehow handle that, e.g. perhaps with conditional
> > > rendering and just using the old frame in compositing if the new one
> > > doesn't show up in time.
> >
> > Nice idea, but how would you handle that on the OpenGL/Glamor/Vulkan level.
>
> For opengl we do all the same guarantees, so if you get one of these
> you just block until the fence is signalled. Doing that properly means
> submit thread to support drm_syncobj like for vulkan.
>
> For vulkan we probably want to represent these as proper vk timeline
> objects, and the vulkan way is to just let the application (well
> compositor) here deal with it. If they import timelines from untrusted
> other parties, they need to handle the potential fallback of being
> lied at. How is "not vulkan's fucking problem", because that entire
> "with great power (well performance) comes great responsibility" is
> the entire vk design paradigm.

The security aspects are currently an unsolved problem in Vulkan.  The
assumption is that everyone trusts everyone else to be careful with
the scissors.  It's a great model!

I think we can do something in Vulkan to allow apps to protect
themselves a bit but it's tricky and non-obvious.

--Jason

> Glamour will just rely on GL providing nice package of the harsh
> reality of gpus, like usual.
>
> So I guess step 1 here for GL would be to provide some kind of
> import/export of timeline syncobj, including properly handling this
> "future/indefinite fences" aspect of them with submit thread and
> everything.
>
> -Daniel
>
> >
> > Regards,
> > Christian.
> >
> > Am 20.04.21 um 13:16 schrieb Daniel Vetter:
> > > On Tue, Apr 20, 2021 at 07:03:19AM -0400, Marek Olšák wrote:
> > >> Daniel, are you suggesting that we should skip any deadlock prevention in
> > >> the kernel, and just let userspace wait for and signal any fence it has
> > >> access to?
> > > Yeah. If we go with userspace fences, then userspace can hang itself. Not
> > > the kernel's problem. The only criteria is that the kernel itself must
> > > never rely on these userspace fences, except for stuff like implementing
> > > optimized cpu waits. And in those we must always guarantee that the
> > > userspace process remains interruptible.
> > >
> > > It's a completely different world from dma_fence based kernel fences,
> > > whether those are implicit or explicit.
> > >
> > >> Do you have any concern with the deprecation/removal of BO fences in the
> > >> kernel assuming userspace is only using explicit fences? Any concern with
> > >> the submit and return fences for modesetting and other producer<->consumer
> > >> scenarios?
> > > Let me work on the full replay for your rfc first, because there's a lot
> > > of details here and nuance.
> > > -Daniel
> > >
> > >> Thanks,
> > >> Marek
> > >>
> > >> On Tue, Apr 20, 2021 at 6:34 AM Daniel Vetter <daniel at ffwll.ch> wrote:
> > >>
> > >>> On Tue, Apr 20, 2021 at 12:15 PM Christian König
> > >>> <ckoenig.leichtzumerken at gmail.com> wrote:
> > >>>> Am 19.04.21 um 17:48 schrieb Jason Ekstrand:
> > >>>>> Not going to comment on everything on the first pass...
> > >>>>>
> > >>>>> On Mon, Apr 19, 2021 at 5:48 AM Marek Olšák <maraeo at gmail.com> wrote:
> > >>>>>> Hi,
> > >>>>>>
> > >>>>>> This is our initial proposal for explicit fences everywhere and new
> > >>> memory management that doesn't use BO fences. It's a redesign of how Linux
> > >>> graphics drivers work, and it can coexist with what we have now.
> > >>>>>>
> > >>>>>> 1. Introduction
> > >>>>>> (skip this if you are already sold on explicit fences)
> > >>>>>>
> > >>>>>> The current Linux graphics architecture was initially designed for
> > >>> GPUs with only one graphics queue where everything was executed in the
> > >>> submission order and per-BO fences were used for memory management and
> > >>> CPU-GPU synchronization, not GPU-GPU synchronization. Later, multiple
> > >>> queues were added on top, which required the introduction of implicit
> > >>> GPU-GPU synchronization between queues of different processes using per-BO
> > >>> fences. Recently, even parallel execution within one queue was enabled
> > >>> where a command buffer starts draws and compute shaders, but doesn't wait
> > >>> for them, enabling parallelism between back-to-back command buffers.
> > >>> Modesetting also uses per-BO fences for scheduling flips. Our GPU scheduler
> > >>> was created to enable all those use cases, and it's the only reason why the
> > >>> scheduler exists.
> > >>>>>> The GPU scheduler, implicit synchronization, BO-fence-based memory
> > >>> management, and the tracking of per-BO fences increase CPU overhead and
> > >>> latency, and reduce parallelism. There is a desire to replace all of them
> > >>> with something much simpler. Below is how we could do it.
> > >>>>>>
> > >>>>>> 2. Explicit synchronization for window systems and modesetting
> > >>>>>>
> > >>>>>> The producer is an application and the consumer is a compositor or a
> > >>> modesetting driver.
> > >>>>>> 2.1. The Present request
> > >>>>>>
> > >>>>>> As part of the Present request, the producer will pass 2 fences (sync
> > >>> objects) to the consumer alongside the presented DMABUF BO:
> > >>>>>> - The submit fence: Initially unsignalled, it will be signalled when
> > >>> the producer has finished drawing into the presented buffer.
> > >>>>>> - The return fence: Initially unsignalled, it will be signalled when
> > >>> the consumer has finished using the presented buffer.
> > >>>>> I'm not sure syncobj is what we want.  In the Intel world we're trying
> > >>>>> to go even further to something we're calling "userspace fences" which
> > >>>>> are a timeline implemented as a single 64-bit value in some
> > >>>>> CPU-mappable BO.  The client writes a higher value into the BO to
> > >>>>> signal the timeline.
> > >>>> Well that is exactly what our Windows guys have suggested as well, but
> > >>>> it strongly looks like that this isn't sufficient.
> > >>>>
> > >>>> First of all you run into security problems when any application can
> > >>>> just write any value to that memory location. Just imagine an
> > >>>> application sets the counter to zero and X waits forever for some
> > >>>> rendering to finish.
> > >>> The thing is, with userspace fences security boundary issue prevent
> > >>> moves into userspace entirely. And it really doesn't matter whether
> > >>> the event you're waiting on doesn't complete because the other app
> > >>> crashed or was stupid or intentionally gave you a wrong fence point:
> > >>> You have to somehow handle that, e.g. perhaps with conditional
> > >>> rendering and just using the old frame in compositing if the new one
> > >>> doesn't show up in time. Or something like that. So trying to get the
> > >>> kernel involved but also not so much involved sounds like a bad design
> > >>> to me.
> > >>>
> > >>>> Additional to that in such a model you can't determine who is the guilty
> > >>>> queue in case of a hang and can't reset the synchronization primitives
> > >>>> in case of an error.
> > >>>>
> > >>>> Apart from that this is rather inefficient, e.g. we don't have any way
> > >>>> to prevent priority inversion when used as a synchronization mechanism
> > >>>> between different GPU queues.
> > >>> Yeah but you can't have it both ways. Either all the scheduling in the
> > >>> kernel and fence handling is a problem, or you actually want to
> > >>> schedule in the kernel. hw seems to definitely move towards the more
> > >>> stupid spinlock-in-hw model (and direct submit from userspace and all
> > >>> that), priority inversions be damned. I'm really not sure we should
> > >>> fight that - if it's really that inefficient then maybe hw will add
> > >>> support for waiting sync constructs in hardware, or at least be
> > >>> smarter about scheduling other stuff. E.g. on intel hw both the kernel
> > >>> scheduler and fw scheduler knows when you're spinning on a hw fence
> > >>> (whether userspace or kernel doesn't matter) and plugs in something
> > >>> else. Add in a bit of hw support to watch cachelines, and you have
> > >>> something which can handle both directions efficiently.
> > >>>
> > >>> Imo given where hw is going, we shouldn't try to be too clever here.
> > >>> The only thing we do need to provision is being able to do cpu side
> > >>> waits without spinning. And that should probably be done in a fairly
> > >>> gpu specific way still.
> > >>> -Daniel
> > >>>
> > >>>> Christian.
> > >>>>
> > >>>>>     The kernel then provides some helpers for
> > >>>>> waiting on them reliably and without spinning.  I don't expect
> > >>>>> everyone to support these right away but, If we're going to re-plumb
> > >>>>> userspace for explicit synchronization, I'd like to make sure we take
> > >>>>> this into account so we only have to do it once.
> > >>>>>
> > >>>>>
> > >>>>>> Deadlock mitigation to recover from segfaults:
> > >>>>>> - The kernel knows which process is obliged to signal which fence.
> > >>> This information is part of the Present request and supplied by userspace.
> > >>>>> This isn't clear to me.  Yes, if we're using anything dma-fence based
> > >>>>> like syncobj, this is true.  But it doesn't seem totally true as a
> > >>>>> general statement.
> > >>>>>
> > >>>>>
> > >>>>>> - If the producer crashes, the kernel signals the submit fence, so
> > >>> that the consumer can make forward progress.
> > >>>>>> - If the consumer crashes, the kernel signals the return fence, so
> > >>> that the producer can reclaim the buffer.
> > >>>>>> - A GPU hang signals all fences. Other deadlocks will be handled like
> > >>> GPU hangs.
> > >>>>> What do you mean by "all"?  All fences that were supposed to be
> > >>>>> signaled by the hung context?
> > >>>>>
> > >>>>>
> > >>>>>> Other window system requests can follow the same idea.
> > >>>>>>
> > >>>>>> Merged fences where one fence object contains multiple fences will be
> > >>> supported. A merged fence is signalled only when its fences are signalled.
> > >>> The consumer will have the option to redefine the unsignalled return fence
> > >>> to a merged fence.
> > >>>>>> 2.2. Modesetting
> > >>>>>>
> > >>>>>> Since a modesetting driver can also be the consumer, the present
> > >>> ioctl will contain a submit fence and a return fence too. One small problem
> > >>> with this is that userspace can hang the modesetting driver, but in theory,
> > >>> any later present ioctl can override the previous one, so the unsignalled
> > >>> presentation is never used.
> > >>>>>>
> > >>>>>> 3. New memory management
> > >>>>>>
> > >>>>>> The per-BO fences will be removed and the kernel will not know which
> > >>> buffers are busy. This will reduce CPU overhead and latency. The kernel
> > >>> will not need per-BO fences with explicit synchronization, so we just need
> > >>> to remove their last user: buffer evictions. It also resolves the current
> > >>> OOM deadlock.
> > >>>>> Is this even really possible?  I'm no kernel MM expert (trying to
> > >>>>> learn some) but my understanding is that the use of per-BO dma-fence
> > >>>>> runs deep.  I would like to stop using it for implicit synchronization
> > >>>>> to be sure, but I'm not sure I believe the claim that we can get rid
> > >>>>> of it entirely.  Happy to see someone try, though.
> > >>>>>
> > >>>>>
> > >>>>>> 3.1. Evictions
> > >>>>>>
> > >>>>>> If the kernel wants to move a buffer, it will have to wait for
> > >>> everything to go idle, halt all userspace command submissions, move the
> > >>> buffer, and resume everything. This is not expected to happen when memory
> > >>> is not exhausted. Other more efficient ways of synchronization are also
> > >>> possible (e.g. sync only one process), but are not discussed here.
> > >>>>>> 3.2. Per-process VRAM usage quota
> > >>>>>>
> > >>>>>> Each process can optionally and periodically query its VRAM usage
> > >>> quota and change domains of its buffers to obey that quota. For example, a
> > >>> process allocated 2 GB of buffers in VRAM, but the kernel decreased the
> > >>> quota to 1 GB. The process can change the domains of the least important
> > >>> buffers to GTT to get the best outcome for itself. If the process doesn't
> > >>> do it, the kernel will choose which buffers to evict at random. (thanks to
> > >>> Christian Koenig for this idea)
> > >>>>> This is going to be difficult.  On Intel, we have some resources that
> > >>>>> have to be pinned to VRAM and can't be dynamically swapped out by the
> > >>>>> kernel.  In GL, we probably can deal with it somewhat dynamically.  In
> > >>>>> Vulkan, we'll be entirely dependent on the application to use the
> > >>>>> appropriate Vulkan memory budget APIs.
> > >>>>>
> > >>>>> --Jason
> > >>>>>
> > >>>>>
> > >>>>>> 3.3. Buffer destruction without per-BO fences
> > >>>>>>
> > >>>>>> When the buffer destroy ioctl is called, an optional fence list can
> > >>> be passed to the kernel to indicate when it's safe to deallocate the
> > >>> buffer. If the fence list is empty, the buffer will be deallocated
> > >>> immediately. Shared buffers will be handled by merging fence lists from all
> > >>> processes that destroy them. Mitigation of malicious behavior:
> > >>>>>> - If userspace destroys a busy buffer, it will get a GPU page fault.
> > >>>>>> - If userspace sends fences that never signal, the kernel will have a
> > >>> timeout period and then will proceed to deallocate the buffer anyway.
> > >>>>>> 3.4. Other notes on MM
> > >>>>>>
> > >>>>>> Overcommitment of GPU-accessible memory will cause an allocation
> > >>> failure or invoke the OOM killer. Evictions to GPU-inaccessible memory
> > >>> might not be supported.
> > >>>>>> Kernel drivers could move to this new memory management today. Only
> > >>> buffer residency and evictions would stop using per-BO fences.
> > >>>>>>
> > >>>>>> 4. Deprecating implicit synchronization
> > >>>>>>
> > >>>>>> It can be phased out by introducing a new generation of hardware
> > >>> where the driver doesn't add support for it (like a driver fork would do),
> > >>> assuming userspace has all the changes for explicit synchronization. This
> > >>> could potentially create an isolated part of the kernel DRM where all
> > >>> drivers only support explicit synchronization.
> > >>>>>> Marek
> > >>>>>> _______________________________________________
> > >>>>>> dri-devel mailing list
> > >>>>>> dri-devel at lists.freedesktop.org
> > >>>>>> https://lists.freedesktop.org/mailman/listinfo/dri-devel
> > >>>>> _______________________________________________
> > >>>>> mesa-dev mailing list
> > >>>>> mesa-dev at lists.freedesktop.org
> > >>>>> https://lists.freedesktop.org/mailman/listinfo/mesa-dev
> > >>>
> > >>> --
> > >>> Daniel Vetter
> > >>> Software Engineer, Intel Corporation
> > >>> http://blog.ffwll.ch
> > >>>
> >
>
>
> --
> Daniel Vetter
> Software Engineer, Intel Corporation
> http://blog.ffwll.ch