[Mesa-dev] [RFC] Linux Graphics Next: Explicit fences everywhere and no BO fences - initial proposal

Tue Apr 20 12:19:19 UTC 2021

Hi Daniel,

Am 20.04.21 um 14:01 schrieb Daniel Vetter:
> On Mon, Apr 19, 2021 at 06:47:48AM -0400, Marek Olšák wrote:
>> Hi,
>>
>> This is our initial proposal for explicit fences everywhere and new memory
>> management that doesn't use BO fences. It's a redesign of how Linux
>> graphics drivers work, and it can coexist with what we have now.
>>
>>
>> *1. Introduction*
>> (skip this if you are already sold on explicit fences)
>>
>> The current Linux graphics architecture was initially designed for GPUs
>> with only one graphics queue where everything was executed in the
>> submission order and per-BO fences were used for memory management and
>> CPU-GPU synchronization, not GPU-GPU synchronization. Later, multiple
>> queues were added on top, which required the introduction of implicit
>> GPU-GPU synchronization between queues of different processes using per-BO
>> fences. Recently, even parallel execution within one queue was enabled
>> where a command buffer starts draws and compute shaders, but doesn't wait
>> for them, enabling parallelism between back-to-back command buffers.
>> Modesetting also uses per-BO fences for scheduling flips. Our GPU scheduler
>> was created to enable all those use cases, and it's the only reason why the
>> scheduler exists.
>>
>> The GPU scheduler, implicit synchronization, BO-fence-based memory
>> management, and the tracking of per-BO fences increase CPU overhead and
>> latency, and reduce parallelism. There is a desire to replace all of them
>> with something much simpler. Below is how we could do it.
> I get the feeling you're mixing up a lot of things here that have more
> nuance, so first some lingo.
>
> - There's kernel based synchronization, based on dma_fence. These come in
>    two major variants: Implicit synchronization, where the kernel attaches
>    the dma_fences to a dma-buf, and explicit synchronization, where the
>    dma_fence gets passed around as a stand-alone object, either a sync_file
>    or a drm_syncobj
>
> - Then there's userspace fence synchronization, where userspace issues any
>    fences directly and the kernel doesn't even know what's going on. This
>    is the only model that allows you to ditch the kernel overhead, and it's
>    also the model that vk uses.
>
>    I concur with Jason that this one is the future, it's the model hw
>    wants, compute wants and vk wants. Building an explicit fence world
>    which doesn't aim at this is imo wasted effort.
>
> Now you smash them into one thing by also changing the memory model, but I
> think that doesn't work:
>
> - Relying on gpu page faults across the board wont happen. I think right
>    now only amd's GFX10 or so has enough pagefault support to allow this,

It's even worse. GFX9 has enough support so that in theory can work.

Because of this Felix and his team are working on HMM support based on 
this generation.

On GFX10 some aspects of it are improved while others are totally broken 
again.

>    and not even there I'm really sure. Nothing else will anytime soon, at
>    least not as far as I know. So we need to support slightly more hw in
>    upstream than just that.  Any plan that's realistic needs to cope with
>    dma_fence for a really long time.
>
> - Pown^WPin All The Things! is probably not a general enough memory
>    management approach. We've kinda tried for years to move away from it.
>    Sure we can support it as an optimization in specific workloads, and it
>    will make stuff faster, but it's not going to be the default I think.
>
> - We live in a post xf86-video-$vendor world, and all these other
>    compositors rely on implicit sync. You're not going to be able to get
>    rid of them anytime soon. What's worse, all the various EGL/vk buffer
>    sharing things also rely on implicit sync, so you get to fix up tons of
>    applications on top. Any plan that's realistic needs to cope with
>    implicit/explicit at the same time together won't work.
>
> - Absolute infuriating, but you can't use page-faulting together with any
>    dma_fence synchronization primitives, whether implicit or explicit. This
>    means until the entire ecosystem moved forward (good luck with that) we
>    have to support dma_fence. The only sync model that works together with
>    page faults is userspace fence based sync.
>
> Then there's the somewhat aside topic of how amdgpu/radeonsi does implicit
> sync, at least last I checked. Currently this oversynchronizes badly
> because it's left to the kernel to guess what should be synchronized, and
> that gets things wrong. What you need there is explicit implicit
> synchronization:
>
> - on the cs side, userspace must set explicit for which buffers the kernel
>    should engage in implicit synchronization. That's how it works on all
>    other drivers that support more explicit userspace like vk or gl drivers
>    that are internally all explicit. So essentially you only set the
>    implicit fence slot when you really want to, and only userspace knows
>    this. Implementing this without breaking the current logic probably
>    needs some flags.
>
> - the other side isn't there yet upstream, but Jason has patches.
>    Essentially you also need to sample your implicit sync points at the
>    right spot, to avoid oversync on later rendering by the producer.
>    Jason's patch solves this by adding an ioctl to dma-buf to get the
>    current set.
>
> - without any of this things for pure explicit fencing userspace the
>    kernel will simply maintain a list of all current users of a buffer. For
>    memory management, which means eviction handling roughly works like you
>    describe below, we wait for everything before a buffer can be moved.
>
> This should get rid of the oversync issues, and since implicit sync is
> backed in everywhere right now, you'll have to deal with implicit sync for
> a very long time.
>
> Next up is reducing the memory manager overhead of all this, without
> changing the ecosystem.
>
> - hw option would be page faults, but until we have full explicit
>    userspace sync we can't use those. Which currently means compute only.
>    Note that for vulkan or maybe also gl this is quite nasty for userspace,
>    since as soon as you need to switch to dma_fenc sync or implicit sync
>    (winsys buffer, or buffer sharing with any of the current set of
>    extensions) you have to flip your internal driver state around all sync
>    points over from userspace fencing to dma_fence kernel fencing. Can
>    still be all explicit using drm_syncobj ofc.
>
> - next up if your hw has preemption, you could use that, except preemption
>    takes a while longer, so from memory pov really should be done with
>    dma_fence. Plus it has all the same problems in that it requires
>    userspace fences.
>
> - now for making dma_fence O(1) in the fastpath you need the shared
>    dma_resv trick and the lru bulk move. radv/amdvlk use that, but I think
>    radeonsi not yet. But maybe I missed that. Either way we need to do some
>    better kernel work so it can also be fast for shared buffers, if those
>    become a problem. On the GL side doing this will use a lot of the tricks
>    for residency/working set management you describe below, except the
>    kernel can still throw out an entire gpu job. This is essentially what
>    you describe with 3.1. Vulkan/compute already work like this.
>
> Now this gets the performance up, but it doesn't give us any road towards
> using page faults (outside of compute) and so retiring dma_fence for good.
> For that we need a few pieces:
>
> - Full new set of userspace winsys protocols and egl/vk extensions. Pray
>    it actually gets adopted, because neither AMD nor Intel have the
>    engineers to push these kind of ecosystems/middleware issues forward on
>    their payrolls. Good pick is probably using drm_syncobj as the kernel
>    primitive for this. Still uses dma_fence underneath.
>
> - Some clever kernel tricks so that we can substitute dma_fence for
>    userspace fences within a drm_syncobj. drm_syncobj already has the
>    notion of waiting for a dma_fence to materialize. We can abuse that to
>    create an upgrade path from dma_fence based sync to userspace fence
>    syncing. Ofc none of this will be on the table if userspace hasn't
>    adopted explicit sync.
>
> With these two things I think we can have a reasonable upgrade path. None
> of this will be break the world type things though.

How about this:
1. We extend drm_syncobj to be able to contain both classic dma_fence as 
well as being used for user fence synchronization.

     We already discussed that briefly and I think we should have a 
rough plan for this in our heads.

2. We allow attaching an drm_syncobj on dma_resv for implicit sync.

     This requires that both the consumer as well as the producer side 
will support user fence synchronization.

     We would still have quite a bunch of limitations, especially we 
would need to adjust all the kernel consumers of classic dma_resv 
objects. But I think it should be doable.

Regards,
Christian.

>
> Bunch of comments below.
>
>> *2. Explicit synchronization for window systems and modesetting*
>>
>> The producer is an application and the consumer is a compositor or a
>> modesetting driver.
>>
>> *2.1. The Present request*
>>
>> As part of the Present request, the producer will pass 2 fences (sync
>> objects) to the consumer alongside the presented DMABUF BO:
>> - The submit fence: Initially unsignalled, it will be signalled when the
>> producer has finished drawing into the presented buffer.
>> - The return fence: Initially unsignalled, it will be signalled when the
>> consumer has finished using the presented buffer.
> Build this with syncobj timelines and it makes a lot more sense I think.
> We'll need that for having a proper upgrade path, both on the hw/driver
> side (being able to support stuff like preempt or gpu page faults) and the
> ecosystem side (so that we don't have to rev protocols twice, once going
> to explicit dma_fence sync and once more for userspace sync).
>
>> Deadlock mitigation to recover from segfaults:
>> - The kernel knows which process is obliged to signal which fence. This
>> information is part of the Present request and supplied by userspace.
>> - If the producer crashes, the kernel signals the submit fence, so that the
>> consumer can make forward progress.
>> - If the consumer crashes, the kernel signals the return fence, so that the
>> producer can reclaim the buffer.
> So for kernel based sync imo simplest is to just reuse dma_fence, same
> rules apply.
>
> For userspace fencing the kernel simply doesn't care how stupid userspace
> is. Security checks at boundaries (e.g. client vs compositor) is also
> usersepace's problem and can be handled by e.g.  timeouts + conditional
> rendering on the compositor side. The timeout might be in the compat glue,
> e.g. when we stall for a dma_fence to materialize from a drm_syncobj. I
> think in vulkan this is defacto already up to applications to deal with
> entirely if they deal with untrusted fences.
>
>> - A GPU hang signals all fences. Other deadlocks will be handled like GPU
>> hangs.
> Nope, we can't just shrug off all deadlocks with "gpu reset rolls in". For
> one, with userspace fencing the kernel isn't aware of any deadlocks, you
> fundamentally can't tell "has deadlocked" from "is still doing useful
> computations" because that amounts to solving the halting problem.
>
> Any programming model we come up with where both kernel and userspace are
> involved needs to come up with rules where at least non-evil userspace
> never deadlocks. And if you just allow both then it's pretty easy to come
> up with scenarios where both userspace and kernel along are deadlock free,
> but interactions result in hangs. That's why we've recently documented all
> the corner cases around indefinite dma_fences, and also why you can't use
> gpu page faults currently anything that uses dma_fence for sync.
>
> That's why I think with userspace fencing the kernel simply should not be
> involved at all, aside from providing optimized/blocking cpu wait
> functionality.
>
>> Other window system requests can follow the same idea.
>>
>> Merged fences where one fence object contains multiple fences will be
>> supported. A merged fence is signalled only when its fences are signalled.
>> The consumer will have the option to redefine the unsignalled return fence
>> to a merged fence.
>>
>> *2.2. Modesetting*
>>
>> Since a modesetting driver can also be the consumer, the present ioctl will
>> contain a submit fence and a return fence too. One small problem with this
>> is that userspace can hang the modesetting driver, but in theory, any later
>> present ioctl can override the previous one, so the unsignalled
>> presentation is never used.
>>
>>
>> *3. New memory management*
>>
>> The per-BO fences will be removed and the kernel will not know which
>> buffers are busy. This will reduce CPU overhead and latency. The kernel
>> will not need per-BO fences with explicit synchronization, so we just need
>> to remove their last user: buffer evictions. It also resolves the current
>> OOM deadlock.
> What's "the current OOM deadlock"?
>
>> *3.1. Evictions*
>>
>> If the kernel wants to move a buffer, it will have to wait for everything
>> to go idle, halt all userspace command submissions, move the buffer, and
>> resume everything. This is not expected to happen when memory is not
>> exhausted. Other more efficient ways of synchronization are also possible
>> (e.g. sync only one process), but are not discussed here.
>>
>> *3.2. Per-process VRAM usage quota*
>>
>> Each process can optionally and periodically query its VRAM usage quota and
>> change domains of its buffers to obey that quota. For example, a process
>> allocated 2 GB of buffers in VRAM, but the kernel decreased the quota to 1
>> GB. The process can change the domains of the least important buffers to
>> GTT to get the best outcome for itself. If the process doesn't do it, the
>> kernel will choose which buffers to evict at random. (thanks to Christian
>> Koenig for this idea)
>>
>> *3.3. Buffer destruction without per-BO fences*
>>
>> When the buffer destroy ioctl is called, an optional fence list can be
>> passed to the kernel to indicate when it's safe to deallocate the buffer.
>> If the fence list is empty, the buffer will be deallocated immediately.
>> Shared buffers will be handled by merging fence lists from all processes
>> that destroy them. Mitigation of malicious behavior:
>> - If userspace destroys a busy buffer, it will get a GPU page fault.
>> - If userspace sends fences that never signal, the kernel will have a
>> timeout period and then will proceed to deallocate the buffer anyway.
>>
>> *3.4. Other notes on MM*
>>
>> Overcommitment of GPU-accessible memory will cause an allocation failure or
>> invoke the OOM killer. Evictions to GPU-inaccessible memory might not be
>> supported.
>>
>> Kernel drivers could move to this new memory management today. Only buffer
>> residency and evictions would stop using per-BO fences.
>>
>>
>>
>> *4. Deprecating implicit synchronization*
>>
>> It can be phased out by introducing a new generation of hardware where the
>> driver doesn't add support for it (like a driver fork would do), assuming
>> userspace has all the changes for explicit synchronization. This could
>> potentially create an isolated part of the kernel DRM where all drivers
>> only support explicit synchronization.
> 10-20 years I'd say before that's even an option.
> -Daniel
>
>> Marek
>> _______________________________________________
>> dri-devel mailing list
>> dri-devel at lists.freedesktop.org
>> https://lists.freedesktop.org/mailman/listinfo/dri-devel
>