[Mesa-dev] [RFC] Linux Graphics Next: Explicit fences everywhere and no BO fences - initial proposal

Wed Apr 28 05:19:16 UTC 2021

On Wed., Apr. 28, 2021, 00:01 Jason Ekstrand, <jason at jlekstrand.net> wrote:

> On Tue, Apr 27, 2021 at 4:59 PM Marek Olšák <maraeo at gmail.com> wrote:
> >
> > Jason, both memory-based signalling as well as interrupt-based
> signalling to the CPU would be supported by amdgpu. External devices don't
> need to support memory-based sync objects. The only limitation is that they
> can't convert amdgpu sync objects to dma_fence.
>
> Sure.  I'm not worried about the mechanism.  We just need a word that
> means "the new fence thing" and I've been throwing "memory fence"
> around for that.  Other mechanisms may work as well.
>
> > The sad thing is that "external -> amdgpu" dependencies are really
> "external <-> amdgpu" dependencies due to mutually-exclusive access
> required by non-explicitly-sync'd buffers, so amdgpu-amdgpu interop is the
> only interop that would initially work with those buffers. Explicitly
> sync'd buffers also won't work if other drivers convert explicit fences to
> dma_fence. Thus, both implicit sync and explicit sync might not work with
> other drivers at all. The only interop that would initially work is
> explicit fences with memory-based waiting and signalling on the external
> device to keep the kernel out of the picture.
>
> Yup.  This is where things get hard.  That said, I'm not quite ready
> to give up on memory/interrupt fences just yet.
>
> One thought that came to mind which might help would be if we added an
> extremely strict concept of memory ownership.  The idea would be that
> any given BO would be in one of two states at any given time:
>
>  1. legacy: dma_fences and implicit sync works as normal but it cannot
> be resident in any "modern" (direct submission, ULLS, whatever you
> want to call it) context
>
>  2. modern: In this mode they should not be used by any legacy
> context.  We can't strictly prevent this, unfortunately, but maybe we
> can say reading produces garbage and writes may be discarded.  In this
> mode, they can be bound to modern contexts.
>
> In theory, when in "modern" mode, you could bind the same buffer in
> multiple modern contexts at a time.  However, when that's the case, it
> makes ownership really tricky to track.  Therefore, we might want some
> sort of dma-buf create flag for "always modern" vs. "switchable" and
> only allow binding to one modern context at a time when it's
> switchable.
>
> If we did this, we may be able to move any dma_fence shenanigans to
> the ownership transition points.  We'd still need some sort of "wait
> for fence and transition" which has a timeout.  However, then we'd be
> fairly well guaranteed that the application (not just Mesa!) has
> really and truly decided it's done with the buffer and we wouldn't (I
> hope!) end up with the accidental edges in the dependency graph.
>
> Of course, I've not yet proven any of this correct so feel free to
> tell me why it won't work. :-)  It was just one of those "about to go
> to bed and had a thunk" type thoughts.
>

We'd like to keep userspace outside of Mesa drivers intact and working
except for interop where we don't have much choice. At the same time,
future hw may remove support for kernel queues, so we might not have much
choice there either, depending on what the hw interface will look like.

The idea is to have an ioctl for querying a timeline semaphore buffer
associated with a shared BO, and an ioctl for querying the next wait and
signal number (e.g. n and n+1) for that semaphore. Waiting for n would be
like mutex lock and signaling would be like mutex unlock. The next process
would use the same ioctl and get n+1 and n+2, etc. There is a deadlock
condition because one process can do lock A, lock B, and another can do
lock B, lock A, which can be prevented such that the ioctl that returns the
numbers would return them for multiple buffers at once. This solution needs
no changes to userspace outside of Mesa drivers, and we'll also keep the BO
wait ioctl for GPU-CPU sync.

Marek

> --Jason
>
> P.S.  Daniel was 100% right when he said this discussion needs a glossary.
>
>
> > Marek
> >
> >
> > On Tue, Apr 27, 2021 at 3:41 PM Jason Ekstrand <jason at jlekstrand.net>
> wrote:
> >>
> >> Trying to figure out which e-mail in this mess is the right one to
> reply to....
> >>
> >> On Tue, Apr 27, 2021 at 12:31 PM Lucas Stach <l.stach at pengutronix.de>
> wrote:
> >> >
> >> > Hi,
> >> >
> >> > Am Dienstag, dem 27.04.2021 um 09:26 -0400 schrieb Marek Olšák:
> >> > > Ok. So that would only make the following use cases broken for now:
> >> > > - amd render -> external gpu
> >>
> >> Assuming said external GPU doesn't support memory fences.  If we do
> >> amdgpu and i915 at the same time, that covers basically most of the
> >> external GPU use-cases.  Of course, we'd want to convert nouveau as
> >> well for the rest.
> >>
> >> > > - amd video encode -> network device
> >> >
> >> > FWIW, "only" breaking amd render -> external gpu will make us pretty
> >> > unhappy, as we have some cases where we are combining an AMD APU with
> a
> >> > FPGA based graphics card. I can't go into the specifics of this use-
> >> > case too much but basically the AMD graphics is rendering content that
> >> > gets composited on top of a live video pipeline running through the
> >> > FPGA.
> >>
> >> I think it's worth taking a step back and asking what's being here
> >> before we freak out too much.  If we do go this route, it doesn't mean
> >> that your FPGA use-case can't work, it just means it won't work
> >> out-of-the box anymore.  You'll have to separate execution and memory
> >> dependencies inside your FPGA driver.  That's still not great but it's
> >> not as bad as you maybe made it sound.
> >>
> >> > > What about the case when we get a buffer from an external device and
> >> > > we're supposed to make it "busy" when we are using it, and the
> >> > > external device wants to wait until we stop using it? Is it
> something
> >> > > that can happen, thus turning "external -> amd" into "external <->
> >> > > amd"?
> >> >
> >> > Zero-copy texture sampling from a video input certainly appreciates
> >> > this very much. Trying to pass the render fence through the various
> >> > layers of userspace to be able to tell when the video input can reuse
> a
> >> > buffer is a great experience in yak shaving. Allowing the video input
> >> > to reuse the buffer as soon as the read dma_fence from the GPU is
> >> > signaled is much more straight forward.
> >>
> >> Oh, it's definitely worse than that.  Every window system interaction
> >> is bi-directional.  The X server has to wait on the client before
> >> compositing from it and the client has to wait on X before re-using
> >> that back-buffer.  Of course, we can break that later dependency by
> >> doing a full CPU wait but that's going to mean either more latency or
> >> reserving more back buffers.  There's no good clean way to claim that
> >> any of this is one-directional.
> >>
> >> --Jason
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.freedesktop.org/archives/dri-devel/attachments/20210428/f443a241/attachment.htm>