[Mesa-dev] [RFC] Linux Graphics Next: Explicit fences everywhere and no BO fences - initial proposal

Wed Apr 28 04:01:06 UTC 2021

On Tue, Apr 27, 2021 at 4:59 PM Marek Olšák <maraeo at gmail.com> wrote:
>
> Jason, both memory-based signalling as well as interrupt-based signalling to the CPU would be supported by amdgpu. External devices don't need to support memory-based sync objects. The only limitation is that they can't convert amdgpu sync objects to dma_fence.

Sure.  I'm not worried about the mechanism.  We just need a word that
means "the new fence thing" and I've been throwing "memory fence"
around for that.  Other mechanisms may work as well.

> The sad thing is that "external -> amdgpu" dependencies are really "external <-> amdgpu" dependencies due to mutually-exclusive access required by non-explicitly-sync'd buffers, so amdgpu-amdgpu interop is the only interop that would initially work with those buffers. Explicitly sync'd buffers also won't work if other drivers convert explicit fences to dma_fence. Thus, both implicit sync and explicit sync might not work with other drivers at all. The only interop that would initially work is explicit fences with memory-based waiting and signalling on the external device to keep the kernel out of the picture.

Yup.  This is where things get hard.  That said, I'm not quite ready
to give up on memory/interrupt fences just yet.

One thought that came to mind which might help would be if we added an
extremely strict concept of memory ownership.  The idea would be that
any given BO would be in one of two states at any given time:

 1. legacy: dma_fences and implicit sync works as normal but it cannot
be resident in any "modern" (direct submission, ULLS, whatever you
want to call it) context

 2. modern: In this mode they should not be used by any legacy
context.  We can't strictly prevent this, unfortunately, but maybe we
can say reading produces garbage and writes may be discarded.  In this
mode, they can be bound to modern contexts.

In theory, when in "modern" mode, you could bind the same buffer in
multiple modern contexts at a time.  However, when that's the case, it
makes ownership really tricky to track.  Therefore, we might want some
sort of dma-buf create flag for "always modern" vs. "switchable" and
only allow binding to one modern context at a time when it's
switchable.

If we did this, we may be able to move any dma_fence shenanigans to
the ownership transition points.  We'd still need some sort of "wait
for fence and transition" which has a timeout.  However, then we'd be
fairly well guaranteed that the application (not just Mesa!) has
really and truly decided it's done with the buffer and we wouldn't (I
hope!) end up with the accidental edges in the dependency graph.

Of course, I've not yet proven any of this correct so feel free to
tell me why it won't work. :-)  It was just one of those "about to go
to bed and had a thunk" type thoughts.

--Jason

P.S.  Daniel was 100% right when he said this discussion needs a glossary.

> Marek
>
>
> On Tue, Apr 27, 2021 at 3:41 PM Jason Ekstrand <jason at jlekstrand.net> wrote:
>>
>> Trying to figure out which e-mail in this mess is the right one to reply to....
>>
>> On Tue, Apr 27, 2021 at 12:31 PM Lucas Stach <l.stach at pengutronix.de> wrote:
>> >
>> > Hi,
>> >
>> > Am Dienstag, dem 27.04.2021 um 09:26 -0400 schrieb Marek Olšák:
>> > > Ok. So that would only make the following use cases broken for now:
>> > > - amd render -> external gpu
>>
>> Assuming said external GPU doesn't support memory fences.  If we do
>> amdgpu and i915 at the same time, that covers basically most of the
>> external GPU use-cases.  Of course, we'd want to convert nouveau as
>> well for the rest.
>>
>> > > - amd video encode -> network device
>> >
>> > FWIW, "only" breaking amd render -> external gpu will make us pretty
>> > unhappy, as we have some cases where we are combining an AMD APU with a
>> > FPGA based graphics card. I can't go into the specifics of this use-
>> > case too much but basically the AMD graphics is rendering content that
>> > gets composited on top of a live video pipeline running through the
>> > FPGA.
>>
>> I think it's worth taking a step back and asking what's being here
>> before we freak out too much.  If we do go this route, it doesn't mean
>> that your FPGA use-case can't work, it just means it won't work
>> out-of-the box anymore.  You'll have to separate execution and memory
>> dependencies inside your FPGA driver.  That's still not great but it's
>> not as bad as you maybe made it sound.
>>
>> > > What about the case when we get a buffer from an external device and
>> > > we're supposed to make it "busy" when we are using it, and the
>> > > external device wants to wait until we stop using it? Is it something
>> > > that can happen, thus turning "external -> amd" into "external <->
>> > > amd"?
>> >
>> > Zero-copy texture sampling from a video input certainly appreciates
>> > this very much. Trying to pass the render fence through the various
>> > layers of userspace to be able to tell when the video input can reuse a
>> > buffer is a great experience in yak shaving. Allowing the video input
>> > to reuse the buffer as soon as the read dma_fence from the GPU is
>> > signaled is much more straight forward.
>>
>> Oh, it's definitely worse than that.  Every window system interaction
>> is bi-directional.  The X server has to wait on the client before
>> compositing from it and the client has to wait on X before re-using
>> that back-buffer.  Of course, we can break that later dependency by
>> doing a full CPU wait but that's going to mean either more latency or
>> reserving more back buffers.  There's no good clean way to claim that
>> any of this is one-directional.
>>
>> --Jason