[PATCH 4/4] RFC: dma-buf: Add an API for importing sync files (v6)

Wed May 26 16:52:59 UTC 2021

On Wed, May 26, 2021 at 5:13 PM Daniel Stone <daniel at fooishbar.org> wrote:
> On Wed, 26 May 2021 at 14:44, Daniel Vetter <daniel at ffwll.ch> wrote:
> > On Wed, May 26, 2021 at 02:08:19PM +0100, Daniel Stone wrote:
> > > Are you saying that if a compositor imports a client-provided dmabuf
> > > as an EGLImage to use as a source texture for its rendering, and then
> > > provides it to VA-API or V4L2 to use as a media encode source (both
> > > purely read-only ops), that these will both serialise against each
> > > other? Like, my media decode job won't begin execution until the
> > > composition read has fully retired?
> > >
> > > If so, a) good lord that hurts, and b) what are shared fences actually ... for?
> >
> > Shared is shared, I just meant to say that we always add the shared fence.
> > So an explicit ioctl to add more shared fences is kinda pointless.
> >
> > So yeah on a good driver this will run in parallel. On a not-so-good
> > driver (which currently includes amdgpu and panfrost) this will serialize,
> > because those drivers don't have the concept of a non-exclusive fence for
> > such shared buffers (amdgpu does not sync internally, but will sync as
> > soon as it's cross-drm_file).
>
> When you say 'we always add the shared fence', add it to ... where?
> And which shared fence? (I'm going to use 'fence' below to refer to
> anything from literal sync_file to timeline-syncobj to userspace
> fence.)

In the current model, every time you submit anything to the gpu, we
create a dma_fence to track this work. This dma_fence is attached as a
shared fence to the dma_resv obj of every object in your working set.
Clarifications
you = both userspace or kernel, anything really, including fun stuff
like writing PTEs, or clearing PTEs and then flushing TLBs
working set = depends, but can be anything from "really just the
buffers the current gpu submission uses" to "everything bound into a
given gpu VM"

This is the fence I'm talking about here.

Since you can't escape this (not unless we do direct userspace submit
with userspace memory fences) and since there's no distinction of the
shared fences into "relevant for implicit sync" and "not relevant for
implicit sync" there's really not much point in adding implicit read
fences. For now at least, we might want to change this eventually.

> I'll admit that I've typed out an argument twice for always export
> from excl+shared, and always import to excl, results in oversync. And
> I keep tying myself in knots trying to do it. It's arguably slightly
> contrived, but here's my third attempt ...
>
> Vulkan Wayland client, full-flying-car-sync Wayland protocol,
> Vulkan-based compositor. Part of the contract when the server exposes
> that protocol is that it guarantees to do explicit sync in both
> directions, so the client provides a fence at QueueSubmit time and the
> server provides one back when releasing the image for return to ANI.
> Neither side ever record fences into the dma_resv because they've
> opted out by being fully explicit-aware.
>
> Now add media encode out on the side because you're streaming. The
> compositor knows this is a transition between explicit and implicit
> worlds, so it imports the client's fence into the exclusive dma_resv
> slot, which makes sense: the media encode has to sync against the
> client work, but is indifferent to the parallel compositor work. The
> shared fence is exported back out so the compositor can union the
> encode-finished fence with its composition-finished fence to send back
> to the client with release/ANI.
>
> Now add a second media encode because you want a higher-quality local
> capture to upload to YouTube later on. The compositor can do the exact
> same import/export dance, and the two encodes can safely run in
> parallel. Which is good.

So the example which works is really clear ...

> Where it starts to become complex is: what if your compositor is fully
> explicit-aware but your clients aren't, so your compositor has more
> import/export points to record into the resv? What if you aren't
> actually a compositor but a full-blown media pipeline, where you have
> a bunch of threads all launching reads in parallel, to the extent
> where it's not practical to manage implicit/explicit transitions
> globally, but each thread has to more pessimistically import and
> export around each access?

... but the example where we oversync is hand-waving?

:-P

> I can make the relatively simple usecases work, but it really feels
> like in practice we'll end up with massive oversync in some fairly
> complex usecases, and we'll regret not having had it from the start,
> plus people will just rely on implicit sync for longer because it has
> better (more parallel) semantics in some usecases.

Things fall apart in implicit sync if you have more than one logical
writer into the same buffer. Trivial example is two images in one
buffer, but you could also do funky stuff like interleaved/tiled
rendering with _indepedent_ consumers. If the consumers are not
independent, then you can again just stuff the two writer fences into
the exclusive slot with the new ioctl (they'll get merged without
additional overhead into one fence array fence).

And the fundamental thing is: This is just not possible with implicit
sync. There's only one fence slot (even if that resolves to an array
of fences for all the producers), so anytime you do multiple
independent things in the same buffer you either:
- must split the buffers so there's again a clear&unique handoff at
each stage of the pipeline
- or use explicit sync

So in your example, options are
- per-client buffers, which you then blend into a composite buffer to
handle the N implicit fences from N buffers into a single implicit
fence for libva conversion. This single buffer then also allows you to
again fan out to M libva encoders, or whatever it is that you fancy
- explicit fencing and clients render into a single buffer with no
copying, and libva encodes from that single buffer (but again needs
explicit fences or it all comes crashing down)

There's really no option C where you somehow do multiple implicitly
fenced things into a single buffer and expect it to work out in
parallel.
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch