[PATCH 4/4] RFC: dma-buf: Add an API for importing sync files (v6)
Daniel Stone
daniel at fooishbar.org
Wed May 26 18:01:07 UTC 2021
Hey,
On Wed, 26 May 2021 at 17:53, Daniel Vetter <daniel at ffwll.ch> wrote:
> On Wed, May 26, 2021 at 5:13 PM Daniel Stone <daniel at fooishbar.org> wrote:
> > > Shared is shared, I just meant to say that we always add the shared fence.
> > > So an explicit ioctl to add more shared fences is kinda pointless.
> > >
> > > So yeah on a good driver this will run in parallel. On a not-so-good
> > > driver (which currently includes amdgpu and panfrost) this will serialize,
> > > because those drivers don't have the concept of a non-exclusive fence for
> > > such shared buffers (amdgpu does not sync internally, but will sync as
> > > soon as it's cross-drm_file).
> >
> > When you say 'we always add the shared fence', add it to ... where?
> > And which shared fence? (I'm going to use 'fence' below to refer to
> > anything from literal sync_file to timeline-syncobj to userspace
> > fence.)
>
> In the current model, every time you submit anything to the gpu, we
> create a dma_fence to track this work. This dma_fence is attached as a
> shared fence to the dma_resv obj of every object in your working set.
> Clarifications
> you = both userspace or kernel, anything really, including fun stuff
> like writing PTEs, or clearing PTEs and then flushing TLBs
> working set = depends, but can be anything from "really just the
> buffers the current gpu submission uses" to "everything bound into a
> given gpu VM"
>
> This is the fence I'm talking about here.
>
> Since you can't escape this (not unless we do direct userspace submit
> with userspace memory fences) and since there's no distinction of the
> shared fences into "relevant for implicit sync" and "not relevant for
> implicit sync" there's really not much point in adding implicit read
> fences. For now at least, we might want to change this eventually.
Yeah, I agree. My own clarification is that I'm talking about an
explicit-first world, where synchronisation is done primarily through
unknowable UMF, and falling back to implicit sync is a painful and
expensive operation that we only do when we need to. So, definitely
not on every CS (command submission aka execbuf aka vkQueueSubmit aka
glFlush).
> > I'll admit that I've typed out an argument twice for always export
> > from excl+shared, and always import to excl, results in oversync. And
> > I keep tying myself in knots trying to do it. It's arguably slightly
> > contrived, but here's my third attempt ...
> >
> > Vulkan Wayland client, full-flying-car-sync Wayland protocol,
> > Vulkan-based compositor. Part of the contract when the server exposes
> > that protocol is that it guarantees to do explicit sync in both
> > directions, so the client provides a fence at QueueSubmit time and the
> > server provides one back when releasing the image for return to ANI.
> > Neither side ever record fences into the dma_resv because they've
> > opted out by being fully explicit-aware.
> >
> > Now add media encode out on the side because you're streaming. The
> > compositor knows this is a transition between explicit and implicit
> > worlds, so it imports the client's fence into the exclusive dma_resv
> > slot, which makes sense: the media encode has to sync against the
> > client work, but is indifferent to the parallel compositor work. The
> > shared fence is exported back out so the compositor can union the
> > encode-finished fence with its composition-finished fence to send back
> > to the client with release/ANI.
> >
> > Now add a second media encode because you want a higher-quality local
> > capture to upload to YouTube later on. The compositor can do the exact
> > same import/export dance, and the two encodes can safely run in
> > parallel. Which is good.
>
> So the example which works is really clear ...
>
> > Where it starts to become complex is: what if your compositor is fully
> > explicit-aware but your clients aren't, so your compositor has more
> > import/export points to record into the resv? What if you aren't
> > actually a compositor but a full-blown media pipeline, where you have
> > a bunch of threads all launching reads in parallel, to the extent
> > where it's not practical to manage implicit/explicit transitions
> > globally, but each thread has to more pessimistically import and
> > export around each access?
>
> ... but the example where we oversync is hand-waving?
>
> :-P
Hey, I said I tied myself into knots! Maybe it's because my brain is
too deeply baked into implicit sync, maybe it's because the problem
cases aren't actually problems. Who knows.
I think what it comes down to is that we make it workable for (at
least current-generation, before someone bakes it into Unity) Wayland
compositors to work well with these modal switches, but really
difficult for more complex and variable pipeline frameworks like
GStreamer or PipeWire to work with it.
> > I can make the relatively simple usecases work, but it really feels
> > like in practice we'll end up with massive oversync in some fairly
> > complex usecases, and we'll regret not having had it from the start,
> > plus people will just rely on implicit sync for longer because it has
> > better (more parallel) semantics in some usecases.
>
> Things fall apart in implicit sync if you have more than one logical
> writer into the same buffer. Trivial example is two images in one
> buffer, but you could also do funky stuff like interleaved/tiled
> rendering with _indepedent_ consumers. If the consumers are not
> independent, then you can again just stuff the two writer fences into
> the exclusive slot with the new ioctl (they'll get merged without
> additional overhead into one fence array fence).
>
> And the fundamental thing is: This is just not possible with implicit
> sync. There's only one fence slot (even if that resolves to an array
> of fences for all the producers), so anytime you do multiple
> independent things in the same buffer you either:
> - must split the buffers so there's again a clear&unique handoff at
> each stage of the pipeline
> - or use explicit sync
Yeah no argument, this doesn't work & can't work in implicit sync.
But what I'm talking about is having a single writer (serialised) and
multiple readers (in parallel). Readers add to the shared slot,
serialising their access against all earlier exclusive fences, and
writers add to the exclusive slot, serialising their access against
all earlier fences (both exclusive and shared).
So if import can only add to the exclusive slot, then we can end up
potentially serialising readers against each other. We want readers to
land in the shared slot to be able to parallelise against each other
and let writers serialise after them, no?
> So in your example, options are
> - per-client buffers, which you then blend into a composite buffer to
> handle the N implicit fences from N buffers into a single implicit
> fence for libva conversion. This single buffer then also allows you to
> again fan out to M libva encoders, or whatever it is that you fancy
> - explicit fencing and clients render into a single buffer with no
> copying, and libva encodes from that single buffer (but again needs
> explicit fences or it all comes crashing down)
>
> There's really no option C where you somehow do multiple implicitly
> fenced things into a single buffer and expect it to work out in
> parallel.
All of my examples above are a single client buffer (GPU source which
places a fence into the exclusive slot for when the colour buffer
contents are fully realised), just working its way through multiple
stages and APIs. Like, your single Dota2 window ends up in a
Vulkan-based Wayland compositor, a pure VA-API encode stream to write
high-quality AV1 to disk, and also an EGL pipeline which overlays your
awesome logo and webcam stream before VA-API encoding to a
lower-quality H.264 stream for Twitch. This isn't a convoluted
example, it's literally what the non-geriatric millennials do all day.
It's a lot of potential boundaries between implicit & explicit world,
and if we've learned one thing from modifiers it's that we probably
shouldn't underthink the boundaries.
So:
1. Does every CS generate the appropriate resv entries (exclusive for
write, shared for read) for every access to every buffer? I think the
answer has to be no, because it's not necessarily viable in future.
2. If every CS doesn't generate the appropriate resv entries, do we go
for the middle ground where we keep interactions with implicit sync
implicit (e.g. every client API accessing any externally-visible BO
populates the appropriate resv slot, but internal-only buffers get to
skip it), or do we surface them and make it explicit (e.g. the Wayland
explicit-sync protocol is a contract between client/compositor that
the client doesn't have to populate the resv slots, because the
compositor will ensure every access it makes is appropriate
synchronised)? I think the latter, because the halfway house sounds
really painful for questionable if any benefit, and makes it maybe
impossible for us to one day deprecate implicit.
3. If we do surface everything and make userspace handle the
implicit/explicit boundaries, do we make every explicit -> implicit
boundary (via the import ioctl) populate the exclusive slot or allow
it to choose? I think allow it to choose, because I don't understand
what the restriction buys us.
4. Can the combination of dynamic modifier negotiation and explicit
synchronisation let us deliver the EGLStreams promise before
EGLStreams can? :)
Cheers,
Daniel
More information about the dri-devel
mailing list