[Intel-gfx] [RFC v2] drm/i915: Android native sync support

Sun Jan 25 23:52:39 PST 2015

On Sat, Jan 24, 2015 at 04:08:32PM +0000, Chris Wilson wrote:
> On Sat, Jan 24, 2015 at 10:41:46AM +0100, Daniel Vetter wrote:
> > On Fri, Jan 23, 2015 at 6:30 PM, Chris Wilson <chris at chris-wilson.co.uk> wrote:
> > > On Fri, Jan 23, 2015 at 04:53:48PM +0100, Daniel Vetter wrote:
> > >> Yeah that's kind the big behaviour difference (at least as I see it)
> > >> between explicit sync and implicit sync:
> > >> - with implicit sync the kernel attachs sync points/requests to buffers
> > >>   and userspace just asks about idle/business of buffers. Synchronization
> > >>   between different users is all handled behind userspace's back in the
> > >>   kernel.
> > >>
> > >> - explicit sync attaches sync points to individual bits of work and makes
> > >>   them explicit objects userspace can get at and pass around. Userspace
> > >>   uses these separate things to inquire about when something is
> > >>   done/idle/busy and has its own mapping between explicit sync objects and
> > >>   the different pieces of memory affected by each. Synchronization between
> > >>   different clients is handled explicitly by passing sync objects around
> > >>   each time some rendering is done.
> > >>
> > >> The bigger driver for explicit sync (besides "nvidia likes it sooooo much
> > >> that everyone uses it a lot") seems to be a) shitty gpu drivers without
> > >> proper bo managers (*cough*android*cough*) and svm, where there's simply
> > >> no buffer objects any more to attach sync information to.
> > >
> > > Actually, mesa would really like much finer granularity than at batch
> > > boundaries. Having a sync object for a batch boundary itself is very meh
> > > and not a substantive improvement on what it possible today, but being
> > > able to convert the implicit sync into an explicit fence object is
> > > interesting and lends a layer of abstraction that could make it more
> > > versatile. Most importantly, it allows me to defer the overhead of fence
> > > creation until I actually want to sleep on a completion. Also Jesse
> > > originally supporting inserting fences inside a batch, which looked
> > > interesting if impractical.
> > 
> > If want to allow the kernel to stall on fences (in e.g. the scheduler)
> > only the kernel should be allowed to create fences imo. At least
> > current fences assume that they _will_ signal eventually, and for i915
> > fences we have the hangcheck to ensure this is the case. In-batch
> > fences and lazy fence creation (beyond just delaying the fd allocation
> > to avoid too many fds flying around) is therefore a no-go.
> 
> Lazy fence creation (i.e. attaching a fence to a buffer) just means
> creating the fd for an existing request (which is derived from the
> fence). Or if the buffer is read or write idle, then you just create the
> fence as already-signaled. And yes, it is just to avoid the death by a
> thousand file descriptors and especially creating one every batch.

I think the problem will be platforms that want full explicit fence (like
android) but allow delayed creation of the fence fd from a gl sync object
(like the android egl extension allows).

I'm not sure yet how to best expose that really since just creating a
fence from the implicit request attached to the batch might upset the
interface purists with the mix in implicit and explicit fencing ;-) Hence
why I think for now we should just do the eager fd creation at execbuf
until ppl scream (well maybe not merge this patch until ppl scream ...).

> > For that kind of fine-grained sync between gpu and cpu workloads the
> > solutions thus far (at least what I've seen) is just busy-looping.
> > Usually those workloads have a few order more sync pionts than frames
> > we tend to render, so blocking isn't terrible efficient anyway.
> 
> Heck, I think a full-fledged fence fd per batch is still more overhead
> than I want.

One idea that crossed my mind is to expose the 2nd interrupt source to
userspace somehow (we have pipe_control/mi_flush_dw and
mi_user_interrupt). Then we could use that, maybe with some
wakeupfiltering to allow userspace to block a bit more efficient.

But my gut feel still says that most likely a bit of busy-looping won't
hurt in such a case with very fine-grained synchronization. There should
be a full blocking kernel request nearby. And often the check is for "busy
or not" only anyway, and that can already be done with seqno writes from
batchbuffers to a per-ctx bo that userspace manages.
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
+41 (0) 79 365 57 48 - http://blog.ffwll.ch