Fence, timeline and android sync points

Sat Aug 16 08:30:50 PDT 2014

On Sat, Aug 16, 2014 at 09:01:27AM +0200, Thomas Hellstrom wrote:
> On 08/15/2014 04:52 PM, Jerome Glisse wrote:
> > On Fri, Aug 15, 2014 at 08:54:38AM +0200, Thomas Hellstrom wrote:
> >> On 08/14/2014 09:15 PM, Jerome Glisse wrote:
> >>> On Thu, Aug 14, 2014 at 08:47:16PM +0200, Daniel Vetter wrote:
> >>>> On Thu, Aug 14, 2014 at 8:18 PM, Jerome Glisse <j.glisse at gmail.com> wrote:
> >>>>> Sucks because you can not do weird synchronization like one i depicted in another
> >>>>> mail in this thread and for as long as cmdbuf_ioctl do not give you fence|syncpt
> >>>>> you can not such thing cleanly in non hackish way.
> >>>> Actually i915 can soon will do that that.
> >>> So you will return fence|syncpoint with each cmdbuf_ioctl ?
> >>>
> >>>>> Sucks because you have a fence object per buffer object and thus overhead grow
> >>>>> with the number of objects. Not even mentioning fence lifetime issue.
> >>>>>
> >>>>> Sucks because sub-buffer allocation is just one of many tricks that can not be
> >>>>> achieved properly and cleanly with implicit sync.
> >>>>>
> >>>>> ...
> >>>> Well I heard all those reasons and I'm well of aware of them. The
> >>>> problem is that with current hardware the kernel needs to know for
> >>>> each buffer how long it needs to be kept around since hw just can't do
> >>>> page faulting. Yeah you can pin them but for an uma design that
> >>>> doesn't go down well with folks.
> >>> I am not thinking with fancy hw in mind, on contrary i thought about all
> >>> this with the crappiest hw i could think of, in mind.
> >>>
> >>> Yes you can get rid of fence and not have to pin memory with current hw.
> >>> What matter for unpinning is to know that all hw block are done using the
> >>> memory. This is easily achievable with your beloved seqno. Have one seqno
> >>> per driver (one driver can have different block 3d, video decoding, crtc,
> >>> ...) each time a buffer is use as part of a command on one block inc the
> >>> common seqno and tag the buffer with that number. Have each hw block write
> >>> the lastest seqno that is done to a per block location. Now to determine
> >>> is buffer is done compare the buffer seqno with the max of all the signaled
> >>> seqno of all blocks.
> >>>
> >>> Cost 1 uint32 per buffer and simple if without locking to check status of
> >>> a buffer.
> >> Hmm?
> >> The trivial and first use of fence objects in the linux DRM was
> >> triggered by the fact that a
> >> 32-bit seqno wraps pretty quickly and a 32-bit solution just can't be
> >> made robust.
> >> Now a 64-bit seqno will probably be robust for forseeable future, but
> >> when it comes to implement that on 32-bit hardware and compare it to a
> >> simple fence object approach,
> > Using same kind of arithemic as use for jiffies would do it provided that
> > there is a checking that we never let someobject pass above a certain age.
> 
> But wouldn't the search-for-max scheme break if blocks complete out of
> order?

Seems i use wrong word in my email, the pseudo code is correct however, it's
not max it's min. So by knowing the minimum global seqno of all hw block you
know that at very least all hw block are past that point. Which means the
order into which blocks complete is irrevelant. But it also means that you
might delay more than needed the unbinding, thought pseudo code shows way to
alieviate and minimize this delay.

It's doable i had it kind of working couple years ago when we were reworking
fence in radeon driver but never got around to finish something clean because
implicit sync needs the reservation scheme which means you kind of need a
fence per buffer (well getting rid of that is where things get complicated
and painful).

Explicit sync is just so much nicer, but we made decision long ago about
implicit, another thing i am deeply regreting now.

Cheers,
Jérôme

> 
> /Thomas
> 
>