[RFC PATCH 0/7] drm/panfrost: Add a new submit ioctl

Thu Mar 11 17:24:58 UTC 2021

Hi Jason,

On Thu, 11 Mar 2021 10:58:46 -0600
Jason Ekstrand <jason at jlekstrand.net> wrote:

> Hi all,
> 
> Dropping in where I may or may not be wanted to feel free to ignore. : -)

I'm glad you decided to chime in. :-)

> > > > 2/ Queued jobs might be executed out-of-order (unless they have
> > > >     explicit/implicit deps between them), and Vulkan asks that the out
> > > >     fence be signaled when all jobs are done. Timeline syncobjs are a
> > > >     good match for that use case. All we need to do is pass the same
> > > >     fence syncobj to all jobs being attached to a single QueueSubmit
> > > >     request, but a different point on the timeline. The syncobj
> > > >     timeline wait does the rest and guarantees that we've reached a
> > > >     given timeline point (IOW, all jobs before that point are done)
> > > >     before declaring the fence as signaled.
> > > >     One alternative would be to have dummy 'synchronization' jobs that
> > > >     don't actually execute anything on the GPU but declare a dependency
> > > >     on all other jobs that are part of the QueueSubmit request, and
> > > >     signal the out fence (the scheduler would do most of the work for
> > > >     us, all we have to do is support NULL job heads and signal the
> > > >     fence directly when that happens instead of queueing the job).  
> > >
> > > I have to admit to being rather hazy on the details of timeline
> > > syncobjs, but I thought there was a requirement that the timeline moves
> > > monotonically. I.e. if you have multiple jobs signalling the same
> > > syncobj just with different points, then AFAIU the API requires that the
> > > points are triggered in order.  
> >
> > I only started looking at the SYNCOBJ_TIMELINE API a few days ago, so I
> > might be wrong, but my understanding is that queuing fences (addition
> > of new points in the timeline) should happen in order, but signaling
> > can happen in any order. When checking for a signaled fence the
> > fence-chain logic starts from the last point (or from an explicit point
> > if you use the timeline wait flavor) and goes backward, stopping at the
> > first un-signaled node. If I'm correct, that means that fences that
> > are part of a chain can be signaled in any order.  
> 
> You don't even need a timeline for this.  Just have a single syncobj
> per-queue and make each submit wait on it and then signal it.
> Alternatively, you can just always hang on to the out-fence from the
> previous submit and make the next one wait on that.

That's what I have right now, but it forces the serialization of all
jobs that are pushed during a submit (and there can be more than one
per command buffer on panfrost :-/). Maybe I'm wrong, but I thought it'd
be better to not force this serialization if we can avoid it.

> Timelines are overkill here, IMO.

Mind developing why you think this is overkill? After looking at the
kernel implementation I thought using timeline syncobjs would be
pretty cheap compared to the other options I considered.

> 
> > Note that I also considered using a sync file, which has the ability to
> > merge fences, but that required 3 extra ioctls for each syncobj to merge
> > (for the export/merge/import round trip), and AFAICT, fences stay around
> > until the sync file is destroyed, which forces some garbage collection
> > if we want to keep the number of active fences low. One nice thing
> > about the fence-chain/syncobj-timeline logic is that signaled fences
> > are collected automatically when walking the chain.  
> 
> Yeah, that's the pain when working with sync files.  This is one of
> the reasons why our driver takes an arbitrary number of in/out
> syncobjs.
> 
> > > So I'm not sure that you've actually fixed this point - you either need
> > > to force an order (in which case the last job can signal the Vulkan
> > > fence)  
> >
> > That options requires adding deps that do not really exist on the last
> > jobs, so I'm not sure I like it.
> >  
> > > or you still need a dummy job to do the many-to-one dependency.  
> >
> > Yeah, that's what I've considered doing before realizing timelined
> > syncojbs could solve this problem (assuming I got it right :-/).
> >  
> > >
> > > Or I may have completely misunderstood timeline syncobjs - definitely a
> > > possibility :)  
> >
> > I wouldn't pretend I got it right either :-).
> >  
> > >  
> > > > 3/ The current implementation lacks information about BO access,
> > > >     so we serialize all jobs accessing the same set of BOs, even
> > > >     if those jobs might just be reading from them (which can
> > > >     happen concurrently). Other drivers pass an access type to the
> > > >     list of referenced BOs to address that. Another option would be
> > > >     to disable implicit deps (deps based on BOs) and force the driver
> > > >     to pass all deps explicitly (interestingly, some drivers have
> > > >     both the no-implicit-dep and r/w flags, probably to support
> > > >     sub-resource access, so we might want to add that one too).
> > > >     I don't see any userspace workaround to that problem, so that one
> > > >     alone would justify extending the existing ioctl or adding a new
> > > >     one.  
> > >
> > > Yeah - I think we need this. My only comment is that I think the
> > > read/write terminology may come back to bite. Better to use 'shared' and
> > > 'exclusive' - which better matches the dma_resv_xxx APIs anyway.
> > >
> > > Also the current code completely ignores PANFROST_BO_REF_READ. So either
> > > that should be defined as 0, or even better we support 3 modes:
> > >
> > >   * Exclusive ('write' access)
> > >   * Shared ('read' access)
> > >   * No fence - ensures the BO is mapped, but doesn't add any implicit
> > > fences.
> > >
> > > The last may make sense when doing explicit fences and e.g. doing
> > > front-buffer rendering with a display driver which does implicit fencing.  
> 
> This one's really annoying.  TBH, we've still not gotten it right on
> Intel, AFAICT.  That is roughly the set of modes you need but you'll
> have to watch out for window-system buffers.  RADV and ANV take
> slightly different approaches here and they each have their own
> problems.  On the balance, I'd look at what RADV is doing rather than
> ANV because ANV's results in some over-synchronization every time you
> vkWaitForFences on the WSI fence.  I've got a patch floating round
> somewhere that adds some new kernel API to make that case a bit better
> but it's a snarly mess.  Sorry for being cryptic but it's a 5-page
> e-mail if I type out all the annoying details. (-:

Ok, I'll have a look at the RADV driver.

Thanks for your feedback.

Boris