[RFC PATCH 00/10] drm/panthor: Add user submission

Fri Nov 8 22:27:32 UTC 2024

On Tue, Sep 24, 2024 at 11:30:53AM +0200, Simona Vetter wrote:
> Apologies for the late reply ...
> 

Also late reply, just read this.

> On Wed, Sep 04, 2024 at 01:34:18PM +0200, Christian König wrote:
> > Hi Boris,
> > 
> > Am 04.09.24 um 13:23 schrieb Boris Brezillon:
> > > > > > > Please read up here on why that stuff isn't allowed:
> > > > > > > https://www.kernel.org/doc/html/latest/driver-api/dma-buf.html#indefinite-dma-fences
> > > > > > panthor doesn't yet have a shrinker, so all memory is pinned, which means
> > > > > > memory management easy mode.
> > > > > Ok, that at least makes things work for the moment.
> > > > Ah, perhaps this should have been spelt out more clearly ;)
> > > > 
> > > > The VM_BIND mechanism that's already in place jumps through some hoops
> > > > to ensure that memory is preallocated when the memory operations are
> > > > enqueued. So any memory required should have been allocated before any
> > > > sync object is returned. We're aware of the issue with memory
> > > > allocations on the signalling path and trying to ensure that we don't
> > > > have that.
> > > > 
> > > > I'm hoping that we don't need a shrinker which deals with (active) GPU
> > > > memory with our design.
> > > That's actually what we were planning to do: the panthor shrinker was
> > > about to rely on fences attached to GEM objects to know if it can
> > > reclaim the memory. This design relies on each job attaching its fence
> > > to the GEM mapped to the VM at the time the job is submitted, such that
> > > memory that's in-use or about-to-be-used doesn't vanish before the GPU
> > > is done.
> > 
> > Yeah and exactly that doesn't work any more when you are using user queues,
> > because the kernel has no opportunity to attach a fence for each submission.
> > 
> > > > Memory which user space thinks the GPU might
> > > > need should be pinned before the GPU work is submitted. APIs which
> > > > require any form of 'paging in' of data would need to be implemented by
> > > > the GPU work completing and being resubmitted by user space after the
> > > > memory changes (i.e. there could be a DMA fence pending on the GPU work).
> > > Hard pinning memory could work (ioctl() around gem_pin/unpin()), but
> > > that means we can't really transparently swap out GPU memory, or we
> > > have to constantly pin/unpin around each job, which means even more
> > > ioctl()s than we have now. Another option would be to add the XGS fence
> > > to the BOs attached to the VM, assuming it's created before the job
> > > submission itself, but you're no longer reducing the number of user <->
> > > kernel round trips if you do that, because you now have to create an
> > > XSG job for each submission, so you basically get back to one ioctl()
> > > per submission.
> > 
> > For AMDGPU we are currently working on the following solution with memory
> > management and user queues:
> > 
> > 1. User queues are created through an kernel IOCTL, submissions work by
> > writing into a ring buffer and ringing a doorbell.
> > 
> > 2. Each queue can request the kernel to create fences for the currently
> > pushed work for a queues which can then be attached to BOs, syncobjs,
> > syncfiles etc...
> > 
> > 3. Additional to that we have and eviction/preemption fence attached to all
> > BOs, page tables, whatever resources we need.
> > 
> > 4. When this eviction fences are requested to signal they first wait for all
> > submission fences and then suspend the user queues and block creating new
> > submission fences until the queues are restarted again.
> 
> Yup this works, at least when I play it out in my head.
> 

I just started experimenting with user submission in Xe last week and
ended up landing on a different PoC, blissfully unaware future fences /
Mesa submit thread. However, after Sima filled me in, I’ve essentially
landed on exactly what Christian is describing in Xe. I haven’t coded it
yet, but have the design in my head.

I also generally agree with Sima’s comments about having a somewhat
generic preempt fence (Christian refers to this as an eviction fence)
as well.

Additionally, I’m thinking it might be beneficial for us to add a new
'preempt' dma-resv slot to track these, which would make it easier to
enforce the ordering of submission fence signaling before preempt
fences.

Depending on bandwidth, I may post an RFC to the list soon. I’ll also
gauge the interest and bandwidth from our Mesa team to begin UMD work.

Matt

> Note that the completion fence is only deadlock free if userspace is
> really, really careful. Which in practice means you need the very
> carefully constructed rules for e.g. vulkan winsys fences, otherwise you
> do indeed deadlock.
> 
> But if you keep that promise in mind, then it works, and step 2 is
> entirely option, which means we can start userspace in a pure long-running
> compute mode where there's only eviction/preemption fences. And then if
> userspace needs a vulkan winsys fence, we can create that with step 2 as
> needed.
> 
> But the important part is that you need really strict rules on userspace
> for when step 2 is ok and won't result in deadlocks. And those rules are
> uapi, which is why I think doing this in panthor without the shrinker and
> eviction fences (i.e. steps 3&4 above) is a very bad mistake.
> 
> > This way you can still do your memory management inside the kernel (e.g.
> > move BOs from local to system memory) or even completely suspend and resume
> > applications without their interaction, but as Sima said it is just horrible
> > complicated to get right.
> > 
> > We have been working on this for like two years now and it still could be
> > that we missed something since it is not in production testing yet.
> 
> Ack.
> -Sima
> -- 
> Simona Vetter
> Software Engineer, Intel Corporation
> http://blog.ffwll.ch