[PATCH v3 0/8] drm: Introduce sparse GEM shmem

Tue Apr 15 09:47:47 UTC 2025

On Mon, 14 Apr 2025 16:34:35 +0100
Steven Price <steven.price at arm.com> wrote:

> On 14/04/2025 13:47, Boris Brezillon wrote:
> > On Fri, 11 Apr 2025 16:39:02 +0200
> > Boris Brezillon <boris.brezillon at collabora.com> wrote:
> >   
> >> On Fri, 11 Apr 2025 15:13:26 +0200
> >> Christian König <christian.koenig at amd.com> wrote:
> >>  
> >>>>      
> >>>>> Background is that you don't get a crash, nor error message, nor
> >>>>> anything indicating what is happening.      
> >>>> The job times out at some point, but we might get stuck in the fault
> >>>> handler waiting for memory, which is pretty close to a deadlock, I
> >>>> suspect.      
> >>>
> >>> I don't know those drivers that well, but at least for amdgpu the
> >>> problem would be that the timeout handling would need to grab some of
> >>> the locks the memory management is holding waiting for the timeout
> >>> handling to do something....
> >>>
> >>> So that basically perfectly closes the circle. With a bit of lock you
> >>> get a message after some time that the kernel is stuck, but since
> >>> that are all sleeping locks I strongly doubt so.
> >>>
> >>> As immediately action please provide patches which changes those
> >>> GFP_KERNEL into GFP_NOWAIT.    
> >>
> >> Sure, I can do that.  
> > 
> > Hm, I might have been too prompt at claiming this was doable. In
> > practice, doing that might regress Lima and Panfrost in situations
> > where trying harder than GFP_NOWAIT would free up some memory. Not
> > saying this was right to use GFP_KERNEL in the first place, but some
> > expectations were set by this original mistake, so I'll probably need
> > Lima developers to vouch in for this change after they've done some
> > testing on a system under high memory pressure, and I'd need to do the
> > same kind of testing for Panfrost and ask Steve if he's okay with that
> > too.  
> 
> It's a tricky one. The ideal would be to teach user space how to recover
> from the memory allocation failure (even on older GPUs it's still in
> theory possible to break up the work and do incremental rendering). But
> that ship sailed long ago so this would be a regression.
> 
> I did consider GFP_ATOMIC as an option, but really we shouldn't be
> accessing "memory reserves" in such a situation.
> 
> For background: In the "kbase" driver (Linux kernel driver for the
> closed source user space 'DDK' driver for Midgard/Bifrost GPUs) we had a
> "JIT memory allocator". The idea here was that if you have multiple
> renderings in flight you can attempt to share the tiler heap memory
> between them. So in this case when we can't allocate more memory and we
> know there's another tiler heap which is going to be freed by a fragment
> job that's already running, we can block knowing the memory is going to
> become available.
> 
> It was designed to do the same thing as CSF's incremental rendering -
> allow us to opportunistically allocate memory but not fail the rendering
> if it wasn't available.
> 
> But it was a nightmare to have any confidence of it being deadlock free
> and the implementation was frankly terrible - which is ultimately why
> CSF GPU's have this ability in hardware to perform incremental rendering
> without failing the job. But effectively this approach requires
> allocating just enough memory for one complete tiler heap while ensuring
> forward progress and opportunistically allowing extra memory to give a
> performance boost.
> 
> TLDR; I think we should try switching to GFP_NOWAIT in Panfrost and do
> some testing with memory pressure. It might be acceptable (and an
> occasional job failure is better than an occasional lock up). If it
> turns out it's too easy to trigger job failures like this then we'll
> need to rethink.

I thought about this incremental-rendering-on-JM thing during the past
few days, and I'd like to run one idea through you if you don't mind.
What if we tried to implement incremental rendering the way it's done
for CSF, that is, when we get a fault on a tiler heap object we:

1. flush caches on the tiler heap range (with a FLUSH_MEM command on the
   faulty AS), such that any primitive data that have been queued so far
   get pushed to the memory

2. use a different AS for an IR (Increment Rendering) fragment job that
   have been provided by the UMD at submission time, just like the
   CSF backend of the UMD registers an exception handler for tiler OOMs
   at the moment.
   Make it so this IR fragment job chain is queued immediately after the
   currently running fragment job (if any). This IR fragment job should
   consume all the primitives written so far and push the result to a
   framebuffer. There's a bit of patching to do on the final
   fragment job chain, because the FB preload setup will differ if
   IR took place, but that should be doable with simple WRITE_VALUE jobs
   turning NULL jobs into FRAGMENT jobs (or the other way around)
   such that they get skipped/activated depending on whether IR took
   place or not.

3. collect pages covering consumed primitives and make the heap range
   covering those consumed pages fault on further accesses (basically
   iommu->unmap() the section we collected pages from). We probably
   need to keep a couple pages around the top/bottom tiler heap
   pointers, because some partially written primitives might be crossing
   a heap chunk boundary, but assuming our heap has more than 4 chunks
   available (which we can ensure at tiler heap allocation time), we
   should be good.

4. use one of the collected pages to satisfy the growing request, and
   acknowledge the fault on the faulty AS. We can pick from the pool
   of collected pages to satisfy new growing requests until it's
   depleted again.

5. repeat steps 1-4 until all primitives are flushed.

6. reset the tiler heap mapping by doing an iommu->unmap() on the whole
   heap BO range, and collect all the pages that were previously
   allocated to the heap such that the next allocation can be satisfied
   immediately

Of course, this only works if primitives are added to the list only
when they are fully written (that is, all data for the primitive has
been written, and the primitive can be consumed by the fragment job
without hazards).

Just to be clear, this is a long term plan to try and fix this on JM
HW, and given the changes it involves (UMD needs to be taught about IR
on JM), we'll still need the GFP_NOWAIT fix for the time being.