[PATCH v3 0/8] drm: Introduce sparse GEM shmem

Fri Apr 11 12:02:54 UTC 2025

On Fri, 11 Apr 2025 12:55:57 +0200
Christian König <christian.koenig at amd.com> wrote:

> Am 11.04.25 um 10:38 schrieb Boris Brezillon:
> > On Fri, 11 Apr 2025 10:04:07 +0200
> > Christian König <christian.koenig at amd.com> wrote:
> >  
> >> Am 10.04.25 um 20:41 schrieb Boris Brezillon:  
> >>> On Thu, 10 Apr 2025 14:01:03 -0400
> >>> Alyssa Rosenzweig <alyssa at rosenzweig.io> wrote:
> >>>    
> >>>>>>> In Panfrost and Lima, we don't have this concept of
> >>>>>>> "incremental rendering", so when we fail the allocation, we
> >>>>>>> just fail the GPU job with an unhandled GPU fault.        
> >>>>>> To be honest I think that this is enough to mark those two
> >>>>>> drivers as broken.  It's documented that this approach is a
> >>>>>> no-go for upstream drivers.
> >>>>>>
> >>>>>> How widely is that used?      
> >>>>> It exists in lima and panfrost, and I wouldn't be surprised if a
> >>>>> similar mechanism was used in other drivers for tiler-based GPUs
> >>>>> (etnaviv, freedreno, powervr, ...), because ultimately that's
> >>>>> how tilers work: the amount of memory needed to store per-tile
> >>>>> primitives (and metadata) depends on what the geometry pipeline
> >>>>> feeds the tiler with, and that can't be predicted. If you
> >>>>> over-provision, that's memory the system won't be able to use
> >>>>> while rendering takes place, even though only a small portion
> >>>>> might actually be used by the GPU. If your allocation is too
> >>>>> small, it will either trigger a GPU fault (for HW not supporting
> >>>>> an "incremental rendering" mode) or under-perform (because
> >>>>> flushing primitives has a huge cost on tilers).      
> >>>> Yes and no.
> >>>>
> >>>> Although we can't allocate more memory for /this/ frame, we know
> >>>> the required size is probably constant across its lifetime. That
> >>>> gives a simple heuristic to manage the tiler heap efficiently
> >>>> without allocations - even fallible ones - in the fence signal
> >>>> path:
> >>>>
> >>>> * Start with a small fixed size tiler heap
> >>>> * Try to render, let incremental rendering kick in when it's too
> >>>> small.
> >>>> * When cleaning up the job, check if we used incremental
> >>>> rendering.
> >>>> * If we did - double the size of the heap the next time we submit
> >>>> work.
> >>>>
> >>>> The tiler heap still grows dynamically - it just does so over the
> >>>> span of a couple frames. In practice that means a tiny hit to
> >>>> startup time as we dynamically figure out the right size,
> >>>> incurring extra flushing at the start, without needing any
> >>>> "grow-on-page-fault" heroics.
> >>>>
> >>>> This should solve the problem completely for CSF/panthor. So it's
> >>>> only hardware that architecturally cannot do incremental
> >>>> rendering (older Mali: panfrost/lima) where we need this mess.
> >>>>  
> >>> OTOH, if we need something
> >>> for Utgard(Lima)/Midgard/Bifrost/Valhall(Panfrost), why not use
> >>> the same thing for CSF, since CSF is arguably the sanest of all
> >>> the HW architectures listed above: allocation can fail/be
> >>> non-blocking, because there's a fallback to incremental rendering
> >>> when it fails.    
> >> Yeah that is a rather interesting point Alyssa noted here.
> >>
> >> So basically you could as well implement it like this:
> >> 1. Userspace makes a submission.
> >> 2. HW finds buffer is not large enough, sets and error code and
> >> completes submission. 3. Userspace detects error, re-allocates
> >> buffer with increased size. 4. Userspace re-submits to incremental
> >> complete the submission. 5. Eventually repeat until fully
> >> completed.
> >>
> >> That would work but is likely just not the most performant
> >> solution. So faulting in memory on demand is basically just an
> >> optimization and that is ok as far as I can see.  
> > Yeah, Alyssa's suggestion got me thinking too, and I think I can
> > come up with a plan where we try non-blocking allocation first, and
> > if it fails, we trigger incremental rendering, and queue a blocking
> > heap-chunk allocation on separate workqueue, such that next time the
> > tiler heap hits an OOM, it has a chunk (or multiple chunks) readily
> > available if the blocking allocation completed in the meantime.
> > That's basically what Alyssa suggested, with an optimization if the
> > system is not under memory pressure, and without userspace being
> > involved (so no uAPI changes).  
> 
> That sounds like it most likely won't work. In an OOM situation the
> blocking allocation would just cause more pressure to complete your
> rendering to free up memory.

Right. It could be deferred to the next job submission instead of being
queued immediately. My point being, userspace doesn't have to know
about it, because the kernel knows when a tiler OOM happened, and can
flag the buffer as "probably needs more space when you have an
opportunity to alloc more". It wouldn't be different from reporting it
to userspace and letting userspace explicitly grow the buffer, and it
avoids introducing a gap between old and new mesa.

> 
> > I guess this leaves older GPUs that don't support incremental
> > rendering in a bad place though.  
> 
> Well what's the handling there currently? Just crash when you're OOM?

It's "alloc(GFP_KERNEL) and crash if it fails or times out", yes.