[PATCH v3 0/8] drm: Introduce sparse GEM shmem

Fri Apr 11 12:50:19 UTC 2025

Am 11.04.25 um 14:01 schrieb Simona Vetter:
> On Thu, Apr 10, 2025 at 08:41:55PM +0200, Boris Brezillon wrote:
>> On Thu, 10 Apr 2025 14:01:03 -0400
>> Alyssa Rosenzweig <alyssa at rosenzweig.io> wrote:
>>
>>>>>> In Panfrost and Lima, we don't have this concept of "incremental
>>>>>> rendering", so when we fail the allocation, we just fail the GPU job
>>>>>> with an unhandled GPU fault.    
>>>>> To be honest I think that this is enough to mark those two drivers as
>>>>> broken.  It's documented that this approach is a no-go for upstream
>>>>> drivers.
>>>>>
>>>>> How widely is that used?  
>>>> It exists in lima and panfrost, and I wouldn't be surprised if a similar
>>>> mechanism was used in other drivers for tiler-based GPUs (etnaviv,
>>>> freedreno, powervr, ...), because ultimately that's how tilers work:
>>>> the amount of memory needed to store per-tile primitives (and metadata)
>>>> depends on what the geometry pipeline feeds the tiler with, and that
>>>> can't be predicted. If you over-provision, that's memory the system won't
>>>> be able to use while rendering takes place, even though only a small
>>>> portion might actually be used by the GPU. If your allocation is too
>>>> small, it will either trigger a GPU fault (for HW not supporting an
>>>> "incremental rendering" mode) or under-perform (because flushing
>>>> primitives has a huge cost on tilers).  
>>> Yes and no.
>>>
>>> Although we can't allocate more memory for /this/ frame, we know the
>>> required size is probably constant across its lifetime. That gives a
>>> simple heuristic to manage the tiler heap efficiently without
>>> allocations - even fallible ones - in the fence signal path:
>>>
>>> * Start with a small fixed size tiler heap
>>> * Try to render, let incremental rendering kick in when it's too small.
>>> * When cleaning up the job, check if we used incremental rendering.
>>> * If we did - double the size of the heap the next time we submit work.
>>>
>>> The tiler heap still grows dynamically - it just does so over the span
>>> of a couple frames. In practice that means a tiny hit to startup time as
>>> we dynamically figure out the right size, incurring extra flushing at
>>> the start, without needing any "grow-on-page-fault" heroics.
>>>
>>> This should solve the problem completely for CSF/panthor. So it's only
>>> hardware that architecturally cannot do incremental rendering (older
>>> Mali: panfrost/lima) where we need this mess.
>> OTOH, if we need something
>> for Utgard(Lima)/Midgard/Bifrost/Valhall(Panfrost), why not use the same
>> thing for CSF, since CSF is arguably the sanest of all the HW
>> architectures listed above: allocation can fail/be non-blocking,
>> because there's a fallback to incremental rendering when it fails.
> So this is a really horrible idea to sort this out for panfrost hardware,
> which doesn't have a tiler cache flush as a fallback. It's roughly three
> stages:
>
> 1. A pile of clever tricks to make the chances of running out of memory
> really low. Most of these also make sense for panthor platforms, just as a
> performance optimization.
>
> 2. I terrible way to handle the unavoidable VK_DEVICE_LOST, but in a way
> such that the impact should be minimal. This is nasty, and we really want
> to avoid that for panthor.
>
> 3. Mesa quirks so that 2 doesn't actually ever happen in practice.
>
> 1. Clever tricks
> ----------------
>
> This is a cascade of tricks we can pull in the gpu fault handler:
>
> 1a. Allocate with GFP_NORECLAIM. We want this first because that triggers
>   background reclaim, and that might be enough to get us through and free
>   some easy caches (like clean fs cache and stuff like that which can just
>   be dropped).
>
> 1b Userspace needs to guesstimate a good guess for how much we'll need. I'm
>   hoping that between render target size and maybe counting the total
>   amounts of vertices we can do a decent guesstimate for many workloads.
>   Note that goal here is not to ensure success, but just to get the rough
>   ballpark. The actual starting number here should aim fairly low, so that
>   we avoid wasting memory since this is memory wasted on every context
>   (that uses a feature which needs dynamic memory allocation, which I
>   guess for pan* is everything, but for apple it would be more limited).
>
> 1c The kernel then keeps an additional global memory pool. Note this would
>   not have the same semantics as mempool.h, which is aimed GFP_NOIO
>   forward progress guarantees, but more as a preallocation pool. In every
>   CS ioctl we'll make sure the pool is filled, and we probably want to
>   size the pool relative to the context with the biggest dynamic memory
>   usage. So probably this thing needs a shrinker, so we can reclaim it
>   when you don't run an app with a huge buffer need on the gpu anymore.
>
>   Note that we're still not sizing this to guarantee success, but together
>   with the userspace heuristics it should be big enough to almost always
>   work out. And since it's global reserve we can afford to waste a bit
>   more memory on this one. We might also want to scale this pool by the
>   total memory available, like the watermarks core mm computes. We'll only
>   hang onto this memory when the gpu is in active usage, so this should be
>   fine.
>
>   Also the preallocation would need to happen without holding the memory
>   pool look, so that we can use GFP_KERNEL.
>
> Up to this point I think it's all tricks that panthor also wants to
> employ.
>
> 1d Next up is scratch dynamic memory. If we can assume that the memory does
>   not need to survive a batchbuffer (hopefully the case with vulkan render
>   pass) we could steal such memory from other contexts. We could even do
>   that for contexts which are queued but not yet running on the hardware
>   (might need unloading them to be doable with fw renderers like
>   panthor/CSF) as long as we keep such stolen dynamic memory on a separate
>   free list. Because if we'd entirely free this, or release it into the
>   memory pool we'll make things worse for these other contexts, we need to
>   be able to guarantee that any context can always get all the stolen
>   dynamic pages back before we start running it on the gpu.
>
> Since the tracking for the memory pool in 1c and the stolen memory in 1d
> has a bunch of tricky corner cases we probably want that as a drm helper
> which globally does that book-keeping for all drm/sched instances. That
> might even help if some NN accel thing also needs this on the same SoC.
>
> 1e Finally, if all else is lost we can try GFP_ATOMIC. This will eat into
>   reserve memory pools the core mm maintains. And hopefully we've spent
>   enough time between this step and the initial GFP_NORECLAIM that
>   background reclaim had a chance to free some memory, hence why all the
>   memory pool and memory stealing tricks should be in between.
>
> The above will need quite some tuning to make sure it works as often as
> possible, without wasting undue amounts of memory. It's classic memory
> overcommit tuning.
>
> 2. Device Lost
> --------------
>
> At this point we're left with no other choice than to kill the context.
> And userspace should be able to cope with VK_DEVICE_LOST (hopefully zink
> does), but it will probably not cope well with an entire strom of these
> just to get the first frame out.
>
> Here comes the horrible trick:
>
> We'll keep rendering the entire frame by just smashing one single reserve
> page (per context) into the pte every time there's a fault. It will result
> in total garbage, and we probably want to shot the context the moment the
> VS stages have finished, but it allows us to collect an accurate estimate
> of how much memory we'd have needed. We need to pass that to the vulkan
> driver as part of the device lost processing, so that it can keep that as
> the starting point for the userspace dynamic memory requirement
> guesstimate as a lower bound. Together with the (scaled to that
> requirement) gpu driver memory pool and the core mm watermarks, that
> should allow us to not hit a device lost again hopefully.
>
> I think if the CS ioctl both takes the current guesstimate from userspace,
> and passes back whatever the current maximum known to the kernel is (we
> need that anyway for the steps in stage 1 to make sense I think), then
> that should also work for dead context where the CS ioctl returns -EIO or
> something like that.

+1

It is a bit radical but as far as I can see that is the way to go.

Additionally I think it's a must have to have a module parameter or similar to exercise this situation even without getting the whole kernel into an OOM situation.

Fault injection to exercise fallback paths are is just mandatory for stuff like that.

Regards,
Christian.

>
> 3. Mesa quirks
> --------------
>
> A database of the fixed dynamic memory requirements we get from step 2
> (through bug reports), so that we can avoid that mess going forward.
>
> If there's too many of those, we'll probably need to improve the
> guesstimation 1b if it's systematically off by too much. We might even
> need to store that on-disk, like shader caches, so that you get a crash
> once and then it should work even without an explicit mesa quirk.
>
> Documentation
> -------------
>
> Assuming this is not too terrible an idea and we reach some consensus, I
> think it'd be good to bake this into a doc patch somewhere in the
> dma_fence documentation. Also including Alyssa recipe for when you have
> hardware support for flushing partial rendering. And maybe generalized so
> it makes sense for the GS/streamout/... fun on apple hardware.
>
> And perhaps with a section added that points to actual code to help make
> this happen like the sparse bo support in this series.
>
> Cheers, Sima