[PATCH v3 0/8] drm: Introduce sparse GEM shmem

Fri Apr 11 08:08:29 UTC 2025

Hi Christian,

On Thu, 10 Apr 2025 20:52:27 +0200
Christian König <christian.koenig at amd.com> wrote:

> Am 10.04.25 um 19:20 schrieb Boris Brezillon:
> > [SNIP]  
> >>>> Faulting is only legal when you have something like HMM, SVM or
> >>>> whatever you call it. And then you can just use a plain shmem
> >>>> object to provide you with backing pages.
> >>>>
> >>>> I mean we could in theory allow faulting on GEM objects as well,
> >>>> but we would need to take very strict precautions on that we
> >>>> currently don't have as far as I know.    
> >>> We only use this mechanism for very specific allocations: tiler
> >>> memory whose maximum size can't be guessed upfront because tile
> >>> binning is by nature unpredictible.
> >>>    
> >>>> So could you explain how this works in the first place?    
> >>> I can explain you how this works in Panthor, yes. You get an
> >>> initial amount of memory that the tiler can use, when it runs out
> >>> of memory, it will first ask the system for more memory, if the
> >>> allocation fails, it will fallback to what they call "incremental
> >>> rendering", where the already binned primitives are flushed to the
> >>> FB in order to free memory, and the rendering starts over from
> >>> there, with the memory that has been freed.
> >>>
> >>> In Panthor, this on-demand allocation scheme is something that
> >>> allows us to speed-up rendering when there's memory available, but
> >>> we can make progress when that's not the case, hence the failable
> >>> allocation scheme I'm proposing here.    
> >> Puh, that sounds like it can potentially work but is also very very
> >> fragile.
> >>
> >> You must have a code comment when you select the GFP flags how and
> >> why that works.  
> > +	/* We want non-blocking allocations, if we're OOM, we just
> > fail the job
> > +	 * to unblock things.
> > +	 */
> > +	ret = drm_gem_shmem_sparse_populate_range(&bo->base,
> > page_offset,
> > +						  NUM_FAULT_PAGES,
> > page_gfp,
> > +						  __GFP_NORETRY |
> > __GFP_NOWARN);
> >
> > That's what I have right now in the failable allocation path. The
> > tiler chunk allocation uses the same flags, but doesn't have a
> > comment explaining that a fallback exists when the allocation
> > fails.  
> 
> We agreed to use GFP_NOWAIT for such types of allocations to at least
> wake up kswapd on the low water mark.

Yeah, I wasn't sure about the flags to be honest. Hence the ping and
the very reason Sima asked me to Cc a few more people to look at those
changes.

> 
> IIRC even using __GFP_NORETRY here was illegal, but I need to double
> check the discussions again.

FWIW, it's been taken from i915 [1]

> 
> From the comment this at minimum needs an explanation what influence
> this has on the submission and that you can still guarantee fence
> forward progress.

Agreed.

> >>> And that's how it is today, the
> >>> alloc-on-fault mechanism is being used in at least 3 drivers, and
> >>> all I'm trying to do here is standardize it and try to document
> >>> the constraints that comes with this model, constraint that are
> >>> currently being ignored. Like the fact allocations in the fault
> >>> handler path shouldn't block so we're guaranteed to signal the job
> >>> fence in finite time, and we don't risk a deadlock between the
> >>> driver shrinker and the job triggering the fault.    
> >> Well on one hand it's good to that we document the pitfalls, but I
> >> clearly think that we should *not* spread that into common code.  
> > Ack on not encouraging people to use that; but having a clear path
> > describing how this should be done for HW that don't have other
> > options, with helpers and extensive doc is IMHO better than letting
> > new drivers reproduce the mistake that were done in the past.
> > Because, if you tell people "don't do that", but don't have an
> > alternative to propose, they'll end up doing it anyway.  
> 
> Just to be clear: We have already completely rejected code from going
> upstream because of that!

Apparently panfrost/panthor/lima didn't get enough scrutiny...

> 
> And we are not talking about anything small, but rather a full blown
> framework and every developed by a major player.
> 
> Additional to that both i915 and amdgpu had code which used this
> approach as well and we reverted back because it simply doesn't work
> reliable.

BTW, the sparse GEM-SHMEM code in itself doesn't necessarily imply
non-blocking allocation in the fault handler, though admittedly it's
been designed to allow it.

> 
> >> That would give the impression that this actually works and to be
> >> honest I should start to charge money to rejecting stuff like that.
> >> It would probably get me rich.
> >>  
> >>> I'm well aware of the implications of what I'm proposing here, but
> >>> ignoring the fact some drivers already violate the rules don't
> >>> make them disappear. So I'd rather work with you and others to
> >>> clearly state what the alloc-in-fault-path rules are, and enforce
> >>> them in some helper functions, than pretend these ugly things
> >>> don't exist. :D    
> >> Yeah, but this kind of says to others it's ok to do this which
> >> clearly isn't as far as I can see.  
> > Not really. At least not if we properly review any driver that use
> > these APIs, and clearly state in the doc that this is provided as a
> > last resort for HW that can't do without on-fault-allocation,
> > because they are designed to work this way.
> >  
> >> What we should do instead is to add this as not ok approaches to
> >> the documentation and move on.  
> > Well, that's what happened with panfrost, lima and panthor, and see
> > where we are now: 3 drivers that you consider broken (probably
> > rightfully), and more to come if we don't come up with a plan for HW
> > that have the same requirements (as I said, I wouldn't be surprised
> > if most tilers were relying on the same kind of
> > on-demand-allocation).  
> 
> Well we have HW features in major drivers which we simply don't use
> because of that.

Yeah, but in that case that's not really a HW feature we can switch
off/avoid using. It's at the core of how Mali GPUs work (at least
anything prior to gen10). And forcing all apps to fully allocate this
128MB buffer upfront (which might seem small for discrete GPUs, but
really isn't for systems where the memory can be as low as 1GB, and
shared with the rest of the system) isn't really an option. The only
alternative I see is finding a way to emulate incremental rendering on
older Mali GPUs, assuming that's even possible.

> 
> > As always, I appreciate your valuable inputs, and would be happy to
> > try anything you think might be more adapted than what is proposed
> > here, but saying "this is broken HW/driver, so let's ignore it" is
> > not really an option, at least if you don't want the bad design
> > pattern to spread.  
> 
> Yeah, I'm not sure what to do either. We *know* from experience that
> this approach will fail.

Can you, or someone who was involved in those attempts, sum-up the
problem that were attempted to be solved, and the walls you hit with
this non-blocking/failable allocation scheme (fell free to link me to
previous discussion so you don't have to type it again :-)).

FWIW, the problem I see with the current panfrost/panthor
on-fault-allocation scheme is the fact allocating GEMs with GFP_KERNEL
leads to unbounded waits, which might cause a job timeout. The dealock
between the panfrost shrinker and the job requesting memory doesn't
exist AFAICT, because panfrost can only reclaim objects that were
flagged as unused (MADVISE(DONTNEED)), and the allocation is done
without any of the locks that are taken by the panfrost shrinker held.
So yeah, jobs can fail because the system ran out of memory, the job
fence will be signalled with an error, and the system will just move
on. It's already what happens today, and userspace either ignores it
(because on older Mali GPUs, a fault on a job is just transient, and
doesn't prevent other jobs from executing) or knows how to recover from
it (on newer GPUs, the FW scheduling-context needs to be destroyed
re-created).

Just to be clear, I'm not saying this is good, I'm just describing what
is done today, so you get the whole picture.

Now, with panthor being in the process of adopting a transparent
reclaim mechanism (swapout on reclaim if the GPU context is idle,
swapin on next job), the deadlock between the panthor/gem-shmem
shrinker and the allocation in the fault path will probably surface, so
I do get why no-alloc-in-fault path, or at the very least,
non-blocking-alloc-in-fault-path is needed. And I really thought
non-blocking allocations would be allowed in that case. So please bare
with me, and tell me where my mistake is, because every time I think I
got it, I look at it again, and don't see the problem.

Regards,

Boris

[1]https://elixir.bootlin.com/linux/v6.13.7/source/drivers/gpu/drm/i915/gem/i915_gem_shmem.c#L96