[PATCH v3 0/8] drm: Introduce sparse GEM shmem

Thu Apr 10 16:43:29 UTC 2025

Hi Boris,

Am 10.04.25 um 17:53 schrieb Boris Brezillon:
> Hi Christian,
>
> On Thu, 10 Apr 2025 17:05:07 +0200
> Christian König <christian.koenig at amd.com> wrote:
>
>> Hi Boris,
>>
>> thanks for looping me in. Can you also send the full patch set to me since I don't see that on the mailing list (yet maybe).
>>
>> Am 10.04.25 um 16:48 schrieb Boris Brezillon:
>>> +Christian, Alyssa and Faith, as suggested by Sima. I'll make sure to
>>> Cc you on v4, but before that, I'd like to get your opinion on the
>>> drm-gem/drm-gem-shmem changes to see if sending a v4 is actually
>>> desirable or if I should go back to the drawing board.
>>>
>>> On Fri,  4 Apr 2025 11:26:26 +0200
>>> Boris Brezillon <boris.brezillon at collabora.com> wrote:
>>>  
>>>> This patch series is a proposal for implementing sparse page allocations
>>>> for shmem objects. It was initially motivated by a kind of BO managed by
>>>> the Panfrost/Panthor and Lima drivers, the tiler heap, which grows on
>>>> demand every time the GPU faults on a virtual address within the heap BO
>>>> range.  
>> Oh, wait a second! GPU faults and DMA fences are usually fundamentally incompatible.
>>
>> So stuff like filling in GEM objects on demand like you suggest here is usually seen as illegal. All resources must be pre-pinned before the submission.
> Unfortunately, that's already how it's done in lima, panfrost and
> panthor.
>
>> Faulting is only legal when you have something like HMM, SVM or whatever you call it. And then you can just use a plain shmem object to provide you with backing pages.
>>
>> I mean we could in theory allow faulting on GEM objects as well, but we would need to take very strict precautions on that we currently don't have as far as I know.
> We only use this mechanism for very specific allocations: tiler memory
> whose maximum size can't be guessed upfront because tile binning is by
> nature unpredictible.
>
>> So could you explain how this works in the first place?
> I can explain you how this works in Panthor, yes. You get an initial
> amount of memory that the tiler can use, when it runs out of memory, it
> will first ask the system for more memory, if the allocation fails, it
> will fallback to what they call "incremental rendering", where the
> already binned primitives are flushed to the FB in order to free memory,
> and the rendering starts over from there, with the memory that has been
> freed.
>
> In Panthor, this on-demand allocation scheme is something that allows
> us to speed-up rendering when there's memory available, but we can make
> progress when that's not the case, hence the failable allocation scheme
> I'm proposing here.

Puh, that sounds like it can potentially work but is also very very fragile.

You must have a code comment when you select the GFP flags how and why that works.

> In Panfrost and Lima, we don't have this concept of "incremental
> rendering", so when we fail the allocation, we just fail the GPU job
> with an unhandled GPU fault.

To be honest I think that this is enough to mark those two drivers as broken.  It's documented that this approach is a no-go for upstream drivers.

How widely is that used?

> And that's how it is today, the
> alloc-on-fault mechanism is being used in at least 3 drivers, and all
> I'm trying to do here is standardize it and try to document the
> constraints that comes with this model, constraint that are currently
> being ignored. Like the fact allocations in the fault handler path
> shouldn't block so we're guaranteed to signal the job fence in finite
> time, and we don't risk a deadlock between the driver shrinker and the
> job triggering the fault.

Well on one hand it's good to that we document the pitfalls, but I clearly think that we should *not* spread that into common code.

That would give the impression that this actually works and to be honest I should start to charge money to rejecting stuff like that. It would probably get me rich.

> I'm well aware of the implications of what I'm proposing here, but
> ignoring the fact some drivers already violate the rules don't make them
> disappear. So I'd rather work with you and others to clearly state what
> the alloc-in-fault-path rules are, and enforce them in some helper
> functions, than pretend these ugly things don't exist. :D

Yeah, but this kind of says to others it's ok to do this which clearly isn't as far as I can see.

What we should do instead is to add this as not ok approaches to the documentation and move on.

Regards,
Christian.

>
> Regards,
>
> Boris