[PATCH v3 0/8] drm: Introduce sparse GEM shmem

Wed Apr 16 15:16:05 UTC 2025

On 15/04/2025 10:47, Boris Brezillon wrote:
> On Mon, 14 Apr 2025 16:34:35 +0100
> Steven Price <steven.price at arm.com> wrote:
> 
>> On 14/04/2025 13:47, Boris Brezillon wrote:
>>> On Fri, 11 Apr 2025 16:39:02 +0200
>>> Boris Brezillon <boris.brezillon at collabora.com> wrote:
>>>   
>>>> On Fri, 11 Apr 2025 15:13:26 +0200
>>>> Christian König <christian.koenig at amd.com> wrote:
>>>>  
>>>>>>      
>>>>>>> Background is that you don't get a crash, nor error message, nor
>>>>>>> anything indicating what is happening.      
>>>>>> The job times out at some point, but we might get stuck in the fault
>>>>>> handler waiting for memory, which is pretty close to a deadlock, I
>>>>>> suspect.      
>>>>>
>>>>> I don't know those drivers that well, but at least for amdgpu the
>>>>> problem would be that the timeout handling would need to grab some of
>>>>> the locks the memory management is holding waiting for the timeout
>>>>> handling to do something....
>>>>>
>>>>> So that basically perfectly closes the circle. With a bit of lock you
>>>>> get a message after some time that the kernel is stuck, but since
>>>>> that are all sleeping locks I strongly doubt so.
>>>>>
>>>>> As immediately action please provide patches which changes those
>>>>> GFP_KERNEL into GFP_NOWAIT.    
>>>>
>>>> Sure, I can do that.  
>>>
>>> Hm, I might have been too prompt at claiming this was doable. In
>>> practice, doing that might regress Lima and Panfrost in situations
>>> where trying harder than GFP_NOWAIT would free up some memory. Not
>>> saying this was right to use GFP_KERNEL in the first place, but some
>>> expectations were set by this original mistake, so I'll probably need
>>> Lima developers to vouch in for this change after they've done some
>>> testing on a system under high memory pressure, and I'd need to do the
>>> same kind of testing for Panfrost and ask Steve if he's okay with that
>>> too.  
>>
>> It's a tricky one. The ideal would be to teach user space how to recover
>> from the memory allocation failure (even on older GPUs it's still in
>> theory possible to break up the work and do incremental rendering). But
>> that ship sailed long ago so this would be a regression.
>>
>> I did consider GFP_ATOMIC as an option, but really we shouldn't be
>> accessing "memory reserves" in such a situation.
>>
>> For background: In the "kbase" driver (Linux kernel driver for the
>> closed source user space 'DDK' driver for Midgard/Bifrost GPUs) we had a
>> "JIT memory allocator". The idea here was that if you have multiple
>> renderings in flight you can attempt to share the tiler heap memory
>> between them. So in this case when we can't allocate more memory and we
>> know there's another tiler heap which is going to be freed by a fragment
>> job that's already running, we can block knowing the memory is going to
>> become available.
>>
>> It was designed to do the same thing as CSF's incremental rendering -
>> allow us to opportunistically allocate memory but not fail the rendering
>> if it wasn't available.
>>
>> But it was a nightmare to have any confidence of it being deadlock free
>> and the implementation was frankly terrible - which is ultimately why
>> CSF GPU's have this ability in hardware to perform incremental rendering
>> without failing the job. But effectively this approach requires
>> allocating just enough memory for one complete tiler heap while ensuring
>> forward progress and opportunistically allowing extra memory to give a
>> performance boost.
>>
>> TLDR; I think we should try switching to GFP_NOWAIT in Panfrost and do
>> some testing with memory pressure. It might be acceptable (and an
>> occasional job failure is better than an occasional lock up). If it
>> turns out it's too easy to trigger job failures like this then we'll
>> need to rethink.
> 
> I thought about this incremental-rendering-on-JM thing during the past
> few days, and I'd like to run one idea through you if you don't mind.
> What if we tried to implement incremental rendering the way it's done
> for CSF, that is, when we get a fault on a tiler heap object we:

CSF adds the ability for the command stream to make decisions between
'jobs'. kbase has the concept of 'soft jobs' allowing the kernel to do
actions between jobs. I fear to make this work we'd need something
similar - see below.

> 1. flush caches on the tiler heap range (with a FLUSH_MEM command on the
>    faulty AS), such that any primitive data that have been queued so far
>    get pushed to the memory

So I'm not entirely sure whether you are proposing doing this on the
back of a MMU fault or not. Doing a FLUSH_MEM while the tiler is waiting
for a memory fault to be resolved is unlikely to work (it won't flush
the tiler's internal caches). The tiler is a very complex beast and has
some elaborate caches. We don't support soft-stopping tiler jobs (they
just run to completion) because the H/W guys could never figure out how
to safely stop the tiler - so I very much doubt we can deal with the
half-completed state of the tiler.

> 2. use a different AS for an IR (Increment Rendering) fragment job that
>    have been provided by the UMD at submission time, just like the
>    CSF backend of the UMD registers an exception handler for tiler OOMs
>    at the moment.

It's an interesting idea - I worry about internal deadlocks within the
GPU. There's some magic which ties vertex processing and the tiler
together, and I'm not sure whether a hung tiler job could lead to hung
vertex tasks on the shader cores. But it's possible my worries are
unfounded.

>    Make it so this IR fragment job chain is queued immediately after the
>    currently running fragment job (if any). This IR fragment job should
>    consume all the primitives written so far and push the result to a
>    framebuffer. There's a bit of patching to do on the final
>    fragment job chain, because the FB preload setup will differ if
>    IR took place, but that should be doable with simple WRITE_VALUE jobs
>    turning NULL jobs into FRAGMENT jobs (or the other way around)
>    such that they get skipped/activated depending on whether IR took
>    place or not.

You're probably more of an expert than me on that part - it certainly
sounds plausible.

> 3. collect pages covering consumed primitives and make the heap range
>    covering those consumed pages fault on further accesses (basically
>    iommu->unmap() the section we collected pages from). We probably
>    need to keep a couple pages around the top/bottom tiler heap
>    pointers, because some partially written primitives might be crossing
>    a heap chunk boundary, but assuming our heap has more than 4 chunks
>    available (which we can ensure at tiler heap allocation time), we
>    should be good.

I'm not sure if it's sufficient to keep the last 'n' chunks. But at
least in theory it should be possible to identify chunks that are still
active.

> 4. use one of the collected pages to satisfy the growing request, and
>    acknowledge the fault on the faulty AS. We can pick from the pool
>    of collected pages to satisfy new growing requests until it's
>    depleted again.
> 
> 5. repeat steps 1-4 until all primitives are flushed.

Running the IR fragment jobs from part way through the tiler heap on
subsequent renderings could be tricky - I'm not sure if we have a way of
doing that?

> 6. reset the tiler heap mapping by doing an iommu->unmap() on the whole
>    heap BO range, and collect all the pages that were previously
>    allocated to the heap such that the next allocation can be satisfied
>    immediately
> 
> Of course, this only works if primitives are added to the list only
> when they are fully written (that is, all data for the primitive has
> been written, and the primitive can be consumed by the fragment job
> without hazards).

Yeah, the hardware certainly wasn't designed for this ;)

The H/W designers approach is simple:

 1. Submit vertex/tiler job with small tiler heap

 2. Job fails with OUT_OF_MEMORY, capture index of last completed vertex
    which is written to the job descriptor

 3. Submit fragment job to incrementally render.

 4. Submit new vertex/tiler job using the index captured in step 2.

 5. Loop back to 2.

The problem from an implementation point of view is either you give up
on fences (bounce back to user space between each step), or you need
start writing job descriptors in the kernel.

Clearly one option is to go the whole hog and have something looking
like the CSF firmware running in the kernel. It's not a popular option
though! ;)

> Just to be clear, this is a long term plan to try and fix this on JM
> HW, and given the changes it involves (UMD needs to be taught about IR
> on JM), we'll still need the GFP_NOWAIT fix for the time being.

Agreed, and I like the ingenuity of it - but having seen the H/W team
give up on soft-stopping the tiler I'm not convinced it will work
leaving the tiler waiting on an MMU fault.

Thanks,
Steve