[PATCH v3 0/8] drm: Introduce sparse GEM shmem
Boris Brezillon
boris.brezillon at collabora.com
Tue Apr 15 09:47:47 UTC 2025
On Mon, 14 Apr 2025 16:34:35 +0100
Steven Price <steven.price at arm.com> wrote:
> On 14/04/2025 13:47, Boris Brezillon wrote:
> > On Fri, 11 Apr 2025 16:39:02 +0200
> > Boris Brezillon <boris.brezillon at collabora.com> wrote:
> >
> >> On Fri, 11 Apr 2025 15:13:26 +0200
> >> Christian König <christian.koenig at amd.com> wrote:
> >>
> >>>>
> >>>>> Background is that you don't get a crash, nor error message, nor
> >>>>> anything indicating what is happening.
> >>>> The job times out at some point, but we might get stuck in the fault
> >>>> handler waiting for memory, which is pretty close to a deadlock, I
> >>>> suspect.
> >>>
> >>> I don't know those drivers that well, but at least for amdgpu the
> >>> problem would be that the timeout handling would need to grab some of
> >>> the locks the memory management is holding waiting for the timeout
> >>> handling to do something....
> >>>
> >>> So that basically perfectly closes the circle. With a bit of lock you
> >>> get a message after some time that the kernel is stuck, but since
> >>> that are all sleeping locks I strongly doubt so.
> >>>
> >>> As immediately action please provide patches which changes those
> >>> GFP_KERNEL into GFP_NOWAIT.
> >>
> >> Sure, I can do that.
> >
> > Hm, I might have been too prompt at claiming this was doable. In
> > practice, doing that might regress Lima and Panfrost in situations
> > where trying harder than GFP_NOWAIT would free up some memory. Not
> > saying this was right to use GFP_KERNEL in the first place, but some
> > expectations were set by this original mistake, so I'll probably need
> > Lima developers to vouch in for this change after they've done some
> > testing on a system under high memory pressure, and I'd need to do the
> > same kind of testing for Panfrost and ask Steve if he's okay with that
> > too.
>
> It's a tricky one. The ideal would be to teach user space how to recover
> from the memory allocation failure (even on older GPUs it's still in
> theory possible to break up the work and do incremental rendering). But
> that ship sailed long ago so this would be a regression.
>
> I did consider GFP_ATOMIC as an option, but really we shouldn't be
> accessing "memory reserves" in such a situation.
>
> For background: In the "kbase" driver (Linux kernel driver for the
> closed source user space 'DDK' driver for Midgard/Bifrost GPUs) we had a
> "JIT memory allocator". The idea here was that if you have multiple
> renderings in flight you can attempt to share the tiler heap memory
> between them. So in this case when we can't allocate more memory and we
> know there's another tiler heap which is going to be freed by a fragment
> job that's already running, we can block knowing the memory is going to
> become available.
>
> It was designed to do the same thing as CSF's incremental rendering -
> allow us to opportunistically allocate memory but not fail the rendering
> if it wasn't available.
>
> But it was a nightmare to have any confidence of it being deadlock free
> and the implementation was frankly terrible - which is ultimately why
> CSF GPU's have this ability in hardware to perform incremental rendering
> without failing the job. But effectively this approach requires
> allocating just enough memory for one complete tiler heap while ensuring
> forward progress and opportunistically allowing extra memory to give a
> performance boost.
>
> TLDR; I think we should try switching to GFP_NOWAIT in Panfrost and do
> some testing with memory pressure. It might be acceptable (and an
> occasional job failure is better than an occasional lock up). If it
> turns out it's too easy to trigger job failures like this then we'll
> need to rethink.
I thought about this incremental-rendering-on-JM thing during the past
few days, and I'd like to run one idea through you if you don't mind.
What if we tried to implement incremental rendering the way it's done
for CSF, that is, when we get a fault on a tiler heap object we:
1. flush caches on the tiler heap range (with a FLUSH_MEM command on the
faulty AS), such that any primitive data that have been queued so far
get pushed to the memory
2. use a different AS for an IR (Increment Rendering) fragment job that
have been provided by the UMD at submission time, just like the
CSF backend of the UMD registers an exception handler for tiler OOMs
at the moment.
Make it so this IR fragment job chain is queued immediately after the
currently running fragment job (if any). This IR fragment job should
consume all the primitives written so far and push the result to a
framebuffer. There's a bit of patching to do on the final
fragment job chain, because the FB preload setup will differ if
IR took place, but that should be doable with simple WRITE_VALUE jobs
turning NULL jobs into FRAGMENT jobs (or the other way around)
such that they get skipped/activated depending on whether IR took
place or not.
3. collect pages covering consumed primitives and make the heap range
covering those consumed pages fault on further accesses (basically
iommu->unmap() the section we collected pages from). We probably
need to keep a couple pages around the top/bottom tiler heap
pointers, because some partially written primitives might be crossing
a heap chunk boundary, but assuming our heap has more than 4 chunks
available (which we can ensure at tiler heap allocation time), we
should be good.
4. use one of the collected pages to satisfy the growing request, and
acknowledge the fault on the faulty AS. We can pick from the pool
of collected pages to satisfy new growing requests until it's
depleted again.
5. repeat steps 1-4 until all primitives are flushed.
6. reset the tiler heap mapping by doing an iommu->unmap() on the whole
heap BO range, and collect all the pages that were previously
allocated to the heap such that the next allocation can be satisfied
immediately
Of course, this only works if primitives are added to the list only
when they are fully written (that is, all data for the primitive has
been written, and the primitive can be consumed by the fragment job
without hazards).
Just to be clear, this is a long term plan to try and fix this on JM
HW, and given the changes it involves (UMD needs to be taught about IR
on JM), we'll still need the GFP_NOWAIT fix for the time being.
More information about the dri-devel
mailing list