[PATCH] drm/doc: Start documenting aspects specific to tile-based renderers

Mon Apr 28 07:42:04 UTC 2025

Hi Alyssa,

On Wed, 23 Apr 2025 10:47:21 -0400
Alyssa Rosenzweig <alyssa at rosenzweig.io> wrote:

> Steven wanted non-Mali eyes, so with my Imaginapple hat on...
> 
> > +All lot of embedded GPUs are using tile-based rendering instead of immediate  
> 
> s/All lot of/Many/

Will change that.

> 
> > +- Complex geometry pipelines: if you throw geometry/tesselation/mesh shaders
> > +  it gets even trickier to guess the number of primitives from the number
> > +  of vertices passed to the vertex shader.  
> 
> Tessellation, yes. Geometry shaders, no. Geometry shaders must declare
> the maximum # of vertices they output, so by themselves geometry shaders
> don't make the problem much harder - unless you do an indirect draw with
> a GS, the emulated GS draw can still be direct.

Right, GS is simpler to cap, so I'll distinguish them from tessellation
shaders.

> 
> But I guess "even trickier" is accurate still...
> 
> > +For all these reasons, the tiler usually allocates memory dynamically, but
> > +DRM has not been designed with this use case in mind. Drivers will address
> > +these problems differently based on the functionality provided by their
> > +hardware, but all of them almost certainly have to deal with this somehow.
> > +
> > +The easy solution is to statically allocate a huge buffer to pick from when
> > +tiler memory is needed, and fail the rendering when this buffer is depleted.
> > +Some drivers try to be smarter to avoid reserving a lot of memory upfront.
> > +Instead, they start with an almost empty buffer and progressively populate it
> > +when the GPU faults on an address sitting in the tiler buffer range.  
> 
> This all seems very Mali-centric.

How surprising :-).

> Imaginapple has had partial renders
> since forever.

Yep, I've noticed that Imagination had partial renders too, but didn't
it had been present from the start.

Anyway, I'll make it clear that all sane HW should have a partial
render fallback, but that it's sometimes not the case, and in that
case, we have to work around this HW limitation.

> 
> > +More and more hardware vendors don't bother providing hardware support for
> > +geometry/tesselation/mesh stages  
> 
> I wouldn't say that... Mali is the only relevant hardware that has *no*
> hardware support for any of geom/tess/mesh. All the desktop vendors +
> Qualcomm have full hardware support, Apple has hardware mesh on m3+,
> Broadcom has geom/tess, and I think Imagination has geom/tess on certain
> parts.

Okay. I'll name Arm/Mali specifically here then.

> 
> And I don't know of any vendors (except possibly Imagination) that
> removed hardware support, because it turns out having hardware support
> for core API features is a good thing actually. It doesn't need to look
> like "put the API in hardware" but some sort of hardware acceleration
> (like AMD's NGG) solves the problems in this doc and more.
> 
> So... just "Some hardware vendors omit hardware support for
> geometry/tessellation/mesh stages".

Sounds good.

> 
> > This being said, the same +"start small, grow on-demand" can be
> > applied to avoid over-allocating memory +upfront.  
> 
> [citation needed], if we overflow that buffer we're screwed and hit
> device_loss, and that's unacceptable in normal usage.

My bad, I went ahead and mentioned other geom stages here because Sima
seemed to think that the same tricks could be applied to those. I'll
just drop this section.

> 
> > +The problem with on-demand allocation is that suddenly, GPU accesses can
> > +fail on OOM, and the DRM components (drm_gpu_scheduler and drm_gem mostly)
> > +were not designed for that.  
> 
> It's not the common DRM scheduler that causes this problem, it
> fundamentally violates the kernel-wide dma_fence contract: signalling a
> dma-fence must not block on a fallible memory allocation, full stop.
> Nothing we do in DRM will change that contract (and it's not obvious to
> me that kbase is actually correct in all the corner cases).

Okay, I'll stop blaming DRM and hide behind the dma_fence contract here.

BTW, is there a piece of doc explaining the rational behind this
dma_fence contract, or is it just the usual informal knowledge shared
among DRM devs over IRC/email threads :-) ?

To be honest, I'm a bit unhappy with this "it's part of the dma_fence
contract" explanation, because I have a hard time remembering all the
details that led to these set of rules myself, so I suspect it's even
harder for new comers to reason about this. To me, it's one of the
reasons people fail to understand/tend to forget what the
problems/limitations are, and end up ignoring them (intentionally or
not).

FWIW, this is what I remember, but I'm sure there's more:

1. dma_fence must signal in finite time, so unbounded waits in the
   fence signalling path path is not good, and that's what happens with
   GFP_KERNEL allocations
2. if you're blocked in your GPU fault handler, that means you can't
   process further faults happening on other contexts
3. GPU drivers are actively participating in the memory reclaim
   process, which leads to deadlocks if the memory allocation in the
   fault handler is waiting on the very same GPU job fence that's
   waiting for its memory allocation to be satisfied

I'd really love if someone (Sima, Alyssa and/or Christian?) could sum it
up, so I can put the outcome of this discussion in some kernel doc
entry (or maybe it'd be better if this was one of you submitting a
patch for that ;-)). If it's already documented somewhere, I'll just
have to eat my hat and accept your RTFM answer :-).

> 
> > +The second trick to try to avoid over-allocation, even with this
> > +sub-optimistic estimate, is to have a shared pool of memory that can be
> > +used by all GPU contexts when they need tiler/geometry memory. This
> > +implies returning chunks to this pool at some point, so other contexts
> > +can re-use those. Details about what this global memory pool implementation
> > +would look like is currently undefined, but it needs to be filled to
> > +guarantee that pre-allocation requests for on-demand buffers used by a
> > +GPU job can be satisfied in the fault handler path.  
> 
> How do we clean memory between contexts? This is a security issue.
> Either we need to pin physical pages to single processes, or we need to
> zero pages when returning pages to the shared pool. Zeroing on the
> CPU side is an option but the performance hit may be unacceptable
> depending how it's implemented. Alternatively we can require userspace to
> clean up after itself on the gpu (with a compute shader) but that's
> going to burn memory b/w in the happy path where we have lots of memory
> free.

I would say memset(0) on allocation (when recycling pages returned to
the pool) since that's already where you're taking the hit for regular
allocations anyway (allocating pages is not free).

> 
> > For GL +drivers, the UMD is in control of the context recreation, so
> > it can easily +record the next buffer size to use.  
> 
> I'm /really/ skeptical of this. Once we hit a device loss in GL, it's
> game over, and I'm skeptical of any plan that expects userspace to
> magically recover, especially as soon as side effects are introduced
> (including transform feedback which is already gles3.0 required).

I can drop that one. I know we've always done that in gallium/panfrost
(that predates CSF/panthor BTW), but I trust you when you say it
doesn't comply with the GL spec.

Thanks a lot for chiming in BTW, that's truly appreciated.

Regards,

Boris