[PATCH] drm/ttm: WIP limit the TTM pool to 32bit CPUs

Mon Aug 11 15:16:10 UTC 2025

On Mon, 2025-08-11 at 13:51 +0200, Christian König wrote:
> On 07.08.25 18:47, Thomas Hellström wrote:
> > > Well it's surprising that even modern CPU do stuff like that.
> > > That
> > > could explain some of the problems we had with uncached mappings
> > > on
> > > ARM and RISC-V.
> > 
> > Yeah. I agree. Need to double-check with HW people whether that is
> > gone
> > with Panther Lake. Don't have a confirmation yet on that.
> 
> Going to ask around AMD internally as well.
> 
> > If it wasn't for the writeback of speculative prefetches we
> > could've
> > settled for only have TTM map the user-space mappings without
> > changing
> > the kernel map, just like the i915 driver does for older GPUS.
> 
> We should probably come up with a CPU whitelist or blacklist where
> that is actually needed.
> 
> > > > Second, IIRC vm_insert_pfn_prot() on X86 will override the
> > > > given
> > > > caching mode with the last caching mode set for the kernel
> > > > linear
> > > > map,
> > > > so if you try to set up a write-combined GPU mapping without a
> > > > previous
> > > > call to set_pages_xxxxx it will actually end up cached. see
> > > > track_pfn_insert().
> > > 
> > > That is exactly the same incorrect assumption I made as well.
> > > 
> > > It's not the linear mapping where that comes from but a separate
> > > page
> > > attribute table, see /sys/kernel/debug/x86/pat_memtype_list.
> > > 
> > > Question is why the heck should we do this? I mean we keep an
> > > extra
> > > rb tree around to overwrite something the driver knows in the
> > > first
> > > place?
> > > 
> > > That is basically just tons of extra overhead for nothing as far
> > > as I
> > > can see.
> > 
> > IIRC it was PAT people enforcing the x86 documentation that aliased
> > mappings with conflicting caching attributes were not allowed. But
> > it
> > has proven to work at least on those CPUs not suffering from the
> > clean
> > cache-line writeback mentioned above.
> 
> Makes sense. With the PAT handling even accessing things through
> /dev/mem gives you the right caching.
> 
> Do you have a list of Intel CPUs where it works?

AFAIK LNL is the only one so far where it doesn't work. I need to get
more information internally.

> 
> > FWIW If I understand the code correctly, i915 bypasses this by
> > setting
> > up user-space mappings not by vm_insert_pfn_prot() but using
> > apply_to_page_range(), mapping the whole bo.
> 
> Yeah, that's probably not something we can do. Even filling in 2MiB
> of address space at a time caused performance problems for TTM.

Wasn't that because of repeated calls to vmf_insert_pfn_prot(),
repeating the caching checks and page-table walk all the time? 
I think apply_to_page_range() should be pretty fast. Also, to avoid
regressions due to changing the number of prefaulted pages, we could
perhaps honor the MADV_RANDOM and MADV_SEQUENTIAL advises for UMD to
use; one faulting a single page only, one faulting the whole bo, but
also see below:

> 
> We should probably just drop overriding the attributes in
> vmf_insert_pfn_prot().

I think either solution will see resistance with arch people. We should
probably involve them in the discussion.

Thanks,

/Thomas

> 
> Regards,
> Christian.
> 
> > 
> > /Thomas
> > 
> > 
> > > 
> > > Thanks for taking a look,
> > > Christian.
> > > 
> > > > 
> > > > /Thomas
> > > > 
> > > > 
> > > > > ---
> > > > >  drivers/gpu/drm/ttm/ttm_pool.c | 16 +++++++++++-----
> > > > >  1 file changed, 11 insertions(+), 5 deletions(-)
> > > > > 
> > > > > diff --git a/drivers/gpu/drm/ttm/ttm_pool.c
> > > > > b/drivers/gpu/drm/ttm/ttm_pool.c
> > > > > index baf27c70a419..7487eac29398 100644
> > > > > --- a/drivers/gpu/drm/ttm/ttm_pool.c
> > > > > +++ b/drivers/gpu/drm/ttm/ttm_pool.c
> > > > > @@ -38,7 +38,7 @@
> > > > >  #include <linux/highmem.h>
> > > > >  #include <linux/sched/mm.h>
> > > > >  
> > > > > -#ifdef CONFIG_X86
> > > > > +#ifdef CONFIG_X86_32
> > > > >  #include <asm/set_memory.h>
> > > > >  #endif
> > > > >  
> > > > > @@ -46,6 +46,7 @@
> > > > >  #include <drm/ttm/ttm_pool.h>
> > > > >  #include <drm/ttm/ttm_tt.h>
> > > > >  #include <drm/ttm/ttm_bo.h>
> > > > > +#include <drm/drm_cache.h>
> > > > >  
> > > > >  #include "ttm_module.h"
> > > > >  
> > > > > @@ -192,7 +193,7 @@ static void ttm_pool_free_page(struct
> > > > > ttm_pool
> > > > > *pool, enum ttm_caching caching,
> > > > >  	struct ttm_pool_dma *dma;
> > > > >  	void *vaddr;
> > > > >  
> > > > > -#ifdef CONFIG_X86
> > > > > +#ifdef CONFIG_X86_32
> > > > >  	/* We don't care that set_pages_wb is inefficient
> > > > > here.
> > > > > This
> > > > > is only
> > > > >  	 * used when we have to shrink and CPU overhead is
> > > > > irrelevant then.
> > > > >  	 */
> > > > > @@ -218,7 +219,7 @@ static void ttm_pool_free_page(struct
> > > > > ttm_pool
> > > > > *pool, enum ttm_caching caching,
> > > > >  /* Apply any cpu-caching deferred during page allocation */
> > > > >  static int ttm_pool_apply_caching(struct
> > > > > ttm_pool_alloc_state
> > > > > *alloc)
> > > > >  {
> > > > > -#ifdef CONFIG_X86
> > > > > +#ifdef CONFIG_X86_32
> > > > >  	unsigned int num_pages = alloc->pages - alloc-
> > > > > > caching_divide;
> > > > >  
> > > > >  	if (!num_pages)
> > > > > @@ -232,6 +233,11 @@ static int ttm_pool_apply_caching(struct
> > > > > ttm_pool_alloc_state *alloc)
> > > > >  	case ttm_uncached:
> > > > >  		return set_pages_array_uc(alloc-
> > > > > >caching_divide,
> > > > > num_pages);
> > > > >  	}
> > > > > +
> > > > > +#elif defined(CONFIG_X86_64)
> > > > > +	unsigned int num_pages = alloc->pages - alloc-
> > > > > > caching_divide;
> > > > > +
> > > > > +	drm_clflush_pages(alloc->caching_divide, num_pages);
> > > > >  #endif
> > > > >  	alloc->caching_divide = alloc->pages;
> > > > >  	return 0;
> > > > > @@ -342,7 +348,7 @@ static struct ttm_pool_type
> > > > > *ttm_pool_select_type(struct ttm_pool *pool,
> > > > >  	if (pool->use_dma_alloc)
> > > > >  		return &pool-
> > > > > >caching[caching].orders[order];
> > > > >  
> > > > > -#ifdef CONFIG_X86
> > > > > +#ifdef CONFIG_X86_32
> > > > >  	switch (caching) {
> > > > >  	case ttm_write_combined:
> > > > >  		if (pool->nid != NUMA_NO_NODE)
> > > > > @@ -980,7 +986,7 @@ long ttm_pool_backup(struct ttm_pool
> > > > > *pool,
> > > > > struct ttm_tt *tt,
> > > > >  	    pool->use_dma_alloc || ttm_tt_is_backed_up(tt))
> > > > >  		return -EBUSY;
> > > > >  
> > > > > -#ifdef CONFIG_X86
> > > > > +#ifdef CONFIG_X86_32
> > > > >  	/* Anything returned to the system needs to be
> > > > > cached.
> > > > > */
> > > > >  	if (tt->caching != ttm_cached)
> > > > >  		set_pages_array_wb(tt->pages, tt-
> > > > > >num_pages);
> > > > 
> > > 
> > 
>