[PATCH] drm/ttm: WIP limit the TTM pool to 32bit CPUs

Thu Aug 7 16:47:49 UTC 2025

On Thu, 2025-08-07 at 11:53 +0200, Christian König wrote:
> On 06.08.25 19:43, Thomas Hellström wrote:
> > Hi, Christian
> > 
> > On Wed, 2025-08-06 at 15:28 +0200, Christian König wrote:
> > > On some old x86 systems we had the problem that changing the
> > > caching
> > > flags
> > > of system memory requires changing the global MTRR/PAT tables.
> > > 
> > > But on any modern x86 system (CPUs introduced rughly after 2004)
> > > we
> > > actually don't need that any more and can update the caching
> > > flags
> > > directly in the PTEs of the CPU mapping. It was just never
> > > disabled
> > > because of the fear of regressions.
> > > 
> > > We already use the PTE flags for encryption on x86 64bit for
> > > quite a
> > > while
> > > and all other supported platforms (Sparc, PowerPC, ARM, MIPS,
> > > LONGARCH)
> > > have never done anything different either.
> > 
> > IIRC from my VMWARE days, changing SEV encryption mode of a page
> > still
> > requires changing all mappings including kernel maps?
> > __set_memory_enc_pgtable()
> 
> IIRC both Intel and AMD sacrifice a bit in the page address for that,
> e.g. for encryption the most significant bit is used to indicate if a
> page is encrypted or not.
> 
> I'm not aware that we need to change all kernel mappings for
> encryption, but could be that the hypervisor somehow depends on that.
> 
> > > 
> > > So disable the page pool completely for 64bit systems and just
> > > insert
> > > a
> > > clflush to be on the safe side so that we never return memory
> > > with
> > > dirty
> > > cache lines.
> > > 
> > > Testing on a Ryzen 5 and 7 shows that this has absolutely no
> > > performance
> > > impact and of hand the AMD CI can't find a problem either.
> > > 
> > > Let's see what the i915 and XE CI systems say to that.
> > > 
> > > Signed-off-by: Christian König <christian.koenig at amd.com>
> > 
> > I don't think we can do this. First Lunar Lake can in some
> > situations,
> > just like the old Athlons, write-back clean cache lines which means
> > writebacks of speculative prefetches may overwrite GPU data.
> 
> So a speculative prefetch because of on an access to an adjacent page
> could causes the cache line to be fetched and then written back
> without any change to it?

Exactly. 

> 
> Well it's surprising that even modern CPU do stuff like that. That
> could explain some of the problems we had with uncached mappings on
> ARM and RISC-V.

Yeah. I agree. Need to double-check with HW people whether that is gone
with Panther Lake. Don't have a confirmation yet on that. 

> 
> > LNL makes heavy use of non-coherent GPU mappings for performance.
> 
> That is even more surprising. At least on modern Ryzens that doesn't
> seem to have much performance impact any more at all.
> 
> I mean non-cached mappings where original introduced to avoid the
> extra overhead of going over the front side bus, but that design is
> long gone.

With LNL it's possible to set up GPU mapping to make the accesses
coherent with CPU but that's quite slow. A tradeoff in HW design.

If it wasn't for the writeback of speculative prefetches we could've
settled for only have TTM map the user-space mappings without changing
the kernel map, just like the i915 driver does for older GPUS.

> 
> > Second, IIRC vm_insert_pfn_prot() on X86 will override the given
> > caching mode with the last caching mode set for the kernel linear
> > map,
> > so if you try to set up a write-combined GPU mapping without a
> > previous
> > call to set_pages_xxxxx it will actually end up cached. see
> > track_pfn_insert().
> 
> That is exactly the same incorrect assumption I made as well.
> 
> It's not the linear mapping where that comes from but a separate page
> attribute table, see /sys/kernel/debug/x86/pat_memtype_list.
> 
> Question is why the heck should we do this? I mean we keep an extra
> rb tree around to overwrite something the driver knows in the first
> place?
> 
> That is basically just tons of extra overhead for nothing as far as I
> can see.

IIRC it was PAT people enforcing the x86 documentation that aliased
mappings with conflicting caching attributes were not allowed. But it
has proven to work at least on those CPUs not suffering from the clean
cache-line writeback mentioned above.

FWIW If I understand the code correctly, i915 bypasses this by setting
up user-space mappings not by vm_insert_pfn_prot() but using
apply_to_page_range(), mapping the whole bo.

/Thomas

> 
> Thanks for taking a look,
> Christian.
> 
> > 
> > /Thomas
> > 
> > 
> > > ---
> > >  drivers/gpu/drm/ttm/ttm_pool.c | 16 +++++++++++-----
> > >  1 file changed, 11 insertions(+), 5 deletions(-)
> > > 
> > > diff --git a/drivers/gpu/drm/ttm/ttm_pool.c
> > > b/drivers/gpu/drm/ttm/ttm_pool.c
> > > index baf27c70a419..7487eac29398 100644
> > > --- a/drivers/gpu/drm/ttm/ttm_pool.c
> > > +++ b/drivers/gpu/drm/ttm/ttm_pool.c
> > > @@ -38,7 +38,7 @@
> > >  #include <linux/highmem.h>
> > >  #include <linux/sched/mm.h>
> > >  
> > > -#ifdef CONFIG_X86
> > > +#ifdef CONFIG_X86_32
> > >  #include <asm/set_memory.h>
> > >  #endif
> > >  
> > > @@ -46,6 +46,7 @@
> > >  #include <drm/ttm/ttm_pool.h>
> > >  #include <drm/ttm/ttm_tt.h>
> > >  #include <drm/ttm/ttm_bo.h>
> > > +#include <drm/drm_cache.h>
> > >  
> > >  #include "ttm_module.h"
> > >  
> > > @@ -192,7 +193,7 @@ static void ttm_pool_free_page(struct
> > > ttm_pool
> > > *pool, enum ttm_caching caching,
> > >  	struct ttm_pool_dma *dma;
> > >  	void *vaddr;
> > >  
> > > -#ifdef CONFIG_X86
> > > +#ifdef CONFIG_X86_32
> > >  	/* We don't care that set_pages_wb is inefficient here.
> > > This
> > > is only
> > >  	 * used when we have to shrink and CPU overhead is
> > > irrelevant then.
> > >  	 */
> > > @@ -218,7 +219,7 @@ static void ttm_pool_free_page(struct
> > > ttm_pool
> > > *pool, enum ttm_caching caching,
> > >  /* Apply any cpu-caching deferred during page allocation */
> > >  static int ttm_pool_apply_caching(struct ttm_pool_alloc_state
> > > *alloc)
> > >  {
> > > -#ifdef CONFIG_X86
> > > +#ifdef CONFIG_X86_32
> > >  	unsigned int num_pages = alloc->pages - alloc-
> > > > caching_divide;
> > >  
> > >  	if (!num_pages)
> > > @@ -232,6 +233,11 @@ static int ttm_pool_apply_caching(struct
> > > ttm_pool_alloc_state *alloc)
> > >  	case ttm_uncached:
> > >  		return set_pages_array_uc(alloc->caching_divide,
> > > num_pages);
> > >  	}
> > > +
> > > +#elif defined(CONFIG_X86_64)
> > > +	unsigned int num_pages = alloc->pages - alloc-
> > > > caching_divide;
> > > +
> > > +	drm_clflush_pages(alloc->caching_divide, num_pages);
> > >  #endif
> > >  	alloc->caching_divide = alloc->pages;
> > >  	return 0;
> > > @@ -342,7 +348,7 @@ static struct ttm_pool_type
> > > *ttm_pool_select_type(struct ttm_pool *pool,
> > >  	if (pool->use_dma_alloc)
> > >  		return &pool->caching[caching].orders[order];
> > >  
> > > -#ifdef CONFIG_X86
> > > +#ifdef CONFIG_X86_32
> > >  	switch (caching) {
> > >  	case ttm_write_combined:
> > >  		if (pool->nid != NUMA_NO_NODE)
> > > @@ -980,7 +986,7 @@ long ttm_pool_backup(struct ttm_pool *pool,
> > > struct ttm_tt *tt,
> > >  	    pool->use_dma_alloc || ttm_tt_is_backed_up(tt))
> > >  		return -EBUSY;
> > >  
> > > -#ifdef CONFIG_X86
> > > +#ifdef CONFIG_X86_32
> > >  	/* Anything returned to the system needs to be cached.
> > > */
> > >  	if (tt->caching != ttm_cached)
> > >  		set_pages_array_wb(tt->pages, tt->num_pages);
> > 
>