[PATCH] drm/ttm: WIP limit the TTM pool to 32bit CPUs

Thu Aug 7 09:53:44 UTC 2025

On 06.08.25 19:43, Thomas Hellström wrote:
> Hi, Christian
> 
> On Wed, 2025-08-06 at 15:28 +0200, Christian König wrote:
>> On some old x86 systems we had the problem that changing the caching
>> flags
>> of system memory requires changing the global MTRR/PAT tables.
>>
>> But on any modern x86 system (CPUs introduced rughly after 2004) we
>> actually don't need that any more and can update the caching flags
>> directly in the PTEs of the CPU mapping. It was just never disabled
>> because of the fear of regressions.
>>
>> We already use the PTE flags for encryption on x86 64bit for quite a
>> while
>> and all other supported platforms (Sparc, PowerPC, ARM, MIPS,
>> LONGARCH)
>> have never done anything different either.
> 
> IIRC from my VMWARE days, changing SEV encryption mode of a page still
> requires changing all mappings including kernel maps?
> __set_memory_enc_pgtable()

IIRC both Intel and AMD sacrifice a bit in the page address for that, e.g. for encryption the most significant bit is used to indicate if a page is encrypted or not.

I'm not aware that we need to change all kernel mappings for encryption, but could be that the hypervisor somehow depends on that.

>>
>> So disable the page pool completely for 64bit systems and just insert
>> a
>> clflush to be on the safe side so that we never return memory with
>> dirty
>> cache lines.
>>
>> Testing on a Ryzen 5 and 7 shows that this has absolutely no
>> performance
>> impact and of hand the AMD CI can't find a problem either.
>>
>> Let's see what the i915 and XE CI systems say to that.
>>
>> Signed-off-by: Christian König <christian.koenig at amd.com>
> 
> I don't think we can do this. First Lunar Lake can in some situations,
> just like the old Athlons, write-back clean cache lines which means
> writebacks of speculative prefetches may overwrite GPU data.

So a speculative prefetch because of on an access to an adjacent page could causes the cache line to be fetched and then written back without any change to it?

Well it's surprising that even modern CPU do stuff like that. That could explain some of the problems we had with uncached mappings on ARM and RISC-V.

> LNL makes heavy use of non-coherent GPU mappings for performance.

That is even more surprising. At least on modern Ryzens that doesn't seem to have much performance impact any more at all.

I mean non-cached mappings where original introduced to avoid the extra overhead of going over the front side bus, but that design is long gone.

> Second, IIRC vm_insert_pfn_prot() on X86 will override the given
> caching mode with the last caching mode set for the kernel linear map,
> so if you try to set up a write-combined GPU mapping without a previous
> call to set_pages_xxxxx it will actually end up cached. see
> track_pfn_insert().

That is exactly the same incorrect assumption I made as well.

It's not the linear mapping where that comes from but a separate page attribute table, see /sys/kernel/debug/x86/pat_memtype_list.

Question is why the heck should we do this? I mean we keep an extra rb tree around to overwrite something the driver knows in the first place?

That is basically just tons of extra overhead for nothing as far as I can see.

Thanks for taking a look,
Christian.

> 
> /Thomas
> 
> 
>> ---
>>  drivers/gpu/drm/ttm/ttm_pool.c | 16 +++++++++++-----
>>  1 file changed, 11 insertions(+), 5 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/ttm/ttm_pool.c
>> b/drivers/gpu/drm/ttm/ttm_pool.c
>> index baf27c70a419..7487eac29398 100644
>> --- a/drivers/gpu/drm/ttm/ttm_pool.c
>> +++ b/drivers/gpu/drm/ttm/ttm_pool.c
>> @@ -38,7 +38,7 @@
>>  #include <linux/highmem.h>
>>  #include <linux/sched/mm.h>
>>  
>> -#ifdef CONFIG_X86
>> +#ifdef CONFIG_X86_32
>>  #include <asm/set_memory.h>
>>  #endif
>>  
>> @@ -46,6 +46,7 @@
>>  #include <drm/ttm/ttm_pool.h>
>>  #include <drm/ttm/ttm_tt.h>
>>  #include <drm/ttm/ttm_bo.h>
>> +#include <drm/drm_cache.h>
>>  
>>  #include "ttm_module.h"
>>  
>> @@ -192,7 +193,7 @@ static void ttm_pool_free_page(struct ttm_pool
>> *pool, enum ttm_caching caching,
>>  	struct ttm_pool_dma *dma;
>>  	void *vaddr;
>>  
>> -#ifdef CONFIG_X86
>> +#ifdef CONFIG_X86_32
>>  	/* We don't care that set_pages_wb is inefficient here. This
>> is only
>>  	 * used when we have to shrink and CPU overhead is
>> irrelevant then.
>>  	 */
>> @@ -218,7 +219,7 @@ static void ttm_pool_free_page(struct ttm_pool
>> *pool, enum ttm_caching caching,
>>  /* Apply any cpu-caching deferred during page allocation */
>>  static int ttm_pool_apply_caching(struct ttm_pool_alloc_state
>> *alloc)
>>  {
>> -#ifdef CONFIG_X86
>> +#ifdef CONFIG_X86_32
>>  	unsigned int num_pages = alloc->pages - alloc-
>>> caching_divide;
>>  
>>  	if (!num_pages)
>> @@ -232,6 +233,11 @@ static int ttm_pool_apply_caching(struct
>> ttm_pool_alloc_state *alloc)
>>  	case ttm_uncached:
>>  		return set_pages_array_uc(alloc->caching_divide,
>> num_pages);
>>  	}
>> +
>> +#elif defined(CONFIG_X86_64)
>> +	unsigned int num_pages = alloc->pages - alloc-
>>> caching_divide;
>> +
>> +	drm_clflush_pages(alloc->caching_divide, num_pages);
>>  #endif
>>  	alloc->caching_divide = alloc->pages;
>>  	return 0;
>> @@ -342,7 +348,7 @@ static struct ttm_pool_type
>> *ttm_pool_select_type(struct ttm_pool *pool,
>>  	if (pool->use_dma_alloc)
>>  		return &pool->caching[caching].orders[order];
>>  
>> -#ifdef CONFIG_X86
>> +#ifdef CONFIG_X86_32
>>  	switch (caching) {
>>  	case ttm_write_combined:
>>  		if (pool->nid != NUMA_NO_NODE)
>> @@ -980,7 +986,7 @@ long ttm_pool_backup(struct ttm_pool *pool,
>> struct ttm_tt *tt,
>>  	    pool->use_dma_alloc || ttm_tt_is_backed_up(tt))
>>  		return -EBUSY;
>>  
>> -#ifdef CONFIG_X86
>> +#ifdef CONFIG_X86_32
>>  	/* Anything returned to the system needs to be cached. */
>>  	if (tt->caching != ttm_cached)
>>  		set_pages_array_wb(tt->pages, tt->num_pages);
>