[PATCH] drm/ttm: Implement strict NUMA pool allocations
Bhardwaj, Rajneesh
rajneesh.bhardwaj at amd.com
Sat Mar 23 02:02:46 UTC 2024
On 3/22/2024 11:29 AM, Ruhl, Michael J wrote:
>> -----Original Message-----
>> From: dri-devel <dri-devel-bounces at lists.freedesktop.org> On Behalf Of
>> Rajneesh Bhardwaj
>> Sent: Friday, March 22, 2024 3:08 AM
>> To: amd-gfx at lists.freedesktop.org; dri-devel at lists.freedesktop.org
>> Cc: felix.kuehling at amd.com; alexander.deucher at amd.com;
>> christian.koenig at amd.com; Rajneesh Bhardwaj
>> <rajneesh.bhardwaj at amd.com>; Joe Greathouse
>> <joseph.greathouse at amd.com>
>> Subject: [PATCH] drm/ttm: Implement strict NUMA pool allocations
>>
>> This change allows TTM to be flexible to honor NUMA localized
>> allocations which can result in significant performance improvement on a
>> multi socket NUMA system. On GFXIP 9.4.3 based AMD APUs, we see
>> manyfold benefits of this change resulting not only in ~10% performance
>> improvement in certain benchmarks but also generating more consistent
>> and less sporadic results specially when the NUMA balancing is not
>> explecitely disabled. In certain scenarios, workloads show a run-to-run
>> variability e.g. HPL would show a ~10x performance drop after running
>> back to back 4-5 times and would recover later on a subsequent run. This
>> is seen with memory intensive other workloads too. It was seen that when
>> CPU caches were dropped e.g. sudo sysctl -w vm.drop_caches=1, the
>> variability reduced but the performance was still well below a good run.
>>
>> Use of __GFP_THISNODE flag ensures that during memory allocation, kernel
>> prioritizes memory allocations from the local or closest NUMA node
>> thereby reducing memory access latency. When memory is allocated using
>> __GFP_THISNODE flag, memory allocations will predominantly be done on
>> the local node, consequency, the shrinkers may priotitize reclaiming
>> memory from caches assocoated with local node to maintain memory
>> locality and minimize latency, thereby provide better shinker targeting.
>>
>> Reduced memory pressure on remote nodes, can also indirectly influence
>> shrinker behavior by potentially reducing the frequency and intensity of
>> memory reclamation operation on remote nodes and could provide improved
>> overall system performance.
>>
>> While this change could be more beneficial in general, i.e., without the
>> use of a module parameter, but in absence of widespread testing, limit
>> it to the AMD GFXIP 9.4.3 based ttm pool initializations only.
>>
>>
>> Cc: Joe Greathouse <joseph.greathouse at amd.com>
>> Signed-off-by: Rajneesh Bhardwaj <rajneesh.bhardwaj at amd.com>
>> ---
>> drivers/gpu/drm/amd/amdgpu/amdgpu.h | 1 +
>> drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c | 8 ++++++++
>> drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c | 7 ++++++-
>> drivers/gpu/drm/ttm/tests/ttm_pool_test.c | 10 +++++-----
>> drivers/gpu/drm/ttm/ttm_device.c | 2 +-
>> drivers/gpu/drm/ttm/ttm_pool.c | 7 ++++++-
>> include/drm/ttm/ttm_pool.h | 4 +++-
>> 7 files changed, 30 insertions(+), 9 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
>> b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
>> index 9c62552bec34..96532cfc6230 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
>> @@ -253,6 +253,7 @@ extern int amdgpu_user_partt_mode;
>> extern int amdgpu_agp;
>>
>> extern int amdgpu_wbrf;
>> +extern bool strict_numa_alloc;
>>
>> #define AMDGPU_VM_MAX_NUM_CTX 4096
>> #define AMDGPU_SG_THRESHOLD (256*1024*1024)
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
>> index 80b9642f2bc4..a183a6b4493d 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
>> @@ -781,6 +781,14 @@ int queue_preemption_timeout_ms = 9000;
>> module_param(queue_preemption_timeout_ms, int, 0644);
>> MODULE_PARM_DESC(queue_preemption_timeout_ms, "queue preemption
>> timeout in ms (1 = Minimum, 9000 = default)");
>>
>> +/**
>> + * DOC: strict_numa_alloc(bool)
>> + * Policy to force NUMA allocation requests from the proximity NUMA domain
>> only.
>> + */
>> +bool strict_numa_alloc;
>> +module_param(strict_numa_alloc, bool, 0444);
>> +MODULE_PARM_DESC(strict_numa_alloc, "Force NUMA allocation requests
>> to be satisfied from the closest node only (false = default)");
>> +
>> /**
>> * DOC: debug_evictions(bool)
>> * Enable extra debug messages to help determine the cause of evictions
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
>> index b0ed10f4de60..a9f78f85e28c 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
>> @@ -1768,6 +1768,7 @@ static int amdgpu_ttm_reserve_tmr(struct
>> amdgpu_device *adev)
>>
>> static int amdgpu_ttm_pools_init(struct amdgpu_device *adev)
>> {
>> + bool policy = true;
>> int i;
>>
>> if (!adev->gmc.is_app_apu || !adev->gmc.num_mem_partitions)
>> @@ -1779,11 +1780,15 @@ static int amdgpu_ttm_pools_init(struct
>> amdgpu_device *adev)
>> if (!adev->mman.ttm_pools)
>> return -ENOMEM;
>>
>> + /* Policy not only depends on the module param but also on the ASIC
>> + * setting use_strict_numa_alloc as well.
>> + */
>> for (i = 0; i < adev->gmc.num_mem_partitions; i++) {
>> ttm_pool_init(&adev->mman.ttm_pools[i], adev->dev,
>> adev->gmc.mem_partitions[i].numa.node,
>> - false, false);
>> + false, false, policy && strict_numa_alloc);
> why not just 'strict_numa_alloc'?
>
> Is 'policy' used somewhere else? Not sure this adds clarity...
Thanks Mike for the feedback. I could remove this, just wasn't sure
about the initial reaction, so just kept the switch off by default so
keep minimum impact outside of where we want this.
>
>> }
>> +
>> return 0;
>> }
>>
>> diff --git a/drivers/gpu/drm/ttm/tests/ttm_pool_test.c
>> b/drivers/gpu/drm/ttm/tests/ttm_pool_test.c
>> index 2d9cae8cd984..6ff47aac570a 100644
>> --- a/drivers/gpu/drm/ttm/tests/ttm_pool_test.c
>> +++ b/drivers/gpu/drm/ttm/tests/ttm_pool_test.c
>> @@ -87,7 +87,7 @@ static struct ttm_pool *ttm_pool_pre_populated(struct
>> kunit *test,
>> pool = kunit_kzalloc(test, sizeof(*pool), GFP_KERNEL);
>> KUNIT_ASSERT_NOT_NULL(test, pool);
>>
>> - ttm_pool_init(pool, devs->dev, NUMA_NO_NODE, true, false);
>> + ttm_pool_init(pool, devs->dev, NUMA_NO_NODE, true, false, false);
> Is this really an acceptable interface (three non-named Booleans in a row)?
>
> I have no idea what "true, false, false" means.
>
> Would having a "flags" parameter and then
>
> USE_DMA_ALLOC BIT(0)
> USE_DMA32 BIT(1)
> USE_STRICT_NUMA BIT(2)
>
> Make this more readable?
I agree, I just followed the existing convention. If this patch gets
moving, I could send a follow-up patch to cleanup and document a little bit.
>
> Or define your Booleans:
>
> USE_DMA_ALLOC true
> USE_DMA32 true
> USE_STRICT_NUMA true
>
> NO_DMA_ALLOC false
> NO_DMA32 false
> NO_STRICT_NUMA false
>
> So at a minimum, we might know what these parameters are?
>
> What is the relationship between this feature and the nid value?
>
> Is this value used for the allocations?
>
> If this is not NUMA_NO_NODE, would this do the same thing?
> (or is the STRICT flag the only way?)
>
> Just some thoughts,
>
> Mike
>
>> err = ttm_pool_alloc(pool, tt, &simple_ctx);
>> KUNIT_ASSERT_EQ(test, err, 0);
>> @@ -152,7 +152,7 @@ static void ttm_pool_alloc_basic(struct kunit *test)
>> KUNIT_ASSERT_NOT_NULL(test, pool);
>>
>> ttm_pool_init(pool, devs->dev, NUMA_NO_NODE, params-
>>> use_dma_alloc,
>> - false);
>> + false, false);
>>
>> KUNIT_ASSERT_PTR_EQ(test, pool->dev, devs->dev);
>> KUNIT_ASSERT_EQ(test, pool->nid, NUMA_NO_NODE);
>> @@ -219,7 +219,7 @@ static void ttm_pool_alloc_basic_dma_addr(struct
>> kunit *test)
>> pool = kunit_kzalloc(test, sizeof(*pool), GFP_KERNEL);
>> KUNIT_ASSERT_NOT_NULL(test, pool);
>>
>> - ttm_pool_init(pool, devs->dev, NUMA_NO_NODE, true, false);
>> + ttm_pool_init(pool, devs->dev, NUMA_NO_NODE, true, false, false);
>>
>> err = ttm_pool_alloc(pool, tt, &simple_ctx);
>> KUNIT_ASSERT_EQ(test, err, 0);
>> @@ -349,7 +349,7 @@ static void ttm_pool_free_dma_alloc(struct kunit
>> *test)
>> pool = kunit_kzalloc(test, sizeof(*pool), GFP_KERNEL);
>> KUNIT_ASSERT_NOT_NULL(test, pool);
>>
>> - ttm_pool_init(pool, devs->dev, NUMA_NO_NODE, true, false);
>> + ttm_pool_init(pool, devs->dev, NUMA_NO_NODE, true, false, false);
>> ttm_pool_alloc(pool, tt, &simple_ctx);
>>
>> pt = &pool->caching[caching].orders[order];
>> @@ -380,7 +380,7 @@ static void ttm_pool_free_no_dma_alloc(struct kunit
>> *test)
>> pool = kunit_kzalloc(test, sizeof(*pool), GFP_KERNEL);
>> KUNIT_ASSERT_NOT_NULL(test, pool);
>>
>> - ttm_pool_init(pool, devs->dev, NUMA_NO_NODE, false, false);
>> + ttm_pool_init(pool, devs->dev, NUMA_NO_NODE, false, false, false);
>> ttm_pool_alloc(pool, tt, &simple_ctx);
>>
>> pt = &pool->caching[caching].orders[order];
>> diff --git a/drivers/gpu/drm/ttm/ttm_device.c
>> b/drivers/gpu/drm/ttm/ttm_device.c
>> index f5187b384ae9..540e8a44f015 100644
>> --- a/drivers/gpu/drm/ttm/ttm_device.c
>> +++ b/drivers/gpu/drm/ttm/ttm_device.c
>> @@ -215,7 +215,7 @@ int ttm_device_init(struct ttm_device *bdev, const
>> struct ttm_device_funcs *func
>>
>> ttm_sys_man_init(bdev);
>>
>> - ttm_pool_init(&bdev->pool, dev, dev_to_node(dev), use_dma_alloc,
>> use_dma32);
>> + ttm_pool_init(&bdev->pool, dev, dev_to_node(dev), use_dma_alloc,
>> use_dma32, false);
>>
>> bdev->vma_manager = vma_manager;
>> spin_lock_init(&bdev->lru_lock);
>> diff --git a/drivers/gpu/drm/ttm/ttm_pool.c
>> b/drivers/gpu/drm/ttm/ttm_pool.c
>> index dbc96984d331..73aafd06c361 100644
>> --- a/drivers/gpu/drm/ttm/ttm_pool.c
>> +++ b/drivers/gpu/drm/ttm/ttm_pool.c
>> @@ -447,6 +447,9 @@ int ttm_pool_alloc(struct ttm_pool *pool, struct
>> ttm_tt *tt,
>> else
>> gfp_flags |= GFP_HIGHUSER;
>>
>> + if (pool->use_strict_numa_alloc)
>> + gfp_flags |= __GFP_THISNODE;
>> +
>> for (order = min_t(unsigned int, MAX_ORDER, __fls(num_pages));
>> num_pages;
>> order = min_t(unsigned int, order, __fls(num_pages))) {
>> @@ -555,7 +558,8 @@ EXPORT_SYMBOL(ttm_pool_free);
>> * Initialize the pool and its pool types.
>> */
>> void ttm_pool_init(struct ttm_pool *pool, struct device *dev,
>> - int nid, bool use_dma_alloc, bool use_dma32)
>> + int nid, bool use_dma_alloc, bool use_dma32,
>> + bool use_strict_numa_alloc)
>> {
>> unsigned int i, j;
>>
>> @@ -565,6 +569,7 @@ void ttm_pool_init(struct ttm_pool *pool, struct
>> device *dev,
>> pool->nid = nid;
>> pool->use_dma_alloc = use_dma_alloc;
>> pool->use_dma32 = use_dma32;
>> + pool->use_strict_numa_alloc = use_strict_numa_alloc;
>>
>> if (use_dma_alloc || nid != NUMA_NO_NODE) {
>> for (i = 0; i < TTM_NUM_CACHING_TYPES; ++i)
>> diff --git a/include/drm/ttm/ttm_pool.h b/include/drm/ttm/ttm_pool.h
>> index 30a347e5aa11..6b7bdc952466 100644
>> --- a/include/drm/ttm/ttm_pool.h
>> +++ b/include/drm/ttm/ttm_pool.h
>> @@ -72,6 +72,7 @@ struct ttm_pool {
>>
>> bool use_dma_alloc;
>> bool use_dma32;
>> + bool use_strict_numa_alloc;
>>
>> struct {
>> struct ttm_pool_type orders[MAX_ORDER + 1];
>> @@ -83,7 +84,8 @@ int ttm_pool_alloc(struct ttm_pool *pool, struct ttm_tt
>> *tt,
>> void ttm_pool_free(struct ttm_pool *pool, struct ttm_tt *tt);
>>
>> void ttm_pool_init(struct ttm_pool *pool, struct device *dev,
>> - int nid, bool use_dma_alloc, bool use_dma32);
>> + int nid, bool use_dma_alloc, bool use_dma32,
>> + bool use_strict_numa_alloc);
>> void ttm_pool_fini(struct ttm_pool *pool);
>>
>> int ttm_pool_debugfs(struct ttm_pool *pool, struct seq_file *m);
>> --
>> 2.34.1
More information about the amd-gfx
mailing list