[Patch v2] drm/ttm: Schedule delayed_delete worker closer

Wed Nov 8 18:19:48 UTC 2023

On 2023-11-08 09:49, Christian König wrote:
> Am 08.11.23 um 13:58 schrieb Rajneesh Bhardwaj:
>> Try to allocate system memory on the NUMA node the device is closest to
>> and try to run delayed_delete workers on a CPU of this node as well.
>>
>> To optimize the memory clearing operation when a TTM BO gets freed by
>> the delayed_delete worker, scheduling it closer to a NUMA node where the
>> memory was initially allocated helps avoid the cases where the worker
>> gets randomly scheduled on the CPU cores that are across interconnect
>> boundaries such as xGMI, PCIe etc.
>>
>> This change helps USWC GTT allocations on NUMA systems (dGPU) and AMD
>> APU platforms such as GFXIP9.4.3.
>>
>> Acked-by: Felix Kuehling <Felix.Kuehling at amd.com>
>> Signed-off-by: Rajneesh Bhardwaj <rajneesh.bhardwaj at amd.com>
>
> Reviewed-by: Christian König <christian.koenig at amd.com>
>
> Going to push this to drm-misc-next.

Hold on. Rajneesh just pointed out a WARN regression from testing. I 
think the problem is that the bdev->wq is not unbound.

Regards,
   Felix

>
> Thanks,
> Christian.
>
>> ---
>>
>> Changes in v2:
>>   - Absorbed the feedback provided by Christian in the commit message 
>> and
>>     the comment.
>>
>>   drivers/gpu/drm/ttm/ttm_bo.c     | 8 +++++++-
>>   drivers/gpu/drm/ttm/ttm_device.c | 3 ++-
>>   2 files changed, 9 insertions(+), 2 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/ttm/ttm_bo.c b/drivers/gpu/drm/ttm/ttm_bo.c
>> index 5757b9415e37..6f28a77a565b 100644
>> --- a/drivers/gpu/drm/ttm/ttm_bo.c
>> +++ b/drivers/gpu/drm/ttm/ttm_bo.c
>> @@ -370,7 +370,13 @@ static void ttm_bo_release(struct kref *kref)
>>               spin_unlock(&bo->bdev->lru_lock);
>>                 INIT_WORK(&bo->delayed_delete, ttm_bo_delayed_delete);
>> -            queue_work(bdev->wq, &bo->delayed_delete);
>> +
>> +            /* Schedule the worker on the closest NUMA node. This
>> +             * improves performance since system memory might be
>> +             * cleared on free and that is best done on a CPU core
>> +             * close to it.
>> +             */
>> +            queue_work_node(bdev->pool.nid, bdev->wq, 
>> &bo->delayed_delete);
>>               return;
>>           }
>>   diff --git a/drivers/gpu/drm/ttm/ttm_device.c 
>> b/drivers/gpu/drm/ttm/ttm_device.c
>> index 43e27ab77f95..72b81a2ee6c7 100644
>> --- a/drivers/gpu/drm/ttm/ttm_device.c
>> +++ b/drivers/gpu/drm/ttm/ttm_device.c
>> @@ -213,7 +213,8 @@ int ttm_device_init(struct ttm_device *bdev, 
>> struct ttm_device_funcs *funcs,
>>       bdev->funcs = funcs;
>>         ttm_sys_man_init(bdev);
>> -    ttm_pool_init(&bdev->pool, dev, NUMA_NO_NODE, use_dma_alloc, 
>> use_dma32);
>> +
>> +    ttm_pool_init(&bdev->pool, dev, dev_to_node(dev), use_dma_alloc, 
>> use_dma32);
>>         bdev->vma_manager = vma_manager;
>>       spin_lock_init(&bdev->lru_lock);
>