[PATCH] drm/ttm: Schedule delayed_delete worker closer

Tue Nov 7 22:07:42 UTC 2023

On 2023-11-07 14:45, Rajneesh Bhardwaj wrote:
> When a TTM BO is getting freed, to optimize the clearing operation on
> the workqueue, schedule it closer to a NUMA node where the memory was
> allocated. This avoids the cases where the ttm_bo_delayed_delete gets
> scheduled on the CPU cores that are across interconnect boundaries such
> as xGMI, PCIe etc.
>
> This change helps USWC GTT allocations on NUMA systems (dGPU) and AMD
> APU platforms such as GFXIP9.4.3.
>
> Signed-off-by: Rajneesh Bhardwaj <rajneesh.bhardwaj at amd.com>

Acked-by: Felix Kuehling <Felix.Kuehling at amd.com>


> ---
>   drivers/gpu/drm/ttm/ttm_bo.c     | 10 +++++++++-
>   drivers/gpu/drm/ttm/ttm_device.c |  3 ++-
>   2 files changed, 11 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/gpu/drm/ttm/ttm_bo.c b/drivers/gpu/drm/ttm/ttm_bo.c
> index 5757b9415e37..0d608441a112 100644
> --- a/drivers/gpu/drm/ttm/ttm_bo.c
> +++ b/drivers/gpu/drm/ttm/ttm_bo.c
> @@ -370,7 +370,15 @@ static void ttm_bo_release(struct kref *kref)
>   			spin_unlock(&bo->bdev->lru_lock);
>   
>   			INIT_WORK(&bo->delayed_delete, ttm_bo_delayed_delete);
> -			queue_work(bdev->wq, &bo->delayed_delete);
> +			/* Schedule the worker on the closest NUMA node, if no
> +			 * CPUs are available, this falls back to any CPU core
> +			 * available system wide. This helps avoid the
> +			 * bottleneck to clear memory in cases where the worker
> +			 * is scheduled on a CPU which is remote to the node
> +			 * where the memory is getting freed.
> +			 */
> +
> +			queue_work_node(bdev->pool.nid, bdev->wq, &bo->delayed_delete);
>   			return;
>   		}
>   
> diff --git a/drivers/gpu/drm/ttm/ttm_device.c b/drivers/gpu/drm/ttm/ttm_device.c
> index 43e27ab77f95..72b81a2ee6c7 100644
> --- a/drivers/gpu/drm/ttm/ttm_device.c
> +++ b/drivers/gpu/drm/ttm/ttm_device.c
> @@ -213,7 +213,8 @@ int ttm_device_init(struct ttm_device *bdev, struct ttm_device_funcs *funcs,
>   	bdev->funcs = funcs;
>   
>   	ttm_sys_man_init(bdev);
> -	ttm_pool_init(&bdev->pool, dev, NUMA_NO_NODE, use_dma_alloc, use_dma32);
> +
> +	ttm_pool_init(&bdev->pool, dev, dev_to_node(dev), use_dma_alloc, use_dma32);
>   
>   	bdev->vma_manager = vma_manager;
>   	spin_lock_init(&bdev->lru_lock);