[PATCH] drm/ttm: Schedule delayed_delete worker closer
Christian König
christian.koenig at amd.com
Wed Nov 8 08:36:50 UTC 2023
Am 07.11.23 um 20:45 schrieb Rajneesh Bhardwaj:
> When a TTM BO is getting freed, to optimize the clearing operation on
> the workqueue, schedule it closer to a NUMA node where the memory was
> allocated. This avoids the cases where the ttm_bo_delayed_delete gets
> scheduled on the CPU cores that are across interconnect boundaries such
> as xGMI, PCIe etc.
This needs more background and doesn't mention that we now try to
allocate the memory close to the device.
Something like this here should work:
Try to allocate system memory on the NUMA node the device is closest to
and try to run delayed delete workers on a CPU of this node as well.
The background of running the delayed delete worker on a NUMA node close
to the one of the initial allocation is that the memory might be cleared
on free by the core memory management and that should probably be done
on a CPU close to it.
>
> This change helps USWC GTT allocations on NUMA systems (dGPU) and AMD
> APU platforms such as GFXIP9.4.3.
>
> Signed-off-by: Rajneesh Bhardwaj <rajneesh.bhardwaj at amd.com>
> ---
> drivers/gpu/drm/ttm/ttm_bo.c | 10 +++++++++-
> drivers/gpu/drm/ttm/ttm_device.c | 3 ++-
> 2 files changed, 11 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/gpu/drm/ttm/ttm_bo.c b/drivers/gpu/drm/ttm/ttm_bo.c
> index 5757b9415e37..0d608441a112 100644
> --- a/drivers/gpu/drm/ttm/ttm_bo.c
> +++ b/drivers/gpu/drm/ttm/ttm_bo.c
> @@ -370,7 +370,15 @@ static void ttm_bo_release(struct kref *kref)
> spin_unlock(&bo->bdev->lru_lock);
>
> INIT_WORK(&bo->delayed_delete, ttm_bo_delayed_delete);
> - queue_work(bdev->wq, &bo->delayed_delete);
> + /* Schedule the worker on the closest NUMA node,
> if no
> + * CPUs are available, this falls back to any CPU core
> + * available system wide.
Mentioning that is superfluous since everybody can look at the
implementation and that a fallback is available for a function which
doesn't return an error is obvious.
> This helps avoid the
> + * bottleneck to clear memory in cases where the worker
> + * is scheduled on a CPU which is remote to the node
> + * where the memory is getting freed.
> + */
Rather write something like "This improves performance since system
memory might be cleared on free and that is best done on a CPU core
close to it."
Regards,
Christian.
> +
> + queue_work_node(bdev->pool.nid, bdev->wq, &bo->delayed_delete);
> return;
> }
>
> diff --git a/drivers/gpu/drm/ttm/ttm_device.c b/drivers/gpu/drm/ttm/ttm_device.c
> index 43e27ab77f95..72b81a2ee6c7 100644
> --- a/drivers/gpu/drm/ttm/ttm_device.c
> +++ b/drivers/gpu/drm/ttm/ttm_device.c
> @@ -213,7 +213,8 @@ int ttm_device_init(struct ttm_device *bdev, struct ttm_device_funcs *funcs,
> bdev->funcs = funcs;
>
> ttm_sys_man_init(bdev);
> - ttm_pool_init(&bdev->pool, dev, NUMA_NO_NODE, use_dma_alloc, use_dma32);
> +
> + ttm_pool_init(&bdev->pool, dev, dev_to_node(dev), use_dma_alloc, use_dma32);
>
> bdev->vma_manager = vma_manager;
> spin_lock_init(&bdev->lru_lock);
More information about the amd-gfx
mailing list