[PATCH 1/2] drm/xe_migrate: Switch from drm to dev managed actions
Matthew Auld
matthew.auld at intel.com
Fri Feb 28 10:21:59 UTC 2025
On 28/02/2025 06:52, Aradhya Bhatia wrote:
> Change the scope of the migrate subsystem to be dev managed instead of
> drm managed.
>
> The parent pci struct &device, that the xe struct &drm_device is a part
> of, gets removed when a hot unplug is triggered, which causes the
> underlying iommu group to get destroyed as well.
Nice find. But if that's the case then the migrate BO here is just one
of many where we will see this. Basically all system memory BO can
suffer from this AFAICT, including userspace owned. So I think we need
to rather solve this for all, and I don't think we can really tie the
lifetime of all BOs to devm, so likely need a different approach.
I think we might instead need to teardown all dma mappings when removing
the device, leaving the BO intact. It looks like there is already a
helper for this, so maybe something roughly like this:
@@ -980,6 +980,8 @@ void xe_device_remove(struct xe_device *xe)
drm_dev_unplug(&xe->drm);
+ ttm_device_clear_dma_mappings(&xe->ttm);
+
xe_display_driver_remove(xe);
xe_heci_gsc_fini(xe);
But I don't think userptr will be covered by that, just all BOs, so
likely need an extra step to also nuke all usersptr dma mappings somewhere.
>
> The migrate subsystem, which handles the lifetime of the page-table tree
> (pt) BO, doesn't get a chance to keep the BO back during the hot unplug,
> as all the references to DRM haven't been put back.
> When all the references to DRM are indeed put back later, the migrate
> subsystem tries to put back the pt BO. Since the underlying iommu group
> has been already destroyed, a kernel NULL ptr dereference takes place
> while attempting to keep back the pt BO.
>
> Signed-off-by: Aradhya Bhatia <aradhya.bhatia at intel.com>
> ---
> drivers/gpu/drm/xe/xe_migrate.c | 6 +++---
> 1 file changed, 3 insertions(+), 3 deletions(-)
>
> diff --git a/drivers/gpu/drm/xe/xe_migrate.c b/drivers/gpu/drm/xe/xe_migrate.c
> index 278bc96cf593..4e23adfa208a 100644
> --- a/drivers/gpu/drm/xe/xe_migrate.c
> +++ b/drivers/gpu/drm/xe/xe_migrate.c
> @@ -97,7 +97,7 @@ struct xe_exec_queue *xe_tile_migrate_exec_queue(struct xe_tile *tile)
> return tile->migrate->q;
> }
>
> -static void xe_migrate_fini(struct drm_device *dev, void *arg)
> +static void xe_migrate_fini(void *arg)
> {
> struct xe_migrate *m = arg;
>
> @@ -401,7 +401,7 @@ struct xe_migrate *xe_migrate_init(struct xe_tile *tile)
> struct xe_vm *vm;
> int err;
>
> - m = drmm_kzalloc(&xe->drm, sizeof(*m), GFP_KERNEL);
> + m = devm_kzalloc(xe->drm.dev, sizeof(*m), GFP_KERNEL);
> if (!m)
> return ERR_PTR(-ENOMEM);
>
> @@ -455,7 +455,7 @@ struct xe_migrate *xe_migrate_init(struct xe_tile *tile)
> might_lock(&m->job_mutex);
> fs_reclaim_release(GFP_KERNEL);
>
> - err = drmm_add_action_or_reset(&xe->drm, xe_migrate_fini, m);
> + err = devm_add_action_or_reset(xe->drm.dev, xe_migrate_fini, m);
> if (err)
> return ERR_PTR(err);
>
More information about the Intel-xe
mailing list