[PATCH v2] drm/xe/vf: Fix VM crash during VF driver release

Mon Jul 28 07:50:39 UTC 2025

On Mon, Jul 28, 2025 at 01:39:38AM -0600, K V P, Satyanarayana wrote:
> > -----Original Message-----
> > From: Brost, Matthew <matthew.brost at intel.com>
> > Sent: Monday, July 28, 2025 1:04 PM
> > To: K V P, Satyanarayana <satyanarayana.k.v.p at intel.com>
> > Cc: intel-xe at lists.freedesktop.org; Wajdeczko, Michal
> > <Michal.Wajdeczko at intel.com>; Auld, Matthew <matthew.auld at intel.com>;
> > Piorkowski, Piotr <piotr.piorkowski at intel.com>
> > Subject: Re: [PATCH v2] drm/xe/vf: Fix VM crash during VF driver release
> > 
> > On Mon, Jul 28, 2025 at 12:10:02PM +0530, Satyanarayana K V P wrote:
> > > The VF CCS save/restore series (patchwork #149108) has a dependency
> > > on the migration framework. A recent migration update in commit
> > > d65ff1ec8535 ("drm/xe: Split xe_migrate allocation from initialization")
> > > caused a VM crash during XE driver release for iGPU devices.
> > >
> > 
> > My strong view a revert, restructure to use single xe_migrate object
> > that initialized on driver load, but provides queues for the VF
> > migration needs.
> > 
> > Matt
> Hi Matt,
>  We had agreed to use multiple migration contexts earlier during the review of CCS save restore series.

Yes, I initially pushed back but eventually relented—that was my
mistake.

xe_migrate is a heavyweight kernel object and should be allocated in
exactly one place, not multiple. This series highlights how fragile it
becomes when that principle is violated.

More importantly, I was assured this series had been tested before I
merged it—it had not. That makes me reconsider letting my concerns
slide, and I don't believe a quick fix is appropriate here. Please
revert and properly address my original feedback.

Matt

> We can take this activity as an improvement (if needed in future), but do not think we need to revert 
> The whole series just for this new change (The series was approved with multiple contexts).
> 
> It is a new change which separated xe_migrate_alloc() and xe_migrate_init(). Now we are adopting the 
> CCS save restore series to the new changes. 
> 
> -Satya.
> > 
> > > Oops: general protection fault, probably for non-canonical address
> > 0x6b6b6b6b6b6b6b83: 0000 [#1] SMP NOPTI
> > > xe 0000:00:02.1: [drm:guc_ct_change_state [xe]] GT0: GuC CT
> > communication channel enabled
> > > CPU: 1 UID: 0 PID: 5998 Comm: bash Kdump: loaded Tainted: G     U  W
> > 6.16.0-xe+ #4 PREEMPT(voluntary)
> > > xe 0000:00:02.0: [drm:xe_sriov_pf_service_handshake_vf [xe]] PF: VF1
> > negotiated ABI version 1.0
> > > Tainted: [U]=USER, [W]=WARN
> > > Hardware name: Intel Corporation Panther Lake Client Platform/PTL-UH LP5
> > T3 RVP1, BIOS PTLPFWI1.R00.3214.D01.2505280450 05/28/2025
> > > RIP: 0010:xe_lrc_ring_head+0x12/0xb0 [xe]
> > > xe 0000:00:02.1: [drm:xe_gt_sriov_vf_connect [xe]] GT0: VF: using VF/PF
> > ABI 1.0
> > > Code: c3 cc cc cc cc 0f 1f 00 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90
> > 90 0f 1f 44 00 00 55 48 89 e5 41 55 41 54 53 48 83 ec 10 <f6> 47 18 02 74
> > 45 48 8b 47 38 48 8b 00 4c 8b 20 e8 69 db ff ff 80
> > > RSP: 0018:ffffc900043eb8d0 EFLAGS: 00010282
> > > RAX: 6b6b6b6b6b6b6b6b RBX: 6b6b6b6b6b6b6b6b RCX:
> > ffff88811a028c40
> > > RDX: 0000000000000000 RSI: ffff88811a028c68 RDI: 6b6b6b6b6b6b6b6b
> > > RBP: ffffc900043eb8f8 R08: ffffffffa1695dda R09: 0000000000000010
> > > xe 0000:00:02.1: [drm] GT0: reset done
> > > R10: 0000000000000000 R11: 0000000000000000 R12:
> > ffff88811a029c40
> > > R13: ffffc900043eb970 R14: ffff888119fc40c8 R15: ffff888119fc4548
> > > FS:  00007e03036bf740(0000) GS:ffff8884eb3b0000(0000)
> > knlGS:0000000000000000
> > > CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > > CR2: 00005af31328c000 CR3: 0000000114534004 CR4:
> > 0000000000f72ef0
> > > PKRU: 55555554
> > > Call Trace:
> > >  <TASK>
> > >  xe_sriov_vf_ccs_fini+0x1e/0x40 [xe]
> > >  devm_action_release+0x12/0x30
> > >  release_nodes+0x3a/0x120
> > >  devres_release_all+0x96/0xd0
> > >  device_unbind_cleanup+0x12/0x80
> > >  device_release_driver_internal+0x23a/0x280
> > >  device_release_driver+0x12/0x20
> > >  pci_stop_bus_device+0x69/0x90
> > >  pci_stop_and_remove_bus_device+0x12/0x30
> > >  pci_iov_remove_virtfn+0xbd/0x130
> > >  sriov_disable+0x42/0x100
> > >  pci_disable_sriov+0x34/0x50
> > >  xe_pci_sriov_configure+0xf71/0x1020 [xe]
> > >  ? sriov_numvfs_store+0x78/0x160
> > >  ? _parse_integer+0xe/0x20
> > >  ? _kstrtoull+0x3a/0xa0
> > >  sriov_numvfs_store+0x10c/0x160
> > >  dev_attr_store+0x14/0x40
> > >  sysfs_kf_write+0x4a/0x80
> > >  kernfs_fop_write_iter+0x166/0x220
> > >  vfs_write+0x284/0x550
> > >  ksys_write+0x6f/0xf0
> > >  __x64_sys_write+0x19/0x30
> > >  x64_sys_call+0x2bf/0x2660
> > >  do_syscall_64+0x93/0x1320
> > >  ? ksys_dup3+0xba/0x170
> > >  ? __x64_sys_dup2+0x2e/0x1f0
> > >  ? do_syscall_64+0x1a2/0x1320
> > >  ? __rseq_handle_notify_resume+0x447/0x5e0
> > >  ? trace_hardirqs_off+0x5a/0xd0
> > >  ? irqentry_exit_to_user_mode+0xdf/0x2f0
> > >  ? irqentry_exit+0x77/0xb0
> > >  entry_SYSCALL_64_after_hwframe+0x76/0x7e
> > > RIP: 0033:0x7e030351c574
> > > Code: c7 00 16 00 00 00 b8 ff ff ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 f3 0f
> > 1e fa 80 3d d5 ea 0e 00 00 74 13 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77
> > 54 c3 0f 1f 00 55 48 89 e5 48 83 ec 20 48 89
> > > RSP: 002b:00007ffd34a10798 EFLAGS: 00000202 ORIG_RAX:
> > 0000000000000001
> > > RAX: ffffffffffffffda RBX: 0000000000000002 RCX: 00007e030351c574
> > > RDX: 0000000000000002 RSI: 00005af318d7da30 RDI:
> > 0000000000000001
> > > RBP: 00007ffd34a107c0 R08: 0000000000000073 R09:
> > 0000000000000000
> > > R10: 00000000ffffffff R11: 0000000000000202 R12: 0000000000000002
> > > R13: 00005af318d7da30 R14: 00007e03036045c0 R15:
> > 00007e0303601ee0
> > >  </TASK>
> > >
> > > Update the VF CCS migration initialization sequence to align with the new
> > > migration framework changes, resolving the release-time crash.
> > >
> > > Fixes: f3009272ff2e ("drm/xe/vf: Create contexts for CCS read write")
> > > Signed-off-by: Satyanarayana K V P <satyanarayana.k.v.p at intel.com>
> > > Cc: Michal Wajdeczko <michal.wajdeczko at intel.com>
> > > Cc: Matthew Brost <matthew.brost at intel.com>
> > > Cc: Matthew Auld <matthew.auld at intel.com>
> > > Cc: Piotr Piórkowski <piotr.piorkowski at intel.com>
> > >
> > > ---
> > > V1 -> V2:
> > > - Updated xe_migrate_init() input arguments and return type (Michal
> > Wajdeczko)
> > > - Fixed review comments.
> > > ---
> > >  drivers/gpu/drm/xe/xe_gt.c           |  6 ++----
> > >  drivers/gpu/drm/xe/xe_migrate.c      | 22 +++++++++++-----------
> > >  drivers/gpu/drm/xe/xe_migrate.h      |  2 +-
> > >  drivers/gpu/drm/xe/xe_sriov_vf_ccs.c |  7 ++++++-
> > >  4 files changed, 20 insertions(+), 17 deletions(-)
> > >
> > > diff --git a/drivers/gpu/drm/xe/xe_gt.c b/drivers/gpu/drm/xe/xe_gt.c
> > > index c8eda36546d3..5a79c6e3208b 100644
> > > --- a/drivers/gpu/drm/xe/xe_gt.c
> > > +++ b/drivers/gpu/drm/xe/xe_gt.c
> > > @@ -564,11 +564,9 @@ static int gt_init_with_all_forcewake(struct xe_gt
> > *gt)
> > >  	if (xe_gt_is_main_type(gt)) {
> > >  		struct xe_tile *tile = gt_to_tile(gt);
> > >
> > > -		tile->migrate = xe_migrate_init(tile);
> > > -		if (IS_ERR(tile->migrate)) {
> > > -			err = PTR_ERR(tile->migrate);
> > > +		err = xe_migrate_init(tile->migrate);
> > > +		if (err)
> > >  			goto err_force_wake;
> > > -		}
> > >  	}
> > >
> > >  	err = xe_uc_load_hw(&gt->uc);
> > > diff --git a/drivers/gpu/drm/xe/xe_migrate.c
> > b/drivers/gpu/drm/xe/xe_migrate.c
> > > index 90065d7d29ff..7e295ca00763 100644
> > > --- a/drivers/gpu/drm/xe/xe_migrate.c
> > > +++ b/drivers/gpu/drm/xe/xe_migrate.c
> > > @@ -396,15 +396,15 @@ struct xe_migrate *xe_migrate_alloc(struct xe_tile
> > *tile)
> > >
> > >  /**
> > >   * xe_migrate_init() - Initialize a migrate context
> > > - * @tile: Back-pointer to the tile we're initializing for.
> > > + * @m: The migration context
> > >   *
> > > - * Return: Pointer to a migrate context on success. Error pointer on error.
> > > + * Return: 0 if successful, negative error code on failure
> > >   */
> > > -struct xe_migrate *xe_migrate_init(struct xe_tile *tile)
> > > +int xe_migrate_init(struct xe_migrate *m)
> > >  {
> > > -	struct xe_device *xe = tile_to_xe(tile);
> > > +	struct xe_tile *tile = m->tile;
> > >  	struct xe_gt *primary_gt = tile->primary_gt;
> > > -	struct xe_migrate *m = tile->migrate;
> > > +	struct xe_device *xe = tile_to_xe(tile);
> > >  	struct xe_vm *vm;
> > >  	int err;
> > >
> > > @@ -412,14 +412,14 @@ struct xe_migrate *xe_migrate_init(struct xe_tile
> > *tile)
> > >  	vm = xe_vm_create(xe, XE_VM_FLAG_MIGRATION |
> > >  			  XE_VM_FLAG_SET_TILE_ID(tile));
> > >  	if (IS_ERR(vm))
> > > -		return ERR_CAST(vm);
> > > +		return PTR_ERR(vm);
> > >
> > >  	xe_vm_lock(vm, false);
> > >  	err = xe_migrate_prepare_vm(tile, m, vm);
> > >  	xe_vm_unlock(vm);
> > >  	if (err) {
> > >  		xe_vm_close_and_put(vm);
> > > -		return ERR_PTR(err);
> > > +		return err;
> > >  	}
> > >
> > >  	if (xe->info.has_usm) {
> > > @@ -430,7 +430,7 @@ struct xe_migrate *xe_migrate_init(struct xe_tile
> > *tile)
> > >  		u32 logical_mask =
> > xe_migrate_usm_logical_mask(primary_gt);
> > >
> > >  		if (!hwe || !logical_mask)
> > > -			return ERR_PTR(-EINVAL);
> > > +			return -EINVAL;
> > >
> > >  		/*
> > >  		 * XXX: Currently only reserving 1 (likely slow) BCS instance on
> > > @@ -450,7 +450,7 @@ struct xe_migrate *xe_migrate_init(struct xe_tile
> > *tile)
> > >  	}
> > >  	if (IS_ERR(m->q)) {
> > >  		xe_vm_close_and_put(vm);
> > > -		return ERR_CAST(m->q);
> > > +		return PTR_ERR(m->q);
> > >  	}
> > >
> > >  	mutex_init(&m->job_mutex);
> > > @@ -460,7 +460,7 @@ struct xe_migrate *xe_migrate_init(struct xe_tile
> > *tile)
> > >
> > >  	err = devm_add_action_or_reset(xe->drm.dev, xe_migrate_fini, m);
> > >  	if (err)
> > > -		return ERR_PTR(err);
> > > +		return err;
> > >
> > >  	if (IS_DGFX(xe)) {
> > >  		if (xe_migrate_needs_ccs_emit(xe))
> > > @@ -475,7 +475,7 @@ struct xe_migrate *xe_migrate_init(struct xe_tile
> > *tile)
> > >  			(unsigned long long)m->min_chunk_size);
> > >  	}
> > >
> > > -	return m;
> > > +	return err;
> > >  }
> > >
> > >  static u64 max_mem_transfer_per_pass(struct xe_device *xe)
> > > diff --git a/drivers/gpu/drm/xe/xe_migrate.h
> > b/drivers/gpu/drm/xe/xe_migrate.h
> > > index 3758f9615484..e81ea6b27fb5 100644
> > > --- a/drivers/gpu/drm/xe/xe_migrate.h
> > > +++ b/drivers/gpu/drm/xe/xe_migrate.h
> > > @@ -105,7 +105,7 @@ struct xe_migrate_pt_update {
> > >  };
> > >
> > >  struct xe_migrate *xe_migrate_alloc(struct xe_tile *tile);
> > > -struct xe_migrate *xe_migrate_init(struct xe_tile *tile);
> > > +int xe_migrate_init(struct xe_migrate *m);
> > >
> > >  struct dma_fence *xe_migrate_to_vram(struct xe_migrate *m,
> > >  				     unsigned long npages,
> > > diff --git a/drivers/gpu/drm/xe/xe_sriov_vf_ccs.c
> > b/drivers/gpu/drm/xe/xe_sriov_vf_ccs.c
> > > index af43e04179aa..bf9fa1238462 100644
> > > --- a/drivers/gpu/drm/xe/xe_sriov_vf_ccs.c
> > > +++ b/drivers/gpu/drm/xe/xe_sriov_vf_ccs.c
> > > @@ -270,11 +270,16 @@ int xe_sriov_vf_ccs_init(struct xe_device *xe)
> > >  		ctx = &tile->sriov.vf.ccs[ctx_id];
> > >  		ctx->ctx_id = ctx_id;
> > >
> > > -		migrate = xe_migrate_init(tile);
> > > +		migrate = xe_migrate_alloc(tile);
> > >  		if (IS_ERR(migrate)) {
> > >  			err = PTR_ERR(migrate);
> > >  			goto err_ret;
> > >  		}
> > > +
> > > +		err = xe_migrate_init(migrate);
> > > +		if (err)
> > > +			goto err_ret;
> > > +
> > >  		ctx->migrate = migrate;
> > >
> > >  		err = alloc_bb_pool(tile, ctx);
> > > --
> > > 2.43.0
> > >