[PATCH 1/2] drm/xe: Process deferred GGTT node removals on device unwind

Wed Jun 25 14:20:53 UTC 2025

Hey,

On 2025-06-13 00:09, Michal Wajdeczko wrote:
> While we are indirectly draining our dedicated workqueue ggtt->wq
> that we use to complete asynchronous removal of some GGTT nodes,
> this happends as part of the managed-drm unwinding (ggtt_fini_early),
> which could be later then manage-device unwinding, where we could
> already unmap our MMIO/GMS mapping (mmio_fini).
> 
> This was recently observed during unsuccessful VF initialization:
> 
>  [ ] xe 0000:00:02.1: probe with driver xe failed with error -62
>  [ ] xe 0000:00:02.1: DEVRES REL ffff88811e747340 __xe_bo_unpin_map_no_vm (16 bytes)
>  [ ] xe 0000:00:02.1: DEVRES REL ffff88811e747540 __xe_bo_unpin_map_no_vm (16 bytes)
>  [ ] xe 0000:00:02.1: DEVRES REL ffff88811e747240 __xe_bo_unpin_map_no_vm (16 bytes)
>  [ ] xe 0000:00:02.1: DEVRES REL ffff88811e747040 tiles_fini (16 bytes)
>  [ ] xe 0000:00:02.1: DEVRES REL ffff88811e746840 mmio_fini (16 bytes)
>  [ ] xe 0000:00:02.1: DEVRES REL ffff88811e747f40 xe_bo_pinned_fini (16 bytes)
>  [ ] xe 0000:00:02.1: DEVRES REL ffff88811e746b40 devm_drm_dev_init_release (16 bytes)
>  [ ] xe 0000:00:02.1: [drm:drm_managed_release] drmres release begin
>  [ ] xe 0000:00:02.1: [drm:drm_managed_release] REL ffff88810ef81640 __fini_relay (8 bytes)
>  [ ] xe 0000:00:02.1: [drm:drm_managed_release] REL ffff88810ef80d40 guc_ct_fini (8 bytes)
>  [ ] xe 0000:00:02.1: [drm:drm_managed_release] REL ffff88810ef80040 __drmm_mutex_release (8 bytes)
>  [ ] xe 0000:00:02.1: [drm:drm_managed_release] REL ffff88810ef80140 ggtt_fini_early (8 bytes)
> 
> and this was leading to:
> 
>  [ ] BUG: unable to handle page fault for address: ffffc900058162a0
>  [ ] #PF: supervisor write access in kernel mode
>  [ ] #PF: error_code(0x0002) - not-present page
>  [ ] Oops: Oops: 0002 [#1] SMP NOPTI
>  [ ] Tainted: [W]=WARN
>  [ ] Workqueue: xe-ggtt-wq ggtt_node_remove_work_func [xe]
>  [ ] RIP: 0010:xe_ggtt_set_pte+0x6d/0x350 [xe]
>  [ ] Call Trace:
>  [ ]  <TASK>
>  [ ]  xe_ggtt_clear+0xb0/0x270 [xe]
>  [ ]  ggtt_node_remove+0xbb/0x120 [xe]
>  [ ]  ggtt_node_remove_work_func+0x30/0x50 [xe]
>  [ ]  process_one_work+0x22b/0x6f0
>  [ ]  worker_thread+0x1e8/0x3d
> 
> Add managed-device action that will explicitly drain the workqueue
> with all pending node removals prior to releasing MMIO/GSM mapping.

I'm curious, would it not make sense to put drain_workqueue inside the ggtt_fini_early()
call mentioned there instead of adding a new devm call?

Kind regards,
~Maarten