[PATCH] drm/xe: Fix oops in xe_gem_fault when running core_hotunplug test.

Tue Jul 15 15:54:24 UTC 2025

On 15/07/2025 16:20, Maarten Lankhorst wrote:
> I saw an oops in xe_gem_fault when running the xe-fast-feedback
> testlist against the realtime kernel without debug options enabled.
> 
> The panic happens after core_hotunplug unbind-rebind finishes.
> Presumably what happens is that a process mmaps, unlocks because
> of the FAULT_FLAG_RETRY_NOWAIT logic, has no process memory left,
> causing ttm_bo_vm_dummy_page() to return VM_FAULT_NOPAGE, since
> there was nothing left to populate, and then oopses in
> "mem_type_is_vram(tbo->resource->mem_type)" because tbo->resource
> is NULL.
> 
> It's convoluted, but fits the data and explains the oops after
> the test exits.

Yeah, looks like on unplug you can indeed have a NULL placement 
according to xe_evict_flags (I think we also evict everything on unplug 
with xe_bo_pci_dev_remove_all), where purge placement might result in 
calling pipeline gutting on the BO, AFAICT. If the fault is triggered on 
such a BO after the unplug then this explains the NPD.

> 
> Signed-off-by: Maarten Lankhorst <dev at lankhorst.se>

Reviewed-by: Matthew Auld <matthew.auld at intel.com>

> ---
>   drivers/gpu/drm/xe/xe_bo.c | 28 ++++++++++++++++------------
>   1 file changed, 16 insertions(+), 12 deletions(-)
> 
> diff --git a/drivers/gpu/drm/xe/xe_bo.c b/drivers/gpu/drm/xe/xe_bo.c
> index c5a5154e1363b..68aefb6f23cde 100644
> --- a/drivers/gpu/drm/xe/xe_bo.c
> +++ b/drivers/gpu/drm/xe/xe_bo.c
> @@ -1716,22 +1716,26 @@ static vm_fault_t xe_gem_fault(struct vm_fault *vmf)
>   		ret = ttm_bo_vm_fault_reserved(vmf, vmf->vma->vm_page_prot,
>   					       TTM_BO_VM_NUM_PREFAULT);
>   		drm_dev_exit(idx);
> +
> +		if (ret == VM_FAULT_RETRY &&
> +		    !(vmf->flags & FAULT_FLAG_RETRY_NOWAIT))
> +			goto out;
> +
> +		/*
> +		 * ttm_bo_vm_reserve() already has dma_resv_lock.
> +		 */
> +		if (ret == VM_FAULT_NOPAGE &&
> +		    mem_type_is_vram(tbo->resource->mem_type)) {
> +			mutex_lock(&xe->mem_access.vram_userfault.lock);
> +			if (list_empty(&bo->vram_userfault_link))
> +				list_add(&bo->vram_userfault_link,
> +					 &xe->mem_access.vram_userfault.list);
> +			mutex_unlock(&xe->mem_access.vram_userfault.lock);
> +		}
>   	} else {
>   		ret = ttm_bo_vm_dummy_page(vmf, vmf->vma->vm_page_prot);
>   	}
>   
> -	if (ret == VM_FAULT_RETRY && !(vmf->flags & FAULT_FLAG_RETRY_NOWAIT))
> -		goto out;
> -	/*
> -	 * ttm_bo_vm_reserve() already has dma_resv_lock.
> -	 */
> -	if (ret == VM_FAULT_NOPAGE && mem_type_is_vram(tbo->resource->mem_type)) {
> -		mutex_lock(&xe->mem_access.vram_userfault.lock);
> -		if (list_empty(&bo->vram_userfault_link))
> -			list_add(&bo->vram_userfault_link, &xe->mem_access.vram_userfault.list);
> -		mutex_unlock(&xe->mem_access.vram_userfault.lock);
> -	}
> -
>   	dma_resv_unlock(tbo->base.resv);
>   out:
>   	if (needs_rpm)