[Intel-xe] [PATCH] drm/xe: Fix potential deadlock handling page faults

Mon Mar 20 11:31:11 UTC 2023

Hey,

On 2023-03-18 04:28, Matthew Brost wrote:
> Within a class the GuC will hault scheduling if the head of the queue
> can't be scheduled the queue will block. This can lead to deadlock if
> BCS0-7 all have faults and another engine on BCS0-7 is at head of the
> GuC scheduling queue as the migration engine used to fix tthe fault will
> be blocked. To work around this set the migration engine to the highest
> priority when servicing page faults.
>
> Signed-off-by: Matthew Brost <matthew.brost at intel.com>
> ---
>   drivers/gpu/drm/xe/xe_gt_pagefault.c | 5 ++++-
>   1 file changed, 4 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/gpu/drm/xe/xe_gt_pagefault.c b/drivers/gpu/drm/xe/xe_gt_pagefault.c
> index 76ec40567a78..8fad6e60f826 100644
> --- a/drivers/gpu/drm/xe/xe_gt_pagefault.c
> +++ b/drivers/gpu/drm/xe/xe_gt_pagefault.c
> @@ -106,6 +106,7 @@ static struct xe_vma *lookup_vma(struct xe_vm *vm, u64 page_addr)
>   static int handle_pagefault(struct xe_gt *gt, struct pagefault *pf)
>   {
>   	struct xe_device *xe = gt_to_xe(gt);
> +	struct xe_engine *e = xe_gt_migrate_engine(gt);
>   	struct xe_vm *vm;
>   	struct xe_vma *vma = NULL;
>   	struct xe_bo *bo;
> @@ -185,6 +186,7 @@ static int handle_pagefault(struct xe_gt *gt, struct pagefault *pf)
>   	if (ret)
>   		goto unlock_vm;
>   
> +	e->ops->set_priority(e, DRM_SCHED_PRIORITY_KERNEL);
>   	if (atomic) {
>   		if (xe_vma_is_userptr(vma)) {
>   			ret = -EACCES;
> @@ -204,7 +206,7 @@ static int handle_pagefault(struct xe_gt *gt, struct pagefault *pf)
>   
>   	/* Bind VMA only to the GT that has faulted */
>   	trace_xe_vma_pf_bind(vma);
> -	fence = __xe_pt_bind_vma(gt, vma, xe_gt_migrate_engine(gt), NULL, 0,
> +	fence = __xe_pt_bind_vma(gt, vma, e, NULL, 0,
>   				 vma->gt_present & BIT(gt->info.id));
>   	if (IS_ERR(fence)) {
>   		ret = PTR_ERR(fence);
> @@ -218,6 +220,7 @@ static int handle_pagefault(struct xe_gt *gt, struct pagefault *pf)
>   	 */
>   	dma_fence_wait(fence, false);
>   	dma_fence_put(fence);
> +	e->ops->set_priority(e, DRM_SCHED_PRIORITY_NORMAL);

Could we just keep the priority of the migrate engine at KERNEL? This 
change is prone to undo the same priority bump by another pagefault, 
unless you protect the priority with a lock.

~Maarten