[Intel-xe] [PATCH] drm/xe: Fix potential deadlock handling page faults

Mon Mar 20 17:14:00 UTC 2023

On Mon, Mar 20, 2023 at 12:31:11PM +0100, Maarten Lankhorst wrote:
> Hey,
> 
> On 2023-03-18 04:28, Matthew Brost wrote:
> > Within a class the GuC will hault scheduling if the head of the queue
> > can't be scheduled the queue will block. This can lead to deadlock if
> > BCS0-7 all have faults and another engine on BCS0-7 is at head of the
> > GuC scheduling queue as the migration engine used to fix tthe fault will
> > be blocked. To work around this set the migration engine to the highest
> > priority when servicing page faults.
> > 
> > Signed-off-by: Matthew Brost <matthew.brost at intel.com>
> > ---
> >   drivers/gpu/drm/xe/xe_gt_pagefault.c | 5 ++++-
> >   1 file changed, 4 insertions(+), 1 deletion(-)
> > 
> > diff --git a/drivers/gpu/drm/xe/xe_gt_pagefault.c b/drivers/gpu/drm/xe/xe_gt_pagefault.c
> > index 76ec40567a78..8fad6e60f826 100644
> > --- a/drivers/gpu/drm/xe/xe_gt_pagefault.c
> > +++ b/drivers/gpu/drm/xe/xe_gt_pagefault.c
> > @@ -106,6 +106,7 @@ static struct xe_vma *lookup_vma(struct xe_vm *vm, u64 page_addr)
> >   static int handle_pagefault(struct xe_gt *gt, struct pagefault *pf)
> >   {
> >   	struct xe_device *xe = gt_to_xe(gt);
> > +	struct xe_engine *e = xe_gt_migrate_engine(gt);
> >   	struct xe_vm *vm;
> >   	struct xe_vma *vma = NULL;
> >   	struct xe_bo *bo;
> > @@ -185,6 +186,7 @@ static int handle_pagefault(struct xe_gt *gt, struct pagefault *pf)
> >   	if (ret)
> >   		goto unlock_vm;
> > +	e->ops->set_priority(e, DRM_SCHED_PRIORITY_KERNEL);
> >   	if (atomic) {
> >   		if (xe_vma_is_userptr(vma)) {
> >   			ret = -EACCES;
> > @@ -204,7 +206,7 @@ static int handle_pagefault(struct xe_gt *gt, struct pagefault *pf)
> >   	/* Bind VMA only to the GT that has faulted */
> >   	trace_xe_vma_pf_bind(vma);
> > -	fence = __xe_pt_bind_vma(gt, vma, xe_gt_migrate_engine(gt), NULL, 0,
> > +	fence = __xe_pt_bind_vma(gt, vma, e, NULL, 0,
> >   				 vma->gt_present & BIT(gt->info.id));
> >   	if (IS_ERR(fence)) {
> >   		ret = PTR_ERR(fence);
> > @@ -218,6 +220,7 @@ static int handle_pagefault(struct xe_gt *gt, struct pagefault *pf)
> >   	 */
> >   	dma_fence_wait(fence, false);
> >   	dma_fence_put(fence);
> > +	e->ops->set_priority(e, DRM_SCHED_PRIORITY_NORMAL);
> 
> Could we just keep the priority of the migrate engine at KERNEL? This change
> is prone to undo the same priority bump by another pagefault, unless you
> protect the priority with a lock.
> 

Ah, yes if you have multiple VMs, multiple threads can be processing
faults at the same time, good catch. For USM devices since the migrate
engine is a reserved physical engine just setting it to kernel priority
seem reasonible. Will change.

Matt

> ~Maarten
>