[PATCH v7 07/13] drm/xe/hw_engine_group: Add helper to wait for dma fence jobs

Thu Aug 8 03:05:50 UTC 2024

On Wed, Aug 07, 2024 at 06:23:36PM +0200, Francois Dugast wrote:
> This is a required feature for faulting long running jobs not to be
> submitted while dma fence jobs are running on the hw engine group.
> 
> v2: Switch to lockdep_assert_held_write in worker, get a proper reference
>     for the last fence (Matt Brost)
> 
> Signed-off-by: Francois Dugast <francois.dugast at intel.com>
> ---
>  drivers/gpu/drm/xe/xe_hw_engine_group.c | 33 +++++++++++++++++++++++++
>  1 file changed, 33 insertions(+)
> 
> diff --git a/drivers/gpu/drm/xe/xe_hw_engine_group.c b/drivers/gpu/drm/xe/xe_hw_engine_group.c
> index 3f74ff577a4c..955451960a3d 100644
> --- a/drivers/gpu/drm/xe/xe_hw_engine_group.c
> +++ b/drivers/gpu/drm/xe/xe_hw_engine_group.c
> @@ -180,3 +180,36 @@ static void xe_hw_engine_group_suspend_faulting_lr_jobs(struct xe_hw_engine_grou
>  		q->ops->suspend_wait(q);
>  	}
>  }
> +
> +/**
> + * xe_hw_engine_group_wait_for_dma_fence_jobs() - Wait for dma fence jobs to complete
> + * @group: The hw engine group
> + *
> + * This function is not meant to be called directly from a user IOCTL as dma_fence_wait()
> + * is not interruptible.
> + *
> + * Return: 0 on success,
> + *	   -ETIME if waiting for one job failed
> + */
> +static int xe_hw_engine_group_wait_for_dma_fence_jobs(struct xe_hw_engine_group *group)
> +{
> +	long timeout;
> +	struct xe_exec_queue *q;
> +	struct dma_fence *fence;
> +
> +	lockdep_assert_held_write(&group->mode_sem);
> +
> +	list_for_each_entry(q, &group->exec_queue_list, hw_engine_group_link) {
> +		if (xe_vm_in_lr_mode(q->vm))
> +			continue;
> +
> +		fence = xe_exec_queue_last_fence_get_for_resume(q, q->vm);
> +		timeout = dma_fence_wait(fence, false);
> +		xe_exec_queue_last_fence_put_for_resume(q, q->vm);

Missed this eariler.

s/xe_exec_queue_last_fence_put_for_resume/dma_fence_put

xe_exec_queue_last_fence_get_for_resume gets ref to a fence which can be
dropped via dma_fence_put. I think this might be the source of CI
failures [1] [2] too. But neither DG2 or ADL should be triggering this
code path unless something else is going wrong. Can you look into these
CI failures too?

Matt

[1] https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-136192v7/shard-dg2-433/igt@xe_module_load@reload.html
[2] https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-136192v7/shard-adlp-1/igt@xe_module_load@unload.html

> +
> +		if (timeout < 0)
> +			return -ETIME;
> +	}
> +
> +	return 0;
> +}
> -- 
> 2.43.0
>