[PATCH 3/3] drm/xe: Stop accumulating LRC timestamp on job_free

Mon Oct 28 20:29:29 UTC 2024

-----Original Message-----
From: De Marchi, Lucas <lucas.demarchi at intel.com> 
Sent: Saturday, October 26, 2024 10:09 AM
To: intel-xe at lists.freedesktop.org
Cc: Cavitt, Jonathan <jonathan.cavitt at intel.com>; Nerlige Ramappa, Umesh <umesh.nerlige.ramappa at intel.com>; De Marchi, Lucas <lucas.demarchi at intel.com>
Subject: [PATCH 3/3] drm/xe: Stop accumulating LRC timestamp on job_free
> 
> The exec queue timestamp is only really useful when it's being queried
> through the fdinfo. There's no need to update it so often, on every
> job_free. Tracing a simple app like vkcube running shows an update
> rate of ~ 120Hz.
> 
> The update on job_free() is used to cover a gap: if exec
> queue is created and destroyed rapidily, before a new query, the
> timestamp still needs to be accumulated and accounted on the xef.
> Initial implementation in commit 6109f24f87d7 ("drm/xe: Add helper to
> accumulate exec queue runtime") couldn't do it on the exec_queue_fini
> since the xef could be gone at that point. However since commit
> ce8c161cbad4 ("drm/xe: Add ref counting for xe_file") the xef is
> refcounted and the exec queue has a reference.
> 
> Improve the fix in commit 2149ded63079 ("drm/xe: Fix use after free when
> client stats are captured") by reducing the frequency in which the
> update is needed.
> 
> Fixes: 2149ded63079 ("drm/xe: Fix use after free when client stats are captured")
> Signed-off-by: Lucas De Marchi <lucas.demarchi at intel.com>
> ---
>  drivers/gpu/drm/xe/xe_exec_queue.c | 6 ++++++
>  drivers/gpu/drm/xe/xe_guc_submit.c | 2 --
>  2 files changed, 6 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/gpu/drm/xe/xe_exec_queue.c b/drivers/gpu/drm/xe/xe_exec_queue.c
> index b15ca84b2422..bc2fc917e0de 100644
> --- a/drivers/gpu/drm/xe/xe_exec_queue.c
> +++ b/drivers/gpu/drm/xe/xe_exec_queue.c
> @@ -260,8 +260,14 @@ void xe_exec_queue_fini(struct xe_exec_queue *q)
>  {
>  	int i;
>  
> +	/*
> +	 * Before releasing our ref to lrc and xef, accumulate our run ticks
> +	 */
> +	xe_exec_queue_update_run_ticks(q);

I mean, if it works, it works.  However,

1) I might be mistaken, but if I'm understanding correctly, xe_exec_queue_fini
is just as asynchronous as guc_exec_queue_free_job was, meaning we're fairly
liable to hit the same issues as before.

2) If this is designed to cover an fd close use case (as per a discussion we had),
shouldn't we be accumulating the usage in the code segment that performs
the fd close?  I don't know where that is, but I suspect it might be xe_file_close
or xe_file_destroy.

I won't block on this, because perhaps I don't have the full picture.
Reviewed-by: Jonathan Cavitt <jonathan.cavitt at intel.com>
-Jonathan Cavitt

> +
>  	for (i = 0; i < q->width; ++i)
>  		xe_lrc_put(q->lrc[i]);
> +
>  	__xe_exec_queue_free(q);
>  }
>  
> diff --git a/drivers/gpu/drm/xe/xe_guc_submit.c b/drivers/gpu/drm/xe/xe_guc_submit.c
> index e5d7c767a744..ebe4665d9159 100644
> --- a/drivers/gpu/drm/xe/xe_guc_submit.c
> +++ b/drivers/gpu/drm/xe/xe_guc_submit.c
> @@ -747,8 +747,6 @@ static void guc_exec_queue_free_job(struct drm_sched_job *drm_job)
>  {
>  	struct xe_sched_job *job = to_xe_sched_job(drm_job);
>  
> -	xe_exec_queue_update_run_ticks(job->q);
> -
>  	trace_xe_sched_job_free(job);
>  	xe_sched_job_put(job);
>  }
> -- 
> 2.47.0
> 
>