[PATCH 2/2] drm/xe: Record utilization before destroying the exec queue

Thu Jun 27 06:01:50 UTC 2024

On Tue, Jun 25, 2024 at 08:21:35PM -0700, Umesh Nerlige Ramappa wrote:
> On Tue, Jun 25, 2024 at 05:41:01PM +0000, Matthew Brost wrote:
> > On Tue, Jun 25, 2024 at 05:24:02PM +0000, Matthew Brost wrote:
> > > On Wed, Jun 26, 2024 at 12:58:12AM +0800, Umesh Nerlige Ramappa wrote:
> > > > Current code captures utilization at the exec queue level whenever a job
> > > > is completed and then accumulates it into the xe file stats whenever the
> > > > user queries for per client engine utilization. There is a case where
> > > > utilization may be lost if the exec queue is destroyed before the user
> > > > queries the utilization. To overcome that, record the utlization when
> > > > the exec queue is destroyed.
> > > >
> > > > To do so
> > > >
> > > > 1) Wait for release of all other references to the exec queue. The wait
> > > >    uses the same timeout as the job scheduling timeout. On timeout, only
> > > >    a debug message is printed out since this is just a best effort to
> > > >    capture the utilization prior to destroying the queue.
> > > > 2) Before releasing the last reference in xe_exec_queue_destroy_ioctl(),
> > > >    record the utilization in the xe file stats.
> > > >
> > > > Fixes: ce62827bc294 ("drm/xe: Do not access xe file when updating exec queue run_ticks")
> > > > Signed-off-by: Umesh Nerlige Ramappa <umesh.nerlige.ramappa at intel.com>
> > > > ---
> > > >  drivers/gpu/drm/xe/xe_exec_queue.c       | 11 +++++++++++
> > > >  drivers/gpu/drm/xe/xe_exec_queue_types.h |  2 ++
> > > >  2 files changed, 13 insertions(+)
> > > >
> > > > diff --git a/drivers/gpu/drm/xe/xe_exec_queue.c b/drivers/gpu/drm/xe/xe_exec_queue.c
> > > > index 4d90a16745d2..f1028eaf2d7f 100644
> > > > --- a/drivers/gpu/drm/xe/xe_exec_queue.c
> > > > +++ b/drivers/gpu/drm/xe/xe_exec_queue.c
> > > > @@ -69,6 +69,7 @@ static struct xe_exec_queue *__xe_exec_queue_alloc(struct xe_device *xe,
> > > >  	q->ops = gt->exec_queue_ops;
> > > >  	INIT_LIST_HEAD(&q->lr.link);
> > > >  	INIT_LIST_HEAD(&q->multi_gt_link);
> > > > +	init_waitqueue_head(&q->wq);
> > > >
> > > >  	q->sched_props.timeslice_us = hwe->eclass->sched_props.timeslice_us;
> > > >  	q->sched_props.preempt_timeout_us =
> > > > @@ -825,6 +826,7 @@ int xe_exec_queue_destroy_ioctl(struct drm_device *dev, void *data,
> > > >  	struct xe_file *xef = to_xe_file(file);
> > > >  	struct drm_xe_exec_queue_destroy *args = data;
> > > >  	struct xe_exec_queue *q;
> > > > +	int ret;
> > > >
> > > >  	if (XE_IOCTL_DBG(xe, args->pad) ||
> > > >  	    XE_IOCTL_DBG(xe, args->reserved[0] || args->reserved[1]))
> > > > @@ -838,6 +840,15 @@ int xe_exec_queue_destroy_ioctl(struct drm_device *dev, void *data,
> > > >
> > > >  	xe_exec_queue_kill(q);
> > > >
> > > > +	ret = wait_event_timeout(q->wq, kref_read(&q->refcount) == 1,
> > > > +			(q->sched_props.job_timeout_ms/1000) * HZ);
> > > > +	if (!ret)
> > > > +		drm_dbg(&xe->drm, "Timedout waiting for exec queue run ticks update\n");
> > > > +
> > > 
> > > I can't say I love adding a wait to destroy IOCTL just to accurately
> > > report some stats. If we can't preempt this exec queue, the kill flow
> > > can take nearly .7s which would be pretty bad to block in the IOCTL.
> > 
> > Opps, my timing calculation is wrong here as I forget that we reduce to
> > the preemption timeoot to 1ms upon kill. But my point still stands from
> > an arch perspective blocking in an IOCTL to collect stats doesn't seem
> > right to me. Maybe others have a different opinion?
> > 
> > > Even the common case isn't great either...
> > > 
> > > Beyond that, this code breaks if the assumption of
> > > kref_read(&q->refcount) == 1 meaning all jobs are done. e.g. The wedging
> > > code == 2 takes an extra ref to the exec queue, so this IOCTL will hang
> > > forever even we wedge the device.
> > 
> > s/even we/if we/
> > 
> > Matt
> > 
> > > 
> > > Could this be added to exec queue backend code which adjusts on the
> > > final destroy? e.g. Add it to __guc_exec_queue_fini_async via
> > > xe_exec_queue_fini? Or does that not work because we don't have the xef?
> 
> That's how it was earlier, but the problem we ran into earlier was that
> xe_file_close would schedule work to kill the queue and in parallel go ahead
> and close the xef object. By the time, the queue backend wanted to update
> xef, xef was freed -
> https://gitlab.freedesktop.org/drm/xe/kernel/issues/1908
> 
> Maybe if the queue backend somehow knows that the relevant xef is gone, it
> could handle the above bug, but we do not have a q->xef. It looks like that
> was intentionally removed a while ago. We were using q->vm->xef in the code
> that caused the above bug, but that's a hacky way to relate the queue to its
> xef.
> 
> To resolve the bug above, I had added code to localize the stats to the
> queue and transfer those stats to xef only in calls where we knew that the
> xef was valid - like when the user queries the stats. The only open was that
> when the queue is destroyed, we will lose the stats stored in the queue if
> we don't transfer it. That's what is being addressed here.
> 
> > > 
> > > Regardless if my suggestion works or not, surely we can do something to
> > > avoid waiting in this IOCTL. I suggest exploring another solution.
> 
> Lucas had suggested using a ref to xef object, but I think that requires
> bringing back the link q->xef. If you are okay with that, then I can add
> some code to make xef ref counted and solve the original issue -
> https://gitlab.freedesktop.org/drm/xe/kernel/issues/1908. I can drop this
> series then.
> 

That seems to be a sane choice, in general I think we get into trouble
when we don't ref count objects.

So the ref counting scheme would roughly be:

- init ref count to xef in xe_file_open
- drop ref in xe_file_close
- exec queue takes ref to xef upon creation (user IOCTL created exec queues only)
- exec queue drops ref to xef upon destroy (again user IOCTL created exec queues only)

Seeme like a good design to me.

Matt

> Thanks,
> Umesh
> 
> > > 
> > > Matt
> > > 
> > > > +	mutex_lock(&xef->exec_queue.lock);
> > > > +	xef->run_ticks[q->class] += xe_exec_queue_delta_run_ticks(q);
> > > > +	mutex_unlock(&xef->exec_queue.lock);
> > > > +
> > > >  	trace_xe_exec_queue_close(q);
> > > >  	xe_exec_queue_put(q);
> > > >
> > > > diff --git a/drivers/gpu/drm/xe/xe_exec_queue_types.h b/drivers/gpu/drm/xe/xe_exec_queue_types.h
> > > > index 201588ec33c3..2ae4221d2f61 100644
> > > > --- a/drivers/gpu/drm/xe/xe_exec_queue_types.h
> > > > +++ b/drivers/gpu/drm/xe/xe_exec_queue_types.h
> > > > @@ -143,6 +143,8 @@ struct xe_exec_queue {
> > > >  	u64 old_run_ticks;
> > > >  	/** @run_ticks: hw engine class run time in ticks for this exec queue */
> > > >  	u64 run_ticks;
> > > > +	/** @wq: wait queue to wait for cleanup */
> > > > +	wait_queue_head_t wq;
> > > >  	/** @lrc: logical ring context for this exec queue */
> > > >  	struct xe_lrc *lrc[];
> > > >  };
> > > > --
> > > > 2.34.1
> > > >