[PATCH v5 5/8] drm/xe: Add helper to accumulate exec queue runtime

Tue May 21 02:58:39 UTC 2024

On Tue, May 21, 2024 at 02:53:56AM GMT, Matthew Brost wrote:
>On Sat, May 18, 2024 at 09:37:20AM -0500, Lucas De Marchi wrote:
>> On Fri, May 17, 2024 at 03:40:22PM GMT, Matt Roper wrote:
>> > On Fri, May 17, 2024 at 01:43:07PM -0700, Lucas De Marchi wrote:
>> > > From: Umesh Nerlige Ramappa <umesh.nerlige.ramappa at intel.com>
>> > >
>> > > Add a helper to accumulate per-client runtime of all its
>> > > exec queues. This is called every time a sched job is finished.
>> > >
>> > > v2:
>> > >   - Use guc_exec_queue_free_job() and execlist_job_free() to accumulate
>> > >     runtime when job is finished since xe_sched_job_completed() is not a
>> > >     notification that job finished.
>> > >   - Stop trying to update runtime from xe_exec_queue_fini() - that is
>> > >     redundant and may happen after xef is closed, leading to a
>> > >     use-after-free
>> > >   - Do not special case the first timestamp read: the default LRC sets
>> > >     CTX_TIMESTAMP to zero, so even the first sample should be a valid
>> > >     one.
>> > >   - Handle the parallel submission case by multiplying the runtime by
>> > >     width.
>> > > v3: Update comments
>> > >
>> > > Signed-off-by: Umesh Nerlige Ramappa <umesh.nerlige.ramappa at intel.com>
>> > > Signed-off-by: Lucas De Marchi <lucas.demarchi at intel.com>
>> > > ---
>> > >  drivers/gpu/drm/xe/xe_device_types.h |  3 +++
>> > >  drivers/gpu/drm/xe/xe_exec_queue.c   | 37 ++++++++++++++++++++++++++++
>> > >  drivers/gpu/drm/xe/xe_exec_queue.h   |  1 +
>> > >  drivers/gpu/drm/xe/xe_execlist.c     |  1 +
>> > >  drivers/gpu/drm/xe/xe_guc_submit.c   |  2 ++
>> > >  5 files changed, 44 insertions(+)
>> > >
>> > > diff --git a/drivers/gpu/drm/xe/xe_device_types.h b/drivers/gpu/drm/xe/xe_device_types.h
>> > > index 5c5e36de452a..bc97990fd032 100644
>> > > --- a/drivers/gpu/drm/xe/xe_device_types.h
>> > > +++ b/drivers/gpu/drm/xe/xe_device_types.h
>> > > @@ -555,6 +555,9 @@ struct xe_file {
>> > >  		struct mutex lock;
>> > >  	} exec_queue;
>> > >
>> > > +	/** @runtime: hw engine class runtime in ticks for this drm client */
>> > > +	u64 runtime[XE_ENGINE_CLASS_MAX];
>> > > +
>> > >  	/** @client: drm client */
>> > >  	struct xe_drm_client *client;
>> > >  };
>> > > diff --git a/drivers/gpu/drm/xe/xe_exec_queue.c b/drivers/gpu/drm/xe/xe_exec_queue.c
>> > > index 395de93579fa..fa6dc996eca8 100644
>> > > --- a/drivers/gpu/drm/xe/xe_exec_queue.c
>> > > +++ b/drivers/gpu/drm/xe/xe_exec_queue.c
>> > > @@ -769,6 +769,43 @@ bool xe_exec_queue_is_idle(struct xe_exec_queue *q)
>> > >  		q->lrc[0].fence_ctx.next_seqno - 1;
>> > >  }
>> > >
>> > > +/**
>> > > + * xe_exec_queue_update_runtime() - Update runtime for this exec queue from hw
>> > > + * @q: The exec queue
>> > > + *
>> > > + * Update the timestamp saved by HW for this exec queue and save runtime
>> > > + * calculated by using the delta from last update. On multi-lrc case, only the
>> > > + * first is considered.
>> > > + */
>> > > +void xe_exec_queue_update_runtime(struct xe_exec_queue *q)
>> > > +{
>> > > +	struct xe_file *xef;
>> > > +	struct xe_lrc *lrc;
>> > > +	u32 old_ts, new_ts;
>> > > +
>> > > +	/*
>> > > +	 * Jobs that are run during driver load may use an exec_queue, but are
>> > > +	 * not associated with a user xe file, so avoid accumulating busyness
>> > > +	 * for kernel specific work.
>> > > +	 */
>> > > +	if (!q->vm || !q->vm->xef)
>> > > +		return;
>> > > +
>> > > +	xef = q->vm->xef;
>> > > +
>> > > +	/*
>> > > +	 * Only sample the first LRC. For parallel submission, all of them are
>> > > +	 * scheduled together and we compensate that below by multiplying by
>> > > +	 * width - this may introduce errors if that premise is not true and
>> > > +	 * they don't exit 100% aligned. On the other hand, looping through
>> > > +	 * the LRCs and reading them in different time could also introduce
>> > > +	 * errors.
>> >
>> > At the time we're executing this function, those LRCs aren't executing
>> > on the hardware anymore and their timestamps aren't continuing to move,
>>
>> not necessarily. Besides calling this function when execution ends, see
>> last patch. This is called when userspace reads the fdinfo file, which
>> portentially reads this for running LRCs.
>>
>> > right?  I don't see where error could creep in from just looping over
>> > each of them?
>> >
>> > I guess parallel submission is mostly just used by media these days,
>> > where the batches submitted in parallel are nearly identical and
>> > expected to run the same amount of time, right?  Do we have any
>>
>> what I got from Matt Brost is that they are always scheduled together
>> and will run together on the different instances of that engine class,
>> regardless if they are different. As you said, I'm not sure there's
>> even a use case for running different batches.  +Matt Brost to confirm
>
>I believe the current usage involves running the same batch program, but
>it operates on a different set of data. Even if widely different batches
>were submitted across multiple engines, the GuC schedules all engines
>simultaneously and does not schedule anything else on the engine set
>until all engines are idle. So, I think the logic as it stands is fine.

that's sufficient for me :)

>If everyone is paranoid, feel free to sample every Logical Ring Context
>(LRC) in the exec queue.

nah, I don't think so. We could do this in future if we find a need for
it. Thanks for checking

Lucas De Marchi