[PATCH v5 5/8] drm/xe: Add helper to accumulate exec queue runtime
Lucas De Marchi
lucas.demarchi at intel.com
Tue May 21 02:58:39 UTC 2024
On Tue, May 21, 2024 at 02:53:56AM GMT, Matthew Brost wrote:
>On Sat, May 18, 2024 at 09:37:20AM -0500, Lucas De Marchi wrote:
>> On Fri, May 17, 2024 at 03:40:22PM GMT, Matt Roper wrote:
>> > On Fri, May 17, 2024 at 01:43:07PM -0700, Lucas De Marchi wrote:
>> > > From: Umesh Nerlige Ramappa <umesh.nerlige.ramappa at intel.com>
>> > >
>> > > Add a helper to accumulate per-client runtime of all its
>> > > exec queues. This is called every time a sched job is finished.
>> > >
>> > > v2:
>> > > - Use guc_exec_queue_free_job() and execlist_job_free() to accumulate
>> > > runtime when job is finished since xe_sched_job_completed() is not a
>> > > notification that job finished.
>> > > - Stop trying to update runtime from xe_exec_queue_fini() - that is
>> > > redundant and may happen after xef is closed, leading to a
>> > > use-after-free
>> > > - Do not special case the first timestamp read: the default LRC sets
>> > > CTX_TIMESTAMP to zero, so even the first sample should be a valid
>> > > one.
>> > > - Handle the parallel submission case by multiplying the runtime by
>> > > width.
>> > > v3: Update comments
>> > >
>> > > Signed-off-by: Umesh Nerlige Ramappa <umesh.nerlige.ramappa at intel.com>
>> > > Signed-off-by: Lucas De Marchi <lucas.demarchi at intel.com>
>> > > ---
>> > > drivers/gpu/drm/xe/xe_device_types.h | 3 +++
>> > > drivers/gpu/drm/xe/xe_exec_queue.c | 37 ++++++++++++++++++++++++++++
>> > > drivers/gpu/drm/xe/xe_exec_queue.h | 1 +
>> > > drivers/gpu/drm/xe/xe_execlist.c | 1 +
>> > > drivers/gpu/drm/xe/xe_guc_submit.c | 2 ++
>> > > 5 files changed, 44 insertions(+)
>> > >
>> > > diff --git a/drivers/gpu/drm/xe/xe_device_types.h b/drivers/gpu/drm/xe/xe_device_types.h
>> > > index 5c5e36de452a..bc97990fd032 100644
>> > > --- a/drivers/gpu/drm/xe/xe_device_types.h
>> > > +++ b/drivers/gpu/drm/xe/xe_device_types.h
>> > > @@ -555,6 +555,9 @@ struct xe_file {
>> > > struct mutex lock;
>> > > } exec_queue;
>> > >
>> > > + /** @runtime: hw engine class runtime in ticks for this drm client */
>> > > + u64 runtime[XE_ENGINE_CLASS_MAX];
>> > > +
>> > > /** @client: drm client */
>> > > struct xe_drm_client *client;
>> > > };
>> > > diff --git a/drivers/gpu/drm/xe/xe_exec_queue.c b/drivers/gpu/drm/xe/xe_exec_queue.c
>> > > index 395de93579fa..fa6dc996eca8 100644
>> > > --- a/drivers/gpu/drm/xe/xe_exec_queue.c
>> > > +++ b/drivers/gpu/drm/xe/xe_exec_queue.c
>> > > @@ -769,6 +769,43 @@ bool xe_exec_queue_is_idle(struct xe_exec_queue *q)
>> > > q->lrc[0].fence_ctx.next_seqno - 1;
>> > > }
>> > >
>> > > +/**
>> > > + * xe_exec_queue_update_runtime() - Update runtime for this exec queue from hw
>> > > + * @q: The exec queue
>> > > + *
>> > > + * Update the timestamp saved by HW for this exec queue and save runtime
>> > > + * calculated by using the delta from last update. On multi-lrc case, only the
>> > > + * first is considered.
>> > > + */
>> > > +void xe_exec_queue_update_runtime(struct xe_exec_queue *q)
>> > > +{
>> > > + struct xe_file *xef;
>> > > + struct xe_lrc *lrc;
>> > > + u32 old_ts, new_ts;
>> > > +
>> > > + /*
>> > > + * Jobs that are run during driver load may use an exec_queue, but are
>> > > + * not associated with a user xe file, so avoid accumulating busyness
>> > > + * for kernel specific work.
>> > > + */
>> > > + if (!q->vm || !q->vm->xef)
>> > > + return;
>> > > +
>> > > + xef = q->vm->xef;
>> > > +
>> > > + /*
>> > > + * Only sample the first LRC. For parallel submission, all of them are
>> > > + * scheduled together and we compensate that below by multiplying by
>> > > + * width - this may introduce errors if that premise is not true and
>> > > + * they don't exit 100% aligned. On the other hand, looping through
>> > > + * the LRCs and reading them in different time could also introduce
>> > > + * errors.
>> >
>> > At the time we're executing this function, those LRCs aren't executing
>> > on the hardware anymore and their timestamps aren't continuing to move,
>>
>> not necessarily. Besides calling this function when execution ends, see
>> last patch. This is called when userspace reads the fdinfo file, which
>> portentially reads this for running LRCs.
>>
>> > right? I don't see where error could creep in from just looping over
>> > each of them?
>> >
>> > I guess parallel submission is mostly just used by media these days,
>> > where the batches submitted in parallel are nearly identical and
>> > expected to run the same amount of time, right? Do we have any
>>
>> what I got from Matt Brost is that they are always scheduled together
>> and will run together on the different instances of that engine class,
>> regardless if they are different. As you said, I'm not sure there's
>> even a use case for running different batches. +Matt Brost to confirm
>
>I believe the current usage involves running the same batch program, but
>it operates on a different set of data. Even if widely different batches
>were submitted across multiple engines, the GuC schedules all engines
>simultaneously and does not schedule anything else on the engine set
>until all engines are idle. So, I think the logic as it stands is fine.
that's sufficient for me :)
>If everyone is paranoid, feel free to sample every Logical Ring Context
>(LRC) in the exec queue.
nah, I don't think so. We could do this in future if we find a need for
it. Thanks for checking
Lucas De Marchi
More information about the dri-devel
mailing list