[PATCH v3 5/5] drm/xe: Sample gpu timestamp closer to exec queues
Cavitt, Jonathan
jonathan.cavitt at intel.com
Tue Nov 5 23:12:32 UTC 2024
-----Original Message-----
From: De Marchi, Lucas <lucas.demarchi at intel.com>
Sent: Tuesday, November 5, 2024 2:36 PM
To: Cavitt, Jonathan <jonathan.cavitt at intel.com>
Cc: intel-xe at lists.freedesktop.org
Subject: Re: [PATCH v3 5/5] drm/xe: Sample gpu timestamp closer to exec queues
>
> On Mon, Nov 04, 2024 at 04:11:16PM +0000, Cavitt, Jonathan wrote:
> >-----Original Message-----
> >From: Intel-xe <intel-xe-bounces at lists.freedesktop.org> On Behalf Of Lucas De Marchi
> >Sent: Monday, November 4, 2024 6:38 AM
> >To: intel-xe at lists.freedesktop.org
> >Cc: De Marchi, Lucas <lucas.demarchi at intel.com>
> >Subject: [PATCH v3 5/5] drm/xe: Sample gpu timestamp closer to exec queues
> >>
> >> Move the force_wake_get to the beginning of the function so the gpu
> >> timestamp can be closer to the sampled value for exec queues. This
> >> avoids additional delays waiting for force wake ack which can make the
> >> proportion between cycles/total_cycles fluctuate either to more or to
> >> less than the real value,
> >
> >s/either to more or to less than/around
>
> changed
>
> >
> >> depending if the delay happened in the first
> >> or second sample of a top-like application:
> >
> >Top-like application? You mean a spinning application?
>
> no, I meant an app like gputop/htop/nvtop/qmassa, that read 2 samples,
> like below, calculate the delta and then show what was the utilization.
>
> >
> >>
> >> sample 0:
> >> read_exec_queue_timestamp
> >> <<<< (A)
> >> read_gpu_timestamp
> >> sample 1:
> >> read_exec_queue_timestamp
> >> <<<<< (B)
> >> read_gpu_timestamp
> >>
> >> Additional delay in (A) will cause the utilization to be bigger
> >> than it should. Additional delay in (B) will cause the utilization to be
> >> smaller than it should.
> >
> >Maybe:
> >
> >"""
> >In the above case, utilization can be bigger or smaller than it should
> >be, depending on if (A) or (B) receives additional delay, respectively.
> >"""
>
> changed
>
> >
> >>
> >> With this a LNL system that was failing after ~60 iterations, get to run
> >> to 100 without a failure. This is still not perfect, and it's easy to
> >> introduce errors by just loading the CPU with `stress --cpu $(nproc)` -
> >> the same igt test in this case fails after 2 or 3 iterations.
> >
> >I'd say either specify the failing IGT test or cut this paragraph.
>
> I don't like to give the impression I think this is a perfect fix. I
> will reword it a little bit for the next version.
>
> >
> >>
> >> Although not perfect, this could be viewed as just extracting that ugly
> >> loop to "find the forcewake for any available engine" into a separate
> >> function, and at the same time improving the results a little bit.
> >
> >This seems like an aside and can probably be cut, though if
> >you think it's necessary to keep this section to make this
> >patch more likely to successfully land, then IMO you
> >should probably merge it with the first paragraph.
>
> I will merge this with the paragraph above it.
>
> >
> >Something like:
> >
> >"""
> >Move the force_wake_get to the beginning of the show_run_ticks
> >function so the gpu timestamp can be closer to the sampled value
> >for exec queues. This cleans the code by extracting the for-loop
> >that finds the forcewake of the first available engine into its own
> >function. Additionally, this helps avoid additional delays waiting
> >for force wake ack which can...
> >"""
> >
> >Note that by saying it "helps avoid" rather than "avoids", we go
> >from a full fix to one that reduces harm, meaning we don't have
> >to specify the imperfection of this fix later.
>
> I like concrete numbers. I like when people actually tested the patch
> and make it clear in the commit message.
>
> >
> >>
> >> Signed-off-by: Lucas De Marchi <lucas.demarchi at intel.com>
> >> ---
> >> drivers/gpu/drm/xe/xe_drm_client.c | 63 ++++++++++++++++++------------
> >> 1 file changed, 39 insertions(+), 24 deletions(-)
> >>
> >> diff --git a/drivers/gpu/drm/xe/xe_drm_client.c b/drivers/gpu/drm/xe/xe_drm_client.c
> >> index 24a0a7377abf2..874c2157baf49 100644
> >> --- a/drivers/gpu/drm/xe/xe_drm_client.c
> >> +++ b/drivers/gpu/drm/xe/xe_drm_client.c
> >> @@ -269,6 +269,40 @@ static void show_meminfo(struct drm_printer *p, struct drm_file *file)
> >> }
> >> }
> >>
> >> +static bool force_wake_get_for_gpu_timestamp(struct xe_device *xe,
> >> + struct xe_hw_engine **phwe,
> >> + unsigned int *pfw_ref)
> >> +{
> >
> >This is amenable, though perhaps we should just call it
> >"force_wake_get_any_engine" or
> >"force_wake_get_first_engine"?
>
> I'd agree, but then I read the feedback from Umesh and decided to keep
> it as is. The reason is that this function may in future use a cached
> hwe and that hwe only makes sense for this specific use.
>
> >
> >> + struct xe_gt *gt;
> >> + unsigned long gt_id;
> >> +
> >> + for_each_gt(gt, xe, gt_id) {
> >> + struct xe_hw_engine *hwe;
> >> + enum xe_force_wake_domains domain;
> >> + unsigned int fw_ref;
> >> + struct xe_force_wake *fw;
> >> +
> >> + hwe = xe_gt_any_hw_engine(gt);
> >> + if (!hwe)
> >> + continue;
> >> +
> >> + domain = xe_hw_engine_to_fw_domain(hwe);
> >> + fw = gt_to_fw(gt);
> >> +
> >> + fw_ref = xe_force_wake_get(fw, domain);
> >> + if (!xe_force_wake_ref_has_domain(fw_ref, domain)) {
> >> + xe_force_wake_put(fw, fw_ref);
> >> + return false;
> >> + }
> >> +
> >> + *phwe = hwe;
> >> + *pfw_ref = fw_ref;
> >> + return true;
> >> + }
> >> +
> >> + return false;
> >> +}
> >> +
> >> static void show_run_ticks(struct drm_printer *p, struct drm_file *file)
> >> {
> >> unsigned long class, i, gt_id, capacity[XE_ENGINE_CLASS_MAX] = { };
> >> @@ -282,6 +316,9 @@ static void show_run_ticks(struct drm_printer *p, struct drm_file *file)
> >>
> >> xe_pm_runtime_get(xe);
> >>
> >> + if (!force_wake_get_for_gpu_timestamp(xe, &hwe, &fw_ref))
> >> + return;
> >> +
> >> /* Accumulate all the exec queues from this client */
> >> mutex_lock(&xef->exec_queue.lock);
> >> xa_for_each(&xef->exec_queue.xa, i, q) {
> >> @@ -295,27 +332,7 @@ static void show_run_ticks(struct drm_printer *p, struct drm_file *file)
> >> }
> >> mutex_unlock(&xef->exec_queue.lock);
> >>
> >> - /* Get the total GPU cycles */
> >> - for_each_gt(gt, xe, gt_id) {
> >> - enum xe_force_wake_domains fw;
> >> -
> >> - hwe = xe_gt_any_hw_engine(gt);
> >> - if (!hwe)
> >> - continue;
> >> -
> >> - fw = xe_hw_engine_to_fw_domain(hwe);
> >> -
> >> - fw_ref = xe_force_wake_get(gt_to_fw(gt), fw);
> >> - if (!xe_force_wake_ref_has_domain(fw_ref, fw)) {
> >> - hwe = NULL;
> >> - xe_force_wake_put(gt_to_fw(gt), fw_ref);
> >> - break;
> >> - }
> >> -
> >> - gpu_timestamp = xe_hw_engine_read_timestamp(hwe);
> >> - xe_force_wake_put(gt_to_fw(gt), fw_ref);
> >> - break;
> >> - }
> >> + gpu_timestamp = xe_hw_engine_read_timestamp(hwe);
> >>
> >> /*
> >> * Wait for any exec queue going away: their cycles will get updated on
> >> @@ -324,11 +341,9 @@ static void show_run_ticks(struct drm_printer *p, struct drm_file *file)
> >> wait_var_event(&xef->exec_queue.pending_removal,
> >> !atomic_read(&xef->exec_queue.pending_removal));
> >>
> >> + xe_force_wake_put(gt_to_fw(hwe->gt), fw_ref);
> >> xe_pm_runtime_put(xe);
> >>
> >> - if (unlikely(!hwe))
> >> - return;
> >> -
> >> for (class = 0; class < XE_ENGINE_CLASS_MAX; class++) {
> >> const char *class_name;
> >>
> >> --
> >
> >I know I've written a lot of suggestions at this point, but none of them
> >are particularly blocking, so just clean up the commit message a bit and
> >this is
>
> thanks
>
> Lucas De Marchi
>
> >
> >Reviewed-by: Jonathan Cavitt <jonathan.cavitt at intel.com>
> >-Jonathan Cavitt
I read all responses and it all looks good to me. Above RB still stands.
-Jonathan Cavitt
> >
> >> 2.47.0
> >>
> >>
>
More information about the Intel-xe
mailing list