[PATCH v3 5/5] drm/xe: Sample gpu timestamp closer to exec queues

Tue Nov 5 23:12:32 UTC 2024

-----Original Message-----
From: De Marchi, Lucas <lucas.demarchi at intel.com> 
Sent: Tuesday, November 5, 2024 2:36 PM
To: Cavitt, Jonathan <jonathan.cavitt at intel.com>
Cc: intel-xe at lists.freedesktop.org
Subject: Re: [PATCH v3 5/5] drm/xe: Sample gpu timestamp closer to exec queues
> 
> On Mon, Nov 04, 2024 at 04:11:16PM +0000, Cavitt, Jonathan wrote:
> >-----Original Message-----
> >From: Intel-xe <intel-xe-bounces at lists.freedesktop.org> On Behalf Of Lucas De Marchi
> >Sent: Monday, November 4, 2024 6:38 AM
> >To: intel-xe at lists.freedesktop.org
> >Cc: De Marchi, Lucas <lucas.demarchi at intel.com>
> >Subject: [PATCH v3 5/5] drm/xe: Sample gpu timestamp closer to exec queues
> >>
> >> Move the force_wake_get to the beginning of the function so the gpu
> >> timestamp can be closer to the sampled value for exec queues. This
> >> avoids additional delays waiting for force wake ack which can make the
> >> proportion between cycles/total_cycles fluctuate either to more or to
> >> less than the real value,
> >
> >s/either to more or to less than/around
> 
> changed
> 
> >
> >> depending if the delay happened in the first
> >> or second sample of a top-like application:
> >
> >Top-like application?  You mean a spinning application?
> 
> no, I meant an app like gputop/htop/nvtop/qmassa, that read 2 samples,
> like below, calculate the delta and then show what was the utilization.
> 
> >
> >>
> >> 	sample 0:
> >> 		read_exec_queue_timestamp
> >> 					<<<< (A)
> >> 		read_gpu_timestamp
> >> 	sample 1:
> >> 		read_exec_queue_timestamp
> >> 					<<<<< (B)
> >> 		read_gpu_timestamp
> >>
> >> Additional delay in (A) will cause the utilization to be bigger
> >> than it should. Additional delay in (B) will cause the utilization to be
> >> smaller than it should.
> >
> >Maybe:
> >
> >"""
> >In the above case, utilization can be bigger or smaller than it should
> >be, depending on if (A) or (B) receives additional delay, respectively.
> >"""
> 
> changed
> 
> >
> >>
> >> With this a LNL system that was failing after ~60 iterations, get to run
> >> to 100 without a failure. This is still not perfect, and it's easy to
> >> introduce errors by just loading the CPU with `stress --cpu $(nproc)` -
> >> the same igt test in this case fails after 2 or 3 iterations.
> >
> >I'd say either specify the failing IGT test or cut this paragraph.
> 
> I don't like to give the impression I think this is a perfect fix. I
> will reword it a little bit for the next version.
> 
> >
> >>
> >> Although not perfect, this could be viewed as just extracting that ugly
> >> loop to "find the forcewake for any available engine" into a separate
> >> function, and at the same time improving the results a little bit.
> >
> >This seems like an aside and can probably be cut, though if
> >you think it's necessary to keep this section to make this
> >patch more likely to successfully land, then IMO you
> >should probably merge it with the first paragraph.
> 
> I will merge this with the paragraph above it.
> 
> >
> >Something like:
> >
> >"""
> >Move the force_wake_get to the beginning of the show_run_ticks
> >function so the gpu timestamp can be closer to the sampled value
> >for exec queues.  This cleans the code by extracting the for-loop
> >that finds the forcewake of the first available engine into its own
> >function.  Additionally, this helps avoid additional delays waiting
> >for force wake ack which can...
> >"""
> >
> >Note that by saying it "helps avoid" rather than "avoids", we go
> >from a full fix to one that reduces harm, meaning we don't have
> >to specify the imperfection of this fix later.
> 
> I like concrete numbers. I like when people actually tested the patch
> and make it clear in the commit message.
> 
> >
> >>
> >> Signed-off-by: Lucas De Marchi <lucas.demarchi at intel.com>
> >> ---
> >>  drivers/gpu/drm/xe/xe_drm_client.c | 63 ++++++++++++++++++------------
> >>  1 file changed, 39 insertions(+), 24 deletions(-)
> >>
> >> diff --git a/drivers/gpu/drm/xe/xe_drm_client.c b/drivers/gpu/drm/xe/xe_drm_client.c
> >> index 24a0a7377abf2..874c2157baf49 100644
> >> --- a/drivers/gpu/drm/xe/xe_drm_client.c
> >> +++ b/drivers/gpu/drm/xe/xe_drm_client.c
> >> @@ -269,6 +269,40 @@ static void show_meminfo(struct drm_printer *p, struct drm_file *file)
> >>  	}
> >>  }
> >>
> >> +static bool force_wake_get_for_gpu_timestamp(struct xe_device *xe,
> >> +					     struct xe_hw_engine **phwe,
> >> +					     unsigned int *pfw_ref)
> >> +{
> >
> >This is amenable, though perhaps we should just call it
> >"force_wake_get_any_engine" or
> >"force_wake_get_first_engine"?
> 
> I'd agree, but then I read the feedback from Umesh and decided to keep
> it as is. The reason is that this function may in future use a cached
> hwe and that hwe only makes sense for this specific use.
> 
> >
> >> +	struct xe_gt *gt;
> >> +	unsigned long gt_id;
> >> +
> >> +	for_each_gt(gt, xe, gt_id) {
> >> +		struct xe_hw_engine *hwe;
> >> +		enum xe_force_wake_domains domain;
> >> +		unsigned int fw_ref;
> >> +		struct xe_force_wake *fw;
> >> +
> >> +		hwe = xe_gt_any_hw_engine(gt);
> >> +		if (!hwe)
> >> +			continue;
> >> +
> >> +		domain = xe_hw_engine_to_fw_domain(hwe);
> >> +		fw = gt_to_fw(gt);
> >> +
> >> +		fw_ref = xe_force_wake_get(fw, domain);
> >> +		if (!xe_force_wake_ref_has_domain(fw_ref, domain)) {
> >> +			xe_force_wake_put(fw, fw_ref);
> >> +			return false;
> >> +		}
> >> +
> >> +		*phwe = hwe;
> >> +		*pfw_ref = fw_ref;
> >> +		return true;
> >> +	}
> >> +
> >> +	return false;
> >> +}
> >> +
> >>  static void show_run_ticks(struct drm_printer *p, struct drm_file *file)
> >>  {
> >>  	unsigned long class, i, gt_id, capacity[XE_ENGINE_CLASS_MAX] = { };
> >> @@ -282,6 +316,9 @@ static void show_run_ticks(struct drm_printer *p, struct drm_file *file)
> >>
> >>  	xe_pm_runtime_get(xe);
> >>
> >> +	if (!force_wake_get_for_gpu_timestamp(xe, &hwe, &fw_ref))
> >> +		return;
> >> +
> >>  	/* Accumulate all the exec queues from this client */
> >>  	mutex_lock(&xef->exec_queue.lock);
> >>  	xa_for_each(&xef->exec_queue.xa, i, q) {
> >> @@ -295,27 +332,7 @@ static void show_run_ticks(struct drm_printer *p, struct drm_file *file)
> >>  	}
> >>  	mutex_unlock(&xef->exec_queue.lock);
> >>
> >> -	/* Get the total GPU cycles */
> >> -	for_each_gt(gt, xe, gt_id) {
> >> -		enum xe_force_wake_domains fw;
> >> -
> >> -		hwe = xe_gt_any_hw_engine(gt);
> >> -		if (!hwe)
> >> -			continue;
> >> -
> >> -		fw = xe_hw_engine_to_fw_domain(hwe);
> >> -
> >> -		fw_ref = xe_force_wake_get(gt_to_fw(gt), fw);
> >> -		if (!xe_force_wake_ref_has_domain(fw_ref, fw)) {
> >> -			hwe = NULL;
> >> -			xe_force_wake_put(gt_to_fw(gt), fw_ref);
> >> -			break;
> >> -		}
> >> -
> >> -		gpu_timestamp = xe_hw_engine_read_timestamp(hwe);
> >> -		xe_force_wake_put(gt_to_fw(gt), fw_ref);
> >> -		break;
> >> -	}
> >> +	gpu_timestamp = xe_hw_engine_read_timestamp(hwe);
> >>
> >>  	/*
> >>  	 * Wait for any exec queue going away: their cycles will get updated on
> >> @@ -324,11 +341,9 @@ static void show_run_ticks(struct drm_printer *p, struct drm_file *file)
> >>  	wait_var_event(&xef->exec_queue.pending_removal,
> >>  		       !atomic_read(&xef->exec_queue.pending_removal));
> >>
> >> +	xe_force_wake_put(gt_to_fw(hwe->gt), fw_ref);
> >>  	xe_pm_runtime_put(xe);
> >>
> >> -	if (unlikely(!hwe))
> >> -		return;
> >> -
> >>  	for (class = 0; class < XE_ENGINE_CLASS_MAX; class++) {
> >>  		const char *class_name;
> >>
> >> --
> >
> >I know I've written a lot of suggestions at this point, but none of them
> >are particularly blocking, so just clean up the commit message a bit and
> >this is
> 
> thanks
> 
> Lucas De Marchi
> 
> >
> >Reviewed-by: Jonathan Cavitt <jonathan.cavitt at intel.com>
> >-Jonathan Cavitt

I read all responses and it all looks good to me.  Above RB still stands.
-Jonathan Cavitt

> >
> >> 2.47.0
> >>
> >>
>