[PATCH] drm/xe/client: Better correlate exec_queue and GT timestamps

Thu Jan 9 20:07:30 UTC 2025

On Wed, Jan 08, 2025 at 04:57:40PM -0600, Lucas De Marchi wrote:
>On Mon, Jan 06, 2025 at 02:44:27PM -0800, Umesh Nerlige Ramappa wrote:
>>On Thu, Dec 12, 2024 at 09:34:32AM -0800, Lucas De Marchi wrote:
>>>This partially reverts commit fe4f5d4b6616 ("drm/xe: Clean up VM / exec
>>>queue file lock usage."). While it's desired to have the mutex to
>>>protect only the reference to the exec queue, getting and dropping each
>>>mutex and then later getting the GPU timestamp, doesn't produce a
>>>correct result: it introduces multiple opportunities for the task to be
>>>scheduled out and thus wrecking havoc the deltas reported to userspace.
>>>
>>>Also, to better correlate the timestamp from the exec queues with the
>>>GPU, disable preemption so they can be updated without allowing the task
>>>to be scheduled out. We leave interrupts enabled as that shouldn't be
>>>enough disturbance for the deltas to matter to userspace.
>>
>>Assuming
>>
>>- timestamp from exec queues = xe_exec_queue_update_run_ticks()
>>- GPU timestamp = xe_hw_engine_read_timestamp()
>>
>>shouldn't you also move the xe_hw_engine_read_timestamp() within the 
>>preempt_disable/preempt_enable section?
>
>this is what I thought I had done, but it seems I messed up.
>
>>
>>Something like this ..
>>
>>	preempt_disable();
>>
>>	xa_for_each(&xef->exec_queue.xa, i, q)
>>		xe_exec_queue_update_run_ticks(q);
>>
>>	gpu_timestamp = xe_hw_engine_read_timestamp(hwe);
>
>I'd move the gpu_timestamp to be done before the exec queue, so we don't
>call it with the exec queue mutex still taken. AFAIR this is what I was
>doing in the test machine, but sent the wrong version of the patch. Let
>me double check locally here and resend.

just re-tested that and submitted a v2. With and without the pending igt
patch series I can run the test for 1000 times while running cpu stress
and not have any failure.

Lucas De Marchi

>
>thanks
>Lucas De Marchi
>
>>
>>	preempt_enable();
>>
>>Thanks,
>>Umesh
>>
>>>
>>>Test scenario:
>>>
>>>	* IGT'S `xe_drm_fdinfo --r --r utilization-single-full-load`
>>>	* Platform: LNL, where CI occasionally reports failures
>>>	* `stress -c $(nproc)` running in parallel to disturb the
>>>	  system
>>>
>>>This brings a first failure from "after ~150 executions" to "never
>>>occurs after 1000 attempts".
>>>
>>>Cc: stable at vger.kernel.org # v6.11+
>>>Closes: https://gitlab.freedesktop.org/drm/xe/kernel/-/issues/3512
>>>Signed-off-by: Lucas De Marchi <lucas.demarchi at intel.com>
>>>---
>>>drivers/gpu/drm/xe/xe_drm_client.c | 9 +++------
>>>1 file changed, 3 insertions(+), 6 deletions(-)
>>>
>>>diff --git a/drivers/gpu/drm/xe/xe_drm_client.c b/drivers/gpu/drm/xe/xe_drm_client.c
>>>index 298a587da7f17..e307b4d6bab5a 100644
>>>--- a/drivers/gpu/drm/xe/xe_drm_client.c
>>>+++ b/drivers/gpu/drm/xe/xe_drm_client.c
>>>@@ -338,15 +338,12 @@ static void show_run_ticks(struct drm_printer *p, struct drm_file *file)
>>>
>>>	/* Accumulate all the exec queues from this client */
>>>	mutex_lock(&xef->exec_queue.lock);
>>>-	xa_for_each(&xef->exec_queue.xa, i, q) {
>>>-		xe_exec_queue_get(q);
>>>-		mutex_unlock(&xef->exec_queue.lock);
>>>+	preempt_disable();
>>>
>>>+	xa_for_each(&xef->exec_queue.xa, i, q)
>>>		xe_exec_queue_update_run_ticks(q);
>>>
>>>-		mutex_lock(&xef->exec_queue.lock);
>>>-		xe_exec_queue_put(q);
>>>-	}
>>>+	preempt_enable();
>>>	mutex_unlock(&xef->exec_queue.lock);
>>>
>>>	gpu_timestamp = xe_hw_engine_read_timestamp(hwe);
>>>-- 
>>>2.47.0
>>>