[PATCH 2/3] drm/xe: Accumulate exec queue timestamp on destroy

Tue Oct 29 19:12:22 UTC 2024

On Tue, Oct 29, 2024 at 12:58:35PM -0500, Lucas De Marchi wrote:
>>>Starting subtest: utilization-single-full-load-destroy-queue
>>>....
>>>(xe_drm_fdinfo:5997) DEBUG: vcs: spinner ended (timestamp=4818209)
>>>(xe_drm_fdinfo:5997) DEBUG: vcs: sample 1: cycles 9637999, total_cycles 19272820538
>>>(xe_drm_fdinfo:5997) DEBUG: vcs: sample 2: cycles 9637999, total_cycles 19277703269
>>>(xe_drm_fdinfo:5997) DEBUG: vcs: percent: 0.000000
>>>(xe_drm_fdinfo:5997) CRITICAL: Test assertion failure function check_results, file ../tests/intel/xe_drm_fdinfo.c:527:
>>>(xe_drm_fdinfo:5997) CRITICAL: Failed assertion: 95.0 < percent
>>>(xe_drm_fdinfo:5997) CRITICAL: error: 95.000000 >= 0.000000
>>>(xe_drm_fdinfo:5997) igt_core-INFO: Stack trace:
>>>(xe_drm_fdinfo:5997) igt_core-INFO:   #0 ../lib/igt_core.c:2051 __igt_fail_assert()
>>>(xe_drm_fdinfo:5997) igt_core-INFO:   #1 [check_results+0x204]
>>>(xe_drm_fdinfo:5997) igt_core-INFO:   #2 ../tests/intel/xe_drm_fdinfo.c:860 __igt_unique____real_main806()
>>>(xe_drm_fdinfo:5997) igt_core-INFO:   #3 ../tests/intel/xe_drm_fdinfo.c:806 main()
>>>(xe_drm_fdinfo:5997) igt_core-INFO:   #4 ../sysdeps/nptl/libc_start_call_main.h:74 __libc_start_call_main()
>>>(xe_drm_fdinfo:5997) igt_core-INFO:   #5 ../csu/libc-start.c:128 __libc_start_main@@GLIBC_2.34()
>>>(xe_drm_fdinfo:5997) igt_core-INFO:   #6 [_start+0x25]
>>>****  END  ****
>>>
>>>which makes me think it's probably related to the kill being async as
>>>you mentioned.
>>>
>>>I wonder if we should synchronize the call in the fdinfo read with the
>>>queues that are going away.
>>
>>Hmm, maybe.
>
>doing that it passes for me 62/100 running all
>xe_drm_fdinfo at utilization-* tests.
>
>The failure on run 63 is different  and I think it's another bug or

so the other failure, that I forgot to paste:

	Starting subtest: utilization-all-full-load
	(xe_drm_fdinfo:14864) CRITICAL: Test assertion failure function check_results, file ../tests/intel/xe_drm_fdinfo.c:528:
	(xe_drm_fdinfo:14864) CRITICAL: Failed assertion: percent < 105.0
	(xe_drm_fdinfo:14864) CRITICAL: error: 315.453826 >= 105.000000
	Stack trace:
	  #0 ../lib/igt_core.c:2051 __igt_fail_assert()
	  #1 ../tests/intel/xe_drm_fdinfo.c:520 check_results()
	  #2 ../tests/intel/xe_drm_fdinfo.c:464 __igt_unique____real_main806()
	  #3 ../tests/intel/xe_drm_fdinfo.c:806 main()
	  #4 ../sysdeps/nptl/libc_start_call_main.h:74 __libc_start_call_main()
	  #5 ../csu/libc-start.c:128 __libc_start_main@@GLIBC_2.34()
	  #6 [_start+0x25]
	Subtest utilization-all-full-load failed.
	**** DEBUG ****
	(xe_drm_fdinfo:14864) DEBUG: rcs: spinner started
	(xe_drm_fdinfo:14864) DEBUG: bcs: spinner started
	(xe_drm_fdinfo:14864) DEBUG: ccs: spinner started
	(xe_drm_fdinfo:14864) DEBUG: vcs: spinner started
	(xe_drm_fdinfo:14864) DEBUG: vecs: spinner started
	(xe_drm_fdinfo:14864) DEBUG: rcs: spinner ended (timestamp=15218479)
	(xe_drm_fdinfo:14864) DEBUG: bcs: spinner ended (timestamp=15194339)
	(xe_drm_fdinfo:14864) DEBUG: vcs: spinner ended (timestamp=4837648)
	(xe_drm_fdinfo:14864) DEBUG: vecs: spinner ended (timestamp=4816316)
	(xe_drm_fdinfo:14864) DEBUG: ccs: spinner ended (timestamp=4859494)
	(xe_drm_fdinfo:14864) DEBUG: rcs: sample 1: cycles 40481368, total_cycles 31104224861
	(xe_drm_fdinfo:14864) DEBUG: rcs: sample 2: cycles 55700053, total_cycles 31109049238
	(xe_drm_fdinfo:14864) DEBUG: rcs: percent: 315.453826
	(xe_drm_fdinfo:14864) CRITICAL: Test assertion failure function check_results, file ../tests/intel/xe_drm_fdinfo.c:528:
	(xe_drm_fdinfo:14864) CRITICAL: Failed assertion: percent < 105.0
	(xe_drm_fdinfo:14864) CRITICAL: error: 315.453826 >= 105.000000
	(xe_drm_fdinfo:14864) igt_core-INFO: Stack trace:
	(xe_drm_fdinfo:14864) igt_core-INFO:   #0 ../lib/igt_core.c:2051 __igt_fail_assert()

 From the timestamp read by the GPU;
rcs timestamp=15218479 and bcs timestamp=15194339... which is indeed much
higher than the total_cycles available:
31109049238 - 31104224861 = 4824377, which is reasonably similar to the
timestamp for the other engines.

my hypothesis is something like this:

sample1:
	accumulate_exec_queue (t = 0)
				<<<<<<<< premption
	read_total_gpu_cycles (t = 200)

sample2:
	accumulate_exec_queue (t = 300)
	read_total_gpu_cycles  (t = 300)

which makes cycles = 300, total_cycles = 200.

One easy thing to help: move the force wake finding/getting to the
beginning. I don't think it will be 100% bullet proof, but it improved
the execution on a misbehaving LNL to 100/100 pass. Maybe I was lucky in
this run.

Other than that we may need to resort to keep a copy of the last stamp
reported, redoing it if nonsense comes out, or add some locking

Lucas De Marchi

>race. This is what I'm testing with currently:
>
>diff --git a/drivers/gpu/drm/xe/xe_device_types.h b/drivers/gpu/drm/xe/xe_device_types.h
>index a3e777ad281e3..eaee19efeadce 100644
>--- a/drivers/gpu/drm/xe/xe_device_types.h
>+++ b/drivers/gpu/drm/xe/xe_device_types.h
>@@ -614,6 +614,11 @@ struct xe_file {
>                 * does things while being held.
>                 */
>                struct mutex lock;
>+               /**
>+                * @exec_queue.pending_removal: items pending to be removed to
>+                * synchronize GPU state update with ongoing query.
>+                */
>+               atomic_t pending_removal;
>        } exec_queue;
>        /** @run_ticks: hw engine class run time in ticks for this drm client */
>diff --git a/drivers/gpu/drm/xe/xe_drm_client.c b/drivers/gpu/drm/xe/xe_drm_client.c
>index a9b0d640b2581..5f6347d12eec5 100644
>--- a/drivers/gpu/drm/xe/xe_drm_client.c
>+++ b/drivers/gpu/drm/xe/xe_drm_client.c
>@@ -327,6 +327,13 @@ static void show_run_ticks(struct drm_printer *p, struct drm_file *file)
>        if (!read_total_gpu_timestamp(xe, &gpu_timestamp))
>                goto fail_gpu_timestamp;
>+       /*
>+        * Wait for any exec queue going away: their cycles will get updated on
>+        * context switch out, so wait for that to happen
>+        */
>+        wait_var_event(&xef->exec_queue.pending_removal,
>+                       !atomic_read(&xef->exec_queue.pending_removal));
>+
>        xe_pm_runtime_put(xe);
>        for (class = 0; class < XE_ENGINE_CLASS_MAX; class++) {
>diff --git a/drivers/gpu/drm/xe/xe_exec_queue.c b/drivers/gpu/drm/xe/xe_exec_queue.c
>index fd0f3b3c9101d..58dd35beb15ad 100644
>--- a/drivers/gpu/drm/xe/xe_exec_queue.c
>+++ b/drivers/gpu/drm/xe/xe_exec_queue.c
>@@ -262,8 +262,11 @@ void xe_exec_queue_fini(struct xe_exec_queue *q)
>        /*
>         * Before releasing our ref to lrc and xef, accumulate our run ticks
>+        * and wakeup any waiters.
>         */
>        xe_exec_queue_update_run_ticks(q);
>+       if (q->xef && atomic_dec_and_test(&q->xef->exec_queue.pending_removal))
>+               wake_up_var(&q->xef->exec_queue.pending_removal);
>        for (i = 0; i < q->width; ++i)
>                xe_lrc_put(q->lrc[i]);
>@@ -824,6 +827,7 @@ int xe_exec_queue_destroy_ioctl(struct drm_device *dev, void *data,
>            XE_IOCTL_DBG(xe, args->reserved[0] || args->reserved[1]))
>                return -EINVAL;
>+       atomic_inc(&xef->exec_queue.pending_removal);
>        mutex_lock(&xef->exec_queue.lock);
>        q = xa_erase(&xef->exec_queue.xa, args->exec_queue_id);
>        mutex_unlock(&xef->exec_queue.lock);
>
>
>Idea is that any process reading the fdinfo needs to wait on contexts
>going away via kill.
>
>
>>
>>I was of the opinion that we should solve it in Xe by adding an 
>>update call in an additional place (like you are doing), but after 
>>digging into it a bit, I am not sure if we should resolve this 
>>specific issue. Instead, we should alter the test to not check for 
>>accuracy when the queue is destroyed before taking the second 
>>sample. We know that the ticks will get updated at some reasonable 
>>point in future and the user will see it in subsequent fdinfo 
>>queries anyways. If that "reasonable point in future" is 
>>unacceptably large, then I think the problem is outside the PCEU 
>>domain.
>>
>>Note that the original reason we added the test was to catch the ref 
>>count issue with xef object (which is now fixed).
>>
>>>
>>>Another thought I had was to use the wabb, but afaics we can only
>>>execute something on context restore, not on context save.
>>
>>I am curious what you want to run in context save though and how 
>>it's any different from what's happening now - CTX_TIMESTAMP is 
>>being updated on save.
>
>I was thinking about letting the gpu use MI_MATH to keep calculating
>the delta.... but yeah, it wouldn't help in this particular case.
>
>
>>
>>>
>>>>
>>>>If the ftrace is getting filed up, we could throttle that.
>>>
>>>oh no, that is definitely not what I want. If we enable the tracepoint, we
>>>want to see it, not artifically drop the events.
>>>
>>>Initially (and to get a good measure of function runtime), I was
>>>actually using retsnoop rather than using the previously non-existent
>>>tracepoint:
>>>
>>>	retsnoop -e xe_lrc_update_timestamp -e xe_lrc_create -e xe_lrc_destroy  -S -A -C args.fmt-max-arg-width=0
>>
>>Didn't know that. That's ^ useful.
>
>life saver - I keep forgetting options for the other tools to do similar
>stuff, but this one is so simple and effective.
>
>Lucas De Marchi
>
>>
>>Thanks,
>>Umesh
>>>
>>>Lucas De Marchi
>>>
>>>>
>>>>Thanks,
>>>>Umesh
>>>>
>>>>>	trace_xe_exec_queue_close(q);
>>>>>	xe_exec_queue_put(q);
>>>>>
>>>>>-- 
>>>>>2.47.0
>>>>>