[PATCH 2/3] drm/xe: Accumulate exec queue timestamp on destroy
Lucas De Marchi
lucas.demarchi at intel.com
Tue Oct 29 19:12:22 UTC 2024
On Tue, Oct 29, 2024 at 12:58:35PM -0500, Lucas De Marchi wrote:
>>>Starting subtest: utilization-single-full-load-destroy-queue
>>>....
>>>(xe_drm_fdinfo:5997) DEBUG: vcs: spinner ended (timestamp=4818209)
>>>(xe_drm_fdinfo:5997) DEBUG: vcs: sample 1: cycles 9637999, total_cycles 19272820538
>>>(xe_drm_fdinfo:5997) DEBUG: vcs: sample 2: cycles 9637999, total_cycles 19277703269
>>>(xe_drm_fdinfo:5997) DEBUG: vcs: percent: 0.000000
>>>(xe_drm_fdinfo:5997) CRITICAL: Test assertion failure function check_results, file ../tests/intel/xe_drm_fdinfo.c:527:
>>>(xe_drm_fdinfo:5997) CRITICAL: Failed assertion: 95.0 < percent
>>>(xe_drm_fdinfo:5997) CRITICAL: error: 95.000000 >= 0.000000
>>>(xe_drm_fdinfo:5997) igt_core-INFO: Stack trace:
>>>(xe_drm_fdinfo:5997) igt_core-INFO: #0 ../lib/igt_core.c:2051 __igt_fail_assert()
>>>(xe_drm_fdinfo:5997) igt_core-INFO: #1 [check_results+0x204]
>>>(xe_drm_fdinfo:5997) igt_core-INFO: #2 ../tests/intel/xe_drm_fdinfo.c:860 __igt_unique____real_main806()
>>>(xe_drm_fdinfo:5997) igt_core-INFO: #3 ../tests/intel/xe_drm_fdinfo.c:806 main()
>>>(xe_drm_fdinfo:5997) igt_core-INFO: #4 ../sysdeps/nptl/libc_start_call_main.h:74 __libc_start_call_main()
>>>(xe_drm_fdinfo:5997) igt_core-INFO: #5 ../csu/libc-start.c:128 __libc_start_main@@GLIBC_2.34()
>>>(xe_drm_fdinfo:5997) igt_core-INFO: #6 [_start+0x25]
>>>**** END ****
>>>
>>>which makes me think it's probably related to the kill being async as
>>>you mentioned.
>>>
>>>I wonder if we should synchronize the call in the fdinfo read with the
>>>queues that are going away.
>>
>>Hmm, maybe.
>
>doing that it passes for me 62/100 running all
>xe_drm_fdinfo at utilization-* tests.
>
>The failure on run 63 is different and I think it's another bug or
so the other failure, that I forgot to paste:
Starting subtest: utilization-all-full-load
(xe_drm_fdinfo:14864) CRITICAL: Test assertion failure function check_results, file ../tests/intel/xe_drm_fdinfo.c:528:
(xe_drm_fdinfo:14864) CRITICAL: Failed assertion: percent < 105.0
(xe_drm_fdinfo:14864) CRITICAL: error: 315.453826 >= 105.000000
Stack trace:
#0 ../lib/igt_core.c:2051 __igt_fail_assert()
#1 ../tests/intel/xe_drm_fdinfo.c:520 check_results()
#2 ../tests/intel/xe_drm_fdinfo.c:464 __igt_unique____real_main806()
#3 ../tests/intel/xe_drm_fdinfo.c:806 main()
#4 ../sysdeps/nptl/libc_start_call_main.h:74 __libc_start_call_main()
#5 ../csu/libc-start.c:128 __libc_start_main@@GLIBC_2.34()
#6 [_start+0x25]
Subtest utilization-all-full-load failed.
**** DEBUG ****
(xe_drm_fdinfo:14864) DEBUG: rcs: spinner started
(xe_drm_fdinfo:14864) DEBUG: bcs: spinner started
(xe_drm_fdinfo:14864) DEBUG: ccs: spinner started
(xe_drm_fdinfo:14864) DEBUG: vcs: spinner started
(xe_drm_fdinfo:14864) DEBUG: vecs: spinner started
(xe_drm_fdinfo:14864) DEBUG: rcs: spinner ended (timestamp=15218479)
(xe_drm_fdinfo:14864) DEBUG: bcs: spinner ended (timestamp=15194339)
(xe_drm_fdinfo:14864) DEBUG: vcs: spinner ended (timestamp=4837648)
(xe_drm_fdinfo:14864) DEBUG: vecs: spinner ended (timestamp=4816316)
(xe_drm_fdinfo:14864) DEBUG: ccs: spinner ended (timestamp=4859494)
(xe_drm_fdinfo:14864) DEBUG: rcs: sample 1: cycles 40481368, total_cycles 31104224861
(xe_drm_fdinfo:14864) DEBUG: rcs: sample 2: cycles 55700053, total_cycles 31109049238
(xe_drm_fdinfo:14864) DEBUG: rcs: percent: 315.453826
(xe_drm_fdinfo:14864) CRITICAL: Test assertion failure function check_results, file ../tests/intel/xe_drm_fdinfo.c:528:
(xe_drm_fdinfo:14864) CRITICAL: Failed assertion: percent < 105.0
(xe_drm_fdinfo:14864) CRITICAL: error: 315.453826 >= 105.000000
(xe_drm_fdinfo:14864) igt_core-INFO: Stack trace:
(xe_drm_fdinfo:14864) igt_core-INFO: #0 ../lib/igt_core.c:2051 __igt_fail_assert()
From the timestamp read by the GPU;
rcs timestamp=15218479 and bcs timestamp=15194339... which is indeed much
higher than the total_cycles available:
31109049238 - 31104224861 = 4824377, which is reasonably similar to the
timestamp for the other engines.
my hypothesis is something like this:
sample1:
accumulate_exec_queue (t = 0)
<<<<<<<< premption
read_total_gpu_cycles (t = 200)
sample2:
accumulate_exec_queue (t = 300)
read_total_gpu_cycles (t = 300)
which makes cycles = 300, total_cycles = 200.
One easy thing to help: move the force wake finding/getting to the
beginning. I don't think it will be 100% bullet proof, but it improved
the execution on a misbehaving LNL to 100/100 pass. Maybe I was lucky in
this run.
Other than that we may need to resort to keep a copy of the last stamp
reported, redoing it if nonsense comes out, or add some locking
Lucas De Marchi
>race. This is what I'm testing with currently:
>
>diff --git a/drivers/gpu/drm/xe/xe_device_types.h b/drivers/gpu/drm/xe/xe_device_types.h
>index a3e777ad281e3..eaee19efeadce 100644
>--- a/drivers/gpu/drm/xe/xe_device_types.h
>+++ b/drivers/gpu/drm/xe/xe_device_types.h
>@@ -614,6 +614,11 @@ struct xe_file {
> * does things while being held.
> */
> struct mutex lock;
>+ /**
>+ * @exec_queue.pending_removal: items pending to be removed to
>+ * synchronize GPU state update with ongoing query.
>+ */
>+ atomic_t pending_removal;
> } exec_queue;
> /** @run_ticks: hw engine class run time in ticks for this drm client */
>diff --git a/drivers/gpu/drm/xe/xe_drm_client.c b/drivers/gpu/drm/xe/xe_drm_client.c
>index a9b0d640b2581..5f6347d12eec5 100644
>--- a/drivers/gpu/drm/xe/xe_drm_client.c
>+++ b/drivers/gpu/drm/xe/xe_drm_client.c
>@@ -327,6 +327,13 @@ static void show_run_ticks(struct drm_printer *p, struct drm_file *file)
> if (!read_total_gpu_timestamp(xe, &gpu_timestamp))
> goto fail_gpu_timestamp;
>+ /*
>+ * Wait for any exec queue going away: their cycles will get updated on
>+ * context switch out, so wait for that to happen
>+ */
>+ wait_var_event(&xef->exec_queue.pending_removal,
>+ !atomic_read(&xef->exec_queue.pending_removal));
>+
> xe_pm_runtime_put(xe);
> for (class = 0; class < XE_ENGINE_CLASS_MAX; class++) {
>diff --git a/drivers/gpu/drm/xe/xe_exec_queue.c b/drivers/gpu/drm/xe/xe_exec_queue.c
>index fd0f3b3c9101d..58dd35beb15ad 100644
>--- a/drivers/gpu/drm/xe/xe_exec_queue.c
>+++ b/drivers/gpu/drm/xe/xe_exec_queue.c
>@@ -262,8 +262,11 @@ void xe_exec_queue_fini(struct xe_exec_queue *q)
> /*
> * Before releasing our ref to lrc and xef, accumulate our run ticks
>+ * and wakeup any waiters.
> */
> xe_exec_queue_update_run_ticks(q);
>+ if (q->xef && atomic_dec_and_test(&q->xef->exec_queue.pending_removal))
>+ wake_up_var(&q->xef->exec_queue.pending_removal);
> for (i = 0; i < q->width; ++i)
> xe_lrc_put(q->lrc[i]);
>@@ -824,6 +827,7 @@ int xe_exec_queue_destroy_ioctl(struct drm_device *dev, void *data,
> XE_IOCTL_DBG(xe, args->reserved[0] || args->reserved[1]))
> return -EINVAL;
>+ atomic_inc(&xef->exec_queue.pending_removal);
> mutex_lock(&xef->exec_queue.lock);
> q = xa_erase(&xef->exec_queue.xa, args->exec_queue_id);
> mutex_unlock(&xef->exec_queue.lock);
>
>
>Idea is that any process reading the fdinfo needs to wait on contexts
>going away via kill.
>
>
>>
>>I was of the opinion that we should solve it in Xe by adding an
>>update call in an additional place (like you are doing), but after
>>digging into it a bit, I am not sure if we should resolve this
>>specific issue. Instead, we should alter the test to not check for
>>accuracy when the queue is destroyed before taking the second
>>sample. We know that the ticks will get updated at some reasonable
>>point in future and the user will see it in subsequent fdinfo
>>queries anyways. If that "reasonable point in future" is
>>unacceptably large, then I think the problem is outside the PCEU
>>domain.
>>
>>Note that the original reason we added the test was to catch the ref
>>count issue with xef object (which is now fixed).
>>
>>>
>>>Another thought I had was to use the wabb, but afaics we can only
>>>execute something on context restore, not on context save.
>>
>>I am curious what you want to run in context save though and how
>>it's any different from what's happening now - CTX_TIMESTAMP is
>>being updated on save.
>
>I was thinking about letting the gpu use MI_MATH to keep calculating
>the delta.... but yeah, it wouldn't help in this particular case.
>
>
>>
>>>
>>>>
>>>>If the ftrace is getting filed up, we could throttle that.
>>>
>>>oh no, that is definitely not what I want. If we enable the tracepoint, we
>>>want to see it, not artifically drop the events.
>>>
>>>Initially (and to get a good measure of function runtime), I was
>>>actually using retsnoop rather than using the previously non-existent
>>>tracepoint:
>>>
>>> retsnoop -e xe_lrc_update_timestamp -e xe_lrc_create -e xe_lrc_destroy -S -A -C args.fmt-max-arg-width=0
>>
>>Didn't know that. That's ^ useful.
>
>life saver - I keep forgetting options for the other tools to do similar
>stuff, but this one is so simple and effective.
>
>Lucas De Marchi
>
>>
>>Thanks,
>>Umesh
>>>
>>>Lucas De Marchi
>>>
>>>>
>>>>Thanks,
>>>>Umesh
>>>>
>>>>> trace_xe_exec_queue_close(q);
>>>>> xe_exec_queue_put(q);
>>>>>
>>>>>--
>>>>>2.47.0
>>>>>
More information about the Intel-xe
mailing list