[PATCH 2/3] drm/xe: Accumulate exec queue timestamp on destroy
Umesh Nerlige Ramappa
umesh.nerlige.ramappa at intel.com
Tue Oct 29 19:03:14 UTC 2024
On Tue, Oct 29, 2024 at 12:58:35PM -0500, Lucas De Marchi wrote:
>On Tue, Oct 29, 2024 at 10:27:28AM -0700, Umesh Nerlige Ramappa wrote:
>>On Mon, Oct 28, 2024 at 05:32:03PM -0500, Lucas De Marchi wrote:
>>>On Mon, Oct 28, 2024 at 01:33:09PM -0700, Umesh Nerlige Ramappa wrote:
>>>>On Sat, Oct 26, 2024 at 12:08:47PM -0500, Lucas De Marchi wrote:
>>>>>When the exec queue is destroyed, there's a race between a query to the
>>>>>fdinfo and the exec queue value being updated: after the destroy ioctl,
>>>>>if the fdinfo is queried before a call to guc_exec_queue_free_job(),
>>>>>the wrong utilization is reported: it's not accumulated on the query
>>>>>since the queue was removed from the array, and the value wasn't updated
>>>>>yet by the free_job().
>>>>>
>>>>>Explicitly accumulate the engine utilization so the right value is
>>>>>visible after the ioctl return.
>>>>>
>>>>>Link: https://gitlab.freedesktop.org/drm/xe/kernel/-/issues/2667
>>>>>Cc: Jonathan Cavitt <jonathan.cavitt at intel.com>
>>>>>Signed-off-by: Lucas De Marchi <lucas.demarchi at intel.com>
>>>>>---
>>>>>drivers/gpu/drm/xe/xe_exec_queue.c | 8 ++++++++
>>>>>1 file changed, 8 insertions(+)
>>>>>
>>>>>diff --git a/drivers/gpu/drm/xe/xe_exec_queue.c b/drivers/gpu/drm/xe/xe_exec_queue.c
>>>>>index d098d2dd1b2d..b15ca84b2422 100644
>>>>>--- a/drivers/gpu/drm/xe/xe_exec_queue.c
>>>>>++ b/drivers/gpu/drm/xe/xe_exec_quee.c
>>>>>@@ -829,6 +829,14 @@ int xe_exec_queue_destroy_ioctl(struct drm_device *dev, void *data,
>>>>>
>>>>> xe_exec_queue_kill(q);
>>>>>
>>>>>+ /*
>>>>>+ * After killing and destroying the exec queue, make sure userspace has
>>>>>+ * an updated view of the run ticks, regardless if this was the last
>>>>>+ * ref: since the exec queue is removed from xef->exec_queue.xa, a
>>>>>+ * query to fdinfo after this returns could not account for this load.
>>>>>+ */
>>>>>+ xe_exec_queue_update_run_ticks(q);
>>>>>+
>>>>
>>>>At this point we may/may-not have the updated LRC timestamp.
>>>>
>>>>fwiu, xe_exec_queue_kill() is an async call. It will queue a
>>>>work that will disable guc scheduling on the context. Once guc
>>>>notifies KMD that scheduling is disabled on this context, KMD
>>>>knows for sure that the context has switched out and the lrc
>>>>timestamp is updated for this
>>>
>>>ok. do you know what's that notification? Is it the cleanup that we
>>>process at __guc_exec_queue_process_msg_cleanup() ? (which is where I'm
>>>moving the io read to in the other patch)
>>>
>>>>context. It may work well for contexts that switch frequently
>>>>and may not work for contexts that seldom switch or never
>>>>destroy their exec queue.
>>>
>>>why does it matter? it's only for contexts going away - for context that
>>>are scheduled out the sampling on fdinfo read covers it. Note that this
>>>call is not replacing other calls. If the exec queue is not destroyed,
>>>the other 2 places should still cover it.
>>>
>>>The problem with the case when the destroy ioctl is called is that q is
>>>removed from xef->exec_queue.xa, so the additional call in fdinfo-read stops
>>>working. It then relies on the last job completing and being released before
>>>a read on the fdinfo. As CI shows, this is racy and fails in ~10% of
>>>the cases.
>>>
>>>On a remote system I got it was failing ~20% of the time
>>>
>>>>
>>>>I still believe calling it from job free is the right thing to
>>>>do. As for the ~120 Hz updates, these are just memory updates,
>>>>so not sure if it's a huge performance impact.
>>>
>>>it seems now you are commenting on the change in the other patch?
>>
>>sorry, I will try to be more clear here.
>>
>>patch 3/3:
>>
>>I missed the fact that the updates could be in VRAM, so moving the
>>update out of job_free makes sense.
>>
>>Also since fdinfo query is expected to be frequent enough, adding
>>the update to exec_queue_fini makes sense too.
>>
>>patch 2/3 (everything below)
>>
>>it's just the update in destroy_ioctl that I am not convinced about
>>since the kill is async. Also you see another instance of failure
>>already below.
>>
>>>
>>>I think it's pretty pointless to do the io access when nobody cares
>>>about that number. The user needs to be reading the fdinfo, probably
>>>through gputop or the like for the number to be meaningful. So why do we
>>>care about sampling it at a high frequency when nobody is looking? For
>>>real, non-CI use-cases it should be as good as before. The difference
>>>is that instead of getting 120 updates of few cycles, we are getting
>>>just 1 with the amount of cycles that got used in the last sampling
>>>period.
>>>
>>>Looking at the traces we do get a few low-msec execution time for
>>>xe_lrc_update_timestamp(). Mind you, there's another race about updating
>>>u32 in xef in a non-atomic way and doing it that frequently just makes
>>>it more likely to happen. That still needs a separate fix.
>>>
>>>Goal of this series is to make this work:
>>>
>>> $ for ((i=0; i < 100; i++)); do echo $i/100 >&2; if ! ./build/tests/xe_drm_fdinfo; then break; fi; done
>>>
>>>I was happy when sending this that it passed. But just double checking
>>>on another host, the issue is still there and I get this after 16
>>>iterations:
>>>
>>>Starting subtest: utilization-single-full-load-destroy-queue
>>>....
>>>(xe_drm_fdinfo:5997) DEBUG: vcs: spinner ended (timestamp=4818209)
>>>(xe_drm_fdinfo:5997) DEBUG: vcs: sample 1: cycles 9637999, total_cycles 19272820538
>>>(xe_drm_fdinfo:5997) DEBUG: vcs: sample 2: cycles 9637999, total_cycles 19277703269
>>>(xe_drm_fdinfo:5997) DEBUG: vcs: percent: 0.000000
>>>(xe_drm_fdinfo:5997) CRITICAL: Test assertion failure function check_results, file ../tests/intel/xe_drm_fdinfo.c:527:
>>>(xe_drm_fdinfo:5997) CRITICAL: Failed assertion: 95.0 < percent
>>>(xe_drm_fdinfo:5997) CRITICAL: error: 95.000000 >= 0.000000
>>>(xe_drm_fdinfo:5997) igt_core-INFO: Stack trace:
>>>(xe_drm_fdinfo:5997) igt_core-INFO: #0 ../lib/igt_core.c:2051 __igt_fail_assert()
>>>(xe_drm_fdinfo:5997) igt_core-INFO: #1 [check_results+0x204]
>>>(xe_drm_fdinfo:5997) igt_core-INFO: #2 ../tests/intel/xe_drm_fdinfo.c:860 __igt_unique____real_main806()
>>>(xe_drm_fdinfo:5997) igt_core-INFO: #3 ../tests/intel/xe_drm_fdinfo.c:806 main()
>>>(xe_drm_fdinfo:5997) igt_core-INFO: #4 ../sysdeps/nptl/libc_start_call_main.h:74 __libc_start_call_main()
>>>(xe_drm_fdinfo:5997) igt_core-INFO: #5 ../csu/libc-start.c:128 __libc_start_main@@GLIBC_2.34()
>>>(xe_drm_fdinfo:5997) igt_core-INFO: #6 [_start+0x25]
>>>**** END ****
>>>
>>>which makes me think it's probably related to the kill being async as
>>>you mentioned.
>>>
>>>I wonder if we should synchronize the call in the fdinfo read with the
>>>queues that are going away.
>>
>>Hmm, maybe.
>
>doing that it passes for me 62/100 running all
>xe_drm_fdinfo at utilization-* tests.
>
>The failure on run 63 is different and I think it's another bug or
>race. This is what I'm testing with currently:
>
>diff --git a/drivers/gpu/drm/xe/xe_device_types.h b/drivers/gpu/drm/xe/xe_device_types.h
>index a3e777ad281e3..eaee19efeadce 100644
>--- a/drivers/gpu/drm/xe/xe_device_types.h
>+++ b/drivers/gpu/drm/xe/xe_device_types.h
>@@ -614,6 +614,11 @@ struct xe_file {
> * does things while being held.
> */
> struct mutex lock;
>+ /**
>+ * @exec_queue.pending_removal: items pending to be removed to
>+ * synchronize GPU state update with ongoing query.
>+ */
>+ atomic_t pending_removal;
> } exec_queue;
> /** @run_ticks: hw engine class run time in ticks for this drm client */
>diff --git a/drivers/gpu/drm/xe/xe_drm_client.c b/drivers/gpu/drm/xe/xe_drm_client.c
>index a9b0d640b2581..5f6347d12eec5 100644
>--- a/drivers/gpu/drm/xe/xe_drm_client.c
>+++ b/drivers/gpu/drm/xe/xe_drm_client.c
>@@ -327,6 +327,13 @@ static void show_run_ticks(struct drm_printer *p, struct drm_file *file)
> if (!read_total_gpu_timestamp(xe, &gpu_timestamp))
> goto fail_gpu_timestamp;
>+ /*
>+ * Wait for any exec queue going away: their cycles will get updated on
>+ * context switch out, so wait for that to happen
>+ */
>+ wait_var_event(&xef->exec_queue.pending_removal,
>+ !atomic_read(&xef->exec_queue.pending_removal));
>+
> xe_pm_runtime_put(xe);
> for (class = 0; class < XE_ENGINE_CLASS_MAX; class++) {
>diff --git a/drivers/gpu/drm/xe/xe_exec_queue.c b/drivers/gpu/drm/xe/xe_exec_queue.c
>index fd0f3b3c9101d..58dd35beb15ad 100644
>--- a/drivers/gpu/drm/xe/xe_exec_queue.c
>+++ b/drivers/gpu/drm/xe/xe_exec_queue.c
>@@ -262,8 +262,11 @@ void xe_exec_queue_fini(struct xe_exec_queue *q)
> /*
> * Before releasing our ref to lrc and xef, accumulate our run ticks
>+ * and wakeup any waiters.
> */
> xe_exec_queue_update_run_ticks(q);
>+ if (q->xef && atomic_dec_and_test(&q->xef->exec_queue.pending_removal))
>+ wake_up_var(&q->xef->exec_queue.pending_removal);
> for (i = 0; i < q->width; ++i)
> xe_lrc_put(q->lrc[i]);
>@@ -824,6 +827,7 @@ int xe_exec_queue_destroy_ioctl(struct drm_device *dev, void *data,
> XE_IOCTL_DBG(xe, args->reserved[0] || args->reserved[1]))
> return -EINVAL;
>+ atomic_inc(&xef->exec_queue.pending_removal);
> mutex_lock(&xef->exec_queue.lock);
> q = xa_erase(&xef->exec_queue.xa, args->exec_queue_id);
> mutex_unlock(&xef->exec_queue.lock);
>
>
>Idea is that any process reading the fdinfo needs to wait on contexts
>going away via kill.
>
Yep. That should work for the synchronization.
Once you kick off the queue destruction, it could be a small while
before the context is actually stopped by GuC, so run ticks will keep
ticking until then, but you may have sampled gt_timestamp already. Maybe
sample the gt timestamp after the wait is over... just thinking out
loud.
Thanks,
Umesh
More information about the Intel-xe
mailing list