[PATCH 2/3] drm/xe: Accumulate exec queue timestamp on destroy

Tue Oct 29 19:03:14 UTC 2024

On Tue, Oct 29, 2024 at 12:58:35PM -0500, Lucas De Marchi wrote:
>On Tue, Oct 29, 2024 at 10:27:28AM -0700, Umesh Nerlige Ramappa wrote:
>>On Mon, Oct 28, 2024 at 05:32:03PM -0500, Lucas De Marchi wrote:
>>>On Mon, Oct 28, 2024 at 01:33:09PM -0700, Umesh Nerlige Ramappa wrote:
>>>>On Sat, Oct 26, 2024 at 12:08:47PM -0500, Lucas De Marchi wrote:
>>>>>When the exec queue is destroyed, there's a race between a query to the
>>>>>fdinfo and the exec queue value being updated: after the destroy ioctl,
>>>>>if the fdinfo is queried before a call to guc_exec_queue_free_job(),
>>>>>the wrong utilization is reported: it's not accumulated on the query
>>>>>since the queue was removed from the array, and the value wasn't updated
>>>>>yet by the free_job().
>>>>>
>>>>>Explicitly accumulate the engine utilization so the right value is
>>>>>visible after the ioctl return.
>>>>>
>>>>>Link: https://gitlab.freedesktop.org/drm/xe/kernel/-/issues/2667
>>>>>Cc: Jonathan Cavitt <jonathan.cavitt at intel.com>
>>>>>Signed-off-by: Lucas De Marchi <lucas.demarchi at intel.com>
>>>>>---
>>>>>drivers/gpu/drm/xe/xe_exec_queue.c | 8 ++++++++
>>>>>1 file changed, 8 insertions(+)
>>>>>
>>>>>diff --git a/drivers/gpu/drm/xe/xe_exec_queue.c b/drivers/gpu/drm/xe/xe_exec_queue.c
>>>>>index d098d2dd1b2d..b15ca84b2422 100644
>>>>>--- a/drivers/gpu/drm/xe/xe_exec_queue.c
>>>>>++ b/drivers/gpu/drm/xe/xe_exec_quee.c
>>>>>@@ -829,6 +829,14 @@ int xe_exec_queue_destroy_ioctl(struct drm_device *dev, void *data,
>>>>>
>>>>>	xe_exec_queue_kill(q);
>>>>>
>>>>>+	/*
>>>>>+	 * After killing and destroying the exec queue, make sure userspace has
>>>>>+	 * an updated view of the run ticks, regardless if this was the last
>>>>>+	 * ref: since the exec queue is removed from xef->exec_queue.xa, a
>>>>>+	 * query to fdinfo after this returns could not account for this load.
>>>>>+	 */
>>>>>+	xe_exec_queue_update_run_ticks(q);
>>>>>+
>>>>
>>>>At this point we may/may-not have the updated LRC timestamp.
>>>>
>>>>fwiu, xe_exec_queue_kill() is an async call. It will queue a 
>>>>work that will disable guc scheduling on the context. Once guc 
>>>>notifies KMD that scheduling is disabled on this context, KMD 
>>>>knows for sure that the context has switched out and the lrc 
>>>>timestamp is updated for this
>>>
>>>ok. do you know what's that notification? Is it the cleanup that we
>>>process at __guc_exec_queue_process_msg_cleanup() ?  (which is where I'm
>>>moving the io read to in the other patch)
>>>
>>>>context. It may work well for contexts that switch frequently 
>>>>and may not work for contexts that seldom switch or never 
>>>>destroy their exec queue.
>>>
>>>why does it matter? it's only for contexts going away - for context that
>>>are scheduled out the sampling on fdinfo read covers it. Note that this
>>>call is not replacing other calls. If the exec queue is not destroyed,
>>>the other 2 places should still cover it.
>>>
>>>The problem with the case when the destroy ioctl is called is that q is
>>>removed from xef->exec_queue.xa, so the additional call in fdinfo-read stops
>>>working. It  then relies on the last job completing and being released before
>>>a read on the fdinfo. As CI shows, this is racy and fails in ~10% of
>>>the cases.
>>>
>>>On a remote system I got it was failing ~20% of the time
>>>
>>>>
>>>>I still believe calling it from job free is the right thing to 
>>>>do. As for the ~120 Hz updates, these are just memory updates, 
>>>>so not sure if it's a huge performance impact.
>>>
>>>it seems now you are commenting on the change in the other patch?
>>
>>sorry, I will try to be more clear here.
>>
>>patch 3/3:
>>
>>I missed the fact that the updates could be in VRAM, so moving the 
>>update out of job_free makes sense.
>>
>>Also since fdinfo query is expected to be frequent enough, adding 
>>the update to exec_queue_fini makes sense too.
>>
>>patch 2/3 (everything below)
>>
>>it's just the update in destroy_ioctl that I am not convinced about 
>>since the kill is async. Also you see another instance of failure 
>>already below.
>>
>>>
>>>I think it's pretty pointless to do the io access when nobody cares
>>>about that number. The user needs to be reading the fdinfo, probably
>>>through gputop or the like for the number to be meaningful. So why do we
>>>care about sampling it at a high frequency when nobody is looking? For
>>>real, non-CI use-cases it should be as good as before.  The difference
>>>is that instead of getting 120 updates of few cycles, we are getting
>>>just 1 with the amount of cycles that got used in the last sampling
>>>period.
>>>
>>>Looking at the traces we do get a few low-msec execution time for
>>>xe_lrc_update_timestamp(). Mind you, there's another race about updating
>>>u32 in xef in a non-atomic way and doing it that frequently just makes
>>>it more likely to happen. That still needs a separate fix.
>>>
>>>Goal of this series is to make this work:
>>>
>>>	$ for ((i=0; i < 100; i++)); do echo $i/100 >&2; if ! ./build/tests/xe_drm_fdinfo; then break; fi; done
>>>
>>>I was happy when sending this that it passed. But just double checking
>>>on another host, the issue is still there and I get this after 16
>>>iterations:
>>>
>>>Starting subtest: utilization-single-full-load-destroy-queue
>>>....
>>>(xe_drm_fdinfo:5997) DEBUG: vcs: spinner ended (timestamp=4818209)
>>>(xe_drm_fdinfo:5997) DEBUG: vcs: sample 1: cycles 9637999, total_cycles 19272820538
>>>(xe_drm_fdinfo:5997) DEBUG: vcs: sample 2: cycles 9637999, total_cycles 19277703269
>>>(xe_drm_fdinfo:5997) DEBUG: vcs: percent: 0.000000
>>>(xe_drm_fdinfo:5997) CRITICAL: Test assertion failure function check_results, file ../tests/intel/xe_drm_fdinfo.c:527:
>>>(xe_drm_fdinfo:5997) CRITICAL: Failed assertion: 95.0 < percent
>>>(xe_drm_fdinfo:5997) CRITICAL: error: 95.000000 >= 0.000000
>>>(xe_drm_fdinfo:5997) igt_core-INFO: Stack trace:
>>>(xe_drm_fdinfo:5997) igt_core-INFO:   #0 ../lib/igt_core.c:2051 __igt_fail_assert()
>>>(xe_drm_fdinfo:5997) igt_core-INFO:   #1 [check_results+0x204]
>>>(xe_drm_fdinfo:5997) igt_core-INFO:   #2 ../tests/intel/xe_drm_fdinfo.c:860 __igt_unique____real_main806()
>>>(xe_drm_fdinfo:5997) igt_core-INFO:   #3 ../tests/intel/xe_drm_fdinfo.c:806 main()
>>>(xe_drm_fdinfo:5997) igt_core-INFO:   #4 ../sysdeps/nptl/libc_start_call_main.h:74 __libc_start_call_main()
>>>(xe_drm_fdinfo:5997) igt_core-INFO:   #5 ../csu/libc-start.c:128 __libc_start_main@@GLIBC_2.34()
>>>(xe_drm_fdinfo:5997) igt_core-INFO:   #6 [_start+0x25]
>>>****  END  ****
>>>
>>>which makes me think it's probably related to the kill being async as
>>>you mentioned.
>>>
>>>I wonder if we should synchronize the call in the fdinfo read with the
>>>queues that are going away.
>>
>>Hmm, maybe.
>
>doing that it passes for me 62/100 running all
>xe_drm_fdinfo at utilization-* tests.
>
>The failure on run 63 is different  and I think it's another bug or
>race. This is what I'm testing with currently:
>
>diff --git a/drivers/gpu/drm/xe/xe_device_types.h b/drivers/gpu/drm/xe/xe_device_types.h
>index a3e777ad281e3..eaee19efeadce 100644
>--- a/drivers/gpu/drm/xe/xe_device_types.h
>+++ b/drivers/gpu/drm/xe/xe_device_types.h
>@@ -614,6 +614,11 @@ struct xe_file {
>                 * does things while being held.
>                 */
>                struct mutex lock;
>+               /**
>+                * @exec_queue.pending_removal: items pending to be removed to
>+                * synchronize GPU state update with ongoing query.
>+                */
>+               atomic_t pending_removal;
>        } exec_queue;
>        /** @run_ticks: hw engine class run time in ticks for this drm client */
>diff --git a/drivers/gpu/drm/xe/xe_drm_client.c b/drivers/gpu/drm/xe/xe_drm_client.c
>index a9b0d640b2581..5f6347d12eec5 100644
>--- a/drivers/gpu/drm/xe/xe_drm_client.c
>+++ b/drivers/gpu/drm/xe/xe_drm_client.c
>@@ -327,6 +327,13 @@ static void show_run_ticks(struct drm_printer *p, struct drm_file *file)
>        if (!read_total_gpu_timestamp(xe, &gpu_timestamp))
>                goto fail_gpu_timestamp;
>+       /*
>+        * Wait for any exec queue going away: their cycles will get updated on
>+        * context switch out, so wait for that to happen
>+        */
>+        wait_var_event(&xef->exec_queue.pending_removal,
>+                       !atomic_read(&xef->exec_queue.pending_removal));
>+
>        xe_pm_runtime_put(xe);
>        for (class = 0; class < XE_ENGINE_CLASS_MAX; class++) {
>diff --git a/drivers/gpu/drm/xe/xe_exec_queue.c b/drivers/gpu/drm/xe/xe_exec_queue.c
>index fd0f3b3c9101d..58dd35beb15ad 100644
>--- a/drivers/gpu/drm/xe/xe_exec_queue.c
>+++ b/drivers/gpu/drm/xe/xe_exec_queue.c
>@@ -262,8 +262,11 @@ void xe_exec_queue_fini(struct xe_exec_queue *q)
>        /*
>         * Before releasing our ref to lrc and xef, accumulate our run ticks
>+        * and wakeup any waiters.
>         */
>        xe_exec_queue_update_run_ticks(q);
>+       if (q->xef && atomic_dec_and_test(&q->xef->exec_queue.pending_removal))
>+               wake_up_var(&q->xef->exec_queue.pending_removal);
>        for (i = 0; i < q->width; ++i)
>                xe_lrc_put(q->lrc[i]);
>@@ -824,6 +827,7 @@ int xe_exec_queue_destroy_ioctl(struct drm_device *dev, void *data,
>            XE_IOCTL_DBG(xe, args->reserved[0] || args->reserved[1]))
>                return -EINVAL;
>+       atomic_inc(&xef->exec_queue.pending_removal);
>        mutex_lock(&xef->exec_queue.lock);
>        q = xa_erase(&xef->exec_queue.xa, args->exec_queue_id);
>        mutex_unlock(&xef->exec_queue.lock);
>
>
>Idea is that any process reading the fdinfo needs to wait on contexts
>going away via kill.
>

Yep. That should work for the synchronization.

Once you kick off the queue destruction, it could be a small while 
before the context is actually stopped by GuC, so run ticks will keep 
ticking until then, but you may have sampled gt_timestamp already. Maybe 
sample the gt timestamp after the wait is over... just thinking out 
loud.

Thanks,
Umesh