[PATCH 2/3] drm/xe: Accumulate exec queue timestamp on destroy

Tue Oct 29 17:58:35 UTC 2024

On Tue, Oct 29, 2024 at 10:27:28AM -0700, Umesh Nerlige Ramappa wrote:
>On Mon, Oct 28, 2024 at 05:32:03PM -0500, Lucas De Marchi wrote:
>>On Mon, Oct 28, 2024 at 01:33:09PM -0700, Umesh Nerlige Ramappa wrote:
>>>On Sat, Oct 26, 2024 at 12:08:47PM -0500, Lucas De Marchi wrote:
>>>>When the exec queue is destroyed, there's a race between a query to the
>>>>fdinfo and the exec queue value being updated: after the destroy ioctl,
>>>>if the fdinfo is queried before a call to guc_exec_queue_free_job(),
>>>>the wrong utilization is reported: it's not accumulated on the query
>>>>since the queue was removed from the array, and the value wasn't updated
>>>>yet by the free_job().
>>>>
>>>>Explicitly accumulate the engine utilization so the right value is
>>>>visible after the ioctl return.
>>>>
>>>>Link: https://gitlab.freedesktop.org/drm/xe/kernel/-/issues/2667
>>>>Cc: Jonathan Cavitt <jonathan.cavitt at intel.com>
>>>>Signed-off-by: Lucas De Marchi <lucas.demarchi at intel.com>
>>>>---
>>>>drivers/gpu/drm/xe/xe_exec_queue.c | 8 ++++++++
>>>>1 file changed, 8 insertions(+)
>>>>
>>>>diff --git a/drivers/gpu/drm/xe/xe_exec_queue.c b/drivers/gpu/drm/xe/xe_exec_queue.c
>>>>index d098d2dd1b2d..b15ca84b2422 100644
>>>>--- a/drivers/gpu/drm/xe/xe_exec_queue.c
>>>>++ b/drivers/gpu/drm/xe/xe_exec_quee.c
>>>>@@ -829,6 +829,14 @@ int xe_exec_queue_destroy_ioctl(struct drm_device *dev, void *data,
>>>>
>>>>	xe_exec_queue_kill(q);
>>>>
>>>>+	/*
>>>>+	 * After killing and destroying the exec queue, make sure userspace has
>>>>+	 * an updated view of the run ticks, regardless if this was the last
>>>>+	 * ref: since the exec queue is removed from xef->exec_queue.xa, a
>>>>+	 * query to fdinfo after this returns could not account for this load.
>>>>+	 */
>>>>+	xe_exec_queue_update_run_ticks(q);
>>>>+
>>>
>>>At this point we may/may-not have the updated LRC timestamp.
>>>
>>>fwiu, xe_exec_queue_kill() is an async call. It will queue a work 
>>>that will disable guc scheduling on the context. Once guc notifies 
>>>KMD that scheduling is disabled on this context, KMD knows for 
>>>sure that the context has switched out and the lrc timestamp is 
>>>updated for this
>>
>>ok. do you know what's that notification? Is it the cleanup that we
>>process at __guc_exec_queue_process_msg_cleanup() ?  (which is where I'm
>>moving the io read to in the other patch)
>>
>>>context. It may work well for contexts that switch frequently and 
>>>may not work for contexts that seldom switch or never destroy 
>>>their exec queue.
>>
>>why does it matter? it's only for contexts going away - for context that
>>are scheduled out the sampling on fdinfo read covers it. Note that this
>>call is not replacing other calls. If the exec queue is not destroyed,
>>the other 2 places should still cover it.
>>
>>The problem with the case when the destroy ioctl is called is that q is
>>removed from xef->exec_queue.xa, so the additional call in fdinfo-read stops
>>working. It  then relies on the last job completing and being released before
>>a read on the fdinfo. As CI shows, this is racy and fails in ~10% of
>>the cases.
>>
>>On a remote system I got it was failing ~20% of the time
>>
>>>
>>>I still believe calling it from job free is the right thing to do. 
>>>As for the ~120 Hz updates, these are just memory updates, so not 
>>>sure if it's a huge performance impact.
>>
>>it seems now you are commenting on the change in the other patch?
>
>sorry, I will try to be more clear here.
>
>patch 3/3:
>
>I missed the fact that the updates could be in VRAM, so moving the 
>update out of job_free makes sense.
>
>Also since fdinfo query is expected to be frequent enough, adding the 
>update to exec_queue_fini makes sense too.
>
>patch 2/3 (everything below)
>
>it's just the update in destroy_ioctl that I am not convinced about 
>since the kill is async. Also you see another instance of failure 
>already below.
>
>>
>>I think it's pretty pointless to do the io access when nobody cares
>>about that number. The user needs to be reading the fdinfo, probably
>>through gputop or the like for the number to be meaningful. So why do we
>>care about sampling it at a high frequency when nobody is looking? For
>>real, non-CI use-cases it should be as good as before.  The difference
>>is that instead of getting 120 updates of few cycles, we are getting
>>just 1 with the amount of cycles that got used in the last sampling
>>period.
>>
>>Looking at the traces we do get a few low-msec execution time for
>>xe_lrc_update_timestamp(). Mind you, there's another race about updating
>>u32 in xef in a non-atomic way and doing it that frequently just makes
>>it more likely to happen. That still needs a separate fix.
>>
>>Goal of this series is to make this work:
>>
>>	$ for ((i=0; i < 100; i++)); do echo $i/100 >&2; if ! ./build/tests/xe_drm_fdinfo; then break; fi; done
>>
>>I was happy when sending this that it passed. But just double checking
>>on another host, the issue is still there and I get this after 16
>>iterations:
>>
>>Starting subtest: utilization-single-full-load-destroy-queue
>>....
>>(xe_drm_fdinfo:5997) DEBUG: vcs: spinner ended (timestamp=4818209)
>>(xe_drm_fdinfo:5997) DEBUG: vcs: sample 1: cycles 9637999, total_cycles 19272820538
>>(xe_drm_fdinfo:5997) DEBUG: vcs: sample 2: cycles 9637999, total_cycles 19277703269
>>(xe_drm_fdinfo:5997) DEBUG: vcs: percent: 0.000000
>>(xe_drm_fdinfo:5997) CRITICAL: Test assertion failure function check_results, file ../tests/intel/xe_drm_fdinfo.c:527:
>>(xe_drm_fdinfo:5997) CRITICAL: Failed assertion: 95.0 < percent
>>(xe_drm_fdinfo:5997) CRITICAL: error: 95.000000 >= 0.000000
>>(xe_drm_fdinfo:5997) igt_core-INFO: Stack trace:
>>(xe_drm_fdinfo:5997) igt_core-INFO:   #0 ../lib/igt_core.c:2051 __igt_fail_assert()
>>(xe_drm_fdinfo:5997) igt_core-INFO:   #1 [check_results+0x204]
>>(xe_drm_fdinfo:5997) igt_core-INFO:   #2 ../tests/intel/xe_drm_fdinfo.c:860 __igt_unique____real_main806()
>>(xe_drm_fdinfo:5997) igt_core-INFO:   #3 ../tests/intel/xe_drm_fdinfo.c:806 main()
>>(xe_drm_fdinfo:5997) igt_core-INFO:   #4 ../sysdeps/nptl/libc_start_call_main.h:74 __libc_start_call_main()
>>(xe_drm_fdinfo:5997) igt_core-INFO:   #5 ../csu/libc-start.c:128 __libc_start_main@@GLIBC_2.34()
>>(xe_drm_fdinfo:5997) igt_core-INFO:   #6 [_start+0x25]
>>****  END  ****
>>
>>which makes me think it's probably related to the kill being async as
>>you mentioned.
>>
>>I wonder if we should synchronize the call in the fdinfo read with the
>>queues that are going away.
>
>Hmm, maybe.

doing that it passes for me 62/100 running all
xe_drm_fdinfo at utilization-* tests.

The failure on run 63 is different  and I think it's another bug or
race. This is what I'm testing with currently:

diff --git a/drivers/gpu/drm/xe/xe_device_types.h b/drivers/gpu/drm/xe/xe_device_types.h
index a3e777ad281e3..eaee19efeadce 100644
--- a/drivers/gpu/drm/xe/xe_device_types.h
+++ b/drivers/gpu/drm/xe/xe_device_types.h
@@ -614,6 +614,11 @@ struct xe_file {
                  * does things while being held.
                  */
                 struct mutex lock;
+               /**
+                * @exec_queue.pending_removal: items pending to be removed to
+                * synchronize GPU state update with ongoing query.
+                */
+               atomic_t pending_removal;
         } exec_queue;
  
         /** @run_ticks: hw engine class run time in ticks for this drm client */
diff --git a/drivers/gpu/drm/xe/xe_drm_client.c b/drivers/gpu/drm/xe/xe_drm_client.c
index a9b0d640b2581..5f6347d12eec5 100644
--- a/drivers/gpu/drm/xe/xe_drm_client.c
+++ b/drivers/gpu/drm/xe/xe_drm_client.c
@@ -327,6 +327,13 @@ static void show_run_ticks(struct drm_printer *p, struct drm_file *file)
         if (!read_total_gpu_timestamp(xe, &gpu_timestamp))
                 goto fail_gpu_timestamp;
  
+       /*
+        * Wait for any exec queue going away: their cycles will get updated on
+        * context switch out, so wait for that to happen
+        */
+        wait_var_event(&xef->exec_queue.pending_removal,
+                       !atomic_read(&xef->exec_queue.pending_removal));
+
         xe_pm_runtime_put(xe);
  
         for (class = 0; class < XE_ENGINE_CLASS_MAX; class++) {
diff --git a/drivers/gpu/drm/xe/xe_exec_queue.c b/drivers/gpu/drm/xe/xe_exec_queue.c
index fd0f3b3c9101d..58dd35beb15ad 100644
--- a/drivers/gpu/drm/xe/xe_exec_queue.c
+++ b/drivers/gpu/drm/xe/xe_exec_queue.c
@@ -262,8 +262,11 @@ void xe_exec_queue_fini(struct xe_exec_queue *q)
  
         /*
          * Before releasing our ref to lrc and xef, accumulate our run ticks
+        * and wakeup any waiters.
          */
         xe_exec_queue_update_run_ticks(q);
+       if (q->xef && atomic_dec_and_test(&q->xef->exec_queue.pending_removal))
+               wake_up_var(&q->xef->exec_queue.pending_removal);
  
         for (i = 0; i < q->width; ++i)
                 xe_lrc_put(q->lrc[i]);
@@ -824,6 +827,7 @@ int xe_exec_queue_destroy_ioctl(struct drm_device *dev, void *data,
             XE_IOCTL_DBG(xe, args->reserved[0] || args->reserved[1]))
                 return -EINVAL;
  
+       atomic_inc(&xef->exec_queue.pending_removal);
         mutex_lock(&xef->exec_queue.lock);
         q = xa_erase(&xef->exec_queue.xa, args->exec_queue_id);
         mutex_unlock(&xef->exec_queue.lock);


Idea is that any process reading the fdinfo needs to wait on contexts
going away via kill.


>
>I was of the opinion that we should solve it in Xe by adding an update 
>call in an additional place (like you are doing), but after digging 
>into it a bit, I am not sure if we should resolve this specific issue. 
>Instead, we should alter the test to not check for accuracy when the 
>queue is destroyed before taking the second sample. We know that the 
>ticks will get updated at some reasonable point in future and the user 
>will see it in subsequent fdinfo queries anyways. If that "reasonable 
>point in future" is unacceptably large, then I think the problem is 
>outside the PCEU domain.
>
>Note that the original reason we added the test was to catch the ref 
>count issue with xef object (which is now fixed).
>
>>
>>Another thought I had was to use the wabb, but afaics we can only
>>execute something on context restore, not on context save.
>
>I am curious what you want to run in context save though and how it's 
>any different from what's happening now - CTX_TIMESTAMP is being 
>updated on save.

I was thinking about letting the gpu use MI_MATH to keep calculating
the delta.... but yeah, it wouldn't help in this particular case.


>
>>
>>>
>>>If the ftrace is getting filed up, we could throttle that.
>>
>>oh no, that is definitely not what I want. If we enable the tracepoint, we
>>want to see it, not artifically drop the events.
>>
>>Initially (and to get a good measure of function runtime), I was
>>actually using retsnoop rather than using the previously non-existent
>>tracepoint:
>>
>>	retsnoop -e xe_lrc_update_timestamp -e xe_lrc_create -e xe_lrc_destroy  -S -A -C args.fmt-max-arg-width=0
>
>Didn't know that. That's ^ useful.

life saver - I keep forgetting options for the other tools to do similar
stuff, but this one is so simple and effective.

Lucas De Marchi

>
>Thanks,
>Umesh
>>
>>Lucas De Marchi
>>
>>>
>>>Thanks,
>>>Umesh
>>>
>>>>	trace_xe_exec_queue_close(q);
>>>>	xe_exec_queue_put(q);
>>>>
>>>>-- 
>>>>2.47.0
>>>>