[PATCH 2/3] drm/xe: Accumulate exec queue timestamp on destroy

Tue Oct 29 19:31:54 UTC 2024

On Tue, Oct 29, 2024 at 02:12:22PM -0500, Lucas De Marchi wrote:
>On Tue, Oct 29, 2024 at 12:58:35PM -0500, Lucas De Marchi wrote:
>>>>Starting subtest: utilization-single-full-load-destroy-queue
>>>>....
>>>>(xe_drm_fdinfo:5997) DEBUG: vcs: spinner ended (timestamp=4818209)
>>>>(xe_drm_fdinfo:5997) DEBUG: vcs: sample 1: cycles 9637999, total_cycles 19272820538
>>>>(xe_drm_fdinfo:5997) DEBUG: vcs: sample 2: cycles 9637999, total_cycles 19277703269
>>>>(xe_drm_fdinfo:5997) DEBUG: vcs: percent: 0.000000
>>>>(xe_drm_fdinfo:5997) CRITICAL: Test assertion failure function check_results, file ../tests/intel/xe_drm_fdinfo.c:527:
>>>>(xe_drm_fdinfo:5997) CRITICAL: Failed assertion: 95.0 < percent
>>>>(xe_drm_fdinfo:5997) CRITICAL: error: 95.000000 >= 0.000000
>>>>(xe_drm_fdinfo:5997) igt_core-INFO: Stack trace:
>>>>(xe_drm_fdinfo:5997) igt_core-INFO:   #0 ../lib/igt_core.c:2051 __igt_fail_assert()
>>>>(xe_drm_fdinfo:5997) igt_core-INFO:   #1 [check_results+0x204]
>>>>(xe_drm_fdinfo:5997) igt_core-INFO:   #2 ../tests/intel/xe_drm_fdinfo.c:860 __igt_unique____real_main806()
>>>>(xe_drm_fdinfo:5997) igt_core-INFO:   #3 ../tests/intel/xe_drm_fdinfo.c:806 main()
>>>>(xe_drm_fdinfo:5997) igt_core-INFO:   #4 ../sysdeps/nptl/libc_start_call_main.h:74 __libc_start_call_main()
>>>>(xe_drm_fdinfo:5997) igt_core-INFO:   #5 ../csu/libc-start.c:128 __libc_start_main@@GLIBC_2.34()
>>>>(xe_drm_fdinfo:5997) igt_core-INFO:   #6 [_start+0x25]
>>>>****  END  ****
>>>>
>>>>which makes me think it's probably related to the kill being async as
>>>>you mentioned.
>>>>
>>>>I wonder if we should synchronize the call in the fdinfo read with the
>>>>queues that are going away.
>>>
>>>Hmm, maybe.
>>
>>doing that it passes for me 62/100 running all
>>xe_drm_fdinfo at utilization-* tests.
>>
>>The failure on run 63 is different  and I think it's another bug or
>
>
>so the other failure, that I forgot to paste:
>
>	Starting subtest: utilization-all-full-load
>	(xe_drm_fdinfo:14864) CRITICAL: Test assertion failure function check_results, file ../tests/intel/xe_drm_fdinfo.c:528:
>	(xe_drm_fdinfo:14864) CRITICAL: Failed assertion: percent < 105.0
>	(xe_drm_fdinfo:14864) CRITICAL: error: 315.453826 >= 105.000000
>	Stack trace:
>	  #0 ../lib/igt_core.c:2051 __igt_fail_assert()
>	  #1 ../tests/intel/xe_drm_fdinfo.c:520 check_results()
>	  #2 ../tests/intel/xe_drm_fdinfo.c:464 __igt_unique____real_main806()
>	  #3 ../tests/intel/xe_drm_fdinfo.c:806 main()
>	  #4 ../sysdeps/nptl/libc_start_call_main.h:74 __libc_start_call_main()
>	  #5 ../csu/libc-start.c:128 __libc_start_main@@GLIBC_2.34()
>	  #6 [_start+0x25]
>	Subtest utilization-all-full-load failed.
>	**** DEBUG ****
>	(xe_drm_fdinfo:14864) DEBUG: rcs: spinner started
>	(xe_drm_fdinfo:14864) DEBUG: bcs: spinner started
>	(xe_drm_fdinfo:14864) DEBUG: ccs: spinner started
>	(xe_drm_fdinfo:14864) DEBUG: vcs: spinner started
>	(xe_drm_fdinfo:14864) DEBUG: vecs: spinner started
>	(xe_drm_fdinfo:14864) DEBUG: rcs: spinner ended (timestamp=15218479)
>	(xe_drm_fdinfo:14864) DEBUG: bcs: spinner ended (timestamp=15194339)
>	(xe_drm_fdinfo:14864) DEBUG: vcs: spinner ended (timestamp=4837648)
>	(xe_drm_fdinfo:14864) DEBUG: vecs: spinner ended (timestamp=4816316)
>	(xe_drm_fdinfo:14864) DEBUG: ccs: spinner ended (timestamp=4859494)
>	(xe_drm_fdinfo:14864) DEBUG: rcs: sample 1: cycles 40481368, total_cycles 31104224861
>	(xe_drm_fdinfo:14864) DEBUG: rcs: sample 2: cycles 55700053, total_cycles 31109049238
>	(xe_drm_fdinfo:14864) DEBUG: rcs: percent: 315.453826
>	(xe_drm_fdinfo:14864) CRITICAL: Test assertion failure function check_results, file ../tests/intel/xe_drm_fdinfo.c:528:
>	(xe_drm_fdinfo:14864) CRITICAL: Failed assertion: percent < 105.0
>	(xe_drm_fdinfo:14864) CRITICAL: error: 315.453826 >= 105.000000
>	(xe_drm_fdinfo:14864) igt_core-INFO: Stack trace:
>	(xe_drm_fdinfo:14864) igt_core-INFO:   #0 ../lib/igt_core.c:2051 __igt_fail_assert()
>
>
>From the timestamp read by the GPU;
>rcs timestamp=15218479 and bcs timestamp=15194339... which is indeed much
>higher than the total_cycles available:
>31109049238 - 31104224861 = 4824377, which is reasonably similar to the
>timestamp for the other engines.
>
>my hypothesis is something like this:
>
>sample1:
>	accumulate_exec_queue (t = 0)
>				<<<<<<<< premption
>	read_total_gpu_cycles (t = 200)
>
>sample2:
>	accumulate_exec_queue (t = 300)
>	read_total_gpu_cycles  (t = 300)
>

It could as well be the second sample, see my previous email on why run 
ticks can be larger than the gt timestamp delta.

For the sake of narrowing it down, you could capture the value of 
CTX_TIMESTAMP mmio before killing the exec queue in destroy. It should 
be ticking since the context is active. Once the context stops, it would 
have an updated value. That way we know how long it took to stop.

Thanks,
Umesh

>
>which makes cycles = 300, total_cycles = 200.
>
>One easy thing to help: move the force wake finding/getting to the
>beginning. I don't think it will be 100% bullet proof, but it improved
>the execution on a misbehaving LNL to 100/100 pass. Maybe I was lucky in
>this run.
>
>Other than that we may need to resort to keep a copy of the last stamp
>reported, redoing it if nonsense comes out, or add some locking
>
>Lucas De Marchi
>
>>race. This is what I'm testing with currently:
>>
>>diff --git a/drivers/gpu/drm/xe/xe_device_types.h b/drivers/gpu/drm/xe/xe_device_types.h
>>index a3e777ad281e3..eaee19efeadce 100644
>>--- a/drivers/gpu/drm/xe/xe_device_types.h
>>+++ b/drivers/gpu/drm/xe/xe_device_types.h
>>@@ -614,6 +614,11 @@ struct xe_file {
>>                * does things while being held.
>>                */
>>               struct mutex lock;
>>+               /**
>>+                * @exec_queue.pending_removal: items pending to be removed to
>>+                * synchronize GPU state update with ongoing query.
>>+                */
>>+               atomic_t pending_removal;
>>       } exec_queue;
>>       /** @run_ticks: hw engine class run time in ticks for this drm client */
>>diff --git a/drivers/gpu/drm/xe/xe_drm_client.c b/drivers/gpu/drm/xe/xe_drm_client.c
>>index a9b0d640b2581..5f6347d12eec5 100644
>>--- a/drivers/gpu/drm/xe/xe_drm_client.c
>>+++ b/drivers/gpu/drm/xe/xe_drm_client.c
>>@@ -327,6 +327,13 @@ static void show_run_ticks(struct drm_printer *p, struct drm_file *file)
>>       if (!read_total_gpu_timestamp(xe, &gpu_timestamp))
>>               goto fail_gpu_timestamp;
>>+       /*
>>+        * Wait for any exec queue going away: their cycles will get updated on
>>+        * context switch out, so wait for that to happen
>>+        */
>>+        wait_var_event(&xef->exec_queue.pending_removal,
>>+                       !atomic_read(&xef->exec_queue.pending_removal));
>>+
>>       xe_pm_runtime_put(xe);
>>       for (class = 0; class < XE_ENGINE_CLASS_MAX; class++) {
>>diff --git a/drivers/gpu/drm/xe/xe_exec_queue.c b/drivers/gpu/drm/xe/xe_exec_queue.c
>>index fd0f3b3c9101d..58dd35beb15ad 100644
>>--- a/drivers/gpu/drm/xe/xe_exec_queue.c
>>+++ b/drivers/gpu/drm/xe/xe_exec_queue.c
>>@@ -262,8 +262,11 @@ void xe_exec_queue_fini(struct xe_exec_queue *q)
>>       /*
>>        * Before releasing our ref to lrc and xef, accumulate our run ticks
>>+        * and wakeup any waiters.
>>        */
>>       xe_exec_queue_update_run_ticks(q);
>>+       if (q->xef && atomic_dec_and_test(&q->xef->exec_queue.pending_removal))
>>+               wake_up_var(&q->xef->exec_queue.pending_removal);
>>       for (i = 0; i < q->width; ++i)
>>               xe_lrc_put(q->lrc[i]);
>>@@ -824,6 +827,7 @@ int xe_exec_queue_destroy_ioctl(struct drm_device *dev, void *data,
>>           XE_IOCTL_DBG(xe, args->reserved[0] || args->reserved[1]))
>>               return -EINVAL;
>>+       atomic_inc(&xef->exec_queue.pending_removal);
>>       mutex_lock(&xef->exec_queue.lock);
>>       q = xa_erase(&xef->exec_queue.xa, args->exec_queue_id);
>>       mutex_unlock(&xef->exec_queue.lock);
>>
>>
>>Idea is that any process reading the fdinfo needs to wait on contexts
>>going away via kill.
>>
>>
>>>
>>>I was of the opinion that we should solve it in Xe by adding an 
>>>update call in an additional place (like you are doing), but after 
>>>digging into it a bit, I am not sure if we should resolve this 
>>>specific issue. Instead, we should alter the test to not check for 
>>>accuracy when the queue is destroyed before taking the second 
>>>sample. We know that the ticks will get updated at some reasonable 
>>>point in future and the user will see it in subsequent fdinfo 
>>>queries anyways. If that "reasonable point in future" is 
>>>unacceptably large, then I think the problem is outside the PCEU 
>>>domain.
>>>
>>>Note that the original reason we added the test was to catch the 
>>>ref count issue with xef object (which is now fixed).
>>>
>>>>
>>>>Another thought I had was to use the wabb, but afaics we can only
>>>>execute something on context restore, not on context save.
>>>
>>>I am curious what you want to run in context save though and how 
>>>it's any different from what's happening now - CTX_TIMESTAMP is 
>>>being updated on save.
>>
>>I was thinking about letting the gpu use MI_MATH to keep calculating
>>the delta.... but yeah, it wouldn't help in this particular case.
>>
>>
>>>
>>>>
>>>>>
>>>>>If the ftrace is getting filed up, we could throttle that.
>>>>
>>>>oh no, that is definitely not what I want. If we enable the tracepoint, we
>>>>want to see it, not artifically drop the events.
>>>>
>>>>Initially (and to get a good measure of function runtime), I was
>>>>actually using retsnoop rather than using the previously non-existent
>>>>tracepoint:
>>>>
>>>>	retsnoop -e xe_lrc_update_timestamp -e xe_lrc_create -e xe_lrc_destroy  -S -A -C args.fmt-max-arg-width=0
>>>
>>>Didn't know that. That's ^ useful.
>>
>>life saver - I keep forgetting options for the other tools to do similar
>>stuff, but this one is so simple and effective.
>>
>>Lucas De Marchi
>>
>>>
>>>Thanks,
>>>Umesh
>>>>
>>>>Lucas De Marchi
>>>>
>>>>>
>>>>>Thanks,
>>>>>Umesh
>>>>>
>>>>>>	trace_xe_exec_queue_close(q);
>>>>>>	xe_exec_queue_put(q);
>>>>>>
>>>>>>-- 
>>>>>>2.47.0
>>>>>>