[PATCH 2/3] drm/xe: Accumulate exec queue timestamp on destroy

Tue Oct 29 19:53:14 UTC 2024

On Tue, Oct 29, 2024 at 12:31:54PM -0700, Umesh Nerlige Ramappa wrote:
>On Tue, Oct 29, 2024 at 02:12:22PM -0500, Lucas De Marchi wrote:
>>On Tue, Oct 29, 2024 at 12:58:35PM -0500, Lucas De Marchi wrote:
>>>>>Starting subtest: utilization-single-full-load-destroy-queue
>>>>>....
>>>>>(xe_drm_fdinfo:5997) DEBUG: vcs: spinner ended (timestamp=4818209)
>>>>>(xe_drm_fdinfo:5997) DEBUG: vcs: sample 1: cycles 9637999, total_cycles 19272820538
>>>>>(xe_drm_fdinfo:5997) DEBUG: vcs: sample 2: cycles 9637999, total_cycles 19277703269
>>>>>(xe_drm_fdinfo:5997) DEBUG: vcs: percent: 0.000000
>>>>>(xe_drm_fdinfo:5997) CRITICAL: Test assertion failure function check_results, file ../tests/intel/xe_drm_fdinfo.c:527:
>>>>>(xe_drm_fdinfo:5997) CRITICAL: Failed assertion: 95.0 < percent
>>>>>(xe_drm_fdinfo:5997) CRITICAL: error: 95.000000 >= 0.000000
>>>>>(xe_drm_fdinfo:5997) igt_core-INFO: Stack trace:
>>>>>(xe_drm_fdinfo:5997) igt_core-INFO:   #0 ../lib/igt_core.c:2051 __igt_fail_assert()
>>>>>(xe_drm_fdinfo:5997) igt_core-INFO:   #1 [check_results+0x204]
>>>>>(xe_drm_fdinfo:5997) igt_core-INFO:   #2 ../tests/intel/xe_drm_fdinfo.c:860 __igt_unique____real_main806()
>>>>>(xe_drm_fdinfo:5997) igt_core-INFO:   #3 ../tests/intel/xe_drm_fdinfo.c:806 main()
>>>>>(xe_drm_fdinfo:5997) igt_core-INFO:   #4 ../sysdeps/nptl/libc_start_call_main.h:74 __libc_start_call_main()
>>>>>(xe_drm_fdinfo:5997) igt_core-INFO:   #5 ../csu/libc-start.c:128 __libc_start_main@@GLIBC_2.34()
>>>>>(xe_drm_fdinfo:5997) igt_core-INFO:   #6 [_start+0x25]
>>>>>****  END  ****
>>>>>
>>>>>which makes me think it's probably related to the kill being async as
>>>>>you mentioned.
>>>>>
>>>>>I wonder if we should synchronize the call in the fdinfo read with the
>>>>>queues that are going away.
>>>>
>>>>Hmm, maybe.
>>>
>>>doing that it passes for me 62/100 running all
>>>xe_drm_fdinfo at utilization-* tests.
>>>
>>>The failure on run 63 is different  and I think it's another bug or
>>
>>
>>so the other failure, that I forgot to paste:
>>
>>	Starting subtest: utilization-all-full-load
>>	(xe_drm_fdinfo:14864) CRITICAL: Test assertion failure function check_results, file ../tests/intel/xe_drm_fdinfo.c:528:
>>	(xe_drm_fdinfo:14864) CRITICAL: Failed assertion: percent < 105.0
>>	(xe_drm_fdinfo:14864) CRITICAL: error: 315.453826 >= 105.000000
>>	Stack trace:
>>	  #0 ../lib/igt_core.c:2051 __igt_fail_assert()
>>	  #1 ../tests/intel/xe_drm_fdinfo.c:520 check_results()
>>	  #2 ../tests/intel/xe_drm_fdinfo.c:464 __igt_unique____real_main806()
>>	  #3 ../tests/intel/xe_drm_fdinfo.c:806 main()
>>	  #4 ../sysdeps/nptl/libc_start_call_main.h:74 __libc_start_call_main()
>>	  #5 ../csu/libc-start.c:128 __libc_start_main@@GLIBC_2.34()
>>	  #6 [_start+0x25]
>>	Subtest utilization-all-full-load failed.
>>	**** DEBUG ****
>>	(xe_drm_fdinfo:14864) DEBUG: rcs: spinner started
>>	(xe_drm_fdinfo:14864) DEBUG: bcs: spinner started
>>	(xe_drm_fdinfo:14864) DEBUG: ccs: spinner started
>>	(xe_drm_fdinfo:14864) DEBUG: vcs: spinner started
>>	(xe_drm_fdinfo:14864) DEBUG: vecs: spinner started
>>	(xe_drm_fdinfo:14864) DEBUG: rcs: spinner ended (timestamp=15218479)
>>	(xe_drm_fdinfo:14864) DEBUG: bcs: spinner ended (timestamp=15194339)
>>	(xe_drm_fdinfo:14864) DEBUG: vcs: spinner ended (timestamp=4837648)
>>	(xe_drm_fdinfo:14864) DEBUG: vecs: spinner ended (timestamp=4816316)
>>	(xe_drm_fdinfo:14864) DEBUG: ccs: spinner ended (timestamp=4859494)
>>	(xe_drm_fdinfo:14864) DEBUG: rcs: sample 1: cycles 40481368, total_cycles 31104224861
>>	(xe_drm_fdinfo:14864) DEBUG: rcs: sample 2: cycles 55700053, total_cycles 31109049238
>>	(xe_drm_fdinfo:14864) DEBUG: rcs: percent: 315.453826
>>	(xe_drm_fdinfo:14864) CRITICAL: Test assertion failure function check_results, file ../tests/intel/xe_drm_fdinfo.c:528:
>>	(xe_drm_fdinfo:14864) CRITICAL: Failed assertion: percent < 105.0
>>	(xe_drm_fdinfo:14864) CRITICAL: error: 315.453826 >= 105.000000
>>	(xe_drm_fdinfo:14864) igt_core-INFO: Stack trace:
>>	(xe_drm_fdinfo:14864) igt_core-INFO:   #0 ../lib/igt_core.c:2051 __igt_fail_assert()
>>
>>
>>From the timestamp read by the GPU;
>>rcs timestamp=15218479 and bcs timestamp=15194339... which is indeed much
>>higher than the total_cycles available:
>>31109049238 - 31104224861 = 4824377, which is reasonably similar to the
>>timestamp for the other engines.
>>
>>my hypothesis is something like this:
>>
>>sample1:
>>	accumulate_exec_queue (t = 0)
>>				<<<<<<<< premption
>>	read_total_gpu_cycles (t = 200)
>>
>>sample2:
>>	accumulate_exec_queue (t = 300)
>>	read_total_gpu_cycles  (t = 300)
>>
>
>It could as well be the second sample, see my previous email on why 
>run ticks can be larger than the gt timestamp delta.
>
>For the sake of narrowing it down, you could capture the value of 
>CTX_TIMESTAMP mmio before killing the exec queue in destroy. It should 
>be ticking since the context is active. Once the context stops, it 
>would have an updated value. That way we know how long it took to 
>stop.

in this test there is no kill involved.... it happens in
utilization-all-full-load

Lucas De Marchi