[PATCH i-g-t] tests/intel/xe_drm_fdinfo: Extend mercy to the upper end

Mon Aug 26 22:10:27 UTC 2024

On Sat, Aug 24, 2024 at 02:24:37PM GMT, Lucas De Marchi wrote:
>When we are processing the fdinfo of each client, the gpu time is read
>first, and then later all the exec queues are accumulated. It's thus
>possible that the total gpu time is smaller than the time reported in
>the exec queues. A preemption in the middle of second sample would
>exaggerate the problem:
>				  total_cycles	      cycles
>	s1: read gpu time		|
>	s1: read exec queues times	|		*
>					|		*
>	..				|		*
>					|		*
>	s2: read gpu time		|		*
>	-> prempted					*
>	...						*
>	s2: read exec queues times			*

tldr: it seems the problem is between reading the total_cycles and
cycles for the *1st* sample. See below.

Considering one case in which it failed:

(xe_drm_fdinfo:9093) DEBUG: rcs: sample 1: cycles 19487844, total_cycles 20099857555
(xe_drm_fdinfo:9093) DEBUG: rcs: sample 2: cycles 29302671, total_cycles 20109533601
(xe_drm_fdinfo:9093) DEBUG: rcs: percent: 101.000000

Looking at the kernel code, it seems like I misremembered it.
Merging kernel and igt steps, the timeline is more or less like below:

	start and wait xe_spin for each class
	sample1: read exec queues times
	*
	sample1: read gpu time
	stop and wait xe_spin for each class
	sample2: read exec queues times
	sample2: read gpu time

So it seems the [*] is actually the critical point. Looking for trace
in which it failed:

     xe_drm_fdinfo-9657  [000]  2056.317942759: bputs:                xe_drm_client_fdinfo: XXX exec_queue time {
[a] xe_drm_fdinfo-9657  [000]  2056.317954843: bputs:                xe_drm_client_fdinfo: } XXX exec_queue time
     xe_drm_fdinfo-9657  [000]  2056.317955038: bputs:                xe_drm_client_fdinfo: XXX gpu time {
[b] xe_drm_fdinfo-9657  [000]  2056.317968871: bputs:                xe_drm_client_fdinfo: } XXX gpu time
     xe_drm_fdinfo-9657  [001]  2056.821949987: bputs:                xe_drm_client_fdinfo: XXX exec_queue time {
[c] xe_drm_fdinfo-9657  [001]  2056.821964764: bputs:                xe_drm_client_fdinfo: } XXX exec_queue time
     xe_drm_fdinfo-9657  [001]  2056.821964999: bputs:                xe_drm_client_fdinfo: XXX gpu time {
[d] xe_drm_fdinfo-9657  [001]  2056.821982168: bputs:                xe_drm_client_fdinfo: } XXX gpu time

Considering the deltas from the CPU perspective...

[b] - [a] == 14028nsec
[d] - [c] == 17404nsec

Considering cycles vs total_cycles delta above and considering a ref
clock of 19.2MHz for this platform, the error is in the order of
2685nsec. Worth noting that it would be worse if [b] - [a] was higher
since exec_queue would be ticking while we wouldn't have collected the
first gpu_time. In the second sample, in theory the exec_queue clock is
already stopped so any error there counts to the other side, making it <
100%. Much more likely as it depends on us starting to read the second
sample.

With a little more debugging and making the gpu write the timestamp on
each loop (by adding write_timestamp=true to the spinner):

DEBUG: rcs: spinner ended timestamp=9817720
DEBUG: bcs: spinner ended timestamp=9755785
DEBUG: vecs: spinner ended timestamp=9653024
DEBUG: ccs: spinner ended timestamp=9722776
DEBUG: gpu delta: 0
DEBUG: rcs: sample 1: cycles 19513436, total_cycles 130596422980
DEBUG: rcs: sample 2: cycles 29331470, total_cycles 130606141449
DEBUG: rcs: percent: 101.000000
CRITICAL: Test assertion failure function check_results, file ../tests/intel/xe_drm_fdinfo.c:488:
CRITICAL: Failed assertion: percent >= 95 && percent <= 100

29331470−19513436 == 9818034, which is reasonably close to what was
written by the gpu until the last loop, 9817720. It wouldn't make any
sense that the total gpu time was delta_total_cycles == 9718469 since
it's less than what the gpu wrote while running the spinner.

With that, I think it makes sense to consider the error on both ends.
We could invert the reads in the kernel side to read first the gpu time
and later the exec queue ones, but this would still be subject to
errors.

One improvement we can make is to start writing the timestamp and
checking that the time reported for each exec queue via fdinfo is:

	time_fdinfo ~ timestamp +- error

Lucas De Marchi

>
>In this situation, using the line numbers:
>
>	total_cyles == 6
>	cycles == 8
>
>In a more realistic situation, as reported in CI:
>
>	(xe_drm_fdinfo:1072) DEBUG: rcs: sample 1: cycles 29223333, total_cycles 5801623069
>	(xe_drm_fdinfo:1072) DEBUG: rcs: sample 2: cycles 38974256, total_cycles 5811276365
>	(xe_drm_fdinfo:1072) DEBUG: rcs: percent: 101.000000
>
>Extend the same mercy to the upper end as we did to the lower end.
>
>Long term we may move some of these tests to just check cycles by using
>a timed spinner and stop relying on the CPU sleep time and timing on the
>fdinfo file read. See commit 01214d73c209 ("lib/xe_spin: fixed duration
>xe_spin capability").
>
>Also, this matches the tolerance applied on the i915 side in
>tests/intel/drm_fdinfo.c:__assert_within_epsilon().
>
>Signed-off-by: Lucas De Marchi <lucas.demarchi at intel.com>
>---
> tests/intel/xe_drm_fdinfo.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
>diff --git a/tests/intel/xe_drm_fdinfo.c b/tests/intel/xe_drm_fdinfo.c
>index 4696c6495..e3a99a2dc 100644
>--- a/tests/intel/xe_drm_fdinfo.c
>+++ b/tests/intel/xe_drm_fdinfo.c
>@@ -484,7 +484,7 @@ check_results(struct pceu_cycles *s1, struct pceu_cycles *s2,
> 	igt_debug("%s: percent: %f\n", engine_map[class], percent);
>
> 	if (flags & TEST_BUSY)
>-		igt_assert(percent >= 95 && percent <= 100);
>+		igt_assert(percent >= 95 && percent <= 105);
> 	else
> 		igt_assert(!percent);
> }
>-- 
>2.43.0
>