[PATCH i-g-t v2 05/10] tests/intel/xe_drm_fdinfo: Add single engine tests

Thu Jul 11 21:34:53 UTC 2024

On Thu, Jul 11, 2024 at 12:13:22PM GMT, Umesh Nerlige Ramappa wrote:
>On Thu, Jul 11, 2024 at 08:20:17AM -0500, Lucas De Marchi wrote:
>>On Tue, Jul 02, 2024 at 05:25:27PM GMT, Umesh Nerlige Ramappa wrote:
>>>Add simple tests that submit work to one engine and measure utilization
>>>per class.
>>>
>>>v2:
>>>- Drop measured_usleep since return value is not used
>>>- s/parallel engines/parallel submission/ in comment
>>>- Use NSEC_PER_SEC for batch_duration_ns
>>>- Percent should not be > 100
>>
>>but as we chatted later in the review, it's actually possible to have it
>>greater than 100, so I think we need to leave some room
>>
>>>- Check utilization for both clients for isolation case
>>>
>>>Signed-off-by: Umesh Nerlige Ramappa <umesh.nerlige.ramappa at intel.com>
>>>---
>>>tests/intel/xe_drm_fdinfo.c | 97 +++++++++++++++++++++++++++++++++++++
>>>1 file changed, 97 insertions(+)
>>>
>>>diff --git a/tests/intel/xe_drm_fdinfo.c b/tests/intel/xe_drm_fdinfo.c
>>>index 037f25e53..410c885e7 100644
>>>--- a/tests/intel/xe_drm_fdinfo.c
>>>+++ b/tests/intel/xe_drm_fdinfo.c
>>>@@ -28,6 +28,15 @@
>>>* SUBTEST: basic-engine-utilization
>>>* Description: Check if basic fdinfo content is present for engine utilization
>>>*
>>>+ * SUBTEST: drm-idle
>>>+ * Description: Check that engines show no load when idle
>>>+ *
>>>+ * SUBTEST: drm-busy-idle
>>>+ * Description: Check that engines show load when idle after busy
>>>+ *
>>>+ * SUBTEST: drm-busy-idle-isolation
>>>+ * Description: Check that engine load does not spill over to other drm clients
>>>+ *
>>>* SUBTEST: drm-total-resident
>>>* Description: Create and compare total and resident memory consumption by client
>>>*
>>>@@ -42,11 +51,18 @@ IGT_TEST_DESCRIPTION("Read and verify drm client memory consumption and engine u
>>>
>>>#define BO_SIZE (65536)
>>>
>>>+/* flag masks */
>>>+#define TEST_BUSY		(1 << 0)
>>>+#define TEST_TRAILING_IDLE	(1 << 1)
>>>+#define TEST_ISOLATION		(1 << 2)
>>>+
>>>struct pceu_cycles {
>>>	uint64_t cycles;
>>>	uint64_t total_cycles;
>>>};
>>>
>>>+const unsigned long batch_duration_ns = (1 * NSEC_PER_SEC) / 2;
>>>+
>>>static const char *engine_map[] = {
>>>	"rcs",
>>>	"bcs",
>>>@@ -464,8 +480,77 @@ spin_ctx_destroy(int fd, struct spin_ctx *ctx)
>>>	free(ctx);
>>>}
>>>
>>>+static void
>>>+check_results(struct pceu_cycles *s1, struct pceu_cycles *s2,
>>>+	      int class, int width, unsigned int flags)
>>>+{
>>>+	double percent;
>>>+
>>>+	igt_debug("%s: sample 1: cycles %lu, total_cycles %lu\n",
>>>+		  engine_map[class], s1[class].cycles, s1[class].total_cycles);
>>>+	igt_debug("%s: sample 2: cycles %lu, total_cycles %lu\n",
>>>+		  engine_map[class], s2[class].cycles, s2[class].total_cycles);
>>>+
>>>+	percent = ((s2[class].cycles - s1[class].cycles) * 100) /
>>>+		  ((s2[class].total_cycles + 1) - s1[class].total_cycles);
>>>+
>>>+	/* for parallel submission scale the busyness with width */
>>>+	percent = percent / width;
>>>+
>>>+	igt_debug("%s: percent: %f\n", engine_map[class], percent);
>>>+
>>>+	if (flags & TEST_BUSY)
>>>+		igt_assert(percent >= 95 && percent <= 100);
>>
>><= 105 like you had before seems good, otherwise we may have too much
>>noise in CI.  What are the typical numbers you're seeing?
>
>I see 101 whenever it fails. Also I see it only when I run batches on 
>all classes (all_busy_all/most_busy_all etc.). The difference in 
>numerator and denominator deltas is close to 5ms though. Not convinced 
>that we should see 100 still because we are ending the spinners (and 
>waiting for fences and all) before taking sample 2. Still debugging a 
>little more on this one.
>
>>
>>>+	else
>>>+		igt_assert(!percent);
>>>+}
>>>+
>>>+static void
>>>+single(int fd, struct drm_xe_engine_class_instance *hwe, int width, int count,
>>>+       unsigned int flags)
>>>+{
>>>+	struct pceu_cycles pceu1[2][DRM_XE_ENGINE_CLASS_COMPUTE + 1];
>>>+	struct pceu_cycles pceu2[2][DRM_XE_ENGINE_CLASS_COMPUTE + 1];
>>
>>nit: either name it pceu_start/pceu_end:
>>
>>pceu_start[0] is the pceu for client0 on start
>>pceu_start[1] is the pceu for client1 on start
>>pceu_end[0] is the pceu for client0 on end
>>pceu_end[1] is the pceu for client1 on end
>>
>>or pceu_client0/pceu_client1 if grouping by client
>>
>>pceu_client0[0] is the pceu for client0 on start
>>pceu_client0[1] is the pceu for client0 on end
>>pceu_client1[1] is the pceu for client1 on start
>>pceu_client1[2] is the pceu for client1 on end
>>
>>... this makes it less prone to typos.
>>
>>But see below as I think 2 samples are not sufficient.
>>
>>>+	struct spin_ctx *ctx = NULL;
>>>+	uint32_t vm;
>>>+	int new_fd;
>>>+
>>>+	if (flags & TEST_ISOLATION)
>>>+		new_fd = drm_reopen_driver(fd);
>>>+
>>>+	vm = xe_vm_create(fd, 0, 0);
>>>+	if (flags & TEST_BUSY) {
>>>+		ctx = spin_ctx_init(fd, hwe, vm, width, count);
>>>+		spin_sync_start(fd, ctx);
>>>+	}
>>>+
>>>+	read_engine_cycles(fd, pceu1[0]);
>>>+	if (flags & TEST_ISOLATION)
>>>+		read_engine_cycles(new_fd, pceu1[1]);
>>>+
>>>+	usleep(batch_duration_ns / 1000);
>>
>>NSEC_PER_USEC
>>
>>>+	if (flags & TEST_TRAILING_IDLE)
>>>+		spin_sync_end(fd, ctx);
>>>+
>>>+	read_engine_cycles(fd, pceu2[0]);
>>
>>... and here we have another source of noise.... when doing
>>TEST_TRAILING_IDLE, should we read the first sample after the spin
>>ended, sleep, and read the second sample?
>
>That would always make utilization = 0% (numerator delta is 0). We 
>should take sample 1 when the spinner has started and sample 2 when it 
>has ended (for PCEU).
>
>TEST_TRAILING_IDLE is only valid when TEST_BUSY is also specified.  
>Maybe I should enforce that in the test. This is the flow for common 
>flag configs:
>
>1) flag = TEST_BUSY | TEST_TRAILING_IDLE
>- wait till spinner starts
>- take sample 1
>- usleep
>- end the batch
>- take sample 2.
>- check_results (expected 100% util)

I think the way the test is written, the expectation is wrong.
By absurd:

1) flag = TEST_BUSY | TEST_TRAILING_IDLE
- wait till spinner starts
- take sample 1
- usleep
- end the batch
		<--- long preempt here emulated with
		     usleep(500000)
- take sample 2.
- check_results

So we'd still expect the same ballpark number of delta ticks, but
utilization overall would go down as now you are ending the batch
and sleeping a little more before taking another sample. In the example
above I'd expect utilization to be 50%

>
>2) flag = TEST_BUSY
>- wait till spinner starts
>- take sample 1
>- usleep
>- take sample 2.
>- end the batch
>- check_results (expected 100% util)
>
>3) flag = 0 means truly IDLE since there is no spinner.
>
>>
>>but then how are we testing all at the same time here? I can't see any
>>igt_assert() for the trailing idle test and it will actually impact the
>>busy test since it will wait the spin to end.... Am I missing anything
>>here?
>
>Case (2) above does not exist for PCEU tests because CTX_TIMESTAMP is 
>only updates when the context switches out, which means we must end 
>the spinner before taking the second sample (or preempt it, but ending 
>is simpler). This series only has tests for case (1) and (3) 
>implemented.

true, we are only reading the value from the context image, but does the
hardware update it only on context switch? What I had understood was
that the value was updated while the context was running. Otherwise I
think it's utterly broken:

Considering
	s* are the samples
	c* are the context timestamps
	g* are the gpu timestamps

	- not running
	| running

          g0 g1                    g2        g3   g4 
          c0 c1                    c2        c3   c4
ctx0 ----||||||||||||||||||||||||||||||||||||-------
             ^                     ^              ^
             s0                    s1             s3                    

For simplicity, assuming g0 == c0 == 0

in s0: g1 == 3
        c1 == 0

in s1: g2 == 25
        c2 == 0

-> utilization returns 0%

in s3: g4 == 40
        c4 == c3 == 35

-> utilization returns 35 / (40 - 25) == 233%

how can we get this right if the value is not updated for running
contexts?

Lucas De Marchi

>
>Case (2) is applicable for a completely different feature which is 
>still wip - per engine utilization via kernel perf framework.
>
>>I think we need to add a third sample when testing trailing idle
>>and ensure:
>>
>>considering the indexing as sample[client][idx]
>>
>>	sample[0][1] - sample[0][0] == 100% if BUSY else 0%
>>	sample[0][2] - sample[0][1] == 0%  -> idle on trailing test
>>
>>and for isolation:
>>
>>	sample[1][1] - sample[0][0] == 0%
>>	sample[1][2] - sample[1][1] == 0%
>>
>>>+	if (flags & TEST_ISOLATION)
>>>+		read_engine_cycles(new_fd, pceu2[1]);
>>>+
>>>+	check_results(pceu1[0], pceu2[0], hwe->engine_class, width, flags);
>
>This check_results will check for TEST_BUSY case.
>
>>>+
>>>+	if (flags & TEST_ISOLATION) {
>>>+		check_results(pceu1[1], pceu2[1], hwe->engine_class, width, 0);
>
>This check_results will check for idle case (flags = 0).
>
>>>+		close(new_fd);
>>>+	}
>>>+
>>>+	spin_sync_end(fd, ctx);
>>>+	spin_ctx_destroy(fd, ctx);
>>
>>if !BUSY you are not even calling spin_ctx_init()
>
>:) yeah, sorry about the readability here. The spin_sync_end and 
>spin_ctx_destroy bail out if the ctx is NULL which is the case for 
>!BUSY. I can add and explicit check to not call these instead.
>
>Thanks,
>Umesh
>>
>>Lucas De Marchi
>>
>>>+	xe_vm_destroy(fd, vm);
>>>+}
>>>+
>>>igt_main
>>>{
>>>+	struct drm_xe_engine_class_instance *hwe;
>>>	int xe;
>>>
>>>	igt_fixture {
>>>@@ -484,6 +569,18 @@ igt_main
>>>	igt_subtest("basic-engine-utilization")
>>>		basic_engine_utilization(xe);
>>>
>>>+	igt_subtest("drm-idle")
>>>+		xe_for_each_engine(xe, hwe)
>>>+			single(xe, hwe, 1, 1, 0);
>>>+
>>>+	igt_subtest("drm-busy-idle")
>>>+		xe_for_each_engine(xe, hwe)
>>>+			single(xe, hwe, 1, 1, TEST_BUSY | TEST_TRAILING_IDLE);
>>>+
>>>+	igt_subtest("drm-busy-idle-isolation")
>>>+		xe_for_each_engine(xe, hwe)
>>>+			single(xe, hwe, 1, 1, TEST_BUSY | TEST_TRAILING_IDLE | TEST_ISOLATION);
>>>+
>>>	igt_describe("Create and compare total and resident memory consumption by client");
>>>	igt_subtest("drm-total-resident")
>>>		test_total_resident(xe);
>>>-- 
>>>2.38.1
>>>