[PATCH i-g-t v2 05/10] tests/intel/xe_drm_fdinfo: Add single engine tests

Umesh Nerlige Ramappa umesh.nerlige.ramappa at intel.com
Thu Jul 11 22:45:52 UTC 2024


On Thu, Jul 11, 2024 at 04:34:53PM -0500, Lucas De Marchi wrote:
>On Thu, Jul 11, 2024 at 12:13:22PM GMT, Umesh Nerlige Ramappa wrote:
>>On Thu, Jul 11, 2024 at 08:20:17AM -0500, Lucas De Marchi wrote:
>>>On Tue, Jul 02, 2024 at 05:25:27PM GMT, Umesh Nerlige Ramappa wrote:
>>>>Add simple tests that submit work to one engine and measure utilization
>>>>per class.
>>>>
>>>>v2:
>>>>- Drop measured_usleep since return value is not used
>>>>- s/parallel engines/parallel submission/ in comment
>>>>- Use NSEC_PER_SEC for batch_duration_ns
>>>>- Percent should not be > 100
>>>
>>>but as we chatted later in the review, it's actually possible to have it
>>>greater than 100, so I think we need to leave some room
>>>
>>>>- Check utilization for both clients for isolation case
>>>>
>>>>Signed-off-by: Umesh Nerlige Ramappa <umesh.nerlige.ramappa at intel.com>
>>>>---
>>>>tests/intel/xe_drm_fdinfo.c | 97 +++++++++++++++++++++++++++++++++++++
>>>>1 file changed, 97 insertions(+)
>>>>
>>>>diff --git a/tests/intel/xe_drm_fdinfo.c b/tests/intel/xe_drm_fdinfo.c
>>>>index 037f25e53..410c885e7 100644
>>>>--- a/tests/intel/xe_drm_fdinfo.c
>>>>+++ b/tests/intel/xe_drm_fdinfo.c
>>>>@@ -28,6 +28,15 @@
>>>>* SUBTEST: basic-engine-utilization
>>>>* Description: Check if basic fdinfo content is present for engine utilization
>>>>*
>>>>+ * SUBTEST: drm-idle
>>>>+ * Description: Check that engines show no load when idle
>>>>+ *
>>>>+ * SUBTEST: drm-busy-idle
>>>>+ * Description: Check that engines show load when idle after busy
>>>>+ *
>>>>+ * SUBTEST: drm-busy-idle-isolation
>>>>+ * Description: Check that engine load does not spill over to other drm clients
>>>>+ *
>>>>* SUBTEST: drm-total-resident
>>>>* Description: Create and compare total and resident memory consumption by client
>>>>*
>>>>@@ -42,11 +51,18 @@ IGT_TEST_DESCRIPTION("Read and verify drm client memory consumption and engine u
>>>>
>>>>#define BO_SIZE (65536)
>>>>
>>>>+/* flag masks */
>>>>+#define TEST_BUSY		(1 << 0)
>>>>+#define TEST_TRAILING_IDLE	(1 << 1)
>>>>+#define TEST_ISOLATION		(1 << 2)
>>>>+
>>>>struct pceu_cycles {
>>>>	uint64_t cycles;
>>>>	uint64_t total_cycles;
>>>>};
>>>>
>>>>+const unsigned long batch_duration_ns = (1 * NSEC_PER_SEC) / 2;
>>>>+
>>>>static const char *engine_map[] = {
>>>>	"rcs",
>>>>	"bcs",
>>>>@@ -464,8 +480,77 @@ spin_ctx_destroy(int fd, struct spin_ctx *ctx)
>>>>	free(ctx);
>>>>}
>>>>
>>>>+static void
>>>>+check_results(struct pceu_cycles *s1, struct pceu_cycles *s2,
>>>>+	      int class, int width, unsigned int flags)
>>>>+{
>>>>+	double percent;
>>>>+
>>>>+	igt_debug("%s: sample 1: cycles %lu, total_cycles %lu\n",
>>>>+		  engine_map[class], s1[class].cycles, s1[class].total_cycles);
>>>>+	igt_debug("%s: sample 2: cycles %lu, total_cycles %lu\n",
>>>>+		  engine_map[class], s2[class].cycles, s2[class].total_cycles);
>>>>+
>>>>+	percent = ((s2[class].cycles - s1[class].cycles) * 100) /
>>>>+		  ((s2[class].total_cycles + 1) - s1[class].total_cycles);
>>>>+
>>>>+	/* for parallel submission scale the busyness with width */
>>>>+	percent = percent / width;
>>>>+
>>>>+	igt_debug("%s: percent: %f\n", engine_map[class], percent);
>>>>+
>>>>+	if (flags & TEST_BUSY)
>>>>+		igt_assert(percent >= 95 && percent <= 100);
>>>
>>><= 105 like you had before seems good, otherwise we may have too much
>>>noise in CI.  What are the typical numbers you're seeing?
>>
>>I see 101 whenever it fails. Also I see it only when I run batches 
>>on all classes (all_busy_all/most_busy_all etc.). The difference in 
>>numerator and denominator deltas is close to 5ms though. Not 
>>convinced that we should see 100 still because we are ending the 
>>spinners (and waiting for fences and all) before taking sample 2. 
>>Still debugging a little more on this one.
>>
>>>
>>>>+	else
>>>>+		igt_assert(!percent);
>>>>+}
>>>>+
>>>>+static void
>>>>+single(int fd, struct drm_xe_engine_class_instance *hwe, int width, int count,
>>>>+       unsigned int flags)
>>>>+{
>>>>+	struct pceu_cycles pceu1[2][DRM_XE_ENGINE_CLASS_COMPUTE + 1];
>>>>+	struct pceu_cycles pceu2[2][DRM_XE_ENGINE_CLASS_COMPUTE + 1];
>>>
>>>nit: either name it pceu_start/pceu_end:
>>>
>>>pceu_start[0] is the pceu for client0 on start
>>>pceu_start[1] is the pceu for client1 on start
>>>pceu_end[0] is the pceu for client0 on end
>>>pceu_end[1] is the pceu for client1 on end
>>>
>>>or pceu_client0/pceu_client1 if grouping by client
>>>
>>>pceu_client0[0] is the pceu for client0 on start
>>>pceu_client0[1] is the pceu for client0 on end
>>>pceu_client1[1] is the pceu for client1 on start
>>>pceu_client1[2] is the pceu for client1 on end
>>>
>>>... this makes it less prone to typos.
>>>
>>>But see below as I think 2 samples are not sufficient.
>>>
>>>>+	struct spin_ctx *ctx = NULL;
>>>>+	uint32_t vm;
>>>>+	int new_fd;
>>>>+
>>>>+	if (flags & TEST_ISOLATION)
>>>>+		new_fd = drm_reopen_driver(fd);
>>>>+
>>>>+	vm = xe_vm_create(fd, 0, 0);
>>>>+	if (flags & TEST_BUSY) {
>>>>+		ctx = spin_ctx_init(fd, hwe, vm, width, count);
>>>>+		spin_sync_start(fd, ctx);
>>>>+	}
>>>>+
>>>>+	read_engine_cycles(fd, pceu1[0]);
>>>>+	if (flags & TEST_ISOLATION)
>>>>+		read_engine_cycles(new_fd, pceu1[1]);
>>>>+
>>>>+	usleep(batch_duration_ns / 1000);
>>>
>>>NSEC_PER_USEC
>>>
>>>>+	if (flags & TEST_TRAILING_IDLE)
>>>>+		spin_sync_end(fd, ctx);
>>>>+
>>>>+	read_engine_cycles(fd, pceu2[0]);
>>>
>>>... and here we have another source of noise.... when doing
>>>TEST_TRAILING_IDLE, should we read the first sample after the spin
>>>ended, sleep, and read the second sample?
>>
>>That would always make utilization = 0% (numerator delta is 0). We 
>>should take sample 1 when the spinner has started and sample 2 when 
>>it has ended (for PCEU).
>>
>>TEST_TRAILING_IDLE is only valid when TEST_BUSY is also specified.  
>>Maybe I should enforce that in the test. This is the flow for common 
>>flag configs:
>>
>>1) flag = TEST_BUSY | TEST_TRAILING_IDLE
>>- wait till spinner starts
>>- take sample 1
>>- usleep
>>- end the batch
>>- take sample 2.
>>- check_results (expected 100% util)
>
>I think the way the test is written, the expectation is wrong.
>By absurd:
>
>1) flag = TEST_BUSY | TEST_TRAILING_IDLE
>- wait till spinner starts
>- take sample 1
>- usleep
>- end the batch
>		<--- long preempt here emulated with
>		     usleep(500000)
>- take sample 2.
>- check_results
>
>So we'd still expect the same ballpark number of delta ticks, but
>utilization overall would go down as now you are ending the batch
>and sleeping a little more before taking another sample. In the example
>above I'd expect utilization to be 50%

Indeed. if preempted after ending batch, then the numbers will be off, 
but that would be for this test. In other words, the test will not be 
able to verify this accurately. In reality, user will see the correct 
utilization - 50%.

>
>>
>>2) flag = TEST_BUSY
>>- wait till spinner starts
>>- take sample 1
>>- usleep
>>- take sample 2.
>>- end the batch
>>- check_results (expected 100% util)
>>
>>3) flag = 0 means truly IDLE since there is no spinner.
>>
>>>
>>>but then how are we testing all at the same time here? I can't see any
>>>igt_assert() for the trailing idle test and it will actually impact the
>>>busy test since it will wait the spin to end.... Am I missing anything
>>>here?
>>
>>Case (2) above does not exist for PCEU tests because CTX_TIMESTAMP 
>>is only updates when the context switches out, which means we must 
>>end the spinner before taking the second sample (or preempt it, but 
>>ending is simpler). This series only has tests for case (1) and (3) 
>>implemented.
>
>true, we are only reading the value from the context image, but does the
>hardware update it only on context switch? What I had understood was
>that the value was updated while the context was running. Otherwise I
>think it's utterly broken:

CTX_TIMESTAMP in the context image is only updated on a context switch.  

>
>Considering
>	s* are the samples
>	c* are the context timestamps
>	g* are the gpu timestamps
>
>	- not running
>	| running
>
>
>         g0 g1                    g2        g3   g4
>         c0 c1                    c2        c3   c4
>ctx0 ----||||||||||||||||||||||||||||||||||||-------
>            ^                     ^              ^
>            s0                    s1             s3
>
>For simplicity, assuming g0 == c0 == 0
>
>in s0: g1 == 3
>       c1 == 0
>
>in s1: g2 == 25
>       c2 == 0
>
>-> utilization returns 0%
>
>in s3: g4 == 40
>       c4 == c3 == 35
>
>
>-> utilization returns 35 / (40 - 25) == 233%

well... I think that's the issue with > 100% utilization that I see with 
tests!! If sample 1 is delayed, then calculations are off.

>
>how can we get this right if the value is not updated for running
>contexts?

I don't think we can. This is an inherent limitation of the HW counter 
that we are exposing. It is updated only when context switches happen 
and can be reliable only for frequently switching contexts. Tools would 
have to use some sort of averaging algorithm over a period of time to 
make sense of the data.

I think this sort of sampling affects any counters that we export that 
are updated infrequently. You will run into internediate values that are 
larger than 100%. Results may need to be averaged out.

Although, for this specific case, we should try to support getting c2 as 
well since that would be required for long-running contexts anyways. GuC 
had some level of support for this but with some limitations. I would 
have to dig that up and consult GuC if we can use/extend that.

I would probably not merge the IGTs, but would appreciate it if they can 
still be reviewed. I would still use the same tests once the above 
support is added.

Thanks,
Umesh

>
>Lucas De Marchi
>
>>
>>Case (2) is applicable for a completely different feature which is 
>>still wip - per engine utilization via kernel perf framework.
>>
>>>I think we need to add a third sample when testing trailing idle
>>>and ensure:
>>>
>>>considering the indexing as sample[client][idx]
>>>
>>>	sample[0][1] - sample[0][0] == 100% if BUSY else 0%
>>>	sample[0][2] - sample[0][1] == 0%  -> idle on trailing test
>>>
>>>and for isolation:
>>>
>>>	sample[1][1] - sample[0][0] == 0%
>>>	sample[1][2] - sample[1][1] == 0%
>>>
>>>>+	if (flags & TEST_ISOLATION)
>>>>+		read_engine_cycles(new_fd, pceu2[1]);
>>>>+
>>>>+	check_results(pceu1[0], pceu2[0], hwe->engine_class, width, flags);
>>
>>This check_results will check for TEST_BUSY case.
>>
>>>>+
>>>>+	if (flags & TEST_ISOLATION) {
>>>>+		check_results(pceu1[1], pceu2[1], hwe->engine_class, width, 0);
>>
>>This check_results will check for idle case (flags = 0).
>>
>>>>+		close(new_fd);
>>>>+	}
>>>>+
>>>>+	spin_sync_end(fd, ctx);
>>>>+	spin_ctx_destroy(fd, ctx);
>>>
>>>if !BUSY you are not even calling spin_ctx_init()
>>
>>:) yeah, sorry about the readability here. The spin_sync_end and 
>>spin_ctx_destroy bail out if the ctx is NULL which is the case for 
>>!BUSY. I can add and explicit check to not call these instead.
>>
>>Thanks,
>>Umesh
>>>
>>>Lucas De Marchi
>>>
>>>>+	xe_vm_destroy(fd, vm);
>>>>+}
>>>>+
>>>>igt_main
>>>>{
>>>>+	struct drm_xe_engine_class_instance *hwe;
>>>>	int xe;
>>>>
>>>>	igt_fixture {
>>>>@@ -484,6 +569,18 @@ igt_main
>>>>	igt_subtest("basic-engine-utilization")
>>>>		basic_engine_utilization(xe);
>>>>
>>>>+	igt_subtest("drm-idle")
>>>>+		xe_for_each_engine(xe, hwe)
>>>>+			single(xe, hwe, 1, 1, 0);
>>>>+
>>>>+	igt_subtest("drm-busy-idle")
>>>>+		xe_for_each_engine(xe, hwe)
>>>>+			single(xe, hwe, 1, 1, TEST_BUSY | TEST_TRAILING_IDLE);
>>>>+
>>>>+	igt_subtest("drm-busy-idle-isolation")
>>>>+		xe_for_each_engine(xe, hwe)
>>>>+			single(xe, hwe, 1, 1, TEST_BUSY | TEST_TRAILING_IDLE | TEST_ISOLATION);
>>>>+
>>>>	igt_describe("Create and compare total and resident memory consumption by client");
>>>>	igt_subtest("drm-total-resident")
>>>>		test_total_resident(xe);
>>>>-- 
>>>>2.38.1
>>>>


More information about the igt-dev mailing list