[igt-dev] [PATCH i-g-t] test/perf: Add test to verify OA TLB invalidation

Tue Mar 10 17:57:41 UTC 2020

On Tue, Mar 10, 2020 at 11:02:00AM +0200, Lionel Landwerlin wrote:
>On 10/03/2020 05:07, Umesh Nerlige Ramappa wrote:
>>Run 2 polling tests back to back and compare the number of OA reports
>>captured. Make sure the number of reports are almost same.
>>
>>Signed-off-by: Umesh Nerlige Ramappa <umesh.nerlige.ramappa at intel.com>
>>---
>>  tests/perf.c | 97 ++++++++++++++++++++++++++++++++++++++++++++++++++++
>>  1 file changed, 97 insertions(+)
>>
>>diff --git a/tests/perf.c b/tests/perf.c
>>index 5e818030..2394adc4 100644
>>--- a/tests/perf.c
>>+++ b/tests/perf.c
>>@@ -2265,6 +2265,97 @@ test_polling(void)
>>  	__perf_close(stream_fd);
>>  }
>>+static int
>>+num_valid_reports_captured(struct drm_i915_perf_open_param *param)
>>+{
>>+	uint8_t buf[1024 * 1024];
>>+	int64_t tick_ns = 1000000000 / sysconf(_SC_CLK_TCK);
>>+	int64_t test_duration_ns = tick_ns * 5 * 100; /* 5 seconds */
>>+	int64_t start, end;
>>+	int num_reports = 0;
>>+
>>+	stream_fd = __perf_open(drm_fd, param, true);
>>+
>>+	igt_debug("tick length = %dns, test duration = %"PRIu64"ns\n",
>>+		  (int)tick_ns, test_duration_ns);
>>+
>>+	start = get_time();
>>+	do_ioctl(stream_fd, I915_PERF_IOCTL_ENABLE, 0);
>>+	for (/* nop */; ((end = get_time()) - start) < test_duration_ns; /* nop */) {
>>+		struct pollfd pollfd = { .fd = stream_fd, .events = POLLIN };
>>+		struct drm_i915_perf_record_header *header;
>>+		int ret;
>>+
>>+		while ((ret = poll(&pollfd, 1, -1)) < 0 &&
>>+		       errno == EINTR)
>>+			;
>>+		igt_assert_eq(ret, 1);
>>+		igt_assert(pollfd.revents & POLLIN);
>
>
>I guess you can drop the poll() if you drop the 
>I915_PERF_FLAG_FD_NONBLOCK below.
>
>That way your reads are blocking.
>
>
>Or if you want more accuracy, you can compute the amount of poll() 
>timeout based of test_duration_ns and exit the loop earlier.
>

In the failure case, I actually see zero valid reports in 5 seconds, so 
the blocking read would block endlessly. Poll will instead return 
indicating that reports are available (even if reports are invalid).

I think adding a poll timeout based on the test duration is a good idea 
in case the poll behavior changes in future.

>
>>+
>>+		while ((ret = read(stream_fd, buf, sizeof(buf))) < 0 &&
>>+		       errno == EINTR)
>>+			;
>>+
>>+		/* poll checks if the tail has advanced on the OA buffer, but
>>+		 * does not check if the reports are valid. On read, the driver
>>+		 * checks if the reports are valid or not. if none of the
>>+		 * reports are valid, it returns EAGAIN. EAGAIN should also
>>+		 * suffice to show that the TLB invalidation failed, but we will
>>+		 * try for a more concrete check. Ignore read errors here.
>>+		 */
>>+		if (ret < 0)
>>+			continue;
>>+
>>+		for (int offset = 0; offset < ret; offset += header->size) {
>>+			header = (void *)(buf + offset);
>>+
>>+			if (header->type == DRM_I915_PERF_RECORD_SAMPLE) {
>>+				uint32_t *report = (void *)(header + 1);
>>+
>>+				if ((report[0] >> OAREPORT_REASON_SHIFT) &
>>+				    OAREPORT_REASON_TIMER)
>>+					num_reports++;
>>+			}
>>+		}
>>+	}
>>+	__perf_close(stream_fd);
>>+
>>+	return num_reports;
>>+}
>>+
>>+static void
>>+gen12_test_oa_tlb_invalidate(void)
>>+{
>>+	int oa_exponent = max_oa_exponent_for_period_lte(30000000);
>>+	uint64_t properties[] = {
>>+		DRM_I915_PERF_PROP_SAMPLE_OA, true,
>>+
>>+		DRM_I915_PERF_PROP_OA_METRICS_SET, test_set->perf_oa_metrics_set,
>>+		DRM_I915_PERF_PROP_OA_FORMAT, test_set->perf_oa_format,
>>+		DRM_I915_PERF_PROP_OA_EXPONENT, oa_exponent,
>>+	};
>>+	struct drm_i915_perf_open_param param = {
>>+		.flags = I915_PERF_FLAG_FD_CLOEXEC |
>>+			I915_PERF_FLAG_DISABLED |
>>+			I915_PERF_FLAG_FD_NONBLOCK,
>>+		.num_properties = sizeof(properties) / 16,
>>+		.properties_ptr = to_user_pointer(properties),
>>+	};
>>+	int num_reports1, num_reports2;
>>+	
>>+	/* Capture reports for 5 seconds twice and then make sure you get around
>>+	 * the same number of reports. In the case of failure, the number of
>>+	 * reports will vary largely since the beginning of the OA buffer
>>+	 * will have invalid entries.
>>+	 */
>
>
>I thought you were also noticing corruption in the data.

I don't see corruption of OA reports. Just missing reports in the 
beginning of OA buffer. All reports are zeroed out.

When trying the interrupt patches, I saw a random reboot with 
blocking-with-interrupt (or any of the other subtests that ran multiple 
cases). My theory was that OA is writing reports to oa buffer memory 
from previous iteration. That failure/reboot does not occur anymore with 
this patch (invalidating tlb).

Thanks,
Umesh

>
>Seems like it would be an easier characterization of the failure than 
>the number of reports (since the number can vary for a other reasons).
>
>What do you think?
>
>
>-Lionel
>
>
>>+	num_reports1 = num_valid_reports_captured(&param);
>>+	num_reports2 = num_valid_reports_captured(&param);
>>+
>>+	igt_debug("num_reports1 = %d, num_reports2 = %d\n", num_reports1, num_reports2);
>>+	igt_assert(num_reports2 > 0.95 * num_reports1);
>>+}
>>+
>>+
>>  static void
>>  test_buffer_fill(void)
>>  {
>>@@ -4622,6 +4713,12 @@ igt_main
>>  		gen8_test_single_ctx_render_target_writes_a_counter();
>>  	}
>>+	igt_describe("Test OA TLB invalidate");
>>+	igt_subtest("gen12-oa-tlb-invalidate") {
>>+		igt_require(intel_gen(devid) >= 12);
>>+		gen12_test_oa_tlb_invalidate();
>>+	}
>>+
>>  	igt_describe("Measure performance for a specific context using OAR in Gen 12");
>>  	igt_subtest("gen12-unprivileged-single-ctx-counters") {
>>  		igt_require(intel_gen(devid) >= 12);
>
>