[igt-dev] [PATCH i-g-t] i915/hangman: Fix gt hang/error tests
Umesh Nerlige Ramappa
umesh.nerlige.ramappa at intel.com
Fri Jul 29 18:35:03 UTC 2022
On Thu, Jul 28, 2022 at 08:14:44AM -0700, Umesh Nerlige Ramappa wrote:
>On Thu, Jul 28, 2022 at 04:59:53PM +0200, Kamil Konieczny wrote:
>>Hi Umesh,
>>
>>On 2022-07-28 at 01:15:59 +0000, Umesh Nerlige Ramappa wrote:
>>>gt-engine-hang and gt-engine-error tests were still using reset=2
>>>setting so that ended up just doing an engine reset. Fix the tests so
>>>that they actually do the gt specific reset and set the expectations
>>>correctly.
>>>
>>>In some rare failures, some of the background spinners show up as
>>>innocent and keep spinning after the hang recovery. This can only happen
>>>if the spinners did not start for some reason. Check that the spinners
>>>actually started before submitting a hanging batch.
>>>
>>>For engine specific hang, only one context is marked guilty, but for a
>>>gt hang all contexts are marked guilty. Check for different expected
>>>behavior for engine vs. gt reset.
>>>
>>>v2:
>>>- gt-reset resets all contexts on all engines. The execlist implementation
>>> of gt-engine-* tests expected that a preemptible background spinner
>>> running on the target engine should be marked innocent. While i915 can
>>> mark such a context for execlist mode, GuC scheduling does not guarantee
>>> that the background spinner can be marked as innocent. Since the state
>>> of the background spinner depends on the scheduling backend, do no
>>> validate the state of the background spinner for the target engine.
>>>
>>>Signed-off-by: Umesh Nerlige Ramappa <umesh.nerlige.ramappa at intel.com>
>>>---
>>> tests/i915/i915_hangman.c | 84 ++++++++++++++++++++++++---------------
>>> 1 file changed, 53 insertions(+), 31 deletions(-)
>>>
>>>diff --git a/tests/i915/i915_hangman.c b/tests/i915/i915_hangman.c
>>>index c7d69fdd..d7b173ab 100644
>>>--- a/tests/i915/i915_hangman.c
>>>+++ b/tests/i915/i915_hangman.c
>>>@@ -295,9 +295,15 @@ static void context_unban(int fd, unsigned ctx)
>>> gem_context_set_param(fd, ¶m);
>>> }
>>>
>>>+enum reset_type {
>>>+ GT_RESET = 1,
>>>+ ENGINE_RESET = 2,
>>>+};
>>>+
>>> static void
>>> test_engine_hang(const intel_ctx_t *ctx,
>>>- const struct intel_execution_engine2 *e, unsigned int flags)
>>>+ const struct intel_execution_engine2 *e, unsigned int flags,
>>>+ enum reset_type reset)
>>> {
>>> const struct intel_execution_engine2 *other;
>>> const intel_ctx_t *local_ctx[GEM_MAX_ENGINES];
>>>@@ -309,6 +315,7 @@ test_engine_hang(const intel_ctx_t *ctx,
>>> igt_skip_on(flags & IGT_SPIN_INVALID_CS &&
>>> gem_engine_has_cmdparser(device, &ctx->cfg, e->flags));
>>>
>>>+ igt_assert(reset == GT_RESET || reset == ENGINE_RESET);
>>> /*
>>> * Fill all engines with background load.
>>> * This verifies that independent engines are unaffected and gives
>>>@@ -324,7 +331,10 @@ test_engine_hang(const intel_ctx_t *ctx,
>>> .ahnd = ahndN,
>>> .ctx = local_ctx[num_ctx],
>>> .engine = other->flags,
>>>- .flags = IGT_SPIN_FENCE_OUT);
>>>+ .flags = IGT_SPIN_FENCE_OUT |
>>>+ IGT_SPIN_POLL_RUN);
>>>+ igt_spin_busywait_until_started(spin);
>>>+
>>> num_ctx++;
>>>
>>> igt_list_move(&spin->link, &list);
>>>@@ -344,14 +354,43 @@ test_engine_hang(const intel_ctx_t *ctx,
>>> igt_assert_eq(sync_fence_status(spin->out_fence), -EIO);
>>> igt_spin_free(device, spin);
>>>
>>>- /* But no other engines/clients should be affected */
>>>- igt_list_for_each_entry_safe(spin, next, &list, link) {
>>>+ /*
>>>+ * engine-engine-hang: Other engines/clients should not be affected for
>>>+ * engine reset, so innocent contexts complete successfully once the
>>>+ * spinner is ended.
>>>+ *
>>>+ * gt-engine-hang: All engines/clients are guilty and complete with a
>>>+ * -EIO fence status, however the background task that was submitted to
>>>+ * the target engine is innocent and is expected to complete
>>>+ * successfully.
>>>+ */
>>
>>imho we should keep test for execlist behaviour. It is not clear
>>why background task survives ? You stated "All ... are guilty" ?
>>Isn't this a race between completing and resetting GT ?
>
>With execlist mode, kmd has control over preempting the background
>task and saving it's state. With GuC KMD has no such control to save
>the state of select spinners.
>
>All execlist behavior in this test is retained, except for one - the
>check for the state of the background spinner on the target engine.
>The state assumed by execlist backend cannot be guaranteed by GuC
>backend.
>
>>
>>>+ igt_list_for_each_entry_safe_reverse(spin, next, &list, link) {
>>>+ bool innocent = reset == ENGINE_RESET;
>>>+ int expect = innocent ? 1 : -EIO;
>>>+
>>> ahndN = spin->opts.ahnd;
>>>- igt_assert(sync_fence_wait(spin->out_fence, 0) == -ETIME);
>>>- igt_spin_end(spin);
>>>+ if (innocent) {
>>>+ igt_assert_eq(sync_fence_wait(spin->out_fence, 0), -ETIME);
>>>+ igt_spin_end(spin);
>>>+ }
>>>
>>>- igt_assert(sync_fence_wait(spin->out_fence, 500) == 0);
>>>- igt_assert_eq(sync_fence_status(spin->out_fence), 1);
>>>+ if (spin->opts.engine == e->flags) {
>
>I think the above condition should also check for GT_RESET. I need to
>add that.
>
>>>+ /*
>>>+ * gt-reset resets all contexts on all engines. The execlist
>>>+ * implementation of gt-engine-* tests expected that a
>>>+ * preemptible background spinner running on the target engine
>>>+ * should be marked innocent. While i915 can mark such a
>>>+ * context for execlist mode, GuC scheduling does not guarantee
>>>+ * that the background spinner can be marked as innocent. Since
>>>+ * the state of the background spinner depends on the scheduling
>>>+ * backend, do no validate the state of the background spinner
>>>+ * for the target engine.
>>>+ */
>
>In relation to my earlier comment, this is the only place where the
>asserts are omitted. This is because of the caveat mentioned in the
>comment above.
>
>>>+ igt_spin_end(spin);
>>>+ } else {
>>>+ igt_assert_eq(sync_fence_wait(spin->out_fence, 500), 0);
>>>+ igt_assert_eq(sync_fence_status(spin->out_fence), expect);
>>>+ }
>>> igt_spin_free(device, spin);
>>> put_ahnd(ahndN);
>>> }
>>>@@ -439,6 +478,10 @@ static void do_tests(const char *name, const char *prefix,
>>> {
>>> const struct intel_execution_engine2 *e;
>>> char buff[256];
>>>+ enum reset_type reset = ENGINE_RESET;
>>>+
>>>+ if (!strncmp(prefix, "gt", 2))
>>>+ reset = GT_RESET;
>>>
>>> snprintf(buff, sizeof(buff), "Per engine error capture (%s reset)", name);
>>> igt_describe(buff);
>>>@@ -454,20 +497,9 @@ static void do_tests(const char *name, const char *prefix,
>>> igt_describe(buff);
>>> snprintf(buff, sizeof(buff), "%s-engine-hang", prefix);
>>> igt_subtest_with_dynamic(buff) {
>>>- int has_gpu_reset = 0;
>>>- struct drm_i915_getparam gp = {
>>>- .param = I915_PARAM_HAS_GPU_RESET,
>>>- .value = &has_gpu_reset,
>>>- };
>>>-
>>>- igt_require(gem_scheduler_has_preemption(device));
>>>- igt_params_set(device, "reset", "%u", -1);
>>>- ioctl(device, DRM_IOCTL_I915_GETPARAM, &gp);
>>>- igt_require(has_gpu_reset > 1);
>>>-
>>> for_each_ctx_engine(device, ctx, e) {
>>> igt_dynamic_f("%s", e->name)
>>>- test_engine_hang(ctx, e, 0);
>>>+ test_engine_hang(ctx, e, 0, reset);
>>> }
>>> }
>>>
>>>@@ -475,19 +507,9 @@ static void do_tests(const char *name, const char *prefix,
>>> igt_describe(buff);
>>> snprintf(buff, sizeof(buff), "%s-engine-error", prefix);
>>> igt_subtest_with_dynamic(buff) {
>>>- int has_gpu_reset = 0;
>>>- struct drm_i915_getparam gp = {
>>>- .param = I915_PARAM_HAS_GPU_RESET,
>>>- .value = &has_gpu_reset,
>>>- };
>>>-
>>>- igt_params_set(device, "reset", "%u", -1);
>>>- ioctl(device, DRM_IOCTL_I915_GETPARAM, &gp);
>>>- igt_require(has_gpu_reset > 1);
>>>-
>>
>>imho we need this code in test_engine_hang for execlist
>>subbmission for gt-engine-hang case.
>
>for gt-* tests, the fixture in igt_main has
>
>hang = igt_allow_hang(device, ctx->id, HANG_ALLOW_CAPTURE);
>
>That should suffice.
ok, I see what you are saying here - the checks for has_gpu_reset and
gem_scheduler_has_preemption are still needed for the gt-* tests in
exelist mode, although I think that code can be placed in the igt_main.
I will try it out and post another revision.
Thanks,
Umesh
>
>Regards,
>Umesh
>
>
>>
>>Regards,
>>Kamil
>>
>>> for_each_ctx_engine(device, ctx, e) {
>>> igt_dynamic_f("%s", e->name)
>>>- test_engine_hang(ctx, e, IGT_SPIN_INVALID_CS);
>>>+ test_engine_hang(ctx, e, IGT_SPIN_INVALID_CS, reset);
>>> }
>>> }
>>> }
>>>--
>>>2.25.1
>>>
More information about the igt-dev
mailing list