[igt-dev] [PATCH i-g-t] i915/hangman: Fix gt hang/error tests
Kamil Konieczny
kamil.konieczny at linux.intel.com
Thu Aug 4 16:57:32 UTC 2022
Hi Umesh,
On 2022-07-29 at 12:27:41 -0700, Nerlige Ramappa, Umesh wrote:
> From: Umesh Nerlige Ramappa <umesh.nerlige.ramappa at intel.com>
>
> gt-engine-hang and gt-engine-error tests were still using reset=2
> setting so that ended up just doing an engine reset. Fix the tests so
> that they actually do the gt specific reset and set the expectations
> correctly.
>
> In some rare failures, some of the background spinners show up as
> innocent and keep spinning after the hang recovery. This can only happen
> if the spinners did not start for some reason. Check that the spinners
> actually started before submitting a hanging batch.
imho this is a fix, it is worth to split into separate patch.
>
> For engine specific hang, only one context is marked guilty, but for a
> gt hang all contexts are marked guilty. Check for different expected
> behavior for engine vs. gt reset.
This is true only for GuC submission on new platforms.
When execlist is used, these tests are the same as engine-engine-*
as you noted at beginning.
>
> v2:
> - gt-reset resets all contexts on all engines. The execlist implementation
> of gt-engine-* tests expected that a preemptible background spinner
> running on the target engine should be marked innocent. While i915 can
> mark such a context for execlist mode, GuC scheduling does not guarantee
> that the background spinner can be marked as innocent. Since the state
> of the background spinner depends on the scheduling backend, do no
> validate the state of the background spinner for the target engine.
>
> v3:
> - omit checking background spinner state only for gt-reset case
>
> v4: (Kamil)
> - check has_gpu_reset and gem_scheduler_has_preemption for the
> gt-engine-hang and gt-engine-error tests.
>
> Signed-off-by: Umesh Nerlige Ramappa <umesh.nerlige.ramappa at intel.com>
> ---
> tests/i915/i915_hangman.c | 75 +++++++++++++++++++++++++++++++--------
> 1 file changed, 60 insertions(+), 15 deletions(-)
>
> diff --git a/tests/i915/i915_hangman.c b/tests/i915/i915_hangman.c
> index c7d69fdd..bdf47bc0 100644
> --- a/tests/i915/i915_hangman.c
> +++ b/tests/i915/i915_hangman.c
> @@ -295,9 +295,15 @@ static void context_unban(int fd, unsigned ctx)
> gem_context_set_param(fd, ¶m);
> }
>
> +enum reset_type {
> + GT_RESET = 1,
> + ENGINE_RESET = 2,
> +};
> +
> static void
> test_engine_hang(const intel_ctx_t *ctx,
> - const struct intel_execution_engine2 *e, unsigned int flags)
> + const struct intel_execution_engine2 *e, unsigned int flags,
> + enum reset_type reset)
Using bool as param are no good, imho we should split this into
two functions, see below after your comment on
egine-engine-hang.
> {
> const struct intel_execution_engine2 *other;
> const intel_ctx_t *local_ctx[GEM_MAX_ENGINES];
> @@ -309,6 +315,7 @@ test_engine_hang(const intel_ctx_t *ctx,
> igt_skip_on(flags & IGT_SPIN_INVALID_CS &&
> gem_engine_has_cmdparser(device, &ctx->cfg, e->flags));
>
> + igt_assert(reset == GT_RESET || reset == ENGINE_RESET);
> /*
> * Fill all engines with background load.
> * This verifies that independent engines are unaffected and gives
> @@ -324,7 +331,10 @@ test_engine_hang(const intel_ctx_t *ctx,
> .ahnd = ahndN,
> .ctx = local_ctx[num_ctx],
> .engine = other->flags,
> - .flags = IGT_SPIN_FENCE_OUT);
> + .flags = IGT_SPIN_FENCE_OUT |
> + IGT_SPIN_POLL_RUN);
> + igt_spin_busywait_until_started(spin);
> +
This is a fix, please split it to separate patch.
> num_ctx++;
>
> igt_list_move(&spin->link, &list);
> @@ -344,14 +354,44 @@ test_engine_hang(const intel_ctx_t *ctx,
> igt_assert_eq(sync_fence_status(spin->out_fence), -EIO);
> igt_spin_free(device, spin);
>
> - /* But no other engines/clients should be affected */
> - igt_list_for_each_entry_safe(spin, next, &list, link) {
> + /*
> + * engine-engine-hang: Other engines/clients should not be affected for
> + * engine reset, so innocent contexts complete successfully once the
> + * spinner is ended.
> + *
> + * gt-engine-hang: All engines/clients are guilty and complete with a
> + * -EIO fence status, however the background task that was submitted to
> + * the target engine is innocent and is expected to complete
> + * successfully.
> + */
>From this description maybe it it worth to split this function
into two separate functions : gt_engine_hang and engine_engine_hang.
Latter one will be old and in former one we could add different
bahaviour. I will add Janusz to Cc for this topic.
> + igt_list_for_each_entry_safe_reverse(spin, next, &list, link) {
Why do you change this into _reverse ?
> + bool innocent = reset == ENGINE_RESET;
When we split functionality then we can remove this check.
> + int expect = innocent ? 1 : -EIO;
> +
> ahndN = spin->opts.ahnd;
> - igt_assert(sync_fence_wait(spin->out_fence, 0) == -ETIME);
> - igt_spin_end(spin);
> + if (innocent) {
> + igt_assert_eq(sync_fence_wait(spin->out_fence, 0), -ETIME);
> + igt_spin_end(spin);
> + }
>
> - igt_assert(sync_fence_wait(spin->out_fence, 500) == 0);
> - igt_assert_eq(sync_fence_status(spin->out_fence), 1);
> + if (reset == GT_RESET && spin->opts.engine == e->flags) {
--------------------------------------- ^
Here you will check execlist or GuC ? This shows we need to split
functionality to avoid logic errors and further simplify our
checks.
> + /*
> + * gt-reset resets everything. The execlist
> + * implementation of gt-engine-* tests expect that a
> + * preemptible background spinner running on the target
> + * engine should be marked innocent. While i915 can mark
> + * such a context innocent for execlist mode, GuC
> + * scheduling does not guarantee that the background
> + * spinner can be marked as innocent. Since the state of
> + * the background spinner depends on the scheduling
> + * backend, do no validate the state of the background
> + * spinner for the target engine when doing a gt-reset.
> + */
> + igt_spin_end(spin);
> + } else {
> + igt_assert_eq(sync_fence_wait(spin->out_fence, 500), 0);
> + igt_assert_eq(sync_fence_status(spin->out_fence), expect);
> + }
> igt_spin_free(device, spin);
> put_ahnd(ahndN);
> }
> @@ -439,6 +479,10 @@ static void do_tests(const char *name, const char *prefix,
> {
> const struct intel_execution_engine2 *e;
> char buff[256];
> + enum reset_type reset = ENGINE_RESET;
> +
> + if (!strncmp(prefix, "gt", 2))
> + reset = GT_RESET;
>
> snprintf(buff, sizeof(buff), "Per engine error capture (%s reset)", name);
> igt_describe(buff);
> @@ -461,13 +505,13 @@ static void do_tests(const char *name, const char *prefix,
> };
>
> igt_require(gem_scheduler_has_preemption(device));
> - igt_params_set(device, "reset", "%u", -1);
> ioctl(device, DRM_IOCTL_I915_GETPARAM, &gp);
> - igt_require(has_gpu_reset > 1);
> + igt_require((reset == GT_RESET && has_gpu_reset == 1) ||
> + (reset == ENGINE_RESET && has_gpu_reset > 1));
Move above checks into separate function (avoid code duplication),
so here:
check_reset(reset);
and
check_reset(enum reset_type reset)
{
...put data declarations here...
ioctl(device, DRM_IOCTL_I915_GETPARAM, &gp);
igt_require((reset == GT_RESET && has_gpu_reset == 1) ||
(reset == ENGINE_RESET && has_gpu_reset > 1));
}
>
> for_each_ctx_engine(device, ctx, e) {
> igt_dynamic_f("%s", e->name)
> - test_engine_hang(ctx, e, 0);
> + test_engine_hang(ctx, e, 0, reset);
It is better to make this into:
gt_reset ? gt_engine_hang(ctx, e, 0) : engine_engine_hang(ctx, e, 0);
and before that set:
bool gt_reset = reset == GT_RESET && gem_using_guc_submission(device) &&
intel_gen(intel_get_drm_devid(i915)) >= 12;
imho we want to check both old execlist and new GuC behaviour.
> }
> }
>
> @@ -481,13 +525,14 @@ static void do_tests(const char *name, const char *prefix,
> .value = &has_gpu_reset,
> };
>
> - igt_params_set(device, "reset", "%u", -1);
> - ioctl(device, DRM_IOCTL_I915_GETPARAM, &gp);
> - igt_require(has_gpu_reset > 1);
> + igt_require(gem_scheduler_has_preemption(device));
Why is it needed here ?
> + ioctl(device, DRM_IOCTL_I915_GETPARAM, &gp);
-- ^
Use tabs, not spaces here.
> + igt_require((reset == GT_RESET && has_gpu_reset == 1) ||
> + (reset == ENGINE_RESET && has_gpu_reset > 1));
Instead of above use check_reset(reset) here.
>
> for_each_ctx_engine(device, ctx, e) {
> igt_dynamic_f("%s", e->name)
> - test_engine_hang(ctx, e, IGT_SPIN_INVALID_CS);
> + test_engine_hang(ctx, e, IGT_SPIN_INVALID_CS, reset);
Same here, call gt_engine_hang or engine_engine_hang instead
(see above). What happens when there are no preemption ?
> }
> }
> }
> --
> 2.36.1
>
Regards,
Kamil
More information about the igt-dev
mailing list