Can you help me debug an issue?

Peter Senna Tschudin peter.senna at gmail.com
Sat Mar 23 16:23:06 UTC 2024


On Sat, Mar 23, 2024 at 2:05 PM Peter Senna Tschudin
<peter.senna at gmail.com> wrote:
>
> Dear List,
>
> I found a commit that introduced a regression that broke the tests
> gem_exec_capture at many-4k-incremental and
> gem_exec_capture at many-4k-zero. Reverting 93c5ec210 fixes the issue,
> but it does not tell what the problem is.
>
> The problem is that `e` gets corrupted when `many()` calls the macro
> `find_first_available_engine`, and the corruption happens at the line
> `saved = configure_hangs(fd, e, ctx->id);`. By corrupted, I mean that
> the field name gets empty and the field class gets a large number.
>
> After `e` gets corrupted, the call to __captureN() will fail because
> it expects 'e' to be valid. A simple fix is to add `e =
> &saved_engine.engine;` before the call to __captureN().
>
> I have been trying to understand why `e` gets corrupted for a few
> hours, and I ran out of ideas. To make the code more gdb-friendly, I
> have unfolded the macro find_first_available_engine, but that did not
> help me find the reason for the `e` corruption. Here is how I have
> unfolded the macros:
>
> -       find_first_available_engine(fd, ctx, e, saved_engine);
> +       ctx = intel_ctx_create_all_physical(fd);
> +       igt_assert(ctx);
> +       for (struct intel_engine_data i =
> intel_engine_list_for_ctx_cfg(fd, &(ctx)->cfg);
> +            (e = intel_get_current_engine(&i));
> +            intel_next_engine(&i)) {
> +               if ((gem_class_can_store_dword(fd, e->class)))
> +                       break;
> +                }
> +       igt_assert(e);
> +       printf("e->name: %s\n", e->name);
> +       saved_engine = configure_hangs(fd, e, ctx->id);
> +       printf("e->name: %s\n", e->name);
>
> Reverting 93c5ec210 stops the corruption from happening, and I am
> trying to understand why. Can you help me debug this further?

I believe that the problem is caused by trying to access the contents
of the variable i after the scope of the variable has ended. I believe
that the variable i is limited in scope to the lifetime of the for
loop inside the macro for_each_ctx_cfg_engine().

for_each_ctx_cfg_engine() iterates through elements of a structure
struct intel_engine_data. This structure contains an array of struct
intel_execution_engine2 and a pointer to struct
intel_execution_engine2. The pointer will save the address of one of
the array elements. My take is that due to the block scope of i,
attempting to access elements of the array engines after the loop has
ended is resulting in invalid memory access.

Based on my analysis, the root cause seems to be the attempt to access
contents of i outside the loop's scope. If this theory holds, the
block scope of variable i would invalidate access to engines and i
after the loop. I am considering that reverting 93c5ec210 and having
the issue fixed is a coincidence rather than a solution.

Can you validate this theory and suggest how to fix the problem?


-- 
                         Peter


More information about the igt-dev mailing list