Can you help me debug an issue?

Sun Mar 24 16:54:24 UTC 2024

On Sat, Mar 23, 2024 at 5:23 PM Peter Senna Tschudin
<peter.senna at gmail.com> wrote:
>
> On Sat, Mar 23, 2024 at 2:05 PM Peter Senna Tschudin
> <peter.senna at gmail.com> wrote:
> >
> > Dear List,
> >
> > I found a commit that introduced a regression that broke the tests
> > gem_exec_capture at many-4k-incremental and
> > gem_exec_capture at many-4k-zero. Reverting 93c5ec210 fixes the issue,
> > but it does not tell what the problem is.
> >
> > The problem is that `e` gets corrupted when `many()` calls the macro
> > `find_first_available_engine`, and the corruption happens at the line
> > `saved = configure_hangs(fd, e, ctx->id);`. By corrupted, I mean that
> > the field name gets empty and the field class gets a large number.
> >
> > After `e` gets corrupted, the call to __captureN() will fail because
> > it expects 'e' to be valid. A simple fix is to add `e =
> > &saved_engine.engine;` before the call to __captureN().
> >
> > I have been trying to understand why `e` gets corrupted for a few
> > hours, and I ran out of ideas. To make the code more gdb-friendly, I
> > have unfolded the macro find_first_available_engine, but that did not
> > help me find the reason for the `e` corruption. Here is how I have
> > unfolded the macros:
> >
> > -       find_first_available_engine(fd, ctx, e, saved_engine);
> > +       ctx = intel_ctx_create_all_physical(fd);
> > +       igt_assert(ctx);
> > +       for (struct intel_engine_data i =
> > intel_engine_list_for_ctx_cfg(fd, &(ctx)->cfg);
> > +            (e = intel_get_current_engine(&i));
> > +            intel_next_engine(&i)) {
> > +               if ((gem_class_can_store_dword(fd, e->class)))
> > +                       break;
> > +                }
> > +       igt_assert(e);
> > +       printf("e->name: %s\n", e->name);
> > +       saved_engine = configure_hangs(fd, e, ctx->id);
> > +       printf("e->name: %s\n", e->name);
> >
> > Reverting 93c5ec210 stops the corruption from happening, and I am
> > trying to understand why. Can you help me debug this further?
>
> I believe that the problem is caused by trying to access the contents
> of the variable i after the scope of the variable has ended. I believe
> that the variable i is limited in scope to the lifetime of the for
> loop inside the macro for_each_ctx_cfg_engine().
>
> for_each_ctx_cfg_engine() iterates through elements of a structure
> struct intel_engine_data. This structure contains an array of struct
> intel_execution_engine2 and a pointer to struct
> intel_execution_engine2. The pointer will save the address of one of
> the array elements. My take is that due to the block scope of i,
> attempting to access elements of the array engines after the loop has
> ended is resulting in invalid memory access.
>
> Based on my analysis, the root cause seems to be the attempt to access
> contents of i outside the loop's scope. If this theory holds, the
> block scope of variable i would invalidate access to engines and i
> after the loop. I am considering that reverting 93c5ec210 and having
> the issue fixed is a coincidence rather than a solution.
>
> Can you validate this theory and suggest how to fix the problem?

This is a simple patch that fixes the issue but introduces memory
leaks(by not freeing `e` after use). I tried to craft a proper 'fix'
for the issue, but I did not find a good solution, and I will
appreciate feedback on how you want me to fix it.

-- // --

diff --git a/lib/i915/gem_engine_topology.c b/lib/i915/gem_engine_topology.c
index afb576afb..264dd696c 100644
--- a/lib/i915/gem_engine_topology.c
+++ b/lib/i915/gem_engine_topology.c
@@ -189,12 +189,26 @@ static int __query_engine_list(int fd, struct
intel_engine_data *ed)
 struct intel_execution_engine2 *
 intel_get_current_engine(struct intel_engine_data *ed)
 {
+       struct intel_execution_engine2 *ret = NULL;
+
        if (ed->n >= ed->nengines)
                ed->current_engine = NULL;
        else if (!ed->n)
                ed->current_engine = &ed->engines[0];

-       return ed->current_engine;
+       // This function is called from the macro for_each_ctx_cfg_engine()
+       // which defines *ed inside a for loop. Not memcping ed->current_engine
+       // will lead to a memory access violation when trying to dereference
+       // ed->current_engine after the end of the for loop.
+       if (ed->current_engine) {
+               ret = malloc(sizeof(*ret));
+               if (ret)
+                       memcpy(ret, ed->current_engine, sizeof(*ret));
+       }
+
+       // One can always buy more RAM... This patch does not free the memory
+       // on the call sites...
+       return ret;
 }

 void intel_next_engine(struct intel_engine_data *ed)
-- // --
-- 
                         Peter