Can you help me debug an issue?
Peter Senna Tschudin
peter.senna at gmail.com
Sun Mar 24 16:54:24 UTC 2024
On Sat, Mar 23, 2024 at 5:23 PM Peter Senna Tschudin
<peter.senna at gmail.com> wrote:
>
> On Sat, Mar 23, 2024 at 2:05 PM Peter Senna Tschudin
> <peter.senna at gmail.com> wrote:
> >
> > Dear List,
> >
> > I found a commit that introduced a regression that broke the tests
> > gem_exec_capture at many-4k-incremental and
> > gem_exec_capture at many-4k-zero. Reverting 93c5ec210 fixes the issue,
> > but it does not tell what the problem is.
> >
> > The problem is that `e` gets corrupted when `many()` calls the macro
> > `find_first_available_engine`, and the corruption happens at the line
> > `saved = configure_hangs(fd, e, ctx->id);`. By corrupted, I mean that
> > the field name gets empty and the field class gets a large number.
> >
> > After `e` gets corrupted, the call to __captureN() will fail because
> > it expects 'e' to be valid. A simple fix is to add `e =
> > &saved_engine.engine;` before the call to __captureN().
> >
> > I have been trying to understand why `e` gets corrupted for a few
> > hours, and I ran out of ideas. To make the code more gdb-friendly, I
> > have unfolded the macro find_first_available_engine, but that did not
> > help me find the reason for the `e` corruption. Here is how I have
> > unfolded the macros:
> >
> > - find_first_available_engine(fd, ctx, e, saved_engine);
> > + ctx = intel_ctx_create_all_physical(fd);
> > + igt_assert(ctx);
> > + for (struct intel_engine_data i =
> > intel_engine_list_for_ctx_cfg(fd, &(ctx)->cfg);
> > + (e = intel_get_current_engine(&i));
> > + intel_next_engine(&i)) {
> > + if ((gem_class_can_store_dword(fd, e->class)))
> > + break;
> > + }
> > + igt_assert(e);
> > + printf("e->name: %s\n", e->name);
> > + saved_engine = configure_hangs(fd, e, ctx->id);
> > + printf("e->name: %s\n", e->name);
> >
> > Reverting 93c5ec210 stops the corruption from happening, and I am
> > trying to understand why. Can you help me debug this further?
>
> I believe that the problem is caused by trying to access the contents
> of the variable i after the scope of the variable has ended. I believe
> that the variable i is limited in scope to the lifetime of the for
> loop inside the macro for_each_ctx_cfg_engine().
>
> for_each_ctx_cfg_engine() iterates through elements of a structure
> struct intel_engine_data. This structure contains an array of struct
> intel_execution_engine2 and a pointer to struct
> intel_execution_engine2. The pointer will save the address of one of
> the array elements. My take is that due to the block scope of i,
> attempting to access elements of the array engines after the loop has
> ended is resulting in invalid memory access.
>
> Based on my analysis, the root cause seems to be the attempt to access
> contents of i outside the loop's scope. If this theory holds, the
> block scope of variable i would invalidate access to engines and i
> after the loop. I am considering that reverting 93c5ec210 and having
> the issue fixed is a coincidence rather than a solution.
>
> Can you validate this theory and suggest how to fix the problem?
This is a simple patch that fixes the issue but introduces memory
leaks(by not freeing `e` after use). I tried to craft a proper 'fix'
for the issue, but I did not find a good solution, and I will
appreciate feedback on how you want me to fix it.
-- // --
diff --git a/lib/i915/gem_engine_topology.c b/lib/i915/gem_engine_topology.c
index afb576afb..264dd696c 100644
--- a/lib/i915/gem_engine_topology.c
+++ b/lib/i915/gem_engine_topology.c
@@ -189,12 +189,26 @@ static int __query_engine_list(int fd, struct
intel_engine_data *ed)
struct intel_execution_engine2 *
intel_get_current_engine(struct intel_engine_data *ed)
{
+ struct intel_execution_engine2 *ret = NULL;
+
if (ed->n >= ed->nengines)
ed->current_engine = NULL;
else if (!ed->n)
ed->current_engine = &ed->engines[0];
- return ed->current_engine;
+ // This function is called from the macro for_each_ctx_cfg_engine()
+ // which defines *ed inside a for loop. Not memcping ed->current_engine
+ // will lead to a memory access violation when trying to dereference
+ // ed->current_engine after the end of the for loop.
+ if (ed->current_engine) {
+ ret = malloc(sizeof(*ret));
+ if (ret)
+ memcpy(ret, ed->current_engine, sizeof(*ret));
+ }
+
+ // One can always buy more RAM... This patch does not free the memory
+ // on the call sites...
+ return ret;
}
void intel_next_engine(struct intel_engine_data *ed)
-- // --
--
Peter
More information about the igt-dev
mailing list