[Intel-gfx] [PATCH] drm/i915/guc: Fix error capture for virtual engines
Teres Alexis, Alan Previn
alan.previn.teres.alexis at intel.com
Thu Apr 27 17:00:08 UTC 2023
On Fri, 2023-04-14 at 17:27 -0700, John.C.Harrison at Intel.com wrote:
> From: John Harrison <John.C.Harrison at Intel.com>
>
> GuC based register dumps in error capture logs were basically broken
> for virtual engines. This can be seen in igt at gem_exec_balancer@hang:
> [IGT] gem_exec_balancer: starting subtest hang
> [drm] GPU HANG: ecode 12:4:e1524110, in gem_exec_balanc [6388]
> [drm] GT0: GUC: No register capture node found for 0x1005 / 0xFEDC311D
> [drm] GPU HANG: ecode 12:4:00000000, in gem_exec_balanc [6388]
> [IGT] gem_exec_balancer: exiting, ret=0
>
> The test causes a hang on both engines of a virtual engine context.
> The engine instance zero hang gets a valid error capture but the
> non-instance-zero hang does not.
>
> Fix that by scanning through the list of pending register captures
> when a hang notification for a virtual engine is received. That way,
> the hang can be assigned to the correct physical engine prior to
> starting the error capture process. So later on, when the error capture
> handler tries to find the engine register list, it looks for one on
> the correct engine.
>
> Also, sneak in a missing blank line before a comment in the node
> search code.
>
> Signed-off-by: John Harrison <John.C.Harrison at Intel.com>
LGTM - thanks for fixing this! :D
Reviewed-by: Alan Previn <alan.previn.teres.alexis at intel.com>
A side conversation - potentially requring an unrelated future patch,...
i notice that the array "error->reset_engine_count[]" (which is being
used for error state reporting and as some verification in selftests) seem
to have a size indicating of engine-instance-count but the reading/wrting
of members of this array keep using the engine->uabi_class as index...
meaning its being used to track engine class reset counts? Maybe this is
an issue or i am misundertanding that. Either way, that issue is unrelated
to the intent of this patch - i just wanted to get that highlighted for
future action if needed. I can take that onus if its in fact an issue.
More information about the dri-devel
mailing list