[Intel-gfx] [PATCH] drm/i915/guc: Fix missing ecodes
John Harrison
john.c.harrison at intel.com
Sat Jan 28 02:28:11 UTC 2023
On 1/26/2023 11:17, Teres Alexis, Alan Previn wrote:
> Firstly, thanks for catching this miss.
> Since I only have one trivial nit and one non-blocker ask.
> and the non-blocker ask will not impact the patch intent as it merely
> tweaks an existing debug message, I believe we have an rb:
>
> Reviewed-by: Alan Previn <alan.previn.teres.alexis at intel.com>
>
> On Tue, 2023-01-24 at 16:49 -0800, Harrison, John C wrote:
>> From: John Harrison <John.C.Harrison at Intel.com>
>>
>> Error captures are tagged with an 'ecode'. This is a pseduo-unique magic
>> number that is meant to distinguish similar seeming bugs with
>> different underlying signatures. It is a combination of two ring state
>> registers. Unfortunately, the register state being used is only valid
>> in execlist mode. In GuC mode, the register state exists in a separate
>> list of arbitrary register address/value pairs rather than the named
>> entry structure. So, search through that list to find the two exciting
>> registers and copy them over to the structure's named members.
>>
>> Signed-off-by: John Harrison <John.C.Harrison at Intel.com>
>> Fixes: a6f0f9cf330a ("drm/i915/guc: Plumb GuC-capture into gpu_coredump")
>> Cc: Alan Previn <alan.previn.teres.alexis at intel.com>
>> Cc: Umesh Nerlige Ramappa <umesh.nerlige.ramappa at intel.com>
>> Cc: Lucas De Marchi <lucas.demarchi at intel.com>
>> Cc: Jani Nikula <jani.nikula at linux.intel.com>
>> Cc: Joonas Lahtinen <joonas.lahtinen at linux.intel.com>
>> Cc: Rodrigo Vivi <rodrigo.vivi at intel.com>
>> Cc: Tvrtko Ursulin <tvrtko.ursulin at linux.intel.com>
>> Cc: Matt Roper <matthew.d.roper at intel.com>
>> Cc: Aravind Iddamsetty <aravind.iddamsetty at intel.com>
>> Cc: Michael Cheng <michael.cheng at intel.com>
>> Cc: Matthew Brost <matthew.brost at intel.com>
>> Cc: Bruce Chang <yu.bruce.chang at intel.com>
>> Cc: Daniele Ceraolo Spurio <daniele.ceraolospurio at intel.com>
>> Cc: Matthew Auld <matthew.auld at intel.com>
>> ---
>> .../gpu/drm/i915/gt/uc/intel_guc_capture.c | 22 +++++++++++++++++++
>> 1 file changed, 22 insertions(+)
>>
>> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_capture.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_capture.c
>> index 1c1b85073b4bd..4e0b06ceed96d 100644
>> --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_capture.c
>> +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_capture.c
>> @@ -1571,6 +1571,27 @@ int intel_guc_capture_print_engine_node(struct drm_i915_error_state_buf *ebuf,
>>
>> #endif //CONFIG_DRM_I915_CAPTURE_ERROR
>>
>> +static void guc_capture_find_ecode(struct intel_engine_coredump *ee)
>> +{
>> + struct gcap_reg_list_info *reginfo;
>> + struct guc_mmio_reg *regs;
>> + i915_reg_t reg_ipehr = RING_IPEHR(0);
>> + i915_reg_t reg_instdone = RING_INSTDONE(0);
>> + int i;
>> +
>> + if (!ee->guc_capture_node)
>> + return;
>> +
>> + reginfo = ee->guc_capture_node->reginfo + GUC_CAPTURE_LIST_TYPE_ENGINE_INSTANCE;
>> + regs = reginfo->regs;
>> + for (i = 0; i < reginfo->num_regs; i++) {
>> + if (regs[i].offset == reg_ipehr.reg)
>> + ee->ipehr = regs[i].value;
>> + if (regs[i].offset == reg_instdone.reg)
> nit: "else if"?
>> + ee->instdone.instdone = regs[i].value;
>> + }
>> +}
>> +
>> void intel_guc_capture_free_node(struct intel_engine_coredump *ee)
>> {
>> if (!ee || !ee->guc_capture_node)
>> @@ -1612,6 +1633,7 @@ void intel_guc_capture_get_matching_node(struct intel_gt *gt,
>> list_del(&n->link);
>> ee->guc_capture_node = n;
>> ee->capture = guc->capture;
>> + guc_capture_find_ecode(ee);
>> return;
>> }
>> }
> alan: only one non-blocker request:
> while we are here, could we update the debug message when we can't find a matching captured node?
> Current code:
> drm_dbg(&i915->drm, "GuC capture can't match ee to node\n");
> New suggestion:
> drm_dbg(&i915->drm, "GuC capture can't find node for ee-ctx: lcra = 0x%08x | gucid = 0x%08x\n",
> ce->lrc.lrca, ce->guc_id.id);
Regarding the search test, there seem to be some incorrect terms in
there. The if itself is also not the easiest to read with some terms
across multiple lines and other lines with multiple terms. Breaking it down:
(n->eng_inst == GUC_ID_TO_ENGINE_INSTANCE(ee->engine->guc_id) &&
n->eng_class == GUC_ID_TO_ENGINE_CLASS(ee->engine->guc_id) &&
n->guc_id &&
Why does the GuC id have to be non zero? Zero is a valid id. And even if
it isn't, comparing to ce->guc_id.id is sufficient to filter out
anything bad.
n->guc_id == ce->guc_id.id &&
(n->lrca & CTX_GTT_ADDRESS_MASK) &&
Again, address zero is not invalid but the next test makes this one
redundant anyway.
(n->lrca & CTX_GTT_ADDRESS_MASK) == (ce->lrc.lrca &
CTX_GTT_ADDRESS_MASK)) {
Any objection to dropping the !zero tests and reformatting the whole thing?
John.
>
>
>
More information about the Intel-gfx
mailing list