[PATCH v2] drm/i915/gt: Convert reset prepare failure log to trace
Nirmoy Das
nirmoy.das at linux.intel.com
Wed Dec 6 10:35:04 UTC 2023
Hi John,
On 12/5/2023 8:50 PM, John Harrison wrote:
> On 12/5/2023 02:39, Nirmoy Das wrote:
>> Hi John,
>>
>> On 12/5/2023 10:10 AM, John Harrison wrote:
>>> On 12/5/2023 00:52, Nirmoy Das wrote:
>>>> gen8_engine_reset_prepare() can fail when HW fails to set
>>>> RESET_CTL_READY_TO_RESET bit. In some cases this is not fatal
>>>> error as driver will retry.
>>>>
>>>> Convert the log to a trace log for debugging without triggering
>>>> unnecessary concerns in CI or for end-users during non-fatal
>>>> scenarios.
>>> I strongly disagree with this change. The hardware spec for the
>>> RESET_CTL and GDRST registers are that they will self clear within a
>>> matter of microseconds. If something is so badly wrong with the
>>> hardware that it can't even manage to reset
>>
>>
>> This message is for reset readiness poll timeout not that the reset
>> is failed which doesn't sound so serious if the subsequent attempt
>> managed reset the engine.
> Not sure what the distinction is. The reset procedure is poke
> RESET_CTL wait for it to clear, poke GDRST and wait for it to clear.
> Just because step one is failing rather than step 2 does not mean that
> the reset as a whole has not failed.
>
> Note that the purpose of RESET_CTL is to pause a bunch of stuff like
> the command streamers to prevent them from issuing new memory requests
> while the reset is in progress. If it fails, it likely means that a CS
> is refusing to stop. Most probably because it can't reach a stopping
> point because it is stuck waiting on a lost memory request or similar.
> And the point of stopping further memory requests during reset is that
> if the memory channel gets out of sync (because only the GT side is
> reset during a GT reset) then that can result in total system failure.
> As in potentially even the CPU can no longer get to memory if it is an
> integrated platform. So yes, it can be quite a serious failure indeed.
>
Thanks bspec didn't explain those details. My intention was to
acknowledge that engine reset is a complicated process which why the
driver retries and don't spook CI/user if subsequent reset works but I
get your objection on this.
>>
>> I couldn't get enough details when this can happen that HW takes very
>> long time to set the readiness bit.
> Is it simply 'taking a long time' or is never clearing at all? If it
> is just that the timeout is too short then the proper fix would be to
> increase the timeout. But if it is taking seconds or longer or just
> never succeeding at all, then something is very bad.
I tried with 10x timeout without any help so I think the CS is stuck
though re-try works. I will try to get more details from HW team on this
issue.
>
>>
>>
>>> then that is something that very much warrants more than a
>>> completely silent trace event. It most certainly should be flagged
>>> as a failure in CI.
>>>
>>> Just because the driver will retry does not mean that this is not a
>>> serious error. And if the first attempt failed, why would a
>>> subsequent attempt succeed?
>>
>> The patch is not ignoring the failure. If the subsequent attempt
>> fails then driver load will fail or it will be wedged if that happens
>> after driver load.
> One thing I really hate about our driver is the total lack of
> information when something goes wrong during load. The driver wedges
> in total silence. There are many error paths that have no reporting at
> all. Which means you are left with a totally useless bug report.
>
>
>>
>>
>>> Escalating to FLR may have more success, but that is not something
>>> that i915 currently does.
>>
>> Do we still need to do FLR if a subsequent engine reset failure ?
> Assuming that we are talking about modern(ish) platforms, an engine
> reset failure would be hit by GuC rather than i915, but that would be
> escalated to an i915 based full GT reset. Generally speaking though,
> if the engine reset fails the GT reset isn't going to do much better.
> It would fix a dead GuC problem but it can't help with memory related
> issues. If the full GT reset fails then we are out of escalation
> routes as there is no FLR path at present (I think we have that at
> driver unload on MTL but not for general reset?). The FLR resets a lot
> more than just the GT, so it does have a chance to fix some issues
> that a GT reset can't. After driver-level FLR, there is PCI level FLR.
> Not sure if that involves a full power down and restart, but if not
> then that would be the last escalation possible. A power cycle really
> should fix any issues, if it doesn't then it's time to return the
> system as being totally dead!
>
> My recollection is that the vast majority of engine reset failures
> I've looked at have been completely catastrophic and the system only
> recovered after a reboot. I.e. after the card was power cycled. Such
> issues were generally caused by bad memory. Once the path to memory
> has died, there really is not much of the GPU that can do anything at
> all and there isn't much that can be done to recover it.
Thanks,
Nirmoy
>
> John.
>
>
>>
>>
>> Regards,
>>
>> Nirmoy
>>
>>>
>>> John.
>>>
>>>
>>>>
>>>> v2: Improve commit message(Tvrtko)
>>>>
>>>> Cc: Tvrtko Ursulin <tvrtko.ursulin at intel.com>
>>>> Cc: John Harrison <John.C.Harrison at Intel.com>
>>>> Cc: Andi Shyti <andi.shyti at linux.intel.com>
>>>> Cc: Andrzej Hajda <andrzej.hajda at intel.com>
>>>> Closes: https://gitlab.freedesktop.org/drm/intel/-/issues/5591
>>>> Signed-off-by: Nirmoy Das <nirmoy.das at intel.com>
>>>> Reviewed-by: Andi Shyti <andi.shyti at linux.intel.com>
>>>> Reviewed-by: Andrzej Hajda <andrzej.hajda at intel.com>
>>>> ---
>>>> drivers/gpu/drm/i915/gt/intel_reset.c | 8 ++++----
>>>> 1 file changed, 4 insertions(+), 4 deletions(-)
>>>>
>>>> diff --git a/drivers/gpu/drm/i915/gt/intel_reset.c
>>>> b/drivers/gpu/drm/i915/gt/intel_reset.c
>>>> index d5ed904f355d..e6fbc6202c80 100644
>>>> --- a/drivers/gpu/drm/i915/gt/intel_reset.c
>>>> +++ b/drivers/gpu/drm/i915/gt/intel_reset.c
>>>> @@ -593,10 +593,10 @@ static int gen8_engine_reset_prepare(struct
>>>> intel_engine_cs *engine)
>>>> ret = __intel_wait_for_register_fw(uncore, reg, mask, ack,
>>>> 700, 0, NULL);
>>>> if (ret)
>>>> - gt_err(engine->gt,
>>>> - "%s reset request timed out: {request: %08x,
>>>> RESET_CTL: %08x}\n",
>>>> - engine->name, request,
>>>> - intel_uncore_read_fw(uncore, reg));
>>>> + GT_TRACE(engine->gt,
>>>> + "%s reset request timed out: {request: %08x,
>>>> RESET_CTL: %08x}\n",
>>>> + engine->name, request,
>>>> + intel_uncore_read_fw(uncore, reg));
>>>> return ret;
>>>> }
>>>
>
More information about the dri-devel
mailing list