[PATCH v2] drm/i915/gt: Convert reset prepare failure log to trace

Wed Dec 6 10:35:04 UTC 2023

Hi John,

On 12/5/2023 8:50 PM, John Harrison wrote:
> On 12/5/2023 02:39, Nirmoy Das wrote:
>> Hi John,
>>
>> On 12/5/2023 10:10 AM, John Harrison wrote:
>>> On 12/5/2023 00:52, Nirmoy Das wrote:
>>>> gen8_engine_reset_prepare() can fail when HW fails to set
>>>> RESET_CTL_READY_TO_RESET bit. In some cases this is not fatal
>>>> error as driver will retry.
>>>>
>>>> Convert the log to a trace log for debugging without triggering
>>>> unnecessary concerns in CI or for end-users during non-fatal 
>>>> scenarios.
>>> I strongly disagree with this change. The hardware spec for the 
>>> RESET_CTL and GDRST registers are that they will self clear within a 
>>> matter of microseconds. If something is so badly wrong with the 
>>> hardware that it can't even manage to reset
>>
>>
>> This message is for reset readiness  poll timeout not that the reset 
>> is failed which doesn't sound so serious if the subsequent attempt 
>> managed reset the engine.
> Not sure what the distinction is. The reset procedure is poke 
> RESET_CTL wait for it to clear, poke GDRST and wait for it to clear. 
> Just because step one is failing rather than step 2 does not mean that 
> the reset as a whole has not failed.
>
> Note that the purpose of RESET_CTL is to pause a bunch of stuff like 
> the command streamers to prevent them from issuing new memory requests 
> while the reset is in progress. If it fails, it likely means that a CS 
> is refusing to stop. Most probably because it can't reach a stopping 
> point because it is stuck waiting on a lost memory request or similar. 
> And the point of stopping further memory requests during reset is that 
> if the memory channel gets out of sync (because only the GT side is 
> reset during a GT reset) then that can result in total system failure. 
> As in potentially even the CPU can no longer get to memory if it is an 
> integrated platform. So yes, it can be quite a serious failure indeed.
>

Thanks bspec didn't explain those details. My intention was to 
acknowledge that engine reset is a complicated process which why the 
driver retries  and don't spook CI/user if subsequent reset works but I 
get your objection on this.

>>
>> I couldn't get enough details when this can happen that HW takes very 
>> long time to set the readiness bit.
> Is it simply 'taking a long time' or is never clearing at all? If it 
> is just that the timeout is too short then the proper fix would be to 
> increase the timeout. But if it is taking seconds or longer or just 
> never succeeding at all, then something is very bad.

I tried with 10x timeout without any help so I think the CS is stuck 
though re-try works. I will try to get more details from HW team on this 
issue.

>
>>
>>
>>> then that is something that very much warrants more than a 
>>> completely silent trace event. It most certainly should be flagged 
>>> as a failure in CI.
>>>
>>> Just because the driver will retry does not mean that this is not a 
>>> serious error. And if the first attempt failed, why would a 
>>> subsequent attempt succeed?
>>
>> The patch is not ignoring the failure. If the subsequent attempt 
>> fails then driver load will fail or it will be wedged if that happens 
>> after driver load.
> One thing I really hate about our driver is the total lack of 
> information when something goes wrong during load. The driver wedges 
> in total silence. There are many error paths that have no reporting at 
> all. Which means you are left with a totally useless bug report.
>
>
>>
>>
>>> Escalating to FLR may have more success, but that is not something 
>>> that i915 currently does.
>>
>> Do we still need to do FLR if a subsequent engine reset failure ?
> Assuming that we are talking about modern(ish) platforms, an engine 
> reset failure would be hit by GuC rather than i915, but that would be 
> escalated to an i915 based full GT reset. Generally speaking though, 
> if the engine reset fails the GT reset isn't going to do much better. 
> It would fix a dead GuC problem but it can't help with memory related 
> issues. If the full GT reset fails then we are out of escalation 
> routes as there is no FLR path at present (I think we have that at 
> driver unload on MTL but not for general reset?). The FLR resets a lot 
> more than just the GT, so it does have a chance to fix some issues 
> that a GT reset can't. After driver-level FLR, there is PCI level FLR. 
> Not sure if that involves a full power down and restart, but if not 
> then that would be the last escalation possible. A power cycle really 
> should fix any issues, if it doesn't then it's time to return the 
> system as being totally dead!
>
> My recollection is that the vast majority of engine reset failures 
> I've looked at have been completely catastrophic and the system only 
> recovered after a reboot. I.e. after the card was power cycled. Such 
> issues were generally caused by bad memory. Once the path to memory 
> has died, there really is not much of the GPU that can do anything at 
> all and there isn't much that can be done to recover it.

Thanks,

Nirmoy

>
> John.
>
>
>>
>>
>> Regards,
>>
>> Nirmoy
>>
>>>
>>> John.
>>>
>>>
>>>>
>>>> v2: Improve commit message(Tvrtko)
>>>>
>>>> Cc: Tvrtko Ursulin <tvrtko.ursulin at intel.com>
>>>> Cc: John Harrison <John.C.Harrison at Intel.com>
>>>> Cc: Andi Shyti <andi.shyti at linux.intel.com>
>>>> Cc: Andrzej Hajda <andrzej.hajda at intel.com>
>>>> Closes: https://gitlab.freedesktop.org/drm/intel/-/issues/5591
>>>> Signed-off-by: Nirmoy Das <nirmoy.das at intel.com>
>>>> Reviewed-by: Andi Shyti <andi.shyti at linux.intel.com>
>>>> Reviewed-by: Andrzej Hajda <andrzej.hajda at intel.com>
>>>> ---
>>>>   drivers/gpu/drm/i915/gt/intel_reset.c | 8 ++++----
>>>>   1 file changed, 4 insertions(+), 4 deletions(-)
>>>>
>>>> diff --git a/drivers/gpu/drm/i915/gt/intel_reset.c 
>>>> b/drivers/gpu/drm/i915/gt/intel_reset.c
>>>> index d5ed904f355d..e6fbc6202c80 100644
>>>> --- a/drivers/gpu/drm/i915/gt/intel_reset.c
>>>> +++ b/drivers/gpu/drm/i915/gt/intel_reset.c
>>>> @@ -593,10 +593,10 @@ static int gen8_engine_reset_prepare(struct 
>>>> intel_engine_cs *engine)
>>>>       ret = __intel_wait_for_register_fw(uncore, reg, mask, ack,
>>>>                          700, 0, NULL);
>>>>       if (ret)
>>>> -        gt_err(engine->gt,
>>>> -               "%s reset request timed out: {request: %08x, 
>>>> RESET_CTL: %08x}\n",
>>>> -               engine->name, request,
>>>> -               intel_uncore_read_fw(uncore, reg));
>>>> +        GT_TRACE(engine->gt,
>>>> +             "%s reset request timed out: {request: %08x, 
>>>> RESET_CTL: %08x}\n",
>>>> +             engine->name, request,
>>>> +             intel_uncore_read_fw(uncore, reg));
>>>>         return ret;
>>>>   }
>>>
>