[Intel-gfx] [RFC PATCH] drm/i915/gt: Do not treat MCR locking timeouts as errors

Wed Oct 4 21:15:16 UTC 2023

On 10/4/2023 13:09, Andi Shyti wrote:
> Hi John,
>
>>>>>>> The MCR steering semaphore is a shared lock entry between i915
>>>>>>> and various firmware components.
>>>>>>>
>>>>>>> Getting the lock might sinchronize on some shared resources.
>>>>>>> Sometimes though, it might happen that the firmware forgets to
>>>>>>> unlock causing unnecessary noise in the driver which keeps doing
>>>>>>> what was supposed to do, ignoring the problem.
>>>>>>>
>>>>>>> Do not consider this failure as an error, but just print a debug
>>>>>>> message stating that the MCR locking has been skipped.
>>>>>>>
>>>>>>> On the driver side we still have spinlocks that make sure that
>>>>>>> the access to the resources is serialized.
>>>>>>>
>>>>>>> Signed-off-by: Andi Shyti <andi.shyti at linux.intel.com>
>>>>>>> Cc: Jonathan Cavitt <jonathan.cavitt at intel.com>
>>>>>>> Cc: Matt Roper <matthew.d.roper at intel.com>
>>>>>>> Cc: Nirmoy Das <nirmoy.das at intel.com>
>>>>>>> ---
>>>>>>>      drivers/gpu/drm/i915/gt/intel_gt_mcr.c | 6 ++----
>>>>>>>      1 file changed, 2 insertions(+), 4 deletions(-)
>>>>>>>
>>>>>>> diff --git a/drivers/gpu/drm/i915/gt/intel_gt_mcr.c b/drivers/gpu/drm/i915/gt/intel_gt_mcr.c
>>>>>>> index 326c2ed1d99b..51eb693df39b 100644
>>>>>>> --- a/drivers/gpu/drm/i915/gt/intel_gt_mcr.c
>>>>>>> +++ b/drivers/gpu/drm/i915/gt/intel_gt_mcr.c
>>>>>>> @@ -395,10 +395,8 @@ void intel_gt_mcr_lock(struct intel_gt *gt, unsigned long *flags)
>>>>>>>      	 * would indicate some hardware/firmware is misbehaving and not
>>>>>>>      	 * releasing it properly.
>>>>>>>      	 */
>>>>>>> -	if (err == -ETIMEDOUT) {
>>>>>>> -		gt_err_ratelimited(gt, "hardware MCR steering semaphore timed out");
>>>>>>> -		add_taint_for_CI(gt->i915, TAINT_WARN);  /* CI is now unreliable */
>>>>>>> -	}
>>>>>>> +	if (err == -ETIMEDOUT)
>>>>>>> +		gt_dbg(gt, "hardware MCR steering semaphore timed out");
>>>>>>>      }
>>>>>>>      /**
>>>>>> Are we sure this does not warrant a level higher than dbg, such as
>>>>>> notice/warn?
>>>>> We might make it a warn, but this doesn't change much the economy
>>>>> of the driver as we will keep doing what we were supposed to do.
>>>>>
>>>>>> Because how can we be sure the two entities will not stomp on
>>>>>> each other toes if we failed to obtain lock?
>>>>> So far, in all the research I've done, no one looks like using
>>>>> MCR lock, but yet someone is stuck in it.
>>>> If someone has the lock then that someone thinks they are using it. Just
>>>> because you can't see what someone piece of IFWI is doing doesn't mean it
>>>> isn't doing it. And if it is a genuinely missing unlock then it needs to be
>>>> tracked down and fixed with an IFWI update otherwise the system is going to
>>>> be unstable from that point on.
>>> But I'm not changing here the behavior of the driver. The driver
>>> will keep doing what was doing before.
>> And what it is doing is dangerous and can lead to unpredictable results
>> because a critical section resource access is no longer wrapped with a
>> critical section lock. Hence there is an error message to say Here Be
>> Dragons.
> hehe!
>
> What are you suggesting, then? Should we reset the GT after
> hitting the MCR lock issue?
No. I'm suggesting that you don't hide the issue by removing the error 
message. It is there for a reason. Just because that reason is being hit 
doesn't mean you should remove the message.

>
> We could, but I rather prefer to work with what's available.
>
>>> Because this most probably won't be noticed by the user, then I
>>> don't see why it should shout out loud that the system is
>>> unusable when most probably it is.
>> Just because a race condition is hard to hit doesn't mean it won't be hit.
> We are hitting it, constantly, but it's not producing any effect.
> Even when raising the MCR wait timeout. Which means that most
> probably someone is having a nap on the lock.
No. You are hitting the lock contention problem constantly and so far 
are not seeing any *visible* effect.

The point is that there is still a potential race condition which you 
haven't hit yet which could lead to data corruption, page faults, 
crashes, etc. (because a TLB invalidation access went to the wrong 
target) or the CPU/GPU melting itself of the board (because a power 
management access went to the wrong target).

>
>> The point of shouting out loud is that we know for a fact a problem has
>> occurred. That problem might lead to nothing or it might lead to subtle data
>> corruption, gross crashes or anything in between.
> yes, agree... the message still stays, though, with a bit of a
> lower catastrophy.
>
>>> BTW, at my understanding this is not an IFWI problem. We checked
>>> with different version of IFWI and the issue is the same.
>> Which implies it is a driver bug after all? In which case you absolutely
>> should not be papering over it in the driver.
> This section is serialized by a spinlock and besides I haven't
> found any place else except for the TLB invalidation and the
> resume where we can incur a locking situation.
There is a bug somewhere. Either it is IFWI or it is KMD. You can't say 
"I can't find a problem therefore there is no problem".

John.

>
> Thanks a lot for your inputs and discussion!
> Andi