[Intel-gfx] [RFC PATCH] drm/i915/gt: Do not treat MCR locking timeouts as errors

John Harrison john.c.harrison at intel.com
Wed Oct 4 19:50:02 UTC 2023


On 10/4/2023 12:35, Andi Shyti wrote:
> Hi John,
>
>>>>> The MCR steering semaphore is a shared lock entry between i915
>>>>> and various firmware components.
>>>>>
>>>>> Getting the lock might sinchronize on some shared resources.
>>>>> Sometimes though, it might happen that the firmware forgets to
>>>>> unlock causing unnecessary noise in the driver which keeps doing
>>>>> what was supposed to do, ignoring the problem.
>>>>>
>>>>> Do not consider this failure as an error, but just print a debug
>>>>> message stating that the MCR locking has been skipped.
>>>>>
>>>>> On the driver side we still have spinlocks that make sure that
>>>>> the access to the resources is serialized.
>>>>>
>>>>> Signed-off-by: Andi Shyti <andi.shyti at linux.intel.com>
>>>>> Cc: Jonathan Cavitt <jonathan.cavitt at intel.com>
>>>>> Cc: Matt Roper <matthew.d.roper at intel.com>
>>>>> Cc: Nirmoy Das <nirmoy.das at intel.com>
>>>>> ---
>>>>>     drivers/gpu/drm/i915/gt/intel_gt_mcr.c | 6 ++----
>>>>>     1 file changed, 2 insertions(+), 4 deletions(-)
>>>>>
>>>>> diff --git a/drivers/gpu/drm/i915/gt/intel_gt_mcr.c b/drivers/gpu/drm/i915/gt/intel_gt_mcr.c
>>>>> index 326c2ed1d99b..51eb693df39b 100644
>>>>> --- a/drivers/gpu/drm/i915/gt/intel_gt_mcr.c
>>>>> +++ b/drivers/gpu/drm/i915/gt/intel_gt_mcr.c
>>>>> @@ -395,10 +395,8 @@ void intel_gt_mcr_lock(struct intel_gt *gt, unsigned long *flags)
>>>>>     	 * would indicate some hardware/firmware is misbehaving and not
>>>>>     	 * releasing it properly.
>>>>>     	 */
>>>>> -	if (err == -ETIMEDOUT) {
>>>>> -		gt_err_ratelimited(gt, "hardware MCR steering semaphore timed out");
>>>>> -		add_taint_for_CI(gt->i915, TAINT_WARN);  /* CI is now unreliable */
>>>>> -	}
>>>>> +	if (err == -ETIMEDOUT)
>>>>> +		gt_dbg(gt, "hardware MCR steering semaphore timed out");
>>>>>     }
>>>>>     /**
>>>> Are we sure this does not warrant a level higher than dbg, such as
>>>> notice/warn?
>>> We might make it a warn, but this doesn't change much the economy
>>> of the driver as we will keep doing what we were supposed to do.
>>>
>>>> Because how can we be sure the two entities will not stomp on
>>>> each other toes if we failed to obtain lock?
>>> So far, in all the research I've done, no one looks like using
>>> MCR lock, but yet someone is stuck in it.
>> If someone has the lock then that someone thinks they are using it. Just
>> because you can't see what someone piece of IFWI is doing doesn't mean it
>> isn't doing it. And if it is a genuinely missing unlock then it needs to be
>> tracked down and fixed with an IFWI update otherwise the system is going to
>> be unstable from that point on.
> But I'm not changing here the behavior of the driver. The driver
> will keep doing what was doing before.
And what it is doing is dangerous and can lead to unpredictable results 
because a critical section resource access is no longer wrapped with a 
critical section lock. Hence there is an error message to say Here Be 
Dragons.


>
> Because this most probably won't be noticed by the user, then I
> don't see why it should shout out loud that the system is
> unusable when most probably it is.
Just because a race condition is hard to hit doesn't mean it won't be hit.

The point of shouting out loud is that we know for a fact a problem has 
occurred. That problem might lead to nothing or it might lead to subtle 
data corruption, gross crashes or anything in between.

>
> BTW, at my understanding this is not an IFWI problem. We checked
> with different version of IFWI and the issue is the same.
Which implies it is a driver bug after all? In which case you absolutely 
should not be papering over it in the driver.

>
> Besides we received reports also from systems that are not using
> IFWI at all.
There is no system that does not use IFWI. Integrated or discrete, all 
systems have an IFWI. It's just part of the main BIOS on an integrated 
platform.

John.

>
>>>> (How can we be sure about
>>>> "forgot to unlock" vs "in prolonged active use"?
>>> There is a patch from Jonathan that is testing a different
>>> timeout.
>>>
>>>> Or if we can be sure, can
>>>> we force unlock and take the lock for the driver explicitly?)
>>> I sent a patch with this solution and Matt turned it down.
>> Presumably because both forcing a lock and ignoring a failed lock are Bad
>> Things to be doing!
>> Just because some entity out of our control isn't playing friendly doesn't
>> mean we can ignore the problem. The lock is there for a reason. If someone
>> else owns the lock then any steered access by i915 is potentially going to
>> be routed to the wrong register as the other entity re-directs the steering
>> behind out back. That is why there is an error message being printed.
>> Because things are quite possibly going to fail in some unknown manner.
> Yes, agree. This has been already discussed here[*] where I sent
> such RFC for the sole purpose of receiving some opinions and
> check how CI would behave.
>
> BTW, we are already doing this during the GT resume[**]... it's a
> bit of a different context, but it still forces the release of
> the lock.
>
> This patch, anyway, is not doing this.
>
> Thanks a lot,
> Andi
>
> [*] https://patchwork.freedesktop.org/series/124402/
> [**] 37280ef5c1c4 ("drm/i915: Clean steer semaphore on resume")



More information about the dri-devel mailing list