[PATCH] drm/i915/dg2: Catch and log more unexpected values in DG1_MSTR_TILE_INTR

Tue Jun 7 09:20:51 UTC 2022

On 06/06/2022 16:21, Matt Roper wrote:
> On Mon, Jun 06, 2022 at 12:55:20PM +0100, Tvrtko Ursulin wrote:
>>
>> On 27/05/2022 19:42, Matt Roper wrote:
>>> On Thu, May 26, 2022 at 11:18:17AM +0100, Tvrtko Ursulin wrote:
>>>> On 25/05/2022 19:05, Matt Roper wrote:
>>>>> On Wed, May 25, 2022 at 05:03:13PM +0100, Tvrtko Ursulin wrote:
>>>>>>
>>>>>> On 24/05/2022 18:51, Matt Roper wrote:
>>>>>>> On Tue, May 24, 2022 at 10:43:39AM +0100, Tvrtko Ursulin wrote:
>>>>>>>> From: Tvrtko Ursulin <tvrtko.ursulin at intel.com>
>>>>>>>>
>>>>>>>> Catch and log any garbage in the register, including no tiles marked, or
>>>>>>>> multiple tiles marked.
>>>>>>>>
>>>>>>>> Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin at intel.com>
>>>>>>>> Cc: Matt Roper <matthew.d.roper at intel.com>
>>>>>>>> ---
>>>>>>>> We caught garbage in DG1_MSTR_TILE_INTR with DG2 (actual value 0xF9D2C008)
>>>>>>>> during glmark and more badness. So I thought lets log all possible failure
>>>>>>>> modes from here and also use per device logging.
>>>>>>>> ---
>>>>>>>>      drivers/gpu/drm/i915/i915_irq.c | 33 ++++++++++++++++++++++-----------
>>>>>>>>      drivers/gpu/drm/i915/i915_reg.h |  1 +
>>>>>>>>      2 files changed, 23 insertions(+), 11 deletions(-)
>>>>>>>>
>>>>>>>> diff --git a/drivers/gpu/drm/i915/i915_irq.c b/drivers/gpu/drm/i915/i915_irq.c
>>>>>>>> index 73cebc6aa650..79853d3fc1ed 100644
>>>>>>>> --- a/drivers/gpu/drm/i915/i915_irq.c
>>>>>>>> +++ b/drivers/gpu/drm/i915/i915_irq.c
>>>>>>>> @@ -2778,24 +2778,30 @@ static irqreturn_t dg1_irq_handler(int irq, void *arg)
>>>>>>>>      	u32 gu_misc_iir;
>>>>>>>>      	if (!intel_irqs_enabled(i915))
>>>>>>>> -		return IRQ_NONE;
>>>>>>>> +		goto none;
>>>>>>>>      	master_tile_ctl = dg1_master_intr_disable(regs);
>>>>>>>> -	if (!master_tile_ctl) {
>>>>>>>> -		dg1_master_intr_enable(regs);
>>>>>>>> -		return IRQ_NONE;
>>>>>>>> +	if (!master_tile_ctl)
>>>>>>>> +		goto enable_none;
>>>>>>>> +
>>>>>>>> +	if (master_tile_ctl & ~(DG1_MSTR_IRQ | DG1_MSTR_TILE_MASK)) {
>>>>>>>> +		drm_warn(&i915->drm, "Garbage in master_tile_ctl: 0x%08x!\n",
>>>>>>>> +			 master_tile_ctl);
>>>>>>>
>>>>>>> I know we have a bunch of them already, but shouldn't we be avoiding
>>>>>>> printk-based stuff like this inside interrupt handlers?  Should we be
>>>>>>> migrating all these error messages over to trace_printk or something
>>>>>>> similar that's safer to use?
>>>>>>
>>>>>> Not sure - I kind of think some really unexpected and worrying situations
>>>>>> should be loud and on by default. Risk is then spam if not ratelimited.
>>>>>> Maybe we should instead ratelimit most errors/warnings coming for irq
>>>>>> handlers?
>>>>>
>>>>> It's not the risk of spam that's the problem, but rather that
>>>>> printk-based stuff eventually calls into the console code to flush its
>>>>> buffers.  That's way more overhead than you want in an interrupt handler
>>>>> so it's bad on its own, but if you're using something slow like a serial
>>>>> console, it becomes even more of a problem.
>>>>
>>>> Is it a problem for messages which we never expect to see?
>>>
>>> Kind of.  While not as catastrophic, it's the same argument for why we
>>> don't use BUG() anymore...when the impossible does manage to happen
>>> there's unnecessary collateral damage on things outside of graphics.  If
>>> we're adding huge delays inside an interrupt handler (while other
>>> interrupts are disabled) that impacts the system-wide usability, not
>>> just our own driver.
>>>
>>> I'd also argue that these messages actually are semi-expected.  Random
>>> bits being set shouldn't happen, but in the world of dgpu's, we do
>>> occasionally see cases where the PCI link itself goes down for reasons
>>> outside our control and then all registers read back as 0xFFFFFFFF,
>>> which will probably trigger error messages here (as well as a bunch of
>>> other places).
>>
>> Could you expand a bit on what is semi-expected and when? I mean the
>> circumstances of PCI link going down. We certainly don't have any code to
>> survive that.
> 
> Yeah, I'm referring to the "Lost access to MMIO BAR" errors; in the past
> most of them have ultimately been tracked down to bugs in early
> firmware, so flashing an updated IFWI/BIOS onto the device usually
> solved the problems.  Generally those buggy firmwares are an internal
> problem that never make it into the wild, but I think we have also seen
> cases where they get triggered by physical/electrical problems on a
> specific part; that can potentially happen to anyone who's unlucky
> enough to get a defective/damaged unit.
> 
> Basically "hardware returns all F's" happens because the CPU initiates
> an MMIO transaction with the hardware, the hardware fails to produce any
> response (possibly due to failing hardware, possibly due to
> firmware/BIOS bugs), so 0xFFFFFFFF gets returned as an autocompletion to
> prevent the CPU core from hanging.
> 
> It looks like we still have a few open here:
> https://gitlab.freedesktop.org/search?search=%22Lost+access+to+MMIO+BAR%22&group_id=2642&project_id=4519&scope=issues&search_code=false&snippets=false&repository_ref=&nav_source=navbar
> 
> and there are some features on specific platforms we haven't turned on
> yet because they also trigger these failures (which is still under
> debug).
> 
> We don't/can't really do much to handle these problems in i915 today
> except printing the 'lost access' error so that we know to ignore
> whatever kinds of bogus errors we get after that point (usually lots of
> messages about forcewake failing to clear, engine/GuC reset failing to
> complete, etc.).  But aside from i915 being broken, the rest of the
> platform should generally continue to work, so you can still access the
> machine over the network, save logs to disk, etc.

Interesting, I missed the addition of 29b6f88d60dd ("drm/i915: Try to 
detect sudden loss of MMIO access"), thanks!

In case of my "Garbage in master_tile_ctl" or "Unexpected irq from 
tile.." messages, in case of lost PCI link they should happen only once. 
I don't think hardware will keep raising interrupts if driver cannot 
talk to it. But it does seem prudent to go with the rate-limiting 
flavour just in case.

Regards,

Tvrtko