[Intel-gfx] [PATCH] drm/i915/dg2: Catch and log more unexpected values in DG1_MSTR_TILE_INTR

Tvrtko Ursulin tvrtko.ursulin at linux.intel.com
Thu May 26 10:29:49 UTC 2022


On 25/05/2022 19:14, Lucas De Marchi wrote:
> On Wed, May 25, 2022 at 05:03:13PM +0100, Tvrtko Ursulin wrote:
>>
>> On 24/05/2022 18:51, Matt Roper wrote:
>>> On Tue, May 24, 2022 at 10:43:39AM +0100, Tvrtko Ursulin wrote:
>>>> From: Tvrtko Ursulin <tvrtko.ursulin at intel.com>
>>>>
>>>> Catch and log any garbage in the register, including no tiles 
>>>> marked, or
>>>> multiple tiles marked.
>>>>
>>>> Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin at intel.com>
>>>> Cc: Matt Roper <matthew.d.roper at intel.com>
>>>> ---
>>>> We caught garbage in DG1_MSTR_TILE_INTR with DG2 (actual value 
>>>> 0xF9D2C008)
>>>> during glmark and more badness. So I thought lets log all possible 
>>>> failure
>>>> modes from here and also use per device logging.
>>>> ---
>>>>  drivers/gpu/drm/i915/i915_irq.c | 33 ++++++++++++++++++++++-----------
>>>>  drivers/gpu/drm/i915/i915_reg.h |  1 +
>>>>  2 files changed, 23 insertions(+), 11 deletions(-)
>>>>
>>>> diff --git a/drivers/gpu/drm/i915/i915_irq.c 
>>>> b/drivers/gpu/drm/i915/i915_irq.c
>>>> index 73cebc6aa650..79853d3fc1ed 100644
>>>> --- a/drivers/gpu/drm/i915/i915_irq.c
>>>> +++ b/drivers/gpu/drm/i915/i915_irq.c
>>>> @@ -2778,24 +2778,30 @@ static irqreturn_t dg1_irq_handler(int irq, 
>>>> void *arg)
>>>>      u32 gu_misc_iir;
>>>>      if (!intel_irqs_enabled(i915))
>>>> -        return IRQ_NONE;
>>>> +        goto none;
>>>>      master_tile_ctl = dg1_master_intr_disable(regs);
>>>> -    if (!master_tile_ctl) {
>>>> -        dg1_master_intr_enable(regs);
>>>> -        return IRQ_NONE;
>>>> +    if (!master_tile_ctl)
>>>> +        goto enable_none;
>>>> +
>>>> +    if (master_tile_ctl & ~(DG1_MSTR_IRQ | DG1_MSTR_TILE_MASK)) {
>>>> +        drm_warn(&i915->drm, "Garbage in master_tile_ctl: 0x%08x!\n",
>>>> +             master_tile_ctl);
>>>
>>> I know we have a bunch of them already, but shouldn't we be avoiding
>>> printk-based stuff like this inside interrupt handlers?  Should we be
>>> migrating all these error messages over to trace_printk or something
>>> similar that's safer to use?
>>
>> Not sure - I kind of think some really unexpected and worrying 
>> situations should be loud and on by default. Risk is then spam if not 
>> ratelimited. Maybe we should instead ratelimit most errors/warnings 
>> coming for irq handlers?
>>
>> In this particular case at least DRM_ERROR with no device info is the 
>> odd one out in the entire file so I'd suggest changing at least that, 
>> if the rest of my changes is of questionable benefit.
> 
> I'd rather remove the printk's from irq rather than adding more. At the 
> very
> least, they should be the _once variant or ratelimited. One of the few
> cases to even deserve a unlikely(), even to document this shouldn't
> really be happening.

I would support ratelimited for all the unexpected bits set, no bits 
set, or similar conditions. I wouldn't remove such printks to 
micro-optimize things. That would potentially lose important feedback in 
cases when we hit truly unexpected situations.

But annotating them as unlikely would be a good thing.

> Our irq handlers (particularly on dgfx and multi-gt) are already too
> long running... I don't like making them any onger or slower.

How come? I mean which interrupts are a problem there? Isn't GuC 
supposed to be taking on that load on itself, isn't that one of the main 
selling points?

Regards,

Tvrtko


More information about the Intel-gfx mailing list