[Intel-gfx] [PATCH] drm/i915/dg2: Catch and log more unexpected values in DG1_MSTR_TILE_INTR
Tvrtko Ursulin
tvrtko.ursulin at linux.intel.com
Thu May 26 10:29:49 UTC 2022
On 25/05/2022 19:14, Lucas De Marchi wrote:
> On Wed, May 25, 2022 at 05:03:13PM +0100, Tvrtko Ursulin wrote:
>>
>> On 24/05/2022 18:51, Matt Roper wrote:
>>> On Tue, May 24, 2022 at 10:43:39AM +0100, Tvrtko Ursulin wrote:
>>>> From: Tvrtko Ursulin <tvrtko.ursulin at intel.com>
>>>>
>>>> Catch and log any garbage in the register, including no tiles
>>>> marked, or
>>>> multiple tiles marked.
>>>>
>>>> Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin at intel.com>
>>>> Cc: Matt Roper <matthew.d.roper at intel.com>
>>>> ---
>>>> We caught garbage in DG1_MSTR_TILE_INTR with DG2 (actual value
>>>> 0xF9D2C008)
>>>> during glmark and more badness. So I thought lets log all possible
>>>> failure
>>>> modes from here and also use per device logging.
>>>> ---
>>>> drivers/gpu/drm/i915/i915_irq.c | 33 ++++++++++++++++++++++-----------
>>>> drivers/gpu/drm/i915/i915_reg.h | 1 +
>>>> 2 files changed, 23 insertions(+), 11 deletions(-)
>>>>
>>>> diff --git a/drivers/gpu/drm/i915/i915_irq.c
>>>> b/drivers/gpu/drm/i915/i915_irq.c
>>>> index 73cebc6aa650..79853d3fc1ed 100644
>>>> --- a/drivers/gpu/drm/i915/i915_irq.c
>>>> +++ b/drivers/gpu/drm/i915/i915_irq.c
>>>> @@ -2778,24 +2778,30 @@ static irqreturn_t dg1_irq_handler(int irq,
>>>> void *arg)
>>>> u32 gu_misc_iir;
>>>> if (!intel_irqs_enabled(i915))
>>>> - return IRQ_NONE;
>>>> + goto none;
>>>> master_tile_ctl = dg1_master_intr_disable(regs);
>>>> - if (!master_tile_ctl) {
>>>> - dg1_master_intr_enable(regs);
>>>> - return IRQ_NONE;
>>>> + if (!master_tile_ctl)
>>>> + goto enable_none;
>>>> +
>>>> + if (master_tile_ctl & ~(DG1_MSTR_IRQ | DG1_MSTR_TILE_MASK)) {
>>>> + drm_warn(&i915->drm, "Garbage in master_tile_ctl: 0x%08x!\n",
>>>> + master_tile_ctl);
>>>
>>> I know we have a bunch of them already, but shouldn't we be avoiding
>>> printk-based stuff like this inside interrupt handlers? Should we be
>>> migrating all these error messages over to trace_printk or something
>>> similar that's safer to use?
>>
>> Not sure - I kind of think some really unexpected and worrying
>> situations should be loud and on by default. Risk is then spam if not
>> ratelimited. Maybe we should instead ratelimit most errors/warnings
>> coming for irq handlers?
>>
>> In this particular case at least DRM_ERROR with no device info is the
>> odd one out in the entire file so I'd suggest changing at least that,
>> if the rest of my changes is of questionable benefit.
>
> I'd rather remove the printk's from irq rather than adding more. At the
> very
> least, they should be the _once variant or ratelimited. One of the few
> cases to even deserve a unlikely(), even to document this shouldn't
> really be happening.
I would support ratelimited for all the unexpected bits set, no bits
set, or similar conditions. I wouldn't remove such printks to
micro-optimize things. That would potentially lose important feedback in
cases when we hit truly unexpected situations.
But annotating them as unlikely would be a good thing.
> Our irq handlers (particularly on dgfx and multi-gt) are already too
> long running... I don't like making them any onger or slower.
How come? I mean which interrupts are a problem there? Isn't GuC
supposed to be taking on that load on itself, isn't that one of the main
selling points?
Regards,
Tvrtko
More information about the Intel-gfx
mailing list