[Intel-gfx] [PATCH] drm/i915/dg2: Catch and log more unexpected values in DG1_MSTR_TILE_INTR
Lucas De Marchi
lucas.demarchi at intel.com
Wed May 25 18:14:17 UTC 2022
On Wed, May 25, 2022 at 05:03:13PM +0100, Tvrtko Ursulin wrote:
>
>On 24/05/2022 18:51, Matt Roper wrote:
>>On Tue, May 24, 2022 at 10:43:39AM +0100, Tvrtko Ursulin wrote:
>>>From: Tvrtko Ursulin <tvrtko.ursulin at intel.com>
>>>
>>>Catch and log any garbage in the register, including no tiles marked, or
>>>multiple tiles marked.
>>>
>>>Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin at intel.com>
>>>Cc: Matt Roper <matthew.d.roper at intel.com>
>>>---
>>>We caught garbage in DG1_MSTR_TILE_INTR with DG2 (actual value 0xF9D2C008)
>>>during glmark and more badness. So I thought lets log all possible failure
>>>modes from here and also use per device logging.
>>>---
>>> drivers/gpu/drm/i915/i915_irq.c | 33 ++++++++++++++++++++++-----------
>>> drivers/gpu/drm/i915/i915_reg.h | 1 +
>>> 2 files changed, 23 insertions(+), 11 deletions(-)
>>>
>>>diff --git a/drivers/gpu/drm/i915/i915_irq.c b/drivers/gpu/drm/i915/i915_irq.c
>>>index 73cebc6aa650..79853d3fc1ed 100644
>>>--- a/drivers/gpu/drm/i915/i915_irq.c
>>>+++ b/drivers/gpu/drm/i915/i915_irq.c
>>>@@ -2778,24 +2778,30 @@ static irqreturn_t dg1_irq_handler(int irq, void *arg)
>>> u32 gu_misc_iir;
>>> if (!intel_irqs_enabled(i915))
>>>- return IRQ_NONE;
>>>+ goto none;
>>> master_tile_ctl = dg1_master_intr_disable(regs);
>>>- if (!master_tile_ctl) {
>>>- dg1_master_intr_enable(regs);
>>>- return IRQ_NONE;
>>>+ if (!master_tile_ctl)
>>>+ goto enable_none;
>>>+
>>>+ if (master_tile_ctl & ~(DG1_MSTR_IRQ | DG1_MSTR_TILE_MASK)) {
>>>+ drm_warn(&i915->drm, "Garbage in master_tile_ctl: 0x%08x!\n",
>>>+ master_tile_ctl);
>>
>>I know we have a bunch of them already, but shouldn't we be avoiding
>>printk-based stuff like this inside interrupt handlers? Should we be
>>migrating all these error messages over to trace_printk or something
>>similar that's safer to use?
>
>Not sure - I kind of think some really unexpected and worrying
>situations should be loud and on by default. Risk is then spam if not
>ratelimited. Maybe we should instead ratelimit most errors/warnings
>coming for irq handlers?
>
>In this particular case at least DRM_ERROR with no device info is the
>odd one out in the entire file so I'd suggest changing at least that,
>if the rest of my changes is of questionable benefit.
I'd rather remove the printk's from irq rather than adding more. At the very
least, they should be the _once variant or ratelimited. One of the few
cases to even deserve a unlikely(), even to document this shouldn't
really be happening.
Our irq handlers (particularly on dgfx and multi-gt) are already too
long running... I don't like making them any onger or slower.
Lucas De Marchi
More information about the Intel-gfx
mailing list