[Intel-gfx] [PATCH] drm/i915/dg2: Catch and log more unexpected values in DG1_MSTR_TILE_INTR

Lucas De Marchi lucas.demarchi at intel.com
Wed May 25 18:14:17 UTC 2022


On Wed, May 25, 2022 at 05:03:13PM +0100, Tvrtko Ursulin wrote:
>
>On 24/05/2022 18:51, Matt Roper wrote:
>>On Tue, May 24, 2022 at 10:43:39AM +0100, Tvrtko Ursulin wrote:
>>>From: Tvrtko Ursulin <tvrtko.ursulin at intel.com>
>>>
>>>Catch and log any garbage in the register, including no tiles marked, or
>>>multiple tiles marked.
>>>
>>>Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin at intel.com>
>>>Cc: Matt Roper <matthew.d.roper at intel.com>
>>>---
>>>We caught garbage in DG1_MSTR_TILE_INTR with DG2 (actual value 0xF9D2C008)
>>>during glmark and more badness. So I thought lets log all possible failure
>>>modes from here and also use per device logging.
>>>---
>>>  drivers/gpu/drm/i915/i915_irq.c | 33 ++++++++++++++++++++++-----------
>>>  drivers/gpu/drm/i915/i915_reg.h |  1 +
>>>  2 files changed, 23 insertions(+), 11 deletions(-)
>>>
>>>diff --git a/drivers/gpu/drm/i915/i915_irq.c b/drivers/gpu/drm/i915/i915_irq.c
>>>index 73cebc6aa650..79853d3fc1ed 100644
>>>--- a/drivers/gpu/drm/i915/i915_irq.c
>>>+++ b/drivers/gpu/drm/i915/i915_irq.c
>>>@@ -2778,24 +2778,30 @@ static irqreturn_t dg1_irq_handler(int irq, void *arg)
>>>  	u32 gu_misc_iir;
>>>  	if (!intel_irqs_enabled(i915))
>>>-		return IRQ_NONE;
>>>+		goto none;
>>>  	master_tile_ctl = dg1_master_intr_disable(regs);
>>>-	if (!master_tile_ctl) {
>>>-		dg1_master_intr_enable(regs);
>>>-		return IRQ_NONE;
>>>+	if (!master_tile_ctl)
>>>+		goto enable_none;
>>>+
>>>+	if (master_tile_ctl & ~(DG1_MSTR_IRQ | DG1_MSTR_TILE_MASK)) {
>>>+		drm_warn(&i915->drm, "Garbage in master_tile_ctl: 0x%08x!\n",
>>>+			 master_tile_ctl);
>>
>>I know we have a bunch of them already, but shouldn't we be avoiding
>>printk-based stuff like this inside interrupt handlers?  Should we be
>>migrating all these error messages over to trace_printk or something
>>similar that's safer to use?
>
>Not sure - I kind of think some really unexpected and worrying 
>situations should be loud and on by default. Risk is then spam if not 
>ratelimited. Maybe we should instead ratelimit most errors/warnings 
>coming for irq handlers?
>
>In this particular case at least DRM_ERROR with no device info is the 
>odd one out in the entire file so I'd suggest changing at least that, 
>if the rest of my changes is of questionable benefit.

I'd rather remove the printk's from irq rather than adding more. At the very
least, they should be the _once variant or ratelimited. One of the few
cases to even deserve a unlikely(), even to document this shouldn't
really be happening.

Our irq handlers (particularly on dgfx and multi-gt) are already too
long running... I don't like making them any onger or slower.


Lucas De Marchi


More information about the Intel-gfx mailing list