[Intel-gfx] [PATCH 11/27] drm/i915/selftests: Fix memory corruption in live_lrc_isolation

Matthew Brost matthew.brost at intel.com
Wed Aug 25 20:03:21 UTC 2021


On Tue, Aug 24, 2021 at 05:07:13PM -0700, Daniele Ceraolo Spurio wrote:
> 
> 
> On 8/18/2021 11:16 PM, Matthew Brost wrote:
> > GuC submission has exposed an existing memory corruption in
> > live_lrc_isolation. We believe that some writes to the watchdog offsets
> > in the LRC (0x178 & 0x17c) can result in trashing of portions of the
> > address space. With GuC submission there are additional objects which
> > can move the context redzone into the space that is trashed. To
> > workaround this avoid poisoning the watchdog.
> 
> This is kind of a worrying explanation, as it implies an HW issue. AFAICS we
> no longer increase the context size with GuC submission, so the redzone
> should be in the same place relative to the base address of the context;
> although it is true that we have more objects in memory due to support the
> GuC, hitting the redzone consistently feels too much like a coincidence.
> When we write the watchdog regs there is a risk we're triggering a watchdog
> interrupt, which will cause the GuC to handle that; on a media reset, the
> GuC overwrites the context with the golden context in the ADS, are we sure
> that's not what is causing this problem?
> Looking in the ADS we set the context memcpy size to:
> 
> real_size = intel_engine_context_size(gt, engine_class);
> 
> but then we only initialize real_size - SKIP_SIZE(gt->i915), which IMO could
> be the real cause of the bug as the GuC memcpy starts at SKIP_SIZE().
> 

Good analysis Daniele. This definitely seems to be the issue as the
below patch appears to have fixed the failing selftest:

diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_ads.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_ads.c
index 9f5f43a16182..c19ce71c9de9 100644
--- a/drivers/gpu/drm/i915/gt/uc/intel_guc_ads.c
+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_ads.c
@@ -358,6 +358,11 @@ static int guc_prep_golden_context(struct intel_guc *guc,
        u8 engine_class, guc_class;
        struct guc_gt_system_info *info, local_info;

+       /* Skip execlist and PPGTT registers + HWSP */
+       const u32 lr_hw_context_size = 80 * sizeof(u32);
+       const u32 skip_size = LRC_PPHWSP_SZ * PAGE_SIZE +
+               lr_hw_context_size;
+
        /*
         * Reserve the memory for the golden contexts and point GuC at it but
         * leave it empty for now. The context data will be filled in later
@@ -396,7 +401,7 @@ static int guc_prep_golden_context(struct intel_guc *guc,
                if (!blob)
                        continue;

-               blob->ads.eng_state_size[guc_class] = real_size;
+               blob->ads.eng_state_size[guc_class] = real_size - skip_size;
                blob->ads.golden_context_lrca[guc_class] = addr_ggtt;
                addr_ggtt += alloc_size;
        }
@@ -476,7 +481,8 @@ static void guc_init_golden_context(struct intel_guc *guc)
                        continue;
                }

-               GEM_BUG_ON(blob->ads.eng_state_size[guc_class] != real_size);
+               GEM_BUG_ON(blob->ads.eng_state_size[guc_class] !=
+                          real_size - skip_size);
                GEM_BUG_ON(blob->ads.golden_context_lrca[guc_class] != addr_ggtt);
                addr_ggtt += alloc_size;

This being said, IMO this actually a bug in the GuC firmware as it
basically is doing:

memcpy(some_guc_dest, blob->ads.golden_context_lrca +
       guc_calculated_skip_size,
       blob->ads.eng_state_size);

IMO if the GuC is applying an internally calculated offset to
blob->ads.golden_context_lrca it should substract that calculated size
from blob->ads.eng_state_size.

e.g. the GuC should be doing:

memcpy(some_guc_dest, blob->ads.golden_context_lrca +
       guc_calculated_skip_size,
       blob->ads.eng_state_size - guc_calculated_skip_size);

We can bring this up with the GuC firmware team today, but in the
meantime I'll include the above patch in the respin of this series as a
workaround.

Matt 	

> Daniele
> 
> > 
> > v2:
> >   (Daniel Vetter)
> >    - Add VLK ref in code to workaround
> > 
> > Signed-off-by: Matthew Brost <matthew.brost at intel.com>
> > ---
> >   drivers/gpu/drm/i915/gt/selftest_lrc.c | 29 +++++++++++++++++++++++++-
> >   1 file changed, 28 insertions(+), 1 deletion(-)
> > 
> > diff --git a/drivers/gpu/drm/i915/gt/selftest_lrc.c b/drivers/gpu/drm/i915/gt/selftest_lrc.c
> > index b0977a3b699b..cdc6ae48a1e1 100644
> > --- a/drivers/gpu/drm/i915/gt/selftest_lrc.c
> > +++ b/drivers/gpu/drm/i915/gt/selftest_lrc.c
> > @@ -1074,6 +1074,32 @@ record_registers(struct intel_context *ce,
> >   	goto err_after;
> >   }
> > +static u32 safe_offset(u32 offset, u32 reg)
> > +{
> > +	/* XXX skip testing of watchdog - VLK-22772 */
> > +	if (offset == 0x178 || offset == 0x17c)
> > +		reg = 0;
> > +
> > +	return reg;
> > +}
> > +
> > +static int get_offset_mask(struct intel_engine_cs *engine)
> > +{
> > +	if (GRAPHICS_VER(engine->i915) < 12)
> > +		return 0xfff;
> > +
> > +	switch (engine->class) {
> > +	default:
> > +	case RENDER_CLASS:
> > +		return 0x07ff;
> > +	case COPY_ENGINE_CLASS:
> > +		return 0x0fff;
> > +	case VIDEO_DECODE_CLASS:
> > +	case VIDEO_ENHANCEMENT_CLASS:
> > +		return 0x3fff;
> > +	}
> > +}
> > +
> >   static struct i915_vma *load_context(struct intel_context *ce, u32 poison)
> >   {
> >   	struct i915_vma *batch;
> > @@ -1117,7 +1143,8 @@ static struct i915_vma *load_context(struct intel_context *ce, u32 poison)
> >   		len = (len + 1) / 2;
> >   		*cs++ = MI_LOAD_REGISTER_IMM(len);
> >   		while (len--) {
> > -			*cs++ = hw[dw];
> > +			*cs++ = safe_offset(hw[dw] & get_offset_mask(ce->engine),
> > +					    hw[dw]);
> >   			*cs++ = poison;
> >   			dw += 2;
> >   		}
> 


More information about the dri-devel mailing list