[Intel-xe] [PATCH] Revert "drm/xe: Allocate regset space for Wa_1607983814 unconditionally"

Matt Roper matthew.d.roper at intel.com
Wed Mar 22 18:15:57 UTC 2023


On Wed, Mar 22, 2023 at 10:51:04AM -0700, Souza, Jose wrote:
> On Wed, 2023-03-22 at 16:57 +0000, Matthew Auld wrote:
> > This reverts commit d51afeaae0479c964e720f3855834a82914112f6.
> > 
> > See if this helps the ADL-P in CI.
> > 
> > Signed-off-by: Matthew Auld <matthew.auld at intel.com>
> > ---
> >  drivers/gpu/drm/xe/xe_guc_ads.c | 5 ++++-
> >  1 file changed, 4 insertions(+), 1 deletion(-)
> > 
> > diff --git a/drivers/gpu/drm/xe/xe_guc_ads.c b/drivers/gpu/drm/xe/xe_guc_ads.c
> > index 304a9501b447..fd9911ffeae4 100644
> > --- a/drivers/gpu/drm/xe/xe_guc_ads.c
> > +++ b/drivers/gpu/drm/xe/xe_guc_ads.c
> > @@ -224,7 +224,10 @@ static size_t calculate_regset_size(struct xe_gt *gt)
> >  		xa_for_each(&hwe->reg_sr.xa, sr_idx, sr_entry)
> >  			count++;
> >  
> > -	count += (ADS_REGSET_EXTRA_MAX + LNCFCMOCS_REG_COUNT) * XE_NUM_HW_ENGINES;
> > +	count += ADS_REGSET_EXTRA_MAX * XE_NUM_HW_ENGINES;
> > +
> > +	if (needs_wa_1607983814(gt_to_xe(gt)))
> > +		count += LNCFCMOCS_REG_COUNT;
> 
> needs_wa_1607983814() also applies to ADL-P, so it should not flip adlp-p results.
> 
> I guess the problem is the calculation:
> (ADS_REGSET_EXTRA_MAX * XE_NUM_HW_ENGINES) + LNCFCMOCS_REG_COUNT != (ADS_REGSET_EXTRA_MAX + LNCFCMOCS_REG_COUNT) * XE_NUM_HW_ENGINES.
> 
> Did not checked what is correct for the workaround implementation.

This:

   (ADS_REGSET_EXTRA_MAX * XE_NUM_HW_ENGINES) + LNCFCMOCS_REG_COUNT

is a proper regset allocation for all the registers that we need the GuC
to save/restore today (it's actually still a slightly larger allocation
than we technically need).

However we found that using this more appropriate regset allocation
(rather than grossly overallocating it as we used to), increases the
frequency with which the initial LRC submission to the hardware fails on
some platforms.  The cause of the LRC submission failure is still
unknown; it's not actually related to regset at all as far as we can
see, but this change happens to change timing or cache access or
something in a way that makes the pre-existing problem much easier to
hit on certain platforms.

We're still searching for the real, underlying problem.  But in the
meantime we backed out the regset size change so that there'd be less
impact to other Xe development in the meantime.  So the revert here is
technically a "correct" change for us to make and will be what we want
in the end, but for now there's a good chance that it will increase the
rate of driver load failure on some platforms.


Matt

> 
> >  
> >  	return count * sizeof(struct guc_mmio_reg);
> >  }
> 

-- 
Matt Roper
Graphics Software Engineer
Linux GPU Platform Enablement
Intel Corporation


More information about the Intel-xe mailing list