[PATCH] drm/xe: Log unreliable MMIO reads during forcewake

Mon Oct 14 21:08:09 UTC 2024

> -----Original Message-----
> From: Brost, Matthew <matthew.brost at intel.com>
> Sent: Friday, October 11, 2024 10:09 PM
> To: Lin, Shuicheng <shuicheng.lin at intel.com>
> Cc: intel-xe at lists.freedesktop.org; Roper, Matthew D
> <matthew.d.roper at intel.com>; Vivi, Rodrigo <rodrigo.vivi at intel.com>; Zuo,
> Alex <alex.zuo at intel.com>
> Subject: Re: [PATCH] drm/xe: Log unreliable MMIO reads during forcewake
> 
> On Sat, Oct 12, 2024 at 03:34:45AM +0000, Shuicheng Lin wrote:
> > In some cases, when the driver attempts to read an MMIO register, the
> > hardware may return 0xFFFFFFFF. The current force wake path code
> > treats this as a valid response, as it only checks the BIT.
> > However, 0xFFFFFFFF should be considered an invalid value, indicating
> > a potential issue. To address this, we should add a log entry to
> > highlight this condition.
> >
> > Suggested-by: Alex Zuo <alex.zuo at intel.com>
> > Signed-off-by: Shuicheng Lin <shuicheng.lin at intel.com>
> > ---
> >  drivers/gpu/drm/xe/xe_force_wake.c | 4 ++++
> >  1 file changed, 4 insertions(+)
> >
> > diff --git a/drivers/gpu/drm/xe/xe_force_wake.c
> > b/drivers/gpu/drm/xe/xe_force_wake.c
> > index a64c14757c84..46f36d05293a 100644
> > --- a/drivers/gpu/drm/xe/xe_force_wake.c
> > +++ b/drivers/gpu/drm/xe/xe_force_wake.c
> > @@ -114,6 +114,10 @@ static int __domain_wait(struct xe_gt *gt, struct
> xe_force_wake_domain *domain,
> >  	ret = xe_mmio_wait32(&gt->mmio, domain->reg_ack, domain->val,
> wake ? domain->val : 0,
> >  			     XE_FORCE_WAKE_ACK_TIMEOUT_MS *
> USEC_PER_MSEC,
> >  			     &value, true);
> > +	if (value == ~0)
> > +		xe_gt_notice(gt,
> > +			     "Force wake domain %d: %s. MMIO unreliable
> (forcewake register returns 0xFFFFFFFF)!\n",
> > +			     domain->id, str_wake_sleep(wake));
> 
> Set the ret value (-EIO) to kick the error to upper layers?
> 
> >  	if (ret)
> 
> Then...
> 
> if (ret)
> 	...
> else if (value == ~0)
> 	...
> 	ret = -EIO;
> 
> Matt

In my test system, the hardware appears to auto-recover after returning 0xFFFFFFFF, which makes me hesitant to classify this as a failure.
Currently, it's logged as an informational message to highlight a potential issue. 
If we classify this as a failure, should we also update it to an error log? Thanks!
Another thing is (ret) and (value == ~0) may co-exist, so the code will like below:
if (ret)
   ... 
if (value == ~0) {
   ...
   ret = -EIO;
} 

Shuicheng

> 
> >  		xe_gt_notice(gt, "Force wake domain %d failed to ack %s
> (%pe) reg[%#x] = %#x\n",
> >  			     domain->id, str_wake_sleep(wake), ERR_PTR(ret),
> > --
> > 2.25.1
> >