[PATCH] drm/xe: Log unreliable MMIO reads during forcewake
Lin, Shuicheng
shuicheng.lin at intel.com
Mon Oct 14 21:08:09 UTC 2024
> -----Original Message-----
> From: Brost, Matthew <matthew.brost at intel.com>
> Sent: Friday, October 11, 2024 10:09 PM
> To: Lin, Shuicheng <shuicheng.lin at intel.com>
> Cc: intel-xe at lists.freedesktop.org; Roper, Matthew D
> <matthew.d.roper at intel.com>; Vivi, Rodrigo <rodrigo.vivi at intel.com>; Zuo,
> Alex <alex.zuo at intel.com>
> Subject: Re: [PATCH] drm/xe: Log unreliable MMIO reads during forcewake
>
> On Sat, Oct 12, 2024 at 03:34:45AM +0000, Shuicheng Lin wrote:
> > In some cases, when the driver attempts to read an MMIO register, the
> > hardware may return 0xFFFFFFFF. The current force wake path code
> > treats this as a valid response, as it only checks the BIT.
> > However, 0xFFFFFFFF should be considered an invalid value, indicating
> > a potential issue. To address this, we should add a log entry to
> > highlight this condition.
> >
> > Suggested-by: Alex Zuo <alex.zuo at intel.com>
> > Signed-off-by: Shuicheng Lin <shuicheng.lin at intel.com>
> > ---
> > drivers/gpu/drm/xe/xe_force_wake.c | 4 ++++
> > 1 file changed, 4 insertions(+)
> >
> > diff --git a/drivers/gpu/drm/xe/xe_force_wake.c
> > b/drivers/gpu/drm/xe/xe_force_wake.c
> > index a64c14757c84..46f36d05293a 100644
> > --- a/drivers/gpu/drm/xe/xe_force_wake.c
> > +++ b/drivers/gpu/drm/xe/xe_force_wake.c
> > @@ -114,6 +114,10 @@ static int __domain_wait(struct xe_gt *gt, struct
> xe_force_wake_domain *domain,
> > ret = xe_mmio_wait32(>->mmio, domain->reg_ack, domain->val,
> wake ? domain->val : 0,
> > XE_FORCE_WAKE_ACK_TIMEOUT_MS *
> USEC_PER_MSEC,
> > &value, true);
> > + if (value == ~0)
> > + xe_gt_notice(gt,
> > + "Force wake domain %d: %s. MMIO unreliable
> (forcewake register returns 0xFFFFFFFF)!\n",
> > + domain->id, str_wake_sleep(wake));
>
> Set the ret value (-EIO) to kick the error to upper layers?
>
> > if (ret)
>
> Then...
>
> if (ret)
> ...
> else if (value == ~0)
> ...
> ret = -EIO;
>
> Matt
In my test system, the hardware appears to auto-recover after returning 0xFFFFFFFF, which makes me hesitant to classify this as a failure.
Currently, it's logged as an informational message to highlight a potential issue.
If we classify this as a failure, should we also update it to an error log? Thanks!
Another thing is (ret) and (value == ~0) may co-exist, so the code will like below:
if (ret)
...
if (value == ~0) {
...
ret = -EIO;
}
Shuicheng
>
> > xe_gt_notice(gt, "Force wake domain %d failed to ack %s
> (%pe) reg[%#x] = %#x\n",
> > domain->id, str_wake_sleep(wake), ERR_PTR(ret),
> > --
> > 2.25.1
> >
More information about the Intel-xe
mailing list