[Intel-gfx] [PATCH] drm/i915: Fix skl_pcode_try_request function

Govindapillai, Vinod vinod.govindapillai at intel.com
Wed Apr 6 10:15:36 UTC 2022


On Wed, 2022-04-06 at 12:58 +0300, Lisovskiy, Stanislav wrote:
> On Wed, Apr 06, 2022 at 12:19:19PM +0300, Govindapillai, Vinod wrote:
> > Hi,
> > 
> > On Wed, 2022-04-06 at 10:48 +0300, Lisovskiy, Stanislav wrote:
> > > On Wed, Apr 06, 2022 at 12:51:29AM +0300, Govindapillai, Vinod wrote:
> > > > Hi Stan
> > > > 
> > > > Nice Find! Couple of clarifications, please check inline...
> > > > 
> > > > On Tue, 2022-04-05 at 13:41 +0300, Stanislav Lisovskiy wrote:
> > > > > Currently skl_pcode_try_request function doesn't
> > > > > properly handle return value it gets from
> > > > > snb_pcode_rw, but treats status != 0 as success,
> > > > > returning true, which basically doesn't allow
> > > > > to use retry/timeout mechanisms if PCode happens
> > > > > to be busy and returns EGAIN or some other status
> > > > > code not equal to 0.
> > > > > 
> > > > > We saw this on real hw and also tried simulating this
> > > > > by always returning -EAGAIN from snb_pcode_rw for 6 times, which
> > > > > currently will just result in false success, while it should
> > > > > have tried until timeout is reached:
> > > > > 
> > > > > [   22.357729] i915 0000:00:02.0: [drm:intel_cdclk_dump_config [i915]] Changing CDCLK to
> > > > > 307200 kHz, VCO 614400 kHz, ref 38400 kHz, bypass 19200 kHz, voltage level 0
> > > > > [   22.357831] i915 0000:00:02.0: [drm:__snb_pcode_rw [i915]] Returning EAGAIN retry 1
> > > > > [   22.357892] i915 0000:00:02.0: [drm:skl_pcode_request [i915]] Success, exiting
> > > > > [   22.357936] i915 0000:00:02.0: [drm] ERROR Failed to inform PCU about cdclk change (err
> > > > > -11,
> > > > > freq 307200)
> > > > > 
> > > > > We see en error because higher level api, still notices that status was wrong,
> > > > > however we still did try only once.
> > > > > 
> > > > > We fix it by requiring _both_ the status to be 0 and
> > > > > request/reply match for success(true) and function
> > > > > should return failure(false) if either status turns
> > > > > out to be EAGAIN, EBUSY or whatever or reply/request
> > > > > masks do not match.
> > > > > 
> > > > > So now we see this in the logs:
> > > > > 
> > > > > [   22.318667] i915 0000:00:02.0: [drm:intel_cdclk_dump_config [i915]] Changing CDCLK to
> > > > > 307200 kHz, VCO 614400 kHz, ref 38400 kHz, bypass 19200 kHz, voltage level 0
> > > > > [   22.318782] i915 0000:00:02.0: [drm:__snb_pcode_rw [i915]] Returning EAGAIN retry 1
> > > > > [   22.318849] i915 0000:00:02.0: [drm:__snb_pcode_rw [i915]] Returning EAGAIN retry 2
> > > > > [   22.319006] i915 0000:00:02.0: [drm:__snb_pcode_rw [i915]] Returning EAGAIN retry 3
> > > > > [   22.319091] i915 0000:00:02.0: [drm:__snb_pcode_rw [i915]] Returning EAGAIN retry 4
> > > > > [   22.319158] i915 0000:00:02.0: [drm:__snb_pcode_rw [i915]] Returning EAGAIN retry 5
> > > > > [   22.319224] i915 0000:00:02.0: [drm:__snb_pcode_rw [i915]] Returning EAGAIN retry 6
> > > > > 
> > > > > Signed-off-by: Stanislav Lisovskiy <stanislav.lisovskiy at intel.com>
> > > > > ---
> > > > >  drivers/gpu/drm/i915/intel_pcode.c | 2 +-
> > > > >  1 file changed, 1 insertion(+), 1 deletion(-)
> > > > > 
> > > > > diff --git a/drivers/gpu/drm/i915/intel_pcode.c b/drivers/gpu/drm/i915/intel_pcode.c
> > > > > index 391a37492ce5..fb6c43e8a02f 100644
> > > > > --- a/drivers/gpu/drm/i915/intel_pcode.c
> > > > > +++ b/drivers/gpu/drm/i915/intel_pcode.c
> > > > > @@ -136,7 +136,7 @@ static bool skl_pcode_try_request(struct drm_i915_private *i915, u32
> > > > > mbox,
> > > > >  {
> > > > >       *status = __snb_pcode_rw(i915, mbox, &request, NULL, 500, 0, true);
> > > > > 
> > > > > -     return *status || ((request & reply_mask) == reply);
> > > > > +     return (*status == 0) && ((request & reply_mask) == reply);
> > > > 
> > > > Here I wonder whether you need to check what sort of __snb_pcode_rw return values need the
> > > > retry!
> > > > Isn't only ETIMEDOUT need the retry? Other return error codes can be probably be ignored
> > > > from
> > > > retry?
> > > 
> > > Hi Vinod! Thanks for comments.
> > > 
> > > Well theoretically yes, but in practice I think we would prefer to retry in almost all of the
> > > cases.
> > > There are also multiple error codes when you need to retry, such as "EAGAIN", "EBUSY" and
> > > probably
> > > some others. Thats is probably why original code also doesn't make a difference.
> > 
> > Ack
> > 
> > > > And should the, "return ret ? ret : status;" in "skl_pcode_request" be change to "return ret
> > > > ?
> > > > status : ret;" to reflect the correct error code to calling functions after this change?
> > > 
> > > I think the logic here is such that ret value is somewhat more important than the status, so
> > > if ret is something not zero - we always prefer returning ret, for the enduser to know what
> > > was the ret ERROR code.
> > > If ret is 0, then we can "afford" to let the enduser know, what was actually the status.
> > > 
> > > To me it actually sounds a bit wrong, I think we should get status pointer, like "&status"
> > > and modify it, so that calling site _always_ knows both status and ret, mixing those two
> > > is a dangerous strategy which exactly caused some coder confusion and probably the bug, that
> > > this patch is fixing.
> > > 
> > > Stan
> > 
> > Agreed, the original code was indeed bit complicated!
> > 
> > But what I meant was, after your patch, "ret" will be either "0" or "ETIMEDOUT".
> > If ret = 0, then "status" would had been 0 too based on your change in this patch.
> > If ret != 0, then "status" might have values other than ETIMEDOUT.
> > 
> > So,
> > "return ret ? status : ret;" might be better instead of the original "return ret ? ret :
> > status;"
> > especially after your patch.
> > 
> > Anyway, not sure if the calling functions care about the return value much other than just
> > logging.
> > It was indeed a quite good fix!
> 
> Yes, your proposal actually makes more sense than current code anyway, however I'm still thinking
> if we should may be propagate &status to call site always, 

Yes! my bad :) actually "return ret ? status : ret" is indeed "return status"!


> need to think a bit.
> Would be nice to get Ville's perspective on this.
> 
> > Reviewd-By: Vinod Govindapillai <vinod.govindapillai at intel.com>
> 
> Thanks,
> 
> Stan
> 
> > > > >  }
> > > > > 
> > > > >  /**


More information about the Intel-gfx mailing list