[Intel-gfx] [PATCH] drm/i915/display: Reset message bus after each read/write operation

Fri Oct 6 11:57:05 UTC 2023

Quoting Kahola, Mika (2023-10-06 03:49:15-03:00)
>> -----Original Message-----
>> From: Vivi, Rodrigo <rodrigo.vivi at intel.com>
>> Sent: Thursday, October 5, 2023 7:10 PM
>> To: Sousa, Gustavo <gustavo.sousa at intel.com>
>> Cc: Kahola, Mika <mika.kahola at intel.com>; intel-gfx at lists.freedesktop.org
>> Subject: Re: [Intel-gfx] [PATCH] drm/i915/display: Reset message bus after each read/write operation
>> 
>> On Thu, Oct 05, 2023 at 12:40:35PM -0300, Gustavo Sousa wrote:
>> > Quoting Rodrigo Vivi (2023-10-05 12:13:34-03:00)
>> > >On Thu, Oct 05, 2023 at 03:05:31AM -0400, Kahola, Mika wrote:
>> > >> > -----Original Message-----
>> > >> > From: Vivi, Rodrigo <rodrigo.vivi at intel.com>
>> > >> > Sent: Wednesday, October 4, 2023 3:56 PM
>> > >> > To: Kahola, Mika <mika.kahola at intel.com>
>> > >> > Cc: intel-gfx at lists.freedesktop.org
>> > >> > Subject: Re: [Intel-gfx] [PATCH] drm/i915/display: Reset message
>> > >> > bus after each read/write operation
>> > >> >
>> > >> > On Wed, Oct 04, 2023 at 01:25:04PM +0300, Mika Kahola wrote:
>> > >> > > Every know and then we receive the following error when running
>> > >> > > for example IGT test kms_flip.
>> > >> > >
>> > >> > > [drm] *ERROR* PHY G Read 0d80 failed after 3 retries.
>> > >> > > [drm] *ERROR* PHY G Write 0d81 failed after 3 retries.
>> > >> > >
>> > >> > > Since the error is sporadic in nature, the patch proposes to
>> > >> > > reset the message bus after every successful or unsuccessful
>> > >> > > read or write operation. However, testing revealed that this
>> > >> > > alone is not sufficient method an additiona delay is also
>> > >> > > introduces anything from 200us to 300us. This delay is
>> > >> > > experimental value and has no specification to back it up.
>> > >> >
>> > >> > have you tried the delays without the bus_reset?
>> > >> Yes, we have bumped up the delay, first from 0x100 to 0x200 and
>> > >> then as per BSpec change 0xa000 and I have tried 0xf000. Increasing
>> > >> the timeout reduces the frequency of this error but doesn't solve this issue.
>> > >
>> > >what is exactly this BSPec's 0xa000? where can I see it? So maybe you
>> > >can update the message above removing the 'no specification to back it up'.
>> >
>> > (Resending this because I got a delivery failure notification)
>> >
>> > I think we are confusing "delay" with the "timeout parameter" of the msgbus.
>> >
>> > The PHY has a register to control the timeout parameter of msgbus
>> > transactions (BSpec 65156). It's default value is 0x100. With commit
>> > e028d7a4235d
>> > ("drm/i915/cx0: Check and increase msgbus timeout threshold"), we had
>> > integrated a workaround that bumped the timeout value to 0x200 in case
>> > timeouts were observed. Later on, there was a BSpec update with the
>> > formal timeout value to be programmed to 0xa000, which was
>> > incorporated with commit e35628968032
>> > ("drm/i915/cx0: Add step for programming msgbus timer").
>> >
>> > I *believe* what Rodrigo has asked was about the usleep_range() calls
>> > added with this patch, if we tried to only keep the usleed_range() without the bus reset.
>> 
>> yes, that was my original question.
>
>I have no good explanation why usleep_range() is needed. Without it, the kms_flip test eventually
>throws these read/write failures. As these are a bit sporadic in nature, it takes some time to catch
>these errors.

I think the question is whether the bus reset is really necessary. Maybe only
the usleep_range() hack would be "enough" to mitigate the issue?

--
Gustavo Sousa

>
>The patch is a hack and my idea was to set message bus at reset state after each read/write operation.
>Unfortunately, this alone is not enough to pass kms_flip without these dmesg errors on read/write.
>However, the kms_flip test itself, which triggers these, passes without issues.
>  
>And I missed to mention that these errors show up (at least more frequently) when 2x 4k monitors are
>connected. These may not be visible with only one monitor connected. For such a system, I haven't
>been testing that much.
>
>-Mika-
>
>> 
>> >
>> > --
>> > Gustavo Sousa
>> >
>> > >
>> > >Oh, and my english is bad, but it looks to me that 'empirical' might
>> > >sound better than 'experimental' for this case, since you really did
>> > >a lot of experiments before coming to this final conclusion.
>> > >
>> > >>
>> > >> > have you talked to hw architects about this?
>> > >> Yes, HW guys requested traces which I provided but based on these
>> > >> the sequence we use in i915 is correct.
>> > >>
>> > >> >
>> > >> > I wonder if we should add the delay inside the bus_reset itself?
>> > >> > although the bit 15 clear check should be enough by itself and it
>> > >> > doesn't look like it is a hw/fw reset involved to justify the extra delay.
>> > >> That should be enough. To me, it looks like when reading/writing to
>> > >> the bus maybe too fast, the hw cannot handle that and we need to reset and let things settle down before trying again.
>> > >>
>> > >> >
>> > >> > well, at least some /* FIXME: */ or /* XXX: */ comments is
>> > >> > desired along with the messages if we are going with this hack without understanding why...
>> > >> True, I will add these the the patch.
>> > >>
>> > >> Thanks for review!
>> > >>
>> > >> -Mika-
>> > >> >
>> > >> > >
>> > >> > > Signed-off-by: Mika Kahola <mika.kahola at intel.com>
>> > >> > > ---
>> > >> > >  drivers/gpu/drm/i915/display/intel_cx0_phy.c | 6 ++++++
>> > >> > >  1 file changed, 6 insertions(+)
>> > >> > >
>> > >> > > diff --git a/drivers/gpu/drm/i915/display/intel_cx0_phy.c
>> > >> > > b/drivers/gpu/drm/i915/display/intel_cx0_phy.c
>> > >> > > index abd607b564f1..a71b8a29d6b0 100644
>> > >> > > --- a/drivers/gpu/drm/i915/display/intel_cx0_phy.c
>> > >> > > +++ b/drivers/gpu/drm/i915/display/intel_cx0_phy.c
>> > >> > > @@ -220,9 +220,12 @@ static u8 __intel_cx0_read(struct drm_i915_private *i915, enum port port,
>> > >> > >          /* 3 tries is assumed to be enough to read successfully */
>> > >> > >          for (i = 0; i < 3; i++) {
>> > >> > >                  status = __intel_cx0_read_once(i915, port,
>> > >> > > lane, addr);
>> > >> > > +                intel_cx0_bus_reset(i915, port, lane);
>> > >> > >
>> > >> > >                  if (status >= 0)
>> > >> > >                          return status;
>> > >> > > +
>> > >> > > +                usleep_range(200, 300);
>> > >> > >          }
>> > >> > >
>> > >> > >          drm_err_once(&i915->drm, "PHY %c Read %04x failed
>> > >> > > after %d retries.\n", @@ -299,9 +302,12 @@ static void __intel_cx0_write(struct drm_i915_private *i915, enum port
>> port,
>> > >> > >          /* 3 tries is assumed to be enough to write successfully */
>> > >> > >          for (i = 0; i < 3; i++) {
>> > >> > >                  status = __intel_cx0_write_once(i915, port,
>> > >> > > lane, addr, data, committed);
>> > >> > > +                intel_cx0_bus_reset(i915, port, lane);
>> > >> > >
>> > >> > >                  if (status == 0)
>> > >> > >                          return;
>> > >> > > +
>> > >> > > +                usleep_range(200, 300);
>> > >> > >          }
>> > >> > >
>> > >> > >          drm_err_once(&i915->drm,
>> > >> > > --
>> > >> > > 2.34.1
>> > >> > >