[Intel-gfx] [RFC PATCH 1/2] drm/dp: Make number of AUX retries configurable by display drivers.

Sat Feb 13 02:01:16 UTC 2021

(adding danvet to here, as I'd imagine they might be interested in seeing some
of this)

Thank you for the descriptive write up. I think we can solve some of the
problems you described here, however the patches that you submitted definitely
won't work as-is. In patch 2, by reverting Intel back to using only 7 retries
you technically break whatever monitor originally prompted us to bump the retry
count up to 32 in the first place. And we have to try our hardest to avoid
breaking people's displays when they were already working.

Also, I'll definitely point out though that a few of the issues you've pointed
out actually exist as workarounds for bad DisplayPort devices (which is a big
reason we have these helpers), but I think we might be able to fix some of the
issues you've mentioned by coming up with better workarounds. More details below

On Thu, 2021-02-11 at 06:56 +0000, Almahallawy, Khaled wrote:
> On Wed, 2021-02-10 at 13:03 -0500, Lyude Paul wrote:
> > On Wed, 2021-02-10 at 00:33 -0800, Khaled Almahallawy wrote:
> > > The number of AUX retries specified in the DP specs is 7.
> > > Currently, to make
> > > Dell 4k monitors happier, the number of retries are 32.
> > > i915 also retries 5 times (intel_dp_aux_xfer) which means in the
> > > case of AUX
> > > timeout we actually retries 32 * 5 = 160 times.
> > 
> > Is there any good reason for i915 to actually be doing retries
> > itself? It seems
> > like an easier solution here might just to be to fix i915 so it
> > doesn't retry,
> > and just rely on DRM to do the retries as appropriate.
> > 
> > That being said though, I'm open to this if there is a legitimate
> > reason for
> > i915 to be handling retries on its own
> 
> i915 or others may benefit from controlling AUX retries to optimize and
> minimize timing spent on the aux transactions.
>  
> drm_dpcd_access handles multiple aux retries cases the same way (retry
> 32 times at worst case). The 4 cases are:
> 1- *AUX receive or IO error*. I believe it is up to specific
> implementation/HW once it detects a receive error to retry based on
> their internal understanding using the timeout appropriate to the HW
> capabilities.

This makes sense so far

>  
> 2- *AUX Timeout* As the driver follows the specs for the recommended
> timeout timer as defined in section  (2.11.2 AUX Transaction
> Response/Reply Timeouts), the driver has more intelligence to know how
> many retries it needs. This is more useful in case of DP-ALT if some
> issues happen in Type-C stack that cause AUX timeout (e.g. USB3 dock
> detected as high speed (USB2) causing SBU/AUX lines to be disabled).
> Also the disconnect of MST hub/docks is dependent how fast a userspace
> disconnect all MST connectors.  Not all user spaces do it as fast like
> Chrome (ubuntu is an example): 
> https://chromium-review.googlesource.com/c/chromium/src/+/2512550/  

So - I'm not exactly following how this portion of the specification is relevant
to the issue that you're bringing up here. If I understand section 2.11.2
correctly, a "response" here is defined as a transmission from the DPRX which
informs us on the current state of the transaction that we've started. This
would be any one of the aux response statuses you've mentioned in this email:
AUX_NACK, AUX_ACK, or AUX_DEFER. It doesn't actually have anything to do with
the number of retries we have to do, because (ignoring the fact we retry on
AUX_NACKS in DRM for a moment here) that reply could be an AUX_DEFER which would
indicate that we have to resend the transaction and try again. I think this is
the correct understanding of section 2.11.2, because I definitely don't think
real world DP devices are able to actually complete a full AUX transaction
within a timespan of 300-400µs reliably for the most part.

The second thing to mention here is that regardless of the first point, I don't
think your point about MST displays needing to be removed by userspace is
correct. It is true that userspace can actually hold onto references to removed
MST connectors after they've been removed, but the main purpose of this being
possible is in order to accommodate legacy clients that wouldn't be able to
handle a connector disappearing under them cleanly. Once the MST topology which
a connector corresponds to is removed, the MST connector is removed ASAP from
the kernel's cache of the respective DisplayPort's MST topology along with all
of the connectors below it. At the same time, the respective DRM connectors are
also unregistered from userspace.

DRM explicitly doesn't allow any kind of client to enable new displays on
connectors that are unregistered, and will reject any modesets involving them
with the exception of modesets which only disable displays on those unregistered
connectors (this last part is mainly just to be nice to legacy clients). So it
doesn't really matter how quickly userspace disables the display configuration
on those connectors, from the kernel's perspective they're already gone and a
new MST topology can be connected to the system. The expectation is that even if
userspace doesn't turn those connectors off, the only possible move is to move
the CRTCs on them to new connectors or disable them outright.

Anyway-if there -is- actually a problem caused by userspace not disabling
displays on those connectors, that's definitely not intentional and not how
things are supposed to work. But this part of the MST helpers in DRM is pretty
solid at this point, and most times when people point out this oddity with MST
it ends up being a bug on userspace's end. Feel free to prove me wrong though!

> 
> 3- *AUX_NACK* DP spec mentions retry 3 times for NACK(2.16.1.3 GTC Lock
> Acquisition). However, sometimes we really don’t need any retry for
> NACK, if DPRX replied AUX_NACK for DPCD that it doesn’t support (e.g.
> reading LTTPR Capability and ID Field at DPCD F0000h ~ F0007).

Ah, yes, -this-. Originally we did actually just abort any transaction on
AUX_NACK, which is honestly the correct thing that we should be doing. But it
looks like I actually changed this at one point after finding some DP devices
that would send AUX_NACK instead of AUX_DEFER when they weren't yet ready to
receive messages. Luckily it seems I wrote up a pretty long explanation around
this when I changed this behavior back in:

82922da39190 ("drm/dp_helper: Retry aux transactions on all errors")

This was almost five years ago though when I was quite new to working on DRM, so
reading through this commit I already think I have some much better ideas for
how we can handle issues with DisplayPort devices like this. For starters, this
isn't the first workaround regarding a DisplayPort device or it's connected
source device waking up. There's also:

f808f63372cc ("drm/dp_helper: Perform throw-away read before actual read in
drm_dp_dpcd_read()")

Which was originally added as a workaround in the Intel driver, and then got
moved into the DRM helpers by me. The important thing about these two
workarounds that sticks out to me is that they're both issues with DP sinks/hubs
that only happen when either the source is first connected to a sink or hub, or
when the sink/hub is waking up from a low power state like D3. So it seems like
in both of these instances, after we manage one "successful" transaction (where
we define "successful" as both an AUX_ACK, -and- the monitor giving us a
sensible reply instead of random garbage) then things start to become normal and
match up with the spec.

The commonality between these two workarounds makes me think that we could solve
the AUX_NACK problem here (-and- this junky throwaway read) if we just kept
track of whether or not we've managed a "successful" transaction at least once,
after which point we can just immediately abort on any AUX_NACK we receive like
we used to. Which would solve the issue you're mentioning here with our handling
of AUX_NACKS, without breaking the monitors that actually need those
workarounds. First, we would we add a field to drm_dp_aux called "ready". Then,
we would want AUX transactions to go like this:

1. Whenever either of the following events occur:
   1a. A new DP device being connected to the system
   1b. Bringing an already-connected DP device out of a low power state through
       DPCD register 00600h or some other mechanism
   We set the "ready" field to False
2. Then, when the driver attempts an AUX channel transaction. We start to
   attempt a single DPCD register read from 00000h and retry until that
   transaction has completed with AUX_ACK. Take note that we will retry this
   transaction a total of 32 times like usual, but we will keep retrying even in
   the face of an AUX_NACK.
3. If the aforementioned transaction never completes with AUX_ACK, we consider
   the driver's DPCD transaction to have failed and return the appropriate
   return code.
   However-if the transaction does complete with AUX_ACK, we set the "ready"
   field to true.
4. We then attempt the original AUX transaction that the driver requested.
5. For the transaction in 4, and any subsequent transactions, as long as "ready"
   is set to True we go with the same 32 retries on AUX_DEFERs, but abort the
   transaction immediately on an AUX_NACK.

> 4- *AUX_DEFER* The specs stated we may retry 7 times on AUX_DEFER
> (3.5.1.2.3 LANEx_CHANNEL_EQ_DONE Sequence) and may terminate LT. Also
> with the increased usage of USB4, TBT/Type-C Docks/Displays, and active
> cables, the use of LTTPR becomes common which adds more challenge
> especially if we have multiple LTTPRs and all operate in non-LTTPR
> mode. In this case all LTTPRs will reply AUX_DEFER in 300us if it did
> not receive any aux response from other LTTPRs and DPRX. The SCR states
> we need to retry 7 times and not more than 50ms (DP v2.0 SCR on 8b/10b
> DPTX and LTTPR Requirements Update to Section 3.6)

I'm not sure where you're getting 7 retries and not more then 50ms from. I
currently have a copy of the DP v2.0 standard, but I'm not sure what SCR is. Is
this some sort of update to Section 3.6 from the DP v2.0 standard? Because I see
some mentions of 50ms response times in my copy of the 2.0 standard regarding
LTTPR, but they don't at all look related to what you're talking about here. (If
it is some update I don't have access to, I'll poke the X.org VESA contacts and
see if they can get me access to this).

Regardless though, I'm still not sure I understand the issue here. If we're
retrying a transaction, it's because the transaction didn't succeed - which in
turn means that something went wrong on the DPRX's end. In the event of
something going wrong with the DPRX for long enough that we end up exceeding
that 50ms timeout, wouldn't that already mean that the link training process is
already failing and needs to be aborted? Or are you saying that we would receive
AUX_NACKs in such an event, which could cause us to keep retrying the
transaction for longer then 50ms, resulting in the DPRX ending the link training
prematurely? If it's the former, hopefully the solution I suggested for your
third point would end up fixing this anyway (but I'm always open to discussion
if that solution isn't enough).

> 
> In addition I believe this function is not correct in treating
> AUX_DEFER and AUX_NACK as -EIO. Especially for AUX_DEFER, it is a valid
> 1 byte response that can provide a valuable debugging information
> especially in the case of on-board LTTPR where there is no way to
> capture this AUX response between the SoC and LTTPR unless you solder
> two wires on AUX_P and AUX_N lines and use i2c/aux analyzer to decode.
> At least we should provide the same debug information as we do in
> drm_dp_i2c_do_msg by printing AUX_DEFER and AUX_NACK.
> 
> Thank you for your feedback and review.

Is the debugging output for DPCD transactions (e.g. setting bit 0x100 for
drm.debug on the kernel commandline) not sufficient enough for this kind of
debugging? I'm fine with us being more specific with our return codes in that
case, I just wonder if they would conflict with any of the error codes some of
the DRM DP helpers already return.

Anyway-let me know if there's anything in my responses I need to clarify, or
anything I missed here. Also, sorry for the very long response! There was a lot
of context I had to dump here for this to make sense.

> --Khaled
> > 
> > > So making the number of aux retires a variable to allow for fine
> > > tuning and
> > > optimization of aux timing.
> > > 
> > > Signed-off-by: Khaled Almahallawy <khaled.almahallawy at intel.com>
> > > ---
> > >  drivers/gpu/drm/drm_dp_helper.c | 10 +++-------
> > >  include/drm/drm_dp_helper.h     |  1 +
> > >  2 files changed, 4 insertions(+), 7 deletions(-)
> > > 
> > > diff --git a/drivers/gpu/drm/drm_dp_helper.c
> > > b/drivers/gpu/drm/drm_dp_helper.c
> > > index eedbb48815b7..8fdf57b4a06c 100644
> > > --- a/drivers/gpu/drm/drm_dp_helper.c
> > > +++ b/drivers/gpu/drm/drm_dp_helper.c
> > > @@ -249,13 +249,7 @@ static int drm_dp_dpcd_access(struct
> > > drm_dp_aux *aux, u8
> > > request,
> > >  
> > >         mutex_lock(&aux->hw_mutex);
> > >  
> > > -       /*
> > > -        * The specification doesn't give any recommendation on how
> > > often to
> > > -        * retry native transactions. We used to retry 7 times like
> > > for
> > > -        * aux i2c transactions but real world devices this wasn't
> > > -        * sufficient, bump to 32 which makes Dell 4k monitors
> > > happier.
> > > -        */
> > > -       for (retry = 0; retry < 32; retry++) {
> > > +       for (retry = 0; retry < aux->num_retries; retry++) {
> > >                 if (ret != 0 && ret != -ETIMEDOUT) {
> > >                         usleep_range(AUX_RETRY_INTERVAL,
> > >                                      AUX_RETRY_INTERVAL + 100);
> > > @@ -1744,6 +1738,8 @@ void drm_dp_aux_init(struct drm_dp_aux *aux)
> > >         aux->ddc.retries = 3;
> > >  
> > >         aux->ddc.lock_ops = &drm_dp_i2c_lock_ops;
> > > +       /*Still making the Dell 4k monitors happier*/
> > > +       aux->num_retries = 32;
> > >  }
> > >  EXPORT_SYMBOL(drm_dp_aux_init);
> > >  
> > > diff --git a/include/drm/drm_dp_helper.h
> > > b/include/drm/drm_dp_helper.h
> > > index edffd1dcca3e..16cbfc8f5e66 100644
> > > --- a/include/drm/drm_dp_helper.h
> > > +++ b/include/drm/drm_dp_helper.h
> > > @@ -1876,6 +1876,7 @@ struct drm_dp_aux {
> > >         struct mutex hw_mutex;
> > >         struct work_struct crc_work;
> > >         u8 crc_count;
> > > +       int num_retries;
> > >         ssize_t (*transfer)(struct drm_dp_aux *aux,
> > >                             struct drm_dp_aux_msg *msg);
> > >         /**
-- 
Sincerely,
Lyude Paul (she/her)
Software Engineer at Red Hat

Note: I deal with a lot of emails and have a lot of bugs on my plate. If you've
asked me a question, are waiting for a review/merge on a patch, etc. and I
haven't responded in a while, please feel free to send me another email to check
on my status. I don't bite!