[Intel-gfx] [PATCH i-g-t 1/2] tests/chamelium: Skip suspend/resume test with unreliable hotplug event

Wed Jul 19 15:47:28 UTC 2017

On Wed, 2017-07-19 at 11:31 +0300, Paul Kocialkowski wrote:
> On Tue, 2017-07-18 at 22:21 +0100, Chris Wilson wrote:
> > Quoting Paul Kocialkowski (2017-07-18 16:16:26)
> > > It may occur that a hotplug uevent is detected at resume, even
> > > though it
> > > does not indicate that an actual hotplug happened. This is the
> > > case
> > > when
> > > link training fails on any other connector.
> > > 
> > > There is currently no way to distinguish what connector caused a
> > > hotplug
> > > uevent, nor what the reason for that uevent really is. This makes
> > > it
> > > impossible to find out whether the test actually passed or not.
> > 
> > And you may get more than one and then this skips even though the
> > test
> > passed. Looks like the patch is overcompensating. What you can do
> > is
> > repeat the test a few times, and then look at all the different
> > errors
> > you get. If the connector remains (no mst disappareance) once it
> > goes
> > bad, it should remain bad and so not generate any new uevent. Or
> > you
> > only repeat the test whilst link_status[old] != link_status[new].
> 
> I am not sure it is really desirable to repeat the test until we are
> fairly certain it succeeds. This involves suspend/resume, that is
> already long enough as it is.
> 
> Also, a uevent will be generated everytime link training fails,
> regardless of whether it was already failing before (I just tested
> that
> to make sure). In my case, it's due to a DP-VGA bridge that will
> consistently fail link training in the first seconds after resume.
> 
> So this is actually even worse that I thought, because there is no
> way
> to find out that this is why a uevent was generated if the link
> status
> was already bad before.
> 
> So I don't see how we can manage with the current information at
> disposal.
> 
> My main point here is that we need more information about what's
> going
> on than simply "HOTPLUG=1". These patches demonstrate that working
> around the lack of information is a pain for testing purposes and can
> only leads to semi-working hackish workarounds.
> 
> Do you agree that this is what the problem really is?
Yes, I agree we need more debugging information for when hotplugs fail.
This being said though, the fact that i915 is unconditionally sending
hotplugs on resume (this appears to be a hack that they did add to stop
from missign hotplug events between suspend/resume) is really what's
causing this problem specifically.

We really need the debugging stuff me and martin suggested for the
kernel, and also more drm helpers to actually do edid checks and that
sort of stuff so that we don't have to deal with dirty hacks like this
:\.
>