[Intel-gfx] [PATCH v4 i-g-t] tests/kms: increase max threshold time for edid read

Mon Aug 14 16:05:45 UTC 2017

On Mon, 14 Aug 2017, Daniel Vetter <daniel at ffwll.ch> wrote:
> On Mon, Aug 14, 2017 at 5:30 PM, Jani Nikula
> <jani.nikula at linux.intel.com> wrote:
>> On Mon, 14 Aug 2017, Daniel Vetter <daniel at ffwll.ch> wrote:
>>> On Mon, Aug 14, 2017 at 11:43:34AM +0300, Jani Nikula wrote:
>>>> On Fri, 11 Aug 2017, Clint Taylor <clinton.a.taylor at intel.com> wrote:
>>>> > On 08/11/2017 12:49 AM, Lofstedt, Marta wrote:
>>>> >> We can't have this lspcon as an argument to the test, it will not work with automated testing using piglit as we do for CI.
>>>> >> Theoretically we could add a "has_lspcon" to debugfs, but I believe that this test has started to be over complicated.
>>>> > The test would need to know if an LSPcon is connected on a port by port
>>>> > basis. This is easy if the LSPcon driver is loaded but in the case of
>>>> > USB_C->HDMI is gets a little more complicated (not impossible) to figure
>>>> > out. Even if we know exactly what device is connected failures can still
>>>> > occur if the TCON/Monitor clock stretches the EDID read.
>>>> >
>>>> >> The intention of the test is to do a sanity check so that we don't drift off here, so I actually prefer the V1.
>>>> > Unfortunately with the timing differences (3ms to 96ms) based on the
>>>> > monitor type connected and EDID size there is no way for a one size fits
>>>> > all sanity check to be reliable. If the original point of this test was
>>>> > to figure out if probe caused more than 1 EDID read, maybe we should
>>>> > actually count EDID reads and not infer it by measuring time.
>>>>
>>>> I'll take a step back here and try to look at the big picture.
>>>>
>>>> I think the point of the test originally was to catch regressions in the
>>>> EDID code that would slow down EDID reading. That's a valid thing to
>>>> test per se, but unfortunately it's not easy to do that accurately. The
>>>> basic alternatives are, from most accurate to least accurate:
>>>>
>>>> 1) Start off with reference timing, and make relative comparisons
>>>> against that. This is not unlike performance testing. The problem is, to
>>>> not compare apples and oranges, you'll need to take into account
>>>> platform, connector type, adapters, cables, hubs, EDID size, and the
>>>> display. Change one, and you can't make the comparison anymore. You end
>>>> up tracking configurations and timings, a lot.
>>>>
>>>> 2) Try to make a reasonable estimate of the absolute maximum based on
>>>> some subset of the configuration (such as connector type and
>>>> adapters). If you stay below, regardless of the timing changes, you
>>>> assume it's fine, and otherwise you consider it a regression. So you try
>>>> to keep the limits tight to catch regressions, but not flag normal
>>>> behaviour too much. You end up accumulating heuristics for the
>>>> configuration and timing, because, let's face it, there is a lot of
>>>> variance in what's acceptable. (This is v2+ of the patch at hand.)
>>>>
>>>> 3) Try to make a reasonable estimate of the absolute maximum independent
>>>> of the configuration. You'll never catch regressions in the fast
>>>> configurations, because your limits are based on the worst case. You'll
>>>> only catch the really pathological problems. Note that the current code
>>>> uses mean, so it evens out connector type specific problems; we might
>>>> also not catch pathological problems in, say, DP if the HDMI is
>>>> fast. (This is the original and v1 of the patch.)
>>>>
>>>> I think 1) is more trouble than it's worth. 3) is simplistic but
>>>> easy. The question is, do we get enough benefit from 2) to offset the
>>>> complications?
>>>
>>> Why is 2) hard? EDID read takes 22ms at wire-speed, per block. Add a bit
>>> of fudge, gives us 10ms + 25ms*num_edid_blocks. Imo everything else is a
>>> bug in our kernel driver. I have no idea why lspcon matters or not,
>>> because it really shouldn't.
>>
>> 2) is hard partly for the same reason 1) is hard. All of the things I
>> mention impact the overall speed. You only mention wire speed, but
>> e.g. DP i2c-over-aux can reply with a bunch of defers before anything
>> happens. The EDID reads are chuncked to a number of i2c-over-aux
>> transactions, *each* of which may defer. For a number of valid reasons.
>> See the commentary in drm_dp_i2c_do_msg(); we've basically punted on
>> accurate timeouts and retry counts, and set them high enough.
>
> i2c over dp aux can defer becuase dp aux runs at 2mbit, and i2c at
> 100kbit. The defers are to handle the mismatch. If you look actual
> transfer rates, then the i2c transactions still completes at wire
> speed of the i2c bus.
>
> And yes we've had bugs in this area where i2c over dp_aux was suddenly
> 2x slower, so making allowances for bugs here isn't a good idea.
>
>> I don't think even the driver is capable of taking all of the variables
>> into account, let alone igt.
>
> So the _one_ variable we haven't taken into account, which is the one
> that spurred Clint here into this endeavor, is EDID size. There's
> simply no way to transfer an 3block edid in 50ms. Afaik that's the
> only bug. I've seen piles of displays, QA has checked for the 50ms
> limit for ages, as long as your edid is only 256 bytes it all works.
> Minor kernel bugs.

Agreed on EDID size being the thing with the largest impact that we're
not taking into account. And it's the low hanging fruit, too.

We can label the rest under hypothetical, at least until it's not... but
that doesn't conflict with EDID size needing to be fixed first. It's not
productive to argue on the rest.

BR,
Jani.

>
> There is provisions in the i2c spec for clock stretching, but I've
> never seen a screen that actually needs that.
>
>> Bottom line, you can't set generic igt limits based on your wire speed
>> assumptions.
>
> Which bug breaks this assumption? All the stuff about lspcon is
> because Clint assumed the wrong wire speed.
>
> All I'm proposing is that we apply the one-liner that takes screens
> with 3/4 block edids into account, because that's a new thing. Afaik
> there's nothing else.
> -Daniel

-- 
Jani Nikula, Intel Open Source Technology Center