[Intel-gfx] [PATCH v4 i-g-t] tests/kms: increase max threshold time for edid read

Mon Aug 14 15:30:24 UTC 2017

On Mon, 14 Aug 2017, Daniel Vetter <daniel at ffwll.ch> wrote:
> On Mon, Aug 14, 2017 at 11:43:34AM +0300, Jani Nikula wrote:
>> On Fri, 11 Aug 2017, Clint Taylor <clinton.a.taylor at intel.com> wrote:
>> > On 08/11/2017 12:49 AM, Lofstedt, Marta wrote:
>> >> We can't have this lspcon as an argument to the test, it will not work with automated testing using piglit as we do for CI.
>> >> Theoretically we could add a "has_lspcon" to debugfs, but I believe that this test has started to be over complicated.
>> > The test would need to know if an LSPcon is connected on a port by port 
>> > basis. This is easy if the LSPcon driver is loaded but in the case of 
>> > USB_C->HDMI is gets a little more complicated (not impossible) to figure 
>> > out. Even if we know exactly what device is connected failures can still 
>> > occur if the TCON/Monitor clock stretches the EDID read.
>> >
>> >> The intention of the test is to do a sanity check so that we don't drift off here, so I actually prefer the V1.
>> > Unfortunately with the timing differences (3ms to 96ms) based on the 
>> > monitor type connected and EDID size there is no way for a one size fits 
>> > all sanity check to be reliable. If the original point of this test was 
>> > to figure out if probe caused more than 1 EDID read, maybe we should 
>> > actually count EDID reads and not infer it by measuring time.
>> 
>> I'll take a step back here and try to look at the big picture.
>> 
>> I think the point of the test originally was to catch regressions in the
>> EDID code that would slow down EDID reading. That's a valid thing to
>> test per se, but unfortunately it's not easy to do that accurately. The
>> basic alternatives are, from most accurate to least accurate:
>> 
>> 1) Start off with reference timing, and make relative comparisons
>> against that. This is not unlike performance testing. The problem is, to
>> not compare apples and oranges, you'll need to take into account
>> platform, connector type, adapters, cables, hubs, EDID size, and the
>> display. Change one, and you can't make the comparison anymore. You end
>> up tracking configurations and timings, a lot.
>> 
>> 2) Try to make a reasonable estimate of the absolute maximum based on
>> some subset of the configuration (such as connector type and
>> adapters). If you stay below, regardless of the timing changes, you
>> assume it's fine, and otherwise you consider it a regression. So you try
>> to keep the limits tight to catch regressions, but not flag normal
>> behaviour too much. You end up accumulating heuristics for the
>> configuration and timing, because, let's face it, there is a lot of
>> variance in what's acceptable. (This is v2+ of the patch at hand.)
>> 
>> 3) Try to make a reasonable estimate of the absolute maximum independent
>> of the configuration. You'll never catch regressions in the fast
>> configurations, because your limits are based on the worst case. You'll
>> only catch the really pathological problems. Note that the current code
>> uses mean, so it evens out connector type specific problems; we might
>> also not catch pathological problems in, say, DP if the HDMI is
>> fast. (This is the original and v1 of the patch.)
>> 
>> I think 1) is more trouble than it's worth. 3) is simplistic but
>> easy. The question is, do we get enough benefit from 2) to offset the
>> complications?
>
> Why is 2) hard? EDID read takes 22ms at wire-speed, per block. Add a bit
> of fudge, gives us 10ms + 25ms*num_edid_blocks. Imo everything else is a
> bug in our kernel driver. I have no idea why lspcon matters or not,
> because it really shouldn't.

2) is hard partly for the same reason 1) is hard. All of the things I
mention impact the overall speed. You only mention wire speed, but
e.g. DP i2c-over-aux can reply with a bunch of defers before anything
happens. The EDID reads are chuncked to a number of i2c-over-aux
transactions, *each* of which may defer. For a number of valid reasons.
See the commentary in drm_dp_i2c_do_msg(); we've basically punted on
accurate timeouts and retry counts, and set them high enough.

I don't think even the driver is capable of taking all of the variables
into account, let alone igt.

Bottom line, you can't set generic igt limits based on your wire speed
assumptions.

BR,
Jani.

-- 
Jani Nikula, Intel Open Source Technology Center