[igt-dev] [PATCH i-g-t v2] intel-ci: add a pre-merge blacklist to reduce the testing queue

Fri Feb 21 10:43:33 UTC 2020

Quoting Martin Peres (2020-02-21 09:00:47)
> When arriving at the office on Monday morning, the reported queue
> size was ~100 hours. This defeats the point of pre-merge testing and
> vastly exceeds our target of ~6 hours.
> 
> We have a lot of work needed to reduce testing time, but this patches
> reduces the reported run time by 15-30% depending on the platforms:
> 
>  - shard-skl: 23.9 -> 18.2 minutes (18.5%)
>  - shard-kbl: 21.2 -> 16.2 minutes (20%)
>  - shard-apl: 25.9 -> 18.5 minutes (24.3%)
>  - shard-glk: 24.7 -> 17.6 minutes (24.8%)
>  - shard-icl: 25.1 -> 16.7 minutes (28.7%)
>  - shard-tgl: 28.2 -> 19.6 minutes (26.4%)
> 
> The reason why the reported runtime is so low compared to the
> actual time is due to:
> 
>  - Unaccounted time spent outside of the IGT subtests (exec(), fixtures)
>  - Unaccounted time spent in suspend (monotonic clock, 20s / suspend)
>  - Boot time / extra reboots between shards to workaround kernel failures
>  - Intel GFX CI shard scheduling overhead
>  - More?
> 
> Tomi and Petri are working on reducing these overheads by detecting the
> bad conditions and rebooting the machine only at this point rather than
> between every single shard, and increasing the size of the shard test
> lists to reduce the per-shard CI overhead.
> 
> Because of this, the actual savings are way smaller in percentage
> but still compound over the tens of executions we do per week:
> 
>  - shard-skl: ~58 -> ~52 minutes
>  - shard-kbl: ~50 -> ~45 minutes
>  - shard-apl: ~53 -> ~46 minutes
>  - shard-glk: ~38 -> ~31 minutes
>  - shard-icl: ~47 -> ~39 minutes
>  - shard-tgl: ~60 -> ~51 minutes
> 
> More work needed, but we'll get there :)
> 
> v2:
>  - Avoid using | in the regular expressions (Petri Latvala)
>  - Update the description for igt at gem_pwrite@big-.* (Chris Wilson)
>  - Drop igt at sw_sync@sync_expired_merge (fixed by Chris Wilson)
>  - Drop igt at gem_eio@kms (fixed by Chris Wilson)
>  - Drop igt at perf@gen12-mi-rpc as it is serious kernel bug (Chris Wilson)
>  - Add links to issues tracking this for all blacklisted item
> 
> NOTICE: The above numbers have not been edited for the v2 since
>         blacklisting or improving the runtime dramatically yields the
>         same results, and only igt at perf@gen12-mi-rpc is back to being
>         slow.
> 
> Signed-off-by: Martin Peres <martin.peres at linux.intel.com>

Acked-by: Chris Wilson <chris at chris-wilson.co.uk>

I dream of a day where the test lists are autogenerated based on
historical information on how effective each one is at rejecting
patches, tuned for a particular test runtime. And with feedback from
bugs reported after the fact (along with the new testcases we need to
capture new code and user reported bugs). [Oh and fuzzing to generate
new tests.]

Imagine if we can do 95% patch^W bug rejection within 10min and 99.9%
rejection within 1hour. Then we might have enough free time for the
extended tests on CI_DRM.
-Chris