[Intel-gfx] Making IGT runnable by CI and developers

Fri Jul 21 09:47:42 UTC 2017

On Fri, Jul 21, 2017 at 11:39 AM, Daniel Vetter <daniel at ffwll.ch> wrote:
> On Thu, Jul 20, 2017 at 6:23 PM, Martin Peres
> <martin.peres at linux.intel.com> wrote:
>> Hi everyone,
>>
>> As some of you may already know, we have made great strides in making our CI
>> system usable, especially in the last 6 months when everything started
>> clicking together.
>>
>> The CI team is no longer overwhelmed with fires and bug reports, so we
>> started working on increasing the coverage from just fast-feedback, to a
>> bigger set of IGT tests.
>>
>> As some of you may know, running IGT has been a challenge that few manage to
>> overcome. Not only is the execution time counted in machine months, but it
>> can also lead to disk corruption, which does not encourage developers to run
>> it either. One test takes 21 days, on its own, and it is a subset of another
>> test which we never ran for obvious reasons.
>>
>> I would thus like to get the CI team and developers to work together to
>> decrease sharply the execution time of IGT, and get these tests run multiple
>> times per day!
>>
>> There are three usages that the CI team envision (up for debate):
>>  - Basic acceptance testing: Meant for developers and CI to check quickly if
>> a patch series is not completely breaking the world (< 10 minutes, timeout
>> per test of 30s)
>>  - Full run: Meant to be ran overnight by developers and users (< 6 hours)
>>  - Stress tests: They can be in the test suite as a way to catch rare
>> issues, but they cannot be part of the default run mode. They likely should
>> be run on a case-by-case basis, on demand of a developer. Each test could be
>> allowed to take up to 1h.
>>
>> There are multiple ways of getting to this situation (up for debate):
>>
>>  1) All the tests exposed by default are fast and meant to be run:
>>   - Fast-feedback is provided by a testlist, for BAT
>>   - Stress tests ran using a special command, kept for on-demand testing
>>
>>  2) Tests are all tagged with information about their exec time:
>>   - igt at basic@.*: Meant for BAT
>>   - igt at complete@.*: Meant for FULL
>>   - igt at stress@.*: The stress tests
>>
>>  3) Testlists all the way:
>>   - fast-feedback: for BAT
>>   - all: the tests that people are expected to run (CI will run them)
>>   - Stress tests will not be part of any testlist.
>>
>> Whatever decision is being accepted, the CI team is mandating global
>> timeouts for both BAT and FULL testing, in order to guarantee throughput.
>> This will require the team as a whole to agree on time quotas per
>> sub-systems, and enforce them.
>>
>> Can we try to get some healthy debate and reach a consensus on this? Our CI
>> efforts are being limited by this issue right now, and we will be doing
>> whatever we can until the test suite becomes saner and runnable, but this
>> may be unfair to some developers.
>>
>> Looking forward to some constructive feedback and intelligent discussions!
>> Martin
>
> Imo the critical bit for the full run (which should regression test
> all features while being fast enough that we can use it for pre-merge
> testing) must be the default set. Default here means what you get
> without any special cmdline options (to either the test or piglit),
> and without any special testlist that are separately maintained.
> Default also means that it will be included by default if you do a new
> testcase. There's two reasons for that:
>
> - Maintaining a separate test list is a pain. Also, it encourages
> adding tons of tests that no one runs.
>
> - If tests aren't run by default we can't test them pre-merging before
> they land in igt and wreak havoc.
>
> Second, we must have a reasonable runtime, and reasonable runtime here
> means a few hours of machine time for everything, total. There's two
> reasons for that:
> - Only pre-merge is early enough to catch regressions. We can lament
> all day long, but fact is that post-merge regressions don't get fixed
> or handled in a timely manner, except when they're really serious.
> This means any testing strategy that depends upon lots of post-merge
> testing, or expects such post-merge testing to work, is bound to fail.
> Either we can test everything pre-merge, or there's no regression
> testing at all.
>
> - We can't mix together multiple patch series bisect autobisecting is
> too unreliable. I've been promised an autobisector for 3 years by
> about 4 different teams now, making that happen in a reliable way is
> _really_ hard. Blocking CI on this is not reasonable.

I forgot one: Randomize test running is also not an acceptable
solution for pre-merge testing, because it's guaranteed to make the
results more noisy. And also guarantees that we'll miss some important
regression tests.

> Also, the testsuite really should be fast enough that developers can
> run it locally on their machines in a work day. Current plan is that
> we can only test on HSW for now, until more budget appears (again, we
> can lament about this, but it's not going to change), which means
> developers _must_ be able to run stuff on e.g. SKL in a reasonable
> amount of time.
>
> Right now we have a runtime of the gem|prime tests of around 24 days,
> and estimated 10 months for the stress tests included. I think the
> actual machine time we'll have available in the near future, on this
> HSW farm is going to allow 2-3h for gem tests. That's the time budget
> for this default set of regression tests.
>
> Wrt actually implementing it: I don't care, as long as it fulfills the
> above. So tagging, per-test comdline options, outright deleting all
> the tests we can't run anyway, disabling them in the build system or
> whatever else is all fine with me, as long as the default set doesn't
> require any special action. For tags this would mean that untagged
> tests are _all_ included.

Just to make it clear: This are the hard constraints we have.
Demanding that we have more machines to run more tests, demanding that
we have better post-regression tracking or anything else isn't a
constructive approach here. The challenge is to engineer a test suite
that fits within the hard constraints, and refusing to do that is
simply not proper software engienering.

And the current gem/prime regression test suite we do have entirely
fails that reality check.
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
+41 (0) 79 365 57 48 - http://blog.ffwll.ch