[Intel-gfx] Making IGT runnable by CI and developers

Daniel Vetter daniel at ffwll.ch
Fri Jul 21 15:45:16 UTC 2017


On Fri, Jul 21, 2017 at 12:56 PM, Tvrtko Ursulin
<tvrtko.ursulin at linux.intel.com> wrote:
>
> On 20/07/2017 17:23, Martin Peres wrote:
>>
>> Hi everyone,
>>
>> As some of you may already know, we have made great strides in making our
>> CI system usable, especially in the last 6 months when everything started
>> clicking together.
>>
>> The CI team is no longer overwhelmed with fires and bug reports, so we
>> started working on increasing the coverage from just fast-feedback, to a
>> bigger set of IGT tests.
>>
>> As some of you may know, running IGT has been a challenge that few manage
>> to overcome. Not only is the execution time counted in machine months, but
>> it can also lead to disk corruption, which does not encourage developers to
>> run it either. One test takes 21 days, on its own, and it is a subset of
>> another test which we never ran for obvious reasons.
>>
>> I would thus like to get the CI team and developers to work together to
>> decrease sharply the execution time of IGT, and get these tests run multiple
>> times per day!
>>
>> There are three usages that the CI team envision (up for debate):
>>   - Basic acceptance testing: Meant for developers and CI to check quickly
>> if a patch series is not completely breaking the world (< 10 minutes,
>> timeout per test of 30s)
>>   - Full run: Meant to be ran overnight by developers and users (< 6
>> hours)
>
>
> We could start by splitting this budget to logical components/teams.
>
> So far we have been talking about GEM and KMS, but I was just thinking that
> we may want to have a separate units on this level of likes of power
> management, DRM (core), external stuff like sw fences? TBD I guess.
>
> Assuming GEM/KMS split only, fair thing seems to be split the time budget
> 50-50 and let the respective teams start working.

Yes, KMS is also not perfect, but there it's maybe a factor of 2x that
it's taking too long. GEM is 50x or worse. Also note KMS includes
everything, so core drm, PM tests. 2x is something can be fixed as we
go, which is good, since it means we should be able to pre-merge test
any changes to igt before pushing. GEM is not even close.

> I assume this is x hours on the slowest machine?
>
> Teams would also need easy access to up-to-date test run times.

Right now you can't have that for GEM, because it takes 24d. That
means 1 run of GEM takes away 50 runs of everything else (need to
check, it might be worse). There's simply no way we can even hand out
that data without blocking pre-merge CI for everyone else.

We might be able to schedule the occasional manual run over the w/e,
but that's about it.

>>   - Stress tests: They can be in the test suite as a way to catch rare
>> issues, but they cannot be part of the default run mode. They likely should
>> be run on a case-by-case basis, on demand of a developer. Each test could be
>> allowed to take up to 1h.
>>
>> There are multiple ways of getting to this situation (up for debate):
>>
>>   1) All the tests exposed by default are fast and meant to be run:
>>    - Fast-feedback is provided by a testlist, for BAT
>>    - Stress tests ran using a special command, kept for on-demand testing
>>
>>   2) Tests are all tagged with information about their exec time:
>>    - igt at basic@.*: Meant for BAT
>>    - igt at complete@.*: Meant for FULL
>>    - igt at stress@.*: The stress tests
>>
>>   3) Testlists all the way:
>>    - fast-feedback: for BAT
>>    - all: the tests that people are expected to run (CI will run them)
>>    - Stress tests will not be part of any testlist.
>
>
> I have a historical fondness for tagging and have just sent a v2 of my
> tagging RFC. There would be some work involved to convert all tests to
> support --list-subtest, but once there it sounds flexible and easy to use to
> me.
>
> How well this would fit with the CI systems I don't have a good visibility
> to. So ultimately I don't care that much what gets picked unless it ends up
> being very cumbersome or work intensive for either side.
>
> To re-iterate:
>
>  * if we get a clear time allocation for GEM for example

2h as a start or goal, maybe 3h where we can start to run it in
pre-merge. On a fast HSW. Yes this is real tough, but I think by the
time the GEM testsuite is getting closer to that number KMS is a lot
faster. At least I plan to invest a pile of my own time into
optimizing stuff.

>  * URL showing us how do we stand relative to that dynamically
>  * method of adding/removing tests to the default/full/extended (whatever
> people want to call it) test run
>
> Then I think this is enough for us to start working towards the common goal.
>
>> Whatever decision is being accepted, the CI team is mandating global
>> timeouts for both BAT and FULL testing, in order to guarantee throughput.
>> This will require the team as a whole to agree on time quotas per
>> sub-systems, and enforce them.
>
>
> Is the current CI capable of adding together total per sub-system runtimes,
> and based on what does it do that? I am wondering about tests which do not
> prefix with gem_ or kms_ here.

We have run-data for everything. Jani S. has the full spreadsheet
somewhere from an outdated run (note that that one has a lot of issues
because gpu reset killed boxes back then, now it's just a bit too
slow). Tomi can generate a new list, but if you want GEM data it's 3
full days of real time across the entire HSW farm, and I really don't
think that's a terrible good use of these machines.

Personally I'd say once you are at a point where you can run the
entire of GEM on your own local box in less than 8h, that's the point
where we can at least make daily runs on the CI farm. Before that it's
simply a waste of machine-time that we don't have (insert lament about
budget freeze, but that's simply the reality for the next few months
at least).
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
+41 (0) 79 365 57 48 - http://blog.ffwll.ch


More information about the Intel-gfx mailing list