[PATCH i-g-t v8] igt-runner fact checking
Peter Senna Tschudin
peter.senna at linux.intel.com
Mon Nov 25 10:21:08 UTC 2024
On 25.11.2024 10:49, Zbigniew Kempczyński wrote:
> On Thu, Nov 21, 2024 at 03:22:30PM +0100, Peter Senna Tschudin wrote:
>> When using igt-runner, collect facts before each test and after the
>> last test, and report when facts change. The facts are:
>> - GPUs on PCI bus: hardware.pci.gpu_at_addr.0000:03:00.0: 8086:e20b Intel Battlemage (Gen20)
>> - Associations between PCI GPU and DRM card: hardware.pci.drm_card_at_addr.0000:03:00.0: card1
>> - Kernel taints: kernel.is_tainted.taint_warn: true
>> - GPU kernel modules loaded: kernel.kmod_is_loaded.i915: true
>>
>> This change imposes little execution overhead and adds just a few
>> lines of logging. The facts will be printed on normal igt-runner
>> output. Here is a real example from our CI shwoing
>> hotreplug-lateclose changing the DRM card number and tainting the
>> kernel on the abort path:
>>
>> [245.316207] [056/121] (816s left) core_hotunplug (hotreplug-lateclose)
>> [245.383596] Starting subtest: hotreplug-lateclose
>> [249.843361] Aborting: Lockdep not active
>> [249.858249] [FACT core_hotunplug (hotreplug-lateclose)] changed: hardware.pci.drm_card_at_addr.0000:00:02.0: card0 -> card1
>> [249.858392] [FACT core_hotunplug (hotreplug-lateclose)] new: kernel.is_tainted.taint_die: true
>> [249.859075] Closing watchdogs
>
> <cut>
>
> Regardless implementation - I wondered a bit about igt_runner and using
> it by the others - instead of turning on gathering the facts from the
> default I would add separate option to enable it. I see -f/--facts would
> appropriate. This would allow to enable/disable it from CI perspective
> if we would notice some problems - instead reverting the code we can
> just disable it from CI perspective.
Thank you for the input. Can you expand on the cost for others? It is just a few extra lines of log.
igt-facts are primarily looking for sub-tests that make changes to the environment, that cause issues downstream. Here is an example. Imagine test B runs after test A, and that there are 100 tests in between. If test A has tainted the kernel or changed modules loaded, it can cause B to fail. The value is identifying test A as the offender. Can you expand on how this is a problem for others?
The secondary goal is to report when weird stuff happens such as disappearing PCI GPU. As the facts goal is to detect events that are not expected to happen, having options to choose facts or to disable it makes no sense.
You mention folks who have non-PCI gpu. Covering their use case can be added later if they want. Lets wait for them to manifest interest instead of trying to come up with something perfect. Please :-)
Just to give you an idea of how pervasive the problem of tests changing the environment is: 49% of all test-lists from IGTPW_12121 had at least one kernel taint. With a little bit of an approximation we can estimate that each sub-test had a 25% chance of running in a tainted kernel.
For me having 25% chance of any sub-tests running in a tainted kernel is a problem, for everyone. Can you expand on why this would be different for others?
In short, my take is:
- overhead is low: just a few lines of code
- the value is there for everyone
- if we need to adapt facts for others, I will be happy to do it when others manifest
Can I get your reviewed-by? Please :-)
Thanks
More information about the igt-dev
mailing list