[PATCH i-g-t v10] igt-runner fact checking
Peter Senna Tschudin
peter.senna at linux.intel.com
Tue Dec 10 08:38:59 UTC 2024
Very good morning Janusz,
Thank you for your detailed question.
On 09.12.2024 14:50, Janusz Krzysztofik wrote:
> On Monday, 9 December 2024 12:06:33 CET Peter Senna Tschudin wrote:
>> Hi Janusz,
>>
>> On 09.12.2024 10:17, Janusz Krzysztofik wrote:
>>> On Friday, 6 December 2024 06:45:31 CET Peter Senna Tschudin wrote:
>>>> Hi Janusz,
>>>>
>>>> Thank you for your detailed comments. I appreciate the opportunity
>>>> to clarify and address your concerns.
>>>>
>>>> On 05.12.2024 15:05, Janusz Krzysztofik wrote:
>>>>> Hi Peter,
>>>>>
>>>>> On Wednesday, 4 December 2024 19:44:53 CET Peter Senna Tschudin wrote:
>>>>>> When using igt-runner, collect facts before each test and after the
>>>>>> last test, and report when facts change. The facts are:
>>>>>> - GPUs on PCI bus: hardware.pci.gpu_at_addr.0000:03:00.0: 8086:e20b Intel Battlemage (Gen20)
>>>>>> - Associations between PCI GPU and DRM card: hardware.pci.drm_card_at_addr.0000:03:00.0: card1
>>>>>> - Kernel taints: kernel.is_tainted.taint_warn: true
>>>>>> - GPU kernel modules loaded: kernel.kmod_is_loaded.i915: true
>>>>>>
>>>>>> This change imposes little execution overhead and adds just a few
>>>>>> lines of logging. The facts will be printed on normal igt-runner
>>>>>> output. Here is a real example from our CI shwoing
>>>>>> hotreplug-lateclose changing the DRM card number
>>>>>
>>>>> Since you give that as an example of how helpful your facts can be, and follow
>>>>> that with a kernel taint example, that may indicate you think, and users of
>>>>> your facts may then be mislead having that read, that the taint was related to
>>>>> the change of card number, while both had nothing to do with each other.
>>>>
>>>> Let’s take a step back to define the purpose and scope of igt-facts:
>>>> - Definition of a fact from the dictionary: A fact is an objectively verifiable
>>>> piece of information.
>>>> - Purpose of igt-facts: Track which tests cause changes to the facts.
>>>>
>>>> The operation is straightforward: facts are collected before and after each test,
>>>> and any differences are logged. Here’s an example showing a fact change and a new
>>>> fact after running hotreplug-lateclose:
>>>>
>>>> [249.858249] [FACT core_hotunplug (hotreplug-lateclose)] changed: hardware.pci.drm_card_at_addr.0000:00:02.0: card0 -> card1
>>>> [249.858392] [FACT core_hotunplug (hotreplug-lateclose)] new: kernel.is_tainted.taint_die: true
>>>>
>>>> This output highlights the facts without implying causation between them. The
>>>> tool(and my commit message) neither explains relationships between facts nor
>>>> misleads users into assuming causation.
>>>
>>> For me your commit message does.
>>
>> Can you please provide some external support for your claims?
>>
>> I am very careful while writing, and I am aware of my own limitations.
>> After reading what chatGPT wrote about my text, I felt comfortable
>> about not misleading anyone. Please mind the wording of my
>> question: I did not hint a desired outcome for chatGPT.
>>
>> My Question:
>> -- // --
>> Can you please indicate any potential causality relationship
>> that the following text may offer?
>>
>> Here is a real example from our CI shwoing hotreplug-lateclose
>> changing the DRM card number and tainting the kernel on the
>> abort path
>> -- // --
>>
>> chatGPT answer:
>> --//--
>> The text suggests a potential causality relationship between the
>> "hotreplug-lateclose" operation and its effects on the system:
>>
>> 1. Causal Event: The "hotreplug-lateclose" operation.
>> This likely refers to a process in the CI(Continuous Integration)
>> environment related to hotplugging devices and managing late-close
>> scenarios. Hotplugging involves dynamically adding or removing
>> hardware components without shutting down the system.
>>
>> 2. Effects:
>>
>> * Changing the DRM card number: This implies that the operation
>> affects the enumeration or reordering of DRM (Direct Rendering
>> Manager) devices, causing the card number to be reassigned.
>>
>> * Kernel Tainting on the abort path: Kernel tainting typically
>> indicates that the kernel has entered a state that is not
>> guaranteed to be stable or reliable. This could occur due
>> to improper handling of errors or unexpected conditions
>> during the "abort path" (likely a failure or cleanup process
>> in the operation).
>>
>> Suggested Causal Chain:
>> 1. The "hotreplug-lateclose" operation executes in the CI.
>> 2. This causes a change in DRM card enumeration (potentially due
>> to how devices are initialized or re-initialized).
>> 3. On the abort path of this operation, a condition arises that
>> taints the kernel (possibly due to resource cleanup issues,
>> unhandled errors, or driver misbehavior).
>>
>> This text highlights a possible link between hotreplug-lateclose
>> handling and instabilities in DRM device management as well as
>> kernel state integrity.
>> --//--
>>
>>
>>
>>>
>>> Can you please provide a full list of "facts" your code is supposed to handle?
>>
>> This is in the commit message already, at the very begining.
>>
>>> Can you please explain why you selected just those "facts", not others?
>>
>> It was either what was missing, such as a convenient way to learn when
>> something strange happend as a gpu disappearing from the PCI bus, or
>> something that we believe may cause errors downstream such as a taint,
>> and the list of loaded modules.
>>
>> For the drm card number association, we belive that there may be a caching
>> issue: we are trying to figure it out if the drm-reopen cache handles the
>> change of drm number association well. We have weak information pointing
>> to a probable problem akin to missing cache invalidation.
[...]
> *What value do your facts add to CI reporting then?*
It seems that the focus of your feedback has shifted from 'Peter is
misleading people' to questioning the value of igt-runner. I believe
it's important to address this change in perspective to ensure
clarity and productive discussion.
To address your new point, let me draw an analogy: asking "what is
the value of igt-runner?" is similar to questioning the use of a car
to reach the supermarket when walking is an option. While walking may
suffice, the car adds efficiency and convenience, especially for
larger or more complex "tasks".
In our CI setup, a typical test list includes between 100 and
200 tests. As you rightly pointed out, determining which test changed
a specific fact currently requires significant time and effort. When
multiple tests modify different facts, the process is not practical.
igt-facts adds value by:
- Clearly identifying which tests change which facts.
- Enabling us to know which facts were present and absent before the
first test runs.
- Reporting only changes, making the information concise and
efficient to review.
Moreover, for most developers, the igt-facts output will not be
obtrusive. In our CI runs, the output is saved only to the
igt_runner?? file and does not appear in dmesg or the results.json
file.
And even so, in the few hundred runs that I checked, igt-facts
typically added between 3 and 16 lines of log to a file that has
about 1000 lines.
I hope this explanation helps clarify the purpose and value of
igt-facts.
Thank you,
Peter
More information about the igt-dev
mailing list