[PATCH i-g-t v10] igt-runner fact checking

Peter Senna Tschudin peter.senna at linux.intel.com
Mon Dec 9 11:06:33 UTC 2024


Hi Janusz,

On 09.12.2024 10:17, Janusz Krzysztofik wrote:
> On Friday, 6 December 2024 06:45:31 CET Peter Senna Tschudin wrote:
>> Hi Janusz,
>>
>> Thank you for your detailed comments. I appreciate the opportunity
>> to clarify and address your concerns.
>>
>> On 05.12.2024 15:05, Janusz Krzysztofik wrote:
>>> Hi Peter,
>>>
>>> On Wednesday, 4 December 2024 19:44:53 CET Peter Senna Tschudin wrote:
>>>> When using igt-runner, collect facts before each test and after the
>>>> last test, and report when facts change. The facts are:
>>>>  - GPUs on PCI bus: hardware.pci.gpu_at_addr.0000:03:00.0: 8086:e20b Intel Battlemage (Gen20)
>>>>  - Associations between PCI GPU and DRM card: hardware.pci.drm_card_at_addr.0000:03:00.0: card1
>>>>  - Kernel taints: kernel.is_tainted.taint_warn: true
>>>>  - GPU kernel modules loaded: kernel.kmod_is_loaded.i915: true
>>>>
>>>> This change imposes little execution overhead and adds just a few
>>>> lines of logging. The facts will be printed on normal igt-runner
>>>> output. Here is a real example from our CI shwoing
>>>> hotreplug-lateclose changing the DRM card number 
>>>
>>> Since you give that as an example of how helpful your facts can be, and follow 
>>> that with a kernel taint example, that may indicate you think, and users of 
>>> your facts may then be mislead having that read, that the taint was related to 
>>> the change of card number, while both had nothing to do with each other.
>>
>> Let’s take a step back to define the purpose and scope of igt-facts:
>>  - Definition of a fact from the dictionary: A fact is an objectively verifiable
>>    piece of information.
>>  - Purpose of igt-facts: Track which tests cause changes to the facts.
>>
>> The operation is straightforward: facts are collected before and after each test,
>> and any differences are logged. Here’s an example showing a fact change and a new
>> fact after running hotreplug-lateclose:
>>
>>  [249.858249] [FACT core_hotunplug (hotreplug-lateclose)] changed: hardware.pci.drm_card_at_addr.0000:00:02.0: card0 -> card1
>>  [249.858392] [FACT core_hotunplug (hotreplug-lateclose)] new: kernel.is_tainted.taint_die: true
>>
>> This output highlights the facts without implying causation between them. The
>> tool(and my commit message) neither explains relationships between facts nor
>> misleads users into assuming causation.
> 
> For me your commit message does.

Can you please provide some external support for your claims?

I am very careful while writing, and I am aware of my own limitations.
After reading what chatGPT wrote about my text, I felt comfortable
about not misleading anyone. Please mind the wording of my
question: I did not hint a desired outcome for chatGPT.

My Question:
-- // --
Can you please indicate any potential causality relationship
that the following text may offer?

Here is a real example from our CI shwoing hotreplug-lateclose
changing the DRM card number and tainting the kernel on the
abort path
-- // --

chatGPT answer:
--//--
The text suggests a potential causality relationship between the
"hotreplug-lateclose" operation and its effects on the system:

1. Causal Event: The "hotreplug-lateclose" operation.
This likely refers to a process in the CI(Continuous Integration)
environment related to hotplugging devices and managing late-close
scenarios. Hotplugging involves dynamically adding or removing
hardware components without shutting down the system.

2. Effects:

* Changing the DRM card number: This implies that the operation
affects the enumeration or reordering of DRM (Direct Rendering
Manager) devices, causing the card number to be reassigned.

* Kernel Tainting on the abort path: Kernel tainting typically
indicates that the kernel has entered a state that is not
guaranteed to be stable or reliable. This could occur due
to improper handling of errors or unexpected conditions
during the "abort path" (likely a failure or cleanup process
in the operation).

Suggested Causal Chain:
1. The "hotreplug-lateclose" operation executes in the CI.
2. This causes a change in DRM card enumeration (potentially due
to how devices are initialized or re-initialized).
3. On the abort path of this operation, a condition arises that
taints the kernel (possibly due to resource cleanup issues,
unhandled errors, or driver misbehavior).

This text highlights a possible link between hotreplug-lateclose
handling and instabilities in DRM device management as well as
kernel state integrity.
--//--



> 
> Can you please provide a full list of "facts" your code is supposed to handle?

This is in the commit message already, at the very begining.
  
> Can you please explain why you selected just those "facts", not others?

It was either what was missing, such as a convenient way to learn when
something strange happend as a gpu disappearing from the PCI bus, or
something that we believe may cause errors downstream such as a taint,
and the list of loaded modules.

For the drm card number association, we belive that there may be a caching
issue: we are trying to figure it out if the drm-reopen cache handles the
change of drm number association well. We have weak information pointing
to a probable problem akin to missing cache invalidation.


Thanks!

[...]


More information about the igt-dev mailing list