[PATCH i-g-t v10] igt-runner fact checking

Janusz Krzysztofik janusz.krzysztofik at linux.intel.com
Tue Dec 10 09:50:16 UTC 2024


Hi Peter,

On Tuesday, 10 December 2024 09:38:59 CET Peter Senna Tschudin wrote:
> Very good morning Janusz,
> 
> Thank you for your detailed question.
> 
> On 09.12.2024 14:50, Janusz Krzysztofik wrote:
> > On Monday, 9 December 2024 12:06:33 CET Peter Senna Tschudin wrote:
> >> Hi Janusz,
> >>
> >> On 09.12.2024 10:17, Janusz Krzysztofik wrote:
> >>> On Friday, 6 December 2024 06:45:31 CET Peter Senna Tschudin wrote:
> >>>> Hi Janusz,
> >>>>
> >>>> Thank you for your detailed comments. I appreciate the opportunity
> >>>> to clarify and address your concerns.
> >>>>
> >>>> On 05.12.2024 15:05, Janusz Krzysztofik wrote:
> >>>>> Hi Peter,
> >>>>>
> >>>>> On Wednesday, 4 December 2024 19:44:53 CET Peter Senna Tschudin wrote:
> >>>>>> When using igt-runner, collect facts before each test and after the
> >>>>>> last test, and report when facts change. The facts are:
> >>>>>>  - GPUs on PCI bus: hardware.pci.gpu_at_addr.0000:03:00.0: 8086:e20b Intel Battlemage (Gen20)
> >>>>>>  - Associations between PCI GPU and DRM card: hardware.pci.drm_card_at_addr.0000:03:00.0: card1
> >>>>>>  - Kernel taints: kernel.is_tainted.taint_warn: true
> >>>>>>  - GPU kernel modules loaded: kernel.kmod_is_loaded.i915: true
> >>>>>>
> >>>>>> This change imposes little execution overhead and adds just a few
> >>>>>> lines of logging. The facts will be printed on normal igt-runner
> >>>>>> output. Here is a real example from our CI shwoing
> >>>>>> hotreplug-lateclose changing the DRM card number 
> >>>>>
> >>>>> Since you give that as an example of how helpful your facts can be, and follow 
> >>>>> that with a kernel taint example, that may indicate you think, and users of 
> >>>>> your facts may then be mislead having that read, that the taint was related to 
> >>>>> the change of card number, while both had nothing to do with each other.
> >>>>
> >>>> Let’s take a step back to define the purpose and scope of igt-facts:
> >>>>  - Definition of a fact from the dictionary: A fact is an objectively verifiable
> >>>>    piece of information.
> >>>>  - Purpose of igt-facts: Track which tests cause changes to the facts.
> >>>>
> >>>> The operation is straightforward: facts are collected before and after each test,
> >>>> and any differences are logged. Here’s an example showing a fact change and a new
> >>>> fact after running hotreplug-lateclose:
> >>>>
> >>>>  [249.858249] [FACT core_hotunplug (hotreplug-lateclose)] changed: hardware.pci.drm_card_at_addr.0000:00:02.0: card0 -> card1
> >>>>  [249.858392] [FACT core_hotunplug (hotreplug-lateclose)] new: kernel.is_tainted.taint_die: true
> >>>>
> >>>> This output highlights the facts without implying causation between them. The
> >>>> tool(and my commit message) neither explains relationships between facts nor
> >>>> misleads users into assuming causation.
> >>>
> >>> For me your commit message does.
> >>
> >> Can you please provide some external support for your claims?
> >>
> >> I am very careful while writing, and I am aware of my own limitations.
> >> After reading what chatGPT wrote about my text, I felt comfortable
> >> about not misleading anyone. Please mind the wording of my
> >> question: I did not hint a desired outcome for chatGPT.
> >>
> >> My Question:
> >> -- // --
> >> Can you please indicate any potential causality relationship
> >> that the following text may offer?
> >>
> >> Here is a real example from our CI shwoing hotreplug-lateclose
> >> changing the DRM card number and tainting the kernel on the
> >> abort path
> >> -- // --
> >>
> >> chatGPT answer:
> >> --//--
> >> The text suggests a potential causality relationship between the
> >> "hotreplug-lateclose" operation and its effects on the system:
> >>
> >> 1. Causal Event: The "hotreplug-lateclose" operation.
> >> This likely refers to a process in the CI(Continuous Integration)
> >> environment related to hotplugging devices and managing late-close
> >> scenarios. Hotplugging involves dynamically adding or removing
> >> hardware components without shutting down the system.
> >>
> >> 2. Effects:
> >>
> >> * Changing the DRM card number: This implies that the operation
> >> affects the enumeration or reordering of DRM (Direct Rendering
> >> Manager) devices, causing the card number to be reassigned.
> >>
> >> * Kernel Tainting on the abort path: Kernel tainting typically
> >> indicates that the kernel has entered a state that is not
> >> guaranteed to be stable or reliable. This could occur due
> >> to improper handling of errors or unexpected conditions
> >> during the "abort path" (likely a failure or cleanup process
> >> in the operation).
> >>
> >> Suggested Causal Chain:
> >> 1. The "hotreplug-lateclose" operation executes in the CI.
> >> 2. This causes a change in DRM card enumeration (potentially due
> >> to how devices are initialized or re-initialized).
> >> 3. On the abort path of this operation, a condition arises that
> >> taints the kernel (possibly due to resource cleanup issues,
> >> unhandled errors, or driver misbehavior).
> >>
> >> This text highlights a possible link between hotreplug-lateclose
> >> handling and instabilities in DRM device management as well as
> >> kernel state integrity.
> >> --//--
> >>
> >>
> >>
> >>>
> >>> Can you please provide a full list of "facts" your code is supposed to handle?
> >>
> >> This is in the commit message already, at the very begining.
> >>   
> >>> Can you please explain why you selected just those "facts", not others?
> >>
> >> It was either what was missing, such as a convenient way to learn when
> >> something strange happend as a gpu disappearing from the PCI bus, or
> >> something that we believe may cause errors downstream such as a taint,
> >> and the list of loaded modules.
> >>
> >> For the drm card number association, we belive that there may be a caching
> >> issue: we are trying to figure it out if the drm-reopen cache handles the
> >> change of drm number association well. We have weak information pointing
> >> to a probable problem akin to missing cache invalidation.
> 
> [...]
>  > *What value do your facts add to CI reporting then?*
> 
> It seems that the focus of your feedback has shifted from 'Peter is
> misleading people' to questioning the value of igt-runner. I believe
> it's important to address this change in perspective to ensure
> clarity and productive discussion.
> 
> To address your new point, let me draw an analogy: asking "what is
> the value of igt-runner?" is similar to questioning the use of a car
> to reach the supermarket when walking is an option. While walking may
> suffice, the car adds efficiency and convenience, especially for
> larger or more complex "tasks".
> 
> In our CI setup, a typical test list includes between 100 and
> 200 tests. As you rightly pointed out, determining which test changed
> a specific fact currently requires significant time and effort. When
> multiple tests modify different facts, the process is not practical.
> 
> igt-facts adds value by:
> - Clearly identifying which tests change which facts.
> - Enabling us to know which facts were present and absent before the
>   first test runs.
> - Reporting only changes, making the information concise and
>   efficient to review.
> 
> Moreover, for most developers, the igt-facts output will not be
> obtrusive. In our CI runs, the output is saved only to the
> igt_runner?? file and does not appear in dmesg or the results.json
> file.
> 
> And even so, in the few hundred runs that I checked, igt-facts
> typically added between 3 and 16 lines of log to a file that has
> about 1000 lines.
> 
> I hope this explanation helps clarify the purpose and value of
> igt-facts.

Unfortunately not.  I still can't see how your 4 facts reported by igt_runner 
could be useful to me.  But let's hear from those who share your points.

Thanks,
Janusz

> 
> Thank you,
> 
> Peter
> 






More information about the igt-dev mailing list