[PATCH i-g-t v10] igt-runner fact checking

Tue Dec 10 13:41:56 UTC 2024

On Tue, 2024-12-10 at 10:50 +0100, Janusz Krzysztofik wrote:
> Hi Peter,
> 
> On Tuesday, 10 December 2024 09:38:59 CET Peter Senna Tschudin wrote:
> > Very good morning Janusz,
> > 
> > Thank you for your detailed question.
> > 
> > On 09.12.2024 14:50, Janusz Krzysztofik wrote:
> > > On Monday, 9 December 2024 12:06:33 CET Peter Senna Tschudin wrote:
> > > > Hi Janusz,
> > > > 
> > > > On 09.12.2024 10:17, Janusz Krzysztofik wrote:
> > > > > On Friday, 6 December 2024 06:45:31 CET Peter Senna Tschudin wrote:
> > > > > > Hi Janusz,
> > > > > > 
> > > > > > Thank you for your detailed comments. I appreciate the opportunity
> > > > > > to clarify and address your concerns.
> > > > > > 
> > > > > > On 05.12.2024 15:05, Janusz Krzysztofik wrote:
> > > > > > > Hi Peter,
> > > > > > > 
> > > > > > > On Wednesday, 4 December 2024 19:44:53 CET Peter Senna Tschudin wrote:
> > > > > > > > When using igt-runner, collect facts before each test and after the
> > > > > > > > last test, and report when facts change. The facts are:
> > > > > > > >  - GPUs on PCI bus: hardware.pci.gpu_at_addr.0000:03:00.0: 8086:e20b Intel Battlemage (Gen20)
> > > > > > > >  - Associations between PCI GPU and DRM card: hardware.pci.drm_card_at_addr.0000:03:00.0: card1
> > > > > > > >  - Kernel taints: kernel.is_tainted.taint_warn: true
> > > > > > > >  - GPU kernel modules loaded: kernel.kmod_is_loaded.i915: true
> > > > > > > > 
> > > > > > > > This change imposes little execution overhead and adds just a few
> > > > > > > > lines of logging. The facts will be printed on normal igt-runner
> > > > > > > > output. Here is a real example from our CI shwoing
> > > > > > > > hotreplug-lateclose changing the DRM card number 
> > > > > > > 
> > > > > > > Since you give that as an example of how helpful your facts can be, and follow 
> > > > > > > that with a kernel taint example, that may indicate you think, and users of 
> > > > > > > your facts may then be mislead having that read, that the taint was related to 
> > > > > > > the change of card number, while both had nothing to do with each other.
> > > > > > 
> > > > > > Let’s take a step back to define the purpose and scope of igt-facts:
> > > > > >  - Definition of a fact from the dictionary: A fact is an objectively verifiable
> > > > > >    piece of information.
> > > > > >  - Purpose of igt-facts: Track which tests cause changes to the facts.
> > > > > > 
> > > > > > The operation is straightforward: facts are collected before and after each test,
> > > > > > and any differences are logged. Here’s an example showing a fact change and a new
> > > > > > fact after running hotreplug-lateclose:
> > > > > > 
> > > > > >  [249.858249] [FACT core_hotunplug (hotreplug-lateclose)] changed: hardware.pci.drm_card_at_addr.0000:00:02.0: card0 -> card1
> > > > > >  [249.858392] [FACT core_hotunplug (hotreplug-lateclose)] new: kernel.is_tainted.taint_die: true
> > > > > > 
> > > > > > This output highlights the facts without implying causation between them. The
> > > > > > tool(and my commit message) neither explains relationships between facts nor
> > > > > > misleads users into assuming causation.
> > > > > 
> > > > > For me your commit message does.
> > > > 
> > > > Can you please provide some external support for your claims?
> > > > 
> > > > I am very careful while writing, and I am aware of my own limitations.
> > > > After reading what chatGPT wrote about my text, I felt comfortable
> > > > about not misleading anyone. Please mind the wording of my
> > > > question: I did not hint a desired outcome for chatGPT.
> > > > 
> > > > My Question:
> > > > -- // --
> > > > Can you please indicate any potential causality relationship
> > > > that the following text may offer?
> > > > 
> > > > Here is a real example from our CI shwoing hotreplug-lateclose
> > > > changing the DRM card number and tainting the kernel on the
> > > > abort path
> > > > -- // --
> > > > 
> > > > chatGPT answer:
> > > > --//--
> > > > The text suggests a potential causality relationship between the
> > > > "hotreplug-lateclose" operation and its effects on the system:
> > > > 
> > > > 1. Causal Event: The "hotreplug-lateclose" operation.
> > > > This likely refers to a process in the CI(Continuous Integration)
> > > > environment related to hotplugging devices and managing late-close
> > > > scenarios. Hotplugging involves dynamically adding or removing
> > > > hardware components without shutting down the system.
> > > > 
> > > > 2. Effects:
> > > > 
> > > > * Changing the DRM card number: This implies that the operation
> > > > affects the enumeration or reordering of DRM (Direct Rendering
> > > > Manager) devices, causing the card number to be reassigned.
> > > > 
> > > > * Kernel Tainting on the abort path: Kernel tainting typically
> > > > indicates that the kernel has entered a state that is not
> > > > guaranteed to be stable or reliable. This could occur due
> > > > to improper handling of errors or unexpected conditions
> > > > during the "abort path" (likely a failure or cleanup process
> > > > in the operation).
> > > > 
> > > > Suggested Causal Chain:
> > > > 1. The "hotreplug-lateclose" operation executes in the CI.
> > > > 2. This causes a change in DRM card enumeration (potentially due
> > > > to how devices are initialized or re-initialized).
> > > > 3. On the abort path of this operation, a condition arises that
> > > > taints the kernel (possibly due to resource cleanup issues,
> > > > unhandled errors, or driver misbehavior).
> > > > 
> > > > This text highlights a possible link between hotreplug-lateclose
> > > > handling and instabilities in DRM device management as well as
> > > > kernel state integrity.
> > > > --//--

(Really not a fan of posting AI slop as an argument for how well other
people can understand your position, especially if you ask it to
perform any sort of reasoning...)

> > > > 
> > > > 
> > > > > 
> > > > > Can you please provide a full list of "facts" your code is supposed to handle?
> > > > 
> > > > This is in the commit message already, at the very begining.
> > > >   
> > > > > Can you please explain why you selected just those "facts", not others?
> > > > 
> > > > It was either what was missing, such as a convenient way to learn when
> > > > something strange happend as a gpu disappearing from the PCI bus, or
> > > > something that we believe may cause errors downstream such as a taint,
> > > > and the list of loaded modules.
> > > > 
> > > > For the drm card number association, we belive that there may be a caching
> > > > issue: we are trying to figure it out if the drm-reopen cache handles the
> > > > change of drm number association well. We have weak information pointing
> > > > to a probable problem akin to missing cache invalidation.
> > 
> > [...]
> >  > *What value do your facts add to CI reporting then?*
> > 
> > It seems that the focus of your feedback has shifted from 'Peter is
> > misleading people' to questioning the value of igt-runner. I believe
> > it's important to address this change in perspective to ensure
> > clarity and productive discussion.
> > 
> > To address your new point, let me draw an analogy: asking "what is
> > the value of igt-runner?" is similar to questioning the use of a car
> > to reach the supermarket when walking is an option. While walking may
> > suffice, the car adds efficiency and convenience, especially for
> > larger or more complex "tasks".
> > 
> > In our CI setup, a typical test list includes between 100 and
> > 200 tests. As you rightly pointed out, determining which test changed
> > a specific fact currently requires significant time and effort. When
> > multiple tests modify different facts, the process is not practical.
> > 
> > igt-facts adds value by:
> > - Clearly identifying which tests change which facts.
> > - Enabling us to know which facts were present and absent before the
> >   first test runs.
> > - Reporting only changes, making the information concise and
> >   efficient to review.
> > 
> > Moreover, for most developers, the igt-facts output will not be
> > obtrusive. In our CI runs, the output is saved only to the
> > igt_runner?? file and does not appear in dmesg or the results.json
> > file.
> > 
> > And even so, in the few hundred runs that I checked, igt-facts
> > typically added between 3 and 16 lines of log to a file that has
> > about 1000 lines.
> > 
> > I hope this explanation helps clarify the purpose and value of
> > igt-facts.
> 
> Unfortunately not.  I still can't see how your 4 facts reported by igt_runner 
> could be useful to me.  But let's hear from those who share your points.

Here's 3 reasons why they seem useful to me:

- CI itself performs some of these checks in the pipelines (loaded
modules, kernel taints before testing starts, likely more in the
future), so this could simplify fetching this data if we just parsed
the IGT lsfacts output.

- Simplicity. I can take a quick look into the runner logs, see a bunch
of successful tests, notice one of the tests left the system tainted
and the GPU dropped off the bus - uh-huh, at least I can quickly filter
down the interesting logs to what happened in tests near the reported
fact changes. This is nothing revolutionary that I could not divine
from dmesgs/outs/errs one way or another, but it sure is easy and
convenient, especially for people who are not kernel developers but
have to review and debug this stuff anyways.

- It also makes it easier to notice passing tests that leave the system
in an unclean state - for example, if a display drops off after a
successful test and what follows does not immediately use displays, so
some runs appear healthy, others less so (=> reasoning for noise runs).

So yes, I'd like to see this framework in IGTs. It's not strictly
necessary, but it would make our lives a little easier.

Thanks, Ryszard

> Thanks,
> Janusz
> 
> > 
> > Thank you,
> > 
> > Peter
> > 
> 
> 
> 
>