Add env info to igt_runner (was: Re: [PATCH i-g-t 4/4] lib/igt_device_scan: Fix scan vs bind/unbind/reload)

Fri Dec 20 21:15:02 UTC 2024

On Fri, 2024-12-20 at 11:37 -0600, Lucas De Marchi wrote:
> On Fri, Dec 20, 2024 at 06:05:22PM +0100, Kamil Konieczny wrote:
> > Hi Lucas,
> > On 2024-12-19 at 11:24:09 -0600, Lucas De Marchi wrote:
> > > hijacking the thread and adding some people to Cc for the igt_runner question.
> > > Previously In-Reply-To: <rnw3q6mhthnwyvowvszr2gllyjtbb2mozk4em272xlmkvm7pyl at szbhtg3sd7d7>
> > > 
> > > On Thu, Dec 19, 2024 at 10:35:00AM -0600, Lucas De Marchi wrote:
> > > > On Wed, Dec 18, 2024 at 07:34:19AM +0100, Zbigniew Kempczyński wrote:
> > > > > On Tue, Dec 17, 2024 at 09:13:24PM -0800, Lucas De Marchi wrote:
> > > > > > There's no guarantee a card will end up with the same device node when
> > > > > > modules are loaded/unloaded and drivers bound/unbound. There's some
> > > > > > fundamental issue with the igt's the way it is and it's also puzzling
> > > > > > from the logs it looks like the device vanished from the bus, when in
> > > > > > reality is just the SW state out of sync with what the kernel is
> > > > > > exporting.
> > > > > > 
> > > > > > Re-scanning when trying to match a device is not expensive compared to
> > > > > > what most tests are doing, so simply force it to occur whenever trying
> > > > > > to match a card.
> > > > > 
> > > > > I also should comment the above. It is generally true, but I've noticed
> > > > > getting attributes might be expensive. Even it may take up to few
> > > > > seconds, that's why I've added some attributes we don't fetch from udev
> > > > > (see is_on_blacklist()). If I'm not wrong getting 'config' was a cause
> > > > > to limit attributes we fetch.
> > > > 
> > > > why would we get all attributes and exclude some?  Shouldn't we get only
> > > > the attributes we actually use? AFAIK this logic is basically used by
> > > > --device/IGT_DEVICE, right? What filters we normally use?
> > > > 
> > > > I usually pass the pci slot (because I know that won't change
> > > > dynamically and cause surprises). Apparently CI passes vendor/devid:
> > > > 
> > > > 	export IGT_DEVICE=pci:vendor=$1,device=$2
> > > > 
> > > > (but it seems to vary depending on pipeline)
> > > > 
> > > > Some devs pass the device node directly too as in a lot of places
> > > > there's only ever card0 possible.
> > > 
> > > 
> > > Could we dump the env and args somewhere so we know how igt_runner or
> > > individual tests are being called without looking at the CI piepeline
> > > sources? I was thinking about either having that info in the stdout
> > > output of igt_runner or in the json. Another possibility would be in
> > > dmesg, but I'm not sure it's a good option. Thoughts?
> > 
> > Not only that, also parameters used to start igt_runner,
> > what was in .igtrc file (if any), current wall time,
> 
> does CI actually have an .igtrc? We can add, but I'd prioritize
> things that are used and that we don't have annotated anywhere (with
> easy/public access).

Yes, but only on hosts working with Chamelium (for display output
mappings). It should(tm) not be present on other DUTs.

> > testlist prepared to run, free memory and free disk.
> > metadata file for igt_resume, it will enable with prepared
> > teslist to re-execute run.
> 
> talking about igt_resume, there may be some issues doing this at
> the igt_runner level: it may not match from one test to another if the
> env didn't match when it started running from when it finished.
> 
> 1) it may have gone through igt_resume after a reboot (hopefully in the
>     same machine)
> 
> 2) for shards we slice the testlist and give it to different machines.
>     Ideally they have the same env, config, etc, but that also is not
>     guaranteed.
> 
>     Checking random tests in https://intel-gfx-ci.01.org/tree/intel-xe/shards-all.html?
> 
>     https://intel-gfx-ci.01.org/tree/intel-xe/xe-2404-26e6464dff2b3fe53049bd3b6e426cec43beb165/shard-bmg-1/igt@kms_async_flips@async-flip-with-page-flip-events-atomic@pipe-a-dp-2-4-rc-ccs.html
>     https://intel-gfx-ci.01.org/tree/intel-xe/xe-2404-26e6464dff2b3fe53049bd3b6e426cec43beb165/shard-bmg-1/igt@core_auth@basic-auth.html
> 
>     Do I understand it right that we simply have multiple resultsXX.json
>     for different runs so it's fine to create it at the global level?
>     What about a resume (1)?

There's also the case where we collate results across multiple machines
into a single larger JSON file that is then used for vis generation on
01.org (a single JSON will always cover the same kernel/IGT/scenario
combo, but possibly on different hosts). A per-test env collection
would be ideal, but it can also take a lot of extra space if you're
just collecting all the envs, so be careful with this.

In addition, envs may contain CI access tokens and internal data that
we might not exactly want to publish - if you are going to implement
this, add a configurable env key allowlist + blocklist.

You can store all env keys not in the blocklist, but if a key is not in
the allowlist, replace its value with "[ REDACTED ]" or something like
that. This way we can explicitly strip out security-sensitive vars and
useless values ($LS_COLORS, as we all know, are very relevant for CI
tests :), while people can see what vars are available that we did not
yet cover one way or another and request extending that allowlist.

Thanks, Ryszard

> > 
> > Also kernel config from /boot ? Or should it be in shard
> > run info (avoided duplication).
> 
> may be too much as we could extract it from the kernel used by CI
> since we have CONFIG_IKCONFIG=y. If we want igt_runner to collect this
> info and save in the results, then it should probably grab it from
> /proc/config.gz to make sure it's guaranteed to be in sync with the
> actual kernel being used.  Quick check on what we'd need:
> 
> $ # simulate grabbing the /proc/config.gz and piping it through base64
> $ # to be able to add in the json
> $ ./scripts/extract-ikconfig build64/arch/x86/boot/compressed/vmlinux | gzip | base64 > config.gz.b64
> $ ls -lh config.gz.b64
> -rw-rw-r-- 1 lucas lucas 70K Dec 20 09:30 config.gz.b64
> 
> 
> Humn... I would concentrate on things that aren't currently available
> anywhere.
> 
> > 
> > Maybe some other info, either igt_facts or lspci output?
> 
> for lscpci output it seems there's already a TODO comment that nobody
> ever tackled :). And "Options" may is a reference to what we are talking
> here wrt env and args:
> 
> $ git grep -A2 -B2 lspci  runner/
> runner/resultgen.c-      * Result fields that are TODO:
> runner/resultgen.c-      *
> runner/resultgen.c:      * - lspci
> runner/resultgen.c-      * - options
> runner/resultgen.c-      */
> 
> Lucas De Marchi
> 
> > Should we ask also display team and our CI?
> > 
> > +cc Jari from display
> > 
> > Regards,
> > Kamil
> > 
> > > 
> > > My preferred option would be to have e.g.:
> > > 
> > > {
> > >   "__type__": "TestrunResult",
> > >   "results_version": 10,
> > >   "name": "xe-2403-995cd30a4e222b6a7b4b40c36219e4937fd7109e\/bat-bmg-1\/0",
> > >   "uname": "Linux bat-bmg-1 6.13.0-rc3-xe+ #1 SMP PREEMPT_DYNAMIC Thu Dec 19 14:40:51 UTC 2024 x86_64",
> > >   "time_elapsed": {
> > >     "__type__": "TimeAttribute",
> > >     "start": 1734621126.8734231,
> > >     "end": 1734621288.5994539
> > >   },
> > >   "environment": {
> > >     "IGT_DEVICE": ...
> > >     <any IGT_* env var>
> > >   },
> > >   "argv": [ ... ]
> > > 
> > > 
> > > Lucas De Marchi