[PATCH i-g-t] RFC: igt-runner fact checking
Peter Senna Tschudin
peter.senna at linux.intel.com
Wed Oct 30 07:15:04 UTC 2024
Dear Janusz,
Thank you for the review!
I believe that I can clarify the following points. The igt_runner_facts are:
- primarily a logging feature: to measure and report changes to the environment
- generic: I build the argument based on a real example, but I am making igt_runner_facts generic
- time based: knowing when a fact changes matter to help us debug issues
- not test requirements. The line is blurry, but we are after what can unexpectedly change on the environment. Think of the disappearing gpu problem.
On 28.10.2024 13:53, Krzysztofik, Janusz wrote:
> On Monday, 28 October 2024 08:08:52 CET Peter Senna Tschudin wrote:
>> WHAT: THIS IS AN RFC
>>
>> These functions collect facts, such as "Is the i915 kmod loaded?" when
>> igt-runner starts and then recheck the facts after every test is run.
>> Currently the code will abort on fact changes, but this can obviously
>> be altered to suit our needs.
>
> I think it would be more helpful if we tried to restore a clean environment,
> not only check if it hadn't changed.
That is not the point. We currently work on the assumption that every test is a well behaved citizen. But because we do not measure the cleanness of the environment, we do not actually know how clean the system is.
>
>>
>> TODO
>>
>> I have only implemented facts for checking if gpu kmods are loaded.
>
> Tests can do their job regardless of any GPU modules loaded or not, I
> believe.
No, and when a test causes trouble downstream it is very hard to debug because we have no idea about what changed. Ask me how I know is time consuming :-)
>
>> I
>> want facts for:
>> - GPUs on the PCI bus: pci address and card info: pciid, vendor,
>> model, codename
>
> What test scenarios you think we may expect to affect hardware configuration?
I know of two: pre production GPUs may vanish from the PCI bus, and require a reboot to fix. Some tests may make the GPU unresponsive in ways that are recoverable without a reboot.
But you are walking an orthogonal path here. The proposal is to have a mechanism to measure how clean the environment is, and to log changes to help developers debug issues.
>
>> - DRM Cards: association between card number and pci address
>
> That may change, and tests should still work, unless a DRM card number is
> specified in device filter, but that should be avoided, at least in CI. Tests
> can rescan for actual DRM device numbers.
>
>>
>> In the future I want to divide facts into two groups: static and dynamic.
>> Static facts are these I talk about here, facts that are expected to be
>> relevant to all tests such as which GPUs are present at the PCI bus.
>>
>> Dynamic facts are those defined by the tests themselves. The idea is to
>> create a convenient mechanism for tests to define facts, and to have
>> igt-runner to "dynamically" add these before start running the tests.
>
> Why igt_runner? Now individual tests decide if the environment, including
> available GPU devices, meets their specific requirements, and those
> requirements may be different for each test. Each test knows best its
> requirements and can check if met. Why are you going to check all
> conditions, collected from all tests that are going to be executed, before
> execution of each test, regardless of its actual requirements?
I only care about igt_runner because the problems I am trying to detect only manifest when at least two tests run. Test 1 causes a problem that prevents test 2 from running successfully. Sometimes there are a hundred tests between test 1 and test 2...
I do not know where you took the idea that igt_runner facts will "check all conditions, collected from all tests that are going to be executed, before execution of each test". It will not.
igt_runner facts will check facts, such as the ones I explicitly describe on the commit message.
>
>>
>> I also want to have a section on results.json to include the facts.
>>
>> WHY
>>
>> Currently igt-runner expects each test to clean up after themselves to
>> not change the "state" of the machine. However, tests do change the
>> "state" of the machine causing negative side effects downstream.
>
> And that's what can be wrong, I believe, and something to be fixed. Tests
> should restore a clean environment. Tests know best what they change, should
> then know best what needs to be restored, and should do that. If they don't
> then we should fix them.
Yes! Having the facts measured and reported will help us identify others that need fixing.
>
>> As an example, at the time of writing, "igt at core_hotunplug@hotrebind"
>> will load gpu kmods and will not unload them after it is done. Then
>> trying to run "igt at xe_module_load@reload-no-display" will fail because
>> the i915 driver is loaded.
>
> Ok, now I can see what issue you are trying to resolve, but you are missing
> the point, I believe. That's an issue of core_hotunplug unexpectedly
> exercising i915 instead of xe driver in the first place, and that's what we
> should try to fix.
I am well on track, thanks.
The first step to improve an issue is to measure the problem. Currently we have tests tainting the environment and it is a lot of work to detect it. My proposal is to measure changes to the environment and log them explicitly to help us debug issues.
>
> Since core_hotunplug test uses DRIVER_ANY, that can happen if no GPU modules
> are loaded. If there was a driver already bound to a GPU device then the test
> would use that device, if not then it loads all available GPU drivers first,
> including i915 and xe. As soon as i915 is bound to the GPU, xe driver is
> blocked, hardware agnostic tests start exercising i915 instead of xe, and xe
> specific tests either skip, or maybe fail on unexpected and unrecognised
> conditions, unforeseen then impossible to address correctly. As soon as we
> have root cause of that issue resolved, other tests should work as expected
> again.
>
> If you want i915 not interfere with xe in CI.xe* runs then you may try one of:
> - xe in front of i915 in lib/drmtest.c:modules[]:973 (shouldn't hurt i915 CI
> runs since xe is not built for them),
> - # CONFIG_DRM_I915 not set in the CI.xe* kernel .config (already different
> from i915),
> - module_blacklist=i915 via the CI.xe* kernel command line, or
> - IGT_FORCE_DRIVER=xe via the CI.xe* environment variables.
>From my point of view this is unimportant in the context here. I am suggesting a generic logging mechanism to make it explicit when a test misbehaves. You are suggesting how to fix one specific problem.
>
>>
>> HOW
>>
>> Notice that when creating a new fact we provide a name for the fact
>> and a function that can get the value. Also notice that we cannot
>> change a fact after it is initially set. The rationale is that the
>> facts will be set before the first test is run, then they will be
>> verified after each test.
>
>
>>
>> EXAMPLE
>>
>> $ cat test-list
>> igt at core_hotunplug@hotrebind
>> igt at xe_module_load@reload-no-display
>>
>> $ sudo IGT_TEST_ROOT='/home/gta/UPSTREAM/igt-gpu-tools/build/tests/' \
>> ./build/runner/igt_runner --per-test-timeout 400 \
>> --test-list /home/gta/igt/test-list /home/gta/igt/30
>>
>> Number of facts: 3
>> Fact kernel.kmod_is_loaded.amdgpu: false
>> Fact kernel.kmod_is_loaded.i915: false
>> Fact kernel.kmod_is_loaded.xe: false
>> [117.461584] [1/2] core_hotunplug (hotrebind)
>> [130.057183] Facts changed by last test:
>> kernel.kmod_is_loaded.amdgpu changed from false to true.
>> kernel.kmod_is_loaded.i915 changed from false to true.
>> kernel.kmod_is_loaded.xe changed from false to true.
>>
>> (igt_runner:2246) CRITICAL: Test assertion failure function execute, file
> ../runner/executor.c:2717:
>> (igt_runner:2246) CRITICAL: Failed assertion: fact_changes == NULL
>> (igt_runner:2246) CRITICAL: Last errno: 9, Bad file descriptor
>> Stack trace:
>> #0 ../lib/igt_core.c:2051 __igt_fail_assert()
>> #1 ../runner/executor.c:1020 execute()
>> #2 ../runner/runner.c:40 main()
>> #3 ../sysdeps/nptl/libc_start_call_main.h:58 __libc_start_call_main()
>> #4 ../csu/libc-start.c:128 __libc_start_main@@GLIBC_2.34()
>> #5 [_start+0x25]
>> Test (null) failed.
>> **** DEBUG ****
>> (igt_runner:2246) INFO: Number of facts: 3
>> (igt_runner:2246) INFO: Fact kernel.kmod_is_loaded.amdgpu: false
>> (igt_runner:2246) INFO: Fact kernel.kmod_is_loaded.i915: false
>> (igt_runner:2246) INFO: Fact kernel.kmod_is_loaded.xe: false
>> (igt_runner:2246) CRITICAL: Test assertion failure function execute, file
> ../runner/executor.c:2717:
>> (igt_runner:2246) CRITICAL: Failed assertion: fact_changes == NULL
>> (igt_runner:2246) CRITICAL: Last errno: 9, Bad file descriptor
>> (igt_runner:2246) igt_core-INFO: Stack trace:
>> (igt_runner:2246) igt_core-INFO: #0 ../lib/igt_core.c:2051
> __igt_fail_assert()
>> (igt_runner:2246) igt_core-INFO: #1 ../runner/executor.c:1020 execute()
>> (igt_runner:2246) igt_core-INFO: #2 ../runner/runner.c:40 main()
>> (igt_runner:2246) igt_core-INFO: #3 ../sysdeps/nptl/
> libc_start_call_main.h:58 __libc_start_call_main()
>> (igt_runner:2246) igt_core-INFO: #4 ../csu/libc-start.c:128
> __libc_start_main@@GLIBC_2.34()
>> (igt_runner:2246) igt_core-INFO: #5 [_start+0x25]
>> **** END ****
>> FAIL (-1.000s)
>
> From CI perspective, what failed? A test executed just before the failure,
> potentially already reported as successful? A test about to be run next?
> In fact, the core_hotunplug test hasn't failed, only some modules were loaded,
> and unwanted i915 among them and bound to the device, because the CI
> environment was not protected from that unwanted scenario to happen.
>
> Besides, you would have to negotiate with CI and Bug Filing, and also with
> developers resolving issues, if and how they would handle such "failures",
> reported for (null) tests.
I mention on the first paragraph that aborting was an arbitrary decision. I am convinced now that aborting is not a good idea, and the facts will only add logging about changes to the environment.
Thanks a million!
>
> Thanks,
> Janusz
>
>>
>> Signed-off-by: Peter Senna Tschudin <peter.senna at linux.intel.com>
>> ---
>> runner/executor.c | 191 ++++++++++++++++++++++++++++++++++++++++++++++
>> 1 file changed, 191 insertions(+)
>>
>> diff --git a/runner/executor.c b/runner/executor.c
>> index ac73e1dde..f151898ef 100644
>> --- a/runner/executor.c
>> +++ b/runner/executor.c
>> @@ -36,10 +36,185 @@
>> #include "output_strings.h"
>> #include "runnercomms.h"
>>
>> +#include "igt_kmod.h"
>> +
>> #define KMSG_HEADER "[IGT] "
>> #define KMSG_WARN 4
>> #define GRACEFUL_EXITCODE -SIGHUP
>>
>> +
>> +/* Fact checking functions.
>> + *
>> + * WHAT: THIS IS AN RFC
>> + *
>> + * These functions collect facts, such as "Is the i915 kmod loaded?" when
>> + * igt-runner starts and then recheck the facts after every test is run.
>> + * Currently the code will abort on fact changes, but this can obviously
>> + * be altered to suit our needs.
>> + *
>> + * TODO
>> + *
>> + * I have only implemented facts for checking if gpu kmods are loaded. I
>> + * want facts for:
>> + * - GPUs on the PCI bus: pci address and card info: pciid, vendor,
>> + * model, codename
>> + * - DRM Cards: association between card number and pci address
>> + *
>> + * In the future I want to divide facts into two groups: static and
> dynamic.
>> + * Static facts are these I talk about here, facts that are expected to be
>> + * relevant to all tests such as which GPUs are present at the PCI bus.
>> + *
>> + * Dynamic facts are those defined by the tests themselves. The idea is to
>> + * create a convenient mechanism for tests to define facts, and to have
>> + * igt-runner to "dynamically" add these before start running the tests.
>> + *
>> + * I also want to have a section on results.json to include the facts.
>> + *
>> + * WHY
>> + *
>> + * Currently igt-runner expects each test to clean up after themselves to
>> + * not change the "state" of the machine. However, tests do change the
>> + * "state" of the machine causing negative side effects downstream.
>> + * As an example, at the time of writing, "igt at core_hotunplug@hotrebind"
>> + * will load gpu kmods and will not unload them after it is done. Then
>> + * trying to run "igt at xe_module_load@reload-no-display" will fail because
>> + * the i915 driver is loaded.
>> + *
>> + * HOW
>> + *
>> + * Notice that when creating a new fact we provide a name for the fact
>> + * and a function that can get the value. Also notice that we cannot
>> + * change a fact after it is initially set. The rationale is that the
>> + * facts will be set before the first test is run, then they will be
>> + * verified after each test.
>> + *
>> + */
>> +
>> +/* Type for the fact checking function. It returns the value. Returns
>> + * NULL on failure.
>> + */
>> +typedef const char* (*fact_checker)(const char *name);
>> +
>> +/* Struct for each fact with a name, a pointer to the checker function,
>> + * a bool value, and a string value.
>> + */
>> +typedef struct {
>> + const char *name;
>> + const char *initial_value; /* Supposed to be set only once */
>> + fact_checker checker;
>> +} igt_runner_fact;
>> +
>> +/* Struct for the fact list */
>> +typedef struct {
>> + igt_runner_fact *facts;
>> + int num_facts;
>> +} igt_runner_fact_list;
>> +
>> +/* Add a new fact if it doesn't exist. Return false on failure such as when
>> + * the fact already exist.
>> + */
>> +static bool set_fact(igt_runner_fact_list *list, const char *name,
> fact_checker checker)
>> +{
>> + /* Check if the fact exist */
>> + for (int i = 0; i < list->num_facts; i++) {
>> + if (strcmp(list->facts[i].name, name) == 0)
>> + return false;
>> + }
>> +
>> + /* Add the fact */
>> + list->num_facts++;
>> + list->facts = realloc(list->facts, list->num_facts *
> sizeof(igt_runner_fact));
>> + list->facts[list->num_facts - 1].name = name;
>> + list->facts[list->num_facts - 1].checker = checker;
>> + list->facts[list->num_facts - 1].initial_value = checker(name);
>> +
>> + /* Check if initial_value is NULL */
>> + if (list->facts[list->num_facts - 1].initial_value == NULL)
>> + return false;
>> +
>> + return true;
>> +}
>> +
>> +/* Get a fact by name. Return NULL if fact is not found. */
>> +static igt_runner_fact *get_fact(igt_runner_fact_list *list, const char
> *name)
>> +{
>> + for (int i = 0; i < list->num_facts; i++) {
>> + if (strcmp(list->facts[i].name, name) == 0)
>> + return &list->facts[i];
>> + }
>> +
>> + return NULL;
>> +}
>> +
>> +/* print_all_facts: pretty print all facts */
>> +static void print_all_facts(igt_runner_fact_list *list)
>> +{
>> + igt_info("Number of facts: %d\n", list->num_facts);
>> + for (int i = 0; i < list->num_facts; i++)
>> + igt_info("Fact %s: %s\n", list->facts[i].name, list-
>> facts[i].initial_value);
>> +}
>> +
>> +/* get_fact_changes() check if there are fact changes compared to *list.
>> + * Returns:
>> + * - NULL if there are no changes
>> + * - A string with the changes if there are any
>> + *
>> + * Free the returned string with g_free()
>> + */
>> +static char *get_fact_changes(igt_runner_fact_list *list)
>> +{
>> + char *changes = NULL;
>> + const char *new_value = NULL;
>> +
>> + for (int i = 0; i < list->num_facts; i++) {
>> + new_value = list->facts[i].checker(list->facts[i].name);
>> + if (list->facts[i].initial_value != new_value) {
>> + if (changes == NULL) {
>> + changes = malloc(1);
>> + changes[0] = '\0';
>> + }
>> +
>> + changes = g_strdup_printf("%s %s changed from %s
> to %s.\n",
>> + changes,
>> + list-
>> facts[i].name,
>> + list-
>> facts[i].initial_value,
>> + new_value);
>> + }
>> + }
>> +
>> + return changes;
>> +}
>> +
>> +/* Check if the kernel module at the end of name is loaded */
>> +static const char *fact_check_kmod_loaded(const char *name)
>> +{
>> + const char *module = NULL;
>> +
>> + /* Name will be something like "kernel.kmod_is_loaded.i915".
>> + * Save the string after the last dot on module.
>> + */
>> + module = strrchr(name, '.');
>> + if (module == NULL)
>> + return NULL;
>> +
>> + module++; /* Skip the dot */
>> +
>> + /* Call igt_kmod_is_loaded and convert bool to string */
>> + return igt_kmod_is_loaded(module) ? "true" : "false";
>> +}
>> +
>> +/* Function to init all the facts */
>> +static void init_facts(igt_runner_fact_list *list)
>> +{
>> + /* Init the list */
>> + list->num_facts = 0;
>> + list->facts = NULL;
>> +
>> + set_fact(list, "kernel.kmod_is_loaded.amdgpu",
> fact_check_kmod_loaded);
>> + set_fact(list, "kernel.kmod_is_loaded.i915",
> fact_check_kmod_loaded);
>> + set_fact(list, "kernel.kmod_is_loaded.xe", fact_check_kmod_loaded);
>> +}
>> +
>> static struct {
>> int *fds;
>> size_t num_dogs;
>> @@ -2306,6 +2481,13 @@ bool execute(struct execute_state *state,
>> sigset_t sigmask;
>> double time_spent = 0.0;
>> bool status = true;
>> + igt_runner_fact_list facts;
>> +
>> + /* Init the facts */
>> + init_facts(&facts);
>> +
>> + /* Print facts */
>> + print_all_facts(&facts);
>>
>> if (state->dry) {
>> outf("Dry run, not executing. Invoke igt_resume if you
> want to execute.\n");
>> @@ -2437,6 +2619,7 @@ bool execute(struct execute_state *state,
>> char *job_name;
>> int result;
>> bool already_written = false;
>> + char *fact_changes = NULL;
>>
>> if (should_die_because_signal(sigfd)) {
>> status = false;
>> @@ -2525,6 +2708,14 @@ bool execute(struct execute_state *state,
>> state->time_left = time_left;
>> return execute(state, settings, job_list);
>> }
>> +
>> + /* Abort if facts changed */
>> + fact_changes = get_fact_changes(&facts);
>> + if (fact_changes) {
>> + /* Print fact_changes */
>> + errf("Facts changed by last test:\n%s\n",
> fact_changes);
>> + igt_assert(fact_changes == NULL);
>> + }
>> }
>>
>> if ((timefd = openat(resdirfd, "endtime.txt", O_CREAT | O_WRONLY |
> O_EXCL, 0666)) >= 0) {
>>
>
More information about the igt-dev
mailing list