[PATCH i-g-t v5] runner/executor: Detect when child process is killed by a signal
Peter Senna Tschudin
peter.senna at linux.intel.com
Wed Sep 11 07:06:12 UTC 2024
Hi Kamil,
Thank you for your review.
On 10.09.2024 17:58, Kamil Konieczny wrote:
> Hi Peter,
> On 2024-09-03 at 14:05:05 +0200, Peter Senna Tschudin wrote:
>> Make igt-runner aware about tests being killed by signals. Before this
>> patch, manually killing a test process would result in igt-runner silently
>> marking the test as incomplete.
>>
>> Now igt-runner aborts the run verbosely. As an example the following was
>> extracted from results.json:
>> This test caused an abort condition: Test terminated by a signal Killed (-9).
>>
>> v5: do not use sigdescr_np() as it seems to be a fairly new lib function that does
>> not compile on older Ubuntu
>> v4: improve abort code path to not interfere with igt-runner timeouts
>> v3: do not interfere with igt-runner killing tests due to timeout and diskspace
>> v2: fix race condition
>>
>> Cc: Petri Latvala <adrinael at adrinael.net>
>> Cc: Kamil Konieczny <kamil.konieczny at linux.intel.com>
>> Signed-off-by: Peter Senna Tschudin <peter.senna at intel.com>
>
> Your signed-off-by is different from your e-mail:
> Peter Senna Tschudin <peter.senna at linux.intel.com>
That was a feature, not a bug.
>
> checkpatch.pl script warns:
>
> 100: WARNING:FROM_SIGN_OFF_MISMATCH: From:/Signed-off-by: email address mismatch: 'From: Peter Senna Tschudin <peter.senna at linux.intel.com>' != 'Signed-off-by: Peter Senna Tschudin <peter.senna at intel.com>'
Will fix and not repeat.
>
>> ---
>> runner/executor.c | 38 +++++++++++++++++++++++++++++++++++++-
>> 1 file changed, 37 insertions(+), 1 deletion(-)
>>
>> diff --git a/runner/executor.c b/runner/executor.c
>> index ac73e1dde..990d932f3 100644
>> --- a/runner/executor.c
>> +++ b/runner/executor.c
>> @@ -888,6 +888,8 @@ static int monitor_output(pid_t child,
>> const int interval_length = 1;
>> int wd_timeout;
>> int killed = 0; /* 0 if not killed, signal number otherwise */
>> + bool child_reaped = false;
>> + bool child_killed_by_signal = false;
>> struct timespec time_beg, time_now, time_last_activity, time_last_subtest, time_killed;
>> unsigned long taints = 0;
>> bool aborting = false;
>> @@ -960,6 +962,25 @@ static int monitor_output(pid_t child,
>>
>> igt_gettime(&time_now);
>>
>> + /* Testing for !killed to prevent aborting too early after igt-runner
>> + * decides to kill a process.
>> + */
>> + if (!killed && (child == waitpid(child, &status, WNOHANG))) {
>> + child_reaped = true;
>> + if (WIFSIGNALED(status)) {
>> + child_killed_by_signal = true;
>> + killed = WTERMSIG(status);
>> +
>> + /*
>> + * Do not abort just yet, because igt-runner can kill the test
>> + * due to a timeout for example. Aborting here prevents
>> + * igt-runner from reporting a timeout. The code that aborts
>> + * the run after the test was killed is at the end of the
>> + * while() loop.
>> + */
>> + }
>> + }
>> +
>
> Why is it here? Imho better place it at second hunk.
Please notice that the problem is that igt-runner does not check if a child ended due to a signal, but it notices when a child ends. The problem with your suggestion is that the loop _may_ decide to exercise the old behavior and simply mark the test as incomplete.
My proposal is to always check for child termination due to a signal, and hence unconditional detection before all other logic inside the loop. However there are corner cases, such as excessive disk space consumption by a test, in which igt-runner will signal a test, and will check for the child being signaled. To not interfere with these corner cases, the detection of a signal and taking the action of aborting the run are at different places.
>
>> /* TODO: Refactor these handlers to their own functions */
>> if (outfd >= 0 && FD_ISSET(outfd, &set)) {
>> char *newline;
>> @@ -1241,7 +1262,11 @@ static int monitor_output(pid_t child,
>> errf("Error reading from signalfd: %m\n");
>> continue;
>> } else if (siginfo.ssi_signo == SIGCHLD) {
>> - if (child != waitpid(child, &status, WNOHANG)) {
>
> This code here do what you wrote at first hunk:
> if (child != waitpid(child, &status, WNOHANG)) {
> // fatal err, set status=9999
> } else { // all 'else if' checks 'child == '
> }
>
> The only difference is that your code calls waitpid() unconditionally
> before loop. Could you write some testing code in one simple test
Exactly. Calling waitpid() unconditionally, and early enough to catch all signals.
> and make checks on trybot? e.g. to not rely that your code will
> catch such scenarios if they are rarly seen but to test them in
> predictable way. You could also add here some debugs.
I did not understand what you want to test. The "condition" here is igt-runner itself killing a test due to excessive disk space consumption for example.
Please let me know how to proceed.
Thank you!
[...]
More information about the igt-dev
mailing list