[PATCH i-g-t v6] runner/executor: Abort when child process is killed by a signal
Kamil Konieczny
kamil.konieczny at linux.intel.com
Wed Oct 9 14:46:36 UTC 2024
Hi Peter,
On 2024-10-03 at 15:27:37 +0200, Peter Senna Tschudin wrote:
> Manually killing a test process results in igt-runner silently marking the
> test as incomplete. Change the behavior to abort verbosely when a test is
> killed.
>
> In order for the new behavior to work, child termination is probed on
> every iteration of the while loop inside monitor_output(). Add the
> bool test_timed_out to track when igt_runner is intentionally terminating
> a test, and do not interfere with it.
>
imho you could have that behaviour also in place checking for a signal,
you do not need separate waitpid() call.
> Tested by:
> - using the --per-test-timeout flag and checking that results.json labels
> the test as a timeout.
> - manually killing a test process and checking that results.json labels
> the test as an abort with a message stating the test was killed.
>
> v6: - clear code comments
> - move child termination probing closer to where sigfd is handled
> - add a bool test_timed_out to ensure we will not interfere with
> igt_runner intentionally killing tests.
> - move the aborting code to outside the while loop
>
> v5: do not use sigdescr_np() as it seems to be a fairly new lib function
> that does not compile on older Ubuntu
> v4: improve abort code path to not interfere with igt-runner timeouts
> v3: do not interfere with igt-runner killing tests due to timeout and
> diskspace
> v2: fix race condition
>
> Cc: Petri Latvala <adrinael at adrinael.net>
> Cc: Kamil Konieczny <kamil.konieczny at linux.intel.com>
> Signed-off-by: Peter Senna Tschudin <peter.senna at linux.intel.com>
> ---
> runner/executor.c | 29 +++++++++++++++++++++++++++--
> 1 file changed, 27 insertions(+), 2 deletions(-)
>
> diff --git a/runner/executor.c b/runner/executor.c
> index ac73e1dde..69b4ed939 100644
> --- a/runner/executor.c
> +++ b/runner/executor.c
> @@ -888,12 +888,15 @@ static int monitor_output(pid_t child,
> const int interval_length = 1;
> int wd_timeout;
> int killed = 0; /* 0 if not killed, signal number otherwise */
> + bool child_reaped = false;
> + bool child_killed_by_signal = false;
> struct timespec time_beg, time_now, time_last_activity, time_last_subtest, time_killed;
> unsigned long taints = 0;
> bool aborting = false;
> size_t disk_usage = 0;
> bool socket_comms_used = false; /* whether the test actually uses comms */
> bool results_received = false; /* whether we already have test results that might need overriding if we detect an abort condition */
> + bool test_timed_out = false;
>
> igt_gettime(&time_beg);
> time_last_activity = time_last_subtest = time_killed = time_beg;
> @@ -1233,6 +1236,14 @@ static int monitor_output(pid_t child,
> }
> }
>
> + /* Always check for abort conditions */
> + if (child == waitpid(child, &status, WNOHANG)) {
> + child_reaped = true;
> + if (WIFSIGNALED(status)) {
> + child_killed_by_signal = true;
> + killed = WTERMSIG(status);
> + }
> + }
imho this waitpid() above makes a child logic based on sigfd superflous
or at least racy, I am not sure it is correct.
Petri do you remember if igt had a problems with runner waiting
for already killed test? Or that timeout triggered after a test
was killed long ago?
> if (sigfd >= 0 && FD_ISSET(sigfd, &set)) {
> double time;
>
> @@ -1241,7 +1252,12 @@ static int monitor_output(pid_t child,
> errf("Error reading from signalfd: %m\n");
> continue;
> } else if (siginfo.ssi_signo == SIGCHLD) {
> - if (child != waitpid(child, &status, WNOHANG)) {
> + if (!child_reaped) {
> + /* Was child killed since we last checked? */
> + if (child == waitpid(child, &status, WNOHANG))
> + child_reaped = true;
> + }
> + if (!child_reaped) {
> errf("Failed to reap child\n");
> status = 9999;
> } else if (WIFEXITED(status)) {
Here code is checking status and one of if-else checks for signal,
you could use it for the same purpose without hunk /* Always check for...*/
so it will be just enough to add here:
} else if (WIFSIGNALED(status)) {
status = -WTERMSIG(status);
child_killed_by_signal = true;
}
and make a change based on this. Or just check status (avoid new var)
for being less than zero?
In summary, I am not convinced your change go in good direction.
Instead of manually observing a bahaviour it would be better to
write new test for igt_runner which will demonstrate why old runner
cannot cope with incoming signals, that will prove that your patch
solves a problem.
Regards,
Kamil
> @@ -1303,7 +1319,6 @@ static int monitor_output(pid_t child,
> fdatasync(outputs[_F_JOURNAL]);
> }
> }
> -
> aborting = true;
> killed = SIGQUIT;
> if (!kill_child(killed, child))
> @@ -1447,6 +1462,7 @@ static int monitor_output(pid_t child,
> disk_usage);
>
> if (timeout_reason) {
> + test_timed_out = true;
> if (killed == SIGKILL) {
> /* Nothing that can be done, really. Let's tell the caller we want to abort. */
>
> @@ -1485,6 +1501,15 @@ static int monitor_output(pid_t child,
> }
> }
>
> + if (!test_timed_out && child_killed_by_signal) {
> + sprintf(buf, "Test terminated by a signal %s (%d).\n",
> + strsignal(killed), -killed);
> + errf("%s", buf);
> +
> + *abortreason = strdup(buf);
> + aborting = true;
> + }
> +
> dump_dmesg(kmsgfd, outputs[_F_DMESG]);
> if (settings->sync)
> fdatasync(outputs[_F_DMESG]);
> --
> 2.34.1
>
More information about the igt-dev
mailing list