[PATCH i-g-t v6] runner/executor: Abort when child process is killed by a signal

Wed Oct 9 14:46:36 UTC 2024

Hi Peter,
On 2024-10-03 at 15:27:37 +0200, Peter Senna Tschudin wrote:
> Manually killing a test process results in igt-runner silently marking the
> test as incomplete. Change the behavior to abort verbosely when a test is
> killed.
> 
> In order for the new behavior to work, child termination is probed on
> every iteration of the while loop inside monitor_output(). Add the
> bool test_timed_out to track when igt_runner is intentionally terminating
> a test, and do not interfere with it.
> 

imho you could have that behaviour also in place checking for a signal,
you do not need separate waitpid() call.

> Tested by:
>  - using the --per-test-timeout flag and checking that results.json labels
>    the test as a timeout.
>  - manually killing a test process and checking that results.json labels
>    the test as an abort with a message stating the test was killed.
> 
>  v6: - clear code comments
>      - move child termination probing closer to where sigfd is handled
>      - add a bool test_timed_out to ensure we will not interfere with
>        igt_runner intentionally killing tests.
>      - move the aborting code to outside the while loop
> 
>  v5: do not use sigdescr_np() as it seems to be a fairly new lib function
>      that does not compile on older Ubuntu
>  v4: improve abort code path to not interfere with igt-runner timeouts
>  v3: do not interfere with igt-runner killing tests due to timeout and
>      diskspace
>  v2: fix race condition
> 
> Cc: Petri Latvala <adrinael at adrinael.net>
> Cc: Kamil Konieczny <kamil.konieczny at linux.intel.com>
> Signed-off-by: Peter Senna Tschudin <peter.senna at linux.intel.com>
> ---
>  runner/executor.c | 29 +++++++++++++++++++++++++++--
>  1 file changed, 27 insertions(+), 2 deletions(-)
> 
> diff --git a/runner/executor.c b/runner/executor.c
> index ac73e1dde..69b4ed939 100644
> --- a/runner/executor.c
> +++ b/runner/executor.c
> @@ -888,12 +888,15 @@ static int monitor_output(pid_t child,
>  	const int interval_length = 1;
>  	int wd_timeout;
>  	int killed = 0; /* 0 if not killed, signal number otherwise */
> +	bool child_reaped = false;
> +	bool child_killed_by_signal = false;
>  	struct timespec time_beg, time_now, time_last_activity, time_last_subtest, time_killed;
>  	unsigned long taints = 0;
>  	bool aborting = false;
>  	size_t disk_usage = 0;
>  	bool socket_comms_used = false; /* whether the test actually uses comms */
>  	bool results_received = false; /* whether we already have test results that might need overriding if we detect an abort condition */
> +	bool test_timed_out = false;
>  
>  	igt_gettime(&time_beg);
>  	time_last_activity = time_last_subtest = time_killed = time_beg;
> @@ -1233,6 +1236,14 @@ static int monitor_output(pid_t child,
>  			}
>  		}
>  
> +		/* Always check for abort conditions */
> +		if (child == waitpid(child, &status, WNOHANG)) {
> +			child_reaped = true;
> +			if (WIFSIGNALED(status)) {
> +				child_killed_by_signal = true;
> +				killed = WTERMSIG(status);
> +			}
> +		}

imho this waitpid() above makes a child logic based on sigfd superflous
or at least racy, I am not sure it is correct.

Petri do you remember if igt had a problems with runner waiting
for already killed test? Or that timeout triggered after a test
was killed long ago?

>  		if (sigfd >= 0 && FD_ISSET(sigfd, &set)) {
>  			double time;
>  
> @@ -1241,7 +1252,12 @@ static int monitor_output(pid_t child,
>  				errf("Error reading from signalfd: %m\n");
>  				continue;
>  			} else if (siginfo.ssi_signo == SIGCHLD) {
> -				if (child != waitpid(child, &status, WNOHANG)) {
> +				if (!child_reaped) {
> +					/* Was child killed since we last checked? */
> +					if (child == waitpid(child, &status, WNOHANG))
> +						child_reaped = true;
> +				}
> +				if (!child_reaped) {
>  					errf("Failed to reap child\n");
>  					status = 9999;
>  				} else if (WIFEXITED(status)) {

Here code is checking status and one of if-else checks for signal,
you could use it for the same purpose without hunk /* Always check for...*/
so it will be just enough to add here:

	} else if (WIFSIGNALED(status)) {
		status = -WTERMSIG(status);
		child_killed_by_signal = true;
	}

and make a change based on this. Or just check status (avoid new var)
for being less than zero?

In summary, I am not convinced your change go in good direction.
Instead of manually observing a bahaviour it would be better to
write new test for igt_runner which will demonstrate why old runner
cannot cope with incoming signals, that will prove that your patch
solves a problem.

Regards,
Kamil

> @@ -1303,7 +1319,6 @@ static int monitor_output(pid_t child,
>  							fdatasync(outputs[_F_JOURNAL]);
>  					}
>  				}
> -
>  				aborting = true;
>  				killed = SIGQUIT;
>  				if (!kill_child(killed, child))
> @@ -1447,6 +1462,7 @@ static int monitor_output(pid_t child,
>  						 disk_usage);
>  
>  		if (timeout_reason) {
> +			test_timed_out = true;
>  			if (killed == SIGKILL) {
>  				/* Nothing that can be done, really. Let's tell the caller we want to abort. */
>  
> @@ -1485,6 +1501,15 @@ static int monitor_output(pid_t child,
>  		}
>  	}
>  
> +	if (!test_timed_out && child_killed_by_signal) {
> +		sprintf(buf, "Test terminated by a signal %s (%d).\n",
> +				strsignal(killed), -killed);
> +		errf("%s", buf);
> +
> +		*abortreason = strdup(buf);
> +		aborting = true;
> +	}
> +
>  	dump_dmesg(kmsgfd, outputs[_F_DMESG]);
>  	if (settings->sync)
>  		fdatasync(outputs[_F_DMESG]);
> -- 
> 2.34.1
>