[PATCH i-g-t v2] runner: Relax timeout reduction on soft lockup

Kamil Konieczny kamil.konieczny at linux.intel.com
Tue Jul 8 18:15:00 UTC 2025


Hi Janusz,
On 2025-07-08 at 15:04:15 +0200, Janusz Krzysztofik wrote:
> In case of soft lockups, it might be helpful from root cause analysis
> perspective to see if the test was still able to complete despite
> triggering the soft lockup warning, or if that soft lockup seems not
> recoverable without killing the test. For that to be possible, igt_runner
> should not kill the test too promptly if a soft lockup related kernel
> taint is detected.
> 
> On kernel taints, igt_runner now decreases per test and inactivity
> timeouts by a factor of 10.  Let it check if the taint is caused by a
> soft lockup and decrease the timeouts only by the factor of 2 in those
> cases.
> 
> v2: Define symbols for taint bits and use them (Kamil)
> 
> Signed-off-by: Janusz Krzysztofik <janusz.krzysztofik at linux.intel.com>

LGTM
Reviewed-by: Kamil Konieczny <kamil.konieczny at linux.intel.com>

> ---
>  lib/igt_taints.c  |  8 ++++----
>  lib/igt_taints.h  |  6 ++++++
>  runner/executor.c | 14 ++++++++++----
>  3 files changed, 20 insertions(+), 8 deletions(-)
> 
> diff --git a/lib/igt_taints.c b/lib/igt_taints.c
> index 6b36d11cba..1d238fd2af 100644
> --- a/lib/igt_taints.c
> +++ b/lib/igt_taints.c
> @@ -13,10 +13,10 @@ static const struct {
>  	int bad;
>  	const char *explanation;
>  } abort_taints[] = {
> -  { 4, 1, "TAINT_MACHINE_CHECK: Processor reported a Machine Check Exception."},
> -  { 5, 1, "TAINT_BAD_PAGE: Bad page reference or an unexpected page flags." },
> -  { 7, 1, "TAINT_DIE: Kernel has died - BUG/OOPS." },
> -  { 9, 1, "TAINT_WARN: WARN_ON has happened." },
> +  { TAINT_MACHINE_CHECK, 1, "TAINT_MACHINE_CHECK: Processor reported a Machine Check Exception."},
> +  { TAINT_BAD_PAGE,	 1, "TAINT_BAD_PAGE: Bad page reference or an unexpected page flags." },
> +  { TAINT_DIE,		 1, "TAINT_DIE: Kernel has died - BUG/OOPS." },
> +  { TAINT_WARN,		 1, "TAINT_WARN: WARN_ON has happened." },
>    { -1 }
>  };
>  
> diff --git a/lib/igt_taints.h b/lib/igt_taints.h
> index be4195c5aa..50c4cf16f8 100644
> --- a/lib/igt_taints.h
> +++ b/lib/igt_taints.h
> @@ -6,6 +6,12 @@
>  #ifndef __IGT_TAINTS_H__
>  #define __IGT_TAINTS_H__
>  
> +#define	TAINT_MACHINE_CHECK	 4
> +#define	TAINT_BAD_PAGE		 5
> +#define	TAINT_DIE		 7
> +#define	TAINT_WARN		 9
> +#define	TAINT_SOFT_LOCKUP	14
> +
>  unsigned long igt_kernel_tainted(unsigned long *taints);
>  const char *igt_explain_taints(unsigned long *taints);
>  
> diff --git a/runner/executor.c b/runner/executor.c
> index 13180a0a46..847abe481a 100644
> --- a/runner/executor.c
> +++ b/runner/executor.c
> @@ -871,10 +871,15 @@ static const char *need_to_timeout(struct settings *settings,
>  	if (settings->abort_mask & ABORT_TAINT &&
>  	    is_tainted(taints)) {
>  		/* list of timeouts that may postpone immediate kill on taint */
> -		if (settings->per_test_timeout || settings->inactivity_timeout)
> -			decrease = 10;
> -		else
> +		if (settings->per_test_timeout || settings->inactivity_timeout) {
> +			if (is_tainted(taints) == (1 << TAINT_WARN) &&
> +			    taints & (1 << TAINT_SOFT_LOCKUP))
> +				decrease = 2;
> +			else
> +				decrease = 10;
> +		} else {
>  			return "Killing the test because the kernel is tainted.\n";
> +		}
>  	}
>  
>  	if (settings->per_test_timeout != 0 &&
> @@ -1526,8 +1531,9 @@ static int monitor_output(pid_t child,
>  			sigfd = -1; /* we are dying, no signal handling for now */
>  		}
>  
> +		igt_kernel_tainted(&taints);
>  		timeout_reason = need_to_timeout(settings, killed,
> -						 igt_kernel_tainted(&taints),
> +						 taints,
>  						 igt_time_elapsed(&time_last_activity, &time_now),
>  						 igt_time_elapsed(&time_last_subtest, &time_now),
>  						 igt_time_elapsed(&time_killed, &time_now),
> -- 
> 2.50.0
> 


More information about the Intel-xe mailing list