[igt-dev] [i-g-t, v3, 1/2] runner: Add support for aborting on network failure
Tomi Sarvela
tomi.p.sarvela at intel.com
Tue Sep 10 07:31:58 UTC 2019
On 9/9/19 2:27 PM, Petri Latvala wrote:
> On Mon, Sep 09, 2019 at 01:53:44PM +0300, Arkadiusz Hiler wrote:
>> On Mon, May 20, 2019 at 01:16:22PM +0300, Petri Latvala wrote:
>>> If the network goes down while testing, CI tends to interpret that as
>>> the device being down, cutting its power after a while. This causes an
>>> incomplete to an innocent test, increasing noise in the results.
>>>
>>> A new flag to --abort-on-monitored-error, "ping", uses liboping to
>>> ping a host configured in .igtrc with one ping after each test
>>> execution and aborts the run if there is no reply in a hardcoded
>>> amount of time.
>>>
>>> v2:
>>> - Use a higher timeout
>>> - Allow hostname configuration from environment
>>> v3:
>>> - Use runner_c_args for holding c args for runner
>>> - Handle runner's meson options in runner/meson.build
>>> - Instead of one ping with 20 second timeout, ping with 1 second timeout
>>> for a duration of 20 seconds
>>>
>>> Signed-off-by: Petri Latvala <petri.latvala at intel.com>
>>> Cc: Arkadiusz Hiler <arkadiusz.hiler at intel.com>
>>> Cc: Martin Peres <martin.peres at linux.intel.com>
>>> Cc: Tomi Sarvela <tomi.p.sarvela at intel.com>
>>> Cc: Daniel Vetter <daniel at ffwll.ch>
>>> ---
>>> meson_options.txt | 6 ++
>>> runner/executor.c | 150 +++++++++++++++++++++++++++++++++++++++++++++
>>> runner/meson.build | 14 ++++-
>>> runner/settings.c | 4 ++
>>> runner/settings.h | 5 +-
>>> 5 files changed, 176 insertions(+), 3 deletions(-)
>>>
>>> diff --git a/meson_options.txt b/meson_options.txt
>>> index 888efe56..935f883e 100644
>>> --- a/meson_options.txt
>>> +++ b/meson_options.txt
>>> @@ -58,6 +58,12 @@ option('build_runner',
>>> choices : ['auto', 'true', 'false'],
>>> description : 'Build test runner')
>>>
>>> +option('with_oping',
>>> + type : 'combo',
>>> + value : 'auto',
>>> + choices : ['auto', 'true', 'false'],
>>> + description : 'Build igt_runner with liboping')
>>> +
>>> option('use_rpath',
>>> type : 'boolean',
>>> value : false,
>>> diff --git a/runner/executor.c b/runner/executor.c
>>> index 7e5fbe8f..c07e53fa 100644
>>> --- a/runner/executor.c
>>> +++ b/runner/executor.c
>>> @@ -1,6 +1,10 @@
>>> #include <errno.h>
>>> #include <fcntl.h>
>>> +#include <glib.h>
>>> #include <linux/watchdog.h>
>>> +#if HAVE_OPING
>>> +#include <oping.h>
>>> +#endif
>>> #include <signal.h>
>>> #include <stdio.h>
>>> #include <stdlib.h>
>>> @@ -16,6 +20,7 @@
>>> #include <time.h>
>>> #include <unistd.h>
>>>
>>> +#include "igt_aux.h"
>>> #include "igt_core.h"
>>> #include "executor.h"
>>> #include "output_strings.h"
>>> @@ -108,6 +113,147 @@ static void ping_watchdogs(void)
>>> }
>>> }
>>>
>>> +#if HAVE_OPING
>>> +static pingobj_t *pingobj = NULL;
>>> +
>>> +static bool load_ping_config_from_file(void)
>>> +{
>>> + char *key_file_env = NULL;
>>> + char *key_file_loc = NULL;
>>> + GError *error = NULL;
>>> + GKeyFile *key_file = NULL;
>>> + const char *ping_hostname;
>>> + int ret;
>>> +
>>> + /* Determine igt config path */
>>> + key_file_env = getenv("IGT_CONFIG_PATH");
>>> + if (key_file_env) {
>>> + key_file_loc = strdup(key_file_env);
>>> + } else {
>>> + asprintf(&key_file_loc, "%s/.igtrc", g_get_home_dir());
>>> + }
>>> +
>>> + /* Load igt config file */
>>> + key_file = g_key_file_new();
>>> + ret = g_key_file_load_from_file(key_file, key_file_loc,
>>> + G_KEY_FILE_NONE, &error);
>>> + free(key_file_loc);
>>> +
>>> + if (!ret) {
>>> + g_error_free(error);
>>> + g_key_file_free(key_file);
>>> + return false;
>>> + }
>>> +
>>> + g_clear_error(&error);
>>
>> put that in
>>
>> GKeyFile *igt_open_igtrc();
>>
>> and use it in igt_core, so we have only one implementation
>>
>>> +
>>> + ping_hostname =
>>> + g_key_file_get_string(key_file, "DUT",
>>> + "PingHostName", &error);
>>> +
>>> + g_clear_error(&error);
>>> + g_key_file_free(key_file);
>>> +
>>> + if (!ping_hostname)
>>> + return false;
>>> +
>>> + if (ping_host_add(pingobj, ping_hostname)) {
>>> + fprintf(stderr,
>>> + "abort on ping: Cannot use hostname from config file\n");
>>> + return false;
>>> + }
>>> +
>>> + return true;
>>> +}
>>> +
>>> +static bool load_ping_config_from_env(void)
>>> +{
>>> + const char *ping_hostname;
>>> +
>>> + ping_hostname = getenv("IGT_PING_HOSTNAME");
>>> + if (!ping_hostname)
>>> + return false;
>>> +
>>> + if (ping_host_add(pingobj, ping_hostname)) {
>>> + fprintf(stderr,
>>> + "abort on ping: Cannot use hostname from environment\n");
>>> + return false;
>>> + }
>>> +
>>> + return true;
>>> +}
>>> +
>>> +static bool can_ping(void)
>>> +{
>>> + /*
>>> + * On some hosts, getting network back up after suspend takes
>>> + * upwards of 10 seconds. 20 seconds should be enough to see
>>> + * if network comes back at all, and hopefully not too long to
>>> + * make external monitoring freak out.
>>> + */
>>> + igt_until_timeout(20) {
>>
>> make 20 a #define
>>
>>> + pingobj_iter_t *iter;
>>> +
>>> + ping_send(pingobj);
>>> +
>>> + for (iter = ping_iterator_get(pingobj);
>>> + iter != NULL;
>>> + iter = ping_iterator_next(iter)) {
>>> + double latency;
>>> + size_t len = sizeof(latency);
>>> +
>>> + ping_iterator_get_info(iter,
>>> + PING_INFO_LATENCY,
>>> + &latency,
>>> + &len);
>>> + if (latency >= 0.0)
>>> + return true;
>>> + }
>>> + }
>>> +
>>> + return false;
>>> +}
>>> +
>>> +#endif
>>> +
>>> +static void ping_config(void)
>>> +{
>>> +#if HAVE_OPING
>>> + double timeout = 1.0;
>>
>> single_attempt_timeout
>>
>>> +
>>> + if (pingobj)
>>> + return;
>>> +
>>> + pingobj = ping_construct();
>>> +
>>> + /* Try env first, then config file */
>>> + if (!load_ping_config_from_env() && !load_ping_config_from_file()) {
>>> + fprintf(stderr,
>>> + "abort on ping: No host to ping configured\n");
>>> + ping_destroy(pingobj);
>>> + pingobj = NULL;
>>> + return;
>>> + }
>>> +
>>> + ping_setopt(pingobj, PING_OPT_TIMEOUT, &timeout);
>>> +#endif
>>> +}
>>> +
>>> +static char *handle_ping(void)
>>> +{
>>> +#if HAVE_OPING
>>> + if (pingobj && !can_ping()) {
>>> + char *reason;
>>> +
>>> + asprintf(&reason,
>>> + "Ping host did not respond to ping, network down");
>>> + return reason;
>>> + }
>>> +#endif
>>> +
>>> + return NULL;
>>> +}
>>> +
>>> static char *handle_lockdep(void)
>>> {
>>> const char *header = "Lockdep not active\n\n/proc/lockdep_stats contents:\n";
>>> @@ -197,6 +343,7 @@ static const struct {
>>> } abort_handlers[] = {
>>> { ABORT_LOCKDEP, handle_lockdep },
>>> { ABORT_TAINT, handle_taint },
>>> + { ABORT_PING, handle_ping },
>>> { 0, 0 },
>>> };
>>
>> So the handlers are used need_to_abort():
>> * when we start execution
>> * after each tests
>>
>> What worries me here are the suspend tests. You have mentioned that
>> getting network up can take up to 17s, so potentially we are adding
>> 17s * numer_of_executed_suspend_tests to the total execution time.
>>
>> Have you checked the actual impact?
>
> No, let's take a look at that after I send v4.
>
> Even with that extra overhead, an argument can be made that getting
> incompletes without clear indicators why they happen is worse.
>
>
>> Also this does not cover any network issues that happen when executing
>> longer tests - we still can get killed by any external watchdogs.
>
> Yeah, and we don't get easily machine-parseable timing data on how
> long those running tests were fine for. Note to self: Make runner dump
> periodical still-alive marks to journal. Unrelated to this patch
> though...
>
> Needs to be said that this patch won't make much of a difference if
> you only look at the passrates or amount of incompletes. What this
> does is allowing us to close that one silly semi-catchall cibuglog
> filter for incompletes without good data, and thus forcing proper
> issue binning.
>
>>
>> What is the maximum ping timeout on those external watchdogs? Maybe we
>> should get strict on tests' time budget for BAT/shards first...
>
> Don't know actually. Tomi?
SSH timeout (TCP timeout) is first, ~2 minutes. To make sure that the
external watchdog doesn't mess up live system, it only activates when
the target is offline 6 minutes (pinged several times during this time).
Tomi
--
Intel Finland Oy - BIC 0357606-4 - Westendinkatu 7, 02160 Espoo
More information about the igt-dev
mailing list