[Intel-gfx] [PATCH igt] lib: Check and report if a subtest triggers a new kernel taint

Chris Wilson chris at chris-wilson.co.uk
Wed Nov 29 13:23:11 UTC 2017


Quoting Szwichtenberg, Radoslaw (2017-11-29 13:14:52)
> On Wed, 2017-11-29 at 12:40 +0000, Chris Wilson wrote:
> > Quoting Chris Wilson (2017-11-29 12:30:23)
> > > Checking for a tainted kernel is a convenient way to see if the test
> > > generated a critical error such as a oops, or machine check.
> > > 
> > > Signed-off-by: Chris Wilson <chris at chris-wilson.co.uk>
> > > Cc: Daniel Vetter <daniel.vetter at ffwll.ch>
> > > Cc: Radoslaw Szwichtenberg <radoslaw.szwichtenberg at intel.com>
> > > ---
> > > diff --git a/lib/igt_kernel_taint.c b/lib/igt_kernel_taint.c
> > > new file mode 100644
> > > index 00000000..86d9cd20
> > > --- /dev/null
> > > +++ b/lib/igt_kernel_taint.c
> > > @@ -0,0 +1,95 @@
> > > +/*
> > > + * Copyright 2017 Intel Corporation
> > > + *
> > > + * Permission is hereby granted, free of charge, to any person obtaining a
> > > + * copy of this software and associated documentation files (the
> > > "Software"),
> > > + * to deal in the Software without restriction, including without
> > > limitation
> > > + * the rights to use, copy, modify, merge, publish, distribute, sublicense,
> > > + * and/or sell copies of the Software, and to permit persons to whom the
> > > + * Software is furnished to do so, subject to the following conditions:
> > > + *
> > > + * The above copyright notice and this permission notice (including the
> > > next
> > > + * paragraph) shall be included in all copies or substantial portions of
> > > the
> > > + * Software.
> > > + *
> > > + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS
> > > OR
> > > + * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
> > > + * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.  IN NO EVENT SHALL
> > > + * THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR
> > > OTHER
> > > + * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
> > > + * FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER
> > > DEALINGS
> > > + * IN THE SOFTWARE.
> > > + */
> > > +
> > > +#include <unistd.h>
> > > +#include <fcntl.h>
> > > +
> > > +#include "igt.h"
> > > +#include "igt_kernel_taint.h"
> > > +#include "igt_sysfs.h"
> > > +
> > > +#define BIT(x) (1ul << (x))
> > > +
> > > +static const struct kernel_taint {
> > > +       const char *msg;
> > > +       unsigned int flags;
> > > +} taints[] = {
> > > +       { "Non-GPL module loaded" },
> > > +       { "Forced module load" },
> > > +       { "Unsafe SMP processor" },
> > > +       { "Forced module unload" },
> > > +       { "Machine Check Exception", TAINT_WARN },
> > > +       { "Bad page detected", TAINT_ERROR },
> > > +       { "Tainted by user request", TAINT_WARN },
> > 
> > Since unsafe modparams generate these and we are still using them
> > extensively, we should probably ignore this one.
> > 
> > > +       { "System is on fire", TAINT_ERROR },
> > > +       { "ACPI DSDT has been overridden by user" },
> > > +       { "OOPS", TAINT_ERROR },
> > > +       { "Staging driver loaded; are you mad?" },
> > > +       { "Severe firmware bug workaround active", TAINT_WARN },
> > > +       { "Out-of-tree module loaded" },
> > > +       { "Unsigned module loaded" },
> > > +       { "Soft-lockup detected", TAINT_WARN },
> > > +       { "Kernel has been live patched" },
> > > +};
> > > +
> > > +unsigned long igt_read_kernel_taint(void)
> > 
> > One thing I haven't checked is whether we can clear the kernel taints.
> > At the moment, once we see an oops, we never report a second test
> > generating another oops.
> > -Chris
> 
> I guess that clearing kernel taints is not needed when you hit oops - you
> probably should stop executing tests and reboot the machine, right?

Oops in the driver tends to stop igt pretty hard. A good rule of thumb
is indeed to abandon all hope and reboot. I'm thinking that with this
sort of early-warning detection in place, we can use the kernel_taint
when we do detect a persistent error, e.g. abandon the run if one flip
times out, or if we fail to park or reset the GPU. All to make that
catastrophic error stand out and not pollute other test results.
-Chris


More information about the Intel-gfx mailing list