[igt-dev] [PATCH i-g-t] runner: Also consider TAINT_MACHINE_CHECK as abortable taint

Wed Jun 5 16:54:11 UTC 2019

On Wed, Jun 05, 2019 at 03:53:54PM +0300, Petri Latvala wrote:
> On Wed, Jun 05, 2019 at 02:36:56PM +0200, Daniel Vetter wrote:
> > On Wed, Jun 05, 2019 at 03:16:07PM +0300, Petri Latvala wrote:
> > > Signed-off-by: Petri Latvala <petri.latvala at intel.com>
> > > Cc: Mika Kuoppala <mika.kuoppala at linux.intel.com>
> > 
> > I've seen lots of machines where these happen as normal side-effect of
> > thermal throtlling. For some value of "normal".
> > 
> > Do we really want to reboot on these? It could be like the network thing I
> > recently disabled, and then everyone started screaming because our
> > machines where constantly rebooting due to network cards/drivers
> > temporarily having a bad time (but usually recovering).
> 
> 
> I've seen some MCE log messages on dmesgs, quite often on one of the
> BXTs for example. How often those MCE triggers caused taint is another
> question.
> 
> Reading the mce code, it seems to be thermal _failure_ that causes a
> taint. And all of these add_taint() calls also use
> LOCKDEP_NOW_UNRELIABLE so we're already deep under the bus if we get
> that taint.

Hm ok if lockdep is gone then we reboot anyway. I guess ack from me then.

Since the goal is to throw these machines out, shouldn't we have a special
error for these? Instead of just aborted I mean.
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch