[Piglit] [PATCH][RFC] Add dmesg option for reboot policy

Mon Nov 23 04:48:35 PST 2015

On 02/11/15 16:57, yann.argotti at linux.intel.com wrote:
> This adds a policy which advises when user should reboot system to avoid
> noisy test results due to system becoming unstable, for instance, and
> therefore continues testing successfully.
> To do this, a new Dmesg class is proposed which is not filtering dmesg and
> monitors whether or not one of the following event occurs:
> - gpu reset failed (not just gpu reset happened, that happens way too
> often and many tests even provoke hangs intentionally)
> - gpu crash,
> - Oops:
> - BUG
> - lockdep splat that causes the locking validator to get disabled
> If one of these issues happen, piglit test execution is stopped
> -terminating test thread pool- and exit with code 3 to inform that reboot
> is advised.
> Then test execution resume, after rebooting system or not, is done like
> usually with command line parameter "resume".
>
> To call it, use command line parameter: --dmesg monitored

Hello Yann,

The rationale behind this patch is very sound and we need something like 
this. Here are however a list of nitpicks:

  - Please send patches with git send-email, otherwise, it makes it 
impossible for us to comment inline which is the usual process for patch 
review. Please re-send :)

- varaiable -> variable; double space after "when a reboot may be required"

  - I am not a big fan of changing the semantic of arguments that have 
been there forever. Can you think of a case where the user would not 
want the test to abort if we reach a state where we cannot trust the 
result? I am including Dylan on this. Also, if we are to keep these 
modes, can we rename the "simple" mode to "warning" and "monitored" to 
"abort"? This would make more sense and clearly state the goal of the 
modes.

  - I see a potential issue with the dmesg-monitoring class, it can 
generate false positives. Indeed, it completely ignores the fact that 
there may be more than one GPU installed and that it may come from a 
different vendor. It would be good to know which gpu the test got run on 
and only track this gpu for bugs. This is however a complex task and 
would require env_dump to be done properly. I guess we could pass an 
argument to env_dump to track the gpu-vendor-dependent codepath and 
leave the general cases (Oops, BUG and lockdep) up to python. Anyway, 
better safe than sorry in the mean time!

  - I am not a big fan of using dmesg to get information out of the 
kernel because the output format may change. How about listening for 
udev events coming from the gpu or reading i915_error_state/i915_wedged ?

That's it for the first round :)

Thanks for the proposal!
Martin