[Piglit] [PATCH] Add dmesg option for reboot policy

Wed Nov 25 07:31:59 PST 2015

On 25 November 2015 at 12:42, Daniel Vetter <daniel at ffwll.ch> wrote:
> On Tue, Nov 24, 2015 at 02:10:34PM +0000, Emil Velikov wrote:
>> Hi Yan,
>>
>> The plan of having such a module is pretty sound.
>>
>> That said I think that the actual policy/implementation could use some tweaks.
>>
>> On 24 November 2015 at 12:14,  <yann.argotti at linux.intel.com> wrote:
>> > From: Yann Argotti <yann.argotti at linux.intel.com>
>> > Date: Tue, 24 Nov 2015 12:16:34 +0100
>> >
>> >  This adds a policy which advises when user should reboot system to avoid
>> >  noisy test results due to system becoming unstable, for instance, and
>> >  therefore continues testing successfully. To do this, a new Dmesg class is
>> >  proposed which is not filtering dmesg and monitors whether or not one of
>> >  the following event occurs:
>> >   - gpu reset failed (not just gpu reset happened, that happens
>> >  way too often and many tests even provoke hangs intentionally)  - gpu crash,
>> >  - Oops:  - BUG  - lockdep splat that causes the locking validator to get
>> >  disabled If one of these issues happen, piglit test execution is stopped
>> >  -terminating test thread pool- and exit with code 3 to inform that reboot is
>> >  advised. Then test execution resume, after rebooting system or not, is done
>> >  like usually with command line parameter "resume".
>> >
>> Shouldn't one check for the above issues and trigger only when GPU
>> reset was not successful ?
>> Otherwise the idea of robustness, webgl and friends go down the drain.
>
> This is exactly the idea. i915.ko prints different garbage into dmesg when
> the gpu reset failed compared to when it succeeded. We can use that to
> make a sensible decision for when to reboot.
>
The initial comment left the impression that i915 prints two types of messages
 - gpu crash/lockup
 - gpu recovery failure

If so shouldn't one check for both ?

There is also the "something in kernel went Oops/BUG, lets assume it's
the GPU" which doesn't sound great.

> And I'd expect mesa testing to unconditionally reboot even for a
> successful reset, since we have a track record of slightly screwing up
> reset handling for some obscure features (like miss setting some wa bits).
> So for testing mesa with piglit (where we never expect a gpu hang to
> happen) rebooting always is likely the right approach.
>
Bth, I've had about one gpu reset a month since starting the mesa
releasing - afaict all of which were successful thus I never really
bothered reporting them :-) I could be extra 'lucky', so for now I'd
second Ilia's suggestion - let's keep this feature disabled by default
for now.

> igt then has piles of testcases that intentionally hang the gpu, to
> validate all the reset logic. So in total there's no gap, at least for
> intel.
With my mesa experience (above), I believe you meant "in theory" there
is no gap ;-) Everything in this world has plenty of those - then
again I've been pretty happy with the way things run on my system.

Thanks for the ongoing work on igt and in particular kernel testing.
It's great when, the rare, lockup GPU doesn't bring down your whole
system (like nouveau sadly does)

Emil