[Intel-gfx] [PATCH igt] igt/drv_hangman: Use manual error-state generation

Thu Oct 20 13:14:30 UTC 2016

On Thu, Oct 20, 2016 at 10:46:01AM +0100, Chris Wilson wrote:
> On Thu, Oct 20, 2016 at 11:29:05AM +0200, Daniel Vetter wrote:
> > On Thu, Oct 20, 2016 at 10:07:39AM +0100, Chris Wilson wrote:
> > > For the basic error state, we only desire that an error state be created
> > > following a hang. For that purpose, we do not need a real hang (slow
> > > 6-12s) but can inject one instead (fast <1s).
> > > 
> > > Signed-off-by: Chris Wilson <chris at chris-wilson.co.uk>
> > 
> > Should we instead speed up hangcheck? I think there's lots of value in
> > making sure not just error dumping, but also hang detection works somewhat
> > in BAT. Since if it doesn't any attempt at a full run will lead to pretty
> > serious disasters. And I have this dream that BAT is the gating thing
> > deciding whether a patch series deserves a complete pre-merge run ;-)
> 
> We have full-hang detection in BAT elsewhere as well. This particular
> test was only asking the question "do we generate an error state", hence
> why I felt it was safe to just do that and skip a simulated hang.

Hm, is it worth it then in BAT? Or does the other test not check whether
the error capture part was mildly successful? Might be worth it to just
combine them (in BAT) for even more time saved. Either way ack on this.

> > But since this is a controlled enviromnent we could make hangcheck
> > super-fast at timing out with some debugfs knob. Would probably also help
> > a lot with speeding up the gazillion of testcases in gem_reset_stats.
> 
> I have considered i915.hangcheck_interval_ms many a time. It is not just
> the interval but the hangcheck score threshold to consider. If we can
> trust our activity detection, we would be safe with a hangcheck every
> jiffie (at some overhead mind you), but we would declare a dos too soon.

Yeah, we'd still need to tune hangcheck in normal time. But for regression
testing I think a linear speed-up (i.e. just scaling the hangcheck
timeout, but leaving all the cadence and accumulation logic intact). And
ofc also scaling the dos loads with the same linear factor. Maybe a
--normal-time option in gem_reset_stats or similar could then be used for
manual testing of changes, or re-tuning.
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch