[Intel-gfx] [PATCH i-g-t] lib: don't hang on blt on snb

Daniel Vetter daniel at ffwll.ch
Tue Aug 8 09:30:51 UTC 2017


On Tue, Aug 8, 2017 at 11:25 AM, Chris Wilson <chris at chris-wilson.co.uk> wrote:
> Quoting Daniel Vetter (2017-08-08 10:01:59)
>> On Mon, Aug 7, 2017 at 6:34 PM, Chris Wilson <chris at chris-wilson.co.uk> wrote:
>> > Quoting Daniel Vetter (2017-08-07 17:26:56)
>> >> On Fri, Aug 04, 2017 at 06:05:10PM +0100, Chris Wilson wrote:
>> >> > Quoting Daniel Vetter (2017-08-04 17:07:22)
>> >> > > We now have full (or a lot at least) igt running in beta CI, and snb
>> >> > > blt hangs are really unhappy:
>> >> > >
>> >> > > - drv_hangman at error-state-capture-blt and gem_exec_capture at capture-blt
>> >> > >   reliably result in insta-machine death when we try to reset the gpu,
>> >> > >   both on the CI snb and the one I have here.
>> >> > >
>> >> > > - Other testcases also randomly (and sometimes rather rarely) die on
>> >> > >   snb.
>> >> > >
>> >> > > We can't use the endless batch because that results in a reset failure
>> >> > > and wedged gpu, so also not really better.
>> >> >
>> >> > It shouldn't be the recursion, but the invalid instruction we use to try
>> >> > and trigger the hang quicker (otherwise hangcheck may see the advancing
>> >> > ACTHD and give us longer to escape the loop).
>> >> >
>> >> > In gem_exec_capture we shouldn't even need that invalid instruction, we
>> >> > just need the busy batch as we pull the trigger ourselves, and if that
>> >> > fails to reset a simple recursive batch we have some issues to resolve.
>> >>
>> >> Endless loop for haning results in a reset failure on blt as described in
>> >> the commit message. We end up with a permanent and unrecoverable -EIO,
>> >> which is as deadly to CI as outright killing the machine.
>> >
>> > No, it doesn't. snb-gt1 exhibiting the machine death on invalid blt
>> > instruction as reported, after fixes:
>>
>> Well my gt2 disagreed, but I guess we can push your patches to igt and
>> then ask CI whether we need more.
>
> Fine, dug out the snb-gt2,
>
> [ickle at huronriver tests]$ sudo ./drv_hangman
> IGT-Version: 1.19-gcfd42d1 (i686) (Linux: 4.12.0+ i686)
> Subtest error-state-sysfs-entry: SUCCESS (0.000s)
> Subtest error-state-basic: SUCCESS (0.004s)
> Subtest error-state-capture-render: SUCCESS (13.711s)
> Subtest error-state-capture-bsd: SUCCESS (8.006s)
> Test requirement not met in function test_error_state_capture, file drv_hangman.c:187:
> Test requirement: gem_has_ring(device, ring_id)
> Subtest error-state-capture-bsd1: SKIP (0.000s)
> Test requirement not met in function test_error_state_capture, file drv_hangman.c:187:
> Test requirement: gem_has_ring(device, ring_id)
> Subtest error-state-capture-bsd2: SKIP (0.000s)
> Subtest error-state-capture-blt: SUCCESS (6.049s)
> Test requirement not met in function test_error_state_capture, file drv_hangman.c:187:
> Test requirement: gem_has_ring(device, ring_id)
> Subtest error-state-capture-vebox: SKIP (0.000s)
> Test requirement not met in function hangcheck_unterminated, file drv_hangman.c:213:
> Test requirement: gem_uses_full_ppgtt(device)
> Subtest hangcheck-unterminated: SKIP (0.000s)
> [ickle at huronriver tests]$ sudo ./gem_exec_capture
> IGT-Version: 1.19-gcfd42d1 (i686) (Linux: 4.12.0+ i686)
> Subtest capture-render: SUCCESS (0.009s)
> Test requirement not met in function __real_main175, file gem_exec_capture.c:202:
> Test requirement: gem_can_store_dword(fd, e->exec_id | e->flags)
> Subtest capture-bsd: SKIP (0.000s)
> Test requirement not met in function gem_require_ring, file ioctl_wrappers.c:1642:
> Test requirement: gem_has_ring(fd, ring)
> Subtest capture-bsd1: SKIP (0.000s)
> Test requirement not met in function gem_require_ring, file ioctl_wrappers.c:1642:
> Test requirement: gem_has_ring(fd, ring)
> Subtest capture-bsd2: SKIP (0.000s)
> Subtest capture-blt: SUCCESS (0.005s)
> Test requirement not met in function gem_require_ring, file ioctl_wrappers.c:1642:
> Test requirement: gem_has_ring(fd, ring)
> Subtest capture-vebox: SKIP (0.000s)
>
> Seems solid to me.

Ok I'll retest once your patches have landed, could very well be that
I screwed up something with my looping batch.
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
+41 (0) 79 365 57 48 - http://blog.ffwll.ch


More information about the Intel-gfx mailing list