[Intel-gfx] [RFC] How to assign blame when multiple rings are hung

Chris Wilson chris at chris-wilson.co.uk
Tue Jan 28 13:39:40 CET 2014


On Tue, Jan 28, 2014 at 01:16:34PM +0200, Mika Kuoppala wrote:
> Hi,
> 
> I am working with a patchset [1] which, originally, aimed to fix
> how we find out the guilty batches with ppgtt.
> 
> But during the review it became clear that I don't have a clear
> idea how the behaviour should be when multiple rings encounter
> a problematic batch at the same time.
> 
> The following i-g-t patch will add test which asserts that
> both contexts get blame of having (problematic) batch active
> during hang.
> 
> The patch set [1] will fail with this test case as it will
> blame only the first context that injected the hang.
> We would need to change the test to for it to pass:
> -       assert_reset_status(fd[1], 0, RS_BATCH_ACTIVE);
> +       assert_reset_status(fd[1], 0, RS_BATCH_PENDING);
> 
> I lean towards that both contexts get their batch_active count
> increased. As other rings might gain contexts and we could
> already reset individual rings instead of whole GPU.
> 
> But we need to take a pick so thats why the RFC.
> Thoughts?

Assuming idealised code, both get blamed today. Which gets blamed first
is decided at random (whichever accumulates hangscore quickest), that
triggers either a full GPU reset and replay of unaffected batches, or a
ring reset (in which we should not touch the other context on the other
rings). Then once the GPU is running again, it will hang on the other
ring and we will detect it and start the blame game all over again.
We do have a fairness issue whereby a sequence of bad batches on one
ring may prevent us detecting a hang on the other - but if we have replay
working, then we carry over the hangscore as well and so the blame
should be fairly appropriated.
-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre



More information about the Intel-gfx mailing list