[Intel-gfx] [RFC] How to assign blame when multiple rings are hung

Tue Jan 28 12:16:34 CET 2014

Hi,

I am working with a patchset [1] which, originally, aimed to fix
how we find out the guilty batches with ppgtt.

But during the review it became clear that I don't have a clear
idea how the behaviour should be when multiple rings encounter
a problematic batch at the same time.

The following i-g-t patch will add test which asserts that
both contexts get blame of having (problematic) batch active
during hang.

The patch set [1] will fail with this test case as it will
blame only the first context that injected the hang.
We would need to change the test to for it to pass:
-       assert_reset_status(fd[1], 0, RS_BATCH_ACTIVE);
+       assert_reset_status(fd[1], 0, RS_BATCH_PENDING);

I lean towards that both contexts get their batch_active count
increased. As other rings might gain contexts and we could
already reset individual rings instead of whole GPU.

But we need to take a pick so thats why the RFC.
Thoughts?

--
[1]: https://github.com/mkuoppal/linux/commits/one_guilty

Mika Kuoppala (1):
  tests/gem_reset_stats: add subtest hang-render-and-<ring>

 tests/gem_reset_stats.c |   34 ++++++++++++++++++++++++++++++++++
 1 file changed, 34 insertions(+)

-- 
1.7.9.5