[Intel-gfx] [PATCH 0/4] [RFC] use HW watchdog timer
Daniel Vetter
daniel at ffwll.ch
Mon Jul 16 22:16:22 CEST 2012
On Mon, Jul 16, 2012 at 11:51:55AM -0700, Ben Widawsky wrote:
> This was my pet project for the last few days, but I have to take a
> break from working on it for now to do some real work ;-). The patches
> compile, and pass a basic test, but that's about it. There is still
> quite a bit of work left to make this useful. The easiest thing would be
> to tie this into error state.
>
> The idea is pretty simple. It uses the HW watchdog to set a timeout per
> batchbuffer, instead of a global software watchdog.
>
> Pros:
> * Potential for per batch, or ring watchdog values. I believe when/if we
> get to GPGPU workloads, this is particularly interesting.
> * Batch granularity hang detection. This mostly just makes hang
> detection and recovery a bit easier IMO.
>
> Cons:
> * Blit ring doesn't have an interrupt. This means we still need the
> software watchdog, and it makes hang detection more complex. I've been
> led to believe future HW *may* have this interrupt.
> * Semaphores
>
> I'm looking for feedback, mainly for Daniel, and Chris if this is worth
> pursuing further when I have more time. The idea would be to eventually
> use this to implement much of the ARB robustness requirements instead of
> doing a bunch of request list processing.
>
> Ben Widawsky (4):
> drm/i915: Use HW watchdog for each batch
> drm/i915: Turn on watchdog interrupts
> drm/i915: Add a breadcrumb
> drm/i915: Display the failing seqno
High-level drive-by review:
I think it's very important to separate hangs in the batch itself from
hangs in the ringbuffers, e.g. semaphore deadlocks. We should only blame
the former on the userspace-submitted batchbuffer. So on that ground I
think the following speudo-code check in the hangcheck code would be
simpler to implement and more robust:
/* Check whether we're outside of the ring. At worst this check presumes
* that the hang is in the ring, never the other way round. */
if (ACTHEAD != RING_HEAD)
/* Check whether ACTHEAD lies in any active ring. We can't check
* the object's gpu_domain, that might have been changed again
* already. */
for_each_active_buffer(buffer)
if (buffer->gtt_offset + buffer->size < ACTHEAD &&
buffer->gtt_offset >= ACTHEAD)
goto found;
Otoh I don't think your patches here are a completely lost cause. This
render watchdog could be used rather well for a per-patch short timeout to
kill one specific batchbuffer I guess. But right now I don't see an
immediate use-case ...
Yours, Daniel
--
Daniel Vetter
Mail: daniel at ffwll.ch
Mobile: +41 (0)79 365 57 48
More information about the Intel-gfx
mailing list