[Intel-gfx] [PATCH] drm/i915: Advance seqno upon reseting the GPU following a hang

Fri May 10 17:02:03 CEST 2013

On Wed, May 8, 2013 at 4:06 PM, Chris Wilson <chris at chris-wilson.co.uk> wrote:
> On Wed, May 08, 2013 at 04:02:00PM +0200, Daniel Vetter wrote:
>> On Wed, May 08, 2013 at 02:29:30PM +0100, Chris Wilson wrote:
>> > There is an unlikely corner case whereby a lockless wait may not notice
>> > a GPU hang and reset, and so continue to wait for the device to advance
>> > beyond the chosen seqno. This of course may never happen as the waiter
>> > may be the only user. Instead, we can explicitly advance the device
>> > seqno to match the requests that are forcibly retired following the
>> > hang.
>> >
>> > Signed-off-by: Chris Wilson <chris at chris-wilson.co.uk>
>>
>> This race is why the reset counter must always increase and can't just
>> flip-flop between the reset-in-progress and everything-works states.
>>
>> Now if we want to unwedge on resume we need to reconsider this, but imo it
>> would be easier to simply remember the reset counter before we wedge the
>> gpu and restore that one (incremented as if the gpu reset worked). We
>> already assume that wedged will never collide with a real reset counter,
>> so this should work.
>
> Agree that this a unwedge-upon-resume issue, but my argument here is
> that this leaves the hardware state consistent with what we forcibly
> reset it to. From that perspective your suggestion is papering over this
> here bug and this is the neat solution.

Yeah, for the reset case I agree that just continuing in the sequence
would be more resilient. I'm still a bit unsure though what to do
across suspend/resume (where we currently force-reset the sequence
numbers, too). Maybe we need the poke-y stick there, too (in the form
of kicking waiters and incrementing the reset counter).
-Daniel
--
Daniel Vetter
Software Engineer, Intel Corporation
+41 (0) 79 365 57 48 - http://blog.ffwll.ch