[Intel-gfx] [PATCH] drm/i915: Don't unwedge if reset is disabled

Daniele Ceraolo Spurio daniele.ceraolospurio at intel.com
Mon Sep 9 16:27:47 UTC 2019



On 9/7/19 1:39 AM, Chris Wilson wrote:
> Quoting Daniele Ceraolo Spurio (2019-09-06 23:28:05)
>>
>>
>> On 9/5/19 2:09 AM, Janusz Krzysztofik wrote:
>>> When trying to reset a device with reset capability disabled or not
>>> supported while rings are full of requests, it has been observed when
>>> running in execlists submission mode that command stream buffer tail
>>> tends to be incremented by apparently still running GPU regardless of
>>> all requests being already cancelled and command stream buffer pointers
>>> reset.  As a result, kernel panic on NULL pointer dereference occurs
>>> when a trace_ports() helper is called with command stream buffer tail
>>> incremented but request pointers being NULL during final
>>> __intel_gt_set_wedged() operation called from intel_gt_reset().
>>>
>>> Skip actual reset procedure if reset is disabled or not supported.
>>
>> This last sentence is a bit confusing. You're not skipping the reset
>> procedure, you're skipping the attempt of unwedging and resetting again
>> after a reset & wedge already happened.
> 
> Loss of email over the last week, so jumping in at the end. My gut
> response is that this is still just papering over the bug, as what you
> say above makes no sense.
> -Chris
> 

The issue here is that if we don't reset the HW when we wedge, whatever 
was running on the engines might complete at any point after that, which 
generates an unexpected post-wedge CSB event that we don't handle 
gracefully when we unwedge. The CSB event might arrive at any time (even 
after the unwedge) or cause weird behavior on the first re-submission, 
so trying to handle it is not worth the effort IMO since having reset 
disabled is a debug-only use-case.

Daniele


More information about the Intel-gfx mailing list