[Intel-gfx] [PATCH] drm/i915: Don't unwedge if reset is disabled

Fri Sep 13 14:20:49 UTC 2019

On Monday, September 9, 2019 11:48:42 PM CEST Chris Wilson wrote:
> Quoting Chris Wilson (2019-09-07 09:39:52)
> > Quoting Daniele Ceraolo Spurio (2019-09-06 23:28:05)
> > > 
> > > 
> > > On 9/5/19 2:09 AM, Janusz Krzysztofik wrote:
> > > > When trying to reset a device with reset capability disabled or not
> > > > supported while rings are full of requests, it has been observed when
> > > > running in execlists submission mode that command stream buffer tail
> > > > tends to be incremented by apparently still running GPU regardless of
> > > > all requests being already cancelled and command stream buffer 
pointers
> > > > reset.  As a result, kernel panic on NULL pointer dereference occurs
> > > > when a trace_ports() helper is called with command stream buffer tail
> > > > incremented but request pointers being NULL during final
> > > > __intel_gt_set_wedged() operation called from intel_gt_reset().
> > > > 
> > > > Skip actual reset procedure if reset is disabled or not supported.
> > > 
> > > This last sentence is a bit confusing. You're not skipping the reset 
> > > procedure, you're skipping the attempt of unwedging and resetting again 
> > > after a reset & wedge already happened.
> > 
> > Loss of email over the last week, so jumping in at the end. My gut
> > response is that this is still just papering over the bug, as what you
> > say above makes no sense.
> 
> So my gut response was to the run on sentence, when all you needed to
> say that without a successful reset prior to calling
> reset_default_submission, the engine may still generate CS events out of
> the blue. And I think the patch should be written to require the
> successful reset.

You are right, successful reset seems the only safe protection.

But anyway, while digging deeper waiting for your clarification of that gut 
respone ;-) , I've discovered that symptoms from which the issue can be 
predicted may be sometimes observed during reset_prepere() as failing 
intel_engine_stop_cs().  Checking for that failure alone may be too weak as it 
can probably happen to succeed regardless of the uncertain hardware status, 
but anyway, what do you think about modifying reset_prepare() so it may fail 
with an error propagated from functions it calls, then calling reset_prepare() 
at the beginning of intel_gt_reset() and skiping over 
__intel_gt_unset_wedgede() and further steps (do_reset(), ..., reset_finish()) 
if reset_prepare() fails?  Wouldn't that be a useful additional layer of 
protection?

If you think the idea is worth of being considered, please have a look at my 
first attempt sent to trybot already before your explanation arrived:
https://patchwork.freedesktop.org/patch/329840/?series=66447&rev=1
(don't complain on its commit message making no sense, please ;-) ).

Thanks,
Janusz

> -Chris
>