[Intel-gfx] [PATCH] drm/i915: Treat i915_reset_engine() as guilty until proven innocent

Fri Apr 6 21:44:34 UTC 2018

On 4/6/2018 2:30 PM, Chris Wilson wrote:
> Quoting Michel Thierry (2018-04-06 22:23:21)
>> And I thought we believed in presumption of innocence...
>>
>> On 4/6/2018 2:00 PM, Chris Wilson wrote:
>>> If we are resetting just one engine, we know it has stalled. So we can
>>> pass the stalled parameter directly to i915_gem_reset_engine(), which
>>> alleviates the necessity to poke at the generic engine->hangcheck.stalled
>>> magic variable, leaving that under control of hangcheck as its name
>>> implies. Other than simplifying by removing the indirect parameter along
>>> this path, this allows us to introduce new reset mechanisms that run
>>> independently of hangcheck.
>>>> Signed-off-by: Chris Wilson <chris at chris-wilson.co.uk>
>>> Cc: Michel Thierry <michel.thierry at intel.com>
>>> Cc: Jeff McGee <jeff.mcgee at intel.com>
>>> Cc: Mika Kuoppala <mika.kuoppala at linux.intel.com>
>>> ---
>>>    drivers/gpu/drm/i915/i915_drv.c               |  2 +-
>>>    drivers/gpu/drm/i915/i915_drv.h               |  3 +-
>>>    drivers/gpu/drm/i915/i915_gem.c               | 36 +++++++++----------
>>>    .../gpu/drm/i915/selftests/intel_hangcheck.c  |  9 -----
>>>    4 files changed, 20 insertions(+), 30 deletions(-)
>>>
>> ...
>>> @@ -774,7 +766,6 @@ static int __igt_reset_engines(struct drm_i915_private *i915,
>>>                                break;
>>>                        }
>>>    
>>> -                     engine->hangcheck.stalled = false;
>>>                        count++;
>>>    
>>>                        if (rq) {
>>>
>>
>> Are the ones in igt_handle_error() still needed?
>>     hangcheck.stalled = true;
>>     hangcheck.seqno = intel_engine_get_seqno(engine);
>>
>> Because igt_handle_error is sending a real request.
> 
>> (I think the only ones remaining in the selftest should be in
>> fake_hangcheck).
> 
> Right, fake_hangcheck definitely still needs it to behave like
> hangcheck.
> 
> i915_handle_error() is still "odd". At the moment, yes we still need to
> be poking where we shouldn't. If i915_handle_error() uses the
> engine_mask to do per-engine resets, no, we don't need the magic
> hangcheck.stalled. But, if it falls back to the full device level, we
> loose the guilty reset.  So we do get a difference in behaviour, that
> really hasn't been noticed before as the only real caller is from
> hangcheck. (i915_wedged, I dare anyone to say what they expect ;)
> 

Not the first time I forget about the full reset path.
Thanks for explaining it.

> I think the answer will be to pass engine_mask to i915_reset. But I
> haven't fleshed that out yet. I think it means we do away with
> hangcheck.seqno as well, so bonus?