[Intel-gfx] [PATCH] drm/i915/selftests: Allow engine reset failure to do a GT reset in hangcheck selftest

Wed Oct 27 20:34:23 UTC 2021

On 10/26/2021 23:36, Thomas Hellström wrote:
> Hi, John,
>
> On 10/26/21 21:55, John Harrison wrote:
>> On 10/21/2021 23:23, Thomas Hellström wrote:
>>> On 10/21/21 22:37, Matthew Brost wrote:
>>>> On Thu, Oct 21, 2021 at 08:15:49AM +0200, Thomas Hellström wrote:
>>>>> Hi, Matthew,
>>>>>
>>>>> On Mon, 2021-10-11 at 16:47 -0700, Matthew Brost wrote:
>>>>>> The hangcheck selftest blocks per engine resets by setting magic 
>>>>>> bits
>>>>>> in
>>>>>> the reset flags. This is incorrect for GuC submission because if the
>>>>>> GuC
>>>>>> fails to reset an engine we would like to do a full GT reset. Do no
>>>>>> set
>>>>>> these magic bits when using GuC submission.
>>>>>>
>>>>>> Side note this lockless algorithm with magic bits to block resets
>>>>>> really
>>>>>> should be ripped out.
>>>>>>
>>>>> Lockless algorithm aside, from a quick look at the code in
>>>>> intel_reset.c it appears to me like the interface that falls back 
>>>>> to a
>>>>> full GT reset is intel_gt_handle_error() whereas intel_engine_reset()
>>>>> is explicitly intended to not do that, so is there a discrepancy
>>>>> between GuC and non-GuC here?
>>>>>
>>>> With GuC submission when an engine reset fails, we get an engine reset
>>>> failure notification which triggers a full GT reset
>>>> (intel_guc_engine_failure_process_msg in intel_guc_submission.c). That
>>>> reset is blocking by setting these magic bits. Clearing the bits in 
>>>> this
>>>> function doesn't seem to unblock that reset either, the driver 
>>>> tries to
>>>> unload with a worker blocked, and results in the blow up. Something 
>>>> with
>>>> this lockless algorithm could be wrong as clear of the bit should
>>>> unlblock the reset but it is doesn't. We can look into that but in the
>>>> meantime we need to fix this test to be able to fail gracefully and 
>>>> not
>>>> crash CI.
>>>
>>> Yeah, for that lockless algorithm if needed, we might want to use a 
>>> ww_mutex per engine or something,
>>> but point was that AFAICT at least one of the tests that set those 
>>> flags explicitly tested the functionality that no other engines than 
>>> the intended one was reset when the intel_engine_reset() function 
>>> was used, and then if GuC submission doesn't honor that, wouldn't a 
>>> better approach be to make a code comment around 
>>> intel_engine_reset() to explain the differences and disable that 
>>> particular test for GuC?. Also wouldn't we for example we see a 
>>> duplicated full GT reset with GuC if intel_engine_reset() fails as 
>>> part of the intel_gt_handle_error() function?
>> Re-reading this thread, I think there is a misunderstanding.
>>
>> The selftests themselves have already been updated to support GuC 
>> based engine resets. That is done by submitting a hanging context and 
>> letting the GuC detect the hang and issue a reset. There is no 
>> mechanism available for i915 to directly issue or request an engine 
>> based reset (because i915 does not know what is running on any given 
>> engine at any given time, being disconnected from the scheduler).
>>
>> So the tests are already correctly testing per engine resets and do 
>> not go anywhere near either intel_engine_reset() or 
>> intel_gt_handle_error() when GuC submission is used. The problem is 
>> what happens if the engine reset fails (which supposedly can only 
>> happen with broken hardware). In that scenario, there is an 
>> asynchronous message from GuC to i915 to notify us of the failure. 
>> The KMD receives that notification and then (eventually) calls 
>> intel_gt_handle_error() to issue a full GT reset. However, that is 
>> blocked because the selftest is not expecting it and has vetoed the 
>> possibility.
>
> This is where my understanding of the discussion differs. According to 
> Matthew, the selftest actually proceeds to clear the bits, but the 
> worker that calls into intel_gt_handle_error() never wakes up. (and 
> that's probably due to clear_bit() being used instead of 
> clear_and_wake_up_bit()).
Hmm, missed that point. Yeah, sounds like the missing wake_up suffix is 
what is causing the deadlock. I can't see any other reason why the reset 
handler would not proceed once the flags are cleared. And it looks like 
the selftest should timeout out waiting for the request and continue on 
to clear the bits just fine.

>
> And my problem with this particular patch is that it adds even more 
> "if (!guc_submission)" which is already sprinkled all over the place 
> in the selftests to the point that it becomes difficult to see what 
> (if anything) the tests are really testing.
I agree with this. Fixing the problem at source seems like a better 
solution than hacking lots of different bits in different tests.

> For example igt_reset_nop_engine() from a cursory look looks like it's 
> doing something but inside the engine loop it becomes clear that the 
> test doesn't do *anything* except iterate over engines. Same for 
> igt_reset_engines() in the !TEST_ACTIVE case and for 
> igt_reset_idle_engine(). For some other tests the reset_count checks 
> are gone, leaving only a test that we actually do a reset.
The nop_engine test is meant to be a no-op. It is testing what happens 
when you reset an idle engine. That is not something we can do with GuC 
based engine resets - there are no debug hooks into GuC specifically to 
trigger an engine reset when there is no hang. The same applies to the 
!TEST_ACTIVE case (which is igt_reset_idle_engine). Hence those are 
skipped for GuC.

The reset_count tests are still present. They are looking for global 
resets occurring when they should not. It is the reset_engine_count 
check that is disabled for GuC. Again, because GuC is handling the reset 
not i915. GuC tells us when a context is reset, it does not report any 
specifics about engines. So this is not a count we can maintain when 
using GuC submission.

>
> So if possible, as previously mentioned, I think a solution without 
> adding more of this in the selftests is preferrable. To me the best 
> option is probably be the one you suggest in your previous email: 
> Don't wait on the I915_RESET_ENGINE bits with GuC in 
> intel_gt_handle_error(), (or perhaps extract what's left in a separate 
> function called from the GuC handler).
I don't like the idea of splitting intel_gt_handle_error() apart. It is 
meant to be the top level reset handler that manages all resets - per 
engine and full GT (and potentially FLR in the future). Certainly, none 
of it should be pulled out into GuC specific code.

One less than ideal aspect of the current reset support is that 
'intel_has_reset_engine()' is ambiguous. With GuC submission, we have 
per engine reset. It's just not managed by i915. So code that wants to 
know if engines can be reset individual wants a true response. But code 
such as intel_gt_handle_error() needs to know *who* is doing that reset. 
Hence it needs the GuC specific check on top to differentiate. Ideally, 
there would be separate intel_has_reset_engine() and 
intel_does_i915_do_engine_reset() queries. Then the reset code itself 
would not need to query GuC submission. However, that seems like 
overkill to remove a couple of GuC tests. Maybe it could be done as 
future work, but it is certainly not in the scope of making this 
selftest not hang the system.

John.

>
> Thanks,
>
> Thomas
>
>
>