[Intel-gfx] [PATCH] drm/i915/selftests: Allow engine reset failure to do a GT reset in hangcheck selftest

Wed Oct 27 20:47:20 UTC 2021

On 10/27/21 22:34, John Harrison wrote:
> On 10/26/2021 23:36, Thomas Hellström wrote:
>> Hi, John,
>>
>> On 10/26/21 21:55, John Harrison wrote:
>>> On 10/21/2021 23:23, Thomas Hellström wrote:
>>>> On 10/21/21 22:37, Matthew Brost wrote:
>>>>> On Thu, Oct 21, 2021 at 08:15:49AM +0200, Thomas Hellström wrote:
>>>>>> Hi, Matthew,
>>>>>>
>>>>>> On Mon, 2021-10-11 at 16:47 -0700, Matthew Brost wrote:
>>>>>>> The hangcheck selftest blocks per engine resets by setting magic 
>>>>>>> bits
>>>>>>> in
>>>>>>> the reset flags. This is incorrect for GuC submission because if 
>>>>>>> the
>>>>>>> GuC
>>>>>>> fails to reset an engine we would like to do a full GT reset. Do no
>>>>>>> set
>>>>>>> these magic bits when using GuC submission.
>>>>>>>
>>>>>>> Side note this lockless algorithm with magic bits to block resets
>>>>>>> really
>>>>>>> should be ripped out.
>>>>>>>
>>>>>> Lockless algorithm aside, from a quick look at the code in
>>>>>> intel_reset.c it appears to me like the interface that falls back 
>>>>>> to a
>>>>>> full GT reset is intel_gt_handle_error() whereas 
>>>>>> intel_engine_reset()
>>>>>> is explicitly intended to not do that, so is there a discrepancy
>>>>>> between GuC and non-GuC here?
>>>>>>
>>>>> With GuC submission when an engine reset fails, we get an engine 
>>>>> reset
>>>>> failure notification which triggers a full GT reset
>>>>> (intel_guc_engine_failure_process_msg in intel_guc_submission.c). 
>>>>> That
>>>>> reset is blocking by setting these magic bits. Clearing the bits 
>>>>> in this
>>>>> function doesn't seem to unblock that reset either, the driver 
>>>>> tries to
>>>>> unload with a worker blocked, and results in the blow up. 
>>>>> Something with
>>>>> this lockless algorithm could be wrong as clear of the bit should
>>>>> unlblock the reset but it is doesn't. We can look into that but in 
>>>>> the
>>>>> meantime we need to fix this test to be able to fail gracefully 
>>>>> and not
>>>>> crash CI.
>>>>
>>>> Yeah, for that lockless algorithm if needed, we might want to use a 
>>>> ww_mutex per engine or something,
>>>> but point was that AFAICT at least one of the tests that set those 
>>>> flags explicitly tested the functionality that no other engines 
>>>> than the intended one was reset when the intel_engine_reset() 
>>>> function was used, and then if GuC submission doesn't honor that, 
>>>> wouldn't a better approach be to make a code comment around 
>>>> intel_engine_reset() to explain the differences and disable that 
>>>> particular test for GuC?. Also wouldn't we for example we see a 
>>>> duplicated full GT reset with GuC if intel_engine_reset() fails as 
>>>> part of the intel_gt_handle_error() function?
>>> Re-reading this thread, I think there is a misunderstanding.
>>>
>>> The selftests themselves have already been updated to support GuC 
>>> based engine resets. That is done by submitting a hanging context 
>>> and letting the GuC detect the hang and issue a reset. There is no 
>>> mechanism available for i915 to directly issue or request an engine 
>>> based reset (because i915 does not know what is running on any given 
>>> engine at any given time, being disconnected from the scheduler).
>>>
>>> So the tests are already correctly testing per engine resets and do 
>>> not go anywhere near either intel_engine_reset() or 
>>> intel_gt_handle_error() when GuC submission is used. The problem is 
>>> what happens if the engine reset fails (which supposedly can only 
>>> happen with broken hardware). In that scenario, there is an 
>>> asynchronous message from GuC to i915 to notify us of the failure. 
>>> The KMD receives that notification and then (eventually) calls 
>>> intel_gt_handle_error() to issue a full GT reset. However, that is 
>>> blocked because the selftest is not expecting it and has vetoed the 
>>> possibility.
>>
>> This is where my understanding of the discussion differs. According 
>> to Matthew, the selftest actually proceeds to clear the bits, but the 
>> worker that calls into intel_gt_handle_error() never wakes up. (and 
>> that's probably due to clear_bit() being used instead of 
>> clear_and_wake_up_bit()).
> Hmm, missed that point. Yeah, sounds like the missing wake_up suffix 
> is what is causing the deadlock. I can't see any other reason why the 
> reset handler would not proceed once the flags are cleared. And it 
> looks like the selftest should timeout out waiting for the request and 
> continue on to clear the bits just fine.
>
>
>>
>> And my problem with this particular patch is that it adds even more 
>> "if (!guc_submission)" which is already sprinkled all over the place 
>> in the selftests to the point that it becomes difficult to see what 
>> (if anything) the tests are really testing.
> I agree with this. Fixing the problem at source seems like a better 
> solution than hacking lots of different bits in different tests.
>
OK, so if we can fix this in intel_gt_handle_error() that'd be great.

Thanks,

Thomas