[Intel-gfx] [PATCH] drm/i915/selftests: Allow engine reset failure to do a GT reset in hangcheck selftest

Tue Oct 26 19:55:24 UTC 2021

On 10/21/2021 23:23, Thomas Hellström wrote:
> On 10/21/21 22:37, Matthew Brost wrote:
>> On Thu, Oct 21, 2021 at 08:15:49AM +0200, Thomas Hellström wrote:
>>> Hi, Matthew,
>>>
>>> On Mon, 2021-10-11 at 16:47 -0700, Matthew Brost wrote:
>>>> The hangcheck selftest blocks per engine resets by setting magic bits
>>>> in
>>>> the reset flags. This is incorrect for GuC submission because if the
>>>> GuC
>>>> fails to reset an engine we would like to do a full GT reset. Do no
>>>> set
>>>> these magic bits when using GuC submission.
>>>>
>>>> Side note this lockless algorithm with magic bits to block resets
>>>> really
>>>> should be ripped out.
>>>>
>>> Lockless algorithm aside, from a quick look at the code in
>>> intel_reset.c it appears to me like the interface that falls back to a
>>> full GT reset is intel_gt_handle_error() whereas intel_engine_reset()
>>> is explicitly intended to not do that, so is there a discrepancy
>>> between GuC and non-GuC here?
>>>
>> With GuC submission when an engine reset fails, we get an engine reset
>> failure notification which triggers a full GT reset
>> (intel_guc_engine_failure_process_msg in intel_guc_submission.c). That
>> reset is blocking by setting these magic bits. Clearing the bits in this
>> function doesn't seem to unblock that reset either, the driver tries to
>> unload with a worker blocked, and results in the blow up. Something with
>> this lockless algorithm could be wrong as clear of the bit should
>> unlblock the reset but it is doesn't. We can look into that but in the
>> meantime we need to fix this test to be able to fail gracefully and not
>> crash CI.
>
> Yeah, for that lockless algorithm if needed, we might want to use a 
> ww_mutex per engine or something,
> but point was that AFAICT at least one of the tests that set those 
> flags explicitly tested the functionality that no other engines than 
> the intended one was reset when the intel_engine_reset() function was 
> used, and then if GuC submission doesn't honor that, wouldn't a better 
> approach be to make a code comment around intel_engine_reset() to 
> explain the differences and disable that particular test for GuC?. 
> Also wouldn't we for example we see a duplicated full GT reset with 
> GuC if intel_engine_reset() fails as part of the 
> intel_gt_handle_error() function?
Re-reading this thread, I think there is a misunderstanding.

The selftests themselves have already been updated to support GuC based 
engine resets. That is done by submitting a hanging context and letting 
the GuC detect the hang and issue a reset. There is no mechanism 
available for i915 to directly issue or request an engine based reset 
(because i915 does not know what is running on any given engine at any 
given time, being disconnected from the scheduler).

So the tests are already correctly testing per engine resets and do not 
go anywhere near either intel_engine_reset() or intel_gt_handle_error() 
when GuC submission is used. The problem is what happens if the engine 
reset fails (which supposedly can only happen with broken hardware). In 
that scenario, there is an asynchronous message from GuC to i915 to 
notify us of the failure. The KMD receives that notification and then 
(eventually) calls intel_gt_handle_error() to issue a full GT reset. 
However, that is blocked because the selftest is not expecting it and 
has vetoed the possibility. A fix is required to allow that full GT 
reset to proceed and recover the hardware. At that point, the selftest 
should indeed fail because the reset was larger than expected. That 
should be handled by the fact the selftest issued work to other engines 
beside the target and expects those requests to complete successfully. 
In the case of the escalated GT reset, all those requests will be killed 
off as well. Thus the test will correctly fail.

John.

>
> I guess we could live with the hangcheck test being disabled for guc 
> submission until this is sorted out?
>
> /Thomas
>
>
>>
>> Matt
>>
>>> /Thomas
>>>
>>>
>>>> Signed-off-by: Matthew Brost <matthew.brost at intel.com>
>>>> ---
>>>>   drivers/gpu/drm/i915/gt/selftest_hangcheck.c | 12 ++++++++----
>>>>   1 file changed, 8 insertions(+), 4 deletions(-)
>>>>
>>>> diff --git a/drivers/gpu/drm/i915/gt/selftest_hangcheck.c
>>>> b/drivers/gpu/drm/i915/gt/selftest_hangcheck.c
>>>> index 7e2d99dd012d..90a03c60c80c 100644
>>>> --- a/drivers/gpu/drm/i915/gt/selftest_hangcheck.c
>>>> +++ b/drivers/gpu/drm/i915/gt/selftest_hangcheck.c
>>>> @@ -734,7 +734,8 @@ static int __igt_reset_engine(struct intel_gt
>>>> *gt, bool active)
>>>>                  reset_engine_count = i915_reset_engine_count(global,
>>>> engine);
>>>>                    st_engine_heartbeat_disable(engine);
>>>> -               set_bit(I915_RESET_ENGINE + id, &gt->reset.flags);
>>>> +               if (!using_guc)
>>>> +                       set_bit(I915_RESET_ENGINE + id, &gt-
>>>>> reset.flags);
>>>>                  count = 0;
>>>>                  do {
>>>>                          struct i915_request *rq = NULL;
>>>> @@ -824,7 +825,8 @@ static int __igt_reset_engine(struct intel_gt
>>>> *gt, bool active)
>>>>                          if (err)
>>>>                                  break;
>>>>                  } while (time_before(jiffies, end_time));
>>>> -               clear_bit(I915_RESET_ENGINE + id, &gt->reset.flags);
>>>> +               if (!using_guc)
>>>> +                       clear_bit(I915_RESET_ENGINE + id, &gt-
>>>>> reset.flags);
>>>>                  st_engine_heartbeat_enable(engine);
>>>>                  pr_info("%s: Completed %lu %s resets\n",
>>>>                          engine->name, count, active ? "active" :
>>>> "idle");
>>>> @@ -1042,7 +1044,8 @@ static int __igt_reset_engines(struct intel_gt
>>>> *gt,
>>>>                  yield(); /* start all threads before we begin */
>>>>                  st_engine_heartbeat_disable_no_pm(engine);
>>>> -               set_bit(I915_RESET_ENGINE + id, &gt->reset.flags);
>>>> +               if (!using_guc)
>>>> +                       set_bit(I915_RESET_ENGINE + id, &gt-
>>>>> reset.flags);
>>>>                  do {
>>>>                          struct i915_request *rq = NULL;
>>>>                          struct intel_selftest_saved_policy saved;
>>>> @@ -1165,7 +1168,8 @@ static int __igt_reset_engines(struct intel_gt
>>>> *gt,
>>>>                          if (err)
>>>>                                  break;
>>>>                  } while (time_before(jiffies, end_time));
>>>> -               clear_bit(I915_RESET_ENGINE + id, &gt->reset.flags);
>>>> +               if (!using_guc)
>>>> +                       clear_bit(I915_RESET_ENGINE + id, &gt-
>>>>> reset.flags);
>>>>                  st_engine_heartbeat_enable_no_pm(engine);
>>>>                    pr_info("i915_reset_engine(%s:%s): %lu resets\n",
>>>