[Intel-gfx] [PATCH] drm/i915/gt: Reset twice

Thu Dec 22 09:28:34 UTC 2022


On 12/15/22 10:07 PM, Rodrigo Vivi wrote:
> On Wed, Dec 14, 2022 at 11:37:19PM +0100, Andi Shyti wrote:
>> Hi Rodrigo,
>>
>> On Tue, Dec 13, 2022 at 01:18:48PM +0000, Vivi, Rodrigo wrote:
>>> On Tue, 2022-12-13 at 00:08 +0100, Andi Shyti wrote:
>>>> Hi Rodrigo,
>>>>
>>>> On Mon, Dec 12, 2022 at 11:55:10AM -0500, Rodrigo Vivi wrote:
>>>>> On Mon, Dec 12, 2022 at 05:13:38PM +0100, Andi Shyti wrote:
>>>>>> From: Chris Wilson <chris at chris-wilson.co.uk>
>>>>>>
>>>>>> After applying an engine reset, on some platforms like
>>>>>> Jasperlake, we
>>>>>> occasionally detect that the engine state is not cleared until
>>>>>> shortly
>>>>>> after the resume. As we try to resume the engine with volatile
>>>>>> internal
>>>>>> state, the first request fails with a spurious CS event (it looks
>>>>>> like
>>>>>> it reports a lite-restore to the hung context, instead of the
>>>>>> expected
>>>>>> idle->active context switch).
>>>>>>
>>>>>> Signed-off-by: Chris Wilson <hris at chris-wilson.co.uk>
>>>>>
>>>>> There's a typo in the signature email I'm afraid...
>>>>
>>>> oh yes, I forgot the 'C' :)
>>>
>>> you forgot?
>>> who signed it off?
>>
>> Chris, but as I was copy/pasting SoB's I might have
>> unintentionally removed the 'c'.
>>
>>>>> Other than that, have we checked the possibility of using the
>>>>> driver-initiated-flr bit
>>>>> instead of this second loop? That should be the right way to
>>>>> guarantee everything is
>>>>> cleared on gen11+...
>>>>
>>>> maybe I am misinterpreting it, but is FLR the same as resetting
>>>> hardware domains individually?
>>>
>>> No, it is bigger than that... almost the PCI FLR with some exceptions:
>>>
>>> https://lists.freedesktop.org/archives/intel-gfx/2022-December/313956.html
>>
>> yes, exactly... I would use FLR feedback if I was performing an
>> FLR reset. But here I'm not doing that, here I'm simply gating
>> off some power domains. It happens that those power domains turn
>> on and off engines making them reset.
> 
> is this issue only seeing when this reset is called from the
> sanitize functions at probe and resumes?
> Or from any kind of gt reset?
> 
> I don't remember seeing any reference link to the bug in the patch,
> hence I'm assuming this is happening in any kind of gt reset that
> ends up in this function.
> 
>>
>> FLR doesn't have anything to do here, also because if you want to
>> reset a single engine you go through this function, instead of
>> resetting the whole GPU with whatever is annexed.
> 
> yeap. That might be to extreme depending on the case. But all that
> I asked was if we were considering this option since this is the
> recommended way of reseting our engines nowadays.
> 
>>
>> This patch is not fixing the "reset" concept of i915, but it's
>> fixing a missing feedback that happens in one single platform
>> when trying to gate on/off a domain.
> 
> But it is changing the reset concept and timeouts for all the reset
> cases in all the platforms.
> 
>>
>> Maybe I am completely off track, but I don't see connection with
>> FLR here.
> 
> The point is that if a reset is needed, for any reason,
> the recommended way for Jasperlake, and any other newer platforms,
> is to use the FLR rather than the engine reset. But we are using
> the engine reset, and now twice, rather then attempt the recommended
> way.
> 
>>
>> (besides FLR might not be present in all the platforms)
> 
> This issue is also not present in all the platforms and you are still
> increasing the loops and delay for all the platforms.
> 
>>
>> Thanks a lot for your inputs,
> 
> have we looked to the Jasperlake workarounds to see if we are missing
> anything there that could help us in this case instead of this extreme
> approach of randomly increasing timeouts and attempts for all the
> platforms?
> 
Hi, Rodrigo
JSL_WA (Bspec: 33451) doesn't seem to have a WA similar to this issue. 
(Please correct me if I can't find it.)

>> Andi
>>
>>>> How am I supposed to use driver_initiated_flr() in this context?
>>>
>>> Some drivers are not even using this gt reset anymore and going
>>> directly with the driver-initiated FLR in case that guc reset failed.
>>>
>>> I believe we can still keep the gt reset in our case as we currently
>>> have, but instead of keep retrying it until it succeeds we probably
>>> should go to the next level and do the driver initiated FLR when the GT
>>> reset failed.
>>>
>>>>
>>>> Thanks,
>>>> Andi
>>>>
>>>>>> Cc: stable at vger.kernel.org
>>>>>> Cc: Mika Kuoppala <mika.kuoppala at linux.intel.com>
>>>>>> Signed-off-by: Andi Shyti <andi.shyti at linux.intel.com>
>>>>>> ---
>>>>>>   drivers/gpu/drm/i915/gt/intel_reset.c | 34
>>>>>> ++++++++++++++++++++++-----
>>>>>>   1 file changed, 28 insertions(+), 6 deletions(-)
>>>>>>
>>>>>> diff --git a/drivers/gpu/drm/i915/gt/intel_reset.c
>>>>>> b/drivers/gpu/drm/i915/gt/intel_reset.c
>>>>>> index ffde89c5835a4..88dfc0c5316ff 100644
>>>>>> --- a/drivers/gpu/drm/i915/gt/intel_reset.c
>>>>>> +++ b/drivers/gpu/drm/i915/gt/intel_reset.c
>>>>>> @@ -268,6 +268,7 @@ static int ilk_do_reset(struct intel_gt *gt,
>>>>>> intel_engine_mask_t engine_mask,
>>>>>>   static int gen6_hw_domain_reset(struct intel_gt *gt, u32
>>>>>> hw_domain_mask)
>>>>>>   {
>>>>>>          struct intel_uncore *uncore = gt->uncore;
>>>>>> +       int loops = 2;
>>>>>>          int err;
>>>>>>   
>>>>>>          /*
>>>>>> @@ -275,18 +276,39 @@ static int gen6_hw_domain_reset(struct
>>>>>> intel_gt *gt, u32 hw_domain_mask)
>>>>>>           * for fifo space for the write or forcewake the chip for
>>>>>>           * the read
>>>>>>           */
>>>>>> -       intel_uncore_write_fw(uncore, GEN6_GDRST,
>>>>>> hw_domain_mask);
>>>>>> +       do {
>>>>>> +               intel_uncore_write_fw(uncore, GEN6_GDRST,
>>>>>> hw_domain_mask);
>>>>>>   
>>>>>> -       /* Wait for the device to ack the reset requests */
>>>>>> -       err = __intel_wait_for_register_fw(uncore,
>>>>>> -                                          GEN6_GDRST,
>>>>>> hw_domain_mask, 0,
>>>>>> -                                          500, 0,
>>>>>> -                                          NULL);
>>>>>> +               /*
>>>>>> +                * Wait for the device to ack the reset requests.
>>>>>> +                *
>>>>>> +                * On some platforms, e.g. Jasperlake, we see see
>>>>>> that the
>>>>>> +                * engine register state is not cleared until
>>>>>> shortly after
>>>>>> +                * GDRST reports completion, causing a failure as
>>>>>> we try
>>>>>> +                * to immediately resume while the internal state
>>>>>> is still
>>>>>> +                * in flux. If we immediately repeat the reset,
>>>>>> the second
>>>>>> +                * reset appears to serialise with the first, and
>>>>>> since
>>>>>> +                * it is a no-op, the registers should retain
>>>>>> their reset
>>>>>> +                * value. However, there is still a concern that
>>>>>> upon
>>>>>> +                * leaving the second reset, the internal engine
>>>>>> state
>>>>>> +                * is still in flux and not ready for resuming.
>>>>>> +                */
>>>>>> +               err = __intel_wait_for_register_fw(uncore,
>>>>>> GEN6_GDRST,
>>>>>> +
>>>>>> hw_domain_mask, 0,
>>>>>> +                                                  2000, 0,
>>>>>> +                                                  NULL);
Andi, fast_timeout_us is increased from 500 to 2000, and if it fails, it 
tries to reset it once more. How was this value of 2000 calculated?
>>>>>> +       } while (err == 0 && --loops);
>>>>>>          if (err)
>>>>>>                  GT_TRACE(gt,
>>>>>>                           "Wait for 0x%08x engines reset
>>>>>> failed\n",
>>>>>>                           hw_domain_mask);
Did GT_TRACE report an error in a situation where the problem was reported?
>>>>>>   
>>>>>> +       /*
>>>>>> +        * As we have observed that the engine state is still
>>>>>> volatile
>>>>>> +        * after GDRST is acked, impose a small delay to let
>>>>>> everything settle.
>>>>>> +        */
>>>>>> +       udelay(50);
udelay(50) affects all platforms that can call gen6_hw_domain_reset(), 
is that intended?

Br,

G.G.
>>>>>> +
>>>>>>          return err;
>>>>>>   }
>>>>>>   
>>>>>> -- 
>>>>>> 2.38.1
>>>>>>
>>>