[Intel-gfx] [PATCH] drm/i915/gt: Reset twice
Gwan-gyeong Mun
gwan-gyeong.mun at intel.com
Thu Dec 22 09:28:34 UTC 2022
On 12/15/22 10:07 PM, Rodrigo Vivi wrote:
> On Wed, Dec 14, 2022 at 11:37:19PM +0100, Andi Shyti wrote:
>> Hi Rodrigo,
>>
>> On Tue, Dec 13, 2022 at 01:18:48PM +0000, Vivi, Rodrigo wrote:
>>> On Tue, 2022-12-13 at 00:08 +0100, Andi Shyti wrote:
>>>> Hi Rodrigo,
>>>>
>>>> On Mon, Dec 12, 2022 at 11:55:10AM -0500, Rodrigo Vivi wrote:
>>>>> On Mon, Dec 12, 2022 at 05:13:38PM +0100, Andi Shyti wrote:
>>>>>> From: Chris Wilson <chris at chris-wilson.co.uk>
>>>>>>
>>>>>> After applying an engine reset, on some platforms like
>>>>>> Jasperlake, we
>>>>>> occasionally detect that the engine state is not cleared until
>>>>>> shortly
>>>>>> after the resume. As we try to resume the engine with volatile
>>>>>> internal
>>>>>> state, the first request fails with a spurious CS event (it looks
>>>>>> like
>>>>>> it reports a lite-restore to the hung context, instead of the
>>>>>> expected
>>>>>> idle->active context switch).
>>>>>>
>>>>>> Signed-off-by: Chris Wilson <hris at chris-wilson.co.uk>
>>>>>
>>>>> There's a typo in the signature email I'm afraid...
>>>>
>>>> oh yes, I forgot the 'C' :)
>>>
>>> you forgot?
>>> who signed it off?
>>
>> Chris, but as I was copy/pasting SoB's I might have
>> unintentionally removed the 'c'.
>>
>>>>> Other than that, have we checked the possibility of using the
>>>>> driver-initiated-flr bit
>>>>> instead of this second loop? That should be the right way to
>>>>> guarantee everything is
>>>>> cleared on gen11+...
>>>>
>>>> maybe I am misinterpreting it, but is FLR the same as resetting
>>>> hardware domains individually?
>>>
>>> No, it is bigger than that... almost the PCI FLR with some exceptions:
>>>
>>> https://lists.freedesktop.org/archives/intel-gfx/2022-December/313956.html
>>
>> yes, exactly... I would use FLR feedback if I was performing an
>> FLR reset. But here I'm not doing that, here I'm simply gating
>> off some power domains. It happens that those power domains turn
>> on and off engines making them reset.
>
> is this issue only seeing when this reset is called from the
> sanitize functions at probe and resumes?
> Or from any kind of gt reset?
>
> I don't remember seeing any reference link to the bug in the patch,
> hence I'm assuming this is happening in any kind of gt reset that
> ends up in this function.
>
>>
>> FLR doesn't have anything to do here, also because if you want to
>> reset a single engine you go through this function, instead of
>> resetting the whole GPU with whatever is annexed.
>
> yeap. That might be to extreme depending on the case. But all that
> I asked was if we were considering this option since this is the
> recommended way of reseting our engines nowadays.
>
>>
>> This patch is not fixing the "reset" concept of i915, but it's
>> fixing a missing feedback that happens in one single platform
>> when trying to gate on/off a domain.
>
> But it is changing the reset concept and timeouts for all the reset
> cases in all the platforms.
>
>>
>> Maybe I am completely off track, but I don't see connection with
>> FLR here.
>
> The point is that if a reset is needed, for any reason,
> the recommended way for Jasperlake, and any other newer platforms,
> is to use the FLR rather than the engine reset. But we are using
> the engine reset, and now twice, rather then attempt the recommended
> way.
>
>>
>> (besides FLR might not be present in all the platforms)
>
> This issue is also not present in all the platforms and you are still
> increasing the loops and delay for all the platforms.
>
>>
>> Thanks a lot for your inputs,
>
> have we looked to the Jasperlake workarounds to see if we are missing
> anything there that could help us in this case instead of this extreme
> approach of randomly increasing timeouts and attempts for all the
> platforms?
>
Hi, Rodrigo
JSL_WA (Bspec: 33451) doesn't seem to have a WA similar to this issue.
(Please correct me if I can't find it.)
>> Andi
>>
>>>> How am I supposed to use driver_initiated_flr() in this context?
>>>
>>> Some drivers are not even using this gt reset anymore and going
>>> directly with the driver-initiated FLR in case that guc reset failed.
>>>
>>> I believe we can still keep the gt reset in our case as we currently
>>> have, but instead of keep retrying it until it succeeds we probably
>>> should go to the next level and do the driver initiated FLR when the GT
>>> reset failed.
>>>
>>>>
>>>> Thanks,
>>>> Andi
>>>>
>>>>>> Cc: stable at vger.kernel.org
>>>>>> Cc: Mika Kuoppala <mika.kuoppala at linux.intel.com>
>>>>>> Signed-off-by: Andi Shyti <andi.shyti at linux.intel.com>
>>>>>> ---
>>>>>> drivers/gpu/drm/i915/gt/intel_reset.c | 34
>>>>>> ++++++++++++++++++++++-----
>>>>>> 1 file changed, 28 insertions(+), 6 deletions(-)
>>>>>>
>>>>>> diff --git a/drivers/gpu/drm/i915/gt/intel_reset.c
>>>>>> b/drivers/gpu/drm/i915/gt/intel_reset.c
>>>>>> index ffde89c5835a4..88dfc0c5316ff 100644
>>>>>> --- a/drivers/gpu/drm/i915/gt/intel_reset.c
>>>>>> +++ b/drivers/gpu/drm/i915/gt/intel_reset.c
>>>>>> @@ -268,6 +268,7 @@ static int ilk_do_reset(struct intel_gt *gt,
>>>>>> intel_engine_mask_t engine_mask,
>>>>>> static int gen6_hw_domain_reset(struct intel_gt *gt, u32
>>>>>> hw_domain_mask)
>>>>>> {
>>>>>> struct intel_uncore *uncore = gt->uncore;
>>>>>> + int loops = 2;
>>>>>> int err;
>>>>>>
>>>>>> /*
>>>>>> @@ -275,18 +276,39 @@ static int gen6_hw_domain_reset(struct
>>>>>> intel_gt *gt, u32 hw_domain_mask)
>>>>>> * for fifo space for the write or forcewake the chip for
>>>>>> * the read
>>>>>> */
>>>>>> - intel_uncore_write_fw(uncore, GEN6_GDRST,
>>>>>> hw_domain_mask);
>>>>>> + do {
>>>>>> + intel_uncore_write_fw(uncore, GEN6_GDRST,
>>>>>> hw_domain_mask);
>>>>>>
>>>>>> - /* Wait for the device to ack the reset requests */
>>>>>> - err = __intel_wait_for_register_fw(uncore,
>>>>>> - GEN6_GDRST,
>>>>>> hw_domain_mask, 0,
>>>>>> - 500, 0,
>>>>>> - NULL);
>>>>>> + /*
>>>>>> + * Wait for the device to ack the reset requests.
>>>>>> + *
>>>>>> + * On some platforms, e.g. Jasperlake, we see see
>>>>>> that the
>>>>>> + * engine register state is not cleared until
>>>>>> shortly after
>>>>>> + * GDRST reports completion, causing a failure as
>>>>>> we try
>>>>>> + * to immediately resume while the internal state
>>>>>> is still
>>>>>> + * in flux. If we immediately repeat the reset,
>>>>>> the second
>>>>>> + * reset appears to serialise with the first, and
>>>>>> since
>>>>>> + * it is a no-op, the registers should retain
>>>>>> their reset
>>>>>> + * value. However, there is still a concern that
>>>>>> upon
>>>>>> + * leaving the second reset, the internal engine
>>>>>> state
>>>>>> + * is still in flux and not ready for resuming.
>>>>>> + */
>>>>>> + err = __intel_wait_for_register_fw(uncore,
>>>>>> GEN6_GDRST,
>>>>>> +
>>>>>> hw_domain_mask, 0,
>>>>>> + 2000, 0,
>>>>>> + NULL);
Andi, fast_timeout_us is increased from 500 to 2000, and if it fails, it
tries to reset it once more. How was this value of 2000 calculated?
>>>>>> + } while (err == 0 && --loops);
>>>>>> if (err)
>>>>>> GT_TRACE(gt,
>>>>>> "Wait for 0x%08x engines reset
>>>>>> failed\n",
>>>>>> hw_domain_mask);
Did GT_TRACE report an error in a situation where the problem was reported?
>>>>>>
>>>>>> + /*
>>>>>> + * As we have observed that the engine state is still
>>>>>> volatile
>>>>>> + * after GDRST is acked, impose a small delay to let
>>>>>> everything settle.
>>>>>> + */
>>>>>> + udelay(50);
udelay(50) affects all platforms that can call gen6_hw_domain_reset(),
is that intended?
Br,
G.G.
>>>>>> +
>>>>>> return err;
>>>>>> }
>>>>>>
>>>>>> --
>>>>>> 2.38.1
>>>>>>
>>>
More information about the Intel-gfx
mailing list