[Intel-gfx] [PATCH] drm/i915: fix SFC reset flow
Tvrtko Ursulin
tvrtko.ursulin at linux.intel.com
Wed Sep 18 13:42:37 UTC 2019
On 17/09/2019 19:29, Daniele Ceraolo Spurio wrote:
>
>
> On 9/17/2019 3:22 AM, Tvrtko Ursulin wrote:
>>
>> On 16/09/2019 22:41, Daniele Ceraolo Spurio wrote:
>>> Our assumption that the we can ask the HW to lock the SFC even if not
>>> currently in use does not match the HW commitment. The expectation from
>>> the HW is that SW will not try to lock the SFC if the engine is not
>>> using it and if we do that the behavior is undefined; on ICL the HW
>>> ends up to returning the ack and ignoring our lock request, but this is
>>> not guaranteed and we shouldn't expect it going forward.
>>>
>>> Reported-by: Owen Zhang <owen.zhang at intel.com>
>>> Signed-off-by: Daniele Ceraolo Spurio <daniele.ceraolospurio at intel.com>
>>> Cc: Tvrtko Ursulin <tvrtko.ursulin at linux.intel.com>
>>> ---
>>> drivers/gpu/drm/i915/gt/intel_reset.c | 25 +++++++++++++++++--------
>>> 1 file changed, 17 insertions(+), 8 deletions(-)
>>>
>>> diff --git a/drivers/gpu/drm/i915/gt/intel_reset.c
>>> b/drivers/gpu/drm/i915/gt/intel_reset.c
>>> index 8327220ac558..900958804bd5 100644
>>> --- a/drivers/gpu/drm/i915/gt/intel_reset.c
>>> +++ b/drivers/gpu/drm/i915/gt/intel_reset.c
>>> @@ -352,13 +352,15 @@ static u32 gen11_lock_sfc(struct
>>> intel_engine_cs *engine)
>>> }
>>> /*
>>> - * Tell the engine that a software reset is going to happen. The
>>> engine
>>> - * will then try to force lock the SFC (if currently locked, it
>>> will
>>> - * remain so until we tell the engine it is safe to unlock; if
>>> currently
>>> - * unlocked, it will ignore this and all new lock requests). If SFC
>>> - * ends up being locked to the engine we want to reset, we have
>>> to reset
>>> - * it as well (we will unlock it once the reset sequence is
>>> completed).
>>> + * If the engine is using a SFC, tell the engine that a software
>>> reset
>>> + * is going to happen. The engine will then try to force lock
>>> the SFC.
>>> + * If SFC ends up being locked to the engine we want to reset,
>>> we have
>>> + * to reset it as well (we will unlock it once the reset
>>> sequence is
>>> + * completed).
>>> */
>>> + if (!(intel_uncore_read_fw(uncore, sfc_usage) & sfc_usage_bit))
>>> + return 0;
>>> +
>>> rmw_set_fw(uncore, sfc_forced_lock, sfc_forced_lock_bit);
>>> if (__intel_wait_for_register_fw(uncore,
>>> @@ -366,10 +368,13 @@ static u32 gen11_lock_sfc(struct
>>> intel_engine_cs *engine)
>>> sfc_forced_lock_ack_bit,
>>> sfc_forced_lock_ack_bit,
>>> 1000, 0, NULL)) {
>>> - DRM_DEBUG_DRIVER("Wait for SFC forced lock ack failed\n");
>>> + /* did we race the unlock? */
>>
>> How do we race here? Are we not in complete control of the engine at
>> this point so the status of this engine using SFC or not should be
>> static, no?
>
> The hang detection might be due to a long non-preemptable batch, in
> which case there is in theory a chance for the batch to release the SFC
> while we try to lock it. The chance is incredibly small though, so am I
> being too paranoid?
I get it now, it is a legitimate race.
>
>>
>>> + if (intel_uncore_read_fw(uncore, sfc_usage) & sfc_usage_bit)
>>> + DRM_ERROR("Wait for SFC forced lock ack failed\n");
>>> return 0;
>>> }
>>> + /* The HW could return the ack even if the sfc is not in use */
>>
>> But the function checked whether SFC wasn't in use and bailed out
>> early - so is this comment relevant? (I understand it is true against
>> the specs just wondering about our exact code.)
>>
>
> Same rationale as the above, if the engine relased the SFC while we were
> locking it, the locking might have been rejected, but on ICL we still
> get the ack.
>
>>> if (intel_uncore_read_fw(uncore, sfc_usage) & sfc_usage_bit)
>>> return sfc_reset_bit;
>>> @@ -382,6 +387,7 @@ static void gen11_unlock_sfc(struct
>>> intel_engine_cs *engine)
>>> u8 vdbox_sfc_access =
>>> RUNTIME_INFO(engine->i915)->vdbox_sfc_access;
>>> i915_reg_t sfc_forced_lock;
>>> u32 sfc_forced_lock_bit;
>>> + u32 lock;
>>> switch (engine->class) {
>>> case VIDEO_DECODE_CLASS:
>>> @@ -401,7 +407,10 @@ static void gen11_unlock_sfc(struct
>>> intel_engine_cs *engine)
>>> return;
>>> }
>>> - rmw_clear_fw(uncore, sfc_forced_lock, sfc_forced_lock_bit);
>>> + lock = intel_uncore_read_fw(uncore, sfc_forced_lock);
>>> + if (lock & sfc_forced_lock_bit)
>>> + intel_uncore_write_fw(uncore, sfc_forced_lock,
>>> + lock & ~sfc_forced_lock_bit);
>>
>> Here we can't rely on the return code from gen11_lock_sfc and have to
>> read the register ourselves? I guess it depends on my question about
>> the race comment.
>>
>> In addition to this I now see that gen11_reset_engines does not use
>> the return value from gen11_lock_sfc when deciding which engines it
>> needs to unlock. Should we change that as well?
>
> Paranoia here as well, in case something went wrong with the locking I'd
> like to be sure the unlocking can still be performed independently so we
> can recover. e.g. the locking might have succeeded after we hit the
> timeout in gen11_lock_sfc , in which case the return from that function
> won't reflect the status of the HW.
Put in a comment here explaining what's the story and with that:
Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin at intel.com>
Another option is to cross-check software vs hardware locked status at
this point and to log a mismatch. Just so we get data how often this
happens in practice. This is probably best as follow up.
Regards,
Tvrtko
>
> Thanks,
> Daniele
>
>>
>>
>>> }
>>> static int gen11_reset_engines(struct intel_gt *gt,
>>>
>>
>> Regards,
>>
>> Tvrtko
>
>
More information about the Intel-gfx
mailing list