[Intel-gfx] [PATCH] drm/i915: fix SFC reset flow

Wed Sep 18 13:42:37 UTC 2019

On 17/09/2019 19:29, Daniele Ceraolo Spurio wrote:
> 
> 
> On 9/17/2019 3:22 AM, Tvrtko Ursulin wrote:
>>
>> On 16/09/2019 22:41, Daniele Ceraolo Spurio wrote:
>>> Our assumption that the we can ask the HW to lock the SFC even if not
>>> currently in use does not match the HW commitment. The expectation from
>>> the HW is that SW will not try to lock the SFC if the engine is not
>>> using it and if we do that the behavior is undefined; on ICL the HW
>>> ends up to returning the ack and ignoring our lock request, but this is
>>> not guaranteed and we shouldn't expect it going forward.
>>>
>>> Reported-by: Owen Zhang <owen.zhang at intel.com>
>>> Signed-off-by: Daniele Ceraolo Spurio <daniele.ceraolospurio at intel.com>
>>> Cc: Tvrtko Ursulin <tvrtko.ursulin at linux.intel.com>
>>> ---
>>>   drivers/gpu/drm/i915/gt/intel_reset.c | 25 +++++++++++++++++--------
>>>   1 file changed, 17 insertions(+), 8 deletions(-)
>>>
>>> diff --git a/drivers/gpu/drm/i915/gt/intel_reset.c 
>>> b/drivers/gpu/drm/i915/gt/intel_reset.c
>>> index 8327220ac558..900958804bd5 100644
>>> --- a/drivers/gpu/drm/i915/gt/intel_reset.c
>>> +++ b/drivers/gpu/drm/i915/gt/intel_reset.c
>>> @@ -352,13 +352,15 @@ static u32 gen11_lock_sfc(struct 
>>> intel_engine_cs *engine)
>>>       }
>>>         /*
>>> -     * Tell the engine that a software reset is going to happen. The 
>>> engine
>>> -     * will then try to force lock the SFC (if currently locked, it 
>>> will
>>> -     * remain so until we tell the engine it is safe to unlock; if 
>>> currently
>>> -     * unlocked, it will ignore this and all new lock requests). If SFC
>>> -     * ends up being locked to the engine we want to reset, we have 
>>> to reset
>>> -     * it as well (we will unlock it once the reset sequence is 
>>> completed).
>>> +     * If the engine is using a SFC, tell the engine that a software 
>>> reset
>>> +     * is going to happen. The engine will then try to force lock 
>>> the SFC.
>>> +     * If SFC ends up being locked to the engine we want to reset, 
>>> we have
>>> +     * to reset it as well (we will unlock it once the reset 
>>> sequence is
>>> +     * completed).
>>>        */
>>> +    if (!(intel_uncore_read_fw(uncore, sfc_usage) & sfc_usage_bit))
>>> +        return 0;
>>> +
>>>       rmw_set_fw(uncore, sfc_forced_lock, sfc_forced_lock_bit);
>>>         if (__intel_wait_for_register_fw(uncore,
>>> @@ -366,10 +368,13 @@ static u32 gen11_lock_sfc(struct 
>>> intel_engine_cs *engine)
>>>                        sfc_forced_lock_ack_bit,
>>>                        sfc_forced_lock_ack_bit,
>>>                        1000, 0, NULL)) {
>>> -        DRM_DEBUG_DRIVER("Wait for SFC forced lock ack failed\n");
>>> +        /* did we race the unlock? */
>>
>> How do we race here? Are we not in complete control of the engine at 
>> this point so the status of this engine using SFC or not should be 
>> static, no?
> 
> The hang detection might be due to a long non-preemptable batch, in 
> which case there is in theory a chance for the batch to release the SFC 
> while we try to lock it. The chance is incredibly small though, so am I 
> being too paranoid?

I get it now, it is a legitimate race.

> 
>>
>>> +        if (intel_uncore_read_fw(uncore, sfc_usage) & sfc_usage_bit)
>>> +            DRM_ERROR("Wait for SFC forced lock ack failed\n");
>>>           return 0;
>>>       }
>>>   +    /* The HW could return the ack even if the sfc is not in use */
>>
>> But the function checked whether SFC wasn't in use and bailed out 
>> early - so is this comment relevant? (I understand it is true against 
>> the specs just wondering about our exact code.)
>>
> 
> Same rationale as the above, if the engine relased the SFC while we were 
> locking it, the locking might have been rejected, but on ICL we still 
> get the ack.
> 
>>>       if (intel_uncore_read_fw(uncore, sfc_usage) & sfc_usage_bit)
>>>           return sfc_reset_bit;
>>>   @@ -382,6 +387,7 @@ static void gen11_unlock_sfc(struct 
>>> intel_engine_cs *engine)
>>>       u8 vdbox_sfc_access = 
>>> RUNTIME_INFO(engine->i915)->vdbox_sfc_access;
>>>       i915_reg_t sfc_forced_lock;
>>>       u32 sfc_forced_lock_bit;
>>> +    u32 lock;
>>>         switch (engine->class) {
>>>       case VIDEO_DECODE_CLASS:
>>> @@ -401,7 +407,10 @@ static void gen11_unlock_sfc(struct 
>>> intel_engine_cs *engine)
>>>           return;
>>>       }
>>>   -    rmw_clear_fw(uncore, sfc_forced_lock, sfc_forced_lock_bit);
>>> +    lock = intel_uncore_read_fw(uncore, sfc_forced_lock);
>>> +    if (lock & sfc_forced_lock_bit)
>>> +        intel_uncore_write_fw(uncore, sfc_forced_lock,
>>> +                      lock & ~sfc_forced_lock_bit);
>>
>> Here we can't rely on the return code from gen11_lock_sfc and have to 
>> read the register ourselves? I guess it depends on my question about 
>> the race comment.
>>
>> In addition to this I now see that gen11_reset_engines does not use 
>> the return value from gen11_lock_sfc when deciding which engines it 
>> needs to unlock. Should we change that as well?
> 
> Paranoia here as well, in case something went wrong with the locking I'd 
> like to be sure the unlocking can still be performed independently so we 
> can recover. e.g. the locking might have succeeded after we hit the 
> timeout in gen11_lock_sfc , in which case the return from that function 
> won't reflect the status of the HW.

Put in a comment here explaining what's the story and with that:

Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin at intel.com>

Another option is to cross-check software vs hardware locked status at 
this point and to log a mismatch. Just so we get data how often this 
happens in practice. This is probably best as follow up.

Regards,

Tvrtko

> 
> Thanks,
> Daniele
> 
>>
>>
>>>   }
>>>     static int gen11_reset_engines(struct intel_gt *gt,
>>>
>>
>> Regards,
>>
>> Tvrtko
> 
>