[Intel-gfx] [PATCH] drm/i915/huc: bump timeout for delayed load and reduce print verbosity

Thu Oct 6 20:24:49 UTC 2022

On 10/6/2022 13:16, Ceraolo Spurio, Daniele wrote:
> On 10/6/2022 1:09 PM, John Harrison wrote:
>> On 10/6/2022 10:20, Daniele Ceraolo Spurio wrote:
>>> We're observing sporadic HuC delayed load timeouts in CI, due to 
>>> mei_pxp
>>> binding completing later than we expected. HuC is still still loaded
>> still still
>>
>>> when the bind occurs, but in the meantime i915 has started allowing
>>> submission to the VCS engines even if HuC is not there.
>>> In most of the cases I've observed, the timeout was due to the
>>> init/resume of another driver between i915 and mei hitting errors and
>>> thus adding an extra delay, but HuC was still loaded before userspace
>>> could submit, because the whole resume process time was increased by 
>>> the
>>> delays.
>>>
>>> Given that there is no upper bound to the delay that can be introduced
>>> by other drivers, I've reached the following compromise with the media
>>> team:
>>>
>>> 1) i915 is going to bump the timeout to 5s, to reduce the probability
>>> of reaching it. We still expect HuC to be loaded before userspace
>>> starts submitting, so increasing the timeout should have no impact on
>>> normal operations, but in case something weird happens we don't want to
>>> stall video submissions for too long.
>>>
>>> 2) The media driver will cope with the failing submissions that manage
>>> to go through between i915 init/resume complete and HuC loading, if any
>>> ever happen. This could cause a small corruption of video playback
>>> immediately after a resume (we should be safe on boot because the media
>>> driver polls the HUC_STATUS ioctl before starting submissions).
>>>
>>> Since we're accepting the timeout as a valid outcome, I'm also reducing
>>> the print verbosity from error to notice.
>>>
>>> References: https://gitlab.freedesktop.org/drm/intel/-/issues/7033
>>> Fixes: 27536e03271d ("drm/i915/huc: track delayed HuC load with a 
>>> fence")
>>> Signed-off-by: Daniele Ceraolo Spurio <daniele.ceraolospurio at intel.com>
>>> Cc: Tony Ye <tony.ye at intel.com>
>>> ---
>>>   drivers/gpu/drm/i915/gt/uc/intel_huc.c | 10 ++++++----
>>>   1 file changed, 6 insertions(+), 4 deletions(-)
>>>
>>> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_huc.c 
>>> b/drivers/gpu/drm/i915/gt/uc/intel_huc.c
>>> index 4d1cc383b681..73a6a2fae637 100644
>>> --- a/drivers/gpu/drm/i915/gt/uc/intel_huc.c
>>> +++ b/drivers/gpu/drm/i915/gt/uc/intel_huc.c
>>> @@ -52,10 +52,12 @@
>>>    * guaranteed for this to happen during boot, so the big timeout 
>>> is a safety net
>>>    * that we never expect to need.
>>>    * MEI-PXP + HuC load usually takes ~300ms, but if the GSC needs 
>>> to be resumed
>>> - * and/or reset, this can take longer.
>>> + * and/or reset, this can take longer. Note that the kernel might 
>>> schedule
>>> + * other work between the i915 init/resume and the MEI one, which 
>>> can add to
>>> + * the delay.
>>>    */
>>>   #define GSC_INIT_TIMEOUT_MS 10000
>>> -#define PXP_INIT_TIMEOUT_MS 2000
>>> +#define PXP_INIT_TIMEOUT_MS 5000
>> If we already have the GSC timeout at 10s, why not just use 10s for 
>> PXP as well?
>
> They're different type of operations: mei_gsc is a full on aux driver, 
> so it is loaded only once during boot; mei_pxp is a component and it 
> is bound on init and then unbound/re-bound on suspend/resume. On 
> resume we don't want timeouts excessively big.
Okay.

>
>>
>>>     static int sw_fence_dummy_notify(struct i915_sw_fence *sf,
>>>                    enum i915_sw_fence_notify state)
>>> @@ -104,8 +106,8 @@ static enum hrtimer_restart 
>>> huc_delayed_load_timer_callback(struct hrtimer *hrti
>>>       struct intel_huc *huc = container_of(hrtimer, struct 
>>> intel_huc, delayed_load.timer);
>>>         if (!intel_huc_is_authenticated(huc)) {
>>> -        drm_err(&huc_to_gt(huc)->i915->drm,
>>> -            "timed out waiting for GSC init to load HuC\n");
>>> +        drm_notice(&huc_to_gt(huc)->i915->drm,
>>> +               "timed out waiting for GSC init to load HuC\n");
>> If the failure is that the MEI PXP module hasn't loaded yet, why is 
>> the error message 'waiting for GSC init'? Or can we not distinguish 
>> between GSC load timeout and PXP load timeout? In which case should 
>> the message refer to 'GSC/PXP'?
>
> I wanted to keep things simple and have a unified message for both 
> scenarios as what we care about from an i915 POV is that something 
> went wrong on the mei side. I can split it up.
Or just change it to 'waiting for GSC/PXP init'? If we don't care about 
the differentiation then no need to make the code more complex. But it 
is worth having a message that reflects all the major causes of the issue.

John.

>
> Daniele
>
>>
>> John.
>>
>>>             __gsc_init_error(huc);
>>>       }
>>
>