[Intel-gfx] [PATCH v2] drm/i915/huc: bump timeout for delayed load and reduce print verbosity
Ceraolo Spurio, Daniele
daniele.ceraolospurio at intel.com
Mon Oct 10 22:57:21 UTC 2022
On 10/10/2022 3:50 PM, John Harrison wrote:
> On 10/10/2022 11:48, Daniele Ceraolo Spurio wrote:
>> We're observing sporadic HuC delayed load timeouts in CI, due to mei_pxp
>> binding completing later than we expected. HuC is still loaded when the
>> bind occurs, but in the meantime i915 has started allowing submission to
>> the VCS engines even if HuC is not there.
>> In most of the cases I've observed, the timeout was due to the
>> init/resume of another driver between i915 and mei hitting errors and
>> thus adding an extra delay, but HuC was still loaded before userspace
>> could submit, because the whole resume process time was increased by the
>> delays.
>>
>> Given that there is no upper bound to the delay that can be introduced
>> by other drivers, I've reached the following compromise with the media
>> team:
>>
>> 1) i915 is going to bump the timeout to 5s, to reduce the probability
>> of reaching it. We still expect HuC to be loaded before userspace
>> starts submitting, so increasing the timeout should have no impact on
>> normal operations, but in case something weird happens we don't want to
>> stall video submissions for too long.
>>
>> 2) The media driver will cope with the failing submissions that manage
>> to go through between i915 init/resume complete and HuC loading, if any
>> ever happen. This could cause a small corruption of video playback
>> immediately after a resume (we should be safe on boot because the media
>> driver polls the HUC_STATUS ioctl before starting submissions).
>>
>> Since we're accepting the timeout as a valid outcome, I'm also reducing
>> the print verbosity from error to notice.
>>
>> v2: use separate prints for MEI GSC and MEI PXP init timeouts (John)
>>
>> References: https://gitlab.freedesktop.org/drm/intel/-/issues/7033
>> Fixes: 27536e03271d ("drm/i915/huc: track delayed HuC load with a
>> fence")
>> Signed-off-by: Daniele Ceraolo Spurio <daniele.ceraolospurio at intel.com>
>> Cc: Tony Ye <tony.ye at intel.com>
>> Cc: John Harrison <john.c.harrison at intel.com>
>> ---
>> drivers/gpu/drm/i915/gt/uc/intel_huc.c | 14 ++++++++++----
>> 1 file changed, 10 insertions(+), 4 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_huc.c
>> b/drivers/gpu/drm/i915/gt/uc/intel_huc.c
>> index 4d1cc383b681..41c032ab34b3 100644
>> --- a/drivers/gpu/drm/i915/gt/uc/intel_huc.c
>> +++ b/drivers/gpu/drm/i915/gt/uc/intel_huc.c
>> @@ -52,10 +52,12 @@
>> * guaranteed for this to happen during boot, so the big timeout is
>> a safety net
>> * that we never expect to need.
>> * MEI-PXP + HuC load usually takes ~300ms, but if the GSC needs to
>> be resumed
>> - * and/or reset, this can take longer.
>> + * and/or reset, this can take longer. Note that the kernel might
>> schedule
>> + * other work between the i915 init/resume and the MEI one, which
>> can add to
>> + * the delay.
>> */
>> #define GSC_INIT_TIMEOUT_MS 10000
>> -#define PXP_INIT_TIMEOUT_MS 2000
>> +#define PXP_INIT_TIMEOUT_MS 5000
>> static int sw_fence_dummy_notify(struct i915_sw_fence *sf,
>> enum i915_sw_fence_notify state)
>> @@ -104,8 +106,12 @@ static enum hrtimer_restart
>> huc_delayed_load_timer_callback(struct hrtimer *hrti
>> struct intel_huc *huc = container_of(hrtimer, struct intel_huc,
>> delayed_load.timer);
>> if (!intel_huc_is_authenticated(huc)) {
>> - drm_err(&huc_to_gt(huc)->i915->drm,
>> - "timed out waiting for GSC init to load HuC\n");
>> + if (huc->delayed_load.status == INTEL_HUC_WAITING_ON_GSC)
>> + drm_notice(&huc_to_gt(huc)->i915->drm,
>> + "timed out waiting for MEI GSC init to load HuC\n");
>> + else if (huc->delayed_load.status == INTEL_HUC_WAITING_ON_PXP)
>> + drm_notice(&huc_to_gt(huc)->i915->drm,
>> + "timed out waiting for MEI PXP init to load HuC\n");
> Hmm. It feels wrong to assume that the status can only be GSC or PXP.
> Granted, it certainly should be one of those two values if we get
> here. But it just seems like bad practice to have a partial if/elsif
> ladder for an enum value instead of a switch with a default path. What
> I meant originally was just have a single print that says 'timed out
> waiting for MEI GSC/PXP to load...', maybe with the status value being
> printed. I don't think it is critically important to differentiate
> between the two, I just meant that it was wrong to explicitly state
> GSC when it might not be.
It is guaranteed that the state is one of those 2 in this callback. The
only other option is an error state, which we set from the
__gsc_init_error() below. I can make it an unconditional else, or add an
else MISSING_CASE(status) if you think that works better, but given the
errors we've seen I'd prefer to keep the prints separate for extra clarity.
Daniele
>
> John.
>
>> __gsc_init_error(huc);
>> }
>
More information about the Intel-gfx
mailing list