[Intel-gfx] [PATCH] drm/i915/guc: Propagate the fw xfer timeout

Daniele Ceraolo Spurio daniele.ceraolospurio at intel.com
Thu Oct 18 18:18:53 UTC 2018



On 18/10/18 02:13, Chris Wilson wrote:
> Quoting Michal Wajdeczko (2018-10-18 00:22:43)
>> On Thu, 18 Oct 2018 01:09:19 +0200, Daniele Ceraolo Spurio
>> <daniele.ceraolospurio at intel.com> wrote:
>>
>>>
>>>
>>> On 17/10/18 13:29, Chris Wilson wrote:
>>>> Propagate the timeout on transferring the fw back to the caller where it
>>>> may act upon it, usually by restarting the xfer before failing.
>>>>
>>>
>>> Did you see any case where we failed the xfer and didn't get a timeout
>>> out of guc_wait_ucode? that'd be quite weird
> 
> Yes the logs show the xfer error but not the wait error. So we ended up
> returning 0.
> 

The guc status register is correctly cleared by HW on guc reset (just 
checked that), so if the wait_ucode succeeded in matching a "ready" 
value in there it means that the guc FW had loaded correctly. Maybe it 
managed to complete the xfer just after the timeout elapsed or the DMA 
got confused and reported the wrong status. Still weird, but your 
changes should help make the error more visible.

>>> Anyway, definitely better and cleaner than before:
>>> Reviewed-by: Daniele Ceraolo Spurio <daniele.ceraolospurio at intel.com>
>>>
>>> small nitpick below
>>>
>>>> Testcase: igt/drv_selftest/live_hangcheck
>>>> Signed-off-by: Chris Wilson <chris at chris-wilson.co.uk>
>>>> Cc: Michal Wajdeczko <michal.wajdeczko at intel.com>
>>>> Cc: Daniele Ceraolo Spurio <daniele.ceraolospurio at intel.com>
>>>> ---
>>>>    drivers/gpu/drm/i915/intel_guc_fw.c | 23 ++++++-----------------
>>>>    1 file changed, 6 insertions(+), 17 deletions(-)
>>>>   diff --git a/drivers/gpu/drm/i915/intel_guc_fw.c
>>>> b/drivers/gpu/drm/i915/intel_guc_fw.c
>>>> index a9e6fcce467c..b68a05892dab 100644
>>>> --- a/drivers/gpu/drm/i915/intel_guc_fw.c
>>>> +++ b/drivers/gpu/drm/i915/intel_guc_fw.c
>>>> @@ -125,22 +125,17 @@ static void guc_prepare_xfer(struct intel_guc
>>>> *guc)
>>>>    }
>>>>      /* Copy RSA signature from the fw image to HW for verification */
>>>> -static int guc_xfer_rsa(struct intel_guc *guc, struct i915_vma *vma)
>>>> +static void guc_xfer_rsa(struct intel_guc *guc, struct i915_vma *vma)
>>>>    {
>>>>       struct drm_i915_private *dev_priv = guc_to_i915(guc);
>>>> -    struct intel_uc_fw *guc_fw = &guc->fw;
>>>> -    struct sg_table *sg = vma->pages;
>>>>       u32 rsa[UOS_RSA_SCRATCH_COUNT];
>>>>       int i;
>>>>    -  if (sg_pcopy_to_buffer(sg->sgl, sg->nents, rsa, sizeof(rsa),
>>>> -                           guc_fw->rsa_offset) != sizeof(rsa))
>>>> -            return -EINVAL;
>>>> +    sg_pcopy_to_buffer(vma->pages->sgl, vma->pages->nents,
>>>> +                       rsa, sizeof(rsa), guc->fw.rsa_offset);
>>>>       for (i = 0; i < UOS_RSA_SCRATCH_COUNT; i++)
>>>>               I915_WRITE(UOS_RSA_SCRATCH(i), rsa[i]);
>>>> -
>>>> -    return 0;
>>>>    }
>>>>      /*
>>>> @@ -251,17 +246,11 @@ static int guc_fw_xfer(struct intel_uc_fw
>>>> *guc_fw, struct i915_vma *vma)
>>>>        * by the DMA engine in one operation, whereas the RSA signature is
>>>>        * loaded via MMIO.
>>>>        */
>>>> -    ret = guc_xfer_rsa(guc, vma);
>>>> -    if (ret)
>>>> -            DRM_WARN("GuC firmware signature xfer error %d\n", ret);
>>>> +    guc_xfer_rsa(guc, vma);
>>>>       ret = guc_xfer_ucode(guc, vma);
>>>> -    if (ret)
>>>> -            DRM_WARN("GuC firmware code xfer error %d\n", ret);
>>>> -
>>>> -    ret = guc_wait_ucode(guc);
>>>> -    if (ret)
>>>> -            DRM_ERROR("GuC firmware xfer error %d\n", ret);
>>>
>>> With the logs gone we don't have any more error message to understand
>>> where exactly we hit issue (xfer vs wait_ucode). We do have debug logs
>>> printing the status that would give the info, but might be worth
>>> retaining at least 1 error-level log.
>>
>> +1 for leaving error-level logs (here or at helper functions)
>> also, please let CI re-run this patch with GuC enabled ;)
> 
> I disagree since the errors are noise as the error is propagated and
> handled by the caller -- so not strictly errors, but maybe
> DRM_DEBUG_DRIVER().
> -Chris
> 

True, we don't want an error message if the caller decides that a 
failure can be expected and retries. we already have DRM_DEBUG_DRIVER 
printing status inside the GuC wait and GuC xfer functions, those should 
be enough.

Daniele


More information about the Intel-gfx mailing list