[Intel-gfx] [PATCH] drm/i915/guc: Propagate the fw xfer timeout

Thu Oct 18 18:27:20 UTC 2018

On Thu, 18 Oct 2018 20:18:53 +0200, Daniele Ceraolo Spurio  
<daniele.ceraolospurio at intel.com> wrote:

>
>
> On 18/10/18 02:13, Chris Wilson wrote:
>> Quoting Michal Wajdeczko (2018-10-18 00:22:43)
>>> On Thu, 18 Oct 2018 01:09:19 +0200, Daniele Ceraolo Spurio
>>> <daniele.ceraolospurio at intel.com> wrote:
>>>
>>>>
>>>>
>>>> On 17/10/18 13:29, Chris Wilson wrote:
>>>>> Propagate the timeout on transferring the fw back to the caller  
>>>>> where it
>>>>> may act upon it, usually by restarting the xfer before failing.
>>>>>
>>>>
>>>> Did you see any case where we failed the xfer and didn't get a timeout
>>>> out of guc_wait_ucode? that'd be quite weird
>>  Yes the logs show the xfer error but not the wait error. So we ended up
>> returning 0.
>>
>
> The guc status register is correctly cleared by HW on guc reset (just  
> checked that), so if the wait_ucode succeeded in matching a "ready"  
> value in there it means that the guc FW had loaded correctly. Maybe it  
> managed to complete the xfer just after the timeout elapsed or the DMA  
> got confused and reported the wrong status. Still weird, but your  
> changes should help make the error more visible.

maybe first timeout in DMA transfer was mitigated by additional wait in
guc_wait_ucode() - so maybe this error propagation is not that important ?

>
>>>> Anyway, definitely better and cleaner than before:
>>>> Reviewed-by: Daniele Ceraolo Spurio <daniele.ceraolospurio at intel.com>
>>>>
>>>> small nitpick below
>>>>
>>>>> Testcase: igt/drv_selftest/live_hangcheck
>>>>> Signed-off-by: Chris Wilson <chris at chris-wilson.co.uk>
>>>>> Cc: Michal Wajdeczko <michal.wajdeczko at intel.com>
>>>>> Cc: Daniele Ceraolo Spurio <daniele.ceraolospurio at intel.com>
>>>>> ---
>>>>>    drivers/gpu/drm/i915/intel_guc_fw.c | 23 ++++++-----------------
>>>>>    1 file changed, 6 insertions(+), 17 deletions(-)
>>>>>   diff --git a/drivers/gpu/drm/i915/intel_guc_fw.c
>>>>> b/drivers/gpu/drm/i915/intel_guc_fw.c
>>>>> index a9e6fcce467c..b68a05892dab 100644
>>>>> --- a/drivers/gpu/drm/i915/intel_guc_fw.c
>>>>> +++ b/drivers/gpu/drm/i915/intel_guc_fw.c
>>>>> @@ -125,22 +125,17 @@ static void guc_prepare_xfer(struct intel_guc
>>>>> *guc)
>>>>>    }
>>>>>      /* Copy RSA signature from the fw image to HW for verification  
>>>>> */
>>>>> -static int guc_xfer_rsa(struct intel_guc *guc, struct i915_vma *vma)
>>>>> +static void guc_xfer_rsa(struct intel_guc *guc, struct i915_vma  
>>>>> *vma)
>>>>>    {
>>>>>       struct drm_i915_private *dev_priv = guc_to_i915(guc);
>>>>> -    struct intel_uc_fw *guc_fw = &guc->fw;
>>>>> -    struct sg_table *sg = vma->pages;
>>>>>       u32 rsa[UOS_RSA_SCRATCH_COUNT];
>>>>>       int i;
>>>>>    -  if (sg_pcopy_to_buffer(sg->sgl, sg->nents, rsa, sizeof(rsa),
>>>>> -                           guc_fw->rsa_offset) != sizeof(rsa))
>>>>> -            return -EINVAL;
>>>>> +    sg_pcopy_to_buffer(vma->pages->sgl, vma->pages->nents,
>>>>> +                       rsa, sizeof(rsa), guc->fw.rsa_offset);
>>>>>       for (i = 0; i < UOS_RSA_SCRATCH_COUNT; i++)
>>>>>               I915_WRITE(UOS_RSA_SCRATCH(i), rsa[i]);
>>>>> -
>>>>> -    return 0;
>>>>>    }
>>>>>      /*
>>>>> @@ -251,17 +246,11 @@ static int guc_fw_xfer(struct intel_uc_fw
>>>>> *guc_fw, struct i915_vma *vma)
>>>>>        * by the DMA engine in one operation, whereas the RSA  
>>>>> signature is
>>>>>        * loaded via MMIO.
>>>>>        */
>>>>> -    ret = guc_xfer_rsa(guc, vma);
>>>>> -    if (ret)
>>>>> -            DRM_WARN("GuC firmware signature xfer error %d\n", ret);
>>>>> +    guc_xfer_rsa(guc, vma);
>>>>>       ret = guc_xfer_ucode(guc, vma);
>>>>> -    if (ret)
>>>>> -            DRM_WARN("GuC firmware code xfer error %d\n", ret);
>>>>> -
>>>>> -    ret = guc_wait_ucode(guc);
>>>>> -    if (ret)
>>>>> -            DRM_ERROR("GuC firmware xfer error %d\n", ret);
>>>>
>>>> With the logs gone we don't have any more error message to understand
>>>> where exactly we hit issue (xfer vs wait_ucode). We do have debug logs
>>>> printing the status that would give the info, but might be worth
>>>> retaining at least 1 error-level log.
>>>
>>> +1 for leaving error-level logs (here or at helper functions)
>>> also, please let CI re-run this patch with GuC enabled ;)
>>  I disagree since the errors are noise as the error is propagated and
>> handled by the caller -- so not strictly errors, but maybe
>> DRM_DEBUG_DRIVER().
>> -Chris
>>
>
> True, we don't want an error message if the caller decides that a  
> failure can be expected and retries. we already have DRM_DEBUG_DRIVER  
> printing status inside the GuC wait and GuC xfer functions, those should  
> be enough.
>
> Daniele