[Intel-gfx] [PATCH 2/3] drm/i915/guc: downgrade some DRM_ERROR() messages to DRM_WARN()

Dave Gordon david.s.gordon at intel.com
Tue Jul 12 12:29:54 UTC 2016


On 12/07/16 10:20, Tvrtko Ursulin wrote:
>
> On 11/07/16 19:01, Dave Gordon wrote:
>> Where we're going to continue regardless of the problem, rather than
>> fail, then the message should be a WARNing rather than an ERROR.
>>
>> Signed-off-by: Dave Gordon <david.s.gordon at intel.com>
>> ---
>>   drivers/gpu/drm/i915/i915_guc_submission.c | 18 +++++++-----------
>>   1 file changed, 7 insertions(+), 11 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/i915/i915_guc_submission.c
>> b/drivers/gpu/drm/i915/i915_guc_submission.c
>> index 2112e02..e299b64 100644
>> --- a/drivers/gpu/drm/i915/i915_guc_submission.c
>> +++ b/drivers/gpu/drm/i915/i915_guc_submission.c
>> @@ -114,10 +114,8 @@ static int host2guc_action(struct intel_guc *guc,
>> u32 *data, u32 len)
>>           if (ret != -ETIMEDOUT)
>>               ret = -EIO;
>>
>> -        DRM_ERROR("GUC: host2guc action 0x%X failed. ret=%d "
>> -                "status=0x%08X response=0x%08X\n",
>> -                data[0], ret, status,
>> -                I915_READ(SOFT_SCRATCH(15)));
>> +        DRM_WARN("Action 0x%X failed; ret=%d status=0x%08X
>> response=0x%08X\n",
>> +             data[0], ret, status, I915_READ(SOFT_SCRATCH(15)));
>
> Hm, this does propagate the error code to the callers some which will
> act and log the failure. Majority won't though - like suspend/resume
> etc. In those cases it feels more like an error than a warning.

It's definitely something that shouldn't happen, so we need to log it; 
and it has to be done at this level because we don't pass enough 
information back to leave it to the caller. But OTOH this layer doesn't 
have enough information to determine just how serious a failure is.

So as a compromise the idea is to log a WARNING here, and then the 
caller can choose to:
1. pass the failure up (until we reach a layer with more context)
2. quietly disregard the failure and continue anyway
3. report an ERROR and fail/abort the process.

That way we should get all the useful information about the root cause 
of something that ends up as an ERROR, while neither ignoring nor being 
too verbose about failures from which we may ultimately recover.

For example, one of the callers (the doorbell h/w initialisation code) 
considers some failures as interesting but not critical (DEBUG level) 
but other instances of the exact same operation are fatal (ERROR).

>>           dev_priv->guc.action_fail += 1;
>>           dev_priv->guc.action_err = ret;
>> @@ -553,8 +551,8 @@ static int guc_ring_doorbell(struct
>> i915_guc_client *gc)
>>           if (db_ret.db_status == GUC_DOORBELL_DISABLED)
>>               break;
>>
>> -        DRM_ERROR("Cookie mismatch. Expected %d, returned %d\n",
>> -              db_cmp.cookie, db_ret.cookie);
>> +        DRM_WARN("Cookie mismatch. Expected %d, found %d\n",
>> +             db_cmp.cookie, db_ret.cookie);
>
> This one is interesting, error is propagated out a bit but then ignored
> in actual command submission.
>
> If the above message means command will not be submitted error is
> probably more appropriate. Or perhaps we cannot tell if the command was
> submitted or not in this case?

Note that this is inside a retry loop. It shouldn't ever happen, but if 
it does we'll report it and try one more time. If it keeps happening 
(which would require active interference by some other party) we won't 
be able to ring the doorbell and the failure will be propagated back out 
of the GuC submission code. OTOH the caller then ignores it because 
"submission is not allowed to fail" (!)

And yes, it is then undefined as to whether the command has been 
submitted or not. If it hasn't we'll expect a GPU hang later.

.Dave.

>>           /* update the cookie to newly read cookie from GuC */
>>           db_cmp.cookie = db_ret.cookie;
>> @@ -726,8 +724,8 @@ static void guc_init_doorbell_hw(struct intel_guc
>> *guc)
>>       /* Restore to original value */
>>       err = guc_update_doorbell_id(guc, client, db_id);
>>       if (err)
>> -        DRM_ERROR("Failed to restore doorbell to %d, err %d\n",
>> -            db_id, err);
>> +        DRM_WARN("Failed to restore doorbell to %d, err %d\n",
>> +             db_id, err);
>>
>>       for (i = 0; i < GUC_MAX_DOORBELLS; ++i) {
>>           i915_reg_t drbreg = GEN8_DRBREGL(i);
>> @@ -819,8 +817,6 @@ static void guc_init_doorbell_hw(struct intel_guc
>> *guc)
>>       return client;
>>
>>   err:
>> -    DRM_ERROR("FAILED to create priority %u GuC client!\n", priority);
>> -
>>       guc_client_free(dev_priv, client);
>>       return NULL;
>>   }
>> @@ -998,7 +994,7 @@ int i915_guc_submission_enable(struct
>> drm_i915_private *dev_priv)
>>                     GUC_CTX_PRIORITY_KMD_NORMAL,
>>                     dev_priv->kernel_context);
>>       if (!client) {
>> -        DRM_ERROR("Failed to create execbuf guc_client\n");
>> +        DRM_ERROR("Failed to create normal GuC client!\n");
>>           return -ENOMEM;
>>       }
>>
>>
>
> Regards,
> Tvrtko



More information about the Intel-gfx mailing list