[Intel-gfx] [PATCH 1/5] drm/i915/guc: Don't GEM_BUG_ON on corrupted G2H CTB

Thu Jan 16 19:24:08 UTC 2020

On 1/16/20 11:13 AM, Michal Wajdeczko wrote:
> On Thu, 16 Jan 2020 19:46:35 +0100, Daniele Ceraolo Spurio 
> <daniele.ceraolospurio at intel.com> wrote:
> 
>>
>>
>> On 1/15/20 6:08 AM, Michal Wajdeczko wrote:
>>> We should never BUG_ON on any corruption in CTB descriptor as
>>> data there can be also modified by the GuC. Instead we can
>>> use flag "is_in_error" to indicate that we will not process
>>> any further messages over this CTB (until reset). While here
>>> move descriptor error reporting to the function that actually
>>> touches that descriptor.
>>>  Note that unexpected content of the specific CT messages, that
>>> still complies with generic CT message format, shall not trigger
>>> disabling whole CTB, as that might just indicate new unsupported
>>> message types.
>>>  Signed-off-by: Michal Wajdeczko <michal.wajdeczko at intel.com>
>>> Cc: Chris Wilson <chris at chris-wilson.co.uk>
>>> Cc: Daniele Ceraolo Spurio <daniele.ceraolospurio at intel.com>
>>> ---
>>>   drivers/gpu/drm/i915/gt/uc/intel_guc_ct.c | 42 ++++++++++++++---------
>>>   1 file changed, 26 insertions(+), 16 deletions(-)
>>>  diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_ct.c 
>>> b/drivers/gpu/drm/i915/gt/uc/intel_guc_ct.c
>>> index a55c336cc5ef..0d3556a820a3 100644
>>> --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_ct.c
>>> +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_ct.c
>>> @@ -578,19 +578,29 @@ static inline bool ct_header_is_response(u32 
>>> header)
>>>   static int ctb_read(struct intel_guc_ct_buffer *ctb, u32 *data)
>>>   {
>>>       struct guc_ct_buffer_desc *desc = ctb->desc;
>>> -    u32 head = desc->head / 4;    /* in dwords */
>>> -    u32 tail = desc->tail / 4;    /* in dwords */
>>> -    u32 size = desc->size / 4;    /* in dwords */
>>> +    u32 head = desc->head;
>>> +    u32 tail = desc->tail;
>>> +    u32 size = desc->size;
>>>       u32 *cmds = ctb->cmds;
>>> -    s32 available;            /* in dwords */
>>> +    s32 available;
>>>       unsigned int len;
>>>       unsigned int i;
>>>   -    GEM_BUG_ON(desc->size % 4);
>>> -    GEM_BUG_ON(desc->head % 4);
>>> -    GEM_BUG_ON(desc->tail % 4);
>>> -    GEM_BUG_ON(tail >= size);
>>> -    GEM_BUG_ON(head >= size);
>>> +    if (unlikely(desc->is_in_error))
>>> +        return -EPIPE;
>>
>> How do we recover from this situation? before we marked the buffer as 
>> in_error but didn't stop processing of G2H, but with this return here 
>> we do. Do we need to reset the CTB desc to recover?
> 
> before we should hit BUG_ON followed by PANIC (since we read in irq)
> now (or soon) we should be able to detect stalled CTB and then wedge
> we can't reset CTB alone as IIRC GuC keeps its own head/tail copies
> 

Ok, this is definitely better than a panic. Anyway AFAICS the only G2H 
message handle at the moment is the log flush, which is only enabled 
when we're using rolling debug logs, so there is basically 0 chance of 
hitting this in the wild. We do need to get a recovery method sorted out 
though before we start relying on having more messages. Maybe 
re-registering the buffers with GuC could work?

>>
>>> +
>>> +    if (unlikely(!IS_ALIGNED(head, 4) ||
>>> +             !IS_ALIGNED(tail, 4) ||
>>> +             !IS_ALIGNED(size, 4) ||
>>> +             (tail >= size) || (head >= size))) {
>>> +        DRM_ERROR("CT: Invalid data in descriptor\n");
>>
>> nit: this log is redundant since we have a better message after the 
>> jump which includes the values
> 
> yeah, looking again and agree that's redundant, will remove
> 
> Initially this "better message" was here, then it was reduced after copying
> it after jump to allow below error also to have desc details:
> 
>      DRM_ERROR("CT: incomplete message %*ph %*ph %*ph\n",
> 

With the logs fixed:

Reviewed-by: Daniele Ceraolo Spurio <daniele.ceraolospurio at intel.com>

Daniele

>>
>> Daniele
>>
>>> +        goto corrupted;
>>> +    }
>>> +
>>> +    /* later calculations will be done in dwords */
>>> +    head /= 4;
>>> +    tail /= 4;
>>> +    size /= 4;
>>>         /* tail == head condition indicates empty */
>>>       available = tail - head;
>>> @@ -615,7 +625,7 @@ static int ctb_read(struct intel_guc_ct_buffer 
>>> *ctb, u32 *data)
>>>                      size - head : available - 1), &cmds[head],
>>>                 4 * (head + available - 1 > size ?
>>>                      available - 1 - size + head : 0), &cmds[0]);
>>> -        return -EPROTO;
>>> +        goto corrupted;
>>>       }
>>>         for (i = 1; i < len; i++) {
>>> @@ -626,6 +636,12 @@ static int ctb_read(struct intel_guc_ct_buffer 
>>> *ctb, u32 *data)
>>>         desc->head = head * 4;
>>>       return 0;
>>> +
>>> +corrupted:
>>> +    DRM_ERROR("CT: Corrupted descriptor addr=%#x head=%u tail=%u 
>>> size=%u\n",
>>> +          desc->addr, desc->head, desc->tail, desc->size);
>>> +    desc->is_in_error = 1;
>>> +    return -EPIPE;
>>>   }
>>>     /**
>>> @@ -836,10 +852,4 @@ void intel_guc_ct_event_handler(struct 
>>> intel_guc_ct *ct)
>>>           else
>>>               err = ct_handle_request(ct, msg);
>>>       } while (!err);
>>> -
>>> -    if (GEM_WARN_ON(err == -EPROTO)) {
>>> -        CT_ERROR(ct, "Corrupted message: %#x\n", msg[0]);
>>> -        ctb->desc->is_in_error = 1;
>>> -    }
>>>   }
>>> -