[Intel-gfx] [PATCH] drm/i915/gt: Be defensive in the face of false CS events

Fri Jul 10 12:49:43 UTC 2020

On 10/07/2020 13:35, Chris Wilson wrote:
> Quoting Tvrtko Ursulin (2020-07-10 13:30:09)
>>
>> On 10/07/2020 13:16, Chris Wilson wrote:
>>> If the HW throws a curve ball and reports either en event before it is
>>> possible, or just a completely impossible event, we have to grin and
>>> bear it. The first few events, we will likely not notice as we would be
>>> expecting some event, but as soon as we stop expecting an event and yet
>>> they still keep coming, then we enter into undefined state territory.
>>> In which case, bail out, stop processing the events, and reset the
>>> engine and our set of queued requests to recover.
>>>
>>> The sporadic hangs and warnings will continue to plague CI, but at least
>>> system stability should not be compromised.
>>>
>>> Closes: https://gitlab.freedesktop.org/drm/intel/-/issues/2045
>>> Signed-off-by: Chris Wilson <chris at chris-wilson.co.uk>
>>> Cc: Tvrtko Ursulin <tvrtko.ursulin at intel.com>
>>> ---
>>>    drivers/gpu/drm/i915/gt/intel_lrc.c | 8 ++++++--
>>>    1 file changed, 6 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/drivers/gpu/drm/i915/gt/intel_lrc.c b/drivers/gpu/drm/i915/gt/intel_lrc.c
>>> index fbcfeaed6441..c86324d2d2bb 100644
>>> --- a/drivers/gpu/drm/i915/gt/intel_lrc.c
>>> +++ b/drivers/gpu/drm/i915/gt/intel_lrc.c
>>> @@ -2567,6 +2567,7 @@ static void process_csb(struct intel_engine_cs *engine)
>>>        tail = READ_ONCE(*execlists->csb_write);
>>>        if (unlikely(head == tail))
>>>                return;
>>> +     execlists->csb_head = tail;
>>
>> This deserves a comment...
>>
>>>    
>>>        /*
>>>         * Hopefully paired with a wmb() in HW!
>>> @@ -2613,6 +2614,9 @@ static void process_csb(struct intel_engine_cs *engine)
>>>                if (promote) {
>>>                        struct i915_request * const *old = execlists->active;
>>>    
>>> +                     if (GEM_WARN_ON(!*execlists->pending))
>>> +                             break;
>>> +
>>
>> ... but why not continue? You think nothing good can come out of trying
>> further and break simply expedites the hang? We have to be confident we
>> can cope with any random i915 state caused by skipping maybe valid entries.
> 
> We are already past the point of no return as the events coming from HW
> do not correspond to our events; continuing on cannot recover, we will
> already have made mistakes.

Yeah, I am just worried if between first error and reset, the fact we 
skipped possible valid entries, could cause hitting some other bug on or 
null ptr deref. I don't have anything concrete.. so maybe just FUD.

>> Conclusion will define what kind of comment to put above. "Assume we
>> always consume all CSB entries, or things are really bad and we mark all
>> as invalid upon finding first bad entry"?
> 
> It's dead, Jim.
> 
> We escape out, reset the engine/GPU, consign the port tracking to the bin,
> and reload with the next set of requests.

With a comment at the "execlists->csb_head = tail;" site explaining the 
plan for handling seriously unexpected HW events:

Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin at intel.com>

Regards,

Tvrtko