[Intel-gfx] [PATCH 6/8] drm/i915: Use safe list iterators

Fri Oct 9 05:00:49 PDT 2015

On 09/10/2015 11:38, Chris Wilson wrote:
> On Fri, Oct 09, 2015 at 11:27:48AM +0100, Tomas Elf wrote:
>> On 09/10/2015 08:41, Chris Wilson wrote:
>>> On Thu, Oct 08, 2015 at 07:31:38PM +0100, Tomas Elf wrote:
>>>> Error state capture is dependent on i915_gem_active_request() and
>>>> i915_gem_obj_is_pinned(). Since there is no synchronization between error state
>>>> capture and the driver state we at least need to use safe list iterators in
>>>> these functions to alleviate the problem of request/vma lists changing during
>>>> error state capture.
>>>
>>> Does not make it safe.
>>> -Chris
>>>
>>
>> I know it doesn't make it safe, I didn't say this patch makes it safe.
>>
>> Maybe I was unclear but I chose the word "alleviate" rather than
>> "solve" to indicate that there are still problems but this change
>> reduces the problem scope and makes crashes less likely to occur. I
>> also used the formulation "at least" to indicate that we're not
>> solving everything but we can do certain things to improve things
>> somewhat.
>>
>> The problems I've been seeing has been that the list state changes
>> during iteration and that the error capture tries to read elements
>> that are no longer part of the list - not that elements that the
>> error capture code is dereferencing are deallocated by the driver or
>> whatever. Using a safe iterator helps with that very particular
>> problem. Or maybe I guess I've just been incredibly lucky for the
>> last 2 months when I've been running this code as I've been able to
>> get 12+ hours of stability during my tests instead of less than one
>> hour in between crashes that was the case before I introduced these
>> changes.
>
> You have been incredibily lucky (probably due to how the requests are
> being cached now), but that the requests can be modified whilst error
> capture runs and oops is well known. Just pretending the list iterator
> is safe does nothing.
> -Chris
>

Does nothing except consistently extend meantime between failures from 
less than one hour to more than 12 hours during my TDR tests. That's a 
lot of luck right there. On a consistent and predictable basis. Also, 
I'm not trying to pretend, I thought I communicated clearly that this is 
not a solution but rather an improvement that makes certain types of 
tests actually possible to run, tests that are needed for the TDR 
validation.

But if you have another solution then let's go with that.

Thanks,
Tomas