[Intel-gfx] [PATCH v6 3/9] drm/i915/gt: Increase suspend timeout
Thomas Hellström
thomas.hellstrom at linux.intel.com
Thu Sep 23 13:19:37 UTC 2021
On 9/23/21 2:59 PM, Tvrtko Ursulin wrote:
>
> On 23/09/2021 12:47, Thomas Hellström wrote:
>> Hi, Tvrtko,
>>
>> On 9/23/21 12:13 PM, Tvrtko Ursulin wrote:
>>>
>>> On 22/09/2021 07:25, Thomas Hellström wrote:
>>>> With GuC submission on DG1, the execution of the requests times out
>>>> for the gem_exec_suspend igt test case after executing around 800-900
>>>> of 1000 submitted requests.
>>>>
>>>> Given the time we allow elsewhere for fences to signal (in the
>>>> order of
>>>> seconds), increase the timeout before we mark the gt wedged and
>>>> proceed.
>>>
>>> I suspect it is not about requests not retiring in time but about
>>> the intel_guc_wait_for_idle part of intel_gt_wait_for_idle. Although
>>> I don't know which G2H message is the code waiting for at suspend
>>> time so perhaps something to run past the GuC experts.
>>
>> So what's happening here is that the tests submits 1000 requests,
>> each writing a value to an object, and then that object content is
>> checked after resume. With GuC it turns out that only 800-900 or so
>> values are actually written before we time out, and the test
>> (basic-S3) fails, but not on every run.
>
> Yes and that did not make sense to me. It is a single context even so
> I did not come up with an explanation why would GuC be slower.
>
> Unless it somehow manages to not even update the ring tail in time and
> requests are still only stuck in the software queue? Perhaps you can
> see that from context tail and head when it happens.
>
>> This is a bit interesting in itself, because I never saw the hang-S3
>> test fail, which from what I can tell basically is an identical test
>> but with a spinner submitted after the 1000th request. Could be that
>> the suspend backup code ends up waiting for something before we end
>> up in intel_gt_wait_for_idle, giving more requests time to execute.
>
> No idea, I don't know the suspend paths that well. For instance before
> looking at the code I thought we would preempt what's executing and
> not wait for everything that has been submitted to finish. :)
>
>>> Anyway, if that turns out to be correct then perhaps it would be
>>> better to split the two timeouts (like if required GuC timeout is
>>> perhaps fundamentally independent) so it's clear who needs how much
>>> time. Adding Matt and John to comment.
>>
>> You mean we have separate timeouts depending on whether we're using
>> GuC or execlists submission?
>
> No, I don't know yet. First I think we need to figure out what exactly
> is happening.
Well then TBH I will need to file a separate Jira about that. There
might be various things going on here like swiching between the migrate
context for eviction of unrelated LMEM buffers and the context used by
gem_exec_suspend. The gem_exec_suspend failures are blocking DG1 BAT so
it's pretty urgent to get this series merged. If you insist I can leave
this patch out for now, but rather I'd commit it as is and File a Jira
instead.
/Thomas
More information about the dri-devel
mailing list