[Intel-gfx] [PATCH v6 3/9] drm/i915/gt: Increase suspend timeout

Tvrtko Ursulin tvrtko.ursulin at linux.intel.com
Thu Sep 23 12:59:59 UTC 2021


On 23/09/2021 12:47, Thomas Hellström wrote:
> Hi, Tvrtko,
> 
> On 9/23/21 12:13 PM, Tvrtko Ursulin wrote:
>>
>> On 22/09/2021 07:25, Thomas Hellström wrote:
>>> With GuC submission on DG1, the execution of the requests times out
>>> for the gem_exec_suspend igt test case after executing around 800-900
>>> of 1000 submitted requests.
>>>
>>> Given the time we allow elsewhere for fences to signal (in the order of
>>> seconds), increase the timeout before we mark the gt wedged and proceed.
>>
>> I suspect it is not about requests not retiring in time but about the 
>> intel_guc_wait_for_idle part of intel_gt_wait_for_idle. Although I 
>> don't know which G2H message is the code waiting for at suspend time 
>> so perhaps something to run past the GuC experts.
> 
> So what's happening here is that the tests submits 1000 requests, each 
> writing a value to an object, and then that object content is checked 
> after resume. With GuC it turns out that only 800-900 or so values are 
> actually written before we time out, and the test (basic-S3) fails, but 
> not on every run.

Yes and that did not make sense to me. It is a single context even so I 
did not come up with an explanation why would GuC be slower.

Unless it somehow manages to not even update the ring tail in time and 
requests are still only stuck in the software queue? Perhaps you can see 
that from context tail and head when it happens.

> This is a bit interesting in itself, because I never saw the hang-S3 
> test fail, which from what I can tell basically is an identical test but 
> with a spinner submitted after the 1000th request. Could be that the 
> suspend backup code ends up waiting for something before we end up in 
> intel_gt_wait_for_idle, giving more requests time to execute.

No idea, I don't know the suspend paths that well. For instance before 
looking at the code I thought we would preempt what's executing and not 
wait for everything that has been submitted to finish. :)

>> Anyway, if that turns out to be correct then perhaps it would be 
>> better to split the two timeouts (like if required GuC timeout is 
>> perhaps fundamentally independent) so it's clear who needs how much 
>> time. Adding Matt and John to comment.
> 
> You mean we have separate timeouts depending on whether we're using GuC 
> or execlists submission?

No, I don't know yet. First I think we need to figure out what exactly 
is happening.

>> To be clear, as timeout is AFAIK an arbitrary value, I don't have 
>> fundamental objections here. Just think it would be good to have 
>> accurate story in the commit message.
> 
> Ok. yes. I wonder whether we actually should increase this timeout even 
> more since now the watchdog times out after 10+ seconds? I guess those 
> long-running requests could be executing also at suspend time.

We probably should not just increase it hugely. Because watchdog is a 
separate story since it applies to unsubmited and unready requests and 
suspend can happen fine with those around I think.

Regards,

Tvrtko

> /Thomas
> 
> 
> 
> 
> 
>>
>> Regards,
>>
>> Tvrtko
>>
>>>
>>> Signed-off-by: Thomas Hellström <thomas.hellstrom at linux.intel.com>
>>> ---
>>>   drivers/gpu/drm/i915/gt/intel_gt_pm.c | 4 +++-
>>>   1 file changed, 3 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/drivers/gpu/drm/i915/gt/intel_gt_pm.c 
>>> b/drivers/gpu/drm/i915/gt/intel_gt_pm.c
>>> index dea8e2479897..f84f2bfe2de0 100644
>>> --- a/drivers/gpu/drm/i915/gt/intel_gt_pm.c
>>> +++ b/drivers/gpu/drm/i915/gt/intel_gt_pm.c
>>> @@ -19,6 +19,8 @@
>>>   #include "intel_rps.h"
>>>   #include "intel_wakeref.h"
>>>   +#define I915_GT_SUSPEND_IDLE_TIMEOUT (HZ / 2)
>>> +
>>>   static void user_forcewake(struct intel_gt *gt, bool suspend)
>>>   {
>>>       int count = atomic_read(&gt->user_wakeref);
>>> @@ -279,7 +281,7 @@ static void wait_for_suspend(struct intel_gt *gt)
>>>       if (!intel_gt_pm_is_awake(gt))
>>>           return;
>>>   -    if (intel_gt_wait_for_idle(gt, I915_GEM_IDLE_TIMEOUT) == 
>>> -ETIME) {
>>> +    if (intel_gt_wait_for_idle(gt, I915_GT_SUSPEND_IDLE_TIMEOUT) == 
>>> -ETIME) {
>>>           /*
>>>            * Forcibly cancel outstanding work and leave
>>>            * the gpu quiet.
>>>


More information about the dri-devel mailing list