[Intel-gfx] [PATCH v6 3/9] drm/i915/gt: Increase suspend timeout

Thu Sep 23 11:47:46 UTC 2021

Hi, Tvrtko,

On 9/23/21 12:13 PM, Tvrtko Ursulin wrote:
>
> On 22/09/2021 07:25, Thomas Hellström wrote:
>> With GuC submission on DG1, the execution of the requests times out
>> for the gem_exec_suspend igt test case after executing around 800-900
>> of 1000 submitted requests.
>>
>> Given the time we allow elsewhere for fences to signal (in the order of
>> seconds), increase the timeout before we mark the gt wedged and proceed.
>
> I suspect it is not about requests not retiring in time but about the 
> intel_guc_wait_for_idle part of intel_gt_wait_for_idle. Although I 
> don't know which G2H message is the code waiting for at suspend time 
> so perhaps something to run past the GuC experts.

So what's happening here is that the tests submits 1000 requests, each 
writing a value to an object, and then that object content is checked 
after resume. With GuC it turns out that only 800-900 or so values are 
actually written before we time out, and the test (basic-S3) fails, but 
not on every run.

This is a bit interesting in itself, because I never saw the hang-S3 
test fail, which from what I can tell basically is an identical test but 
with a spinner submitted after the 1000th request. Could be that the 
suspend backup code ends up waiting for something before we end up in 
intel_gt_wait_for_idle, giving more requests time to execute.

>
> Anyway, if that turns out to be correct then perhaps it would be 
> better to split the two timeouts (like if required GuC timeout is 
> perhaps fundamentally independent) so it's clear who needs how much 
> time. Adding Matt and John to comment.

You mean we have separate timeouts depending on whether we're using GuC 
or execlists submission?

>
> To be clear, as timeout is AFAIK an arbitrary value, I don't have 
> fundamental objections here. Just think it would be good to have 
> accurate story in the commit message.

Ok. yes. I wonder whether we actually should increase this timeout even 
more since now the watchdog times out after 10+ seconds? I guess those 
long-running requests could be executing also at suspend time.

/Thomas

>
> Regards,
>
> Tvrtko
>
>>
>> Signed-off-by: Thomas Hellström <thomas.hellstrom at linux.intel.com>
>> ---
>>   drivers/gpu/drm/i915/gt/intel_gt_pm.c | 4 +++-
>>   1 file changed, 3 insertions(+), 1 deletion(-)
>>
>> diff --git a/drivers/gpu/drm/i915/gt/intel_gt_pm.c 
>> b/drivers/gpu/drm/i915/gt/intel_gt_pm.c
>> index dea8e2479897..f84f2bfe2de0 100644
>> --- a/drivers/gpu/drm/i915/gt/intel_gt_pm.c
>> +++ b/drivers/gpu/drm/i915/gt/intel_gt_pm.c
>> @@ -19,6 +19,8 @@
>>   #include "intel_rps.h"
>>   #include "intel_wakeref.h"
>>   +#define I915_GT_SUSPEND_IDLE_TIMEOUT (HZ / 2)
>> +
>>   static void user_forcewake(struct intel_gt *gt, bool suspend)
>>   {
>>       int count = atomic_read(&gt->user_wakeref);
>> @@ -279,7 +281,7 @@ static void wait_for_suspend(struct intel_gt *gt)
>>       if (!intel_gt_pm_is_awake(gt))
>>           return;
>>   -    if (intel_gt_wait_for_idle(gt, I915_GEM_IDLE_TIMEOUT) == 
>> -ETIME) {
>> +    if (intel_gt_wait_for_idle(gt, I915_GT_SUSPEND_IDLE_TIMEOUT) == 
>> -ETIME) {
>>           /*
>>            * Forcibly cancel outstanding work and leave
>>            * the gpu quiet.
>>