[PATCH] drm/amdgpu: disable job timeout on GPU reset disabled

Christian König christian.koenig at amd.com
Tue Mar 20 14:21:59 UTC 2018


That's a good point as well, maybe we should have separate timeouts for 
gfx and compute?

Something like 5 seconds for gfx and 1 minute (or even longer) for compute?

Anyway I agree that we can worry about that later on, patch is 
Reviewed-by: Christian König <christian.koenig at amd.com> for now.

Regards,
Christian.

Am 20.03.2018 um 15:16 schrieb Deucher, Alexander:
>
> My concern was that compute will always have the timeout disabled with 
> no way to override it even if you enable GPU reset.  I guess we can 
> address that down the road.
>
>
> Acked-by: Alex Deucher <alexander.deucher at amd.com>
>
> ------------------------------------------------------------------------
> *From:* Koenig, Christian
> *Sent:* Tuesday, March 20, 2018 6:14:29 AM
> *To:* Quan, Evan; amd-gfx at lists.freedesktop.org
> *Cc:* Deucher, Alexander
> *Subject:* Re: [PATCH] drm/amdgpu: disable job timeout on GPU reset 
> disabled
> Hi Evan,
>
> that one is perfect if you ask me. Just reading up on the history of
> that patch, Alex what was your concern with that?
>
> Regarding printing this as error, that's a really good point as well. We
> should probably reduce it to a warning or even info severity.
>
> Regards,
> Christian.
>
> Am 20.03.2018 um 03:11 schrieb Quan, Evan:
> > Hi Christian,
> >
> > The messages prompted on timeout are Errors not just Warnings 
> although we did not see any real problem(for the dgemm special case). 
> That's why we say it confusing.
> > And i suppose you want a fix like my previous patch(see attachment).
> >
> > Regards,
> > Evan
> >> -----Original Message-----
> >> From: Christian König [mailto:ckoenig.leichtzumerken at gmail.com]
> >> Sent: Monday, March 19, 2018 5:42 PM
> >> To: Quan, Evan <Evan.Quan at amd.com>; amd-gfx at lists.freedesktop.org
> >> Cc: Deucher, Alexander <Alexander.Deucher at amd.com>
> >> Subject: Re: [PATCH] drm/amdgpu: disable job timeout on GPU reset
> >> disabled
> >>
> >> Am 19.03.2018 um 07:08 schrieb Evan Quan:
> >>> Since under some heavy computing environment(dgemm test), it takes the
> >>> asic over 10+ seconds to finish the dispatched single job which will
> >>> trigger the timeout. It's quite confusing although it does not seem to
> >>> bring any real problems.
> >>> As a quick workround, we choose to disable timeout when GPU reset is
> >>> disabled.
> >> NAK, I enabled those warning intentionally even when the GPU 
> recovery is
> >> disabled to have a hint in the logs what goes wrong.
> >>
> >> Please only increase the timeout for the compute queue and/or add a
> >> separate timeout for them.
> >>
> >> Regards,
> >> Christian.
> >>
> >>
> >>> Change-Id: I3a95d856ba4993094dc7b6269649e470c5b053d2
> >>> Signed-off-by: Evan Quan <evan.quan at amd.com>
> >>> ---
> >>>    drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 7 +++++++
> >>>    1 file changed, 7 insertions(+)
> >>>
> >>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> >>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> >>> index 8bd9c3f..9d6a775 100644
> >>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> >>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> >>> @@ -861,6 +861,13 @@ static void
> >> amdgpu_device_check_arguments(struct amdgpu_device *adev)
> >>>              amdgpu_lockup_timeout = 10000;
> >>>      }
> >>>
> >>> +   /*
> >>> +    * Disable timeout when GPU reset is disabled to avoid confusing
> >>> +    * timeout messages in the kernel log.
> >>> +    */
> >>> +   if (amdgpu_gpu_recovery == 0 || amdgpu_gpu_recovery == -1)
> >>> +           amdgpu_lockup_timeout = INT_MAX;
> >>> +
> >>>      adev->firmware.load_type = amdgpu_ucode_get_load_type(adev,
> >> amdgpu_fw_load_type);
> >>>    }
> >>>
>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.freedesktop.org/archives/amd-gfx/attachments/20180320/c1e16687/attachment-0001.html>


More information about the amd-gfx mailing list