[PATCH] drm/amdgpu: extend the default timeout for kernel compute queues
Xu, Feifei
Feifei.Xu at amd.com
Fri Apr 21 09:28:31 UTC 2023
[AMD Official Use Only - General]
For some Vulkan stress tests, it might be not possible to rewrite using ROCm.
After a twice think, it might be too risky if we put 120s, because of the softlockup timeout set to 120s.
To support some stress tests like the one which recently I saw on stressbench (Vulkan stress test), if we shorten the 120s to a reasonable range like 100s, it can also fix the software hang.
-----Original Message-----
From: Alex Deucher <alexdeucher at gmail.com>
Sent: Thursday, April 20, 2023 8:57 PM
To: Xu, Feifei <Feifei.Xu at amd.com>
Cc: amd-gfx at lists.freedesktop.org; Zhang, Hawking <Hawking.Zhang at amd.com>
Subject: Re: [PATCH] drm/amdgpu: extend the default timeout for kernel compute queues
On Thu, Apr 20, 2023 at 5:19 AM Feifei Xu <Feifei.Xu at amd.com> wrote:
>
> Extend to 120s. The default timeout value should also extend if
> compute shader execution time extended. Otherwise some stress test
> will trigger compute ring timeout in software.
I think that's probably too long. 2 minutes is a long time to have a hung system. I think we should rework the tests or use ROCm for long running test cases.
Alex
>
> Signed-off-by: Feifei Xu <Feifei.Xu at amd.com>
> ---
> drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 4 ++--
> 1 file changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> index e536886f6d42..1f98b4b0a549 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> @@ -3475,7 +3475,7 @@ static int
> amdgpu_device_get_job_timeout_settings(struct amdgpu_device *adev)
>
> /*
> * By default timeout for non compute jobs is 10000
> - * and 60000 for compute jobs.
> + * and 120000 for compute jobs.
> * In SR-IOV or passthrough mode, timeout for compute
> * jobs are 60000 by default.
> */
> @@ -3485,7 +3485,7 @@ static int amdgpu_device_get_job_timeout_settings(struct amdgpu_device *adev)
> adev->compute_timeout = amdgpu_sriov_is_pp_one_vf(adev) ?
> msecs_to_jiffies(60000) : msecs_to_jiffies(10000);
> else
> - adev->compute_timeout = msecs_to_jiffies(60000);
> + adev->compute_timeout = msecs_to_jiffies(120000);
>
> if (strnlen(input, AMDGPU_MAX_TIMEOUT_PARAM_LENGTH)) {
> while ((timeout_setting = strsep(&input, ",")) &&
> --
> 2.34.1
>
More information about the amd-gfx
mailing list