[PATCH] drm/amdgpu: extend the default timeout for kernel compute queues

Xu, Feifei Feifei.Xu at amd.com
Fri Apr 21 09:28:31 UTC 2023


[AMD Official Use Only - General]

For some Vulkan stress tests, it might be not possible to rewrite using ROCm.
After a twice think, it might be too risky if we put 120s, because of the softlockup timeout set to 120s.

To support some stress tests like the one which recently I saw on stressbench (Vulkan stress test), if we shorten the 120s to a reasonable range like 100s, it can also fix the software hang.

-----Original Message-----
From: Alex Deucher <alexdeucher at gmail.com> 
Sent: Thursday, April 20, 2023 8:57 PM
To: Xu, Feifei <Feifei.Xu at amd.com>
Cc: amd-gfx at lists.freedesktop.org; Zhang, Hawking <Hawking.Zhang at amd.com>
Subject: Re: [PATCH] drm/amdgpu: extend the default timeout for kernel compute queues

On Thu, Apr 20, 2023 at 5:19 AM Feifei Xu <Feifei.Xu at amd.com> wrote:
>
> Extend to 120s. The default timeout value should also extend if 
> compute shader execution time extended. Otherwise some stress test 
> will trigger compute ring timeout in software.

I think that's probably too long.  2 minutes is a long time to have a hung system.  I think we should rework the tests or use ROCm for long running test cases.

Alex

>
> Signed-off-by: Feifei Xu <Feifei.Xu at amd.com>
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> index e536886f6d42..1f98b4b0a549 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> @@ -3475,7 +3475,7 @@ static int 
> amdgpu_device_get_job_timeout_settings(struct amdgpu_device *adev)
>
>         /*
>          * By default timeout for non compute jobs is 10000
> -        * and 60000 for compute jobs.
> +        * and 120000 for compute jobs.
>          * In SR-IOV or passthrough mode, timeout for compute
>          * jobs are 60000 by default.
>          */
> @@ -3485,7 +3485,7 @@ static int amdgpu_device_get_job_timeout_settings(struct amdgpu_device *adev)
>                 adev->compute_timeout = amdgpu_sriov_is_pp_one_vf(adev) ?
>                                         msecs_to_jiffies(60000) : msecs_to_jiffies(10000);
>         else
> -               adev->compute_timeout =  msecs_to_jiffies(60000);
> +               adev->compute_timeout =  msecs_to_jiffies(120000);
>
>         if (strnlen(input, AMDGPU_MAX_TIMEOUT_PARAM_LENGTH)) {
>                 while ((timeout_setting = strsep(&input, ",")) &&
> --
> 2.34.1
>


More information about the amd-gfx mailing list