[PATCH] amdgpu: disable GPU reset if amdgpu.lockup_timeout=0

Liu, Monk Monk.Liu at amd.com
Mon Dec 18 07:54:39 UTC 2017


Lockup_timeout = 0 doesn't indicate GPU reset isn't ready, kernel/amdgpu never tell you that, instead if means there is no Timeout of jobs 
So no warning, no gpu recover triggered by time out event, but that doesn't mean gpu recover cannot be triggered, e.g. for SRIOV we can
Trigger gpu recover by hypervisor.

Your patch shouldn't and cannot break exist logics, that's very simple rule ...
If you insist your change, at least make sure it doesn't change any logic of SRIOV and that's not hard for you, just add "if (!amdgpu_sriov_vf(adev))" checking 
Prior to your path, although I didn't encourage such ugly actions...


-----Original Message-----
From: Marek Olšák [mailto:maraeo at gmail.com] 
Sent: Tuesday, December 12, 2017 11:02 PM
To: Liu, Monk <Monk.Liu at amd.com>
Cc: amd-gfx at lists.freedesktop.org
Subject: Re: [PATCH] amdgpu: disable GPU reset if amdgpu.lockup_timeout=0

On Tue, Dec 12, 2017 at 4:18 AM, Liu, Monk <Monk.Liu at amd.com> wrote:
> NAK, you change break SRIOV logic:
>
> Without lockup_timeout set, this gpu_recover() won't get called at all 
> , unless your IB triggered invalid instruct and that IRQ invoked 
> Amdgpu_gpu_recover(), by this cause you should disable the logic that 
> in that IRQ instead of change gpu_recover() itself because For SRIOV 
> we need gpu_recover() even lockup_timeout is zero

The default value of 0 indicates that GPU reset isn't ready to be enabled by default. That's what it means. Once the GPU reset works, the default should be non-zero (e.g. 10000) and
amdgpu.lockup_timeout=0 should be used to disable all GPU resets in order to be able do scandumps and debug GPU hangs.

Marek


More information about the amd-gfx mailing list