[PATCH] amdgpu: disable GPU reset if amdgpu.lockup_timeout=0

Marek Olšák maraeo at gmail.com
Tue Dec 12 19:57:41 UTC 2017

On Tue, Dec 12, 2017 at 5:36 PM, Christian König
<ckoenig.leichtzumerken at gmail.com> wrote:
> Am 12.12.2017 um 15:57 schrieb Marek Olšák:
>> On Tue, Dec 12, 2017 at 10:01 AM, Christian König
>> <ckoenig.leichtzumerken at gmail.com> wrote:
>>> Am 11.12.2017 um 22:29 schrieb Marek Olšák:
>>>> From: Marek Olšák <marek.olsak at amd.com>
>>>> Signed-off-by: Marek Olšák <marek.olsak at amd.com>
>>>> ---
>>>> Is this really correct? I have no easy way to test it.
>>> It's a step in the right direction, but I would rather vote for something
>>> else:
>>> Instead of disabling the timeout by default we only disable the GPU
>>> reset/recovery.
>>> The idea is to add a new parameter amdgpu_gpu_recovery which makes
>>> amdgpu_gpu_recover only prints out an error and doesn't touch the GPU at
>>> all
>>> (on bare metal systems).
>>> Then we finally set the amdgpu_lockup_timeout to a non zero value by
>>> default.
>>> Andrey could you take care of this when you have time?
>> I don't understand this.
>> Why can't we keep the previous behavior where amdgpu.lockup_timeout=0
>> disabled GPU reset? Why do we have to add another option for the same
>> thing?
> lockup_timeout=0 never disabled the GPU reset, it just disabled the timeout.

It disabled the automatic reset before we had those interrupt callbacks.

> You could still manually trigger a reset and also invalid commands, invalid
> register writes and requests from the SRIOV hypervisor could trigger this.

That's OK. Manual resets should always be allowed.

> And as Monk explained GPU resets are mandatory for SRIOV, you can't disable
> them at all in this case.

What is preventing Monk from setting amdgpu.lockup_timeout > 0, which
should be the default state anyway?

Let's just say lockup_timeout=0 has undefined behavior with SRIOV.

> Additional to that we probably want the error message that something timed
> out, but not touching the hardware in any way.

Yes that is a fair point.


More information about the amd-gfx mailing list