[PATCH] amdgpu: disable GPU reset if amdgpu.lockup_timeout=0

Marek Olšák maraeo at gmail.com
Tue Dec 12 19:57:41 UTC 2017


On Tue, Dec 12, 2017 at 5:36 PM, Christian König
<ckoenig.leichtzumerken at gmail.com> wrote:
> Am 12.12.2017 um 15:57 schrieb Marek Olšák:
>>
>> On Tue, Dec 12, 2017 at 10:01 AM, Christian König
>> <ckoenig.leichtzumerken at gmail.com> wrote:
>>>
>>> Am 11.12.2017 um 22:29 schrieb Marek Olšák:
>>>>
>>>> From: Marek Olšák <marek.olsak at amd.com>
>>>>
>>>> Signed-off-by: Marek Olšák <marek.olsak at amd.com>
>>>> ---
>>>>
>>>> Is this really correct? I have no easy way to test it.
>>>
>>>
>>> It's a step in the right direction, but I would rather vote for something
>>> else:
>>>
>>> Instead of disabling the timeout by default we only disable the GPU
>>> reset/recovery.
>>>
>>> The idea is to add a new parameter amdgpu_gpu_recovery which makes
>>> amdgpu_gpu_recover only prints out an error and doesn't touch the GPU at
>>> all
>>> (on bare metal systems).
>>>
>>> Then we finally set the amdgpu_lockup_timeout to a non zero value by
>>> default.
>>>
>>> Andrey could you take care of this when you have time?
>>
>> I don't understand this.
>>
>> Why can't we keep the previous behavior where amdgpu.lockup_timeout=0
>> disabled GPU reset? Why do we have to add another option for the same
>> thing?
>
>
> lockup_timeout=0 never disabled the GPU reset, it just disabled the timeout.

It disabled the automatic reset before we had those interrupt callbacks.

>
> You could still manually trigger a reset and also invalid commands, invalid
> register writes and requests from the SRIOV hypervisor could trigger this.

That's OK. Manual resets should always be allowed.

>
> And as Monk explained GPU resets are mandatory for SRIOV, you can't disable
> them at all in this case.

What is preventing Monk from setting amdgpu.lockup_timeout > 0, which
should be the default state anyway?

Let's just say lockup_timeout=0 has undefined behavior with SRIOV.

>
> Additional to that we probably want the error message that something timed
> out, but not touching the hardware in any way.

Yes that is a fair point.

Marek


More information about the amd-gfx mailing list