[PATCH 4/4] drm/amdgpu: prevent gpu access during reset recovery

Fri May 24 07:49:01 UTC 2024

Am 23.05.24 um 17:35 schrieb Li, Yunxiang (Teddy):
> [Public]
>
>>> Here is taking a different lock than the reset_domain->sem. It is a seperate reset_domain->gpu_sem that is only locked when we will actuall do reset, it is not taken in the skip_hw_reset path.
>> Exactly that is what you should *not* do. Please don't add any new lock to the code. This is already complicated enough.
>>
>> If you think that adding wrappers for reset lock makes sense then we can probably do that, bot don't add any lock for hw access.
> The two lock protects very different things though. The first case is we need to block two resets running in parallel,

No, that's not correct. Two parallel resets are prevent by using a 
worker queue for the resets.

The reset lock is here exactly to provide external thread the 
opportunity to make their operation mutual exclusive with the reset.

>   this does not only cover GPU reset itself but also any teardown that happens before GPU reset. The second case is we need to ensure exclusive access to the GPU between GPU reset and GPU init, concurrent access is fine before GPU is reset.

If that is true you could in theory lower the locked area of the 
existing lock, but adding a new one is strict no-go from my side.

Regards,
Christian.

>
> Theoretically, the second case happens within the first case, so locking the first case would protect against both. But with the current implementation this is infeasible, all the generic functions called between amdgpu_device_lock/unlock_reset_domain would need to be swapped out for special versions so the reset thread does not dead lock itself. It is much simpler to have a second, much narrower lock that only covers GPU reset<->GPU init because all the accesses there are very low level anyway.
>
> Teddy