[PATCH] add missing mutex lock to amdgpu_get_xgmi_hive() (v2)
StDenis, Tom
Tom.StDenis at amd.com
Mon Jan 7 16:53:41 UTC 2019
On 2019-01-07 11:51 a.m., Grodzovsky, Andrey wrote:
>
>
> On 01/07/2019 11:36 AM, StDenis, Tom wrote:
>> On 2019-01-07 11:33 a.m., Grodzovsky, Andrey wrote:
>>>
>>> On 01/07/2019 11:16 AM, Liu, Shaoyun wrote:
>>>> I think it's reasonable to use the hive specific lock for hive specific functions.
>>>> The changes is acked-by Shaoyun.liu < Shaoyun.liu at amd.com>
>>>>
>>>> -----Original Message-----
>>>> From: amd-gfx <amd-gfx-bounces at lists.freedesktop.org> On Behalf Of StDenis, Tom
>>>> Sent: Monday, January 7, 2019 10:16 AM
>>>> To: amd-gfx at lists.freedesktop.org
>>>> Cc: StDenis, Tom <Tom.StDenis at amd.com>
>>>> Subject: [PATCH] add missing mutex lock to amdgpu_get_xgmi_hive() (v2)
>>>>
>>>> v2: Move locks around in other functions so that this function can stand on its own. Also only hold the hive specific lock for add/remove device instead of the driver global lock so you can't add/remove devices in parallel from one hive.
>>>>
>>>> Signed-off-by: Tom St Denis <tom.stdenis at amd.com>
>>>> ---
>>>> drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 2 +-
>>>> drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.c | 36 ++++++++++++++--------
>>>> drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.h | 2 +-
>>>> 3 files changed, 25 insertions(+), 15 deletions(-)
>>>>
>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>> index 39d5d058b2c7..13d8e2ad2f7a 100644
>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>> @@ -3525,7 +3525,7 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
>>>> * by different nodes. No point also since the one node already executing
>>>> * reset will also reset all the other nodes in the hive.
>>>> */
>>>> - hive = amdgpu_get_xgmi_hive(adev);
>>>> + hive = amdgpu_get_xgmi_hive(adev, 0);
>>>> if (hive && adev->gmc.xgmi.num_physical_nodes > 1 &&
>>>> !mutex_trylock(&hive->hive_lock))
>>>> return 0;
>>> Let's say i have device 0 in hive A and it just got a gpu reset and at
>>> the same time device 1 is being added to same have though
>>> amdgpu_xgmi_add_device, hive->hive_lock is acquired by this new device
>>> being added and so gpu reset for device 0 bails out on
>>> '!mutex_trylock(&hive->hive_lock))' without completing the reset.
>>> Also in general i feel a bit uncomfortable about the confusing
>>> acquisition scheme in the function and the fact that you take the
>>> hive->hive_lock inside amdgpu_get_xgmi_hive but release is still outside
>>> of the function.
>> Is adding a device while resetting a device even a valid operation
>> anyways?
>
> In theory it's valid if you have hot pluggable devices
>>
>> I think this means more so that the reset logic is broken. Instead
>> there should be a per-hive reset lock that is taken and that is tested
>> instead.
>>
>> Tom
>
> The hive->hive_lock was added exactly for this purpose and used only for
> that purpose. Maybe the naming i gave it wasn't reflective of it's
> purpose :)
But the add/remove should use per-hive locks not the global lock... :-)
(I'm honestly not trying to bike shed I just thought the get_hive
function looked wrong :-)).
Tom
More information about the amd-gfx
mailing list