[PATCH] drm/amdgpu: Fix the dead lock issue.
zhoucm1
zhoucm1 at amd.com
Tue Sep 11 05:41:56 UTC 2018
On 2018年09月11日 11:37, zhoucm1 wrote:
>
>
> On 2018年09月11日 11:32, Deng, Emily wrote:
>>> -----Original Message-----
>>> From: amd-gfx <amd-gfx-bounces at lists.freedesktop.org> On Behalf Of
>>> zhoucm1
>>> Sent: Tuesday, September 11, 2018 11:28 AM
>>> To: Deng, Emily <Emily.Deng at amd.com>; Zhou, David(ChunMing)
>>> <David1.Zhou at amd.com>; amd-gfx at lists.freedesktop.org
>>> Subject: Re: [PATCH] drm/amdgpu: Fix the dead lock issue.
>>>
>>>
>>>
>>> On 2018年09月11日 11:23, Deng, Emily wrote:
>>>>> -----Original Message-----
>>>>> From: Zhou, David(ChunMing)
>>>>> Sent: Tuesday, September 11, 2018 11:03 AM
>>>>> To: Deng, Emily <Emily.Deng at amd.com>; amd-gfx at lists.freedesktop.org
>>>>> Subject: Re: [PATCH] drm/amdgpu: Fix the dead lock issue.
>>>>>
>>>>>
>>>>>
>>>>> On 2018年09月11日 10:51, Emily Deng wrote:
>>>>>> It will ramdomly have the dead lock issue when test TDR:
>>>>>> 1. amdgpu_device_handle_vram_lost gets the lock shadow_list_lock 2.
>>>>>> amdgpu_bo_create locked the bo's resv lock 3.
>>>>>> amdgpu_bo_create_shadow is waiting for the shadow_list_lock 4.
>>>>>> amdgpu_device_recover_vram_from_shadow is waiting for the bo's resv
>>>>>> lock.
>>>>>>
>>>>>> v2:
>>>>>> Make a local copy of the list
>>>>>>
>>>>>> Signed-off-by: Emily Deng <Emily.Deng at amd.com>
>>>>>> ---
>>>>>> drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 21
>>>>> ++++++++++++++++++++-
>>>>>> 1 file changed, 20 insertions(+), 1 deletion(-)
>>>>>>
>>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>>> index 2a21267..8c81404 100644
>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>>> @@ -3105,6 +3105,9 @@ static int
>>>>> amdgpu_device_handle_vram_lost(struct amdgpu_device *adev)
>>>>>> long r = 1;
>>>>>> int i = 0;
>>>>>> long tmo;
>>>>>> + struct list_head local_shadow_list;
>>>>>> +
>>>>>> + INIT_LIST_HEAD(&local_shadow_list);
>>>>>>
>>>>>> if (amdgpu_sriov_runtime(adev))
>>>>>> tmo = msecs_to_jiffies(8000);
>>>>>> @@ -3112,8 +3115,19 @@ static int
>>>>> amdgpu_device_handle_vram_lost(struct amdgpu_device *adev)
>>>>>> tmo = msecs_to_jiffies(100);
>>>>>>
>>>>>> DRM_INFO("recover vram bo from shadow start\n");
>>>>>> +
>>>>>> + mutex_lock(&adev->shadow_list_lock);
>>>>>> + list_splice_init(&adev->shadow_list, &local_shadow_list);
>>>>>> + mutex_unlock(&adev->shadow_list_lock);
>>>>>> +
>>>>>> +
>>>>>> mutex_lock(&adev->shadow_list_lock);
>>>>> local_shadow_list is a local variable, I think it doesn't need lock
>>>>> at all, no one change it. Otherwise looks good to me.
>>>> The bo->shadow_list which now is in local_shadow_list maybe destroy in
>>>> case that it already in amdgpu_bo_destroy, then it will change
>>> local_shadow_list, so need lock the shadow_list_lock.
>>> Ah, sorry for noise, I forget you don't reference these BOs.
>> Yes, I don't reference these Bos, as I found even reference these
>> Bos, it still couldn't avoid the case that another process is already
>> in amdgpu_bo_destroy.
> ??? that shouldn't happen, the reference is belonged to list. But back
> to here, we don't need reference them.
> And since no shadow BO is added to local after splice, we'd better to
> use list_next_entry to iterate the local shadow list instead of
> list_for_each_entry_safe.
>
> Thanks,
> David Zhou
>>> Thanks,
>>> David Zhou
>>>> Best wishes
>>>> Emily Deng
>>>>> Thanks,
>>>>> David Zhou
>>>>>> - list_for_each_entry_safe(bo, tmp, &adev->shadow_list,
>>>>>> shadow_list) {
>>>>>> + list_for_each_entry_safe(bo, tmp, &local_shadow_list,
>>>>>> shadow_list) {
because shadow list doesn't take bo reference, we should give a
amdgpu_bo_ref(bo) with attached patch before unlock.
You can have a try.
Thanks,
David Zhou
>>>>>> + mutex_unlock(&adev->shadow_list_lock);
>>>>>> +
>>>>>> + if (!bo)
>>>>>> + continue;
>>>>>> +
>>>>>> next = NULL;
>>>>>> amdgpu_device_recover_vram_from_shadow(adev, ring, bo,
>>>>> &next);
>>>>>> if (fence) {
>>>>>> @@ -3132,9 +3146,14 @@ static int
>>>>>> amdgpu_device_handle_vram_lost(struct amdgpu_device *adev)
>>>>>>
>>>>>> dma_fence_put(fence);
>>>>>> fence = next;
>>>>>> + mutex_lock(&adev->shadow_list_lock);
>>>>>> }
>>>>>> mutex_unlock(&adev->shadow_list_lock);
>>>>>>
>>>>>> + mutex_lock(&adev->shadow_list_lock);
>>>>>> + list_splice_init(&local_shadow_list, &adev->shadow_list);
>>>>>> + mutex_unlock(&adev->shadow_list_lock);
>>>>>> +
>>>>>> if (fence) {
>>>>>> r = dma_fence_wait_timeout(fence, false, tmo);
>>>>>> if (r == 0)
>>> _______________________________________________
>>> amd-gfx mailing list
>>> amd-gfx at lists.freedesktop.org
>>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>
> _______________________________________________
> amd-gfx mailing list
> amd-gfx at lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
-------------- next part --------------
A non-text attachment was scrubbed...
Name: 0001-drm-amdgpu-changing-of-shadow-bo-reference-should-ta.patch
Type: text/x-patch
Size: 1652 bytes
Desc: not available
URL: <https://lists.freedesktop.org/archives/amd-gfx/attachments/20180911/5ab9b3e5/attachment.bin>
More information about the amd-gfx
mailing list