[PATCH v2] drm/amdgpu: Fix the dead lock issue.

Deng, Emily Emily.Deng at amd.com
Mon Sep 10 10:33:28 UTC 2018


>-----Original Message-----
>From: Koenig, Christian
>Sent: Monday, September 10, 2018 6:02 PM
>To: Deng, Emily <Emily.Deng at amd.com>; amd-gfx at lists.freedesktop.org
>Subject: Re: [PATCH v2] drm/amdgpu: Fix the dead lock issue.
>
>Am 10.09.2018 um 11:55 schrieb Deng, Emily:
>>> -----Original Message-----
>>> From: amd-gfx <amd-gfx-bounces at lists.freedesktop.org> On Behalf Of
>>> Christian König
>>> Sent: Monday, September 10, 2018 5:49 PM
>>> To: Deng, Emily <Emily.Deng at amd.com>; amd-gfx at lists.freedesktop.org
>>> Subject: Re: [PATCH v2] drm/amdgpu: Fix the dead lock issue.
>>>
>>> Am 10.09.2018 um 11:47 schrieb Deng, Emily:
>>>>> -----Original Message-----
>>>>> From: Christian König <ckoenig.leichtzumerken at gmail.com>
>>>>> Sent: Monday, September 10, 2018 5:41 PM
>>>>> To: Deng, Emily <Emily.Deng at amd.com>; amd-gfx at lists.freedesktop.org
>>>>> Subject: Re: [PATCH v2] drm/amdgpu: Fix the dead lock issue.
>>>>>
>>>>> Am 10.09.2018 um 11:34 schrieb Emily Deng:
>>>>>> It will ramdomly have the dead lock issue when test TDR:
>>>>>> 1. amdgpu_device_handle_vram_lost gets the lock shadow_list_lock 2.
>>>>>> amdgpu_bo_create locked the bo's resv lock 3.
>>>>>> amdgpu_bo_create_shadow is waiting for the shadow_list_lock 4.
>>>>>> amdgpu_device_recover_vram_from_shadow is waiting for the bo's
>>>>>> resv lock.
>>>>>>
>>>>>> v2:
>>>>>>       Make a local copy of the list
>>>>>>
>>>>>> Signed-off-by: Emily Deng <Emily.Deng at amd.com>
>>>>>> ---
>>>>>>     drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 12 +++++++++++-
>>>>>>     drivers/gpu/drm/amd/amdgpu/amdgpu_object.h |  1 +
>>>>>>     2 files changed, 12 insertions(+), 1 deletion(-)
>>>>>>
>>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>>> index acfc63e..2b9f597 100644
>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>>> @@ -3006,6 +3006,9 @@ static int
>>>>> amdgpu_device_handle_vram_lost(struct amdgpu_device *adev)
>>>>>>     	long r = 1;
>>>>>>     	int i = 0;
>>>>>>     	long tmo;
>>>>>> +	struct list_head local_shadow_list;
>>>>>> +
>>>>>> +	INIT_LIST_HEAD(&local_shadow_list);
>>>>>>
>>>>>>     	if (amdgpu_sriov_runtime(adev))
>>>>>>     		tmo = msecs_to_jiffies(8000); @@ -3013,8 +3016,15
>@@ static
>>>>>> int
>>>>> amdgpu_device_handle_vram_lost(struct amdgpu_device *adev)
>>>>>>     		tmo = msecs_to_jiffies(100);
>>>>>>
>>>>>>     	DRM_INFO("recover vram bo from shadow start\n");
>>>>>> +
>>>>>>     	mutex_lock(&adev->shadow_list_lock);
>>>>>>     	list_for_each_entry_safe(bo, tmp, &adev->shadow_list,
>>>>>> shadow_list) {
>>>>>> +		amdgpu_bo_ref(bo);
>>>>>> +		list_add_tail(&bo->copy_shadow_list, &local_shadow_list);
>>>>>> +	}
>>>>> Please don't add an extra copy_shadow_list field to amdgpu_bo.
>>>> If don't use an extra variable, the local shadow list will change
>>>> according the adev->shadow_list, both for adding or deleting, it is
>>>> not we
>>> want.
>>>
>>> That is not correct, see amdgpu_bo_destroy:
>>>>          if (!list_empty(&bo->shadow_list)) {
>>>>                  mutex_lock(&adev->shadow_list_lock);
>>>>                  list_del_init(&bo->shadow_list);
>>>>                  mutex_unlock(&adev->shadow_list_lock);
>>>>          }
>>> The BO is only removed from the list when it is destroyed, since we
>>> grabbed a local reference it can't be destroyed. So we are safe here.
>> Sorry I am not meaning the delete, what about the adding.
>
>That will still go to adev->shadow_list and not affect our local list in any way.

>We are not interested in any newly allocated shadow BOs, so that should be
>unproblematic.
Understand, will send a patch later.

Best wishes
Emily Deng
>Regards,
>Christian.
>
>>> Regards,
>>> Christian.
>>>
>>>>> Instead just use bo->shadow list for this. When you hold a
>>>>> reference to the BO it should not be removed from the shadow list.
>>>>>
>>>>> Additional to that you can just use list_splice_init() to move the
>>>>> whole shadow list to your local list.
>>>>>
>>>>> Christian.
>>>>>
>>>>>> +	mutex_unlock(&adev->shadow_list_lock);
>>>>>> +
>>>>>> +	list_for_each_entry_safe(bo, tmp, &local_shadow_list,
>>>>>> +copy_shadow_list) {
>>>>>>     		next = NULL;
>>>>>>     		amdgpu_device_recover_vram_from_shadow(adev,
>ring, bo,
>>>>> &next);
>>>>>>     		if (fence) {
>>>>>> @@ -3033,8 +3043,8 @@ static int
>>>>> amdgpu_device_handle_vram_lost(struct
>>>>>> amdgpu_device *adev)
>>>>>>
>>>>>>     		dma_fence_put(fence);
>>>>>>     		fence = next;
>>>>>> +		amdgpu_bo_unref(&bo);
>>>>>>     	}
>>>>>> -	mutex_unlock(&adev->shadow_list_lock);
>>>>>>
>>>>>>     	if (fence) {
>>>>>>     		r = dma_fence_wait_timeout(fence, false, tmo); diff --
>git
>>>>>> a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.h
>>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.h
>>>>>> index 907fdf4..cfee16c 100644
>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.h
>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.h
>>>>>> @@ -103,6 +103,7 @@ struct amdgpu_bo {
>>>>>>     		struct list_head	mn_list;
>>>>>>     		struct list_head	shadow_list;
>>>>>>     	};
>>>>>> +        struct list_head                copy_shadow_list;
>>>>>>
>>>>>>     	struct kgd_mem                  *kfd_bo;
>>>>>>     };
>>> _______________________________________________
>>> amd-gfx mailing list
>>> amd-gfx at lists.freedesktop.org
>>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx



More information about the amd-gfx mailing list