[PATCH v2] drm/amdgpu: Fix the dead lock issue.

Christian König christian.koenig at amd.com
Mon Sep 10 09:49:25 UTC 2018


Am 10.09.2018 um 11:47 schrieb Deng, Emily:
>> -----Original Message-----
>> From: Christian König <ckoenig.leichtzumerken at gmail.com>
>> Sent: Monday, September 10, 2018 5:41 PM
>> To: Deng, Emily <Emily.Deng at amd.com>; amd-gfx at lists.freedesktop.org
>> Subject: Re: [PATCH v2] drm/amdgpu: Fix the dead lock issue.
>>
>> Am 10.09.2018 um 11:34 schrieb Emily Deng:
>>> It will ramdomly have the dead lock issue when test TDR:
>>> 1. amdgpu_device_handle_vram_lost gets the lock shadow_list_lock 2.
>>> amdgpu_bo_create locked the bo's resv lock 3. amdgpu_bo_create_shadow
>>> is waiting for the shadow_list_lock 4.
>>> amdgpu_device_recover_vram_from_shadow is waiting for the bo's resv
>>> lock.
>>>
>>> v2:
>>>      Make a local copy of the list
>>>
>>> Signed-off-by: Emily Deng <Emily.Deng at amd.com>
>>> ---
>>>    drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 12 +++++++++++-
>>>    drivers/gpu/drm/amd/amdgpu/amdgpu_object.h |  1 +
>>>    2 files changed, 12 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>> index acfc63e..2b9f597 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>> @@ -3006,6 +3006,9 @@ static int
>> amdgpu_device_handle_vram_lost(struct amdgpu_device *adev)
>>>    	long r = 1;
>>>    	int i = 0;
>>>    	long tmo;
>>> +	struct list_head local_shadow_list;
>>> +
>>> +	INIT_LIST_HEAD(&local_shadow_list);
>>>
>>>    	if (amdgpu_sriov_runtime(adev))
>>>    		tmo = msecs_to_jiffies(8000);
>>> @@ -3013,8 +3016,15 @@ static int
>> amdgpu_device_handle_vram_lost(struct amdgpu_device *adev)
>>>    		tmo = msecs_to_jiffies(100);
>>>
>>>    	DRM_INFO("recover vram bo from shadow start\n");
>>> +
>>>    	mutex_lock(&adev->shadow_list_lock);
>>>    	list_for_each_entry_safe(bo, tmp, &adev->shadow_list, shadow_list)
>>> {
>>> +		amdgpu_bo_ref(bo);
>>> +		list_add_tail(&bo->copy_shadow_list, &local_shadow_list);
>>> +	}
>> Please don't add an extra copy_shadow_list field to amdgpu_bo.
> If don't use an extra variable, the local shadow list will change according the adev->shadow_list,
> both for adding or deleting, it is not we want.

That is not correct, see amdgpu_bo_destroy:
>         if (!list_empty(&bo->shadow_list)) {
>                 mutex_lock(&adev->shadow_list_lock);
>                 list_del_init(&bo->shadow_list);
>                 mutex_unlock(&adev->shadow_list_lock);
>         }

The BO is only removed from the list when it is destroyed, since we 
grabbed a local reference it can't be destroyed. So we are safe here.

Regards,
Christian.

>
>> Instead just use bo->shadow list for this. When you hold a reference to the BO
>> it should not be removed from the shadow list.
>>
>> Additional to that you can just use list_splice_init() to move the whole shadow
>> list to your local list.
>>
>> Christian.
>>
>>> +	mutex_unlock(&adev->shadow_list_lock);
>>> +
>>> +	list_for_each_entry_safe(bo, tmp, &local_shadow_list,
>>> +copy_shadow_list) {
>>>    		next = NULL;
>>>    		amdgpu_device_recover_vram_from_shadow(adev, ring, bo,
>> &next);
>>>    		if (fence) {
>>> @@ -3033,8 +3043,8 @@ static int
>> amdgpu_device_handle_vram_lost(struct
>>> amdgpu_device *adev)
>>>
>>>    		dma_fence_put(fence);
>>>    		fence = next;
>>> +		amdgpu_bo_unref(&bo);
>>>    	}
>>> -	mutex_unlock(&adev->shadow_list_lock);
>>>
>>>    	if (fence) {
>>>    		r = dma_fence_wait_timeout(fence, false, tmo); diff --git
>>> a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.h
>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.h
>>> index 907fdf4..cfee16c 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.h
>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.h
>>> @@ -103,6 +103,7 @@ struct amdgpu_bo {
>>>    		struct list_head	mn_list;
>>>    		struct list_head	shadow_list;
>>>    	};
>>> +        struct list_head                copy_shadow_list;
>>>
>>>    	struct kgd_mem                  *kfd_bo;
>>>    };



More information about the amd-gfx mailing list