[PATCH v2] drm/amdgpu: Fix the dead lock issue.

Deng, Emily Emily.Deng at amd.com
Mon Sep 10 09:47:03 UTC 2018


>-----Original Message-----
>From: Christian König <ckoenig.leichtzumerken at gmail.com>
>Sent: Monday, September 10, 2018 5:41 PM
>To: Deng, Emily <Emily.Deng at amd.com>; amd-gfx at lists.freedesktop.org
>Subject: Re: [PATCH v2] drm/amdgpu: Fix the dead lock issue.
>
>Am 10.09.2018 um 11:34 schrieb Emily Deng:
>> It will ramdomly have the dead lock issue when test TDR:
>> 1. amdgpu_device_handle_vram_lost gets the lock shadow_list_lock 2.
>> amdgpu_bo_create locked the bo's resv lock 3. amdgpu_bo_create_shadow
>> is waiting for the shadow_list_lock 4.
>> amdgpu_device_recover_vram_from_shadow is waiting for the bo's resv
>> lock.
>>
>> v2:
>>     Make a local copy of the list
>>
>> Signed-off-by: Emily Deng <Emily.Deng at amd.com>
>> ---
>>   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 12 +++++++++++-
>>   drivers/gpu/drm/amd/amdgpu/amdgpu_object.h |  1 +
>>   2 files changed, 12 insertions(+), 1 deletion(-)
>>
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>> index acfc63e..2b9f597 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>> @@ -3006,6 +3006,9 @@ static int
>amdgpu_device_handle_vram_lost(struct amdgpu_device *adev)
>>   	long r = 1;
>>   	int i = 0;
>>   	long tmo;
>> +	struct list_head local_shadow_list;
>> +
>> +	INIT_LIST_HEAD(&local_shadow_list);
>>
>>   	if (amdgpu_sriov_runtime(adev))
>>   		tmo = msecs_to_jiffies(8000);
>> @@ -3013,8 +3016,15 @@ static int
>amdgpu_device_handle_vram_lost(struct amdgpu_device *adev)
>>   		tmo = msecs_to_jiffies(100);
>>
>>   	DRM_INFO("recover vram bo from shadow start\n");
>> +
>>   	mutex_lock(&adev->shadow_list_lock);
>>   	list_for_each_entry_safe(bo, tmp, &adev->shadow_list, shadow_list)
>> {
>> +		amdgpu_bo_ref(bo);
>> +		list_add_tail(&bo->copy_shadow_list, &local_shadow_list);
>> +	}
>
>Please don't add an extra copy_shadow_list field to amdgpu_bo.
If don't use an extra variable, the local shadow list will change according the adev->shadow_list, 
both for adding or deleting, it is not we want.

>Instead just use bo->shadow list for this. When you hold a reference to the BO
>it should not be removed from the shadow list.
>
>Additional to that you can just use list_splice_init() to move the whole shadow
>list to your local list.
>
>Christian.
>
>> +	mutex_unlock(&adev->shadow_list_lock);
>> +
>> +	list_for_each_entry_safe(bo, tmp, &local_shadow_list,
>> +copy_shadow_list) {
>>   		next = NULL;
>>   		amdgpu_device_recover_vram_from_shadow(adev, ring, bo,
>&next);
>>   		if (fence) {
>> @@ -3033,8 +3043,8 @@ static int
>amdgpu_device_handle_vram_lost(struct
>> amdgpu_device *adev)
>>
>>   		dma_fence_put(fence);
>>   		fence = next;
>> +		amdgpu_bo_unref(&bo);
>>   	}
>> -	mutex_unlock(&adev->shadow_list_lock);
>>
>>   	if (fence) {
>>   		r = dma_fence_wait_timeout(fence, false, tmo); diff --git
>> a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.h
>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.h
>> index 907fdf4..cfee16c 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.h
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.h
>> @@ -103,6 +103,7 @@ struct amdgpu_bo {
>>   		struct list_head	mn_list;
>>   		struct list_head	shadow_list;
>>   	};
>> +        struct list_head                copy_shadow_list;
>>
>>   	struct kgd_mem                  *kfd_bo;
>>   };



More information about the amd-gfx mailing list