[PATCH v2 1/2] drm/amdgpu: fix a potential deadlock in amdgpu_bo_create_restricted()

Samuel Pitoiset samuel.pitoiset at gmail.com
Mon Feb 13 17:49:23 UTC 2017



On 02/13/2017 05:25 PM, Nicolai Hähnle wrote:
> On 09.02.2017 11:33, Samuel Pitoiset wrote:
>> When ttm_bo_init() fails, the reservation mutex should be unlocked.
>>
>> In debug build, the kernel reported "possible recursive locking
>> detected" in this codepath. For debugging purposes, I also added
>> a "WARN_ON(ww_mutex_is_locked())" when ttm_bo_init() fails and the
>> mutex was locked as expected.
>>
>> This should fix (random) GPU hangs. The easy way to reproduce the
>> issue is to change the "Super Sampling" option from 1.0 to 2.0 in
>> Hitman. It will create a huge buffer, evict a bunch of buffers
>> (around ~5k) and deadlock.
>>
>> This regression has been introduced pretty recently.
>>
>> v2: only release the mutex if resv is NULL
>>
>> Fixes: 12a852219583 ("drm/amdgpu: improve
>> AMDGPU_GEM_CREATE_VRAM_CLEARED handling (v2)")
>> Signed-off-by: Samuel Pitoiset <samuel.pitoiset at gmail.com>
>> ---
>>  drivers/gpu/drm/amd/amdgpu/amdgpu_object.c | 5 ++++-
>>  1 file changed, 4 insertions(+), 1 deletion(-)
>>
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
>> index d1ef1d064de4..556236a112c1 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
>> @@ -403,8 +403,11 @@ int amdgpu_bo_create_restricted(struct
>> amdgpu_device *adev,
>>              &bo->placement, page_align, !kernel, NULL,
>>              acc_size, sg, resv ? resv : &bo->tbo.ttm_resv,
>>              &amdgpu_ttm_bo_destroy);
>> -    if (unlikely(r != 0))
>> +    if (unlikely(r != 0)) {
>> +        if (!resv)
>> +            ww_mutex_unlock(&bo->tbo.resv->lock);
>>          return r;
>> +    }
>
> I was looking at this myself a couple of weeks back, and I'm pretty sure
> I had this exact same patch just to realize that it's actually incorrect.
>
> The problem is that ttm_bo_init will actually call the destroy function
> (in our case, amdgpu_ttm_bo_destroy), so at this point, bo has been freed.
>
> This code is a huge mess. I'm surprised though: have you verified that
> this patch actually fixes a hang?

Yes, I triple-checked. I can't reproduce the hangs with Hitman.

This fixes a deadlock, here's the report:
https://hastebin.com/durodivoma.xml

The resv->lock has to be unlocked when ttm_bo_init() fails (I checked 
with a WARN_ON(is_locked)) because it doesn't call the destroy function 
in all situations. Presumably, when drm_vma_offset_add() fails and resv 
is not NULL, the mutex is not unlocked.

>
> Cheers,
> Nicolai
>
>
>>
>>      bo->tbo.priority = ilog2(bo->tbo.num_pages);
>>      if (kernel)
>>
>


More information about the amd-gfx mailing list