[PATCH v2 1/2] drm/amdgpu: fix a potential deadlock in amdgpu_bo_create_restricted()

Mon Feb 13 18:41:46 UTC 2017

Am 13.02.2017 um 19:32 schrieb Samuel Pitoiset:
>
>
> On 02/13/2017 07:19 PM, Nicolai Hähnle wrote:
>> On 13.02.2017 19:11, Samuel Pitoiset wrote:
>>>
>>>
>>> On 02/13/2017 07:04 PM, Nicolai Hähnle wrote:
>>>> On 13.02.2017 18:49, Samuel Pitoiset wrote:
>>>>>
>>>>>
>>>>> On 02/13/2017 05:25 PM, Nicolai Hähnle wrote:
>>>>>> On 09.02.2017 11:33, Samuel Pitoiset wrote:
>>>>>>> When ttm_bo_init() fails, the reservation mutex should be unlocked.
>>>>>>>
>>>>>>> In debug build, the kernel reported "possible recursive locking
>>>>>>> detected" in this codepath. For debugging purposes, I also added
>>>>>>> a "WARN_ON(ww_mutex_is_locked())" when ttm_bo_init() fails and the
>>>>>>> mutex was locked as expected.
>>>>>>>
>>>>>>> This should fix (random) GPU hangs. The easy way to reproduce the
>>>>>>> issue is to change the "Super Sampling" option from 1.0 to 2.0 in
>>>>>>> Hitman. It will create a huge buffer, evict a bunch of buffers
>>>>>>> (around ~5k) and deadlock.
>>>>>>>
>>>>>>> This regression has been introduced pretty recently.
>>>>>>>
>>>>>>> v2: only release the mutex if resv is NULL
>>>>>>>
>>>>>>> Fixes: 12a852219583 ("drm/amdgpu: improve
>>>>>>> AMDGPU_GEM_CREATE_VRAM_CLEARED handling (v2)")
>>>>>>> Signed-off-by: Samuel Pitoiset <samuel.pitoiset at gmail.com>
>>>>>>> ---
>>>>>>>  drivers/gpu/drm/amd/amdgpu/amdgpu_object.c | 5 ++++-
>>>>>>>  1 file changed, 4 insertions(+), 1 deletion(-)
>>>>>>>
>>>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
>>>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
>>>>>>> index d1ef1d064de4..556236a112c1 100644
>>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
>>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
>>>>>>> @@ -403,8 +403,11 @@ int amdgpu_bo_create_restricted(struct
>>>>>>> amdgpu_device *adev,
>>>>>>>              &bo->placement, page_align, !kernel, NULL,
>>>>>>>              acc_size, sg, resv ? resv : &bo->tbo.ttm_resv,
>>>>>>>              &amdgpu_ttm_bo_destroy);
>>>>>>> -    if (unlikely(r != 0))
>>>>>>> +    if (unlikely(r != 0)) {
>>>>>>> +        if (!resv)
>>>>>>> + ww_mutex_unlock(&bo->tbo.resv->lock);
>>>>>>>          return r;
>>>>>>> +    }
>>>>>>
>>>>>> I was looking at this myself a couple of weeks back, and I'm pretty
>>>>>> sure
>>>>>> I had this exact same patch just to realize that it's actually
>>>>>> incorrect.
>>>>>>
>>>>>> The problem is that ttm_bo_init will actually call the destroy
>>>>>> function
>>>>>> (in our case, amdgpu_ttm_bo_destroy), so at this point, bo has been
>>>>>> freed.
>>>>>>
>>>>>> This code is a huge mess. I'm surprised though: have you verified 
>>>>>> that
>>>>>> this patch actually fixes a hang?
>>>>>
>>>>> Yes, I triple-checked. I can't reproduce the hangs with Hitman.
>>>>
>>>> That's surprising, but a relief. Maybe it ties into some of the other
>>>> problems I'm seeing as well.
>>>>
>>>> This means we need a real fix for this; I still think the current 
>>>> patch
>>>> is broken.
>>>
>>> Maybe the issue is somewhere else and this not the proper solution, but
>>> I don't think the given patch is broken as-is. It fixes deadlocks which
>>> are pretty easy to reproduce with Hitman (as explained in the commit
>>> description).
>>
>> I'm sorry, but a use-after-free is clearly broken.
>
> You are right. If the destroy callback is called, there is a 
> use-after-free which is bad, really..

bad. Calling the destroy callback when something goes wrong sound fishy 
to me in the first place when the structure initialized here is 
allocated by the caller.

Probably best to clean that up from the beginning.

Regards,
Christian.

>
>>
>> Nicolai
>>
>>>
>>>>
>>>>
>>>>> This fixes a deadlock, here's the report:
>>>>> https://hastebin.com/durodivoma.xml
>>>>>
>>>>> The resv->lock has to be unlocked when ttm_bo_init() fails (I checked
>>>>> with a WARN_ON(is_locked)) because it doesn't call the destroy 
>>>>> function
>>>>> in all situations. Presumably, when drm_vma_offset_add() fails and 
>>>>> resv
>>>>> is not NULL, the mutex is not unlocked.
>>>>
>>>> On which code path is the destroy function not called? If that is the
>>>> case, we're leaking memory.
>>>>
>>>> With the patch as-is, the error paths are either leaking memory (if
>>>> you're right) or accessing memory after it's freed (otherwise).
>>>> Obviously, neither is good.
>>>
>>> No, I was wrong. resv is always NULL in this situation. The best
>>> solution is probably to try to clean up that code path because I do
>>> agree: it's a bit messy.
>>>
>>>>
>>>> Nicolai
>>
> _______________________________________________
> amd-gfx mailing list
> amd-gfx at lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/amd-gfx