[PATCH] dmr/amdgpu: Fix wrongly unref of BO

Christian König deathsimple at vodafone.de
Wed Apr 19 11:50:36 UTC 2017


> Without correctly kunmap, page table is corrupted. Page entries point 
> to wrong memory locations. You might call it completely harmless. But 
> I think this is a severe bug. Leaking memory is better than a 
> corrupted page table. Think security.
We are talking about the page tables for the vmalloc area in the kernel 
here, so no security problem. Leaking memory is much more problematic.

> Would you provide any document and reference by saying" It is 
> impossible to receive a signal during module load/unload"? For 
> example, if the unload stuck in a lock, can CTRL+C stop the unload?
>
No, CTRL+C doesn't abort module load/unload. There have been patches to 
changes this a while ago, but IIRC it broke a whole bunch of driver 
relying on this.

> What about there is some other return error? What about in future 
> somebody improve amdgpu_bo_reserve to return other errors, 
> then function amdgpu_vram_scratch_fini becomes buggy?
>
Yes, that is indeed an issue. For example -EDEADLK is possible as well. 
That's why I said we should use amdgpu_bo_free_kernel() instead.

> While I am thinking whether there is a better way for the current 
> situation, would you give a real world example that my patch really 
> not working? Then we can address it.
>
I don't think there is because the driver can't receive a signal during 
load/unload, but the problem is rather that the patch doesn't improve 
the situation at all.

Regards,
Christian.

Am 19.04.2017 um 13:37 schrieb Xie, AlexBin:
>
> Hi Christian,
>
>
> Without correctly kunmap, page table is corrupted. Page entries point 
> to wrong memory locations. You might call it completely harmless. But 
> I think this is a severe bug. Leaking memory is better than a 
> corrupted page table. Think security.
>
>
> Would you provide any document and reference by saying" It is 
> impossible to receive a signal during module load/unload"? For 
> example, if the unload stuck in a lock, can CTRL+C stop the unload?
>
>
> If "It is impossible to receive a signal during module load/unload", 
> interruptible waiting is fine too, because function amdgpu_bo_reserve 
> will return successfully.
>
>
> What about there is some other return error? What about in future 
> somebody improve amdgpu_bo_reserve to return other errors, 
> then function amdgpu_vram_scratch_fini becomes buggy?
>
>
> While I am thinking whether there is a better way for the current 
> situation, would you give a real world example that my patch really 
> not working? Then we can address it.
>
>
> Thanks,
>
> Alex Bin
>
>
> ------------------------------------------------------------------------
> *From:* Christian König <deathsimple at vodafone.de>
> *Sent:* Wednesday, April 19, 2017 2:35 AM
> *To:* Xie, AlexBin; Zhou, David(ChunMing); amd-gfx at lists.freedesktop.org
> *Subject:* Re: [PATCH] dmr/amdgpu: Fix wrongly unref of BO
> Hi AlexBin,
>
> the answer is ttm_bo_kunmap isn't called at all and yes in the case of 
> an iomap we leak the address space reserved.
>
> But that is completely harmless on a 64bit system compared to leaking 
> the memory backing the address space.
>
> Using amdgpu_bo_free_kernel() instead of openly coding it here is 
> probably a good idea.
>
> Additional to that it's probably a good idea to set the no_intr flag 
> when reserving kernel BOs. It is impossible to receive a signal during 
> module load/unload, but it's probably better to document that in the 
> code as well.
>
> Regards,
> Christian.
>
> Am 18.04.2017 um 20:54 schrieb Xie, AlexBin:
>> Hi Christian,
>>
>> Have you found how/where/when? When you said "mapping will just be 
>> released a bit later on", you must know the answer.
>>
>> It is difficult to prove something does not exist. Anyway, I will 
>> give it a try to prove such "later on" does not exist.
>>
>> Function ttm_bo_kunmap is the only function to unmap. To prove this, 
>> search ttm_bo_map_iomap, only ttm_bo_kunmap use this enum to 
>> correctly kunmap.
>>
>> Function ttm_bo_kunmap is not called by ttm itself. This is a hint 
>> that all TTM delay delete mechanism or unref mechanism will NOT 
>> kunmap BO later on.
>>
>> Function ttm_bo_kunmap is called by AMDGPU function amdgpu_bo_kunmap 
>> and amdgpu_gem_prime_vunmap.
>>
>> Search AMDGPU for amdgpu_bo_kunmap. All matches do not kunmap for 
>> scratch VRAM BO. amdgpu_bo_free_kernel is a suspect but the answer is 
>> still NO.
>>
>> So all possibilities are searched. Did I miss anything?
>>
>> Thanks,
>> Alex Bin Xie
>>
>> ------------------------------------------------------------------------
>> *From:* Xie, AlexBin
>> *Sent:* Tuesday, April 18, 2017 2:04:33 PM
>> *To:* Christian König; Zhou, David(ChunMing); 
>> amd-gfx at lists.freedesktop.org
>> *Subject:* Re: [PATCH] dmr/amdgpu: Fix wrongly unref of BO
>>
>> Hi Christian,
>>
>>
>> Would you point out where/when will kunmap happen for this BO when 
>> release? It must be somewhere in some function calls.
>>
>>
>> I checked before I asked for review. But I did not see such obvious 
>> kunmap function call.
>>
>>
>> If so, there should be a comment in function amdgpu_vram_scratch_fini 
>> to avoid future confusion.
>>
>>
>> Thanks,
>> Alex Bin Xie
>> ------------------------------------------------------------------------
>> *From:* Christian König <deathsimple at vodafone.de>
>> *Sent:* Tuesday, April 18, 2017 1:46 PM
>> *To:* Xie, AlexBin; Zhou, David(ChunMing); amd-gfx at lists.freedesktop.org
>> *Subject:* Re: [PATCH] dmr/amdgpu: Fix wrongly unref of BO
>> Hi AlexBin,
>>
>> No, David is right. This is a very common coding pattern in the 
>> kernel module.
>>
>> Freeing up a BO while there still exists a kernel mapping is 
>> perfectly ok, the mapping will just be released a bit later on.
>>
>> So this code is actually perfectly ok and just an optimization, but 
>> your patch breaks it and creates a memory leak.
>>
>> Regards,
>> Christian.
>>
>> Am 18.04.2017 um 17:17 schrieb Xie, AlexBin:
>>>
>>> Hi David,
>>>
>>>
>>> When amdgpu_bo_reserve return errors, we cannot release the BO. This 
>>> is not a memory leak. General speaking, memory leak is unnoticed and 
>>> unintentional.
>>>
>>>
>>> The caller of function amdgpu_vram_scratch_fini ignores the return 
>>> error value...
>>>
>>>
>>> The "memory leak" is not caused by my patch. It is caused because 
>>> reserving BO fails.
>>>
>>>
>>> This patch only aim to make function amdgpu_vram_scratch_fini behave 
>>> correctly.
>>>
>>>
>>> To follow up, we can add a warning message when amdgpu_bo_reserve 
>>> error happens in a different patch.
>>>
>>>
>>> If function call amdgpu_bo_reserve is changed to uninterruptible, 
>>> this changes driver behaviour. Without a substantial issue, I would 
>>> be cautious for such a change.
>>>
>>>
>>> Thanks,
>>>
>>> Alex Bin Xie
>>>
>>>
>>> ------------------------------------------------------------------------
>>> *From:* Zhou, David(ChunMing)
>>> *Sent:* Monday, April 17, 2017 10:38 PM
>>> *To:* Xie, AlexBin; amd-gfx at lists.freedesktop.org
>>> *Subject:* Re: [PATCH] dmr/amdgpu: Fix wrongly unref of BO
>>>
>>>
>>> On 2017年04月17日 22:54, Xie, AlexBin wrote:
>>>>
>>>> Hi David,
>>>>
>>>>
>>>> Thanks for the comments. However, please have look at 
>>>> amdgpu_bo_reserve definition.
>>>>
>>>> static inline int amdgpu_bo_reserve(struct amdgpu_bo *bo, bool no_intr)
>>>>
>>> Ah, this is a wired wrapper for ttm_bo_reserve.
>>>
>>>>
>>>> When we call this function like the following:
>>>>
>>>>          r = amdgpu_bo_reserve(adev->vram_scratch.robj, false);
>>>> The false means interruptible.
>>>>
>>>>
>>>> On the other hand,  when amdgpu_bo_reserve function return error, 
>>>> why do we unref BO without kunmap and unpin the BO? This is wrong 
>>>> implementation when amdgpu_bo_reserve return any error.
>>>>
>>> Yeah, I see your mean, it's in driver un-loading, How about changing 
>>> to no interruptible? Your patch will make a memleak if bo_reserve 
>>> fails, but it seems not matter. I have no strong preference.
>>>
>>> Regards,
>>> David Zhou
>>>>
>>>>
>>>> Thanks,
>>>> Alex Bin Xie
>>>>
>>>> ------------------------------------------------------------------------
>>>> *From:* Zhou, David(ChunMing)
>>>> *Sent:* Friday, April 14, 2017 12:00 AM
>>>> *To:* Xie, AlexBin; amd-gfx at lists.freedesktop.org
>>>> *Subject:* Re: [PATCH] dmr/amdgpu: Fix wrongly unref of BO
>>>>
>>>>
>>>> On 2017年04月14日 05:34, Alex Xie wrote:
>>>> > According to comment of amdgpu_bo_reserve, amdgpu_bo_reserve
>>>> > can return with -ERESTARTSYS. When this function was interrupted
>>>> > by a signal, BO should not be unref. Otherwise the BO might be
>>>> > released while is kmapped and pinned, or BO MIGHT be deref
>>>> > multiple times, etc.
>>>>          r = amdgpu_bo_reserve(adev->vram_scratch.robj, false);
>>>> we have specified interruptible to false, so -ERESTARTSYS isn't 
>>>> possible
>>>> here.
>>>>
>>>> Thanks,
>>>> David Zhou
>>>> >
>>>> > Change-Id: If76071a768950a0d3ad9d5da7fcae04881807621
>>>> > Signed-off-by: Alex Xie <AlexBin.Xie at amd.com>
>>>> > ---
>>>> > drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 2 +-
>>>> >   1 file changed, 1 insertion(+), 1 deletion(-)
>>>> >
>>>> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>> > index 53996e3..1dcc2d1 100644
>>>> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>> > @@ -355,8 +355,8 @@ static void amdgpu_vram_scratch_fini(struct 
>>>> amdgpu_device *adev)
>>>> > amdgpu_bo_kunmap(adev->vram_scratch.robj);
>>>> > amdgpu_bo_unpin(adev->vram_scratch.robj);
>>>> > amdgpu_bo_unreserve(adev->vram_scratch.robj);
>>>> > + amdgpu_bo_unref(&adev->vram_scratch.robj);
>>>> >        }
>>>> > - amdgpu_bo_unref(&adev->vram_scratch.robj);
>>>> >   }
>>>> >
>>>> >   /**
>>>>
>>>
>>>
>>>
>>> _______________________________________________
>>> amd-gfx mailing list
>>> amd-gfx at lists.freedesktop.org
>>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>>
>>
>>
>>
>> _______________________________________________
>> amd-gfx mailing list
>> amd-gfx at lists.freedesktop.org
>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>
>
>
>
> _______________________________________________
> amd-gfx mailing list
> amd-gfx at lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/amd-gfx


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.freedesktop.org/archives/amd-gfx/attachments/20170419/1009b48e/attachment-0001.html>


More information about the amd-gfx mailing list