[PATCH] dmr/amdgpu: Fix wrongly unref of BO

Thu Apr 20 08:43:19 UTC 2017

Hi AlexBin,

> Missing kunmap mapping in vmalloc will make kernel master page table 
> incorrect.
That's what I tried to explain yesterday, but unfortunately didn't had 
time to do so. There is not corruption of the kernel master page table 
in this case!

The call of ttm_bo_kunmap is completely optional, take a look at 
amdgpu_ttm_io_mem_reserve() and amdgpu_ttm_io_mem_free().

The aperture is kept mapped into the page tables for the whole time the 
driver is loaded. So this is a complete no-op and only done for consistency.

> It is good that you agree that there is no real world bad example 
> caused by my patch. I will not discuss whether it is an improvement or 
> not now to save time for both of us.
>
Great at least we can now agree to completely drop this patch.

Thanks,
Christian.

Am 19.04.2017 um 21:30 schrieb Xie, AlexBin:
>
> Hi Christian,
>
>
> Missing kunmap mapping in vmalloc will make kernel master page table 
> incorrect. I would not call such issue as completely harmless. Please 
> note that AMD graphic driver can run in 32 bit system. In 32 bit 
> system, vmalloc address space is much smaller than size of most GPU VRAM.
>
>
> amdgpu_bo_free_kernel has same issue as amdgpu_vram_scratch_fini. 1. 
> It calls amdgpu_bo_reserve interruptible too. 2. It misses kunmap when 
> amdgpu_bo_reserve returns error too. As result, kernel master page 
> table can become incorrect, or as you call it "completely harmless 
> vmalloc space leaking".
>
>
> Because amdgpu_bo_free_kernel is used in more places, such as psp 
> command submission, there will be bigger chance to have other usage 
> where signal is not blocked. This will become a real bug.
>
>
> I am thinking that we may fix the issue completely when TTM releases 
> BO. But that is a bigger change.
>
>
> It is good that you agree that there is no real world bad example 
> caused by my patch. I will not discuss whether it is an improvement or 
> not now to save time for both of us.
>
>
> Thanks,
>
> Alex Bin Xie
>
>
> ------------------------------------------------------------------------
> *From:* Christian König <deathsimple at vodafone.de>
> *Sent:* Wednesday, April 19, 2017 7:50 AM
> *To:* Xie, AlexBin; Zhou, David(ChunMing); amd-gfx at lists.freedesktop.org
> *Subject:* Re: [PATCH] dmr/amdgpu: Fix wrongly unref of BO
>> Without correctly kunmap, page table is corrupted. Page entries point 
>> to wrong memory locations. You might call it completely harmless. But 
>> I think this is a severe bug. Leaking memory is better than a 
>> corrupted page table. Think security.
> We are talking about the page tables for the vmalloc area in the 
> kernel here, so no security problem. Leaking memory is much more 
> problematic.
>
>> Would you provide any document and reference by saying" It is 
>> impossible to receive a signal during module load/unload"? For 
>> example, if the unload stuck in a lock, can CTRL+C stop the unload?
>>
> No, CTRL+C doesn't abort module load/unload. There have been patches 
> to changes this a while ago, but IIRC it broke a whole bunch of driver 
> relying on this.
>
>> What about there is some other return error? What about in future 
>> somebody improve amdgpu_bo_reserve to return other errors, 
>> then function amdgpu_vram_scratch_fini becomes buggy?
>>
> Yes, that is indeed an issue. For example -EDEADLK is possible as 
> well. That's why I said we should use amdgpu_bo_free_kernel() instead.
>
>> While I am thinking whether there is a better way for the current 
>> situation, would you give a real world example that my patch really 
>> not working? Then we can address it.
>>
> I don't think there is because the driver can't receive a signal 
> during load/unload, but the problem is rather that the patch doesn't 
> improve the situation at all.
>
> Regards,
> Christian.
>
> Am 19.04.2017 um 13:37 schrieb Xie, AlexBin:
>>
>> Hi Christian,
>>
>>
>> Without correctly kunmap, page table is corrupted. Page entries point 
>> to wrong memory locations. You might call it completely harmless. But 
>> I think this is a severe bug. Leaking memory is better than a 
>> corrupted page table. Think security.
>>
>>
>> Would you provide any document and reference by saying" It is 
>> impossible to receive a signal during module load/unload"? For 
>> example, if the unload stuck in a lock, can CTRL+C stop the unload?
>>
>>
>> If "It is impossible to receive a signal during module load/unload", 
>> interruptible waiting is fine too, because function amdgpu_bo_reserve 
>> will return successfully.
>>
>>
>> What about there is some other return error? What about in future 
>> somebody improve amdgpu_bo_reserve to return other errors, 
>> then function amdgpu_vram_scratch_fini becomes buggy?
>>
>>
>> While I am thinking whether there is a better way for the current 
>> situation, would you give a real world example that my patch really 
>> not working? Then we can address it.
>>
>>
>> Thanks,
>>
>> Alex Bin
>>
>>
>> ------------------------------------------------------------------------
>> *From:* Christian König <deathsimple at vodafone.de>
>> *Sent:* Wednesday, April 19, 2017 2:35 AM
>> *To:* Xie, AlexBin; Zhou, David(ChunMing); amd-gfx at lists.freedesktop.org
>> *Subject:* Re: [PATCH] dmr/amdgpu: Fix wrongly unref of BO
>> Hi AlexBin,
>>
>> the answer is ttm_bo_kunmap isn't called at all and yes in the case 
>> of an iomap we leak the address space reserved.
>>
>> But that is completely harmless on a 64bit system compared to leaking 
>> the memory backing the address space.
>>
>> Using amdgpu_bo_free_kernel() instead of openly coding it here is 
>> probably a good idea.
>>
>> Additional to that it's probably a good idea to set the no_intr flag 
>> when reserving kernel BOs. It is impossible to receive a signal 
>> during module load/unload, but it's probably better to document that 
>> in the code as well.
>>
>> Regards,
>> Christian.
>>
>> Am 18.04.2017 um 20:54 schrieb Xie, AlexBin:
>>> Hi Christian,
>>>
>>> Have you found how/where/when? When you said "mapping will just be 
>>> released a bit later on", you must know the answer.
>>>
>>> It is difficult to prove something does not exist. Anyway, I will 
>>> give it a try to prove such "later on" does not exist.
>>>
>>> Function ttm_bo_kunmap is the only function to unmap. To prove this, 
>>> search ttm_bo_map_iomap, only ttm_bo_kunmap use this enum to 
>>> correctly kunmap.
>>>
>>> Function ttm_bo_kunmap is not called by ttm itself. This is a hint 
>>> that all TTM delay delete mechanism or unref mechanism will NOT 
>>> kunmap BO later on.
>>>
>>> Function ttm_bo_kunmap is called by AMDGPU function amdgpu_bo_kunmap 
>>> and amdgpu_gem_prime_vunmap.
>>>
>>> Search AMDGPU for amdgpu_bo_kunmap. All matches do not kunmap for 
>>> scratch VRAM BO. amdgpu_bo_free_kernel is a suspect but the answer 
>>> is still NO.
>>>
>>> So all possibilities are searched. Did I miss anything?
>>>
>>> Thanks,
>>> Alex Bin Xie
>>>
>>> ------------------------------------------------------------------------
>>> *From:* Xie, AlexBin
>>> *Sent:* Tuesday, April 18, 2017 2:04:33 PM
>>> *To:* Christian König; Zhou, David(ChunMing); 
>>> amd-gfx at lists.freedesktop.org
>>> *Subject:* Re: [PATCH] dmr/amdgpu: Fix wrongly unref of BO
>>>
>>> Hi Christian,
>>>
>>>
>>> Would you point out where/when will kunmap happen for this BO when 
>>> release? It must be somewhere in some function calls.
>>>
>>>
>>> I checked before I asked for review. But I did not see such obvious 
>>> kunmap function call.
>>>
>>>
>>> If so, there should be a comment in function 
>>> amdgpu_vram_scratch_fini to avoid future confusion.
>>>
>>>
>>> Thanks,
>>> Alex Bin Xie
>>> ------------------------------------------------------------------------
>>> *From:* Christian König <deathsimple at vodafone.de>
>>> *Sent:* Tuesday, April 18, 2017 1:46 PM
>>> *To:* Xie, AlexBin; Zhou, David(ChunMing); amd-gfx at lists.freedesktop.org
>>> *Subject:* Re: [PATCH] dmr/amdgpu: Fix wrongly unref of BO
>>> Hi AlexBin,
>>>
>>> No, David is right. This is a very common coding pattern in the 
>>> kernel module.
>>>
>>> Freeing up a BO while there still exists a kernel mapping is 
>>> perfectly ok, the mapping will just be released a bit later on.
>>>
>>> So this code is actually perfectly ok and just an optimization, but 
>>> your patch breaks it and creates a memory leak.
>>>
>>> Regards,
>>> Christian.
>>>
>>> Am 18.04.2017 um 17:17 schrieb Xie, AlexBin:
>>>>
>>>> Hi David,
>>>>
>>>>
>>>> When amdgpu_bo_reserve return errors, we cannot release the BO. 
>>>> This is not a memory leak. General speaking, memory leak 
>>>> is unnoticed and unintentional.
>>>>
>>>>
>>>> The caller of function amdgpu_vram_scratch_fini ignores the return 
>>>> error value...
>>>>
>>>>
>>>> The "memory leak" is not caused by my patch. It is caused because 
>>>> reserving BO fails.
>>>>
>>>>
>>>> This patch only aim to make function amdgpu_vram_scratch_fini 
>>>> behave correctly.
>>>>
>>>>
>>>> To follow up, we can add a warning message when amdgpu_bo_reserve 
>>>> error happens in a different patch.
>>>>
>>>>
>>>> If function call amdgpu_bo_reserve is changed to uninterruptible, 
>>>> this changes driver behaviour. Without a substantial issue, I would 
>>>> be cautious for such a change.
>>>>
>>>>
>>>> Thanks,
>>>>
>>>> Alex Bin Xie
>>>>
>>>>
>>>> ------------------------------------------------------------------------
>>>> *From:* Zhou, David(ChunMing)
>>>> *Sent:* Monday, April 17, 2017 10:38 PM
>>>> *To:* Xie, AlexBin; amd-gfx at lists.freedesktop.org
>>>> *Subject:* Re: [PATCH] dmr/amdgpu: Fix wrongly unref of BO
>>>>
>>>>
>>>> On 2017年04月17日 22:54, Xie, AlexBin wrote:
>>>>>
>>>>> Hi David,
>>>>>
>>>>>
>>>>> Thanks for the comments. However, please have look at 
>>>>> amdgpu_bo_reserve definition.
>>>>>
>>>>> static inline int amdgpu_bo_reserve(struct amdgpu_bo *bo, bool 
>>>>> no_intr)
>>>>>
>>>> Ah, this is a wired wrapper for ttm_bo_reserve.
>>>>
>>>>>
>>>>> When we call this function like the following:
>>>>>
>>>>>      r = amdgpu_bo_reserve(adev->vram_scratch.robj, false);
>>>>> The false means interruptible.
>>>>>
>>>>>
>>>>> On the other hand,  when amdgpu_bo_reserve function return error, 
>>>>> why do we unref BO without kunmap and unpin the BO? This is wrong 
>>>>> implementation when amdgpu_bo_reserve return any error.
>>>>>
>>>> Yeah, I see your mean, it's in driver un-loading, How about 
>>>> changing to no interruptible? Your patch will make a memleak if 
>>>> bo_reserve fails, but it seems not matter. I have no strong preference.
>>>>
>>>> Regards,
>>>> David Zhou
>>>>>
>>>>>
>>>>> Thanks,
>>>>> Alex Bin Xie
>>>>>
>>>>> ------------------------------------------------------------------------
>>>>> *From:* Zhou, David(ChunMing)
>>>>> *Sent:* Friday, April 14, 2017 12:00 AM
>>>>> *To:* Xie, AlexBin; amd-gfx at lists.freedesktop.org
>>>>> *Subject:* Re: [PATCH] dmr/amdgpu: Fix wrongly unref of BO
>>>>>
>>>>>
>>>>> On 2017年04月14日 05:34, Alex Xie wrote:
>>>>> > According to comment of amdgpu_bo_reserve, amdgpu_bo_reserve
>>>>> > can return with -ERESTARTSYS. When this function was interrupted
>>>>> > by a signal, BO should not be unref. Otherwise the BO might be
>>>>> > released while is kmapped and pinned, or BO MIGHT be deref
>>>>> > multiple times, etc.
>>>>>          r = amdgpu_bo_reserve(adev->vram_scratch.robj, false);
>>>>> we have specified interruptible to false, so -ERESTARTSYS isn't 
>>>>> possible
>>>>> here.
>>>>>
>>>>> Thanks,
>>>>> David Zhou
>>>>> >
>>>>> > Change-Id: If76071a768950a0d3ad9d5da7fcae04881807621
>>>>> > Signed-off-by: Alex Xie <AlexBin.Xie at amd.com>
>>>>> > ---
>>>>> > drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 2 +-
>>>>> >   1 file changed, 1 insertion(+), 1 deletion(-)
>>>>> >
>>>>> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>> > index 53996e3..1dcc2d1 100644
>>>>> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>> > @@ -355,8 +355,8 @@ static void amdgpu_vram_scratch_fini(struct 
>>>>> amdgpu_device *adev)
>>>>> >                amdgpu_bo_kunmap(adev->vram_scratch.robj);
>>>>> >                amdgpu_bo_unpin(adev->vram_scratch.robj);
>>>>> >                amdgpu_bo_unreserve(adev->vram_scratch.robj);
>>>>> > + amdgpu_bo_unref(&adev->vram_scratch.robj);
>>>>> >        }
>>>>> > - amdgpu_bo_unref(&adev->vram_scratch.robj);
>>>>> >   }
>>>>> >
>>>>> >   /**
>>>>>
>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>> amd-gfx mailing list
>>>> amd-gfx at lists.freedesktop.org
>>>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>>>
>>>
>>>
>>>
>>> _______________________________________________
>>> amd-gfx mailing list
>>> amd-gfx at lists.freedesktop.org
>>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>>
>>
>>
>>
>> _______________________________________________
>> amd-gfx mailing list
>> amd-gfx at lists.freedesktop.org
>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>
>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.freedesktop.org/archives/amd-gfx/attachments/20170420/d2617804/attachment-0001.html>