[PATCH] dmr/amdgpu: Fix wrongly unref of BO
Christian König
deathsimple at vodafone.de
Thu Apr 20 08:43:19 UTC 2017
Hi AlexBin,
> Missing kunmap mapping in vmalloc will make kernel master page table
> incorrect.
That's what I tried to explain yesterday, but unfortunately didn't had
time to do so. There is not corruption of the kernel master page table
in this case!
The call of ttm_bo_kunmap is completely optional, take a look at
amdgpu_ttm_io_mem_reserve() and amdgpu_ttm_io_mem_free().
The aperture is kept mapped into the page tables for the whole time the
driver is loaded. So this is a complete no-op and only done for consistency.
> It is good that you agree that there is no real world bad example
> caused by my patch. I will not discuss whether it is an improvement or
> not now to save time for both of us.
>
Great at least we can now agree to completely drop this patch.
Thanks,
Christian.
Am 19.04.2017 um 21:30 schrieb Xie, AlexBin:
>
> Hi Christian,
>
>
> Missing kunmap mapping in vmalloc will make kernel master page table
> incorrect. I would not call such issue as completely harmless. Please
> note that AMD graphic driver can run in 32 bit system. In 32 bit
> system, vmalloc address space is much smaller than size of most GPU VRAM.
>
>
> amdgpu_bo_free_kernel has same issue as amdgpu_vram_scratch_fini. 1.
> It calls amdgpu_bo_reserve interruptible too. 2. It misses kunmap when
> amdgpu_bo_reserve returns error too. As result, kernel master page
> table can become incorrect, or as you call it "completely harmless
> vmalloc space leaking".
>
>
> Because amdgpu_bo_free_kernel is used in more places, such as psp
> command submission, there will be bigger chance to have other usage
> where signal is not blocked. This will become a real bug.
>
>
> I am thinking that we may fix the issue completely when TTM releases
> BO. But that is a bigger change.
>
>
> It is good that you agree that there is no real world bad example
> caused by my patch. I will not discuss whether it is an improvement or
> not now to save time for both of us.
>
>
> Thanks,
>
> Alex Bin Xie
>
>
> ------------------------------------------------------------------------
> *From:* Christian König <deathsimple at vodafone.de>
> *Sent:* Wednesday, April 19, 2017 7:50 AM
> *To:* Xie, AlexBin; Zhou, David(ChunMing); amd-gfx at lists.freedesktop.org
> *Subject:* Re: [PATCH] dmr/amdgpu: Fix wrongly unref of BO
>> Without correctly kunmap, page table is corrupted. Page entries point
>> to wrong memory locations. You might call it completely harmless. But
>> I think this is a severe bug. Leaking memory is better than a
>> corrupted page table. Think security.
> We are talking about the page tables for the vmalloc area in the
> kernel here, so no security problem. Leaking memory is much more
> problematic.
>
>> Would you provide any document and reference by saying" It is
>> impossible to receive a signal during module load/unload"? For
>> example, if the unload stuck in a lock, can CTRL+C stop the unload?
>>
> No, CTRL+C doesn't abort module load/unload. There have been patches
> to changes this a while ago, but IIRC it broke a whole bunch of driver
> relying on this.
>
>> What about there is some other return error? What about in future
>> somebody improve amdgpu_bo_reserve to return other errors,
>> then function amdgpu_vram_scratch_fini becomes buggy?
>>
> Yes, that is indeed an issue. For example -EDEADLK is possible as
> well. That's why I said we should use amdgpu_bo_free_kernel() instead.
>
>> While I am thinking whether there is a better way for the current
>> situation, would you give a real world example that my patch really
>> not working? Then we can address it.
>>
> I don't think there is because the driver can't receive a signal
> during load/unload, but the problem is rather that the patch doesn't
> improve the situation at all.
>
> Regards,
> Christian.
>
> Am 19.04.2017 um 13:37 schrieb Xie, AlexBin:
>>
>> Hi Christian,
>>
>>
>> Without correctly kunmap, page table is corrupted. Page entries point
>> to wrong memory locations. You might call it completely harmless. But
>> I think this is a severe bug. Leaking memory is better than a
>> corrupted page table. Think security.
>>
>>
>> Would you provide any document and reference by saying" It is
>> impossible to receive a signal during module load/unload"? For
>> example, if the unload stuck in a lock, can CTRL+C stop the unload?
>>
>>
>> If "It is impossible to receive a signal during module load/unload",
>> interruptible waiting is fine too, because function amdgpu_bo_reserve
>> will return successfully.
>>
>>
>> What about there is some other return error? What about in future
>> somebody improve amdgpu_bo_reserve to return other errors,
>> then function amdgpu_vram_scratch_fini becomes buggy?
>>
>>
>> While I am thinking whether there is a better way for the current
>> situation, would you give a real world example that my patch really
>> not working? Then we can address it.
>>
>>
>> Thanks,
>>
>> Alex Bin
>>
>>
>> ------------------------------------------------------------------------
>> *From:* Christian König <deathsimple at vodafone.de>
>> *Sent:* Wednesday, April 19, 2017 2:35 AM
>> *To:* Xie, AlexBin; Zhou, David(ChunMing); amd-gfx at lists.freedesktop.org
>> *Subject:* Re: [PATCH] dmr/amdgpu: Fix wrongly unref of BO
>> Hi AlexBin,
>>
>> the answer is ttm_bo_kunmap isn't called at all and yes in the case
>> of an iomap we leak the address space reserved.
>>
>> But that is completely harmless on a 64bit system compared to leaking
>> the memory backing the address space.
>>
>> Using amdgpu_bo_free_kernel() instead of openly coding it here is
>> probably a good idea.
>>
>> Additional to that it's probably a good idea to set the no_intr flag
>> when reserving kernel BOs. It is impossible to receive a signal
>> during module load/unload, but it's probably better to document that
>> in the code as well.
>>
>> Regards,
>> Christian.
>>
>> Am 18.04.2017 um 20:54 schrieb Xie, AlexBin:
>>> Hi Christian,
>>>
>>> Have you found how/where/when? When you said "mapping will just be
>>> released a bit later on", you must know the answer.
>>>
>>> It is difficult to prove something does not exist. Anyway, I will
>>> give it a try to prove such "later on" does not exist.
>>>
>>> Function ttm_bo_kunmap is the only function to unmap. To prove this,
>>> search ttm_bo_map_iomap, only ttm_bo_kunmap use this enum to
>>> correctly kunmap.
>>>
>>> Function ttm_bo_kunmap is not called by ttm itself. This is a hint
>>> that all TTM delay delete mechanism or unref mechanism will NOT
>>> kunmap BO later on.
>>>
>>> Function ttm_bo_kunmap is called by AMDGPU function amdgpu_bo_kunmap
>>> and amdgpu_gem_prime_vunmap.
>>>
>>> Search AMDGPU for amdgpu_bo_kunmap. All matches do not kunmap for
>>> scratch VRAM BO. amdgpu_bo_free_kernel is a suspect but the answer
>>> is still NO.
>>>
>>> So all possibilities are searched. Did I miss anything?
>>>
>>> Thanks,
>>> Alex Bin Xie
>>>
>>> ------------------------------------------------------------------------
>>> *From:* Xie, AlexBin
>>> *Sent:* Tuesday, April 18, 2017 2:04:33 PM
>>> *To:* Christian König; Zhou, David(ChunMing);
>>> amd-gfx at lists.freedesktop.org
>>> *Subject:* Re: [PATCH] dmr/amdgpu: Fix wrongly unref of BO
>>>
>>> Hi Christian,
>>>
>>>
>>> Would you point out where/when will kunmap happen for this BO when
>>> release? It must be somewhere in some function calls.
>>>
>>>
>>> I checked before I asked for review. But I did not see such obvious
>>> kunmap function call.
>>>
>>>
>>> If so, there should be a comment in function
>>> amdgpu_vram_scratch_fini to avoid future confusion.
>>>
>>>
>>> Thanks,
>>> Alex Bin Xie
>>> ------------------------------------------------------------------------
>>> *From:* Christian König <deathsimple at vodafone.de>
>>> *Sent:* Tuesday, April 18, 2017 1:46 PM
>>> *To:* Xie, AlexBin; Zhou, David(ChunMing); amd-gfx at lists.freedesktop.org
>>> *Subject:* Re: [PATCH] dmr/amdgpu: Fix wrongly unref of BO
>>> Hi AlexBin,
>>>
>>> No, David is right. This is a very common coding pattern in the
>>> kernel module.
>>>
>>> Freeing up a BO while there still exists a kernel mapping is
>>> perfectly ok, the mapping will just be released a bit later on.
>>>
>>> So this code is actually perfectly ok and just an optimization, but
>>> your patch breaks it and creates a memory leak.
>>>
>>> Regards,
>>> Christian.
>>>
>>> Am 18.04.2017 um 17:17 schrieb Xie, AlexBin:
>>>>
>>>> Hi David,
>>>>
>>>>
>>>> When amdgpu_bo_reserve return errors, we cannot release the BO.
>>>> This is not a memory leak. General speaking, memory leak
>>>> is unnoticed and unintentional.
>>>>
>>>>
>>>> The caller of function amdgpu_vram_scratch_fini ignores the return
>>>> error value...
>>>>
>>>>
>>>> The "memory leak" is not caused by my patch. It is caused because
>>>> reserving BO fails.
>>>>
>>>>
>>>> This patch only aim to make function amdgpu_vram_scratch_fini
>>>> behave correctly.
>>>>
>>>>
>>>> To follow up, we can add a warning message when amdgpu_bo_reserve
>>>> error happens in a different patch.
>>>>
>>>>
>>>> If function call amdgpu_bo_reserve is changed to uninterruptible,
>>>> this changes driver behaviour. Without a substantial issue, I would
>>>> be cautious for such a change.
>>>>
>>>>
>>>> Thanks,
>>>>
>>>> Alex Bin Xie
>>>>
>>>>
>>>> ------------------------------------------------------------------------
>>>> *From:* Zhou, David(ChunMing)
>>>> *Sent:* Monday, April 17, 2017 10:38 PM
>>>> *To:* Xie, AlexBin; amd-gfx at lists.freedesktop.org
>>>> *Subject:* Re: [PATCH] dmr/amdgpu: Fix wrongly unref of BO
>>>>
>>>>
>>>> On 2017年04月17日 22:54, Xie, AlexBin wrote:
>>>>>
>>>>> Hi David,
>>>>>
>>>>>
>>>>> Thanks for the comments. However, please have look at
>>>>> amdgpu_bo_reserve definition.
>>>>>
>>>>> static inline int amdgpu_bo_reserve(struct amdgpu_bo *bo, bool
>>>>> no_intr)
>>>>>
>>>> Ah, this is a wired wrapper for ttm_bo_reserve.
>>>>
>>>>>
>>>>> When we call this function like the following:
>>>>>
>>>>> r = amdgpu_bo_reserve(adev->vram_scratch.robj, false);
>>>>> The false means interruptible.
>>>>>
>>>>>
>>>>> On the other hand, when amdgpu_bo_reserve function return error,
>>>>> why do we unref BO without kunmap and unpin the BO? This is wrong
>>>>> implementation when amdgpu_bo_reserve return any error.
>>>>>
>>>> Yeah, I see your mean, it's in driver un-loading, How about
>>>> changing to no interruptible? Your patch will make a memleak if
>>>> bo_reserve fails, but it seems not matter. I have no strong preference.
>>>>
>>>> Regards,
>>>> David Zhou
>>>>>
>>>>>
>>>>> Thanks,
>>>>> Alex Bin Xie
>>>>>
>>>>> ------------------------------------------------------------------------
>>>>> *From:* Zhou, David(ChunMing)
>>>>> *Sent:* Friday, April 14, 2017 12:00 AM
>>>>> *To:* Xie, AlexBin; amd-gfx at lists.freedesktop.org
>>>>> *Subject:* Re: [PATCH] dmr/amdgpu: Fix wrongly unref of BO
>>>>>
>>>>>
>>>>> On 2017年04月14日 05:34, Alex Xie wrote:
>>>>> > According to comment of amdgpu_bo_reserve, amdgpu_bo_reserve
>>>>> > can return with -ERESTARTSYS. When this function was interrupted
>>>>> > by a signal, BO should not be unref. Otherwise the BO might be
>>>>> > released while is kmapped and pinned, or BO MIGHT be deref
>>>>> > multiple times, etc.
>>>>> r = amdgpu_bo_reserve(adev->vram_scratch.robj, false);
>>>>> we have specified interruptible to false, so -ERESTARTSYS isn't
>>>>> possible
>>>>> here.
>>>>>
>>>>> Thanks,
>>>>> David Zhou
>>>>> >
>>>>> > Change-Id: If76071a768950a0d3ad9d5da7fcae04881807621
>>>>> > Signed-off-by: Alex Xie <AlexBin.Xie at amd.com>
>>>>> > ---
>>>>> > drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 2 +-
>>>>> > 1 file changed, 1 insertion(+), 1 deletion(-)
>>>>> >
>>>>> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>> > index 53996e3..1dcc2d1 100644
>>>>> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>> > @@ -355,8 +355,8 @@ static void amdgpu_vram_scratch_fini(struct
>>>>> amdgpu_device *adev)
>>>>> > amdgpu_bo_kunmap(adev->vram_scratch.robj);
>>>>> > amdgpu_bo_unpin(adev->vram_scratch.robj);
>>>>> > amdgpu_bo_unreserve(adev->vram_scratch.robj);
>>>>> > + amdgpu_bo_unref(&adev->vram_scratch.robj);
>>>>> > }
>>>>> > - amdgpu_bo_unref(&adev->vram_scratch.robj);
>>>>> > }
>>>>> >
>>>>> > /**
>>>>>
>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>> amd-gfx mailing list
>>>> amd-gfx at lists.freedesktop.org
>>>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>>>
>>>
>>>
>>>
>>> _______________________________________________
>>> amd-gfx mailing list
>>> amd-gfx at lists.freedesktop.org
>>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>>
>>
>>
>>
>> _______________________________________________
>> amd-gfx mailing list
>> amd-gfx at lists.freedesktop.org
>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.freedesktop.org/archives/amd-gfx/attachments/20170420/d2617804/attachment-0001.html>
More information about the amd-gfx
mailing list